YouTube Comments Analysis in 100 Baby Challenge Sims 4

Maria Emine Nylund
7 min readMar 30, 2021

On December 20th 2018, the first episode of a Buzzfeed Multiplayer 100 Baby Challenge was published. It was created, produced and played by Kelsey Impicciche. The goal of the challenge was to have 100 babies in Sims 4 all with different sims, there are a lot of rules to this challenge, see more here. Kelsey quickly asked viewers to suggest names in the comments and proceeded to choose from them when a new offspring came. Somewhere during episode 20, after I upvoted a “Burrito” name suggestion, I started to wonder which of the names were the most suggested. And thus started my journey of collecting YouTube comments and trying to filter out some sensible information out of it. It turned out more challenging than I expected to answer “What are the top 10 most suggested names for the Buzzfeed Multiplayer 100 Baby Challenge?”. If you just want to know the answer, scroll down, don’t be shy. For the code, check out GitHub here.
Otherwise let’s break down the process:

Downloading the data

To download all the comments to the videos, one needs to venture into Google’s territory and thus one needs to acquire API keys and other secret tokens. I followed the tutorial here and used this sandbox to try out different calls. I have decided to save only the top comments, not the replies to the comments as Kelsey would only look for suggestions in the top comments. First I found an id for the 100 baby Challenge playlist and saved names and ids of the videos. Then a python script went through each video, creating a folder for each and saving comments in txt file.

Filtering

First using regex I filtered out all the symbols, numbers, emojis and such, then I remove duplicates and English stop words (using nlkt library). It actually turned out quite impressive, thanks to this regex sandbox:

txt = re.sub(r"(<([^>])*([^<])*([^>])*>)", "", line) # delete everything contained in < > symbols
txt = re.sub(r"([^a-zA-Z ])", " ", txt) # delete everything that is not letters
txt = re.sub(r'\b\w{1,1}\b', '', txt) # delete everything that is one letter long
txt = re.sub(r'\b\w{70,}\b', '', txt) # limiting words to 70 characters
txt = re.sub(r"( +)", " ", txt) # delete all the extra spaces
list_of_text = txt.strip().lower().split(" ") # delete whitespace at the end and beginning, lower and split on whitespace
set_of_text = set(list_of_text) # delete duplicates
set_of_text.difference_update(english_stop_words) # delete the most common words in English

But filtering out all the other things that are not letters might have removed some suggestions with numbers. For example Mateo69 would only be Mateo.

With my approach two part names would not be counted either such as “Harry Jr”:

“BABY NAMES SUGGESTIONS Lindaya/Eleanor for a girl And Harry Jr a tribute to Harry since he ………. drowned”

Deleting duplicates made it that one comment has only 1 “vote” for a given name. So a comment like that would count Jasmine suggestion only once:

“Jasmine JASMINE jasminE JASmine jasMine JaSmiNE Jasmine JASMINE jasminE JASmine name a girl jasMine JaSmiNE Jasmine JASMINE jasminE JASmine jasMine JaSmiNE Jasmine JASMINE jasminE JASmine Name a girl jasMine JaSmiNE Jasmine JASMINE jasminE JASmine jasMine JaSmiNE Name a girl Jasmine JASMINE jasminE JASmine jasMine JaSmiNE Jasmine JASMINE jasminE JASmine Name a girl jasMine JaSmiNE Jasmine JASMINE jasminE JASmine jasMine Ja?SmiNE Jasmine JASMINE Name a girl jasminE JASmine jasMine JaSmiNE Jasmine JASMINE jasminE JASmine jasMine JaSmiNE . . . Please name a girl Jasmine maybe”

Another issue with duplicates, was that it turns out people love posting the same comments several times:

Duplicate comments

To solve that I used a difflib.SequenceMatcher comparison function that returns a ratio of how alike two strings are. So in my case if two comments were more than 90% alike, I only counted one of them. In order to save some time and not compare all comments to each other I only compared the adjacent ones. So if anyone is spamming two comments interchanging them with each other, all of them would be counted. Also the downside with this method is that if someone commented “Serene” and “Seren”, the 2nd one was disregarded since it has a high score of similarity with the previous one. If people spammed with a pause or a break of another comments in between they would count as well, since I only check for preceding. So this could have added to a lot of false suggestion counts.

Extracting names

Now for probably the most challenging part, actually counting the names. There are several issues with this one:

How to differentiate between comments and actual name suggestions?

“Mars looks like wyatt oleff”

I was going back and forth if I should check if a comment contains “suggestion” “baby” or just “name”, but there were a lot of simple requests such as:

“Antoinette or Tosia, please 🥺💓 I love you and yours films”

Even “better” when those were combined:

“Mars low key looks like Troye Sivan. Kelsey can you name a boy Troye and a girl Sivan?”

Not to mention plain difficulties to understand if the intend was to suggest a name or not:

“My hamsters name is chewy”
“Name your next baby Quarantine”
“YAY!! I needed this in quarentine 😁”

Moreover how to differentiate between names and objects?

“Name one willow”

In addition, there were also a lot of comments about parents of babies, already born babies and patriarchs (main characters of the challenge).

So, to solve some of my troubles, I found a NameDataset library that lets you check if it is a name or not. But first I had to filter out the names of already born babies depending on when they are born, their parents and matriarchs. As for the differentiating between names and object and comments and actual name suggestions, I decided to count it all as long NameDataset says it is a name and create my own black list of words to filter out. Hopefully, someday I will learn of a better way.

Gathering name data

A side-quest I was not aware of, but was fun to do anyway. I had to download the list of all parents’ and babies’ names and in which episode they were born. Thankfully, some smart people have created a fandom web-page with all the information about the series, so I did not have to look through every video once again (even though it sounds as a sweet excuse to watch it again). Unfortunately, there was no simple list with what I needed so I had to write a short web-crawler, saving the data as a json file.

Counting s̵h̵e̵e̵p̵ names

First a python script went through each episode and created a json with names and their mention count. Then I merged those into a json file containing all the episodes and all their names.

Now I could remove all the names gathered from the fandom page from the dictionary based on which episode it is. I converted the cleaned data into a pandas dataframe, so I could easily drop columns of my exceptions:

ignore_names = set(["kelsey", "chelsea", "chelsey", "kelly", "kacey", "jamie", "maleficent"])ignore_words = set(["hope", "best", "happy", "new", "coward", "chance", "sup", "black", "tho", "angel", "elders", "may", "nanny", "couch", "soo", "irl", "hurry", "drew", "marry", "luv", "bunk", "demon", "hug", "nap", "jr", "cousin", "max", "grim", "sunny"])

Then I aggregated the data and sorted on the count. Tada! Moment of truth, finally:

emma 9778.0
 alex 9381.0
 lily 9011.0
 luna 8660.0
 jack 8328.0
 noah 7509.0
 emily 6628.0
 james 6626.0
 luke 6475.0
 sam 6006.0
Top ten names

To spice up the results, I have used the worldcloud library and this tutorial to create a word cloud image of top 100 names:

Word cloud of top 100 suggested names throughout the series

In addition, I was curious about the top name for each episode, so we have this chunky table as well:

Top name for each episode (except for the Final episode)

Interesting founds were that the episode 52 aired on April 18th 2020 and the top name for that episode is Corona, I saw that there were quite a handful of comments suggesting it as a name.

“Baby names: Quarantina or Corinne and Tina Covid Corona”

There were also comments just mentioning it, so I am not sure if it counts as a top name, but it definitely describes the world at that moment.

Also, in the episodes 80 and 81 Wanda was the top name and that was probably because of the new show airing and Kelsey posting a video about making 4 Course meal from that show.

All in all, I am very curious why Emma was the most suggested name, especially what happened in episode 9 to write that name in over 2000 comments. On the other hand, my research was filled with some shortcuts, so we might never know if Emma was truly the most suggested name to this fun video series.

In total, this project took around 25 hours of focused work including writing this post. Tracked using Toggle.

--

--