YouTube Comments Analysis in 100 Baby Challenge Sims 4

On December 20th 2018, the first episode of a Buzzfeed Multiplayer 100 Baby Challenge was published. It was created, produced and played by Kelsey Impicciche. The goal of the challenge was to have 100 babies in Sims 4 all with different sims, there are a lot of rules to this challenge, see more here. Kelsey quickly asked viewers to suggest names in the comments and proceeded to choose from them when a new offspring came. Somewhere during episode 20, after I upvoted a “Burrito” name suggestion, I started to wonder which of the names were the most suggested. And thus started my journey of collecting YouTube comments and trying to filter out some sensible information out of it. It turned out more challenging than I expected to answer “What are the top 10 most suggested names for the Buzzfeed Multiplayer 100 Baby Challenge?”. If you just want to know the answer, scroll down, don’t be shy. For the code, check out GitHub here.
Otherwise let’s break down the process:

Downloading the data

To download all the comments to the videos, one needs to venture into Google’s territory and thus one needs to acquire API keys and other secret tokens. I followed the tutorial here and used this sandbox to try out different calls. I have decided to save only the top comments, not the replies to the comments as Kelsey would only look for suggestions in the top comments. First I found an id for the 100 baby Challenge playlist and saved names and ids of the videos. Then a python script went through each video, creating a folder for each and saving comments in txt file.

Filtering

First using regex I filtered out all the symbols, numbers, emojis and such, then I remove duplicates and English stop words (using nlkt library). It actually turned out quite impressive, thanks to this regex sandbox:

txt = re.sub(r"(<([^>])*([^<])*([^>])*>)", "", line) # delete everything contained in < > symbols
txt = re.sub(r"([^a-zA-Z ])", " ", txt) # delete everything that is not letters
txt = re.sub(r'\b\w{1,1}\b', '', txt) # delete everything that is one letter long
txt = re.sub(r'\b\w{70,}\b', '', txt) # limiting words to 70 characters
txt = re.sub(r"( +)", " ", txt) # delete all the extra spaces
list_of_text = txt.strip().lower().split(" ") # delete whitespace at the end and beginning, lower and split on whitespace
set_of_text = set(list_of_text) # delete duplicates
set_of_text.difference_update(english_stop_words) # delete the most common words in English
Duplicate comments

Extracting names

Now for probably the most challenging part, actually counting the names. There are several issues with this one:

Gathering name data

A side-quest I was not aware of, but was fun to do anyway. I had to download the list of all parents’ and babies’ names and in which episode they were born. Thankfully, some smart people have created a fandom web-page with all the information about the series, so I did not have to look through every video once again (even though it sounds as a sweet excuse to watch it again). Unfortunately, there was no simple list with what I needed so I had to write a short web-crawler, saving the data as a json file.

Counting s̵h̵e̵e̵p̵ names

First a python script went through each episode and created a json with names and their mention count. Then I merged those into a json file containing all the episodes and all their names.

ignore_names = set(["kelsey", "chelsea", "chelsey", "kelly", "kacey", "jamie", "maleficent"])ignore_words = set(["hope", "best", "happy", "new", "coward", "chance", "sup", "black", "tho", "angel", "elders", "may", "nanny", "couch", "soo", "irl", "hurry", "drew", "marry", "luv", "bunk", "demon", "hug", "nap", "jr", "cousin", "max", "grim", "sunny"])
emma 9778.0
 alex 9381.0
 lily 9011.0
 luna 8660.0
 jack 8328.0
 noah 7509.0
 emily 6628.0
 james 6626.0
 luke 6475.0
 sam 6006.0
emma 9778.0
 alex 9381.0
 lily 9011.0
 luna 8660.0
 jack 8328.0
 noah 7509.0
 emily 6628.0
 james 6626.0
 luke 6475.0
 sam 6006.0
Top ten names
Word cloud of top 100 suggested names throughout the series
Top name for each episode (except for the Final episode)