Methodology for Data
Data Processing
1

2




3

The initial data table was created by extracting the first ‘top 100’ list from Billboard’s "The Hot 100" Songs of each year and combining them. The initial table included only a column for a year, an artist and a song title.
Each data entry was a result of applying the steps below:
​
​
Query the Pronoun & Gender Database to check the gender, pronouns and if the artist is a band. The information extracted was added to our file.
​
​
​
​
​
​
​
​
​
​
​
​Query the genius API (accessed through the lyricsgenius library) for the lyrics of the song in order to further examine it (the lyrics themselves were not stored)​​
​
a. Carry out a sentiment analysis on the lyrics and include the results in the table.
​
​
​
​
​
​
b. Before the analysis was carried out if either the language was not English or the song wasn’t found, the lyrics part was being skipped. This resulted in several minor difficulties, such as the surprisingly frequent case of the song that’s English only to some degree (textblob library). In these cases, the song was either being skipped or the analysis was conducted anyway. Unfortunately, the library used for language detection was not 100% accurate, so some of the non-English songs managed to slip through.
The results were inserted into the output table.
​
​
​
​
​
​
​
Query the Spotify database (implemented with the use of the spotipy python library) for an artist. Extract genres assigned to that artist and add them in a new column.
​
​
​
​
​
​
​
​
​
The biggest issue while processing the data turned out to be the multiple API queries. Although they were easy to implement initially, there was quite a high likelihood of a query returning a timeout error that couldn’t be avoided in any way as it was on the server-side. An error results in the program stopping its work and losing all the progress made. It was quite a severe inconvenience, since the algorithm is heavily API-reliant it takes several hours to go through all the entries. We came up with a work-around solution that involved two steps:
​
-
Break the whole table into manageable chunks of 20 entries each - this way when an error occurs only the currently processed chunk is lost and when the program resumes it has to make up for only up to 20 lost entries (and not 6000).
-
Include a try-except statement nested in an infinite while loop. This way, when the error occurs, the program can resume automatically