Chapter 5 Genre Classification — An NLP approach

There may be marked differences between the musical features of country and rock songs, but to utilize audio features by themself would ignore another important aspect of music: the lyrics. Lyrics contain information that the audio features may not. Lyrics may betray the feeling that the sound may portray. We can use natural language processing (NLP) techniques on the song lyrics to quantify song lyrics. This will enable us to identify the sentiment of a song based on it’s lyrics alongside of the mood that is set by its tone. In addition we are able to identify clusters of related words, as we will later see. All of these techniques can be used to distinguish songs, artists, and genres from one and another.

The following will outline a process for creating clusters of related words and using them as a basis for a lyric based genre classification model.

To do this we will use three additional packages:

  • genius: acquiring song lyrics
  • tidytext: text mining with tidy tools
  • topicmodels: package containing R implementations of topic models

In order to best understand the following code I recommend reading Tidy Text Mining by Julia Silge and David Robinson.

From here we frame the issue of genre classification as one of language rather than strict musical features. In creating this model we are attempting to see if there are linguistic differences between the lyrics of rock and country songs that can be captured, quantified, and classified using NLP and predictive modeling.

The steps that we will take to create this classification model are roughly as follows:

  • Retrieve song lyrics
  • Tokenization
  • Removal of stop words
  • Word stemming
  • Create a document term matrix
  • Create an LDA topic model
  • Calculate LDA topic probabilities
  • Create a model using class probabilities as features

Latent Dirichlet Allocation (LDA) is an unsupervised classification model that uses term-frequency (tf) of a document as input foollowing a bag of words approach. To create this LDA model, We first need to calculate the term-frequency of our documents (songs). And to do that, we need to process the song lyrics by tokenizing, removing stop-words, and stemming.

But we can’t get ahead of ourselves. We need the lyrics first.

5.1 Retrieving song lyrics

We will use add_genius() with the charts table to get the song lyrics for each song. Below we are specifying the columns that contain the name of the artist and tracks respectively. The type argument tells genius whether it will be fetching lyrics for a song or an album. Here we specify we are interested in "lyrics".

This is can be rather time consuming. So maybe go for a walk around the block. Make some tea. Have a moment to yourself.

library(genius)
charts_lyrics <- add_genius(charts, artist, title, type = "lyrics")

Now that we have charts and their lyric, let’s see if there were any songs that were missed by add_genius().

anti_join(charts, charts_lyrics %>% 
  count(year, artist, title)) %>% 
  distinct(artist, title)
## Joining, by = c("year", "artist", "title")
## # A tibble: 16 x 2
##    artist                                           title                  
##    <chr>                                            <chr>                  
##  1 Florida Georgia Line                             H.O.L.Y.               
##  2 Dan + Shay                                       From The Ground Up     
##  3 Chris Young Duet With Cassadee Pope              Think Of You           
##  4 Dan + Shay                                       Nothin' Like You       
##  5 Dan + Shay                                       How Not To             
##  6 Dan + Shay                                       Tequila                
##  7 Dan + Shay                                       Speechless             
##  8 David Lee Murphy & Kenny Chesney                 Everything's Gonna Be …
##  9 Brantley Gilbert                                 Ones That Like Me      
## 10 Lil Wayne, Wiz Khalifa & Imagine Dragons With L… Sucker For Pain        
## 11 Nathaniel Rateliff & The Night Sweats            S.O.B.                 
## 12 Coldplay                                         Up&Up                  
## 13 The Dirty Heads                                  Vacation               
## 14 Florence + The Machine                           Hunger                 
## 15 Sir Sly                                          &Run                   
## 16 Imagine Dragons + Khalid                         Thunder/Young Dumb & B…

There looks to be some inconsistencies in naming. We can either manually fix these, or omit them. For this case, I will omit them (as fixing them would invariable mean making a GitHub issue; if you want to tackle it, please do!).

5.2 Lyric Preprocessing

To use song lyrics in a model, we need to quantify them in some manner. NLP allows us to add quantitative structure to inherently unstructured text data. We will follow the tidy text mining approach outlined in Tidy Text Mining in R.

We will impose structure on song lyrics by tokenizing words and estimating document topics (groups of related words).

The code below takes the charts_lyrics data frame and deduplicates the songs, then splits thee line column into unigrams. Next, stop words are removed. Finally, unigrams are stemmed using wordStem() from the SnowballC package (for more on word stemming visit here).

library(tidytext)

# create unigrams
lyric_unigrams <- charts_lyrics %>% 
  # if a song appears more than once, keep only one observation
  distinct(artist, title, line, .keep_all = TRUE) %>% 
  # create unigrams
  unnest_tokens(word, lyric) %>% 
  # remove stop words
  anti_join(get_stopwords()) %>% 
  # stem each word
  mutate(word = SnowballC::wordStem(word)) 
## Joining, by = "word"
lyric_unigrams 
## # A tibble: 79,885 x 9
##     rank  year chart  artist  featured_artist title track_title  line word 
##    <dbl> <dbl> <chr>  <chr>   <chr>           <chr> <chr>       <dbl> <chr>
##  1     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     1 babi 
##  2     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     1 last 
##  3     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     1 night
##  4     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     1 hand 
##  5     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     2 on   
##  6     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     2 best 
##  7     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     2 night
##  8     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     2 doubt
##  9     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     3 bottl
## 10     2  2016 Hot C… Thomas… <NA>            Die … Die a Happ…     3 wine 
## # … with 79,875 more rows

We’re almost there. We want to create a LDA model, but LDA models are based on term-frequency. At this moment, we only have terms. Now we need to count the number of times each word occurs in each document (song).

To accomplish this we will count() by each unique combination of song and word. Below we create a common identifier as a combination of the artist name and title, then count the number of times each word occurs in the song.

# create word counts and an id column to simplify counting for me
lyric_counts <- lyric_unigrams %>%
  mutate(id = glue::glue("{artist}....{title}")) %>% 
  count(id, word, sort = TRUE)

lyric_counts
## # A tibble: 33,662 x 3
##    id                                         word        n
##    <glue>                                     <chr>   <int>
##  1 Eric Church....Desperate Man               boo       147
##  2 Imagine Dragons....Thunder                 thunder   113
##  3 Dierks Bentley....Woman, Amen              oh        103
##  4 The Head And The Heart....All We Ever Knew la        103
##  5 Vance Joy....Saturday Sun                  ba         99
##  6 Eric Church....Mr. Misunderstood           na         96
##  7 Fall Out Boy....Hold Me Tight Or Don't     na         95
##  8 alt-J....In Cold Blood                     la         81
##  9 CHVRCHES....Get Out                        get        74
## 10 twenty one pilots....Message Man           eh         64
## # … with 33,652 more rows

Now these counts can be cast into a DocumentTermMatrix using tidytext::cast_dtm(). A document term matrix takes the structure of one document per row and one column per term. This is important when we are thinking about the models that we will be creating.

For our model, we want to know the genre of each song. This means there needs to be only one row per song rather than a row per song and word pairing as is the case with our tidy lyric_counts object.

lyric_dtm <- lyric_counts %>% 
  cast_dtm(id, word, n)

lyric_dtm
## <<DocumentTermMatrix (documents: 513, terms: 4638)>>
## Non-/sparse entries: 33662/2345632
## Sparsity           : 99%
## Maximal term length: NA
## Weighting          : term frequency (tf)

Notice how the term weighting is done with term-frequency (tf)? This is ideal as LDA requires term-frequency.

5.3 Topic Modeling

Now that we have our DocumentTermMatrix we can create our LDA model. We will use the LDA() function from topicmodels to do this. LDA will createw as many topics as specified. Much like a KNN model, we can specify k clusters. In this case we will use 5.

library(topicmodels)

lda_5 <- LDA(lyric_dtm, k = 5, control = list(seed = 0))

lda_5
## A LDA_VEM topic model with 5 topics.

Now that the model has been fit, we can use it to calculate the posterior probabilities for each document’s membership to the k classes. posterior() creates a list object with two matrices. The first matrix contains the terms that were used to create the model. The second matrix contains the topics probability of each document. We will extract the topic probabilities and coerce the matrix into a tibble so we can join it back on to the original charts table.

# calculate the posterior probabilities for each document's classificaiton
lda_inf <- posterior(lda_5, lyric_dtm)

# extract document class probabilities 
chart_lda <- lda_inf[[2]] %>% 
  as_tibble(rownames = "id")

chart_lda
## # A tibble: 513 x 6
##    id                                 `1`      `2`      `3`     `4`     `5`
##    <chr>                            <dbl>    <dbl>    <dbl>   <dbl>   <dbl>
##  1 Eric Church....Desperate Man  0.999    0.000187 0.000187 1.87e-4 1.87e-4
##  2 Imagine Dragons....Thunder    0.000200 0.000200 0.999    2.00e-4 2.00e-4
##  3 Dierks Bentley....Woman, Amen 0.000208 0.999    0.000208 2.08e-4 2.08e-4
##  4 The Head And The Heart....Al… 0.999    0.000288 0.000288 2.88e-4 2.88e-4
##  5 Vance Joy....Saturday Sun     0.000189 0.999    0.000189 1.89e-4 1.89e-4
##  6 Eric Church....Mr. Misunders… 0.999    0.000155 0.000155 1.55e-4 1.55e-4
##  7 Fall Out Boy....Hold Me Tigh… 0.999    0.000200 0.000200 2.00e-4 2.00e-4
##  8 alt-J....In Cold Blood        0.999    0.000241 0.000241 2.41e-4 2.41e-4
##  9 CHVRCHES....Get Out           0.999    0.000312 0.000312 3.12e-4 3.12e-4
## 10 twenty one pilots....Message… 0.999    0.000200 0.000200 2.00e-4 2.00e-4
## # … with 503 more rows

To join back onto the original charts data we need to recreate the id column in the charts tibble. We then join only the unique songs to avoid any duplication and then clean the column headers.

# join back on to charts to get the genre 
chart_topics <- charts %>% 
  mutate(id = glue::glue("{artist}....{title}")) %>% 
  distinct(chart, id) %>% 
  right_join(chart_lda) %>% 
  janitor::clean_names()
## Joining, by = "id"
## Warning: Column `id` has different attributes on LHS and RHS of join
head(chart_topics)
## # A tibble: 6 x 7
##   chart      id                          x1      x2      x3      x4      x5
##   <chr>      <glue>                   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Hot Count… Eric Church....Despe… 0.999    1.87e-4 1.87e-4 1.87e-4 1.87e-4
## 2 Rock Songs Imagine Dragons....T… 0.000200 2.00e-4 9.99e-1 2.00e-4 2.00e-4
## 3 Hot Count… Dierks Bentley....Wo… 0.000208 9.99e-1 2.08e-4 2.08e-4 2.08e-4
## 4 Rock Songs The Head And The Hea… 0.999    2.88e-4 2.88e-4 2.88e-4 2.88e-4
## 5 Rock Songs Vance Joy....Saturda… 0.000189 9.99e-1 1.89e-4 1.89e-4 1.89e-4
## 6 Hot Count… Eric Church....Mr. M… 0.999    1.55e-4 1.55e-4 1.55e-4 1.55e-4

From here we now have a dataset that can be utilized for creating a genre classification model.

5.4 Genre Classification of Topics

Below we follow the same steps as were done for song feature classfications. The only differences here are that we are specifying the formula of the recipe as chart ~ x1 + x2 + x3 + x4 + x5, and are using randomForest rather than ranger .

#------------------------------- pre-processing -------------------------------#
set.seed(0)
init_split <- initial_split(chart_topics, strata = "chart")
train_df <- training(init_split)
test_df <- testing(init_split)

# create recipe
chart_rec <- recipe(chart ~ x1 + x2 + x3 + x4 + x5, data = train_df) %>% 
  prep()

# bake the training and testing to have clean dfs
baked_train <- bake(chart_rec, train_df)
baked_test <- bake(chart_rec, test_df)


#--------------------------------- model fit ----------------------------------#

rf_fit <- rand_forest(mode = "classification") %>%
  set_engine("randomForest") %>%
  fit(chart ~ ., data = baked_train)

lyric_classifier <- rf_fit

#------------------------------ model evaluation ------------------------------#
rf_estimates <- predict(rf_fit, baked_test) %>%
  bind_cols(baked_test) %>%
  yardstick::metrics(truth = chart, estimate = .pred_class)

rf_estimates
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.567
## 2 kap      binary         0.131

This is not a very good model. But what happens if we combine it with the audio feature model?