Chapter 4 Genre Classification — Audio feature approach

Thanks to the use of bbcharts and spotifyr we were able to identify the most popular rock and country songs from 2016-2018 and retrieve their audio features. We can use these features to create a classification model. In this model, we will use the spotify audio features to predicts if a song is rock & roll or country.

The following code will utilize tidymodels heavily to pre-process, sample, and train data.

First, we will remove any unnecessary columns and coerce key and chart to class factor.

library(tidymodels)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidymodels 0.0.2 ──
## ✔ broom     0.5.2       ✔ recipes   0.1.5  
## ✔ dials     0.0.2       ✔ rsample   0.0.4  
## ✔ infer     0.4.0.1     ✔ yardstick 0.0.3  
## ✔ parsnip   0.0.2
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
clean_chart <- select(chart_analysis, 
                      -c("duration_ms", "time_signature", "type", "mode",
                         "rank", "year", "artist", "featured_artist", "title")) %>%
  mutate(chart = as.factor(chart),
         key = as.factor(key))

Now that we have our data, we need to partition it into a training and test set. We will use rsample.

# set a seed for reproducibility 
set.seed(0)

# partition data
init_split <- initial_split(clean_chart, strata = "chart")

# extract training set
train_df <- training(init_split)

# extract testing set
test_df <- testing(init_split)

Next, we specify the recipe that we want to use using the recipes package. Think of the recipe as the “description of what steps should be applied” to the data. In this chunk, we specify the formula and the training data in recipe(). Following that, we specify two preprocessing steps: center and scaling (standardizing) our numeric variables. Last, we prep() the recipe for training.

chart_rec <- recipe(chart ~ ., data = train_df)  %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  prep()

We apply the recipe to the training and testing data using bake(). The first argument is the recipe object which provides the plan for processing the data provided in the second argument, new_data.

# apply pre-processing to create tibbles ready for modeling
baked_train <- bake(chart_rec, train_df)
baked_test <- bake(chart_rec, test_df)

Now that the training and testing sets have been created, we can continue with creating the model. In this example we will create a random forest model from the ranger package. This will be done with parsnip.

parsnip is another package from the tidymodels ecosystem that creates a single interface for models in R. In the code chunk below we:

  • specify our model as a random forest: rand_forest()
  • identify which package will be used to create the model: set_engine()
  • define the model specification using the formula interface: fit()
audio_classifier <- rand_forest(mode = "classification") %>%
  set_engine("ranger") %>%
  fit(chart ~ ., data = baked_train)

audio_classifier
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      448 
## Number of independent variables:  10 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.149174

Now that the model has been fit, we can validate it’s performance on the testing set baked_test. This follows the procedure used in Edgar Ruiz’s A Gentle Introduction to tidymodels.

ranger_estimates <- predict(audio_classifier, baked_test) %>%
  bind_cols(baked_test) %>%
  metrics(truth = chart, estimate = .pred_class)

ranger_estimates
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.791
## 2 kap      binary         0.581