What factors are meaningfully predictive of a song staying on the Billboard Hot 100? We are team Food Network, and we are about to take you on a journey to Flavortown, describing our process for creating a model that will predict the duration of a song on the Billboard Hot 100.
Let’s load in our data. Our data come from three sources: Spotify audio features, network features, and embeddings of the lyrics. The lyric embeddings come from the RoBERTa model, which encodes all the lyrics of each song as a 768-dimensional vector. Three separate networks were created to get the network features – a songwriter, performer, and producer network. Centrality values are calculated for each song by taking the mean and sum values for all performers, songwriters, or producers associated with a song. The Spotify features all come directly from the hot_long dataset.
lyrics_df <- read_csv('master_songs_dataset.csv')
## Rows: 1403 Columns: 45
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Song, Performer_x, lyrics, genius_title, Performer_y, spotify_genr...
## dbl (37): genius_pageviews, duration, casey_bin, Previous.Week.Position, Pea...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cols_to_use <- c('casey_bin', 'loudness', 'danceability', 'energy', 'key',
'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'time_signature', 'prod_mean_close',
'prod_mean_between', 'prod_mean_eigen', 'prod_sum_close', 'prod_sum_between',
'prod_sum_eigen', 'write_mean_close', 'write_mean_between', 'write_mean_eigen',
'write_sum_close', 'write_sum_between', 'write_sum_eigen',
'write_sum_between', 'perform_mean_close', 'perform_mean_between',
'perform_mean_eigen', 'perform_sum_close', 'perform_sum_between',
'perform_sum_eigen')
y <- as.matrix(lyrics_df$duration)
x2 <- as.matrix(lyrics_df[, cols_to_use])
np <- import("numpy")
# Load in lyric embeddings
embeddings <- np$load("lyrics_embedded_roberta.npy")
Create the training/testing split
#nice
set.seed(69)
full_matrix <- cbind(embeddings, x2, y)
full_matrix <- full_matrix[rowSums(is.na(full_matrix)) == 0, ]
data_split <- sample(c(TRUE, FALSE), nrow(full_matrix), replace=TRUE, prob=c(0.7,0.3))
training_data <- full_matrix[data_split,]
testing_data <- full_matrix[!data_split,]
x_train <- training_data[,1:ncol(training_data)-1]
y_train <- training_data[,ncol(training_data)]
x_test <- testing_data[,1:ncol(testing_data)-1]
y_test <- testing_data[,ncol(testing_data)]
Define the model – we’ll be using a fancy neural network regression model.
model <- keras_model_sequential() %>%
layer_dense(units = 32, input_shape = c(ncol(x_train))) %>%
layer_activation('relu') %>%
layer_dense(units = 8) %>%
layer_activation('relu') %>%
layer_dense(units = 1)
model %>% compile(
optimizer = 'adam',
loss = 'mean_absolute_error',
metrics = 'accuracy'
)
history <- model %>%
fit(x_train,
y_train,
epochs = 100,
batch_size = 16,
validation_split = 0.05,
verbose = 0)
Lets get the predictions from the training model and see how well it fits with the True Data
predictions<-model %>% predict(x_test)
## 14/14 - 0s - 76ms/epoch - 5ms/step
predictions
## [,1]
## [1,] 28.45243
## [2,] 27.81550
## [3,] 28.32582
## [4,] 31.07643
## [5,] 24.50566
## [6,] 23.79673
## [7,] 26.19591
## [8,] 26.75201
## [9,] 25.88054
## [10,] 27.78677
## [11,] 25.54463
## [12,] 24.64201
## [13,] 23.92621
## [14,] 31.10043
## [15,] 30.84730
## [16,] 26.82842
## [17,] 23.82069
## [18,] 25.06596
## [19,] 23.45938
## [20,] 27.42456
## [21,] 27.39990
## [22,] 28.44237
## [23,] 26.59094
## [24,] 25.11882
## [25,] 25.45541
## [26,] 25.92103
## [27,] 20.63699
## [28,] 23.55431
## [29,] 27.35653
## [30,] 27.98357
## [31,] 34.64032
## [32,] 32.10350
## [33,] 22.09949
## [34,] 31.11921
## [35,] 28.81367
## [36,] 27.42373
## [37,] 24.14384
## [38,] 27.38661
## [39,] 29.91822
## [40,] 27.29826
## [41,] 26.68687
## [42,] 28.40520
## [43,] 30.02664
## [44,] 30.09277
## [45,] 30.04870
## [46,] 27.45788
## [47,] 21.02209
## [48,] 25.11837
## [49,] 31.91403
## [50,] 28.23643
## [51,] 27.11362
## [52,] 25.06954
## [53,] 25.89404
## [54,] 26.01055
## [55,] 29.46474
## [56,] 23.82537
## [57,] 22.54728
## [58,] 25.48817
## [59,] 29.04400
## [60,] 23.65430
## [61,] 26.12143
## [62,] 24.14007
## [63,] 27.60301
## [64,] 29.30144
## [65,] 26.80024
## [66,] 30.86699
## [67,] 26.86447
## [68,] 28.09933
## [69,] 24.65123
## [70,] 29.13892
## [71,] 26.45239
## [72,] 25.90484
## [73,] 34.20430
## [74,] 29.78869
## [75,] 30.43804
## [76,] 25.38340
## [77,] 28.11860
## [78,] 30.16584
## [79,] 27.31095
## [80,] 35.72677
## [81,] 21.27448
## [82,] 30.07510
## [83,] 31.38151
## [84,] 32.33314
## [85,] 23.59721
## [86,] 30.75007
## [87,] 19.18373
## [88,] 25.61814
## [89,] 27.87724
## [90,] 26.03992
## [91,] 26.57051
## [92,] 21.24569
## [93,] 28.06336
## [94,] 24.43646
## [95,] 22.60724
## [96,] 22.17869
## [97,] 28.54154
## [98,] 28.03564
## [99,] 27.52558
## [100,] 28.16591
## [101,] 23.17747
## [102,] 23.27866
## [103,] 30.95784
## [104,] 27.01052
## [105,] 37.65630
## [106,] 26.49585
## [107,] 25.72214
## [108,] 24.50426
## [109,] 29.82755
## [110,] 23.74047
## [111,] 27.29631
## [112,] 30.11519
## [113,] 24.75664
## [114,] 26.22311
## [115,] 21.02185
## [116,] 26.07515
## [117,] 29.70685
## [118,] 24.37506
## [119,] 29.96858
## [120,] 27.48477
## [121,] 25.44057
## [122,] 26.26706
## [123,] 26.57983
## [124,] 18.89155
## [125,] 20.49828
## [126,] 28.79175
## [127,] 27.92835
## [128,] 20.82204
## [129,] 24.88634
## [130,] 28.97171
## [131,] 29.53102
## [132,] 22.10871
## [133,] 24.99020
## [134,] 30.74726
## [135,] 27.57101
## [136,] 28.01756
## [137,] 33.58810
## [138,] 29.90125
## [139,] 25.89471
## [140,] 31.61125
## [141,] 25.50609
## [142,] 26.75072
## [143,] 27.81025
## [144,] 32.72487
## [145,] 20.39301
## [146,] 30.52656
## [147,] 28.76752
## [148,] 27.84350
## [149,] 21.04414
## [150,] 25.69960
## [151,] 32.54147
## [152,] 30.46287
## [153,] 24.66373
## [154,] 25.77806
## [155,] 25.60147
## [156,] 27.87081
## [157,] 23.07869
## [158,] 21.93084
## [159,] 29.98444
## [160,] 26.36654
## [161,] 23.52792
## [162,] 29.65089
## [163,] 27.21558
## [164,] 22.15603
## [165,] 25.73273
## [166,] 28.45643
## [167,] 29.82031
## [168,] 21.29142
## [169,] 30.88342
## [170,] 31.79610
## [171,] 25.45131
## [172,] 29.79397
## [173,] 26.12992
## [174,] 27.53050
## [175,] 23.58087
## [176,] 29.32847
## [177,] 25.05340
## [178,] 25.53575
## [179,] 22.34940
## [180,] 28.30094
## [181,] 25.04690
## [182,] 30.21750
## [183,] 27.05515
## [184,] 22.13762
## [185,] 32.92354
## [186,] 31.94243
## [187,] 32.96405
## [188,] 25.73107
## [189,] 26.73387
## [190,] 27.92838
## [191,] 29.63882
## [192,] 26.47745
## [193,] 22.63057
## [194,] 25.23105
## [195,] 29.14222
## [196,] 21.93510
## [197,] 29.14883
## [198,] 23.40249
## [199,] 20.74644
## [200,] 23.13050
## [201,] 27.04401
## [202,] 30.39468
## [203,] 31.71091
## [204,] 26.48401
## [205,] 27.56959
## [206,] 28.09095
## [207,] 27.21213
## [208,] 28.67972
## [209,] 26.63922
## [210,] 22.41128
## [211,] 22.11425
## [212,] 34.05720
## [213,] 27.93371
## [214,] 28.37791
## [215,] 25.21961
## [216,] 30.29406
## [217,] 22.41415
## [218,] 32.35473
## [219,] 31.83652
## [220,] 26.84233
## [221,] 19.83762
## [222,] 25.77701
## [223,] 35.07326
## [224,] 27.33848
## [225,] 27.13159
## [226,] 33.14562
## [227,] 33.20855
## [228,] 27.87420
## [229,] 27.68744
## [230,] 24.81424
## [231,] 20.58066
## [232,] 28.29806
## [233,] 22.96192
## [234,] 30.42265
## [235,] 29.35570
## [236,] 29.83001
## [237,] 23.17416
## [238,] 21.79491
## [239,] 32.28967
## [240,] 29.51904
## [241,] 23.99611
## [242,] 25.58693
## [243,] 27.39195
## [244,] 33.72094
## [245,] 23.41295
## [246,] 27.02084
## [247,] 19.99773
## [248,] 28.43913
## [249,] 23.41225
## [250,] 28.89471
## [251,] 27.69443
## [252,] 30.51452
## [253,] 28.76727
## [254,] 26.33172
## [255,] 25.23523
## [256,] 25.74788
## [257,] 29.81427
## [258,] 28.98053
## [259,] 22.22850
## [260,] 23.80474
## [261,] 24.90600
## [262,] 24.07982
## [263,] 31.18386
## [264,] 23.03202
## [265,] 24.95154
## [266,] 31.77747
## [267,] 26.33443
## [268,] 24.80073
## [269,] 26.12728
## [270,] 32.15080
## [271,] 24.77887
## [272,] 31.50543
## [273,] 31.41887
## [274,] 29.51601
## [275,] 30.50276
## [276,] 32.64515
## [277,] 30.56450
## [278,] 24.59868
## [279,] 29.88183
## [280,] 21.58648
## [281,] 31.72005
## [282,] 37.50755
## [283,] 22.13552
## [284,] 28.49701
## [285,] 27.28007
## [286,] 29.41271
## [287,] 20.16001
## [288,] 26.55971
## [289,] 32.17868
## [290,] 33.26149
## [291,] 32.53857
## [292,] 15.26559
## [293,] 32.61299
## [294,] 29.02098
## [295,] 26.66522
## [296,] 22.73433
## [297,] 26.33498
## [298,] 28.26364
## [299,] 31.59731
## [300,] 25.37298
## [301,] 25.74209
## [302,] 21.38394
## [303,] 26.88373
## [304,] 26.02579
## [305,] 23.97326
## [306,] 24.04617
## [307,] 31.31280
## [308,] 28.31355
## [309,] 27.48653
## [310,] 25.93909
## [311,] 29.32321
## [312,] 28.45552
## [313,] 25.22117
## [314,] 28.12082
## [315,] 33.80579
## [316,] 29.54355
## [317,] 26.48139
## [318,] 32.18763
## [319,] 25.41419
## [320,] 22.79019
## [321,] 30.28597
## [322,] 26.36131
## [323,] 31.59629
## [324,] 21.57825
## [325,] 33.00210
## [326,] 28.65610
## [327,] 32.86699
## [328,] 27.88660
## [329,] 23.95252
## [330,] 32.47197
## [331,] 22.05738
## [332,] 32.02917
## [333,] 32.93562
## [334,] 23.44558
## [335,] 28.21341
## [336,] 27.36943
## [337,] 30.50196
## [338,] 30.65084
## [339,] 25.30681
## [340,] 29.37426
## [341,] 29.07481
## [342,] 29.39710
## [343,] 29.25163
## [344,] 28.85500
## [345,] 26.14924
## [346,] 31.48842
## [347,] 23.98674
## [348,] 26.83732
## [349,] 30.97142
## [350,] 24.60032
## [351,] 27.85752
## [352,] 27.70051
## [353,] 29.43940
## [354,] 29.74369
## [355,] 27.28392
## [356,] 27.76402
## [357,] 27.75073
## [358,] 34.58438
## [359,] 25.55704
## [360,] 32.76838
## [361,] 23.34787
## [362,] 28.58585
## [363,] 24.24999
## [364,] 23.62328
## [365,] 23.52643
## [366,] 24.45274
## [367,] 24.59880
## [368,] 30.73742
## [369,] 23.85079
## [370,] 24.31755
## [371,] 31.91102
## [372,] 27.74360
## [373,] 30.05258
## [374,] 29.39667
## [375,] 26.72174
## [376,] 25.43268
## [377,] 27.76795
## [378,] 23.62501
## [379,] 22.81899
## [380,] 27.59023
## [381,] 31.50641
## [382,] 30.54010
## [383,] 29.23827
## [384,] 30.51574
## [385,] 22.37058
## [386,] 23.55005
## [387,] 25.62619
## [388,] 22.37028
## [389,] 23.90463
## [390,] 25.41073
## [391,] 26.18407
## [392,] 32.26125
## [393,] 30.43921
## [394,] 26.55194
## [395,] 28.22199
## [396,] 31.81399
## [397,] 27.06061
## [398,] 23.00859
## [399,] 29.82955
## [400,] 26.09267
## [401,] 33.10970
## [402,] 31.30760
## [403,] 25.59125
## [404,] 24.08851
## [405,] 26.81207
## [406,] 36.54249
## [407,] 29.13290
## [408,] 30.13481
## [409,] 28.88790
## [410,] 32.92785
## [411,] 27.26805
## [412,] 23.84957
## [413,] 25.98769
## [414,] 22.47361
## [415,] 25.03668
## [416,] 33.01883
## [417,] 24.75841
## [418,] 21.59107
## [419,] 24.58619
## [420,] 30.13667
## [421,] 27.14094
## [422,] 25.06140
## [423,] 24.23958
## [424,] 25.58025
## [425,] 26.27259
## [426,] 35.90395
# R-squared value (how good is the model fit)
cor(predictions, y_test)**2
## [,1]
## [1,] 0.05382058
# P-value of correlation
cor.test(predictions, y_test)
##
## Pearson's product-moment correlation
##
## data: predictions and y_test
## t = 4.911, df = 424, p-value = 1.296e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1400707 0.3199498
## sample estimates:
## cor
## 0.2319926
Damn! that wasnt a good R^2 value Let us see the residual plot to get an idea of what values its predicting.
# Residual plot (how close are the predictions to the actual value)
residuals <- predictions - y_test
idxs <- c(1:nrow(residuals))
df <- data.frame(residuals, idxs)
ggplot(data=df, aes(x=idxs, y=residuals)) + geom_jitter(color="orange")+
geom_hline(yintercept = 0)+
labs(x="Song", y="Predicted duration - truth",
title="Residual plot of duration prediction model")
Lots of variants
The predictions can differe by up to 60 weeks which is up to a year on the billboard, and others differ by 30 weeks which is is half a year on the billboard.
Features we used were not very good predictors of the durations of the billboard 100.
Worry not, for our chefs have cooked another interesting find in our other kitchen. Python
import pandas as pd
import numpy as np
Load our libraries for Python
#This is a attempt to recreate the hot long set with the WeekID values
hot_long = pd.read_csv('hot_long.csv')
song_creators = pd.read_csv('song_creators_wide - song_creators_wide.csv')
merged_hotCreators = pd.merge(hot_long, song_creators, left_on='Song', right_on='Song', how='inner')
Loading the files we are using and joining them. The easisiest way to group these songs was to join the hot long, the one with the WeekID formats with the the wide spreadsheet. The reason we went with this is there are mulitple instances of artist popping in and out that grouping by artist doesnt make sense. Grouping by songs is both practical and gives me the results I was chasing. Another issue it avoids is long filtering times. Having a script look over all the songs from both list wouldve taken too long.
Pretend this is Python Code** merged_hotCreators = pd.merge(hot_long, song_creators, left_on=‘Song’, right_on=‘Song’, how=‘inner’) # Rename columns merged_hotCreators.columns = [‘Song’, ‘Start_date’, ‘End_date’, ‘Source’, ‘Target’]
merged_hotCreators = merged_hotCreators[[‘Source’, ‘Target’, ‘Song’, ‘Start_date’, ‘End_date’]]
Converting to a format time format that Gephi will appreciate. This product will then result in a csv that will be used for the graph in Gephi.
merged_hotCreators.to_csv(‘modified_dates.csv’, index=False) ```
Chef Gordan will explain. What we see here is that the network of artist from the Billboard. In the beginning of the time sliced in incriments of 10 years, begin with stand mostly stand alone artist.
Towards the middle of the time frame, we start seeing more artist cluster together. This is around the 2010’s, artist are collaborating with eachother more often, Hip-hop and Rap showing up in the billboard.
The third image is in the modern era of the billboard 100, what we see happening is the groupings form alot tighter creating rings around themselves. The scattered dots around would be artist that appear on the billboard 100, but are solo artist as in they don’t feature anyone.