Data Science

Predicting Popularity on Spotify – When Data Needs Culture More than Culture Needs Data

Back when I was an Original Music intern at Butter Music in 2016, our team thought a lot about how to parameterize audio. For one, we were…

Philip Peker

Jun 25, 2021

16 min read

A short, step-by-step stroll through an Introductory Machine Learning project using Spotify data.

👁 Image Source: Mick Haupt/Unsplash

Image Source: Mick Haupt/Unsplash

Back when I was an Original Music intern at Butter Music in 2016, our team thought a lot about how to parameterize audio. For one, we were laying the foundation for a proprietary music sync library and we needed a novel taxonomy for all the different sounds that were going to make up the library. The goal was to allow our current and future clients to search by mood, genre, instruments or other keywords.

In the meantime, we were also continuing to write original music for our clients’ commercials. While this workflow was more custom and service-driven, we still had to figure out how to continuously transform client feedback such as "Let’s make the track more inquisitive and approachable, but less sappy" into actual melodies, arrangements, timbres, and atmospheres.

With both of these tasks, it felt like we were missing a translation layer between colloquial language and the language of music. This problem is not new. If you’re a music nut like, I’m sure you’ve sometimes struggled for the right words that explain why you like a song so much. "I don’t know, it’s just so funky and smooth" is usually what rolls out of my mouth when describing the new Emotional Oranges or SiR track that I can’t stop telling my friends about.

Many organizations and teams have embarked on the journey of quantifying music at scale, but none as radically as Spotify. According to Counterpoint Research, they hold a 34% plurality market share in the world for paid subscriptions, as compared to all the other competitions in the space. Apple Music comes in second with a 21% marketshare.

Consequently, this also means the size and scale of their data warehouses are second to none. Spotify has found a wonderful harmony between their data engineering teams and their product and sales teams. They feed off and into each other, driving the company’s unrelenting growth.

The better Spotify can quantify music, the better they can tune their systems and algorithms to generate more revenue for themselves and their stakeholders.

I was fascinated by Spotify’s unique business goal of making quantitative sense of music. So fascinated, that it pushed me to go out of my comfort zone and embark on my personal data science journey to better understand the interplay between music and data.

In this article, I invite you to come walk the path I took for my first Machine Learning project using Spotify tracks data as the focal point.

Before I move any further, I want to make a few things clear— this article details the trek I took from 0 to 1 and not 1 to ‘n’. My methodology was neither exhaustive nor flawless, and in fact, likely contains many technical opportunities missed, some of which I touch on near the end of the article. I am excited to continue iterating and learning as I go and so this project is not a final stop, but merely a first step.

What I want to leave behind is the bigger-picture findings that helped me wrap my head around the shape and size of the problem in quantifying and parameterizing music. I don’t purport to solve anything here, but rather shed a new light, and at a new angle, on an age-old problem that will only get more complex the more we listen and stream music.

Alright, enough preambling, let’s dig in.

What the hell is popularity in music, how do I become famous, and other gnawing existential predicaments.

For my data science capstone project (shoutout to General Assembly) I was interested in finding out a bit more on how Spotify understands ‘popularity’. The essential question for me was: could we use a song’s attributes to predict a track’s ‘popularity’?

From the get-go, I was eyeing a Kaggle dataset that was put together based on Spotify’s Web API. For those not familiar with the Spotify Web API, here is a screenshot of just some of the callable parameters that can be used to analyze tracks on Spotify:

👁 Image by author

Image by author

According to Spotify, "popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past."

The first order of business for me is to take a look at the data source and begin on some exploratory data analysis.

EDA – A bit of looking, cleaning and visualization

The dataset has 586,672 rows, 20 columns.

👁 Image by author

Image by author

Off the bat, I notice three wrinkles in the data that I should be mindful of:

Two variables are already dummified (‘mode’ and ‘explicit’)
Certain categorical variables, such as ‘key’, are value-encoded, but their relative values are meaningless. If 0 is the key of C, and 1 is the key of C#, this does not mean the key of C# is intrinsically greater by 1 point than the key of C
‘timesignature’ is a predicted value already

Let’s take a look at the raw data to ensure it’s in the best format.

👁 Image by author

Image by author

We drop some null values and do one retouch on the ‘duration_ms’ column. Duration is being expressed in milliseconds which makes little sense in the context of song duration, so we convert to minutes.

👁 Image by author

Image by author

It may also be useful to dummify the three categorical variables we have in the dataframe, so let’s just do that now.

data = pd.get_dummies(data, columns=['time_signature', 'key', 'mode'], drop_first=True)

With that, we should be good to start on some visualizations.

Visualizations:

For starters, we generate some Seaborn Pair plots across several variables:

👁 Image by author

Image by author

Shockingly, there are little to no simple linear relationships that jump out. Let’s continue on to some more granular visualizations to see what’s really going on.

Since predicting popularity is our north star, I’m curious to see what the popularity distribution is across the dataset.

👁 Image by author

Image by author

The Pareto principle is in full effect here, with a right skewed distribution showing us how truly rare it is to have a popular song.

I also want to discover some domain-specific nuances in the data. Among several fun visualizations, the double bar plot of Key/Mode vs Popularity highlighted some interesting points.

👁 Image by author

Image by author

In Western music, there exists 12 possible keys. Each key, however, can live in either a minor or a major tonality. Diatonically speaking, there are three major modes, and four minor modes. This bar plot describes how the popularity differs for the same key across different tonalities (0 being a minor tonality, 1 being major). For example, a track in C# minor tends to be more popular than a track in C# major.

Surely, a confounding variable here could be the keys that a vocalist prefers, assuming the more popular a track is, the more likely it contains vocals, which based on the third point below, seems like a fair assumption to make (i.e ‘acousticness’ has a negative correlation with ‘popularity’).

Let’s look at a correlation table to identify some baseline correlations between our many X variables.

plt.figure(figsize=(20, 10))
sns.heatmap(data.corr(),annot = True)

👁 Image by author

Image by author

Some observations:

‘Energy’ and ‘loudness’ have the highest correlation, and a positive one, which does not surprise
‘Energy’ and ‘acousticness’ have a highly-correlated inverse relationship, which also makes total sense. The more a song skews towards being acoustic, the less energy it tends to be
Unfortunately, with our dependent variable being ‘popularity’, we notice very poor correlation values across our independent variables. The best we get is a -.37 between ‘acousticness’ and ‘popularity’
From this correlation matrix, I plucked four of the best features (ones with the highest correlation) to use later on during feature engineering. These four are: ‘acousticness’, ‘instrumentalness’, ‘loudness’, and ‘energy’

By squaring our highest correlation coefficient, R, we get the coefficient of determination (R²) that we need to clear: .136. The bar is low, but let’s explore by how much we can beat it.

"Hi, I’m here for the modeling gig? Am I in the right place?"

Model 1: Linear Regression

To warm up the oven, let’s see what kind of success we can get from a simple linear regression model.

Set the variables:

X = data[features]
y = data['popularity']

For now [features] includes every single independent variable in our dataframe.

Split our data

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X[features], y, train_size=0.5, random_state=8)

Train our model

lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

This prints out our coefficient of determination, R², of .213. While we did already beat our baseline, generally speaking, this is a dangerously low R².

Making predictions

Now let’s pass a predict method to our testing data.

y_pred = lr.predict(X_test)

Metric Evaluation

Finally, we can print out three key metrics to determine model fit.

print(metrics.mean_absolute_error(y_test, y_pred))
print(metrics.mean_squared_error(y_test, y_pred))
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

MAE = 13.24

MSE = 266.19

RMSE = 16.32

Conclusion

As is standard protocol, we use RMSE as the primary metric to evaluate our linear regression model. An average of 16.32 spread for our residuals in the prediction model for a range of 0–100 in ‘popularity’ is huge.

And with R² value of .213, we haven’t even beat our base correlation.

I ran this model again, but now with only the four best features noted above. The model actually worsened, with an R² of .181 and an RMSE of 16.64.

So let’s see if we can nudge these in a positive direction with some other regression techniques.

Model 2: Decision Tree

from sklearn.tree import DecisionTreeRegressor

max_depth_range = range(1, 15)

RMSE_scores = []

from sklearn.model_selection import cross_val_score
for depth in max_depth_range:
 treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
 MSE_scores = cross_val_score(treereg, X, y, cv=5, scoring='neg_mean_squared_error')
 RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))

plt.plot(max_depth_range, RMSE_scores);
plt.xlabel('max_depth');
plt.ylabel('RMSE (lower is better)');

👁 Image by author

Image by author

Even with just a small range of (1, 15), we managed to get a better RMSE of 15.64 using a max_depth_range of 10. Better, but given the nature of decision trees, we naturally engaged in some gross overfitting for a minor gain in the standard deviation of our prediction errors.

Model 3: Random Forest

Another stretch to try and curb overfitting and improve accuracy. We apply the same methodology as above to a random tree regressor model, and see the following metrics:

RMSE: 14.80

Out of bag score: .38

*A reminder: an out of bag score is the accuracy of examples 𝑥ᵢ using all the trees in the random forest ensemble for which it was omitted during training.

We see that our OOB score is now handsomely beating our baseline R². As far as regression models tested, hiking through the random forest has led us to a happier destination. We’re getting closer and closer to a model that has a stronger generalizability, which is ultimately what we’re shooting for here.

With all this being said, I’m definitely not excited about the results we’re getting with our regression methods, so perhaps it’s time to see what the classification world could offer us. Off we go.

Classifications

In order to set up any kind of classification models, we need to move away from trying to predict a continuous integer value for our output, ‘popularity’, and instead predict categories/labels for it. So let’s create some bins for ‘popularity’. We’ll segment and sort our values into equal bins of ‘low’, ‘medium’ and ‘high’ popularity using pd.cut.

👁 Image

👁 Image by author

Image by author

One element that sticks out from our binning is the uneven count distribution across the three bins. Knowing how easily some classification models can be affected by imbalanced data, I resampled the classes using RandomOverSampler from the imbalanced-learn package:

👁 Image by author

Image by author

Now that our classes are even, we can set up and instantiate our classification model; this time, we’ll try a KNN classifier.

Let’s re-input our four top features to set up our design matrix:

feature_cols = ['acousticness', 'instrumentalness', 'loudness', 'energy']
X = data[feature_cols]

Next, we can perform a train-test split using our oversampled classes:

X_train, X_test, y_train, y_test = train_test_split(X_ros, y_ros, random_state=99, test_size=0.3)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

y_pred_class = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred_class))

Out pops an accuracy score of: .807

Holy smokes, what an improvement. Smells like major overfitting though, so let’s implement some hyperparameter tuning and search for an optimal ‘k’.

Since both manual nearest neighbor searching using for-loops and GridSearch were very computationally expensive for my humble computer, I opted to use RandomizedSearchCV. This package still implements a "fit" and "score" method but doesn’t try out all parameter values, like GridSearch does. Rather, a fixed number of parameter settings get sampled from the specified distributions.

👁 Image by author

Image by author

Results

So how did we do?

First, we can print out a confusion matrix.

from sklearn.metrics import confusion_matrix

cmat = confusion_matrix(y_test, y_pred_class)
#print(cmat)
print('TP - True Negative {}'.format(cmat[0,0]))
print('FP - Flase Positive {}'.format(cmat[0,1]))
print('FN - False Negative {}'.format(cmat[1,0]))
print('TP - True Positive {}'.format(cmat[1,1]))
print('Accuracy Score: {}'.format(np.divide(np.sum([cmat[0,0], cmat[1,1], cmat[2,2]]), np.sum(cmat)))) 
print('Misclassification Rate: {}'.format(np.divide(np.sum([cmat[1,0], cmat[0,1], cmat[0,2], cmat[2,0], cmat[1,2], cmat[2,1]]), np.sum(cmat))))

👁 Image by author

Image by author

An accuracy score of 56.52% is the best we’ve batted so far, and I’m feeling fairly confident in the veracity of this score. For our KNN model, we balanced our classes, we feature engineered, we performed some hyperparameter tuning, and we came all the way from regression world to boot! But for all our work, when we zoom out, an accuracy of 56.52% also means an average error rate 43.48% in predicting with our model.

Conclusion: Limitations, expectations, and reverberations

Our highest-performing model was the KNN classification model, but this comes with a large asterisk. Morphing our data to allow for a discrete, classification approach versus a continuous, regression approach means that predictive robustness within the model was largely swept under the rug. It’s much easier for a model to predict for one of three popularity classes ("low", "medium" or "high"), versus a discrete numerical score for popularity from 0–100. So while we can feel more optimistic about the accuracy of the model itself, we’re far from being able to productionalize this and enable a fulfilling end-user prediction experience from our work so far.

I’m also hyper-aware of other limitations that I faced throughout:

Machine learning becomes computationally expensive, quickly, and I had to make concessions so that I could even run certain blocks of code. This compromised some level of robustness and depth, whether it was using only five folds for our cross-validator, versus the industry standard of cv=10, or only running our RandomizedSearch for a k_range of (1, 22)
In the future, I could try testing a weighted KNN (weighted voting) model to mollify the effect of our imperfect search for the perfect k
Did I handle the data balance problem correctly? While we flagged it and did oversample, I did not try undersampling, and perhaps this could have yielded a different result. Yes, it would be less counts, but also less synthetic, duplicative rows

As I crawled out of the ML cave, I wondered if there is a bigger question here in terms of the dataset philosophy itself, and whether this was an ontological limitation.

Why is it that for a dataset that is so well-manicured and organized, and with no shortage of rows, the relationships found within are strained? One would think that popularity is a lever that artists, labels, management companies, A&R, and Spotify themselves would want to be able to pull up and down. The obvious answer is that there are thousands of other variables that add to the noise. And, perhaps selecting popularity as the dependent variable in my project meant I was destined for disappointment, in that it highlighted how narrow the independent variables I was given to work with are, and also how little we truly know about the Spotify algorithm that computes ‘popularity’.

This dataset and its Web API parent is descriptive, not predictive, in nature. ‘good 4 u’ by Olivia Rodrigo isn’t popular because it’s loudness, energy, and instrumentalness scores are high (even though they probably are, but that song wasn’t in this somewhat outdated dataset unfortunately). It’s popular because of who Olivia is, her blend of nostalgic old (hi @ Paramore) with sparkly new, and the buildup of anticipation for anything by Olivia after ‘drivers license’. Where were these parameters in our dataset?!

Given the state of modern music and how it’s consumed, it would be prudent to study variables such as, and we’re just scratching the surface here:

Social media following
Whether an artist is signed to a record label, and if so, which one?
A metric to represent an artists’ network value (A "Who do you know here?" score)
A "nostalgia" score
Historical data for that artist

We could also look to break down "popularity" into subsets based on:

region
demographics
which device is it streaming on
number of shared Spotify accounts

…and many more slices.

With just these new features added to our dataframe, it would be interesting to see if there is an improvement in our model accuracy, and if so, by how much.

And yet, like all guardians of a galaxy, Spotify prudently keeps this type of data hidden. Whether for reasons involving PII, or stakeholder contracts and master service agreements, or anything else beyond our purview, those accessing Spotify data via the API are ultimately accessing just the tip of the data iceberg.

Obviously, the world of data science and AI extends far, far beyond what was was tried here. Moving forward, I’d love to dig into unsupervised learning and deep learning techniques to see what delights we can find along those paths.

All in all, I’m really excited to see the music analytics vertical grow. Coming out of this project, I have a renewed sense of hope that even in our algorithmic, seemingly monolithic streaming world, we can still have wild, silky, colorful, complicated, emotional auditory outliers that punch us in the gut and wake us the f*** up.

Thanks for reading all the way until the end (hell, I barely made it to the end writing this). I’m personally still making sense of all this newness, and would love to hear your thoughts on my thoughts, whether you’re a data scientist, a musician, a Spotify enthusiast, a fan of cultural insights, or anything at all.

Until next time 🎶

In case you want to check out the full Jupyter notebook and a high-level powerpoint for this capstone project, here’s the Github: https://github.com/philinyouin/SpotifyPopularityPrediction

Written By

Philip Peker

See all from Philip Peker

Data Science, Machine Learning, Music, Spotify

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/predicting-popularity-on-spotify-when-data-needs-culture-more-than-culture-needs-data-2ed3661f75f1/