Predicting Spotify Song Popularity

Uploaded by

FabioSantos

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

89 views

Predicting Spotify Song Popularity

Uploaded by

FabioSantos

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 11

‘022021 Data Science Predicts Spotiy Song Populaiy | Towards Data Science Get started ] Open in app 548K Followers You have 2 free member-only stories left this month. Sign up for Medium and get an extra one OPINION Predicting Spotify Song Popularity Ranking every Machine Learning algorithm to build the best Data Science model using PyCaret. a @ Matt Praybyla Feb - 7minread * Photo by Cezar Sampaio on Unsplash [1] hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 mm‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app Table of Contents 1. Introduction 2. Model Comparison 3, Summary 4. References Introduction Because Spotify and other music streaming services are incredibly popular and widely used, I wanted to apply Data Science techniques with Machine Learning algorithms to this product to predict song popularity. I personally use this product, and what I apply here could be applied to other services as well. I will be examining every popular Machine Learning algorithm and pick the best algorithm based on success metrics or criteria — oftentimes, it is some sort of calculated error. The goal of the best model developed is to predict a song’s popularity based on various features current and historical features. Keep on reading if you would like to learn a tutorial on how to use Data Science to predict the popularity of a song. Model Comparison hitps:ftowardsdatascience.com/predcting-spoty-song-populaiy-<8d000°254c7‘0212021 Os ‘Sciance Predicts Spotty Song Populaiy | Towards Data Science Open in app Photo by Markus Spiske on Unsplash [2]. 1 will be discussing the Python library that I used, along with the data, parameters, models compared, results, and code below. Library Using the power of PyCaret [3], you can now test every popular Machine Learning algorithm against one another (or more of them at least). For this problem, I will be comparing MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec) — the time it takes for the model to be completed. Some of the benefits of using PyCaret overall, as stated by the developers, is that there is increased productivity, ease of use, and business-ready —all of which I can personally attest to myself. Data The dataset [4] that I am using is from Kaggle. You can download it easily and quickly. It consists of 17MB along with data from Spotify from the years 1921 to 2020, including 160,000+ tracks. It consists of 174,389 rows and 19 columns. Below, isa screenshot of the first few rows along with the first columns: In [3]: spotify.head|) outst: scountioness ait danceablity_doration ms_eneroy explicit _instrumentanoss © ogeran0 "Berle ss teense 2240 ORSATILEUSTEWFEFAAE! 0.000522 ('Seramn’ + ossaoa “Ty aso ssonne 0517 © o_onbarLrzasuanowaor .2asaca Hewaine} 2 —oossono (ame gar 163827 0.188 mTlaigmOKadevautne 000018 2 coma yy ftom eater a7 9 TaLesSuSOIoMYOpNm ———_agor000 Data Sample. Screenshot by Author [5]. Columns: After we eventually pick the best model, we can look at the most important features. I am using the incerpret_mode1 () function of PyCaret, which is based on the popular SHAP library, Here are all of the features possible below: hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 amt‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app wanceapirety > ‘duration_ms', tenergy', texplicit', ‘iat, ‘instrumentalness', tkey', ‘liveness', "loudness", 'mode', ‘name', ‘popularity', ‘release _date', ‘speechiness', ‘tempo', ‘valence', tyear'] Here are the most important features using SHAP: High 2 g é year instrumentainess loudness duration_ms acousticness liveness release_date_month_1 speechiness danceability release_date_month_12 valence tempo energy release_date_is_month_start_o key_o - release_date_weekday_1 release_date_month_6 ~ key 4 explicit_1 key 9 0 -is -o 5 0 5 0 SHAP value (impact on model output) hitps:ftowardsdatascience.com/predcting-spotty-song-populaniy-¢9d000'254c7 am‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app eu WULULIID ale USC ad ITALULED, CALEPL LUE LUE LAL yet Vat LaUIE, WHILE ID Lue column popularity . As you can see, the top three features are year, instrumentalness, and loudness. As a future improvement, it would be better to have the categorical features that are broken out into one column instead of tens of columns, then as a next step, be fed into the CatBoost model so that target encoding can be applied vs one-hot- encoding — to perform this action, we would confirm or change the xey column to be categorical instead, and for any other similar columns. Parameters These are the parameters that | used in the setup) of PyCaret. The Machine Learning problem is a regression one, including data from Spotify, with the cazget variable being the popularity field. For reproducibility, you can establish a There are a ton more parameters, but these are the ones that I used, and PyCaret does a great job of automatically detecting information from your data — like picking which features are categorical, and it will confirm that with you in the Oo. Models Compared Iwill be comparing 19 Machine Learning algorithms, some are incredibly popular while some, I have actually not heard of, so it will be interesting to see which one wins with this dataset. For the success criteria, am comparing all of the metrics MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec), which PyCaret automatically ranks. Here are all of the models that I compared: * Linear Regression * Lasso Regression * Ridge Regression © Elastic Net * Orthogonal Matching Pursuit * Bayesian Ridge © Gradient Boosting Regressor hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 sm‘022021 Os ‘Sciance Predicts Spotty Song Populaiy | Towards Data Science Open in app * Decision Tree Regressor * CatBoost Regressor * Light Gradient Boosting Machinee + Extra Trees Regressor * AdaBoost Regressor * K Neighbors Regressor Lasso Least Angle Regression + Huber Regressor * Passive Aggressive Regressor * Least Angle Regression Results It is important to note that I am just using a sample of the data, so the order of these algorithms may rearrange if you use all of the data if you test this code yourself. I used only +,000 rows instead of the total -270, 000 rows. As you can see, catoost was ranked first, having the best RMSE, RMSE, R2. However, it did not have the best MAE, RMSLE, and MAPE, and it was not the fastest. Therefore, you should establish what you mean by success in terms of these metrics. For example, if time is essential, then you will want to rank that higher, or if MAE is higher you might want to pick mxtra Trees Regressor instead to win. se (7/2 compere nodelat) omen BSS ne cs A a =: ws tsi Bie wom Bie nee ms or oa hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000°254c7 em‘022021 Dat cionce Predicls Spotty Sang Popularity | Towards Data Science Open in app ane SSO ee scam ame ons Model Comparison. Screenshot by Author [7] Overall, you can see, even with a small sample of the dataset, we faired pretty well. The popularity target variable has a range of 0 to 91. Therefore, for MAE for example, our average error is 9.7 popularity units. Out of 91 that is not too bad, considering we would be off by up to just a difference of 10 on average. However, the way the algorithm is trained not would probably not generalize that well since we are just using asample, so you can expect all of the error metrics to decrease (which is good) significantly, but unfortunately, you will see the training time increase dramatically. One of the neat features of PyCaret, is the ability for you to remove algorithms in your sompare_models() training —I would start on a small sample of the dataset, and then see which algorithms generally take longer, then remove those when you compare with all of the original data since some of these could take hours to train depending on the dataset. In the screenshot below, I am printing the dataframe with the predictions and the actual values. For example, we can see that popularity or original is compared side-by- side to the abe , which is the prediction. You can see that some predictions were better than others. The last prediction was quite poor, while the first two predictions were great. In [15]: predictions[['popularity', ‘Label']].head(3) out[15]: popularity Label ° 25.0 19.695469 1 28.0 26.412759 2 73.0 37.664414 Predictions. Screenshot by Author [8]. hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 mm‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app reading in your data, sampling your data (only if you want), setting up your regression, comparing models, creating your final model, making predictions, and visualizing feature importance[9]: # import libraries from pycaret.regression import * import pandas as pd # read in your stock data spotify = pd.read_csv(‘file location of your data on your computer.csv’) # using a sample of the dataset (you can use any amount) spotify sample = spotify.sample (1000) # setup your regression parameters regression = setup (data = spotify_sample, target = ‘popularity’, session id = 100, ) # compare models compare models () # create a model catboost = create model ("catboost') # predict on test set predictions = predict_model (catboost) # interpreting model interpret_model (catboost) Summary hitps:ftowardsdatascience.com/predicting-spoty-song-populaniy-¢8d000'254c7 am‘022021 Data Science Predicts Spotiy Song Populaiy | Towards Data Science Open in app Photo by bruce mars on Uns} Using Data Science models to predict a variable can be quite overwhelming, but we have seen how, with a few lines of code, we can compare several Machine Learning algorithms efficiently. We have also shown how easy it is to set up different types of data, including data like numeric and categorical. For the next steps, I would apply this to an entire dataset, confirm data types, making sure to remove inaccurate models, as well as models that take too long to train. | msummary, we now know how to perform the following to determine song popularit import libraries read in data setup your model compare models pick and create the best model predict using the best model intepret feature importance I want to give thanks and admiration to Moez Ali for developing this awesome Data Science library. Thope you found my article both interesting and useful. Please feel free to comment down below if you applied this library to a dataset or if you use other techniques. Do you prefer one over the other? What do you think about automatic Data Science? hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 om‘022024 Data Science Predicts Spotiy Song Populaiy | Towards Data Science Getstarted | Openinapp @ LinkedIn. | FIgase 1ec1 Hee WU CUEUR YUL uy protic anu vi References [1] Photo by Cezar Sampaio on Unsplash, (2020) [2] Photo by Markus Spiske on Unsplash, (2020) [3] Moez Ali, PyCaret, (2021) [4] Yamac Eren Ay on Kaggle, Spotify Dataset, (2021) [5] M.Przybyla, Dataframe Screenshot, (2021) [6] M.Przybyla, SHAP Feature Importance Screenshot, (2021) [7] M.Przybyla, Model Comparison Screenshot, (2021) [8] M.Przybyla, Predictions Screenshot, (2021) [9] M.Przybyla, Python Code, (2021) [10] Photo by bruce mars on Unsplash, (2018) Sign up for The Daily Pick By Towards Data Science Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday, Make learning your daily ritual. Take alook Your email By signing up. you willcreate a Medium account ityou dont already have one. Review our Privacy Policy for more information about our privacy practices hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 rom‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Vata suiente — MaUINNELeaIINy — AIUNLIANMENYENLE — 1UWarUS Lata suiEHe — spumy CP Google Play hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 wm

Food Delivery Time Prediction 1703681339
100% (1)
Food Delivery Time Prediction 1703681339
8 pages
26 Survey Analysis - The Epidemiologist R Handbook
No ratings yet
26 Survey Analysis - The Epidemiologist R Handbook
17 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Unofficial Cheat Sheet For Forecasting
No ratings yet
Unofficial Cheat Sheet For Forecasting
2 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
How To Export Data From Quant Data Manager and Import To Metatrader 5
No ratings yet
How To Export Data From Quant Data Manager and Import To Metatrader 5
18 pages
Importing High Quality Tick Data On MetaTrader 4 & 5
No ratings yet
Importing High Quality Tick Data On MetaTrader 4 & 5
34 pages
The RSI Delta Indicator. Enhancing Momentum Trading
No ratings yet
The RSI Delta Indicator. Enhancing Momentum Trading
21 pages
Creating A Modified Fisher Transformation For Profitable Trading.
No ratings yet
Creating A Modified Fisher Transformation For Profitable Trading.
21 pages
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
No ratings yet
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
10 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Implementation Data Mining With K-Means Algorithm For Clustering Distribution Rabies Case Area in Palembang City PDF
No ratings yet
Implementation Data Mining With K-Means Algorithm For Clustering Distribution Rabies Case Area in Palembang City PDF
8 pages
Oil Export Indonesia
100% (1)
Oil Export Indonesia
12 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
A Survey On Opinion Mining and Sentiment Analysis: Tasks, Approaches and Applications1-S2.0-S0950705115002336-Main
No ratings yet
A Survey On Opinion Mining and Sentiment Analysis: Tasks, Approaches and Applications1-S2.0-S0950705115002336-Main
33 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Principal Component Analysis - Ipynb
No ratings yet
Principal Component Analysis - Ipynb
27 pages
A Machine Learning Framework For Sport Result Prediction
No ratings yet
A Machine Learning Framework For Sport Result Prediction
7 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
17 pages
Churn Modeling
100% (1)
Churn Modeling
11 pages
Variable Selection
No ratings yet
Variable Selection
15 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
4 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
7 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
Lesson 9: Test of Correlation and Simple Linear Regression
No ratings yet
Lesson 9: Test of Correlation and Simple Linear Regression
7 pages
Random Forest
No ratings yet
Random Forest
18 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
AIML Online
No ratings yet
AIML Online
16 pages
Notes On Time Series Analysis
No ratings yet
Notes On Time Series Analysis
111 pages
Top 10 Data Mining Algorithms
No ratings yet
Top 10 Data Mining Algorithms
65 pages
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
100% (1)
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
51 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
PyGAD-2 15 1
No ratings yet
PyGAD-2 15 1
203 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Cluster
100% (1)
Cluster
72 pages
ML Cheatsheet Final
No ratings yet
ML Cheatsheet Final
32 pages
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
No ratings yet
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
7 pages
In-Class Practices - Session 1 - Answers
No ratings yet
In-Class Practices - Session 1 - Answers
19 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Assignment No - 6-1
100% (1)
Assignment No - 6-1
3 pages
Machine Learning Project Car Price Prediction Algorithm
No ratings yet
Machine Learning Project Car Price Prediction Algorithm
4 pages
Frequency Distribution For Categorical Data
No ratings yet
Frequency Distribution For Categorical Data
6 pages
01.multiple Linear Regression - Ipynb - Colaboratory
No ratings yet
01.multiple Linear Regression - Ipynb - Colaboratory
10 pages
One-Sample T-Test
No ratings yet
One-Sample T-Test
9 pages
Comparative Study of Holt-Winters Triples Exponent
No ratings yet
Comparative Study of Holt-Winters Triples Exponent
12 pages
Machine Learning Project Basic - Linear Regression - Kaggle
No ratings yet
Machine Learning Project Basic - Linear Regression - Kaggle
10 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
173 pages
Statistics
No ratings yet
Statistics
27 pages
statistics-in-data-science
No ratings yet
statistics-in-data-science
100 pages
Data Transformation and Arima Models A S
No ratings yet
Data Transformation and Arima Models A S
8 pages
Econometric Project - Permanent Income Hypothesis
No ratings yet
Econometric Project - Permanent Income Hypothesis
9 pages
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
100% (2)
Get Feature Engineering Bookcamp 1st Edition Sinan Ozdemir free all chapters
55 pages
Car Make and Model Recognition Using Ima
No ratings yet
Car Make and Model Recognition Using Ima
8 pages
Aneesha Big Data Project
No ratings yet
Aneesha Big Data Project
4 pages
ML Case Study
No ratings yet
ML Case Study
5 pages
T Sivaprakash MBA BA03 040 Capstone Project
No ratings yet
T Sivaprakash MBA BA03 040 Capstone Project
16 pages
How To Trade Forex Using Roboforex Strategyquant Software
No ratings yet
How To Trade Forex Using Roboforex Strategyquant Software
44 pages
Using Machine Learning To Locate Support and Resistance Lines For Stocks - by Suhail Saqan - The Startup - Jan, 2021 - Medium
No ratings yet
Using Machine Learning To Locate Support and Resistance Lines For Stocks - by Suhail Saqan - The Startup - Jan, 2021 - Medium
14 pages
Hidden Divergence - Chamane's Guidelines
100% (1)
Hidden Divergence - Chamane's Guidelines
13 pages
Trading Strategy - Technical Analysis With Python TA-Lib
No ratings yet
Trading Strategy - Technical Analysis With Python TA-Lib
12 pages
Test Strategy in MetaTrader 4 With Tick Precision
No ratings yet
Test Strategy in MetaTrader 4 With Tick Precision
15 pages
The Augmented Bollinger Bands
No ratings yet
The Augmented Bollinger Bands
23 pages
A Review of Reinforcement Learning For Financial Time Series Prediction and Portfolio Optimization
No ratings yet
A Review of Reinforcement Learning For Financial Time Series Prediction and Portfolio Optimization
38 pages
Time Series Forecasting With 2D Convolutions
No ratings yet
Time Series Forecasting With 2D Convolutions
33 pages
Building A Stock Option Valuation Model With Python Part II
No ratings yet
Building A Stock Option Valuation Model With Python Part II
18 pages
Estimate Support and Resistance of A Stock With Python
No ratings yet
Estimate Support and Resistance of A Stock With Python
18 pages
Technical Indicators and GRU-LSTM To Predict Stock Price
No ratings yet
Technical Indicators and GRU-LSTM To Predict Stock Price
36 pages
Building A Stock Option Valuation Model With Python Part I
No ratings yet
Building A Stock Option Valuation Model With Python Part I
17 pages
Heiken-Ashi Trading - The Full Guide in Python
100% (2)
Heiken-Ashi Trading - The Full Guide in Python
14 pages
Using Machine Learning To Locate Support and Resistance Lines For Stocks
No ratings yet
Using Machine Learning To Locate Support and Resistance Lines For Stocks
14 pages
Algorithmic Trading Models - Breakouts
No ratings yet
Algorithmic Trading Models - Breakouts
10 pages
Gap Trading. An Introduction & Back-Test in Python
No ratings yet
Gap Trading. An Introduction & Back-Test in Python
15 pages
Teaching A Machine To Trade Stocks Like Warren Buffett (Part 2)
No ratings yet
Teaching A Machine To Trade Stocks Like Warren Buffett (Part 2)
34 pages
How I Created A Bitcoin Trading Algorithm Using Sentiment Analysis With A 29% Return
No ratings yet
How I Created A Bitcoin Trading Algorithm Using Sentiment Analysis With A 29% Return
10 pages

Predicting Spotify Song Popularity

Uploaded by

Predicting Spotify Song Popularity

Uploaded by

You might also like