0% found this document useful (0 votes)
89 views

Predicting Spotify Song Popularity

Predicting Spotify Song Popularity

Uploaded by

FabioSantos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
89 views

Predicting Spotify Song Popularity

Predicting Spotify Song Popularity

Uploaded by

FabioSantos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
‘022021 Data Science Predicts Spotiy Song Populaiy | Towards Data Science Get started ] Open in app 548K Followers You have 2 free member-only stories left this month. Sign up for Medium and get an extra one OPINION Predicting Spotify Song Popularity Ranking every Machine Learning algorithm to build the best Data Science model using PyCaret. a @ Matt Praybyla Feb - 7minread * Photo by Cezar Sampaio on Unsplash [1] hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 mm ‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app Table of Contents 1. Introduction 2. Model Comparison 3, Summary 4. References Introduction Because Spotify and other music streaming services are incredibly popular and widely used, I wanted to apply Data Science techniques with Machine Learning algorithms to this product to predict song popularity. I personally use this product, and what I apply here could be applied to other services as well. I will be examining every popular Machine Learning algorithm and pick the best algorithm based on success metrics or criteria — oftentimes, it is some sort of calculated error. The goal of the best model developed is to predict a song’s popularity based on various features current and historical features. Keep on reading if you would like to learn a tutorial on how to use Data Science to predict the popularity of a song. Model Comparison hitps:ftowardsdatascience.com/predcting-spoty-song-populaiy-<8d000°254c7 ‘0212021 Os ‘Sciance Predicts Spotty Song Populaiy | Towards Data Science Open in app Photo by Markus Spiske on Unsplash [2]. 1 will be discussing the Python library that I used, along with the data, parameters, models compared, results, and code below. Library Using the power of PyCaret [3], you can now test every popular Machine Learning algorithm against one another (or more of them at least). For this problem, I will be comparing MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec) — the time it takes for the model to be completed. Some of the benefits of using PyCaret overall, as stated by the developers, is that there is increased productivity, ease of use, and business-ready —all of which I can personally attest to myself. Data The dataset [4] that I am using is from Kaggle. You can download it easily and quickly. It consists of 17MB along with data from Spotify from the years 1921 to 2020, including 160,000+ tracks. It consists of 174,389 rows and 19 columns. Below, isa screenshot of the first few rows along with the first columns: In [3]: spotify.head|) outst: scountioness ait danceablity_doration ms_eneroy explicit _instrumentanoss © ogeran0 "Berle ss teense 2240 ORSATILEUSTEWFEFAAE! 0.000522 ('Seramn’ + ossaoa “Ty aso ssonne 0517 © o_onbarLrzasuanowaor .2asaca Hewaine} 2 —oossono (ame gar 163827 0.188 mTlaigmOKadevautne 000018 2 coma yy ftom eater a7 9 TaLesSuSOIoMYOpNm ———_agor000 Data Sample. Screenshot by Author [5]. Columns: After we eventually pick the best model, we can look at the most important features. I am using the incerpret_mode1 () function of PyCaret, which is based on the popular SHAP library, Here are all of the features possible below: hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 amt ‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app wanceapirety > ‘duration_ms', tenergy', texplicit', ‘iat, ‘instrumentalness', tkey', ‘liveness', "loudness", 'mode', ‘name', ‘popularity', ‘release _date', ‘speechiness', ‘tempo', ‘valence', tyear'] Here are the most important features using SHAP: High 2 g é year instrumentainess loudness duration_ms acousticness liveness release_date_month_1 speechiness danceability release_date_month_12 valence tempo energy release_date_is_month_start_o key_o - release_date_weekday_1 release_date_month_6 ~ key 4 explicit_1 key 9 0 -is -o 5 0 5 0 SHAP value (impact on model output) hitps:ftowardsdatascience.com/predcting-spotty-song-populaniy-¢9d000'254c7 am ‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app eu WULULIID ale USC ad ITALULED, CALEPL LUE LUE LAL yet Vat LaUIE, WHILE ID Lue column popularity . As you can see, the top three features are year, instrumentalness, and loudness. As a future improvement, it would be better to have the categorical features that are broken out into one column instead of tens of columns, then as a next step, be fed into the CatBoost model so that target encoding can be applied vs one-hot- encoding — to perform this action, we would confirm or change the xey column to be categorical instead, and for any other similar columns. Parameters These are the parameters that | used in the setup) of PyCaret. The Machine Learning problem is a regression one, including data from Spotify, with the cazget variable being the popularity field. For reproducibility, you can establish a There are a ton more parameters, but these are the ones that I used, and PyCaret does a great job of automatically detecting information from your data — like picking which features are categorical, and it will confirm that with you in the Oo. Models Compared Iwill be comparing 19 Machine Learning algorithms, some are incredibly popular while some, I have actually not heard of, so it will be interesting to see which one wins with this dataset. For the success criteria, am comparing all of the metrics MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec), which PyCaret automatically ranks. Here are all of the models that I compared: * Linear Regression * Lasso Regression * Ridge Regression © Elastic Net * Orthogonal Matching Pursuit * Bayesian Ridge © Gradient Boosting Regressor hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 sm ‘022021 Os ‘Sciance Predicts Spotty Song Populaiy | Towards Data Science Open in app * Decision Tree Regressor * CatBoost Regressor * Light Gradient Boosting Machinee + Extra Trees Regressor * AdaBoost Regressor * K Neighbors Regressor Lasso Least Angle Regression + Huber Regressor * Passive Aggressive Regressor * Least Angle Regression Results It is important to note that I am just using a sample of the data, so the order of these algorithms may rearrange if you use all of the data if you test this code yourself. I used only +,000 rows instead of the total -270, 000 rows. As you can see, catoost was ranked first, having the best RMSE, RMSE, R2. However, it did not have the best MAE, RMSLE, and MAPE, and it was not the fastest. Therefore, you should establish what you mean by success in terms of these metrics. For example, if time is essential, then you will want to rank that higher, or if MAE is higher you might want to pick mxtra Trees Regressor instead to win. se (7/2 compere nodelat) omen BSS ne cs A a =: ws tsi Bie wom Bie nee ms or oa hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000°254c7 em ‘022021 Dat cionce Predicls Spotty Sang Popularity | Towards Data Science Open in app ane SSO ee scam ame ons Model Comparison. Screenshot by Author [7] Overall, you can see, even with a small sample of the dataset, we faired pretty well. The popularity target variable has a range of 0 to 91. Therefore, for MAE for example, our average error is 9.7 popularity units. Out of 91 that is not too bad, considering we would be off by up to just a difference of 10 on average. However, the way the algorithm is trained not would probably not generalize that well since we are just using asample, so you can expect all of the error metrics to decrease (which is good) significantly, but unfortunately, you will see the training time increase dramatically. One of the neat features of PyCaret, is the ability for you to remove algorithms in your sompare_models() training —I would start on a small sample of the dataset, and then see which algorithms generally take longer, then remove those when you compare with all of the original data since some of these could take hours to train depending on the dataset. In the screenshot below, I am printing the dataframe with the predictions and the actual values. For example, we can see that popularity or original is compared side-by- side to the abe , which is the prediction. You can see that some predictions were better than others. The last prediction was quite poor, while the first two predictions were great. In [15]: predictions[['popularity', ‘Label']].head(3) out[15]: popularity Label ° 25.0 19.695469 1 28.0 26.412759 2 73.0 37.664414 Predictions. Screenshot by Author [8]. hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 mm ‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Open in app reading in your data, sampling your data (only if you want), setting up your regression, comparing models, creating your final model, making predictions, and visualizing feature importance[9]: # import libraries from pycaret.regression import * import pandas as pd # read in your stock data spotify = pd.read_csv(‘file location of your data on your computer.csv’) # using a sample of the dataset (you can use any amount) spotify sample = spotify.sample (1000) # setup your regression parameters regression = setup (data = spotify_sample, target = ‘popularity’, session id = 100, ) # compare models compare models () # create a model catboost = create model ("catboost') # predict on test set predictions = predict_model (catboost) # interpreting model interpret_model (catboost) Summary hitps:ftowardsdatascience.com/predicting-spoty-song-populaniy-¢8d000'254c7 am ‘022021 Data Science Predicts Spotiy Song Populaiy | Towards Data Science Open in app Photo by bruce mars on Uns} Using Data Science models to predict a variable can be quite overwhelming, but we have seen how, with a few lines of code, we can compare several Machine Learning algorithms efficiently. We have also shown how easy it is to set up different types of data, including data like numeric and categorical. For the next steps, I would apply this to an entire dataset, confirm data types, making sure to remove inaccurate models, as well as models that take too long to train. | msummary, we now know how to perform the following to determine song popularit import libraries read in data setup your model compare models pick and create the best model predict using the best model intepret feature importance I want to give thanks and admiration to Moez Ali for developing this awesome Data Science library. Thope you found my article both interesting and useful. Please feel free to comment down below if you applied this library to a dataset or if you use other techniques. Do you prefer one over the other? What do you think about automatic Data Science? hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 om ‘022024 Data Science Predicts Spotiy Song Populaiy | Towards Data Science Getstarted | Openinapp @ LinkedIn. | FIgase 1ec1 Hee WU CUEUR YUL uy protic anu vi References [1] Photo by Cezar Sampaio on Unsplash, (2020) [2] Photo by Markus Spiske on Unsplash, (2020) [3] Moez Ali, PyCaret, (2021) [4] Yamac Eren Ay on Kaggle, Spotify Dataset, (2021) [5] M.Przybyla, Dataframe Screenshot, (2021) [6] M.Przybyla, SHAP Feature Importance Screenshot, (2021) [7] M.Przybyla, Model Comparison Screenshot, (2021) [8] M.Przybyla, Predictions Screenshot, (2021) [9] M.Przybyla, Python Code, (2021) [10] Photo by bruce mars on Unsplash, (2018) Sign up for The Daily Pick By Towards Data Science Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday, Make learning your daily ritual. Take alook Your email By signing up. you willcreate a Medium account ityou dont already have one. Review our Privacy Policy for more information about our privacy practices hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 rom ‘022021 Data Science Predicts Spotiy Song Popularly | Towards Data Science Vata suiente — MaUINNELeaIINy — AIUNLIANMENYENLE — 1UWarUS Lata suiEHe — spumy CP Google Play hitps:/towardsdatascience.comipredicting-spatiy-song-populariy-49d000'254c7 wm

You might also like