ML(project)_merged
ML(project)_merged
1 Problem Formulation
This project aims to predict the future popularity of songs based on streaming
and engagement data collected in 2024. The problem is framed as a regression
task, where the objective is to forecast a continuous value that represents a
song’s future popularity. This popularity metric could be defined by future
streaming counts, chart positions, or other relevant indicators that reflect how
widely a song is consumed
2 Data Points
Features (Input Data): The features represent the input data that the
model uses to make predictions. These are the various metrics and attributes
related to each song that can influence its popularity.
1. Track Information:
• Track: The name of the song.
• Album Name: The name of the album on which the song appears.
• Artist: The artist or group performing the song.
• Release Date: The date when the song was released.
• ISRC: International Standard Recording Code, a unique identifier
for the song.
2. Popularity Metrics:
• All Time Rank: A ranking of the song based on its historical per-
formance.
• Spotify Streams: The number of times the song has been streamed
on Spotify.
1
• Spotify Playlist Count: The number of Spotify playlists that in-
clude the song.
• Spotify Playlist Reach: The potential audience size of the Spotify
playlists that include the song.
• Spotify Popularity: A metric indicating the song’s popularity on
Spotify.
• YouTube Views: The number of views the song has received on
YouTube.
• YouTube Likes: The number of likes the song has received on
YouTube.
• TikTok Posts: The number of posts featuring the song on TikTok.
• TikTok Likes: The number of likes for posts featuring the song on
TikTok.
• TikTok Views: The number of views for posts featuring the song
on TikTok.
• YouTube Playlist Reach: The potential audience size of YouTube
playlists that include the song.
• Apple Music Playlist Count: The number of Apple Music playlists
that include the song.
• AirPlay Spins: The number of times the song has been played on
AirPlay.
• SiriusXM Spins: The number of times the song has been played
on SiriusXM radio.
• Deezer Playlist Count: The number of Deezer playlists that in-
clude the song.
• Deezer Playlist Reach: The potential audience size of Deezer
playlists that include the song.
• Amazon Playlist Count: The number of Amazon Music playlists
that include the song.
• Pandora Streams: The number of times the song has been streamed
on Pandora.
• Pandora Track Stations: The number of Pandora stations that
include the song.
• Soundcloud Streams: The number of times the song has been
streamed on SoundCloud.
• Shazam Counts: The number of times the song has been identified
using Shazam.
• TIDAL Popularity: A metric indicating the song’s popularity on
TIDAL.
2
• Explicit Track: A binary indicator (0 or 1) indicating whether the
song is explicit.
Label(Target Variable): The label is the variable that the model aims to
predict. In this case, it summarizes the overall popularity of the song.
Track Score: A composite score that reflects the overall popularity of the
song. This score integrates various aspects of the song’s performance and re-
ception across different platforms and metrics.
3 Data Set
The dataset is from Kaggle and is titled ”Most Streamed Spotify Songs 2024.” [1]
It offers a detailed collection of streaming and engagement metrics for popular
songs across various platforms in 2024.
The dataset contains a comprehensive collection of streaming and engage-
ment metrics for popular songs across various platforms in 2024. It includes
information such as track details, streaming data from platforms like Spotify,
YouTube, TikTok, and Pandora, as well as radio spins and playlist counts.
4 Feature Selection
Identify Numeric Fields: I listed all numeric columns that might help predict
Track Score. Select Features: I chose features based on their relevance:
• Streaming Metrics: Spotify Streams, YouTube Views, etc., reflect
track popularity.
• Engagement Metrics: Spotify Playlist Count, YouTube Likes, etc.,
show listener engagement.
• Playlist Metrics: Spotify Playlist Reach, Deezer Playlist Reach,
etc., indicate playlist inclusion.
• Explicit Track: Included as a categorical feature that could impact the
score.
Exclude Non-Numeric Features: Columns like Track, Album Name, Artist,
and Release Date were left out because they’re less relevant for numeric pre-
diction.
5 Model
Linear regression
3
ML(project)
September 6, 2024
data.head(10)
1
0 390,470,936 30,716 196,631,588 …
1 323,703,884 28,113 174,597,137 …
2 601,309,283 54,331 211,607,669 …
3 2,031,280,633 269,802 136,569,078 …
4 107,034,922 7,223 151,469,874 …
5 670,665,438 105,892 175,421,034 …
6 900,158,751 73,118 201,585,714 …
7 675,079,153 40,094 211,236,940 …
8 1,653,018,119 1 15 …
9 90,676,573 10,400 184,199,419 …
2
[10 rows x 29 columns]
missing_data = data.isnull().sum()
print("\nMissing Data before handling (after dropping TIDAL Popularity):")
print(missing_data[missing_data > 0])
missing_data_after = data.isnull().sum()
if missing_data_after.sum() == 0:
print("\nNo missing data found after handling.")
else:
print("\nMissing Data after handling:")
print(missing_data_after[missing_data_after > 0])
return data
data = handle_missing_data(data)
data.head(10)
3
AirPlay Spins 498
SiriusXM Spins 2123
Deezer Playlist Count 921
Deezer Playlist Reach 928
Amazon Playlist Count 1055
Pandora Streams 1106
Pandora Track Stations 1268
Soundcloud Streams 3333
Shazam Counts 577
dtype: int64
4
15 925,655,569 103,605 79,944,921 …
16 395,433,400 12,784 177,932,568 …
18 91,272,461 6,499 52,287,548 …
21 547,882,871 24,425 262,343,414 …
AirPlay Spins SiriusXM Spins Deezer Playlist Count Deezer Playlist Reach \
0 40,975 684 62.0 17,598,718
1 40,778 3 67.0 10,422,430
2 74,333 536 136.0 36,321,847
5 522,042 4,654 86.0 17,167,254
9 3,823 117 78.0 10,800,098
12 41,344 45 138.0 38,243,636
15 92,231 228 60.0 5,633,435
16 129,968 3 99.0 37,988,531
18 181 1 24.0 5,054,005
21 37,208 236 167.0 41,414,565
5
'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
'YouTube Views', 'YouTube Likes', 'TikTok Posts', 'TikTok Likes',
'TikTok Views', 'YouTube Playlist Reach', 'Apple Music Playlist Count',
'AirPlay Spins', 'SiriusXM Spins', 'Deezer Playlist Count',
'Deezer Playlist Reach', 'Amazon Playlist Count', 'Pandora Streams',
'Pandora Track Stations', 'Soundcloud Streams', 'Shazam Counts'
]
return data
data = clean_numeric_fields(data)
data.head(10)
6
9 90676573 10400 184199419 …
12 221636195 13800 197280692 …
15 925655569 103605 79944921 …
16 395433400 12784 177932568 …
18 91272461 6499 52287548 …
21 547882871 24425 262343414 …
NORMALIZING DATA
7
[117]: def normalize_data(data):
numeric_columns = [
'Track Score', 'Spotify Streams', 'Spotify Playlist Count',
'Spotify Playlist Reach', 'Spotify Popularity', 'YouTube Views',
'YouTube Likes', 'TikTok Posts', 'TikTok Likes', 'TikTok Views',
'YouTube Playlist Reach', 'Apple Music Playlist Count', 'AirPlay Spins',
'SiriusXM Spins', 'Deezer Playlist Count', 'Deezer Playlist Reach',
'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
'Soundcloud Streams', 'Shazam Counts', 'Explicit Track'
]
scaler = MinMaxScaler()
data[numeric_columns] = scaler.fit_transform(data[numeric_columns])
return data
data = normalize_data(data)
data.head(10)
8
2 0.138373 0.089457 0.805405 …
5 0.154611 0.177038 0.666612 …
9 0.018820 0.014837 0.700281 …
12 0.049481 0.020612 0.750454 …
15 0.214312 0.173153 0.300416 …
16 0.090172 0.018887 0.676245 …
18 0.018960 0.008211 0.194337 …
21 0.125865 0.038660 1.000000 …
9
[118]: def display_data_points(data):
features = [
'------Track Information---------',
'Album Name',
'Artist',
'Release Date',
'ISRC',
'------Popularity Metrics---------',
'All Time Rank',
'Spotify Streams',
'Spotify Playlist Count',
'Spotify Playlist Reach',
'Spotify Popularity',
'YouTube Views',
'YouTube Likes',
'TikTok Posts',
'TikTok Likes',
'TikTok Views',
'YouTube Playlist Reach',
'Apple Music Playlist Count',
'AirPlay Spins',
'SiriusXM Spins',
'Deezer Playlist Count',
'Deezer Playlist Reach',
'Amazon Playlist Count',
'Pandora Streams',
'Pandora Track Stations',
'Soundcloud Streams',
'Shazam Counts',
'Explicit Track'
]
display_data_points(data)
10
- Release Date
- ISRC
- ------Popularity Metrics---------
- All Time Rank
- Spotify Streams
- Spotify Playlist Count
- Spotify Playlist Reach
- Spotify Popularity
- YouTube Views
- YouTube Likes
- TikTok Posts
- TikTok Likes
- TikTok Views
- YouTube Playlist Reach
- Apple Music Playlist Count
- AirPlay Spins
- SiriusXM Spins
- Deezer Playlist Count
- Deezer Playlist Reach
- Amazon Playlist Count
- Pandora Streams
- Pandora Track Stations
- Soundcloud Streams
- Shazam Counts
- Explicit Track
11
'Deezer Playlist Count', 'Deezer Playlist Reach', 'Amazon Playlist Count',
'Pandora Streams', 'Pandora Track Stations', 'Soundcloud Streams', 'Shazam␣
↪Counts',
'Explicit Track'
]
#dividing variables
X = data[features]
y = data['Track Score']
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("\nModel Evaluation:")
print(f"Mean Squared Error (MSE): {mse:.6f}")
print(f"R-squared Score: {r2:.4f}")
Model Evaluation:
Mean Squared Error (MSE): 0.003194
R-squared Score: 0.6456
Choice of Loss Function:
I choose for Mean Squared Error (MSE) because it captures the average squared difference between
predictions and actual values, which is crucial for regression. It particularly helps in focusing on
reducing larger errors.
Training Set (80%):
Size: 80% of the data. Why: This larger portion gives the model plenty of examples to learn from,
helping it understand patterns and relationships better for accurate predictions.
Test Set (20%):
Size: 20% of the data. Why: This set is for evaluating how well the model performs on new, unseen
data. It’s a common practice to ensure the model can generalize well to real-world situations.
Design Choice:
Split Ratio: 80% training and 20% testing in my opinion is a good balance. It provides enough
data to train the model effectively while keeping a significant portion for testing its performance.
12
Overfitting Prevention: Keeping a separate test set helps avoid overfitting, ensuring the model isn’t
just memorizing the training data but can perform well on new data.
[ ]:
13