0% found this document useful (0 votes)
2 views

ML(project)_merged

The document outlines a project aimed at predicting the future popularity of songs using streaming and engagement metrics collected in 2024, framed as a regression task. It details the input data features, including track information and various popularity metrics from platforms like Spotify, YouTube, and TikTok, as well as the data preprocessing steps required. The dataset used for this analysis is sourced from Kaggle and includes comprehensive streaming metrics, with a focus on selecting relevant features for a linear regression model.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML(project)_merged

The document outlines a project aimed at predicting the future popularity of songs using streaming and engagement metrics collected in 2024, framed as a regression task. It details the input data features, including track information and various popularity metrics from platforms like Spotify, YouTube, and TikTok, as well as the data preprocessing steps required. The dataset used for this analysis is sourced from Kaggle and includes comprehensive streaming metrics, with a focus on selecting relevant features for a linear regression model.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Predicting the future popularity of songs based

on various streaming and engagement metrics


collected in 2024.
Anonymous .. lol
September 2024

1 Problem Formulation
This project aims to predict the future popularity of songs based on streaming
and engagement data collected in 2024. The problem is framed as a regression
task, where the objective is to forecast a continuous value that represents a
song’s future popularity. This popularity metric could be defined by future
streaming counts, chart positions, or other relevant indicators that reflect how
widely a song is consumed

2 Data Points
Features (Input Data): The features represent the input data that the
model uses to make predictions. These are the various metrics and attributes
related to each song that can influence its popularity.

1. Track Information:
• Track: The name of the song.
• Album Name: The name of the album on which the song appears.
• Artist: The artist or group performing the song.
• Release Date: The date when the song was released.
• ISRC: International Standard Recording Code, a unique identifier
for the song.
2. Popularity Metrics:
• All Time Rank: A ranking of the song based on its historical per-
formance.
• Spotify Streams: The number of times the song has been streamed
on Spotify.

1
• Spotify Playlist Count: The number of Spotify playlists that in-
clude the song.
• Spotify Playlist Reach: The potential audience size of the Spotify
playlists that include the song.
• Spotify Popularity: A metric indicating the song’s popularity on
Spotify.
• YouTube Views: The number of views the song has received on
YouTube.
• YouTube Likes: The number of likes the song has received on
YouTube.
• TikTok Posts: The number of posts featuring the song on TikTok.
• TikTok Likes: The number of likes for posts featuring the song on
TikTok.
• TikTok Views: The number of views for posts featuring the song
on TikTok.
• YouTube Playlist Reach: The potential audience size of YouTube
playlists that include the song.
• Apple Music Playlist Count: The number of Apple Music playlists
that include the song.
• AirPlay Spins: The number of times the song has been played on
AirPlay.
• SiriusXM Spins: The number of times the song has been played
on SiriusXM radio.
• Deezer Playlist Count: The number of Deezer playlists that in-
clude the song.
• Deezer Playlist Reach: The potential audience size of Deezer
playlists that include the song.
• Amazon Playlist Count: The number of Amazon Music playlists
that include the song.
• Pandora Streams: The number of times the song has been streamed
on Pandora.
• Pandora Track Stations: The number of Pandora stations that
include the song.
• Soundcloud Streams: The number of times the song has been
streamed on SoundCloud.
• Shazam Counts: The number of times the song has been identified
using Shazam.
• TIDAL Popularity: A metric indicating the song’s popularity on
TIDAL.

2
• Explicit Track: A binary indicator (0 or 1) indicating whether the
song is explicit.
Label(Target Variable): The label is the variable that the model aims to
predict. In this case, it summarizes the overall popularity of the song.
Track Score: A composite score that reflects the overall popularity of the
song. This score integrates various aspects of the song’s performance and re-
ception across different platforms and metrics.

3 Data Set
The dataset is from Kaggle and is titled ”Most Streamed Spotify Songs 2024.” [1]
It offers a detailed collection of streaming and engagement metrics for popular
songs across various platforms in 2024.
The dataset contains a comprehensive collection of streaming and engage-
ment metrics for popular songs across various platforms in 2024. It includes
information such as track details, streaming data from platforms like Spotify,
YouTube, TikTok, and Pandora, as well as radio spins and playlist counts.

3.1 Needed Data Preprocessing


Handling Missing Data + Cleaning Numeric fields + Normalization + Counting
Data points

4 Feature Selection
Identify Numeric Fields: I listed all numeric columns that might help predict
Track Score. Select Features: I chose features based on their relevance:
• Streaming Metrics: Spotify Streams, YouTube Views, etc., reflect
track popularity.
• Engagement Metrics: Spotify Playlist Count, YouTube Likes, etc.,
show listener engagement.
• Playlist Metrics: Spotify Playlist Reach, Deezer Playlist Reach,
etc., indicate playlist inclusion.
• Explicit Track: Included as a categorical feature that could impact the
score.
Exclude Non-Numeric Features: Columns like Track, Album Name, Artist,
and Release Date were left out because they’re less relevant for numeric pre-
diction.

5 Model
Linear regression

3
ML(project)

September 6, 2024

READING DATA INTO PANDAS


[113]: import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

[114]: data = pd.read_csv('Most Streamed Spotify Songs 2024.csv',␣


↪encoding='ISO-8859-1')

data.head(10)

[114]: Track Album Name \


0 MILLION DOLLAR BABY Million Dollar Baby - Single
1 Not Like Us Not Like Us
2 i like the way you kiss me I like the way you kiss me
3 Flowers Flowers - Single
4 Houdini Houdini
5 Lovin On Me Lovin On Me
6 Beautiful Things Beautiful Things
7 Gata Only Gata Only
8 Danza Kuduro - Cover ýýýýýýýýýýýýýýýýýýýýý - ýýýýýýýýýýýýýýýýýý -
9 BAND4BAND (feat. Lil Baby) BAND4BAND (feat. Lil Baby)

Artist Release Date ISRC All Time Rank Track Score \


0 Tommy Richman 4/26/2024 QM24S2402528 1 725.4
1 Kendrick Lamar 5/4/2024 USUG12400910 2 545.9
2 Artemas 3/19/2024 QZJ842400387 3 538.4
3 Miley Cyrus 1/12/2023 USSM12209777 4 444.9
4 Eminem 5/31/2024 USUG12403398 5 423.3
5 Jack Harlow 11/10/2023 USAT22311371 6 410.1
6 Benson Boone 1/18/2024 USWB12307016 7 407.2
7 FloyyMenor 2/2/2024 QZL382406049 8 375.8
8 MUSIC LAB JPN 6/9/2024 TCJPA2463708 9 355.7
9 Central Cee 5/23/2024 USSM12404354 10 330.6

Spotify Streams Spotify Playlist Count Spotify Playlist Reach … \

1
0 390,470,936 30,716 196,631,588 …
1 323,703,884 28,113 174,597,137 …
2 601,309,283 54,331 211,607,669 …
3 2,031,280,633 269,802 136,569,078 …
4 107,034,922 7,223 151,469,874 …
5 670,665,438 105,892 175,421,034 …
6 900,158,751 73,118 201,585,714 …
7 675,079,153 40,094 211,236,940 …
8 1,653,018,119 1 15 …
9 90,676,573 10,400 184,199,419 …

SiriusXM Spins Deezer Playlist Count Deezer Playlist Reach \


0 684 62.0 17,598,718
1 3 67.0 10,422,430
2 536 136.0 36,321,847
3 2,182 264.0 24,684,248
4 1 82.0 17,660,624
5 4,654 86.0 17,167,254
6 429 168.0 48,197,850
7 30 87.0 33,245,595
8 NaN NaN NaN
9 117 78.0 10,800,098

Amazon Playlist Count Pandora Streams Pandora Track Stations \


0 114.0 18,004,655 22,931
1 111.0 7,780,028 28,444
2 172.0 5,022,621 5,639
3 210.0 190,260,277 203,384
4 105.0 4,493,884 7,006
5 152.0 138,529,362 50,982
6 154.0 65,447,476 57,372
7 53.0 3,372,428 5,762
8 NaN NaN NaN
9 92.0 1,005,626 842

Soundcloud Streams Shazam Counts TIDAL Popularity Explicit Track


0 4,818,457 2,669,262 NaN 0
1 6,623,075 1,118,279 NaN 1
2 7,208,651 5,285,340 NaN 0
3 NaN 11,822,942 NaN 0
4 207,179 457,017 NaN 1
5 9,438,601 4,517,131 NaN 1
6 NaN 9,990,302 NaN 0
7 NaN 6,063,523 NaN 1
8 NaN NaN NaN 1
9 3,679,709 666,302 NaN 1

2
[10 rows x 29 columns]

HANDLING MISSING DATA


[115]: def handle_missing_data(data):
if 'TIDAL Popularity' in data.columns:
data = data.drop(columns=['TIDAL Popularity'])
print("Dropped 'TIDAL Popularity' column (entirely empty).")

missing_data = data.isnull().sum()
print("\nMissing Data before handling (after dropping TIDAL Popularity):")
print(missing_data[missing_data > 0])

data = data.dropna(axis=1, how='all')


data = data.dropna()

print("\nData after dropping columns and rows with missing values:")


print(f"Remaining data points: {len(data)}")

missing_data_after = data.isnull().sum()
if missing_data_after.sum() == 0:
print("\nNo missing data found after handling.")
else:
print("\nMissing Data after handling:")
print(missing_data_after[missing_data_after > 0])

return data

data = pd.read_csv('Most Streamed Spotify Songs 2024.csv',␣


↪encoding='ISO-8859-1')

data = handle_missing_data(data)
data.head(10)

Dropped 'TIDAL Popularity' column (entirely empty).

Missing Data before handling (after dropping TIDAL Popularity):


Artist 5
Spotify Streams 113
Spotify Playlist Count 70
Spotify Playlist Reach 72
Spotify Popularity 804
YouTube Views 308
YouTube Likes 315
TikTok Posts 1173
TikTok Likes 980
TikTok Views 981
YouTube Playlist Reach 1009
Apple Music Playlist Count 561

3
AirPlay Spins 498
SiriusXM Spins 2123
Deezer Playlist Count 921
Deezer Playlist Reach 928
Amazon Playlist Count 1055
Pandora Streams 1106
Pandora Track Stations 1268
Soundcloud Streams 3333
Shazam Counts 577
dtype: int64

Data after dropping columns and rows with missing values:


Remaining data points: 565

No missing data found after handling.

[115]: Track Album Name \


0 MILLION DOLLAR BABY Million Dollar Baby - Single
1 Not Like Us Not Like Us
2 i like the way you kiss me I like the way you kiss me
5 Lovin On Me Lovin On Me
9 BAND4BAND (feat. Lil Baby) BAND4BAND (feat. Lil Baby)
12 LUNCH HIT ME HARD AND SOFT
15 LALA LALA - Single
16 Fortnight (feat. Post Malone) THE TORTURED POETS DEPARTMENT
18 BLUE HIT ME HARD AND SOFT
21 Espresso Espresso

Artist Release Date ISRC All Time Rank Track Score \


0 Tommy Richman 4/26/2024 QM24S2402528 1 725.4
1 Kendrick Lamar 5/4/2024 USUG12400910 2 545.9
2 Artemas 3/19/2024 QZJ842400387 3 538.4
5 Jack Harlow 11/10/2023 USAT22311371 6 410.1
9 Central Cee 5/23/2024 USSM12404354 10 330.6
12 Billie Eilish 5/17/2024 USUM72401991 13 316.3
15 Myke Towers 3/22/2023 USWL12300002 16 299.9
16 Taylor Swift 4/18/2024 USUG12401028 17 297.6
18 Billie Eilish 5/17/2024 USUM72401996 19 292.6
21 Sabrina Carpenter 4/12/2024 USUM72403305 22 281.5

Spotify Streams Spotify Playlist Count Spotify Playlist Reach … \


0 390,470,936 30,716 196,631,588 …
1 323,703,884 28,113 174,597,137 …
2 601,309,283 54,331 211,607,669 …
5 670,665,438 105,892 175,421,034 …
9 90,676,573 10,400 184,199,419 …
12 221,636,195 13,800 197,280,692 …

4
15 925,655,569 103,605 79,944,921 …
16 395,433,400 12,784 177,932,568 …
18 91,272,461 6,499 52,287,548 …
21 547,882,871 24,425 262,343,414 …

AirPlay Spins SiriusXM Spins Deezer Playlist Count Deezer Playlist Reach \
0 40,975 684 62.0 17,598,718
1 40,778 3 67.0 10,422,430
2 74,333 536 136.0 36,321,847
5 522,042 4,654 86.0 17,167,254
9 3,823 117 78.0 10,800,098
12 41,344 45 138.0 38,243,636
15 92,231 228 60.0 5,633,435
16 129,968 3 99.0 37,988,531
18 181 1 24.0 5,054,005
21 37,208 236 167.0 41,414,565

Amazon Playlist Count Pandora Streams Pandora Track Stations \


0 114.0 18,004,655 22,931
1 111.0 7,780,028 28,444
2 172.0 5,022,621 5,639
5 152.0 138,529,362 50,982
9 92.0 1,005,626 842
12 163.0 1,354,692 1,219
15 83.0 12,171,026 13,242
16 134.0 9,961,769 13,437
18 33.0 283,089 162
21 149.0 10,362,898 10,848

Soundcloud Streams Shazam Counts Explicit Track


0 4,818,457 2,669,262 0
1 6,623,075 1,118,279 1
2 7,208,651 5,285,340 0
5 9,438,601 4,517,131 1
9 3,679,709 666,302 1
12 1,313,357 450,344 0
15 871,978 2,765,808 1
16 377,734 1,210,029 0
18 975,891 257,661 0
21 1,551,157 1,373,085 1

[10 rows x 28 columns]

CLEANING NUMERIC FEILDS


[116]: def clean_numeric_fields(data):
numeric_columns_with_commas = [

5
'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
'YouTube Views', 'YouTube Likes', 'TikTok Posts', 'TikTok Likes',
'TikTok Views', 'YouTube Playlist Reach', 'Apple Music Playlist Count',
'AirPlay Spins', 'SiriusXM Spins', 'Deezer Playlist Count',
'Deezer Playlist Reach', 'Amazon Playlist Count', 'Pandora Streams',
'Pandora Track Stations', 'Soundcloud Streams', 'Shazam Counts'
]

for col in numeric_columns_with_commas:


if col in data.columns:
data[col] = data[col].astype(str).str.replace(',', '')
data[col] = pd.to_numeric(data[col], errors='coerce')

return data

data = clean_numeric_fields(data)
data.head(10)

[116]: Track Album Name \


0 MILLION DOLLAR BABY Million Dollar Baby - Single
1 Not Like Us Not Like Us
2 i like the way you kiss me I like the way you kiss me
5 Lovin On Me Lovin On Me
9 BAND4BAND (feat. Lil Baby) BAND4BAND (feat. Lil Baby)
12 LUNCH HIT ME HARD AND SOFT
15 LALA LALA - Single
16 Fortnight (feat. Post Malone) THE TORTURED POETS DEPARTMENT
18 BLUE HIT ME HARD AND SOFT
21 Espresso Espresso

Artist Release Date ISRC All Time Rank Track Score \


0 Tommy Richman 4/26/2024 QM24S2402528 1 725.4
1 Kendrick Lamar 5/4/2024 USUG12400910 2 545.9
2 Artemas 3/19/2024 QZJ842400387 3 538.4
5 Jack Harlow 11/10/2023 USAT22311371 6 410.1
9 Central Cee 5/23/2024 USSM12404354 10 330.6
12 Billie Eilish 5/17/2024 USUM72401991 13 316.3
15 Myke Towers 3/22/2023 USWL12300002 16 299.9
16 Taylor Swift 4/18/2024 USUG12401028 17 297.6
18 Billie Eilish 5/17/2024 USUM72401996 19 292.6
21 Sabrina Carpenter 4/12/2024 USUM72403305 22 281.5

Spotify Streams Spotify Playlist Count Spotify Playlist Reach … \


0 390470936 30716 196631588 …
1 323703884 28113 174597137 …
2 601309283 54331 211607669 …
5 670665438 105892 175421034 …

6
9 90676573 10400 184199419 …
12 221636195 13800 197280692 …
15 925655569 103605 79944921 …
16 395433400 12784 177932568 …
18 91272461 6499 52287548 …
21 547882871 24425 262343414 …

AirPlay Spins SiriusXM Spins Deezer Playlist Count \


0 40975 684 62.0
1 40778 3 67.0
2 74333 536 136.0
5 522042 4654 86.0
9 3823 117 78.0
12 41344 45 138.0
15 92231 228 60.0
16 129968 3 99.0
18 181 1 24.0
21 37208 236 167.0

Deezer Playlist Reach Amazon Playlist Count Pandora Streams \


0 17598718 114.0 18004655
1 10422430 111.0 7780028
2 36321847 172.0 5022621
5 17167254 152.0 138529362
9 10800098 92.0 1005626
12 38243636 163.0 1354692
15 5633435 83.0 12171026
16 37988531 134.0 9961769
18 5054005 33.0 283089
21 41414565 149.0 10362898

Pandora Track Stations Soundcloud Streams Shazam Counts Explicit Track


0 22931 4818457 2669262 0
1 28444 6623075 1118279 1
2 5639 7208651 5285340 0
5 50982 9438601 4517131 1
9 842 3679709 666302 1
12 1219 1313357 450344 0
15 13242 871978 2765808 1
16 13437 377734 1210029 0
18 162 975891 257661 0
21 10848 1551157 1373085 1

[10 rows x 28 columns]

NORMALIZING DATA

7
[117]: def normalize_data(data):
numeric_columns = [
'Track Score', 'Spotify Streams', 'Spotify Playlist Count',
'Spotify Playlist Reach', 'Spotify Popularity', 'YouTube Views',
'YouTube Likes', 'TikTok Posts', 'TikTok Likes', 'TikTok Views',
'YouTube Playlist Reach', 'Apple Music Playlist Count', 'AirPlay Spins',
'SiriusXM Spins', 'Deezer Playlist Count', 'Deezer Playlist Reach',
'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
'Soundcloud Streams', 'Shazam Counts', 'Explicit Track'
]

scaler = MinMaxScaler()

data[numeric_columns] = scaler.fit_transform(data[numeric_columns])

return data

data = normalize_data(data)
data.head(10)

[117]: Track Album Name \


0 MILLION DOLLAR BABY Million Dollar Baby - Single
1 Not Like Us Not Like Us
2 i like the way you kiss me I like the way you kiss me
5 Lovin On Me Lovin On Me
9 BAND4BAND (feat. Lil Baby) BAND4BAND (feat. Lil Baby)
12 LUNCH HIT ME HARD AND SOFT
15 LALA LALA - Single
16 Fortnight (feat. Post Malone) THE TORTURED POETS DEPARTMENT
18 BLUE HIT ME HARD AND SOFT
21 Espresso Espresso

Artist Release Date ISRC All Time Rank Track Score \


0 Tommy Richman 4/26/2024 QM24S2402528 1 1.000000
1 Kendrick Lamar 5/4/2024 USUG12400910 2 0.745715
2 Artemas 3/19/2024 QZJ842400387 3 0.735090
5 Jack Harlow 11/10/2023 USAT22311371 6 0.553336
9 Central Cee 5/23/2024 USSM12404354 10 0.440714
12 Billie Eilish 5/17/2024 USUM72401991 13 0.420456
15 Myke Towers 3/22/2023 USWL12300002 16 0.397223
16 Taylor Swift 4/18/2024 USUG12401028 17 0.393965
18 Billie Eilish 5/17/2024 USUM72401996 19 0.386882
21 Sabrina Carpenter 4/12/2024 USUM72403305 22 0.371157

Spotify Streams Spotify Playlist Count Spotify Playlist Reach … \


0 0.089010 0.049345 0.747964 …
1 0.073378 0.044924 0.663452 …

8
2 0.138373 0.089457 0.805405 …
5 0.154611 0.177038 0.666612 …
9 0.018820 0.014837 0.700281 …
12 0.049481 0.020612 0.750454 …
15 0.214312 0.173153 0.300416 …
16 0.090172 0.018887 0.676245 …
18 0.018960 0.008211 0.194337 …
21 0.125865 0.038660 1.000000 …

AirPlay Spins SiriusXM Spins Deezer Playlist Count \


0 0.024170 0.096701 0.104631
1 0.024054 0.000283 0.113208
2 0.043848 0.075747 0.231561
5 0.307945 0.658785 0.145798
9 0.002255 0.016424 0.132075
12 0.024388 0.006230 0.234991
15 0.054405 0.032139 0.101201
16 0.076666 0.000283 0.168096
18 0.000106 0.000000 0.039451
21 0.021948 0.033272 0.284734

Deezer Playlist Reach Amazon Playlist Count Pandora Streams \


0 0.412387 0.604278 0.016404
1 0.244226 0.588235 0.007087
2 0.851123 0.914439 0.004575
5 0.402276 0.807487 0.126229
9 0.253076 0.486631 0.000914
12 0.896156 0.866310 0.001232
15 0.132006 0.438503 0.011088
16 0.890178 0.711230 0.009075
18 0.118428 0.171123 0.000256
21 0.970460 0.791444 0.009441

Pandora Track Stations Soundcloud Streams Shazam Counts Explicit Track


0 0.006437 0.018714 0.058976 0.0
1 0.007987 0.025724 0.024369 1.0
2 0.001575 0.027998 0.117350 0.0
5 0.014324 0.036659 0.100208 1.0
9 0.000227 0.014291 0.014284 1.0
12 0.000333 0.005100 0.009465 0.0
15 0.003713 0.003386 0.061131 1.0
16 0.003768 0.001466 0.026416 0.0
18 0.000035 0.003789 0.005166 0.0
21 0.003040 0.006024 0.030055 1.0

[10 rows x 28 columns]

9
[118]: def display_data_points(data):
features = [
'------Track Information---------',
'Album Name',
'Artist',
'Release Date',
'ISRC',
'------Popularity Metrics---------',
'All Time Rank',
'Spotify Streams',
'Spotify Playlist Count',
'Spotify Playlist Reach',
'Spotify Popularity',
'YouTube Views',
'YouTube Likes',
'TikTok Posts',
'TikTok Likes',
'TikTok Views',
'YouTube Playlist Reach',
'Apple Music Playlist Count',
'AirPlay Spins',
'SiriusXM Spins',
'Deezer Playlist Count',
'Deezer Playlist Reach',
'Amazon Playlist Count',
'Pandora Streams',
'Pandora Track Stations',
'Soundcloud Streams',
'Shazam Counts',
'Explicit Track'
]

labels = ['Track Score']


print("Features (Input Data):")
for feature in features:
print(f"- {feature}")

print("\nLabels (Target Variable):")


for label in labels:
print(f"- {label}")

display_data_points(data)

Features (Input Data):


- ------Track Information---------
- Album Name
- Artist

10
- Release Date
- ISRC
- ------Popularity Metrics---------
- All Time Rank
- Spotify Streams
- Spotify Playlist Count
- Spotify Playlist Reach
- Spotify Popularity
- YouTube Views
- YouTube Likes
- TikTok Posts
- TikTok Likes
- TikTok Views
- YouTube Playlist Reach
- Apple Music Playlist Count
- AirPlay Spins
- SiriusXM Spins
- Deezer Playlist Count
- Deezer Playlist Reach
- Amazon Playlist Count
- Pandora Streams
- Pandora Track Stations
- Soundcloud Streams
- Shazam Counts
- Explicit Track

Labels (Target Variable):


- Track Score
COUNTING DATA POINTS
[119]: data_points = data.shape
data_points

[119]: (565, 28)

MODEL OF CHOICE: Linear Regression Simplicity and Interpretability: Linear regression is


straightforward and provides clear insights into how each feature affects the Track Score. Feature
Relationships: Since the relationship between features like streaming metrics and Track Score
is likely linear, a linear model can capture these relationships effectively. Baseline Performance:
Linear regression serves as a good baseline model. If it performs well, it might indicate that more
complex models aren’t necessary.
[120]: features = [
'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',

11
'Deezer Playlist Count', 'Deezer Playlist Reach', 'Amazon Playlist Count',
'Pandora Streams', 'Pandora Track Stations', 'Soundcloud Streams', 'Shazam␣
↪Counts',

'Explicit Track'
]

#dividing variables
X = data[features]
y = data['Track Score']

#spliting the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

print("\nModel Evaluation:")
print(f"Mean Squared Error (MSE): {mse:.6f}")
print(f"R-squared Score: {r2:.4f}")

Model Evaluation:
Mean Squared Error (MSE): 0.003194
R-squared Score: 0.6456
Choice of Loss Function:
I choose for Mean Squared Error (MSE) because it captures the average squared difference between
predictions and actual values, which is crucial for regression. It particularly helps in focusing on
reducing larger errors.
Training Set (80%):
Size: 80% of the data. Why: This larger portion gives the model plenty of examples to learn from,
helping it understand patterns and relationships better for accurate predictions.
Test Set (20%):
Size: 20% of the data. Why: This set is for evaluating how well the model performs on new, unseen
data. It’s a common practice to ensure the model can generalize well to real-world situations.
Design Choice:
Split Ratio: 80% training and 20% testing in my opinion is a good balance. It provides enough
data to train the model effectively while keeping a significant portion for testing its performance.

12
Overfitting Prevention: Keeping a separate test set helps avoid overfitting, ensuring the model isn’t
just memorizing the training data but can perform well on new data.
[ ]:

13

You might also like