ML Ass
ML Ass
June 4, 2024
1
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.12.4)
Requirement already satisfied: requests<3,>=2.24.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (2.31.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.66.4)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.13.2)
Requirement already satisfied: multimethod<2,>=1.4 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.11.2)
Requirement already satisfied: statsmodels<1,>=0.13.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.14.2)
Requirement already satisfied: typeguard<5,>=3 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.3.0)
Requirement already satisfied: imagehash==4.3.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.9.3)
Requirement already satisfied: dacite>=1.8 in /usr/local/lib/python3.10/dist-
packages (from ydata_profiling) (1.8.1)
Requirement already satisfied: numba<1,>=0.56.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.58.1)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-
packages (from imagehash==4.3.1->ydata_profiling) (1.6.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages
(from imagehash==4.3.1->ydata_profiling) (9.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in
/usr/local/lib/python3.10/dist-packages (from
jinja2<3.2,>=2.11.1->ydata_profiling) (2.1.5)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-
packages (from matplotlib<3.9,>=3.2->ydata_profiling) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (4.52.4)
Requirement already satisfied: kiwisolver>=1.3.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (1.4.5)
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (24.0)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (3.1.2)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in
2
/usr/local/lib/python3.10/dist-packages (from numba<1,>=0.56.0->ydata_profiling)
(0.41.1)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-
packages (from phik<0.13,>=0.11.1->ydata_profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.4.0 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(0.7.0)
Requirement already satisfied: pydantic-core==2.18.3 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(2.18.3)
Requirement already satisfied: typing-extensions>=4.6.1 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(4.12.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-
packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-
packages (from requests<3,>=2.24.0->ydata_profiling) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (2024.2.2)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.10/dist-
packages (from statsmodels<1,>=0.13.2->ydata_profiling) (0.5.6)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-
packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata_profiling) (23.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-
packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata_profiling) (3.3)
[ ]: import pandas as pd
import ydata_profiling as pandas_profiling
try:
movie_data = pd.read_csv('movies_data.csv', encoding='utf-8')
except UnicodeDecodeError:
movie_data = pd.read_csv('movies_data.csv', encoding='latin1')
profile.to_file("movie_data_report.html")
/usr/local/lib/python3.10/dist-packages/ydata_profiling/profile_report.py:363:
UserWarning: Try running command: 'pip install --upgrade Pillow' to avoid
3
ValueError
warnings.warn(
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
movie_data['Year'].fillna(movie_data['Year'].mean(), inplace=True)
[10]: movie_data.isnull().sum()
[10]: Name 0
Year 0
Duration 8269
Genre 1877
4
Rating 7590
Votes 7589
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64
[12]: movie_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 15509 non-null object
1 Year 15509 non-null int64
2 Duration 7240 non-null object
3 Genre 13632 non-null object
4 Rating 7919 non-null float64
5 Votes 7920 non-null object
6 Director 14984 non-null object
7 Actor 1 13892 non-null object
8 Actor 2 13125 non-null object
9 Actor 3 12365 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 1.2+ MB
[13]: movie_data.head()
[13]: Name
Year Duration Genre Rating \
0 1987 NaN Drama NaN
1 #Gadhvi (He thought he was Gandhi) 2019 109 min Drama 7.0
2 #Homecoming 2021 90 min Drama, Musical NaN
3 #Yaaram 2019 110 min Comedy, Romance 4.4
4 …And Once Again 2010 105 min Drama NaN
5
[88]: # @title Year vs Rating
[14]: movie_data.isnull().sum()
[14]: Name 0
Year 0
Duration 8269
Genre 1877
Rating 7590
Votes 7589
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64
6
[15]: movie_data['Duration'] = movie_data['Duration'].str.extract('(\d+)').
↪astype(float)
[16]: movie_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 15509 non-null object
1 Year 15509 non-null int64
2 Duration 7240 non-null float64
3 Genre 13632 non-null object
4 Rating 7919 non-null float64
5 Votes 7920 non-null object
6 Director 14984 non-null object
7 Actor 1 13892 non-null object
8 Actor 2 13125 non-null object
9 Actor 3 12365 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 1.2+ MB
movie_data["Duration"].fillna(movie_data["Duration"].mean(), inplace=True)
[19]: movie_data.isnull().sum()
[19]: Name 0
Year 0
Duration 0
Genre 1877
Rating 7590
Votes 7589
7
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64
[20]: movie_data.head()
Actor 3
0 Rajendra Bhatia
1 Arvind Jangid
2 Roy Angana
3 Siddhant Kapoor
4 Antara Mali
8
[21]: movie_data["Genre"].isnull().sum()
[21]: 1877
[24]: genres=movie_data['Genre'].str.split(',',expand=True)
genres.head(5)
[24]: 0 1 2
0 Drama None None
1 Drama None None
2 Drama Musical None
3 Comedy Romance None
4 Drama None None
[25]: genre_counts = {}
for genre in genres.values.flatten():
if genre is not None:
if genre in genre_counts:
genre_counts[genre] += 1
9
else:
genre_counts[genre] = 1
Action: 56
Adventure: 289
Biography: 53
Comedy: 468
Crime: 863
Drama: 2726
Family: 782
Fantasy: 266
History: 178
Horror: 121
Music: 74
Musical: 424
Mystery: 365
News: 9
Reality-TV: 1
Romance: 1687
Sci-Fi: 48
Short: 1
Sport: 59
Thriller: 927
War: 39
Western: 5
Action: 3487
Adventure: 252
Animation: 125
Biography: 155
Comedy: 1561
Crime: 459
Documentary: 383
Drama: 4517
Family: 161
Fantasy: 192
History: 29
Horror: 403
Music: 16
Musical: 165
Mystery: 148
Reality-TV: 2
Romance: 762
Sci-Fi: 10
10
Sport: 11
Thriller: 786
War: 8
[26]: Genre
Drama 2780
Action 1289
Thriller 779
Romance 708
Drama, Romance 524
Name: count, dtype: int64
genre_counts_df = pd.DataFrame(data)
genre_counts_df.drop(columns=['Genre'], inplace=True)
[30]: movie_data.isnull().sum()
[30]: Name 0
Year 0
Duration 0
11
Genre 0
Rating 5815
Votes 5814
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64
movie_data["Rating"].fillna(movie_data["Rating"].mean(), inplace=True)
[32]: movie_data.isnull().sum()
[32]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 5814
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64
[33]: movie_data["Votes"].info()
<class 'pandas.core.series.Series'>
Index: 13632 entries, 0 to 15508
Series name: Votes
Non-Null Count Dtype
-------------- -----
7818 non-null object
dtypes: object(1)
12
memory usage: 213.0+ KB
print(movie_data['Votes'].dtype)
print(movie_data['Votes'].count())
float64
7818
[35]: movie_data['Votes'].info()
<class 'pandas.core.series.Series'>
Index: 13632 entries, 0 to 15508
Series name: Votes
Non-Null Count Dtype
-------------- -----
7818 non-null float64
dtypes: float64(1)
memory usage: 213.0 KB
movie_data['Votes'].fillna(movie_data['Votes'].mean(), inplace=True)
[37]: movie_data.isnull().sum()
[37]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
13
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64
[38]: movie_data.head()
Actor 2 Actor 3
0 Birbal Rajendra Bhatia
1 Vivek Ghamande Arvind Jangid
2 Plabita Borthakur Roy Angana
3 Ishita Raj Siddhant Kapoor
4 Rituparna Sengupta Antara Mali
[41]: movie_data.isnull().sum()
[41]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64
14
actors.head(5)
[44]: movie_data.isnull().sum()
[44]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64
[45]: movie_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13131 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 13131 non-null object
1 Year 13131 non-null int64
2 Duration 13131 non-null float64
3 Genre 13131 non-null object
4 Rating 13131 non-null float64
5 Votes 13131 non-null float64
6 Director 13131 non-null object
7 Actor 1 12407 non-null object
8 Actor 2 11928 non-null object
9 Actor 3 11368 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 1.1+ MB
[46]: movie_data.isnull().sum()
[46]: Name 0
Year 0
15
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64
[48]: movie_data.isnull().sum()
[48]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 0
Actor 2 0
Actor 3 0
dtype: int64
3 Data Visualization
[49]: plt.hist(movie_data['Rating'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Ratings')
plt.show()
16
[53]: import numpy as np
• This code snippet filters the numeric columns from the movie dataset and computes the
correlation matrix among them. The correlation matrix quantifies the linear relationships
between pairs of numeric features.
• The heatmap generated from the correlation matrix visually represents these correlations
using a color gradient. Darker colors indicate stronger positive or negative correlations, while
numerical correlation values are annotated within each cell.
[54]: numeric_cols = movie_data.select_dtypes(include=[np.number])
corrmat = numeric_cols.corr()
17
[55]: genre_counts = movie_data['Genre'].value_counts().head(10)
plt.figure(figsize=(10, 6))
genre_counts.plot(kind='bar', color='lightcoral')
plt.xlabel('Genre')
plt.ylabel('Frequency')
plt.title('Top 10 Movie Genres')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
18
This code segment converts the ‘Year’ column in the movie dataset to integer data type, ensuring
it is treated as a numerical variable. It then counts the number of movie releases for each year and
sorts the results by year.
Afterwards, it plots the movie releases over time, with the x-axis representing the years and the
y-axis showing the number of movies released. The plot is created using a line plot with markers
for each data point. Additionally, labels, title, and a grid are added for clarity.
[56]: movie_data['Year'] = movie_data['Year'].astype(int)
movie_counts_by_year = movie_data['Year'].value_counts().sort_index()
plt.figure(figsize=(12, 6))
plt.plot(movie_counts_by_year.index, movie_counts_by_year.values, marker='o',␣
↪linestyle='-', color='mediumseagreen')
plt.xlabel('Year')
plt.ylabel('Number of Movies Released')
plt.title('Movie Releases Over Time')
plt.grid(True)
plt.show()
19
[60]: import plotly.express as px
fig.show()
plt.figure(figsize=(10, 6))
plt.hist(movie_data['Duration'], bins=30, color='lightblue', edgecolor='black')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.title('Histogram of Movie Durations')
plt.show()
20
[65]: !pip install wordcloud
21
packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Directors')
plt.show()
plt.figure(figsize=(8, 8))
plt.pie(top_20_genre_counts, labels=top_20_genre_counts.index, autopct='%1.
↪1f%%', startangle=140, colors=colors)
22
plt.axis('equal') # Equal aspect ratio ensures that the pie is drawn as a␣
↪circle.
plt.show()
X = movie_data[['Duration', 'Votes']]
y = movie_data['Rating']
23
using mean squared error (MSE).
rf_predictions = rf_regressor.predict(X_test)
adaboost_regressor.fit(X_train, y_train)
adaboost_predictions = adaboost_regressor.predict(X_test)
24
base_estimators = [
('random_forest', RandomForestRegressor(n_estimators=100, random_state=42)),
('adaboost', AdaBoostRegressor(n_estimators=100, random_state=42))
]
stacking_regressor = StackingRegressor(estimators=base_estimators,␣
↪final_estimator=LinearRegression())
stacking_regressor.fit(X_train, y_train)
stacking_predictions = stacking_regressor.predict(X_test)
Stacking Metrics:
Mean Squared Error: 1.8230041471178846
AdaBoost Metrics:
• Mean Squared Error: [adaboost_mse]
• Mean Absolute Error: [adaboost_mae]
• R-squared Score: [adaboost_r2]
Stacking Metrics:
• Mean Squared Error: [stacking_mse]
• Mean Absolute Error: [stacking_mae]
• R-squared Score: [stacking_r2]
25
print("Random Forest Metrics:")
print(f"Mean Squared Error: {rf_mse}")
print(f"Mean Absolute Error: {rf_mae}")
print(f"R-squared Score: {rf_r2}")
print()
print("AdaBoost Metrics:")
print(f"Mean Squared Error: {adaboost_mse}")
print(f"Mean Absolute Error: {adaboost_mae}")
print(f"R-squared Score: {adaboost_r2}")
print()
# Evaluate Stacking
stacking_mse = mean_squared_error(y_test, stacking_predictions)
stacking_mae = mean_absolute_error(y_test, stacking_predictions)
stacking_r2 = r2_score(y_test, stacking_predictions)
print("Stacking Metrics:")
print(f"Mean Squared Error: {stacking_mse}")
print(f"Mean Absolute Error: {stacking_mae}")
print(f"R-squared Score: {stacking_r2}")
AdaBoost Metrics:
Mean Squared Error: 1.9140002337811928
Mean Absolute Error: 1.1371216538097293
R-squared Score: -0.028746739563326296
Stacking Metrics:
Mean Squared Error: 1.8230041471178846
Mean Absolute Error: 1.0876853487767815
R-squared Score: 0.020162307476320973
##Conclusion
Mean Squared Error (MSE): Lower values indicate better performance. Stacking has the lowest
MSE (1.823) followed by AdaBoost (1.914) and then Random Forest (2.275).
Mean Absolute Error (MAE): Lower values indicate better performance. Stacking has the
lowest MAE (1.088) followed by AdaBoost (1.137) and then Random Forest (1.186).
26
R-squared Score: This metric indicates how well the model fits the data, with higher values
being better. Stacking has the highest R-squared score (0.020), indicating the best fit to the data,
followed by AdaBoost (-0.029) and then Random Forest (-0.223).
Based on these metrics, Stacking appears to perform the best among the three ensemble tech-
niques for this regression task. It has the lowest MSE and MAE, and the highest R-squared score,
indicating the best overall performance.
27