0% found this document useful (0 votes)
14 views

ML Ass

Uploaded by

Rayyan Athar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ML Ass

Uploaded by

Rayyan Athar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

pvegnu5k0

June 4, 2024

1 Importing Required Libraries


[ ]: import pandas as pd
import ydata_profiling as pandas_profiling
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from wordcloud import WordCloud
from IPython.display import display

[ ]: !pip install pandas ydata_profiling

Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages


(2.2.2)
Requirement already satisfied: ydata_profiling in
/usr/local/lib/python3.10/dist-packages (4.8.3)
Requirement already satisfied: numpy>=1.22.4 in /usr/local/lib/python3.10/dist-
packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-
packages (from pandas) (2023.4)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-
packages (from pandas) (2024.1)
Requirement already satisfied: scipy<1.14,>=1.4.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.11.4)
Requirement already satisfied: matplotlib<3.9,>=3.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (3.8.4)
Requirement already satisfied: pydantic>=2 in /usr/local/lib/python3.10/dist-
packages (from ydata_profiling) (2.7.2)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (6.0.1)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (3.1.4)
Requirement already satisfied: visions[type_image_path]<0.7.7,>=0.7.5 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.7.6)
Requirement already satisfied: htmlmin==0.1.12 in

1
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.12.4)
Requirement already satisfied: requests<3,>=2.24.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (2.31.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.66.4)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.13.2)
Requirement already satisfied: multimethod<2,>=1.4 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.11.2)
Requirement already satisfied: statsmodels<1,>=0.13.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.14.2)
Requirement already satisfied: typeguard<5,>=3 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.3.0)
Requirement already satisfied: imagehash==4.3.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.9.3)
Requirement already satisfied: dacite>=1.8 in /usr/local/lib/python3.10/dist-
packages (from ydata_profiling) (1.8.1)
Requirement already satisfied: numba<1,>=0.56.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.58.1)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-
packages (from imagehash==4.3.1->ydata_profiling) (1.6.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages
(from imagehash==4.3.1->ydata_profiling) (9.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in
/usr/local/lib/python3.10/dist-packages (from
jinja2<3.2,>=2.11.1->ydata_profiling) (2.1.5)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-
packages (from matplotlib<3.9,>=3.2->ydata_profiling) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (4.52.4)
Requirement already satisfied: kiwisolver>=1.3.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (1.4.5)
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (24.0)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (3.1.2)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in

2
/usr/local/lib/python3.10/dist-packages (from numba<1,>=0.56.0->ydata_profiling)
(0.41.1)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-
packages (from phik<0.13,>=0.11.1->ydata_profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.4.0 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(0.7.0)
Requirement already satisfied: pydantic-core==2.18.3 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(2.18.3)
Requirement already satisfied: typing-extensions>=4.6.1 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(4.12.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-
packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-
packages (from requests<3,>=2.24.0->ydata_profiling) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (2024.2.2)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.10/dist-
packages (from statsmodels<1,>=0.13.2->ydata_profiling) (0.5.6)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-
packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata_profiling) (23.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-
packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata_profiling) (3.3)

[ ]: import pandas as pd
import ydata_profiling as pandas_profiling

try:
movie_data = pd.read_csv('movies_data.csv', encoding='utf-8')
except UnicodeDecodeError:
movie_data = pd.read_csv('movies_data.csv', encoding='latin1')

profile = pandas_profiling.ProfileReport(movie_data, title="Movie Data␣


↪Profiling Report", explorative=True)

profile.to_file("movie_data_report.html")

/usr/local/lib/python3.10/dist-packages/ydata_profiling/profile_report.py:363:
UserWarning: Try running command: 'pip install --upgrade Pillow' to avoid

3
ValueError
warnings.warn(
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]

[ ]: # Generate a Pandas Profiling Report for initial data exploration


report = pandas_profiling.ProfileReport(movie_data)
display(report)

Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]


Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
<IPython.core.display.HTML object>

2 Data Preprocessing and Cleaning


[8]: movie_data['Year'] = movie_data['Year'].str.extract(r'(\d+)').astype(float)

[9]: movie_data['Year'].fillna(movie_data['Year'].mean(), inplace=True)

<ipython-input-9-ae611bd2d130>:2: FutureWarning: A value is trying to be set on


a copy of a DataFrame or Series through chained assignment using an inplace
method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data['Year'].fillna(movie_data['Year'].mean(), inplace=True)

[10]: movie_data.isnull().sum()

[10]: Name 0
Year 0
Duration 8269
Genre 1877

4
Rating 7590
Votes 7589
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64

[11]: movie_data['Year'] = movie_data['Year'].astype(int)

[12]: movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 15509 non-null object
1 Year 15509 non-null int64
2 Duration 7240 non-null object
3 Genre 13632 non-null object
4 Rating 7919 non-null float64
5 Votes 7920 non-null object
6 Director 14984 non-null object
7 Actor 1 13892 non-null object
8 Actor 2 13125 non-null object
9 Actor 3 12365 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 1.2+ MB

[13]: movie_data.head()

[13]: Name
Year Duration Genre Rating \
0 1987 NaN Drama NaN
1 #Gadhvi (He thought he was Gandhi) 2019 109 min Drama 7.0
2 #Homecoming 2021 90 min Drama, Musical NaN
3 #Yaaram 2019 110 min Comedy, Romance 4.4
4 …And Once Again 2010 105 min Drama NaN

Votes Director Actor 1 Actor 2 Actor 3


0 NaN J.S. Randhawa Manmauji Birbal Rajendra Bhatia
1 8 Gaurav Bakshi Rasika Dugal Vivek Ghamande Arvind Jangid
2 NaN Soumyajit Majumdar Sayani Gupta Plabita Borthakur Roy Angana
3 35 Ovais Khan Prateik Ishita Raj Siddhant Kapoor
4 NaN Amol Palekar Rajat Kapoor Rituparna Sengupta Antara Mali

5
[88]: # @title Year vs Rating

from matplotlib import pyplot as plt


movie_data.plot(kind='scatter', x='Year', y='Rating', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

[14]: movie_data.isnull().sum()

[14]: Name 0
Year 0
Duration 8269
Genre 1877
Rating 7590
Votes 7589
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64

6
[15]: movie_data['Duration'] = movie_data['Duration'].str.extract('(\d+)').
↪astype(float)

[16]: movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 15509 non-null object
1 Year 15509 non-null int64
2 Duration 7240 non-null float64
3 Genre 13632 non-null object
4 Rating 7919 non-null float64
5 Votes 7920 non-null object
6 Director 14984 non-null object
7 Actor 1 13892 non-null object
8 Actor 2 13125 non-null object
9 Actor 3 12365 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 1.2+ MB

[17]: movie_data["Duration"].fillna(movie_data["Duration"].mean(), inplace=True)

<ipython-input-17-302c48df9e36>:2: FutureWarning: A value is trying to be set on


a copy of a DataFrame or Series through chained assignment using an inplace
method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data["Duration"].fillna(movie_data["Duration"].mean(), inplace=True)

[19]: movie_data.isnull().sum()

[19]: Name 0
Year 0
Duration 0
Genre 1877
Rating 7590
Votes 7589

7
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64

[20]: movie_data.head()

[20]: NameYear Duration Genre \


0 1987 128.126519 Drama
1 #Gadhvi (He thought he was Gandhi) 2019 109.000000 Drama
2 #Homecoming 2021 90.000000 Drama, Musical
3 #Yaaram 2019 110.000000 Comedy, Romance
4 …And Once Again 2010 105.000000 Drama

Rating Votes Director Actor 1 Actor 2 \


0 NaN NaN J.S. Randhawa Manmauji Birbal
1 7.0 8 Gaurav Bakshi Rasika Dugal Vivek Ghamande
2 NaN NaN Soumyajit Majumdar Sayani Gupta Plabita Borthakur
3 4.4 35 Ovais Khan Prateik Ishita Raj
4 NaN NaN Amol Palekar Rajat Kapoor Rituparna Sengupta

Actor 3
0 Rajendra Bhatia
1 Arvind Jangid
2 Roy Angana
3 Siddhant Kapoor
4 Antara Mali

[97]: # @title Rating vs Votes

from matplotlib import pyplot as plt


movie_data.plot(kind='scatter', x='Rating', y='Votes', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

8
[21]: movie_data["Genre"].isnull().sum()

[21]: 1877

[23]: movie_data.dropna(subset=['Genre'], inplace=True)

[24]: genres=movie_data['Genre'].str.split(',',expand=True)
genres.head(5)

[24]: 0 1 2
0 Drama None None
1 Drama None None
2 Drama Musical None
3 Comedy Romance None
4 Drama None None

[25]: genre_counts = {}
for genre in genres.values.flatten():
if genre is not None:
if genre in genre_counts:
genre_counts[genre] += 1

9
else:
genre_counts[genre] = 1

genereCounts = {genre: count for genre, count in sorted(genre_counts.items())}


for genre, count in genereCounts.items():
print(f"{genre}: {count}")

Action: 56
Adventure: 289
Biography: 53
Comedy: 468
Crime: 863
Drama: 2726
Family: 782
Fantasy: 266
History: 178
Horror: 121
Music: 74
Musical: 424
Mystery: 365
News: 9
Reality-TV: 1
Romance: 1687
Sci-Fi: 48
Short: 1
Sport: 59
Thriller: 927
War: 39
Western: 5
Action: 3487
Adventure: 252
Animation: 125
Biography: 155
Comedy: 1561
Crime: 459
Documentary: 383
Drama: 4517
Family: 161
Fantasy: 192
History: 29
Horror: 403
Music: 16
Musical: 165
Mystery: 148
Reality-TV: 2
Romance: 762
Sci-Fi: 10

10
Sport: 11
Thriller: 786
War: 8

[26]: genresPie = movie_data['Genre'].value_counts()


genresPie.head(5)

[26]: Genre
Drama 2780
Action 1289
Thriller 779
Romance 708
Drama, Romance 524
Name: count, dtype: int64

[27]: genrePie = pd.DataFrame(list(genresPie.items()))


genrePie = genrePie.rename(columns={0: 'Genre', 1: 'Count'})
genrePie.head(5)

[27]: Genre Count


0 Drama 2780
1 Action 1289
2 Thriller 779
3 Romance 708
4 Drama, Romance 524

[101]: data = {'Genre': ['Drama', 'Action', 'Thriller', 'Romance', 'Drama, Romance'],


'Count': [2780, 1289, 779, 708, 524]}

genre_counts_df = pd.DataFrame(data)

2.1 One-Hot Encoding for Genre Data


In this section, we perform one-hot encoding for the genre data in the DataFrame genre_counts_df.
One-hot encoding is a technique used to convert categorical variables into binary vectors, where
each category is represented by a binary attribute.
[102]: one_hot_encoded = pd.get_dummies(genre_counts_df['Genre'], prefix='Genre')

genre_counts_df = pd.concat([genre_counts_df, one_hot_encoded], axis=1)

genre_counts_df.drop(columns=['Genre'], inplace=True)

[30]: movie_data.isnull().sum()

[30]: Name 0
Year 0
Duration 0

11
Genre 0
Rating 5815
Votes 5814
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64

[31]: movie_data["Rating"].fillna(movie_data["Rating"].mean(), inplace=True)

<ipython-input-31-263e50eb2066>:2: FutureWarning: A value is trying to be set on


a copy of a DataFrame or Series through chained assignment using an inplace
method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data["Rating"].fillna(movie_data["Rating"].mean(), inplace=True)

[32]: movie_data.isnull().sum()

[32]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 5814
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64

[33]: movie_data["Votes"].info()

<class 'pandas.core.series.Series'>
Index: 13632 entries, 0 to 15508
Series name: Votes
Non-Null Count Dtype
-------------- -----
7818 non-null object
dtypes: object(1)

12
memory usage: 213.0+ KB

2.2 Data Cleaning: Convert ‘Votes’ Column to Float


[34]: movie_data['Votes'] = movie_data['Votes'].str.replace('[\$,M]', '', regex=True).
↪astype(float)

print(movie_data['Votes'].dtype)
print(movie_data['Votes'].count())

float64
7818

[35]: movie_data['Votes'].info()

<class 'pandas.core.series.Series'>
Index: 13632 entries, 0 to 15508
Series name: Votes
Non-Null Count Dtype
-------------- -----
7818 non-null float64
dtypes: float64(1)
memory usage: 213.0 KB

[36]: movie_data['Votes'].fillna(movie_data['Votes'].mean(), inplace=True)

<ipython-input-36-fa982e82bc00>:2: FutureWarning: A value is trying to be set on


a copy of a DataFrame or Series through chained assignment using an inplace
method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data['Votes'].fillna(movie_data['Votes'].mean(), inplace=True)

[37]: movie_data.isnull().sum()

[37]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0

13
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64

[38]: movie_data.head()

[38]: NameYear Duration Genre \


0 1987 128.126519 Drama
1 #Gadhvi (He thought he was Gandhi) 2019 109.000000 Drama
2 #Homecoming 2021 90.000000 Drama, Musical
3 #Yaaram 2019 110.000000 Comedy, Romance
4 …And Once Again 2010 105.000000 Drama

Rating Votes Director Actor 1 \


0 5.839568 1963.393471 J.S. Randhawa Manmauji
1 7.000000 8.000000 Gaurav Bakshi Rasika Dugal
2 5.839568 1963.393471 Soumyajit Majumdar Sayani Gupta
3 4.400000 35.000000 Ovais Khan Prateik
4 5.839568 1963.393471 Amol Palekar Rajat Kapoor

Actor 2 Actor 3
0 Birbal Rajendra Bhatia
1 Vivek Ghamande Arvind Jangid
2 Plabita Borthakur Roy Angana
3 Ishita Raj Siddhant Kapoor
4 Rituparna Sengupta Antara Mali

[40]: movie_data.dropna(subset=['Director'], inplace=True)

[41]: movie_data.isnull().sum()

[41]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64

[43]: actors = pd.concat([movie_data['Actor 1'], movie_data['Actor 2'],␣


↪movie_data['Actor 3']]).dropna().value_counts()

14
actors.head(5)

[43]: Mithun Chakraborty 241


Dharmendra 230
Ashok Kumar 212
Jeetendra 179
Amitabh Bachchan 178
Name: count, dtype: int64

[44]: movie_data.isnull().sum()

[44]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64

[45]: movie_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13131 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 13131 non-null object
1 Year 13131 non-null int64
2 Duration 13131 non-null float64
3 Genre 13131 non-null object
4 Rating 13131 non-null float64
5 Votes 13131 non-null float64
6 Director 13131 non-null object
7 Actor 1 12407 non-null object
8 Actor 2 11928 non-null object
9 Actor 3 11368 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 1.1+ MB

[46]: movie_data.isnull().sum()

[46]: Name 0
Year 0

15
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64

[47]: movie_data.dropna(subset = ['Actor 1','Actor 2', 'Actor 3'], inplace = True)

[48]: movie_data.isnull().sum()

[48]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 0
Actor 2 0
Actor 3 0
dtype: int64

3 Data Visualization
[49]: plt.hist(movie_data['Rating'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Ratings')
plt.show()

16
[53]: import numpy as np

• This code snippet filters the numeric columns from the movie dataset and computes the
correlation matrix among them. The correlation matrix quantifies the linear relationships
between pairs of numeric features.
• The heatmap generated from the correlation matrix visually represents these correlations
using a color gradient. Darker colors indicate stronger positive or negative correlations, while
numerical correlation values are annotated within each cell.
[54]: numeric_cols = movie_data.select_dtypes(include=[np.number])
corrmat = numeric_cols.corr()

fig = plt.figure(figsize=(20, 5))


sns.heatmap(corrmat, vmax=.8, square=True, annot=True)
plt.show()

17
[55]: genre_counts = movie_data['Genre'].value_counts().head(10)

plt.figure(figsize=(10, 6))
genre_counts.plot(kind='bar', color='lightcoral')
plt.xlabel('Genre')
plt.ylabel('Frequency')
plt.title('Top 10 Movie Genres')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

18
This code segment converts the ‘Year’ column in the movie dataset to integer data type, ensuring
it is treated as a numerical variable. It then counts the number of movie releases for each year and
sorts the results by year.
Afterwards, it plots the movie releases over time, with the x-axis representing the years and the
y-axis showing the number of movies released. The plot is created using a line plot with markers
for each data point. Additionally, labels, title, and a grid are added for clarity.
[56]: movie_data['Year'] = movie_data['Year'].astype(int)
movie_counts_by_year = movie_data['Year'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
plt.plot(movie_counts_by_year.index, movie_counts_by_year.values, marker='o',␣
↪linestyle='-', color='mediumseagreen')

plt.xlabel('Year')
plt.ylabel('Number of Movies Released')
plt.title('Movie Releases Over Time')
plt.grid(True)
plt.show()

19
[60]: import plotly.express as px

[61]: fig = px.scatter_3d(movie_data, x='Duration', y='Rating', z='Votes',␣


↪color='Rating', title='3D Plot of Duration, Rating, and Votes')

fig.show()

[63]: import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(movie_data['Duration'], bins=30, color='lightblue', edgecolor='black')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.title('Histogram of Movie Durations')
plt.show()

20
[65]: !pip install wordcloud

Requirement already satisfied: wordcloud in /usr/local/lib/python3.10/dist-


packages (1.9.3)
Requirement already satisfied: numpy>=1.6.1 in /usr/local/lib/python3.10/dist-
packages (from wordcloud) (1.26.4)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages
(from wordcloud) (9.4.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-
packages (from wordcloud) (3.8.4)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-
packages (from matplotlib->wordcloud) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (4.52.4)
Requirement already satisfied: kiwisolver>=1.3.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (1.4.5)
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (24.0)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-

21
packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)

[66]: from wordcloud import WordCloud

[67]: director_text = ' '.join(movie_data['Director'].dropna())

wordcloud = WordCloud(width=800, height=400, background_color='white').


↪generate(director_text)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Directors')
plt.show()

[68]: top_20_genre_counts = movie_data['Genre'].value_counts().head(20)

colors = ['#4287f5', '#3d5a80', '#587b7f', '#325e84', '#6487a0',


'#2f496e', '#385d7e', '#4287f5', '#3d5a80', '#587b7f',
'#2c4f6d', '#376c9b', '#419be0', '#4fa2e5', '#66b3ff',
'#2c4f6d', '#376c9b', '#419be0', '#4fa2e5', '#66b3ff']

plt.figure(figsize=(8, 8))
plt.pie(top_20_genre_counts, labels=top_20_genre_counts.index, autopct='%1.
↪1f%%', startangle=140, colors=colors)

plt.title('Genre Distribution of Top 20 Movies')

22
plt.axis('equal') # Equal aspect ratio ensures that the pie is drawn as a␣
↪circle.

plt.show()

3.1 Splitting the Data


[104]: from sklearn.model_selection import train_test_split

X = movie_data[['Duration', 'Votes']]
y = movie_data['Rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣


↪random_state=42)

3.1.1 Random Forest Regression Model Evaluation


In this section, we evaluate the performance of the Random Forest regression model trained on the
movie dataset. We initialize the Random Forest regressor with 100 estimators and train it on the
training data. Then, we make predictions on the test data and evaluate the model’s performance

23
using mean squared error (MSE).

[105]: from sklearn.ensemble import RandomForestRegressor


from sklearn.metrics import mean_squared_error

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)


rf_regressor.fit(X_train, y_train)

rf_predictions = rf_regressor.predict(X_test)

mse = mean_squared_error(y_test, rf_predictions)


print("Mean Squared Error:", mse)

Mean Squared Error: 2.2754586685392897

3.1.2 AdaBoost Regression Model Evaluation


In this section, we evaluate the performance of the AdaBoost regression model trained on the movie
dataset. We initialize the AdaBoost regressor with 100 estimators and train it on the training data.
Then, we make predictions on the test data and evaluate the model’s performance using mean
squared error (MSE).

[107]: from sklearn.ensemble import AdaBoostRegressor


from sklearn.metrics import mean_squared_error

adaboost_regressor = AdaBoostRegressor(n_estimators=100, random_state=42)

adaboost_regressor.fit(X_train, y_train)

adaboost_predictions = adaboost_regressor.predict(X_test)

mse = mean_squared_error(y_test, adaboost_predictions)


print("Mean Squared Error:", mse)

Mean Squared Error: 1.9140002337811928

3.1.3 Stacking Regression Model Evaluation


In this section, we evaluate the performance of the Stacking regression model trained on the movie
dataset. We initialize two base estimators, Random Forest regressor and AdaBoost regressor, each
with 100 estimators, and train them on the training data. Then, we use a Linear Regression model
as the final estimator in the StackingRegressor. We make predictions on the test data using the
StackingRegressor and evaluate the model’s performance using mean squared error (MSE).

[108]: from sklearn.ensemble import StackingRegressor


from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error

24
base_estimators = [
('random_forest', RandomForestRegressor(n_estimators=100, random_state=42)),
('adaboost', AdaBoostRegressor(n_estimators=100, random_state=42))
]

stacking_regressor = StackingRegressor(estimators=base_estimators,␣
↪final_estimator=LinearRegression())

stacking_regressor.fit(X_train, y_train)

stacking_predictions = stacking_regressor.predict(X_test)

mse = mean_squared_error(y_test, stacking_predictions)


print("Stacking Metrics:")
print("Mean Squared Error:", mse)

Stacking Metrics:
Mean Squared Error: 1.8230041471178846

3.1.4 Evaluation Metrics for Regression Models


Here, we compute and compare evaluation metrics for the Random Forest, AdaBoost, and Stacking
regression models trained on the movie dataset.

Random Forest Metrics:


• Mean Squared Error: [rf_mse]
• Mean Absolute Error: [rf_mae]
• R-squared Score: [rf_r2]

AdaBoost Metrics:
• Mean Squared Error: [adaboost_mse]
• Mean Absolute Error: [adaboost_mae]
• R-squared Score: [adaboost_r2]

Stacking Metrics:
• Mean Squared Error: [stacking_mse]
• Mean Absolute Error: [stacking_mae]
• R-squared Score: [stacking_r2]

[109]: from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

rf_mse = mean_squared_error(y_test, rf_predictions)


rf_mae = mean_absolute_error(y_test, rf_predictions)
rf_r2 = r2_score(y_test, rf_predictions)

25
print("Random Forest Metrics:")
print(f"Mean Squared Error: {rf_mse}")
print(f"Mean Absolute Error: {rf_mae}")
print(f"R-squared Score: {rf_r2}")
print()

adaboost_mse = mean_squared_error(y_test, adaboost_predictions)


adaboost_mae = mean_absolute_error(y_test, adaboost_predictions)
adaboost_r2 = r2_score(y_test, adaboost_predictions)

print("AdaBoost Metrics:")
print(f"Mean Squared Error: {adaboost_mse}")
print(f"Mean Absolute Error: {adaboost_mae}")
print(f"R-squared Score: {adaboost_r2}")
print()

# Evaluate Stacking
stacking_mse = mean_squared_error(y_test, stacking_predictions)
stacking_mae = mean_absolute_error(y_test, stacking_predictions)
stacking_r2 = r2_score(y_test, stacking_predictions)

print("Stacking Metrics:")
print(f"Mean Squared Error: {stacking_mse}")
print(f"Mean Absolute Error: {stacking_mae}")
print(f"R-squared Score: {stacking_r2}")

Random Forest Metrics:


Mean Squared Error: 2.2754586685392897
Mean Absolute Error: 1.1855945596224633
R-squared Score: -0.2230252875394938

AdaBoost Metrics:
Mean Squared Error: 1.9140002337811928
Mean Absolute Error: 1.1371216538097293
R-squared Score: -0.028746739563326296

Stacking Metrics:
Mean Squared Error: 1.8230041471178846
Mean Absolute Error: 1.0876853487767815
R-squared Score: 0.020162307476320973
##Conclusion
Mean Squared Error (MSE): Lower values indicate better performance. Stacking has the lowest
MSE (1.823) followed by AdaBoost (1.914) and then Random Forest (2.275).
Mean Absolute Error (MAE): Lower values indicate better performance. Stacking has the
lowest MAE (1.088) followed by AdaBoost (1.137) and then Random Forest (1.186).

26
R-squared Score: This metric indicates how well the model fits the data, with higher values
being better. Stacking has the highest R-squared score (0.020), indicating the best fit to the data,
followed by AdaBoost (-0.029) and then Random Forest (-0.223).
Based on these metrics, Stacking appears to perform the best among the three ensemble tech-
niques for this regression task. It has the lowest MSE and MAE, and the highest R-squared score,
indicating the best overall performance.

27

You might also like