0% found this document useful (0 votes)

14 views

ML Ass

Uploaded by

Rayyan Athar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

ML Ass

Uploaded by

Rayyan Athar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

pvegnu5k0

June 4, 2024

1 Importing Required Libraries

[ ]: import pandas as pd
import ydata_profiling as pandas_profiling
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from wordcloud import WordCloud
from IPython.display import display

[ ]: !pip install pandas ydata_profiling

Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages

(2.2.2)
Requirement already satisfied: ydata_profiling in
/usr/local/lib/python3.10/dist-packages (4.8.3)
Requirement already satisfied: numpy>=1.22.4 in /usr/local/lib/python3.10/dist-
packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-
packages (from pandas) (2023.4)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-
packages (from pandas) (2024.1)
Requirement already satisfied: scipy<1.14,>=1.4.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.11.4)
Requirement already satisfied: matplotlib<3.9,>=3.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (3.8.4)
Requirement already satisfied: pydantic>=2 in /usr/local/lib/python3.10/dist-
packages (from ydata_profiling) (2.7.2)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (6.0.1)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (3.1.4)
Requirement already satisfied: visions[type_image_path]<0.7.7,>=0.7.5 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.7.6)
Requirement already satisfied: htmlmin==0.1.12 in

1
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.12.4)
Requirement already satisfied: requests<3,>=2.24.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (2.31.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.66.4)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.13.2)
Requirement already satisfied: multimethod<2,>=1.4 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.11.2)
Requirement already satisfied: statsmodels<1,>=0.13.2 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.14.2)
Requirement already satisfied: typeguard<5,>=3 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.3.0)
Requirement already satisfied: imagehash==4.3.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (1.9.3)
Requirement already satisfied: dacite>=1.8 in /usr/local/lib/python3.10/dist-
packages (from ydata_profiling) (1.8.1)
Requirement already satisfied: numba<1,>=0.56.0 in
/usr/local/lib/python3.10/dist-packages (from ydata_profiling) (0.58.1)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-
packages (from imagehash==4.3.1->ydata_profiling) (1.6.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages
(from imagehash==4.3.1->ydata_profiling) (9.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in
/usr/local/lib/python3.10/dist-packages (from
jinja2<3.2,>=2.11.1->ydata_profiling) (2.1.5)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-
packages (from matplotlib<3.9,>=3.2->ydata_profiling) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (4.52.4)
Requirement already satisfied: kiwisolver>=1.3.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (1.4.5)
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (24.0)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.10/dist-packages (from
matplotlib<3.9,>=3.2->ydata_profiling) (3.1.2)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in

2
/usr/local/lib/python3.10/dist-packages (from numba<1,>=0.56.0->ydata_profiling)
(0.41.1)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-
packages (from phik<0.13,>=0.11.1->ydata_profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.4.0 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(0.7.0)
Requirement already satisfied: pydantic-core==2.18.3 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(2.18.3)
Requirement already satisfied: typing-extensions>=4.6.1 in
/usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata_profiling)
(4.12.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-
packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-
packages (from requests<3,>=2.24.0->ydata_profiling) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.10/dist-packages (from
requests<3,>=2.24.0->ydata_profiling) (2024.2.2)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.10/dist-
packages (from statsmodels<1,>=0.13.2->ydata_profiling) (0.5.6)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-
packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata_profiling) (23.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-
packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata_profiling) (3.3)

[ ]: import pandas as pd
import ydata_profiling as pandas_profiling

try:
movie_data = pd.read_csv('movies_data.csv', encoding='utf-8')
except UnicodeDecodeError:
movie_data = pd.read_csv('movies_data.csv', encoding='latin1')

profile = pandas_profiling.ProfileReport(movie_data, title="Movie Data␣

↪Profiling Report", explorative=True)

profile.to_file("movie_data_report.html")

/usr/local/lib/python3.10/dist-packages/ydata_profiling/profile_report.py:363:
UserWarning: Try running command: 'pip install --upgrade Pillow' to avoid

[ ]: # Generate a Pandas Profiling Report for initial data exploration

report = pandas_profiling.ProfileReport(movie_data)
display(report)

Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]

Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
<IPython.core.display.HTML object>

2 Data Preprocessing and Cleaning

[8]: movie_data['Year'] = movie_data['Year'].str.extract(r'(\d+)').astype(float)

[9]: movie_data['Year'].fillna(movie_data['Year'].mean(), inplace=True)

<ipython-input-9-ae611bd2d130>:2: FutureWarning: A value is trying to be set on

a copy of a DataFrame or Series through chained assignment using an inplace
method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using

'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data['Year'].fillna(movie_data['Year'].mean(), inplace=True)

[10]: movie_data.isnull().sum()

[10]: Name 0
Year 0
Duration 8269
Genre 1877

4
Rating 7590
Votes 7589
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64

[11]: movie_data['Year'] = movie_data['Year'].astype(int)

[12]: movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 15509 non-null object
1 Year 15509 non-null int64
2 Duration 7240 non-null object
3 Genre 13632 non-null object
4 Rating 7919 non-null float64
5 Votes 7920 non-null object
6 Director 14984 non-null object
7 Actor 1 13892 non-null object
8 Actor 2 13125 non-null object
9 Actor 3 12365 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 1.2+ MB

[13]: movie_data.head()

[13]: Name
Year Duration Genre Rating \
0 1987 NaN Drama NaN
1 #Gadhvi (He thought he was Gandhi) 2019 109 min Drama 7.0
2 #Homecoming 2021 90 min Drama, Musical NaN
3 #Yaaram 2019 110 min Comedy, Romance 4.4
4 …And Once Again 2010 105 min Drama NaN

Votes Director Actor 1 Actor 2 Actor 3

0 NaN J.S. Randhawa Manmauji Birbal Rajendra Bhatia
1 8 Gaurav Bakshi Rasika Dugal Vivek Ghamande Arvind Jangid
2 NaN Soumyajit Majumdar Sayani Gupta Plabita Borthakur Roy Angana
3 35 Ovais Khan Prateik Ishita Raj Siddhant Kapoor
4 NaN Amol Palekar Rajat Kapoor Rituparna Sengupta Antara Mali

5
[88]: # @title Year vs Rating

from matplotlib import pyplot as plt

movie_data.plot(kind='scatter', x='Year', y='Rating', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

[14]: movie_data.isnull().sum()

[14]: Name 0
Year 0
Duration 8269
Genre 1877
Rating 7590
Votes 7589
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64

6
[15]: movie_data['Duration'] = movie_data['Duration'].str.extract('(\d+)').
↪astype(float)

[16]: movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 15509 non-null object
1 Year 15509 non-null int64
2 Duration 7240 non-null float64
3 Genre 13632 non-null object
4 Rating 7919 non-null float64
5 Votes 7920 non-null object
6 Director 14984 non-null object
7 Actor 1 13892 non-null object
8 Actor 2 13125 non-null object
9 Actor 3 12365 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 1.2+ MB

[17]: movie_data["Duration"].fillna(movie_data["Duration"].mean(), inplace=True)

<ipython-input-17-302c48df9e36>:2: FutureWarning: A value is trying to be set on

For example, when doing 'df[col].method(value, inplace=True)', try using

'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data["Duration"].fillna(movie_data["Duration"].mean(), inplace=True)

[19]: movie_data.isnull().sum()

[19]: Name 0
Year 0
Duration 0
Genre 1877
Rating 7590
Votes 7589

7
Director 525
Actor 1 1617
Actor 2 2384
Actor 3 3144
dtype: int64

[20]: movie_data.head()

[20]: NameYear Duration Genre \

0 1987 128.126519 Drama
1 #Gadhvi (He thought he was Gandhi) 2019 109.000000 Drama
2 #Homecoming 2021 90.000000 Drama, Musical
3 #Yaaram 2019 110.000000 Comedy, Romance
4 …And Once Again 2010 105.000000 Drama

Rating Votes Director Actor 1 Actor 2 \

0 NaN NaN J.S. Randhawa Manmauji Birbal
1 7.0 8 Gaurav Bakshi Rasika Dugal Vivek Ghamande
2 NaN NaN Soumyajit Majumdar Sayani Gupta Plabita Borthakur
3 4.4 35 Ovais Khan Prateik Ishita Raj
4 NaN NaN Amol Palekar Rajat Kapoor Rituparna Sengupta

Actor 3
0 Rajendra Bhatia
1 Arvind Jangid
2 Roy Angana
3 Siddhant Kapoor
4 Antara Mali

[97]: # @title Rating vs Votes

from matplotlib import pyplot as plt

movie_data.plot(kind='scatter', x='Rating', y='Votes', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

8
[21]: movie_data["Genre"].isnull().sum()

[21]: 1877

[23]: movie_data.dropna(subset=['Genre'], inplace=True)

[24]: genres=movie_data['Genre'].str.split(',',expand=True)
genres.head(5)

[24]: 0 1 2
0 Drama None None
1 Drama None None
2 Drama Musical None
3 Comedy Romance None
4 Drama None None

[25]: genre_counts = {}
for genre in genres.values.flatten():
if genre is not None:
if genre in genre_counts:
genre_counts[genre] += 1

9
else:
genre_counts[genre] = 1

genereCounts = {genre: count for genre, count in sorted(genre_counts.items())}

for genre, count in genereCounts.items():
print(f"{genre}: {count}")

Action: 56
Adventure: 289
Biography: 53
Comedy: 468
Crime: 863
Drama: 2726
Family: 782
Fantasy: 266
History: 178
Horror: 121
Music: 74
Musical: 424
Mystery: 365
News: 9
Reality-TV: 1
Romance: 1687
Sci-Fi: 48
Short: 1
Sport: 59
Thriller: 927
War: 39
Western: 5
Action: 3487
Adventure: 252
Animation: 125
Biography: 155
Comedy: 1561
Crime: 459
Documentary: 383
Drama: 4517
Family: 161
Fantasy: 192
History: 29
Horror: 403
Music: 16
Musical: 165
Mystery: 148
Reality-TV: 2
Romance: 762
Sci-Fi: 10

10
Sport: 11
Thriller: 786
War: 8

[26]: genresPie = movie_data['Genre'].value_counts()

genresPie.head(5)

[26]: Genre
Drama 2780
Action 1289
Thriller 779
Romance 708
Drama, Romance 524
Name: count, dtype: int64

[27]: genrePie = pd.DataFrame(list(genresPie.items()))

genrePie = genrePie.rename(columns={0: 'Genre', 1: 'Count'})
genrePie.head(5)

[27]: Genre Count

0 Drama 2780
1 Action 1289
2 Thriller 779
3 Romance 708
4 Drama, Romance 524

[101]: data = {'Genre': ['Drama', 'Action', 'Thriller', 'Romance', 'Drama, Romance'],

'Count': [2780, 1289, 779, 708, 524]}

genre_counts_df = pd.DataFrame(data)

2.1 One-Hot Encoding for Genre Data

In this section, we perform one-hot encoding for the genre data in the DataFrame genre_counts_df.
One-hot encoding is a technique used to convert categorical variables into binary vectors, where
each category is represented by a binary attribute.
[102]: one_hot_encoded = pd.get_dummies(genre_counts_df['Genre'], prefix='Genre')

genre_counts_df = pd.concat([genre_counts_df, one_hot_encoded], axis=1)

genre_counts_df.drop(columns=['Genre'], inplace=True)

[30]: movie_data.isnull().sum()

[30]: Name 0
Year 0
Duration 0

11
Genre 0
Rating 5815
Votes 5814
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64

[31]: movie_data["Rating"].fillna(movie_data["Rating"].mean(), inplace=True)

<ipython-input-31-263e50eb2066>:2: FutureWarning: A value is trying to be set on

For example, when doing 'df[col].method(value, inplace=True)', try using

'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data["Rating"].fillna(movie_data["Rating"].mean(), inplace=True)

[32]: movie_data.isnull().sum()

[32]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 5814
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64

[33]: movie_data["Votes"].info()

<class 'pandas.core.series.Series'>
Index: 13632 entries, 0 to 15508
Series name: Votes
Non-Null Count Dtype
-------------- -----
7818 non-null object
dtypes: object(1)

12
memory usage: 213.0+ KB

2.2 Data Cleaning: Convert ‘Votes’ Column to Float

[34]: movie_data['Votes'] = movie_data['Votes'].str.replace('[\$,M]', '', regex=True).
↪astype(float)

print(movie_data['Votes'].dtype)
print(movie_data['Votes'].count())

float64
7818

[35]: movie_data['Votes'].info()

<class 'pandas.core.series.Series'>
Index: 13632 entries, 0 to 15508
Series name: Votes
Non-Null Count Dtype
-------------- -----
7818 non-null float64
dtypes: float64(1)
memory usage: 213.0 KB

[36]: movie_data['Votes'].fillna(movie_data['Votes'].mean(), inplace=True)

<ipython-input-36-fa982e82bc00>:2: FutureWarning: A value is trying to be set on

For example, when doing 'df[col].method(value, inplace=True)', try using

'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

movie_data['Votes'].fillna(movie_data['Votes'].mean(), inplace=True)

[37]: movie_data.isnull().sum()

[37]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0

13
Director 501
Actor 1 1225
Actor 2 1704
Actor 3 2264
dtype: int64

[38]: movie_data.head()

[38]: NameYear Duration Genre \

Rating Votes Director Actor 1 \

0 5.839568 1963.393471 J.S. Randhawa Manmauji
1 7.000000 8.000000 Gaurav Bakshi Rasika Dugal
2 5.839568 1963.393471 Soumyajit Majumdar Sayani Gupta
3 4.400000 35.000000 Ovais Khan Prateik
4 5.839568 1963.393471 Amol Palekar Rajat Kapoor

Actor 2 Actor 3
0 Birbal Rajendra Bhatia
1 Vivek Ghamande Arvind Jangid
2 Plabita Borthakur Roy Angana
3 Ishita Raj Siddhant Kapoor
4 Rituparna Sengupta Antara Mali

[40]: movie_data.dropna(subset=['Director'], inplace=True)

[41]: movie_data.isnull().sum()

[41]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64

[43]: actors = pd.concat([movie_data['Actor 1'], movie_data['Actor 2'],␣

↪movie_data['Actor 3']]).dropna().value_counts()

14
actors.head(5)

[43]: Mithun Chakraborty 241

Dharmendra 230
Ashok Kumar 212
Jeetendra 179
Amitabh Bachchan 178
Name: count, dtype: int64

[44]: movie_data.isnull().sum()

[44]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64

[45]: movie_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13131 entries, 0 to 15508
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 13131 non-null object
1 Year 13131 non-null int64
2 Duration 13131 non-null float64
3 Genre 13131 non-null object
4 Rating 13131 non-null float64
5 Votes 13131 non-null float64
6 Director 13131 non-null object
7 Actor 1 12407 non-null object
8 Actor 2 11928 non-null object
9 Actor 3 11368 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 1.1+ MB

[46]: movie_data.isnull().sum()

[46]: Name 0
Year 0

15
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 724
Actor 2 1203
Actor 3 1763
dtype: int64

[47]: movie_data.dropna(subset = ['Actor 1','Actor 2', 'Actor 3'], inplace = True)

[48]: movie_data.isnull().sum()

[48]: Name 0
Year 0
Duration 0
Genre 0
Rating 0
Votes 0
Director 0
Actor 1 0
Actor 2 0
Actor 3 0
dtype: int64

3 Data Visualization
[49]: plt.hist(movie_data['Rating'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Ratings')
plt.show()

16
[53]: import numpy as np

• This code snippet filters the numeric columns from the movie dataset and computes the
correlation matrix among them. The correlation matrix quantifies the linear relationships
between pairs of numeric features.
• The heatmap generated from the correlation matrix visually represents these correlations
using a color gradient. Darker colors indicate stronger positive or negative correlations, while
numerical correlation values are annotated within each cell.
[54]: numeric_cols = movie_data.select_dtypes(include=[np.number])
corrmat = numeric_cols.corr()

fig = plt.figure(figsize=(20, 5))

sns.heatmap(corrmat, vmax=.8, square=True, annot=True)
plt.show()

17
[55]: genre_counts = movie_data['Genre'].value_counts().head(10)

plt.figure(figsize=(10, 6))
genre_counts.plot(kind='bar', color='lightcoral')
plt.xlabel('Genre')
plt.ylabel('Frequency')
plt.title('Top 10 Movie Genres')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

18
This code segment converts the ‘Year’ column in the movie dataset to integer data type, ensuring
it is treated as a numerical variable. It then counts the number of movie releases for each year and
sorts the results by year.
Afterwards, it plots the movie releases over time, with the x-axis representing the years and the
y-axis showing the number of movies released. The plot is created using a line plot with markers
for each data point. Additionally, labels, title, and a grid are added for clarity.
[56]: movie_data['Year'] = movie_data['Year'].astype(int)
movie_counts_by_year = movie_data['Year'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
plt.plot(movie_counts_by_year.index, movie_counts_by_year.values, marker='o',␣
↪linestyle='-', color='mediumseagreen')

plt.xlabel('Year')
plt.ylabel('Number of Movies Released')
plt.title('Movie Releases Over Time')
plt.grid(True)
plt.show()

19
[60]: import plotly.express as px

[61]: fig = px.scatter_3d(movie_data, x='Duration', y='Rating', z='Votes',␣

↪color='Rating', title='3D Plot of Duration, Rating, and Votes')

fig.show()

[63]: import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(movie_data['Duration'], bins=30, color='lightblue', edgecolor='black')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.title('Histogram of Movie Durations')
plt.show()

20
[65]: !pip install wordcloud

Requirement already satisfied: wordcloud in /usr/local/lib/python3.10/dist-

packages (1.9.3)
Requirement already satisfied: numpy>=1.6.1 in /usr/local/lib/python3.10/dist-
packages (from wordcloud) (1.26.4)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages
(from wordcloud) (9.4.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-
packages (from wordcloud) (3.8.4)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-
packages (from matplotlib->wordcloud) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (4.52.4)
Requirement already satisfied: kiwisolver>=1.3.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (1.4.5)
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (24.0)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-

21
packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)

[66]: from wordcloud import WordCloud

[67]: director_text = ' '.join(movie_data['Director'].dropna())

wordcloud = WordCloud(width=800, height=400, background_color='white').

↪generate(director_text)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Directors')
plt.show()

[68]: top_20_genre_counts = movie_data['Genre'].value_counts().head(20)

colors = ['#4287f5', '#3d5a80', '#587b7f', '#325e84', '#6487a0',

'#2f496e', '#385d7e', '#4287f5', '#3d5a80', '#587b7f',
'#2c4f6d', '#376c9b', '#419be0', '#4fa2e5', '#66b3ff',
'#2c4f6d', '#376c9b', '#419be0', '#4fa2e5', '#66b3ff']

plt.figure(figsize=(8, 8))
plt.pie(top_20_genre_counts, labels=top_20_genre_counts.index, autopct='%1.
↪1f%%', startangle=140, colors=colors)

plt.title('Genre Distribution of Top 20 Movies')

22
plt.axis('equal') # Equal aspect ratio ensures that the pie is drawn as a␣
↪circle.

plt.show()

3.1 Splitting the Data

[104]: from sklearn.model_selection import train_test_split

X = movie_data[['Duration', 'Votes']]
y = movie_data['Rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣

↪random_state=42)

3.1.1 Random Forest Regression Model Evaluation

In this section, we evaluate the performance of the Random Forest regression model trained on the
movie dataset. We initialize the Random Forest regressor with 100 estimators and train it on the
training data. Then, we make predictions on the test data and evaluate the model’s performance

23
using mean squared error (MSE).

[105]: from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

rf_regressor.fit(X_train, y_train)

rf_predictions = rf_regressor.predict(X_test)

mse = mean_squared_error(y_test, rf_predictions)

print("Mean Squared Error:", mse)

Mean Squared Error: 2.2754586685392897

3.1.2 AdaBoost Regression Model Evaluation

In this section, we evaluate the performance of the AdaBoost regression model trained on the movie
dataset. We initialize the AdaBoost regressor with 100 estimators and train it on the training data.
Then, we make predictions on the test data and evaluate the model’s performance using mean
squared error (MSE).

[107]: from sklearn.ensemble import AdaBoostRegressor

from sklearn.metrics import mean_squared_error

adaboost_regressor = AdaBoostRegressor(n_estimators=100, random_state=42)

adaboost_regressor.fit(X_train, y_train)

adaboost_predictions = adaboost_regressor.predict(X_test)

mse = mean_squared_error(y_test, adaboost_predictions)

print("Mean Squared Error:", mse)

Mean Squared Error: 1.9140002337811928

3.1.3 Stacking Regression Model Evaluation

In this section, we evaluate the performance of the Stacking regression model trained on the movie
dataset. We initialize two base estimators, Random Forest regressor and AdaBoost regressor, each
with 100 estimators, and train them on the training data. Then, we use a Linear Regression model
as the final estimator in the StackingRegressor. We make predictions on the test data using the
StackingRegressor and evaluate the model’s performance using mean squared error (MSE).

[108]: from sklearn.ensemble import StackingRegressor

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error

24
base_estimators = [
('random_forest', RandomForestRegressor(n_estimators=100, random_state=42)),
('adaboost', AdaBoostRegressor(n_estimators=100, random_state=42))
]

stacking_regressor = StackingRegressor(estimators=base_estimators,␣
↪final_estimator=LinearRegression())

stacking_regressor.fit(X_train, y_train)

stacking_predictions = stacking_regressor.predict(X_test)

mse = mean_squared_error(y_test, stacking_predictions)

print("Stacking Metrics:")
print("Mean Squared Error:", mse)

Stacking Metrics:
Mean Squared Error: 1.8230041471178846

3.1.4 Evaluation Metrics for Regression Models

Here, we compute and compare evaluation metrics for the Random Forest, AdaBoost, and Stacking
regression models trained on the movie dataset.

Random Forest Metrics:

• Mean Squared Error: [rf_mse]
• Mean Absolute Error: [rf_mae]
• R-squared Score: [rf_r2]

AdaBoost Metrics:
• Mean Squared Error: [adaboost_mse]
• Mean Absolute Error: [adaboost_mae]
• R-squared Score: [adaboost_r2]

Stacking Metrics:
• Mean Squared Error: [stacking_mse]
• Mean Absolute Error: [stacking_mae]
• R-squared Score: [stacking_r2]

[109]: from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

rf_mse = mean_squared_error(y_test, rf_predictions)

rf_mae = mean_absolute_error(y_test, rf_predictions)
rf_r2 = r2_score(y_test, rf_predictions)

25
print("Random Forest Metrics:")
print(f"Mean Squared Error: {rf_mse}")
print(f"Mean Absolute Error: {rf_mae}")
print(f"R-squared Score: {rf_r2}")
print()

adaboost_mse = mean_squared_error(y_test, adaboost_predictions)

adaboost_mae = mean_absolute_error(y_test, adaboost_predictions)
adaboost_r2 = r2_score(y_test, adaboost_predictions)

print("AdaBoost Metrics:")
print(f"Mean Squared Error: {adaboost_mse}")
print(f"Mean Absolute Error: {adaboost_mae}")
print(f"R-squared Score: {adaboost_r2}")
print()

# Evaluate Stacking
stacking_mse = mean_squared_error(y_test, stacking_predictions)
stacking_mae = mean_absolute_error(y_test, stacking_predictions)
stacking_r2 = r2_score(y_test, stacking_predictions)

print("Stacking Metrics:")
print(f"Mean Squared Error: {stacking_mse}")
print(f"Mean Absolute Error: {stacking_mae}")
print(f"R-squared Score: {stacking_r2}")

Random Forest Metrics:

Mean Squared Error: 2.2754586685392897
Mean Absolute Error: 1.1855945596224633
R-squared Score: -0.2230252875394938

AdaBoost Metrics:
Mean Squared Error: 1.9140002337811928
Mean Absolute Error: 1.1371216538097293
R-squared Score: -0.028746739563326296

Stacking Metrics:
Mean Squared Error: 1.8230041471178846
Mean Absolute Error: 1.0876853487767815
R-squared Score: 0.020162307476320973
##Conclusion
Mean Squared Error (MSE): Lower values indicate better performance. Stacking has the lowest
MSE (1.823) followed by AdaBoost (1.914) and then Random Forest (2.275).
Mean Absolute Error (MAE): Lower values indicate better performance. Stacking has the
lowest MAE (1.088) followed by AdaBoost (1.137) and then Random Forest (1.186).

26
R-squared Score: This metric indicates how well the model fits the data, with higher values
being better. Stacking has the highest R-squared score (0.020), indicating the best fit to the data,
followed by AdaBoost (-0.029) and then Random Forest (-0.223).
Based on these metrics, Stacking appears to perform the best among the three ensemble tech-
niques for this regression task. It has the lowest MSE and MAE, and the highest R-squared score,
indicating the best overall performance.

Learning Pandas Library
100% (1)
Learning Pandas Library
271 pages
Ep05 A3
0% (2)
Ep05 A3
5 pages
Stock Price Prediction Using Recurrent Neural Networks PDF
No ratings yet
Stock Price Prediction Using Recurrent Neural Networks PDF
132 pages
2_Data_Analysis.ipynb - Colaboratory
No ratings yet
2_Data_Analysis.ipynb - Colaboratory
28 pages
Assign 4-Samana Tatheer 20U00323 .Ipynb - Colaboratory
No ratings yet
Assign 4-Samana Tatheer 20U00323 .Ipynb - Colaboratory
6 pages
caso2lau
No ratings yet
caso2lau
27 pages
Rein Dashboard Survey
No ratings yet
Rein Dashboard Survey
11 pages
Pertemuan 3_Latihan_Faiz Anugerah Gunawan
No ratings yet
Pertemuan 3_Latihan_Faiz Anugerah Gunawan
6 pages
Temp1: Pandas PD Numpy NP
No ratings yet
Temp1: Pandas PD Numpy NP
4 pages
Load Dataset: Import As
No ratings yet
Load Dataset: Import As
8 pages
Mini Projects 3-6-Satyaki Mitra
No ratings yet
Mini Projects 3-6-Satyaki Mitra
60 pages
Pandas Profiling Library For EDA
No ratings yet
Pandas Profiling Library For EDA
1 page
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Jupiter Notebook Tricks
100% (1)
Jupiter Notebook Tricks
9 pages
Pronosticos - Ipynb - Colaboratory
No ratings yet
Pronosticos - Ipynb - Colaboratory
13 pages
12.1 - 12.9 Introduction To Modules - Libraries For DataScience
No ratings yet
12.1 - 12.9 Introduction To Modules - Libraries For DataScience
54 pages
NFL - SURVIVAL - Ipynb - Colab
No ratings yet
NFL - SURVIVAL - Ipynb - Colab
5 pages
Python-Numpy & Pandas
No ratings yet
Python-Numpy & Pandas
78 pages
Bigmartsalesprediction
No ratings yet
Bigmartsalesprediction
27 pages
Zeta - Updated - Matplotlib - Ipynb - Colab
No ratings yet
Zeta - Updated - Matplotlib - Ipynb - Colab
12 pages
Paython Sondaje Corregido
No ratings yet
Paython Sondaje Corregido
6 pages
For Cor Pc2 Lismasari - Ipynb - Colab
No ratings yet
For Cor Pc2 Lismasari - Ipynb - Colab
5 pages
Pandas Documentation PDF
No ratings yet
Pandas Documentation PDF
86 pages
For Cor Pc1 Lismasari - Ipynb - Colab
No ratings yet
For Cor Pc1 Lismasari - Ipynb - Colab
5 pages
session-1 DataFrame
No ratings yet
session-1 DataFrame
13 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
CardioGoodFitness - Jupyter Notebook
No ratings yet
CardioGoodFitness - Jupyter Notebook
12 pages
For Cor Pc3 Lismasari - Ipynb - Colab
No ratings yet
For Cor Pc3 Lismasari - Ipynb - Colab
5 pages
Underwater Object Detection With YOLO v8
No ratings yet
Underwater Object Detection With YOLO v8
47 pages
Exp1-ref-doc-installation
No ratings yet
Exp1-ref-doc-installation
6 pages
Python Datasci Slides
No ratings yet
Python Datasci Slides
13 pages
Pandas PDF
No ratings yet
Pandas PDF
2,573 pages
PC3 - SPATIAL - LismaSari - Ipynb - Colab
No ratings yet
PC3 - SPATIAL - LismaSari - Ipynb - Colab
9 pages
16. PYTHON PACKAGES TO LEARN DATA SCIENCE E-BOOK
No ratings yet
16. PYTHON PACKAGES TO LEARN DATA SCIENCE E-BOOK
76 pages
PC1 Lisma Sari - Ipynb - Colab
No ratings yet
PC1 Lisma Sari - Ipynb - Colab
9 pages
Mastering Python Data Visualization - Sample Chapter
100% (9)
Mastering Python Data Visualization - Sample Chapter
63 pages
VARMA For Battery Voltage Forecasting 1
No ratings yet
VARMA For Battery Voltage Forecasting 1
70 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Python Libraries 2
No ratings yet
Python Libraries 2
80 pages
Finalviz
No ratings yet
Finalviz
8 pages
NumPy, Pandas, MatplotLib,Seaborn, ScikitLearn (SkLearn)
No ratings yet
NumPy, Pandas, MatplotLib,Seaborn, ScikitLearn (SkLearn)
14 pages
Face Mask Detection
No ratings yet
Face Mask Detection
32 pages
De&v Lab Manual
No ratings yet
De&v Lab Manual
91 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Libraries
No ratings yet
Libraries
3 pages
Untitled7.ipynb - Colab
No ratings yet
Untitled7.ipynb - Colab
28 pages
Unit 5 (Python) - Colab
No ratings yet
Unit 5 (Python) - Colab
11 pages
19. Pandas_Profiling
No ratings yet
19. Pandas_Profiling
3 pages
Kedar Maheshwari
No ratings yet
Kedar Maheshwari
17 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
l9 Scientific Python Proc
No ratings yet
l9 Scientific Python Proc
30 pages
Unit 7: Problem Solving Real World Programming Problems
No ratings yet
Unit 7: Problem Solving Real World Programming Problems
36 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Python Indepth Live Session
No ratings yet
Python Indepth Live Session
8 pages
data science
No ratings yet
data science
42 pages
Python Week+1 New
No ratings yet
Python Week+1 New
44 pages
BSM 461 Introduction To Big Data: Kevser Ovaz Akpınar, PHD
No ratings yet
BSM 461 Introduction To Big Data: Kevser Ovaz Akpınar, PHD
26 pages
Assignment 2
No ratings yet
Assignment 2
17 pages
No More Sad Pandas: Optimizing Pandas Code For Performance: Lead Data Scientist
No ratings yet
No More Sad Pandas: Optimizing Pandas Code For Performance: Lead Data Scientist
48 pages
BDA File
No ratings yet
BDA File
26 pages
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
No ratings yet
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
9 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Math 4 Diagnostic Test With Answer Key
No ratings yet
Math 4 Diagnostic Test With Answer Key
83 pages
AL3391 AI UNIT 3 NOTES EduEngg
No ratings yet
AL3391 AI UNIT 3 NOTES EduEngg
38 pages
Chapter 2-Part 1 - Arithmetic and Geometric Sequences
No ratings yet
Chapter 2-Part 1 - Arithmetic and Geometric Sequences
15 pages
Artigo Hot
No ratings yet
Artigo Hot
8 pages
PPT ch03
No ratings yet
PPT ch03
71 pages
Rules For Multiplying Fractions
No ratings yet
Rules For Multiplying Fractions
5 pages
Abstract Calculus
No ratings yet
Abstract Calculus
3 pages
Sample Student Resume 1
No ratings yet
Sample Student Resume 1
1 page
Morphology Gonzalez Woods
No ratings yet
Morphology Gonzalez Woods
107 pages
Bumastics
No ratings yet
Bumastics
2 pages
TST Rupho
No ratings yet
TST Rupho
3 pages
Tromp Curve Explanation PDF
100% (1)
Tromp Curve Explanation PDF
8 pages
CTET Jan24 Maths Pedagogy Mock Test 2
No ratings yet
CTET Jan24 Maths Pedagogy Mock Test 2
42 pages
Compressible CFD (CESE) Module Presentation: Zeng Chan Zhang, Kyoung-Su Im, Iñaki Çaldichoury
No ratings yet
Compressible CFD (CESE) Module Presentation: Zeng Chan Zhang, Kyoung-Su Im, Iñaki Çaldichoury
33 pages
Karabo Centre Grade 11 Paper 1 Final Exams - 2024 - Memo
No ratings yet
Karabo Centre Grade 11 Paper 1 Final Exams - 2024 - Memo
13 pages
Health and Endogenous Growth: Adriaan Van Zon, Joan Muysken
No ratings yet
Health and Endogenous Growth: Adriaan Van Zon, Joan Muysken
17 pages
Estimating The Fractionally Integrated GARCH Model
No ratings yet
Estimating The Fractionally Integrated GARCH Model
20 pages
"Twinkle Twinkle Little Star - . ." Why Do Stars Twinkle But Not Planets?
No ratings yet
"Twinkle Twinkle Little Star - . ." Why Do Stars Twinkle But Not Planets?
16 pages
Exercise 5 Presentation
100% (1)
Exercise 5 Presentation
21 pages
Reservoir Engineering1
No ratings yet
Reservoir Engineering1
35 pages
Handson Data Analysis In R For Finance Jeanfrancois Collard pdf download
100% (1)
Handson Data Analysis In R For Finance Jeanfrancois Collard pdf download
83 pages
Soil Mechanics Lab Manual
No ratings yet
Soil Mechanics Lab Manual
22 pages
Download Complete Response surface methodology process and product optimization using designed experiments Fourth Edition Anderson-Cook PDF for All Chapters
100% (2)
Download Complete Response surface methodology process and product optimization using designed experiments Fourth Edition Anderson-Cook PDF for All Chapters
55 pages
Etabs Load Combo
No ratings yet
Etabs Load Combo
146 pages
SLG Math5 6.2.1 Extrema of Functions Part 1
No ratings yet
SLG Math5 6.2.1 Extrema of Functions Part 1
8 pages
Syllabus
No ratings yet
Syllabus
24 pages
1.1 Introduction
No ratings yet
1.1 Introduction
17 pages
SMO Lesson 1 - Numerical - Least Common Multiple (LCM)
No ratings yet
SMO Lesson 1 - Numerical - Least Common Multiple (LCM)
3 pages