0% found this document useful (0 votes)
47 views

Ads - Phase 5

The document discusses analyzing IMDB movie data to understand factors that contribute to a movie's success. It describes collecting IMDB data, preprocessing the data, exploring it through statistical analysis and visualization, and providing insights. A link to the Netflix movies dataset used is also included.

Uploaded by

secondyear2325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Ads - Phase 5

The document discusses analyzing IMDB movie data to understand factors that contribute to a movie's success. It describes collecting IMDB data, preprocessing the data, exploring it through statistical analysis and visualization, and providing insights. A link to the Netflix movies dataset used is also included.

Uploaded by

secondyear2325
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

IBM NAANMUDHALVAN

PHASE 5
DOMAIN - INTERNET MOVIE DATABASE
SCORES PREDICTION
PROBLEM DEFINITION:
You have multiple rows of data that contain the list of movies that were top 100 rated
movies in the past. The Data contains various information regarding Actors, Genres, Voters, and
other information about the Movies.

DESIGN THINKING:
• Data Collection: Collect IMDB data from reputable sources like genre,premiere
date,runtime,language.

• Data Preprocessing: Clean and preprocess the data, handle missing values, and
convertcategorical features into numerical representations.

• Exploratory Data Analysis: Explore the data to understand its characteristics, identify
trends, and outliers.

• Statistical Analysis: Perform statistical tests to analyze movie rate, premiere and
language

• Visualization: Create visualizations (e.g., bar plots, line charts, heatmaps) to present key
findings and insights.

• Insights and Recommendations: Provide actionable insights and recommendations


based on the analysis to assist movie ratings and premiere features.

PROBLEM STATEMENT:
Based on the massive movie information, it would be interesting to understand what
are the important factors that make a movie more successful than others. So, we would like to
analyze what kind of movies are more successful, in other words, get higher IMDB score. In
this project, we take IMDB scores as response variable and focus on operating predictions .The
results can help film companies to understand the secret of generating a commercial success

movie.

SOLUTION:

DATA COLLECTION:

The dataset that we chose to analyze the Netflix originals IMDb scores
contains approximately 600 different movies and 5 variables. These are split up into
the genre, runtime , premiere, IMBDB score and language.

DATA PREPROCESSING:

Clean and preprocess the data by handle missing values, and convert categorical features
into numerical representations .Also remove punctuation marks, HTML tags, URL’s,
successive whitespaces, convert the text to lower case, strip whitespaces from the
beginning and the end of the reviews.

EXPLORATORY DATA ANALYSIS:


Explore the data to understand its characteristics, observe your dataset, find any
missing values, categorize your values , find the shape of dataset ,find relationships in dataset
and locate any outliers in dataset.

STATISTICAL ANALYSIS:
Perform statistical tests to analyze IMDB scores like movie rate, premiere, runtime
and language.

DATA VISUALIZATION:
Visualizations consisting of histograms, box plots, scatter plots, line plots, heat
maps, and bar charts assist in identifying styles, trends, and relationships within the
facts to present key findings and insights.

OVER VIEW :
DATASET LINK:
https://www.kaggle.com/datasets/luiscorter/netflix-original-films-imdb-scores
ABOUT:
 Loading a dataset
 Preprocessing dataset
 Data cleaning
 Data transformation
 Data reduction
PROGRAM:
LOAD THE DATASET:

OUTPUT:

DATA PREPROESSING:

Runtim IMDB
Title Genre Premiere Language
e Score

August 5,
0 Enter the Anime Documentary 58 2.5 English/Japanese
2019

1 Dark Forces Thriller August 21, 81 2.6 Spanish


Runtim IMDB
Title Genre Premiere Language
e Score

2020

December 26,
2 The App Science fiction/Drama 79 2.6 Italian
2019

January 19,
3 The Open House Horror thriller 94 3.2 English
2018

October 30,
4 Kaali Khuhi Mystery 90 3.4 Hindi
2020

November 1,
5 Drive Action 147 3.5 Hindi
2019

December 4,
6 Leyla Everlasting Comedy 112 3.7 Turkish
2020

The Last Days of


7 Heist film/Thriller June 5, 2020 149 3.7 English
American Crime

Musical/Western/ March 23,


8 Paradox 73 3.9 English
Fantasy 2018

Sardar Ka
9 Comedy May 18, 2021 139 4.1 Hindi
Grandson

a.shape
(584, 6)

a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Title 584 non-null object
1 Genre
584 non-null object
2 Premiere 584 non-null object
3 Runtime 584 non-null int64
4 IMDB Score 584 non-null float64
5 Language 584 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB

a.nunique()
Title 584

Genre 115

Premiere 390

Runtime 124

IMDB Score 54

Language 38

dtype: int64

a.isnull().sum()

Title 0

Genre 0

Premiere 0

Runtime 0

IMDB Score 0

Language 0

dtype: int64

a.duplicated().any()

False

a.agg(['min','max'])
Title Genre Premiere Runtime IMDB Score Language

min #REALITYHIGH Action April 1, 2021 4 2.5 Bengali

max Òlòt?ré Zombie/Heist September 9, 2020 209 9.0 Turkish

import numpy as np
a.replace(20.0,np.nan)

Runtim IMDB
Title Genre Premiere Language
e Score

August 5,
0 Enter the Anime Documentary 58 2.5 English/Japanese
2019

1 Dark Forces Thriller August 21, 81 2.6 Spanish


Runtim IMDB
Title Genre Premiere Language
e Score

2020

Science December
2 The App 79 2.6 Italian
fiction/Drama 26, 2019

January 19,
3 The Open House Horror thriller 94 3.2 English
2018

October 30,
4 Kaali Khuhi Mystery 90 3.4 Hindi
2020

... ... ... ... ... ... ...

Taylor Swift:
December
579 Reputation Stadium Concert Film 125 8.4 English
31, 2018
Tour

Winter on Fire:
October 9,
580 Ukraine's Fight for Documentary 91 8.4 English/Ukranian/Russian
2015
Freedom

Springsteen on December
581 One-man show 153 8.5 English
Broadway 16, 2018

Emicida: AmarElo - It's December 8,


582 Documentary 89 8.6 Portuguese
All For Yesterday 2020

David Attenborough: A October 4,


583 Documentary 83 9.0 English
Life on Our Planet 2020

584 rows × 6 column


a.isnull()
a.notnull()

Title Genre Premiere Runtime IMDB Score

0 True True True True True True

1 True True True True True True

2 True True True True True True

3 True True True True True True

4 True True True True True True

... ... ... ... ... ... ...

579 True True True True True True


Title Genre Premiere Runtime IMDB Score

580 True True True True True True

581 True True True True True True

582 True True True True True True

583 True True True True True True

584 rows × 6 columns


a.tail()
Runtim IMDB
Title Genre Premiere Language
e Score

Taylor Swift:
December
579 Reputation Stadium Concert Film
31, 2018
125 8.4 English
Tour

Winter on Fire:
October 9,
580 Ukraine's Fight for Documentary
2015
91 8.4 English/Ukranian/Russian
Freedom

Springsteen on One-man December


581 Broadway show 16, 2018
153 8.5 English

Emicida: AmarElo - It's December 8,


582 All For Yesterday
Documentary
2020
89 8.6 Portuguese

David Attenborough: A October 4,


583 Life on Our Planet
Documentary
2020
83 9.0 English

FEATURE ENGINEERING:
 Data Science is not a field where theoretical understanding helps you to start a carrier. It
totally depends on the projects you do and the practice you have done that determines
your probability of success. Feature engineering is a very important aspect of machine
learning and data science and should never be ignored. The main goal of Feature
engineering is to get the best results from the algorithms.

 Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive model
using machine learning or statistical modelling . The goal of feature engineering and
selection is to improve the performance of machine learning (ML) algorithms.
SPLITTING DATA:
x=a[['Runtime','IMDB Score']]
y=a['IMDB Score']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
x_train

Runtime IMDB Score

464 78 7.1

193 96 5.9

453 47 7.1

43 106 4.8

41 93 4.7

... ... ...

335 120 6.5

125 103 5.5

481 136 7.2

559 135 7.7

340 101 6.5

467 rows × 2 columns


x_test

MODEL EVALUATION:

LINEAR REGRESSION:
from sklearn import linear_model# compute classification accuracy for the
linear regression model
from sklearn import metrics # for the check the error and accuracy of the
model
lin = linear_model.LinearRegression()
# train the model on the training set
lin.fit(x_train, y_train)

lin_score_train = lin.score(x_test, y_test)


lin_score_test = lin.score(x_train, y_train)
print("Training score: ",lin_score_train)
print("Testing score: ",lin_score_test)
Training score: 1.0
Testing score: 1.0

RIDGE REGRESSION:
ridge_model=Ridge(alpha=1.0)
ridge_mse,ridge_rmse,ridge_mae_cross_validation(ridge_model,X,y,num_folds)
print("Ridge Regression:")
print(f"Average MSE: {np.mean(ridge_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(ridge_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(ridge_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(ridge_r2) * 100:.2f}%")
print("\n")
Ridge Regression:
Average MSE: 18.89%
Average RMSE: 11.01%
Average MAE: 8.38%
Average R-squared: 89.54%

DECISION TREE:
fromsklearn.treeimportDecisionTreeRegressor
# Decision Trees
tree_model=DecisionTreeRegressor(max_depth=None,random_state=0)
tree_mse,tree_rmse,tree_mae,tree_r2=perform_cross_validation(tree_model,X,
y,num_folds)
print("Decision Trees:")
print(f"Average MSE: {np.mean(tree_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(tree_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(tree_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(tree_r2) * 100:.2f}%")
print("\n")
Decision Trees:
Average MSE: 16.90%
Average RMSE: 10.45%
Average MAE: 7.63%
Average R-squared: 90.55%
RANDOM FOREST:
fromsklearn.ensembleimportRandomForestRegressor
forest_model=RandomForestRegressor(n_estimators=100,random_state=0)
forest_mse,forest_rmse,forest_mae,forest_r2=perform_cross_validation(fores
t_model,X,y,num_folds)
print("Random Forest:")
print(f"Average MSE: {np.mean(forest_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(forest_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(forest_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(forest_r2) * 100:.2f}%")
Random Forest:
Random Forest:
Average MSE: 10.40%
Average RMSE: 8.11%
Average MAE: 6.03%
Average R-squared: 94.23%

DATA VISUALIZATION:

LINE PLOT:

import numpy as np
import matplotlib.pyplot as plt
df=a['Runtime'].head(5)
df1=a['IMDB Score'].head(5)
fig =plt.figure(figsize=(5,5))
plt.plot(df, df1,color='red')
plt.title("LINEPLOT")
plt.xlabel("RUNTIME")
plt.ylabel("IMDB SCORE")
plt.show()
SCATTER PLOT:

import matplotlib.pyplot as plt


df=a['Runtime'].head()
df1=a['IMDB Score'].head()
fig =plt.figure(figsize=(5,5))
plt.scatter(df, df1,marker='*',color='red')
plt.show()
BAR CHART:
import numpy as np
import matplotlib.pyplot as plt
df=a['Genre'].head(5)
df1=a['Runtime'].head(5)
fig = plt.figure(figsize =(8, 7))
plt.bar(df, df1)
plt.title("BARPLOT")
plt.xlabel("GENRE")
plt.ylabel("RUNTIME")
plt.show()

PIE CHART:
df=a['IMDB Score'].head(10)
df1=a['Runtime'].head(10)
fig = plt.figure(figsize =(4,4))
plt.pie(df, labels= df1)
plt.title("PIE CHART")
plt.show()
HEAT MAP:
import numpy as np
import matplotlib.pyplot as plt
a = np.random.random(( 10, 12 ))
plt.imshow( a )
plt.title( "2-D Heat Map" )
plt.show()

HISTOGRAM:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()
CONCLUSION:
Predicting the IMDB scores helps a lot of companies to understand how much
expenses they have to pay every year. As well as in this project we can
accurately predict how people will rate (like) the movies from its
general characteristics. By using the Netflix originals dataset we
clearly outline the problem statement , design thinking process, data
preprocessing steps, feature extraction techniques ,model
training ,machine learning algorithms ,data visualization techniques
and evaluation metrics.

NAME : ABINAYA B
COLLEGE CODE : 4204
REGISTER NO : 420421104003

You might also like