0% found this document useful (0 votes)

47 views

Ads - Phase 5

The document discusses analyzing IMDB movie data to understand factors that contribute to a movie's success. It describes collecting IMDB data, preprocessing the data, exploring it through statistical analysis and visualization, and providing insights. A link to the Netflix movies dataset used is also included.

Uploaded by

secondyear2325

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Ads - Phase 5

Uploaded by

secondyear2325

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

IBM NAANMUDHALVAN

PHASE 5
DOMAIN - INTERNET MOVIE DATABASE
SCORES PREDICTION
PROBLEM DEFINITION:
You have multiple rows of data that contain the list of movies that were top 100 rated
movies in the past. The Data contains various information regarding Actors, Genres, Voters, and
other information about the Movies.

DESIGN THINKING:
• Data Collection: Collect IMDB data from reputable sources like genre,premiere
date,runtime,language.

• Data Preprocessing: Clean and preprocess the data, handle missing values, and
convertcategorical features into numerical representations.

• Exploratory Data Analysis: Explore the data to understand its characteristics, identify
trends, and outliers.

• Statistical Analysis: Perform statistical tests to analyze movie rate, premiere and
language

• Visualization: Create visualizations (e.g., bar plots, line charts, heatmaps) to present key
findings and insights.

• Insights and Recommendations: Provide actionable insights and recommendations

based on the analysis to assist movie ratings and premiere features.

PROBLEM STATEMENT:
Based on the massive movie information, it would be interesting to understand what
are the important factors that make a movie more successful than others. So, we would like to
analyze what kind of movies are more successful, in other words, get higher IMDB score. In
this project, we take IMDB scores as response variable and focus on operating predictions .The
results can help film companies to understand the secret of generating a commercial success

movie.

SOLUTION:

DATA COLLECTION:

The dataset that we chose to analyze the Netflix originals IMDb scores
contains approximately 600 different movies and 5 variables. These are split up into
the genre, runtime , premiere, IMBDB score and language.

DATA PREPROCESSING:

Clean and preprocess the data by handle missing values, and convert categorical features
into numerical representations .Also remove punctuation marks, HTML tags, URL’s,
successive whitespaces, convert the text to lower case, strip whitespaces from the
beginning and the end of the reviews.

EXPLORATORY DATA ANALYSIS:

Explore the data to understand its characteristics, observe your dataset, find any
missing values, categorize your values , find the shape of dataset ,find relationships in dataset
and locate any outliers in dataset.

STATISTICAL ANALYSIS:
Perform statistical tests to analyze IMDB scores like movie rate, premiere, runtime
and language.

DATA VISUALIZATION:
Visualizations consisting of histograms, box plots, scatter plots, line plots, heat
maps, and bar charts assist in identifying styles, trends, and relationships within the
facts to present key findings and insights.

OVER VIEW :
DATASET LINK:
https://www.kaggle.com/datasets/luiscorter/netflix-original-films-imdb-scores
ABOUT:
 Loading a dataset
 Preprocessing dataset
 Data cleaning
 Data transformation
 Data reduction
PROGRAM:
LOAD THE DATASET:

OUTPUT:

DATA PREPROESSING:

Runtim IMDB
Title Genre Premiere Language
e Score

August 5,
0 Enter the Anime Documentary 58 2.5 English/Japanese
2019

1 Dark Forces Thriller August 21, 81 2.6 Spanish

Runtim IMDB
Title Genre Premiere Language
e Score

2020

December 26,
2 The App Science fiction/Drama 79 2.6 Italian
2019

January 19,
3 The Open House Horror thriller 94 3.2 English
2018

October 30,
4 Kaali Khuhi Mystery 90 3.4 Hindi
2020

November 1,
5 Drive Action 147 3.5 Hindi
2019

December 4,
6 Leyla Everlasting Comedy 112 3.7 Turkish
2020

The Last Days of

7 Heist film/Thriller June 5, 2020 149 3.7 English
American Crime

Musical/Western/ March 23,

8 Paradox 73 3.9 English
Fantasy 2018

Sardar Ka
9 Comedy May 18, 2021 139 4.1 Hindi
Grandson

a.shape
(584, 6)

a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Title 584 non-null object
1 Genre
584 non-null object
2 Premiere 584 non-null object
3 Runtime 584 non-null int64
4 IMDB Score 584 non-null float64
5 Language 584 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB

a.nunique()
Title 584

Genre 115

Premiere 390

Runtime 124

IMDB Score 54

Language 38

dtype: int64

a.isnull().sum()

Title 0

Genre 0

Premiere 0

Runtime 0

IMDB Score 0

Language 0

dtype: int64

a.duplicated().any()

False

a.agg(['min','max'])
Title Genre Premiere Runtime IMDB Score Language

min #REALITYHIGH Action April 1, 2021 4 2.5 Bengali

max Òlòt?ré Zombie/Heist September 9, 2020 209 9.0 Turkish

import numpy as np
a.replace(20.0,np.nan)

Runtim IMDB
Title Genre Premiere Language
e Score

August 5,
0 Enter the Anime Documentary 58 2.5 English/Japanese
2019

1 Dark Forces Thriller August 21, 81 2.6 Spanish

Runtim IMDB
Title Genre Premiere Language
e Score

2020

Science December
2 The App 79 2.6 Italian
fiction/Drama 26, 2019

January 19,
3 The Open House Horror thriller 94 3.2 English
2018

October 30,
4 Kaali Khuhi Mystery 90 3.4 Hindi
2020

... ... ... ... ... ... ...

Taylor Swift:
December
579 Reputation Stadium Concert Film 125 8.4 English
31, 2018
Tour

Winter on Fire:
October 9,
580 Ukraine's Fight for Documentary 91 8.4 English/Ukranian/Russian
2015
Freedom

Springsteen on December
581 One-man show 153 8.5 English
Broadway 16, 2018

Emicida: AmarElo - It's December 8,

582 Documentary 89 8.6 Portuguese
All For Yesterday 2020

David Attenborough: A October 4,

583 Documentary 83 9.0 English
Life on Our Planet 2020

584 rows × 6 column

a.isnull()
a.notnull()

Title Genre Premiere Runtime IMDB Score

0 True True True True True True

1 True True True True True True

2 True True True True True True

3 True True True True True True

4 True True True True True True

... ... ... ... ... ... ...

579 True True True True True True

Title Genre Premiere Runtime IMDB Score

580 True True True True True True

581 True True True True True True

582 True True True True True True

583 True True True True True True

584 rows × 6 columns

a.tail()
Runtim IMDB
Title Genre Premiere Language
e Score

Taylor Swift:
December
579 Reputation Stadium Concert Film
31, 2018
125 8.4 English
Tour

Winter on Fire:
October 9,
580 Ukraine's Fight for Documentary
2015
91 8.4 English/Ukranian/Russian
Freedom

Springsteen on One-man December

581 Broadway show 16, 2018
153 8.5 English

Emicida: AmarElo - It's December 8,

582 All For Yesterday
Documentary
2020
89 8.6 Portuguese

David Attenborough: A October 4,

583 Life on Our Planet
Documentary
2020
83 9.0 English

FEATURE ENGINEERING:
 Data Science is not a field where theoretical understanding helps you to start a carrier. It
totally depends on the projects you do and the practice you have done that determines
your probability of success. Feature engineering is a very important aspect of machine
learning and data science and should never be ignored. The main goal of Feature
engineering is to get the best results from the algorithms.

 Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive model
using machine learning or statistical modelling . The goal of feature engineering and
selection is to improve the performance of machine learning (ML) algorithms.
SPLITTING DATA:
x=a[['Runtime','IMDB Score']]
y=a['IMDB Score']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
x_train

Runtime IMDB Score

464 78 7.1

193 96 5.9

453 47 7.1

43 106 4.8

41 93 4.7

... ... ...

335 120 6.5

125 103 5.5

481 136 7.2

559 135 7.7

340 101 6.5

467 rows × 2 columns

x_test

MODEL EVALUATION:

LINEAR REGRESSION:
from sklearn import linear_model# compute classification accuracy for the
linear regression model
from sklearn import metrics # for the check the error and accuracy of the
model
lin = linear_model.LinearRegression()
# train the model on the training set
lin.fit(x_train, y_train)

lin_score_train = lin.score(x_test, y_test)

lin_score_test = lin.score(x_train, y_train)
print("Training score: ",lin_score_train)
print("Testing score: ",lin_score_test)
Training score: 1.0
Testing score: 1.0

RIDGE REGRESSION:
ridge_model=Ridge(alpha=1.0)
ridge_mse,ridge_rmse,ridge_mae_cross_validation(ridge_model,X,y,num_folds)
print("Ridge Regression:")
print(f"Average MSE: {np.mean(ridge_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(ridge_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(ridge_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(ridge_r2) * 100:.2f}%")
print("\n")
Ridge Regression:
Average MSE: 18.89%
Average RMSE: 11.01%
Average MAE: 8.38%
Average R-squared: 89.54%

DECISION TREE:
fromsklearn.treeimportDecisionTreeRegressor
# Decision Trees
tree_model=DecisionTreeRegressor(max_depth=None,random_state=0)
tree_mse,tree_rmse,tree_mae,tree_r2=perform_cross_validation(tree_model,X,
y,num_folds)
print("Decision Trees:")
print(f"Average MSE: {np.mean(tree_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(tree_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(tree_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(tree_r2) * 100:.2f}%")
print("\n")
Decision Trees:
Average MSE: 16.90%
Average RMSE: 10.45%
Average MAE: 7.63%
Average R-squared: 90.55%
RANDOM FOREST:
fromsklearn.ensembleimportRandomForestRegressor
forest_model=RandomForestRegressor(n_estimators=100,random_state=0)
forest_mse,forest_rmse,forest_mae,forest_r2=perform_cross_validation(fores
t_model,X,y,num_folds)
print("Random Forest:")
print(f"Average MSE: {np.mean(forest_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(forest_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(forest_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(forest_r2) * 100:.2f}%")
Random Forest:
Random Forest:
Average MSE: 10.40%
Average RMSE: 8.11%
Average MAE: 6.03%
Average R-squared: 94.23%

DATA VISUALIZATION:

LINE PLOT:

import numpy as np
import matplotlib.pyplot as plt
df=a['Runtime'].head(5)
df1=a['IMDB Score'].head(5)
fig =plt.figure(figsize=(5,5))
plt.plot(df, df1,color='red')
plt.title("LINEPLOT")
plt.xlabel("RUNTIME")
plt.ylabel("IMDB SCORE")
plt.show()
SCATTER PLOT:

import matplotlib.pyplot as plt

df=a['Runtime'].head()
df1=a['IMDB Score'].head()
fig =plt.figure(figsize=(5,5))
plt.scatter(df, df1,marker='*',color='red')
plt.show()
BAR CHART:
import numpy as np
import matplotlib.pyplot as plt
df=a['Genre'].head(5)
df1=a['Runtime'].head(5)
fig = plt.figure(figsize =(8, 7))
plt.bar(df, df1)
plt.title("BARPLOT")
plt.xlabel("GENRE")
plt.ylabel("RUNTIME")
plt.show()

PIE CHART:
df=a['IMDB Score'].head(10)
df1=a['Runtime'].head(10)
fig = plt.figure(figsize =(4,4))
plt.pie(df, labels= df1)
plt.title("PIE CHART")
plt.show()
HEAT MAP:
import numpy as np
import matplotlib.pyplot as plt
a = np.random.random(( 10, 12 ))
plt.imshow( a )
plt.title( "2-D Heat Map" )
plt.show()

HISTOGRAM:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()
CONCLUSION:
Predicting the IMDB scores helps a lot of companies to understand how much
expenses they have to pay every year. As well as in this project we can
accurately predict how people will rate (like) the movies from its
general characteristics. By using the Netflix originals dataset we
clearly outline the problem statement , design thinking process, data
preprocessing steps, feature extraction techniques ,model
training ,machine learning algorithms ,data visualization techniques
and evaluation metrics.

NAME : ABINAYA B
COLLEGE CODE : 4204
REGISTER NO : 420421104003

Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
No ratings yet
Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
9 pages
Making Games For The NES: 8bitworkshop
From Everand
Making Games For The NES: 8bitworkshop
Steven Hugg
5/5 (1)
IMDB Movie Dataset Analysis: Sarada Saripalli
No ratings yet
IMDB Movie Dataset Analysis: Sarada Saripalli
9 pages
Movie Prediction
100% (1)
Movie Prediction
7 pages
IMDB Movie Analysis 05 Project
No ratings yet
IMDB Movie Analysis 05 Project
7 pages
SDA SW Block Description
No ratings yet
SDA SW Block Description
149 pages
Statistical Mechanics - Pathria Homework 2
100% (10)
Statistical Mechanics - Pathria Homework 2
6 pages
Report
No ratings yet
Report
26 pages
Case Study Data Analytics
No ratings yet
Case Study Data Analytics
12 pages
Movies Final Report
No ratings yet
Movies Final Report
22 pages
Final Project1 IMDB Movie Analysis PDF
No ratings yet
Final Project1 IMDB Movie Analysis PDF
9 pages
Group 15 Report
No ratings yet
Group 15 Report
23 pages
b1 PDF
No ratings yet
b1 PDF
6 pages
Movie Success Prediction Using Machine Learning Algorithms and Their Comparison
No ratings yet
Movie Success Prediction Using Machine Learning Algorithms and Their Comparison
6 pages
Datascience Pepar
No ratings yet
Datascience Pepar
9 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
2 pages
Recommender System
No ratings yet
Recommender System
45 pages
Report Final-MovieLens
No ratings yet
Report Final-MovieLens
47 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
Student Details
No ratings yet
Student Details
10 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
23 pages
Team_Renegades_MMLA_Report
No ratings yet
Team_Renegades_MMLA_Report
27 pages
Imdb Movie Data Set
No ratings yet
Imdb Movie Data Set
9 pages
DAV_PROJECT
No ratings yet
DAV_PROJECT
22 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
Predicting Movie Success Based On Imdb Data
No ratings yet
Predicting Movie Success Based On Imdb Data
5 pages
Project Movielense Solution
No ratings yet
Project Movielense Solution
4 pages
Informatics Practices Project Synopsis Title: Imdb Movie Analysis System
No ratings yet
Informatics Practices Project Synopsis Title: Imdb Movie Analysis System
24 pages
RE Paper
No ratings yet
RE Paper
25 pages
549129758-synopsis
No ratings yet
549129758-synopsis
52 pages
Project Movielense Solution
29% (7)
Project Movielense Solution
4 pages
Prediks I Movie
No ratings yet
Prediks I Movie
25 pages
Python Project Description
No ratings yet
Python Project Description
4 pages
COM 428 - Jupyter Notebook2_101223
No ratings yet
COM 428 - Jupyter Notebook2_101223
16 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
17 pages
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
No ratings yet
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
12 pages
Moviesuggester - Jupyter Notebook
No ratings yet
Moviesuggester - Jupyter Notebook
11 pages
Group Project Description
No ratings yet
Group Project Description
6 pages
Predicting Movie Success Based On IMDB Data
No ratings yet
Predicting Movie Success Based On IMDB Data
4 pages
Business Intelligence Project Report
No ratings yet
Business Intelligence Project Report
14 pages
Netflix Recommendation Based On IMDB
No ratings yet
Netflix Recommendation Based On IMDB
5 pages
Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
No ratings yet
Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
24 pages
Imdb Questions
No ratings yet
Imdb Questions
4 pages
Bi 501 03 Final Imdb Rating
No ratings yet
Bi 501 03 Final Imdb Rating
33 pages
Netflix data analysis vashisht
No ratings yet
Netflix data analysis vashisht
29 pages
Review 2
No ratings yet
Review 2
21 pages
IMDB Movie Analysis - PDF
No ratings yet
IMDB Movie Analysis - PDF
8 pages
IMDB Analysis
No ratings yet
IMDB Analysis
4 pages
Movie Recommender Systems
No ratings yet
Movie Recommender Systems
11 pages
DM Theory Mid Term
No ratings yet
DM Theory Mid Term
9 pages
Final Review
No ratings yet
Final Review
24 pages
Vertopal.com IMDb+Movie+Assignment Stub
No ratings yet
Vertopal.com IMDb+Movie+Assignment Stub
9 pages
Importing Librarie
No ratings yet
Importing Librarie
13 pages
PM_Coded_Project_Sample_Business_Report
No ratings yet
PM_Coded_Project_Sample_Business_Report
32 pages
18BCS053
No ratings yet
18BCS053
17 pages
Movie Recommender System Using Content Based AndCollaborative Filtering
No ratings yet
Movie Recommender System Using Content Based AndCollaborative Filtering
7 pages
CMSC422 Project Presentation
No ratings yet
CMSC422 Project Presentation
17 pages
IMDB Movie Analysis1
No ratings yet
IMDB Movie Analysis1
14 pages
Internet Movie Database Analysis Using Python
No ratings yet
Internet Movie Database Analysis Using Python
6 pages
Unofficial Price Guide to Video Games: Virtual Boy
From Everand
Unofficial Price Guide to Video Games: Virtual Boy
Jay Recher
No ratings yet
DESTINY OF TAR AND FEATHERS: A Harold L. Brown Screenplay
From Everand
DESTINY OF TAR AND FEATHERS: A Harold L. Brown Screenplay
Harold Lea Brown
No ratings yet
The Benefits of Predictive Maintenance in Manufact
No ratings yet
The Benefits of Predictive Maintenance in Manufact
9 pages
Customer Ran. # Arrival Time Start Time Service Time Finish Start Interarrival Time Waiting Time
No ratings yet
Customer Ran. # Arrival Time Start Time Service Time Finish Start Interarrival Time Waiting Time
4 pages
Lecture 17-Sampling Theorem
No ratings yet
Lecture 17-Sampling Theorem
18 pages
Perturbation Algebraic9 Beamer1
No ratings yet
Perturbation Algebraic9 Beamer1
20 pages
Swec 4-2 Project Marks List
No ratings yet
Swec 4-2 Project Marks List
1 page
CBO AI Initiative - CBO T&I AI Infusion in Monitoring - Qlik Sense Infusion Guide January 2022
No ratings yet
CBO AI Initiative - CBO T&I AI Infusion in Monitoring - Qlik Sense Infusion Guide January 2022
34 pages
Econometrics 5 and 6
No ratings yet
Econometrics 5 and 6
16 pages
2024 of Shooting
No ratings yet
2024 of Shooting
6 pages
Experimental of Vectorizer and Classifier For Scrapped Social Media Data
No ratings yet
Experimental of Vectorizer and Classifier For Scrapped Social Media Data
10 pages
4.5 - Mapping Objective Functions To Fitness Form
No ratings yet
4.5 - Mapping Objective Functions To Fitness Form
4 pages
Pds I - Internal 2 - Set 1
No ratings yet
Pds I - Internal 2 - Set 1
2 pages
Linear Regression
No ratings yet
Linear Regression
23 pages
Nsm Practical
No ratings yet
Nsm Practical
39 pages
Usefull Insights About Data
No ratings yet
Usefull Insights About Data
8 pages
ECE III II DSP Content
No ratings yet
ECE III II DSP Content
40 pages
Econometrics Homework Answers
100% (1)
Econometrics Homework Answers
4 pages
Indeed's Engineer Manager Hiring Process (UPDATED)
No ratings yet
Indeed's Engineer Manager Hiring Process (UPDATED)
3 pages
Module-5 Image Restoration
No ratings yet
Module-5 Image Restoration
122 pages
Block Cipher Modes of Operation
100% (1)
Block Cipher Modes of Operation
37 pages
MIT15 053S13 Lec1
No ratings yet
MIT15 053S13 Lec1
36 pages
Fundamentals of Engineering (FE) INDUSTRIAL AND SYSTEMS CBT Exam Specifications
No ratings yet
Fundamentals of Engineering (FE) INDUSTRIAL AND SYSTEMS CBT Exam Specifications
3 pages
Model Based Design of Pid Controller For BLDC Motor With Implementation of
No ratings yet
Model Based Design of Pid Controller For BLDC Motor With Implementation of
8 pages
A Review of Various KNN Techniques
No ratings yet
A Review of Various KNN Techniques
6 pages
AI Lecture Four Heuristic Search
No ratings yet
AI Lecture Four Heuristic Search
15 pages
Assignment I DSP
No ratings yet
Assignment I DSP
4 pages
Differentiation #4
No ratings yet
Differentiation #4
11 pages
AEI504
No ratings yet
AEI504
28 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
4 pages