Ads - Phase 5
Ads - Phase 5
PHASE 5
DOMAIN - INTERNET MOVIE DATABASE
SCORES PREDICTION
PROBLEM DEFINITION:
You have multiple rows of data that contain the list of movies that were top 100 rated
movies in the past. The Data contains various information regarding Actors, Genres, Voters, and
other information about the Movies.
DESIGN THINKING:
• Data Collection: Collect IMDB data from reputable sources like genre,premiere
date,runtime,language.
• Data Preprocessing: Clean and preprocess the data, handle missing values, and
convertcategorical features into numerical representations.
• Exploratory Data Analysis: Explore the data to understand its characteristics, identify
trends, and outliers.
• Statistical Analysis: Perform statistical tests to analyze movie rate, premiere and
language
• Visualization: Create visualizations (e.g., bar plots, line charts, heatmaps) to present key
findings and insights.
PROBLEM STATEMENT:
Based on the massive movie information, it would be interesting to understand what
are the important factors that make a movie more successful than others. So, we would like to
analyze what kind of movies are more successful, in other words, get higher IMDB score. In
this project, we take IMDB scores as response variable and focus on operating predictions .The
results can help film companies to understand the secret of generating a commercial success
movie.
SOLUTION:
DATA COLLECTION:
The dataset that we chose to analyze the Netflix originals IMDb scores
contains approximately 600 different movies and 5 variables. These are split up into
the genre, runtime , premiere, IMBDB score and language.
DATA PREPROCESSING:
Clean and preprocess the data by handle missing values, and convert categorical features
into numerical representations .Also remove punctuation marks, HTML tags, URL’s,
successive whitespaces, convert the text to lower case, strip whitespaces from the
beginning and the end of the reviews.
STATISTICAL ANALYSIS:
Perform statistical tests to analyze IMDB scores like movie rate, premiere, runtime
and language.
DATA VISUALIZATION:
Visualizations consisting of histograms, box plots, scatter plots, line plots, heat
maps, and bar charts assist in identifying styles, trends, and relationships within the
facts to present key findings and insights.
OVER VIEW :
DATASET LINK:
https://www.kaggle.com/datasets/luiscorter/netflix-original-films-imdb-scores
ABOUT:
Loading a dataset
Preprocessing dataset
Data cleaning
Data transformation
Data reduction
PROGRAM:
LOAD THE DATASET:
OUTPUT:
DATA PREPROESSING:
Runtim IMDB
Title Genre Premiere Language
e Score
August 5,
0 Enter the Anime Documentary 58 2.5 English/Japanese
2019
2020
December 26,
2 The App Science fiction/Drama 79 2.6 Italian
2019
January 19,
3 The Open House Horror thriller 94 3.2 English
2018
October 30,
4 Kaali Khuhi Mystery 90 3.4 Hindi
2020
November 1,
5 Drive Action 147 3.5 Hindi
2019
December 4,
6 Leyla Everlasting Comedy 112 3.7 Turkish
2020
Sardar Ka
9 Comedy May 18, 2021 139 4.1 Hindi
Grandson
a.shape
(584, 6)
a.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Title 584 non-null object
1 Genre
584 non-null object
2 Premiere 584 non-null object
3 Runtime 584 non-null int64
4 IMDB Score 584 non-null float64
5 Language 584 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB
a.nunique()
Title 584
Genre 115
Premiere 390
Runtime 124
IMDB Score 54
Language 38
dtype: int64
a.isnull().sum()
Title 0
Genre 0
Premiere 0
Runtime 0
IMDB Score 0
Language 0
dtype: int64
a.duplicated().any()
False
a.agg(['min','max'])
Title Genre Premiere Runtime IMDB Score Language
import numpy as np
a.replace(20.0,np.nan)
Runtim IMDB
Title Genre Premiere Language
e Score
August 5,
0 Enter the Anime Documentary 58 2.5 English/Japanese
2019
2020
Science December
2 The App 79 2.6 Italian
fiction/Drama 26, 2019
January 19,
3 The Open House Horror thriller 94 3.2 English
2018
October 30,
4 Kaali Khuhi Mystery 90 3.4 Hindi
2020
Taylor Swift:
December
579 Reputation Stadium Concert Film 125 8.4 English
31, 2018
Tour
Winter on Fire:
October 9,
580 Ukraine's Fight for Documentary 91 8.4 English/Ukranian/Russian
2015
Freedom
Springsteen on December
581 One-man show 153 8.5 English
Broadway 16, 2018
Taylor Swift:
December
579 Reputation Stadium Concert Film
31, 2018
125 8.4 English
Tour
Winter on Fire:
October 9,
580 Ukraine's Fight for Documentary
2015
91 8.4 English/Ukranian/Russian
Freedom
FEATURE ENGINEERING:
Data Science is not a field where theoretical understanding helps you to start a carrier. It
totally depends on the projects you do and the practice you have done that determines
your probability of success. Feature engineering is a very important aspect of machine
learning and data science and should never be ignored. The main goal of Feature
engineering is to get the best results from the algorithms.
Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive model
using machine learning or statistical modelling . The goal of feature engineering and
selection is to improve the performance of machine learning (ML) algorithms.
SPLITTING DATA:
x=a[['Runtime','IMDB Score']]
y=a['IMDB Score']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
x_train
464 78 7.1
193 96 5.9
453 47 7.1
43 106 4.8
41 93 4.7
MODEL EVALUATION:
LINEAR REGRESSION:
from sklearn import linear_model# compute classification accuracy for the
linear regression model
from sklearn import metrics # for the check the error and accuracy of the
model
lin = linear_model.LinearRegression()
# train the model on the training set
lin.fit(x_train, y_train)
RIDGE REGRESSION:
ridge_model=Ridge(alpha=1.0)
ridge_mse,ridge_rmse,ridge_mae_cross_validation(ridge_model,X,y,num_folds)
print("Ridge Regression:")
print(f"Average MSE: {np.mean(ridge_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(ridge_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(ridge_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(ridge_r2) * 100:.2f}%")
print("\n")
Ridge Regression:
Average MSE: 18.89%
Average RMSE: 11.01%
Average MAE: 8.38%
Average R-squared: 89.54%
DECISION TREE:
fromsklearn.treeimportDecisionTreeRegressor
# Decision Trees
tree_model=DecisionTreeRegressor(max_depth=None,random_state=0)
tree_mse,tree_rmse,tree_mae,tree_r2=perform_cross_validation(tree_model,X,
y,num_folds)
print("Decision Trees:")
print(f"Average MSE: {np.mean(tree_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(tree_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(tree_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(tree_r2) * 100:.2f}%")
print("\n")
Decision Trees:
Average MSE: 16.90%
Average RMSE: 10.45%
Average MAE: 7.63%
Average R-squared: 90.55%
RANDOM FOREST:
fromsklearn.ensembleimportRandomForestRegressor
forest_model=RandomForestRegressor(n_estimators=100,random_state=0)
forest_mse,forest_rmse,forest_mae,forest_r2=perform_cross_validation(fores
t_model,X,y,num_folds)
print("Random Forest:")
print(f"Average MSE: {np.mean(forest_mse) / np.mean(y) * 100:.2f}%")
print(f"Average RMSE: {np.mean(forest_rmse) / np.mean(y) * 100:.2f}%")
print(f"Average MAE: {np.mean(forest_mae) / np.mean(y) * 100:.2f}%")
print(f"Average R-squared: {np.mean(forest_r2) * 100:.2f}%")
Random Forest:
Random Forest:
Average MSE: 10.40%
Average RMSE: 8.11%
Average MAE: 6.03%
Average R-squared: 94.23%
DATA VISUALIZATION:
LINE PLOT:
import numpy as np
import matplotlib.pyplot as plt
df=a['Runtime'].head(5)
df1=a['IMDB Score'].head(5)
fig =plt.figure(figsize=(5,5))
plt.plot(df, df1,color='red')
plt.title("LINEPLOT")
plt.xlabel("RUNTIME")
plt.ylabel("IMDB SCORE")
plt.show()
SCATTER PLOT:
PIE CHART:
df=a['IMDB Score'].head(10)
df1=a['Runtime'].head(10)
fig = plt.figure(figsize =(4,4))
plt.pie(df, labels= df1)
plt.title("PIE CHART")
plt.show()
HEAT MAP:
import numpy as np
import matplotlib.pyplot as plt
a = np.random.random(( 10, 12 ))
plt.imshow( a )
plt.title( "2-D Heat Map" )
plt.show()
HISTOGRAM:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()
CONCLUSION:
Predicting the IMDB scores helps a lot of companies to understand how much
expenses they have to pay every year. As well as in this project we can
accurately predict how people will rate (like) the movies from its
general characteristics. By using the Netflix originals dataset we
clearly outline the problem statement , design thinking process, data
preprocessing steps, feature extraction techniques ,model
training ,machine learning algorithms ,data visualization techniques
and evaluation metrics.
NAME : ABINAYA B
COLLEGE CODE : 4204
REGISTER NO : 420421104003