100% found this document useful (1 vote)

157 views

Answer Book (Ashish)

The document is a graded machine learning project report that analyzes an election dataset to predict which party voters will vote for. It contains the student's work on data ingestion, exploration, preprocessing, modeling using logistic regression, LDA, KNN, Naive Bayes, and ensemble methods. It evaluates the models on training and test sets using performance metrics. The best performing models were logistic regression, LDA, Naive Bayes, AdaBoost and gradient boosting based on their performance on both training and test sets. Naive Bayes was selected as the final model due to its high recall and interpretability.

Uploaded by

Ashish Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

157 views

Answer Book (Ashish)

Uploaded by

Ashish Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Answer Report – Machine Learning (Graded Project)

Name – Ashish Agrawal

========================================================================

You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build
a model, to predict which party a voter will vote for on the basis of the given
information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.
Dataset for Problem: Election_Data.xlsx
Data Ingestion: 11 marks
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)
- There are total 1525 rows and 10 columns in the dataset
- We have dropped the “Unnamed” column since it only represents the serial
numbers
- There are no null values in the dataset. Hence, no adjustments made.
- All columns are ‘Integers’ except Column “Vote” and “Gender” which are Object.
- There are no duplicates

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 Marks)
Univariate and Bivariate analysis –
- There are no significant outliers present in the dataset and hence we have not
done any treatment for outliers
- There are more voters who are having Eurosceptic attitude. People with
Eurosceptic score 8 or less has voted for Labour party. This shows that people
sentiments are that Labour party would work more towards empowerment of
Europe Union.
- The vote counts are higher to Labour party irrespective of the political
knowledge of voters on positions of European integrations.
- There are more female voters participated than male.
- Overall, Labour party will receive more votes based on the survey in the dataset
and Labour would be expected to win.
- The voters between age group 42 to 58 have participated maximum in the survey.
- People who assess “Blair” as the leader of Labour party on a scale 4 or more will
vote for Labour party
- People who assess “Blair” as the leader of Labour party on a scale 4 or more will
vote for Labour party
- People who assess “Hague” as the leader of Labour party on a scale 4 or more will
prefer to vote for “Conservative” over “Labour” party with very minor margins.
- Household economic conditions are assessed to be “Average” by maximum
people.
- National economic conditions are assessed to be “Average” by maximum people.

Data Preparation: 4 marks

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not? Data Split: Split the data into train and test (70:30). (4 Marks)
- The columns “Vote” and “Genders” are manually encoded and replaced with
Binary values i.e. 0 and 1.
- The data is split between training and test based on 70:30 split ratio. Training
data will be used to train the model and test data will be used to assess the
predictions based on unseen data by model.

Modeling: 22 marks
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
Logistic Regression –
- The model is applied on training data using “newton-cg” solver.
- The model score is 0.8275.
- Please refer to code to understand the application of this model.

LDA –
- The model score is 0.8340.
- Please refer to code to understand the application of this model.

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
KNN Model –
- The model is initially prepared based on default value of n_neighbors=5. The
model score is 0.8556 based on this criteria.
- The model then prepared based on n_neighbors=7 value. The model score has
improved slightly and ended up at 0.8481.
- Further, we calculated the misclassification errors (“MCE”) for K odd values
between 1 to 20. The model with least MCE is assumed to have the most
appropriate number of n_neighbors. We tool n = 17 as the best estimator for
KNN model. The model score is 0.8253 based on n=17.
- Please refer to code to understand the application of this model.
Naïve Bayes Model-
- We applied Gaussian Naïve Bayes model after importing GaussianNB from
sklearn.naive bayes library
- The model score is 0.8384
- Please refer to code to understand the application of this model.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
(7 marks)
Here, since the model is not specified for tuning. I have assumed that we need to do it
for Random forest although this model has not been requested in question 1.4 and 1.5
- Bagging classifier is used for Bagging
- ADA Boosting and Gradient Boosting are used for boosting techniques
- Based on Random forest classifier and Bagging classifier, it can be noted that the
recall is 1 and hence we concluded that the model is overfit and not a good model
to proceed.
- Boosting works better over Bagging in this case since bagging is overfit models.
- The score based on ADA Boosting classifier is 0.8296 and the score based on
Gradient boosting classifier is 0.8493
- Please refer to code for further application of different techniques

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized. (7 marks)
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score of all the models –
Based on Training data
Logistic Regression
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –

LDA
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –

KNN
Performance Metrics:
ROC_AUC score – 0.898
ROC Curve –

Naïve Bayes
Performance Metrics:
ROC_AUC score – 0.881
ROC Curve –

Random Forest
Performance Metrics:
ROC_AUC score – 1.000
ROC Curve –

Bagging
Performance Metrics:
ROC_AUC score – 1.000
ROC Curve –

ADA Boosting
Performance Metrics:
ROC_AUC score – 0.897
ROC Curve –

Gradient Boosting
Performance Metrics:
ROC_AUC score – 0.933
ROC Curve –

Based on Testing data

Logistic Regression
Performance Metric:
ROC_AUC score – 0.885
ROC Curve –

LDA
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –

KNN
Performance Metrics:

ROC_AUC score – 0.898

ROC Curve –
Naïve Bayes
Performance Metrics:

ROC_AUC score – 0.881

ROC Curve –
Random Forest
Performance Metrics:

ROC_AUC score – 1.000

ROC Curve –
Bagging
Performance Metrics:

ROC_AUC score – 1.000

ROC Curve –
ADA Boosting
Performance Metrics:

ROC_AUC score – 0.897

ROC Curve –
Gradient Boosting
Performance Metrics:

ROC_AUC score – 0.933

ROC Curve –
Inference: 5 marks
1.8 Based on these predictions, what are the insights? (5 marks)

Ans –

Let's look at the performance of all the models on the Train Data set

Recall refers to the percentage of total relevant results correctly classified by the
algorithm and hence we will compare Recall of class "1" for all models.

Worst performing model based on the performance metrics on training data set is KNN
model. KNN model has performed worst in both training and test datasets.

Model which have not performed well on the train data set , also have not performed
well on the test data set.

However Random Forest and Bagging has highest recall ‘1’ for class ‘1’ and seems to be
the best performing models based on train data set. The score for both of these models
is 99.90% on the train data set, although both the models have shown poor results on
the test data set (recall is very low). Hence a clear case of overfitting.
Based on above analysis, we think Logistic Regression, LDA, Naïve Bayes, ADA Boosting
and Gradient Boosting can be considered for modeling purposes. The difference
between the performance scores of the models is also less than 10% between train and
test data for all these model and hence seems reasonable to use these models. Also, the
ROC curve and the AUC score for all these models are also good.
However, if we need to choose the best model, we will probably go ahead with naïve
Bayes model since the Recall is highest in this case and test results are better than train
results. Also Naïve Bayes is easier to implement and the assumptions are quite easy to
implement.

Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

(Hint: use .words(), .raw(), .sent() for extracting counts)

2.1 Find the number of characters, words, and sentences for the mentioned documents.
– 3 Marks
Characters in Roosevelt’s speech – 7571
Characters in Kennedy’s speech – 7618
Characters in Nixon’s speech – 9991

Words in Roosevelt’s speech – 1536

Words in Kennedy’s speech – 1546
Words in Nixon’s speech – 20208

Sentences in Roosevelt’s speech – 68

Sentences in Kennedy’s speech – 52
Sentences in Nixon’s speech – 69

2.2 Remove all the stopwords from all three speeches. – 3 Marks

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks
Roosevelt’s speech –
Top three words
Nation, Know, Peopl
Kennedy’s speech-

Top three words

Let, Us, Power

Nixon’s speech –
Top three words
Us, Let, America

2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks [ refer to the End-to-End Case Study done in the Mentored
Learning Session ]
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
"
Important Note: Please reflect on all that you have learned while working on
this project. This step is critical in cementing all your concepts and closing the loop.
Please write down your thoughts here.

Vijaya ML
88% (8)
Vijaya ML
26 pages
Capstone Notes-2
No ratings yet
Capstone Notes-2
27 pages
Capstone Project - DS With R
No ratings yet
Capstone Project - DS With R
2 pages
261 A Syllabus Spring 2012 Full Version
No ratings yet
261 A Syllabus Spring 2012 Full Version
17 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Marketing Research Questions
No ratings yet
Marketing Research Questions
3 pages
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
100% (2)
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
47 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Project Questions
No ratings yet
Project Questions
3 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Bank Customer Churn Prediction 1691464479
No ratings yet
Bank Customer Churn Prediction 1691464479
7 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
Lasso and Ridge Regression
No ratings yet
Lasso and Ridge Regression
30 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Why Do You Need To Scale Data in KNN: 3 Answers
No ratings yet
Why Do You Need To Scale Data in KNN: 3 Answers
1 page
Predictive Modelling
100% (1)
Predictive Modelling
58 pages
ML Lab6.Ipynb - Colaboratory
100% (1)
ML Lab6.Ipynb - Colaboratory
5 pages
All Life Bank - AIML_ML_Project_low_code_notebook
No ratings yet
All Life Bank - AIML_ML_Project_low_code_notebook
78 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
Tutorial 2 - Clustering
100% (2)
Tutorial 2 - Clustering
6 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
COVID Project
0% (1)
COVID Project
1 page
Sajjad DS
100% (2)
Sajjad DS
97 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
No ratings yet
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
8 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
ML 2 - Problem statements and Rubirics
No ratings yet
ML 2 - Problem statements and Rubirics
3 pages
Assignment ML
100% (2)
Assignment ML
21 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
PCA quiz
No ratings yet
PCA quiz
8 pages
Machine Learning GL
No ratings yet
Machine Learning GL
25 pages
Quiz
No ratings yet
Quiz
6 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Module 5 Pandas Assignment Updated
No ratings yet
Module 5 Pandas Assignment Updated
3 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
ML Notes
100% (2)
ML Notes
125 pages
Machine Learning Projects PDF
No ratings yet
Machine Learning Projects PDF
5 pages
ML Week 3 Logistic Regression
60% (10)
ML Week 3 Logistic Regression
6 pages
Prathamesh Shukla SMDM Project 20.08.23
100% (1)
Prathamesh Shukla SMDM Project 20.08.23
34 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Mvchine Learning Project Report
No ratings yet
Mvchine Learning Project Report
33 pages
Module 1 Quiz
No ratings yet
Module 1 Quiz
7 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Project Report
100% (3)
Project Report
36 pages
Biostatistics (Correlation and Regression)
100% (1)
Biostatistics (Correlation and Regression)
29 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
40 pages
Log Explanation
No ratings yet
Log Explanation
3 pages
Impact of Social Networking Media Usage PDF
No ratings yet
Impact of Social Networking Media Usage PDF
11 pages
4 6006014285984042352
100% (1)
4 6006014285984042352
4 pages
Unit Root Ev4 1
No ratings yet
Unit Root Ev4 1
9 pages
Coret Coret
No ratings yet
Coret Coret
3 pages
Marketing Research: CH-8 Data Processing and Data Analysis
No ratings yet
Marketing Research: CH-8 Data Processing and Data Analysis
30 pages
Annotated Bibliography Economic Data 1ab - 1
No ratings yet
Annotated Bibliography Economic Data 1ab - 1
9 pages
CSCI946 w1-Introduction
No ratings yet
CSCI946 w1-Introduction
36 pages
The Effectiveness of OJT in Bridging the Gap between Theory and Practice to BSHM Graduates
No ratings yet
The Effectiveness of OJT in Bridging the Gap between Theory and Practice to BSHM Graduates
6 pages
Regression Analysis: By: Vaibhav Sahu
No ratings yet
Regression Analysis: By: Vaibhav Sahu
17 pages
Church Conflict
100% (1)
Church Conflict
47 pages
ICMAP - Pathway Exam Syllabus - 2022-23
No ratings yet
ICMAP - Pathway Exam Syllabus - 2022-23
14 pages
MIS Solution
No ratings yet
MIS Solution
8 pages
QP11 Data Analysis
No ratings yet
QP11 Data Analysis
3 pages
Presentation AirQuality Prediction Using Machine Learning
No ratings yet
Presentation AirQuality Prediction Using Machine Learning
16 pages
GSEB Solutions Class 12 Statistics Part 1 Chapter 2 Linear Corre
No ratings yet
GSEB Solutions Class 12 Statistics Part 1 Chapter 2 Linear Corre
43 pages
Effect of Training, Work Discipline, and Leadership League To Employees Performance at Pt. Sinarmas Rendranusa Pekanbaru
No ratings yet
Effect of Training, Work Discipline, and Leadership League To Employees Performance at Pt. Sinarmas Rendranusa Pekanbaru
14 pages
SEHH1008 Chapter 04 Correlation and Regression
No ratings yet
SEHH1008 Chapter 04 Correlation and Regression
35 pages
Faqs Ds-Ba - Version1.0
No ratings yet
Faqs Ds-Ba - Version1.0
23 pages
Co-Efficient of Variation
No ratings yet
Co-Efficient of Variation
4 pages
Lean Six Sigma Green Belt Curriculum
No ratings yet
Lean Six Sigma Green Belt Curriculum
6 pages
Pipeline Integrity Management
100% (3)
Pipeline Integrity Management
47 pages
Dinda Dewi Aisyah - Landscape Integrated Pest Management As A Tool To Determine The Risk of Production of Rice Farming in Pliken Village Banyumas Regency
No ratings yet
Dinda Dewi Aisyah - Landscape Integrated Pest Management As A Tool To Determine The Risk of Production of Rice Farming in Pliken Village Banyumas Regency
13 pages
HR Data Analyst Syllabus
No ratings yet
HR Data Analyst Syllabus
22 pages
Sandeep Garg CV 1
No ratings yet
Sandeep Garg CV 1
1 page
Examining the Challenges of Manual Bookkeeping
No ratings yet
Examining the Challenges of Manual Bookkeeping
52 pages

Answer Book (Ashish)

Uploaded by

Answer Book (Ashish)

Uploaded by

Answer Report – Machine Learning (Graded Project)

Name – Ashish Agrawal

Data Preparation: 4 marks

Based on Testing data

ROC_AUC score – 0.898

ROC_AUC score – 0.881

ROC_AUC score – 1.000

ROC_AUC score – 1.000

ROC_AUC score – 0.897

ROC_AUC score – 0.933

(Hint: use .words(), .raw(), .sent() for extracting counts)

Words in Roosevelt’s speech – 1536

Sentences in Roosevelt’s speech – 68

Top three words

You might also like