100% found this document useful (1 vote)
157 views

Answer Book (Ashish)

The document is a graded machine learning project report that analyzes an election dataset to predict which party voters will vote for. It contains the student's work on data ingestion, exploration, preprocessing, modeling using logistic regression, LDA, KNN, Naive Bayes, and ensemble methods. It evaluates the models on training and test sets using performance metrics. The best performing models were logistic regression, LDA, Naive Bayes, AdaBoost and gradient boosting based on their performance on both training and test sets. Naive Bayes was selected as the final model due to its high recall and interpretability.

Uploaded by

Ashish Agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
157 views

Answer Book (Ashish)

The document is a graded machine learning project report that analyzes an election dataset to predict which party voters will vote for. It contains the student's work on data ingestion, exploration, preprocessing, modeling using logistic regression, LDA, KNN, Naive Bayes, and ensemble methods. It evaluates the models on training and test sets using performance metrics. The best performing models were logistic regression, LDA, Naive Bayes, AdaBoost and gradient boosting based on their performance on both training and test sets. Naive Bayes was selected as the final model due to its high recall and interpretability.

Uploaded by

Ashish Agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Answer Report – Machine Learning (Graded Project)

Name – Ashish Agrawal


========================================================================

You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build
a model, to predict which party a voter will vote for on the basis of the given
information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.
Dataset for Problem: Election_Data.xlsx
Data Ingestion: 11 marks
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)
- There are total 1525 rows and 10 columns in the dataset
- We have dropped the “Unnamed” column since it only represents the serial
numbers
- There are no null values in the dataset. Hence, no adjustments made.
- All columns are ‘Integers’ except Column “Vote” and “Gender” which are Object.
- There are no duplicates

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 Marks)
Univariate and Bivariate analysis –
- There are no significant outliers present in the dataset and hence we have not
done any treatment for outliers
- There are more voters who are having Eurosceptic attitude. People with
Eurosceptic score 8 or less has voted for Labour party. This shows that people
sentiments are that Labour party would work more towards empowerment of
Europe Union.
- The vote counts are higher to Labour party irrespective of the political
knowledge of voters on positions of European integrations.
- There are more female voters participated than male.
- Overall, Labour party will receive more votes based on the survey in the dataset
and Labour would be expected to win.
- The voters between age group 42 to 58 have participated maximum in the survey.
- People who assess “Blair” as the leader of Labour party on a scale 4 or more will
vote for Labour party
- People who assess “Blair” as the leader of Labour party on a scale 4 or more will
vote for Labour party
- People who assess “Hague” as the leader of Labour party on a scale 4 or more will
prefer to vote for “Conservative” over “Labour” party with very minor margins.
- Household economic conditions are assessed to be “Average” by maximum
people.
- National economic conditions are assessed to be “Average” by maximum people.

Data Preparation: 4 marks


1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not? Data Split: Split the data into train and test (70:30). (4 Marks)
- The columns “Vote” and “Genders” are manually encoded and replaced with
Binary values i.e. 0 and 1.
- The data is split between training and test based on 70:30 split ratio. Training
data will be used to train the model and test data will be used to assess the
predictions based on unseen data by model.

Modeling: 22 marks
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
Logistic Regression –
- The model is applied on training data using “newton-cg” solver.
- The model score is 0.8275.
- Please refer to code to understand the application of this model.

LDA –
- The model score is 0.8340.
- Please refer to code to understand the application of this model.

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
KNN Model –
- The model is initially prepared based on default value of n_neighbors=5. The
model score is 0.8556 based on this criteria.
- The model then prepared based on n_neighbors=7 value. The model score has
improved slightly and ended up at 0.8481.
- Further, we calculated the misclassification errors (“MCE”) for K odd values
between 1 to 20. The model with least MCE is assumed to have the most
appropriate number of n_neighbors. We tool n = 17 as the best estimator for
KNN model. The model score is 0.8253 based on n=17.
- Please refer to code to understand the application of this model.
Naïve Bayes Model-
- We applied Gaussian Naïve Bayes model after importing GaussianNB from
sklearn.naive bayes library
- The model score is 0.8384
- Please refer to code to understand the application of this model.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
(7 marks)
Here, since the model is not specified for tuning. I have assumed that we need to do it
for Random forest although this model has not been requested in question 1.4 and 1.5
- Bagging classifier is used for Bagging
- ADA Boosting and Gradient Boosting are used for boosting techniques
- Based on Random forest classifier and Bagging classifier, it can be noted that the
recall is 1 and hence we concluded that the model is overfit and not a good model
to proceed.
- Boosting works better over Bagging in this case since bagging is overfit models.
- The score based on ADA Boosting classifier is 0.8296 and the score based on
Gradient boosting classifier is 0.8493
- Please refer to code for further application of different techniques

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized. (7 marks)
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score of all the models –
Based on Training data
Logistic Regression
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –

LDA
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –

KNN
Performance Metrics:
ROC_AUC score – 0.898
ROC Curve –

Naïve Bayes
Performance Metrics:
ROC_AUC score – 0.881
ROC Curve –

Random Forest
Performance Metrics:
ROC_AUC score – 1.000
ROC Curve –

Bagging
Performance Metrics:
ROC_AUC score – 1.000
ROC Curve –

ADA Boosting
Performance Metrics:
ROC_AUC score – 0.897
ROC Curve –

Gradient Boosting
Performance Metrics:
ROC_AUC score – 0.933
ROC Curve –

Based on Testing data

Logistic Regression
Performance Metric:
ROC_AUC score – 0.885
ROC Curve –

LDA
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –

KNN
Performance Metrics:

ROC_AUC score – 0.898


ROC Curve –
Naïve Bayes
Performance Metrics:

ROC_AUC score – 0.881


ROC Curve –
Random Forest
Performance Metrics:

ROC_AUC score – 1.000


ROC Curve –
Bagging
Performance Metrics:

ROC_AUC score – 1.000


ROC Curve –
ADA Boosting
Performance Metrics:

ROC_AUC score – 0.897


ROC Curve –
Gradient Boosting
Performance Metrics:

ROC_AUC score – 0.933


ROC Curve –
Inference: 5 marks
1.8 Based on these predictions, what are the insights? (5 marks)

Ans –

Let's look at the performance of all the models on the Train Data set

Recall refers to the percentage of total relevant results correctly classified by the
algorithm and hence we will compare Recall of class "1" for all models.

Worst performing model based on the performance metrics on training data set is KNN
model. KNN model has performed worst in both training and test datasets.

Model which have not performed well on the train data set , also have not performed
well on the test data set.

However Random Forest and Bagging has highest recall ‘1’ for class ‘1’ and seems to be
the best performing models based on train data set. The score for both of these models
is 99.90% on the train data set, although both the models have shown poor results on
the test data set (recall is very low). Hence a clear case of overfitting.
Based on above analysis, we think Logistic Regression, LDA, Naïve Bayes, ADA Boosting
and Gradient Boosting can be considered for modeling purposes. The difference
between the performance scores of the models is also less than 10% between train and
test data for all these model and hence seems reasonable to use these models. Also, the
ROC curve and the AUC score for all these models are also good.
However, if we need to choose the best model, we will probably go ahead with naïve
Bayes model since the Recall is highest in this case and test results are better than train
results. Also Naïve Bayes is easier to implement and the assumptions are quite easy to
implement.

Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

(Hint: use .words(), .raw(), .sent() for extracting counts)


2.1 Find the number of characters, words, and sentences for the mentioned documents.
– 3 Marks
Characters in Roosevelt’s speech – 7571
Characters in Kennedy’s speech – 7618
Characters in Nixon’s speech – 9991

Words in Roosevelt’s speech – 1536


Words in Kennedy’s speech – 1546
Words in Nixon’s speech – 20208

Sentences in Roosevelt’s speech – 68


Sentences in Kennedy’s speech – 52
Sentences in Nixon’s speech – 69

2.2 Remove all the stopwords from all three speeches. – 3 Marks

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks
Roosevelt’s speech –
Top three words
Nation, Know, Peopl
Kennedy’s speech-

Top three words


Let, Us, Power

Nixon’s speech –
Top three words
Us, Let, America

2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks [ refer to the End-to-End Case Study done in the Mentored
Learning Session ]
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
"
Important Note: Please reflect on all that you have learned while working on
this project. This step is critical in cementing all your concepts and closing the loop.
Please write down your thoughts here.

You might also like