Answer Book (Ashish)
Answer Book (Ashish)
You are hired by one of the leading news channels CNBE who wants to analyze recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build
a model, to predict which party a voter will vote for on the basis of the given
information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.
Dataset for Problem: Election_Data.xlsx
Data Ingestion: 11 marks
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)
- There are total 1525 rows and 10 columns in the dataset
- We have dropped the “Unnamed” column since it only represents the serial
numbers
- There are no null values in the dataset. Hence, no adjustments made.
- All columns are ‘Integers’ except Column “Vote” and “Gender” which are Object.
- There are no duplicates
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 Marks)
Univariate and Bivariate analysis –
- There are no significant outliers present in the dataset and hence we have not
done any treatment for outliers
- There are more voters who are having Eurosceptic attitude. People with
Eurosceptic score 8 or less has voted for Labour party. This shows that people
sentiments are that Labour party would work more towards empowerment of
Europe Union.
- The vote counts are higher to Labour party irrespective of the political
knowledge of voters on positions of European integrations.
- There are more female voters participated than male.
- Overall, Labour party will receive more votes based on the survey in the dataset
and Labour would be expected to win.
- The voters between age group 42 to 58 have participated maximum in the survey.
- People who assess “Blair” as the leader of Labour party on a scale 4 or more will
vote for Labour party
- People who assess “Blair” as the leader of Labour party on a scale 4 or more will
vote for Labour party
- People who assess “Hague” as the leader of Labour party on a scale 4 or more will
prefer to vote for “Conservative” over “Labour” party with very minor margins.
- Household economic conditions are assessed to be “Average” by maximum
people.
- National economic conditions are assessed to be “Average” by maximum people.
Modeling: 22 marks
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
Logistic Regression –
- The model is applied on training data using “newton-cg” solver.
- The model score is 0.8275.
- Please refer to code to understand the application of this model.
LDA –
- The model score is 0.8340.
- Please refer to code to understand the application of this model.
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
KNN Model –
- The model is initially prepared based on default value of n_neighbors=5. The
model score is 0.8556 based on this criteria.
- The model then prepared based on n_neighbors=7 value. The model score has
improved slightly and ended up at 0.8481.
- Further, we calculated the misclassification errors (“MCE”) for K odd values
between 1 to 20. The model with least MCE is assumed to have the most
appropriate number of n_neighbors. We tool n = 17 as the best estimator for
KNN model. The model score is 0.8253 based on n=17.
- Please refer to code to understand the application of this model.
Naïve Bayes Model-
- We applied Gaussian Naïve Bayes model after importing GaussianNB from
sklearn.naive bayes library
- The model score is 0.8384
- Please refer to code to understand the application of this model.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
(7 marks)
Here, since the model is not specified for tuning. I have assumed that we need to do it
for Random forest although this model has not been requested in question 1.4 and 1.5
- Bagging classifier is used for Bagging
- ADA Boosting and Gradient Boosting are used for boosting techniques
- Based on Random forest classifier and Bagging classifier, it can be noted that the
recall is 1 and hence we concluded that the model is overfit and not a good model
to proceed.
- Boosting works better over Bagging in this case since bagging is overfit models.
- The score based on ADA Boosting classifier is 0.8296 and the score based on
Gradient boosting classifier is 0.8493
- Please refer to code for further application of different techniques
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized. (7 marks)
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score of all the models –
Based on Training data
Logistic Regression
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –
LDA
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –
KNN
Performance Metrics:
ROC_AUC score – 0.898
ROC Curve –
Naïve Bayes
Performance Metrics:
ROC_AUC score – 0.881
ROC Curve –
Random Forest
Performance Metrics:
ROC_AUC score – 1.000
ROC Curve –
Bagging
Performance Metrics:
ROC_AUC score – 1.000
ROC Curve –
ADA Boosting
Performance Metrics:
ROC_AUC score – 0.897
ROC Curve –
Gradient Boosting
Performance Metrics:
ROC_AUC score – 0.933
ROC Curve –
Logistic Regression
Performance Metric:
ROC_AUC score – 0.885
ROC Curve –
LDA
Performance Metrics:
ROC_AUC score – 0.885
ROC Curve –
KNN
Performance Metrics:
Ans –
Let's look at the performance of all the models on the Train Data set
Recall refers to the percentage of total relevant results correctly classified by the
algorithm and hence we will compare Recall of class "1" for all models.
Worst performing model based on the performance metrics on training data set is KNN
model. KNN model has performed worst in both training and test datasets.
Model which have not performed well on the train data set , also have not performed
well on the test data set.
However Random Forest and Bagging has highest recall ‘1’ for class ‘1’ and seems to be
the best performing models based on train data set. The score for both of these models
is 99.90% on the train data set, although both the models have shown poor results on
the test data set (recall is very low). Hence a clear case of overfitting.
Based on above analysis, we think Logistic Regression, LDA, Naïve Bayes, ADA Boosting
and Gradient Boosting can be considered for modeling purposes. The difference
between the performance scores of the models is also less than 10% between train and
test data for all these model and hence seems reasonable to use these models. Also, the
ROC curve and the AUC score for all these models are also good.
However, if we need to choose the best model, we will probably go ahead with naïve
Bayes model since the Recall is highest in this case and test results are better than train
results. Also Naïve Bayes is easier to implement and the assumptions are quite easy to
implement.
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.2 Remove all the stopwords from all three speeches. – 3 Marks
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks
Roosevelt’s speech –
Top three words
Nation, Know, Peopl
Kennedy’s speech-
Nixon’s speech –
Top three words
Us, Let, America
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks [ refer to the End-to-End Case Study done in the Mentored
Learning Session ]
Code Snippet to extract the three speeches:
"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
"
Important Note: Please reflect on all that you have learned while working on
this project. This step is critical in cementing all your concepts and closing the loop.
Please write down your thoughts here.