100% found this document useful (5 votes)

1K views

ASSIGNMENT Machine Learning

This document summarizes a machine learning project that analyzed voter data to predict election outcomes. Seven machine learning models were built, tested, tuned and compared to find the most optimized model for accurate predictions. The analysis found patterns in how different voter groups voted and formed the basis for prediction. Along with the top-performing model, recommendations were provided to the client, a news channel, on how to implement the findings. A separate text mining analysis of speeches by US presidents extracted key insights like most used words and sentiments to understand similarities and differences between the speeches.

Uploaded by

Sambit Roy

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (5 votes)

1K views

ASSIGNMENT Machine Learning

Uploaded by

Sambit Roy

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Project: Machine Learning

By
Sambit Roy Chowdhury
OF
Group 1 (Sat) May 21
ON
5/12/2021

1
Table of Contents:
Sl. Topic Page Number
No
Executive Summary 6
1 Problem: 1 (Machine Learning Models) 7-51
1.1 Read the dataset. Do the descriptive statistics and do the null value 7
condition check. Write an inference on it.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data 8-17
analysis.
1.3 Encode the data (having string values) for Modelling. Is Scaling 17-18
necessary here or not? Data Split: Split the data into train and test
(70:30).
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). 18-22
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. 22-25
1.6 Model Tuning, Bagging (Random Forest should be applied for 25-46
Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train 47-50
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model: Compare the
models and write inference which model is best/optimized.
1.8 Based on these predictions, what are the insights? 50-51
2 Problem: 2 (Text Mining) 51-60
2.1 Find the number of characters, words, and sentences for the 52-53
mentioned documents.
2.2 Remove all the stopwords from all three speeches. 53-54
2.3 Which word occurs the most number of times in his inaugural address 54-56
for each president? Mention the top three words. (after removing the
stopwords)
2.4 Plot the word cloud of each of the speeches of the variable. (after 57-60
removing the stopwords)

2
List of figures & tables for Problem 1 – Machine Learning Models:
Fig/Table Topic Page Number
Number
Table 1 Data dictionary for the analysis. 7
Table 2 Statistical summary table. 8
Table 3 Vote’s % by gender. 12
Table 4 vote’s % by assessment ratings of Labour leader, Blair. 13
Table 5 vote’s % by assessment ratings of Conservative leader, 14
Hague.
Table 6 Percentage of votes by ratings of Europe integration 14
sentiment.
Table 7 Vote share by household economic condition. 15
Table 8 Vote share by national economic condition 16
Table 9 Sample of scaled and encoded data. 18
Table 10 Comprehensive performance report-Logistic Regression 19
Train Model.
Table 11 Comprehensive performance report-Logistic Regression 20
Test Model.
Table 12 Comprehensive performance report of LDA Train 21
Model.
Table 13 Comprehensive performance report of LDA Test Model. 21
Table 14 Comprehensive performance report-KNN Train Model 22
Table 15 Comprehensive performance report-KNN Test Model 23
Table 16 Comprehensive performance report- Naive Bayes Train 24
Model.
Table 17 Comprehensive performance report- Naive Bayes Test 25
Model.
Table 18 Comprehensive performance report-Bagging Train 26
Model.
Table 19 Comprehensive performance report-Bagging Test 27
Model.
Table 20 Comprehensive performance report-ADA Boost Train 28
Model.
Table 21 Comprehensive performance report-ADA Boost Test 29
Model.
Table 22 Comprehensive performance report-Gradient Boost 30
Train Model.
Table 23 Comprehensive performance report-Gradient Boost Test 30
Model.
Table 24 Comprehensive performance report- Reg LR Train 32
Model.
Table 25 Comprehensive performance report- Reg LR Test 33
Model.
Table 26 Performance report of Reg LDA Train Model. 34

3
Table 27 Performance report of Reg LDA Test Model. 35
Table 28 Performance report- Reg KNN Train Model. 37
Table 29 Performance report- Reg KNN Test Model. 38
Table 30 Performance report-Reg Bagging Train Model. 39
Table 31 Performance report-Reg Bagging Test Model. 40
Table 32 Performance report-Reg ADA Boost Train Model. 42
Table 33 Performance report-Reg ADA Boost Test Model. 43
Table 34 Performance report-Reg Gradient Boost Train Model. 45
Table 35 Performance report-Reg Gradient Boost Test Model. 46
Table 36 Comparison of basic and regularized ML Models. 47

Fig 1 Histogram and Box Plots of all numeric variables. 10

Fig 2 Heatmap showing correlation. 11
Fig 3 Bar graph showing vote’s % by gender 12
Fig 4 Bar graph showing vote’s % by assessment ratings of 13
Labour leader, Blair.
Fig 5 Bar graph showing vote’s % by assessment ratings of 14
Conservative leader, Hague.
Fig 6 Bar graph showing vote share by household economic 15
condition.
Fig 7 Bar graph showing vote share by national economic 16
condition.
Fig 8 Outlier removal. 17
Fig 9 Parameters passed to find the best fit. 31
Fig 10 Best parameters after tuning. 31
Fig 11 Parameters passed to find the best fit. 33
Fig 12 Best parameters after tuning. 34
Fig 13 Parameters passed to find the best fit. 36
Fig 14 Best parameters after tuning. 36
Fig 15 Parameters passed to find the best fit. 38
Fig 16 Best parameters after tuning. 39
Fig 17 Parameters passed to find the best fit. 41
Fig 18 Best parameters after tuning. 41
Fig 19 Parameters passed to find the best fit. 44
Fig 20 Best parameters after tuning. 44

4
List of figures & tables for Problem 2 – Text Mining:

Fig/Table Topic Page Number

Number
Table 37 Snapshots of 3 data frame containing speeches. 52
Table 38 Data frames showing the words and characters counts of 53
speeches.
Table 39 Number of sentences in the speech. 53
Table 40 Data frames showing the number of stopwords. 53
Table 41 Data frames reflecting removal of stopwords. 54
Table 42 Top 10 words used in Roosevelt’s speech. 55
Table 43 Top 10 words used in Kennedy’s speech. 55
Table 44 Top 10 words used in Nixon’s speech. 56
Fig 21 Word cloud for Roosevelt’s speech. 57
Fig 22 Word cloud for Kennedy’s speech. 58
Fig 23 Word cloud for Nixon’s speech. 59

5
Executive Summary
Problem: 1

A dataset of 1525 voters provided by CNBE was analyzed and 7 machine learning models were
built, tested, tuned and analyzed to find the most optimized model which could accurately predict
the win of a particular political party and further help in creating an exit poll. The analysis found
the vote share distribution and distinguished the sentiments of Labour and Conservative voters
which lays the foundation of prediction. Furthermore, alongside the suggestion for optimized
model for the business case, few key recommendations have been made for the business to
implement.

Problem: 2

The text mining project analyzed 3 speeches of the Presidents of the United States of America in
order to extrapolate key insights like most used words, length of the speech, and sentiment
analysis using formation of word clouds. The contextual and sentimental similarities and
dissimilarities between the speeches of Roosevelt, Kennedy and Nixon were analyzed and
concluded.

6
Problem 1 – Machine Learning Models

Introduction:

We are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. We have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.

1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it.

The dataset provided contains the following features:

 Total number of customer records: 1525;

 Total number of features or variables (excluding index): 9;
 The 2 categorical variables are those of vote and gender;
 The number of numeric variables are 7;
 There are no missing values present;
 There are 8 duplicated values present which we have not removed for this analysis;
 Total no. of male and female voters are 713 and 812 respectively;
 No. of votes for Labour and Conservative are 1063 and 462 respectively.

Variable Name Description

vote Party choice: Conservative or Labour
age In years
economic.cond.national Assessment of current national economic conditions, 1 to 5
economic.cond.househol Assessment of current household economic conditions, 1 to 5
d
Blair Assessment of the Labour leader, 1 to 5
Hague Assessment of the Conservative leader, 1 to 5
Europe An 11-point scale that measures respondents' attitudes toward
European integration. High scores represent ‘Eurosceptic’ sentiment
political.knowledge Knowledge of parties' positions on European integration, 0 to 3
gender Female or Male
Table 1: Data dictionary for the analysis.

Note: The descriptive statistics of the features has been analyzed and described as a part of
Univariate analysis in the next answer (1.2).

7
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers.

a) Univariate Analysis: Analyzing the statistical summary.

Table 2: Statistical summary table.

Insights:

i) Age:

 The minimum age of voters is 24 years and max age is 93 years;

 50% of the voters are of age below 53 years and only the bottom 25% voters are aged less
than 41 years;
 The mean age and median age of voters are 54.18 and 53 years respectively;
 Hence, the voters on this data set have lesser proportion of young generation;
 Moreover, it is not clear why there are no voters between 18 and 24.

ii) Economic.condition.national:

Insights:

 75% of the voters have a positive assessment of economic condition rated 3 (or below)
out of 5;
 The bottom 25% of voters only rated less than 3;
 The mean national economic condition assessed by the voters is rated to be 3 out of 5;

iii) Economic.condition.household:

8
Insights:

 The mean household economic condition assessed by the voters is rated to be 3 out of 5;
 The bottom 25% of voters only rated less than 3;
 75% of voters have rated the household economic condition 3 (or below) out of 5.

iv) Labour leader, Blair:

Insights:

 50% of the voters have rated the leader less than 4;

 Hence, a big portion of the voters (50%) rated highly about the leader with a score of 4
(and above) out of 5;
 The mean rating given to the Labour leader is 3 out of 5.

v) Conservative leader, Hague:

Insights:

 The mean rating given to the Conservative leader is almost 3 out of 5;

 50% of the voters have rated the leader less than 2;
 Hence, a big portion of the voters (50%) rated poorly about the leader with a score of 2
(or less) out of 5;
 Only the top 25% of the voters have rated highly the leader with a score of more than 4
out of 5.

vi) Attitudes towards European integration:

Insights:

 The median score is 6. Both mean and median shows voters are moderately skeptical
about European integration;
 The mean score towards European integration is 7;
 Interestingly, the top 25% of voters are highly Euro skeptic in nature;
 Furthermore, the bottom 25% voters scored less than 4 out of 11 indicating they are pro-
European integration.

vii) Political Knowledge:

Insights:

 The mean and median rating are 1.54 and 2 out of 3 respectively, indicating that majority
of voters have a sound knowledge on the parties’ position on the issue of European
integration;

9
 Interestingly, 50% of the voters have somewhat little to no knowledge about the parties’
outlook on the issue.

b) Univariate analysis: Detection of outliers and distribution of data:

Fig1: Histogram and Box Plots of all numeric variables.

Insights:

 Age- The data distribution is nearly normally distributed. The distribution have negligible
skewness and no outliers found;
 National economic condition- Presence of outliers can be seen. The data distribution
cannot be fully ascertained, due to categorical/ordinal in nature. However, it is nearly
normally distributed;

10
 Economic household condition- Presence of outliers can be seen. Due to
categorical/ordinal in nature, the data distribution cannot be fully ascertained, however, it
is nearly normally distributed;
 Blair- No outliers found. The data distribution cannot be fully ascertained, due to
categorical/ordinal in nature. However, it is nearly normally distributed;
 Hague- No outliers found. Due to categorical/ordinal in nature, the data distribution
cannot be fully ascertained, however, it is nearly normally distributed;
 Europe- No outliers found. The data distribution cannot be fully ascertained, due to
categorical/ordinal in nature. However, it is nearly normally distributed;
 Political Knowledge- No outliers found. Due to categorical/ordinal in nature, the data
distribution cannot be fully ascertained, however, it is nearly normally distributed.

c) Bivariate and Multivariate analysis:

i) Correlation between numerical features

Fig 2: Heatmap showing correlation.

11
Insights:

 Assessment of Labour leader, Blair has a negative correlation of -0.30 with assessment of
European integration (Europe). The negative nature indicates that voters who are non
Eurosceptic have rated the Labour leader highly;
 Assessment of Conservative leader, Hague has a positive correlation of 0.29 with
assessment of European integration (Europe). The positive nature indicates that voters
who are Eurosceptic have rated the Conservative leader highly;
 Two economic parameters have a negative correlation with rating of the Conservative
leader, Hague;
 Two economic parameters have a positive correlation with that of the ratings towards the
Labour leader;
 Age have very little to no correlation with any of the other features/parameters.

ii) Vote share analysis by gender

Table 3 and Fig 3: Bar graph and table showing vote’s % by gender.

Insights:

 Almost 31.90% of females and 28.47% of males voted for Conservative;

 68.10% of females and almost 71.53% of males voted for Labour.

12
iii) Analysis of Labour leaders’ assessment by voters:

Table 4 and Fig 4: Bar graph and table showing vote’s % by assessment ratings of Labour
leader, Blair.

Insights:

 The above graphical illustration shows that majority of female (53.33%) and
males(56.52%) have rated Blair with a score of 4 out of 5;
 A small percentage of both female and male voters gave a perfect score of 5;
 A significant percentage of 25.25% male have rated poorly with a score of 2 and 31.77%
females did the same;
 It can be clearly seen that Blair did not receive any average score of 3. The mere 0.14%
of males rating him 2 is insignificant. Whereas in case of females it’s 0.

13
Table 5 and Fig 5: Bar graph and table showing vote’s % by assessment ratings of
Conservative leader, Hague.

Insights:

 The above graphical illustrations shows that a significant percentage of voters have rated
Hague low. 13.79% females and almost 17% males have rated him with the lowest score
of 1;
 A significant percentage of voters have rated Hague low. 40.9% of females and 41%
males have rated him 2;
 Despite, 38.18% females and 34.78% males have rated him high of 4;
 More than 50% of voters have a negative view assessment of the Conservative leader.

iv) Analysis of voter’s assessment on European integration:

14
Table 6: Table showing percentage of votes by ratings of Europe integration sentiment.

Insights:

 Approximately 37% of Conservative voters have very aggressive stance on European

integration indicating that as significant portion of this group of voters are Eurosceptic in
nature;
 It is clearly visible that conservative voters are quite Eurosceptic in their opinion because
the evaluation score from 7 clearly shows that there is an upward trend in the percentage
of conservative voters with an anti-European stance;
 On the contrary, Labour voters show a decreasing trend in % of votes from rating 6 to 1.
Hence, Labour voters appear to have a pro-European integration view.

v) Analysis of vote share by the assessment of economic conditions

Economic household condition:

Table 7 and Fig 6: table and Bar graph showing vote share by household economic
condition.

Insights:

 33% of Labour voters have rated a high score of 4 out of 5;

15
 From the above table and bar graph, Labour voters have rated the household economic
condition moderately good ranging from rating 3 (max) to 5;
 19% of Conservative voters have rated the economic household condition as 4;
 The trend shown for ratings 1 to 3 by conservative voters clearly indicated that they are
not happy with the economic performance;
 The majority of Conservative voters and Labour voters indicated that the household
economic condition is average, rated at 3 out of 5.

National economic condition:

Table 8 and Fig 7: Table and Bar graph showing vote share by national economic
condition.

Insights:

 42% of Labour voters have rated a high score of 4 out of 5, whereas only 20% of
Conservative voters have rated the a high score of 4 out of 5;
 only 2% of Conservative voters have rated perfect 5 and 7% of Labour voters have rated
perfect 5;

16
 From the above table and bar graph, Labour voters have rated the national economic
condition moderately good ranging from 3 to 5;
 30% of Conservative voters have rated the national economic condition poorly at 2 out of
5;
 43% Conservative voters have rated an average score of 3, thus indicating, the overall
assessment to be average;
 On the other hand, the Labour voters have indicated they are moderately happy with the
national economic condition.

d) Removal of outliers:

Fig 8: Outlier removal.

The outliers have been removed for national economic condition and economic household
condition before splitting the data into Train and Test sets.

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).

Data Preparation:

a) Removal of outliers- outliers found in the variables national economic condition and
economic household condition have been treated and capped to the lower interquartile range.

17
b) Scaling of data- The data needs to be scaled to be used in distance-based algorithms such as
KNN as they are affected by the scale of the variables. Thus before KNN, we did the appropriate
scaling.

c) Encoding of data- There are two string type variables present- gender and vote. For the
purpose of transforming the categorical data (gender and vote) into numeric nominal data, one-
hot label encoding is performed.

Table 9: Sample of scaled and encoded data.

d) Train-test split- The data is split into 70% train and 30% test set.

 The X_train set has 1067 entries and 8 columns (the target variable: ‘IsLabour_or_not’
has been dropped);
 The X_test set has 458 entries and 6 columns (the target variable: ‘IsLabour_or_not’ has
been dropped);
 The y_train and y_test contains the ‘IsLabour_or_not’ values of their corresponding
X_train and X_test dataframe respectively. ‘IsLabour_or_not’ = 0, means Conservative.

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression Model:

This is a regression analysis that should be performed when the dependent variable is binary in
nature. Like all regression analysis, logistic regression analysis is predictive analysis. Logistic
regression is used to describe data and to describe the relationship between one dependent binary
variable and one or more nominal, ordinal, interval, or independent relationship level variables.

Below is an example of logistic regression equation:

y = e^(b0 + b1x) / (1 + e^(b0 + b1x))

18
Building of LR model:

The basic model is built with default parameters and the train and test performance results have
been discussed.

Logistic Regression Performance:

Training performance:

Table 10: Comprehensive performance report-Logistic Regression Train Model.

Insights:

 The accuracy stands at 84%;

 Out of 1067 data points, the model has classified 230 as TN, 102 as FP, 68 as FN and 667
as TP;
 However, the recall for TP (labeled as 1), which is our point of interest in the context of
the analysis have a great score of 91%. The precision score is 87% and the F1-score too is
great at 0.89;
 Lastly, the AUC value has a value of 0.89, by observing this we can say that there is a
very high chance that the classifier will be able to distinguish the positive class values
from the negative class.

19
Testing performance:

Table 11: Comprehensive performance report-Logistic Regression Test Model.

Insights:

 Precision has remained unchanged, and AUC has marginally reduced to 0.88;
 The F1-score has marginally reduced to 0.88;
 The recall value has dropped to 0.89 and accuracy dropped to 0.82.

Linear Discriminant Analysis (LDA) Model:

LDA is a dimensionality reduction algorithm similar to principal component analysis. However,

while PCA is an unsupervised algorithm that focuses on maximizing the variance of a data set,
LDA is a control algorithm that maximizes separability between classes. Here the class
references the target variable.

Building of LDA model:

The basic model is built with default parameters and the train and test performance results have
been discussed.

LDA Performance:

20
Training performance:

Table 12: Comprehensive performance report of LDA Train Model.

Insights:

 The accuracy stands at 84%;

 Out of 1067 data points, the model has classified 236 as TN, 96 as FP, 75 as FN and 660
as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the
analysis have a good score of 90%. The precision score is 87%. The F1-score too is great
at 0.89;
 The AUC value has a value of 0.889; means there is a very high chance that the classifier
will be able to distinguish the positive class values from the negative class.

Testing performance:

Table 13: Comprehensive performance report of LDA Test Model.

Insights:

21
 The F1-score has marginally reduced to 0.87;
 The recall value has dropped to 0.88 and accuracy dropped to 0.82;
 Precision has remained unchanged, and AUC has marginally reduced to 0.88.

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

KNN model:

A k-nearest-neighbor algorithm, often abbreviated KNN, is an approach to data classification

that estimates how likely a data point is to be a member of one group or the other depending on
what group the data points nearest to it are in. The k-nearest-neighbor is an example of a "lazy
learner" algorithm, meaning that it does not build a model using the training set until a query of
the data set is performed.

Building of KNN model:

The basic model is built with default parameters and the train and test performance results have
been discussed.

KNN Performance:

Training performance:

Table 14:Comprehensive performance report-KNN Train Model

Insights:

 The accuracy stands at 86%;

 Out of 1067 data points, the model has classified 246 as TN, 86 as FP, 61 as FN and 674
as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the
analysis have a good score of 92%. The precision score is 89%. The F1-score too is great
at 0.90;

22
 The AUC value has a value of 0.933, means there is a very high chance that the classifier
will be able to distinguish the positive class values from the negative class.

Testing performance:

Table 15: Comprehensive performance report-KNN Test Model.

Insights:

 The recall value has dropped to 0.85;

 accuracy dropped to 0.78;
 Precision has dropped to 0.85 and AUC has significantly reduced to 0.828;
 The F1-score has reduced to 0.85;
 The AUC is below .85, which is not good.

Naive Bayes Model (NB):

A naive Bayes classifier is a calculation that utilizes Bayes' hypothesis to arrange objects. Naïve
Bayes classifiers expect to be strong, or naive, autonomy between attributes of data points. Well
known employments of naive Bayes classifiers incorporate spam channels, text examination and
clinical finding.

Building of Naive Bayes model:

23
The basic model is built with default parameters and the train and test performance results have
been discussed.

Naive Bayes Performance:

Training performance:

Table 16: Comprehensive performance report- Naive Bayes Train Model.

Insights:

 The accuracy stands at 83%;

 Out of 1067 data points, the model has classified 240 as TN, 92 as FP, 87 as FN and 648
as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the
analysis have a good score of 88%. The precision score is 88%. The F1-score too is great
at 0.88;
 The AUC value has a value of 0.887, means there is a very high chance that the classifier
will be able to distinguish the positive class values from the negative class.

Testing performance:

24
Table 17: Comprehensive performance report- Naive Bayes Test Model
Insights:

 The F1-score has marginally reduced to 0.87;

 The recall value has dropped to 0.86;
 Accuracy dropped marginally to 0.82;
 Precision has increased to 0.89;
 AUC is almost unchanged.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.

Before going further with the hyper-parameter tuning of the Machine Learning models generated
so far, 3 more basic Machine Learning models, namely, Bagging Classifier, ADA Boosting
Classifier and Gradient Boosting Classifier models have been built and their performances have
been analyzed. After building these 3 basic models, a separate section of the analysis illustrates
hyper-parameter tuning of all 6 models- Logistic Regression, LDA, KNN, Bagging, ADA
Boosting and Gradient Boosting, except Naive Bayes.

Bagging Classifier Model:

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random
subsets of the first dataset and afterward total their singular forecasts (either by casting a ballot or
by averaging) to shape a final prediction. Such a meta-estimator can regularly be utilized as a
method for diminishing the variance of a black-box estimator (e.g., a decision tree), by bringing
randomization into its development technique and afterward making a group out of it.

25
Building of Bagging Classifier model:

The basic model is built with default parameters with base_estimator as Random Forest classifier
and n_estimators of 100 and the train and test performance results have been discussed.

N_estimators: It is the number of base estimators in the ensemble.

Base_estimator: The base estimator is the base model to fit on random subsets of the dataset.

Bagging Classifier Performance:

Training performance:

Table 18: Comprehensive performance report-Bagging Train Model.

Insights:

 The accuracy stands at 97%;

 Out of 1067 data points, the model has classified 304 as TN, 28 as FP, 9 as FN and 726 as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the analysis have a
good score of 99%. The precision score is 96%. The F1-score too is great at 0.98;
 The AUC value has a value of 0.997, means there is a very high chance that the classifier will be
able to distinguish the positive class values from the negative class.

Testing performance:

26
Table 19: Comprehensive performance report-Bagging Test Model.

Insights:

 The F1-score has reduced to 0.89;

 The recall value has dropped to 0.89;
 Accuracy dropped to 0.84;
 Precision has decreased to 0.88;
 AUC has dropped to 0.897;
 The basic Bagging Classifier model is over fitting.

ADA Boosting Model:

ADA Boost algorithm, short for Adaptive Boosting, is a Boosting technique that is used as an
Ensemble Method in Machine Learning. It is called Adaptive Boosting as the weights are re-
assigned to each instance, with higher weights to incorrectly classified instances. Boosting is
used to reduce bias as well as the variance for supervised learning. It works on the principle
where learners are grown sequentially. Except for the first, each subsequent learner is grown
from previously grown learners.

Building of ADA Boosting Model:

The basic model is built with default parameters with and n_estimators of 100 and the train and
test performance results have been discussed;

27
Base_estimator: The base estimator from which the boosted ensemble is built. The default is
DecisionTreeClassifier.

ADA Boosting Performance:

Training performance:

Table 20: Comprehensive performance report-ADA Boost Train Model.

Insights:

 The accuracy stands at 85%;

 Out of 1067 data points, the model has classified 238 as TN, 94 as FP, 69 as FN and 666
as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the
analysis have a good score of 91%. The precision score is 88%. The F1-score too is great
at 0.89;
 The AUC value has a value of 0.913, means there is a very high chance that the classifier
will be able to distinguish the positive class values from the negative class.

Testing performance:

28
Table 21: Comprehensive performance report-ADA Boost Test Model.

Insights:

 The F1-score has reduced to 0.87;

 The recall value has dropped to 0.87;
 Accuracy dropped to 0.82;
 Precision has remained unchanged;
 AUC has dropped to 0.879.

Gradient Boosting Classifier Model:

Gradient boosting classifiers are a group of machine learning algorithms that combine many
weak learning models together to create a strong predictive model. Decision trees are usually
used when doing gradient boosting.

Building of Gradient Boosting Classifier model:

The basic model is built with default parameters.

Gradient Boosting Classifier Performance:

Training performance:

29
Table 22: Comprehensive performance report-Gradient Boost Train Model.

Insights:

 The accuracy stands at 89%;

 Out of 1067 data points, the model has classified 262 as TN, 70 as FP, 51 as FN and 684
as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the
analysis have a good score of 93%. The precision score is 91%. The F1-score too is great
at 0.92;
 The AUC value has a value of 0.950, means there is a very high chance that the classifier
will be able to distinguish the positive class values from the negative class.

Testing performance:

Table 23: Comprehensive performance report-Gradient Boost Test Model.

Insights:

 Precision has dropped to 0.89;

30
 AUC has dropped to 0.904;
 The F1-score has reduced to 0.88;
 The recall value has dropped to 0.87;
 Accuracy dropped to 0.83.

Model Tuning of Machine Learning Models

The 7 machine learning models built so far are basic in nature with mostly default parameters
used with little to no hyper-parameter tuning. The model tuning has been performed on all the
models using GridSearchCV function, except for Naive Bayes Classifier model, as this algorithm
does not provide any scope for tuning of hyper-parameters.

A) Regularized Logistic Regression model:

Hyperparameters tuning:

Fig 9: Parameters passed to find the best fit.

Best hyperparameters:

On executing the GridSearchCV function, following best parameter were found:

Fig 10: Best parameters after tuning.

Regularized Logistic Regression Performance:

Training performance:

31
Table 24: Comprehensive performance report- Reg LR Train Model.

Insights:

 The accuracy stands at 84%;

 Out of 1067 data points, the regularized model has classified 229 as TN, 103 as FP, 68 as
FN and 667 as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the
analysis have a good score of 91%. The precision score is 87%. The F1-score too is great
at 0.89;
 The AUC value has a value of 0.89, means there is a very high chance that the classifier
will be able to distinguish the positive class values from the negative class.

Testing performance:

32
Table 25: Comprehensive performance report- Reg LR Test Model.

Insights:

 The F1-score has marginally reduced to 0.88;

 Precision has remained unchanged;
 AUC has marginally reduced to 0.88;
 The recall value has dropped to 0.89;
 Accuracy dropped to 0.82.

B) Regularized LDA model:

Hyperparameters tuning:

Following parameters were passed to find the best fit:

Fig 11: Parameters passed to find the best fit.

33
Best hyperparameters:

On executing the GridSearchCV function, the following best parameters were found:

Fig 12: Best parameters after tuning.

Regularized LDA Performance:

Training performance:

Table 26: Performance report of Reg LDA Train Model.

Insights:

 The accuracy stands at 84%;

 Out of 1067 data points, the regularized model has classified 236 as TN, 96 as FP, 75 as
FN and 660 as TP;
 The recall for TP (labeled as 1), which is our point of interest in the context of the
analysis have a good score of 90%. The precision score is 87%. The F1-score too is great
at 0.89;
 AUC value has a value of 0.889, means there is a very high chance that the classifier will
be able to distinguish the positive class values from the negative class.

34
Testing performance:

Table 27: Performance report of Reg LDA Test Model.

Insights:

 The F1-score has marginally reduced to 0.87;

 The recall value has dropped to 0.88;
 Accuracy dropped to 0.82;
 Precision has remained unchanged;
 AUC has marginally reduced to 0.884

35
C) Regularized KNN Model

Hyperparameters tuning:

Following parameters were passed to find the best fit:

Fig 13: Parameters passed to find the best fit.

N_neighbors: Number of neighbors to use by default for k-neighbors queries. A range from 3 to
19 were considered.

Metrics: default.

Weights: weight function used in prediction.

Possible values:

‘Uniform’: uniform weights. All points in each neighborhood are weighted equally

‘Distance’: weight points by the inverse of their distance and in this case, closer neighbors of a
query point will have a greater influence than neighbors which are further away.

Best hyperparameters:

On executing the GridSearchCV function following best parameter was found:

Fig 14: Best parameters after tuning.

Regularized KNN Performance:

36
Training performance:

Table 28: Performance report- Reg KNN Train Model.

Insights:

 The accuracy stands at 100%;

 Out of 1067 data points, the model has classified 332 as TN, 0 as FP, 1 as FN and 734 as
TP;
 Recall for TP (labeled as 1), which is our point of interest in the context of the analysis
have a complete score of 100%. The precision score is 100% as well. The F1-score too
is100%;
 AUC value has a value of 100, means there is a very high chance that the classifier will
be able to distinguish the positive class values from the negative class;
 It seems that the model is over fitting.

37
Testing performance:

Table 29: Performance report- Reg KNN Test Model.

Insights:

 The F1-score has reduced to 0.86;

 Precision has reduced to 0.85;
 AUC has reduced to 0.876;
 The recall value has dropped to 0.88;
 Accuracy dropped to 0.80.

D) Regularized Bagging Classifier Model:

Hyperparameters tuning:

Following parameters were passed to find the best fit:

38
Fig 15: Parameters passed to find the best fit.

Best hyperparameters:

On executing the GridSearchCV function following best parameter was found:

Fig 16: Best parameters after tuning.

Regularized Bagging Classifier Performance:

Training performance:

Table 30: Performance report-Reg Bagging Train Model.

Insights:

 The accuracy stands at 85%;

 Out of 1067 data points, the model has classified 242 as TN, 90 as FP, 70 as FN and 665
as TP;
 Recall for TP (labeled as 1), which is our point of interest in the context of the analysis
have a good score of 90%. The precision score is 88%. The F1-score too is great at 0.89;

39
 AUC value has a value of 0.913, means there is a very high chance that the classifier will
be able to distinguish the positive class values from the negative class.

Testing performance:

Table 31: Performance report-Reg Bagging Test Model.

Insights:

 The recall value has dropped to 0.87;

 Accuracy dropped to 0.82;
 The F1-score has reduced to 0.87;
 Precision has decreased to 0.88;
 AUC has dropped to 0.891.

40
E) Regularized ADA Boosting Classifier Model:

Hyperparameters tuning:

Following parameters were passed to find the best fit:

Fig 17: Parameters passed to find the best fit.

Learning rate: Weight applied to each classifier at each boosting iteration. A higher learning
rate increases the contribution of each classifier. Here, 3 values have been tested with.

N_estimators: The maximum number of estimators at which boosting is terminated. In case of

perfect fit, the learning procedure is stopped early. Here, 2 values were tested, 200 and 300.

Best hyperparameters:

Fig 18: Best parameters after tuning.

Regularized ADA Boosting Classifier Performance:

41
Training performance:

Table 32: Performance report-Reg ADA Boost Train Model.

Insights:

 The accuracy stands at 84%;

 Out of 1067 data points, the model has classified 224 as TN, 108 as FP, 64 as FN and 671
as TP;
 Recall for TP (labeled as 1), which is our point of interest in the context of the analysis
have a good score of 91%. The precision score is 86%. The F1-score too is great at 0.89;
 AUC value has a value of 0.913, means there is a very high chance that the classifier will
be able to distinguish the positive class values from the negative class.

Testing performance:

42
Table 33: Performance report-Reg ADA Boost Test Model.

Insights:

 The F1-score has reduced to 0.87;

 Precision has increased marginally to 0.87;
 AUC has dropped to 0.891;
 The recall value has dropped to 0.88;
 Accuracy dropped to 0.82.

F) Regularized Gradient Boosting Classifier Model:

Hyperparameters tuning:

43
Following parameters were passed to find the best fit:

Fig 19: Parameters passed to find the best fit.

Learning_rate: Learning rate shrinks the contribution of each tree by learning_rate. Here, 3
values0.001, 0.01 and 0.2 were tested.

N_estimators: The number of boosting stages to perform. Gradient boosting is fairly robust to
overfitting so a large number usually results in better performance.

Max_features: The number of features to consider when looking for the best split. Here, 3
values- 4, 5, and 6 were tested.

Min_samples_leaf: The minimum number of samples required to be at a leaf node. Ideally it is

between 1-3% of the training set. Here, 3 values- 10, 15 and 20 was tested.

Min_samples_split: The minimum number of samples required to split an internal node. Ideally,
it is 3 times the min samples leaf.

Best hyperparameters:

Fig 20: Best parameters after tuning.

Regularized Gradient Boosting Classifier Performance:

Training performance:

44
Table 34: Performance report-Reg Gradient Boost Train Model.

Insights:

 The accuracy stands at 85%;

 Out of 1067 data points, the model has classified 224 as TN, 108 as FP, 52 as FN and 683
as TP;
 Recall for TP (labeled as 1), which is our point of interest in the context of the analysis
have a good score of 93%. The precision score is 86%. The F1-score too is great at 0.90;
 AUC value has a value of 0.913, means there is a very high chance that the classifier will
be able to distinguish the positive class values from the negative class.

Testing performance:

45
Table 35: Comprehensive performance report-Reg Gradient Boost Test Model.

Insights:

 The F1-score has reduced to 0.89;

 The recall value has dropped to 0.90;
 Accuracy dropped to 0.83;
 Precision has increased to 0.88;
 AUC has dropped marginally to 0.891.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is best/
optimized.

46
BEFORE Tuning AFTER Tuning

MODEL SETS Precision RECALL F1 SCORE Accuracy AUC Precision RECAL F1 SCORE Accuracy AUC
L

LR TRAIN .87 .91 .89 .84 .89 .87 .91 .89 .84 .89

TEST

.87 .89 .88 .82 .883 .87 .89 .88 .82 .883

TRAIN

LDA .87 .90 .89 .84 .889 .87 .90 .89 .84 .889

TEST .87 .88 .87 .82 .884 .87 .88 .87 .82 .884

NB TRAIN .88 .88 .88 .83 .887 .88 .88 .88 .83 .887

TEST .89 .86 .87 .82 .885 .89 .86 .87 .82 .885

KNN TRAIN .89 .92 .90 .86 .925 1 1 1 1 1

TEST .85 .85 .85 .78 .828 .85 .88 .86 .80 .876

ADA TRAIN .88 .91 .89 .85 .913 .86 .91 .89 .84 .913
BOOST
TEST .88 .87 .87 .82 .879 .87 .88 .87 .82 .891

GRADIENT TRAIN .91 .93 .92 .89 .950 .86 .93 .90 .85 .913
BOOST
TEST .89 .87 .88 .83 .904 .88 .90 .89 .83 .891

BAGGING TRAIN .96 .99 .98 .97 .997 .88 .90 .89 .85 .913
(RF)
TEST .88 .89 .89 .84 .897 .88 .87 .87 .82 .891

Table 36: Comparison of basic and regularized ML Models.

Note: Naive Bayes Classifier has no scope for regularization hence for the purpose of continuity
and ease of comparison, the basic NB model performance metrics have been shown in the
regularized performance table.

A) Comparison of Logistic Regression Models:

In terms of performance of basic and regularized models, there is no change in the performance.
Hence, we can safely choose the regularized LR model for further comparison between other
Machine Learning models.

47
B) Comparison of LDA Models:

In terms of performance of basic and regularized models, there is no change in the performance.
Hence, we can safely choose the regularized LDA model for further comparison between other
Machine Learning models.

C) Comparison of Naive Bayes Models:

The NB model is chosen by default for further comparison as it does not have any regularized
model.

D) Comparison of KNN Models:

Accuracy: The regularized model performs better as the test accuracy of regularized is 80%
compared to that of basic model at 78%, but the regularized model is an over-fit.

Recall: Recall of the regularized model has increased marginally in testing phase by 3%.

Precision: It has remained same in regularized testing as well as in basic test performance
(85%).

AUC: The AUC test score in much better at 87.6% in regularized model as compared to basic
test score.

F1-score: The f1-score increased by 1% in regularized testing phase as compared to basic test
performance.

Over-fitting in the Train set for the regularized KNN model is clearly observed. Hence, choosing
the basic model would be a wise decision.

E) Comparison of ADA Boosting Classifier Models:

Accuracy: The accuracy of the regularized model performs better marginally as compared to the
basic performance when looked at the difference between the train and test performance of both
the models.

Recall: Recall of the regularized model has increased marginally in testing phase by 1%
compared to basic test performance.

Precision: It has decreased by 1% in regularized testing as compared to basic test performance.

AUC: The AUC test score in much better at 89% in regularized model as compared to basic
model’s test score.

48
F1-score: The f1-score remained unchanged in regularized model when compared to basic
model performance.

Overall, the regularized ADA Boosting Classifier model performs marginally better than the
basic model, hence the regularized ADA Boosting Classifier model is chosen for further
comparison between other Machine Learning models.

F) Comparison of Gradient Boosting Classifier Models:

Recall: Recall of the regularized model has increased marginally in testing phase by 3%
compared to basic test performance.

Precision: It has decreased by 1% in regularized testing as compared to basic test performance,

however it is the difference between the train and test of regularized model is more favorable and
consistent when compared to basic model.

AUC: The AUC test score of the regularized model performs better marginally as compared to
the basic performance when looked at the difference between the train and test performance of
both the models.

F1-score: The f1-score is better in regularized model when compared to basic model’s test
performance.

Overall, the regularized Gradient Boosting Classifier model performs marginally better than the
basic model and gives more consistent result when compared between train and test results.
Hence, the regularized Gradient Boosting Classifier model is chosen for further comparison
between other Machine Learning models.

G) Comparison of Bagging Classifier Models (RF):

Accuracy: The basic model performs marginally better as the test accuracy of regularized is
82% compared to that of basic model at 84%.

Recall: Recall of the regularized model has decreased marginally in testing phase by 2%
compared to basic test performance.

Precision: It remains the same in regularized testing as compared to basic test performance.

AUC: The AUC test score in much better at 90% in basic model as compared to regularized test
score.

49
F1-score: The f1-score decreased marginally in regularized model by 2% in testing phase as
compare to basic test performance.

Looking at the recall, precision and f1-score, the basic Bagging Classifier model is chosen for
further comparison between other Machine Learning models.

Final inferences:

Therefore, to conclude we will go with the regularized Gradient Boosting model. There are no
signs of over-fitting / under fitting as compared to other models. Recall and F1 scores are also
excellent as is desirable with classification models. Moreover the difference between the test and
train in the regularized model is very less and it gives more consistent result when compared
between train and test results of other models.

1.8 Based on these predictions, what are the insights?

The business issue essentially spun around fostering a model to anticipate which party a citizen
would vote in favor of depending on the data about the citizens. The model will in this way be
utilized to make an exit poll that will help in predicting the overall win and seats covered by a
specific party. For this to achieve, the analyses assumed CNBE wish to focus more on accurately
predicting the Labor’s win and hence that has been the class of choice for prediction. The
analysis and building of Machine Learning models based on a restricted dataset of 1525 citizens
with specific details of the electors. This notwithstanding, regardless of limitations, has assisted
us with finding not many key bits of knowledge and patterns alongside exhibiting the ideal
model which could be used by CNBE to anticipate the previously mentioned.

Insights summary:

 Majority of the voters are between the ages 33 – 75 and there are no voters’ data capture
between the age 18 to 24;
 Majority of people think that household and national economic condition is satisfactory
as most have ranked them in 3 or 4 out of 5;
 Conservatives consists of slightly higher proportion of aged voters (50 years and above);
 50% of the voters are of age above 53 years and only the bottom 25% voters are aged less
than 41 years;
 Labour leader Blair is more popular among people than Conservative leader Hague as
Blair has received a rating of 4 on average, whereas Hague has received a mixed rating of
2 and 4;
 The general population does not seem to be very eurosceptic as cumulative frequency of
non-eurosceptic people (who opted for 6 or less) seem to be higher than the cumulative
frequency of eurosceptic people (who opted for 7 or higher);

50
 There are more female voters than male voters;
 Conservative voters have better political knowledge of political parties’ position on
European integration than their Labour counterparts;
 Labour voters appears to have a pro-European integration opinion as opposed to
Conservative voters;
 43% Conservative voters have rated the national economic condition average with score
of 3, further indicating, the overall assessment to be between poor and average.
 National economic condition has a mild positive correlation with ‘IsLabour_or_not’
which means that as impression of national economic condition improves, the votes for
Labour also increases;
 Candidates can focus to improve the image of economic conditions to gather more crowd
favor;
 Most people find Blair to be a better leader and if the Conservative party wants to win
then they have to focus in improving Hague’s image among people, or go with a different
candidate;
 National economic condition has a mild positive correlation with ‘IsLabour_or_not’
which means that as impression of national economic condition improves, the votes for
Labour also increases.

Business Recommendations:

 CNBE must gather data of voters aged between 18 and 24 so as to make the predictions
more accurate;
 It needs to be addressed that, the larger the number of voters, better the Machine
Learning models can be optimized;
 The dataset must also include additional assessment ratings about migration policy
including refugee settlement, employment generation, income tax regime, etc;
 Based on the existing data, the most optimized model is found to be the regularized
Gradient Boosting Classifier model. This however would need re-tuning of
hyperparameters with larger dataset to accurately predict the win for the Labour party;
 Irrespective of the size of the dataset, the regularized Gradient Boosting Classifier model
could be deployed to build an exit poll which will still perform with great degree of
accuracy.

Problem 2 – Text Mining

Introduction:

In this particular project, we are supposed to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:

51
1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961

3. President Richard Nixon in 1973

Loading of dataset:

The 3 speeches (in text document format) are converted into a data frame for the purpose of text
mining. We have created 3 different excel file for 3 different speeches. Below is a snapshot of
the initial data frame:

Table 37: Snapshots of 3 data frame containing speeches.

2.1 Find the number of characters, words, and sentences for the mentioned documents.

Text mining involves various preprocessing of the text before starting to build a model. In this
case, plain text was used instead of the preprocessed text to perform a count of words, characters
and sentences.

Using pre-defined functions as a part of the inaugural package in the nltk toolkit, the counts of
words, characters and sentences were computed as shown below:

52
Table 38: Data frames showing the words and characters counts of speeches.

Insights:

 From the above table, Nixon’s speech has the most word count and characters values and
sentence count, which are 1769, 10107 and 68 respectively;
 The number of sentences in Nixon’s speech is 68, similar to Roosevelt’s speech;
 Hence, Nixon’s speech could easily be conformed to be the longest of all the 3 speeches.

Presiden Sentence count

t
Roosevel 68
t
Kennedy 52
Nixon 68
Table 39: Number of sentences in the speech.

2.2 Remove all the stopwords from all three speeches.

A few data pre-processing steps have been undertaken before removing the stopwords such as
‘A’, ‘the’, ‘then’, ‘is’, etc. So before removal of stopwords the number of stopwords is as
follows:

Table 40: Data frames showing the number of stopwords.

Step 1: Converting the text into lowercase:

The text must be converted to lowercase in order to reduce the redundant words such as ‘The’
and ‘the’. Here, these are two separate words in the speech which however for the purpose of

53
building models, word clouds make it inaccurate. In order to mitigate the issue of double
counting of words, the text of 3 speeches have been converted to lowercase.

Step 2: Removal of punctuations:

Similar to converting of text to lowercase, another important pre-processing step involves

removal of punctuations which if not removed will cause incorrect building of models and word
clouds. Thus the text contains punctuations like commas, full stop, apostrophe, etc have been
removed. The text also contains some special characters such as ‘--’ and ‘\’, they too are
removed.

Step 3: Removal of stopwords:

Table 41: Data frames reflecting removal of stopwords.

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (After removing the stopwords).

After the removal of the stopwords, an important text pre-processing step is taken to reduce the
words to their root words, called Stemming.

It is a rule-based approach because it slices the inflected words from prefix or suffix as per the
need using a set of commonly underused prefix and suffix, like “-ing”, “-ed”, “-es”, “-pre”, etc. It
results in a word that is actually not a word. Using the Porter Stemmer method available in the
nltk package, the texts of 3 speeches are stemmed to their root words.

Top used words in Roosevelt’s’ speech:

54
Table 42: Top 10 words used in Roosevelt’s speech.

The top 3 words used by Roosevelt in his speech are: ‘nation’ (10 times), ‘know’ (10 times) and
‘spirit’ (8 times). However the word ‘us’ too is used 8 times.

Top used words in Kennedy’s speech:

Table 43: Top 10 words used in Kennedy’s speech.

The top 3 words used by Kennedy in his speech are: ‘us’ (11 times), ‘let’ (11 times) and ‘sides’
(8 times).

Top used words in Nixon’s speech:

55
Table 44: Top 10 words used in Nixon’s speech.

The top 3 words used by Nixon in his speech are: ‘us’ (26 times), ‘peace’ (15 times) and ‘new’
(15 times).

2.4 Plot the word cloud of each of the speeches of the variable. (After removing the
stopwords)

56
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific
word appears in a textual data source (such as a speech, blog post, or news story or database), the
bigger and bolder it it appears in the word cloud. A word cloud is a collection, or group, of words
represented in different dimensions. The bigger and bolder the word appears, the more often it is
mentioned in a given text and the more important it is.

Roosevelt’s speech analysis:

Fig 21: Word cloud for Roosevelt’s speech.

Insights:

 Roosevelt had used the word ‘nation’ the most in his speeches followed by words such as
‘spirit’, ‘people’, ‘life’, ‘America’;
 Further, his speech also stressed on positive words like ‘spirit’, ‘security’, ‘life’, ‘faith’,
etc;
 Other prominent words visible are ‘live, ‘freedom’, ‘people’, ‘America’, ‘preserve’,
‘history’;
 Based on the word cloud, the sentiment of his speech is positive;
 His speech seems to encourage the audience to preserve America’s history, democracy
and freedom;
 The president is talking about the country which is stressed through words like ‘nation’
and ‘America’ and ‘people’.

57
Kennedy’s speech analysis:

Fig 22: Word cloud for Kennedy’s speech.

Insights:

 Kennedy in his speech have stressed on the words like ‘let’, ‘new’, ‘world’, ‘power’,
‘nation’, ‘side’, etc;
 Unlike Roosevelt's speeches, his speeches seem to center on 'power', 'nation', and 'world'.
These words refer to a more aggressive approach to build America as a new world power;
 Also, unlike Roosevelt's speeches, the use of words such as "peace", "hope", "us",
"human", "friend" and "citizen" are less common in his speeches;
 The word cloud implies that his speeches are for his audience to embrace and support the
idea of America's global power, not the actual well-being of its citizens;
 However the above sentimental inference could however be challenged in different
context.

58
Nixon’s speech analysis:

Fig 23: Word cloud for Nixon’s speech.

Insights:

 Nixon in his speech is centered around the words like ‘let’, ‘us’, ‘peace’, ‘world’,
‘America’, ‘nation’, ‘role’, ‘government’, etc;
 Unlike Kennedy's speeches, the general idea of his speeches appears to be how people in
the United States can contribute to world peace;
 His speeches, similar to Roosevelt's, give a positive vibration, which is shown in repeated
uses of words such as "live," "build," "right," "together," "promise," and "justice”;
 The word cloud thus suggests the overall positive encouragement for the audience;
 It is interesting to note that Nixon's speech takes a more balanced approach to the nation-
building and their positive impacts in a global context compared to Roosevelt's speech,
where the whole message focused on the human aspects of the country. It is derived from
common words such as 'peace', ‘home’, 'nation', 'together', 'faith', 'justice'.

59
Conclusion:

Text Mining and Sentiment Analysis highlighted the key steps taken to preprocess text and how
text can be used to visually analyze full speech sentiment.

Based on their analysis, both Roosevelt and Nixon gave positive comments on the welfare of its
people and supported the idea of peace and freedom. Rather, Kennedy's speech focused on the
United States as a global power, not on the welfare of its citizens.

60
61
62
63

Rajendra Ladda SQL and Databases New Wheels Project Report
100% (1)
Rajendra Ladda SQL and Databases New Wheels Project Report
12 pages
Nursing Informatics Questions
No ratings yet
Nursing Informatics Questions
10 pages
Data Visualization Project Nilanjan Das PDF
100% (1)
Data Visualization Project Nilanjan Das PDF
26 pages
Project Submission Machine Learning - Ankit Bhagat - 8th Jan
100% (9)
Project Submission Machine Learning - Ankit Bhagat - 8th Jan
36 pages
Machine Learning Business Report
100% (1)
Machine Learning Business Report
34 pages
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
100% (11)
Final Project - ML - Nikita Chaturvedi - 03.10.2021 - Jupyter Notebook
154 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
E-Commerce Revenue Management - Python For Data Science - Great Learning
100% (1)
E-Commerce Revenue Management - Python For Data Science - Great Learning
4 pages
Statisitics Project 6
100% (2)
Statisitics Project 6
48 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Data Visualization in Tableau - Car Insurance Claim Project
50% (2)
Data Visualization in Tableau - Car Insurance Claim Project
51 pages
ML ProjectReport-Sonali Joshi
100% (2)
ML ProjectReport-Sonali Joshi
38 pages
Data Mining - Project
100% (2)
Data Mining - Project
25 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
No ratings yet
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
22 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
MachineLearning Project PDF
No ratings yet
MachineLearning Project PDF
32 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
Project ML
100% (4)
Project ML
36 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
State Wise Health Income Clustering 18th December 2021 PDF
100% (2)
State Wise Health Income Clustering 18th December 2021 PDF
29 pages
Predictive Modelling
67% (3)
Predictive Modelling
64 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
Shoe Sales
100% (3)
Shoe Sales
105 pages
Machine Learning Project: Problem 1
67% (3)
Machine Learning Project: Problem 1
26 pages
ML Project Report
100% (2)
ML Project Report
35 pages
Advance Statistics-Project Report
50% (2)
Advance Statistics-Project Report
17 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3
100% (1)
Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3
24 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Time Series Forecasting: Group Assignment - Group 5: Answer
100% (2)
Time Series Forecasting: Group Assignment - Group 5: Answer
29 pages
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
Project Report
100% (3)
Project Report
36 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
Predictive Modelling Alternative Firm Level PDF
100% (4)
Predictive Modelling Alternative Firm Level PDF
26 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Time Series Forecasting Business Report: Name: S.Krishna Veni Date: 20/02/2022
100% (1)
Time Series Forecasting Business Report: Name: S.Krishna Veni Date: 20/02/2022
31 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Capstone-2 Market Basket Analysis Vinothkumar R
No ratings yet
Capstone-2 Market Basket Analysis Vinothkumar R
18 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Time Series Rose Shehroz Arfeen
100% (1)
Time Series Rose Shehroz Arfeen
42 pages
Data Mining Project PCA Report
100% (1)
Data Mining Project PCA Report
27 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
12-Storey New Condominium Supported With Chance Helical Piers
No ratings yet
12-Storey New Condominium Supported With Chance Helical Piers
2 pages
Video Schematic
No ratings yet
Video Schematic
1 page
Configuring Hawk Database Adapter
No ratings yet
Configuring Hawk Database Adapter
6 pages
Livelihood Activities of Self-Help Group Members in Gaya District, Bihar
0% (1)
Livelihood Activities of Self-Help Group Members in Gaya District, Bihar
76 pages
2014 South African Book Fair Visitors' Guide and Programme
No ratings yet
2014 South African Book Fair Visitors' Guide and Programme
40 pages
Po 0003659901 RXN 0063712465 0000
No ratings yet
Po 0003659901 RXN 0063712465 0000
42 pages
General Specifications (One Storey)
No ratings yet
General Specifications (One Storey)
5 pages
GATE Books For Electrical Engineering
No ratings yet
GATE Books For Electrical Engineering
2 pages
CA CTE Standards - Marketing, Sales & Service
No ratings yet
CA CTE Standards - Marketing, Sales & Service
40 pages
Drone Laws in Switzerland (Updated March 4, 2024)
No ratings yet
Drone Laws in Switzerland (Updated March 4, 2024)
1 page
Special Power of Attorney
No ratings yet
Special Power of Attorney
2 pages
Topic 2 - Pierce - Clinical Field Experience A - Informal Observations
No ratings yet
Topic 2 - Pierce - Clinical Field Experience A - Informal Observations
4 pages
Congo PPT - Ss
No ratings yet
Congo PPT - Ss
10 pages
Imperial_College_London
No ratings yet
Imperial_College_London
30 pages
Labour and Industrial Law II
No ratings yet
Labour and Industrial Law II
2 pages
Guia2 HT12D-Ht12E
No ratings yet
Guia2 HT12D-Ht12E
3 pages
Sales and Distribution Channel of Tata Motors LTD Final
No ratings yet
Sales and Distribution Channel of Tata Motors LTD Final
76 pages
Fundamentals of Estimating and Costing Part 2
No ratings yet
Fundamentals of Estimating and Costing Part 2
32 pages
3LPE Coating Specification
No ratings yet
3LPE Coating Specification
21 pages
Catalogue: Steering Column Switches Commercial Vehicles
0% (1)
Catalogue: Steering Column Switches Commercial Vehicles
71 pages
Pesco Full Bill
No ratings yet
Pesco Full Bill
2 pages
Municipal VGU Guide (Road & Street Design in Urban Areas)
No ratings yet
Municipal VGU Guide (Road & Street Design in Urban Areas)
176 pages
IT Cyber Security Policy
No ratings yet
IT Cyber Security Policy
10 pages
MLLG-LED-HT80 High Temperature Product Family
No ratings yet
MLLG-LED-HT80 High Temperature Product Family
8 pages
Make-To-Order Kanban: Don Guild, Synchronous Management
No ratings yet
Make-To-Order Kanban: Don Guild, Synchronous Management
9 pages
Management of Systemic Sclerosis Associated With Interstitial Lung Disease - DOI: 10.37897/RJR.2024.33.2.2
No ratings yet
Management of Systemic Sclerosis Associated With Interstitial Lung Disease - DOI: 10.37897/RJR.2024.33.2.2
9 pages
Annual Function 2024-25
No ratings yet
Annual Function 2024-25
73 pages
E291w Handout
No ratings yet
E291w Handout
164 pages
Application of Linear Inequalities Linear Programming
No ratings yet
Application of Linear Inequalities Linear Programming
21 pages

ASSIGNMENT Machine Learning

Uploaded by

ASSIGNMENT Machine Learning

Uploaded by

Project: Machine Learning

Fig 1 Histogram and Box Plots of all numeric variables. 10

Fig/Table Topic Page Number

The dataset provided contains the following features:

 Total number of customer records: 1525;

Variable Name Description

a) Univariate Analysis: Analyzing the statistical summary.

Table 2: Statistical summary table.

 The minimum age of voters is 24 years and max age is 93 years;

iv) Labour leader, Blair:

 50% of the voters have rated the leader less than 4;

v) Conservative leader, Hague:

 The mean rating given to the Conservative leader is almost 3 out of 5;

vi) Attitudes towards European integration:

vii) Political Knowledge:

b) Univariate analysis: Detection of outliers and distribution of data:

Fig1: Histogram and Box Plots of all numeric variables.

c) Bivariate and Multivariate analysis:

Fig 2: Heatmap showing correlation.

ii) Vote share analysis by gender

 Almost 31.90% of females and 28.47% of males voted for Conservative;

iv) Analysis of voter’s assessment on European integration:

 Approximately 37% of Conservative voters have very aggressive stance on European

v) Analysis of vote share by the assessment of economic conditions

Economic household condition:

 33% of Labour voters have rated a high score of 4 out of 5;

National economic condition:

Fig 8: Outlier removal.

Table 9: Sample of scaled and encoded data.

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression Model:

Below is an example of logistic regression equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Logistic Regression Performance:

Table 10: Comprehensive performance report-Logistic Regression Train Model.

 The accuracy stands at 84%;

Table 11: Comprehensive performance report-Logistic Regression Test Model.

Linear Discriminant Analysis (LDA) Model:

LDA is a dimensionality reduction algorithm similar to principal component analysis. However,

Building of LDA model:

Table 12: Comprehensive performance report of LDA Train Model.

 The accuracy stands at 84%;

Table 13: Comprehensive performance report of LDA Test Model.

A k-nearest-neighbor algorithm, often abbreviated KNN, is an approach to data classification

Building of KNN model:

Table 14:Comprehensive performance report-KNN Train Model

 The accuracy stands at 86%;

Table 15: Comprehensive performance report-KNN Test Model.

 The recall value has dropped to 0.85;

Naive Bayes Model (NB):

Building of Naive Bayes model:

Naive Bayes Performance:

Table 16: Comprehensive performance report- Naive Bayes Train Model.

 The accuracy stands at 83%;

 The F1-score has marginally reduced to 0.87;

Bagging Classifier Model:

N_estimators: It is the number of base estimators in the ensemble.

Bagging Classifier Performance:

Table 18: Comprehensive performance report-Bagging Train Model.

 The accuracy stands at 97%;

 The F1-score has reduced to 0.89;

ADA Boosting Model:

Building of ADA Boosting Model:

ADA Boosting Performance:

Table 20: Comprehensive performance report-ADA Boost Train Model.

 The accuracy stands at 85%;

 The F1-score has reduced to 0.87;

Gradient Boosting Classifier Model:

Building of Gradient Boosting Classifier model:

The basic model is built with default parameters.

Gradient Boosting Classifier Performance:

 The accuracy stands at 89%;

Table 23: Comprehensive performance report-Gradient Boost Test Model.

 Precision has dropped to 0.89;

Model Tuning of Machine Learning Models

A) Regularized Logistic Regression model:

y = e^(b0 + b1x) / (1 + e^(b0 + b1x))