ASSIGNMENT Machine Learning
ASSIGNMENT Machine Learning
By
Sambit Roy Chowdhury
OF
Group 1 (Sat) May 21
ON
5/12/2021
1
Table of Contents:
Sl. Topic Page Number
No
Executive Summary 6
1 Problem: 1 (Machine Learning Models) 7-51
1.1 Read the dataset. Do the descriptive statistics and do the null value 7
condition check. Write an inference on it.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data 8-17
analysis.
1.3 Encode the data (having string values) for Modelling. Is Scaling 17-18
necessary here or not? Data Split: Split the data into train and test
(70:30).
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). 18-22
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. 22-25
1.6 Model Tuning, Bagging (Random Forest should be applied for 25-46
Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train 47-50
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model. Final Model: Compare the
models and write inference which model is best/optimized.
1.8 Based on these predictions, what are the insights? 50-51
2 Problem: 2 (Text Mining) 51-60
2.1 Find the number of characters, words, and sentences for the 52-53
mentioned documents.
2.2 Remove all the stopwords from all three speeches. 53-54
2.3 Which word occurs the most number of times in his inaugural address 54-56
for each president? Mention the top three words. (after removing the
stopwords)
2.4 Plot the word cloud of each of the speeches of the variable. (after 57-60
removing the stopwords)
2
List of figures & tables for Problem 1 – Machine Learning Models:
Fig/Table Topic Page Number
Number
Table 1 Data dictionary for the analysis. 7
Table 2 Statistical summary table. 8
Table 3 Vote’s % by gender. 12
Table 4 vote’s % by assessment ratings of Labour leader, Blair. 13
Table 5 vote’s % by assessment ratings of Conservative leader, 14
Hague.
Table 6 Percentage of votes by ratings of Europe integration 14
sentiment.
Table 7 Vote share by household economic condition. 15
Table 8 Vote share by national economic condition 16
Table 9 Sample of scaled and encoded data. 18
Table 10 Comprehensive performance report-Logistic Regression 19
Train Model.
Table 11 Comprehensive performance report-Logistic Regression 20
Test Model.
Table 12 Comprehensive performance report of LDA Train 21
Model.
Table 13 Comprehensive performance report of LDA Test Model. 21
Table 14 Comprehensive performance report-KNN Train Model 22
Table 15 Comprehensive performance report-KNN Test Model 23
Table 16 Comprehensive performance report- Naive Bayes Train 24
Model.
Table 17 Comprehensive performance report- Naive Bayes Test 25
Model.
Table 18 Comprehensive performance report-Bagging Train 26
Model.
Table 19 Comprehensive performance report-Bagging Test 27
Model.
Table 20 Comprehensive performance report-ADA Boost Train 28
Model.
Table 21 Comprehensive performance report-ADA Boost Test 29
Model.
Table 22 Comprehensive performance report-Gradient Boost 30
Train Model.
Table 23 Comprehensive performance report-Gradient Boost Test 30
Model.
Table 24 Comprehensive performance report- Reg LR Train 32
Model.
Table 25 Comprehensive performance report- Reg LR Test 33
Model.
Table 26 Performance report of Reg LDA Train Model. 34
3
Table 27 Performance report of Reg LDA Test Model. 35
Table 28 Performance report- Reg KNN Train Model. 37
Table 29 Performance report- Reg KNN Test Model. 38
Table 30 Performance report-Reg Bagging Train Model. 39
Table 31 Performance report-Reg Bagging Test Model. 40
Table 32 Performance report-Reg ADA Boost Train Model. 42
Table 33 Performance report-Reg ADA Boost Test Model. 43
Table 34 Performance report-Reg Gradient Boost Train Model. 45
Table 35 Performance report-Reg Gradient Boost Test Model. 46
Table 36 Comparison of basic and regularized ML Models. 47
4
List of figures & tables for Problem 2 – Text Mining:
5
Executive Summary
Problem: 1
A dataset of 1525 voters provided by CNBE was analyzed and 7 machine learning models were
built, tested, tuned and analyzed to find the most optimized model which could accurately predict
the win of a particular political party and further help in creating an exit poll. The analysis found
the vote share distribution and distinguished the sentiments of Labour and Conservative voters
which lays the foundation of prediction. Furthermore, alongside the suggestion for optimized
model for the business case, few key recommendations have been made for the business to
implement.
Problem: 2
The text mining project analyzed 3 speeches of the Presidents of the United States of America in
order to extrapolate key insights like most used words, length of the speech, and sentiment
analysis using formation of word clouds. The contextual and sentimental similarities and
dissimilarities between the speeches of Roosevelt, Kennedy and Nixon were analyzed and
concluded.
6
Problem 1 – Machine Learning Models
Introduction:
We are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. We have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it.
Note: The descriptive statistics of the features has been analyzed and described as a part of
Univariate analysis in the next answer (1.2).
7
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers.
Insights:
i) Age:
ii) Economic.condition.national:
Insights:
75% of the voters have a positive assessment of economic condition rated 3 (or below)
out of 5;
The bottom 25% of voters only rated less than 3;
The mean national economic condition assessed by the voters is rated to be 3 out of 5;
iii) Economic.condition.household:
8
Insights:
The mean household economic condition assessed by the voters is rated to be 3 out of 5;
The bottom 25% of voters only rated less than 3;
75% of voters have rated the household economic condition 3 (or below) out of 5.
Insights:
Insights:
Insights:
The median score is 6. Both mean and median shows voters are moderately skeptical
about European integration;
The mean score towards European integration is 7;
Interestingly, the top 25% of voters are highly Euro skeptic in nature;
Furthermore, the bottom 25% voters scored less than 4 out of 11 indicating they are pro-
European integration.
Insights:
The mean and median rating are 1.54 and 2 out of 3 respectively, indicating that majority
of voters have a sound knowledge on the parties’ position on the issue of European
integration;
9
Interestingly, 50% of the voters have somewhat little to no knowledge about the parties’
outlook on the issue.
Insights:
Age- The data distribution is nearly normally distributed. The distribution have negligible
skewness and no outliers found;
National economic condition- Presence of outliers can be seen. The data distribution
cannot be fully ascertained, due to categorical/ordinal in nature. However, it is nearly
normally distributed;
10
Economic household condition- Presence of outliers can be seen. Due to
categorical/ordinal in nature, the data distribution cannot be fully ascertained, however, it
is nearly normally distributed;
Blair- No outliers found. The data distribution cannot be fully ascertained, due to
categorical/ordinal in nature. However, it is nearly normally distributed;
Hague- No outliers found. Due to categorical/ordinal in nature, the data distribution
cannot be fully ascertained, however, it is nearly normally distributed;
Europe- No outliers found. The data distribution cannot be fully ascertained, due to
categorical/ordinal in nature. However, it is nearly normally distributed;
Political Knowledge- No outliers found. Due to categorical/ordinal in nature, the data
distribution cannot be fully ascertained, however, it is nearly normally distributed.
11
Insights:
Assessment of Labour leader, Blair has a negative correlation of -0.30 with assessment of
European integration (Europe). The negative nature indicates that voters who are non
Eurosceptic have rated the Labour leader highly;
Assessment of Conservative leader, Hague has a positive correlation of 0.29 with
assessment of European integration (Europe). The positive nature indicates that voters
who are Eurosceptic have rated the Conservative leader highly;
Two economic parameters have a negative correlation with rating of the Conservative
leader, Hague;
Two economic parameters have a positive correlation with that of the ratings towards the
Labour leader;
Age have very little to no correlation with any of the other features/parameters.
Table 3 and Fig 3: Bar graph and table showing vote’s % by gender.
Insights:
12
iii) Analysis of Labour leaders’ assessment by voters:
Table 4 and Fig 4: Bar graph and table showing vote’s % by assessment ratings of Labour
leader, Blair.
Insights:
The above graphical illustration shows that majority of female (53.33%) and
males(56.52%) have rated Blair with a score of 4 out of 5;
A small percentage of both female and male voters gave a perfect score of 5;
A significant percentage of 25.25% male have rated poorly with a score of 2 and 31.77%
females did the same;
It can be clearly seen that Blair did not receive any average score of 3. The mere 0.14%
of males rating him 2 is insignificant. Whereas in case of females it’s 0.
13
Table 5 and Fig 5: Bar graph and table showing vote’s % by assessment ratings of
Conservative leader, Hague.
Insights:
The above graphical illustrations shows that a significant percentage of voters have rated
Hague low. 13.79% females and almost 17% males have rated him with the lowest score
of 1;
A significant percentage of voters have rated Hague low. 40.9% of females and 41%
males have rated him 2;
Despite, 38.18% females and 34.78% males have rated him high of 4;
More than 50% of voters have a negative view assessment of the Conservative leader.
14
Table 6: Table showing percentage of votes by ratings of Europe integration sentiment.
Insights:
Table 7 and Fig 6: table and Bar graph showing vote share by household economic
condition.
Insights:
15
From the above table and bar graph, Labour voters have rated the household economic
condition moderately good ranging from rating 3 (max) to 5;
19% of Conservative voters have rated the economic household condition as 4;
The trend shown for ratings 1 to 3 by conservative voters clearly indicated that they are
not happy with the economic performance;
The majority of Conservative voters and Labour voters indicated that the household
economic condition is average, rated at 3 out of 5.
Table 8 and Fig 7: Table and Bar graph showing vote share by national economic
condition.
Insights:
42% of Labour voters have rated a high score of 4 out of 5, whereas only 20% of
Conservative voters have rated the a high score of 4 out of 5;
only 2% of Conservative voters have rated perfect 5 and 7% of Labour voters have rated
perfect 5;
16
From the above table and bar graph, Labour voters have rated the national economic
condition moderately good ranging from 3 to 5;
30% of Conservative voters have rated the national economic condition poorly at 2 out of
5;
43% Conservative voters have rated an average score of 3, thus indicating, the overall
assessment to be average;
On the other hand, the Labour voters have indicated they are moderately happy with the
national economic condition.
d) Removal of outliers:
The outliers have been removed for national economic condition and economic household
condition before splitting the data into Train and Test sets.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30).
Data Preparation:
a) Removal of outliers- outliers found in the variables national economic condition and
economic household condition have been treated and capped to the lower interquartile range.
17
b) Scaling of data- The data needs to be scaled to be used in distance-based algorithms such as
KNN as they are affected by the scale of the variables. Thus before KNN, we did the appropriate
scaling.
c) Encoding of data- There are two string type variables present- gender and vote. For the
purpose of transforming the categorical data (gender and vote) into numeric nominal data, one-
hot label encoding is performed.
d) Train-test split- The data is split into 70% train and 30% test set.
The X_train set has 1067 entries and 8 columns (the target variable: ‘IsLabour_or_not’
has been dropped);
The X_test set has 458 entries and 6 columns (the target variable: ‘IsLabour_or_not’ has
been dropped);
The y_train and y_test contains the ‘IsLabour_or_not’ values of their corresponding
X_train and X_test dataframe respectively. ‘IsLabour_or_not’ = 0, means Conservative.
This is a regression analysis that should be performed when the dependent variable is binary in
nature. Like all regression analysis, logistic regression analysis is predictive analysis. Logistic
regression is used to describe data and to describe the relationship between one dependent binary
variable and one or more nominal, ordinal, interval, or independent relationship level variables.
18
Building of LR model:
The basic model is built with default parameters and the train and test performance results have
been discussed.
Training performance:
Insights:
19
Testing performance:
Insights:
Precision has remained unchanged, and AUC has marginally reduced to 0.88;
The F1-score has marginally reduced to 0.88;
The recall value has dropped to 0.89 and accuracy dropped to 0.82.
The basic model is built with default parameters and the train and test performance results have
been discussed.
LDA Performance:
20
Training performance:
Insights:
Testing performance:
Insights:
21
The F1-score has marginally reduced to 0.87;
The recall value has dropped to 0.88 and accuracy dropped to 0.82;
Precision has remained unchanged, and AUC has marginally reduced to 0.88.
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN model:
The basic model is built with default parameters and the train and test performance results have
been discussed.
KNN Performance:
Training performance:
Insights:
22
The AUC value has a value of 0.933, means there is a very high chance that the classifier
will be able to distinguish the positive class values from the negative class.
Testing performance:
Insights:
A naive Bayes classifier is a calculation that utilizes Bayes' hypothesis to arrange objects. Naïve
Bayes classifiers expect to be strong, or naive, autonomy between attributes of data points. Well
known employments of naive Bayes classifiers incorporate spam channels, text examination and
clinical finding.
23
The basic model is built with default parameters and the train and test performance results have
been discussed.
Training performance:
Insights:
Testing performance:
24
Table 17: Comprehensive performance report- Naive Bayes Test Model
Insights:
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
Before going further with the hyper-parameter tuning of the Machine Learning models generated
so far, 3 more basic Machine Learning models, namely, Bagging Classifier, ADA Boosting
Classifier and Gradient Boosting Classifier models have been built and their performances have
been analyzed. After building these 3 basic models, a separate section of the analysis illustrates
hyper-parameter tuning of all 6 models- Logistic Regression, LDA, KNN, Bagging, ADA
Boosting and Gradient Boosting, except Naive Bayes.
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random
subsets of the first dataset and afterward total their singular forecasts (either by casting a ballot or
by averaging) to shape a final prediction. Such a meta-estimator can regularly be utilized as a
method for diminishing the variance of a black-box estimator (e.g., a decision tree), by bringing
randomization into its development technique and afterward making a group out of it.
25
Building of Bagging Classifier model:
The basic model is built with default parameters with base_estimator as Random Forest classifier
and n_estimators of 100 and the train and test performance results have been discussed.
Base_estimator: The base estimator is the base model to fit on random subsets of the dataset.
Training performance:
Insights:
Testing performance:
26
Table 19: Comprehensive performance report-Bagging Test Model.
Insights:
ADA Boost algorithm, short for Adaptive Boosting, is a Boosting technique that is used as an
Ensemble Method in Machine Learning. It is called Adaptive Boosting as the weights are re-
assigned to each instance, with higher weights to incorrectly classified instances. Boosting is
used to reduce bias as well as the variance for supervised learning. It works on the principle
where learners are grown sequentially. Except for the first, each subsequent learner is grown
from previously grown learners.
The basic model is built with default parameters with and n_estimators of 100 and the train and
test performance results have been discussed;
27
Base_estimator: The base estimator from which the boosted ensemble is built. The default is
DecisionTreeClassifier.
Training performance:
Insights:
Testing performance:
28
Table 21: Comprehensive performance report-ADA Boost Test Model.
Insights:
Gradient boosting classifiers are a group of machine learning algorithms that combine many
weak learning models together to create a strong predictive model. Decision trees are usually
used when doing gradient boosting.
Training performance:
29
Table 22: Comprehensive performance report-Gradient Boost Train Model.
Insights:
Testing performance:
Insights:
30
AUC has dropped to 0.904;
The F1-score has reduced to 0.88;
The recall value has dropped to 0.87;
Accuracy dropped to 0.83.
Hyperparameters tuning:
Best hyperparameters:
Training performance:
31
Table 24: Comprehensive performance report- Reg LR Train Model.
Insights:
Testing performance:
32
Table 25: Comprehensive performance report- Reg LR Test Model.
Insights:
Hyperparameters tuning:
33
Best hyperparameters:
On executing the GridSearchCV function, the following best parameters were found:
Training performance:
Insights:
34
Testing performance:
Insights:
35
C) Regularized KNN Model
Hyperparameters tuning:
N_neighbors: Number of neighbors to use by default for k-neighbors queries. A range from 3 to
19 were considered.
Metrics: default.
Possible values:
‘Uniform’: uniform weights. All points in each neighborhood are weighted equally
‘Distance’: weight points by the inverse of their distance and in this case, closer neighbors of a
query point will have a greater influence than neighbors which are further away.
Best hyperparameters:
36
Training performance:
Insights:
37
Testing performance:
Insights:
Hyperparameters tuning:
38
Fig 15: Parameters passed to find the best fit.
Best hyperparameters:
Training performance:
Insights:
39
AUC value has a value of 0.913, means there is a very high chance that the classifier will
be able to distinguish the positive class values from the negative class.
Testing performance:
Insights:
40
E) Regularized ADA Boosting Classifier Model:
Hyperparameters tuning:
Learning rate: Weight applied to each classifier at each boosting iteration. A higher learning
rate increases the contribution of each classifier. Here, 3 values have been tested with.
Best hyperparameters:
41
Training performance:
Insights:
Testing performance:
42
Table 33: Performance report-Reg ADA Boost Test Model.
Insights:
Hyperparameters tuning:
43
Following parameters were passed to find the best fit:
Learning_rate: Learning rate shrinks the contribution of each tree by learning_rate. Here, 3
values0.001, 0.01 and 0.2 were tested.
N_estimators: The number of boosting stages to perform. Gradient boosting is fairly robust to
overfitting so a large number usually results in better performance.
Max_features: The number of features to consider when looking for the best split. Here, 3
values- 4, 5, and 6 were tested.
Min_samples_split: The minimum number of samples required to split an internal node. Ideally,
it is 3 times the min samples leaf.
Best hyperparameters:
Training performance:
44
Table 34: Performance report-Reg Gradient Boost Train Model.
Insights:
Testing performance:
45
Table 35: Comprehensive performance report-Reg Gradient Boost Test Model.
Insights:
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is best/
optimized.
46
BEFORE Tuning AFTER Tuning
MODEL SETS Precision RECALL F1 SCORE Accuracy AUC Precision RECAL F1 SCORE Accuracy AUC
L
LR TRAIN .87 .91 .89 .84 .89 .87 .91 .89 .84 .89
TEST
.87 .89 .88 .82 .883 .87 .89 .88 .82 .883
TRAIN
LDA .87 .90 .89 .84 .889 .87 .90 .89 .84 .889
TEST .87 .88 .87 .82 .884 .87 .88 .87 .82 .884
NB TRAIN .88 .88 .88 .83 .887 .88 .88 .88 .83 .887
TEST .89 .86 .87 .82 .885 .89 .86 .87 .82 .885
TEST .85 .85 .85 .78 .828 .85 .88 .86 .80 .876
ADA TRAIN .88 .91 .89 .85 .913 .86 .91 .89 .84 .913
BOOST
TEST .88 .87 .87 .82 .879 .87 .88 .87 .82 .891
GRADIENT TRAIN .91 .93 .92 .89 .950 .86 .93 .90 .85 .913
BOOST
TEST .89 .87 .88 .83 .904 .88 .90 .89 .83 .891
BAGGING TRAIN .96 .99 .98 .97 .997 .88 .90 .89 .85 .913
(RF)
TEST .88 .89 .89 .84 .897 .88 .87 .87 .82 .891
Note: Naive Bayes Classifier has no scope for regularization hence for the purpose of continuity
and ease of comparison, the basic NB model performance metrics have been shown in the
regularized performance table.
In terms of performance of basic and regularized models, there is no change in the performance.
Hence, we can safely choose the regularized LR model for further comparison between other
Machine Learning models.
47
B) Comparison of LDA Models:
In terms of performance of basic and regularized models, there is no change in the performance.
Hence, we can safely choose the regularized LDA model for further comparison between other
Machine Learning models.
The NB model is chosen by default for further comparison as it does not have any regularized
model.
Accuracy: The regularized model performs better as the test accuracy of regularized is 80%
compared to that of basic model at 78%, but the regularized model is an over-fit.
Recall: Recall of the regularized model has increased marginally in testing phase by 3%.
Precision: It has remained same in regularized testing as well as in basic test performance
(85%).
AUC: The AUC test score in much better at 87.6% in regularized model as compared to basic
test score.
F1-score: The f1-score increased by 1% in regularized testing phase as compared to basic test
performance.
Over-fitting in the Train set for the regularized KNN model is clearly observed. Hence, choosing
the basic model would be a wise decision.
Accuracy: The accuracy of the regularized model performs better marginally as compared to the
basic performance when looked at the difference between the train and test performance of both
the models.
Recall: Recall of the regularized model has increased marginally in testing phase by 1%
compared to basic test performance.
AUC: The AUC test score in much better at 89% in regularized model as compared to basic
model’s test score.
48
F1-score: The f1-score remained unchanged in regularized model when compared to basic
model performance.
Overall, the regularized ADA Boosting Classifier model performs marginally better than the
basic model, hence the regularized ADA Boosting Classifier model is chosen for further
comparison between other Machine Learning models.
Accuracy: The accuracy of the regularized model performs better marginally as compared to the
basic performance when looked at the difference between the train and test performance of both
the models.
Recall: Recall of the regularized model has increased marginally in testing phase by 3%
compared to basic test performance.
AUC: The AUC test score of the regularized model performs better marginally as compared to
the basic performance when looked at the difference between the train and test performance of
both the models.
F1-score: The f1-score is better in regularized model when compared to basic model’s test
performance.
Overall, the regularized Gradient Boosting Classifier model performs marginally better than the
basic model and gives more consistent result when compared between train and test results.
Hence, the regularized Gradient Boosting Classifier model is chosen for further comparison
between other Machine Learning models.
Accuracy: The basic model performs marginally better as the test accuracy of regularized is
82% compared to that of basic model at 84%.
Recall: Recall of the regularized model has decreased marginally in testing phase by 2%
compared to basic test performance.
Precision: It remains the same in regularized testing as compared to basic test performance.
AUC: The AUC test score in much better at 90% in basic model as compared to regularized test
score.
49
F1-score: The f1-score decreased marginally in regularized model by 2% in testing phase as
compare to basic test performance.
Looking at the recall, precision and f1-score, the basic Bagging Classifier model is chosen for
further comparison between other Machine Learning models.
Final inferences:
Therefore, to conclude we will go with the regularized Gradient Boosting model. There are no
signs of over-fitting / under fitting as compared to other models. Recall and F1 scores are also
excellent as is desirable with classification models. Moreover the difference between the test and
train in the regularized model is very less and it gives more consistent result when compared
between train and test results of other models.
The business issue essentially spun around fostering a model to anticipate which party a citizen
would vote in favor of depending on the data about the citizens. The model will in this way be
utilized to make an exit poll that will help in predicting the overall win and seats covered by a
specific party. For this to achieve, the analyses assumed CNBE wish to focus more on accurately
predicting the Labor’s win and hence that has been the class of choice for prediction. The
analysis and building of Machine Learning models based on a restricted dataset of 1525 citizens
with specific details of the electors. This notwithstanding, regardless of limitations, has assisted
us with finding not many key bits of knowledge and patterns alongside exhibiting the ideal
model which could be used by CNBE to anticipate the previously mentioned.
Insights summary:
Majority of the voters are between the ages 33 – 75 and there are no voters’ data capture
between the age 18 to 24;
Majority of people think that household and national economic condition is satisfactory
as most have ranked them in 3 or 4 out of 5;
Conservatives consists of slightly higher proportion of aged voters (50 years and above);
50% of the voters are of age above 53 years and only the bottom 25% voters are aged less
than 41 years;
Labour leader Blair is more popular among people than Conservative leader Hague as
Blair has received a rating of 4 on average, whereas Hague has received a mixed rating of
2 and 4;
The general population does not seem to be very eurosceptic as cumulative frequency of
non-eurosceptic people (who opted for 6 or less) seem to be higher than the cumulative
frequency of eurosceptic people (who opted for 7 or higher);
50
There are more female voters than male voters;
Conservative voters have better political knowledge of political parties’ position on
European integration than their Labour counterparts;
Labour voters appears to have a pro-European integration opinion as opposed to
Conservative voters;
43% Conservative voters have rated the national economic condition average with score
of 3, further indicating, the overall assessment to be between poor and average.
National economic condition has a mild positive correlation with ‘IsLabour_or_not’
which means that as impression of national economic condition improves, the votes for
Labour also increases;
Candidates can focus to improve the image of economic conditions to gather more crowd
favor;
Most people find Blair to be a better leader and if the Conservative party wants to win
then they have to focus in improving Hague’s image among people, or go with a different
candidate;
National economic condition has a mild positive correlation with ‘IsLabour_or_not’
which means that as impression of national economic condition improves, the votes for
Labour also increases.
Business Recommendations:
CNBE must gather data of voters aged between 18 and 24 so as to make the predictions
more accurate;
It needs to be addressed that, the larger the number of voters, better the Machine
Learning models can be optimized;
The dataset must also include additional assessment ratings about migration policy
including refugee settlement, employment generation, income tax regime, etc;
Based on the existing data, the most optimized model is found to be the regularized
Gradient Boosting Classifier model. This however would need re-tuning of
hyperparameters with larger dataset to accurately predict the win for the Labour party;
Irrespective of the size of the dataset, the regularized Gradient Boosting Classifier model
could be deployed to build an exit poll which will still perform with great degree of
accuracy.
In this particular project, we are supposed to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
51
1. President Franklin D. Roosevelt in 1941
Loading of dataset:
The 3 speeches (in text document format) are converted into a data frame for the purpose of text
mining. We have created 3 different excel file for 3 different speeches. Below is a snapshot of
the initial data frame:
2.1 Find the number of characters, words, and sentences for the mentioned documents.
Text mining involves various preprocessing of the text before starting to build a model. In this
case, plain text was used instead of the preprocessed text to perform a count of words, characters
and sentences.
Using pre-defined functions as a part of the inaugural package in the nltk toolkit, the counts of
words, characters and sentences were computed as shown below:
52
Table 38: Data frames showing the words and characters counts of speeches.
Insights:
From the above table, Nixon’s speech has the most word count and characters values and
sentence count, which are 1769, 10107 and 68 respectively;
The number of sentences in Nixon’s speech is 68, similar to Roosevelt’s speech;
Hence, Nixon’s speech could easily be conformed to be the longest of all the 3 speeches.
A few data pre-processing steps have been undertaken before removing the stopwords such as
‘A’, ‘the’, ‘then’, ‘is’, etc. So before removal of stopwords the number of stopwords is as
follows:
The text must be converted to lowercase in order to reduce the redundant words such as ‘The’
and ‘the’. Here, these are two separate words in the speech which however for the purpose of
53
building models, word clouds make it inaccurate. In order to mitigate the issue of double
counting of words, the text of 3 speeches have been converted to lowercase.
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (After removing the stopwords).
After the removal of the stopwords, an important text pre-processing step is taken to reduce the
words to their root words, called Stemming.
It is a rule-based approach because it slices the inflected words from prefix or suffix as per the
need using a set of commonly underused prefix and suffix, like “-ing”, “-ed”, “-es”, “-pre”, etc. It
results in a word that is actually not a word. Using the Porter Stemmer method available in the
nltk package, the texts of 3 speeches are stemmed to their root words.
54
Table 42: Top 10 words used in Roosevelt’s speech.
The top 3 words used by Roosevelt in his speech are: ‘nation’ (10 times), ‘know’ (10 times) and
‘spirit’ (8 times). However the word ‘us’ too is used 8 times.
The top 3 words used by Kennedy in his speech are: ‘us’ (11 times), ‘let’ (11 times) and ‘sides’
(8 times).
55
Table 44: Top 10 words used in Nixon’s speech.
The top 3 words used by Nixon in his speech are: ‘us’ (26 times), ‘peace’ (15 times) and ‘new’
(15 times).
2.4 Plot the word cloud of each of the speeches of the variable. (After removing the
stopwords)
56
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific
word appears in a textual data source (such as a speech, blog post, or news story or database), the
bigger and bolder it it appears in the word cloud. A word cloud is a collection, or group, of words
represented in different dimensions. The bigger and bolder the word appears, the more often it is
mentioned in a given text and the more important it is.
Insights:
Roosevelt had used the word ‘nation’ the most in his speeches followed by words such as
‘spirit’, ‘people’, ‘life’, ‘America’;
Further, his speech also stressed on positive words like ‘spirit’, ‘security’, ‘life’, ‘faith’,
etc;
Other prominent words visible are ‘live, ‘freedom’, ‘people’, ‘America’, ‘preserve’,
‘history’;
Based on the word cloud, the sentiment of his speech is positive;
His speech seems to encourage the audience to preserve America’s history, democracy
and freedom;
The president is talking about the country which is stressed through words like ‘nation’
and ‘America’ and ‘people’.
57
Kennedy’s speech analysis:
Insights:
Kennedy in his speech have stressed on the words like ‘let’, ‘new’, ‘world’, ‘power’,
‘nation’, ‘side’, etc;
Unlike Roosevelt's speeches, his speeches seem to center on 'power', 'nation', and 'world'.
These words refer to a more aggressive approach to build America as a new world power;
Also, unlike Roosevelt's speeches, the use of words such as "peace", "hope", "us",
"human", "friend" and "citizen" are less common in his speeches;
The word cloud implies that his speeches are for his audience to embrace and support the
idea of America's global power, not the actual well-being of its citizens;
However the above sentimental inference could however be challenged in different
context.
58
Nixon’s speech analysis:
Insights:
Nixon in his speech is centered around the words like ‘let’, ‘us’, ‘peace’, ‘world’,
‘America’, ‘nation’, ‘role’, ‘government’, etc;
Unlike Kennedy's speeches, the general idea of his speeches appears to be how people in
the United States can contribute to world peace;
His speeches, similar to Roosevelt's, give a positive vibration, which is shown in repeated
uses of words such as "live," "build," "right," "together," "promise," and "justice”;
The word cloud thus suggests the overall positive encouragement for the audience;
It is interesting to note that Nixon's speech takes a more balanced approach to the nation-
building and their positive impacts in a global context compared to Roosevelt's speech,
where the whole message focused on the human aspects of the country. It is derived from
common words such as 'peace', ‘home’, 'nation', 'together', 'faith', 'justice'.
59
Conclusion:
Text Mining and Sentiment Analysis highlighted the key steps taken to preprocess text and how
text can be used to visually analyze full speech sentiment.
Based on their analysis, both Roosevelt and Nixon gave positive comments on the welfare of its
people and supported the idea of peace and freedom. Rather, Kennedy's speech focused on the
United States as a global power, not on the welfare of its citizens.
60
61
62
63