Mvchine Learning Project Report
Mvchine Learning Project Report
Project Report
BY Suraj Shaw
Date of Submission- 28th jan
#PROBLEM 1
#Context: CNBE, a prominent news channel, is gearing up to provide insightful coverage of
recent elections, recognizing the importance of data-driven analysis. A comprehensive
survey has been conducted, capturing the perspectives of 1525 voters across various
demographic and socio-economic factors. This dataset encompasses 9 variables, offering a
rich source of information regarding voters' characteristics and preferences.
#Objective: The primary objective is to leverage machine learning to build a predictive
model capable of forecasting which political party a voter is likely to support. This
predictive model, developed based on the provided information, will serve as the
foundation for creating an exit poll. The exit poll aims to contribute to the accurate
prediction of the overall election outcomes, including determining which party is likely to
secure the majority of seats.
#Data Description: vote: Party choice: Conservative or Labour
age: in years
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to 5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Europe: an 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
gender: female or male.
#Check shape, Data types, and statistical summary - Univariate analysis - Multivariate
analysis - Use appropriate visualizations to identify the patterns and insights - Key
meaningful observations on individual variables and the relationship between variables
df.head()
df.describe().T
75% max
Unnamed: 0 1144.0 1525.0
age 67.0 93.0
economic.cond.national 4.0 5.0
economic.cond.household 4.0 5.0
Blair 4.0 5.0
Hague 4.0 5.0
Europe 10.0 11.0
political.knowledge 2.0 3.0
Unnamed: 0 0
vote 0
age 0
economic.cond.national 0
economic.cond.household 0
Blair 0
Hague 0
Europe 0
political.knowledge 0
gender 0
dtype: int64
The number of rows and columns are present in coordinate form, the first is
showing rows and second is showing columns:
(1525, 10)
#To check the how many categorical variables are present in in 'vote' column
Labour 1063
Conservative 462
Name: vote, dtype: int64
VOTE 2
Conservative 462
Labour 1063
Name: vote, dtype: int64
GENDER 2
male 713
female 812
Name: gender, dtype: int64
df.dtypes
Unnamed: 0 int64
vote object
age int64
economic.cond.national int64
economic.cond.household int64
Blair int64
Hague int64
Europe int64
political.knowledge int64
gender object
dtype: object
#EDA-
Univariate Analysis explores each variable in a dataset, separately.
Performing Univariate Analysis on the column 'age', we can see that most of the
respondents in this data is in the age bracket 40-60.
Performing Univariate analysis on the column 'gender', we can see that the number
of females in the dataset is more than that of the male population.
Performing Univariate analysis on the column 'vote', we can see that there is more
prefernce for the Labour party than the Conservatives.
Performing Univariate analysis on the column 'economic.cond.national', we find that
the majority of the respondents have medium assesment of current national
economic conditions as majority of the population lies in 3 to 4 buckets.
2
0.512787
0 0.298361
3 0.163934
1 0.024918
Name: political.knowledge, dtype: float64
Performing Univariate analysis on the column 'Blair' we find that the majority of the
respondents have a good assesment of the Labour Leader.
Performing Univarite anlysis on the column 'Hauge' we find that the majority of the
respondents have a low assessment of the Conservative Leader. However there is a
fair majority who have a high assessment as well.
Performing Univariate analysis on the column 'Europe' most of the respondents have
'Eurosceptic' sentiments.
#Bivariate Analysis
On poerforming Bivariate anlysis on the column's 'vote' and 'age', we can see that
Younger people have less probability of voting Conservative. This mean that the
probability of voting conservative is low for age people, according to shown strip
plot.
Bivariate analysis between political knowledge and age.
Here, we can see that majority of the population has a moderate understanding of
the political situation.
However , the middle-aged (35 to 50) population seems to have a better
understanding than the others.
From the below plot, we see that the population of both middle-aged male and
female is more than the other ages.
Bivariate analysis between 'economic.cond.national' and 'age' gives a similar trend
as its Univariate analysis.
#Outlier Detection(treat, if needed)) - Encode the data - Data split - Scale the data (and
state your reasons for scaling the features)
#Cecking For Outliars
I am checking outliar only for column 'age', because the rest of the columns are in
categorical values.
We see that there are no any outliar in data.
Encoding the data which are in string for Modeling.
Index(['vote', 'age', 'economic.cond.national', 'economic.cond.household',
'Blair', 'Hague', 'Europe', 'political.knowledge', 'gender'],
dtype='object')
ml_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vote 1525 non-null category
1 age 1525 non-null int64
2 economic.cond.national 1525 non-null int64
3 economic.cond.household 1525 non-null int64
4 Blair 1525 non-null int64
5 Hague 1525 non-null int64
6 Europe 1525 non-null int64
7 political.knowledge 1525 non-null int64
8 gender 1525 non-null category
dtypes: category(2), int64(7)
memory usage: 86.7 KB
Labour 1063
Conservative 462
Name: vote, dtype: int64
ml_1['vote']=np.where(ml_1['vote']=='Conservative', 0, ml_1['vote'])
ml_1['vote']=np.where(ml_1['vote']=='Labour', 1, ml_1['vote'])
ml_1['vote'].value_counts()
1 1063
0 462
Name: vote, dtype: int64
ml_1['gender'].value_counts()
female 812
male 713
Name: gender, dtype: int64
ml_1['gender']=np.where(ml_1['gender']=='male', 0, ml_1['gender'])
ml_1['gender']=np.where(ml_1['gender']=='female', 1, ml_1['gender'])
ml_1['gender'].value_counts()
1 812
0 713
Name: gender, dtype: int64
ml_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vote 1525 non-null object
1 age 1525 non-null int64
2 economic.cond.national 1525 non-null int64
3 economic.cond.household 1525 non-null int64
4 Blair 1525 non-null int64
5 Hague 1525 non-null int64
6 Europe 1525 non-null int64
7 political.knowledge 1525 non-null int64
8 gender 1525 non-null object
dtypes: int64(7), object(2)
memory usage: 107.4+ KB
Scaling is not necessary in this case as all the other data column values are
categorical.
#Splitting the data
Before splitting i need to find the target variable. Here, the target variable is "vote".
# Arrange data into independent variables and dependent variables
#- Metrics of Choice (Justify the evaluation metrics) - Model Building (KNN, Naive bayes,
Bagging, Boosting) - Metrics of Choice (Justify the evaluation metrics) - Model Building
(KNN, Naive bayes, Bagging, Boosting)
#Applying Logistic Regression
X_train['gender']=pd.to_numeric(X_train['gender'])
X_test['gender']=pd.to_numeric(X_test['gender'])
y_train = pd.to_numeric(y_train)
y_test = pd.to_numeric(y_test)
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1067 entries, 1453 to 1061
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1067 non-null int64
1 economic.cond.national 1067 non-null int64
2 economic.cond.household 1067 non-null int64
3 Blair 1067 non-null int64
4 Hague 1067 non-null int64
5 Europe 1067 non-null int64
6 political.knowledge 1067 non-null int64
7 gender 1067 non-null int64
dtypes: int64(8)
memory usage: 75.0 KB
X_test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 458 entries, 91 to 776
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 458 non-null int64
1 economic.cond.national 458 non-null int64
2 economic.cond.household 458 non-null int64
3 Blair 458 non-null int64
4 Hague 458 non-null int64
5 Europe 458 non-null int64
6 political.knowledge 458 non-null int64
7 gender 458 non-null int64
dtypes: int64(8)
memory usage: 32.2 KB
y_train.info()
<class 'pandas.core.series.Series'>
Int64Index: 1067 entries, 1453 to 1061
Series name: vote
Non-Null Count Dtype
-------------- -----
1067 non-null int64
dtypes: int64(1)
memory usage: 16.7 KB
y_test.info()
<class 'pandas.core.series.Series'>
Int64Index: 458 entries, 91 to 776
Series name: vote
Non-Null Count Dtype
-------------- -----
458 non-null int64
dtypes: int64(1)
memory usage: 7.2 KB
LOGISTIC REGRESSION
Training data
0.8397375820056232
[[229 103]
[ 68 667]]
precision recall f1-score support
TESTING DATA
0.8231441048034934
[[ 85 45]
[ 36 292]]
precision recall f1-score support
0.8369259606373008
[[233 99]
[ 75 660]]
precision recall f1-score support
0.8187772925764192
[[ 86 44]
[ 39 289]]
precision recall f1-score support
Applying LDA, we see that we get a fairly good model with accuracy of about 82% in
the Test data and with Test and Train performance within the accepted limited of +/-
10%.
Applying KNN model and then applying the model tuning to improve the prediction
on test data.
#performance matrix on train data set
0.8584817244611059
[[672 63]
[ 88 244]]
precision recall f1-score support
0.7816593886462883
[[277 51]
[ 49 81]]
precision recall f1-score support
Here, we can see that there is not much diiference in the model scores(Giving around
78% accuracy). Hence, it is not necessary that there will be a performance
improvement by model tuning.
0.8331771321462043
[[649 86]
[ 92 240]]
precision recall f1-score support
0 0.88 0.88 0.88 735
1 0.74 0.72 0.73 332
0.8253275109170306
[[284 44]
[ 36 94]]
precision recall f1-score support
The Naive Bayes model also seems to give a fair output when comparing the train and test
data with around 83% accuracy.
Grid-search is used to find the optimal hyperparameters of a model which results in the
most 'accurate' predictions. Here we have tried minimize the false negatives in the KNN
model by using Grid Search to find the optimal parameters. Grid search can be used to
improve any specific evaluation metric.
from sklearn.model_selection import GridSearchCV
# Create the GridSearchCV object
# Fit the model to the training data
# Print the best hyperparameters
Accuracy: 0.7882096069868996
0.8584817244611059
[[672 63]
[ 88 244]]
precision recall f1-score support
0 0.88 0.91 0.90 735
1 0.79 0.73 0.76 332
We can see that the recall has changed in each model slightly,howvever the accuracy has
also changed.
Bagging is a machine learning ensemble meta-algorithm designed to improve the stability
and accuracy of macihine learning algorithm used in statistical slassification and reression.
Bagging reduces variance and helps to overfitting. Using Random Forest Classifier for
vagginf below:
0.7816593886462883
[[726 9]
[ 27 305]]
precision recall f1-score support
0.7816593886462883
[[277 51]
[ 49 81]]
precision recall f1-score support
However, we can see that is not a very strong model as the values when comapring
test and train data exceeds the recommended +/-10%.
Boosting is an ensemble strategy that consecutively builds on weak learners in order to
generate one final strong learner.
By building a weak model, making conclusions about tha various feature importance and
parameters, and then using those conclusion to build a new stronger model.
Boosting can effectively convert weak learners into a strong learner.
The method of Boosting used here is AdaBoost(Adaptive Boosting)
0.7816593886462883
[[666 69]
[ 94 238]]
precision recall f1-score support
0.7816593886462883
[[277 51]
[ 49 81]]
precision recall f1-score support
#PROBLEM 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States
of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
#2.1) Find the number of characters, words and sentences for the mentioned documents.
len(inaugural.fileids())
59
7571
7618
9991
S represent sentence
S1 - 68
S2 - 52
S3 - 69
W represent words
In W1 -- ['On', 'each', 'national', 'day', 'of', 'inauguration', ...]
W1 - 1536
W2 - 1546
W3 - 2028
Roosvelt, W1=1536
Kennedy, W2=1546
Nixon, W3=2028
#2.2 Remove all the stopwords from the three speeches. Show the three speeches. Show
the word count before and after the reemoval of stopwords. Show a sample sentences after
the removal of stopwords.
#1)Roosevelt Stopwords
words1
['On',
'each',
'national',
'day',
'of',
'inauguration',
'since',
'1789',
',',
'the',
'people',
'have',
'renewed',
'Those',
'who',
'first',
------
'came',
'here',
'to',
'carry',
'out',
'the',
'millions',
'who',
'followed',
...]
FreqDist({'the': 104, 'of': 81, ',': 77, '.': 67, 'and': 44, 'to': 35, 'in':
30, 'a': 29, '--': 25, 'is': 24, ...})
stop_words
['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',----
------------
'her',
'hers',
'herself',
'it',
"it's",
',',
'.',
'-',
';',
'--',
'It',
'The',
'We']
['Vice',
'President',
'Johnson',
',',
'Mr',
'.',
'Speaker',
',',
'Mr',
'.',
'Chief',
'Justice',
',',
'President',
'Eisenhower',
',',
'Vice',
'President',
'Nixon',
',',-----
-----------
',',
'tap',
'the',
'ocean',
'depths',
',',
'and',
'encourage',
'the',
'arts',
...]
FreqDist({',': 85, 'the': 83, 'of': 65, '.': 51, 'to': 38, 'and': 37, 'a':
29, 'we': 27, '--': 25, 'in': 24, ...})
add_to_stop_words = [',','.','-',';','--','It','The','We','us','Let','let']
['Mr',
'.',
'Vice',
'President',
',',
'Mr',
'.',
'Speaker',
',',
'Mr',
'.',
'Chief',
'Justice',
',',
'Senator',
'Cook',
',',
'Mrs',
'.',
'Eisenhower',
',',
'and',
'my',
'fellow',
'citizens',
'of',
'this',
'great',
'and',
'good',
'country',
'we',
'share',
-----
----
'those',
'new',
'responsibilities',
'lies',
'in',
'the',
'placing',
'and',
'the',
'division',
'of',
'responsibility',
'.',
'We',
'have',
'lived',
'too',
...]
FreqDist({',': 96, 'the': 80, '.': 68, 'of': 68, 'to': 65, 'in': 54, 'and':
47, 'we': 38, 'a': 34, 'that': 32, ...})
add_to_stop_words = [',','.','-',';','--','It','The','We','us','Let','let']
FreqDist({'America': 21, 'peace': 19, 'world': 17, 'new': 15, "'": 14, 'I':
12, 'responsibility': 11, 'great': 9, 'home': 9, 'nation': 9, ...})
A1=len(words1)
print(A1)
A2=len(words2)
print(A2)
A3=len(words3)
print(A3)
670
716
857
2) Kennedy
3)NIXON
Here, we get most common words that is used by each of the presidents excluding the stops
by looking at the words with the bigger fonts in the word cloud.