0% found this document useful (0 votes)
90 views

Mvchine Learning Project Report

Uploaded by

Suraj Shaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Mvchine Learning Project Report

Uploaded by

Suraj Shaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning

Project Report

(Different models and Text Learning Case Study)


Module 6 – DSBA

BY Suraj Shaw
Date of Submission- 28th jan
#PROBLEM 1
#Context: CNBE, a prominent news channel, is gearing up to provide insightful coverage of
recent elections, recognizing the importance of data-driven analysis. A comprehensive
survey has been conducted, capturing the perspectives of 1525 voters across various
demographic and socio-economic factors. This dataset encompasses 9 variables, offering a
rich source of information regarding voters' characteristics and preferences.
#Objective: The primary objective is to leverage machine learning to build a predictive
model capable of forecasting which political party a voter is likely to support. This
predictive model, developed based on the provided information, will serve as the
foundation for creating an exit poll. The exit poll aims to contribute to the accurate
prediction of the overall election outcomes, including determining which party is likely to
secure the majority of seats.
#Data Description: vote: Party choice: Conservative or Labour
age: in years
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to 5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Europe: an 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
gender: female or male.

#Check shape, Data types, and statistical summary - Univariate analysis - Multivariate
analysis - Use appropriate visualizations to identify the patterns and insights - Key
meaningful observations on individual variables and the relationship between variables
df.head()

Unnamed: 0 vote age economic.cond.national economic.cond.household


\
0 1 Labour 43 3 3
1 2 Labour 36 4 4
2 3 Labour 35 4 4
3 4 Labour 24 4 2
4 5 Labour 41 2 2
Blair Hague Europe political.knowledge gender
0 4 1 2 2 female
1 4 4 5 2 male
2 5 2 3 2 male
3 2 1 4 0 female
4 1 1 6 2 male

df.describe().T

count mean std min 25% 50%


\
Unnamed: 0 1525.0 763.000000 440.373894 1.0 382.0 763.0
age 1525.0 54.182295 15.711209 24.0 41.0 53.0
economic.cond.national 1525.0 3.245902 0.880969 1.0 3.0 3.0
economic.cond.household 1525.0 3.140328 0.929951 1.0 3.0 3.0
Blair 1525.0 3.334426 1.174824 1.0 2.0 4.0
Hague 1525.0 2.746885 1.230703 1.0 2.0 2.0
Europe 1525.0 6.728525 3.297538 1.0 4.0 6.0
political.knowledge 1525.0 1.542295 1.083315 0.0 0.0 2.0

75% max
Unnamed: 0 1144.0 1525.0
age 67.0 93.0
economic.cond.national 4.0 5.0
economic.cond.household 4.0 5.0
Blair 4.0 5.0
Hague 4.0 5.0
Europe 10.0 11.0
political.knowledge 2.0 3.0

#to check in the data any missing value is available?

Unnamed: 0 0
vote 0
age 0
economic.cond.national 0
economic.cond.household 0
Blair 0
Hague 0
Europe 0
political.knowledge 0
gender 0
dtype: int64

#to check the datatype using info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1525 non-null int64
1 vote 1525 non-null object
2 age 1525 non-null int64
3 economic.cond.national 1525 non-null int64
4 economic.cond.household 1525 non-null int64
5 Blair 1525 non-null int64
6 Hague 1525 non-null int64
7 Europe 1525 non-null int64
8 political.knowledge 1525 non-null int64
9 gender 1525 non-null object
dtypes: int64(8), object(2)
memory usage: 119.3+ KB

##to check in the data any duplicate value is available?

print("Total number of duplicate values = %d" % (dups.sum()))


df[dups]

Total number of duplicate values = 0

The number of rows and columns are present in coordinate form, the first is
showing rows and second is showing columns:
(1525, 10)

#To check the how many categorical variables are present in in 'vote' column

Labour 1063
Conservative 462
Name: vote, dtype: int64

#to check the unique values in object variable

VOTE 2
Conservative 462
Labour 1063
Name: vote, dtype: int64
GENDER 2
male 713
female 812
Name: gender, dtype: int64
df.dtypes

Unnamed: 0 int64
vote object
age int64
economic.cond.national int64
economic.cond.household int64
Blair int64
Hague int64
Europe int64
political.knowledge int64
gender object
dtype: object

#EDA-
Univariate Analysis explores each variable in a dataset, separately.
Performing Univariate Analysis on the column 'age', we can see that most of the
respondents in this data is in the age bracket 40-60.
Performing Univariate analysis on the column 'gender', we can see that the number
of females in the dataset is more than that of the male population.

Performing Univariate analysis on the column 'vote', we can see that there is more
prefernce for the Labour party than the Conservatives.
Performing Univariate analysis on the column 'economic.cond.national', we find that
the majority of the respondents have medium assesment of current national
economic conditions as majority of the population lies in 3 to 4 buckets.

Performing Univariate analysis on the column 'political.knowledge' we find that the


majority of the respondents I(around 51%) is fairly aware of parties's positions on
european integration. However, there is a large population (around 29%) that is not
aware at all of the same as well.

2
0.512787
0 0.298361
3 0.163934
1 0.024918
Name: political.knowledge, dtype: float64

Performing Univariate analysis on the column 'Blair' we find that the majority of the
respondents have a good assesment of the Labour Leader.
Performing Univarite anlysis on the column 'Hauge' we find that the majority of the
respondents have a low assessment of the Conservative Leader. However there is a
fair majority who have a high assessment as well.
Performing Univariate analysis on the column 'Europe' most of the respondents have
'Eurosceptic' sentiments.
#Bivariate Analysis
On poerforming Bivariate anlysis on the column's 'vote' and 'age', we can see that
Younger people have less probability of voting Conservative. This mean that the
probability of voting conservative is low for age people, according to shown strip
plot.
Bivariate analysis between political knowledge and age.

Here, we can see that majority of the population has a moderate understanding of
the political situation.
However , the middle-aged (35 to 50) population seems to have a better
understanding than the others.
From the below plot, we see that the population of both middle-aged male and
female is more than the other ages.
Bivariate analysis between 'economic.cond.national' and 'age' gives a similar trend
as its Univariate analysis.

#Outlier Detection(treat, if needed)) - Encode the data - Data split - Scale the data (and
state your reasons for scaling the features)
#Cecking For Outliars
I am checking outliar only for column 'age', because the rest of the columns are in
categorical values.
We see that there are no any outliar in data.
Encoding the data which are in string for Modeling.
Index(['vote', 'age', 'economic.cond.national', 'economic.cond.household',
'Blair', 'Hague', 'Europe', 'political.knowledge', 'gender'],
dtype='object')

# Convert string values in the 'Category' column to numeric codes

ml_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vote 1525 non-null category
1 age 1525 non-null int64
2 economic.cond.national 1525 non-null int64
3 economic.cond.household 1525 non-null int64
4 Blair 1525 non-null int64
5 Hague 1525 non-null int64
6 Europe 1525 non-null int64
7 political.knowledge 1525 non-null int64
8 gender 1525 non-null category
dtypes: category(2), int64(7)
memory usage: 86.7 KB

Labour 1063
Conservative 462
Name: vote, dtype: int64

ml_1['vote']=np.where(ml_1['vote']=='Conservative', 0, ml_1['vote'])
ml_1['vote']=np.where(ml_1['vote']=='Labour', 1, ml_1['vote'])

ml_1['vote'].value_counts()

1 1063
0 462
Name: vote, dtype: int64

ml_1['gender'].value_counts()

female 812
male 713
Name: gender, dtype: int64
ml_1['gender']=np.where(ml_1['gender']=='male', 0, ml_1['gender'])
ml_1['gender']=np.where(ml_1['gender']=='female', 1, ml_1['gender'])

ml_1['gender'].value_counts()

1 812
0 713
Name: gender, dtype: int64

ml_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vote 1525 non-null object
1 age 1525 non-null int64
2 economic.cond.national 1525 non-null int64
3 economic.cond.household 1525 non-null int64
4 Blair 1525 non-null int64
5 Hague 1525 non-null int64
6 Europe 1525 non-null int64
7 political.knowledge 1525 non-null int64
8 gender 1525 non-null object
dtypes: int64(7), object(2)
memory usage: 107.4+ KB

Scaling is not necessary in this case as all the other data column values are
categorical.
#Splitting the data
Before splitting i need to find the target variable. Here, the target variable is "vote".
# Arrange data into independent variables and dependent variables

#- Metrics of Choice (Justify the evaluation metrics) - Model Building (KNN, Naive bayes,
Bagging, Boosting) - Metrics of Choice (Justify the evaluation metrics) - Model Building
(KNN, Naive bayes, Bagging, Boosting)
#Applying Logistic Regression

--- ------ -------------- -----


0 age 1067 non-null int64
1 economic.cond.national 1067 non-null int64
2 economic.cond.household 1067 non-null int64
3 Blair 1067 non-null int64
4 Hague 1067 non-null int64
5 Europe 1067 non-null int64
6 political.knowledge 1067 non-null int64
7 gender 1067 non-null object
dtypes: int64(7), object(1)
memory usage: 75.0+ KB

X_train['gender']=pd.to_numeric(X_train['gender'])
X_test['gender']=pd.to_numeric(X_test['gender'])
y_train = pd.to_numeric(y_train)
y_test = pd.to_numeric(y_test)

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1067 entries, 1453 to 1061
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1067 non-null int64
1 economic.cond.national 1067 non-null int64
2 economic.cond.household 1067 non-null int64
3 Blair 1067 non-null int64
4 Hague 1067 non-null int64
5 Europe 1067 non-null int64
6 political.knowledge 1067 non-null int64
7 gender 1067 non-null int64
dtypes: int64(8)
memory usage: 75.0 KB

X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 458 entries, 91 to 776
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 458 non-null int64
1 economic.cond.national 458 non-null int64
2 economic.cond.household 458 non-null int64
3 Blair 458 non-null int64
4 Hague 458 non-null int64
5 Europe 458 non-null int64
6 political.knowledge 458 non-null int64
7 gender 458 non-null int64
dtypes: int64(8)
memory usage: 32.2 KB

y_train.info()

<class 'pandas.core.series.Series'>
Int64Index: 1067 entries, 1453 to 1061
Series name: vote
Non-Null Count Dtype
-------------- -----
1067 non-null int64
dtypes: int64(1)
memory usage: 16.7 KB

y_test.info()

<class 'pandas.core.series.Series'>
Int64Index: 458 entries, 91 to 776
Series name: vote
Non-Null Count Dtype
-------------- -----
458 non-null int64
dtypes: int64(1)
memory usage: 7.2 KB

LOGISTIC REGRESSION
Training data

0.8397375820056232
[[229 103]
[ 68 667]]
precision recall f1-score support

0 0.77 0.69 0.73 332


1 0.87 0.91 0.89 735

accuracy 0.84 1067


macro avg 0.82 0.80 0.81 1067
weighted avg 0.84 0.84 0.84 1067

TESTING DATA

0.8231441048034934
[[ 85 45]
[ 36 292]]
precision recall f1-score support

0 0.70 0.65 0.68 130


1 0.87 0.89 0.88 328

accuracy 0.82 458


macro avg 0.78 0.77 0.78 458
weighted avg 0.82 0.82 0.82 458
#AUC and ROC for the training data
#AUC and ROC for the test data
# predict probabilities
The model score seems to be pretty good in both training and testing instance and
this looks like a fairly model. The Test and Train performance is within the accepted
limited of +/-10% which makes it an acceptable model as well.
#Apply Linear Discriminant Analysis model:
LDA_model=LinearDiscriminantAnalysis()
LDA_model.fit(X_train,y_train)
##performance matrix on train dataset

0.8369259606373008
[[233 99]
[ 75 660]]
precision recall f1-score support

0 0.76 0.70 0.73 332


1 0.87 0.90 0.88 735

accuracy 0.84 1067


macro avg 0.81 0.80 0.81 1067
weighted avg 0.83 0.84 0.84 1067

#perform matrix on test dataset

0.8187772925764192
[[ 86 44]
[ 39 289]]
precision recall f1-score support

0 0.69 0.66 0.67 130


1 0.87 0.88 0.87 328

accuracy 0.82 458


macro avg 0.78 0.77 0.77 458
weighted avg 0.82 0.82 0.82 458

Applying LDA, we see that we get a fairly good model with accuracy of about 82% in
the Test data and with Test and Train performance within the accepted limited of +/-
10%.
Applying KNN model and then applying the model tuning to improve the prediction
on test data.
#performance matrix on train data set

0.8584817244611059
[[672 63]
[ 88 244]]
precision recall f1-score support

0 0.88 0.91 0.90 735


1 0.79 0.73 0.76 332

accuracy 0.86 1067


macro avg 0.84 0.82 0.83 1067
weighted avg 0.86 0.86 0.86 1067

#performance matrix on test data set

0.7816593886462883
[[277 51]
[ 49 81]]
precision recall f1-score support

0 0.85 0.84 0.85 328


1 0.61 0.62 0.62 130

accuracy 0.78 458


macro avg 0.73 0.73 0.73 458
weighted avg 0.78 0.78 0.78 458

Here, we can see that there is not much diiference in the model scores(Giving around
78% accuracy). Hence, it is not necessary that there will be a performance
improvement by model tuning.

Applying naive Bayes Model


#performance matrix on train dataset

0.8331771321462043
[[649 86]
[ 92 240]]
precision recall f1-score support
0 0.88 0.88 0.88 735
1 0.74 0.72 0.73 332

accuracy 0.83 1067


macro avg 0.81 0.80 0.80 1067
weighted avg 0.83 0.83 0.83 1067

#performance matrix on train dataset

0.8253275109170306
[[284 44]
[ 36 94]]
precision recall f1-score support

0 0.89 0.87 0.88 328


1 0.68 0.72 0.70 130

accuracy 0.83 458


macro avg 0.78 0.79 0.79 458
weighted avg 0.83 0.83 0.83 458

The Naive Bayes model also seems to give a fair output when comparing the train and test
data with around 83% accuracy.
Grid-search is used to find the optimal hyperparameters of a model which results in the
most 'accurate' predictions. Here we have tried minimize the false negatives in the KNN
model by using Grid Search to find the optimal parameters. Grid search can be used to
improve any specific evaluation metric.
from sklearn.model_selection import GridSearchCV
# Create the GridSearchCV object
# Fit the model to the training data
# Print the best hyperparameters

# Make predictions on the test set using the best model

# Evaluate the performance of the tuned model

Accuracy: 0.7882096069868996

#performance matrix on train data set

0.8584817244611059
[[672 63]
[ 88 244]]
precision recall f1-score support
0 0.88 0.91 0.90 735
1 0.79 0.73 0.76 332

accuracy 0.86 1067


macro avg 0.84 0.82 0.83 1067
weighted avg 0.86 0.86 0.86 1067

#performance matrix on test data set


0.7816593886462883
[[277 51]
[ 49 81]]
precision recall f1-score support

0 0.85 0.84 0.85 328


1 0.61 0.62 0.62 130

accuracy 0.78 458


macro avg 0.73 0.73 0.73 458
weighted avg 0.78 0.78 0.78 458

We can see that the recall has changed in each model slightly,howvever the accuracy has
also changed.
Bagging is a machine learning ensemble meta-algorithm designed to improve the stability
and accuracy of macihine learning algorithm used in statistical slassification and reression.
Bagging reduces variance and helps to overfitting. Using Random Forest Classifier for
vagginf below:

#Performance Matrix on train dataset

0.7816593886462883
[[726 9]
[ 27 305]]
precision recall f1-score support

0 0.96 0.99 0.98 735


1 0.97 0.92 0.94 332

accuracy 0.97 1067


macro avg 0.97 0.95 0.96 1067
weighted avg 0.97 0.97 0.97 1067
#Performance Matrix on train dataset

0.7816593886462883
[[277 51]
[ 49 81]]
precision recall f1-score support

0 0.85 0.84 0.85 328


1 0.61 0.62 0.62 130

accuracy 0.78 458


macro avg 0.73 0.73 0.73 458
weighted avg 0.78 0.78 0.78 458

However, we can see that is not a very strong model as the values when comapring
test and train data exceeds the recommended +/-10%.
Boosting is an ensemble strategy that consecutively builds on weak learners in order to
generate one final strong learner.
By building a weak model, making conclusions about tha various feature importance and
parameters, and then using those conclusion to build a new stronger model.
Boosting can effectively convert weak learners into a strong learner.
The method of Boosting used here is AdaBoost(Adaptive Boosting)
0.7816593886462883
[[666 69]
[ 94 238]]
precision recall f1-score support

0 0.88 0.91 0.89 735


1 0.78 0.72 0.74 332

accuracy 0.85 1067


macro avg 0.83 0.81 0.82 1067
weighted avg 0.84 0.85 0.85 1067

#Performance Matrix on train data set

0.7816593886462883
[[277 51]
[ 49 81]]
precision recall f1-score support

0 0.85 0.84 0.85 328


1 0.61 0.62 0.62 130
accuracy 0.78 458
macro avg 0.73 0.73 0.73 458
weighted avg 0.78 0.78 0.78 458

#PROBLEM 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States
of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

#2.1) Find the number of characters, words and sentences for the mentioned documents.
len(inaugural.fileids())

59

1. President Franklin D. Roosevelt in 1941


#R represent character
R1

7571

2. President John F. Kennedy in 1961


R2

7618

3. President Richard Nixon in 1973


R3

9991
S represent sentence
S1 - 68

S2 - 52

S3 - 69

W represent words
In W1 -- ['On', 'each', 'national', 'day', 'of', 'inauguration', ...]

W1 - 1536

In W2 - ['Vice', 'President', 'Johnson', ',', 'Mr', '.', ...]

W2 - 1546

In W3 - ['Mr', '.', 'Vice', 'President', ',', 'Mr', '.', ...]

W3 - 2028

Roosvelt, W1=1536
Kennedy, W2=1546
Nixon, W3=2028
#2.2 Remove all the stopwords from the three speeches. Show the three speeches. Show
the word count before and after the reemoval of stopwords. Show a sample sentences after
the removal of stopwords.
#1)Roosevelt Stopwords
words1
['On',
'each',
'national',
'day',
'of',
'inauguration',
'since',
'1789',
',',
'the',
'people',
'have',
'renewed',
'Those',
'who',
'first',
------

'came',
'here',
'to',
'carry',
'out',
'the',
'millions',
'who',
'followed',
...]

FreqDist({'the': 104, 'of': 81, ',': 77, '.': 67, 'and': 44, 'to': 35, 'in':
30, 'a': 29, '--': 25, 'is': 24, ...})

stop_words

['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',----

------------
'her',
'hers',
'herself',
'it',
"it's",
',',
'.',
'-',
';',
'--',
'It',
'The',
'We']

FreqDist({'know': 10, 'spirit': 9, 'life': 9, 'us': 8, 'democracy': 8,


'people': 7, 'Nation': 7, 'America': 7, 'years': 6, 'freedom': 6, ...})
#2) Kennedy Stopword
words2

['Vice',
'President',
'Johnson',
',',
'Mr',
'.',
'Speaker',
',',
'Mr',
'.',
'Chief',
'Justice',
',',
'President',
'Eisenhower',
',',
'Vice',
'President',
'Nixon',
',',-----

-----------
',',
'tap',
'the',
'ocean',
'depths',
',',
'and',
'encourage',
'the',
'arts',
...]

FreqDist({',': 85, 'the': 83, 'of': 65, '.': 51, 'to': 38, 'and': 37, 'a':
29, 'we': 27, '--': 25, 'in': 24, ...})

add_to_stop_words = [',','.','-',';','--','It','The','We','us','Let','let']

FreqDist({'world': 8, 'sides': 8, 'new': 7, 'pledge': 7, 'citizens': 5, 'I':


5, 'power': 5, 'shall': 5, 'To': 5, 'free': 5, ...})
#3) Nixon
words3

['Mr',
'.',
'Vice',
'President',
',',
'Mr',
'.',
'Speaker',
',',
'Mr',
'.',
'Chief',
'Justice',
',',
'Senator',
'Cook',
',',
'Mrs',
'.',
'Eisenhower',
',',
'and',
'my',
'fellow',
'citizens',
'of',
'this',
'great',
'and',
'good',
'country',
'we',
'share',
-----

----

'those',
'new',
'responsibilities',
'lies',
'in',
'the',
'placing',
'and',
'the',
'division',
'of',
'responsibility',
'.',
'We',
'have',
'lived',
'too',
...]

FreqDist({',': 96, 'the': 80, '.': 68, 'of': 68, 'to': 65, 'in': 54, 'and':
47, 'we': 38, 'a': 34, 'that': 32, ...})

add_to_stop_words = [',','.','-',';','--','It','The','We','us','Let','let']

FreqDist({'America': 21, 'peace': 19, 'world': 17, 'new': 15, "'": 14, 'I':
12, 'responsibility': 11, 'great': 9, 'home': 9, 'nation': 9, ...})

A1=len(words1)
print(A1)
A2=len(words2)
print(A2)
A3=len(words3)
print(A3)

670
716
857

#Before Dtopword Removal Roosvelt, W1=1536


Kennedy, W2=1546
Nixon, W3=2028
#After Stopword Removal Roosvelt, A1=670
Kennedy, A2=716
Nixon, A3=857
#2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words after removing the stopwords.
After removing the stopwords (we have removed "Let" and "us" also in extended stopword
as need to check which most significant words were observed in president speech
a) In Roosevelt, the top three words are:
1)'know'
2)'spirit'
3)'life'
FreqDist({'know': 10, 'spirit': 9, 'life': 9, 'us': 8, 'democracy': 8,
'people': 7, 'Nation': 7, 'America': 7, 'years': 6, 'freedom': 6, ...})

b) In Kennedy, the top three words are:


1)'world'
2)'sides'
3)'new'
FreqDist({'world': 8, 'sides': 8, 'new': 7, 'pledge': 7, 'citizens': 5, 'I':
5, 'power': 5, 'shall': 5, 'To': 5, 'free': 5, ...})

c) In Nixon, the top three words are:


1)'America'
2)'peace'
3)'world'
FreqDist({'America': 21, 'peace': 19, 'world': 17, 'new': 15, "'": 14, 'I':
12, 'responsibility': 11, 'great': 9, 'home': 9, 'nation': 9, ...})
#2.4 Plot the word cloud of each of the speeches of the variable after removing the
stopwords.
1) Roosevelt

2) Kennedy
3)NIXON

Here, we get most common words that is used by each of the presidents excluding the stops
by looking at the words with the bigger fonts in the word cloud.

You might also like