Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
Objective:
Outcomes:
• To carry out logistic regression
• Analyzing various parameters like Accuracy, Misclassification Rate, True
Positive Rate, False Positive Rate, True Negative Rate, Precision,
Prevalence, Null Error Rate, Cohen's Kappa, F Score, etc.
• Understand how to predict probabilities in a multinomial logistic
regression model.
When the response variable has only 2 possible values, it is desirable to have a model
that predicts the value either as 0 or 1 or as a probability score that ranges between 0
and 1. Linear regression does not have this capability. Because, If you use linear
regression to model a binary response variable, the resulting model may not restrict
the predicted Y values within 0 and 1.
1
Building a Logistic Regression Model in R
Now let's see how to implement logistic regression using the given University
Admission dataset in csv format. The goal here is to model and predict if a given
application (row in dataset) is admit or reject, based on 3 other features. So, let's load
the data and keep only the complete cases.
> mydata<-read.csv("~/Downloads/binary.csv",header=T)
> str(mydata)
'data.frame': 400 obs. of 4 variables:
$ admit: int 01 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39...
$ rank : int 33 14421232…
The dataset has 400 observations and 4 columns. The Class column is the response
(dependent) variable and it tells if a given student is accepted or rejected. The column
with rank and admit are numeric int data types. Let us convert them into factors.
> mydata$admit=as.factor(mydata$admit)
> mydata$rank=as.factor(mydata$rank)
> str(mydata)
'data.frame': 400 obs. of 4 variables:
> xtabs(~admit+rank,data=mydata)
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
2
Partitioning the dataset
We partition the given dataset into training and test data. Before building the logistic
regressor, you need to randomly split the data into training and test samples.
> set.seed(123)
> ind<-sample(2,nrow(mydata),replace=T,prob=c(0.8,0.2))
> ind
[1] 1 112211 21121111 21112211 211111122 1111
11
[39] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 11
1 11
[77] 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 12
1 12
[115] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 11
1 21
[153] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 11
1 22
[191] 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 11
1 11
[229] 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 21
2 11
[267] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 11
1 11
[305] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 11
2 11
[343] 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 21
1 12
[381] 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2
> traindata<-mydata[ind==1,]
> testdata<-mydata[ind==2,]
> str(traindata)
3
> str(testdata)
'data.frame': 75 obs. of 4 variables:
$ admit: Factor w/ 2 levels "0","1": 2 1 11 1 2 11 1 1 ...
$ gre : int 640 520 400 800 480 540 500 680 540 760 ...
$ gpa : num 3.19 2.93 3.08 4 3.44 3.81 3.17 3.19 3.78 3.35 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 4 4 2 4 3 134 4 3 ...
The syntax to build a logit model is very similar to the lm function you saw in linear
regression. You only need to set the family='binomial' for glm to build a logistic
regression model.
glm stands for generalised linear models and it is capable of building many types of
regression models besides linear and logistic regression.
>model<-
glm(formula=admit~gre+gpa+rank,data=traindata,family=bino
mial)
In above model, Class is modeled as a function of admit alone.
Call:
glm(formula = admit ~ gre + gpa + rank, family = binomial, data = traindata)
Deviance Residuals:
Coefficients:
4
(Dispersion parameter for binomial family taken to be 1)
Call:
glm(formula = admit ~ gre + rank, family = binomial, data = traindata)
Deviance Residuals:
The model is now built. You can now use it to predict the response on test
data. You need to set type='response' in order to compute the prediction
probabilities.
5
> pred<-predict(model,testdata,type="response")
> head(pred)
4 5 8 11 16 20
0.2357941 0.1736073 0.2277837 0.3399874 0.1990518 0.5332868
Logic function used here is a sigmoid function.The common practice is to take the
probability cutoff as 0.5. If the probability of Y is > 0.5, then it can be classified as an
event.
So if prediction is greater than 0.5, it is admit else it is rejected.
> p1<-ifelse(pred>0.5,1,0)
> head(p1)
4 5 8 11 16 20
0 0000 1
> table(p1)
p1
0 1
63 12
> table(testdata$admit)
0 1
57 18
> table(predicted=p1,actual=testdata$admit)
actual
predicted 0 1
0 51 12 tn fp
1 6 6 fn tp
6
Evaluation of performance of a model (based on standard paramters)
Accuracy: 0.6933333
Precision: 0.66
Prevalence: 0.33
F1 Score: 1.835152
Multinomial Regression
which the log odds of the outcomes are modeled as a linear combination of the
predictor variables.
For our data analysis example, we will be using the preloaded IRIS data
set.
> str(iris)
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
7
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
11
The data set contains variables on 150 flowers. The outcome variable is Species,
Below we use the multinom function from the nnet package to estimate a
> install.packages("nnet")
> library(nnet)
We partition the given dataset into training and test data. Before building the logistic
regressor, you need to randomly split the data into training and test samples.
> data<-iris
> set.seed(1234)
> ind<-sample(2,nrow(mydata),replace=T,prob=c(0.8,0.2))
> traindata<-data[ind==1,]
> testdata<-data[ind==2,
The model is trained using multinom function from the nnet package.
>model<-
multinom(formula=Species~Sepal.Length+Sepal.Width+Petal.L
ength+Petal.Width,data=traindata)
# weights: 18 (10 variable) initial value
129.636250 iter 10 value 10.683012 iter
20 value 5.933903
8
iter 30 value 5.873500
iter 40 value 5.866866
iter 50 value 5.861992
iter 60 value 5.860395
iter 70 value 5.859634
iter 80 value 5.859340
iter 90 value 5.859208
iter 100 value 5.859118
final value 5.859118
stopped after 100 iterations
> summary(model)
Call:
multinom(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = traindata)
Coefficients:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 20.95306 -2.019734 -12.17769 10.47244 -2.626553
virginica -18.78599 -4.541125 -18.31369 19.53561 14.264642
Std. Errors:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 41.61007 134.8689 191.1050 76.27252 17.95107
> pred<-predict(model,testdata,type="class")
> head(pred)
setosa setosa setosa setosa setosa setosa Levels: setosa
versicolor virginica
Confusion Matrix
> tab<-table(pred,testdata$Species)
> tab
Pred setosa versicolor virginica
setosa 11 0 0
9
versicolor 0 6 0
virginica 0 0 15
We have received an accuracy of 100% which can clearly concluded from the above
confusion matrix.
Conclusion: