0% found this document useful (0 votes)
2 views

lec43

The lecture focuses on Logistic Regression, explaining its use in modeling probabilities through a sigmoidal function and the concept of hyperplanes for classification. It discusses the importance of training and test data, as well as the need for regularization to prevent overfitting in models with many parameters. The session concludes with a preview of the next lecture, which will cover performance measures for classifiers and a case study using logistic regression.

Uploaded by

sarika satya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lec43

The lecture focuses on Logistic Regression, explaining its use in modeling probabilities through a sigmoidal function and the concept of hyperplanes for classification. It discusses the importance of training and test data, as well as the need for regularization to prevent overfitting in models with many parameters. The session concludes with a preview of the next lecture, which will cover performance measures for classifiers and a case study using logistic regression.

Uploaded by

sarika satya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data science for Engineers

Prof. Ragunathan Rengasamy


Department of Computer Science and Engineering
Indian Institute of Technology, Madras

Lecture- 43
Logistic Regression

(Refer Slide Time: 00:13)

We will continue our lecture on Logistic Regression that we


introduced in the last lecture. And if you recall from the last lecture we
modeled the probability as a sigmoidal Function.

And the sigmoidal function that we used is given here and notice
that this is your hyperplane equation. And in n dimensions this quantity
is a scalar, because you have X elements n elements in X and n
elements in β1 and this becomes something like β0 + β11 x1 + β12 x2 and
so on β1n xn.
So, this is a scalar and then we saw that if this quantity is a very
large negative number, then the probability is 0 and if this quantity is a
very large positive number the probability is 1. And the transition of
the probability at 0.5 remember I said you have to always look at it
from 1 classes viewpoint.

So, let us say if you want class 1 to have high probability and class
0 is a, row prob, low probability case, then you need to have a
threshold that we described before that you could convert this into a
binary output by using a threshold. So, if you were to use a threshold of
0.5, because probabilities go from 0 and 1. And then you notice that
this p of X becomes 0.5 exactly when β 0 + β1 X = 0. This is because p
of X then = e 0 divided by 1 + equal 0 which is equal to 1 by 2.

Also notice another interesting thing that this equation is then the
equation of the hyperplane. So, if I had data like this and data like this
and if I draw this line any point on this line is the probability is equal to
0.5 point. That basically says that any point on this line in this 2 d case
or hyperplane, in the n dimensional case will have an equal probability
of belonging to either class 0 or class 1 which makes sense from what
we are trying to do. So, this model is what is called a logit model.

(Refer Slide Time: 02:52)

Let us take a very simple example to understand this. So, let us


assume that we are given data. So, here we have data for class 0 and
data for class 1 and then clearly this is a 2 dimensional problem. So,
the hyperplane is going to be a line.

So, a line will separate this. And in a typical case in these kinds of
classification problems this is actually called as supervised
classification problem. We call this a supervised classification problem
because all of this data is labeled. So, I already know that all of this
data is coming from class 0 and all of this data is coming from class 1.

So, in other words I am being supervised in terms of what I should call


as class 0 and what I should call as class 1. So, in these kinds of
problems typically you have this and then you are given new data
which is called the test data and then the question is what class does
this test data belong to. So, it is either class 0 or class 1, as far as we
are concerned in this example.
Just to keep in mind that there would we use problems like this.
Remember at the beginning of this course I talked about fraud
detection and so on. Where you could have lots of records of
fraudulent let us say credit card use and all of those instances of
fraudulent credit card use you could describe by certain attribute.

So, for example, the time of the day whether the credit card was done
at, the place where the person lives credit card transfer or credit card
use was done at the place the person lives and many other attributes.
So, if those are the attributes let us say many attributes are there. And
you have lots of records for normal use of credit card and some records
for fraudulent use of credit card.

Then you could build a classifier which given a new set of attributes
that is a new transaction that is being initiated, could identify what
likelihood it is of this transaction being fraudulent. So that is one other
way of thinking about the same problem. So, nonetheless as far as this
example is concerned what we need to do is we have to fill this column
with zeros and ones. If I fill a column with row with 0 then that means,
this data belongs to class 0 and if I fill it 1 then let us say this belongs
to class 1 and so on.

So, this is what we are trying to do, we do not know what the classes are.

(Refer Slide Time: 05:34)

So, just so let me see this it is a very simple problem we have


plotted the same data that was shown in the last table. And you would
notice that if you wanted a classifier, something like this would do. So,
this problem is linearly separable. So, you could you could come up
with a line that does it. So, let us see what happens if we use logistic
regression to solve this problem.
(Refer Slide Time: 06:04)

So, if you did a logistics regression solution, then in this case it


turns out that the parameter values are these. And how did we get these
parameter values? These parameters values are guard through the
optimization formulation, where 1 is maximizing log likelihood with β 0
, β11 and β12 as decision variables.

And as we see here there are 3 decision variables, because this was
A₂ dimensional problem. So, 1 coefficient for each dimension and then
1 constant. Now once you have this then what you do is, you have your
expression for p of X which is as written before the sigmoid. So, this is
a sigmoidal function that we have been talking about. Then whenever
you get a test data, let us say 1 3, you plug this into this sigmoidal
function and you get a probability. Let us say the first data point when
you plug in you get a probability this.

So, if you use a threshold of 0.5 then what we are going to say is
anything less than 0.5 is going to belong to class 0 and anything greater
than 0.5 is going to belong to class 1. So, you will notice that this is 0
class 0, class 1, class 1, class 0, class 0, class 1, class 0, class 0, class 0.
So, as I mentioned in the previous slide what we wanted was to fill
this column and if you go across row then it says that particular sample
belongs to which class. So, now, what we have done is we have
classified these test cases, which the classifier did not see while you
were identifying these parameters.

So, the process of identifying these parameters is what is usually


called in machine learning algorithms as training. So, you are training
the classifier to be able to solve test cases later. And the data that you
use while these parameters are being identified are called the training
data and this is called the test data that you are testing a classifier with.

So, typically what you do is if you have lots of data with class labels
already given one of the good things, you know that one should do is to
split this into training data and the test data. And the reason for
splitting this into training and test data is the following. In this case if
you look at it, we built a classifier based on some data and then we
tested it on some other data, but we have no way of knowing whether
these results are right or wrong.

So, we just have to take the results as it is. So, ideally what you
would like to do is, you would like to use some portion of the data to
build a classifier. And then you want to retain some portion of the data
for testing and the reason for retaining this is because the labels are
already known in this.

So, if I just give this portion of the data to the classifier, the
classifier will come up with some classification. Now that can be
compared with the already established labels for those data points. So,
from verifying how good your classifier is it is always a good idea to
split this into training and testing data. What proportion of data you use
for training, what proportion of data used for testing and so on are
things to think about.

Also there are many different ways of doing this validation as one
would call it with test data. There are techniques such as k fold
validation and so on. So, there are many ways of splitting the data into
train and test and then verifying how good your classifier is.
Nonetheless the most important idea to remember is that one should
always look at data and partition the data into training and testing so
that you get results that are consistent.
(Refer Slide Time: 10:23)

So, if one were to draw these points again that that we use this in
this exercise. So, these are all class 1 data points these are class 0 data
points and this is your hyperplane that a logistic regression model
figured out and these are the test points that we tried with this
classifier. So, you can see that in this case everything seems to be
working well, but as I said before you can look at results like this in 2
dimensions quite easily.

However, when there are multiple dimensions it is very di cult to


visualize where the data point lies and so on. Nonetheless so, it gives
you an idea of what logistic regression is doing. It is actually doing a
linear classification here. However, based on the distance in some
sense from this hyperplane. We also assign a probability for the data
being in a particular class.
Now, there is one more idea that we want to talk about in logistic
regression. This idea is what is called as regularization. The idea here
is the following. If you notice the objective function that we used in the
general logistic regression, which is what we called as a log likelihood
objective function.
(Refer Slide Time: 11:42)

Here θ again speaks to the constants in the hyperplane or the


decision variables and this is the form of the equation that we saw in
the previous lecture and in the beginning of this lecture also I believe.
Now, if you have n variables in your problem or n features or n
attributes then the number of decision variables that you are identifying
are n + 1. So, 1 constant for each variable and the constant if this n
becomes very large when there are large number of variables that are
present then, what happens is this logistic regression models can over t
because there are so many parameters that you could tend to over t the
data.

So, to prevent this what we want to do is somehow we want to say


though you have this n + 1 decision variables to use, one would want
these decision variables to be used sparingly. So, whenever you use a
coefficient for a variable, for the classification problem, then we want
ensure that you get the maximum value for using that variable in the
classification problem. So, in other words if let us say there are 2
variables I say β0 β11 x1 + β12 x2.Then for this classifier, I am using both
let us say variables x1 and x2 as being important. What 1 would like to
do is make sure that I use these only if they really contribute to the
solution or to the efficacy of the solution.
So, one might say that for every term that you use, you should get
something in return or in other words if you use a term and get nothing
in return I want to penalize this term. So, I want to penalize these
coefficients. This is what is typically called as regularization.
(Refer Slide Time: 14:23)

So, regularization avoids building complex models or it helps in


building non-complex models. So that your over fitting effects can be
reduced. So, how do we penalize this? So, notice that what we are
trying to do is we are trying to minimize a log likelihood.
So, what we do here is we add another term to the objective and λ is
called the regularization parameter and this h (θ) is some regularization
function. So, what we want to do is, when I choose the values of θ to
be very large, I want this function to be large So, that the penalty is
more or whenever I choose a variable right away a penalty kicks in.

And this penalty should be o set by the improvement I have in this


term of the objective function. So, that is the basic idea behind
regularization. Now this function could be of many types if you use
this function to be θTθ, then this is called L2 type regularization. So, in
the previous example this will turn out to be θ = (β0 β11β12 )T (β0 β11β12 ).

So, in this case h ( θ) = β 02 +β112+β122. Now there are other types of


regularization that you can use. You can use this is what is called the
L2 type or L2 norm you can also use something called an L type or 1
an L1 norm. And larger the value of this coefficient that is multiplying
this the more is regularization strength that is you are penalizing for
use of variables lot more. And one general rule is regularization helps
the model work better with test data because you avoid over fitting on
the train data. So, that is in general something that one can keep in
mind as one does these kinds of problems.

So, with this the portion on logistic regression comes to an end what
we are going to do next is we are going to show you an example case
study, where logistic regression is used for a solution. However, before
we do this case study since all the case studies on classification and
clustering will involve looking at the output from the r code, I am
going to take a typical output from the r code and there are several
results that will show up. These are called performance measures of a
classifier. I am going to describe what these performance measures are
and how you should interpret these performance measures once you
use a particular technique for any case study.

So, in the next lecture we will talk about these performance


measures and then following that will be the lecture on a case study
using logistic regression.

Thank you for listening to this lecture and I will see you in the next lecture.

You might also like