0% found this document useful (0 votes)
7 views28 pages

Lecture 5_Logistic Regression (1)

The document discusses logistic regression as a method for classification, outlining its formulation, cost functions, and optimization techniques. It emphasizes the use of cross-entropy loss and gradient descent for finding solutions, as well as extending logistic regression to multi-class problems. Key concepts include the sigmoid function, probabilistic views of classification, and regularization techniques.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views28 pages

Lecture 5_Logistic Regression (1)

The document discusses logistic regression as a method for classification, outlining its formulation, cost functions, and optimization techniques. It emphasizes the use of cross-entropy loss and gradient descent for finding solutions, as well as extending logistic regression to multi-class problems. Key concepts include the sigmoid function, probabilistic views of classification, and regularization techniques.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Artificial Intelligence II (CS4442 & CS9542)

Classification: Logistic Regression

Boyu Wang
Department of Computer Science
University of Western Ontario
Recall: tumor example

Slide credit: Doina Precup

1
Recall: supervised learning

I A training example i has the form: (xi , yi ), where xi ∈ Rn it the


number of features (feature dimension). If y ∈ R, this is the
regression problem.

I If y ∈ {0, 1}, this is the binary classification problem.

I If y ∈ {1, . . . , C} (i.e., y can take more than two values), this is


the multi-class classification problem.

I Most binary classification algorithms can be extended to


multi-class classification algorithms.

2
Linear model for classification

I As in linear regression, we consider a linear model hw for


classification:

hw (x) = w > x

where we already augmented the data: x → [x; 1].

I Rules for binary classification: hw (x) ≥ 0 ⇒ y = 1;


hw (x) < 0 ⇒ y = 0

I How to choose w?

3
Error (cost) function for classification

I Recall: for regression, we use the sum-of-squared errors:


m
1X
J(w) = (hw (xi ) − yi )2
2
i=1

4
Error (cost) function for classification

I Recall: for regression, we use the sum-of-squared errors:


m
1X
J(w) = (hw (xi ) − yi )2
2
i=1

I Can we use it for classification?

4
Error (cost) function for classification

I Recall: for regression, we use the sum-of-squared errors:


m
1X
J(w) = (hw (xi ) − yi )2
2
i=1

I Can we use it for classification?

I We could, but it is not the best choice

- If y = 1, we want hw (x) > 0 as much as possible, which reflects


our confidence of classification
- recall the connection between linear regression and maximum
likelihood with Gaussian assumption

4
Probabilistic view for classification

I We would like a way of learning that is more suitable for the


problem

5
Probabilistic view for classification

I We would like a way of learning that is more suitable for the


problem

I Classification can be formulated as the following question: given


a data point x, what is the probability p(y |x)?

5
Probabilistic view for classification

I We would like a way of learning that is more suitable for the


problem

I Classification can be formulated as the following question: given


a data point x, what is the probability p(y |x)?

I Intuition: we want find the conditional probability p(y|x; w)


parameterized by w in a way such that

hw (x) = w > x → +∞ ⇒ p(y = 1|x) → 1


hw (x) = w > x → −∞ ⇒ p(y = 1|x) → 0 (i.e., p(y = 0|x) → 1)
>
hw (x) = w x = 0 ⇒ p(y = 1|x) = 0.5

5
Sigmoid function

Consider the following function:

1 ea
σ(a) , =
1 + e−a 1 + ea

- a → +∞ ⇒ σ(a) → 1
- a → −∞ ⇒ σ(a) → 0

- a = 0 ⇒ σ(a) = 0.5

Plug hw (x) into σ(·):

1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
which is exactly what we are looking for!

6
1D sigmoid function

Figure: The sigmoid function and the predicted probabilities.

Figure credit: Kevin Murphy

7
2D sigmoid function

Figure: Plots of σ(w1 x1 + w2 x2 )


Figure credit: Kevin Murphy

8
Error (cost) function for classification – revisited

I Maximum likelihood classification assumes that we will find the


hypothesis that maximizes the (log) likelihood of the training
data:
m
X
arg max log p({xi , yi }m
i=1 |h) = arg max log p(xi , yi |h)
h h
i=1

(using the i.i.d. assumption)

I If we ignore the marginal distribution p(x), we may maximize the


conditional probability of the labels, given the inputs and the
hypothesis h:
m
X
arg max log p(yi |xi ; h)
h
i=1

9
The cross-entropy loss function for logistic regression

I Recall that for any data point (xi , yi ), the conditional probability
can be represented by the sigmoid function:
p(yi = 1|xi ; h) = σ(hw (xi ))

I Then the log-likelihood of a hypothesis hw is


m m
(
X X log σ(hw (xi )), if yi = 1
log L(w) = log p(yi |xi ; hw ) =
i=1 i=1
log(1 − σ(hw (xi ))), if yi = 0
m
X
= (yi log ti + (1 − yi ) log(1 − ti )) ,
i=1

where ti = σ(hw (xi )).

I The cross-entropy loss function is the negative of log L(w).

10
Linear regression vs. logistic regression
Both use linear model: hw (x) = w > x

I Conditional probability:
2
yi −w > xi

− 21
√ 1
σ
- linear regression: p(yi |xi ; w) = 2πσ 2
e

1
 >x , if yi = 1
- logistic regression: p(yi |xi ; w) = 1+e−w i
1 − 1
>x , if yi = 0
1+e−w i

I Log-likelihood function:
2 !
yi −w > xi

Pm − 21
√ 1
σ
- linear regression: log L(w) = i=1 log
2πσ 2
e
- logistic regression:
log L(w) = m
P
i=1 (yi log ti + (1 − yi ) log(1 − ti )), where
ti = σ(hw (xi )).
I Solution:
- linear regression: w = (x > X )−1 X > y
- logistic regression: No analytical solution
11
Optimization procedure: gradient descent

Objective: minimize a loss function J(w)

Gradient descent for function minimization


1: Input: number of iterations: N, learning rate: α
2: Initialize w(0)
3: for n = 1 to N do
4: Compute the gradient: g(n) = ∇J(w(n) )
5: w(n+1) = w(n) − αg(n)
6: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
7: Stop
8: end if
9: end for

12
Effect of the learning rate

Figure: Gradient descent on a simple function, starting from


(0, 0), for 20 steps using a fixed learning rate α. (a) α = 0.1. (b)
α = 0.6

Figure credit: Kevin Murphy

13
Optimization procedure: Newton’s method
Objective: minimize a loss function J(w)

Newton’s method for function minimization


1: Input: number of iterations: N
2: Initialize w(0)
3: for n = 1 to N do
4: Compute the gradient: g(n) = ∇J(w(n) )
5: Compute the Hessian: H(n) = ∇2 J(w(n) )
−1
6: w(n+1) = w(n) − H(n) g(n)
7: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
8: Stop
9: end if
10: end for

14
Gradient descent for logistic regression
Objective: minimize the cross-entropy loss function(− log L(w)):
m
X
J(w) , − log L(w) = − (yi log ti + (1 − yi ) log(1 − ti ))
i=1
m
X 1
∇J(w) = (ti − yi )xi , ti = σ(hw (xi )) =
i=1
1 + e−w > xi

Gradient descent for logistic regression


1: Input: number of iterations: N, learning rate: α
2: Initialize w(0)
3: for n = 1 to N do
Compute the gradient: g(n) = ∇J(w(n) ) = m
P
4: i=1 (ti − yi )xi
5: w(n+1) = w(n) − αg(n)
6: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
7: Stop
8: end if
9: end for 15
Newton’s method for logistic regression

Objective: minimize the cross-entropy loss function(− log L(w)):


m
X
J(w) , − log L(w) = − (yi log ti + (1 − yi ) log(1 − ti ))
i=1
m
X m
X
∇J(w) = (ti − yi )xi , H(w) = ∇2 J(w) = ti (1 − ti )xi> xi = X > SX
i=1 i=1

where S ∈ Rm×m is a diagonal matrix with elements Sii = ti (1 − ti ),


1
ti = σ(hw (xi )) = > .
1+e−w xi
Then, the update rule becomes:

w(n+1) = w(n) − H(w(n) )−1 ∇J(w(n) )

16
Gradient descent vs. Newton’s method

I Newtons method usually requires significantly fewer iterations


than gradient descent

I Computing the Hessian and its inverse is expensive

I Approximation algorithms exist which help to compute the


product of the inverse Hessian with gradient without explicitly
computing H

17
Cross-entropy loss vs. squared loss

Figure: Decision boundaries obtained by minimizing squared loss


(magenta line) and cross-entropy loss (green line)
Figure credit: Christopher Bishop

18
Regularized logistic regression

One can do regularization for logistic regression just like in the case
of linear regression.

I `2 -regularized logistic regression

λ
J(w) , − log L(w) + ||w||22
2

2. `1 -regularized logistic regression

J(w) , − log L(w) + λ||w||1

19
Multi-class logistic regression

I For 2 classes:
>
1 ew x
p(y = 1|x; w) = > =
1 + e−w x 1 + ew > x

I For C classes {1, . . . , C}:


>
ewc x
p(y = c|x; w1 , . . . , wC ) = PC >x
c=1 e wc

– called the softmax function

20
Multi-class logistic regression

Gradient descent for multi-class logistic regression


1: Input: number of iterations: N, learning rate: α
2: Initialize C vectors: w1,(0) , . . . , wC,(0)
3: for n = 1 to N do
4: for c = 1 to C do
5: Compute the gradient with respect to wc,(n) : gc,(n)
6: wc,(n+1) = wc,(n) − αgc,(n)
7: end for
8: if converges (e.g, |wc,(n+1) − wc,(n) | ≤  for all classes) then
9: Stop
10: end if
11: end for

21
Multi-class logistic regression

I After training, the probability of y = c is given by


>
ewc x
p(y = c|x; w1 , . . . , wC ) , σc (x) = PC
wc> x
c=1 e

I Predict class label as the most probable label:

y = arg max σc (x)


c

22
Summary

I Logistic regression for classification

I Cross-entropy loss for logistic regression

I No analytical solution that minimizes the cross-entropy


loss/maximizes the log-likelihood

I Use gradient descent to find a solution

I Multi-class logistic regression

23

You might also like