0% found this document useful (0 votes)

7 views28 pages

Lecture 5_Logistic Regression (1)

The document discusses logistic regression as a method for classification, outlining its formulation, cost functions, and optimization techniques. It emphasizes the use of cross-entropy loss and gradient descent for finding solutions, as well as extending logistic regression to multi-class problems. Key concepts include the sigmoid function, probabilistic views of classification, and regularization techniques.

Uploaded by

aeryaery0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views28 pages

Lecture 5_Logistic Regression (1)

Uploaded by

aeryaery0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Artificial Intelligence II (CS4442 & CS9542)

Classification: Logistic Regression

Boyu Wang
Department of Computer Science
University of Western Ontario
Recall: tumor example

Slide credit: Doina Precup

1
Recall: supervised learning

I A training example i has the form: (xi , yi ), where xi ∈ Rn it the

number of features (feature dimension). If y ∈ R, this is the
regression problem.

I If y ∈ {0, 1}, this is the binary classification problem.

I If y ∈ {1, . . . , C} (i.e., y can take more than two values), this is

the multi-class classification problem.

I Most binary classification algorithms can be extended to

multi-class classification algorithms.

2
Linear model for classification

I As in linear regression, we consider a linear model hw for

classification:

hw (x) = w > x

where we already augmented the data: x → [x; 1].

I Rules for binary classification: hw (x) ≥ 0 ⇒ y = 1;

hw (x) < 0 ⇒ y = 0

I How to choose w?

3
Error (cost) function for classification

I Recall: for regression, we use the sum-of-squared errors:

m
1X
J(w) = (hw (xi ) − yi )2
2
i=1

4
Error (cost) function for classification

I Recall: for regression, we use the sum-of-squared errors:

m
1X
J(w) = (hw (xi ) − yi )2
2
i=1

I Can we use it for classification?

4
Error (cost) function for classification

I Recall: for regression, we use the sum-of-squared errors:

m
1X
J(w) = (hw (xi ) − yi )2
2
i=1

I Can we use it for classification?

I We could, but it is not the best choice

- If y = 1, we want hw (x) > 0 as much as possible, which reflects

our confidence of classification
- recall the connection between linear regression and maximum
likelihood with Gaussian assumption

4
Probabilistic view for classification

I We would like a way of learning that is more suitable for the

problem

5
Probabilistic view for classification

I We would like a way of learning that is more suitable for the

problem

I Classification can be formulated as the following question: given

a data point x, what is the probability p(y |x)?

5
Probabilistic view for classification

I We would like a way of learning that is more suitable for the

problem

I Classification can be formulated as the following question: given

a data point x, what is the probability p(y |x)?

I Intuition: we want find the conditional probability p(y|x; w)

parameterized by w in a way such that

hw (x) = w > x → +∞ ⇒ p(y = 1|x) → 1

hw (x) = w > x → −∞ ⇒ p(y = 1|x) → 0 (i.e., p(y = 0|x) → 1)
>
hw (x) = w x = 0 ⇒ p(y = 1|x) = 0.5

5
Sigmoid function

Consider the following function:

1 ea
σ(a) , =
1 + e−a 1 + ea

- a → +∞ ⇒ σ(a) → 1
- a → −∞ ⇒ σ(a) → 0

- a = 0 ⇒ σ(a) = 0.5

Plug hw (x) into σ(·):

1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
which is exactly what we are looking for!

6
1D sigmoid function

Figure: The sigmoid function and the predicted probabilities.

Figure credit: Kevin Murphy

7
2D sigmoid function

Figure: Plots of σ(w1 x1 + w2 x2 )

Figure credit: Kevin Murphy

8
Error (cost) function for classification – revisited

I Maximum likelihood classification assumes that we will find the

hypothesis that maximizes the (log) likelihood of the training
data:
m
X
arg max log p({xi , yi }m
i=1 |h) = arg max log p(xi , yi |h)
h h
i=1

(using the i.i.d. assumption)

I If we ignore the marginal distribution p(x), we may maximize the

conditional probability of the labels, given the inputs and the
hypothesis h:
m
X
arg max log p(yi |xi ; h)
h
i=1

9
The cross-entropy loss function for logistic regression

I Recall that for any data point (xi , yi ), the conditional probability
can be represented by the sigmoid function:
p(yi = 1|xi ; h) = σ(hw (xi ))

I Then the log-likelihood of a hypothesis hw is

m m
(
X X log σ(hw (xi )), if yi = 1
log L(w) = log p(yi |xi ; hw ) =
i=1 i=1
log(1 − σ(hw (xi ))), if yi = 0
m
X
= (yi log ti + (1 − yi ) log(1 − ti )) ,
i=1

where ti = σ(hw (xi )).

I The cross-entropy loss function is the negative of log L(w).

10
Linear regression vs. logistic regression
Both use linear model: hw (x) = w > x

I Conditional probability:
2
yi −w > xi

− 21
√ 1
σ
- linear regression: p(yi |xi ; w) = 2πσ 2
e

1
 >x , if yi = 1
- logistic regression: p(yi |xi ; w) = 1+e−w i
1 − 1
>x , if yi = 0
1+e−w i

I Log-likelihood function:
2 !
yi −w > xi

Pm − 21
√ 1
σ
- linear regression: log L(w) = i=1 log
2πσ 2
e
- logistic regression:
log L(w) = m
P
i=1 (yi log ti + (1 − yi ) log(1 − ti )), where
ti = σ(hw (xi )).
I Solution:
- linear regression: w = (x > X )−1 X > y
- logistic regression: No analytical solution
11
Optimization procedure: gradient descent

Objective: minimize a loss function J(w)

Gradient descent for function minimization

1: Input: number of iterations: N, learning rate: α
2: Initialize w(0)
3: for n = 1 to N do
4: Compute the gradient: g(n) = ∇J(w(n) )
5: w(n+1) = w(n) − αg(n)
6: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
7: Stop
8: end if
9: end for

12
Effect of the learning rate

Figure: Gradient descent on a simple function, starting from

(0, 0), for 20 steps using a fixed learning rate α. (a) α = 0.1. (b)
α = 0.6

Figure credit: Kevin Murphy

13
Optimization procedure: Newton’s method
Objective: minimize a loss function J(w)

Newton’s method for function minimization

1: Input: number of iterations: N
2: Initialize w(0)
3: for n = 1 to N do
4: Compute the gradient: g(n) = ∇J(w(n) )
5: Compute the Hessian: H(n) = ∇2 J(w(n) )
−1
6: w(n+1) = w(n) − H(n) g(n)
7: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
8: Stop
9: end if
10: end for

14
Gradient descent for logistic regression
Objective: minimize the cross-entropy loss function(− log L(w)):
m
X
J(w) , − log L(w) = − (yi log ti + (1 − yi ) log(1 − ti ))
i=1
m
X 1
∇J(w) = (ti − yi )xi , ti = σ(hw (xi )) =
i=1
1 + e−w > xi

Gradient descent for logistic regression

1: Input: number of iterations: N, learning rate: α
2: Initialize w(0)
3: for n = 1 to N do
Compute the gradient: g(n) = ∇J(w(n) ) = m
P
4: i=1 (ti − yi )xi
5: w(n+1) = w(n) − αg(n)
6: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
7: Stop
8: end if
9: end for 15
Newton’s method for logistic regression

Objective: minimize the cross-entropy loss function(− log L(w)):

m
X
J(w) , − log L(w) = − (yi log ti + (1 − yi ) log(1 − ti ))
i=1
m
X m
X
∇J(w) = (ti − yi )xi , H(w) = ∇2 J(w) = ti (1 − ti )xi> xi = X > SX
i=1 i=1

where S ∈ Rm×m is a diagonal matrix with elements Sii = ti (1 − ti ),

1
ti = σ(hw (xi )) = > .
1+e−w xi
Then, the update rule becomes:

w(n+1) = w(n) − H(w(n) )−1 ∇J(w(n) )

16
Gradient descent vs. Newton’s method

I Newtons method usually requires significantly fewer iterations

than gradient descent

I Computing the Hessian and its inverse is expensive

I Approximation algorithms exist which help to compute the

product of the inverse Hessian with gradient without explicitly
computing H

17
Cross-entropy loss vs. squared loss

Figure: Decision boundaries obtained by minimizing squared loss

(magenta line) and cross-entropy loss (green line)
Figure credit: Christopher Bishop

18
Regularized logistic regression

One can do regularization for logistic regression just like in the case
of linear regression.

I `2 -regularized logistic regression

λ
J(w) , − log L(w) + ||w||22
2

2. `1 -regularized logistic regression

J(w) , − log L(w) + λ||w||1

19
Multi-class logistic regression

I For 2 classes:
>
1 ew x
p(y = 1|x; w) = > =
1 + e−w x 1 + ew > x

I For C classes {1, . . . , C}:

>
ewc x
p(y = c|x; w1 , . . . , wC ) = PC >x
c=1 e wc

– called the softmax function

20
Multi-class logistic regression

Gradient descent for multi-class logistic regression

1: Input: number of iterations: N, learning rate: α
2: Initialize C vectors: w1,(0) , . . . , wC,(0)
3: for n = 1 to N do
4: for c = 1 to C do
5: Compute the gradient with respect to wc,(n) : gc,(n)
6: wc,(n+1) = wc,(n) − αgc,(n)
7: end for
8: if converges (e.g, |wc,(n+1) − wc,(n) | ≤ for all classes) then
9: Stop
10: end if
11: end for

21
Multi-class logistic regression

I After training, the probability of y = c is given by

>
ewc x
p(y = c|x; w1 , . . . , wC ) , σc (x) = PC
wc> x
c=1 e

I Predict class label as the most probable label:

y = arg max σc (x)

22
Summary

I Logistic regression for classification

I Cross-entropy loss for logistic regression

I No analytical solution that minimizes the cross-entropy

loss/maximizes the log-likelihood

I Use gradient descent to find a solution

I Multi-class logistic regression

Linear Models
No ratings yet
Linear Models
30 pages
Logisticregression 2021
No ratings yet
Logisticregression 2021
78 pages
Lecture3 Logistic Regression Regularization
No ratings yet
Lecture3 Logistic Regression Regularization
39 pages
I2ml3e Chap10
No ratings yet
I2ml3e Chap10
27 pages
Ch03 LogisticRegression
No ratings yet
Ch03 LogisticRegression
79 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
Lecture 8 Logistic Regression
No ratings yet
Lecture 8 Logistic Regression
34 pages
05 Logistic Regression
No ratings yet
05 Logistic Regression
33 pages
Lect 11 P1
No ratings yet
Lect 11 P1
21 pages
DS203 2024 01 02 LogisticRegression
No ratings yet
DS203 2024 01 02 LogisticRegression
38 pages
Handout 02 Logistic Regression
No ratings yet
Handout 02 Logistic Regression
39 pages
Lecture 03 Logistic Regression
No ratings yet
Lecture 03 Logistic Regression
34 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Logistic Regression
No ratings yet
Logistic Regression
78 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
383-Fall11-Lec19
No ratings yet
383-Fall11-Lec19
30 pages
Unit 3-Discriminative Models
No ratings yet
Unit 3-Discriminative Models
29 pages
Lec12 Logreg
No ratings yet
Lec12 Logreg
41 pages
lec4 (1)
No ratings yet
lec4 (1)
24 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Logistic Regression
No ratings yet
Logistic Regression
74 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
AML AfterMid Merged
No ratings yet
AML AfterMid Merged
389 pages
Chapter02-Introduction-to-DeepLearning
No ratings yet
Chapter02-Introduction-to-DeepLearning
84 pages
10. Binary Logistic Regression 2
No ratings yet
10. Binary Logistic Regression 2
43 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
3-LG_Eval
No ratings yet
3-LG_Eval
52 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
04- Linear-Classification-2024
No ratings yet
04- Linear-Classification-2024
65 pages
Lecture 3. Classification
No ratings yet
Lecture 3. Classification
60 pages
Lecture 1, Part 3: Training A Classifier: Roger Grosse
No ratings yet
Lecture 1, Part 3: Training A Classifier: Roger Grosse
11 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Lec18 Logistic Regression d6a15824b03c57e66682ca61d6a89d99
No ratings yet
Lec18 Logistic Regression d6a15824b03c57e66682ca61d6a89d99
17 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
09_23ECE216_LogisticRegression
No ratings yet
09_23ECE216_LogisticRegression
40 pages
cs188-fa22-note21
No ratings yet
cs188-fa22-note21
4 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Lecture 05
No ratings yet
Lecture 05
5 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Machine Learning unit 2 que and ans
No ratings yet
Machine Learning unit 2 que and ans
16 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
4.Logistic Regression
No ratings yet
4.Logistic Regression
16 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Measuring Ultrasonic Velocity in Materials by Comparative Pulse-Echo Method
100% (1)
Measuring Ultrasonic Velocity in Materials by Comparative Pulse-Echo Method
14 pages
cs188 Fa23 Note22
No ratings yet
cs188 Fa23 Note22
3 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
No ratings yet
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
15 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Logistic Regression
No ratings yet
Logistic Regression
10 pages
12 Cs Python - Record Programs New 2024-25
No ratings yet
12 Cs Python - Record Programs New 2024-25
45 pages
ML DSBA Lab2
No ratings yet
ML DSBA Lab2
4 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Solved SSC JE Electrical Engineering 23rd Jan 2018 Shift-1 Paper With Solutions
No ratings yet
Solved SSC JE Electrical Engineering 23rd Jan 2018 Shift-1 Paper With Solutions
52 pages
PHYSICS IP
No ratings yet
PHYSICS IP
16 pages
Application-and-Example-Non-isentropic-flow
No ratings yet
Application-and-Example-Non-isentropic-flow
7 pages
Dld Assignment
No ratings yet
Dld Assignment
8 pages
11th Maths August 2021 Original Question Paper
No ratings yet
11th Maths August 2021 Original Question Paper
15 pages
Integers _ DHA 02 __ Junoon 2025
No ratings yet
Integers _ DHA 02 __ Junoon 2025
4 pages
Curvelinear and Projectile
No ratings yet
Curvelinear and Projectile
73 pages
Resumen JQueryMobile
No ratings yet
Resumen JQueryMobile
50 pages
Maths Holiday Homework Class 3
100% (1)
Maths Holiday Homework Class 3
7 pages
DOC-20240908-WA0032.
No ratings yet
DOC-20240908-WA0032.
6 pages
MMW 101 - Lesson 10 - Z-Scores and Normal Curve
No ratings yet
MMW 101 - Lesson 10 - Z-Scores and Normal Curve
45 pages
Type 3025 Sizes P462, P460-200 and P900 Diaphragm Actuator: Bulletin 61.1:3025
No ratings yet
Type 3025 Sizes P462, P460-200 and P900 Diaphragm Actuator: Bulletin 61.1:3025
8 pages
KPI Overview
100% (2)
KPI Overview
21 pages
SQL
No ratings yet
SQL
3 pages
Pfeifer 2014 Dynare Graphs
No ratings yet
Pfeifer 2014 Dynare Graphs
51 pages
Hrs Funke Storage Calorifier
100% (1)
Hrs Funke Storage Calorifier
12 pages
1.UNIX Operating System
No ratings yet
1.UNIX Operating System
37 pages
Cover Photo Goes Here: Vertical Construction
No ratings yet
Cover Photo Goes Here: Vertical Construction
4 pages
Exocompact Ardo
No ratings yet
Exocompact Ardo
5 pages
Performance Analysis
No ratings yet
Performance Analysis
5 pages
د. سارة Oral histology-5 (Muhadharaty)
No ratings yet
د. سارة Oral histology-5 (Muhadharaty)
15 pages
Product HSN GST Rate MRP: Selec Controls PVT LTD.
No ratings yet
Product HSN GST Rate MRP: Selec Controls PVT LTD.
6 pages
On Appeals To Ordinary Language
No ratings yet
On Appeals To Ordinary Language
11 pages
IRIS Integration User Guide
100% (1)
IRIS Integration User Guide
17 pages
Steel Pipes (Ashrae) (Schedule 40)
No ratings yet
Steel Pipes (Ashrae) (Schedule 40)
1 page
Transfer Function Models: Development of Transfer Functions
No ratings yet
Transfer Function Models: Development of Transfer Functions
16 pages
Yale Applied Math Course Syllabus Fall 2011
No ratings yet
Yale Applied Math Course Syllabus Fall 2011
3 pages
Module 6A - Precipitation Titrimetry
No ratings yet
Module 6A - Precipitation Titrimetry
4 pages
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet

Lecture 5_Logistic Regression (1)

Uploaded by

Lecture 5_Logistic Regression (1)

Uploaded by

Artificial Intelligence II (CS4442 & CS9542)

Classification: Logistic Regression

Slide credit: Doina Precup

I A training example i has the form: (xi , yi ), where xi ∈ Rn it the

I If y ∈ {0, 1}, this is the binary classification problem.

I If y ∈ {1, . . . , C} (i.e., y can take more than two values), this is

I Most binary classification algorithms can be extended to

I As in linear regression, we consider a linear model hw for

where we already augmented the data: x → [x; 1].

I Rules for binary classification: hw (x) ≥ 0 ⇒ y = 1;

I Recall: for regression, we use the sum-of-squared errors:

I Recall: for regression, we use the sum-of-squared errors:

I Can we use it for classification?

I Recall: for regression, we use the sum-of-squared errors:

I Can we use it for classification?

I We could, but it is not the best choice

- If y = 1, we want hw (x) > 0 as much as possible, which reflects

I We would like a way of learning that is more suitable for the

I We would like a way of learning that is more suitable for the

I Classification can be formulated as the following question: given

I We would like a way of learning that is more suitable for the

I Classification can be formulated as the following question: given

I Intuition: we want find the conditional probability p(y|x; w)

hw (x) = w > x → +∞ ⇒ p(y = 1|x) → 1

Consider the following function:

Plug hw (x) into σ(·):

Figure: The sigmoid function and the predicted probabilities.

Figure credit: Kevin Murphy

Figure: Plots of σ(w1 x1 + w2 x2 )

I Maximum likelihood classification assumes that we will find the

(using the i.i.d. assumption)

I If we ignore the marginal distribution p(x), we may maximize the

I Then the log-likelihood of a hypothesis hw is

where ti = σ(hw (xi )).

I The cross-entropy loss function is the negative of log L(w).

Objective: minimize a loss function J(w)

Gradient descent for function minimization

Figure: Gradient descent on a simple function, starting from

Figure credit: Kevin Murphy

Newton’s method for function minimization

Gradient descent for logistic regression

Objective: minimize the cross-entropy loss function(− log L(w)):

where S ∈ Rm×m is a diagonal matrix with elements Sii = ti (1 − ti ),

w(n+1) = w(n) − H(w(n) )−1 ∇J(w(n) )

I Newtons method usually requires significantly fewer iterations

I Computing the Hessian and its inverse is expensive

I Approximation algorithms exist which help to compute the

Figure: Decision boundaries obtained by minimizing squared loss

I `2 -regularized logistic regression

2. `1 -regularized logistic regression

J(w) , − log L(w) + λ||w||1

I For C classes {1, . . . , C}:

– called the softmax function

Gradient descent for multi-class logistic regression

I After training, the probability of y = c is given by

I Predict class label as the most probable label:

y = arg max σc (x)

I Logistic regression for classification

I Cross-entropy loss for logistic regression

I No analytical solution that minimizes the cross-entropy

I Use gradient descent to find a solution

I Multi-class logistic regression

You might also like