0% found this document useful (0 votes)

8 views20 pages

Lect4 Log Reg

Uploaded by

Ark Mtech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views20 pages

Lect4 Log Reg

Uploaded by

Ark Mtech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Lecture 4: More classifiers and classes

C4B Machine Learning Hilary 2011 A. Zisserman

• Logistic regression
• Loss functions revisited

• Adaboost
• Loss functions revisited

• Optimization

• Multiple class classification

Logistic Regression
Overview

• Logistic regression is actually a classification method

• LR introduces an extra non-linearity over a linear

classifier, f (x) = w>x + b, by using a logistic (or
sigmoid) function, σ().

• The LR classifier is defined as

(
≥ 0.5 yi = +1
σ (f (xi))
< 0.5 yi = −1

where σ(f (x)) = 1

1+e−f (x)

The logistic function or sigmoid function

0.9

0.8

0.7

1 0.6

σ(z) = 0.5

1 + e−z 0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

• As z goes from −∞ to ∞, σ(z) goes from 0 to 1, a “squash-

ing function”.

• It has a “sigmoid” shape (i.e. S-like shape)

• σ(0) = 0.5, and if z = w>x + b then || dσ(z) 1

dx ||z=0 = 4 ||w||
Intuition – why use a sigmoid?
Here, choose binary classification to be repre-
sented by yi ∈ {0, 1}, rather than yi ∈ {1, −1}

Least squares fit

1
0.9
1

σ(wx + b) fit to y 0.8

0.7

0.6

0.5 0.5

wx + b fit to y 0.4

0.3

0.2

0.1

0 0
-20 -15 -10 -5 0 5 10 15 20

• fit of wx + b dominated by more 1

0.9

distant points 0.8

0.7

0.6

• causes misclassification 0.5 0.5

0.4

• instead LR regresses the sigmoid

0.3

0.2

to the class data 0.1

0 0
-20 -15 -10 -5 0 5 10 15 20

Similarly in 2D
LR linear LR linear

σ(w1x1 + w2x2 + b) fit, vs w1x1 + w2x2 + b

Learning
In logistic regression fit a sigmoid function to the data { xi, yi }
by minimizing the classification errors

yi − σ(w>xi)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

Margin property
A sigmoid favours a larger margin cf a step classifier

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20
Probabilistic interpretation

• Think of σ(f (x)) as the posterior probability that

y = 1, i.e. P (y = 1|x) = σ(f (x))

• Hence, if σ(f (x)) > 0.5 then class y = 1 is selected

• Then, after a rearrangement

P (y = 1|x) P (y = 1|x)
f (x) = log = log
1 − P (y = 1|x) P (y = 0|x)
which is the log odds ratio

Maximum Likelihood Estimation

Assume

p(y = 1|x; w) = σ(w>x)

p(y = 0|x; w) = 1 − σ(w>x)
write this more compactly as
³ ´y ³ ´(1−y)
p(y|x; w) = σ(w>x) 1 − σ(w>x)

Then the likelihood (assuming data independence) is

N ³
Y ´y ³ ´(1−y )
i i
p(y|x; w) ∼ σ(w>xi) 1 − σ(w>xi)
i
and the negative log likelihood is
N
X
L(w) = − yi log σ(w>xi) + (1 − yi) log(1 − σ(w>xi))
i
Logistic Regression Loss function
Use notation yi ∈ {−1, 1}. Then
1
P (y = 1|x) = σ(f (x)) =
1 + e−f (x)
1
P (y = −1|x) = 1 − σ(f (x)) =
1 + e+f (x)
So in both cases
1
P (yi|xi) =
1 + e−yif (xi)
Assuming independence, the likelihood is
N
Y 1
i 1 + e−yif (xi)
and the negative log likelihood is
N
X ³ ´
= log 1 + e−yif (xi)
i
which defines the loss function.

Logistic Regression Learning

Learning is formulated as the optimization problem
N
X ³ ´
min log 1 + e −yi f ( x i ) + λ||w||2
w∈Rd i

loss function regularization

• For³ correctly classified

´ points −yif (xi) is negative, and
log 1 + e−yif (xi) is near zero

• For³ incorrectly classified

´ points −yif (xi) is positive, and
−y f (x )
log 1 + e i i can be large.

• Hence the optimization penalizes parameters which lead

to such misclassifications
Comparison of SVM and LR cost functions
SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w∈Rd i
Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||2
w∈Rd i

Note:
• both approximate 0-1 loss
• very similar asymptotic behaviour
• main difference is smoothness, and
non-zero values outside margin yif (xi)
• SVM gives sparse solution for αi

AdaBoost
Overview

• AdaBoost is an algorithm for constructing a strong classifier

out of a linear combination
T
X
αtht(x)
t=1
of simple weak classifiers ht(x). It provides a method of
choosing the weak classifiers and setting the weights αt

Terminology

• weak classifier ht (x) ∈ {−1, 1}, for data vector x

T
X
• strong classifier H(x) = sign αtht(x)
t=1

Example: combination of linear classifiers ht (x) ∈ {−1, 1}

weak weak weak strong

classifier 1 classifier 2 classifier 3 classifier

h1(x) h2(x) h3(x) H(x)

H(x) = sign (α1h1(x) + α2h2(x) + α3h3(x))

• Note, this linear combination is not a simple majority vote

(it would be if )
• Need to compute as well as selecting weak classifiers
AdaBoost algorithm – building a strong classifier
Start with equal weights on each xi, and a set of weak classifiers ht (x)
For t = 1 …,T
• Select weak classifier with minimum error
X
²t = ωi[ht(xi) 6= yi] where ωi are weights
i
• Set 1 1 − ²t
αt = ln
2 ²t
• Reweight examples (boosting) to give misclassified
examples more weight

ωt+1,i = ωt,ie−αtyiht(xi)
• Add weak classifier with weight αt
XT
H(x) = sign αtht(x)
t=1

Example
Weak
start with equal weights on Classifier 1
each data point (i)

X
²j = ωi[hj (xi) 6= yi]
Weights
i Increased

Weak
Classifier 2

Weak
classifier 3

Final classifier is
linear combination of
weak classifiers
The AdaBoost algorithm (Freund & Shapire 1995)
• Given example data (x1, y1), . . . , (xn, yn ), where yi = −1, 1 for negative and positive examples
respectively.
1 1
• Initialize weights ω1,i = 2m , 2l for yi = −1, 1 respectively, where m and l are the number of
negatives and positives respectively.

• For t = 1, . . . , T
1. Normalize the weights,
ωt,i
ωt,i ← Pn
j=1 ωt,j
so that ωt,i is a probability distribution.
2. For each j, train a weak classifier hj with error evaluated with respect to ωt,i,
X
²j = ωt,i [hj (xi) 6= yi]
i

3. Choose the classifier, ht , with the lowest error ²t .

4. Set αt as
1 1 − ²t
αt = ln
2 ²t
5. Update the weights
ωt+1,i = ωt,i e−αt yi ht (xi )

• The final strong classifier is

T
X
H(x) = sign αtht (x)
t=1

Why does it work?

The AdaBoost algorithm carries out a greedy optimization

of a loss function
AdaBoost

N
X
min e−yiH(xi)
αi,hi
i

SVM loss function

N
X
max (0, 1 − yif (xi)) LR
i
Logistic regression loss function
N
X ³ ´
log 1 + e−yif (xi)
i SVM hinge loss yif (xi)
Sketch derivation – non-examinable
The objective function used by AdaBoost is
X
J(H) = e−yiH(xi )
i

For a correctly classified point the penalty is exp(−|H|) and for an incorrectly classified
point the penalty is exp(+|H|). The AdaBoost algorithm incrementally decreases the
cost by adding simple functions to
X
H(x) = αt ht (x)
t

Suppose that we have a function B and we propose to add the function αh(x) where
the scalar α is to be determined and h(x) is some function that takes values in +1
or −1 only. The new function is B(x) + αh(x) and the new cost is
X
J(B + αh) = e−yiB(xi) e−αyih(xi)
i

Diﬀerentiating with respect to α and setting the result to zero gives

X X
e−α e−yiB(xi ) − e+α e−yi B(xi) = 0
yi =h(xi ) yi 6=h(xi )

Rearranging, the optimal value of α is therefore determined to be

P −yi B(xi )
1 yi =h(xi ) e
α = log P −yi B(xi )
2 yi 6=h(xi ) e

The classification error is defined as

X
²= ωi [h(xi ) 6= yi ]
i

where
e−yiB(xi)
ωi = P −y B(x )
je
j j

Then, it can be shown that,

1 1−²
α=
log
2 ²
The update from B to H therefore involves evaluating the weighted performance
(with the weights ωi given above) ² of the “weak” classifier h.

If the current function B is B(x) = 0 then the weights will be uniform. This is a
common starting point for the minimization. As a numerical convenience, note that
at the next round of boosting the required weights are obtained by multiplying the
old weights with exp(−αyih(xi)) and then normalizing. This gives the update formula
1
ωt+1,i = ωt,i e−αtyi ht (xi )
Zt
where Zt is a normalizing factor.

Choosing h The function h is not chosen arbitrarily but is chosen to give a good
performance (low value of ²) on the training data weighted by ω.
Optimization

We have seen many cost functions, e.g.

SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w ∈R d i

Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi ) + λ||w||2
w∈Rd i local global
minimum minimum

• Do these have a unique solution?

• Does the solution depend on the starting point of an iterative
optimization algorithm (such as gradient descent)?

If the cost function is convex, then a locally optimal point is globally optimal
(provided the optimization is over a convex set, which it is in our case)
Convex functions

Convex function examples

convex Not convex

A non-negative sum of convex functions is convex

+
Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||2 convex
w∈Rd i

SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2 convex
w∈Rd i

Gradient (or Steepest) descent algorithms

To minimize a cost function C(w) use the iterative update

wt+1 ← wt − ηt∇w C(wt)

where η is the learning rate.

In our case the loss function is a sum over the training data. For example
for LR
N
X ³ ´ N
X
min C(w) = log 1 + e −y i f (x i ) 2
+ λ||w|| = L(xi, yi; w) + λ||w||2
w∈Rd i i
This means that one iterative update consists of a pass through the
training data with an update for each point
N
X
wt+1 ← wt − ( ηt∇w L(xi, yi; wt) + 2λwt)
i
The advantage is that for large amounts of data, this can be carried
out point by point.
Gradient descent algorithm for LR
Minimizing L(w) using gradient descent gives [exercise]
the update rule

w ← w − η(yi − σ(w>xi))xi
where yi ∈ {0, 1}
Note:
• this is similar, but not identical, to the perceptron update rule.

w ← w − ηsign(w>xi)xi
• there is a unique solution for w
• in practice more efficient Newton methods are used to minimize L
• there can be problems with w becoming infinite for linearly separable data

Gradient descent algorithm for SVM

First, rewrite the optimization problem as an average
N
λ 2 1X
min C(w) = ||w|| + max (0, 1 − yif (xi))
w 2 N i
N µ ¶
1 X λ 2
= ||w|| + max (0, 1 − yif (xi))
N i 2
(with λ = 2/(N C) up to an overall scale of the problem) and
f (x) = w>x + b

Because the hinge loss is not diﬀerentiable, a sub-gradient is

computed
Sub-gradient for hinge loss

L(xi, yi; w) = max (0, 1 − yif (xi)) f (xi) = w>xi + b

∂L
= −yixi
∂w

∂L
=0
∂w

yif (xi)

Sub-gradient descent algorithm for SVM

N µ ¶
1X λ
C(w) = ||w||2 + L(xi, yi; w)
N i 2

The iterative update is

wt+1 ← wt − η∇wt C(wt)

N
1X
← wt − η (λwt + ∇w L(xi, yi; wt))
N i
where η is the learning rate.

Then each iteration t involves cycling through the training data with the
updates:

wt+1 ← (1 − ηλ)wt + ηyixi if yi(w>xi + b) < 1

← (1 − ηλ)wt otherwise
Multi-class Classification

Multi-Class Classification – what we would like

Assign input vector x to one of K classes Ck

Goal: a decision rule that divides input space into K

decision regions separated by decision boundaries
Reminder: K Nearest Neighbour (K-NN) Classifier

Algorithm
• For each test point, x, to be classified, find the K nearest
samples in the training data
• Classify the point, x, according to the majority vote of their
class labels

e.g. K = 3

• naturally applicable
to multi-class case

Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x)

1 vs 2 & 3

?
C2

C3 3 vs 1 & 2
2 vs 1 & 3
Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x)

• Classification: choose class with most positive score

1 vs 2 & 3

max fk (x)
k
C2

C3 3 vs 1 & 2
2 vs 1 & 3

Application: hand written digit recognition

• Feature vectors: each image is 28 x 28

pixels. Rearrange as a 784-vector x

• Training: learn k=10 two-class 1 vs the rest

SVM classifiers fk (x)

• Classification: choose class with most

positive score

f (x) = max fk (x)

k
Example

hand
drawn

5
1 2 3 4 5 6 7 8 9 0
4

3
classification 5
1 2
2

5
3 4
5
5
0

Background reading and more

• Other multiple-class classifiers (not covered here):

• Neural networks
• Random forests

• Bishop, chapters 4.1 – 4.3 and 14.3

• Hastie et al, chapters 10.1 – 10.6

• More on web page:

http://www.robots.ox.ac.uk/~az/lectures/ml

Boosting
No ratings yet
Boosting
11 pages
Ml -5 Sovan Lr Svm 1
No ratings yet
Ml -5 Sovan Lr Svm 1
59 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
ML Opt
No ratings yet
ML Opt
89 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
T R Ik-Cl Ervor Er Kis: (Example)
No ratings yet
T R Ik-Cl Ervor Er Kis: (Example)
122 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Support Vector Machines Vs Logistic Regression: Kevin Swersky University of Toronto CSC2515 Tutorial
No ratings yet
Support Vector Machines Vs Logistic Regression: Kevin Swersky University of Toronto CSC2515 Tutorial
23 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
LectureNotes7
No ratings yet
LectureNotes7
8 pages
09 Boosting
No ratings yet
09 Boosting
17 pages
CH 1
No ratings yet
CH 1
24 pages
Lecture 5_Logistic Regression (1)
No ratings yet
Lecture 5_Logistic Regression (1)
28 pages
04- Linear-Classification-2024
No ratings yet
04- Linear-Classification-2024
65 pages
Lecture-10-boosting
No ratings yet
Lecture-10-boosting
20 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
3-LG_Eval
No ratings yet
3-LG_Eval
52 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
ML 14 Boosting
No ratings yet
ML 14 Boosting
57 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
UNIT 1,2,3
No ratings yet
UNIT 1,2,3
17 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Classification
100% (2)
Classification
105 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Week 3 Lecture Notes
No ratings yet
Week 3 Lecture Notes
7 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
Dixit PDF
No ratings yet
Dixit PDF
435 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
South Africa Heart Disease Project: Omar M. Osama Deyaa Eldeen A. Almahallawi June 16, 2010
No ratings yet
South Africa Heart Disease Project: Omar M. Osama Deyaa Eldeen A. Almahallawi June 16, 2010
7 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Dijkstra's_Algorithm_CCN
No ratings yet
Dijkstra's_Algorithm_CCN
14 pages
GSL Ref PDF
No ratings yet
GSL Ref PDF
663 pages
EE708 Module 3A(1)
No ratings yet
EE708 Module 3A(1)
28 pages
genaifile
No ratings yet
genaifile
39 pages
Algorithm For Newton
No ratings yet
Algorithm For Newton
7 pages
Kadane
100% (1)
Kadane
15 pages
Tut - 4 Divide and Conquer
No ratings yet
Tut - 4 Divide and Conquer
1 page
Ncert Solutions Class 10 Maths Chapter 2 Ex 2 2
No ratings yet
Ncert Solutions Class 10 Maths Chapter 2 Ex 2 2
10 pages
Numerical Computing Final Exam Summer
No ratings yet
Numerical Computing Final Exam Summer
3 pages
Bubble Sort
No ratings yet
Bubble Sort
31 pages
4 Systems of Linear Equations
No ratings yet
4 Systems of Linear Equations
4 pages
Numerical Integration
No ratings yet
Numerical Integration
29 pages
10.6 Shortest-Path Problems
No ratings yet
10.6 Shortest-Path Problems
3 pages
CS 580 Algorithm Design and Analysis: NP and NP-completeness
No ratings yet
CS 580 Algorithm Design and Analysis: NP and NP-completeness
35 pages
(Fall14) Dijkstra
No ratings yet
(Fall14) Dijkstra
31 pages
Homework 3
No ratings yet
Homework 3
5 pages
ASSIGNMENT-I (unit-III) Maths For Btech Biotech
No ratings yet
ASSIGNMENT-I (unit-III) Maths For Btech Biotech
2 pages
Merge Sort
No ratings yet
Merge Sort
15 pages
HW 6
No ratings yet
HW 6
2 pages
Numerical Integration: F (X) DX C F (X
No ratings yet
Numerical Integration: F (X) DX C F (X
6 pages
Simpsons 1-3rd & 3-8th Rule
No ratings yet
Simpsons 1-3rd & 3-8th Rule
12 pages
Handout, Assignment Msc. Math, 2ND Sem, 2023 Batch
No ratings yet
Handout, Assignment Msc. Math, 2ND Sem, 2023 Batch
5 pages
Herstein Topics in Algebra Solution 5 8
No ratings yet
Herstein Topics in Algebra Solution 5 8
2 pages
Knowledge/Comprehension/ Application/ Analysis/ Synthesis/Evaluation)
No ratings yet
Knowledge/Comprehension/ Application/ Analysis/ Synthesis/Evaluation)
12 pages
Human Disease Prediction Using Rule Based Expert System: R.Karthikeyan Assistant Professor SRM College Chennai
No ratings yet
Human Disease Prediction Using Rule Based Expert System: R.Karthikeyan Assistant Professor SRM College Chennai
15 pages
Nonlinear Optimization: Benny Yakir
No ratings yet
Nonlinear Optimization: Benny Yakir
38 pages
Hill Climbing Algorithm in AI
No ratings yet
Hill Climbing Algorithm in AI
4 pages
Summative Test Math Grade 8
100% (1)
Summative Test Math Grade 8
2 pages
Solution of 2D Convection-Diffusion Transient Problems by A Fractional-Step FE Method
No ratings yet
Solution of 2D Convection-Diffusion Transient Problems by A Fractional-Step FE Method
11 pages
SDLC
No ratings yet
SDLC
1 page
Water Fall Model
No ratings yet
Water Fall Model
2 pages
History of Crypto
No ratings yet
History of Crypto
1 page
Formulate A Project Strategy
No ratings yet
Formulate A Project Strategy
1 page
Decimation in Frequency (DIF) FFT Algorithm
No ratings yet
Decimation in Frequency (DIF) FFT Algorithm
9 pages
50 % of Marks For SC/ST/OBC (Non Creamy Layer) /person With Disability 3. 55 % of Marks For All Other Candidates
No ratings yet
50 % of Marks For SC/ST/OBC (Non Creamy Layer) /person With Disability 3. 55 % of Marks For All Other Candidates
1 page