0% found this document useful (0 votes)
8 views20 pages

Lect4 Log Reg

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

Lect4 Log Reg

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Lecture 4: More classifiers and classes

C4B Machine Learning Hilary 2011 A. Zisserman

• Logistic regression
• Loss functions revisited

• Adaboost
• Loss functions revisited

• Optimization

• Multiple class classification

Logistic Regression
Overview

• Logistic regression is actually a classification method

• LR introduces an extra non-linearity over a linear


classifier, f (x) = w>x + b, by using a logistic (or
sigmoid) function, σ().

• The LR classifier is defined as


(
≥ 0.5 yi = +1
σ (f (xi))
< 0.5 yi = −1

where σ(f (x)) = 1


1+e−f (x)

The logistic function or sigmoid function


1

0.9

0.8

0.7

1 0.6

σ(z) = 0.5

1 + e−z 0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

• As z goes from −∞ to ∞, σ(z) goes from 0 to 1, a “squash-


ing function”.

• It has a “sigmoid” shape (i.e. S-like shape)

• σ(0) = 0.5, and if z = w>x + b then || dσ(z) 1


dx ||z=0 = 4 ||w||
Intuition – why use a sigmoid?
Here, choose binary classification to be repre-
sented by yi ∈ {0, 1}, rather than yi ∈ {1, −1}

Least squares fit


1
0.9
1

σ(wx + b) fit to y 0.8

0.7

0.6

0.5 0.5

wx + b fit to y 0.4

0.3

0.2

0.1

0 0
-20 -15 -10 -5 0 5 10 15 20

• fit of wx + b dominated by more 1


1

0.9

distant points 0.8

0.7

0.6

• causes misclassification 0.5 0.5

0.4

• instead LR regresses the sigmoid


0.3

0.2

to the class data 0.1

0 0
-20 -15 -10 -5 0 5 10 15 20

Similarly in 2D
LR linear LR linear

σ(w1x1 + w2x2 + b) fit, vs w1x1 + w2x2 + b


Learning
In logistic regression fit a sigmoid function to the data { xi, yi }
by minimizing the classification errors

yi − σ(w>xi)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

Margin property
A sigmoid favours a larger margin cf a step classifier

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20
Probabilistic interpretation

• Think of σ(f (x)) as the posterior probability that


y = 1, i.e. P (y = 1|x) = σ(f (x))

• Hence, if σ(f (x)) > 0.5 then class y = 1 is selected

• Then, after a rearrangement


P (y = 1|x) P (y = 1|x)
f (x) = log = log
1 − P (y = 1|x) P (y = 0|x)
which is the log odds ratio

Maximum Likelihood Estimation


Assume

p(y = 1|x; w) = σ(w>x)


p(y = 0|x; w) = 1 − σ(w>x)
write this more compactly as
³ ´y ³ ´(1−y)
p(y|x; w) = σ(w>x) 1 − σ(w>x)

Then the likelihood (assuming data independence) is


N ³
Y ´y ³ ´(1−y )
i i
p(y|x; w) ∼ σ(w>xi) 1 − σ(w>xi)
i
and the negative log likelihood is
N
X
L(w) = − yi log σ(w>xi) + (1 − yi) log(1 − σ(w>xi))
i
Logistic Regression Loss function
Use notation yi ∈ {−1, 1}. Then
1
P (y = 1|x) = σ(f (x)) =
1 + e−f (x)
1
P (y = −1|x) = 1 − σ(f (x)) =
1 + e+f (x)
So in both cases
1
P (yi|xi) =
1 + e−yif (xi)
Assuming independence, the likelihood is
N
Y 1
i 1 + e−yif (xi)
and the negative log likelihood is
N
X ³ ´
= log 1 + e−yif (xi)
i
which defines the loss function.

Logistic Regression Learning


Learning is formulated as the optimization problem
N
X ³ ´
min log 1 + e −yi f ( x i ) + λ||w||2
w∈Rd i

loss function regularization

• For³ correctly classified


´ points −yif (xi) is negative, and
log 1 + e−yif (xi) is near zero

• For³ incorrectly classified


´ points −yif (xi) is positive, and
−y f (x )
log 1 + e i i can be large.

• Hence the optimization penalizes parameters which lead


to such misclassifications
Comparison of SVM and LR cost functions
SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w∈Rd i
Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||2
w∈Rd i

Note:
• both approximate 0-1 loss
• very similar asymptotic behaviour
• main difference is smoothness, and
non-zero values outside margin yif (xi)
• SVM gives sparse solution for αi

AdaBoost
Overview

• AdaBoost is an algorithm for constructing a strong classifier


out of a linear combination
T
X
αtht(x)
t=1
of simple weak classifiers ht(x). It provides a method of
choosing the weak classifiers and setting the weights αt

Terminology

• weak classifier ht (x) ∈ {−1, 1}, for data vector x


T
X
• strong classifier H(x) = sign αtht(x)
t=1

Example: combination of linear classifiers ht (x) ∈ {−1, 1}

weak weak weak strong


classifier 1 classifier 2 classifier 3 classifier

h1(x) h2(x) h3(x) H(x)

H(x) = sign (α1h1(x) + α2h2(x) + α3h3(x))

• Note, this linear combination is not a simple majority vote


(it would be if )
• Need to compute as well as selecting weak classifiers
AdaBoost algorithm – building a strong classifier
Start with equal weights on each xi, and a set of weak classifiers ht (x)
For t = 1 …,T
• Select weak classifier with minimum error
X
²t = ωi[ht(xi) 6= yi] where ωi are weights
i
• Set 1 1 − ²t
αt = ln
2 ²t
• Reweight examples (boosting) to give misclassified
examples more weight

ωt+1,i = ωt,ie−αtyiht(xi)
• Add weak classifier with weight αt
XT
H(x) = sign αtht(x)
t=1

Example
Weak
start with equal weights on Classifier 1
each data point (i)

X
²j = ωi[hj (xi) 6= yi]
Weights
i Increased

Weak
Classifier 2

Weak
classifier 3

Final classifier is
linear combination of
weak classifiers
The AdaBoost algorithm (Freund & Shapire 1995)
• Given example data (x1, y1), . . . , (xn, yn ), where yi = −1, 1 for negative and positive examples
respectively.
1 1
• Initialize weights ω1,i = 2m , 2l for yi = −1, 1 respectively, where m and l are the number of
negatives and positives respectively.

• For t = 1, . . . , T
1. Normalize the weights,
ωt,i
ωt,i ← Pn
j=1 ωt,j
so that ωt,i is a probability distribution.
2. For each j, train a weak classifier hj with error evaluated with respect to ωt,i,
X
²j = ωt,i [hj (xi) 6= yi]
i

3. Choose the classifier, ht , with the lowest error ²t .


4. Set αt as
1 1 − ²t
αt = ln
2 ²t
5. Update the weights
ωt+1,i = ωt,i e−αt yi ht (xi )

• The final strong classifier is


T
X
H(x) = sign αtht (x)
t=1

Why does it work?

The AdaBoost algorithm carries out a greedy optimization


of a loss function
AdaBoost

N
X
min e−yiH(xi)
αi,hi
i

SVM loss function


N
X
max (0, 1 − yif (xi)) LR
i
Logistic regression loss function
N
X ³ ´
log 1 + e−yif (xi)
i SVM hinge loss yif (xi)
Sketch derivation – non-examinable
The objective function used by AdaBoost is
X
J(H) = e−yiH(xi )
i

For a correctly classified point the penalty is exp(−|H|) and for an incorrectly classified
point the penalty is exp(+|H|). The AdaBoost algorithm incrementally decreases the
cost by adding simple functions to
X
H(x) = αt ht (x)
t

Suppose that we have a function B and we propose to add the function αh(x) where
the scalar α is to be determined and h(x) is some function that takes values in +1
or −1 only. The new function is B(x) + αh(x) and the new cost is
X
J(B + αh) = e−yiB(xi) e−αyih(xi)
i

Differentiating with respect to α and setting the result to zero gives


X X
e−α e−yiB(xi ) − e+α e−yi B(xi) = 0
yi =h(xi ) yi 6=h(xi )

Rearranging, the optimal value of α is therefore determined to be


P −yi B(xi )
1 yi =h(xi ) e
α = log P −yi B(xi )
2 yi 6=h(xi ) e

The classification error is defined as


X
²= ωi [h(xi ) 6= yi ]
i

where
e−yiB(xi)
ωi = P −y B(x )
je
j j

Then, it can be shown that,


1 1−²
α=
log
2 ²
The update from B to H therefore involves evaluating the weighted performance
(with the weights ωi given above) ² of the “weak” classifier h.

If the current function B is B(x) = 0 then the weights will be uniform. This is a
common starting point for the minimization. As a numerical convenience, note that
at the next round of boosting the required weights are obtained by multiplying the
old weights with exp(−αyih(xi)) and then normalizing. This gives the update formula
1
ωt+1,i = ωt,i e−αtyi ht (xi )
Zt
where Zt is a normalizing factor.

Choosing h The function h is not chosen arbitrarily but is chosen to give a good
performance (low value of ²) on the training data weighted by ω.
Optimization

We have seen many cost functions, e.g.

SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w ∈R d i

Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi ) + λ||w||2
w∈Rd i local global
minimum minimum

• Do these have a unique solution?


• Does the solution depend on the starting point of an iterative
optimization algorithm (such as gradient descent)?

If the cost function is convex, then a locally optimal point is globally optimal
(provided the optimization is over a convex set, which it is in our case)
Convex functions

Convex function examples

convex Not convex

A non-negative sum of convex functions is convex


+
Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||2 convex
w∈Rd i

SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2 convex
w∈Rd i

Gradient (or Steepest) descent algorithms


To minimize a cost function C(w) use the iterative update

wt+1 ← wt − ηt∇w C(wt)


where η is the learning rate.

In our case the loss function is a sum over the training data. For example
for LR
N
X ³ ´ N
X
min C(w) = log 1 + e −y i f (x i ) 2
+ λ||w|| = L(xi, yi; w) + λ||w||2
w∈Rd i i
This means that one iterative update consists of a pass through the
training data with an update for each point
N
X
wt+1 ← wt − ( ηt∇w L(xi, yi; wt) + 2λwt)
i
The advantage is that for large amounts of data, this can be carried
out point by point.
Gradient descent algorithm for LR
Minimizing L(w) using gradient descent gives [exercise]
the update rule

w ← w − η(yi − σ(w>xi))xi
where yi ∈ {0, 1}
Note:
• this is similar, but not identical, to the perceptron update rule.

w ← w − ηsign(w>xi)xi
• there is a unique solution for w
• in practice more efficient Newton methods are used to minimize L
• there can be problems with w becoming infinite for linearly separable data

Gradient descent algorithm for SVM


First, rewrite the optimization problem as an average
N
λ 2 1X
min C(w) = ||w|| + max (0, 1 − yif (xi))
w 2 N i
N µ ¶
1 X λ 2
= ||w|| + max (0, 1 − yif (xi))
N i 2
(with λ = 2/(N C) up to an overall scale of the problem) and
f (x) = w>x + b

Because the hinge loss is not differentiable, a sub-gradient is


computed
Sub-gradient for hinge loss

L(xi, yi; w) = max (0, 1 − yif (xi)) f (xi) = w>xi + b

∂L
= −yixi
∂w

∂L
=0
∂w

yif (xi)

Sub-gradient descent algorithm for SVM

N µ ¶
1X λ
C(w) = ||w||2 + L(xi, yi; w)
N i 2

The iterative update is

wt+1 ← wt − η∇wt C(wt)


N
1X
← wt − η (λwt + ∇w L(xi, yi; wt))
N i
where η is the learning rate.

Then each iteration t involves cycling through the training data with the
updates:

wt+1 ← (1 − ηλ)wt + ηyixi if yi(w>xi + b) < 1


← (1 − ηλ)wt otherwise
Multi-class Classification

Multi-Class Classification – what we would like

Assign input vector x to one of K classes Ck

Goal: a decision rule that divides input space into K


decision regions separated by decision boundaries
Reminder: K Nearest Neighbour (K-NN) Classifier

Algorithm
• For each test point, x, to be classified, find the K nearest
samples in the training data
• Classify the point, x, according to the majority vote of their
class labels

e.g. K = 3

• naturally applicable
to multi-class case

Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x)

1 vs 2 & 3

C1

?
C2

C3 3 vs 1 & 2
2 vs 1 & 3
Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x)


• Classification: choose class with most positive score

1 vs 2 & 3

C1

max fk (x)
k
C2

C3 3 vs 1 & 2
2 vs 1 & 3

Application: hand written digit recognition

• Feature vectors: each image is 28 x 28


pixels. Rearrange as a 784-vector x

• Training: learn k=10 two-class 1 vs the rest


SVM classifiers fk (x)

• Classification: choose class with most


positive score

f (x) = max fk (x)


k
Example

hand
drawn

5
1 2 3 4 5 6 7 8 9 0
4

3
classification 5
1 2
2

5
3 4
5
5
0

Background reading and more

• Other multiple-class classifiers (not covered here):


• Neural networks
• Random forests

• Bishop, chapters 4.1 – 4.3 and 14.3

• Hastie et al, chapters 10.1 – 10.6

• More on web page:


http://www.robots.ox.ac.uk/~az/lectures/ml

You might also like