0% found this document useful (0 votes)
3 views

Lect4 Log Reg

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lect4 Log Reg

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Lecture 4: More classifiers and classes

C4B Machine Learning Hilary 2011 A. Zisserman

• Logistic regression
• Loss functions revisited

• Adaboost
• Loss functions revisited

• Optimization

• Multiple class classification

Logistic Regression
Overview

• Logistic regression is actually a classification method

• LR introduces an extra non-linearity over a linear


classifier, f (x) = w>x + b, by using a logistic (or
sigmoid) function, σ().

• The LR classifier is defined as


(
≥ 0.5 yi = +1
σ (f (xi))
< 0.5 yi = −1

where σ(f (x)) = 1


1+e−f (x)

The logistic function or sigmoid function


1

0.9

0.8

0.7

1 0.6

σ(z) = 0.5

1 + e−z 0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

• As z goes from −∞ to ∞, σ(z) goes from 0 to 1, a “squash-


ing function”.

• It has a “sigmoid” shape (i.e. S-like shape)

• σ(0) = 0.5, and if z = w>x + b then || dσ(z) 1


dx ||z=0 = 4 ||w||
Intuition – why use a sigmoid?
Here, choose binary classification to be repre-
sented by yi ∈ {0, 1}, rather than yi ∈ {1, −1}

Least squares fit


1
0.9
1

σ(wx + b) fit to y 0.8

0.7

0.6

0.5 0.5

wx + b fit to y 0.4

0.3

0.2

0.1

0 0
-20 -15 -10 -5 0 5 10 15 20

• fit of wx + b dominated by more 1


1

0.9

distant points 0.8

0.7

0.6

• causes misclassification 0.5 0.5

0.4

• instead LR regresses the sigmoid


0.3

0.2

to the class data 0.1

0 0
-20 -15 -10 -5 0 5 10 15 20

Similarly in 2D
LR linear LR linear

σ(w1x1 + w2x2 + b) fit, vs w1x1 + w2x2 + b


Learning
In logistic regression fit a sigmoid function to the data { xi, yi }
by minimizing the classification errors

yi − σ(w>xi)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

Margin property
A sigmoid favours a larger margin cf a step classifier

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-20 -15 -10 -5 0 5 10 15 20
Probabilistic interpretation

• Think of σ(f (x)) as the posterior probability that


y = 1, i.e. P (y = 1|x) = σ(f (x))

• Hence, if σ(f (x)) > 0.5 then class y = 1 is selected

• Then, after a rearrangement


P (y = 1|x) P (y = 1|x)
f (x) = log = log
1 − P (y = 1|x) P (y = 0|x)
which is the log odds ratio

Maximum Likelihood Estimation


Assume

p(y = 1|x; w) = σ(w>x)


p(y = 0|x; w) = 1 − σ(w>x)
write this more compactly as
³ ´y ³ ´(1−y)
p(y|x; w) = σ(w>x) 1 − σ(w>x)

Then the likelihood (assuming data independence) is


N ³
Y ´y ³ ´(1−y )
i i
p(y|x; w) ∼ σ(w>xi) 1 − σ(w>xi)
i
and the negative log likelihood is
N
X
L(w) = − yi log σ(w>xi) + (1 − yi) log(1 − σ(w>xi))
i
Logistic Regression Loss function
Use notation yi ∈ {−1, 1}. Then
1
P (y = 1|x) = σ(f (x)) =
1 + e−f (x)
1
P (y = −1|x) = 1 − σ(f (x)) =
1 + e+f (x)
So in both cases
1
P (yi|xi) =
1 + e−yif (xi)
Assuming independence, the likelihood is
N
Y 1
i 1 + e−yif (xi)
and the negative log likelihood is
N
X ³ ´
= log 1 + e−yif (xi)
i
which defines the loss function.

Logistic Regression Learning


Learning is formulated as the optimization problem
N
X ³ ´
min log 1 + e −yi f ( x i ) + λ||w||2
w∈Rd i

loss function regularization

• For³ correctly classified


´ points −yif (xi) is negative, and
log 1 + e−yif (xi) is near zero

• For³ incorrectly classified


´ points −yif (xi) is positive, and
−y f (x )
log 1 + e i i can be large.

• Hence the optimization penalizes parameters which lead


to such misclassifications
Comparison of SVM and LR cost functions
SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w∈Rd i
Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||2
w∈Rd i

Note:
• both approximate 0-1 loss
• very similar asymptotic behaviour
• main difference is smoothness, and
non-zero values outside margin yif (xi)
• SVM gives sparse solution for αi

AdaBoost
Overview

• AdaBoost is an algorithm for constructing a strong classifier


out of a linear combination
T
X
αtht(x)
t=1
of simple weak classifiers ht(x). It provides a method of
choosing the weak classifiers and setting the weights αt

Terminology

• weak classifier ht (x) ∈ {−1, 1}, for data vector x


T
X
• strong classifier H(x) = sign αtht(x)
t=1

Example: combination of linear classifiers ht (x) ∈ {−1, 1}

weak weak weak strong


classifier 1 classifier 2 classifier 3 classifier

h1(x) h2(x) h3(x) H(x)

H(x) = sign (α1h1(x) + α2h2(x) + α3h3(x))

• Note, this linear combination is not a simple majority vote


(it would be if )
• Need to compute as well as selecting weak classifiers
AdaBoost algorithm – building a strong classifier
Start with equal weights on each xi, and a set of weak classifiers ht (x)
For t = 1 …,T
• Select weak classifier with minimum error
X
²t = ωi[ht(xi) 6= yi] where ωi are weights
i
• Set 1 1 − ²t
αt = ln
2 ²t
• Reweight examples (boosting) to give misclassified
examples more weight

ωt+1,i = ωt,ie−αtyiht(xi)
• Add weak classifier with weight αt
XT
H(x) = sign αtht(x)
t=1

Example
Weak
start with equal weights on Classifier 1
each data point (i)

X
²j = ωi[hj (xi) 6= yi]
Weights
i Increased

Weak
Classifier 2

Weak
classifier 3

Final classifier is
linear combination of
weak classifiers
The AdaBoost algorithm (Freund & Shapire 1995)
• Given example data (x1, y1), . . . , (xn, yn ), where yi = −1, 1 for negative and positive examples
respectively.
1 1
• Initialize weights ω1,i = 2m , 2l for yi = −1, 1 respectively, where m and l are the number of
negatives and positives respectively.

• For t = 1, . . . , T
1. Normalize the weights,
ωt,i
ωt,i ← Pn
j=1 ωt,j
so that ωt,i is a probability distribution.
2. For each j, train a weak classifier hj with error evaluated with respect to ωt,i,
X
²j = ωt,i [hj (xi) 6= yi]
i

3. Choose the classifier, ht , with the lowest error ²t .


4. Set αt as
1 1 − ²t
αt = ln
2 ²t
5. Update the weights
ωt+1,i = ωt,i e−αt yi ht (xi )

• The final strong classifier is


T
X
H(x) = sign αtht (x)
t=1

Why does it work?

The AdaBoost algorithm carries out a greedy optimization


of a loss function
AdaBoost

N
X
min e−yiH(xi)
αi,hi
i

SVM loss function


N
X
max (0, 1 − yif (xi)) LR
i
Logistic regression loss function
N
X ³ ´
log 1 + e−yif (xi)
i SVM hinge loss yif (xi)
Sketch derivation – non-examinable
The objective function used by AdaBoost is
X
J(H) = e−yiH(xi )
i

For a correctly classified point the penalty is exp(−|H|) and for an incorrectly classified
point the penalty is exp(+|H|). The AdaBoost algorithm incrementally decreases the
cost by adding simple functions to
X
H(x) = αt ht (x)
t

Suppose that we have a function B and we propose to add the function αh(x) where
the scalar α is to be determined and h(x) is some function that takes values in +1
or −1 only. The new function is B(x) + αh(x) and the new cost is
X
J(B + αh) = e−yiB(xi) e−αyih(xi)
i

Differentiating with respect to α and setting the result to zero gives


X X
e−α e−yiB(xi ) − e+α e−yi B(xi) = 0
yi =h(xi ) yi 6=h(xi )

Rearranging, the optimal value of α is therefore determined to be


P −yi B(xi )
1 yi =h(xi ) e
α = log P −yi B(xi )
2 yi 6=h(xi ) e

The classification error is defined as


X
²= ωi [h(xi ) 6= yi ]
i

where
e−yiB(xi)
ωi = P −y B(x )
je
j j

Then, it can be shown that,


1 1−²
α=
log
2 ²
The update from B to H therefore involves evaluating the weighted performance
(with the weights ωi given above) ² of the “weak” classifier h.

If the current function B is B(x) = 0 then the weights will be uniform. This is a
common starting point for the minimization. As a numerical convenience, note that
at the next round of boosting the required weights are obtained by multiplying the
old weights with exp(−αyih(xi)) and then normalizing. This gives the update formula
1
ωt+1,i = ωt,i e−αtyi ht (xi )
Zt
where Zt is a normalizing factor.

Choosing h The function h is not chosen arbitrarily but is chosen to give a good
performance (low value of ²) on the training data weighted by ω.
Optimization

We have seen many cost functions, e.g.

SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w ∈R d i

Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi ) + λ||w||2
w∈Rd i local global
minimum minimum

• Do these have a unique solution?


• Does the solution depend on the starting point of an iterative
optimization algorithm (such as gradient descent)?

If the cost function is convex, then a locally optimal point is globally optimal
(provided the optimization is over a convex set, which it is in our case)
Convex functions

Convex function examples

convex Not convex

A non-negative sum of convex functions is convex


+
Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi) + λ||w||2 convex
w∈Rd i

SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2 convex
w∈Rd i

Gradient (or Steepest) descent algorithms


To minimize a cost function C(w) use the iterative update

wt+1 ← wt − ηt∇w C(wt)


where η is the learning rate.

In our case the loss function is a sum over the training data. For example
for LR
N
X ³ ´ N
X
min C(w) = log 1 + e −y i f (x i ) 2
+ λ||w|| = L(xi, yi; w) + λ||w||2
w∈Rd i i
This means that one iterative update consists of a pass through the
training data with an update for each point
N
X
wt+1 ← wt − ( ηt∇w L(xi, yi; wt) + 2λwt)
i
The advantage is that for large amounts of data, this can be carried
out point by point.
Gradient descent algorithm for LR
Minimizing L(w) using gradient descent gives [exercise]
the update rule

w ← w − η(yi − σ(w>xi))xi
where yi ∈ {0, 1}
Note:
• this is similar, but not identical, to the perceptron update rule.

w ← w − ηsign(w>xi)xi
• there is a unique solution for w
• in practice more efficient Newton methods are used to minimize L
• there can be problems with w becoming infinite for linearly separable data

Gradient descent algorithm for SVM


First, rewrite the optimization problem as an average
N
λ 2 1X
min C(w) = ||w|| + max (0, 1 − yif (xi))
w 2 N i
N µ ¶
1 X λ 2
= ||w|| + max (0, 1 − yif (xi))
N i 2
(with λ = 2/(N C) up to an overall scale of the problem) and
f (x) = w>x + b

Because the hinge loss is not differentiable, a sub-gradient is


computed
Sub-gradient for hinge loss

L(xi, yi; w) = max (0, 1 − yif (xi)) f (xi) = w>xi + b

∂L
= −yixi
∂w

∂L
=0
∂w

yif (xi)

Sub-gradient descent algorithm for SVM

N µ ¶
1X λ
C(w) = ||w||2 + L(xi, yi; w)
N i 2

The iterative update is

wt+1 ← wt − η∇wt C(wt)


N
1X
← wt − η (λwt + ∇w L(xi, yi; wt))
N i
where η is the learning rate.

Then each iteration t involves cycling through the training data with the
updates:

wt+1 ← (1 − ηλ)wt + ηyixi if yi(w>xi + b) < 1


← (1 − ηλ)wt otherwise
Multi-class Classification

Multi-Class Classification – what we would like

Assign input vector x to one of K classes Ck

Goal: a decision rule that divides input space into K


decision regions separated by decision boundaries
Reminder: K Nearest Neighbour (K-NN) Classifier

Algorithm
• For each test point, x, to be classified, find the K nearest
samples in the training data
• Classify the point, x, according to the majority vote of their
class labels

e.g. K = 3

• naturally applicable
to multi-class case

Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x)

1 vs 2 & 3

C1

?
C2

C3 3 vs 1 & 2
2 vs 1 & 3
Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x)


• Classification: choose class with most positive score

1 vs 2 & 3

C1

max fk (x)
k
C2

C3 3 vs 1 & 2
2 vs 1 & 3

Application: hand written digit recognition

• Feature vectors: each image is 28 x 28


pixels. Rearrange as a 784-vector x

• Training: learn k=10 two-class 1 vs the rest


SVM classifiers fk (x)

• Classification: choose class with most


positive score

f (x) = max fk (x)


k
Example

hand
drawn

5
1 2 3 4 5 6 7 8 9 0
4

3
classification 5
1 2
2

5
3 4
5
5
0

Background reading and more

• Other multiple-class classifiers (not covered here):


• Neural networks
• Random forests

• Bishop, chapters 4.1 – 4.3 and 14.3

• Hastie et al, chapters 10.1 – 10.6

• More on web page:


http://www.robots.ox.ac.uk/~az/lectures/ml

You might also like