Lect4 Log Reg
Lect4 Log Reg
• Logistic regression
• Loss functions revisited
• Adaboost
• Loss functions revisited
• Optimization
Logistic Regression
Overview
0.9
0.8
0.7
1 0.6
σ(z) = 0.5
1 + e−z 0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
0.7
0.6
0.5 0.5
wx + b fit to y 0.4
0.3
0.2
0.1
0 0
-20 -15 -10 -5 0 5 10 15 20
0.9
0.7
0.6
0.4
0.2
0 0
-20 -15 -10 -5 0 5 10 15 20
Similarly in 2D
LR linear LR linear
yi − σ(w>xi)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
Margin property
A sigmoid favours a larger margin cf a step classifier
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
Probabilistic interpretation
Note:
• both approximate 0-1 loss
• very similar asymptotic behaviour
• main difference is smoothness, and
non-zero values outside margin yif (xi)
• SVM gives sparse solution for αi
AdaBoost
Overview
Terminology
ωt+1,i = ωt,ie−αtyiht(xi)
• Add weak classifier with weight αt
XT
H(x) = sign αtht(x)
t=1
Example
Weak
start with equal weights on Classifier 1
each data point (i)
X
²j = ωi[hj (xi) 6= yi]
Weights
i Increased
Weak
Classifier 2
Weak
classifier 3
Final classifier is
linear combination of
weak classifiers
The AdaBoost algorithm (Freund & Shapire 1995)
• Given example data (x1, y1), . . . , (xn, yn ), where yi = −1, 1 for negative and positive examples
respectively.
1 1
• Initialize weights ω1,i = 2m , 2l for yi = −1, 1 respectively, where m and l are the number of
negatives and positives respectively.
• For t = 1, . . . , T
1. Normalize the weights,
ωt,i
ωt,i ← Pn
j=1 ωt,j
so that ωt,i is a probability distribution.
2. For each j, train a weak classifier hj with error evaluated with respect to ωt,i,
X
²j = ωt,i [hj (xi) 6= yi]
i
N
X
min e−yiH(xi)
αi,hi
i
For a correctly classified point the penalty is exp(−|H|) and for an incorrectly classified
point the penalty is exp(+|H|). The AdaBoost algorithm incrementally decreases the
cost by adding simple functions to
X
H(x) = αt ht (x)
t
Suppose that we have a function B and we propose to add the function αh(x) where
the scalar α is to be determined and h(x) is some function that takes values in +1
or −1 only. The new function is B(x) + αh(x) and the new cost is
X
J(B + αh) = e−yiB(xi) e−αyih(xi)
i
where
e−yiB(xi)
ωi = P −y B(x )
je
j j
If the current function B is B(x) = 0 then the weights will be uniform. This is a
common starting point for the minimization. As a numerical convenience, note that
at the next round of boosting the required weights are obtained by multiplying the
old weights with exp(−αyih(xi)) and then normalizing. This gives the update formula
1
ωt+1,i = ωt,i e−αtyi ht (xi )
Zt
where Zt is a normalizing factor.
Choosing h The function h is not chosen arbitrarily but is chosen to give a good
performance (low value of ²) on the training data weighted by ω.
Optimization
SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w ∈R d i
Logistic regression:
N
X ³ ´
min log 1 + e−yif (xi ) + λ||w||2
w∈Rd i local global
minimum minimum
If the cost function is convex, then a locally optimal point is globally optimal
(provided the optimization is over a convex set, which it is in our case)
Convex functions
SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2 convex
w∈Rd i
In our case the loss function is a sum over the training data. For example
for LR
N
X ³ ´ N
X
min C(w) = log 1 + e −y i f (x i ) 2
+ λ||w|| = L(xi, yi; w) + λ||w||2
w∈Rd i i
This means that one iterative update consists of a pass through the
training data with an update for each point
N
X
wt+1 ← wt − ( ηt∇w L(xi, yi; wt) + 2λwt)
i
The advantage is that for large amounts of data, this can be carried
out point by point.
Gradient descent algorithm for LR
Minimizing L(w) using gradient descent gives [exercise]
the update rule
w ← w − η(yi − σ(w>xi))xi
where yi ∈ {0, 1}
Note:
• this is similar, but not identical, to the perceptron update rule.
w ← w − ηsign(w>xi)xi
• there is a unique solution for w
• in practice more efficient Newton methods are used to minimize L
• there can be problems with w becoming infinite for linearly separable data
∂L
= −yixi
∂w
∂L
=0
∂w
yif (xi)
N µ ¶
1X λ
C(w) = ||w||2 + L(xi, yi; w)
N i 2
Then each iteration t involves cycling through the training data with the
updates:
Algorithm
• For each test point, x, to be classified, find the K nearest
samples in the training data
• Classify the point, x, according to the majority vote of their
class labels
e.g. K = 3
• naturally applicable
to multi-class case
1 vs 2 & 3
C1
?
C2
C3 3 vs 1 & 2
2 vs 1 & 3
Build from binary classifiers …
1 vs 2 & 3
C1
max fk (x)
k
C2
C3 3 vs 1 & 2
2 vs 1 & 3
hand
drawn
5
1 2 3 4 5 6 7 8 9 0
4
3
classification 5
1 2
2
5
3 4
5
5
0