Lecture 2
Lecture 2
Lecture 2
Anna Choromanska
[email protected]
http://cims.nyu.edu/~achoroma/
Department of Electrical and Computer Engineering
New York University Tandon School of Engineering
Lecture outline
Perceptron
Exponentiated gradient (EG)
Expert advice:
Static-expert
Fixed-share (α)
Learn-α
Online multi-class classification
Bibliography
C. M. Bishop,Pattern recognition and machine learning (Information Science and
Statistics), 1st ed. 2006. corr. 2nd printing edn, Springer.
T. Jebara. Course notes, Machine Learning, Topic 4.
M. Warmuth and J. Kivinen, Exponentiated Gradient versus Gradient Descent for
Linear Predictors, Inf. and Comp. 132(1): 1-63, 1995.
C. Monteleoni, Learning with Online Constraints: Shifting Concepts and Active
Learning,” PhD Thesis, MIT, 2006.
A. Choromanska and C. Monteleoni, Online clustering with experts, AISTATS, 2012.
M. Herbster and Manfred K. Warmuth, Tracking the best expert, Machine Learning,
32:151178, 1998.
Y. Jernite, A. Choromanska, D. Sontag, and Y. LeCun, Simultaneous Learning of
Trees and Representations for Extreme Classification, with Application to Language
Modeling, CoRR, abs/1610.04658, 2016
A. Choromanska & J. Langford, Logarithmic Time Online Multicl. pred., NIPS, 2015.
A. Choromanska, K. Choromanski, and M. Bojarski, On the boosting ability of
top-down decision tree learning algorithm for multiclass classification, CoRR,
abs/1605.05223, 2016
There may be more than one solution - perceptron finds one of them. Which
one? Depends on parameter initialization and order of data processing.
Perceptron
Proof.
Let w 0 = 0. Make two necessary assumption ((w ∗ denote the optimum):
all data are inside a sphere of radius r , i.e. ∀i kxi k ≤ r
data is separable with a non-zero margin γ, i.e. ∀i yi (w ∗ )> xi ≥ γ
Note that
(w ∗ )> w t = (w ∗ )> w t−1 + yi (w ∗ )> xi ≥ (w ∗ )> w t−1 + γ
After t such updates we get (w ∗ )> w t ≥ tγ.
Note that
kw t k2 = kw t−1 + yi xi k2 = kw t−1 k2 + 2yi (w t−1 )> xi +kxi k2
| {z }
negative since only mistakes cause updates
t−1 2 2 t−1 2 2 2
≤ kw k + kxi k ≤ kw k + r ≤ tr
Angle between optimal and current solution satisfies
(w ∗ )> w t tγ r2
1 ≥ cos(w ∗ , w t ) = ≥ √ thus t ≤ kw ∗ k2 .
kw t kkw ∗ k r tkw ∗ k γ2
Perceptron
Figure: Figure from Tony Jebara’s class notes. Linear regression (left) (classification
error = 2, squared error = 0.139) versus perceptron (right) (classification error = 0,
squared error = 0).
Perceptron
Figure: Figure from C. M. Bishop’s book, Pattern recognition and Machine Learning.
Illustration of the convergence of the perceptron learning algorithm, showing data
points from two classes (red and blue) in a two-dimensional feature space. The left
plot shows the initial parameter vector w shown as a black arrow together with the
corresponding decision boundary (black line), in which the arrow points towards the
decision region which classified as belonging to the red class. The data point circled
in green is misclassified and so its feature vector is added to the current weight
vector, giving the new decision boundary shown on the second plot from the left.
The third plot from the left shows the next misclassified point to be considered,
indicated by the green circle, and its feature vector is again added to the weight
vector giving the decision boundary shown in the right plot for which all data points
are correctly classified.
Exponentiated gradient (EG)
Perceptron algorithm of Rosenblatt is one of the most fundamental and
simple online learning algorithms.
Look at another online learning algorithm known as exponentiated gradient
(EG). Here, the learner, or learning algorithm, tries to accurately predict
real-valued outcomes in a sequence of trials - online linear regression. Let
L : R × R to [0, ∞) be the loss. At an arbitrary t th trial the EG does:
initialize weights w 1 s.t. ni=1 wi1 = 1 and ∀ni=1 wi1 ≥ 0 (usually uniform)
P
wi−,t ri−,t
wi−,t+1 = U
0
, where ri+,t = exp(−ηLy t (ŷ t )Uxit ),
rj +wj−,t rj−,t )
+,t +,t
Pn
j=1 (w j
Consider loss function L : [0, 1] × [0, 1] → [0, ∞). We will use the following
notation:
Lt (i) := Lt (yt , et (i)) - the loss of expert i at time t
LT (i) := T
P
t=1 Lt (yt , et (i)) - the cumulative loss of expert i after T
steps
Lt (alg) := Lt (yt , ŷt ) = Lt (yt , pred(et , pt )) - the loss of the algorithm
(learner) at time t
LT (alg) - the cumulative
PT loss of the algorithm (learner) after T steps,
i.e. LT (alg) := t=1 Lt (alg)
Expert advice
Static-expert algorithm:
assumes no shifting occurs thus dedicated to stationary sequences
the identity of the best expert does not change with time (this is
reflected in the update on pt (i))
Static-expert
Theorem 3 (Regret bound for the Static-expert)
Given any sequence of length T , let i ∗ be the best expert in hindsight (with
respect to cumulative loss on the sequence). Then the cumulative loss of the
algorithm obeys the following bound, with respect to that of the best expert:
LT (alg) ≤ LT (i ∗ ) + c ln n
We proceed with the proof. We first apply previous theorem to bound the
cumulative loss of the algorithm.
T
X n
X T
X
LT (alg) ≤ −c ln pt (i)e −ηLt (i) = −c ln p(yt |y1 , . . . , yt−1 )
t=1 i=1 t=1
T
Y
= − c ln p(y1 ) p(yt |y1 , . . . , yt−1 ) = −c ln p(y1 , ..., yT )
t=2
n
X T
Y
= −c ln p1 (i)p(y1 |i) p(yt |i, y1 , ..., yt−1 )
i=1 t=2
Static-expert
n T
1 X −ηL1 (i) Y −ηLt (i)
LT (alg) ≤ −c ln e e
n
i=1 t=2
n
1X PT
= −c ln e −η t=1 Lt (i)
n
i=1
n
1 X PT
= −c ln − c ln e −η t=1 Lt (i)
n
i=1
− ln(·) decreases monotonically,
we can upper-bound this by the same
function of any of the terms in the summation
≤ LT (i) + c ln n
Let LT (α) be the cumulative loss of the Fixed-share (α) algorithm, where
α ∈ [0, 1]. Before this was denoted as LT (alg).
We will show another bound of a simpler form for the same algorithm.
Theorem 5 (Regret bound for the Fixed-share (α) (constants c, η are
omitted))
Let LT (α∗ ) = minα LT (α) be the cumulative loss of the best Fixed-share (α)
algorithm chosen in hindsight. Then
Figure: Figure from C. Monteleoni PhD Thesis. The hierarchy of experts and
“α-experts”, Fixed-share algorithms running with different settings of α, maintained
by the Learn-α algorithm.
Learn-α
Ltop ∗ ∗
T ≤ LT (α ) + ln m + (T − 1)D(α kαj ).
∗
√
One can show that m∗ = O( T ) is the optimal resolution (the one for which
the upper-bound on regret, or in other words the worst-case regret, is
minimized) at which to learn α.
Online multi-class classification
We will next stay in the online setting, but not experts setting. We will focus
on the classification problem.
Problem setting:
classification
with large number of classes
10 classes: 15 years ago
large =
1000/10000/100000/1000000 classes and more: today
...
Our approach:
reduction to tree-structured binary classification
top-down approach for class partitioning allowing gradient descent style
optimization
Existing approaches: intractable/do not learn the tree structure/not online
Online multi-class classification
h - hypothesis inducing the split, x - data point
Online multi-class classification
Given a set of n examples each with one of k labels, find a partitioner h that
maximizes the objective.
Lemma 7
For any hypothesis h : X 7→ {−1, 1}, J(h) ∈ [0, 1].
h induces a perfectly pure and perfectly balanced partition iff J(h) = 1.
Online multi-class classification
Balancing factor
" p p #
1− 1 − J(h) 1 + 1 − J(h)
balancedness ∈ ,
2 2
Purity factor
2 − J(h)
purity ≤ − balancedness
4 · balancedness
balance = 1/2
1 0.5
√
1+ 1−J (h)
y= 2
0.8 0.4
2−J (h)
y= 4·balance − balance
0.6 0.3
y
y
0.4 0.2
0.2 √ 0.1
1− 1−J (h)
y= 2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
J(h) J(h)
Online multi-class classification
k
X X 1
GT = w` π`,y ln ,
π`,y
`∈leaves of T y =1
where π`,y is the probability that a data point x has label y given that x
reaches node ` and w` is the weight of leaf ` (probability that x reaches
leaf `).
Why?
Small entropy of leaves ⇒ pure leaves
4(1−γ)2 2 ln k
1
Tree depth ≈ log
γ = O(ln k) ⇒
J(h) = 2Ey [|Ex [1(h(x) > 0)] − Ex [1(h(x) > 0|y )]|].
Apply recursively to
construct a tree structure.
Online multi-class classification
Let e = 0 and for all y , ey = 0, ny = 0
For each example (x, y )
if ey < e then b = −1 else b = 1
Update w using (x, b)
ny ← ny + 1
(ny −1)ey w .x
ey ← ny + ny
(n−1)e w .x
e← n + n
Apply recursively to
construct a tree structure.
Online multi-class classification
Let e = 0 and for all y , ey = 0, ny = 0
For each example (x, y )
if ey < e then b = −1 else b = 1
Update w using (x, b)
ny ← ny + 1
(ny −1)ey w .x
ey ← ny + ny
(n−1)e w .x
e← n + n
Apply recursively to
construct a tree structure.
Online multi-class classification
1
compressed
Online multi-class classification
log2(time ratio)
8
accuracy
0.6
6
0.4
4
0.2 2
0 6 8 10 12 14 16
26 105 1000 21841 105033 log2(number of classes)
number of classes
Table: Test error (%) and confidence interval on all problems.
LOMtree Rtree Filter tree
Isolet 6.36±1.71 16.92±2.63 15.10±2.51
Sector 16.19±2.33 15.77±2.30 17.70±2.41
Aloi 16.50±0.70 83.74±0.70 80.50±0.75
ImNet 90.17±0.05 96.99±0.03 92.12±0.04
ODP 93.46±0.12 93.85±0.12 93.76±0.12
Online multi-class classification
New algorithm: