0% found this document useful (0 votes)

26 views57 pages

Lecture 2

Uploaded by

happy_user

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views57 pages

Lecture 2

Uploaded by

happy_user

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Special Topics in Advanced Machine Learning

Lecture 2

Anna Choromanska
[email protected]
http://cims.nyu.edu/~achoroma/
Department of Electrical and Computer Engineering
New York University Tandon School of Engineering
Lecture outline

Perceptron
Exponentiated gradient (EG)
Expert advice:
Static-expert
Fixed-share (α)
Learn-α
Online multi-class classification
Bibliography
C. M. Bishop,Pattern recognition and machine learning (Information Science and
Statistics), 1st ed. 2006. corr. 2nd printing edn, Springer.
T. Jebara. Course notes, Machine Learning, Topic 4.
M. Warmuth and J. Kivinen, Exponentiated Gradient versus Gradient Descent for
Linear Predictors, Inf. and Comp. 132(1): 1-63, 1995.
C. Monteleoni, Learning with Online Constraints: Shifting Concepts and Active
Learning,” PhD Thesis, MIT, 2006.
A. Choromanska and C. Monteleoni, Online clustering with experts, AISTATS, 2012.
M. Herbster and Manfred K. Warmuth, Tracking the best expert, Machine Learning,
32:151178, 1998.
Y. Jernite, A. Choromanska, D. Sontag, and Y. LeCun, Simultaneous Learning of
Trees and Representations for Extreme Classification, with Application to Language
Modeling, CoRR, abs/1610.04658, 2016
A. Choromanska & J. Langford, Logarithmic Time Online Multicl. pred., NIPS, 2015.
A. Choromanska, K. Choromanski, and M. Bojarski, On the boosting ability of
top-down decision tree learning algorithm for multiclass classification, CoRR,
abs/1605.05223, 2016

Provided slides are occasionally using fragments of texts/only slightly

modified versions of original slides from bibliographic material. They are used
for educational purposes only.
Perceptron
Consider binary classification with two possible labels y ∈ {−1, 1}. To obtain
binary output use the following rule:
−1 if z < 0
g (z) = ,
+1 if z ≥ 0
where z = f (x, w ) is a model’s prediction. Assume linear model:
f (x, w ) = w > x. Simple classification loss is:
n n
1X 1X
L(y , f (x, w )) = step(−yi f (xi , w )) = step(−yi w > xi ),
n n
i=1 i=1
where step function is 0 for negative arguments, and else is equal to 1. The
gradient of this function is 0, except at edges when a label flips, which makes
GD inapplicable. Perceptron is a linear classifier which instead uses loss:
1 X 1 X
L(y , f (x, w )) = −yi f (xi , w ) = −yi w > xi ,
n n
i∈{misclassified} i∈{misclassified}
where i ∈ {misclassified} ⇔ {i : yi xi> w ≤ 0}.
GD update becomes: w t+1 = w t − η∇w L|w t = w t + η n1
P
i∈{misclassified} yi xi
and SGD update is based on a single misclassified data point:
w t+1 = w t +yi xi , where i is the index of a misclassified point, w.l.o.g. η = 1.
Perceptron
SGD update gives rise to an online perceptron algorithm as below
Initialize w 0 (at random, close to 0)
While not converged (i.e. no update is made to w during one epoch or
maximum number of iterations is reached) do
Pick up i ∈ {1, 2, . . . , n} at random
If (yi xi> w t ≤ 0)
w t+1 = w t + yi xi
t ⇐t +1
If kxi yi k2 > 0, then the contribution to the error from a misclassified pattern
will be reduced with a single update:
−(w t+1 )> xi yi = −(w t )> xi yi − (xi yi )> xi yi < −(w t )> xi yi ,
but it does not imply that the contribution to the error from other
misclassified patterns got reduced. The change in weight vector may have
also caused some previously correctly classified patterns to become
misclassified.
The perceptron learning rule is not guaranteed to reduce the total error
function at each stage...but...
Perceptron

Theorem 1 (Perceptron convergence theorem)

If there exists an exact solution (if the training set is linearly separable), then
the perceptron algorithm is guaranteed to find it in a finite number of steps.

There may be more than one solution - perceptron finds one of them. Which
one? Depends on parameter initialization and order of data processing.
Perceptron
Proof.
Let w 0 = 0. Make two necessary assumption ((w ∗ denote the optimum):
all data are inside a sphere of radius r , i.e. ∀i kxi k ≤ r
data is separable with a non-zero margin γ, i.e. ∀i yi (w ∗ )> xi ≥ γ
Note that
(w ∗ )> w t = (w ∗ )> w t−1 + yi (w ∗ )> xi ≥ (w ∗ )> w t−1 + γ
After t such updates we get (w ∗ )> w t ≥ tγ.
Note that
kw t k2 = kw t−1 + yi xi k2 = kw t−1 k2 + 2yi (w t−1 )> xi +kxi k2
| {z }
negative since only mistakes cause updates
t−1 2 2 t−1 2 2 2
≤ kw k + kxi k ≤ kw k + r ≤ tr
Angle between optimal and current solution satisfies
(w ∗ )> w t tγ r2
1 ≥ cos(w ∗ , w t ) = ≥ √ thus t ≤ kw ∗ k2 .
kw t kkw ∗ k r tkw ∗ k γ2
Perceptron

Figure: Figure from Tony Jebara’s class notes. Linear regression (left) (classification
error = 2, squared error = 0.139) versus perceptron (right) (classification error = 0,
squared error = 0).
Perceptron

Figure: Figure from C. M. Bishop’s book, Pattern recognition and Machine Learning.
Illustration of the convergence of the perceptron learning algorithm, showing data
points from two classes (red and blue) in a two-dimensional feature space. The left
plot shows the initial parameter vector w shown as a black arrow together with the
corresponding decision boundary (black line), in which the arrow points towards the
decision region which classified as belonging to the red class. The data point circled
in green is misclassified and so its feature vector is added to the current weight
vector, giving the new decision boundary shown on the second plot from the left.
The third plot from the left shows the next misclassified point to be considered,
indicated by the green circle, and its feature vector is again added to the weight
vector giving the decision boundary shown in the right plot for which all data points
are correctly classified.
Exponentiated gradient (EG)
Perceptron algorithm of Rosenblatt is one of the most fundamental and
simple online learning algorithms.
Look at another online learning algorithm known as exponentiated gradient
(EG). Here, the learner, or learning algorithm, tries to accurately predict
real-valued outcomes in a sequence of trials - online linear regression. Let
L : R × R to [0, ∞) be the loss. At an arbitrary t th trial the EG does:
initialize weights w 1 s.t. ni=1 wi1 = 1 and ∀ni=1 wi1 ≥ 0 (usually uniform)
P

the learner receives an instance x t , which is an n-dimensional real vector

(its i th component is denoted as xit ) - in the online learning with an
adversary literature, these features are called experts and thus the
weights can be thought as a distribution over the experts
the learner makes a prediction based on the information received in
previous trials: ŷ t = (w t )> x t
the world reveals the truth y t (e.g. for cross-entropy loss y t ∈ [0, 1])
wtrt 0
EG updates weights: wit+1 = Pn i i t t , where rit = exp(−ηLy t (ŷ t )xit )
j=1 wj rj
and η is the learning rate. For squared loss Ly (ŷ ) := L(y , ŷ ) = (y − ŷ )2 :
0 0
Ly (ŷ ) = 2(ŷ − y ), where Ly (ŷ ) = (∂L(y , z)/∂z)z=ŷ .
Exponentiated gradient (EG)
Recall the GD algorithm for linear predictions known as the Least Mean
Squares, where the update is
w t+1 = w t − 2η(ŷ t − y t )x t .
GD uses additive update rule, whereas EG uses multiplicative update rule.
What is common between GD and EG?
There is a common framework. In making an update, the algorithm must
balance its need to be conservative, i.e. retain the information it has acquired
in the preceding trials, and to be corrective, i.e. make certain that if the
same instance was observed again, the algorithm could make a more accurate
prediction, at least if the outcome is still the same. Thus the algorithm
chooses a new weight vector w t+1 that approximately minimizes
d(w t+1 , w t ) + ηL(y t , w t+1 x t ),
where d(w t+1 , w t ) measures the distance between the old and new
parameter vector (it is typically not a metric), L is the loss function, and the
learning rate η represents the importance of correctiveness compared to the
importance of conservativeness.
Exponentiated gradient (EG)

Consider squared loss L(y t , w t+1 x t ) = (y t − (w t+1 )> x t )2 .

d(w t+1 , w t ) = 12 kw t+1 − w t k22 gives the GD algorithm
w t+1
d(w t+1 , w t ) = ni=1 wit+1 ln wi t is a relative entropy, also known as
P
i
Kullback-Leibler divergence, and gives the EG algorithm
EG assumes all components of the parameter vector are non-negative and
sum to 1. This may limit the predictive ability of the algorithm. A simple
modification to the EG algorithm allows for both positive and negative
weights, where their sum does not have to be fixed, but instead the algorithm
assumes there is a fixed upper-bound for it. The new algorithm is called
exponentiated gradient algorithm with positive and negative weights
(EG± ).
Exponentiated gradient with positive and negative weights
(EG± )
Let U be the upper-bound on the total weight of the weight vectors. At an
arbitrary t th trial the EG± does as follows:
initialize a pair of weight vectors w +,1 and w −,1 in [0, 1]n such that
+,1
+ wi−,1 ) = 1 (usually uniform with each component equal to
Pn
i=1 (wi
1/2n)
finish the initialization with: w +,1 ← Uw +,1 and w −,1 ← Uw −,1
the learner receives an instance x t , which is an n-dimensional real vector
the learner makes a prediction based on the information received in
previous trials: ŷ t = (w +,t − w −,t )> x t (note that (w +,t − w −,t ) can
contain negative components)
the world reveals the truth y t (e.g. for cross-entropy loss y t ∈ [0, 1])
EG± updates the weights according to
wi+,t ri+,t
wi+,t+1 = U Pn and
j=1 (wj rj +wj−,t rj−,t )
+,t +,t

wi−,t ri−,t
wi−,t+1 = U
0
, where ri+,t = exp(−ηLy t (ŷ t )Uxit ),
rj +wj−,t rj−,t )
+,t +,t
Pn
j=1 (w j

ri−,t = exp(ηL (ŷ )Ux ) =

0 t t 1
yt i , and η is the learning rate.
ri+,t
Expert advice

We next further study more complicated expert advice setting in supervised

learning. We make no assumptions on the observation sequence which can
even be generated by an adaptive adversary. Thus, the analysis will be based
on on bounding the regret - the difference between the cumulative loss of
the algorithm and the loss of the best method in some comparator class.

We focus on non-stationary observations that are generated by a shifting

sequence of stationary distributions.
Expert advice
Goal: predict at each time step t, the outcome yt ∈ [0, 1]. Framework:
learner has access to n experts
expert is a black-box (we do not have to know how it forms prediction)
that at each time step receives data point xt and makes prediction that
is revealed to the learner
learner keeps the distribution pt over the experts (usually initialized to
uniform) reflecting how well they performed so far

Figure: Figure from C. Monteleoni PhD Thesis. The algorithm maintains a

distribution over the experts, in order to inform its own predictions.
after receiving expert predictions (n-tuple denoted
P as et ), the learner
forms the final prediction ŷt = pred(et , pt ) = ni=1 pt (i)et (i)
the nature reveals the true prediction yt
learner incurs the loss on itself and the experts and updates pt
Expert advice

Consider loss function L : [0, 1] × [0, 1] → [0, ∞). We will use the following
notation:
Lt (i) := Lt (yt , et (i)) - the loss of expert i at time t
LT (i) := T
P
t=1 Lt (yt , et (i)) - the cumulative loss of expert i after T
steps
Lt (alg) := Lt (yt , ŷt ) = Lt (yt , pred(et , pt )) - the loss of the algorithm
(learner) at time t
LT (alg) - the cumulative
PT loss of the algorithm (learner) after T steps,
i.e. LT (alg) := t=1 Lt (alg)
Expert advice

Figure: Figure from C. Monteleoni PhD Thesis. A generalized HMM of probability of

the next observation, given past observations, and the current best expert.
We will talk about algorithms that can be viewed as Bayesian updates in the
graphical model called generalized Hidden Markov Model (HMM)
(current observation depends on past observations, not just hidden state):
hidden state - identity of the current best expert, y s - observations
p(it |it−1 ) - transition matrix Θ
emission probabilities are defined as:
p(yt |i, y1 , . . . , yt−1 ) = e −ηLt (i) ⇔ Lt (i) = − log p(yt |i, y1 , . . . , yt−1 )
the predictive probability (view experts as making probabilistic
predictions) is therefore P (η is the learning rate; pt (i) = Pp(i|y1 , . . . , yt−1 ))
p(yt |y1 ,. . . , yt−1) = ni=1 pt (i)p(yt |i, y1 , . . . , yt−1 ) = ni=1 pt (i)e −ηLt (i)
Expert advice
The algorithms combining experts predictions can now be derived as simple
Bayesian estimation methods calculating the distribution
pt (i) = p(i|y1 , . . . , yt−1 ) over the experts on the basis of the observations
seen so far.
The Bayesian algorithm updating pt (·) is defined as follows: for each
i = {1, 2, . . . , n} do
n
1 X
pt (i; α) = pt−1 (j; α)e −ηLt−1 (j) p(i|j; α),
Zt
j=1

where Zt is a normalization factor and p(i|j; α) denotes how the optimal

choice of expert can change with time:

1−α i =j
Θi,j = p(i|j; α) = α
n−1 i 6= j.
Thus we have frameworks:
Static-expert - when α = 0 - the identity of the best expert cannot
change with time: pt (i) = Z1t pt−1 (i)e −ηLt−1 (i)
Fixed-share (α) - when α 6= 0
Expert advice
Consider three different kinds of losses L : [0, 1] × [0, 1] → [0, ∞):
Squared loss: L(p, q) = (p − q)2
Relative entropy: L(p, q) = p ln qp + (1 − p) ln 1−q
1−p
√ √ √ √
Hellinger loss: L(p, q) = 12 (( 1 − p − 1 − q)2 + ( p − q)2 )
Then:
Theorem 2 ((c, η)-realizability)
The loss function L and prediction function pred are (c, η)-realizable for the
constants c and η:
n
X
(L(alg) :=)L(y , pred(e, p) ) ≤ −c ln p(i)e −ηL(y ,e(i)) .
| {z }
P n i=1
i=1 p(i)e(i)

These constants are (we provide c, whereas η = c1 ):

Squared loss: c = 2
Relative entropy: c = 1
Hellinger loss: c = 1
Static-expert

Static-expert algorithm:
assumes no shifting occurs thus dedicated to stationary sequences
the identity of the best expert does not change with time (this is
reflected in the update on pt (i))
Static-expert
Theorem 3 (Regret bound for the Static-expert)
Given any sequence of length T , let i ∗ be the best expert in hindsight (with
respect to cumulative loss on the sequence). Then the cumulative loss of the
algorithm obeys the following bound, with respect to that of the best expert:
LT (alg) ≤ LT (i ∗ ) + c ln n
We proceed with the proof. We first apply previous theorem to bound the
cumulative loss of the algorithm.
T
X n
X T
X
LT (alg) ≤ −c ln pt (i)e −ηLt (i) = −c ln p(yt |y1 , . . . , yt−1 )
t=1 i=1 t=1
T
Y
= − c ln p(y1 ) p(yt |y1 , . . . , yt−1 ) = −c ln p(y1 , ..., yT )
t=2
n
X T
Y
= −c ln p1 (i)p(y1 |i) p(yt |i, y1 , ..., yt−1 )
i=1 t=2
Static-expert

n T
1 X −ηL1 (i) Y −ηLt (i)
LT (alg) ≤ −c ln e e
n
i=1 t=2
n
1X PT
= −c ln e −η t=1 Lt (i)
n
i=1
n
1 X PT
= −c ln − c ln e −η t=1 Lt (i)
n
i=1
− ln(·) decreases monotonically,
we can upper-bound this by the same
function of any of the terms in the summation
≤ LT (i) + c ln n

The last inequality holds for any i, so in particular for i ∗ .

Fixed-share (α)

Fixed-share (α) algorithm:

assumes that shifting occurs thus dedicated to non-stationary sequences
the identity of the best expert change with time (this is reflected in the
update on pt (i) - we cannot simply use the update weight of the form
1 −ηLt−1 , because before an expert is optimal for a segment of a
Zt pt−1 (i)e
sequence its loss in prior segments may be arbitrarily large, and thus its
weight may become arbitrarily small)
experts share a fixed fraction of their weights with each other - this
guarantees that the ratio of the weight of any expert to the total weight
of all the experts may be bounded from below
Fixed-share (α)

Let LT (α) be the cumulative loss of the Fixed-share (α) algorithm, where
α ∈ [0, 1]. Before this was denoted as LT (alg).

Theorem 4 (Regret bound for the Fixed-share (α))

For any sequence of length T , and for any s < T , consider the best partition,
computed in hindsight, of the sequence into s + 1 segments, mapping each
segment to its best expert. Then

LT (α) ≤ cηLT (best s-partition)

+ c[ln n + s ln(n − 1) + (T − 1)(H(α∗ ) + D(α∗ kα))],
1
where H(p) = p ln p1 + (1 − p) ln 1−p is the binary entropy and
D(pkq) = p ln qp + (1 − p) ln 1−q
1−p
is the binary relative entropy.
∗
α = s/(T − 1) is the hindsight-optimal setting of the switching rate
parameter α given s.
Fixed-share (α)

We will show another bound of a simpler form for the same algorithm.
Theorem 5 (Regret bound for the Fixed-share (α) (constants c, η are
omitted))
Let LT (α∗ ) = minα LT (α) be the cumulative loss of the best Fixed-share (α)
algorithm chosen in hindsight. Then

LT (α) ≤ LT (α∗ ) + (T − 1)D(α∗ kα).

this bound, as opposed to previous, vanishes when α = α∗

the bound does not directly depend on the number of experts, though it
may indirectly depend on that through α∗
Learn-α
Intuitively, the switching rate α is another parameter to learn,
but how to learn it?
This problem can be viewed as finding the single best “α-expert”, where the
collection of α-experts is given by Fixed-share (α) algorithms running with
different switching rates α (we have m α-experts: {Fixed-share(αj )}mj=1 ).

Figure: Figure from C. Monteleoni PhD Thesis. The hierarchy of experts and
“α-experts”, Fixed-share algorithms running with different settings of α, maintained
by the Learn-α algorithm.
Learn-α

Theorem 6 (Regret bound for the Learn-α (constants c, η are omitted))

Let Ltop
T be the loss of the hierarchical Learn-α algorithm and m be any
search resolution. Let LT (α∗ ) = minα LT (α) as before be the cumulative loss
of the best Fixed-share (α) algorithm chosen in hindsight and let αj ∗ be the
best discrete choice (at resolution m) of the switching rate chosen in
hindsight for the sequence. Then

Ltop ∗ ∗
T ≤ LT (α ) + ln m + (T − 1)D(α kαj ).
∗

√
One can show that m∗ = O( T ) is the optimal resolution (the one for which
the upper-bound on regret, or in other words the worst-case regret, is
minimized) at which to learn α.
Online multi-class classification
We will next stay in the online setting, but not experts setting. We will focus
on the classification problem.
Problem setting:
classification
with large number of classes
10 classes: 15 years ago
large =
1000/10000/100000/1000000 classes and more: today

data is accessed online

Goal: good predictor with logarithmic training and testing time

Most multi-class algorithms run

in O(k) time, where k is the
number of classes. The lower-
bound is O(log k) .
Online multi-class classification

Deep representation learning:

Computation in the last layer can blow up...

Online multi-class classification

...

k - total number of classes is large!!!

Real-world data have billions of labels.
Applications: search engines, targeted advertising, aggregation of online
news stories and their categorization etc.

Our approach:
reduction to tree-structured binary classification
top-down approach for class partitioning allowing gradient descent style
optimization
Existing approaches: intractable/do not learn the tree structure/not online
Online multi-class classification
h - hypothesis inducing the split, x - data point
Online multi-class classification

Our approach is based on a new splitting criterion leading to a balanced

tree (⇒ logarithmic training and testing time) with small entropy of
leaves.
Online multi-class classification

kr (y ): # data points of class y to the right of the partitioning

k(y ): total # data points in class y
nr : # data points to the right of the partitioning
n: total # data points
h ∈ H - hypothesis creating the partition, e.g. linear classifier
nr
Measure of balancedness: n
Online multi-class classification

kr (y ): # data points of class y to the right of the partitioning

k(y ): total # data points in class y
nr : # data points to the right of the partitioning
n: total # data points
h ∈ H - hypothesis creating the partition, e.g. linear classifier
nr kr (y)
Measure of balancedness: n Measure of purity: k(y)

balancedness = P(h(x) > 0)

k(y )
purity = ky =1 πy min(P(h(x) > 0|y ), P(h(x) < 0|y )),
P
πy = n
Online multi-class classification

J(h) = 2Ey [|P(h(x) > 0) − P(h(x) > 0|y )|]

K
X
= 2 |P(h(x) > 0)πy − P(h(x) > 0, y )|
y =1

J (h ) ⇒ Splitting criterion (objective function)

Given a set of n examples each with one of k labels, find a partitioner h that
maximizes the objective.

Lemma 7
For any hypothesis h : X 7→ {−1, 1}, J(h) ∈ [0, 1].
h induces a perfectly pure and perfectly balanced partition iff J(h) = 1.
Online multi-class classification
Balancing factor
" p p #
1− 1 − J(h) 1 + 1 − J(h)
balancedness ∈ ,
2 2

Purity factor
2 − J(h)
purity ≤ − balancedness
4 · balancedness
balance = 1/2
1 0.5
√
1+ 1−J (h)
y= 2
0.8 0.4

2−J (h)
y= 4·balance − balance
0.6 0.3
y
y

0.4 0.2

0.2 √ 0.1
1− 1−J (h)
y= 2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
J(h) J(h)
Online multi-class classification

In each node of the tree T optimize the splitting criterion

Apply recursively to construct a tree structure
Measure the quality of the tree using entropy

k
X X 1
GT = w` π`,y ln ,
π`,y
`∈leaves of T y =1

where π`,y is the probability that a data point x has label y given that x
reaches node ` and w` is the weight of leaf ` (probability that x reaches
leaf `).

Why?
Small entropy of leaves ⇒ pure leaves

Goal: maximizing the objective reduces the entropy

Online multi-class classification

Definition 8 (Weak Hypothesis Assumption)

For any distribution P over an input space X at each node of the tree there
exists a hypothesis h such that J(h)/2 ≥ γ, where γ > 0.

Theorem 9 (Boosting theorem for LOM tree)

Under the Weak Hypothesis Assumption, for any ∈ [0, 1], to obtain GT ≤
4(1−γ)2 2 ln k
it suffices to make 1 γ splits.

4(1−γ)2 2 ln k

1
Tree depth ≈ log
γ = O(ln k) ⇒

⇒ logarithmic training and testing time

Online multi-class classification

Recall the objective function we consider at every tree node

J(h) = 2Ey [|Ex [1(h(x) > 0)] − Ex [1(h(x) > 0|y )]|].

Problem: discrete optimization

Relaxation: drop the indicator operator and look at margins
The objective function becomes

J(h) = 2Ey [|Ex [h(x)] − Ex [h(x)|y ]|].

Keep the online empirical estimates of these expectations.

The sign of the difference of two expectations decides whether to send an
example to the left or right child node.
Online multi-class classification