Online Learning Lecture Notes 2011 Oct 20
Online Learning Lecture Notes 2011 Oct 20
1 Shooting Game 4
1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6 Tracking 28
6.1 The problem of tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Fixed-share forecaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 Variable-share forecaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1
7 Linear classification with Perceptron 36
7.1 The Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Analysis for Linearly Separable Data . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Analysis in the General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
10 Least Squares 61
10.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.2 Ridge Regression with Projections . . . . . . . . . . . . . . . . . . . . . . . . 66
10.2.1 Analysis of Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.3 Directional Strong Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
11 Exp-concave Functions 72
11.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2
15 Multi-Armed Bandits 96
15.1 Exp3-γ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
15.2 A high probability bound for the Exp3.P algorithm . . . . . . . . . . . . . . 98
3
Chapter 1
Shooting Game
pt − yt k2
• We suffer a loss `(pt , yt ) = kb
If we knew the sequence y1 , y2 , . . . , yn ahead of time and wanted to use the best fixed point
to predict, we would choose the point that minimizes the total loss:
n
X n
X
p∗n = argmin `(p, yt ) = argmin kp − yt k2
p∈Rd t=1 p∈Rd t=1
For the loss function `(p, y) = kp − yk2 it is not hard to calculate p∗n explicitly:
n
1X
p∗n = yt
n t=1
A particularly good algorithm for the shooting game is the Follow-The-Leader (FTL)
algorithm, which chooses the point that minimizes the loss on the points y1 , y2 , . . . , yt−1 seen
so far:
Xn t−1
X
pbt = argmin `(p, yt ) = argmin kp − ys k2
p∈Rd t=1 p∈Rd s=1
4
(The point pbt is the current “leader”.) Again, there exists an explicit formula for pbt :
t−1
1 X
pbt = ys
t − 1 s=1
Note that, using our notation, pbt = p∗t−1 . For concreteness, we define pb1 = p∗0 = 0.
We analyze the regret of Follow-The-Leader algorithm by using the following lemma:
Lemma 1.1 (Hannan’s “Follow The Leader–Be The Leader” lemma). For any sequence
y1 , y2 , . . . , yn ∈ Rd we have
n
X n
X
∗
`(pt , yt ) ≤ `(p∗n , yt ) .
t=1 t=1
Proof. We prove the lemma by induction on n. For n = 1 the inequality is equivalent to:
`(p∗1 , y1 ) ≤ `(p∗1 , y1 ) ,
which is trivially satisfied with equality. Now, take some n ≥ 2 and assume that the desired
inequality holds up to n − 1. Our goal is to prove that it also holds for n, i.e., to show that
n
X n
X
`(p∗t , yt ) ≤ `(p∗n , yt ) .
t=1 t=1
which we prove by writing it as a chain of two inequalities, each of which is easy to justify:
n−1
X n−1
X n−1
X
`(p∗t , yt ) ≤ `(p∗n−1 , yt ) ≤ `(p∗n , yt )
t=1 t=1 t=1
The first inequality holds by the induction hypothesis. The second follows from that p∗n−1 =
argminp∈Rd n−1 ∗
P
t=1 `(pn−1 , yt ) is a minimizer of the sum losses, so the sum of losses evaluated
at pn must be at least as large as the sum evaluated at p∗n−1 . This finishes the proof.
∗
We use the lemma to upper bound the regret of the Follow-The-Leader algorithm:
n
X n
X n
X n
X n
X n
X
Rn = `(b
pt , yt )− `(p∗n , yt ) = `(p∗t−1 , yt )− `(p∗n , yt ) ≤ `(p∗t−1 , yt )− `(p∗t , yt ) .
t=1 t=1 t=1 t=1 t=1 t=1
5
Now, the rightmost expression can be written as:1
n
X n
X n
X n
X
`(p∗t−1 , yt ) − `(p∗t−1 , yt ) = kp∗t−1 2
− yt k − kp∗t−1 − yt k2
t=1 t=1 t=1 t=1
Xn
= hp∗t−1 − p∗t , p∗t−1 + p∗t − 2yt i ,
t=1
where we have used that kuk2 − kvk2 = hu − v, u + vi. The last expression can be further
upper bounded by the Cauchy-Schwarz inequality, that states that |hu, vi| ≤ kuk · kvk, and
the triangle inequality that states that ku + vk ≤ kuk + kvk:
n
X n
X
hp∗t−1 − p∗t , p∗t−1 + p∗t − 2yt i ≤ kp∗t−1 − p∗t k · kp∗t−1 + p∗t − 2yt k
t=1 t=1
Xn
kp∗t−1 − p∗t k · kp∗t−1 k + kp∗t k + 2kyt k .
≤
t=1
We use that y1 , y2 , . . . , yn have norm at most 1 and thus so do the averages p∗0 , p∗1 , . . . , p∗n :
n
X n
X
kp∗t−1 p∗t k kp∗t−1 k kp∗t k kp∗t−1 − p∗t k .
Rn ≤ − · + + 2kyt k ≤ 4
t=1 t=1
(t−1)p∗ +yt
It remains to upper bound kp∗t−1 − p∗t k, which we do by substituting p∗t = t
t−1
, using
∗
the triangle inequality and kyt k ≤ 1, kpt−1 k ≤ 1:
(t − 1)p∗t−1 + yt
∗ ∗
= 1
p∗t−1 − yt
≤ 1 kp∗t−1 k + kyt k ≤ 2 .
∗
kpt−1 − pt k =
pt−1 −
t
t t t
In summary,
n n
X X 1
Rn ≤ 4 kp∗t−1 − p∗t k ≤8 ≤ 8(1 + ln n) .
t=1 t=1
t
Pn 1
The last inequality, t=1 t ≤ 1 + ln n, is a well known inequality for Harmonic numbers.
Theorem 1.2 (FTL Regret Bound). If y1 , y2 , . . . , yn is any sequence of points of the unit
ball of Rd then the Folllow-The-Leader algorithm that in round t predicts pbt = p∗t−1 =
1
Pt−1
t−1 s=1 ys has regret at most 8(1 + ln n).
1.1 Exercises
Exercise 1.1.
p
1
For vectors u, v ∈ Rd , the inner product of u and v is hu, vi = u> v. Further, kvk = hv, vi.
6
(a) Consider an arbitrary deterministic online algorithm A for the shooting game. Show
that for any n ≥ 1 there exists a sequence y1 , y2 , . . . , yn such that A has non-positive
regret. Justify your answer.
(b) Consider an arbitrary deterministic online algorithm A for the shooting game. Show
that for any n ≥ 1 there exists a sequence y1 , y2 , . . . , yn in the unit ball of Rd such that
A has non-negative regret. Justify your answer.
Hint: Imagine that the algorithm is given to you as a subroutine, which you can call at any
time. More precisely, you will have two functions: init and predict. The first initializes
the algorithm’s internal variables (whatever they are) and returns the algorithm’s first pre-
diction and the second, taking the previous outcome gives you the new prediction. That
the algorithm is deterministic means that if you call it with the same sequence of outcomes,
it will give the same sequence of predictions. Can you write a code that will produce an
outcome sequence that makes the regret behave as desired? Remember, your code is free to
use the above two functions! Do not forget to justify why your solution works.
(b) According to Theorem 1.2, the regret of the FTL algorithm on any sequence y1 , y2 , . . . , yn
in the unit ball of Rd is at most O(log n). This problem asks you to show that the upper
bound O(log n) for FTL is tight. More precisely, for any n ≥ 1 construct a sequence
y1 , y2 , . . . , yn in the unit ball such that FTL has regret at least Ω(log n).
Hint for Part (b): Consider the sequence yt = (−1)t v where v is an arbitrary unit vector.
First solve the case when n is even. The inequality 1 + 1/2 + · · · + 1/n ≥ ln n might be
useful.
Exercise 1.3. (Stochastic model) Consider the situation when y1 , y2 , . . . , yn are gener-
ated i.i.d. (independently and identically distributed) according to some probability distri-
bution. For this case, we have seen in class an alternative definition of regret:
n
X n
X
R
en = pt , yt ) − min
`(b E[`(p, yt )] .
p∈Rd
t=1 t=1
7
Chapter 2
Consider the prediction problem, in which we want to predict whether tomorrow it’s going
be rainy or sunny. For our disposal, we have N experts that predict rain/sunshine. We do
this repeatedly over a span of n days. In round (day) t = 1, 2, . . . , n:
(The symbol I denotes the indicator function. If `(b pt , yt ) = 1 we make a mistake and if
`(fi,t , yt ) = 1, expert i makes a mistake.) We are interested in algorithms which make as few
mistakes as possible.
• Let St−1 ⊆ {1, 2, . . . , N } be the experts that did not make any mistake so far.
• The algorithm receives the experts’ predictions f1,t , f2,t , . . . , fN,t ∈ {0, 1}.
0 1
• The set St−1 is split into St−1 = {i ∈ St−1 : fi,t = 0} and St−1 = {i ∈ St−1 : fi,t = 1}.
0 1
• If |St−1 | > |St−1 |, the algorithm predicts pbt = 0, otherwise it predicts pbt = 1.
8
• The algorithm then receives yt (and suffers the loss `(b pt 6= yt }).
pt , yt ) = I{b
yt
• The set of alive experts is updated: St = St−1 .
2.1.1 Analysis
yt
If in round t the Halving Algorithm makes a mistake then |St | = |St−1 | ≤ |St−1 |/2. Since
we assume that there is an expert that never makes a mistake we have at all time steps that
|St | ≥ 1 and therefore, the algorithm can not make more than log2 |S0 | = log2 N mistakes.
pbt = I{ N
P PN
i=1 wi,t−1 fi,t > i=1 wi,t−1 (1 − fi,t )} .
2.2.1 Analysis
To analyze the number of mistakes of WMA, we use the notation
1+β
Claim 2.1. Wt ≤ Wt−1 and if pbt 6= yt then Wt ≤ 2
Wt−1 .
9
Proof. Since wi,t ≤ wi,t−1 we have Wt ≤ Wt−1 . Now, if pbt 6= yt then the good experts were
in minority:
Wt−1,Jt,good ≤ Wt−1 /2 .
Using this inequality we can upper bound Wt :
Wt = Wt−1,Jt,good + βWt−1,Jt,bad
= Wt−1,Jt,good + β Wt−1 − Wt−1,Jt,good
= (1 − β)Wt−1,Jt,good + βWt−1
≤ (1 − β)Wt−1 /2 + βWt−1
1+β
= Wt−1 ,
2
and we’re done.
Pn
The claim we’ve just proved allows to upper bound Wn in terms of L
bn =
t=1 pt 6= yt },
I{b
the number of mistakes made by the WMA:
Lbn
1+β
Wn ≤ W0 .
2
Pn
On the other hand, if Li,n = t=1 I{fi,t 6= yt } denotes the number of mistakes made by
expert i, the weight of expert i at the end of algorithm is wi,n = β Li,n which in particular
means that
β Li,n = wi,n ≤ Wn .
Putting these two inequalities together and using that W0 = N , we get
Lbn
Li,n 1+β
β ≤ N.
2
Taking logarithm and rearranging, we get upper bound on the number of mistakes of WMA:
ln β1 Li,n + ln N
Ln ≤
b
2
ln 1+β
which holds for any expert i. If L∗i,n = min1≤i≤N Li,n denotes the loss of the best expert, we
get
ln β1 L∗i,n + ln N
Lbn ≤ .
2
ln 1+β
Note that WMA with β = 0 coincides with the Halving Algorithm and if L∗i,n = 0 we
bn ≤ log2 N , which was shown to hold for the Halving Algorithm.
recover the bound L
10
Chapter 3
We consider the situation when the experts’ predictions are continuous. Each expert i =
1, 2, . . . , N in round t predicts fi,t ∈ D where D is a convex subset of some vector space.1 In
each round t = 1, 2, . . . , n we play a game according to the following protocol:
The set Y of outcomes can be arbitrary, however we will make two assumptions on the loss
function ` : D × Y → R:
where L bn = Pn `(b Pn
t=1 pt , yt ) is the cumulated loss of the algorithm and Li,n = t=1 `(fi,t , yt )
is the cumulated loss of expert i.
1
Recall that a subset D of a real vector space is convex if for any x, y ∈ D and any real numbers α, β ≥ 0
such that α+β = 1 the point αx1 +βx2 belongs to D. Further, a real-valued function f with a convex domain
D is convex if for any x1 , x2 ∈ D, α, β ≥ 0, α + β = 1, f (αx1 + βx2 ) ≤ αf (x1 ) + βf (x2 ). An equivalent
characterization of convex functions is the following: Consider the graph Gf = {(p, f (p)) : p ∈ D} of f
(this is a surface). Then, f is convex if and only if for any x ∈ D, any hyperplane Hx tangent to Gf at x is
below Gf in the sense that for any point (x0 , y 0 ) ∈ Hx , y 0 ≤ f (x0 ).
11
3.1 The Exponentially Weighted Average forecaster
A very good algorithm for this problem is the Exponentially Weighted Average
forecaster (EWA). For each expert i, EWA maintains a weight wi,t = e−ηLi,t which
depends on the loss Li,t of expert i up to round t and a parameter η > 0 which will be
chosen later. In each round t, EWA predicts by taking the convex combination of the
experts’ predictions f1,t , f2,t , . . . , fN,t with the current weights.
More precisely, EWA is as follows: First, the weights are initialized to w1,0 = w2,0 =
· · · = wN,0 = 1. Further, the sum of the weights is stored in W0 = N . Then, in each round
t = 1, 2, . . . , n EWA does the following steps:
• It receives experts’ predictions f1,t , f2,t , . . . , fN,t ∈ D.
• It predicts PN
i=1 wi,t−1 fi,t
pbt = .
Wt−1
• Then, the environment reveals an outcome yt ∈ Y .
• EWA update Wt = N
P
i=1 wi,t .
For numerical stability, in software implementations instead of working with the weights wi,t
and their sum, one would only store and maintain the normalized weights w bi,t = wi,t /Wt .
3.2 Analysis
To analyze the regret of EWA we will need two results. We prove only the first result.
Lemma 3.1 (Jensen’s inequality). Let V is any real vector space. If X is a V -valued random
variable such that E [X] exists and f : V → R is a convex function then
Proof. Consider x̄ = E [X] and the hyperplane Hx̄ which is tangent to the graph Gf of f
at x̄. We can write Hx̄ = {(x0 , ha, x0 i + b) : x0 ∈ V } with some a ∈ V , b ∈ R. Since
f is convex, Hx̄ is below Gf . Thus, ha, Xi + b ≤ f (X). Since Hx̄ is tangent to Gf at x̄,
ha, x̄i + b = f (x̄) = f (E [X]). Taking expectations of both sides, and using the equality just
obtained, we get f (E [X]) = ha, E [X]i + b ≤ E [f (X)], which proves the statement.
Lemma 3.2 (Hoeffding’s lemma). If X is a real-valued random variable lying in [0, 1] then
for any s ∈ R
2
E[esX ] ≤ es E[X]+s /8 .
12
The analysis will be similar to the analysis of WMA. First we lower bound Wn
Wn ≥ wi,n = e−ηLi,n
and then upper bound it by upper bounding the terms of
W1 W2 Wn
Wn = W0 · · ··· .
W0 W1 Wn−1
Let us thus upper bound the fraction WWt−1 t
for some fixed t (t ∈ {1, . . . , n}). We can express
this fraction as
N N N
Wt X wi,t X wi,t−1 e−η`(fi,t ,yt ) X wi,t−1 −η`(fi,t ,yt )
= E e−η`(fIt ,t ,yt ) ,
= = = e
Wt−1 i=1
Wt−1 i=1
Wt−1 i=1
Wt−1
where It is a random variable that attains values in the set {1, 2, . . . , N } and has distribution
w
Pr (It = i) = Wi,t−1
t−1
. Applying Hoeffding’s lemma to the random variable X = `(fIt ,t , yt ) gives
2
E e−η`(fIt ,t ,yt ) ≤ e−η E[`(fIt ,t ,yt )]+η /8
Applying Jensen’s inequality on the expression in the exponent E[`(fIt ,t , yt )] (exploiting that
` is convex in its first argument) and then using the definition of pbt , we have
N
!
X wi,t−1
E[`(fIt ,t , yt )] ≥ `(E[fIt ,t ], yt ) = ` fi,t , yt = `(b
pt , yt ) .
i=1
W t−1
Taking logarithm, diving by η > 0 and rearranging gives us an upper bound on the loss
bn ≤ Li,n + ln N + ηn .
L
η 8
We summarize the result in the following theorem:
Theorem 3.3. Assume D is convex subset of some vector space. Let ` : D × Y → [0, 1] be
convex in its first argument. Then, the loss of the EWA forecaster is upper bounded by
bn ≤ min Li,n + ln N + ηn .
L
1≤i≤N η 8
r
bn ≤ min Li,n + n ln N .
q
With η = 8 lnn N , L
1≤i≤N 2
13
3.3 Exercises
Exercise 3.1. For A, B > 0, find argminη>0 (1/η A + η B) and also minη>0 (1/η A + η B).
Exercise 3.2. It is known that if X is a random variable taking values in [0, 1], then for
any s ∈ R, E esX ≤ exp((es − 1) E [X]). For s > 0 fixed, this bound becomes smaller than
what we would get from Hoeffding’s inequality as E [X] → 0.
(a) Use this inequality in place of Hoeffding’s inequality in the proof of Theorem 3.3 to prove
the bound ∗
Lbn ≤ ηLn + ln N ,
1 − e−η
where L∗n = min1≤i≤N Li,n .
p
(b) Let η = ln(1 + (2 ln N )/L∗n ), where L∗n is assumed to be positive. Show that in this
case p
Lbn ≤ L∗ + 2L∗ ln N + ln N .
n n
Hint for Part (b): Use η ≤ sinh(η) = (eη − e−η )/2, which is known to hold for all η > 0,
together with the bound of Part (a).
(c) Compare the bound of Part (b) to that of Theorem 3.3. When would you use this bound
and when the original one?
14
(c) Implement the hierarchical EWA algorithm described above and test it in the environ-
ment you have used above. Select η1 , . . . , ηK in such a way that you can get interesting
results (include the values of the learning rate used in the previous part). Describe your
findings. As to the value of η ∗ , use the value specified in Theorem 3.3.
(d) Is it worth to define yet another layer of the hierarchy, to “learn” the best value of η ∗ ?
How about yet another layer on the top of this? Justify your answer!
Exercise 3.4. Let Lit > 0, fit ∈ D, ` : D × Y → [0, 1] convex in its first argument. For
y ∈ Y , η > 0, define ft (η, y) = `( i Pexp(−ηL it )
P
exp(−ηL jt )
fit , y). If this function was convex as a
j
function of η, could you use it for tuning η? How? Let ` be linear in its first argument. Is
this function convex as a function of η? Justify your answer.
15
Chapter 4
A discrete prediction problem is one when the outcome space Y has at least two elements,
D = Y and the loss is the zero one loss: `(p, y) = I{p 6= y}.
In Chapter 2 we have shown that the Weighted Majority Algorithm (WMA) for
binary prediction problems makes at most
log2 β1 L∗i,n + log2 N
bn ≤
L
2
log2 1+β
mistakes. If we subtract L∗i,n from both sides, we get a bound on the regret
log2 β1 − log2 1+β2
L∗i,n + log2 N
∗
Rn = Ln − Li,n ≤
b
2
log2 1+β
Unfortunately, this bound is of the form Rn ≤ aL∗i,n + b = O(n). That’s much worse than
√
the bound Rn ≤ n2 ln N = O( n), which we proved for EWA for continuous predictions.
p
This suggests that discrete problems are harder than continuous problems.
That this is true, at least if we stick to deterministic algorithms, is shown as follows:
Take D = Y = {0, 1}, let the loss `(p, y) = I{p 6= y} be the zero-one loss. Assume that we
have two experts such that the first expert predicts f1,t = f1,2 = · · · = f1,n = 0 and the
second one predicts f2,t = f2,2 = · · · = f2,n = 1. Then, the following hold:
(b) For any deterministic algorithm A there exists an outcome sequence y1A , y2A , . . . , ynA ∈
bn = Pn `(b
{0, 1} such that L A A A
t=1 pt , yt ) = n, where pt denotes the prediction of A at time
t on the outcome sequence y1A , . . . , ynA .
16
Part a follows from that L1,n + L2,n = n and therefore at least one of the two terms is at
least n/2. Part b follows by an adversarial argument: Let y1A = 1 − pbA 1 . This is well defined,
since the first prediction of A is just some constant, which can thus be used to construct
A
y1A . Now, for t ≥ 2, assuming that y1A , y2A , . . . , yt−1 have already been constructed. Let pbA t
be the prediction of algorithm A for round t, given the constructed sequence. This is again
well-defined, since the prediction of A for round t depends in a deterministic fashion on the
previous outcomes. Then, set ytA = 1 − pbA t . With this construction, `(b pA A
t , yt ) = 1 holds for
all t.
From a and b we get the following result:
Theorem 4.1. Let Y be a set with at least two elements. Consider the discrete prediction
problem where the outcome space is Y , the decision space is D = Y , and the loss is the
zero-one loss `(p, y) = I{p 6= y}. Then, for any deterministic algorithm A, in the worst case
the regret RnA of A can be as large as n/2.
• It predicts fIt ,t .
• The algorithm suffers the loss `(fIt ,t , yt ) and each expert i = 1, 2, . . . , N suffers a loss
`(fi,t , yt ).
17
We do not assume anything about D, Y and the loss function ` doesn’t need to convex
anymore. The only assumption that we make is that `(p, y) ∈ [0, 1]. Also note that the
numbers pb1,t , pb2,t , . . . , pbN,t are non-negative and sum to 1 and therefore the distribution of It
is valid probability distribution.
Since the algorithm randomizes, its regret becomes a random variable. Hence, our state-
ments will be of probabilistic nature: First we show a bound on the expected regret and
then we will argue that with high probability, the actual (random) regret is also bounded by
some “small” quantity.
• D0 = {p ∈ [0, 1]N : N
P
i=1 pi = 1} is the N -dimensional probability simplex;
• Y 0 = Y × DN ;
PN
• `0 (p, (y, f1 , f2 , . . . , fN )) = i=1 pi · `(fi , y).
Note that, as promised, D0 is convex and `0 is convex (in fact linear!) in its first argument.
0
Now, given the expert predictions fi,t and outcomes yt we define the sequences (fi,t ), (yt0 ),
0
fi,t ∈ D0 , yt0 ∈ Y 0 , as follows:
0
• fi,t = ei where ei = (0, . . . , 0, 1, 0, . . . , 0)> is a vector of length N with i-th coordinate
equal to 1 and all other equal to 0;
`0 (fi,t
0
, yt0 ) = `0 (ei , (yt , f1,t , . . . , fN,t ))
= 0 · `(f1,t , yt ) + · · · + 0 · `(fi−1,t , yt ) + 1 · `(fi,t , yt ) + 0 · `(fi+1,t , yt ) + · · · + 0 · `(fN,t , yt )
= `(fi,t , yt ) .
18
Therefore, the cumulated losses Li,t = ts=1 `(fi,s , ys ) = ts=1 `0 (fi,s 0
, ys0 ) of experts are the
P P
same in both algorithms. It follows that the weights wi,t , the sum of weights Wt and the
vectors pbt = (b
p1,t , pb2,t , . . . , pbN,t ) are also identical between the two algorithms.
Furthermore, the expected loss of the randomized algorithm is the same as the loss of
0
the continuous EWA running on fi,t ’s and yt0 ’s:
n
X
E[`(fIt ,t , yt )] = pbi,t · `(fi,t , yt ) = `0 (b
pt , (yt , f1,t , f2,t , . . . , fN,t )) = `0 (b
pt , yt0 ) (4.1)
i=1
Pn
If L0n = t=1 `0 (bpt , yt0 ) denotes the loss of the continuous EWA running on fi,t
0
’s and yt0 ’s, by
Theorem 3.3
b0 ≤ min Li,n + ln N + ηn
Ln
1≤i≤N η 8
Pn
If Ln = t=1 `(fIt ,t , yt ) denotes the loss of the randomized EWA, thanks to (4.1), we see
that E[L b0n and therefore
bn ] = L
bn ] ≤ min Li,n + ln N ηn
E[L +
1≤i≤N η 8
We summarize this result in the following theorem:
Theorem 4.2. Let D, Y , ` : D × Y → [0, 1] be arbitrary. Then, the expected loss of the
randomized EWA forecaster is upper bounded as follows
bn ] ≤ min Li,n + ln N ηn
E[L + .
1≤i≤N η 8
r
bn ] ≤ min Li,n + n ln N .
q
8 ln N
In particular, with η = n
, E[L
1≤i≤N 2
19
Equivalently, for any 0 < δ < 1 with probability at least 1 − δ,
n n r
X X n
Xt < E[Xt ] + ln(1/δ) .
t=1 t=1
2
Remark 4.4. Note that the (upper) tail probability of a zero-mean random variable Z is
P ≥ ε) for P
defined as Pr (Z ε > 0. The first form of the inequality explains why we say that the
“tail” of Z = nt=1 Xt − nt=1 E[Xt ] shows a sub-Gaussian behavior: The tail of a Gaussian
with variance σ 2 = n/4 would also decay as exp(−2ε2 /n) as ε → ∞. Therefore, the tail
of Z is not “fatter” than that of a Gaussian, which we just summarize as the tail of Z is
sub-Gaussian.
We apply Hoeffding’s inequality to the losses Xt = `(fIt ,t , yt ) of randomized EWA. We
get that with probability at least 1 − δ,
r
L
bn < E[L bn ] + n ln(1/δ) .
2
q
This, together with Theorem 4.2 for η = 8 lnn N gives that with probability a least 1 − δ,
r r
n n
L
bn < min Li,n + ln N + ln(1/δ) .
1≤i≤N 2 2
We summarize the result in a theorem:
Theorem 4.5. Let D, Y , ` : D × Y → [0, 1]qbe arbitrary. Then, for any 0 < δ < 1, the
loss of randomized EWA forecaster with η = 8 lnn N is with probability at least 1 − δ upper
bounded as follows
n √
r
p
Ln < min Li,n +
b ln N + ln(1/δ) .
1≤i≤N 2
Using Rn = L bn − min1≤i≤N Li,n , this gives a high-probability regret bound, and in fact
shows that the tail of the regret is sub-Gaussian.
4.4 Exercises
Exercise 4.1. Consider the setting of Exercise 3.3, just now for a discrete prediction
problem. That is, the goal is to find the best learning rate for randomized EWA from a finite
pool {η1 , . . . , ηK }. One possibility is to run a randomized EWA on the top of K randomized
EWA forecasters, each using some learning rate ηk , k = 1, . . . , K. The randomized EWA
“on the top” would thus randomly select one of the randomized EWA forecasters in the
“base”, which would in turn select an expert at random. Another possibility is that if
20
(0) (0)
p0t = (p1,t , . . . , pK,t )> ∈ [0, 1]K is the probability vector of the randomized EWA on the top
(k) (k) (k)
for round t, and pt = (p1,t , . . . , pN,t )> ∈ [0, 1]N is the likewise probability vector of the k th
(0) (k)
randomized EWA in the base, then just select expert i with probability K
P
k=1 pk,t pi,t , i.e.,
combine the “votes” across the two layers before selecting the expert. Which method would
you prefer? Why? Design an experiment to validate your claim.
21
Chapter 5
Is EWA of the last two chapters is a good algorithm? Can there exist a better algorithm?
Or maybe our bound for EWA is loose and in fact EWA is a better algorithm than we think
it is? In an exercise you proved that for the shooting game FTL achieves Θ(ln n) regret. But
can there be a better algorithm than FTL for this problem?
Minimax lower bounds provide answers to questions like these. We shall investigate
continuous prediction problems with convex losses (in short, continuous convex problems)
since for discrete problems if we use a randomized algorithm, the problem is effectively
transformed into a continuous convex problem, while if no randomization is allowed, we
have already seen a lower-bound in Theorem 4.1.
22
(A) (F )
where pt is algorithm A’s decision at time t, ft is the decision of expert F at time t. (Of
course, the regret depends on D, Y and `, too, so for full precision we should denote this
dependence on the right-hand side, too, i.e., we should use, say, RnA (F, `). However, since
we will mostly treat D, Y, ` as given (fixed), we suppress this dependence.)
The corollary of Theorem 3.3 is this: Fix some sets D, √ Y . Let F be the set of experts
over D, Y and let A be the set of algorithms. Take c = 1/ 2. Then the following holds true:
(UB) For any loss function ` : D × Y → [0, 1], for any horizon n, for any positive
integer N , there exists an algorithm A ∈ A, such that for any multiset1 of
experts FN ⊂ √ F of cardinality N , the regret RnA (FN ) of algorithm A satisfies
RnA (FN ) ≤ c n ln N .
For a fixed value of c, the negation of the above statement is the following:
(NUB) There exists a loss function ` : D × Y → [0, 1], a horizon n and a positive
integer N , such that for all algorithms A ∈ A, there exists a multiset of experts
FN ∈ F with√cardinality N such that the regret RnA (FN ) of algorithm A satisfies
RnA (FN ) > c n ln N .
Clearly, for any value of c, only
√ one of the statements (UB), (NUB) can be true. We want
to show that for any c < 1/ 2, (NUB) holds true since then it follows that the bound for
EWA is tight in the sense that there is no better algorithm than EWA in the worst-case and
the bound for EWA cannot be improved with respect to its constant, or how it depends on
the number of experts N or the length of the horizon n.
We will show the result for D√= [0, 1], Y = {0, 1} and `(p, y) = |p − y|.
To show that for any c < 1/ 2, (NUB) holds true, it suffices to prove that there exist
n, N such that for any algorithm A ∈ A,
√ √
sup RnA (FN ) ≥ 1/ 2 n ln N .
FN ∈F ,|FN |≤N
In fact, it suffices to prove that there exist n, N such that for any algorithm A ∈ A,
√ √
sup RnA (FN ) ≥ 1/ 2 n ln N , (5.1)
FN ∈Sn
where Sn is the set of static experts, i.e., the set of experts which decide about their choices
ahead of time, independently of what the outcome sequence will be. Thus, for F ∈ Sn , for
any 1 ≤ t ≤ n, y1 , . . . , yt−1 ∈ Y , F (y1 , . . . , yt−1 ) = ft for some fixed sequence (ft ) ∈ Dn and
vice versa: for any fixed sequence (ft ) ∈ D, there exists an expert F in Sn such that for any
1 ≤ t ≤ n, y1 , . . . , yt−1 ∈ Y , F (y1 , . . . , yt−1 ) = ft . Hence, in what follows we shall identify
the set of experts Sn with the set of sequences of length Dn . That inequality (5.1) holds for
any algorithm A ∈ A is equivalent to requiring that
√ √
inf sup RnA (FN ) ≥ 1/ 2 n ln N
A∈A FN ∈Sn
1
A multiset is similar to a set. The only difference is that multisets can contain elements multiple times.
We could use a list (or vector), but we would like to use ⊂ and ∈..
23
holds. Let
Vn(N ) = inf sup RnA (FN ) .
A∈A FN ∈Sn
5.2 Results
A sequence (Zt )nt=1 of independent, identically distributed {−1, +1}-valued random variables
with Pr (Zt = 1) = Pr (Zt = −1) is called a Rademacher sequence. When you sum up a
Rademacher sequence, you get a random walk on the integers. A matrix with random
elements will be called a Rademacher random matrix when its elements form a Rademacher
sequence. We will need the following lemma, which shows how the expected value of the
maximum of N independent random walks behave. The results stated in the lemma are
purely probabilistic and we do not prove them. The essence of our proof will be a reduction
of our problem to this lemma.
E [max1≤i≤N Gi ]
lim √ = 1.
N →∞ 2 ln N
The first part of the lemma states that, asymptotically speaking, the expected maximal
excursions of N independent random walks is the same as the expectation of the maximum
of N independent Gaussian random variables. Note that for N = 1, this result would follow
from the central limit theorem, so you can consider this as a generalization of the central
limit theorem. The second part of the lemma states that asymptotically, as N gets √ big,
the expectation of the maximum of N independent Gaussian random variables is 2 ln N .
Together the two statements say that, asymptotically, as both √ n, N get large, the expected
size of maximal excursions of N independent random walks is 2n ln N .
As many lower bound proofs, our proof will also use what we call the randomization
hammer, according to which, for any random variable X with domain dom(X), and any
function f ,
sup f (x) ≥ E [f (X)]
x∈dom(X)
holds whenever E [f (X)] exists. When using this “trick” to lower bound supx f (x), we will
choose a distribution over the range of values of x (equivalently, the random variable X)
24
and then we will further calculate with the expected value E [f (X)]. The distribution will
be chosen such that the expectation is easy to deal with.
Our main result is the following theorem:
We use the randomization hammer to lower bound the quantity on the right-hand side. Let
Y1 , Y2 , . . . , Yn be an i.i.d. sequence of Bernoulli(1/2) random variables such that Yt is also
independent of the random numbers which are used by algorithm A. Then,
" n n
#
X (A) X (F )
RnA (FN ) ≥ E |b
pt − Yt | − min |ft − Yt |)
F ∈FN
" t=1
n
# " t=1
n
#
X (A)
X (F )
= E |b
pt − Yt | − E min |ft − Yt | .
F ∈FN
t=1 t=1
h i
(A)
It is not hard to see that E |b pt − Yt | = 1/2 holds for any 1 ≤ t ≤ n thanks to our choice
of (Yt ) (see Exercise 5.1).
Therefore,
" n
# " n #
A n X (F )
X 1 (F )
Rn (FN ) ≥ − E min |ft − Yt | = E max − |ft − Yt | .
2 F ∈FN
t=1
F ∈FN
t=1
2
Note that the right-hand side is a quantity, which does not depend on algorithm A. Thus,
we made a major step forward.
Now, let us find a shorter expression for the right-hand side. If Yt = 0, the expression
1 (F ) (F ) (F )
2
− |ft − Yt | equals to 1/2 − ft . If Yt = 1, the expression equals to ft − 1/2.
(F )
Therefore, the expression can be written as (ft −1/2)(2Yt −1). Let us introduce σt = 2Yt −1.
25
Notice that (σt ) is a Rademacher sequence variables. With the help of this sequence, we can
write " #
Xn
(F )
RnA (FN ) ≥ E max (ft − 21 )σt .
F ∈FN
t=1
It is not hard to see then that (Zi,t σt ) is an N × n Rademacher random matrix (Exer-
cise 5.2).
Now, since (Zi,t σt ) is an N × n Rademacher random matrix, the distribution of
n
X
max Zi,t σt
1≤i≤N
t=1
Pn
is the same as that of max1≤i≤N t=1 Zi,t . Therefore,2
" n
# " n
#
X X
E max Zi,t σt = E max Zi,t .
1≤i≤N 1≤i≤N
t=1 t=1
2
Here, the Rademacher matrix (Zi,t ) is recycled just to save the introduction of some new letters. Of
course, we could have also carried Zi,t σt further, but that would again be just too much writing.
26
Since this holds for any algorithm A ∈ A, we must also have
" n
#
1 X
Vn(N ) = inf sup RnA (FN ) ≥ E max Zi,t .
A∈A FN ∈Sn 2 1≤i≤N
t=1
√
Dividing both sides by n and letting n → ∞, the first part of Lemma 5.1 gives
(N )
E [max1≤i≤N nt=1 Zi,t ]
P
Vn 1 1
lim √ ≥ lim √ = E max Gi ,
n→∞ n 2 n→∞ n 2 1≤i≤N
where
√ G1 , G2 , . . . , GN are independent, Gaussian random variables. Now, divide both sides
by 2 ln N and let N → ∞. The second part of Lemma 5.1 gives
(N )
Vn 1 E [max1≤i≤N Gi ] 1
lim lim √ ≥ lim √ = ,
N →∞ n→∞ 2n ln N 2 N →∞ 2 ln N 2
therefore, we also have
(N )
Vn 1
sup √ ≥ .
n,N 2n ln N 2
Multiplying both sides by 2 gives the desired result.
5.3 Exercises
h i
(A)
Exercise 5.1. Show that in the proof of Theorem 5.2, E |b pt − Yt | = 1/2 holds for
any 1 ≤ t ≤ n, where the algorithm A is allowed to randomize (and of course may use past
information to come up with its prediction).
Exercise 5.2. Show that if (Zi,t ) is an N × n Rademacher matrix and (σt ) is a Rademacher
def
sequence with n elements and if for any t, Vt = (Zi,t )i and σt are independent of each other
then (Zi,t σt ) is an N × n Rademacher matrix.
Exercise 5.3. Show that for any sets X, Y , and any function A : X × Y → R,
Exercise 5.4. √ Show that the following strengthening of the result in this chapter holds, too:
Take any c < 1/ 2. Fix D = [0, 1], Y = {0, 1}, `(p, y) = |p−y|. Then, there exists a horizon
n, a positive integer N , and a non-empty set of experts FN ∈ F with cardinality N√such that
for all algorithms A ∈ A the regret RnA (FN ) of algorithm A satisfies RnA (FN ) > c n ln N .
Hint: Modify the proof of the above theorem.
27
Chapter 6
Tracking
In this chapter we still consider discrete prediction problems with expert advice with N
experts, where the losses take values in [0, 1]. The horizon n will be fixed.
So far we have considered a framework when the learning algorithm competed with the
single best expert out of a set of N experts. At times, this base set of experts might not
perform very well on their own. In such cases one might try a larger set of compound experts,
potentially created from the base experts. For example, we may want to consider decision
trees of experts when the conditions in the decision tree nodes refer to past outcomes, the
time elapsed, past predictions, etc., while the leafs could be associated with indices of base
experts. A decision tree expert can itself be interpreted as an expert: In a given round,
the past information would determine a leaf and thus the base expert whose advice the
tree expert would take. Then one can just use randomized EWA to compete with the best
compound expert. The benefit of doing so is that the best compound expert might have a
much smaller cumulated loss than the cumulated loss of the best base expert. The drawback
is that it might be hard to find this expert with a small loss, i.e., the regret bound, becomes
larger. So in general, this idea must be applied with care.
28
expert within the class B ⊆ {1, . . . , N }n is Rn = Lbn − minσ∈B Lσ,n .
Clearly, randomized EWA can be applied to this problem. By Theorem 4.2, when η is
appropriately selected, r
n
E [Rn ] ≤ ln |B| . (6.1)
2
We immediately see that if B = {1, . . . , N }n (all switching experts are considered), the
bound becomes vacuous since |B| = N n , and hence ln |B| = n ln N and the bound is linear
as a function of n. This should not be surprising since, by not restricting B, we effectively set
off to achieve the goal of predicting in every round the index of the expert which is the best
in that round. In fact, one can show that there is no algorithm whose worst-case expected
regret is sublinear when competing against this class of experts (Exercise 6.1).
How to restrict B? There are many ways. One sensible way is to just compete with the
switching experts which do not switch more than m times. The resulting problem is called
the tracking problem because we want to track the best experts over time with some number
of switches allowed. Let s(σ) be the number of times expert σ switches from one base expert
to another: n
X
s(σ) = I{σt−1 6= σt } .
t=2
For an integer 0 ≤ m ≤ n − 1, let
Bn,m = {σ | s(σ) ≤ m} .
How will EWA perform when the class of switching experts is Bn,mP ? In orderto show this
we only need to bound M = |Bn,m |. It is not hard to see that M = m k=0
n−1
k
N (N − 1)k .
m+1 m
Some further calculation gives that M ≤ N exp( (n−1)H n−1 ), where H(x) = −x ln x−
(1 − x) ln(1− x), x ∈ [0, 1], is the entropy function. Hence, the expected regret of randomized
EWA applied to this problem is bounded as follows:
s
n m
E [Rn ] ≤ (m + 1) ln N + (n − 1)H .
2 n−1
One scenario of interest is when we expect that for any n, by allowing mn ≈ αn switches
during n rounds, the loss of the best expert in Bn,mn will be reasonably small. When using
such a sequence mn , we are thus betting on a constant α “rate of change” of who the
best expert is). Of course, in this case the expected regret per round will not vanish but
it will converge to a constant value when n → ∞. Some calculation gives that E[R n
n]
≤
q
1
(α + n1 ) ln N + (1 − n1 )H(α) and thus
2
r
E [Rn ] α ln N + H(α)
+ O n−1/2 .
=
n 2
Thus, the average regret pper time step when the best expert is allowed to be changed at a
rate α close to zero is H(α)/2 (since this is the dominating term in the first expression
when α → 0).
29
Applying EWA directly to the set of switching experts Bn,m is unpractical since EWA
needs to store and update one weight per expert and the cardinality of Bn,m is just too large
for typical values of n, N and m.
Lemma 6.1 (EWA with non-uniform priors). Consider a continuous, convex expert pre-
diction problem
P given by (D, Y, `) and N experts. Let wi0 be N
Ptnonnegative weights such
−ηLi,t
that W0 = i wi0 ≤ 1 and let wi,t = wi,0 e , where Li,t = s=1 `(fi,s , ys ). Further, let
PN PN wi,t−1
Wt = i=1 wi,t . Then, for pt = i=1 Wt−1 fi,t and any η > 0, it holds that
n
X 1 n
`(pt , yt ) ≤ ln Wn−1 + η .
t=1
η 8
−1
How does this help? Since Wn ≥ wi,0 exp(−ηLi,n ), ln(Wn−1 ) ≤ ln(wi,0 )+ηLi,n . Therefore,
Pn 1 −1 n ∗
t=1 `(pt , yt )−Li,n ≤ η ln wi,0 +η 8 and thus if i is the index of the expert with the smallest
loss, nt=1 `(pt , yt ) − min1≤i≤n Li,n = nt=1 `(pt , yt ) − Li∗ ,n ≤ η1 ln wi−1 n
. Thus, if i∗ was
P P
∗ ,0 + η
8
assigned a large weight (larger than 1/N ) the regret of EWA will be smaller than if all
experts received the same weight: The weights wi,0 act like our a priori bets on how much
we believe in the experts initially. If these bets are correct, the algorithm is rewarded by
achieving a smaller regret (this can also be seen directly since than wi∗ ,n will be larger).
Let us now go back to our problem of competing with the best switching expert. Consider
now randomized EWA, but on the full set B of switching experts. Let wt0 (σ) be the weight
assigned to switching expert σ by the randomized EWA algorithm after observing y1:t . As
initial weights choose
30
Here, 0 < α < 1 reflects our a priori belief in switching per time step in a sense that will be
n ), 1 ≤ s < t ≤ n, let σs:t be (σs , . . . , σt ). Introduce the
made clear next. For σ = (σ1 , . . . , σP
“marginalized” weights w0 (σ1:t ) = σ0 :σ0 =σ1:t w00 (σ 0 ). We have the following lemma, which
0
1:t
shows that the above weights are indeed sum to one and which also helps us in understanding
where these weights are coming from:
Lemma 6.2 (Markov process view). Let (Xt ) be a Markov chain with state space {1, . . . , N }
defined as follows: Pr[X1 = i] = 1/N and Pr[Xt+1 = i0 | Xt = i] = Nα + (1 − α)I{i0 = i}.
Then, for any σ ∈ B, 1 ≤ t ≤ n, Pr[(X1 , . . . , Xt ) = σ1:t ] = w00 (σ1:t ) and in particular,
Pr[(X1 , . . . , Xn ) = σ] = w00 (σ).
Defining X
0
wi,t = wt0 (σ),
σ:σt+1 =i
we see that
wit0
p0i,t+1 = ,
Wt0
where Wt0 = N 0 0 0 0
P
i=1 wit . Notice that by definition wi0 = w0 (i) and so by Lemma 6.2, wi,0 =
1/N .
Our goal now is to show that the (wit0 ) weights can be calculated in a recursive fash-
ion. Introduce the shorthand notation `0t (i) =P`(fi,t , yt ) to denote Pthe loss of expert i. By
0 0 −ηLσ,t t t
definition, wt (σ) = w0 (σ)e , where Lσ,t = s=1 `(fσs ,s , ys ) = s=1 `s (σs ). Define
α
γi→i0 = + (1 − α)I{i = i0 } .
N
31
w0 (σ )
Note that for any σ such that σt+1 = i, γσt →i = w0 0 (σ1:t+1 , as follows from Lemma 6.2.
Pt 0 1:t )
Introduce Lσ1:t = s=1 `s (σs ). Further, for arbitrary 1 ≤ p < q < · · · < t ≤ n and σ ∈ B,
by a slight abuse of notation, we shall also write w00 (σ1:p , σp+1:q , . . . , σt+1:n ) in place of w00 (σ).
Then,
X X X
wit0 = wt0 (σ) = e−ηLσ,t w00 (σ) = e−η `t (σt ) e−ηLσ,t−1 w00 (σ)
σ:σt+1 =i σ:σt+1 =i σ:σt+1 =i
X X X
−η `t (σt ) −ηLσ1:t−1
= e e w00 (σ1:t−1 , σt , i, σt+2:n )
σt σ1:t−1 σt+2:n
X X
= e−η `t (σt ) e−ηLσ1:t−1 w00 (σ1:t , i)
σt σ1:t−1
X
−η `t (σt )
X
−ηLσ1:t−1 0 w00 (σ1:t , i)
= e e w0 (σ1:t ) 0
σt σ1:t−1
w0 (σ1:t )
X X
= e−η `t (σt ) e−ηLσ1:t−1 w00 (σ1:t )γσt →i
σt σ1:t−1
X X
= e−η `t (σt ) γσt →i e−ηLσ1:t−1 w00 (σ1:t )
σt σ1:t−1
X X X
−η `t (σt )
= e γσt →i e−ηLσ1:t−1 w00 (σ1:n )
σt σ1:t−1 σt+1:n
X X X
= e−η `t (σt ) γσt →i e−ηLσ1:t−1 w00 (σ1:n )
σt σ1:t−1 σt+1:n
X X X
= e−η `t (σt ) γσt →i 0
wt−1 (σ1:n )
σt σ1:t−1 σt+1:n
X
−η `t (σt )
= e γσt →i wσ0 t ,t−1 .
σt
Thus,
X
wit0 = e−η `t (j) wj,t−1
0
γj→i (6.2)
j
X nα o
−η `t (j) 0
= e wj,t−1 + (1 − α)I{j = i} (6.3)
j
N
α X −η `t (j) 0
= (1 − α)e−η `t (i) wi,t−1
0
+ e wj,t−1 ,
N j
32
wi,t−1
2. Draws the index It of a base expert such that Pr (It = i) = pi,t , where pi,t = PN .
j=1 wj,t−1
3. Predicts fIt ,t .
4. Observes yt , the losses `(fi,t , yt ) (and it suffers the loss `(fIt ,t , yt )).
6.3 Analysis
Theorem 6.3. Consider a discrete prediction problem over the arbitrary set D, Y and the
zero-one loss `. Let y1 , . . . , yn ∈ Y be an arbitrary sequence of outcomes, fi,t ∈ D be the
advice of base expert i in round t, where 1 ≤ i ≤ N , 1 ≤ t ≤ n. Let L bn be the cumulated
loss of the Fixed-Share Forecaster at the end of round n and, similarly, let Lσ,n be the
cumulated loss of switching expert σ at the end of round n. Then,
h i
bn − Lσ,n ≤ s(σ) + 1 1 1 n
E L ln N + ln s(σ) n−s(σ)−1
+η .
η η α (1 − α) 8
Further, for 0 ≤ m ≤ n, α = m/(n − 1), with a specific choice of η = η(n, m, N ), for any σ
with s(σ) ≤ m,
s
h i n m
E Ln − Lσ,n ≤
b (m + 1) ln N + (n − 1)H .
2 n−1
We see from the second part the algorithm indeed achieves the same regret bound as
randomized EWA with uniform initial weights competing with experts in Bn,m .
Proof. Since FSF implements randomized EWA exactly we can use the reduction technique
developed in the proof of Theorem 4.2 to study its performance. The only difference is
that now we use non-uniform initial weights.Therefore, the problem reduces to the bound
in Lemma 6.1. Then, following the argument already used after Lemma 6.1, we get that for
any σ ∈ B,
0
bn − Lσ,n ≤ − ln w0 (σ) + η n .
h i
E L
η 8
Thus, we need an upper bound on − ln w00 (σ). From the definition,
1 α s(σ) α n−s(σ)−1
w00 (σ) = 1−α+ .
N N N
To finish, just note that − ln w00 (σ) ≤ (1 + s(σ)) ln(N ) + ln( αs(σ) (1−α)
1
n−s(σ)−1 ).
33
6.4 Variable-share forecaster
According to the result of Theorem 6.3, for any switching expert σ,
1 − (1 − α)`t (σt )
0 0 `t (σt )
w0 (σ1:t+1 ) = w0 (σ1:t ) I{σt 6= σt+1 } + (1 − α) I{σt = σt+1 } .
N −1
Now, when `t (σt ) is close to zero, (1 − α)`t (σt ) will be close to one. Hence, the first term of
the sum in the bracket will be close to zero, while the second one will be close to one if and
only if σt = σt+1 . Thus, in this case, from the Markov process view, we see that staying at
σt is encouraged in the prior. On the other hand, when `t (σt ) is close to one, (1 − α)`t (σt ) will
be close 1 − α. Hence, the expression in the bracket will be close to the previous expression
and staying will be encouraged by a probability close to the “default stay probability”, 1 −α.
Therefore, these weights are expected to result in a smaller regret when there is an expert
with small cumulated loss and a few number of switches. Further, `t (σt ) = `(fσt ,t , yt ) is
available at the end of round t, therefore
1 − (1 − α)`t (σt )
(t) `t (σt )
γσt →i = I{σt 6= σt+1 } + (1 − α) I{σt = σt+1 }
N −1
is not only well-defined, but it’s value is also available at the end of round t.
This leads to the Variable-Share Forecaster (VSF). Formally, this algorithm works
as follows: Before round one, initialize the weights using wi,0 = 1/N . In round t = 1, 2, 3, . . . ,
the VSF does the following:
1. Observes the expert forecasts fi,t .
34
wi,t−1
2. Draws the index It of a base expert such that Pr (It = i) = pi,t , where pi,t = PN .
j=1 wj,t−1
3. Predicts fIt ,t .
4. Observes yt , the losses `(fi,t , yt ) (and it suffers the loss `(fIt ,t , yt )).
It is then not hard to se that the result of this update is that for binary losses, n−s(σ)−1
η
1
ln 1−α
in the bound is replaced by s(σ) + η1 Lσ,n ln 1−α 1
. Hence, the VSF may achieve much smaller
loss when some expert σ which does not switch too often achieves a small loss.
6.5 Exercises
Exercise 6.1. Let N > 1. Show that there is no algorithm whose worst-case expected
regret is sublinear when competing against all switching experts. More precisely, show that
there exists a constant c such that for D = [0, 1], Y = {0, 1}, `(p, y) = |p − y|, for any N > 1,
for any algorithm, there exists a set of base experts of size N and a time horizon n such that
the regret with respect to all switching experts is at least cn.
Exercise 6.2. Show that an algorithm that competes against experts in Bn,0 is effectively
back to competing with the best of the base experts.
Exercise 6.3. Show that nH(m/n) = O(ln(n)) as n → ∞. Hint: For large n, H(m/n) ≈
m/n ln(n/m), therefore nH(m/n) = O(ln(n)).
Exercise 6.7. Give a practical palgorithm that does not require the knowledge of the
horizon n and which achieves the O( H(α)/2) regret per time step when the rate of change
of the identity of the best expert is bounded by α. You may assume that α is known ahead
of time.
Exercise 6.8. Give an algorithm like in the previous exercise, except that now the
algorithm does not know α, but it may know n.
Exercise 6.9. Give an algorithm like in the previous exercise, except that now neither α,
nor n is known.
35
Chapter 7
Suppose we want to classify emails according to if they are spam (say, encoded +1) or
not-spam (say, encoded by −1). From the text of the emails we extract the features,
x1 , x2 , . . . , xn ∈ Rd (e.g., xt,i ∈ {0, 1} indicates whether a certain phrase or word is in
the email), and we assign to every email a target label (or output) yt such that, say, yt = +1
if the email is spam. The features will also be called inputs.
A classifier f is just a mapping from Rd to {−1, +1}. An online learning algorithm upon
seeing the samples
(x1 , y1 ), . . . , (xt−1 , yt−1 )
and xt produces a classifier ft−1 to predict yt (1 ≤ t ≤ n).
The algorithm is said to make a mistake, if its
Pprediction, ybt does not match yt . The total
number of mistakes of the algorithm is M = nt=1 I{b yt 6= yt }. The general goal in online
learning of classifiers is to come up with an online algorithm that makes a small number of
mistakes.
A linear classifier fw : Rd → {−1, +1} is given by a weight w ∈ Rd , w 6= 0 such that
(
+1, if hw, xi ≥ 0 ;
fw (x) =
−1, otherwise .
For simplicity, at the price of abusing the sign function, we will just write fw (x) = signhw, xi
(but when we write this, we will mean the above). Introduce the (d − 1)-dimensional hy-
perplane Hw = {x : hw, xi = 0}. Thus, if hw, xi = 0 then x is on the hyperplane Hw .
We will also say that x is above (below) the hyperplane Hw when hw, xi is positive (resp.,
negative). Thus, fw (x) = +1 is x is above the hyperplane Hw etc. In this sense, Hw is really
the decision surface underlying fw . (In general, the decision “surface” underlying a classifier
f is {x : f (x) = 0}.)
By the law of cosines, |hw, xi| = kwk kxk cos(∠(w, x)), therefore |hw, xi| is just kwk times
the distance of x from Hw .1 In particular, when kwk = 1, |hw, xi| is the distance of x to Hw .
In some sense, this should also reflect how confident we are in our prediction of the label, if
1
As usual, k · k is the 2-norm.
36
we believe w gives a good classifier. The distance of x to the hyperplane Hw is also called
the (unsigned) margin of x (the signed margin of the pair (x, y) ∈ Rd × {−1, +1} would be
yhw, xi).
The general scheme for online learning with linear classifiers is as follows:
Initialize w0 .
1. Receive xt ∈ Rd .
Remark 7.1 (Bias (or intercept) terms). We defined linear classifiers as function of the
form f (x) = sign(hw, xi). However, this restricts the decision surfaces to hyperplanes which
cross the origin. A more general class of classifiers allows a bias term (or intercept term).
f (x) = sign(hw, xi + w0 ), where w, x ∈ Rd , w0 ∈ R. These allow hyperplanes which do not
cross the origin. However, any linear classifier with a bias can be given as a linear classifier
with no bias term when the input space is appropriately enlarged. In particular, for w, x, w0 ,
let w0 = (w1 , . . . , wd , w0 ) and x0 = (x1 , . . . , xd , 1). Then sign(hw, xi + w0 ) = sign(hw0 , x0 i).
In practice, this just means that before running the algorithms one should just amend the
input vectors by a constant 1.
1. Receive xt ∈ Rd .
37
7.2 Analysis for Linearly Separable Data
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) be the data where xi ∈ Rd , yi ∈ {−1, +1}. We say that
w∗ ∈ Rd separates the data set if w∗ ∈ Rd such that sign(hw∗ , xt i) = yt for all 1 ≤ t ≤ n. If
there exists w∗ which separates the data set, we call the data set linearly separable. Notice
that if w∗ separates the data sets, then for any c > 0, cw∗ separates it as well. So, we may
even assume that kw∗ k = 1, which we will indeed do from this point on.
Theorem 7.2 (Novikoff’s Theorem (1962)). Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) be data set
that is separated by a w∗ ∈ Rd . Let R, γ ≥ 0 be such that for all 1 ≤ t ≤ n, kxt k ≤ R
and yt hw∗ , xt i ≥ γ. Let M be the number of mistakes Perceptron makes on the data set.
Then,
R2
M≤ 2 .
γ
Proof. We want to prove that M ≤ R2 /γ 2 , or M γ 2 ≤ R2 . We will prove an upper bound
on kwn k2 and a lower bound on hw∗ , wn i, from which the bound will follow. To prove these
bounds, we study the evolution of kwt k2 , and hw∗ , wt i.
If Perceptron makes no mistake, both quantities stay the same. Hence, the only
interesting case is when Perceptron makes a mistake, i.e., ybt 6= yt . Let t be such a time
step. We have
where the inequality follows because by assumption, yt hw∗ , xt i ≥ γ. Hence, by unfolding the
recurrence,
hw∗ , wn i ≥ γM .
By Cauchy-Schwarz, |hw∗ , wn i| ≤ kw∗ kkwn k = kwn k, where the last equality follows since
by assumption kw∗ k = 1. Chaining the obtained inequalities we get,
γ 2 M 2 ≤ kwn k2 ≤ R2 M .
If M = 0, the mistake bound indeed holds. Otherwise, we can divide both sides by M γ 2 , to
get the desired statement.
38
Intuitively, if the smallest margin on the examples is big, it should be easier to find a
separating hyperplane. According to the bound, this is indeed what the algorithm achieves!
Why do we have R2 in the bound? To understand this, notice that the algorithm is scale
invariant (it makes the same mistakes if all inputs are multiplied by some c > 0). Thus,
the bound on the number of mistakes should be scale invariant! Since the margin changes
by a factor c2 when scaling the inputs by c > 0, R2 = max1≤t≤n kxt k2 must appear in the
bound. In other words, the number of mistakes that Perceptron makes scales inversely
proportional to the square of the size of the normalized margin.
The bound becomes smaller, when γ is larger. In fact, the best γ is
Remark 7.3. Novikoff’s proof follows pretty much the same patterns as the proofs which
we have seen beforehand: We prove upper and lower bounds on some function of the weights
and then combine these to get a bound. This should not be very surprising (given that the
essence of the algorithms is that they change their weights).
Remark 7.4. We can use the Perceptron algorithm to solve linear feasibility problems.
A linear feasibility problem (after maybe some transformation of the data inputs) is the
problem of finding a weight w ∈ Rd such that, with some D = ((xt , yt ))1≤t≤n , xt ∈ Rd ,
yt ∈ {−1, +1},
yt hw, xt i > 0, t = 1, 2, . . . , n
holds. Now, this is nothing but finding a separating hyperplane. If this problem has a
solution, repeatedly running Perceptron on the data D until it does not make any mistakes
will find a solution. (Why?) One can even bound the number of sweeps over D needed until
the algorithm will stop. Thus, we have an algorithm (Perceptron) which can be coded up
in 5 minutes to solve linear feasibility problems (if they have a solution). See Exercise 7.1
for some further ideas.
39
algorithm has a natural extension to the case when no expert is perfect. Can we have some
similar extensions in the linear classification case? In particular, is it possible to prove some
nice regret bounds of the form
n
X
M ≤ min I{yt hw, xt i ≤ 0} + “something small” ?
w∈Rd
t=1
In the expert setting, we saw that no regret bound is possible, unless we randomize. The
following theorem parallels this result:
Theorem 7.5. For any deterministic algorithm A and any n ≥ 0 there exists a data sequence
(x1 , y1 ), . . . , (xn , yn ) and w∗ ∈ Rd such that the following hold:
The proof is left as an exercise (Exercise 7.2). In the discrete prediction setting random-
ization saved us. Unfortunately, randomization cannot save us now (Exercise 7.3).
So, what do we do? The Big Cheat! We introduce a surrogate loss, the so-called hinge
loss `. The 0-1 loss (or binary classification loss) is:
40
Theorem 7.6. Let M be the number of mistakes that the Perceptron algorithm makes
on a sequence (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) ∈ Rd × {+1, −1}. Suppose R ≥ 0 is such that
kxt k ≤ R for all 1 ≤ t ≤ n. For any w ∈ Rd
n
X √
M≤ `hinge (w, (xt , yt )) + kwkR n .
t=1
Proof. Fix w ∈ Rd . We analyze the evolution of kwt k2 and hw, wt i. When a mistake occurs
in time step t then yt hwt−1 , xt i ≤ 0 and therefore
Thus, after processing all n points and making all M mistakes, by unrolling the recurrence
2 2 2 2
we get kw√n k ≤ kw0 k + M R = M R . Taking square root and using M ≤ n we have
√
kwn k ≤ R M ≤ R n.
Similarly, if a mistake occurs in time step t then
hw, wt i = hw, wt−1 + yt xt i = hw, wt−1 i + yt hw, xt i ≥ hw, wt−1 i + 1 − `hinge (w, (xt , yt ))
where in the last step we have used the inequality yhw, xi ≥ 1 − `hinge (w, (x, y)) valid for
any w, x ∈ Rd and any y ∈ {+1, −1}. We unroll this recurrence in each round t in which
Perceptron makes a mistakes. If no mistake was made, we use hw, wt i = hw, wt−1 i instead.
We get X
hw, wn i ≥ hw, w0 i + M − `hinge (w, (xt , yt )) .
1≤t≤n
sign(hwt ,xt i)6=yt
√
Using kwn k ≤ R n, Cauchy-Schwarz inequality, w0 = 0 and the fact that hinge loss is
non-negative, we can write:
√
Rkwk n ≥ kwkkwn k
≥ hw, wn i
X
≥ hw, w0 i + M − `hinge (w, (xt , yt ))
1≤t≤n
sign(hwt ,xt i)6=yt
n
X
≥M− `hinge (w, (xt , yt )) .
t=1
Reading off the beginning and the end of the chain of inequalities gives the statement of the
theorem.
7.4 Exercises
41
(a) Show that if this algorithm stops, it will stop with a solution to the linear feasibility
problem.
(b) Let S be the number of sweeps the algorithm makes on the data D before it stops. Show
a bound S.
(c) What happens when the original problem does not have a solution? Suggest a simple
way of detecting that there is no solution.
Exercise 7.4. Show that `hinge (w, (x, y)) is convex in its first argument.
42
Chapter 8
In this chapter we describe the generic Follow The Regularized Leader algorithm for
linear loss functions and derive a regret bound for it. To analyze and implement the algo-
rithm, we will need bits of convex analysis such as Legendre functions, Bregman divergences,
Bregman projections and strong convexity.
The restriction to linear loss functions is not as severe as one might think. In later lectures,
we will see that we can cope with non-linear loss functions by working with their linear
approximations. The analysis in the linear case is simpler and leads often to computationally
faster algorithms.
We consider an online learning scenario where an online algorithm in each round t =
1, 2, . . . , n chooses a point wt in a non-empty convex set K ⊆ Rd and suffers a loss `t (wt ).
The loss function `t : Rd → R is chosen by the adversary and we assume that it is linear i.e.
`t (w) = hft , wi where ft ∈ Rd .
Follow The Regularized Leader (FTRL) algorithm is a particular algorithm for
this scenario which in round t + 1 chooses wt+1 ∈ K based on the sum loss functions up to
time t:
X t
Lt (w) = `s (w)
s=1
43
We make additional assumptions on K and the regularizer R. We assume that K is
closed. Furthermore, we assume that R is a Legendre function. In the next section, we
explain what are Legendre functions and we introduce other necessary tools from convex
analysis.
1. A ⊆ Rd , A 6= ∅, A◦ is convex.
2. F is strictly convex.
∂F
3. Partial derivatives ∂xi
exists for all i = 1, 2, . . . , d and are continuous.
Bregman divergence is the difference between the function value F (u) and its approxi-
mation by the first order Taylor expansion F (v) + h∇ F (v), u − vi around v. The first order
Taylor expansion is a linear function tangent to F at point v. Since F is convex, the linear
function lies below F and therefore DF is non-negative. Furthermore, DF (u, v) = 0 implies
that u = v because of strict convexity of F the only point where the linear approximation
meets F is u = v.
44
It is a non-trivial fact to verify that ΠF,K is well defined. More precisely, it needs to be
shown that minimizer minu∈K∩A DF (u, w) is attained and it is unique. The former follows
from that K is closed. The later from strict convexity of F .
Lemma 8.4 (Pythagorean inequality). Let F : A → R be Legendre function and let K ⊆ Rd
be a closed convex subset such that K ∩ A 6= ∅. If w ∈ A◦ , w0 = ΠF,K (w), u ∈ K then
DF (u, w) ≥ DF (u, w0 ) + DF (w0 , w) .
Lemma 8.5 (Kolmogorov’s inequality). Let F : A → R be Legendre function and let K ⊆ Rd
be a closed convex subset such that K ∩ A 6= ∅. If u, v ∈ A◦ and u0 = ΠF,K (u), v 0 = ΠF,K (u)
are their projections then
hu0 − w0 , ∇ F (u0 ) − ∇ F (w0 )i ≤ hu0 − w0 , ∇ F (u) − ∇ F (w)i .
Lemma 8.6 (Projection lemma). Let F : A → R, A ⊂ Rd be a Legendre function and
e = argminu∈A F (u). Let K ⊆ Rd be convex closed and set such that K ∩ A 6= ∅. Then,
w
ΠF,K (w)
e = argmin F (u) .
u∈K∩A
Proof. Let w0 = ΠF,K (w) e and w = argminu∈K∩A F (u). We need to prove w = w0 . Since
w is a minimizer, we have F (w) ≤ F (w0 ). If we are able to prove the reversed inequality
F (w0 ) ≤ F (w) then by strict convexity of F , the minimizer of F is unique and hence w = w0 .
It thus remains to prove that F (w0 ) ≤ F (w). By definition of Bregman projection
w0 = argminu∈K∩A DF (u, w) e and therefore
DF (w0 , w)
e ≤ DF (w, w)
e .
Expanding DF on both sides of the inequality, we get
F (w0 ) − F (w) e w0 − wi
e − h∇ F (w), e ≤ F (w) − F (w)
e − h∇ F (w),
e w − wi
e .
e on both sides and note that ∇ F (w)
We cancel F (w) e = 0 since w
e is an unconstrained
minimizer of F . We obtain F (w0 ) ≤ F (w) as promised.
The projection lemma says that the constrained minimizer w can be calculated by first
computing the unconstrained minimizer w e and then projecting it onto K using Bregman
projection. This will be not only useful in the regret analysis of FTRL, but it can be also
used to implement FTRL.
45
This definition depends on the norm used. A function F can be strongly convex with
respect to one norm but not another. Note that strong convexity (with respect to any norm)
implies strict convexity. Convexity means that the function F (u) can be lower bounded a
linear function F (v) + h∇ F (v), u − vi. In contrast, strong convexity means that the function
F (u) can lower bounded a quadratic function F (v) + h∇ F (v), u − vi + 21 ku − vk2 . The
coefficient 21 in front ku − vk2 is an arbitrary choice; it was chosen because of mathematical
convenience.
Let k · k be any norm on Rd . Its dual norm, denoted by k · k∗ , is defined by
1 2 1 2
kuk∗ = sup hu, vi − kvk . (8.1)
2 v∈Rd 2
1 1
The dual norm of k · kp is k · kq where 1 ≤ p, q ≤ ∞ and p
+ q
= 1.
46
Substituting u = wn+1 we get
The right hand side can be upper bounded by ηLn (u) + R(u) − R(w1 ) because wn+1 is the
minimizer of ηLn (u) + R(u). Thus we get
b+ ≤ ηLn (u) + R(u) − R(w1 )
ηL n
bn − Ln (u) ≤ L
L b+ + R(u) − R(w1 ) .
bn − L
n
η
The upper bound consists of two terms: L b+ = Pn [`t (wt ) − `t (wt+1 )] and (R(u) −
bn − L
n t=1
R(w1 ))/η. The first term captures how fast is wt changing and the second is a penalty
that we pay for using too much regularization. There is a trade-off: If we use too much
regularization (i.e. R is too big and/or η is too small) then the second term is too big. On
other hand, if we use too few regularization (i.e. R is too small and/or η is too large) then
wt+1 will too far from wt and the first term will be big. The goal is find a balance between
these two opposing forces and find the right amount of regularization.
The following lemma states that the FTRL solution wt+1 can be obtained by first finding
the unconstrained minimum of ηLt (u) + R(u) over A and then projecting it to K. (Recall
that Lt (u) is the sum of the loss functions up time t.)
Lemma 8.13 (FTRL projection lemma). Let
w
et+1 = argmin [ηLt (u) + R(u)]
u∈A
wt+1 = ΠLηt ,K (w
et+1 )
where Lηt (u) = ηLt (u) + R(u). It is straightforward to verify that DR (u, v) = DLηt (u, v) for
any u, v. Therefore ΠLηt ,K (u) = ΠR,K (u) for any u.
47
Theorem 8.14 (FTRL Regret Bound for Strongly Convex Regularizer). Let R : A → R be
a Legendre function which is strongly convex with respect a norm k · k. Let `t (u) = hu, ft i.
Then, for all u ∈ K ∩ A
n
X R(u) − R(w1 )
bn − Ln (u) ≤ η
L kft k2∗ + .
t=1
η
q
R(u)−R(w1 )
In particular, if kft k∗ ≤ 1 for all 1 ≤ t ≤ n and η = n
then
p
bn − Ln (u) ≤
L n(R(u) − R(w1 )) .
48
We finish the proof by showing that ∇ R(w et ) − ∇ R(w et+1 ) = ηft . In order see that,
note that w et are the unconstrained minimizers of Lt−1 , Lηt respectively and therefore they
et , w η
8.4 Exercises
Exercise 8.1. (EWA as FTRL) The goal of this exercise is to show that Exponen-
tially Weighted Average (EWA) forecaster is, in fact, the Follow The Regular-
ized Leader (FTRL) algorithm with the “un-normalized” negative entropy regularizer.
Assume that `1 , `2 , . . . , `t−1 are vectors in RN . We denote their components by `s,i where
1 ≤ s ≤ t − 1 and 1 ≤ i ≤ N . (You can think of `s,i as the loss of expert i in round s.)
Recall that in round t, EWA chooses the probability vector pt = (pt,1 , pt,2 , . . . , pt,N ), where
the coordinates are
wt,i Pt−1
pt,i = PN (i = 1, 2, . . . , N ) where wt,i = e−η s=1 `s,i
i=1 wt,i
and η > 0 is the learning rate. On the other hand, FTRL chooses the probability vector p0t
defined as !
t−1
X
0
pt = argmin η h`s , pi + R(p) ,
p∈∆N
s=1
PN
where ∆N = {p ∈ RN : i=1 pi = 1, ∀1 ≤ i ≤ N, pi ≥ 0}. Your main goal will be to show
that if the regularizer is the un-normalized negative entropy
N
X
R(p) = pi ln(pi ) − pi
i=1
then pt = p0t (provided, of course, that both EWA and FTRL use the same η). We ask you
to do it in several steps.
49
(b) Prove that the function R(p) defined on the open positive orthant
RN
++ = {p ∈ R
N
: ∀1 ≤ i ≤ N, pi > 0}
(c) Show that R(p) is not strongly convex on the positive orthant with respect to any norm.
(Hint: First, prove the statement for a fixed norm e.g. 1-norm. Then use that on RN
any two norms are equivalent: For any norms k · k♥ , k · k♠ on RN there exists α > 0 such
that for any x ∈ RN , αkxk♠ ≤ kxk♥ . You don’t need to prove this.)
(d) Prove that R(p) is strongly convex with respect to k · k1 on the open probability simplex
( N
)
X
0 N
∆N = p ∈ R : pi > 0, pi = 1 .
i=1
for any p, q ∈ ∆N .)
(e) Show that the Bregman projection ΠR,∆N : RN ++ → ∆N induced by the negative entropy
can be calculated in O(N ) time. That is, find an algorithm that, given an input x ∈ RN
++ ,
0
outputs x = ΠR,∆N (x) using at most O(N ) arithmetic operations. (Hint: Find an
explicit formula for the projection!)
and show that w et = wt , where wt = (wt,1 , . . . , wt,N ) is the EWA weight vector. (Hint:
Set the gradient of the objective function to zero and solve the equation.)
(g) Combining e and f, find the constrained minimum p0t and prove that p0t = pt .
50
Hints: By checking the proof, it is not hard to see that the general upper bound for FTRL
continues to hold even when strong convexity for R holds only over K ∩ A. Identify K
and A and argue (based on what you proved above) that this is indeed the case here.
Then, use Part (h) to upper bound R(p). To deal with R(p1 ), figure out what p1 is and
substitute! Optimize η. You can use that the dual norm of k · k1 is k · k∞ .
(a) Show that the unit ball, BN , is a convex set. (Hint: Use triangle inequality for the
2-norm.)
(b) Show that R(x) = 21 kxk22 defined on all of RN is a Legendre function. (Among other
things, don’t forget to prove that the gradients grow at the “boundary”. That is, don’t
forget to show that k ∇ R(x)k → ∞ as kxk → ∞.)
(c) Show that R(x) is strongly convex with respect to the 2-norm.
(d) Show that the Bregman projection ΠR,BN : RN → BN induced by the quadratic regu-
larizer can be calculated in O(N ) time. That is, find an algorithm that, given an input
0
x ∈ RN ++ , outputs x = ΠR,∆N (x) using at most O(N ) arithmetic operations. Find an
explicit formula.
(Hint: Calculate the gradient of the objective, set it to zero and solve.)
(f) Find the constrained minimum xt using parts (d) and (e).
(g) Design and describe (both in pseudo-code and in English) the FTRL algorithm for this
setting that has O(N ) memory complexity and O(N ) time complexity per step.
51
√
(h) Assuming that `1 , `2 , . . . , `n ∈ BN , prove an O( n) regret on the resulting algorithm.
Use the general upper bound, proved in class, for FTRL with strongly convex regularizer:
n
X R(u) − R(x1 )
bn − L(u) ≤ η
L k`t k2∗ + ∀ u ∈ BN .
t=1
η
Upper bound R(u). To deal with R(x1 ), figure out what x1 is and substitute! Optimize
η. (Hint: The dual norm√of k · k2 is k · k2 .) What is the dependence of your regret bound
on N ? Compare this O( n log N ) from Q1 part (i).
(For those initiated: Does the algorithm from Part (g) remind you of the classical (online)
gradient descent algorithm?)
`t (wt+1 ) ≤ `t (wt ) .
(c) Show that the dual norm of the 2-norm is the 2-norm.
52
Chapter 9
In this lecture, we present a different algorithm for the same online learning scenario as in
last lecture. The algorithm we present is called Proximal Point Algorithm and it goes
back to Martinet (1978) and Rockafellar (1977) in the context of classical (offline) numerical
convex optimization. In the context of online learning, a special case of the algorithm with
quadratic regularizer was rediscovered by Zinkevich (2003).
Recall the online learning scenario from the last chapter, where the learner in round t
chooses a point wt from a closed convex set K ⊆ Rd , and it suffers a loss `t (wt ) where
`t : Rd → R is linear function chosen by the adversary i.e. `t (wt ) = hft , wi.
In the last chapter, we saw that the regret of FTRL is controlled by how much wt+1
differs from wt and the amount of regularization used. The Proximal Point Algorithm
is designed with the explicit intention to keep wt+1 close to wt . This also explains the name
of the algorithm.
Suppose R : A → R is Legendre function, which we can think of as regularizer. The
algorithm first calculates wet+1 ∈ A from wt and then w et+1 is projected to K by a Bregman
projection:
w
et+1 = argmin [η`t (w) + DR (w, wt )]
w∈A
wt+1 = ΠR,K (w
et+1 )
We see that w et+1 is constructed so that it minimizes the current loss penalized so that w et+1
stays close to wt . The algorithm starts with we1 = argminw∈A R(w) and w1 = ΠR,K (w e1 ), the
same as FTRL.
9.1 Analysis
Proposition 9.1.
∇ R(w
et+1 ) − ∇ R(wt ) = −ηft
53
Proof. We have ∇w [η`t (u) + DR (w, wt )] |w=wet+1 = 0 since wet+1 is an unconstrained minimizer
of η`t (w) + DR (w, wt ). We rewrite the condition by calculating the gradient:
0 = ∇w [η`t (u) + DR (w, wt )] w=wet+1
= η ∇ `t (w
et+1 ) + ∇w DR (w, wt )w=wet+1
= ηft + ∇w [R(w) − R(wt ) − h∇ R(wt ), w − wt i] w=wet+1
= ηft + ∇ R(w
et+1 ) − ∇ R(wt )
Proof. The sum Lbn − Ln (u) can be written as a sum Pn (`t (wt ) − `t (u)). We upper bound
t=1
each term in the sum separately:
Adding the inequalities for all t = 1, 2, . . . , n, the terms DR (u, wt ) − DR (u, wet+1 ) telescope:
" n
#
bn − Ln (u) ≤ 1 X
L DR (u, w1 ) − DR (u, wn+1 ) + DR (wt , w
et+1 )
η t=1
We can drop (−DR (u, wn+1 )) since Bregman divergence is always non-negative,
Pn and we get
the first inequality of the lemma. For the second inequality, we upper bound t=1 DR (wt , w
et+1 )
54
term-by-term:
et+1 ) ≤ DR (wt , w
DR (wt , w et+1 ) + DR (wet+1 , wt )
= h∇ R(wt ) − ∇ R(w et+1 ), wt − w et+1 i
= ηhft , wt − w et+1 i (Proposition 9.1)
= η (`t (wt ) − `t (w
et+1 )) .
Theorem 9.3 (Regret Bound for Proximal Point Algorithm with Strongly Convex Regu-
larizer). Let R : A → R is a Legendre function which is strongly convex with respect a norm
k · k. Then, for all u ∈ K ∩ A,
n
X R(u) − R(w1 )
bn − Ln (u) ≤ η
L kft k2∗ + .
t=1
η
q
R(u)−R(w1 )
In particular, if kft k∗ ≤ 1 for all 1 ≤ t ≤ n and η = n
then for any u ∈ K ∩ A,
p
bn − Ln (u) ≤
L n(R(u) − R(w1 )) .
First, note that DR (u, w1 ) ≤ R(u) − R(w1 ) since for any u ∈ K ∩ A, h∇ R(w1 ), u − w1 i ≥ 0
by optimality of w1 . Hence, it remains to show that
n
X n
X
(`t (wt ) − `t (w
et+1 )) ≤ η kft k2∗ .
t=1 t=1
`t (wt ) − `t (w
et+1 ) = hft , wt − w
et+1 i ≤ kft k∗ · kwt − w
et+1 k .
55
By Hölder’s inequality,
et+1 k2 ≤ k ∇ R(wt ) − ∇ R(w
kwt − w et+1 )k∗ · kwt − w
et+1 k .
Dividing both sides by non-negative kwt − w
et+1 k we get
kwt − w
et+1 k ≤ k ∇ R(wt ) − ∇ R(w
et+1 )k∗ .
Finally, by Proposition 9.1, we have k ∇ R(wt ) − ∇ R(w
et+1 )k∗ = ηkft k∗ .
56
Summing the lemma over all t = 1, 2, . . . , n we get the following corollary.
where
`et (w) = `t (wt ) + h∇`t (wt ), w − wt i
is the linearized loss. The name comes from that `et is a linear approximation of `t by its
first order Taylor expansion. We call the resulting algorithm the Linearized Proximal
Point Algorithm.
The crucial property that allows to extend regret bounds for non-linear losses is the
following lemma. The lemma states that the instantaneous regret `t (wt ) − `t (u) for any
convex loss is upper bounded by the instantaneous regret `et (wt )− `et (u) for their linearization.
Proof. The fist inequality from convexity of `t . The second equality follows by the definition
of the linearized loss `et .
Using this lemma it easy to extend the first part of Lemma 9.2 (with constant learning
rate) and Corollary 9.6 (with varying learning rate) to non-linear convex losses. If gradients
∇ `t are bounded, one can even extend Theorem 9.3.
et }∞
Proposition 9.8. The sequences {w ∞
t=1 and {wt }t=1 generated by the Linearized Prox-
imal Point Algorithm satisfy
∇R(w
et+1 ) − ∇R(wt ) = −ηt ∇`t (wt )
Proof. The follows from Proposition 9.4 applied to linearized loss `et .
57
Notice that the linearized loss `et (w) = `t (wt )+h∇`t (wt ), w −wt i is technically not a linear
function. It is an affine function. That is, it is of the form `et (w) = a + hb, wi for some a ∈ R
and b ∈ Rd . It easy to see that the results for linear losses from previous sections extend
without any change to affine losses. Also notice that the intercept a does not play any role in
the algorithms nor the regret. Thus we can define the linearized loss as `et (w) = h∇ `t (wt ), wi.
n
bn − Ln (u) ≤
X DR (wt , w
et+1 )
L .
t=1
ηt
By definition of the linearized losses h∇ `t (wt ), u − wt i = `et (u) − `et (wt ) and thus
σt
`t (wt ) − `t (u) ≤ `et (wt ) − `et (u) − DR (u, wt ) .
2
(Note that this inequality is a strengthening of Lemma 9.7 for strongly convex losses). We
upper bound `et (wt ) − `et (u) using Lemma 9.5:
58
It remains to show that the second sum is non-positive. We can rewrite it as
n
X 1 σt 1
− DR (u, wt ) − DR (u, wt+1 )
t=1
ηt 2 ηt
n−1
1 σ1 1 X 1 σt+1 1
= − DR (u, w1 ) − DR (u, wn+1 ) + − − DR (u, wt+1 ) .
η1 2 ηn t=1
η t+1 2 η t
By the choice of learning rates 1/η1 = σ1 /2 and hence the first term vanishes. We can
drop the second term, because DR is non-negative. Each term in the sum is zero, since the
1
learning rates satisfy the recurrence ηt+1 = η1t + σt+1
2
.
We illustrate the use of Theorem on a special case. Assume that the regularizer is
R(w) = 12 kwk22 , gradients are uniformly bounded k ∇ `t (w)k2 ≤ G for all w ∈ K and all t,
and σt = σ. The algorithm becomes a projected gradient descent algorithm
et+1 = wt − ηt ∇`t (wt )
w
wt+1 = ΠR,K (w
et )
The term DR (wt , wt+1 ) can be expressed as
1 1
et+1 ) = kwt − w
DR (wt , w et+1 k22 = ηt2 k∇`t (wt )k22 .
2 2
Applying Theorem 9.10 we get a O(log n) regret bound:
n
bn − Ln (u) ≤
X DR (wt , w
et+1 )
L
t=1
ηt
n
X ηt k∇`t (wt )k22
=
t=1
2
n
X k∇`t (wt )k22
= Pt
t=1 s=1 σs
n
X k∇`t (wt )k2 2
=
t=1
tσ
n
G2 X 1
≤
σ t=1 t
G2
≤ (1 + ln n) .
σ
9.5 Exercises
59
Exercise 9.2. (Zinkevich’s algorithm) Let K ⊆ Rd be convex and closed containing
the origin. Abusing notation, let ΠK (w) = argminu∈K ku − wk be the Euclidean projection
to K. Zinkevich’s algorithm starts with w1 = 0 and updates
Show that Zinkevich’s algorithm is nothing else than Linearized Proximal Point Al-
gorithm with quadratic regularizer R(w) = 21 kwk22 .
Exercise 9.3. (Comparison with FTRL) Consider FTRL algorithm and the Proximal
Point Algorithm with the same convex closed set K ⊆ Rd , Legendre regularizer R : A →
Rd and the same learning rate η > 0, running on the same sequence {`t }∞
t=1 of linear loss
functions.
(c) Give an example of a bounded K and R such that for any η, {`t }∞ t=1 the algorithms
produce the same sequences of solutions w1 , w2 , . . . , wn+1 . (Hint: Express the Expo-
nentially Weighted Average forectaster both as FTRL and as a Proximal Point
Algorithm.)
(d) Redo (a), (b) and (a) with linearized versions of the algorithms and arbitrary convex
differentiable sequence of {`t }∞
t=1 loss functions.
60
Chapter 10
Least Squares
Linear least squares method is the single most important regression problem in all of statistics
and machine learning. In this chapter we consider the online version of the problem. We
will assume that, in round t, the learner receives a vector xt ∈ Rd , predicts ybt = hwt , xt i,
yt − yt )2 . More compactly, the learner chooses
receives feedback yt ∈ R and suffers a loss 12 (b
wt ∈ Rd and suffers loss `t (wt ) where `t : Rd → R is a loss function defined as
1
`t (w) = (hw, xt i − yt )2 .
2
For dimension d ≥ 2 the loss function `t is not strongly convex; not even strictly convex. To
see that note consider any non-zero w that is perpendicular to xt and note that `t (αw) = 0 for
any α ∈ R. In other words, the loss function is flat (constant) along the line {αw : α ∈ R}.
Our goal is to design an online algorithm for this problem with O(log n) regret under
some natural assumptions on {(xt , yt )}∞
t=1 ; we state those assumptions later. We have seen
O(log n) regret bounds for problems where `t were strongly convex. Despite that in our
case the loss functions are not strongly convex, we show that Follow The Regularized
Leader (FTRL) algorithm with quadratic regularizer R(w) = 21 kwk22 has O(log n) regret.
Recall that in round t + 1 FTRL algorithm chooses
" t #
X 1
wt+1 = argmin η `s (w) + kwk2
w∈R d
s=1
2
where η > 0 is the learning rate. This minimization problem—both in online learning
and classical off-line optimization—is called regularized (linear) least squares problem or
sometimes ridge regression.
10.1 Analysis
We denote by Lt (u) the sum of the first t loss functions, by L
bt the sum of the losses of
η
the FTRL algorithm in the first t rounds, and by Lt (u) the objective function that FTRL
61
minimizes in round t + 1. Formally, for any u ∈ Rd and any 1 ≤ t ≤ n
t
X t
X t
X
Lt (u) = `s (u) L
bt = `s (ws ) Lηt (u) =η `s (u) + R(u) .
s=1 s=1 s=1
Proof. Since the minimization is unconstrained ∇Lηt (wt+1 ) = 0. Therefore, for any u ∈ A
DLηt (u, wt+1 ) = Lηt (u) − Lηt (wt+1 ) (10.1)
= Lηt−1 (u) + η`t (u) − Lηt (wt+1 ) .
Rearranging, for any u ∈ A
η`t (u) = DLηt (u, wt+1 ) + Lηt (wt+1 ) − Lηt−1 (u) (10.2)
Substituting u = wt we get
η`t (wt ) = DLηt (wt , wt+1 ) + Lηt (wt+1 ) − Lηt−1 (wt ) (10.3)
Subtracting (10.2) from (10.3) we get
η (`t (wt ) − `t (u)) = DLηt (wt , wt+1 ) − DLηt (u, wt+1 ) + Lt−1 (u) − Lt−1 (wt )
= DLηt (wt , wt+1 ) + DLηt−1 (u, wt ) − DLηt (u, wt+1 )
where in the last step we have used (10.1) with t shifted by one. If we sum both sides of the
last equation over all t = 1, 2, . . . , n, the differences DLηt−1 (u, wt ) − DLηt (u, wt+1 ) telescope.
The observation that Lη0 = R finishes the proof.
Unsurprisingly, the lemma can be used to upper bound the regret:
n
bn − Ln (u) ≤ 1 DR (u, wn ) + 1
X
L D η (wt , wt+1 ) . (10.4)
η η t=1 Lt
The first divergence is easy to deal with: DR (u, wt ) = 21 ku − w1 k22 = 12 kuk22 . The main
challenge to is to calculate the divergences DLηt (wt , wt+1 ). We do it in the following lemma.
62
Lemma 10.2 (Ridge regression). Let {(xt , yt )}∞ d
t=1 be any sequence, xt ∈ R and yt ∈ R. Let
`t (w) = 12 (hw, xt i − yt )2 be the corresponding sequence of loss functions. Consider the ridge
regression FTRL algorithm wt+1 = argminw∈A Lηt (w). Then,
η2
`t (wt ) xt , A−1
DLηt (wt , wt+1 ) = t x t
2
where
t
X
At = I + η xs x>
s .
s=1
Proof. Define Mt = xt x> t and vt = −yt xt . Using this notation, we can write the loss function
`t (u) as a quadratic form:
1 1 2
`t (u) = hu, xt x>
t ui − hu, yt xt i + yt
2 2
1 1 2
= hu, Mt ui + hu, vt i + yt
2 2
In order to understand DLηt (wt , wt+1 ) we first need to understand the underlying Legendre
function Lηt (u). We can write rewrite by using the quadratic form for `t (u)
t
X 1
Lηt (u) =η `s (u) + kuk2
s=1
2
t
* t
+ * t
! +
X
2
X 1 X
=η ys + η vs , u + u, I +η Ms u
s=1 s=1
2 s=1
1
= Ct + hVt , ui + hu, At ui . (10.5)
2
where Ct = ts=1 ys2 and Vt = η ts=1 vs . Using (10.5) we express the Bregman divergence
P P
DLηt (u, v) as
1
DLηt (u, v) = hu − v, At (u − v)i . (10.6)
2
From ∇Lηt (wt+1 ) = 0 = ∇Lηt−1 (wt ) and (10.5) we derive the following equalities:
Mt wt + vt = xt (hxt , wt i − yt ) (10.9)
63
which follows by definition of Mt and vt . Starting from we calculate (10.6) the Bregman
divergence:
1
DLηt (wt , wt+1 ) = hwt+1 − wt , At (wt+1 − wt )i
2
1
= hwt − wt+1 , η(Mt wt + vt )i by (10.7)
2
η
= hwt − wt+1 , (Mt wt + vt )i
2
η
= hwt − wt+1 , xt (hxt , wt i − yt )i by (10.9)
2
η
= (yt − hxt , wt i)hwt+1 − wt , xt i
2
η2
= − (yt − hxt , wt i)hMt wt + vt , A−1t xt i by (10.8)
2
η2
= (yt − hxt , wt i)2 hxt , A−1
t xt i by (10.9)
2
η2
= `t (wt )hxt , A−1
t xt i
2
xt , A−1
Ln − Ln (u) ≤
b + t xt . (10.10)
2η 2 t=1
Lemma 10.3 (Matrix Determinant Lemma). Let B be a d × d positive definite matrix, let
x ∈ Rd and define A = B + xx> . Then,
det(B)
hx, A−1 xi = 1 − .
det(A)
Proof. Note that since B is positive definite and xx> is positive semi-definite, A must be
positive definite and therefore also invertible. We can write
and hence
det(B) = det(A) det(I − A−1 xx> ) .
64
We focus on the later term and use that A has a square root A1/2
det(I − A−1 xx> ) = det(A1/2 ) det(I − A−1 xx> ) det(A−1/2 )
= det(A1/2 (I − A−1 xx> )A−1/2 )
= det(I − A−1/2 xx> A−1/2 )
= det(I − (A−1/2 x)(A−1/2 x)> )
= det(I − zz > )
Now we use that 1 − x ≤ − ln x for any real number x > 0 and get that
n n
X
−1
X det(At )
η xt , At xt ≤ ln = ln (det(An )) .
t=1 t=1
det(At−1 )
65
In order to upper bound the determinant of An we use that kxt k2 ≤ X and derive from that
an upper bound on the trace of An :
n
X
tr(An ) = tr(I) + η tr(xt x> 2
t ) ≤ d + ηX n .
t=1
In the derivation, we have used that trace is linear and that tr(xt x> 2
t ) = tr(hxt , xt i) = kxt k2 .
Since An is positive definite, if tr(An ) ≤ C then ln(det(An )) ≤ d ln(C/d) and therefore
ηX 2 n
ln(det(An )) ≤ d ln 1 +
d
bn − Ln (u) ≤ L
L b+ + R(u) − R(w1 ) .
bn − L
n
η
Here L bn = Pn `t (wt ) is the loss of the algorithm, L b+ = Pn `t (wt+1 ) is the loss of the
t=1 Pn n t=1
“cheating” algorithm, and Ln (u) = t=1 `t (u) is the sum of the loss functions.
We see that in order to upper bound the regret, we need to upper bound the differences
`t (wt ) − `t+1 (wt+1 ). We do it in the following lemma.
66
Lemma 10.5. The FTRL with convex closed K, regularizer R(w) = 21 kwk22 , learning rate
η > 0, and loss functions of the form `t = ct + κt hw, vt i + β2 hw, vt vt> wi where ct , κt ∈ R,
vt ∈ Rd and β > 0 satisfies, for any t ≥ 1
`t (wt+1 ) − `t (wt ) ≤ ηh∇ `t (wt ), A−1
t ∇ `t (wt )i
where
t
X
At = I + ηβ vt vt> .
s=1
67
To finish the proof of the first part of the lemma (the inequality) it remains to show that
h∇ `t (wt ), A−1
t dt i is non-negative.
By optimality of wt and wt+1 , for any w ∈ K
Substituting wt for w in the first inequality, and substituting wt+1 for w in the second
inequality, we get
where in the last step we have used that At is positive definite. We have thus derived that
ηhdt , A−1
t ∇ `t (wt )i ≥ 0 .
Theorem 10.6 (Regret of Projected Ridge Regression). Let β > 0 and G ≥ 0. Let {`t }nt=1
be a sequence of loss functions of the form `t = ct + κt hw, vt i + β2 hw, vt vt> wi where ct , κt ∈ R,
vt ∈ Rd such that kvt k2 ≤ G for all 1 ≤ t ≤ n. The regret of FTRL on {`t }nt=1 with convex
closed K ⊆ Rd , regularizer R(w) = 21 kwk22 and learning rate η > 0 is upper bounded, for any
u ∈ K, as
kuk22 dBn ηβG2 n
Ln − Ln (u) ≤
b + ln 1 +
2η β d
where
Bn = max (κt + βhvt , wt i)2 .
1≤t≤n
Proof. First, we express the upper bound ηh∇ `t (wt ), A−1 t ∇ `t (wt )i from the previous lemma
>
using that ∇ `t (w) = κt vt + βvt vt w = (κt + βhvt , wi)vt as
68
We can upper bound the regret as
bn − Ln (u) ≤ L b+
bn − L R(u) − R(w1 )
L n + (Corollary 8.12)
η
n
kuk22 X
≤ + (`t (wt ) − `t (wt+1 )) (R(w1 ) ≥ 0)
2η t=1
n
kuk22 X
≤ +η h∇ `t (wt ), A−1
t ∇ `t (wt )i (by Lemma 10.5)
2η t=1
n
kuk22 X
= +η h(κt + βhvt , wt i)2 hvt , A−1
t vt i by (10.14)
2η t=1
n
kuk22 X
≤ + ηBn hvt , A−1
t vt i
2η t=1
Using the matrix determinant lemma, we can deal with the terms hvt , A−1
t vt i:
p p det(At−1 )
ηβhvt , A−1
t vt i = h ηβv t , A−1
t ( ηβvt )i = 1 − .
det(At )
Therefore, using 1 − x ≤ ln x for any x > 0 we have
2 n
bn − Ln (u) ≤ kuk 2 Bn
X det(A t−1 )
L + 1−
2η β t=1 det(At )
n
kuk22 Bn X
det(At )
≤ + ln
2η β t=1 det(At−1 )
kuk22 Bn
= + ln(det(An ))
2η β
It remains to upper bound ln(det(An )). Since, from kvt k2 ≤ G we can derive upper bound
on the trace of An
n
X
tr(An ) = tr(I) + βη tr(vt vt> ) ≤ d + βηG2 n
t=1
βηG2 n
ln(det(An )) ≤ d ln 1 + .
d
Supppose that we assume that the decision set K is bounded and κt ≤ κ for all t. Then,
using the assumption that kvt k2 ≤ G and from that wt ∈ K, we can give an upper bound
on Bn which does not depend on n.
69
10.3 Directional Strong Convexity
We can generalize the analysis to a more general class of loss functions.
Definition 10.7 (Directional Strong Convexity). Let f : K → R be a differentiable function
defined on a convex set K ⊆ Rd . Let β ≥ 0 be a real number. We say that f is β-directionally
strongly convex if for any x, y ∈ K
β
f (y) ≥ f (x) + h∇ f (x), y − xi + h∇ f (x), y − xi2 .
2
We now extend ridge regression to arbitrary directionally strongly convex loss functions.
The idea is similar to that of linearized loss. However, instead of linearizing the loss functions,
we approximate them with quadratic functions of rank one. Consider a sequence {`t }∞ t=1 of
β-directionally strongly convex loss functions defined on a convex closed set K ⊆ Rd . We
define the FTRL with quadratic rank-one approximations:
t
!
X 1
wt+1 = argmin η `es (w) + kwk22
w∈K s=1
2
where
β
`es (w) = `s (ws ) + h∇ `s (ws ), w − ws i + h∇ `s (ws ), w − ws i2 .
2
To analyze the regret of we use a similar trick as for linearized losses (Lemma 9.7).
Proposition 10.8. If `t : K → R is β-directionally strongly convex then for any u ∈ K
`t (wt ) − `t (u) ≤ `et (wt ) − `et (u) .
Proof. Since `t (wt ) = `et (wt ) the inequality is equivalent to `et (u) ≤ `et (wt ) and it follows from
directional strong convexity of `t .
If we define Ln (u) = nt=1 `t (u) and L bn = Pn `t (wt ) we see that
P
t=1
n
X
bn − Ln (u) ≤
L (`et (wt ) − `et (u))
t=1
Combining this inequality with Theorem 10.6 applied to the sequence {`et }nt=1 of approxi-
mated losses, we can upper the regret of the algorithm.
Theorem 10.9. Let β > 0 and G ≥ 0. Let {`t }nt=1 be a sequence of β-directionally strongly
convex loss functions defined on a convex closed set K ⊆ Rd such that k ∇ `t (w)k ≤ G
for any w ∈ K and any 1 ≤ t ≤ n. The regret of FTRL with quadratic rank-
one approximations on the sequence {`t }nt=1 with learning rate η > 0 and regularizer
R(w) = 21 kwk22 is upper bounded for any u ∈ K as
2 2
bn − Ln (u) ≤ kuk 2 dB n ηβG n
L + ln 1 +
2η β d
where
Bn =??? .
70
10.4 Exercises
Exercise 10.1. Let A be a d × d positive definite matrix. Show that tr(A) ≤ C implies
that ln(det(A)) ≤ C ln(C/d).
Exercise 10.2. Show that the ridge regression algorithm can be implemented in O(d2 ) time
per time step. (Hint: Use Sherman-Morrison formula for rank-one updates of the inverse of
a matrix.)
71
Chapter 11
Exp-concave Functions
Proof. Let hα (x) = exp(−αg(x)) and let hβ (x) = exp(−αg(x)) is concave. By assumption
hα is concave. We need to prove that hβ is also concave. Noting that hβ (x) = (hα (x))β/α ,
we have
Proof. Let α = H/G2 and h(x) = exp(−αg(x)). We calculate the Hessian of h and show
that it is negative semi-definite:
∂
∇2 h(x) = (∇h(x))
∂x
∂
= (−α∇g(x) exp(−αg(x)))
∂x
= α∇2 g(x) exp(−αg(x)) + α2 ∇g(x)∇g(x)> exp(−αg(x))
= αh(x) α∇g(x)∇g(x)> − ∇2 g(x)
72
Since αh(x) is a positive scalar, it remains to show that the matrix α∇g(x)∇g(x)> − ∇2 g(x)
is negative semi-definite. We show that all its eigenvalues are non-positive. Using that
λmax (A + B) ≤ λmax (A) + λmax (B) for any symmetric matrices A, B, we have
λmax α∇g(x)∇g(x)> − ∇2 g(x) ≤ λmax α∇g(x)∇g(x)> + λmax −∇2 g(x)
≤ αG2 − H
≤0
where we have used that the only non-zero eigenvalue of vv > is kvk2 with associated eigen-
vector v.
Lemma 11.4. Let G, D, α be positive reals. Let g : K → R be an α-exp-concave function
such that
1. k∇g(x)k2 ≤ G
2. ∀x, y ∈ K, kx − yk2 ≤ D.
Then, g is 21 min α, GD
1
-directionally strongly convex.
1
Proof. Let γ = min α, GD . The function h(x) = exp(−γg(x)) is concave. Thus,
h(x) ≤ h(y) + h∇h(y), x − yi
= h(y) + h−γ∇g(y)h(y), x − yi
= h(y) [1 − γh∇g(y), x − yi]
= exp(−γg(y)) [1 − γh∇g(y), x − yi]
Taking logarithm and dividing by −γ we get
1
g(x) ≥ g(y) − ln [1 − γh∇g(y), x − yi] .
γ
z2
We use that − ln(1 − z) ≥ z + 4
for any z ∈ [−1, 1). For z = γh∇g(y), x − yi we obtain
γ
g(x) ≥ g(y) + h∇g(y), x − yi + (h∇g(y), x − yi)2 .
4
It remains to verify that z lies in the interval [−1, 1). The quantity exp(−γg(y))[1 − z] is
positive, since it is lower-bounded by a positive quantity h(x). Because exp(−γg(y)) is also
positive, [1 − z] is positive. Equivalently, z < 1. Furthermore,
|z| = γ|h∇g(y), x − yi|
≤ γk∇g(y)k2 · kx − yk2
≤ γGD
≤1.
This means that z ≥ −1.
73
11.1 Exercises
Exercise 11.1. Show that if a function is α-exp-concave for some α > 0 then it is convex.
z2
Exercise 11.2. Prove that − ln(1 − z) ≥ z + 4
holds for any z ∈ [−1, 1).
74
Chapter 12
where L bn = Pn `t (wt ) is the loss of algorithm and Ln (u) = Pn `t (u) is the sum of the
t=1 t=1
loss functions. The rest of the chapter is devoted to the careful investigation of the terms
DR (wt , wet+1 ).
75
Definition 12.1 (Legendre dual). Let R : A → R be a Legendre function. Let A∗ =
{∇ R(v) : v ∈ A}. The (Legendre) dual of R, R∗ : A∗ → R, is defined by
The following statement, which is given without proof, follows from the definition.
(ii) R∗∗ = R.
The inverse of the gradient of a Legendre function can be obtained as the gradient of the
function’s dual. This is the subject of the next proposition.
Lemma 12.4. Let R : A → R be Legendre function and R∗ : A∗ → R its Legendre dual. Let
u ∈ A and u0 ∈ A∗ . The following two conditions are equivalent:
2. u0 = ∇ R(u).
Proof. Fix u, u0 . Define the function G : A → R, G(v) = hv, u0 i − R(v). Using the function
G, the first condition can be written as
Since, by definition, R∗ (u0 ) = supv∈A G(v), and G is strictly concave, (12.1) holds if and only
if
∇ G(u) = 0 .
This can be equivalently written as
u0 − ∇ R(u) = 0 ,
76
Proof of Proposition 12.3. Pick u ∈ A and define u0 = ∇R(u). By the previous lemma we
have
R(u) + R∗ (u0 ) = hu, u0 i .
Since R∗∗ = R, we have
R∗∗ (u) + R∗ (u0 ) = hu, u0 i .
Applying the previous lemma to R∗ in place of R, we have
u = ∇R∗ (u0 ) .
This shows that u = ∇ R∗ (∇ R(u)). In other words, ∇ R∗ is the right inverse of ∇ R. Since
∇ R is a surjection (a map onto A∗ ), ∇ R∗ must be also its left inverse.
Equipped with this result, we can prove the following important result which connects
the Bregman divergences underlying a Legendre function and its dual.
Proof. Let u0 = ∇ R(u) and v 0 = ∇ R(v). By Lemma 12.4, R(u) = hu, u0 i − R∗ (u0 ) and
R(v) = hv, v 0 i − R∗ (v 0 ). Therefore,
Furthermore, it is not hard to show that if 1 ≤ p ≤ ∞ then the dual of Rp (w) = 21 kwk2p is
Rq = 12 kwk2q where q satisfies p1 + 1q = 1. Thus, we are left with studying the properties of
DR∗ .
77
12.2 p-Norms and Norm-Like Divergences
We now show that for p ≥ 2, DRp behaves essentially like the p-norm.
Definition 12.6 (Norm-like Bregman divergence). Let R : A → R be a Legendre function
on a domain A ⊂ Rd , let k · k be a norm on Rd and let c > 0. We say that the Bregman
divergence associated with R is c-norm-like with respect to k · k, if for any u, v ∈ A,
Our goal is to show that R ≡ Rp where Rp (w) = 21 kwk2p is (p−1)/2-norm-like with respect
to the p-norm. Recall that for any p ≥ 1, the p-norm of a vector x = (x1 , x2 , . . . , xd )> ∈ Rd
is defined as !1/p
X d
kxkp = |xi |p .
i=1
The definition can be extended to p = ∞ by defining kxk∞ = max1≤i≤d |xi |. We will need
Hölder’s inequality, which is a generalization of Cauchy-Schwarz inequality.
1 1
Lemma 12.7 (Hölder’s inequality). Fix 1 ≤ p, q ≤ ∞ such that p
+ q
= 1. Then, for any
x, y ∈ Rd it holds that
|hx, yi| ≤ kxkp · kykq .
A pair (p, q) ∈ [1, ∞] × [1, ∞] satisfying p1 + 1q = 1 is called a conjugate pair. For example
(1, ∞), (p, q) = (2, 2) and (p, q) = (3, 32 ) are conjugate pairs.
In order to upper bound DR we start with a simple application of Taylor’s theorem. We
keep the argument more general so that they can be reused in studying cases other than
R = Rp . In order to be apply Taylor’s theorem, we will assume that R is twice-differentiable
in A◦ . Since DR (u, v) is the difference between R(u) and its first order Taylor’s expansion at
v, the divergence is nothing else but the remainder of the second order Taylor’s expansion
of R. Thus, there exists ξ on the open line segment between u and v such that
1
u − v, ∇2 R(ξ)u − v .
DR (u, v) =
2
Lemma 12.8. Suppose φ : R → R, ψ : R → R are twice-differentiable
P and ψ is concave.
d
Consider R : Rd → R defined by R(ξ) = ψ i=1 φ(ξi ) for any ξ ∈ Rd . Then, for any
x ∈ Rd !
d
X d
X
0
2
φ00 (ξi )x2i .
x, ∇ R(ξ)x ≤ ψ φ(ξi )
i=1 i=1
Proof. Let ei be i-th vector of the standard orthogonal basis of Rd . The gradient of R is
d
! d
X X
∇ R(ξ) = ψ 0 φ(ξi ) φ0 (ξi )ei .
i=1 i=1
78
The Hessian of R is
∂
∇2 R(ξ) = (∇ R(ξ))
∂ξ
d
! d !
∂ X X
= ψ0 φ(ξi ) φ0 (ξi )ei
∂ξ i=1 i=1
d
!" d
!# d
! d
X
0 ∂ 0
X
0
X ∂ X 0
= φ (ξi )ei ψ φ(ξi ) +ψ φ(ξi ) φ (ξi )ei
i=1
∂ξ i=1 i=1
∂ξ i=1
d
! d
! d d
! d
X X X X X
= φ0 (ξi )ei ψ 00 φ(ξi ) φ0 (ξi )e>i + ψ 0
φ(ξi ) φ00 (ξi )ei e>
i
i=1 i=1 i=1 i=1 i=1
d
! d
!2 d
! d
X X X X
00 0 0
2
hx, ∇ R(ξ)xi = ψ φ(ξi ) φ (ξi )xi +ψ φ(ui ) φ00 (ξi )x2i
i=1 i=1 i=1 i=1
d
! d
X X
≤ ψ0 φ(ui ) φ00 (ξi )x2i .
i=1 i=1
We are now ready to state the upper bound on the Bregman divergence underlying Rp :
Proposition 12.9. Let p ≥ 2 and R(u) = Rp (u) = 21 kuk2p . Then the following bound holds
for the Bregman divergence DR :
p−1
DR (u, v) ≤ ku − vk2p .
2
Proof. Fix u, v ∈ Rd . Clearly, R is twice differentiable on Rd . As explained above, DR (u, v) =
1
u − v, ∇2 R(ξ)u − v for some ξ lying on the open line segment between u and v. We apply
2
the previous lemma to ψ(z) = 12 z 2/p and φ(z) = |z|p . (Note that for p ≥ 2, the function ψ is
concave.) Then,
1 2−p
ψ 0 (z) = z p , φ0 (z) = sign(z)p|z|p−1 , φ00 (z) = p(p − 1)|z|p−2 .
p
79
The previous lemma for x = u − v gives,
1
DR (u, v) = hu − v, ∇2 R(ξ)(u − v)i
2
d
! 2−p
p d
1 X
p
X
≤ |ξi | p(p − 1) |ui |p−2 (ui − vi )2
2p i=1 i=1
d
p−1 2−p
X
= kξkp |ξi |p−2 (ui − vi )2
2 i=1
d
! p−2
p
p−1 2−p
X
p
≤ kξkp |ξi | ku − vk2p (Hölder’s inequality)
2 i=1
p−1
= kξk2−p
p · kξkp−2
p · ku − vk2p
2
p−1
= ku − vk2p .
2
80
Proof. We know that
DR (w, w
et+1 ) = DR∗ (∇R(w et+1 ), ∇R(wt ))
≤ ck∇R(w et+1 ) − ∇R(wt )k2
= cη 2 k∇`t (wt )k2 (Proposition 9.8 with ηt = η)
2
≤ cαη `t (wt )
x2 ≤ 4A(x + Ln ) .
81
q−1
Proof. First note that R∗ = Rq and therefore DR∗ by Proposition 12.9 is 2
-norm-like with
respect q-norm. Second, since w
e1 = 0
1
e1 ) = kuk2p .
DR (u, w1 ) ≤ DR (u, w
2
Third, for α = 2Xq2 we get
Consider the situation when a u with a small loss Ln (u) is sparse and xt ’s are dense. For
example, assume that u is a standard unit vector and xt ∈ {+1, −1}d . Then it’s a good
idea to choose q √ √ consequently p ≈ 1. Then kxt kq ≈ kxt k∞ = 1, kukp = 1 and
= log d and
√ p q−1 ≈ √
therefore Xq kuk log d. If we√were to choose p = q = 2 we would get kukq = 1
but kxt kq = d and Xq kukp q − 1 ≈ d. Thus, q = log d is exponentially better choice
than q = 2.
On the other hand, consider the situation when a u with a small loss Ln (u) is dense and
xt ’s are sparse. For example assume that standard unit vectors and u ∈ {+1, −1}d .
√ xt ’s are √
Then, if we use p = q = 2, then kukp Xq q − 1 = d. If √ √ q = log d and p ≈ 1
were to chooses
then kukp ≈ kuk√ 1 = d, kxt kq = 1 and therefore Xq kukp q − 1 ≈ d log d. Thus p = q = 2
is more than d better choice than q = log d.
12.4 Exercises
Exercise 12.2. Let Rp (u) = 21 kuk2p for some 1 ≤ p ≤ ∞. Show that the Legendre dual of
Rp (u) is Rq (v) = 12 kvk2p where q satisfies p1 + 1q = 1.
Pd
Exercise 12.3. Let R(u) = u
i=1 ei . Show that the Legendre dual of R(u) is R∗ (v) =
Pd
i=1 vi (ln(vi ) − 1).
Exercise 12.4.
82
√
• Show that kxk∞ ≤ kxk2 ≤ dkxk∞ for any x ∈ Rd .
• Show that there exists c > 0 such that for any integer d ≥ 3 if x ∈ Rd and q = ln d
then kxk∞ ≤ kxkq ≤ ckxk∞ . Find the smallest possible value of c.
• Show that there does not exist c > 0 such that for any integer d ≥ 1 if x ∈ Rd then
kxk∞ ≤ kxk2 ≤ ckxk∞ .
Morale: This shows that the infinity-norm is well approximated by (ln d)-norm but only very
poorly approximated by the 2-norm.
Exercise 12.5. Show that kxkp ≤ kxkq for any x ∈ Rd and any 1 ≤ q ≤ p ≤ ∞.
83
Chapter 13
In this chapter, we consider the Proximal Point Algorithm with Linearized Losses
with the unnormalized negative entropy regularizer
d
X
R(w) = wi ln(wi ) − wi
i=1
defined on A = (0, ∞)d and sequence of convex loss functions `1 , `2 , . . . , `n defined the prob-
ability simplex d-dimensional probability simplex
( d
)
X
K = ∆d = w ∈ Rd : wi = 1 and ∀1 ≤ i ≤ d, wi ≥ 0 .
i=1
If the loss functions were linear, we would recover Exponentially Weighted Averages
Forecaster (EWA). (See Exercises 8.1 and 9.3.) Our goal is however to consider non-linear
loss functions such as `t (w) = 21 (hw, xt i − yt )2 . More generally, we consider loss functions
`1 , `2 , . . . , `n for which there exists α > 0 such that k ∇ `t (w)k2∞ ≤ α`t (w) for all 1 ≤ t ≤ n
and all w.
For linear prediction problems where the task is to predict yt from xt by using a linear
predictor ybt = hwt , xt i it is somewhat unnatural to restrict wt to the probability simplex ∆d .
By introducing extra dimensions we can extended the analysis to K 0 = {w : kwk1 ≤ c}.
See Exercise 13.1.
Recall that Proximal Point Algorithm with Linearized Losses in round t + 1
chooses h i
wt+1 = argmin η `et (w) + DR (w, wt )
w∈K∩A
84
Pd
be the unprojected solution. Since R(w) = i=1 wi ln(wi ) − wi and K = ∆d the update can
written as
et+1,i = wt,i · exp(−η ∇i `t (wt ))
w for 1 ≤ i ≤ d,
wet+1
wt+1 = .
kw
et+1 k1
where ∇i denotes the i-th component of the gradient. The resulting algorithm is, for an
obvious reason, called Exponentiated Gradient Algorithm (EG). The update for EG
can be easily derived from the first two parts of the following proposition. We leave its proof
as an exercise for the reader.
Proposition 13.1 (Properties
Pd of Negative Entropy Regularizer). Let K = ∆ and for any
d
w ∈ (0, ∞) let R(w) = i=1 wi ln(wi ) − wi . Then,
w
1. ΠR,K (w) = kwk1
for any w ∈ (0, ∞)d .
p
bn − Ln (u) ≤ 2α ln(d) +
L 2α ln(d)Ln (u) .
Proof. By the same calculation as in Lemma 9.2 we know that for any u ∈ ∆d
η(`t (wt ) − `t (u))
≤ DR (u, wt ) − DR (u, w
et+1 ) + DR (wt , w
et+1 )
= (DR (u, wt ) − DR (u, wt+1 )) + (DR (u, wt+1 ) − DR (u, w
et+1 )) + DR (wt , w
et+1 ) (13.1)
We deal with each of the three terms separately. We leave the first term as it is. We express
the third term using
et+1 ) = hwt , ∇R(wt ) − ∇R(w
DR (wt , w et+1 )i
= ηhwt , ∇ `t (wt )i . (13.2)
where the last equality follows by Proposition 9.8. We start expressing the second term as
DR (u, wt+1 ) − DR (u, w
et+1 ) = hu, ∇R(u) − ∇R(wt+1 ) − ∇R(u) + ∇R(w et+1 )i (by Proposition 13.1)
= hu, ∇R(w et+1 ) − ∇R(wt+1 )i
Xn
= ui (∇i R(w et+1 ) − ∇i R(wt+1 )i)
i=1
85
and since
w
et+1,i wet+1,i
∇i R(w
et+1 ) − ∇i R(wt+1 ) = ln = ln = ln kw
et+1 k1
wt+1,i w et+1 k1
et+1,i /kw
we see that the second term equals
n
X
DR (u, wt+1 ) − DR (u, w
et+1 ) = et+1 ) − ∇i R(wt+1 )i)
ui (∇i R(w
i=1
n
X
= ln kw
et+1 k1 ui
i=1
d
X
= ln kw
et+1 k1 (since ui = 1)
i=1
d
!
X
= ln w
et+1,i
i=1
d
!
X
= ln wt,i · exp(−η ∇i `t (wt ))
i=1
η2
k∇`t (wt )k2∞
≤ −ηhwt , ∇`t (wt )i + (13.3)
2
where the last inequality follows from Hoeffding’s lemma applied to the random variable X
with distribution Pr[X = ∇i `t (wt )] = wt,i .
Returning back to (13.1) and substituting (13.2) and (13.3) we get
η2
η(`t (wt ) − `t (u)) ≤ DR (u, wt ) − DR (u, wt+1 ) + k ∇ `t (wt )k2∞
2
η2
≤ DR (u, wt ) − DR (u, wt+1 ) + α`t (wt ) .
2
Summing over all t = 1, 2, . . . , n the first two terms telescope. If we drop −DR (u, wn+1 ) and
divide by by η we get
bn − Ln (u) ≤ 1 DR (u, w1 ) + ηα L
L bn
η 2
Since w1,i = d1 for all i = 1, 2, . . . , d the first term can be upper bounded using DR (u, w1 ) ≤
ln(d) and thus
bn − Ln (u) ≤ ln d + ηα L
L bn .
η 2
q
Choosing η = 2αln d
b which minimizes the right hand side, we get
L
n
q
bn − Ln (u) ≤ 2α ln(d)L
L bn
Using the quadratic inequality trick, as in the proof of Theorem 12.10, we get the result.
86
For loss functions of the form `t = 12 (hw, xt i − yt )2 we have k∇`t (w)k2∞ = 2`t (w)kxt k2∞
and thus we can take α = 2 max1≤t≤n kxt k2∞ .
13.1 Exercises
Exercise 13.1. Consider the linear prediction problem where we want to predict yt ∈ R
from xt ∈ Rd by using a linear predictions ybt = hwt , xt i where wt ∈ K and
K = {w ∈ Rd : kwk1 ≤ 1} .
and loss that we suffer in each round is `t (wt ) where `t (w) = 12 (yt − hw, xt i)2 .
Show that EG algorithm on the (2d + 1)-dimensional probability simplex can be used to
solve this problem. What regret bound do you get? Generalize the result to the case when
K = {w ∈ Rd : kwk1 ≤ c} .
for some c > 0. How does regret bound changes? How does it depend on c?
(Hint: Define x0t = (xt,1 , xt,2 , . . . , xt,d , 0, −xt,1 , xt,1 ) and `0t (w0 ) = 21 (yt − hw, x0t i)2 for w0 ∈
∆2d+1 .)
87
Chapter 14
In this chapter we connect online learning with the more traditional part of machine learning—
the statistical learning theory. The fundamental problem of statistical learning theory is the
off-line (batch) learning, where we are given a random sample and the goal is to produce a
single predictor that performs well on future, unseen data. The sample and the future data
are connected by the assumption that they both are drawn from the same distribution.
More formally, we consider the scenario where we are given a hypothesis space H and a
independent identically distributed (i.i.d.) sequence of loss functions {`t }∞
t=1 where `t : H →
R for each t ≥ 1. The elements of H are called either hypotheses, predictors, classifiers or
models depending on the context. We will denote a typical element of H by w or W with
various subscripts and superscripts.
Example 14.1 (Linear Prediction).
1
H = Rd `t (w) = (hw, Xt i − Yt )2
2
(Xt , Yt ) ∈ Rd × R {(Xt , Yt )}∞t=1 is i.i.d.
H = Rd `t (w) = I{sign(hw, Xt i) 6= Yt }
(Xt , Yt ) ∈ Rd × {−1, +1} {(Xt , Yt )}∞
t=1 is i.i.d.
The goal in statistical learning is to find w that has small risk. The risk is defined as the
expected loss. The risk captures the performance of w on future data. Formally, the risk of
88
a hypothesis w ∈ H is defined to be
`∗ = inf `(w) .
w∈H
A learning algorithm gets as input (the description of) the loss functions `1 , `2 , . . . , `n
and possibly some randomness if it is a randomized algorithm, and outputs a hypothesis
fn ∈ H. Note that even for a deterministic algorithm its output W
W fn is random, since the
1
input was random to start with. The goal of the algorithm is to minimize excess risk
fn ) − `∗ .
`(W
where A is the class of all online randomized algorithms. An algorithm is consider asymp-
totically optimal if rn (A, P) = O(rn∗ (P)).
1
To distinguish random and non-random elements of H, we denote the non-random elements by w and
random elements W (decorated with various subscripts and super-scripts).
89
Typical results proved in statistical learning theory are a priori generalization bounds for
specific algorithms. These are typically high-probability bounds on the excess risk. Often,
these bounds do not depend on P and are thus called distribution-free bounds.
Second question question studied in statistical learning theory is to give a computable
upper bound on `(Wn ) based on empirical data. Such bounds are called data-dependent
bounds a posteriori bounds. These bounds are good for evaluation purposes and they help
us to choose from hypotheses produced by various algorithms (or the same algorithms with
different settings of parameters).
WUn . (ii) In expectation, the excess loss of WUn is upper bounded by the average per-step
regret of the algorithm. (iii) The conversion is applicable regardless of whether the loss
functions are convex or not. The next theorem formalizes the first two properties.
1
E[`(WUn )] − min `(w) ≤ E [Rn ] .
w∈H n
Proof. Since Wt and `t are independent, E[`(Wt )] = E[`t (Wt )]. The first part of the theorem
90
then follows by straightforward calculation:
" n #
X
E[`(WUn )] = E I{Un = t}`(Wt )
t=1
n
X
= E I{Un = t}`(Wt )
t=1
Xn
= E [I{Un = t}] · E[`(Wt )] (by independence of Un and Wt )
t=1
n
1X
= E[`(Wt )]
n t=1
n
1X
= E[`t (Wt )] (by independence of `t and Wt )
n t=1
" n #
1X
=E `t (Wt )
n t=1
From the first part of the theorem we see that the second part of the theorem, is equivalent
to " #
n
1 X
E inf `t (w) ≤ min `(w) .
n w∈H
t=1
w∈H
91
Pn Pn
Furthermore, if Rn = t=1 `t (Wt ) − inf w∈H t=1 `t (w) denotes the regret, then
1
E[`(W n )] − min `(w) ≤ E [Rn ] .
w∈H n
Proof. Let `n+1 be an independent copy of `1 (independent of `1 , `2 , . . . , `n ). Then, since
`n+1 is convex, by Jensen’s inequality
n
! n
1X 1X
`n+1 (W n ) = `n+1 Wt ≤ `n+1 (Wt ) .
n t=1 n t=1
The first part of the theorem follows from that E[`n+1 (W n )] = E[`(W n )] which in turn
follows by independence of W n and `n+1 , and from that for any t = 1, 2, . . . , n, E[`n+1 (Wt )] =
E[`t (Wt )] which in turn follows by independence of Wt , `t , `n+1 .
The first part of theorem implies that in order to prove the second part of theorem, it
suffices to prove that " #
n
1 X
E inf `t (w) ≤ min `(w) .
n w∈H
t=1
w∈H
" n
# " n #
1 X 1 X
E inf `t (w) ≤ inf E `t (w)
n w∈H
t=1
n w∈H t=1
E[Xt+1 | Y0 , Y1 , Y2 , . . . , Yt ] = Xt . (14.1)
A typical example of a martingale is the amount of money a gambler has after playing t
games, assuming that all games in the casino are fair and the gambler can go negative. In
other words, Xt is the amount of money the gambler has after playing t games and Yt is the
92
outcome of the t-th game. The gambler can play according to any strategy he wishes (that
can depend on past outcomes of his plays), changing between roulette, black jack or slot
machines etc. as he pleases. The condition (14.1) expresses the assumption that all games
in the casino are fair: After playing t games and having Xt dollars, the expected amount of
money after (t + 1)-th game is Xt . Note that, the condition (14.1) implies that
Xt − Xt−1 ∈ [At , At + c]
2ε2
Pr [Xn − X0 ≥ ε] ≤ exp − 2 .
nc
Equivalently, for all δ > 0, with probability at least 1 − δ,
r
n
Xn < X0 + c ln(1/δ) .
2
Typically, we will assume that X0 is some constant (usually zero) and then X0 in Azuma’s
inequality can be replaced by E[Xn ].
93
Furthermore, if Rn = nt=1 `t (Wt ) − inf w∈H nt=1 `t (w) denotes the regret, then for any δ > 0
P P
with probability at least 1 − δ
r
1 2 ln(1/δ)
`(W n ) − inf `(w) < Rn + .
w∈H n n
Proof. First notice that ` : H → R is a convex function. This is easy to see since ` is an
expectation of a (random) convex function `1 . Formally, for any w, w0 ∈ H and α, β ≥ 0
such that α + β = 1,
`(αw + βw0 ) = E[`1 (αw + βw0 )] ≤ E[α`1 (w) + β`1 (w0 )] = α`(w) + β`(w0 ) .
Thus,
Pn in order to Pprove the first
p part of the theorem, we see it suffices to prove that
1 1 n
n t=1 `(Wt ) ≤ n t=1 `t (W
Ptt
) + ln(1/δ)/(2n). In order to prove that, we will use Azuma’s
inequality. The sequence { s=1 (`(Ws )−`s (Ws ))}nt=0 is a martingale with respect to {(`t , Wt+1 )}nt=0 .
Indeed, for any t ≥ 1
" t #
X
E (`(Ws ) − `s (Ws )) W1:t , `0:t−1
s=1
t−1
X
= (`(Ws ) − `s (Ws )) + E `(Wt ) − `t (Wt ) W1:t , `1:t−1
s=1
t−1
X
= (`(Ws ) − `s (Ws ))
s=1
where we have used that `(Wt ) = E[`t (Wt ) | W1:t , `0:t−1 ] which holds because `t is independent
of W1:T , `0:t−1 . Since the losses lie in [0, 1], the increments of the martingale lie in intervals
of length one:
t
X t−1
X
(`(Ws ) − `s (Ws )) − (`(Ws ) − `s (Ws )) = `(Wt ) − `t (Wt ) ∈ [`(Wt ) − 1, `(Wt )] .
s=1 s=1
Therefore, since the zeroth element of the martingale is zero, by Azuma’s inequality with
At = `(Wt ) − 1 and c = 1, for any δ > 0
n r
X n
(`(Wt ) − `t (Wt )) < ln(1/δ) .
t=1
2
Dividing by n and combining with (14.2) gives the first part of the theorem.
94
To prove the second part of theorem let w∗ = argminw∈H `(w).2 Consider the martingale
( t )n
X
`(Wt ) − `t (Wt ) − `(w∗ ) + `t (w∗ )
s=1 t=0
where we have used that `(Wt ) = E[`t (Wt ) | W1:t , `0:t−1 ] and `(w∗ ) = E[`t (w∗ ) | W1:t , `0:t−1 ]
both of which hold since `t is independent of W1:t and `0:t−1 . The increments of the martingale
lie in an interval of length of 2:
Thus, by Azuma’s inequality with At = `(Wt ) − `(w∗ ) − 1 and c = 2, we have that for any
δ > 0, with probability at least 1 − δ,
n
X p
(`(Wt ) − `t (Wt ) − `(w∗ ) + `t (w∗ )) < 2n ln(1/δ) .
t=1
We use (14.2) to lower bound the first term and inf w∈H t=1 `t (w) ≤ nt=1 `t (w∗ ) to lower
Pn
P
bound the fourth term. We obtain that with probability at least 1 − δ,
! n
! r
1 X X 2 ln(1/δ)
`(W n ) + `t (Wt ) − `(w∗ ) + inf `t (w) < .
n t=1 w∈H
t=1
n
Since w∗ is the minimizer or `, the last inequality is equivalent to the second part of the
theorem.
2
If the minimizer does not exists, let w∗ be such that `(w∗ ) < inf w∈H `(w) + ε. Then take ε → 0.
95
Chapter 15
Multi-Armed Bandits
Multi-armed bandit problem. Online learning problem. We do not see the losses for all the
decisions.
Examples: Getting to school. Loss is travel time. Decisions: take the bus, bike, walk,
drive. Also: Clinical trials. Ad-allocation problem. Recommendation system. Adaptive user
interfaces.
Simple case: Decision space is finite.
Full-information setting, EWA, see Chapter 4.
We do not assume anything about D, Y and the loss function ` doesn’t need to convex
anymore. The only assumption that we make is that `(p, y) ∈ [0, 1]. Also note that the
numbers pb1,t , pb2,t , . . . , pbN,t are non-negative and sum to 1 and therefore the distribution of It
is valid probability distribution.
We have N actions.
Initially, wi,0 = 1 for each expert i and W0 = N . Then, in each round t = 1, 2, . . . , n, the
algorithm does the following:
4. It predicts fIt ,t .
6. The algorithm suffers the loss `(fIt ,t , yt ) and each expert i = 1, 2, . . . , N suffers a loss
`(fi,t , yt ).
96
Since we do not have `i,t = `(fi,t , yt ), for i 6= It , we come up with an estimate of it:
(`
i,t
def I{It = i} `i,t pi,t
, if It = i ,
`i,t =
e =
pi,t 0, otherwise .
h i
This estimate is constructed such that E `ei,t = `i,t . Indeed, by the tower rule,
h i h h ii
E `i,t = E E `i,t |I1 , . . . , It−1 = E [`i,t /pi,t E [I{It = i}|I1 , . . . , It−1 ]] = E [`i,t /pi,t pi,t ] = `i,t .
e e
Exp3 stands for exponetial weights for exploration and exploitation. The ending, −γ stands
for not using exploration (wait for the next section to understand this).
Note: pi,t becomes random, whereas in EWA it was not random.
Regret bound for the expected regret.
Assume that `i,t ∈ [0, 1].
ei,n = Pn `ei,t .
Define L t=1
Usual proof:
PN
Wn e−ηLi,n e−ηLi,n
e e
i=1
= ≥ . (15.1)
W0 N N
Now,
N N
Wt X wi,t X wi,t−1
= = e−η`i,t .
e
Wt−1 i=1
Wt−1 i=1
Wt−1
Instead of using Hoeffding as in Chapter 4, we use that ex ≤ 1 + x + x2 holds when x ≤ 1
to get
N N N
Wt X wi,t−1 n 2 e2
o X
2
X
2
≤ 1 − η `i,t + η `i,t = 1 − η
e pi,t `i,t + η
e pi,t `ei,t .
Wt−1 i=1
Wt−1 i=1 i=1
Now, N e = `It ,t . This, and 1 + x ≤ ex (which holds for any x ∈ R) gives Wt /Wt−1 ≤
P
i=1 pi,t `i,tP
exp(−η`It ,t + η 2 N e2
i=1 pi,t `i,t ). Therefore,
n
( N
)! n X N
!
Wn X X X
≤ exp − η `It ,t + η 2 2
pi,t `ei,t bn + η2
≤ exp −η L 2
pi,t `ei,t
W0 t=1 i=1 t=1 i=1
97
The second term in the exponent is upper bounded as follows:
N N N N
X
2
X I{It = i}`i,t e X X
pi,t `ei,t = pi,t `i,t = I{It = i}`i,t `ei,t ≤ `ei,t ,
i=1 i=1
pi,t i=1 i=1
where in the last step we used `i,t ≤ 1. Combining the previous inequality with this bound
and (15.1) and taking logarithms of both sides we get
N
X
−η L bn + η2
ei,n − ln(N ) ≤ −η L L
ei,n .
i=1
h i
Now, by construction E Lei,n = Li,n , therefore taking the expectation of both sides gives
h i N
X
bn + η2
−ηLi,n − ln(N ) ≤ −η E L Li,n .
i=1
Reordering gives
N
hi ln N X ln N √
E Ln − Li,n ≤
b +η Li,n ≤ + ηnN ≤ 2 nN ln N .
η i=1
η
I{It = i}gi,t
gi,t = 1 − `i,t ∈ [0, 1], gei,t = .
pi,t
We also need
0 β
gi,t = gei,t + , β > 0.
pi,t
98
Now,
PN 0 0
Wn i=1eηGi,n eη max1≤i≤N Gi,n
= ≥ . (15.2)
W0 N N
Also,
PN N
Wt wi,t X wi,t−1 0
= i=1
= eηgi,t .
Wt−1 Wt−1 i=1
Wt−1
By definition,
wi,t−1 pi,t − Nγ pi,t
= ≤ .
Wt−1 1−γ 1−γ
Assume
0
ηgi,t ≤ 1. (15.3)
Then using ex ≤ 1 + x + x2 , which holds for x ≤ 1, we get
N
Wt X pi,t − Nγ 0 0 2 wi,t−1
γ
pi,t − N
+ η 2 (gi,t
≤ 1 + ηgi,t ) (because Wt−1
= 1−γ
)
Wt−1 i=1
1−γ
N N
η X 0 η2 X 0 2 γ 0
≤1+ pi,t gi,t + pi,t (gi,t ) (because pi,t − N
≤ pi,t and gi,t ≥ 0)
1 − γ i=1 1 − γ i=1
0 0
By the definition of gi,t , pi,t gi,t ≤ pi,t gei,t + β ≤ I{It = i}gi,t + β. This can be used to
bound both the second and third term above. The third term is further upper bounded by
0 2 0 0 0
pi,t (gi,t ) = (pi,t gi,t )gi,t ≤ (1 + β)gi,t . Therefore,
N
Wt η η2 X
0
≤1+ (gIt ,t + N β) + (1 + β) gi,t
Wt−1 1−γ 1−γ i=1
N
!
η η2 X
0
≤ exp (gIt ,t + N β) + (1 + β) gi,t .
1−γ 1−γ i=1
99
Reordering gives
Pr Gi,n > G0i,n + βnN = Pr Gi,n − G0i,n > βnN = Pr β(Gi,n − G0i,n ) > β 2 nN
where in the last step we have used Markov’s inequality which says that if X ≥ 0 then
Pr (X ≥ a) ≤ E [X] /a. Define Zt = exp(β(Gi,t − G0i,t )). It suffices to prove that E [Zn ] ≤ 1.
0
Let I1:t = (I1 , . . . , It ). Then, E [Zn ] = E [E [Zn |I1:n−1 ]]. Now, Zn = Zn−1 exp(β(gi,n − gi,n ))
0
and since Zn−1 is a function of I1:n−1 , E [Zn |I1:n−1 ] = Zn−1 E exp(β(gi,n − gi,n ))|I1:n−1 .
0
Using the definition of gi,n , we get that
0
β2
E exp(β(gi,n − gi,n )) | I1:n−1 = exp(− ) E [ exp(β(gi,n − gei,n )) | I1:n−1 ] .
pi,n
By assumption, β(gi,n − gei,n ) ≤ 1, since by assumption 0 ≤ β ≤ 1. Therefore, we can use
ex ≤ 1 + x + x2 . Using E [gi,n − gei,n | I1:n−1 ] = 0, which holds by the construction
2 of gei,n , and
E [(gi,n − gei,n )2 | I1:n−1 ] ≤ E gei,n
2
| I1:n−1 (because Var(X) ≤ E [X 2 ]), and E gei,n | I1:n−1 =
2
gi,n /pi,n ≤ 1/pi,n we have
β2 β2
0
E exp(β(gi,n − gi,n )) | I1:n−1 ≤ exp(− ) 1+
pi,n pi,n
2 2
β β
≤ exp(− + ) because 1 + x ≤ ex
pi,n pi,n
= 1.
Thus, E [Zn |I1:n−1 ] ≤ E [Zn−1 ]. Now, taking expectation of both sides and repeating the
above argument, we get that E [Zn ] ≤ E [Zn−1 ] ≤ E [Zn−2 ] ≤ . . . ≤ E [Z0 ] = 1, finishing the
proof of the Lemma.
So, let’s assume that
1 − γ − η(1 + β)N ≥ 0 ,
s
ln(N/δ)
≤ β ≤ 1.
β
100
By the union bound, with probability at least 1 − δ, maxj G0j,n ≥ maxj Gj,n − βnN .
Hence,
Gbn ≥ [1 − γ − η(1 + β)N ] max Gj,n − βnN − ln N − nN β .
j η
Reordering,
101
Chapter 16
Then E [Rn ] ≥ R̄n , therefore it suffices to develop a lower bound on R̄n (thanks to the
randomization device).
Note that E [Gi,n ] = n E [gi,n ]. Let
∆i = max E [gj,t ] − E [gi,t ] .
j
102
(Step 3) Derive the lower bound from this.
Fix 1 ≤ i ≤ N . Let Πi be the joint distribution, where the payoff distributions for all the
arms is a Bernoulli 1/2 distribution, except for arm i, whose payoff has Bernoulli 1/2 + ε
(i)
distribution, where 0 < ε < 1/2 will be chosen later. Let gj,t ∼ Ber(1/2 + I{i = j}ε) be an
i.i.d payoff sequence generated from Πi . All of these are independent.
We further need Π0 : In Π0 all payoffs have Bernoulli 1/2 distributions.
Notation: Pi , E [i] ·, pi the PMF. Let R̄j be R̄ when we are in the j th world.
R̄j = (n − Ej [nj ])ε.
Notice that pi (h1:n−1 ) is completely dependent on A. Step 2.
X n
X
Ei [ni ] − E0 [ni ] = pi (h1:n−1 ) I{A(h1:t−1 ) = i}
h1:n−1 ∈{0,1}n−1 t=1
X n
X
− p0 (h1:n−1 ) I{A(h1:t−1 ) = i}
h1:n−1 ∈{0,1}n−1 t=1
X
≤n (pi (h1:n−1 ) − p0 (h1:n−1 ))+
h1:n−1 ∈{0,1}n−1
n
= kpi (h1:n−1 ) − p0 (h1:n−1 )k1
2
np
≤ 2KL (p0 (h1:n−1 )kpi (h1:n−1 )) ,
2
where the last step used Pinsker’s inequality.
Conditional KL divergence:
X p(x|y)
KL (p(X|Y )kq(X|Y )) = p(x, y) log = E [KL (p(X|y)kq(X|y))] .
x,y
q(x|y)
Chain rule:
KL (p(x, y)kq(x, y)) = KL (p(x)kq(x)) + KL (p(y|x)kq(y|x)) .
By the chain rule,
n
X
KL (p0 (h1:n−1 )kpi (h1:n−1 )) = KL (p0 (ht |h1:t−1 )kpi (ht |h1:t−1 )
t=1
n X
X p0 (ht |h1:t−1 )
= p0 (h1:t ) log .
t=1 h1:t
pi (ht |h1:t−1 )
103
By the choice of Π0 , the innermost sum is
1 1/2 1 1/2
log + log .
2 pi (0|h1:t−1 ) 2 pi (1|h1:t−1 )
If the algorithm does not choose arm i given the history h1:t−1 then pi (b|h1:t−1 ) = 1/2,
1
making this expression zero (b ∈ {0, 1}). In the other case, this sum is 12 log 4(1/2−ε)(1/2+ε)) =
1 1
2
log 1−4ε2 . Therefore,
X n X
1 1
KL (p0 (h1:n−1 )kpi (h1:n−1 )) = log p0 (h1:t−1 )I{A(h1:t−1 ) = i}
2 1 − 4ε2 t=1 h
1:t
1 1
= log E0 [ni ]
2 1 − 4ε2
≤ 4ε2 E0 [ni ] ,
where the last inequality holds thanks to − log(1 − x) ≤ 2x which holds when 0 ≤ x ≤ 1/2.
We summarize what we have achieved in the following lemma.
Lemma 16.1. √ p
Ei [ni ] − E0 [ni ] ≤ 2εn E0 [ni ] .
Back to the main proof. We have
√ p
R̄i = ε(n − Ei [ni ]) ≥ ε(n − 2εn E0 [ni ] − E0 [ni ]) .
Using the randomization hammer,
Xn
R̄ ≥ 1/N R̄i
i=1
N
X √ p
≥ 1/N ε(n − 2εn E0 [ni ] − E0 [ni ])
i=1
N
√ X p
≥ 1/N ε(nN − 2εn E0 [ni ] − n)
i=1
v
u N
√ u X
≥ 1/N ε(nN − 2εntN E0 [ni ] − n)
i=1
√ √
= 1/N ε(nN − 2εn N n − n)
√ n
≥ εn − 2ε2 N −1/2 n3/2 − ε.
N
p √
Choose ε = 1/2 N/(2n), we get c nN .
Theorem 16.2. There exists a constant c > 0 such that for any N, n ≥ 1 and any algorithm
A, there exists a sequence of rewards in [0, 1] such that the regret Rn of algorithm A on this
sequence of rewards satisfies √
E [Rn ] ≥ c nN ∩ n .
104
Chapter 17
Exp3-γ as FTRL
The prediction with expert advice problem can be casted as a prediction problem over the
simplex with linear losses as follows (see also Exercise 8.1): The decision set K0 available
to the online learner is the set of unit vectors {e1 , . . . , ed } of the d-dimensional Euclidean
space, while the loss function in round t is `t (w) = ft> w. Choosing expert i is identified with
choosing unit vector ei and if the loss assigned to expert i in round t is `t,i , we set ft,i = `t,i .
This way the two games become “isomorphic”.1 From now on we will thus work with the
vectorial representation.
In general, the choice wbt ∈ K0 made in round t is random. Assume that ei is selected in
round t with probability wt,i , where wt = (wt,1 , . . . , wt,d )> ∈ ∆d and wt is computed based
on past information. Due to the linearity of the losses E [`t (w bt )] = E [`t (wt )]. Thus, the real
issue is to select the vectors wt ∈ ∆d (see also Exercise 17.1).
Our main interest in this chapter is the bandit setting, where the learner receives only
`t (w
bt ) after round t, as opposed to the full-information setting, when the learner is told the
whole function `t (·). The goal is still to keep the regret,
n
X n
X
bn − Ln (u) =
L bt ) −
`t (w `t (u)
t=1 t=1
small. More precisely, in this chapter we focus on the expected regret only.
105
ft
fet−1 wt w
bt
A ×
B
fet ft> w
bt
−1
q
the full-information algorithm. Let fet be the estimate of ft developed in round t. Define
en = Pn `et (wt ) and L
L en (u) = Pn e>
t=1 `t (u), where `t (w) = ft w.
b e e
t=1
The following result shows that if fet is an appropriately constructed estimate of ft , it
suffices to study the regret of the full-information algorithm fed with the sequence of losses
`et :
bt | w
Proposition 17.1. Assume that E [w b1 , . . . , w
bt−1 ] = wt and
h i
E fet | w
b1 , . . . , w
bt−1 = ft , (17.1)
106
hP i
n
Thus, we see that a small expected regret can be achieved as long as E t=1 kfet k2∞ is
small.
Consider, for example, the importance weighted estimator
bt = ei }ft,i
I{w
fet,i = , (17.3)
wt,i
which makes the algorithm identical to the Exp3-γ algorithm of Section 15.1. By construction
h i
(and as it was also shown earlier) fet satisfies (17.1). Therefore, it remains to study E kfet k2∞ .
Introduce the shorthand notation Et [·] = E [· | w b1 , . . . , w
bt−1 ]. We have
h i h d
i X h i
2 2 2
Et kft k∞ ≤ Et kft k2 =
e e Et fet,i .
i=1
Since
2
h i
2
ft,i
Et fet,i bt = ei }]
= 2 Et [I{w
wt,i
2
h i ft,i
2
bt = ei }] = Pr (w
and Et [I{w bt = ei | w bt−1 ) = wt,i , we get Et fet,i
b1 , . . . , w = wt,i
. Thus,
d 2
h i X ft,i
Et kfet k2∞ ≤ . (17.4)
i=1
wt,i
Unfortunately, this expression is hard to control since the weights can become arbitrarily
close to zero. One possibility then is to bias w
bt so that the probability of choosing w
bt = ei is
lower bounded. This is explored in Exercises 17.4 and 17.5, though the rates we can obtain
with this technique (alone) are suboptimal. However, as we already saw earlier, Exp3-γ
enjoys a low regret without any adjustment. This motivates us to refine the above analysis.
hfet , wt − w
et+1 i ≤ kfet kt,∗ kwt − w
et+1 kt .
107
These norms are called “local” because they will be chosen based on local information so
that both of the terms on the right-hand side are tightly controlled.
A simple idea is to try some weighted norms. As it turns out kvk2t,∗ = di=1 wt,i vi2 is a
P
good choice.2 How can we bound kwt − w et+1 kt ? The dual of k · kt,∗ is kvk2t = di=1 wt,i
−1 2
P
vi .
Hence, using wet+1,i = wt,i exp(−η ft,i ), we get
e
d
X d
X
−1 −η fet,i 2
kwt − et+1 k2t
w = wt,i (wt,i − wt,i e ) = wt,i (1 − e−ηft,i )2
e
i=1 i=1
d
X
≤ η2 2
wt,i fet,i = η 2 kfet k2t,∗ .
i=1
In the inequality, we used that 1 − e−x ≤ x (this holds for any x ∈ R) and thus for any
x ≥ 0, (1 − e−x )2 ≤ x2 holds true. Therefore,
d
X
et+1 i ≤ ηkfet k2t,∗ = η
hfet , wt − w 2
wt,i fet,i . (17.6)
i=1
h i
By the choice of the estimator, Et kfet k2t,∗ ≤ kft k2 , and thus,
h i
et+1 i ≤ ηkft k2 .
Et hfet , wt − w
When the losses are in bounded by 1, kft k2 ≤ d. Combining these with (17.5), we get the
following theorem:
Theorem 17.2. Assume that the losses are in the [0, 1] interval and that Exp3-γ is run with
q
ln d
√
η= nd
. Then, its expected regret is bounded by 2 nd ln d.
108
17.3 Avoiding local norms
Starting from (17.5), we can arrive at the same bound as before, while avoiding local norms
completely. This can be done as follows. Let us upper bound the inner product hfet , wt −w
et+1 i:
d
X
hfet , wt − w
et+1 i = fet,i wt,i (1 − e−ηft,i )
e
i=1
Xd
2
≤η fet,i wt,i (17.7)
i=1
d
X
2 bt = ei }
I{w
= ft,i ,
i=1
wt,i
where the inequality follows because fet,i , wt,i ≥ 0 and 1 − e−x ≤ x holds for any x ∈ R.
Therefore,
h i Xd
2
Et hft , wt − w
e et+1 i ≤ η ft,i = ηkft k2 .
i=1
i=1 i=1
hence, h i
et+1 i ≤ cηkft k2 .
Et hfet , wt − w
The condition that η f˜t,i ≥ −1 (or the stronger condition that η|f˜t,i | ≤ 1) can be achieved
for example by adding exploration. When this condition holds, continuing as before we can
bound the expected regret. See Exercise 17.5 for the details.
hfet , wt − w
et+1 i ≤ kfet k2 kwt − w
et+1 k2 .
109
By the same argument as in the Section 17.2.1, assuming fet ≥ 0, we also have
d
X
kwt − et+1 k22
w ≤η 2 2 e2
wt,j ft,j .
j=1
pP P p
Now, take the square root of both sides and use i |xi | ≤ i |xi | to get
d d
X
2
X
2 bt = ei }
I{w
hfet , wt − w
et+1 i ≤ η wt,i fet,i =η ft,i
i=1 i=1
wt,i
To reiterate, the key ideas were that fei,t fej,t is zero very often and also not to take expectation
until one takes the square root, since the lower power fei,t has before taking expectation, the
better the bound will be.
17.4 Exercises
(a) Show that for any u ∈ K, the expected regret of the randomizing algorithm equals to
the regret of algorithm A.
(b) Show that for any u ∈ K, with high probability, the regret of the randomizing algorithm
is not much larger than that of algorithm A.
110
Exercise 17.3. (Biased estimates) Let w bt , wt , ft , fet be as in Proposition 17.1. Prove
the following extension of Proposition 17.1: Let
h i
bt = E fet |w bt−1 − ft
b1 , . . . , w
Exercise 17.4. (Biased estimates, biased predictions) Consider the same setting as
in Exercise 17.3, except that now allow a bias in w
bt , too. In particular, let
bt |w
dt = E [w bt−1 ] − wt
b1 , . . . , w
be the bias of w
bt . As before, define
h i
bt = E fet |w et−1 − ft
e1 , . . . , w
to be the bias of fet at time t, Remember that by assumption for any t ∈ {1, . . . , n}, w ∈ K,
it holds that `t (w) ∈ [0, 1].
Show that for any u ∈ K,
" n # " n #
h i X X
E Lbn − Ln (u) ≤ E L en − L
en (u) + E sup b> w + E sup d> f ,
b
t t
t=1 w∈K t=1 f ∈F
Exercise 17.5. Consider the algorithm where FTRL with the un-normalized negentropy
is fed with the estimates
bt = ei }ft,i
I{w
fet = .
wt,i (γ)
def
where wbt is a randomly chosen from {e1 , . . . , ed } and the probability of w
bt = ei is wt,i (γ) =
γ/d + (1 − γ)wt,i , where 0 < γ < 1 is an exploration parameter. Assume that |ft,i | ≤ 1.
(a) Using the result of the previous exercise, combined with (17.2) and (17.4) (applied
appropriately), show that the expected regret of the resulting algorithm is not larger
than
η ln d
γn + d2 + .
γ η
(b) Show that by appropriately choosing γ and η, the regret is bounded by 2d2/3 n2/3 (4 ln d)1/3 .
(This is thus a suboptimal bound.)
111
(c)
fet = Ct† w
bt `t (w
bt ) ,
where Ct† denotes the Moore-Penrose pseudo-inverse of Ct .
(a) Assume that Ct is invertible. Show that in this case fet is an unbiased estimate of ft .
(b) Now, assume that wbht is discrete
i random variable. Show that for any vector w such that
>
bt = w) > 0, Et fet w = `t (w).
Pr (w
h i
(c) Show that Et wbt Ct† w
bt ≤ rank(Ct ) ≤ d.
Exercise 17.7. (Finitely spanned decision sets) Consider an online learning problem
with linear losses, `t (w) = ft> w, t = 1, . . . , n. Assume that `t (w) ∈ [0, 1] and that the
decision set, K ⊂ Rd , is the convex hull of finitely many vectors of Rd , i.e.,
( p p
)
X X
K= αi vi : αi ≥ 0, αi = 1 ,
i=1 i=1
with some v1 , . . . , vp ∈ Rd .
Consider the following algorithm: Fix 0 < η, 0 < γ < 1 and µ = (µi ) ∈ ∆p . Assume
that µi > 0 holds for all i ∈ {1, . . . , p}. The algorithm keeps a weight αt ∈ ∆p . Initially,
the weights are uniform: α1 = (1/p, . . . , 1/p)> . In round t, the algorithm predicts w bt ∈
def
{v1 , . . . , vp }, where the probability of w
bt = vi is αt,i (γ) = γµi + (1 − γ)αt,i . (In other words,
the algorithm chooses a unit vector α bt ∈ Rp randomly such that the probability of selecting
ei is αt,i (γ), and then predicts w bt = V αbt .) Next, `t (wbt ) = ft> w
bt is observed, based on which,
the algorithm produces the estimates
fet = Ct† w
bt `t (w
bt ) , get = V > fet ,
Pp
where V = [v1 , . . . , vp ] ∈ Rd×p , and Ct = i=1 αt,i (γ)vi vi> . Making use of these estimates,
the weights (1 ≤ i ≤ p) are updated using
α
et+1,i
αt+1,i = Pp , et+1,i = αt,i e−ηegt,i .
α
j=1 α
et+1,j
112
Let Lbn = Pn `t (w bt ) be the cumulated loss of the algorithm, Ln (u) = nt=1 `t (u) be the
P
t=1
cumulated
h lossi of a competitor u ∈ K. The goal is to show that the expected regret,
E L bn − Ln (u) , can be kept “small” independently of u provided that µ, η and γ are ap-
propriately chosen.
h i
(c) Show that for any u ∈ K, α ∈ ∆p such that u = V α, it holds that Ln (u) = E Len (α) ,
en (α) = Pn ge> α.
where L t=1 t
2
(e) P
Let Vmax = max1≤i≤p kvk2 and let λ0 be the smallest, positive eigenvalue of the matrix
p > 2
i=1 µi vi vi (why is this well-defined?). Show that kg̃t k∞ ≤ Vmax /(λ0 γ). Therefore, if
2
we choose γ = ηVmax /λ0 , then no matter how we choose η > 0, it holds that
|ηg̃t,i | ≤ 1 , 1 ≤ i ≤ d, 1 ≤ t ≤ n . (17.9)
Pp 2
(f) Assume that (17.9) holds. Show that he
gt , αt − α
et+1 i ≤ cη i=1 αt,i get,i , where c = e − 1 ≈
1.71828182845905.
(i) Show that by selecting η appropriately, the expected regret can be further bounded by
s
V2
2
h i Vmax
E L bn − Ln (u) ≤ 2 n + cd ln p + 4 max ln p .
λ0 λ0
113
Hint: Don’t panic! Use the result of this chapter and the previous exercises. In addition,
the following might be useful for Part (e): For a matrix M , let kM k denote its operator
norm derived from k · k: kM k = maxkxk=1 kM xk. By its very definition, kM xk ≤ kM kkxk
holds for any vector x. Now, let M be a positive semidefinite matrix. Denote by λmax (M )
the largest eigenvalue of M . Remember that kM k = λmax (M ) (in general, kM k, also called
the spectral norm, equals the largest singular value of M ). Finally, let λ+
min (M ) denote the
smallest non-zero (i.e., positive) eigenvalue of M , if M has such an eigenvalue, and zero
otherwise. Remember that A B if A − B 0, i.e., if A − B is positive semidefinite.
Finally, remember that if A, B 0 and A B then λmin (A) ≥ λmin (B) (this follows from
Weyl’s inequality).
Exercise 17.8. (Bandits with aggregated reporting) Consider a bandit game with N
arms, when the losses are reported in an aggregated fashion. As an example, imagine that
your job is to select an ad to put on a webpage from a pool of N ads, however, you learn the
number of clicks at the end of the day only. In particular, for simplicity assume that you get
the sum of losses after exactly k pulls (and you get no information in the interim period).
Design an algorithm√ for this problem. In particular, can you design an algorithm whose
expected regret is O( n) after playing for n = sk rounds? How does the regret scale with
k and N ?
Hint: Consider mapping the problem into an online learning problem with linear losses over
a finitely spanned decision set. In particular, consider the space RN k , and the convex hull
generated by decision set
K0 = (e> >
i1 , . . . , eik ) : 1 ≤ ij ≤ N, 1 ≤ j ≤ k .
If the original sequence of losses is `t (w) = ft> w (with the usual identification of the ith
arm/expert and the ith unit vector, ei ∈ RN ), then the aggregated loss for round s is
> >
`A
s (w1 , . . . , wk ) = fs(k−1)+1 w1 + . . . + fsk wk . Argue that a small regret in this new game
with decision set K, loss functions `A s (s = 1, 2, . . .) means a small regret in the original
game. Then consider the algorithm of Exercise 17.7. In this algorithm, you need to choose
a distribution µ over ∆p , where p = |K0 |. Consider choosing this distribution as follows:
Let Q ∈ RN be a random variable whose distribution is uniform over {e1 , . . . , eN }, the set
of unit vectors of RN (i.e., Pr (Q = ei ) = 1/N ). Let Q1 , . . . , Qk be k independent copies of
Q. Let R = (Q> > >
1 , . . . , Qk ) be a random vector obtained from Q1 , . . . , Qk by stacking these
vectors on the top of each other. Let µi = Pr (R = vi ), where K0 = {v1 , . . . , vp }. Notice
that µi > 0 (in fact µi is just Pp the uniform distribution, but you do not actually need this)
> > >
and notice also that M = i=1 µi vi vi = E RR . Calculate E+ RR as a block matrix
>
(what is E Qi Qj =?). To calculate a lower bound on λ0 = λmin (M ), notice that λ0 ≥
λmin (M ) = minx∈RN k :kxk=1 x> M x. Therefore, it suffices P to uniformly lower bound x> M x,
where kxk = 1. Since the vectors in K0 span RN k , x = pi=1 αi vi for some α = (αi ) ∈ Rp .
Now write x> M x by considering the diagonal and off-diagonal blocks in it (consider k × k
blocks). Finish by completing the square, thereby reducing the problem to calculating the
sum of elements in the blocks of x (this sum can be related to α).
114
For the initiated: Is there a way to efficiently implement your algorithm? Keeping around
and updating N k weights is not practical. Can you implement the algorithm using much
less memory and time?
Exercise 17.9. (Cognitive radio) Can you help engineers to create really smart cognitive
radios? A “cognitive radio” problem is as follows: Engineers have split up the frequency
spectrum available for radio communication into a finite number of channels. These days, all
these channels are already preallocated for various users who paid for the exclusive right to
use the channels. However, since the demand for communication is still increasing, engineers
want to find a way of using the already preassigned channels. The main idea is that the
primary users of the channels do not necessarily use it all the time. Therefore, the unused
channels could be used for other purposes. Intelligent protocols make sure that if the primary
user uses a channel, no one can interfere, so the primary users are not impacted. The problem
then is to design algorithms, which efficiently learn which channels to use for what purposes.
Mathematically, the problem can be formulated as follows: In each round t, the radio
serves S “secondary users” who wish to use the channels available. Let bt ∈ {0, 1}N be the
binary vector which encodes which of the N channels are used at time t. Let cs,j ∈ [0, 1] be
the loss suffered when the secondary user s (1 ≤ s ≤ S) attempts to broadcast on channel j
(1 ≤ j ≤ N ) without success because the j th channel is busy (bt,j = 1). Let gs,j ∈ [0, 1] be
the gain of the user when the j th channel is available. An assignment of secondary users to
channels can be represented by a mapping π : {1, . . . , S} → {1, . . . , N }. Since only one user
is allowed per channel, only mappings satisfying π(s) 6= π(s0 ), s 6= s0 are considered. Given
such an assignment π, the total loss suffered at time t when using this assignment will be
S
X
`t (π) = bt,π(s) cs,π(s) − (1 − bt,π(s) )gs,π(s) .
s=1
More generally, consider the problem when the costs cs,π(s) and gains gs,π(s) are unknown
and can change in time (we may denote these changing quantities by the respective symbols
ct,s,j , gt,s,j ). When an assignment πt is selected, only the actual loss `t (πt ) is learned.
Design an online learning algorithm for this problem that has a small regret√ when playing
against any fixed assignment. Can you design an algorithm which achieves O( n) expected
regret? What is the price of learning the aggregated information only after every k th round?
Hint: Try mapping the problem into the setting of Exercise 17.7. First, consider the
decision set
where ei is the ith unit vector of RN . Given an assignment π, let vπ = (e> > >
π(1) , . . . , eπ(S) ) .
Note that for any admissible assignment π, vπ ∈ K0 and vice versa. Now, it is not hard to
see that it holds that
`t (π) = ft> vπ ,
115
where
bt,1 ct,1,1 − (1 − bt,1 )gt,1,1
bt,2 ct,1,2 − (1 − bt,2 )gt,1,2
..
.
bt,N ct,1,N − (1 − bt,N )gt,1,N
bt,1 ct,2,1 − (1 − bt,1 )gt,2,1
ft =
bt,2 ct,2,2 − (1 − bt,2 )gt,2,2
.
..
.
t,N t,2,N − (1 − bt,N )gt,2,N
b c
..
.
bt,N ct,N,N − (1 − bt,N )gt,N,N
116
Chapter 18
Here, the last step follows, since Yt is independent of Y1 , . . . , Yt−1 and Rt . By our previous
observation, since pt (y1 , . . . , yt−1 , rt ) is just a fixed number in the [0, 1] interval,
Answer to Exercise 5.2. Obviously, Zi,t σt is {−1, +1}-valued and Pr (Zi,t σt = −1) =
Pr (Zi,t σt = +1) = 1/2. We also need to show that the elements of this matrix are indepen-
dent. This should be clear for the elements Zi,t σt and Zj,s σs when s 6= t. That this also
117
holds when s = t (and i 6= j) can be shown by the following calculation: When s = t,
Pr (Zi,t σt = v, Zj,t σt = w)
= Pr (Zi,t σt = v, Zj,t σt = w | σt = −1) Pr (σt = −1)
+ Pr (Zi,t σt = v, Zj,t σt = w | σt = 1) Pr (σt = 1)
1
= (Pr (Zi,t = −v, Zj,t = −w | σt = −1) + Pr (Zi,t = v, Zj,t = w | σt = +1))
2
1
= (Pr (Zi,t = −v, Zj,t = −w) + Pr (Zi,t = v, Zj,t = w))
2
1 1 1 1
= + = = Pr (Zi,t σt = v) Pr (Zj,t σt = w) .
2 4 4 4
In fact, a little linear algebra shows that the claim holds even if we only assume pairwise
independence of (Zi,t ), (σs ).
Et [·] = E [·|w
b1 , . . . , w
bt−1 ] .
h i
Then, we have ft + bt = Et fet . By definition, bt is a deterministic function of w b1 , . . . , w
bt−1 .
h i h i
Therefore, Et ft> wbt = ft> wt = (ft + bt )> wt − b> > > >
t w t = E t f
e
t w t − b t w t . Next, E t fe
t u =
h i
hEt fet , ui = (ft + bt )> u = ft> u + b>
t u. Therefore,
" n #
h i X n h i o
E L
bn − Ln (u) = E L
en − E
b
b> w t − E en (u) − hE [Pn bt ] , ui
L
t t=1
t=1
" n #
h i X
en − E L
=E L en (u) + E b>
t (u − wt )
b
t=1
" n
#
h i X
≤E L
bn − Ln (u) + E sup b>
t (v − w) .
v,w∈K
t=1
118
Answer to Exercise 17.4. We proceed as in the solution to Exercise 17.3. First, define
Et [·] = E [·|w
b1 , . . . , w
bt−1 ] .
Therefore,
where F = {f ∈ Rd : 0 ≤ inf w∈K f > w ≤ supw∈K f > w ≤ 1} and we also used that
by construction the estimates are unbiased (therefore, with the notation of Exercise 17.4,
bt = 0). Since K is the probability simplex, for f ∈ Rd , supw∈K f > w = maxi fi and
inf w∈K f > w = supw∈K (−f )> w = maxi (−fi ) = − mini fi . Therefore, F = [0, 1]d and
Pd
supf ∈F d>t f = i=1 (dt,i )+ (here, (x)+ = max(x, 0) denotes the positive part of x). Now,
dt = E [wbt |w
b 1 , . . . , w
bt−1 ]−wtP = γd−1 1+(1−γ)wt −wt = γ(d−1 1−wt ), where 1 = (1, . . . , 1)> .
Therefore, di=1 (dt,i )+ = γ di=1 (1/d − wt,i )+ ≤ γ, since wt,i ≥ 0. Now, if we use (17.2) to
P
bound L en − L
b en (u), we get
n
h i X ln d
E Ln − Ln (u) ≤ η
b E[kfet k2∞ ] + + nγ .
t=1
η
119
Pd
By (17.4), E[kfet k2∞ ] ≤ i=1 et,i ≤ d2 /γ, since w
1/w et,i ≥ γ/d. Therefore,
bn − Ln (u) ≤ d2 n η + ln d + nγ ,
h i
E L
γ η
finishing the first part of the problem.
For the second part, first choose η to balance the first two terms of the √ bound on the
−2 −1 1/2 −1/2
regret. This gives η = (γd n ln d) and results in √
the bound γ 4d2 n ln d + nγ.
3/2
Now, to balance these terms, we set γ so that γ = 4d2 n−1 ln d. Solving for γ gives
2/3 −1/3
γ=d n (4 ln d) , and results in the final bound of 2d2/3 n2/3 (4 ln d)1/3 .
1/3
where in the first equality we used that Ct (and thus also Ct† ) is a deterministic function
of w b1 , and `t (w) = w> ft , where ft is a deterministic
bt−1 , . . . , w h i vector. Now, since Ct is
† −1
non-singular, Ct = Ct . Therefore, fet indeed satisfies Et fet = ft .
Part (b): Introduce 0† = 0, and for x 6= 0, let x† = 1/x. By Part (a), it suffices to show
that for w such that Pr (w bt = w) > 0, w> Ct† Ct = w> . Since Ct is symmetric, so is Ct† , and
hence this last equality holds if and Ponly if Ct Ct† w = w. Since Ct is a positive semidefinite
matrix, its pseudo-inverse is Ct† = d † >
i=1 λi ui ui , where λi ≥ 0 are the eigenvalues of Ct ,
ui ∈ Rd are the corresponding eigenvectors and these eigenvectors form anPorthonormal basis
of Rd . Let w = i αi ui . Elementary calculation gives that Ct Ct† w = di=1 I{λi > 0}αi ui .
P
Hence, w −Ct Ct† w = di=1 I{λi = 0}αi ui and thus it suffices to show that for all i s.t. λi = 0,
P
we also have αi = 0. Thus, take an index i such that λi = 0. Note that αi = hw, ui i. By
the definition of Ct , Ct = C + pww> for some p > 0 and some positive semidefinite matrix
C. Hence, u> > 2 2 >
i Ct ui = ui Cui + pαi . Since Ct ui = 0, 0 ≤ αi = −ui Cui /p ≤ 0, where we used
that u>i Cui ≥ 0. Thus, αi = 0.
120
Part (c): Consider the eigendecomposition of Ct as in the previous part: Ct† = λ†i ui u>
Pd
i=1 i .
Then,
h i Xd
bt> Ct† w λ†i Et (w
> 2
Et w bt = bt ui )
i=1
Xd
λ†i u> bt> ui
= i Et wbt w
i=1
Xd
= λ†i u>
i Ct ui (by the definition of Ct )
i=1
Xd
= λ†i λi u>
i ui (because ui is an eigenvalue of Ct )
i=1
Xd
= λ†i λi (because ui is normed)
i=1
= rank(Ct ) .
Here, the first equality used the eigendecomposition of Ct and the second used that ui is a
deterministic function of wb1 , . . . , w
bt−1 .
bt )] = Et ft> w
bt = Et ft> V α bt = Et gt> α
Et [`t (w bt
= gt> αt + γgt> (µ − αt ) = Et get> αt + γgt> (µ − αt )
≤ Et get> αt + γ ,
where the last inequality used that gt> αt = ft> (V αt ) ≥ 0 and gt> µ = ft> (V µ) ≤ 1. Now, take
the sum and use the tower-rule.
Part (c): This follows immediately from Part a:
121
Part (d): This follows because the algorithm can be viewed as PPA with the losses get> α.
In particular, the update rule can be written as
α gt> α + DR (α, αt )] ,
et+1 = argmin[ ηe
α∈Rp++
gt> α + DR (α, αt )] .
αt+1 = argmin[ ηe
α∈∆p
Here, R is the un-normalized negentropy regularizer over Rp++ . Therefore, Lemma 9.2 is
applicable and gives the desired inequality.
Part (e): By definition, g̃t,i = f˜t> vi = vi> Ct† w bt ). Therefore, using |`t (w
bt `t (w b1 )| ≤ 1 and
> † † 2 † 2 †
|vi Ct w bt k kCt k ≤ Vmax kCt k ≤ Vmax λmax (Ct ), we get
bt | ≤ kvi k kw
2
|g̃t,i | ≤ Vmax λmax (Ct† ) .
† + −1
Now, by the definition
Pp of the pseudo-inverse,>λmax (C Pt p) = (λmin>(Ct )) . Using+the definition
of Ct , we get Ct = i=1 (γµi + (1 − γ)αt,i ) vi vi γ i=1 µi vi vi . Therefore, λmin (Ct ) ≥ γλ0 .
Putting together the inequalities obtained, we get
2
Vmax
|g̃t,i | ≤ .
γλ0
Part (f): The argument of Section 17.3.1 is applicable since it only needs that αt,i > 0,
gt,i ≥ −1, the first of which is true by the definition of the algorithm, the second of which
ηe
holds since we assumed that (17.9) holds. This argument then gives exactly the desired
statement (see (17.8)).
Part (g): Since get = V > fet , get,i = vi> fet . Therefore,
p p
X X
αt,i (γ)e2
gt,i = αt,i (γ)(vi> fet )2
i=1 i=1
p
X
= αt,i (γ)fet> vi vi> fet
i=1
p
!
X
= fet> αt,i vi vi> fet
i=1
122
h i
Finally, by Part (c) of 17.6, (1 − γ) Et w bt> Ct† w
bt ≤ rank(Ct ) ≤ d.
Part (h): Putting together the inequalities obtained so far, for any u = V α,
en (α) + nγ ≤ ln p + cηnd + nγ .
h i
E Lbn − Ln (u) ≤ E L en − L
b
η 1−γ
2
Using 1/(1 − x) ≤ 1 + 2x, which holds for 0 ≤ x ≤ 1/2 and plugging in γ = ηVmax /λ0 gives
the desired bound. r
ln p
Part (i): This follows from the previous bound if we choose η = 2
Vmax
and if we
n( λ0
+cd)
note that for n so small that γ ≤ 1/2, the last, constant term, of the regret bound upper
bounds the regret.
Answer to Exercise 17.8. Following the advice, we can map the problem into the setting
of Exercise 17.7 and use the algorithm described there.
To calculate λ0 , we can follow the hint. If i 6= j, because of independence, E Qi Q>
j =
> >
2 > >
E [Qi ] E [Qj ] . Now, E [Qi ] = 1/N 1, hence E Qi Qj = 1/N 11 , where 1 = (1, 1, . . . , 1) ∈
RN . Therefore, M = E RR> is k × k block-matrix, where the diagonal blocks are 1/N 2
times the N ×N identity matrix and the off-diagonal blocks are 1/N times the N ×N matrix
11> . Clearly, λ0 ≥ minx:kxk=1 x> M x. Therefore, it suffices to lower bound x> M x for some
x ∈P RN k with kxk = 1. Since the vectors in K0 span the whole space RN , we can write
x = pi=1 αi vi . Let
vi,1 x1
vi,2 x2
vi = .. , x = ..
. .
vi,k xk
be the partitioning of (vi ) and x into k blocks, each of length N . Now,
1 1 X >
x> M x = kxk2 + 2 (xi 1)(x>
j 1)
N N i6=j
!2
k k
1 1 X
>
X
> 2
= + 2 xi 1 − (xi 1) ,
N N i=1 i=1
where the second equation follows by completing theP square and because, by assumption,
kxk = 1. Now, xi = pj=1 αj vj,i and therefore x> p > >
P
i 1 = j=1 αj vj,i 1. By definition, vj,i 1 = 1.
p p
Therefore, x>
P P
i 1 = j=1 αj . Defining a = j=1 αj , the expression in the bracket becomes
2 2 2 >
(ka) − ka = a k(k − 1) ≥ 0. Thus, x M x ≥ 1/N , showing that λ0 ≥ 1/N .
2
Clearly, Vmax = k. Now, because the range of the aggregated loss is [0, k], after s rounds
in the aggregatedp game (i.e., s days), disregarding the constant term, the expected regret
is bounded by 2k s(Vmax 2 /λ + cd) ln p, where c = e − 1. We have d = kN and p = N k ,
0
therefore, ln p = k ln N , and
p p
2k s(Vmax 2 /λ + cd) ln p = 2k 3/2 (1 + c)nN ln N ,
0
123
where we used that s aggregated rounds amounts to n = sk rounds in the original game.
Thus, the price of learning the feedback only after every k th round is an increase in the regret
by a factor of k 3/2 .
Answer to Exercise 17.9. As suggested in the hint, we can use the algorithm of Exer-
cise 17.7. We have d = SN . We also have p = N (N − 1)...(N − S + 1) ≤ N S . The range
of the loss is [−S, S],
√ i.e., the length
√ of the range is bounded by 2S. Therefore, the regret is
2
bounded by 2(2S) nd ln p ≤ 4S nN ln N .
124