1 Online Linear Regression: COS 511: Theoretical Machine Learning
1 Online Linear Regression: COS 511: Theoretical Machine Learning
In this lecture, we introduce the online linear regression problem. We give an algorithm
for this problem due to Widrow and Hoff, along with its analysis. Finally, as part of a new
topic, we explore the relationship between Online Learning and Batch learning, giving a
possible reduction from the latter to the former.
The vector u that minimizes Lu describes the best weight vector that we could have picked
if we knew all the data points offline. We can think of this smallest achievable loss as the
clairvoyant loss. This inequality ensures that the cumulative loss of our online algorithm
does not exceed the clairvoyant loss by more than a small amount.
One possible instantiation of this framework is the following algorithm:
• Initialize w1 = 0
• Choose parameter η > 0.
• For each round t = 1, . . . , T :
– Get xt ∈ Rn
– Predict ŷt = wt · xt ∈ R
– Observe yt ∈ R
– Update wt+1 = wt − η(wt · xt − yt )xt .
The blue lines are what was changed from the template above. This algorithm is called
Widrow-Hoff (WH) after its inventors, as well as Least Mean Squares (LMS).
1.1 Motivation for Update Function
Before analyzing this algorithm, we will motivate why the particular weight update function
wt+1 = wt − η(wt · xt − yt )xt was used. We provide two motivations.
Motivation 1. The first motivation is related to performing Gradient Descent on the loss
function. Remember that the loss of a weight vector w on one example (x, y) is given by
L(w, x, y) = (w · x − y)2 . At a minimum, we expect a good update rule to decrease the loss
on the most recently seen example (i.e. (xt , yt ) at round t). We know that the gradient of
a continuous, differentiable function points in the direction of fastest increase. Therefore,
it is a natural idea to decrease a function by taking a small step in the opposite direction
of the gradient. The gradient of the loss function is given by
∂L/∂w1
∂L/∂w2
∇w L(x, y) = = 2(w · x − y)x
..
.
∂L/∂wn
1. achieve smaller loss on the current example (xt , yt ) — so we want a small L(wt+1 , xt , yt );
2. stay close to wt , since wt embodies all the training examples we have seen so far and
we don’t want to throw away all the progress we have made — so we want a small
kwt+1 − wt k22 .
Since we want to minimize the above two things simultaneously, it is natural to minimize
their weighted sum
This is almost the same as the update rule in the algorithm, except that on the right hand
side we have a wt+1 instead of a wt . However, since we want wt+1 and wt to be close to
each other any way, we can use wt as an approximation for wt+1 on the right-hand-side,
giving us exactly the update rule used in the algorithm.
2
1.2 Analysis
In this section, we provide an analysis for the online linear regression algorithm described
above.
Theorem 1. If kxt k2 6 1 for all rounds t, then the following bound holds for LW H , the
cumulative loss of the Widrow-Hoff algorithm:
!
Lu kuk22
LW H 6 minn + .
u∈R 1−η η
• One can roughly think of u as the best vector that could have been chosen even if
all the data points were known in advance (except there is an additional term that
depends on the length of u so this analogy is not completely accurate).
• If we show the bound is true for all u, it follows that it must be true for the “best”
u (i.e. the one minimizing the right-hand side). Fix some u ∈ Rn , and divide both
sides of the inequality by T to get
LW H Lu 1 kuk22
6 + .
T T 1−η ηT
1 kuk2
For sufficiently small η, 1−η is close to 1. Furthermore, as T → ∞, ηT 2 → 0. It
follows that as T → ∞, the rate of loss of Widrow-Hoff (LW H /T ) approaches the rate
of loss of the “best” vector u (Lu /T ).
Proof. The proof uses a potential function argument. We need to establish some notation
before giving the proof.
• As we argued above, it is enough to show the bound holds for all vectors u to conclude
that the bound holds for the best such vector as well. For the rest of the proof, we
fix some u ∈ Rn .
• Define Φt = kwt − uk22 to be the potential at round t. The closer wt and u, the lower
the potential.
• Let `t = wt · xt − yt = ŷt − yt . With this notation, `2t denotes the loss of Widrow-Hoff
at round t.
• Let gt = u · xt − yt . Similarly, gt2 denotes the loss of the weight vector u at round t.
Equipped with the notation above, we can make the following claim:
3
η
Claim. Φt+1 − Φt 6 −η · `2t + 1−η gt2 .
Roughly speaking, the first term −η · `2t corresponds to the loss of the Widrow-Hoff learner
η
— as the learner suffers loss, the potential goes down. The second term 1−η gt2 corresponds
to the loss of u — as u suffers loss, the potential goes up. The potential can thus be thought
of as a budget of how much loss the learner is allowed to suffer to not fall behind u too
much.
Equality (2) follows from the definition of Φt . Equality (3) holds by definition of ∆t .
Equality (4) is obtained by expanding the first squared term. Equality (5) is due to the
cancellation of the first and last terms. Equality (6) is obtained by replacing ∆t with η`t xt .
Equality (6) results from multiplying out the first product. Inequality (7), we have added
and subtracted yt to the first term, and used the assumption that kxt k2 6 1. Inequality (8),
we have used the definitions of `t and gt . Equality (9) is simple algebra. Inequality (10)
2 2 gt
uses the following trick: ab 6 a +b 2 for all a, b, so it specifically holds for a = √1−η and
√
b = `t · 1 − η. Lastly, Equality (11) is a result of a few simple cancellations, and gives us
exactly the desired bound in the claim.
4
The Theorem follows from the claim because
− kuk22 = −Φ1 (12)
6 ΦT +1 − Φ1 (13)
= ΦT +1 − ΦT + ΦT − ΦT −1 + ΦT −1 − . . . + Φ2 − Φ1 (14)
T
X
= (Φt+1 − Φt ) (15)
t=1
T
X η
6 −η · `2t + g2 (16)
1−η t
t=1
T T
X η X 2
= −η `2t
+ gt (17)
1−η
t=1 t=1
η
= −ηLW H + Lu . (18)
1−η
Equality (12) holds because w1 = 0 and therefore Φ1 = kw1 − uk22 = k0 − uk22 = kuk22 .
Equality (13) holds because ΦT +1 is a norm and therefore non-negative. Equality (14) holds
because each new term is added once and subtracted once. Inequality (16) holds by the
claim shown above. Equality (17) is obtained by distributing the sum. Equality (18) follows
from the observation that the cumulativePloss of Widrow-Hoff P (respectively u) is the sum
of the losses at every round, i.e. LW H = Tt=1 `2t and Lu = Tt=1 gt2 .
Solving for LW H , the above inequality gives exactly the statement of the theorem.
5
Additive Multiplicative
Support Vector Machine (SVM) AdaBoost
Perceptron Winnow / Weighted Majority Algorithm (WMA)
Gradient Descent (GD) Exponentiated Gradient (EG)
• Batch learning, including the PAC model, where we are given a set of random
examples offline, and our goal is to minimize the generalization error.
• Online Learning, where we are given a stream of possibly adversarial examples, and
our goal is to minimize cumulative loss (so in a sense training and testing are mixed
together).
It is natural to ask whether these two models are related. Intuitively, the online setting is
stronger, since it requires no randomness assumption. So we can ask whether batch learning
can be reduced to online learning, i.e. given an online learning algorithm, could we use it
to learn in the batch setting? Besides theoretical interest, this reduction turns out to have
practical applications, as online algorithms are often used for offline learning tasks.
More formally, we get a set of examples S = h(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )i drawn i.i.d
from a distribution D as training data. We then get a test example (x, y). Define the risk
2
(expected loss) of vector v to be Rv = E(x,y)∼D (v · x − y) . The goal is to use an online
learning algorithm to find v with small risk.
It turns out that Widrow-Hoff and its analysis yield both an efficient algorithm for this
problem, and an immediate means of bounding the risk of the vector that it finds. In
particular, we propose the following algorithm for solving this problem:
The following theorem shows that the expected risk of the output vector is low. One can
also show that the risk is low with high probability, but we do not do so in today’s lecture.
Theorem 2.
!
Ru kuk22
ES [Rv ] 6 min + ,
u 1−η ηm
where ES [·] denotes expectation with respect to the random choice of the sample S.
Proof. Fix any vector u ∈ Rn , and let (x, y) be a random test point drawn from the
distribution D. We write E[·], with no subscript, to denote expectation with respect to
both the random sample S and the random test point (x, y).
The following three observations will be needed:
6
1 Pm
Observation 1. (v · x − y)2 6 m 2
t=1 (wt · x − y) .
This is due to Jensen’s inequality, since f (x) = x2 is a convex function:
m
!2 m
2 1 X 1 X
(v · x − y) = (wt · x − y) 6 (wt · x − y)2 .
m m
t=1 t=1
Observation 2. E (u · xt − yt )2 = E (u · x − y)2 .
This is because (xt , yt ) and (x, y) come from the same distribution.
This is because (xt , yt ) and (x, y) come from the same distribution, and wt is independent
of both (xt , yt ) and (x, y), since it was trained only on (x1 , y1 ), . . . (xt−1 , yt−1 ) and thus
depends only on the randomness of the first t − 1 examples.
Putting these observations together, we have
ES [Rv ] = E (v · x − y)2
(19)
m
" #
1 X
6E (wt · x − y)2 (20)
m
t=1
m
1 X
E (wt · x − y)2
= (21)
m
t=1
m
1 X
E (wt · xt − yt )2
= (22)
m
t=1
"m #
1 X
= E (wt · xt − yt )2 (23)
m
t=1
"P #
m
1 t=1 (u · xt − yt )
2 kuk22
6 E + (24)
m 1−η η
Pm
1 t=1 E (u · xt − yt )2 kuk22
= + (25)
m 1−η ηm
Pm 2
kuk22
1 t=1 E (u · x − y)
= + (26)
m 1−η ηm
2
Ru kuk2
= + , (27)
1−η ηm
with all the expectations taken over the random sample S and the random test point (x, y).
Equality (19) holds by definition of risk of v. Inequality (20) is an application of Obser-
vation 1. Equality (21) is by linearity of expectation. Equality (22) follows from Observation
3. Equality (23) is by linearity of expectation. Inequality (24) holds because the term inside
the expectation is exactly the cumulative loss of Widrow-Hoff, to which we can apply the
bound from the first part of lecture. Equality (25) holds by linearity of expectation and
pulling out the constant terms. Equality (26) holds by Observation 2, and Equality (27) is
true by definition of risk of u.