0% found this document useful (0 votes)
11 views

1 Online Linear Regression: COS 511: Theoretical Machine Learning

Uploaded by

bowenguan2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

1 Online Linear Regression: COS 511: Theoretical Machine Learning

Uploaded by

bowenguan2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

COS 511: Theoretical Machine Learning

Lecturer: Rob Schapire Lecture # 18


Scribe: Maryam Bahrani April 11, 2018

In this lecture, we introduce the online linear regression problem. We give an algorithm
for this problem due to Widrow and Hoff, along with its analysis. Finally, as part of a new
topic, we explore the relationship between Online Learning and Batch learning, giving a
possible reduction from the latter to the former.

1 Online Linear Regression


The goal of online linear regression is to minimize the square loss of a linear function in an
online setting, according to the following framework:
• Initialize w1 = 0
• For each round t = 1, . . . , T :
– Get xt ∈ Rn
– Predict ŷt = wt · xt ∈ R
– Observe yt ∈ R
– Update wt .
We have the following notions of “loss” for this algorithm. The square loss in round t is
given by (ŷt − yt )2 . The cumulative lossP
of an algorithm A, denoted by LA , is the sum of the
losses in individual rounds, i.e. LA = Tt=1 (ŷt − yt )2 . The loss of a specific weight vector
u is given by Lu = Tt=1 (u · xt − yt )2 . For an algorithm A to have good performance, we
P
would like it to satisfy an inequality of the form
LA 6 min Lu + small number.
u

The vector u that minimizes Lu describes the best weight vector that we could have picked
if we knew all the data points offline. We can think of this smallest achievable loss as the
clairvoyant loss. This inequality ensures that the cumulative loss of our online algorithm
does not exceed the clairvoyant loss by more than a small amount.
One possible instantiation of this framework is the following algorithm:
• Initialize w1 = 0
• Choose parameter η > 0.
• For each round t = 1, . . . , T :
– Get xt ∈ Rn
– Predict ŷt = wt · xt ∈ R
– Observe yt ∈ R
– Update wt+1 = wt − η(wt · xt − yt )xt .
The blue lines are what was changed from the template above. This algorithm is called
Widrow-Hoff (WH) after its inventors, as well as Least Mean Squares (LMS).
1.1 Motivation for Update Function
Before analyzing this algorithm, we will motivate why the particular weight update function
wt+1 = wt − η(wt · xt − yt )xt was used. We provide two motivations.

Motivation 1. The first motivation is related to performing Gradient Descent on the loss
function. Remember that the loss of a weight vector w on one example (x, y) is given by
L(w, x, y) = (w · x − y)2 . At a minimum, we expect a good update rule to decrease the loss
on the most recently seen example (i.e. (xt , yt ) at round t). We know that the gradient of
a continuous, differentiable function points in the direction of fastest increase. Therefore,
it is a natural idea to decrease a function by taking a small step in the opposite direction
of the gradient. The gradient of the loss function is given by
 
∂L/∂w1
 ∂L/∂w2 
∇w L(x, y) =   = 2(w · x − y)x
 
..
 . 
∂L/∂wn

where n is the number of dimensions.


A good update rule is thus

wt+1 = wt − (constant) · ∇w L(x, y) = wt − (constant) · (w · x − y)x,

which matches the update rule of the algorithm.

Motivation 2. Ideally, we want the new weight vector wt+1 to

1. achieve smaller loss on the current example (xt , yt ) — so we want a small L(wt+1 , xt , yt );

2. stay close to wt , since wt embodies all the training examples we have seen so far and
we don’t want to throw away all the progress we have made — so we want a small
kwt+1 − wt k22 .

Since we want to minimize the above two things simultaneously, it is natural to minimize
their weighted sum

η(wt+1 · x − yt )2 + kwt+1 − wt k2 . (1)

Solving the above optimization for wt+1 , we get

wt+1 = wt − η(wt+1 · xt − yt )xt .

This is almost the same as the update rule in the algorithm, except that on the right hand
side we have a wt+1 instead of a wt . However, since we want wt+1 and wt to be close to
each other any way, we can use wt as an approximation for wt+1 on the right-hand-side,
giving us exactly the update rule used in the algorithm.

2
1.2 Analysis
In this section, we provide an analysis for the online linear regression algorithm described
above.

Theorem 1. If kxt k2 6 1 for all rounds t, then the following bound holds for LW H , the
cumulative loss of the Widrow-Hoff algorithm:
!
Lu kuk22
LW H 6 minn + .
u∈R 1−η η

Before going into the proof, we list a few remarks.

• The dependence on the length of u is needed because it is sometimes possible to make


Lu arbitrarily small by making u larger, which is not desirable. An alternative way of
addressing the dependence on length is to take the minimum over all vectors u with
bounded length.

• One can roughly think of u as the best vector that could have been chosen even if
all the data points were known in advance (except there is an additional term that
depends on the length of u so this analogy is not completely accurate).

• If we show the bound is true for all u, it follows that it must be true for the “best”
u (i.e. the one minimizing the right-hand side). Fix some u ∈ Rn , and divide both
sides of the inequality by T to get

LW H Lu 1 kuk22
6 + .
T T 1−η ηT

1 kuk2
For sufficiently small η, 1−η is close to 1. Furthermore, as T → ∞, ηT 2 → 0. It
follows that as T → ∞, the rate of loss of Widrow-Hoff (LW H /T ) approaches the rate
of loss of the “best” vector u (Lu /T ).

Proof. The proof uses a potential function argument. We need to establish some notation
before giving the proof.

• As we argued above, it is enough to show the bound holds for all vectors u to conclude
that the bound holds for the best such vector as well. For the rest of the proof, we
fix some u ∈ Rn .

• Define Φt = kwt − uk22 to be the potential at round t. The closer wt and u, the lower
the potential.

• Let `t = wt · xt − yt = ŷt − yt . With this notation, `2t denotes the loss of Widrow-Hoff
at round t.

• Let gt = u · xt − yt . Similarly, gt2 denotes the loss of the weight vector u at round t.

• Let ∆t = η(ŷt − yt )xt = η`t xt . It follows that wt+1 = wt − ∆t , so ∆t measures the


change in the weight vector from round t to round t + 1.

Equipped with the notation above, we can make the following claim:

3
η
Claim. Φt+1 − Φt 6 −η · `2t + 1−η gt2 .
Roughly speaking, the first term −η · `2t corresponds to the loss of the Widrow-Hoff learner
η
— as the learner suffers loss, the potential goes down. The second term 1−η gt2 corresponds
to the loss of u — as u suffers loss, the potential goes up. The potential can thus be thought
of as a budget of how much loss the learner is allowed to suffer to not fall behind u too
much.

Proof of Claim. We have

Φt+1 − Φt = kwt+1 − uk22 − kwt − uk22 (2)


= kwt − u − ∆t k22 − kwt − uk22 (3)
= kwt − uk22 − 2(wt − u) · ∆t + k∆t k22 − kwt − uk22 (4)
= −2(wt − u) · ∆t + k∆t k22 (5)
= −2η`t (wt · xt − u · xt ) + η 2 `2t kxt k22 (6)
6 −2η`t (wt · xt − yt + yt − u · xt ) + η 2 `2t (7)
= −2η`t (`t − gt ) + η 2 `2t (8)
= η 2 `2t − 2η`2t
+ 2η`t gt (9)
 2 
2 2 2 gt 2
6 η `t − 2η`t + η + `t (1 − η) (10)
1−η
η
= −η · `2t + g2. (11)
1−η t

Equality (2) follows from the definition of Φt . Equality (3) holds by definition of ∆t .
Equality (4) is obtained by expanding the first squared term. Equality (5) is due to the
cancellation of the first and last terms. Equality (6) is obtained by replacing ∆t with η`t xt .
Equality (6) results from multiplying out the first product. Inequality (7), we have added
and subtracted yt to the first term, and used the assumption that kxt k2 6 1. Inequality (8),
we have used the definitions of `t and gt . Equality (9) is simple algebra. Inequality (10)
2 2 gt
uses the following trick: ab 6 a +b 2 for all a, b, so it specifically holds for a = √1−η and

b = `t · 1 − η. Lastly, Equality (11) is a result of a few simple cancellations, and gives us
exactly the desired bound in the claim.

4
The Theorem follows from the claim because
− kuk22 = −Φ1 (12)
6 ΦT +1 − Φ1 (13)
= ΦT +1 − ΦT + ΦT − ΦT −1 + ΦT −1 − . . . + Φ2 − Φ1 (14)
T
X
= (Φt+1 − Φt ) (15)
t=1
T  
X η
6 −η · `2t + g2 (16)
1−η t
t=1
T T
X η X 2
= −η `2t
+ gt (17)
1−η
t=1 t=1
η
= −ηLW H + Lu . (18)
1−η
Equality (12) holds because w1 = 0 and therefore Φ1 = kw1 − uk22 = k0 − uk22 = kuk22 .
Equality (13) holds because ΦT +1 is a norm and therefore non-negative. Equality (14) holds
because each new term is added once and subtracted once. Inequality (16) holds by the
claim shown above. Equality (17) is obtained by distributing the sum. Equality (18) follows
from the observation that the cumulativePloss of Widrow-Hoff P (respectively u) is the sum
of the losses at every round, i.e. LW H = Tt=1 `2t and Lu = Tt=1 gt2 .
Solving for LW H , the above inequality gives exactly the statement of the theorem.

1.3 Generalizing the Motivation


As explained in Motivation 2, the goal of online regression was to minimize
η · (loss of w on (xt , yt )) + (“distance” between wt+1 and wt ).
In particular, we chose the square loss function and the Euclidean distance squared. How-
ever, there are many different notions of loss and distance that can be used in the above
measure.
For example, for an arbitrary loss function L(w, x, y) and the Euclidean norm squared
as a measure of distance, we get the following update rule:
wt+1 = wt − η∇w L(wt , xt , yt ).
Note that this update rule is simply performing gradient descent on the loss function.
Another possibility is an arbitrary loss function L(w, x, y) along with relative entropy as
a measure of distance. More formally, we can restrict the w’s to be probability distributions,
and try to minimize RE(wt ||wt+1 ) in every step. The weight update function then becomes
∂L
−η ∂w (wt ,xt ,yt )
wt,i · e i
wt+1,i =
Zt
where Zt is a normalization factor to make sure wt+1 is a probability distribution. This
update rule is referred to as Exponentiated Gradient (EG).
Note that in the case of Euclidean norm squared as a distance measure, the update func-
tion is additive, whereas in the case of Relative Entropy, the update rule is multiplicative.
The following table summarizes the update rules we have seen so far.

5
Additive Multiplicative
Support Vector Machine (SVM) AdaBoost
Perceptron Winnow / Weighted Majority Algorithm (WMA)
Gradient Descent (GD) Exponentiated Gradient (EG)

2 Relating Batch Learning and Online Learning


So far in the class, we have discussed the following two learning models:

• Batch learning, including the PAC model, where we are given a set of random
examples offline, and our goal is to minimize the generalization error.

• Online Learning, where we are given a stream of possibly adversarial examples, and
our goal is to minimize cumulative loss (so in a sense training and testing are mixed
together).

It is natural to ask whether these two models are related. Intuitively, the online setting is
stronger, since it requires no randomness assumption. So we can ask whether batch learning
can be reduced to online learning, i.e. given an online learning algorithm, could we use it
to learn in the batch setting? Besides theoretical interest, this reduction turns out to have
practical applications, as online algorithms are often used for offline learning tasks.
More formally, we get a set of examples S = h(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )i drawn i.i.d
from a distribution D as training data. We then get a test example (x, y). Define the risk
2

(expected loss) of vector v to be Rv = E(x,y)∼D (v · x − y) . The goal is to use an online
learning algorithm to find v with small risk.
It turns out that Widrow-Hoff and its analysis yield both an efficient algorithm for this
problem, and an immediate means of bounding the risk of the vector that it finds. In
particular, we propose the following algorithm for solving this problem:

• Run Widrow-Hoff for T = m rounds on S = h(x1 , y1 ), (x2 , y2 ) . . . , (xm , ym )i, in the


random order they were given to us.

• Windrow-Hoff produces a sequence of weight vectors w1 , w2 , . . . , wm


1 Pm
• We output the average of these vectors v = m i=1 wi , (instead of the last one).

The following theorem shows that the expected risk of the output vector is low. One can
also show that the risk is low with high probability, but we do not do so in today’s lecture.

Theorem 2.
!
Ru kuk22
ES [Rv ] 6 min + ,
u 1−η ηm

where ES [·] denotes expectation with respect to the random choice of the sample S.

Proof. Fix any vector u ∈ Rn , and let (x, y) be a random test point drawn from the
distribution D. We write E[·], with no subscript, to denote expectation with respect to
both the random sample S and the random test point (x, y).
The following three observations will be needed:

6
1 Pm
Observation 1. (v · x − y)2 6 m 2
t=1 (wt · x − y) .
This is due to Jensen’s inequality, since f (x) = x2 is a convex function:

m
!2 m
2 1 X 1 X
(v · x − y) = (wt · x − y) 6 (wt · x − y)2 .
m m
t=1 t=1

Observation 2. E (u · xt − yt )2 = E (u · x − y)2 .
   

This is because (xt , yt ) and (x, y) come from the same distribution.

Observation 3. E (wt · xt − yt )2 = E (wt · x − y)2 .


   

This is because (xt , yt ) and (x, y) come from the same distribution, and wt is independent
of both (xt , yt ) and (x, y), since it was trained only on (x1 , y1 ), . . . (xt−1 , yt−1 ) and thus
depends only on the randomness of the first t − 1 examples.
Putting these observations together, we have

ES [Rv ] = E (v · x − y)2
 
(19)
m
" #
1 X
6E (wt · x − y)2 (20)
m
t=1
m
1 X 
E (wt · x − y)2

= (21)
m
t=1
m
1 X
E (wt · xt − yt )2
 
= (22)
m
t=1
"m #
1 X
= E (wt · xt − yt )2 (23)
m
t=1
"P #
m
1 t=1 (u · xt − yt )
2 kuk22
6 E + (24)
m 1−η η
Pm
1 t=1 E (u · xt − yt )2 kuk22
 
= + (25)
m 1−η ηm
Pm 2
kuk22
 
1 t=1 E (u · x − y)
= + (26)
m 1−η ηm
2
Ru kuk2
= + , (27)
1−η ηm

with all the expectations taken over the random sample S and the random test point (x, y).
Equality (19) holds by definition of risk of v. Inequality (20) is an application of Obser-
vation 1. Equality (21) is by linearity of expectation. Equality (22) follows from Observation
3. Equality (23) is by linearity of expectation. Inequality (24) holds because the term inside
the expectation is exactly the cumulative loss of Widrow-Hoff, to which we can apply the
bound from the first part of lecture. Equality (25) holds by linearity of expectation and
pulling out the constant terms. Equality (26) holds by Observation 2, and Equality (27) is
true by definition of risk of u.

You might also like