0% found this document useful (0 votes)

11 views

1 Online Linear Regression: COS 511: Theoretical Machine Learning

Uploaded by

bowenguan2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

1 Online Linear Regression: COS 511: Theoretical Machine Learning

Uploaded by

bowenguan2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

COS 511: Theoretical Machine Learning

Lecturer: Rob Schapire Lecture # 18

Scribe: Maryam Bahrani April 11, 2018

In this lecture, we introduce the online linear regression problem. We give an algorithm
for this problem due to Widrow and Hoff, along with its analysis. Finally, as part of a new
topic, we explore the relationship between Online Learning and Batch learning, giving a
possible reduction from the latter to the former.

1 Online Linear Regression

The goal of online linear regression is to minimize the square loss of a linear function in an
online setting, according to the following framework:
• Initialize w1 = 0
• For each round t = 1, . . . , T :
– Get xt ∈ Rn
– Predict ŷt = wt · xt ∈ R
– Observe yt ∈ R
– Update wt .
We have the following notions of “loss” for this algorithm. The square loss in round t is
given by (ŷt − yt )2 . The cumulative lossP
of an algorithm A, denoted by LA , is the sum of the
losses in individual rounds, i.e. LA = Tt=1 (ŷt − yt )2 . The loss of a specific weight vector
u is given by Lu = Tt=1 (u · xt − yt )2 . For an algorithm A to have good performance, we
P
would like it to satisfy an inequality of the form
LA 6 min Lu + small number.
u

The vector u that minimizes Lu describes the best weight vector that we could have picked
if we knew all the data points offline. We can think of this smallest achievable loss as the
clairvoyant loss. This inequality ensures that the cumulative loss of our online algorithm
does not exceed the clairvoyant loss by more than a small amount.
One possible instantiation of this framework is the following algorithm:
• Initialize w1 = 0
• Choose parameter η > 0.
• For each round t = 1, . . . , T :
– Get xt ∈ Rn
– Predict ŷt = wt · xt ∈ R
– Observe yt ∈ R
– Update wt+1 = wt − η(wt · xt − yt )xt .
The blue lines are what was changed from the template above. This algorithm is called
Widrow-Hoff (WH) after its inventors, as well as Least Mean Squares (LMS).
1.1 Motivation for Update Function
Before analyzing this algorithm, we will motivate why the particular weight update function
wt+1 = wt − η(wt · xt − yt )xt was used. We provide two motivations.

Motivation 1. The first motivation is related to performing Gradient Descent on the loss
function. Remember that the loss of a weight vector w on one example (x, y) is given by
L(w, x, y) = (w · x − y)2 . At a minimum, we expect a good update rule to decrease the loss
on the most recently seen example (i.e. (xt , yt ) at round t). We know that the gradient of
a continuous, differentiable function points in the direction of fastest increase. Therefore,
it is a natural idea to decrease a function by taking a small step in the opposite direction
of the gradient. The gradient of the loss function is given by
 
∂L/∂w1
 ∂L/∂w2 
∇w L(x, y) =   = 2(w · x − y)x
 
..
 . 
∂L/∂wn

where n is the number of dimensions.

A good update rule is thus

wt+1 = wt − (constant) · ∇w L(x, y) = wt − (constant) · (w · x − y)x,

which matches the update rule of the algorithm.

Motivation 2. Ideally, we want the new weight vector wt+1 to

1. achieve smaller loss on the current example (xt , yt ) — so we want a small L(wt+1 , xt , yt );

2. stay close to wt , since wt embodies all the training examples we have seen so far and
we don’t want to throw away all the progress we have made — so we want a small
kwt+1 − wt k22 .

Since we want to minimize the above two things simultaneously, it is natural to minimize
their weighted sum

η(wt+1 · x − yt )2 + kwt+1 − wt k2 . (1)

Solving the above optimization for wt+1 , we get

wt+1 = wt − η(wt+1 · xt − yt )xt .

This is almost the same as the update rule in the algorithm, except that on the right hand
side we have a wt+1 instead of a wt . However, since we want wt+1 and wt to be close to
each other any way, we can use wt as an approximation for wt+1 on the right-hand-side,
giving us exactly the update rule used in the algorithm.

2
1.2 Analysis
In this section, we provide an analysis for the online linear regression algorithm described
above.

Theorem 1. If kxt k2 6 1 for all rounds t, then the following bound holds for LW H , the
cumulative loss of the Widrow-Hoff algorithm:
!
Lu kuk22
LW H 6 minn + .
u∈R 1−η η

Before going into the proof, we list a few remarks.

• The dependence on the length of u is needed because it is sometimes possible to make

Lu arbitrarily small by making u larger, which is not desirable. An alternative way of
addressing the dependence on length is to take the minimum over all vectors u with
bounded length.

• One can roughly think of u as the best vector that could have been chosen even if
all the data points were known in advance (except there is an additional term that
depends on the length of u so this analogy is not completely accurate).

• If we show the bound is true for all u, it follows that it must be true for the “best”
u (i.e. the one minimizing the right-hand side). Fix some u ∈ Rn , and divide both
sides of the inequality by T to get

LW H Lu 1 kuk22
6 + .
T T 1−η ηT

1 kuk2
For sufficiently small η, 1−η is close to 1. Furthermore, as T → ∞, ηT 2 → 0. It
follows that as T → ∞, the rate of loss of Widrow-Hoff (LW H /T ) approaches the rate
of loss of the “best” vector u (Lu /T ).

Proof. The proof uses a potential function argument. We need to establish some notation
before giving the proof.

• As we argued above, it is enough to show the bound holds for all vectors u to conclude
that the bound holds for the best such vector as well. For the rest of the proof, we
fix some u ∈ Rn .

• Define Φt = kwt − uk22 to be the potential at round t. The closer wt and u, the lower
the potential.

• Let `t = wt · xt − yt = ŷt − yt . With this notation, `2t denotes the loss of Widrow-Hoff
at round t.

• Let gt = u · xt − yt . Similarly, gt2 denotes the loss of the weight vector u at round t.

• Let ∆t = η(ŷt − yt )xt = η`t xt . It follows that wt+1 = wt − ∆t , so ∆t measures the

change in the weight vector from round t to round t + 1.

Equipped with the notation above, we can make the following claim:

3
η
Claim. Φt+1 − Φt 6 −η · `2t + 1−η gt2 .
Roughly speaking, the first term −η · `2t corresponds to the loss of the Widrow-Hoff learner
η
— as the learner suffers loss, the potential goes down. The second term 1−η gt2 corresponds
to the loss of u — as u suffers loss, the potential goes up. The potential can thus be thought
of as a budget of how much loss the learner is allowed to suffer to not fall behind u too
much.

Proof of Claim. We have

Φt+1 − Φt = kwt+1 − uk22 − kwt − uk22 (2)

= kwt − u − ∆t k22 − kwt − uk22 (3)
= kwt − uk22 − 2(wt − u) · ∆t + k∆t k22 − kwt − uk22 (4)
= −2(wt − u) · ∆t + k∆t k22 (5)
= −2η`t (wt · xt − u · xt ) + η 2 `2t kxt k22 (6)
6 −2η`t (wt · xt − yt + yt − u · xt ) + η 2 `2t (7)
= −2η`t (`t − gt ) + η 2 `2t (8)
= η 2 `2t − 2η`2t
+ 2η`t gt (9)
2
2 2 2 gt 2
6 η `t − 2η`t + η + `t (1 − η) (10)
1−η
η
= −η · `2t + g2. (11)
1−η t

Equality (2) follows from the definition of Φt . Equality (3) holds by definition of ∆t .
Equality (4) is obtained by expanding the first squared term. Equality (5) is due to the
cancellation of the first and last terms. Equality (6) is obtained by replacing ∆t with η`t xt .
Equality (6) results from multiplying out the first product. Inequality (7), we have added
and subtracted yt to the first term, and used the assumption that kxt k2 6 1. Inequality (8),
we have used the definitions of `t and gt . Equality (9) is simple algebra. Inequality (10)
2 2 gt
uses the following trick: ab 6 a +b 2 for all a, b, so it specifically holds for a = √1−η and
√
b = `t · 1 − η. Lastly, Equality (11) is a result of a few simple cancellations, and gives us
exactly the desired bound in the claim.

4
The Theorem follows from the claim because
− kuk22 = −Φ1 (12)
6 ΦT +1 − Φ1 (13)
= ΦT +1 − ΦT + ΦT − ΦT −1 + ΦT −1 − . . . + Φ2 − Φ1 (14)
T
X
= (Φt+1 − Φt ) (15)
t=1
T
X η
6 −η · `2t + g2 (16)
1−η t
t=1
T T
X η X 2
= −η `2t
+ gt (17)
1−η
t=1 t=1
η
= −ηLW H + Lu . (18)
1−η
Equality (12) holds because w1 = 0 and therefore Φ1 = kw1 − uk22 = k0 − uk22 = kuk22 .
Equality (13) holds because ΦT +1 is a norm and therefore non-negative. Equality (14) holds
because each new term is added once and subtracted once. Inequality (16) holds by the
claim shown above. Equality (17) is obtained by distributing the sum. Equality (18) follows
from the observation that the cumulativePloss of Widrow-Hoff P (respectively u) is the sum
of the losses at every round, i.e. LW H = Tt=1 `2t and Lu = Tt=1 gt2 .
Solving for LW H , the above inequality gives exactly the statement of the theorem.

1.3 Generalizing the Motivation

As explained in Motivation 2, the goal of online regression was to minimize
η · (loss of w on (xt , yt )) + (“distance” between wt+1 and wt ).
In particular, we chose the square loss function and the Euclidean distance squared. How-
ever, there are many different notions of loss and distance that can be used in the above
measure.
For example, for an arbitrary loss function L(w, x, y) and the Euclidean norm squared
as a measure of distance, we get the following update rule:
wt+1 = wt − η∇w L(wt , xt , yt ).
Note that this update rule is simply performing gradient descent on the loss function.
Another possibility is an arbitrary loss function L(w, x, y) along with relative entropy as
a measure of distance. More formally, we can restrict the w’s to be probability distributions,
and try to minimize RE(wt ||wt+1 ) in every step. The weight update function then becomes
∂L
−η ∂w (wt ,xt ,yt )
wt,i · e i
wt+1,i =
Zt
where Zt is a normalization factor to make sure wt+1 is a probability distribution. This
update rule is referred to as Exponentiated Gradient (EG).
Note that in the case of Euclidean norm squared as a distance measure, the update func-
tion is additive, whereas in the case of Relative Entropy, the update rule is multiplicative.
The following table summarizes the update rules we have seen so far.

5
Additive Multiplicative
Support Vector Machine (SVM) AdaBoost
Perceptron Winnow / Weighted Majority Algorithm (WMA)
Gradient Descent (GD) Exponentiated Gradient (EG)

2 Relating Batch Learning and Online Learning

So far in the class, we have discussed the following two learning models:

• Batch learning, including the PAC model, where we are given a set of random
examples offline, and our goal is to minimize the generalization error.

• Online Learning, where we are given a stream of possibly adversarial examples, and
our goal is to minimize cumulative loss (so in a sense training and testing are mixed
together).

It is natural to ask whether these two models are related. Intuitively, the online setting is
stronger, since it requires no randomness assumption. So we can ask whether batch learning
can be reduced to online learning, i.e. given an online learning algorithm, could we use it
to learn in the batch setting? Besides theoretical interest, this reduction turns out to have
practical applications, as online algorithms are often used for offline learning tasks.
More formally, we get a set of examples S = h(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )i drawn i.i.d
from a distribution D as training data. We then get a test example (x, y). Define the risk
2

(expected loss) of vector v to be Rv = E(x,y)∼D (v · x − y) . The goal is to use an online
learning algorithm to find v with small risk.
It turns out that Widrow-Hoff and its analysis yield both an efficient algorithm for this
problem, and an immediate means of bounding the risk of the vector that it finds. In
particular, we propose the following algorithm for solving this problem:

• Run Widrow-Hoff for T = m rounds on S = h(x1 , y1 ), (x2 , y2 ) . . . , (xm , ym )i, in the

random order they were given to us.

• Windrow-Hoff produces a sequence of weight vectors w1 , w2 , . . . , wm

1 Pm
• We output the average of these vectors v = m i=1 wi , (instead of the last one).

The following theorem shows that the expected risk of the output vector is low. One can
also show that the risk is low with high probability, but we do not do so in today’s lecture.

Theorem 2.
!
Ru kuk22
ES [Rv ] 6 min + ,
u 1−η ηm

where ES [·] denotes expectation with respect to the random choice of the sample S.

Proof. Fix any vector u ∈ Rn , and let (x, y) be a random test point drawn from the
distribution D. We write E[·], with no subscript, to denote expectation with respect to
both the random sample S and the random test point (x, y).
The following three observations will be needed:

6
1 Pm
Observation 1. (v · x − y)2 6 m 2
t=1 (wt · x − y) .
This is due to Jensen’s inequality, since f (x) = x2 is a convex function:

m
!2 m
2 1 X 1 X
(v · x − y) = (wt · x − y) 6 (wt · x − y)2 .
m m
t=1 t=1

Observation 2. E (u · xt − yt )2 = E (u · x − y)2 .

This is because (xt , yt ) and (x, y) come from the same distribution.

Observation 3. E (wt · xt − yt )2 = E (wt · x − y)2 .

This is because (xt , yt ) and (x, y) come from the same distribution, and wt is independent
of both (xt , yt ) and (x, y), since it was trained only on (x1 , y1 ), . . . (xt−1 , yt−1 ) and thus
depends only on the randomness of the first t − 1 examples.
Putting these observations together, we have

ES [Rv ] = E (v · x − y)2

(19)
m
" #
1 X
6E (wt · x − y)2 (20)
m
t=1
m
1 X
E (wt · x − y)2

= (21)
m
t=1
m
1 X
E (wt · xt − yt )2

= (22)
m
t=1
"m #
1 X
= E (wt · xt − yt )2 (23)
m
t=1
"P #
m
1 t=1 (u · xt − yt )
2 kuk22
6 E + (24)
m 1−η η
Pm
1 t=1 E (u · xt − yt )2 kuk22

= + (25)
m 1−η ηm
Pm 2
kuk22

1 t=1 E (u · x − y)
= + (26)
m 1−η ηm
2
Ru kuk2
= + , (27)
1−η ηm

with all the expectations taken over the random sample S and the random test point (x, y).
Equality (19) holds by definition of risk of v. Inequality (20) is an application of Obser-
vation 1. Equality (21) is by linearity of expectation. Equality (22) follows from Observation
3. Equality (23) is by linearity of expectation. Inequality (24) holds because the term inside
the expectation is exactly the cumulative loss of Widrow-Hoff, to which we can apply the
bound from the first part of lecture. Equality (25) holds by linearity of expectation and
pulling out the constant terms. Equality (26) holds by Observation 2, and Equality (27) is
true by definition of risk of u.

June 2024 A Level D2 MS
0% (1)
June 2024 A Level D2 MS
23 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Quiz - Algorithms: Design and Analysis, Part I
No ratings yet
Quiz - Algorithms: Design and Analysis, Part I
3 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
LSE_RLS
No ratings yet
LSE_RLS
4 pages
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
No ratings yet
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
33 pages
online gradient descent
No ratings yet
online gradient descent
7 pages
Online Learning Survey (ML Reading Group ACEMS)
No ratings yet
Online Learning Survey (ML Reading Group ACEMS)
90 pages
sol3_2016
No ratings yet
sol3_2016
8 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
CS 256: LMS Algorithms
No ratings yet
CS 256: LMS Algorithms
23 pages
Class 02
No ratings yet
Class 02
42 pages
Logit (Lect 05)
No ratings yet
Logit (Lect 05)
6 pages
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
No ratings yet
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
5 pages
Msep2013 L5
No ratings yet
Msep2013 L5
14 pages
online learning
No ratings yet
online learning
68 pages
Project Aayush
No ratings yet
Project Aayush
9 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Handout Delta Rule
No ratings yet
Handout Delta Rule
10 pages
Cost Function
No ratings yet
Cost Function
17 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
Representer Function
No ratings yet
Representer Function
12 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Lec 03
No ratings yet
Lec 03
42 pages
Huber loss
No ratings yet
Huber loss
5 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Icml04 Apprentice Extended
No ratings yet
Icml04 Apprentice Extended
6 pages
EE 6106: Online Learning and Optimisation Homework 1
No ratings yet
EE 6106: Online Learning and Optimisation Homework 1
4 pages
Logistic Regression Loss
No ratings yet
Logistic Regression Loss
7 pages
07_regularization
No ratings yet
07_regularization
51 pages
Kaka de 09 Generalization
No ratings yet
Kaka de 09 Generalization
8 pages
1. Statistical Learning Theory
No ratings yet
1. Statistical Learning Theory
100 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Icml2016 Appendix
No ratings yet
Icml2016 Appendix
5 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Les Hoches 2022 convex optimization
No ratings yet
Les Hoches 2022 convex optimization
34 pages
01_lecturenote_SRM
No ratings yet
01_lecturenote_SRM
9 pages
Lec 16
No ratings yet
Lec 16
10 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
deep-learning
No ratings yet
deep-learning
10 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
3 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
a86e2ffbc8a505b53f9051b60587763c_MIT18_657F15_L8 (1)
No ratings yet
a86e2ffbc8a505b53f9051b60587763c_MIT18_657F15_L8 (1)
6 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
Math 361S Lecture Notes Numerical Solution of Odes: Part I: Topics Covered
No ratings yet
Math 361S Lecture Notes Numerical Solution of Odes: Part I: Topics Covered
21 pages
ML_Lec-23
No ratings yet
ML_Lec-23
20 pages
Machine Learning (Summary)
No ratings yet
Machine Learning (Summary)
20 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
Back Propagation-2-20
No ratings yet
Back Propagation-2-20
19 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
47 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
12 pages
AI-Generated Content (AIGC) : A Survey: Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin
No ratings yet
AI-Generated Content (AIGC) : A Survey: Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin
17 pages
Floating Point Numbers: Do You Have Your Laptop Here?
No ratings yet
Floating Point Numbers: Do You Have Your Laptop Here?
10 pages
Graphing Rational Functions
100% (1)
Graphing Rational Functions
25 pages
Me-Pse Curriculum and Syllabus
No ratings yet
Me-Pse Curriculum and Syllabus
73 pages
Acharya Nagarjuna University: Using Java
No ratings yet
Acharya Nagarjuna University: Using Java
4 pages
Case Study 1
No ratings yet
Case Study 1
14 pages
Simulators - Qiskit Aer 0.15.0
No ratings yet
Simulators - Qiskit Aer 0.15.0
14 pages
Question Paper-1 - FEA Model Exam
No ratings yet
Question Paper-1 - FEA Model Exam
2 pages
Ball Control
No ratings yet
Ball Control
67 pages
Jobshop
No ratings yet
Jobshop
16 pages
A1 Assignment 7
No ratings yet
A1 Assignment 7
13 pages
Mean of A White-Noise Process
No ratings yet
Mean of A White-Noise Process
3 pages
Systems of Equations - Substitution and Elimination
No ratings yet
Systems of Equations - Substitution and Elimination
3 pages
Garden PDF
No ratings yet
Garden PDF
3 pages
Section A Module 1: Discrete Mathematics Answer BOTH Questions
No ratings yet
Section A Module 1: Discrete Mathematics Answer BOTH Questions
9 pages
Math 202 PS-11
No ratings yet
Math 202 PS-11
16 pages
Subject Matter Procedure A. Readings: Refer E-Math 8 Page 341
No ratings yet
Subject Matter Procedure A. Readings: Refer E-Math 8 Page 341
5 pages
AISE Syllabus
No ratings yet
AISE Syllabus
1 page
Lecture6 (Mar 27 2016)
No ratings yet
Lecture6 (Mar 27 2016)
27 pages
DP-100
No ratings yet
DP-100
44 pages
Practical 2
No ratings yet
Practical 2
8 pages
Method of Steepest Descent
No ratings yet
Method of Steepest Descent
2 pages
Ads Lab Manual
60% (5)
Ads Lab Manual
81 pages
Code 5 Diffie Hellman Key Exchange
100% (1)
Code 5 Diffie Hellman Key Exchange
6 pages
Traffic Light Controller Using FSM (3)
No ratings yet
Traffic Light Controller Using FSM (3)
5 pages
06 - Normal Distribution Template
No ratings yet
06 - Normal Distribution Template
16 pages

1 Online Linear Regression: COS 511: Theoretical Machine Learning

Uploaded by

1 Online Linear Regression: COS 511: Theoretical Machine Learning

Uploaded by

COS 511: Theoretical Machine Learning

Lecturer: Rob Schapire Lecture # 18

1 Online Linear Regression

where n is the number of dimensions.

wt+1 = wt − (constant) · ∇w L(x, y) = wt − (constant) · (w · x − y)x,

which matches the update rule of the algorithm.

Motivation 2. Ideally, we want the new weight vector wt+1 to

η(wt+1 · x − yt )2 + kwt+1 − wt k2 . (1)

Solving the above optimization for wt+1 , we get

wt+1 = wt − η(wt+1 · xt − yt )xt .

Before going into the proof, we list a few remarks.

• The dependence on the length of u is needed because it is sometimes possible to make

• Let ∆t = η(ŷt − yt )xt = η`t xt . It follows that wt+1 = wt − ∆t , so ∆t measures the

Proof of Claim. We have

Φt+1 − Φt = kwt+1 − uk22 − kwt − uk22 (2)

1.3 Generalizing the Motivation

2 Relating Batch Learning and Online Learning

• Run Widrow-Hoff for T = m rounds on S = h(x1 , y1 ), (x2 , y2 ) . . . , (xm , ym )i, in the

• Windrow-Hoff produces a sequence of weight vectors w1 , w2 , . . . , wm

Observation 3. E (wt · xt − yt )2 = E (wt · x − y)2 .

You might also like