0% found this document useful (0 votes)
11 views

MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@

This document contains notes from a lecture on least squares methods. It introduces: 1) The difference between interpolation and approximation in modeling data with curves or lines. Least squares finds the "best fit" rather than forcing a fit through all data points. 2) The concept of minimizing the sum of squared errors between a linear model and observed data points to determine the optimal model parameters. 3) How to derive the normal equations for computing the optimal slope and intercept in a simple linear regression model using least squares. More data is expected to yield better parameter estimates as long as the noise in the data meets certain assumptions.

Uploaded by

Assimi Dembélé
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@

This document contains notes from a lecture on least squares methods. It introduces: 1) The difference between interpolation and approximation in modeling data with curves or lines. Least squares finds the "best fit" rather than forcing a fit through all data points. 2) The concept of minimizing the sum of squared errors between a linear model and observed data points to determine the optimal model parameters. 3) How to derive the normal equations for computing the optimal slope and intercept in a simple linear regression model using least squares. More data is expected to yield better parameter estimates as long as the noise in the data meets certain assumptions.

Uploaded by

Assimi Dembélé
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

MA 324, lecture 1

Yohann Tendero
[email protected]

Institut Polytechnique des Sciences Avancées


§ Contact: [email protected]
§ Examen le 16/05/2023 + Projet
Foreword: Least squares

§ Least squares is a class of method that allows to determine


the ”best” (in a sense that we shall make precise soon) model
that fits some given data.
§ We shall start with studying Linear least squares (sometime
called linear least squares regression, ordinary least squares )
§ Before we start let us point out an important difference w
Difference between interpolation and approximation (1)

§ Let us first look at a some simple examples: least squares

The green dots represent the given data. The line and the curve are
computed models that ”best” fits the data. Note that the curve and the
line doesn’t necessarily pass through the given data.
Difference between interpolation and approximation (2)
§ For the sake of completeness, let us mention the difference
between least squares and interpolation

Interpolation: the curve must pass through the given data.


Why should we study least squares ?

§ It captures many of the concept of learning theory AND it


uses optimization tools to be computed AND it is one of the
most basic tool that an engineer should know!
§ To be slightly more precise the problem consists in computing
the ”best” values b1 , . . . , bk such that the linear combination

b1 f1 pxq ` . . . bk fk pxq (1)

is a close as possible to given data points px1 , y1 q, . . . , pxn , yn q


§ Before we start with the simplest example where k “ 2,
f1 pxq “ b0 and f2 pxq “ b1 .
Sample mean, variance, covariance
Given a finite sequence of data (or samples) X “ px1 , . . . , xn q P Rn
§ We call mean (or expected value) of x1 , . . . , xn the quantity
n
x1 ` ¨ ¨ ¨ ` xn 1ÿ
x̄ :“ “ xi (2)
n n i“1

The mean is just the average of the data.


§ We call variance of x1 , . . . , xn the quantity
n
1ÿ
σX2 :“ pxi ´ x̄q2 (3)
n i“1

§ If we consider, in addition, the finite sequence


Y “ py1 , . . . , yn q P Rn we can define the covariance namely
n
1ÿ
CXY :“ pxi ´ x̄qpyi ´ ȳ q (4)
n i“1
Simplest model

§ We saw at the beginning that we’re looking for a model that


best fits given data. Let us start with the simplest example
where f1 pxq “ b1 and f2 pxq “ b2 .
§ We are given the observed data px1 , y1 q, . . . , pxn , yn q, where
xi , yi P R for i “ 1, . . . , n.
§ We look for a model b1 f1 pxq ` b2 f2 pxq “ b1 ` b2 x. This just
means that we look for the coefficients (or parameters)
b1 , b2 P R.
§ IF a model b1 f1 pxq ` b2 f2 pxq is perfect then we would have
b1 f1 px1 q ` b2 f2 px1 q “ y1 ,
b1 f1 px2 q ` b2 f2 px2 q “ y2 ,
...,
b1 f1 pxn q ` b2 f2 pxn q “ yn
Simplest model (2)
§ In practice it is almost impossible to expect a perfect equality
(because measures may be imperfect or the model itself is not
perfect).
Thus we just look for a model so that
b1 f1 px1 q ` b2 f2 px1 q ´ y1 ,
...,
b1 f1 pxn q ` b2 f2 pxn q ´ yn
are small. Indeed each quantity b1 f1 pxi q ` b2 f2 pxi q ´ yi for
i “ 1, . . . , n can be thought as an error (and the error is 0 iif
b1 f1 pxi q ` b2 f2 pxi q “ yi )
§ The simplest1 way to do so is to use a quadratic norm.
§ From now on we shall measure the error of a given model and
samples by summing the square of all the above error terms
namely

pb1 f1 px1 q ` b2 f2 px1 q ´ y1 q2 ` ¨ ¨ ¨ ` pb1 f1 pxn q ` b2 f2 pxn q ´ yn q2


1
It is easier to optimize than other choices
Partial conclusion
To sum up : we look for a model so that the sum of each
quadratic error terms is small
pb1 f1 px1 q ` b2 f2 px1 q ´ y1 q2 ` ¨ ¨ ¨ ` pb1 f1 pxn q ` b2 f2 pxn q ´ yn(5)
q2
ÿn
“ pb1 f1 pxi q ` b2 f2 pxi q ´ yi q2 (6)
i“1

With this framework a model 1 is better than model 2 iff the error
of model 1 is smaller than the error of model 2 given the observed
data.
It seems rather intuitive to say that the ”best” of all possible
models is one that minimizes the above error.
Remark: for a reason that will become clearer later on we add an n1
multiplicative term and therefore define the error2 by
n
1ÿ
pb1 f1 pxi q ` b2 f2 pxi q ´ yi q2
n i“1
2
In the machine learning community the above error is called empirical risk.
Optimizing models (1)
Given the observed data px1 , y1 q, . . . , pxn , yn q, where xi , yi P R for
i “ 1, . . . , n we define the error as
n
1ÿ
E pb1 , b2 q :“ pyi ´ pb1 ` b2 xi qq2 . (7)
n i“1

Indeed, for given data the error only depends on the model
parameters b1 and b2 .
We look for b1 , b2 P R such that E pb0 , b1 q is as small as possible.
To find these values we may start by taking the derivatives
n
BE 1ÿ
pb1 , b2 q “ pyi ´ pb1 ` b2 xi qqp´2q (8)
Bb1 n i“1
n
BE 1ÿ
pb1 , b2 q “ pyi ´ pb1 ` b2 xi qqp´2xi q (9)
Bb2 n i“1
(10)
(Be careful here: the function E has variables b1 and b0 not x!)
Optimizing models (2)
We set these derivatives to zero at the optimum say β1 and β2 .
Dropping the ´2 multiplicative factor, we obtain
n
1ÿ
0 “ pyi ´ pβ1 ` β2 xi qq (11)
n i“1
n
1ÿ
0 “ pyi ´ pβ1 ` β2 xi qqpxi q (12)
n i“1
(13)
These equations are often called normal equations for least
squares estimation. We recall that the unknown are b1 and b2 .
(Many people would remove the n1 factor here. Yet, I think it is
easier to understand if we keep it).
Indeed, we see that the normal equations rewrites as
ȳ ´ β1 ´ β2 x̄ “ 0 (14)
y ´ β1 x̄ ´ β2 xs2 “ 0
xĎ (15)
Optimizing models (3)

Hence, from

ȳ ´ β1 ´ β2 x̄ “ 0 (16)
y ´ β1 x̄ ´ β2 xs2 “ 0
xĎ (17)

you’ll prove in exercise that the slope and intercept of the ”best”
model, i.e., the one that minimizes the error, is therefore β1 ` β2 x
with
CXY
β2 “ (18)
σX2
β1 “ ȳ ´ β2 x̄ (19)
Partial conclusion (2)

§ From now on, in this simplest case you’ll be able to compute


models that are optimal in the least squares sense using the
above.
§ Depending on the application given this model you can make
predictions. Indeed, if the model gives the output value of
some device then you can predict the output for any x (and
therefore the un-observed ones for instance) by computing
β1 ` β2 x for the desired x.
§ Conversely in some cases you may be interested in the
parameters of the model itself. This is model estimation.
§ (Remark) Unfortunately, in more general cases the
optimization of E is more difficult and we’ll need to rely on
algorithms.
Model estimation

§ Suppose that β1 and β2 represent some physical parameters of


a device. We would like to know if the estimations given by
the above method become ”better and better” as the number
of observed samples increases.
In other words, can we expect that more data yields to a
better model ?
§ To answer the above, we shall introduce the following
framework.
§ We suppose that we observe

yi “ β1˚ ` β2˚ xi ` εi for i “ 1, . . . , n, (20)

where εi is a noise variable. Using matrices (as we shall do


later on) this can be written as Y “ β1˚ ` β2˚ X ` ε.
§ We assume that the noise variable εi has zero mean, a
constant variance σ 2 , is independent with X and independent
across observations i.
§ The slightly more sophisticated model given in (20) may
represent measurement error for instance.
§ It seems rather impossible that the least squares estimates β1
and β2 coincides with the true values β1˚ and β2˚ .
§ Yet, under the above assumptions we shall justify that
β2 Ñ β2˚ when the number of observed samples n Ñ `8.
§ We recall that β2 “ CXY and
sX2
§ CXY :“ n i“1 pxi ´ x̄qpyi ´ ȳ q “ n1 ni“1 xi yi ´ x̄ ȳ (proved in
1 řn ř
TD)
§ Since, yi “ β1˚ ` β2˚ xi ` εi we obtain that
n
1ÿ
CXY “ xi pβ1˚ ` β2˚ xi ` εi q ´ x̄β1˚ ` β2˚ xi ` εi (21)
n i“1
“ β1˚ x̄ ` β2˚ xs2 ` xĎε ´ x̄β1˚ ´ β2˚ x̄ 2 ´ x̄ ε̄ (22)
In addition, we have that3ř
n
xi2
σX2 “ n1 ni“1 pxi ´ x̄q2 “ i“1 ´ x̄ 2
ř
n
and therefore obtain that
CXY “ β2˚ σX2 ` xĎε ´ x̄ ε̄
3
Also proved in tutorial
Is β2 consistent ?

§ Since the noise has zero mean by assumption the law of large
number implies that ε̄ Ñ 0 and therefore that ε̄x̄ Ñ 0
§ Since the noise and x are independent4 , using the law of large
numbers we also deduce that xĎε Ñ 0 .
§ Thus, CXY Ñ β2˚ σX2 .
CXY σ2
§ We recall that the estimate is β2 :“
σX2
Ñ β2˚ σX2 .
X
§ This means that indeed, at least in the simplest case above
the estimated parameter converges to the true value. When
an estimator converges on the truth like that we say that the
estimator is consistent. (This is the least we can ask for an
estimator.)

4
For independent variables X and Y we do have EpXY q “ EpX qEpY q
Is β1 consistent ?

§ Done in TD!

You might also like