MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
Yohann Tendero
[email protected]
The green dots represent the given data. The line and the curve are
computed models that ”best” fits the data. Note that the curve and the
line doesn’t necessarily pass through the given data.
Difference between interpolation and approximation (2)
§ For the sake of completeness, let us mention the difference
between least squares and interpolation
With this framework a model 1 is better than model 2 iff the error
of model 1 is smaller than the error of model 2 given the observed
data.
It seems rather intuitive to say that the ”best” of all possible
models is one that minimizes the above error.
Remark: for a reason that will become clearer later on we add an n1
multiplicative term and therefore define the error2 by
n
1ÿ
pb1 f1 pxi q ` b2 f2 pxi q ´ yi q2
n i“1
2
In the machine learning community the above error is called empirical risk.
Optimizing models (1)
Given the observed data px1 , y1 q, . . . , pxn , yn q, where xi , yi P R for
i “ 1, . . . , n we define the error as
n
1ÿ
E pb1 , b2 q :“ pyi ´ pb1 ` b2 xi qq2 . (7)
n i“1
Indeed, for given data the error only depends on the model
parameters b1 and b2 .
We look for b1 , b2 P R such that E pb0 , b1 q is as small as possible.
To find these values we may start by taking the derivatives
n
BE 1ÿ
pb1 , b2 q “ pyi ´ pb1 ` b2 xi qqp´2q (8)
Bb1 n i“1
n
BE 1ÿ
pb1 , b2 q “ pyi ´ pb1 ` b2 xi qqp´2xi q (9)
Bb2 n i“1
(10)
(Be careful here: the function E has variables b1 and b0 not x!)
Optimizing models (2)
We set these derivatives to zero at the optimum say β1 and β2 .
Dropping the ´2 multiplicative factor, we obtain
n
1ÿ
0 “ pyi ´ pβ1 ` β2 xi qq (11)
n i“1
n
1ÿ
0 “ pyi ´ pβ1 ` β2 xi qqpxi q (12)
n i“1
(13)
These equations are often called normal equations for least
squares estimation. We recall that the unknown are b1 and b2 .
(Many people would remove the n1 factor here. Yet, I think it is
easier to understand if we keep it).
Indeed, we see that the normal equations rewrites as
ȳ ´ β1 ´ β2 x̄ “ 0 (14)
y ´ β1 x̄ ´ β2 xs2 “ 0
xĎ (15)
Optimizing models (3)
Hence, from
ȳ ´ β1 ´ β2 x̄ “ 0 (16)
y ´ β1 x̄ ´ β2 xs2 “ 0
xĎ (17)
you’ll prove in exercise that the slope and intercept of the ”best”
model, i.e., the one that minimizes the error, is therefore β1 ` β2 x
with
CXY
β2 “ (18)
σX2
β1 “ ȳ ´ β2 x̄ (19)
Partial conclusion (2)
§ Since the noise has zero mean by assumption the law of large
number implies that ε̄ Ñ 0 and therefore that ε̄x̄ Ñ 0
§ Since the noise and x are independent4 , using the law of large
numbers we also deduce that xĎε Ñ 0 .
§ Thus, CXY Ñ β2˚ σX2 .
CXY σ2
§ We recall that the estimate is β2 :“
σX2
Ñ β2˚ σX2 .
X
§ This means that indeed, at least in the simplest case above
the estimated parameter converges to the true value. When
an estimator converges on the truth like that we say that the
estimator is consistent. (This is the least we can ask for an
estimator.)
4
For independent variables X and Y we do have EpXY q “ EpX qEpY q
Is β1 consistent ?
§ Done in TD!