Lecture-04__Least Squares and Geometry
Lecture-04__Least Squares and Geometry
… 𝑥1𝑇 …
𝑇
X= … 𝑥2 …
⋮
… 𝑥𝑛𝑇 …
Linear Model
For each training sample i = 1, … , n:
𝑦ො𝑖 = 𝑤, 𝑥𝑖 = 𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + ⋯ + 𝑤𝑝 𝑥𝑖𝑝
𝑤, 𝑥𝑖 = 𝑤 𝑇 𝑥𝑖 = 𝑥𝑖 𝑇 𝑤
In other words, the predicted values 𝑦ො𝑖 should be as close as possible to the
true labels 𝑦𝑖 .
In general, the true label 𝑦 ≠ 𝑋𝑤 for any choice of 𝑤 (due to model
error, measurement error, other sources of noise, etc.)
𝑟𝑖 = 𝑦𝑖 − 𝑦ො𝑖
span(𝑥1 , 𝑥2 , … , 𝑥𝑝 ) = {y ∈ ℝ𝑛 : y = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑝 𝑥𝑝
for some 𝑤1 , 𝑤2 , … , 𝑤𝑝 ∈ ℝ }
The span is the set of all vectors that can be “reached” or “constructed”
using the given vectors 𝑥𝑗 , combined with any real number weights 𝑤𝑗 .
If y belongs to this set, we say y is in the span of 𝑥𝑗 ’s.
Ex.
1 0
𝑥1 = 0 , 𝑥2 = 1
0 1
𝑤1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 𝑤2
𝑤2
1 5 3
2, 10 , 1 Span
2 10 1
1
2 Not Span
3
Ex.
1 0
𝑥1 = 0 , 𝑥2 = 1
0 0
span(𝑥1 , 𝑥2 )
𝑤1
= vectors of form 𝑤2 for some 𝑤1 ,𝑤2
0
i.e., vectors with zero in 3rd coordinate
Geometry of Least Squares
𝑦ො ∈ span(cols(X))
2 feature vectors: x1 and x2
Span(cols(X))
We want to find 𝑦ො that has
the smallest distance to 𝑦 true label 𝑦
Span(cols(X))
We want to find a point inside the
span of the columns of X (green plane)
that is as close as possible to 𝑦.
residuals
Span(cols(X))
Consider some 𝑦ො with residual 𝑟ǁ
that is not orthogonal to
span(cols(X)).
Span(cols(X))
Consider the distance d
Span(cols(X))
By the Pythagorean theorem
(i.e., c2 = a2 + b2)
𝑟ǁ 2 = 𝑟 2 + 𝑑 2
Since d ≠ 0, 𝑑 2 > 0,
𝑟ǁ 2 > 𝑟 2
Span(cols(X))
Consider some 𝑦ො with residual 𝑟 that is not perpendicular/orthogonal to
span(cols(X)).
𝑟 2 = 𝑟 2 + 𝑑 2
Since d ≠ 0, 𝑑 2 > 0,
therefore, 𝑟 2
> 𝑟 2
2
Let 𝑤
ෝ = argmin 𝑟(𝑤) = argmin σ𝑛𝑖=1( 𝑦𝑖 − 𝑤, 𝑥𝑖 )2
𝑤 𝑤
We know that rො = 𝑦 − 𝑋𝑤
ෝ is orthogonal to span(cols(X))
➔ 𝑥𝑗𝑇 rො = 0 for all j = 1, 2 , ... ,p
➔ 𝑋 𝑇 rො = 0 (in a matrix form) Two vectors u, v, are
orthogonal if <u, v> = 0
rො = 𝑦 − 𝑋𝑤
ෝ
𝑋 𝑇 rො = 0
ෝ =0
𝑋 𝑇 (𝑦 − 𝑋𝑤)
𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
ෝ
Ex 1.
3𝑥1 + 2𝑥2 = 1
3𝑥1 + 𝑥2 = 0
(a) 3𝑥1 +2𝑥2 = 1 𝑥2 = −3𝑥1
1
3𝑥1 +𝑥2 = 0 −3𝑥1 = 1, 𝑥1 = − ➔ one solution
3
(c) 3 2 𝑥 1
1 1
3 1 𝑥 = 0 𝑥1 = − , 𝑥2 = 1 ➔ one solution
2 3
3 3 2
1 0 0
𝑣1 = 0 , 𝑣2 = 1 , 𝑣3 = 1 ➔ Yes, linearly independent (LI)
0 1 0
𝑎1
𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 = 𝑎2 + 𝑎3 This = 0, only if 𝑎1 = 𝑎2 = 𝑎3 = 0
𝑎2
Linear Dependence
Ex. n=3, p=4
1 0 0 1
𝑣1 = 0 , 𝑣2 = 1 , 𝑣3 = 1 , 𝑣4 = 0
0 1 0 1
Note 𝑣4 = 𝑣1 + 𝑣2 - 𝑣3 ➔ 𝑣4 is dependent on the other three vectors 𝑣1 , 𝑣2 , 𝑣3
We can write one vector as linear combination (weighted sum) of others.
➔ This implies linear dependence
𝑎1 + 𝑎4
𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 +𝑎4 𝑣4 = 𝑎2 + 𝑎3 ➔ if 𝑎1 = −𝑎4 = 𝑎2 = −𝑎3 , then
𝑎2 + 𝑎4 𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 +𝑎4 𝑣4 = 0
A matrix 𝑋 ∈ ℝ𝑛x𝑝 is called "full rank” if its rank equals the smallest dimension of X.
rank(X) = min(n, p)
Ex. 1 0 Ex. 1 0 1
X= 0 1 ➔ rank X = 2 X= 0 1 1 ➔ rank X = 2
0 1 0 1 1
Matrix Inverse
Ex. 1/4 0 4 0
A= ➔ A−1 =
0 2 0 1/2
Not all matrices have inverses
Only square matrices can have an inverse – but being square is not enough!
For a matrix A to have an inverse, it must also be full rank
Ex. 1 0 1 2
A= , A= have no inverse.
0 0 2 4
𝑋 ∈ ℝ𝑛x𝑝
Assume that n ≥ p, rank(X) = p
Because X has full column rank, XTX(a p x p matrix) is also full rank
ෝ satisfies 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
𝑤 ෝ
𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
ෝ
ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚
𝒘
Normal equation
ෝ = 𝑿−𝟏 𝒚)로 구하지 않는 이유는?
ෝ − 𝒚 = 𝟎 (즉, 𝒘
𝑿𝒘