0% found this document useful (0 votes)
16 views35 pages

Lecture-04__Least Squares and Geometry

Lecture 4 discusses the concepts of least squares and its geometric interpretation in machine learning, focusing on how to predict labels using a linear model. It explains the process of minimizing residuals to find the optimal weight vector, emphasizing the importance of the L2 norm and the geometric projection of true labels onto the span of feature vectors. Additionally, it covers linear independence and dependence of vectors in relation to the solutions of linear systems.

Uploaded by

kimsergey606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views35 pages

Lecture-04__Least Squares and Geometry

Lecture 4 discusses the concepts of least squares and its geometric interpretation in machine learning, focusing on how to predict labels using a linear model. It explains the process of minimizing residuals to find the optimal weight vector, emphasizing the importance of the L2 norm and the geometric projection of true labels onto the span of feature vectors. Additionally, it covers linear independence and dependence of vectors in relation to the solutions of linear systems.

Uploaded by

kimsergey606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Lecture 4

Least Squares and Geometry


in Machine Learning
SWCON253, Machine Learning
Won Hee Lee, PhD
Learning Goals
• Understand the fundamental concepts of
least squares and geometric derivation in machine learning
Given
𝑦 ∈ ℝ𝑝 = vector of training labels (one label per sample)
𝑋 ∈ ℝ𝑛x𝑝 = matrix of training feature vectors (with n samples and p features)

… 𝑥1𝑇 …
𝑇
X= … 𝑥2 …

… 𝑥𝑛𝑇 …

ith row of X = feature vectors for the ith sample


Let Xj = jth columns of X = feature j for all n samples
Goal: Learn to predict 𝑦ො for a new feature vector x

Linear Model
For each training sample i = 1, … , n:
𝑦ො𝑖 = 𝑤, 𝑥𝑖 = 𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + ⋯ + 𝑤𝑝 𝑥𝑖𝑝

𝑤, 𝑥𝑖 = 𝑤 𝑇 𝑥𝑖 = 𝑥𝑖 𝑇 𝑤

Need to find 𝑤 so that


𝑦ො𝑖 ≈ 𝑦𝑖 for all i = 1, … , n.

In other words, the predicted values 𝑦ො𝑖 should be as close as possible to the
true labels 𝑦𝑖 .
In general, the true label 𝑦 ≠ 𝑋𝑤 for any choice of 𝑤 (due to model
error, measurement error, other sources of noise, etc.)

We define the residual as:

𝑟𝑖 = 𝑦𝑖 − 𝑦ො𝑖

Since 𝑦ො𝑖 depends on 𝑤 , we can write the residual as a function of 𝑤:

𝑟𝑖 (𝑤) = 𝑦𝑖 − 𝑦ො𝑖 (𝑤)

Thus, for every possible choice of 𝑤 , we would obtain different


residuals (or errors) between our true labels and model predictions
Least Squares Estimation

Choose 𝑤 to minimize σ𝑛𝑖=1 𝑟𝑖 (𝑤)2


2
σ𝑛𝑖=1 𝑟𝑖 (𝑤)2 = 𝑟 2
𝑟1
𝑟2
𝑟= ⋮ ,
𝑟𝑛
𝑛
2 𝒍𝒑 -norm
𝑟 2
= ෍ 𝑟𝑖 2 𝑛
𝑖=1
𝑟 𝑝
= (෍ 𝑟𝑖 𝑝 )1/𝑝
𝑟 2
= (σ𝑛𝑖=1 𝑟𝑖 2 )1/2 = 𝑙2 or Euclidean norm 𝑖=1
The 𝒍𝟐 norm is a measure of distance

The 𝒍𝟐 vector norm is a generalization of the Pythagorean theorem into n dimensions.


It can therefore be used as a measure of distance between two vectors.
● For n-dimensional vectors 𝑎 and 𝑏 , their distance is 𝑎 − 𝑏 .
2

Note: The square of the L2 norm of a vector is


the sum of the squares of the vector’s elements:
Least squares: Minimizing the sum of squared residuals
To find the optimal weight vector 𝑤, we aim to minimize the sum of squared residuals:

Choose 𝑤 to minimize σ𝑛𝑖=1 𝑟𝑖 (𝑤)2


𝑟1
𝑟2
We can think of all residuals combined as a residual vector r: 𝑟= ⋮
𝑟𝑛
The goal is to minimize the length (size) of this residual error, specifically its
squared Euclidean norm:
𝑛
𝑛
2 𝑟 = (෍ 𝑟𝑖 2 )1/2
𝑟 2
= ෍ 𝑟𝑖 2 2
𝑖=1
𝑖=1
2
Thus, minimizing the sum of squared residuals is equivalent to minimizing 𝑟 2
Why least squares?

1. Treat negative and positive residuals equally


2. Mathematically convenient (makes math easy), leading to closed-form solutions
in linear regression
3. Magnify large errors, making the model focus on reducing big mistakes
4. Nice geometric interpretation
5. Consistent with Gaussian noise assumption
y = 𝑋𝑤 + 𝜀 , 𝜀 ~ 𝑁(0, 𝜎 2 𝐼) = Gaussian noise
Then, the maximum likelihood estimator of 𝑤 is the least squares estimator
So, minimizing squared error is statistically justified under Gaussian noise
assumptions
Span
The span of a set of vectors 𝑥1 , 𝑥2 , … , 𝑥𝑝 ∈ ℝ𝑛 is the set of all vectors
that can be expressed as linear combinations (i.e., weighted sums) of 𝑥𝑗 's:

span(𝑥1 , 𝑥2 , … , 𝑥𝑝 ) = {y ∈ ℝ𝑛 : y = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑝 𝑥𝑝
for some 𝑤1 , 𝑤2 , … , 𝑤𝑝 ∈ ℝ }

The span is the set of all vectors that can be “reached” or “constructed”
using the given vectors 𝑥𝑗 , combined with any real number weights 𝑤𝑗 .
If y belongs to this set, we say y is in the span of 𝑥𝑗 ’s.
Ex.
1 0
𝑥1 = 0 , 𝑥2 = 1
0 1
𝑤1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 𝑤2
𝑤2
1 5 3
2, 10 , 1 Span
2 10 1

1
2 Not Span
3
Ex.
1 0
𝑥1 = 0 , 𝑥2 = 1
0 0

span(𝑥1 , 𝑥2 )
𝑤1
= vectors of form 𝑤2 for some 𝑤1 ,𝑤2
0
i.e., vectors with zero in 3rd coordinate
Geometry of Least Squares

Consider the case where:


• p=2 two features (columns of X)
• n=3 three different samples (rows of X)

Recall that our linear model is:


𝑦ො = X𝑤 = 𝑤1 𝑥1 + 𝑤2 𝑥2

In other words, the predicted output 𝑦ො is a linear combination of the columns of X.

𝑦ො ∈ span(cols(X))
2 feature vectors: x1 and x2

Span(cols(X))
We want to find 𝑦ො that has
the smallest distance to 𝑦 true label 𝑦

By least squares criterion,


We want to minimize the
length of the residual vector

Span(cols(X))
We want to find a point inside the
span of the columns of X (green plane)
that is as close as possible to 𝑦.

𝑦ො is the orthogonal projection of


𝑦 onto the span of X’s columns

residuals

Span(cols(X))
Consider some 𝑦ො with residual 𝑟ǁ
that is not orthogonal to
span(cols(X)).

Span(cols(X))
Consider the distance d

Span(cols(X))
By the Pythagorean theorem
(i.e., c2 = a2 + b2)

𝑟ǁ 2 = 𝑟 2 + 𝑑 2

Since d ≠ 0, 𝑑 2 > 0,

𝑟ǁ 2 > 𝑟 2

Span(cols(X))
Consider some 𝑦ො with residual 𝑟෤ that is not perpendicular/orthogonal to
span(cols(X)).

By the Pythagorean theorem (i.e., c2 = a2 + b2)

𝑟෤ 2 = 𝑟 2 + 𝑑 2

Since d ≠ 0, 𝑑 2 > 0,

therefore, 𝑟෤ 2
> 𝑟 2
2
Let 𝑤
ෝ = argmin 𝑟(𝑤) = argmin σ𝑛𝑖=1( 𝑦𝑖 − 𝑤, 𝑥𝑖 )2
𝑤 𝑤

"argument w that minimizes”


We want to know which w makes it as small as possible

We know that rො = 𝑦 − 𝑋𝑤
ෝ is orthogonal to span(cols(X))
➔ 𝑥𝑗𝑇 rො = 0 for all j = 1, 2 , ... ,p
➔ 𝑋 𝑇 rො = 0 (in a matrix form) Two vectors u, v, are
orthogonal if <u, v> = 0
rො = 𝑦 − 𝑋𝑤

𝑋 𝑇 rො = 0
ෝ =0
𝑋 𝑇 (𝑦 − 𝑋𝑤)
𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤

ෝ (least squares estimation) satisfies this system of equations


𝑤
i.e., 𝑤
ෝ is solution to linear system of equations 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤

Recall what you learned from linear algebra class

Consider the following linear systems:


For each, how many solutions are there? (zero, one, or many)
If one or more solutions exist, find one or more.
Why do different cases have different number of solutions?

Ex 1.
3𝑥1 + 2𝑥2 = 1
3𝑥1 + 𝑥2 = 0
(a) 3𝑥1 +2𝑥2 = 1 𝑥2 = −3𝑥1
1
3𝑥1 +𝑥2 = 0 −3𝑥1 = 1, 𝑥1 = − ➔ one solution
3

(b) 3𝑥1 +𝑥2 = 0 ∞ solutions

(c) 3 2 𝑥 1
1 1
3 1 𝑥 = 0 𝑥1 = − , 𝑥2 = 1 ➔ one solution
2 3
3 3 2

(d) 3𝑥1 +2𝑥2 = 1 1


𝑥1 = − , 𝑥2 = 1
3𝑥1 +𝑥2 = 0 3
1
2𝑥1 +2𝑥2 = 2 2 − + 2 1 ≠ 2 ➔ 0 solution
3

(e) 3 2 𝑥 1 (f) 3𝑥1 +𝑥2 = 1 3 1 𝑥1 1


1 =
3 1 𝑥 = 0 ➔ 0 solution 6𝑥1 +2𝑥2 = 2 6 2 𝑥2 2
2
2 2 2
➔ ∞ solutions rank-1 matrix!
Linear Independence

A collection of vectors 𝑣1 , 𝑣2 , … , 𝑣𝑝 ∈ ℝ𝑛 is said to be linearly independent


𝑝
if the only solution to the equation σ𝑖=1 𝑎𝑖 𝑣𝑖 = 0 is 𝑎1 = 𝑎2 = ⋯ = 𝑎𝑝 = 0.
That is, any weighted sum of the vectors is zero only when all the weights are zero.

Ex. n=3, p=2


1 0
𝑣1 = 0 , 𝑣2 = 1
0 1
𝑎1
𝑎1 𝑣1+𝑎2 𝑣2 = 𝑎2 = 0 ?
𝑎2
We want to check if this can equal the zero vector only when a1 = a2 = 0
Since the only solution is a1 = a2 = 0, the vectors v1 and v2 are linearly independent.
Linear Independence

Ex. n=3, p=3

1 0 0
𝑣1 = 0 , 𝑣2 = 1 , 𝑣3 = 1 ➔ Yes, linearly independent (LI)
0 1 0

𝑎1
𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 = 𝑎2 + 𝑎3 This = 0, only if 𝑎1 = 𝑎2 = 𝑎3 = 0
𝑎2
Linear Dependence
Ex. n=3, p=4

1 0 0 1
𝑣1 = 0 , 𝑣2 = 1 , 𝑣3 = 1 , 𝑣4 = 0
0 1 0 1
Note 𝑣4 = 𝑣1 + 𝑣2 - 𝑣3 ➔ 𝑣4 is dependent on the other three vectors 𝑣1 , 𝑣2 , 𝑣3
We can write one vector as linear combination (weighted sum) of others.
➔ This implies linear dependence
𝑎1 + 𝑎4
𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 +𝑎4 𝑣4 = 𝑎2 + 𝑎3 ➔ if 𝑎1 = −𝑎4 = 𝑎2 = −𝑎3 , then
𝑎2 + 𝑎4 𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 +𝑎4 𝑣4 = 0

NOT linearly independent


Matrix Rank
Number of linearly independent (LI) columns of X
= number of linearly independent (LI) rows of X

A matrix 𝑋 ∈ ℝ𝑛x𝑝 is called "full rank” if its rank equals the smallest dimension of X.

rank(X) = min(n, p)

Ex. 1 0 Ex. 1 0 1
X= 0 1 ➔ rank X = 2 X= 0 1 1 ➔ rank X = 2
0 1 0 1 1
Matrix Inverse

The inverse of a square matrix A is the matrix A-1 that satisfies


1 0 ⋯ 0
−1 −1 0 1 ⋯ 0
𝐴𝐴 = 𝐴 𝐴 = 𝐼 =
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 1

For any vector 𝑣, 𝐼𝑣 = 𝑣

Ex. A = 3, A-1 = 1/3

Ex. 1/4 0 4 0
A= ➔ A−1 =
0 2 0 1/2
Not all matrices have inverses

Only square matrices can have an inverse – but being square is not enough!
For a matrix A to have an inverse, it must also be full rank

Ex. 1 0 1 2
A= , A= have no inverse.
0 0 2 4
𝑋 ∈ ℝ𝑛x𝑝
Assume that n ≥ p, rank(X) = p

In other words, X has p linearly independent columns/features

Because X has full column rank, XTX(a p x p matrix) is also full rank

rank(XTX) = p (XTX has p linearly independent columns


→ XTX has an inverse)

𝑋 ∈ ℝ𝑛x𝑝 , n ≥ p, rank(X) = p → rank(XTX) = p → XTX has an inverse


Recall from earlier

Least squares estimation


𝑛
2
𝑤
ෝ = argmin 𝑟(𝑤) = argmin ෍(𝑦𝑖 − 𝑤, 𝑥𝑖 )2
𝑤 𝑤
𝑖=1
"argument w that minimizes”

ෝ satisfies 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
𝑤 ෝ
𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤

Multiply both sides by (XTX)-1 If XTX is invertible

(𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦 = (𝑋 𝑇 𝑋)−1 (𝑋 𝑇 𝑋)𝑤



ෝ=𝑤
= I𝑤 ෝ

ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚
𝒘

Normal equation
ෝ = 𝑿−𝟏 𝒚)로 구하지 않는 이유는?
ෝ − 𝒚 = 𝟎 (즉, 𝒘
𝑿𝒘

주어진 문제에서 X는 feature matrix이고 차원이 대략 n x p (n은 training example 개수,


p는 feature 차원)이 되므로 square 행렬이 아닐 수 있습니다.

역행렬은 square 행렬의 경우만 존재하므로 일반적인 n x p 행렬은 역행렬을 구할 수


없습니다.

반면에, XTX나 XXT 형태로 만들면 square가 되어 역행렬을 구할 수 있습니다.


Further Readings
• Any linear algebra books should be fine!

• 기계학습 (저자: 오일석) 2.1절 has been posted to e-campus


• Skip 2.1.3 “Perceptron” part for now

• Mathematics for Machine Learning (MathML)


• Chapters 2 & 3: Linear Algebra & Analytic Geometry

You might also like