0% found this document useful (0 votes)

16 views35 pages

Lecture-04__Least Squares and Geometry

Lecture 4 discusses the concepts of least squares and its geometric interpretation in machine learning, focusing on how to predict labels using a linear model. It explains the process of minimizing residuals to find the optimal weight vector, emphasizing the importance of the L2 norm and the geometric projection of true labels onto the span of feature vectors. Additionally, it covers linear independence and dependence of vectors in relation to the solutions of linear systems.

Uploaded by

kimsergey606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views35 pages

Lecture-04__Least Squares and Geometry

Uploaded by

kimsergey606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Lecture 4

Least Squares and Geometry

in Machine Learning
SWCON253, Machine Learning
Won Hee Lee, PhD
Learning Goals
• Understand the fundamental concepts of
least squares and geometric derivation in machine learning
Given
𝑦 ∈ ℝ𝑝 = vector of training labels (one label per sample)
𝑋 ∈ ℝ𝑛x𝑝 = matrix of training feature vectors (with n samples and p features)

… 𝑥1𝑇 …
𝑇
X= … 𝑥2 …
⋮
… 𝑥𝑛𝑇 …

ith row of X = feature vectors for the ith sample

Let Xj = jth columns of X = feature j for all n samples
Goal: Learn to predict 𝑦ො for a new feature vector x

Linear Model
For each training sample i = 1, … , n:
𝑦ො𝑖 = 𝑤, 𝑥𝑖 = 𝑤1 𝑥𝑖1 + 𝑤2 𝑥𝑖2 + ⋯ + 𝑤𝑝 𝑥𝑖𝑝

𝑤, 𝑥𝑖 = 𝑤 𝑇 𝑥𝑖 = 𝑥𝑖 𝑇 𝑤

Need to find 𝑤 so that

𝑦ො𝑖 ≈ 𝑦𝑖 for all i = 1, … , n.

In other words, the predicted values 𝑦ො𝑖 should be as close as possible to the
true labels 𝑦𝑖 .
In general, the true label 𝑦 ≠ 𝑋𝑤 for any choice of 𝑤 (due to model
error, measurement error, other sources of noise, etc.)

We define the residual as:

𝑟𝑖 = 𝑦𝑖 − 𝑦ො𝑖

Since 𝑦ො𝑖 depends on 𝑤 , we can write the residual as a function of 𝑤:

𝑟𝑖 (𝑤) = 𝑦𝑖 − 𝑦ො𝑖 (𝑤)

Thus, for every possible choice of 𝑤 , we would obtain different

residuals (or errors) between our true labels and model predictions
Least Squares Estimation

Choose 𝑤 to minimize σ𝑛𝑖=1 𝑟𝑖 (𝑤)2

2
σ𝑛𝑖=1 𝑟𝑖 (𝑤)2 = 𝑟 2
𝑟1
𝑟2
𝑟= ⋮ ,
𝑟𝑛
𝑛
2 𝒍𝒑 -norm
𝑟 2
= ෍ 𝑟𝑖 2 𝑛
𝑖=1
𝑟 𝑝
= (෍ 𝑟𝑖 𝑝 )1/𝑝
𝑟 2
= (σ𝑛𝑖=1 𝑟𝑖 2 )1/2 = 𝑙2 or Euclidean norm 𝑖=1
The 𝒍𝟐 norm is a measure of distance

The 𝒍𝟐 vector norm is a generalization of the Pythagorean theorem into n dimensions.

It can therefore be used as a measure of distance between two vectors.
● For n-dimensional vectors 𝑎 and 𝑏 , their distance is 𝑎 − 𝑏 .
2

Note: The square of the L2 norm of a vector is

the sum of the squares of the vector’s elements:
Least squares: Minimizing the sum of squared residuals
To find the optimal weight vector 𝑤, we aim to minimize the sum of squared residuals:

Choose 𝑤 to minimize σ𝑛𝑖=1 𝑟𝑖 (𝑤)2

𝑟1
𝑟2
We can think of all residuals combined as a residual vector r: 𝑟= ⋮
𝑟𝑛
The goal is to minimize the length (size) of this residual error, specifically its
squared Euclidean norm:
𝑛
𝑛
2 𝑟 = (෍ 𝑟𝑖 2 )1/2
𝑟 2
= ෍ 𝑟𝑖 2 2
𝑖=1
𝑖=1
2
Thus, minimizing the sum of squared residuals is equivalent to minimizing 𝑟 2
Why least squares?

1. Treat negative and positive residuals equally

2. Mathematically convenient (makes math easy), leading to closed-form solutions
in linear regression
3. Magnify large errors, making the model focus on reducing big mistakes
4. Nice geometric interpretation
5. Consistent with Gaussian noise assumption
y = 𝑋𝑤 + 𝜀 , 𝜀 ~ 𝑁(0, 𝜎 2 𝐼) = Gaussian noise
Then, the maximum likelihood estimator of 𝑤 is the least squares estimator
So, minimizing squared error is statistically justified under Gaussian noise
assumptions
Span
The span of a set of vectors 𝑥1 , 𝑥2 , … , 𝑥𝑝 ∈ ℝ𝑛 is the set of all vectors
that can be expressed as linear combinations (i.e., weighted sums) of 𝑥𝑗 's:

span(𝑥1 , 𝑥2 , … , 𝑥𝑝 ) = {y ∈ ℝ𝑛 : y = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑝 𝑥𝑝
for some 𝑤1 , 𝑤2 , … , 𝑤𝑝 ∈ ℝ }

The span is the set of all vectors that can be “reached” or “constructed”
using the given vectors 𝑥𝑗 , combined with any real number weights 𝑤𝑗 .
If y belongs to this set, we say y is in the span of 𝑥𝑗 ’s.
Ex.
1 0
𝑥1 = 0 , 𝑥2 = 1
0 1
𝑤1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 𝑤2
𝑤2
1 5 3
2, 10 , 1 Span
2 10 1

1
2 Not Span
3
Ex.
1 0
𝑥1 = 0 , 𝑥2 = 1
0 0

span(𝑥1 , 𝑥2 )
𝑤1
= vectors of form 𝑤2 for some 𝑤1 ,𝑤2
0
i.e., vectors with zero in 3rd coordinate
Geometry of Least Squares

Consider the case where:

• p=2 two features (columns of X)
• n=3 three different samples (rows of X)

Recall that our linear model is:

𝑦ො = X𝑤 = 𝑤1 𝑥1 + 𝑤2 𝑥2

In other words, the predicted output 𝑦ො is a linear combination of the columns of X.

𝑦ො ∈ span(cols(X))
2 feature vectors: x1 and x2

Span(cols(X))
We want to find 𝑦ො that has
the smallest distance to 𝑦 true label 𝑦

By least squares criterion,

We want to minimize the
length of the residual vector

Span(cols(X))
We want to find a point inside the
span of the columns of X (green plane)
that is as close as possible to 𝑦.

𝑦ො is the orthogonal projection of

𝑦 onto the span of X’s columns

residuals

Span(cols(X))
Consider some 𝑦ො with residual 𝑟ǁ
that is not orthogonal to
span(cols(X)).

Span(cols(X))
Consider the distance d

Span(cols(X))
By the Pythagorean theorem
(i.e., c2 = a2 + b2)

𝑟ǁ 2 = 𝑟 2 + 𝑑 2

Since d ≠ 0, 𝑑 2 > 0,

𝑟ǁ 2 > 𝑟 2

Span(cols(X))
Consider some 𝑦ො with residual 𝑟෤ that is not perpendicular/orthogonal to
span(cols(X)).

By the Pythagorean theorem (i.e., c2 = a2 + b2)

𝑟෤ 2 = 𝑟 2 + 𝑑 2

Since d ≠ 0, 𝑑 2 > 0,

therefore, 𝑟෤ 2
> 𝑟 2
2
Let 𝑤
ෝ = argmin 𝑟(𝑤) = argmin σ𝑛𝑖=1( 𝑦𝑖 − 𝑤, 𝑥𝑖 )2
𝑤 𝑤

"argument w that minimizes”

We want to know which w makes it as small as possible

We know that rො = 𝑦 − 𝑋𝑤
ෝ is orthogonal to span(cols(X))
➔ 𝑥𝑗𝑇 rො = 0 for all j = 1, 2 , ... ,p
➔ 𝑋 𝑇 rො = 0 (in a matrix form) Two vectors u, v, are
orthogonal if <u, v> = 0
rො = 𝑦 − 𝑋𝑤
ෝ
𝑋 𝑇 rො = 0
ෝ =0
𝑋 𝑇 (𝑦 − 𝑋𝑤)
𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
ෝ

ෝ (least squares estimation) satisfies this system of equations

𝑤
i.e., 𝑤
ෝ is solution to linear system of equations 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
ෝ
Recall what you learned from linear algebra class

Consider the following linear systems:

For each, how many solutions are there? (zero, one, or many)
If one or more solutions exist, find one or more.
Why do different cases have different number of solutions?

Ex 1.
3𝑥1 + 2𝑥2 = 1
3𝑥1 + 𝑥2 = 0
(a) 3𝑥1 +2𝑥2 = 1 𝑥2 = −3𝑥1
1
3𝑥1 +𝑥2 = 0 −3𝑥1 = 1, 𝑥1 = − ➔ one solution
3

(b) 3𝑥1 +𝑥2 = 0 ∞ solutions

(d) 3𝑥1 +2𝑥2 = 1 1

𝑥1 = − , 𝑥2 = 1
3𝑥1 +𝑥2 = 0 3
1
2𝑥1 +2𝑥2 = 2 2 − + 2 1 ≠ 2 ➔ 0 solution
3

(e) 3 2 𝑥 1 (f) 3𝑥1 +𝑥2 = 1 3 1 𝑥1 1

1 =
3 1 𝑥 = 0 ➔ 0 solution 6𝑥1 +2𝑥2 = 2 6 2 𝑥2 2
2
2 2 2
➔ ∞ solutions rank-1 matrix!
Linear Independence

A collection of vectors 𝑣1 , 𝑣2 , … , 𝑣𝑝 ∈ ℝ𝑛 is said to be linearly independent

𝑝
if the only solution to the equation σ𝑖=1 𝑎𝑖 𝑣𝑖 = 0 is 𝑎1 = 𝑎2 = ⋯ = 𝑎𝑝 = 0.
That is, any weighted sum of the vectors is zero only when all the weights are zero.

Ex. n=3, p=2

1 0
𝑣1 = 0 , 𝑣2 = 1
0 1
𝑎1
𝑎1 𝑣1+𝑎2 𝑣2 = 𝑎2 = 0 ?
𝑎2
We want to check if this can equal the zero vector only when a1 = a2 = 0
Since the only solution is a1 = a2 = 0, the vectors v1 and v2 are linearly independent.
Linear Independence

Ex. n=3, p=3

1 0 0
𝑣1 = 0 , 𝑣2 = 1 , 𝑣3 = 1 ➔ Yes, linearly independent (LI)
0 1 0

𝑎1
𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 = 𝑎2 + 𝑎3 This = 0, only if 𝑎1 = 𝑎2 = 𝑎3 = 0
𝑎2
Linear Dependence
Ex. n=3, p=4

1 0 0 1
𝑣1 = 0 , 𝑣2 = 1 , 𝑣3 = 1 , 𝑣4 = 0
0 1 0 1
Note 𝑣4 = 𝑣1 + 𝑣2 - 𝑣3 ➔ 𝑣4 is dependent on the other three vectors 𝑣1 , 𝑣2 , 𝑣3
We can write one vector as linear combination (weighted sum) of others.
➔ This implies linear dependence
𝑎1 + 𝑎4
𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 +𝑎4 𝑣4 = 𝑎2 + 𝑎3 ➔ if 𝑎1 = −𝑎4 = 𝑎2 = −𝑎3 , then
𝑎2 + 𝑎4 𝑎1 𝑣1 +𝑎2 𝑣2 +𝑎3 𝑣3 +𝑎4 𝑣4 = 0

NOT linearly independent

Matrix Rank
Number of linearly independent (LI) columns of X
= number of linearly independent (LI) rows of X

A matrix 𝑋 ∈ ℝ𝑛x𝑝 is called "full rank” if its rank equals the smallest dimension of X.

rank(X) = min(n, p)

Ex. 1 0 Ex. 1 0 1
X= 0 1 ➔ rank X = 2 X= 0 1 1 ➔ rank X = 2
0 1 0 1 1
Matrix Inverse

The inverse of a square matrix A is the matrix A-1 that satisfies

1 0 ⋯ 0
−1 −1 0 1 ⋯ 0
𝐴𝐴 = 𝐴 𝐴 = 𝐼 =
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 1

For any vector 𝑣, 𝐼𝑣 = 𝑣

Ex. A = 3, A-1 = 1/3

Ex. 1/4 0 4 0
A= ➔ A−1 =
0 2 0 1/2
Not all matrices have inverses

Only square matrices can have an inverse – but being square is not enough!
For a matrix A to have an inverse, it must also be full rank

Ex. 1 0 1 2
A= , A= have no inverse.
0 0 2 4
𝑋 ∈ ℝ𝑛x𝑝
Assume that n ≥ p, rank(X) = p

In other words, X has p linearly independent columns/features

Because X has full column rank, XTX(a p x p matrix) is also full rank

rank(XTX) = p (XTX has p linearly independent columns

→ XTX has an inverse)

𝑋 ∈ ℝ𝑛x𝑝 , n ≥ p, rank(X) = p → rank(XTX) = p → XTX has an inverse

Recall from earlier

Least squares estimation

𝑛
2
𝑤
ෝ = argmin 𝑟(𝑤) = argmin ෍(𝑦𝑖 − 𝑤, 𝑥𝑖 )2
𝑤 𝑤
𝑖=1
"argument w that minimizes”

ෝ satisfies 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
𝑤 ෝ
𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑤
ෝ

Multiply both sides by (XTX)-1 If XTX is invertible

(𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦 = (𝑋 𝑇 𝑋)−1 (𝑋 𝑇 𝑋)𝑤

ෝ
ෝ=𝑤
= I𝑤 ෝ

ෝ = (𝑿𝑻 𝑿)−𝟏 𝑿𝑻 𝒚
𝒘

Normal equation
ෝ = 𝑿−𝟏 𝒚)로 구하지 않는 이유는?
ෝ − 𝒚 = 𝟎 (즉, 𝒘
𝑿𝒘

주어진 문제에서 X는 feature matrix이고 차원이 대략 n x p (n은 training example 개수,

p는 feature 차원)이 되므로 square 행렬이 아닐 수 있습니다.

역행렬은 square 행렬의 경우만 존재하므로 일반적인 n x p 행렬은 역행렬을 구할 수

없습니다.

반면에, XTX나 XXT 형태로 만들면 square가 되어 역행렬을 구할 수 있습니다.

Further Readings
• Any linear algebra books should be fine!

• 기계학습 (저자: 오일석) 2.1절 has been posted to e-campus

• Skip 2.1.3 “Perceptron” part for now

• Mathematics for Machine Learning (MathML)

• Chapters 2 & 3: Linear Algebra & Analytic Geometry

Dan Brewster Child Church Mission Revised-En
75% (4)
Dan Brewster Child Church Mission Revised-En
240 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
Linear Algebra Cheat-Sheet: Laurent Lessard
100% (1)
Linear Algebra Cheat-Sheet: Laurent Lessard
13 pages
Love Me Whole - Nicky James
No ratings yet
Love Me Whole - Nicky James
340 pages
Regression Review
No ratings yet
Regression Review
50 pages
final 4 sem
No ratings yet
final 4 sem
29 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
5.2 Regression
No ratings yet
5.2 Regression
19 pages
Lec 3 Printed
No ratings yet
Lec 3 Printed
136 pages
31 Least Squares
No ratings yet
31 Least Squares
39 pages
Least Squares Fit
No ratings yet
Least Squares Fit
4 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
ML Module 2,3,4
No ratings yet
ML Module 2,3,4
13 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Lec 13
No ratings yet
Lec 13
10 pages
Against The Dark Master
100% (2)
Against The Dark Master
109 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
s m s t c Lecture Notes Lecture2
No ratings yet
s m s t c Lecture Notes Lecture2
9 pages
Chapter 1 Simple Linear Regression (Part 6: Matrix Version)
No ratings yet
Chapter 1 Simple Linear Regression (Part 6: Matrix Version)
12 pages
Matrix Norms
100% (1)
Matrix Norms
15 pages
SolutionQuiz1
No ratings yet
SolutionQuiz1
5 pages
Least-Square Method
No ratings yet
Least-Square Method
32 pages
data analysis
No ratings yet
data analysis
40 pages
Lecture 09
No ratings yet
Lecture 09
22 pages
Pinv for Modern ML
No ratings yet
Pinv for Modern ML
31 pages
Linear_least_squared
No ratings yet
Linear_least_squared
23 pages
lec12
No ratings yet
lec12
9 pages
lecture03a_least_squares_annotated
No ratings yet
lecture03a_least_squares_annotated
9 pages
Class 9 Maths Final Paper Indraprastha School
No ratings yet
Class 9 Maths Final Paper Indraprastha School
7 pages
Direct Methods
No ratings yet
Direct Methods
79 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Matrix Introduction
No ratings yet
Matrix Introduction
30 pages
Lecture Notes on High Dimensional Linear Regression
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
CFII Instrument Oral Exam Guide
82% (11)
CFII Instrument Oral Exam Guide
14 pages
Properties of The Singular Value Decomposition: Preliminary Definitions
No ratings yet
Properties of The Singular Value Decomposition: Preliminary Definitions
24 pages
Linear Model
No ratings yet
Linear Model
11 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Multiple Regression Model - Matrix Form
No ratings yet
Multiple Regression Model - Matrix Form
22 pages
Chap 03
No ratings yet
Chap 03
59 pages
Selected Linear Algebra for Machine Learning
No ratings yet
Selected Linear Algebra for Machine Learning
30 pages
Day 1
No ratings yet
Day 1
41 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Notes Linearregression
No ratings yet
Notes Linearregression
4 pages
Vector Norm
No ratings yet
Vector Norm
5 pages
Chekurkov Antigravity Research PDF
75% (4)
Chekurkov Antigravity Research PDF
5 pages
3.1 Least-Squares Problems
No ratings yet
3.1 Least-Squares Problems
28 pages
CS 532 Lecture Notes
No ratings yet
CS 532 Lecture Notes
25 pages
Linear Least Squares
No ratings yet
Linear Least Squares
21 pages
Lecture 14: Linear Algebra: cs412: Introduction To Numerical Analysis
No ratings yet
Lecture 14: Linear Algebra: cs412: Introduction To Numerical Analysis
8 pages
Chapter1_Numerical Analysis II 2023-2024
No ratings yet
Chapter1_Numerical Analysis II 2023-2024
30 pages
斯坦福大学机器学习数学基础 9-16
No ratings yet
斯坦福大学机器学习数学基础 9-16
8 pages
The Best Approximation Theorem INCOMPLETE
No ratings yet
The Best Approximation Theorem INCOMPLETE
4 pages
LMnotes 04
No ratings yet
LMnotes 04
9 pages
Factsheet October 2024
No ratings yet
Factsheet October 2024
75 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
72073931-8e00-4107-bdde-c19d4ec282cb
No ratings yet
72073931-8e00-4107-bdde-c19d4ec282cb
5 pages
Iterative Linear
No ratings yet
Iterative Linear
10 pages
A Study On Inventory Management in Forest Industries Travancore LTD Aluva
No ratings yet
A Study On Inventory Management in Forest Industries Travancore LTD Aluva
76 pages
Matrix Model
No ratings yet
Matrix Model
6 pages
History of Management Thought
No ratings yet
History of Management Thought
41 pages
Matrix 123
No ratings yet
Matrix 123
6 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
Lec10 LeastSquaresRegression PDF
No ratings yet
Lec10 LeastSquaresRegression PDF
4 pages
Ballast Water Management Plan
No ratings yet
Ballast Water Management Plan
51 pages
LAVAMAT L 16850: User Manual Washer-Dryer
No ratings yet
LAVAMAT L 16850: User Manual Washer-Dryer
52 pages
Magic With Everyday Objects
No ratings yet
Magic With Everyday Objects
76 pages
Radio Shack From 5 Watts To 1000 Watts
No ratings yet
Radio Shack From 5 Watts To 1000 Watts
156 pages
BTSDSB2018
No ratings yet
BTSDSB2018
30 pages
Elements of Culture: Cultures
No ratings yet
Elements of Culture: Cultures
3 pages
Interview Questions & Answers
No ratings yet
Interview Questions & Answers
18 pages
The Need For A New Medical Model A Challenge For Biomedicine
No ratings yet
The Need For A New Medical Model A Challenge For Biomedicine
15 pages
Cs421 Cheat Sheet
No ratings yet
Cs421 Cheat Sheet
2 pages
Some Notes On Least Squares, QR-factorization, SVD and Fitting
No ratings yet
Some Notes On Least Squares, QR-factorization, SVD and Fitting
12 pages
PP-2018-ICAS-SC-ENG-D-E
No ratings yet
PP-2018-ICAS-SC-ENG-D-E
26 pages
C++Lab Notes
No ratings yet
C++Lab Notes
20 pages
Modern Mining Company - Calcined Petroleum Coke Plant - FEED STAGE
No ratings yet
Modern Mining Company - Calcined Petroleum Coke Plant - FEED STAGE
5 pages
Evs Assignment 1ST
No ratings yet
Evs Assignment 1ST
17 pages
MSL 202 L05a Patrol Base Operations
100% (1)
MSL 202 L05a Patrol Base Operations
8 pages
AKSHAYA
No ratings yet
AKSHAYA
3 pages
Because Swap - Push Isn't As Natural
No ratings yet
Because Swap - Push Isn't As Natural
14 pages
PDF 16045523 1571921316928
No ratings yet
PDF 16045523 1571921316928
5 pages
Textual Evidence
No ratings yet
Textual Evidence
2 pages
Job Organizational Chart
No ratings yet
Job Organizational Chart
2 pages
Compressive Strength of Concrete - Cube Test Procedure Results
No ratings yet
Compressive Strength of Concrete - Cube Test Procedure Results
5 pages
Lighting Calculation 17
No ratings yet
Lighting Calculation 17
1 page
Scrum Questions
No ratings yet
Scrum Questions
2 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Lecture-04__Least Squares and Geometry

Uploaded by

Lecture-04__Least Squares and Geometry

Uploaded by

Lecture 4

Least Squares and Geometry

ith row of X = feature vectors for the ith sample

Need to find 𝑤 so that

We define the residual as:

Since 𝑦ො𝑖 depends on 𝑤 , we can write the residual as a function of 𝑤:

𝑟𝑖 (𝑤) = 𝑦𝑖 − 𝑦ො𝑖 (𝑤)

Thus, for every possible choice of 𝑤 , we would obtain different

Choose 𝑤 to minimize σ𝑛𝑖=1 𝑟𝑖 (𝑤)2

The 𝒍𝟐 vector norm is a generalization of the Pythagorean theorem into n dimensions.

Note: The square of the L2 norm of a vector is

Choose 𝑤 to minimize σ𝑛𝑖=1 𝑟𝑖 (𝑤)2

1. Treat negative and positive residuals equally

Consider the case where:

Recall that our linear model is:

In other words, the predicted output 𝑦ො is a linear combination of the columns of X.

By least squares criterion,

𝑦ො is the orthogonal projection of

By the Pythagorean theorem (i.e., c2 = a2 + b2)

"argument w that minimizes”

ෝ (least squares estimation) satisfies this system of equations

Consider the following linear systems:

(b) 3𝑥1 +𝑥2 = 0 ∞ solutions

(d) 3𝑥1 +2𝑥2 = 1 1

(e) 3 2 𝑥 1 (f) 3𝑥1 +𝑥2 = 1 3 1 𝑥1 1

A collection of vectors 𝑣1 , 𝑣2 , … , 𝑣𝑝 ∈ ℝ𝑛 is said to be linearly independent

Ex. n=3, p=2

Ex. n=3, p=3

NOT linearly independent

The inverse of a square matrix A is the matrix A-1 that satisfies

For any vector 𝑣, 𝐼𝑣 = 𝑣

Ex. A = 3, A-1 = 1/3

In other words, X has p linearly independent columns/features

rank(XTX) = p (XTX has p linearly independent columns

𝑋 ∈ ℝ𝑛x𝑝 , n ≥ p, rank(X) = p → rank(XTX) = p → XTX has an inverse

Least squares estimation

Multiply both sides by (XTX)-1 If XTX is invertible

(𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦 = (𝑋 𝑇 𝑋)−1 (𝑋 𝑇 𝑋)𝑤

주어진 문제에서 X는 feature matrix이고 차원이 대략 n x p (n은 training example 개수,

역행렬은 square 행렬의 경우만 존재하므로 일반적인 n x p 행렬은 역행렬을 구할 수

반면에, XTX나 XXT 형태로 만들면 square가 되어 역행렬을 구할 수 있습니다.

• 기계학습 (저자: 오일석) 2.1절 has been posted to e-campus

• Mathematics for Machine Learning (MathML)

You might also like