0% found this document useful (0 votes)
2 views

Maths Refresher

The document is a mathematical refresher focused on essential concepts for Machine Learning, covering topics such as probability theory, linear algebra, gradient descent, principal component analysis, and matrix differentiation. It emphasizes the importance of understanding uncertainty and modeling in machine learning systems. The document is authored by Professor Désiré Sidibé from Université d’Evry - Paris Saclay.

Uploaded by

hidayabouchouka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Maths Refresher

The document is a mathematical refresher focused on essential concepts for Machine Learning, covering topics such as probability theory, linear algebra, gradient descent, principal component analysis, and matrix differentiation. It emphasizes the importance of understanding uncertainty and modeling in machine learning systems. The document is authored by Professor Désiré Sidibé from Université d’Evry - Paris Saclay.

Uploaded by

hidayabouchouka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Maths Refresher

A refresher of mathematical tools for Machine Learning

Désiré Sidibé

Professor
Université d’Evry - Paris Saclay
IBISC Lab
[email protected]
https://sites.google.com/view/dsidibe/

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 1 / 77


Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 2 / 77


Probability Theory

Why Probability?

Probability is the study of random variables or random processes


Randomness is evrywhere

Understanding how to model uncertainty and how to analyze its


effects is an essential part of every ML system design.

Two main questions:


How to model uncertainty? What do we mean by ’a random
experiment’?
How to process observations to reduce uncertainty?

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 3 / 77


Probability Theory
Probability Space
A probability space is a triplet Ω, F , P where
Ω is a nonempty set, called the sample space;
F is a collection of subsets of Ω closed under countable set
operations. The elements of F are called events;
P is a countably additive function from F to [0, 1] such that P (Ω) = 1,
called a probability measure.

Axioms of Probability
1 0 ≤ P (A ) ≤ 1 ∀A ⊂ Ω
2 P (Ω) = 1
3 If A1 , A2 , . . . , An are mutually exclusive events (i.e. P (Ai ∩ Aj ) = 0),
then:
n
X
P (A1 ∪ A2 ∪ · · · ∪ An ) = P (Ai ).
i =1
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 4 / 77
Probability Theory
Other laws of probability

The complement rule

P (A ) = 1 − P (A ).

The addition rule

P (A ∪ B ) = P (A ) + P (B ) − P (A ∩ B ).

The law of total probability

P (A ) = P (A ∩ B ) + P (A ∩ B ).

The chain rule


 
n
Y  −1 
k\
P (A1 ∩ A2 ∩ . . . ∩ An ) =

P Ak | Aj 
 
k =1 j =1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 5 / 77


Probability Theory
Conditional probability
The mathematical notion of conditional probabilities formalizes the idea of
how observations modify our belief about the likelihood of events.

We define the probability of A given B by


P (A ∩ B )
P ( A |B ) =
P (B )
Bayes Theorem
P (B |A )P (A )
P (A |B ) =
P (B )
Bayes rule is probably the most important results to know
Think of event A as a possible ’cause’ of some effect B
For instance, your alarm can sound either if there is fire or also if there
is no fire (false alarm)
Given that the alarm sounds, what is the probability that there is fire?
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 6 / 77
Probability Theory

Independence
Two events A and B are said to be independent if knowing that A occurs
does not change the probability of B.

Independence
P (A ∩ B ) = P (A )P (B )
Or equivalently
P (A | B ) = P (A )
General definition
A collection of events Ai , i ∈ I are mutually independent if for any
sub-collection {i , j , . . . , k } ⊂ I, one has

P (Ai ∩ Aj ∩ · · · ∩ Ak ) = P (Ai )P (Aj ) · · · P (Ak ).

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 7 / 77


Probability Theory
Random Variables
A random variable (r.v.) is a function that assigns a real number to
each event
Discrete r.v.
A r.v. X is discrete if it takes values in a countable set {xn , n ≥ 1} ⊂ R
We can define P (X = xn ) = pn with pn > 0 and
P
n pn = 1
The collection {xn , pn , n ≥ 1} is then called the Probability Mass
Function (pmf) of the r.v. X

Continuous r.v.
A r.v. X is continuous if one can write
Z b
P (X ∈ [a , b ]) = FX (x ) = fX (x )dx , ∀a<b
a

fX (.) is a nonnegative function called the probability density function


(pdf) of X
Désiré Sidibé .
(IBISC) M2 RVSI - Pattern Recognition 8 / 77
Probability Theory
Examples of random variables
Bernoulli
We say that X has a Bernoulli distribution with parameter p ∈ [0, 1],
and we write X ∼ B(p ) if

P (X = 1) = p and P (X = 0) = 1 − p .

Binomial
X has a binomial distribution with parameters n ∈ {1, 2, . . .} and
p ∈ [0, 1], and we write X ∼ B(n, p ) if
!
n
P (X = m) = p m (1 − p )n−m , for m = 0, 1, . . . , n.
m
Geometric
X has a geometric distribution with parameter p ∈ (0, 1], and we write
X ∼ G(p ) if
P (X = n) = p (1 − p )n−1 for n ≥ 0.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 9 / 77
Probability Theory
Examples of random variables
Poisson
The r.v. X has a Poisson distribution with parameter λ, and we write
X ∼ P(λ) if
λn −λ
P (X = n ) = e for n ≥ 0.
n!
Uniform
X is uniformly distributed in the interval [a , b ] where a < b, and we
write X ∼ U[a , b ] if

b −a ,if x ∈ [a , b ]
1
(
fX (x ) =
0, otherwise.
Exponential
X is exponentially distributed with rate λ > 0, and we write
X ∼ Exp (λ), if
λe −λx , if x > 0
(
fX (x ) =
0, otherwise.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 10 / 77
Probability Theory

Gaussian random variables


We say that X is a standard Gaussian (or standard normal) r.v., and
we write X ∼ N(0, 1), if

1
fX (x ) = √ exp {−x 2 /2}, for x ∈ R.

X is a Gaussian r.v. with mean µ and variance σ2 , and we write


X ∼ N(µ, σ2 ), if X = µ + σY , where Y = N(0, 1).
The pdf of X is given by

(x − µ)2
" #
1
fX (x ) = √ exp − , for x ∈ R.
2πσ2 2σ2

The Gaussian distribution is the most used distribution in practice.


(This is because of the central limit theorem).

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 11 / 77


Probability Theory

Gaussian random variables

An univariate Gaussian distribution has roughly 95% of its area in the


range |x − µ|

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 12 / 77


Probability Theory

Multivariate Gaussian

For x ∈ Rd , x is normally distributed, x ∼ N(µ, Σ), if


" #
1 1
p (x | (µ, Σ)) = exp − (x − µ) Σ (x − µ)
T −1
(2π)d /2 |Σ| 1/2 2

where
The mean is Z
µ = E[x] = xp (x)dx

The covariance matrix is


Z
Σ = E[(x − µ)(x − µ)T ] = (x − µ)(x − µ)T p (x)dx

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 13 / 77


Probability Theory
Gaussian random variables

The eigenvectors of Σ detrmine the direction (and the corresponding


eigenvalues detrmine the length) of the principal axes
The quantity r 2 = (x − µ)T Σ−1 (x − µ) is called the squared
Mahalanobis distance from x to µ.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 14 / 77
Probability Theory

Central limit theorem

Let {Xn , n ≥ n} be a set of independent, identically distributed (i.i.d.)


r.v.’s governed by an arbitrary probability distribution with mean µ and
finite variance σ2 .
Let’s define the r.v. Yn as:
n
1X
Yn  Xi .
n
i =1

Then, we have
Yn →D N(µ, σ2 /n) as n → ∞.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 15 / 77


Probability Theory
Expectation and moments of a r.v.
Expected value
For a discrete r.v. X
E(X ) = xn P (X = xn )
n

For a continuous r.v.


Z ∞
E(X ) = xfX (x )dx
−∞

Linearity of expectation
E(af (X ) + bg (Y )) = aE(f (X )) + bE(g (Y ))
Moments
The n-th moment of a r.v. X is E(X n )
The expected value if the first moment of the r.v.
The variance of a r.v. X is define by

V(X )  E[(X − E(X ))2 ] = E(X 2 ) − {E(X )}2 .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 16 / 77


Probability Theory

Useful inequalities

Exponential bound:
1 + x ≤ exp (x ). (1)
Chebychev:
P (X ≤ a ) ≤ E(X 2 )/a 2 . (2)
Markov inequality: If f (.) is nonnegative and nondecreasing on
[a , +∞), then
P (X ≤ a ) ≤ {E(f (X ))}/f (a ). (3)
Jensen inequality: If f (.) is convex, then

E(f (X )) ≥ f (E(X )). (4)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 17 / 77


Joint distribution
Given a pair of r.v. X , Y , the joint distribution is specified by the joint
cumulative distribution function FX ,Y defined as:
FX ,Y (x , y ) = P (X ≤ x and Y ≤ y ), x , y ∈ R.
In the continuous case we have:
Z x Z y
FX ,Y (x , y ) = fX ,Y (u, v )dudv ,
−∞ −∞
where the nonnegative function fX ,Y (x , y ) is called the joint pdf.
Marginalization
In the discrete case:
X
FX (x ) = FX ,Y (x , y ).
y

In the continuous case:


Z ∞
FX (x ) = FX ,Y (x , y )dy .
−∞

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 18 / 77


Joint distribution
Covariance and correlation
The covariance is a measure of dependence, or joint variability
Cov (X , Y ) = E[(X − E(X ))(Y − E(Y ))] = E(XY ) − E(X )E(Y ).
The correlation coefficient between X and Y is given by:
Cov (X , Y )
ρXY = p .
V (X )V (Y )
The sign of the correlation, shows the tendency in the linear
relationship between the variables

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 19 / 77


Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 20 / 77


Matrices

What is a matrix?

A matrix is one way of


describing (or representing) a
linear transformation between
two vector spaces.
A general m × n matrix A
represents a linear
transformation from Rn to Rm .

The matrix acts on vectors x ∈ Rn to produce vectors y ∈ Rm as y = A x.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 21 / 77


Linear System

Basic questions

Does the system A x = b has a solution?


If yes, how many solution(s)?
How to find the solution(s)?

For example, can we solve the following system?


" # " #
1 1 1
x=
1 1 1

How many solutions, if any?

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 22 / 77


Column space and nullspace

Column space
The column space of A , denoted by C (A ) and also called range or span of
A , is the subspace of Rm such that:
y ∈ C (A ) if and only if y = Ax for some x ∈ Rn .

Nullspace
The nullspace of A , denoted by N (A ) and also called kernel, is the
subspace of Rn such that:
x ∈ N (A ) if and only if Ax = 0.

C (A ) is equals to the set of all linear combinations of the columns of


A
N (A ) is exactly the set of vectors which are orthogonal to all the row
vectors of A .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 23 / 77


Rank of a matrix

Rank
The rank of a matrix is the dimension of its column space.

rank (A )  dim(C (A )).

The rank is the most fundamental notion about a matrix


The rank of A is equal to the maximum number of linearly
independent columns (or rows) of A
What are the rank of the following matrices?
" # " #
1 2 1 2
;
2 1 1 2

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 24 / 77


Rank of a matrix
Rank theorem
if A is an m × n matrix, then rank (A ) + dimN (A ) = n.

Figure: The big picture of linear algebra (from G. Strang)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 25 / 77


Solving linear systems

The main problem in linear algebra: solve A x = b


One can solve A x = b iff b ∈ C (A )
The rank of A tells everything

Table: A is m × n matrix of rank r


r =m=n A x = b has a unique solution
r =n<m A x = b has either 0 or a unique solution
r =m<n A x = b has ∞ many solutions
r < m, r < n A x = b has either 0 or ∞ solutions

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 26 / 77


Solving linear systems

What if b < C (A ) ? Linear Least Squares (LLS)


Project b onto C (A ), and solve
A x̂ = p
The "best" (minimum mean
squared error) is solution to the
normal equation:
A T A x̂ = A T b
If A T A is invertible, then the
LLS solution is given by
Find x ∈ Rn such that
k r k2 =k A x − b k2 is minimum. x̂ = (A T A )−1 A T b

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 27 / 77


Eigen-decomposition

Eigenvalues/Eigenvectors
Given a square n × n matrix A , we say that λ ∈ C is an eigenvalue of A
and x ∈ C in the corresponding eigenvector if

A x = λx, x , 0.

Properties of eigenvalues
The rank of A is equal to the number of non-zero eigenvalues.
If A is a non-singular matrix (all of its eigenvalues are non-zero) then
1/λi is an eigenvalue of A −1 with associated eigenvector xi .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 28 / 77


Eigen-decomposition

Properties of eigenvalues

The sum of the eigenvalues of A is equal to its trace


n
X n
X
trace(A ) = Aii = λi .
i =1 i =1

The determinant of A is equal to the product of its eigenvalues


n
Y
det(A ) = |A | = λi .
i =1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 29 / 77


Eigen-decomposition
Properties of eigenvalues
Different eigenvalues ⇒ linearly independent eigenvectors

λi , λj ⇒ xi and xj are independent

If A has n different eigenvalues, then A can be diagonalized as

λ1
 

A = S Λ S −1 = [x1 , . . . , xn ] 
 ..  [x1 , . . . , xn ]−1

.
λn
 

Powers of A are easily obtained as A k = S Λk S −1


useful to solve recurrent equations such as uk +1 = Auk
Ak
useful to exponentiate the matrix: e A = ∞
P
k =0 k !
If A is symmetric, then we can write A = S ΛS T
If the eigenvalues of A are not all different, it may or may not be
possible to diagonalize A .
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 30 / 77
Singular value decomposition
SVD: generalization of eigenvalues/eigenvectors concept for non-square
matrices

Any general m × n matrix A of rank r can be decomposed as

A = U ΣV T

with
U an orthogonal m × m matrix : UU T = I
σ1
 

..
.
 
 
Σ a diagonal m × r matrix: Σ = 
σr

 
 
0
T
V an orthogonal n × n matrix: VV = I

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 31 / 77


Singular value decomposition

Any general m × n matrix A can be decomposed as : A = U ΣV T

Figure: Geometric interpretation of SVD.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 32 / 77


The usefulness of SVD
Probably the most important tool.

A = U ΣV T

Solving linear systems: A x = b


x = A ± b, where A ± is the pseudo-inverse of A given by
b

A ± = V diag(1/σ1 , . . . , 1/σr ) U T

Solving homogeneous systems: A x = 0


x = the right singular vector corresponding to the smallest sigular
b
value.
x = V(:, end), in MATLAB notation.
b
Approximating a matrix A
b = Pk σi ui v T .
The best rank k approximation of A is A i =1 i
Many more ...

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 33 / 77


Norms of Vectors and Matrices
Norm of a vector  n 1/p
X p 
kxkp =  |xi | 
i =1

If p = 1, we get the L1 -norm:


X
kxk1 = |xi |
i
If p = 2, we get the L2 -norm:
 n  1/2
X 2 
kxk2 =  xi 
i =1
If p = ∞, we get the L∞ -norm:
kxk∞ = max {|xi |}
It is particularly useful to note that
n
X
kxk22 = xi2 = xT x
Désiré Sidibé (IBISC) M2 RVSI - Pattern
i =1Recognition 34 / 77
Norms of Vectors and Matrices

Norm of a matrix

Frobenius norm
sX X q
kA kF = Aij2 = trace (AA T )
i j

Thinking of a matrix as a linear transformation, we can ask ’how does


the length of x grow when it gets transformed by A ’ ?

kA xk
kA k = arg max = arg max kA xk
x,0 kxk kxk=1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 35 / 77


Norms of Vectors and Matrices

Properties of norms

kA k > 0 ∀ A

kkA k = |k |kA k ∀ k
Triangular inequality

kA + B k ≤ kA k + kB k

Cauchy-Schwartz inequality

kAB k ≤ kA k kB k

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 36 / 77


SVD applications

SVD is a fundamental tool for data analysis and is often used in computer
vision and machine learning applications
Image compression
Image denoising
Pattern classification
Transformations estimations
etc

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 37 / 77


SVD applications
Image denoising
A noisy image X can be decomposed as: A = U ΣV T = ri=1 σi ui viT ,
P
where each ui viT is a rank one approximation of X .
A noiseless approximation of X is obtained by truncating the sum at k
b = Pk σi ui v T .
terms: X i =1 i

k = 10 k = 50 k = 100
Figure: Image denoising with SVD.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 38 / 77


SVD applications

Image denoising
It is better to work with local patches
Denoise each local patch with SVD

k =1 k =2 k = 10
Figure: Image denoising with SVD on local patches.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 39 / 77


Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 40 / 77


Gradient Descent

A gentle intro to gradient descent methods

Iterative methods
Most optimization techniques are iterative
starting from an initial point x0
produce a series of vectors {xk }k =1,..,N which hopefully converge
towards a stationary point x̂

Gradient Descent
The most popular methods for continuous optimization
Simple and intuitive
Work under very few assumptions
Suitable for large-scale problems
Easy to parallelize for problems with many terms in the objective

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 41 / 77


Gradient Descent

Gradient Descent Algorithm


Given x0
Iterate until convergence

xk +1 = xk + αk pk

So, at each iteration, we need to


find a descent direction pk , i.e. a direction that ensures that
f (xk +1 ) < f (xk )
find a step size αk , i.e. how far to move along the chosen direction

We also need to decide when to stop iterating (stoping criteria).

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 42 / 77


Gradient Descent
Descent directions
A direction pk is a descent direction from xk , if pTk ∇f (xk ) < 0.

Steepest descent directions


The direction along which f decreases most rapidily is given by
pk = −∇f (xk ).

If ∇f (x) , 0 and B is a symmetric p.d. matrix, then −B ∇f (x) and


−B −1 ∇f (x) are descent directions.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 43 / 77
Gradient Descent
Step size

Small step size


Pros: iterations are more likely converge
Cons: need more iterations and thus more evaluations of ∇f

Large step size


Pros: better use of each ∇f (xk ), may reduce number of iterations
Cons: can cause overshooting and zig-zags;
too large ⇒ diverged iterations

In practice
use a fixed value if ∇f is Lipschitz (bounded rate of change)
apply line search
arg min f (xk + αpk )
α

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 44 / 77


Gradient Descent

When to stop?

We cannot use kx − x∗ k of |f (x) − f (x∗ )| since we don’t know the true


minimizer x∗ .

Instead, in practice, given a small  > 0


1 k∇f (xk )k < 

2 kxk − xk −1 k <  or kxk − xk −1 k < kxk k

3 |f (xk ) − f (xk −1 )| <  or |f (xk ) − f (xk −1 )| < |f (xk )|

4 k < kmax

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 45 / 77


Gradient Descent
Learning problem
n
∗ 1X
w = arg min E (xi ; w).
w N
i =1

Batch gradient descent


At each step,  
N
 1 X ∂ E ( xi ; w ) 
w ( k + 1)
←w (k )
− αk  
N
i =1
∂ w 

Issues
At each step, the gradient is averaged over all training samples,
i = 1, . . . , N.
Very slow when N is large.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 46 / 77
Stochastic Gradient Descent

Avoid averaging gradient over all training samples

Stochastic gradient descent (SGD)


At each step,
Randomly select one training example xi and do

∂E (xi ; w)
w(k +1) ← w(k ) − αk
∂w

Questions
The gradient of one random sample is not the gradient of the
objective function
Would this work at all?
Yes, SGD converges to the expected loss minimum !
But, convergence is slow

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 47 / 77


Stochastic Gradient Descent

The mini-batch approach

Mini-Batch SGD
At each step,
Randomly select a subset of training example Sk ⊂ {1, . . . , N } and do
 
 1 X ∂E (x ; w) 
i
w(k +1) ← w(k ) − αk  
|Sk | ∂w 
i ∈Sk

Pros
Reduced variance of the stochastic gradient estimates
Offers some degree of parallelization
Widely used in practice

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 48 / 77


Stochastic Gradient Descent

SGD, with mini-batch, is the main method used in Deep Learning

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 49 / 77


Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 50 / 77


PCA

Principal Component Analysis (PCA) is arguably the most common


method for dimensionality reduction

But, what is PCA?


Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 51 / 77
PCA

What is PCA?

Most common answer would be ’an algorithm for dimensionality


reduction’
Yes, but:
Where does the algorithm comes from?
What’s the underlying model?
PCA is actually many different things (models)
latent variable model (Hotelling, 1930s)
variance maximization directions (Pearson, 1901)
optimal linear reconstruction (Kosambi-Karhunen-Loève transform in
signal processing)
It just turns out that these different models lead to the same algorithm
(in the linear Gaussian case)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 52 / 77


PCA

What is PCA?

Goal of PCA
The main goal of PCA is to express a complex data set into a new set a
basis vectors that ’best’ explain the data

So, PCA is essentially a change of basis


We want to find the most meaningful basis to re-express the data
such that
the new basis reveals hidden structure
the new basis removes redundancy
Most of the time, we would like a lower dimensional space.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 53 / 77


PCA
Digits visualization

Each digit an image of size 28 × 28, that is a vector in R784 .


Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 54 / 77
PCA

Digits visualization

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 55 / 77


PCA

How does it work?

We project the data onto a lower dimensional space


   
 | | |   | | | 
x1 x2 . . . xN  =⇒ y1 y2 . . . yN  ; k  p.
   
| | | p ×N

| | | k ×N

| {z } | {z }
X Y

So, we need to find a projection matrix P such that

Y = PT X, P is a matrix of size p × k .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 56 / 77


PCA algorithm

The algorithm

Given a set of set of N data samples xi ∈ Rd such that


P
i xi = 0
1 PN
1 Compute the sample covariance matrix C = xi xTi Note that C
N i =1
is a d × d matrix.
2 Compute eigen-decomposition of C: C = UΛUT
U is an orthogonal d × d matrix: U = [u1 , u2 , . . . , ud ]
Λ is a diagonal matrix: Λ = diag(λ1 , λ2 , · · · , λd ).
3 Since C is symmetric, its eigenvectors u1 , u2 , . . . , ud form a basis of
Rd .
The eigenvectors u1 , u2 , . . . , ud are called principal components
The corresponding eigenvalues λ1 > λ2 > · · · > λd give the importance
of each principal axis.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 57 / 77


PCA algorithm

The PCA algorithm is pretty simple


P
First, center the data (if it is not) i xi = 0
Then, compute the sample covariance matrix and its eigenvectors
Finally, each sample point xi can be represented in the new basis
(projection onto the eigenspace) as

yi = UT xi

We claim that the new representation makes the data un-correlated,


i.e. Cov (yi , yj ) = 0 if i , j.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 58 / 77


PCA algorithm

We claim that the new representation makes the data un-correlated

Why?
The sample covariance of the transformed data is
N N
1X T 1X T
Cnew = yi yi = (U xi )(UT xi )T
N N
i =1 i =1
N
 N

1X T
 1 X 
= U xi xTi U = U T T
xi xi  U
N N
i =1 i =1
= U CU = U (UΛU )U = (U U)Λ(UT U)
T T T T

= Λ
Hence, when projected onto the principal components, the data is
decorreletad.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 59 / 77


PCA algorithm

Dimensionality reduction
We usually want to represent our data in a lower dimensional space
Rk , with k  d.
We achieve this by projecting onto the k principal axes which
preserve most of the variance in the data
From the previous analysis, we see that those axes correspond to the
eigenvectors associated with the k largest eigenvalues
   
 | | |   | | | 
U = u1 u2 . . . ud  ⇒ Uk = u1 u2 . . . uk 
   

| | | d ×d
 
| | | d ×k

The projected data is then yi = UTk xi , yi ∈ Rk .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 60 / 77


Data reconstruction

The principal axes are axes of maximum variance


For each data point xi , the projection yi = UTk xi is the best
k-dimensional approximation to xi (best in the minimum mean square
error sense)
The principal axes (i.e. the matrix U) form the best set of orthogonal
basis vectors which minimizes the average reconstruction error
N
1X
U = argmin kxi − WT xi kF
W N
i =1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 61 / 77


Data reconstruction

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 62 / 77


PCA algorithm

Dual PCA

Suppose we are working with images, each of size M × N


We represent an image as a vector x ∈ Rd , with d = MN
1 T
The sample covariance is given C = N XX
C is a d × d matrix
When the images have high resolution, d is large and so is C
Imagine computing the eigenvalues/eigenvectors of a
1000000 × 1000000 matrix with MATLAB !
Moreover, the number N of images is usually much smaller then d.
The dual PCA algorithm is a small size trick.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 63 / 77


PCA algorithm
Dual PCA

Let X be the d × N data matrix X = [x1 , x2 , . . . , xN ], xi ∈ Rd


1 T
The sample covariance can be computed as C = N XX
If N  d, then it is better to work with C0 = N1 XT X
C0 is an N × N matrix
Let C0 = U0 Λ0 U0T be the eigen-decomposition of C0
We have Λ = Λ0 , i.e. eigenvalues of C and C0 are equal
We have ui = Xu0i , for all i
Working with C0 is computationally less expensive if N  d.
We get eigenvectors of C0 : u0i , i = 1, . . . , N
And those of C, the principal components we care about, are given as
ui = Xu0i .

1 T
The matrix C0 = NX X is called the Gram (or Gramian) matrix.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 64 / 77
PCA vs Dual PCA

PCA Dual PCA


h i h i
Given X = x1 ··· xN Given X = x1 ··· xN
p ×N p ×N
Compute the covariance Compute the Gram matrix
matrix C = N1 XXT G = N1 XT X
Find the eigenvalues and Find the eigenvalues and
eigenvectors of C: ui eigenvectors of G: vi

We have ui = Xvi ∀i.


C is an p × p matrix, while G is an N × N matrix.
We use one or the other depending on the number of data points and
the dimension of the data.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 65 / 77


PCA algorithm
Connection with SVD
PCA & SVD
There is a direct link between PCA and SVD

Let X be the d × N data matrix X = [x1 , x2 , . . . , xN ]


1 T
The sample covariance can be computed as C = N XX
The eigenvectors of C are the principal components
The SVD of X is given as X = UΣVT ,
where U is orthogonal d × d and V is orthogonal N × N.
The columns of U are eigenvectors of XXT
So, the columns of U are the principal components
The sigular values of X are ordered as the eigenvalues of C, since
σ2i = λi
The columns of V are the ’dual’ principal components
SVD gives it all !

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 66 / 77


Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 67 / 77


Differentiable Functions
Let f : Rn → R be a function, x ∈ Rn a vector and p ∈ Rn a direction.
f is said to be differentiable at x when f 0 (x; p), called the directional
derivative of f along direction p at point x, defined by the limit
f (x + λp) − f (x)
f 0 (x; p) = lim ,
x ↓0 λ
exists for all directions p ∈ Rn .
If the directional derivative f 0 (x; p) : Rn → R is linear, we can write

f 0 (x; p) = ∇f (x)T p for all p ∈ Rn .


 ∂f (x) 
 ∂x 
 1 
∇f (x) is the gradient of f at x and is given by ∇f (x) =  ...  , where
∂f (x) 
 
∂ xn
∂f (x)
∂xi for i = 1, . . . , n is a partial derivative of f at x given by
∂f (x) f (x+λei )−f (x)
∂xi = limλ→0 λ .
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 68 / 77
Differentiable Functions

∂ 2 f ( x)
If f : Rn → R has second partial derivatives ∂xi ∂xj for all i , j at a given point
x, we say that f is twice differentiable at x.

The matrix ∇2 f (x) is called the Hessian and is given by

 ∂2 f (x) ∂2 f (x) 

 ∂x1 ∂x1
 ··· ∂x1 ∂xn 
∇2 f (x) =  ... ..
.
..
 .

.
∂ f (x) ∂2 f (x)
 2 
∂xn ∂x1 ··· ∂xn ∂xn

∂2 f (x) ∂2 f (x)
Since, second partial derivatives are symmetric, ∂xi ∂xj = ∂xj ∂xi , the
Hessian is a symmetric matrix.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 69 / 77


Taylor’s Expansions

Let X ⊆ Rn be a open set and let x, y ∈ X .


Let f : X → R be a function.
1 If f is continuously differentiable over X , then

f (y) = f (x) + ∇f (x)T (y − x) + o (ky − xk).

2 If f is twice continuously differentiable over X , then

1
f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x) + o (kyk),
2
o (α)
where o (h ) is a continuous function such that limα→0 α = 0.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 70 / 77


Differentiable Maps
Let F : Rn → Rm be a mapping given by
 
 f1 (x) 
 . 
F (x) =  ..  ,
 
fm (x)
where fi : Rn → R.
When each Fi is differentiable at a given x, then the mapping F is
differentiable at x.
The matrix whose rows are ∇f1 (x)T , . . . , ∇fm (x)T is called the
Jacobian of F at x, and denoted JF (x).
JF (x) is an m × n matrix given by

 ∇f1 (x)T 


 
 ..  .

JF (x) = 
 . 
∇fm (x)T

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 71 / 77


Matrix Differentiation

In the following, x and y are vectors while x and y are scalars.

Derivative of a scalar w.r.t. vector


If y is a scalar,
 ∂y 
 ∂x 
∂y  . 1 
,  .  .
∂x  ∂.y 
∂ xn

Derivative of a vector w.r.t. scalar


If x is a scalar,
∂ y h ∂ y1 ∂ym
i
, ∂x · · · ∂x
.
∂x

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 72 / 77


Matrix Differentiation
Derivative of a vector w.r.t. vector
The derivative of vector y w.r.t. vector x is the n × m matrix
 ∂ y1 ∂ ym 
 ∂x ··· ∂x1 

 ∂y11 ∂ym 
∂y  ∂x2 ··· ∂x2 
 .
∂x  ... ..
,  ..
. . 
∂ y1 ∂ ym
 
∂xn ··· ∂xn

Useful formulas
∂y
y(x) ∂x
Ax AT
xT A A
xt x 2x
xT A x Ax + ATx
Table: Some useful derivatives formulas.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 73 / 77


Chain Rule
Cahin rule
Let x, y and z be vectors such as z is a function of y which is in turn a
function of x. We have the following rule

∂z ∂y ∂z
= .
∂x ∂x ∂y

Derivative of scalar functions of a matrix


Let X be an m × n matrix, and let y = f (X) be a scalar function.
The derivative of y w.r.t. X is defined as the following m × n matrix
 ∂y ∂y 
 ∂x ··· ∂x1n

∂y  11 X ∂y
=  ... .. ..  = h ∂y i =

G= . .  ∂xij Eij ,
∂X  ∂y ∂y  i ,j
∂xij
∂xm1 ··· ∂xmn

where Eij denotes the elementary m × n matrix (i.e. a matrix with all
entries equal to zero except for the (i , j ) entry which is one).
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 74 / 77
Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 75 / 77


Conclusions

End of this refresher

To be continued ...
Many useful tools not included:
Automatic differentiation
Numerical optimization
Kernel methods
Gaussian processes
Monte-Carlo methods
etc.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 76 / 77


References

M. P. Deisenroth, A. A. Faisal, C. S. Ong (2020).


Mathematics for Machine Learning
Cambridge University Press.

G. Strang (2009).
Introduction to linear algebra.
Wellesley-Cambridge Press and SIAM.

D. P. Bertsekas, J. N. Tsitsiklis (2002).


Introduction to Probability
Athena Scientific.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 77 / 77

You might also like