0% found this document useful (0 votes)

2 views

Maths Refresher

The document is a mathematical refresher focused on essential concepts for Machine Learning, covering topics such as probability theory, linear algebra, gradient descent, principal component analysis, and matrix differentiation. It emphasizes the importance of understanding uncertainty and modeling in machine learning systems. The document is authored by Professor Désiré Sidibé from Université d’Evry - Paris Saclay.

Uploaded by

hidayabouchouka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Maths Refresher

Uploaded by

hidayabouchouka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Maths Refresher

A refresher of mathematical tools for Machine Learning

Désiré Sidibé

Professor
Université d’Evry - Paris Saclay
IBISC Lab
[email protected]
https://sites.google.com/view/dsidibe/

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 1 / 77

Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 2 / 77

Probability Theory

Why Probability?

Probability is the study of random variables or random processes

Randomness is evrywhere

Understanding how to model uncertainty and how to analyze its

effects is an essential part of every ML system design.

Two main questions:

How to model uncertainty? What do we mean by ’a random
experiment’?
How to process observations to reduce uncertainty?

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 3 / 77

Probability Theory
Probability Space
A probability space is a triplet Ω, F , P where
Ω is a nonempty set, called the sample space;
F is a collection of subsets of Ω closed under countable set
operations. The elements of F are called events;
P is a countably additive function from F to [0, 1] such that P (Ω) = 1,
called a probability measure.

Axioms of Probability
1 0 ≤ P (A ) ≤ 1 ∀A ⊂ Ω
2 P (Ω) = 1
3 If A1 , A2 , . . . , An are mutually exclusive events (i.e. P (Ai ∩ Aj ) = 0),
then:
n
X
P (A1 ∪ A2 ∪ · · · ∪ An ) = P (Ai ).
i =1
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 4 / 77
Probability Theory
Other laws of probability

The complement rule

P (A ) = 1 − P (A ).

The addition rule

P (A ∪ B ) = P (A ) + P (B ) − P (A ∩ B ).

The law of total probability

P (A ) = P (A ∩ B ) + P (A ∩ B ).

The chain rule

 
n
Y  −1 
k\
P (A1 ∩ A2 ∩ . . . ∩ An ) =

P Ak | Aj 
 
k =1 j =1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 5 / 77

Probability Theory
Conditional probability
The mathematical notion of conditional probabilities formalizes the idea of
how observations modify our belief about the likelihood of events.

We define the probability of A given B by

P (A ∩ B )
P ( A |B ) =
P (B )
Bayes Theorem
P (B |A )P (A )
P (A |B ) =
P (B )
Bayes rule is probably the most important results to know
Think of event A as a possible ’cause’ of some effect B
For instance, your alarm can sound either if there is fire or also if there
is no fire (false alarm)
Given that the alarm sounds, what is the probability that there is fire?
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 6 / 77
Probability Theory

Independence
Two events A and B are said to be independent if knowing that A occurs
does not change the probability of B.

Independence
P (A ∩ B ) = P (A )P (B )
Or equivalently
P (A | B ) = P (A )
General definition
A collection of events Ai , i ∈ I are mutually independent if for any
sub-collection {i , j , . . . , k } ⊂ I, one has

P (Ai ∩ Aj ∩ · · · ∩ Ak ) = P (Ai )P (Aj ) · · · P (Ak ).

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 7 / 77

Probability Theory
Random Variables
A random variable (r.v.) is a function that assigns a real number to
each event
Discrete r.v.
A r.v. X is discrete if it takes values in a countable set {xn , n ≥ 1} ⊂ R
We can define P (X = xn ) = pn with pn > 0 and
P
n pn = 1
The collection {xn , pn , n ≥ 1} is then called the Probability Mass
Function (pmf) of the r.v. X

Continuous r.v.
A r.v. X is continuous if one can write
Z b
P (X ∈ [a , b ]) = FX (x ) = fX (x )dx , ∀a<b
a

fX (.) is a nonnegative function called the probability density function

(pdf) of X
Désiré Sidibé .
(IBISC) M2 RVSI - Pattern Recognition 8 / 77
Probability Theory
Examples of random variables
Bernoulli
We say that X has a Bernoulli distribution with parameter p ∈ [0, 1],
and we write X ∼ B(p ) if

P (X = 1) = p and P (X = 0) = 1 − p .

Binomial
X has a binomial distribution with parameters n ∈ {1, 2, . . .} and
p ∈ [0, 1], and we write X ∼ B(n, p ) if
!
n
P (X = m) = p m (1 − p )n−m , for m = 0, 1, . . . , n.
m
Geometric
X has a geometric distribution with parameter p ∈ (0, 1], and we write
X ∼ G(p ) if
P (X = n) = p (1 − p )n−1 for n ≥ 0.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 9 / 77
Probability Theory
Examples of random variables
Poisson
The r.v. X has a Poisson distribution with parameter λ, and we write
X ∼ P(λ) if
λn −λ
P (X = n ) = e for n ≥ 0.
n!
Uniform
X is uniformly distributed in the interval [a , b ] where a < b, and we
write X ∼ U[a , b ] if

b −a ,if x ∈ [a , b ]
1
(
fX (x ) =
0, otherwise.
Exponential
X is exponentially distributed with rate λ > 0, and we write
X ∼ Exp (λ), if
λe −λx , if x > 0
(
fX (x ) =
0, otherwise.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 10 / 77
Probability Theory

Gaussian random variables

We say that X is a standard Gaussian (or standard normal) r.v., and
we write X ∼ N(0, 1), if

1
fX (x ) = √ exp {−x 2 /2}, for x ∈ R.
2π

X is a Gaussian r.v. with mean µ and variance σ2 , and we write

X ∼ N(µ, σ2 ), if X = µ + σY , where Y = N(0, 1).
The pdf of X is given by

(x − µ)2
" #
1
fX (x ) = √ exp − , for x ∈ R.
2πσ2 2σ2

The Gaussian distribution is the most used distribution in practice.

(This is because of the central limit theorem).

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 11 / 77

Probability Theory

Gaussian random variables

An univariate Gaussian distribution has roughly 95% of its area in the

range |x − µ|

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 12 / 77

Probability Theory

Multivariate Gaussian

For x ∈ Rd , x is normally distributed, x ∼ N(µ, Σ), if

" #
1 1
p (x | (µ, Σ)) = exp − (x − µ) Σ (x − µ)
T −1
(2π)d /2 |Σ| 1/2 2

where
The mean is Z
µ = E[x] = xp (x)dx

The covariance matrix is

Z
Σ = E[(x − µ)(x − µ)T ] = (x − µ)(x − µ)T p (x)dx

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 13 / 77

Probability Theory
Gaussian random variables

The eigenvectors of Σ detrmine the direction (and the corresponding

eigenvalues detrmine the length) of the principal axes
The quantity r 2 = (x − µ)T Σ−1 (x − µ) is called the squared
Mahalanobis distance from x to µ.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 14 / 77
Probability Theory

Central limit theorem

Let {Xn , n ≥ n} be a set of independent, identically distributed (i.i.d.)

r.v.’s governed by an arbitrary probability distribution with mean µ and
finite variance σ2 .
Let’s define the r.v. Yn as:
n
1X
Yn Xi .
n
i =1

Then, we have
Yn →D N(µ, σ2 /n) as n → ∞.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 15 / 77

Probability Theory
Expectation and moments of a r.v.
Expected value
For a discrete r.v. X
E(X ) = xn P (X = xn )
n

For a continuous r.v.

Z ∞
E(X ) = xfX (x )dx
−∞

Linearity of expectation
E(af (X ) + bg (Y )) = aE(f (X )) + bE(g (Y ))
Moments
The n-th moment of a r.v. X is E(X n )
The expected value if the first moment of the r.v.
The variance of a r.v. X is define by

V(X ) E[(X − E(X ))2 ] = E(X 2 ) − {E(X )}2 .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 16 / 77

Probability Theory

Useful inequalities

Exponential bound:
1 + x ≤ exp (x ). (1)
Chebychev:
P (X ≤ a ) ≤ E(X 2 )/a 2 . (2)
Markov inequality: If f (.) is nonnegative and nondecreasing on
[a , +∞), then
P (X ≤ a ) ≤ {E(f (X ))}/f (a ). (3)
Jensen inequality: If f (.) is convex, then

E(f (X )) ≥ f (E(X )). (4)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 17 / 77

Joint distribution
Given a pair of r.v. X , Y , the joint distribution is specified by the joint
cumulative distribution function FX ,Y defined as:
FX ,Y (x , y ) = P (X ≤ x and Y ≤ y ), x , y ∈ R.
In the continuous case we have:
Z x Z y
FX ,Y (x , y ) = fX ,Y (u, v )dudv ,
−∞ −∞
where the nonnegative function fX ,Y (x , y ) is called the joint pdf.
Marginalization
In the discrete case:
X
FX (x ) = FX ,Y (x , y ).
y

In the continuous case:

Z ∞
FX (x ) = FX ,Y (x , y )dy .
−∞

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 18 / 77

Joint distribution
Covariance and correlation
The covariance is a measure of dependence, or joint variability
Cov (X , Y ) = E[(X − E(X ))(Y − E(Y ))] = E(XY ) − E(X )E(Y ).
The correlation coefficient between X and Y is given by:
Cov (X , Y )
ρXY = p .
V (X )V (Y )
The sign of the correlation, shows the tendency in the linear
relationship between the variables

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 19 / 77

Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 20 / 77

Matrices

What is a matrix?

A matrix is one way of

describing (or representing) a
linear transformation between
two vector spaces.
A general m × n matrix A
represents a linear
transformation from Rn to Rm .

The matrix acts on vectors x ∈ Rn to produce vectors y ∈ Rm as y = A x.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 21 / 77

Linear System

Basic questions

Does the system A x = b has a solution?

If yes, how many solution(s)?
How to find the solution(s)?

For example, can we solve the following system?

" # " #
1 1 1
x=
1 1 1

How many solutions, if any?

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 22 / 77

Column space and nullspace

Column space
The column space of A , denoted by C (A ) and also called range or span of
A , is the subspace of Rm such that:
y ∈ C (A ) if and only if y = Ax for some x ∈ Rn .

Nullspace
The nullspace of A , denoted by N (A ) and also called kernel, is the
subspace of Rn such that:
x ∈ N (A ) if and only if Ax = 0.

C (A ) is equals to the set of all linear combinations of the columns of

A
N (A ) is exactly the set of vectors which are orthogonal to all the row
vectors of A .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 23 / 77

Rank of a matrix

Rank
The rank of a matrix is the dimension of its column space.

rank (A ) dim(C (A )).

The rank is the most fundamental notion about a matrix

The rank of A is equal to the maximum number of linearly
independent columns (or rows) of A
What are the rank of the following matrices?
" # " #
1 2 1 2
;
2 1 1 2

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 24 / 77

Rank of a matrix
Rank theorem
if A is an m × n matrix, then rank (A ) + dimN (A ) = n.

Figure: The big picture of linear algebra (from G. Strang)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 25 / 77

Solving linear systems

The main problem in linear algebra: solve A x = b

One can solve A x = b iff b ∈ C (A )
The rank of A tells everything

Table: A is m × n matrix of rank r

r =m=n A x = b has a unique solution
r =n<m A x = b has either 0 or a unique solution
r =m<n A x = b has ∞ many solutions
r < m, r < n A x = b has either 0 or ∞ solutions

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 26 / 77

Solving linear systems

What if b < C (A ) ? Linear Least Squares (LLS)

Project b onto C (A ), and solve
A x̂ = p
The "best" (minimum mean
squared error) is solution to the
normal equation:
A T A x̂ = A T b
If A T A is invertible, then the
LLS solution is given by
Find x ∈ Rn such that
k r k2 =k A x − b k2 is minimum. x̂ = (A T A )−1 A T b

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 27 / 77

Eigen-decomposition

Eigenvalues/Eigenvectors
Given a square n × n matrix A , we say that λ ∈ C is an eigenvalue of A
and x ∈ C in the corresponding eigenvector if

A x = λx, x , 0.

Properties of eigenvalues
The rank of A is equal to the number of non-zero eigenvalues.
If A is a non-singular matrix (all of its eigenvalues are non-zero) then
1/λi is an eigenvalue of A −1 with associated eigenvector xi .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 28 / 77

Eigen-decomposition

Properties of eigenvalues

The sum of the eigenvalues of A is equal to its trace

n
X n
X
trace(A ) = Aii = λi .
i =1 i =1

The determinant of A is equal to the product of its eigenvalues

n
Y
det(A ) = |A | = λi .
i =1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 29 / 77

Eigen-decomposition
Properties of eigenvalues
Different eigenvalues ⇒ linearly independent eigenvectors

λi , λj ⇒ xi and xj are independent

If A has n different eigenvalues, then A can be diagonalized as

λ1
 

A = S Λ S −1 = [x1 , . . . , xn ] 
 ..  [x1 , . . . , xn ]−1

.
λn
 

Powers of A are easily obtained as A k = S Λk S −1

useful to solve recurrent equations such as uk +1 = Auk
Ak
useful to exponentiate the matrix: e A = ∞
P
k =0 k !
If A is symmetric, then we can write A = S ΛS T
If the eigenvalues of A are not all different, it may or may not be
possible to diagonalize A .
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 30 / 77
Singular value decomposition
SVD: generalization of eigenvalues/eigenvectors concept for non-square
matrices

Any general m × n matrix A of rank r can be decomposed as

A = U ΣV T

with
U an orthogonal m × m matrix : UU T = I
σ1
 

..
.
 
 
Σ a diagonal m × r matrix: Σ = 
σr

 
 
0
T
V an orthogonal n × n matrix: VV = I

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 31 / 77

Singular value decomposition

Any general m × n matrix A can be decomposed as : A = U ΣV T

Figure: Geometric interpretation of SVD.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 32 / 77

The usefulness of SVD
Probably the most important tool.

A = U ΣV T

Solving linear systems: A x = b

x = A ± b, where A ± is the pseudo-inverse of A given by
b

A ± = V diag(1/σ1 , . . . , 1/σr ) U T

Solving homogeneous systems: A x = 0

x = the right singular vector corresponding to the smallest sigular
b
value.
x = V(:, end), in MATLAB notation.
b
Approximating a matrix A
b = Pk σi ui v T .
The best rank k approximation of A is A i =1 i
Many more ...

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 33 / 77

Norms of Vectors and Matrices
Norm of a vector  n 1/p
X p 
kxkp =  |xi | 
i =1

If p = 1, we get the L1 -norm:

X
kxk1 = |xi |
i
If p = 2, we get the L2 -norm:
 n  1/2
X 2 
kxk2 =  xi 
i =1
If p = ∞, we get the L∞ -norm:
kxk∞ = max {|xi |}
It is particularly useful to note that
n
X
kxk22 = xi2 = xT x
Désiré Sidibé (IBISC) M2 RVSI - Pattern
i =1Recognition 34 / 77
Norms of Vectors and Matrices

Norm of a matrix

Frobenius norm
sX X q
kA kF = Aij2 = trace (AA T )
i j

Thinking of a matrix as a linear transformation, we can ask ’how does

the length of x grow when it gets transformed by A ’ ?

kA xk
kA k = arg max = arg max kA xk
x,0 kxk kxk=1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 35 / 77

Norms of Vectors and Matrices

Properties of norms

kA k > 0 ∀ A

kkA k = |k |kA k ∀ k
Triangular inequality

kA + B k ≤ kA k + kB k

Cauchy-Schwartz inequality

kAB k ≤ kA k kB k

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 36 / 77

SVD applications

SVD is a fundamental tool for data analysis and is often used in computer
vision and machine learning applications
Image compression
Image denoising
Pattern classification
Transformations estimations
etc

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 37 / 77

SVD applications
Image denoising
A noisy image X can be decomposed as: A = U ΣV T = ri=1 σi ui viT ,
P
where each ui viT is a rank one approximation of X .
A noiseless approximation of X is obtained by truncating the sum at k
b = Pk σi ui v T .
terms: X i =1 i

k = 10 k = 50 k = 100
Figure: Image denoising with SVD.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 38 / 77

SVD applications

Image denoising
It is better to work with local patches
Denoise each local patch with SVD

k =1 k =2 k = 10
Figure: Image denoising with SVD on local patches.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 39 / 77

Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 40 / 77

Gradient Descent

A gentle intro to gradient descent methods

Iterative methods
Most optimization techniques are iterative
starting from an initial point x0
produce a series of vectors {xk }k =1,..,N which hopefully converge
towards a stationary point x̂

Gradient Descent
The most popular methods for continuous optimization
Simple and intuitive
Work under very few assumptions
Suitable for large-scale problems
Easy to parallelize for problems with many terms in the objective

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 41 / 77

Gradient Descent

Gradient Descent Algorithm

Given x0
Iterate until convergence

xk +1 = xk + αk pk

So, at each iteration, we need to

find a descent direction pk , i.e. a direction that ensures that
f (xk +1 ) < f (xk )
find a step size αk , i.e. how far to move along the chosen direction

We also need to decide when to stop iterating (stoping criteria).

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 42 / 77

Gradient Descent
Descent directions
A direction pk is a descent direction from xk , if pTk ∇f (xk ) < 0.

Steepest descent directions

The direction along which f decreases most rapidily is given by
pk = −∇f (xk ).

If ∇f (x) , 0 and B is a symmetric p.d. matrix, then −B ∇f (x) and

−B −1 ∇f (x) are descent directions.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 43 / 77
Gradient Descent
Step size

Small step size

Pros: iterations are more likely converge
Cons: need more iterations and thus more evaluations of ∇f

Large step size

Pros: better use of each ∇f (xk ), may reduce number of iterations
Cons: can cause overshooting and zig-zags;
too large ⇒ diverged iterations

In practice
use a fixed value if ∇f is Lipschitz (bounded rate of change)
apply line search
arg min f (xk + αpk )
α

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 44 / 77

Gradient Descent

When to stop?

We cannot use kx − x∗ k of |f (x) − f (x∗ )| since we don’t know the true

minimizer x∗ .

Instead, in practice, given a small > 0

1 k∇f (xk )k <

2 kxk − xk −1 k < or kxk − xk −1 k < kxk k

3 |f (xk ) − f (xk −1 )| < or |f (xk ) − f (xk −1 )| < |f (xk )|

4 k < kmax

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 45 / 77

Gradient Descent
Learning problem
n
∗ 1X
w = arg min E (xi ; w).
w N
i =1

Batch gradient descent

At each step,  
N
 1 X ∂ E ( xi ; w ) 
w ( k + 1)
←w (k )
− αk  
N
i =1
∂ w 

Issues
At each step, the gradient is averaged over all training samples,
i = 1, . . . , N.
Very slow when N is large.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 46 / 77
Stochastic Gradient Descent

Avoid averaging gradient over all training samples

Stochastic gradient descent (SGD)

At each step,
Randomly select one training example xi and do

∂E (xi ; w)
w(k +1) ← w(k ) − αk
∂w

Questions
The gradient of one random sample is not the gradient of the
objective function
Would this work at all?
Yes, SGD converges to the expected loss minimum !
But, convergence is slow

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 47 / 77

Stochastic Gradient Descent

The mini-batch approach

Mini-Batch SGD
At each step,
Randomly select a subset of training example Sk ⊂ {1, . . . , N } and do
 
 1 X ∂E (x ; w) 
i
w(k +1) ← w(k ) − αk  
|Sk | ∂w 
i ∈Sk

Pros
Reduced variance of the stochastic gradient estimates
Offers some degree of parallelization
Widely used in practice

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 48 / 77

Stochastic Gradient Descent

SGD, with mini-batch, is the main method used in Deep Learning

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 49 / 77

Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 50 / 77

PCA

Principal Component Analysis (PCA) is arguably the most common

method for dimensionality reduction

But, what is PCA?

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 51 / 77
PCA

What is PCA?

Most common answer would be ’an algorithm for dimensionality

reduction’
Yes, but:
Where does the algorithm comes from?
What’s the underlying model?
PCA is actually many different things (models)
latent variable model (Hotelling, 1930s)
variance maximization directions (Pearson, 1901)
optimal linear reconstruction (Kosambi-Karhunen-Loève transform in
signal processing)
It just turns out that these different models lead to the same algorithm
(in the linear Gaussian case)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 52 / 77

PCA

What is PCA?

Goal of PCA
The main goal of PCA is to express a complex data set into a new set a
basis vectors that ’best’ explain the data

So, PCA is essentially a change of basis

We want to find the most meaningful basis to re-express the data
such that
the new basis reveals hidden structure
the new basis removes redundancy
Most of the time, we would like a lower dimensional space.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 53 / 77

PCA
Digits visualization

Each digit an image of size 28 × 28, that is a vector in R784 .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 54 / 77
PCA

Digits visualization

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 55 / 77

PCA

How does it work?

We project the data onto a lower dimensional space

   
 | | |   | | | 
x1 x2 . . . xN  =⇒ y1 y2 . . . yN  ; k p.
   
| | | p ×N

| | | k ×N

| {z } | {z }
X Y

So, we need to find a projection matrix P such that

Y = PT X, P is a matrix of size p × k .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 56 / 77

PCA algorithm

The algorithm

Given a set of set of N data samples xi ∈ Rd such that

P
i xi = 0
1 PN
1 Compute the sample covariance matrix C = xi xTi Note that C
N i =1
is a d × d matrix.
2 Compute eigen-decomposition of C: C = UΛUT
U is an orthogonal d × d matrix: U = [u1 , u2 , . . . , ud ]
Λ is a diagonal matrix: Λ = diag(λ1 , λ2 , · · · , λd ).
3 Since C is symmetric, its eigenvectors u1 , u2 , . . . , ud form a basis of
Rd .
The eigenvectors u1 , u2 , . . . , ud are called principal components
The corresponding eigenvalues λ1 > λ2 > · · · > λd give the importance
of each principal axis.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 57 / 77

PCA algorithm

The PCA algorithm is pretty simple

P
First, center the data (if it is not) i xi = 0
Then, compute the sample covariance matrix and its eigenvectors
Finally, each sample point xi can be represented in the new basis
(projection onto the eigenspace) as

yi = UT xi

We claim that the new representation makes the data un-correlated,

i.e. Cov (yi , yj ) = 0 if i , j.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 58 / 77

PCA algorithm

We claim that the new representation makes the data un-correlated

Why?
The sample covariance of the transformed data is
N N
1X T 1X T
Cnew = yi yi = (U xi )(UT xi )T
N N
i =1 i =1
N
 N

1X T
 1 X 
= U xi xTi U = U T T
xi xi  U
N N
i =1 i =1
= U CU = U (UΛU )U = (U U)Λ(UT U)
T T T T

= Λ
Hence, when projected onto the principal components, the data is
decorreletad.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 59 / 77

PCA algorithm

Dimensionality reduction
We usually want to represent our data in a lower dimensional space
Rk , with k d.
We achieve this by projecting onto the k principal axes which
preserve most of the variance in the data
From the previous analysis, we see that those axes correspond to the
eigenvectors associated with the k largest eigenvalues
   
 | | |   | | | 
U = u1 u2 . . . ud  ⇒ Uk = u1 u2 . . . uk 
   

| | | d ×d
 
| | | d ×k


The projected data is then yi = UTk xi , yi ∈ Rk .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 60 / 77

Data reconstruction

The principal axes are axes of maximum variance

For each data point xi , the projection yi = UTk xi is the best
k-dimensional approximation to xi (best in the minimum mean square
error sense)
The principal axes (i.e. the matrix U) form the best set of orthogonal
basis vectors which minimizes the average reconstruction error
N
1X
U = argmin kxi − WT xi kF
W N
i =1

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 61 / 77

Data reconstruction

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 62 / 77

PCA algorithm

Dual PCA

Suppose we are working with images, each of size M × N

We represent an image as a vector x ∈ Rd , with d = MN
1 T
The sample covariance is given C = N XX
C is a d × d matrix
When the images have high resolution, d is large and so is C
Imagine computing the eigenvalues/eigenvectors of a
1000000 × 1000000 matrix with MATLAB !
Moreover, the number N of images is usually much smaller then d.
The dual PCA algorithm is a small size trick.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 63 / 77

PCA algorithm
Dual PCA

Let X be the d × N data matrix X = [x1 , x2 , . . . , xN ], xi ∈ Rd

1 T
The sample covariance can be computed as C = N XX
If N d, then it is better to work with C0 = N1 XT X
C0 is an N × N matrix
Let C0 = U0 Λ0 U0T be the eigen-decomposition of C0
We have Λ = Λ0 , i.e. eigenvalues of C and C0 are equal
We have ui = Xu0i , for all i
Working with C0 is computationally less expensive if N d.
We get eigenvectors of C0 : u0i , i = 1, . . . , N
And those of C, the principal components we care about, are given as
ui = Xu0i .

1 T
The matrix C0 = NX X is called the Gram (or Gramian) matrix.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 64 / 77
PCA vs Dual PCA

PCA Dual PCA

h i h i
Given X = x1 ··· xN Given X = x1 ··· xN
p ×N p ×N
Compute the covariance Compute the Gram matrix
matrix C = N1 XXT G = N1 XT X
Find the eigenvalues and Find the eigenvalues and
eigenvectors of C: ui eigenvectors of G: vi

We have ui = Xvi ∀i.

C is an p × p matrix, while G is an N × N matrix.
We use one or the other depending on the number of data points and
the dimension of the data.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 65 / 77

PCA algorithm
Connection with SVD
PCA & SVD
There is a direct link between PCA and SVD

Let X be the d × N data matrix X = [x1 , x2 , . . . , xN ]

1 T
The sample covariance can be computed as C = N XX
The eigenvectors of C are the principal components
The SVD of X is given as X = UΣVT ,
where U is orthogonal d × d and V is orthogonal N × N.
The columns of U are eigenvectors of XXT
So, the columns of U are the principal components
The sigular values of X are ordered as the eigenvalues of C, since
σ2i = λi
The columns of V are the ’dual’ principal components
SVD gives it all !

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 66 / 77

Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 67 / 77

Differentiable Functions
Let f : Rn → R be a function, x ∈ Rn a vector and p ∈ Rn a direction.
f is said to be differentiable at x when f 0 (x; p), called the directional
derivative of f along direction p at point x, defined by the limit
f (x + λp) − f (x)
f 0 (x; p) = lim ,
x ↓0 λ
exists for all directions p ∈ Rn .
If the directional derivative f 0 (x; p) : Rn → R is linear, we can write

f 0 (x; p) = ∇f (x)T p for all p ∈ Rn .

 ∂f (x) 
 ∂x 
 1 
∇f (x) is the gradient of f at x and is given by ∇f (x) =  ...  , where
∂f (x) 
 
∂ xn
∂f (x)
∂xi for i = 1, . . . , n is a partial derivative of f at x given by
∂f (x) f (x+λei )−f (x)
∂xi = limλ→0 λ .
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 68 / 77
Differentiable Functions

∂ 2 f ( x)
If f : Rn → R has second partial derivatives ∂xi ∂xj for all i , j at a given point
x, we say that f is twice differentiable at x.

The matrix ∇2 f (x) is called the Hessian and is given by

 ∂2 f (x) ∂2 f (x) 

 ∂x1 ∂x1
 ··· ∂x1 ∂xn 
∇2 f (x) =  ... ..
.
..
 .

.
∂ f (x) ∂2 f (x)
 2 
∂xn ∂x1 ··· ∂xn ∂xn

∂2 f (x) ∂2 f (x)
Since, second partial derivatives are symmetric, ∂xi ∂xj = ∂xj ∂xi , the
Hessian is a symmetric matrix.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 69 / 77

Taylor’s Expansions

Let X ⊆ Rn be a open set and let x, y ∈ X .

Let f : X → R be a function.
1 If f is continuously differentiable over X , then

f (y) = f (x) + ∇f (x)T (y − x) + o (ky − xk).

2 If f is twice continuously differentiable over X , then

1
f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x) + o (kyk),
2
o (α)
where o (h ) is a continuous function such that limα→0 α = 0.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 70 / 77

Differentiable Maps
Let F : Rn → Rm be a mapping given by
 
 f1 (x) 
 . 
F (x) =  ..  ,
 
fm (x)
where fi : Rn → R.
When each Fi is differentiable at a given x, then the mapping F is
differentiable at x.
The matrix whose rows are ∇f1 (x)T , . . . , ∇fm (x)T is called the
Jacobian of F at x, and denoted JF (x).
JF (x) is an m × n matrix given by

 ∇f1 (x)T 

 
 ..  .

JF (x) = 
 . 
∇fm (x)T

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 71 / 77

Matrix Differentiation

In the following, x and y are vectors while x and y are scalars.

Derivative of a scalar w.r.t. vector

If y is a scalar,
 ∂y 
 ∂x 
∂y  . 1 
,  .  .
∂x  ∂.y 
∂ xn

Derivative of a vector w.r.t. scalar

If x is a scalar,
∂ y h ∂ y1 ∂ym
i
, ∂x · · · ∂x
.
∂x

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 72 / 77

Matrix Differentiation
Derivative of a vector w.r.t. vector
The derivative of vector y w.r.t. vector x is the n × m matrix
 ∂ y1 ∂ ym 
 ∂x ··· ∂x1 

 ∂y11 ∂ym 
∂y  ∂x2 ··· ∂x2 
 .
∂x  ... ..
,  ..
. . 
∂ y1 ∂ ym
 
∂xn ··· ∂xn

Useful formulas
∂y
y(x) ∂x
Ax AT
xT A A
xt x 2x
xT A x Ax + ATx
Table: Some useful derivatives formulas.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 73 / 77

Chain Rule
Cahin rule
Let x, y and z be vectors such as z is a function of y which is in turn a
function of x. We have the following rule

∂z ∂y ∂z
= .
∂x ∂x ∂y

Derivative of scalar functions of a matrix

Let X be an m × n matrix, and let y = f (X) be a scalar function.
The derivative of y w.r.t. X is defined as the following m × n matrix
 ∂y ∂y 
 ∂x ··· ∂x1n

∂y  11 X ∂y
=  ... .. ..  = h ∂y i =

G= . .  ∂xij Eij ,
∂X  ∂y ∂y  i ,j
∂xij
∂xm1 ··· ∂xmn

where Eij denotes the elementary m × n matrix (i.e. a matrix with all
entries equal to zero except for the (i , j ) entry which is one).
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 74 / 77
Outline

1 Basics of Probability

2 Basics of Linear Algebra

3 Gradient Descent

4 Principal Component Analysis

5 Matrix Differentiation

6 Conclusion

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 75 / 77

Conclusions

End of this refresher

To be continued ...
Many useful tools not included:
Automatic differentiation
Numerical optimization
Kernel methods
Gaussian processes
Monte-Carlo methods
etc.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 76 / 77

References

M. P. Deisenroth, A. A. Faisal, C. S. Ong (2020).

Mathematics for Machine Learning
Cambridge University Press.

G. Strang (2009).
Introduction to linear algebra.
Wellesley-Cambridge Press and SIAM.

D. P. Bertsekas, J. N. Tsitsiklis (2002).

Introduction to Probability
Athena Scientific.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 77 / 77

061585673X Coding
100% (3)
061585673X Coding
689 pages
ALL ST218 Lecture Notes
No ratings yet
ALL ST218 Lecture Notes
87 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
Lecture2 Math ML Review
No ratings yet
Lecture2 Math ML Review
87 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Cs229 Probability Review
No ratings yet
Cs229 Probability Review
36 pages
Introduction To State Estimation: What This Course Is About
No ratings yet
Introduction To State Estimation: What This Course Is About
6 pages
Probab Refresh
No ratings yet
Probab Refresh
7 pages
Chap2 PDF
No ratings yet
Chap2 PDF
20 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
MAS 102_Topic 1
No ratings yet
MAS 102_Topic 1
13 pages
SF 2940 Forms
No ratings yet
SF 2940 Forms
23 pages
01 Lectureslides ProbTheory
No ratings yet
01 Lectureslides ProbTheory
42 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
Probability Theory (MATHIAS LOWE)
No ratings yet
Probability Theory (MATHIAS LOWE)
69 pages
report-endterm
No ratings yet
report-endterm
30 pages
CS229 - Probability Theory Review: Taide Ding, Fereshte Khani
No ratings yet
CS229 - Probability Theory Review: Taide Ding, Fereshte Khani
37 pages
All in One CheatSheet PDF
No ratings yet
All in One CheatSheet PDF
52 pages
2025 Prob Theory CDF CLT
No ratings yet
2025 Prob Theory CDF CLT
53 pages
Lecture-02_Probability_basics
No ratings yet
Lecture-02_Probability_basics
33 pages
report-mid
No ratings yet
report-mid
19 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
091 - MA8451, MA6451 Probability and Random Processes - Notes PDF
No ratings yet
091 - MA8451, MA6451 Probability and Random Processes - Notes PDF
79 pages
MA2013E Slides
No ratings yet
MA2013E Slides
237 pages
EE311_Lecture_Chapter_#04_Random_Variables_and_Expectation
No ratings yet
EE311_Lecture_Chapter_#04_Random_Variables_and_Expectation
48 pages
Module 1
No ratings yet
Module 1
39 pages
CS115 Probability 2
No ratings yet
CS115 Probability 2
58 pages
Probability Review Stochastic
No ratings yet
Probability Review Stochastic
23 pages
Probability and Stochastic Models
No ratings yet
Probability and Stochastic Models
78 pages
Cheatsheet Probability and Statistics
100% (1)
Cheatsheet Probability and Statistics
10 pages
SE513: Modeling & Systems Identification I: Dr. Samir Al-Amer
No ratings yet
SE513: Modeling & Systems Identification I: Dr. Samir Al-Amer
32 pages
Bde Unit 1
No ratings yet
Bde Unit 1
46 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
App.A - Detection and Estimation in Additive Gaussian Noise PDF
No ratings yet
App.A - Detection and Estimation in Additive Gaussian Noise PDF
55 pages
Introductory Probability and The Central Limit Theorem
No ratings yet
Introductory Probability and The Central Limit Theorem
11 pages
Probability and Statistics: Cheat Sheet
100% (1)
Probability and Statistics: Cheat Sheet
10 pages
2 Mle
No ratings yet
2 Mle
28 pages
STA247
No ratings yet
STA247
27 pages
Prob RV Opt Basics
No ratings yet
Prob RV Opt Basics
35 pages
lec2 (1)
No ratings yet
lec2 (1)
46 pages
Basic Statistics and Probability Theory
No ratings yet
Basic Statistics and Probability Theory
45 pages
Intro To Data Science Lecture 2
No ratings yet
Intro To Data Science Lecture 2
12 pages
main
No ratings yet
main
24 pages
Probability Random Variables and Random Processes Part 1
100% (10)
Probability Random Variables and Random Processes Part 1
30 pages
ML Cheat Sheet
50% (2)
ML Cheat Sheet
74 pages
4 - Probability Theory
No ratings yet
4 - Probability Theory
20 pages
Ma8391 Notes
No ratings yet
Ma8391 Notes
60 pages
Basicsofstatisticalmethods PDF
No ratings yet
Basicsofstatisticalmethods PDF
85 pages
Fundamentals of Mathematical Statistics: Pavol Oršanský
No ratings yet
Fundamentals of Mathematical Statistics: Pavol Oršanský
85 pages
ML U3
No ratings yet
ML U3
34 pages
Random Variables and Process
No ratings yet
Random Variables and Process
31 pages
stochbasics_handout
No ratings yet
stochbasics_handout
36 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
79 pages
STAT 516 Course Notes Part 0: Review of STAT 515: 1 Probability
No ratings yet
STAT 516 Course Notes Part 0: Review of STAT 515: 1 Probability
21 pages
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Topology Essentials
From Everand
Topology Essentials
Emil G. Milewski
5/5 (1)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Linear Algebra Notes Ajay Gandecha
No ratings yet
Linear Algebra Notes Ajay Gandecha
75 pages
Transformasi Linier-14
No ratings yet
Transformasi Linier-14
72 pages
(Ebook) Linear Algebra by Lina Oliveira ISBN 9781032287812, 9780815373315, 1032287810, 0815373317, 2021061942, 2021061943 - Own the ebook now with all fully detailed chapters
100% (2)
(Ebook) Linear Algebra by Lina Oliveira ISBN 9781032287812, 9780815373315, 1032287810, 0815373317, 2021061942, 2021061943 - Own the ebook now with all fully detailed chapters
56 pages
EigenValuesand EigenVectors
No ratings yet
EigenValuesand EigenVectors
94 pages
Course Structure Idd Math Computing
No ratings yet
Course Structure Idd Math Computing
92 pages
Detailed Syllabus
No ratings yet
Detailed Syllabus
2 pages
Linear Algebra in Maple
No ratings yet
Linear Algebra in Maple
22 pages
Multivariate Gaussian and Student T Process Regression For Multi-Output Prediction
No ratings yet
Multivariate Gaussian and Student T Process Regression For Multi-Output Prediction
29 pages
MAT6702 - Topics in Lorentz Geometry: Ivo Terek Couto
No ratings yet
MAT6702 - Topics in Lorentz Geometry: Ivo Terek Couto
76 pages
Quadrature Theory Helmut Brass Knut Petras instant download
No ratings yet
Quadrature Theory Helmut Brass Knut Petras instant download
89 pages
Appendix A. Linear Algebra
No ratings yet
Appendix A. Linear Algebra
15 pages
Maths
No ratings yet
Maths
8 pages
Further Mathematics H2
No ratings yet
Further Mathematics H2
14 pages
Week 8 Eigenvalues - Eigenvectors
No ratings yet
Week 8 Eigenvalues - Eigenvectors
51 pages
Practice Set 2
No ratings yet
Practice Set 2
5 pages
Download Full Mathematical Physics with Partial Differential Equations Instructor Solutions Manual James Kirkwood PDF All Chapters
100% (1)
Download Full Mathematical Physics with Partial Differential Equations Instructor Solutions Manual James Kirkwood PDF All Chapters
40 pages
Practice Exam2 Sol
No ratings yet
Practice Exam2 Sol
5 pages
Course Handout EMAT-102L
No ratings yet
Course Handout EMAT-102L
6 pages
Content://com Opera Mini Native Operafile/?o File:///storage/emulated/0/download/shap
No ratings yet
Content://com Opera Mini Native Operafile/?o File:///storage/emulated/0/download/shap
11 pages
Upload
No ratings yet
Upload
14 pages
Renyi's Entropy: I P P I
No ratings yet
Renyi's Entropy: I P P I
35 pages
Mathematics (Hons./Pg) (Code - 19) : A. Classical Algebra
No ratings yet
Mathematics (Hons./Pg) (Code - 19) : A. Classical Algebra
9 pages
CourseModule-BtechCSEMathematics-II
No ratings yet
CourseModule-BtechCSEMathematics-II
14 pages
SIGNALPROCESSING KTU Whole Syllabus
No ratings yet
SIGNALPROCESSING KTU Whole Syllabus
65 pages
Assignment-1 (Linear Algebra)
No ratings yet
Assignment-1 (Linear Algebra)
3 pages
Algebra 2 Linear Algebra Galois Theory Representation theory Group extensions and Schur Multiplier 1st Edition Ramji Lal - The complete ebook set is ready for download today
No ratings yet
Algebra 2 Linear Algebra Galois Theory Representation theory Group extensions and Schur Multiplier 1st Edition Ramji Lal - The complete ebook set is ready for download today
58 pages
Course Outline: International Islamic University Malaysia
No ratings yet
Course Outline: International Islamic University Malaysia
5 pages
MTH212_2018t
No ratings yet
MTH212_2018t
7 pages
Vector Spaces Associated With Matrices
No ratings yet
Vector Spaces Associated With Matrices
53 pages

Maths Refresher

Uploaded by

Maths Refresher

Uploaded by

Maths Refresher

A refresher of mathematical tools for Machine Learning

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 1 / 77

2 Basics of Linear Algebra

4 Principal Component Analysis

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 2 / 77

Probability is the study of random variables or random processes

Understanding how to model uncertainty and how to analyze its

Two main questions:

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 3 / 77

The complement rule

The addition rule

The law of total probability

The chain rule

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 5 / 77

We define the probability of A given B by

P (Ai ∩ Aj ∩ · · · ∩ Ak ) = P (Ai )P (Aj ) · · · P (Ak ).

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 7 / 77

fX (.) is a nonnegative function called the probability density function

Gaussian random variables

X is a Gaussian r.v. with mean µ and variance σ2 , and we write

The Gaussian distribution is the most used distribution in practice.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 11 / 77

Gaussian random variables

An univariate Gaussian distribution has roughly 95% of its area in the

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 12 / 77

For x ∈ Rd , x is normally distributed, x ∼ N(µ, Σ), if

The covariance matrix is

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 13 / 77

The eigenvectors of Σ detrmine the direction (and the corresponding

Central limit theorem

Let {Xn , n ≥ n} be a set of independent, identically distributed (i.i.d.)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 15 / 77

For a continuous r.v.

V(X )  E[(X − E(X ))2 ] = E(X 2 ) − {E(X )}2 .

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 16 / 77

E(f (X )) ≥ f (E(X )). (4)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 17 / 77

In the continuous case:

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 18 / 77

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 19 / 77

2 Basics of Linear Algebra

4 Principal Component Analysis

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 20 / 77

A matrix is one way of

The matrix acts on vectors x ∈ Rn to produce vectors y ∈ Rm as y = A x.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 21 / 77

Does the system A x = b has a solution?

For example, can we solve the following system?

How many solutions, if any?

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 22 / 77

C (A ) is equals to the set of all linear combinations of the columns of

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 23 / 77

rank (A )  dim(C (A )).

The rank is the most fundamental notion about a matrix

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 24 / 77

Figure: The big picture of linear algebra (from G. Strang)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 25 / 77

The main problem in linear algebra: solve A x = b

Table: A is m × n matrix of rank r

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 26 / 77

What if b < C (A ) ? Linear Least Squares (LLS)

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 27 / 77

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 28 / 77

The sum of the eigenvalues of A is equal to its trace

The determinant of A is equal to the product of its eigenvalues

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 29 / 77

λi , λj ⇒ xi and xj are independent

If A has n different eigenvalues, then A can be diagonalized as

Powers of A are easily obtained as A k = S Λk S −1

Any general m × n matrix A of rank r can be decomposed as

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 31 / 77

Any general m × n matrix A can be decomposed as : A = U ΣV T

Figure: Geometric interpretation of SVD.

Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 32 / 77

Solving linear systems: A x = b

V(X ) E[(X − E(X ))2 ] = E(X 2 ) − {E(X )}2 .

rank (A ) dim(C (A )).

Instead, in practice, given a small > 0

2 kxk − xk −1 k < or kxk − xk −1 k < kxk k

3 |f (xk ) − f (xk −1 )| < or |f (xk ) − f (xk −1 )| < |f (xk )|