Maths Refresher
Maths Refresher
Désiré Sidibé
Professor
Université d’Evry - Paris Saclay
IBISC Lab
[email protected]
https://sites.google.com/view/dsidibe/
1 Basics of Probability
3 Gradient Descent
5 Matrix Differentiation
6 Conclusion
Why Probability?
Axioms of Probability
1 0 ≤ P (A ) ≤ 1 ∀A ⊂ Ω
2 P (Ω) = 1
3 If A1 , A2 , . . . , An are mutually exclusive events (i.e. P (Ai ∩ Aj ) = 0),
then:
n
X
P (A1 ∪ A2 ∪ · · · ∪ An ) = P (Ai ).
i =1
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 4 / 77
Probability Theory
Other laws of probability
P (A ) = 1 − P (A ).
P (A ∪ B ) = P (A ) + P (B ) − P (A ∩ B ).
P (A ) = P (A ∩ B ) + P (A ∩ B ).
Independence
Two events A and B are said to be independent if knowing that A occurs
does not change the probability of B.
Independence
P (A ∩ B ) = P (A )P (B )
Or equivalently
P (A | B ) = P (A )
General definition
A collection of events Ai , i ∈ I are mutually independent if for any
sub-collection {i , j , . . . , k } ⊂ I, one has
Continuous r.v.
A r.v. X is continuous if one can write
Z b
P (X ∈ [a , b ]) = FX (x ) = fX (x )dx , ∀a<b
a
P (X = 1) = p and P (X = 0) = 1 − p .
Binomial
X has a binomial distribution with parameters n ∈ {1, 2, . . .} and
p ∈ [0, 1], and we write X ∼ B(n, p ) if
!
n
P (X = m) = p m (1 − p )n−m , for m = 0, 1, . . . , n.
m
Geometric
X has a geometric distribution with parameter p ∈ (0, 1], and we write
X ∼ G(p ) if
P (X = n) = p (1 − p )n−1 for n ≥ 0.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 9 / 77
Probability Theory
Examples of random variables
Poisson
The r.v. X has a Poisson distribution with parameter λ, and we write
X ∼ P(λ) if
λn −λ
P (X = n ) = e for n ≥ 0.
n!
Uniform
X is uniformly distributed in the interval [a , b ] where a < b, and we
write X ∼ U[a , b ] if
b −a ,if x ∈ [a , b ]
1
(
fX (x ) =
0, otherwise.
Exponential
X is exponentially distributed with rate λ > 0, and we write
X ∼ Exp (λ), if
λe −λx , if x > 0
(
fX (x ) =
0, otherwise.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 10 / 77
Probability Theory
1
fX (x ) = √ exp {−x 2 /2}, for x ∈ R.
2π
(x − µ)2
" #
1
fX (x ) = √ exp − , for x ∈ R.
2πσ2 2σ2
Multivariate Gaussian
where
The mean is Z
µ = E[x] = xp (x)dx
Then, we have
Yn →D N(µ, σ2 /n) as n → ∞.
Linearity of expectation
E(af (X ) + bg (Y )) = aE(f (X )) + bE(g (Y ))
Moments
The n-th moment of a r.v. X is E(X n )
The expected value if the first moment of the r.v.
The variance of a r.v. X is define by
Useful inequalities
Exponential bound:
1 + x ≤ exp (x ). (1)
Chebychev:
P (X ≤ a ) ≤ E(X 2 )/a 2 . (2)
Markov inequality: If f (.) is nonnegative and nondecreasing on
[a , +∞), then
P (X ≤ a ) ≤ {E(f (X ))}/f (a ). (3)
Jensen inequality: If f (.) is convex, then
1 Basics of Probability
3 Gradient Descent
5 Matrix Differentiation
6 Conclusion
What is a matrix?
Basic questions
Column space
The column space of A , denoted by C (A ) and also called range or span of
A , is the subspace of Rm such that:
y ∈ C (A ) if and only if y = Ax for some x ∈ Rn .
Nullspace
The nullspace of A , denoted by N (A ) and also called kernel, is the
subspace of Rn such that:
x ∈ N (A ) if and only if Ax = 0.
Rank
The rank of a matrix is the dimension of its column space.
Eigenvalues/Eigenvectors
Given a square n × n matrix A , we say that λ ∈ C is an eigenvalue of A
and x ∈ C in the corresponding eigenvector if
A x = λx, x , 0.
Properties of eigenvalues
The rank of A is equal to the number of non-zero eigenvalues.
If A is a non-singular matrix (all of its eigenvalues are non-zero) then
1/λi is an eigenvalue of A −1 with associated eigenvector xi .
Properties of eigenvalues
λ1
A = S Λ S −1 = [x1 , . . . , xn ]
.. [x1 , . . . , xn ]−1
.
λn
A = U ΣV T
with
U an orthogonal m × m matrix : UU T = I
σ1
..
.
Σ a diagonal m × r matrix: Σ =
σr
0
T
V an orthogonal n × n matrix: VV = I
A = U ΣV T
A ± = V diag(1/σ1 , . . . , 1/σr ) U T
Norm of a matrix
Frobenius norm
sX X q
kA kF = Aij2 = trace (AA T )
i j
kA xk
kA k = arg max = arg max kA xk
x,0 kxk kxk=1
Properties of norms
kA k > 0 ∀ A
kkA k = |k |kA k ∀ k
Triangular inequality
kA + B k ≤ kA k + kB k
Cauchy-Schwartz inequality
kAB k ≤ kA k kB k
SVD is a fundamental tool for data analysis and is often used in computer
vision and machine learning applications
Image compression
Image denoising
Pattern classification
Transformations estimations
etc
k = 10 k = 50 k = 100
Figure: Image denoising with SVD.
Image denoising
It is better to work with local patches
Denoise each local patch with SVD
k =1 k =2 k = 10
Figure: Image denoising with SVD on local patches.
1 Basics of Probability
3 Gradient Descent
5 Matrix Differentiation
6 Conclusion
Iterative methods
Most optimization techniques are iterative
starting from an initial point x0
produce a series of vectors {xk }k =1,..,N which hopefully converge
towards a stationary point x̂
Gradient Descent
The most popular methods for continuous optimization
Simple and intuitive
Work under very few assumptions
Suitable for large-scale problems
Easy to parallelize for problems with many terms in the objective
xk +1 = xk + αk pk
In practice
use a fixed value if ∇f is Lipschitz (bounded rate of change)
apply line search
arg min f (xk + αpk )
α
When to stop?
4 k < kmax
Issues
At each step, the gradient is averaged over all training samples,
i = 1, . . . , N.
Very slow when N is large.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 46 / 77
Stochastic Gradient Descent
∂E (xi ; w)
w(k +1) ← w(k ) − αk
∂w
Questions
The gradient of one random sample is not the gradient of the
objective function
Would this work at all?
Yes, SGD converges to the expected loss minimum !
But, convergence is slow
Mini-Batch SGD
At each step,
Randomly select a subset of training example Sk ⊂ {1, . . . , N } and do
1 X ∂E (x ; w)
i
w(k +1) ← w(k ) − αk
|Sk | ∂w
i ∈Sk
Pros
Reduced variance of the stochastic gradient estimates
Offers some degree of parallelization
Widely used in practice
1 Basics of Probability
3 Gradient Descent
5 Matrix Differentiation
6 Conclusion
What is PCA?
What is PCA?
Goal of PCA
The main goal of PCA is to express a complex data set into a new set a
basis vectors that ’best’ explain the data
Digits visualization
Y = PT X, P is a matrix of size p × k .
The algorithm
yi = UT xi
Why?
The sample covariance of the transformed data is
N N
1X T 1X T
Cnew = yi yi = (U xi )(UT xi )T
N N
i =1 i =1
N
N
1X T
1 X
= U xi xTi U = U T T
xi xi U
N N
i =1 i =1
= U CU = U (UΛU )U = (U U)Λ(UT U)
T T T T
= Λ
Hence, when projected onto the principal components, the data is
decorreletad.
Dimensionality reduction
We usually want to represent our data in a lower dimensional space
Rk , with k d.
We achieve this by projecting onto the k principal axes which
preserve most of the variance in the data
From the previous analysis, we see that those axes correspond to the
eigenvectors associated with the k largest eigenvalues
| | | | | |
U = u1 u2 . . . ud ⇒ Uk = u1 u2 . . . uk
| | | d ×d
| | | d ×k
Dual PCA
1 T
The matrix C0 = NX X is called the Gram (or Gramian) matrix.
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 64 / 77
PCA vs Dual PCA
1 Basics of Probability
3 Gradient Descent
5 Matrix Differentiation
6 Conclusion
∂ 2 f ( x)
If f : Rn → R has second partial derivatives ∂xi ∂xj for all i , j at a given point
x, we say that f is twice differentiable at x.
∂2 f (x) ∂2 f (x)
∂x1 ∂x1
··· ∂x1 ∂xn
∇2 f (x) = ... ..
.
..
.
.
∂ f (x) ∂2 f (x)
2
∂xn ∂x1 ··· ∂xn ∂xn
∂2 f (x) ∂2 f (x)
Since, second partial derivatives are symmetric, ∂xi ∂xj = ∂xj ∂xi , the
Hessian is a symmetric matrix.
1
f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x) + o (kyk),
2
o (α)
where o (h ) is a continuous function such that limα→0 α = 0.
Useful formulas
∂y
y(x) ∂x
Ax AT
xT A A
xt x 2x
xT A x Ax + ATx
Table: Some useful derivatives formulas.
∂z ∂y ∂z
= .
∂x ∂x ∂y
where Eij denotes the elementary m × n matrix (i.e. a matrix with all
entries equal to zero except for the (i , j ) entry which is one).
Désiré Sidibé (IBISC) M2 RVSI - Pattern Recognition 74 / 77
Outline
1 Basics of Probability
3 Gradient Descent
5 Matrix Differentiation
6 Conclusion
To be continued ...
Many useful tools not included:
Automatic differentiation
Numerical optimization
Kernel methods
Gaussian processes
Monte-Carlo methods
etc.
G. Strang (2009).
Introduction to linear algebra.
Wellesley-Cambridge Press and SIAM.