Background
Background
Mathematical background
Prof. LI Hongsheng
Office: SHB 428
e-mail: [email protected]
web: https://blackboard.cuhk.edu.hk
Department of Electronic Engineering
The Chinese University of Hong Kong
Jan. 2024
Outline
2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions
2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions
• Example of classification
• Example of regression
Note that a function can have a vector output or matrix output. For
instance, the following formula is still a function
y1 x1 x1 + x2
=f =
y2 x2 x1 x2
Function estimation
Action recognition
Speech recognition
Statistical learning
Make the optimal decision from the statistical point of view (with
statistical explanation) view (with statistical explanation)
Bayesian decision theory
Classification model
Feature matters
Fish classification
Features
Length of the fish & average lightness of the fish
Models
Sea bass have some typical length (lightness) and it is greater than that for
salmon
f (x) = “sea bass” if x > x∗
Training set (with training samples)
Used to tune the parameters of an adaptive model
Minimize the classification errors on the training set
The data with annotations should be separated into training set, validation
set (optional), and testing set
Reaching 100% accuracy on the training set cannot guarantee good
performance to general unseen samples
The data with annotations should be separated into training set, validation
set (optional), and testing set
Reaching 100% accuracy on the training set cannot guarantee good
performance to general unseen samples
The model must have the capability of generalize to unseen (test) samples
The ability of the learner (or called model) to discover a function taken
from a family of functions. Examples:
Linear predictor
y = wx + b
Quadratic predictor
y = w2 x2 + w1 x + b
Degree-10 polynomial predictor
10
X
y =b+ wi xi
i=1
The latter family is richer, allowing to capture more complex functions
Capacity can be measured by the number of training examples {x(i) , y (i) }
that the learner could always fit, no matter how to change the values of
x(i) and y (i)
Underfitting
The learner cannot find a solution that fits training examples well
For example, use linear regression to fit training examples {x(i) , y (i) } where
y (i) is an quadratic function of x(i)
Underfitting means the learner cannot capture some important aspects of
the data
Reasons for underfitting happening
Model is not rich enough
Difficult to find the global optimum of the objective function on the training
set or easy to get stuck at local minimum
Limitation on the computation resources (not enough training iterations of
an iterative optimization procedure)
Underfitting commonly happens in non-deep learning approaches with
large scale training data and could be even a more serious problem than
overfitting in some cases
Overfitting
The learner fits the training data well, but loses the ability to generalize
well, i.e. it has small training error but larger generalization error
A learner with large capacity tends to overfit
The family of functions is too large (compared with the size of the training
data) and it contains many functions which all fit the training data well.
Without sufficient data, the learner cannot distinguish which one is most
appropriate and would make an arbitrary choice among these apparently
good solutions
A separate validation set helps to choose a more appropriate one
In most cases, data is contaminated by noise. The learner with large
capacity tends to describe random errors or noise instead of the underlying
models of data (classes)
The goal is to classify novel examples not seen yet, but not the training
examples!
Generalization. The ability to correctly classify new examples that differ
from those used for training
Overly complex models lead to complicated de- The decision boundary might represent the
cision boundaries. It leads to perfect classifica- optimal tradeoff between performance on the
tion on the training examples, but would lead training set and simplicity of classifier, there-
to poor performance on new patterns. fore giving highest accuracy on new patterns.
Optimal capacity
Optimal capacity
Typical relationship between capacity and both training and generalization (or
test) error. As capacity increases, training error can be reduced, but the
optimism (difference between training and generalization error) increases. At
some point, the increase in optimism is larger than the decrease in training
error (typically when the training error is low and cannot go much lower), and
we enter the overfitting regime, where capacity is too large, above the optimal
capacity. Before reaching optimal capacity, we are in the underfitting regime.
(Bengio et al. Deep Learning 2014)
Curse of dimensionality
Curse of dimensionality
The more training samples in each cell, the more robust the classifier
The number of cells grows exponentially with the dimensionality of the
feature space
†
D. Chen et al. “Blessing of dimensionality: High dimensional feature and its efficient compression
for face verification,” CVPR 2013.
Feature extraction
Discriminative features
Invariant features with respect to certain transformation
A small number of features
Classifier/regressor
Tradeoff of classification errors on the training set and the model complexity
Decide the form of the classifier
Tune of the parameters of the classifiers by training
Post processing
Risk: the cost of mis-classifying sea bass is different than that of
mis-classifying salmon
Context: it is more likely for a fish to be the same class as its previous one
Integrating multiple classifiers: classifiers are based on different sets of
features
Training cycle
Data collection
Training set
Learning h
algorithm
Testing input Testing output
h
Training inputs Training outputs
Evaluation
Design cycle
F-1 Measure tries to balance the contributions of precision and recall, and
gives a single value
2×P ×R 2 × TP
F1 = =
P +R #All samples + T P − T N
Receiver Operating Characteristic (ROC) curve. Different from the P-R
curve, the tested samples are ranked according to their positive
confidences.
The horinzontal axis is False Positive Rate (FPR) and the vertical axis is
True Positive Rate (TPR)
TP FP
TPR = , FPR =
TP + FN TN + FP
Similar to P-R curve, if one curve enclose another one entirely, the former
represents better performance than the latter one
It still poses difficulties, if two curves intersect each other
It is more appropriate to use Area Under ROC Curve (AUC)
m−1
1 X
AUC = (xi+1 − xi ) · (yi + yi+1 )
2 i=1
Learning schemes
Supervised learning
An “expert” provides a category label for each pattern in the training set
It may be cheap to collect patterns but expensive to obtain the labels
Unsupervised learning
The system automatically learns feature transformation from the training
samples without any annotation to best represent them
Weakly supervised learning (not covered in this course)
The supervisions are not exact or rough
Example: learning image segmentation by providing image-level annotations
Semi-supervised learning (not covered in this course)
Some samples have labels, while some do not
Deep learning
Neural networks
Neural networks
Neural networks
Speech recognition
2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions
Vector-vector multiplication
Matrix-vector multiplication
aT1
T
a1 x
T T
a 2
2x
a
y = Ax = x =
.. .
..
.
aTm aTm x
The ith entry of y is equal to the inner product of the ith row of A and x
If write A in column form,
x1
| | | x2 | |
y = Ax = a1 a2 ··· an = a1 x1 + · · · + an xn
..
| | |
.
| |
xn
Matrix-vector multiplication
the ith entry of y T is equal to the inner product of x and the ith column
of A.
Express A in terms of rows
Matrix multiplication
Matrix multiplication
Norms
Norms
Lp norm
n
!1/p
X p
∥x∥p = |xi |
i=1
Linear Independence
Inverse
Orthogonal matrices
Quadratic forms
Quadratic forms
Positive definite and negative definite matrices are always full rank
n
X
trA = λi
i=1
n
Y
|A| = λi
i=1
The rank of A is equal to the number of non-zero eigenvalues of A
If A is non-singular then 1/λi is an eigenvalue of A−1 with associated
eigenvector xi , i.e., A−1 xi = (1/λi )xi
The eigenvalues of a diagonal matrix D = diag(d1 , · · · , dn ) are just the
diagonal entries d1 , · · · , dn
Because yi2 is always positive, the sign of this expression depends entirely
on the λi ’s
Application of eigenvalues and eigenvectors. For a matrix A ∈ Sn , the
solutions of the following problems
maxx∈Rn xT Ax subject to ∥x∥22 = 1
T
minx∈Rn x Ax subject to ∥x∥22 = 1
are the eigenvectors corresponding to the maximal and minimal eigenvalues
Derivatives
df (x) f (x + δ) − f (x)
= lim
x δ→0 δ
Common functions
d
c=0
dx
d
ax = a
dx
d a
x = axa−1
dx
d 1
log x =
dx x
Rules
d
Product rule: f (x)g(x) = f (x)g ′ (x) + f ′ (x)g(x)
dx
d 1 −f (x)′
Quotient rule: =
dx f (x) f (x)2
d
Chain rule: f (g(x)) = f (g(x)) · g ′ (x)
′
dx
Computational graph
u = bc, v = a + u, J = 3v
∂J ∂J ∂J ∂v ∂J ∂J ∂v ∂J ∂J ∂u ∂J ∂J ∂u
, = , = , = , =
∂v ∂u ∂v ∂u ∂a ∂v ∂a ∂b ∂u ∂b ∂c ∂u ∂c
Matrix calculus
∂f (A)
where (∇A f (A))ij =
∂Aij
Matrix calculus
Properties
∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x)
For t ∈ R, ∇x (tf (x)) = t∇x f (x)
X X n
X
= Aik xi + Akj xj + 2Akk xk = 2 Aki xi
i̸=k j̸=k i=1
We have ∇x xT Ax = 2Ax
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Least squares
Elements of probability
Elements of probability
Sample space Ω: The set of all the outcomes of a random experiment.
Each ω ∈ Ω is a complete outcome at the end of the experiment.
Set of events or event space F: A set whose elements A ∈ F (called
events) are subsets of Ω.
Axioms of probability: A function P : F → R that satisfies the following
properties
P (A) ≥ 0, for all A ∈ F
P (Ω) = 1 P
If A and B are disjoint events, then P (∪i Ai ) = i P (Ai )
Random variables
Random variables
Random variable X is a function X : Ω → R.
For discrete random variable,
def
P (X = k) = P ({ω : X(ω) = k})
For continuous random variable,
def
P (a ≤ X ≤ b) = P ({ω : a ≤ X(ω) ≤ b})
Joint PMF
If X and Y are discrete random variables, the joint probability mass
function pXY : R × R → [0, 1]
pXY (x, y) = P (X = x, Y = y)
Marginal probability mass function
P pX (x)
pX (x) = y pXY (x, y)
Joint PDF
If X and Y are continuous random variables, the joint probability density
function fXY : R × R → [0, 1]
∂ 2 FXY (x, y)
fXY (x, y) =
∂x∂y
Marginal probability density function fX (x)
Z ∞
fX (x) = fXY (x, y)dy
−∞
Conditional distributions
Intuitive understanding: Given X = x, the probability mass function or
probability density function of Y
Discrete cases:
pXY (x, y)
pY |X (y|x) =
pX (x)
Continuous cases:
fXY (x, y)
fY |X (y|x) =
fX (x)
Bayes’ rule
Discrete cases:
pXY (x, y) PX|Y (x|y)PY (y)
pY |X (y|x) = = P ′ ′
pX (x) all y ′ PX|Y (x|y )PY (y )
Continuous cases:
fXY (x, y) fX|Y (x|y)fY (y)
fY |X (y|x) = = R∞
fX (x) f
−∞ X|Y
(x|y ′ )fY (y ′ )dy ′
Independence
Discrete cases:
pXY (x, y) = pX (x)pY (y) for all possible x and y
pY |X (y|x) = pY (y) where pX ̸= 0 for all possible y
Continuous cases:
fXY (x, y) = fX (x)fY (y) for all possible x and y
fY |X (y|x) = fY (y) where fX ̸= 0 for all possible y
25 0 10 5
µ = [3, 2]T , Σ = µ = [3, 2]T , Σ =
0 9 5 5
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Basic optimization
Pattern recognition systems usually involves optimizing some cost (or loss,
energy) functions
If a function J(θ) has global minimum or maximum, the minimal or
maximal point θ̂ must be reached when
∇θ J(θ̂) = 0
Then θ̂ must be a the global minimum or maximum
For local minima and minima, we cannot simply find optimal θ̂ by setting
the gradient to 0
Gradient descent
γ is called the step size (or learning rate), −∇J(θ(i) ) is the negative
gradient direction.
1 variable 2 variables
Prof. LI Hongsheng ENGG 5202: Pattern Recognition