0% found this document useful (0 votes)
19 views

Background

Uploaded by

m974138847y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Background

Uploaded by

m974138847y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Introduction to pattern recognition

Mathematical background

ENGG 5202: Pattern Recognition


Introduction and Mathematical Background

Prof. LI Hongsheng
Office: SHB 428
e-mail: [email protected]
web: https://blackboard.cuhk.edu.hk
Department of Electronic Engineering
The Chinese University of Hong Kong

Jan. 2024

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Outline

1 Introduction to pattern recognition

2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

1 Introduction to pattern recognition

2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

The objective of pattern recognition


The term pattern recognition and machine learning are generally used
interchangeably. Nowadays, machine learning is actually a more popular
term
Learn a mapping function or probability distributions from training data

• Example of classification

• Example of regression

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Pattern Recognition aims at learning a function or a distribution

The course mostly focuses on function learning or function fitting


Given an input value or vector, a function assigns it with a value or vector
“One-to-many” mapping is not a function. “Many-to-one” mapping is a
function.

Not a function A function

Note that a function can have a vector output or matrix output. For
instance, the following formula is still a function
     
y1 x1 x1 + x2
=f =
y2 x2 x1 x2

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Function estimation

We are interested in predicting y from input x and assume there exists a


function that describes the relationship between y and x, e.g., y = f (x)
If the function f ’s parametric form is fixed, prediction function f can be
parametrized by a parameter vector θ
Estimating fˆ from a training set D = {(x(1) , y (1) ), (x(2) , y (2) ), · · · ,
(x(n) , y (n) )}
With a better design of the parametric form of the function, the learner
could achieve better performance
This design process typical involves domain knowledge

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Face recognition in smart surveillance for crossing at a red light in China

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

An interesting failure case

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Face recognition helps capture criminal suspects in Jacky Cheung’s


concerts

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Object detection for autonomous driving

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Action recognition

Sample video frames from UCF-101 dataset

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Email spam classification

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Speech recognition

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Computer-aided medical diagnosis

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Function of gene sequence classification

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Example of pattern recognition

Financial time series prediction

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Pattern recognition is a sub-field of artificial intelligence

Artificial intelligence: general reasoning


Pattern recognition or machine learning: learn to obtain a function with
expected outputs
Deep learning: machine learning with deep neural networks

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Pattern recognition systems

Classification of types of fishes

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Statistical learning

Make the optimal decision from the statistical point of view (with
statistical explanation) view (with statistical explanation)
Bayesian decision theory

What’s the statistical model (the probability distributions of samples


belonging to each class)?
Manually specified or learned from data
Maximum likelihood estimation, Bayesian parameter estimation,
non-parametric density estimation

Directly model the parametric form of decision boundary


Support Vector Machine

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Classification model

Each sample is represented by a d-dimensional feature vector.


The goal of classification is to establish decision boundaries in the feature
space to separate samples belonging to difference classes patterns
belonging to difference classes

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Feature matters

Properly choose features for different classification/regression problems is


one of the key problems in patter recognition applications
Computer vision applications
Histogram of Oriented Gradients (HOG) features
Scale-invariant Feature Transform (SIFT) features
Oriented FAST and rotated BRIEF (ORB) features
Speech recognition
Linear Predictive Codes (LPC) features
Perceptual Linear Prediction (PLP) features
Mel Frequency Cepstral Coefficients (MFCC) features
If discriminative (good) enough features exist, even a very simple linear
classifier can perform well
Back in 1970s to early 2010s, features are mostly manually designed by
humans according to experience

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Pattern recognition systems

Classification of types of fishes

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Fish classification
Features
Length of the fish & average lightness of the fish
Models
Sea bass have some typical length (lightness) and it is greater than that for
salmon
f (x) = “sea bass” if x > x∗
Training set (with training samples)
Used to tune the parameters of an adaptive model
Minimize the classification errors on the training set

Histograms for the length Histograms for the lightness


feature for the two categories feature for the two categories
None of the features alone will serve to unambiguously
discriminate between the two categories!
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background

Jointly use two features


The classification error on the training data becomes lower than using only
one feature

Use more features?

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Linearly separable features

What makes good features: linearly separable features


A linear classifier (decision boundary) that correctly classifies all training
samples

However, such a property cannot be met for most scenarios

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Training set and testing set

The data with annotations should be separated into training set, validation
set (optional), and testing set
Reaching 100% accuracy on the training set cannot guarantee good
performance to general unseen samples

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Training set and testing set

The data with annotations should be separated into training set, validation
set (optional), and testing set
Reaching 100% accuracy on the training set cannot guarantee good
performance to general unseen samples
The model must have the capability of generalize to unseen (test) samples

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Model (learner) capacity

The ability of the learner (or called model) to discover a function taken
from a family of functions. Examples:
Linear predictor

y = wx + b
Quadratic predictor

y = w2 x2 + w1 x + b
Degree-10 polynomial predictor
10
X
y =b+ wi xi
i=1
The latter family is richer, allowing to capture more complex functions
Capacity can be measured by the number of training examples {x(i) , y (i) }
that the learner could always fit, no matter how to change the values of
x(i) and y (i)

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Underfitting

The learner cannot find a solution that fits training examples well
For example, use linear regression to fit training examples {x(i) , y (i) } where
y (i) is an quadratic function of x(i)
Underfitting means the learner cannot capture some important aspects of
the data
Reasons for underfitting happening
Model is not rich enough
Difficult to find the global optimum of the objective function on the training
set or easy to get stuck at local minimum
Limitation on the computation resources (not enough training iterations of
an iterative optimization procedure)
Underfitting commonly happens in non-deep learning approaches with
large scale training data and could be even a more serious problem than
overfitting in some cases

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Overfitting

The learner fits the training data well, but loses the ability to generalize
well, i.e. it has small training error but larger generalization error
A learner with large capacity tends to overfit
The family of functions is too large (compared with the size of the training
data) and it contains many functions which all fit the training data well.
Without sufficient data, the learner cannot distinguish which one is most
appropriate and would make an arbitrary choice among these apparently
good solutions
A separate validation set helps to choose a more appropriate one
In most cases, data is contaminated by noise. The learner with large
capacity tends to describe random errors or noise instead of the underlying
models of data (classes)

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Model complexity (capacity)

The goal is to classify novel examples not seen yet, but not the training
examples!
Generalization. The ability to correctly classify new examples that differ
from those used for training

Overly complex models lead to complicated de- The decision boundary might represent the
cision boundaries. It leads to perfect classifica- optimal tradeoff between performance on the
tion on the training examples, but would lead training set and simplicity of classifier, there-
to poor performance on new patterns. fore giving highest accuracy on new patterns.

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Optimal capacity

Difference between training error and generalization error increases with


the capacity of the learner
Generalization error is a U-shaped function of capacity
Optimal capacity capacity is associated with the transition from
underfitting to overfitting
One can use a validation set to monitor generalization error empirically
Optimal capacity should increase with the number of training examples

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Optimal capacity

Typical relationship between capacity and both training and generalization (or
test) error. As capacity increases, training error can be reduced, but the
optimism (difference between training and generalization error) increases. At
some point, the increase in optimism is larger than the decrease in training
error (typically when the training error is low and cannot go much lower), and
we enter the overfitting regime, where capacity is too large, above the optimal
capacity. Before reaching optimal capacity, we are in the underfitting regime.
(Bengio et al. Deep Learning 2014)

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Curse of dimensionality

In general, the classification errors on the training set can decrease by


simply increasing the number (dimension) of features
But it is not the case on the testing set which includes unseen samples

Scatter plot of the training data of three classes


Two features are used The classes. Two features
are used. The goal is to classify the new testing
point denoted by ‘x’.

The feature space is uniformly divided into celss.


A cell is labeled as a class, if the majority of
training examples in that cell are from that class.
The testing point is classified according to the
label of the cell where it falls in.

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Curse of dimensionality

The more training samples in each cell, the more robust the classifier
The number of cells grows exponentially with the dimensionality of the
feature space

If each dimension is divided into three intervals, the number of cells is


N = 3D
Some cells are empty when the number of cells is very large!

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Machine learning with big data

Machine learning with small data: overfitting, reducing model complexity


(capacity)
Machine learning with big data: underfitting, need to increase model
complexity, difficult optimization, high computation resources
Curse of dimensionality

Blessing of dimensionality†

Learning hierarchical feature transforms
(with deep learning)


D. Chen et al. “Blessing of dimensionality: High dimensional feature and its efficient compression
for face verification,” CVPR 2013.

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Pattern recognition systems

Classification of types of fishes

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Pattern recognition systems

Feature extraction
Discriminative features
Invariant features with respect to certain transformation
A small number of features
Classifier/regressor
Tradeoff of classification errors on the training set and the model complexity
Decide the form of the classifier
Tune of the parameters of the classifiers by training
Post processing
Risk: the cost of mis-classifying sea bass is different than that of
mis-classifying salmon
Context: it is more likely for a fish to be the same class as its previous one
Integrating multiple classifiers: classifiers are based on different sets of
features

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Training cycle

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Data collection

Collect both training data, validation data, and test data


Label the ground truth annotations
Is the training set large enough?
Is the training set representative enough?
Are the training data and the testing data collected under the same
condition?
Initial examination of the data to get a feel of data structure
Summary of statistics
Producing plots
The analysis of the evaluation results may require further data collection

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Problem setup for supervised learning

Given pairs of inputs and outputs, learn a function to map inputs to


outputs
Function inputs: features x(i) (x(i) ∈ Rd for general problems)
Function outputs: target outputs y (i) ∈ R
One training sample: (x(i) , y (i) )
Training set of m samples: {(x(1) , y (1) ), (x(2) , y (2) ), · · · , (x(m) , y (m) )}
Hypothesis h : Rd → R: the function to be learned to map a general input
x to expected output y

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Training and testing

The parametric form of h is fixed


Training: find the optimal parameters θ of function h based on the
training set {(x(1) , y (1) ), (x(2) , y (2) ), · · · , (x(m) , y (m) )}, usually by
minimizing some cost function
Testing: fix the found optimal parameters θ, given the input features of
one unseen example x, predict the output value y

Training set

Learning h
algorithm
Testing input Testing output

h
Training inputs Training outputs

Training Stage Testing Stage

If target variables y are continuous, the learning is a regression problem


If target variables y can only take a small number of discrete values
(classes), it is a classification problem
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background

Evaluation

Apply the trained classifier to an independent validation set of labeled


samples
It is important to both measure the performance of the system and to
identify the need for improvements in its system and to identify the need
for improvements in its components
Compare the error rates on the training set and the validation set to
decide if it is overfitting or underfitting
High error rates on both the training set and the validation set: underfitting
Low error rate on the training set and high error rate on the validation set:
overfitting

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Design cycle

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Binary Classification Evaluation Criteria

A binary classifier classifies each sample into two classes (positive or


negative)
In general, the binary classifier outputs a continuous value, usually (but
not necessarily) between [0,1]. One can choose a threshold value to
determine a sample is considered as positive or negative
Error rate in the test sample set D = {(x(1) , y (1) ), (x(2) , y (2) ),
· · · , (x(m) , y (m) )}
m
1 X  
E(h) = 1 h(x(i) ) ̸= y (i)
m i=1
Accuracy on the test sample set
m
1 X  
acc(h) = 1 h(x(i) ) = y (i) = 1 − E(h)
m i=1

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Classification Evaluation Criteria


Precision
TP
P =
TP + FP
Recall
TP
R=
TP + FN
Precision-Recall (P-R) curve

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Classification Evaluation Criteria

F-1 Measure tries to balance the contributions of precision and recall, and
gives a single value
2×P ×R 2 × TP
F1 = =
P +R #All samples + T P − T N
Receiver Operating Characteristic (ROC) curve. Different from the P-R
curve, the tested samples are ranked according to their positive
confidences.
The horinzontal axis is False Positive Rate (FPR) and the vertical axis is
True Positive Rate (TPR)
TP FP
TPR = , FPR =
TP + FN TN + FP

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Classification Evaluation Criteria

Similar to P-R curve, if one curve enclose another one entirely, the former
represents better performance than the latter one
It still poses difficulties, if two curves intersect each other
It is more appropriate to use Area Under ROC Curve (AUC)

m−1
1 X
AUC = (xi+1 − xi ) · (yi + yi+1 )
2 i=1

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Learning schemes

Supervised learning
An “expert” provides a category label for each pattern in the training set
It may be cheap to collect patterns but expensive to obtain the labels
Unsupervised learning
The system automatically learns feature transformation from the training
samples without any annotation to best represent them
Weakly supervised learning (not covered in this course)
The supervisions are not exact or rough
Example: learning image segmentation by providing image-level annotations
Semi-supervised learning (not covered in this course)
Some samples have labels, while some do not

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Deep learning

Deep learning aims at learning better feature representations

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Neural networks

Deep learning is based on neural networks

Neural networks originates back to 1970s-1980s

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Neural networks

A network of interconnecting artificial neurons


It simulates some properties of biological neural networks: learning
generalization adaptivity fault networks: learning, generalization, adaptivity,
fault tolerance, distribution computation
Low dependence on domain-specific knowledge

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Neural networks

They provide a new suite of nonlinear algorithms for feature extraction


(using hidden layers) and classification (e.g., multi-layer perceptrons)
Existing feature extraction and classification algorithms can aslo be
mapped on neural network architecture for efficient hardware mapped on
neural network architecture for efficient hardware implementation
In spite of the seemly different underlying principles, most of the well
known neural network models are implicitly equivalent or similar to
classical statistical machine learning methods

Link between statistical learning and neural networks

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

What makes the difference?

Deep learning becomes popular again in 2010s


Large-scale training data
Super parallel computing power (e.g. GPU and TPU)

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Deep learning in neural networks

Become hot since 2006


Hinton et al., “A Fast Learning Algorithm for Deep Belief Nets,” Neural
Computation, 2006
Other famous researchers in deep learning
Andrew Ng (Stanford), Yann LeCun (NYU), Yoshua Bengio (U of Montreal)
MIT Technology Review lists deep learning as MIT Technology Review
lists deep learning as one of the top-10 breakthrough technologies in 2013
Neural networks with more hidden layers
Many existing statistical models can be approximated as neural networks
with one or two hidden layers

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Success of deep learning

Speech recognition

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Success of deep learning

Object classification over 1 million images of 1000 classes


ImageNet Challenge 2012

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Success of deep learning

ImageNet Challenge 2013


All teams used deep learning
MSRA, IBM, Adobe, NEC, Clarifai, Berkley, U. Tokyo, UCLA, UIUC,
Toronto, etc.

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Introduction to pattern recognition
Mathematical background

Different types of deep learning

Back in 2006, the name “deep learning” proposed by G. Hinton mostly


describe deep neural networks trained in the unsupervised learning setting
Restricted Boltzmann Machine
Deep Belief Network
Auto-encoder
Nowadays, deep learning research is dominated by supervised learning
approaches
Multi-layer perceptron (MLP)
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
In this course, we only focus on the basics of multi-layer perceptron

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

1 Introduction to pattern recognition

2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Vector-vector multiplication

Inner product for two vectors x, y ∈ Rn . Can be used to measure two


vectors’ similarity.
 
y1
 . 
xT y ∈ R = [x1 , x2 , · · · , xn ]  .. 
yn
Outer product
···
 
  x1 y1 x1 y2 x1 yn
x1
 .. 
 x2 y1 x2 y2 ··· x2 yn 
xy T ∈ Rm×n =  .  [y1 , y2 , · · · , yn ] = 
 
.. .. .. .. 
 . . . . 
xn
xm y1 xm y2 ··· xm yn

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Matrix-vector multiplication

Given a matrix A ∈ Rm×n and a vector x ∈ Rn , their product is a vector


y = Ax ∈ Rm .
If we write A by rows, Ax can be expressed by

aT1
   T 
a1 x
T T
 a 2 
 2x 
 a 
y = Ax =  x =
 
..  . 
 .. 

 . 
aTm aTm x

The ith entry of y is equal to the inner product of the ith row of A and x
If write A in column form,
 
  x1    
| | |  x2  | |
y = Ax =  a1 a2 ··· an   = a1  x1 + · · · +  an  xn
  
..
| | |
 . 
| |
xn

y is a linear combination of columns of A


Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Matrix-vector multiplication

If multiply A on the left by a row vector xT , y T = xT A ∈ Rn for


A ∈ Rm×n , x ∈ Rm
Express A in terms of its columns

the ith entry of y T is equal to the inner product of x and the ith column
of A.
Express A in terms of rows

y is a linear combination of rows of A


Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Matrix multiplication

The product of two matrices A ∈ Rm×n and B ∈ Rn×p


C = AB ∈ Rm×p
where
n
X
Cij = Aik Bkj
k=1
The (i, j)th entry of C is equal to the inner product of the ith row of A
and the jth column of B

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Matrix multiplication

If represent B by columns, we can view C as matrix-vector product


between A and the columns of B

These matrix-vector products can in turn be interpreted using both


viewpoints given in the previous slides
If represent A by rows, and view the rows of C as the matrix-vector
product between the rows of A and C

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Transpose, symmetric matrix, trace

The transpose of a matrix “flips” the rows and columns of a matrix.


Given A ∈ Rm×n , its transpose AT ∈ Rn×m
(AT )ij = Aji
(AT )T=A
(AB)T = B T AT
(A + B)T = AT + B T
Symmetric matrix: a square matrix A ∈ Rn×n is symmetric if A = AT
Trace of a square matrix A ∈ Rn×n , denoted as tr(A)
n
X
trA = Aii
i=1
For A ∈ Rn×n , trA = trAT
For A, B ∈ Rn×n , tr(A + B) = trA + trB
For A, B such that AB is square, trAB = trBA

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Norms

A norm a vector ∥x∥ is informally a measure of the “length” of the vector


Euclidean (or L2 norm) v
u n 2
uX
∥x∥2 = t xi
i=1

A norm can be viewed as a distance and is any function f : Rn → R


For all x ∈ Rn , f (x) ≥ 0 (non-negativity)
f (x) = 0 if and only if x = 0 (definiteness)
For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x) (homogeneity)
For all x, y ∈ Rn , f (x + y) ≤ f (x) + f (y) (triangle inequality)
L1 norm
n
X
∥x∥1 = |xi |
i=1
L∞ norm
∥x∥∞ = maxi |xi |

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Norms

Lp norm
n
!1/p
X p
∥x∥p = |xi |
i=1

For matrix, we have Frobenius norm


v
um X n

uX
∥A∥F = t A2ij = trAT A
i=1 j=1

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Linear Independence

A set of vectors {x1 , x2 , · · · , xn } ⊂ Rn is said to be (linearly) independent


if no vector can be represented as a linear combination of the remaining
vectors
Conversely, if one vector belonging to the set can be represented as a
linear combination of the remaining vectors, the vector are said to be
(linearly) dependent
n−1
X
xn = αi xi
i=1
for some scalar values α1 , · · · , αn−1 ∈ R

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Inverse

The inverse of a matrix A ∈ Rn×n denoted as A−1 , such that


A−1 A = I = AA−1
A is invertible or non-singular if A−1 exists and non-invertible or
singular otherwise
Assume A, B ∈ Rn×n are non-singular
(A−1 )−1 = A
(AB)−1 = B −1 A−1
(A−1 )T = (AT )−1

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Orthogonal matrices

Two vectors x, y ∈ Rn are orthogonal if xT y = 0. A vector x ∈ Rn is


normalized if ∥x∥2 = 1.
A square matrix U ∈ Rn×n is orthogonal if all columns are orthogonal to
each other and are normalized. Its columns are referred to be
orthonormal.
In other words, the inverse of an orthogonal matrix is its transpose
A⊤ A = I = AA⊤
Operating on a vector with an orthogonal matrix do not change its
Euclidean norm
∥U x∥2 = ∥x∥2

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Quadratic forms

Given a square matrix A ∈ Rn×n and a vector x ∈ Rn , the scalar value


xT Ax is called a quadratic form
Xn X
n
xT Ax = Aij xi xj
i=1 j=1
Note that 
1 1
xT Ax = xT
A + AT x
2 2
Only the symmetric part of A contributes to the quadratic form
A symmetric matrix A ∈ S n is positive definite if for all non-zero vectors
x ∈ Rn , xT Ax > 0, written as A ≻ 0
A symmetric matrix A ∈ S n is positive semidefinite if for all vectors
x ∈ Rn , xT Ax ≥ 0, written as A ⪰ 0
A symmetric matrix A ∈ S n is negative definite if for all non-zero vectors
x ∈ Rn , xT Ax < 0, written as A ≺ 0
A symmetric matrix A ∈ S n is positive semidefinite if for all vectors
x ∈ Rn , xT Ax ≤ 0, written as A ⪯ 0

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Quadratic forms

Finally, a symmetric matrix A ∈ S n is indefinite, if it is neither positive


semidefinite nor negative semidefinite

Positive definite and negative definite matrices are always full rank

Gram matrix: given any matrix A ∈ Rm×n , the matrix G = AT A is


always positive semidefinite. If m ≥ n, then G = AT A is positive definite

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Eigenvalues and eigenvectors

Given a square matrix A ∈ Rn×n , λ ∈ Cn is an eigenvalue of A and


x ∈ Cn is the corresponding eigenvector if
Ax = λx, x ̸= 0
Solve the following equation
(λI − A)x = 0, , x ̸= 0
(λI − A)x = 0 has a non-zero solution to x if and only if (λI − A) has a
non-empty nullspace, i.e., (λI − A) is singular, i.e.,
|λI − A| = 0
Solving the equation leads to n (possibly complex) eigenvalues
λ1 , λ2 , · · · , λn . Solving (λi I − A) = 0 leads to n associated eigenvectors.

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Properties of eigenvalues and eigenvectors

n
X
trA = λi
i=1
n
Y
|A| = λi
i=1
The rank of A is equal to the number of non-zero eigenvalues of A
If A is non-singular then 1/λi is an eigenvalue of A−1 with associated
eigenvector xi , i.e., A−1 xi = (1/λi )xi
The eigenvalues of a diagonal matrix D = diag(d1 , · · · , dn ) are just the
diagonal entries d1 , · · · , dn

All the eigenvector equations can be formulated as


AX = XΛ
If the eigenvectors of A are linearly independent, then the matrix X will
be invertible, so A = XΛX −1 . A is called diagonalizable.

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Eigenvalues and eigenvectors of symmetric matrices

All the eigenvalues of a symmetric matrix A are real


The eigenvectors of A are orthonormal, i.e., the matrix X is an
orthogonal matrix (re-written as U )
A = U ΛU T
The definiteness of a matrix depends entirely on the sign of its eigenvalues
Xn
xT Ax = xT U ΛU T x = y T Λy = λi yi2 , where y = U T x
i=1

Because yi2 is always positive, the sign of this expression depends entirely
on the λi ’s
Application of eigenvalues and eigenvectors. For a matrix A ∈ Sn , the
solutions of the following problems
maxx∈Rn xT Ax subject to ∥x∥22 = 1
T
minx∈Rn x Ax subject to ∥x∥22 = 1
are the eigenvectors corresponding to the maximal and minimal eigenvalues

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Derivatives

Recall that derivatives of a function f (x) is defined as

df (x) f (x + δ) − f (x)
= lim
x δ→0 δ
Common functions
d
c=0
dx
d
ax = a
dx
d a
x = axa−1
dx
d 1
log x =
dx x
Rules
d
Product rule: f (x)g(x) = f (x)g ′ (x) + f ′ (x)g(x)
dx
d 1 −f (x)′
Quotient rule: =
dx f (x) f (x)2
d
Chain rule: f (g(x)) = f (g(x)) · g ′ (x)

dx

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Computational graph

Computational graph is a graphical representation of a function


composition
Example

u = bc, v = a + u, J = 3v

Calculate derivatives backward sequentially

∂J ∂J ∂J ∂v ∂J ∂J ∂v ∂J ∂J ∂u ∂J ∂J ∂u
, = , = , = , =
∂v ∂u ∂v ∂u ∂a ∂v ∂a ∂b ∂u ∂b ∂c ∂u ∂c

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Matrix calculus

Suppose that f : Rm×n → R is a function that takes as input a matrix


A ∈ Rm×n and returns a scalar real value
The gradient of f with respect to A ∈ Rm×n is the matrix of partial
derivatives,
 
∂f (A) ∂f (A) ∂f (A)
···
 ∂A11 ∂A12 ∂A1n 
 ∂f (A) ∂f (A) ∂f (A) 
···
m×n
 
∂A21 ∂A22 ∂A2n
∇A f (A) ∈ R = 
 . . . . 
. . . .
.
 . . .

 
∂f (A) ∂f (A) ∂f (A)
···
∂Am1 ∂Am2 ∂Amn

∂f (A)
where (∇A f (A))ij =
∂Aij

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Matrix calculus

The gradient of f with respect to x ∈ Rn is the vector of partial


derivatives,
 
∂f (x)
 ∂x1 
 ∂f (x) 
n
 
∂x2
∇x f (x) ∈ R =  
 . 
 . 
 . 
∂f (x)
∂xn

Properties
∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x)
For t ∈ R, ∇x (tf (x)) = t∇x f (x)

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Gradients of linear and quadratic functions


For x ∈ Rn , let f (x) = bT x for some known vector b ∈ Rn . Then
Xn
f (x) = bi xi
i=1
so
n
∂f (x) ∂ X
= bi xi = bk
∂xk ∂xk i=1
We have
∇x bT x = b
For quadratic function x Ax for A ∈ Sn ,
T
n n
∂f (x) ∂ XX
= Aij xi xj
∂xk ∂xk i=1 j=1
 
∂ X X X X 2
= Aij xi xj + Aik xi xk + Akj xk xj + Akk xk
∂xk
i̸=k j̸=k i̸=k j̸=k

X X n
X
= Aik xi + Akj xj + 2Akk xk = 2 Aki xi
i̸=k j̸=k i=1

We have ∇x xT Ax = 2Ax
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Least squares

Given A ∈ Rm×n and a vector x ∈ Rn , solve x for


Ax = b
When m > n and A is full rank, there might not be an exact solution. We
minimize the following objective function instead,
min∥Ax − b∥22
x
We have
∥Ax − b∥22 = (Ax − b)T (Ax − b) = xT AT Ax − 2bT Ax + bT b
Set the gradient to 0
∇x (xT AT Ax − 2bT Ax + bT b) = ∇x xT AT Ax − ∇x 2bT Ax + ∇x bT b
= 2AT Ax − 2AT b = 0

Solving for x yields


x = (AT A)−1 AT b

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Elements of probability

Elements of probability
Sample space Ω: The set of all the outcomes of a random experiment.
Each ω ∈ Ω is a complete outcome at the end of the experiment.
Set of events or event space F: A set whose elements A ∈ F (called
events) are subsets of Ω.
Axioms of probability: A function P : F → R that satisfies the following
properties
P (A) ≥ 0, for all A ∈ F
P (Ω) = 1 P
If A and B are disjoint events, then P (∪i Ai ) = i P (Ai )

Conditional probability and independence


Event B has non-zero probability. The conditional probability of any event A
given B is
P (A ∩ B)
P (A|B) =
P (B)
Two events are independent if and only if P (A ∩ B) = P (A)P (B) (or
equivalently P (A|B) = P (A)).
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Random variables

Random variables
Random variable X is a function X : Ω → R.
For discrete random variable,
def
P (X = k) = P ({ω : X(ω) = k})
For continuous random variable,
def
P (a ≤ X ≤ b) = P ({ω : a ≤ X(ω) ≤ b})

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

PMF and PDF

Probability mass functions


If X is a discrete random variable, a probability mass function (PMF) is
a function pX : Ω → R
def
pX (x) = P (X = x)
P≤ pX (x) ≤ 1
0
Pall x pX (x) = 1
x∈A = P (X ∈ A)

Probability density functions


If X is a continuous random variable, a probability density function
(PDF) is a function fX : Ω → R
def dFX (x)
fX (x) =
dx
fX (x) ≥ 0
R ∞
f (x) = 1
R−∞ X
x∈A fX (x)dx = P (X ∈ A)

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Two random variables

Joint PMF
If X and Y are discrete random variables, the joint probability mass
function pXY : R × R → [0, 1]
pXY (x, y) = P (X = x, Y = y)
Marginal probability mass function
P pX (x)
pX (x) = y pXY (x, y)

Joint PDF
If X and Y are continuous random variables, the joint probability density
function fXY : R × R → [0, 1]
∂ 2 FXY (x, y)
fXY (x, y) =
∂x∂y
Marginal probability density function fX (x)
Z ∞
fX (x) = fXY (x, y)dy
−∞

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Two random variables

Conditional distributions
Intuitive understanding: Given X = x, the probability mass function or
probability density function of Y
Discrete cases:
pXY (x, y)
pY |X (y|x) =
pX (x)
Continuous cases:
fXY (x, y)
fY |X (y|x) =
fX (x)

Bayes’ rule
Discrete cases:
pXY (x, y) PX|Y (x|y)PY (y)
pY |X (y|x) = = P ′ ′
pX (x) all y ′ PX|Y (x|y )PY (y )
Continuous cases:
fXY (x, y) fX|Y (x|y)fY (y)
fY |X (y|x) = = R∞
fX (x) f
−∞ X|Y
(x|y ′ )fY (y ′ )dy ′

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Two random variables

Independence
Discrete cases:
pXY (x, y) = pX (x)pY (y) for all possible x and y
pY |X (y|x) = pY (y) where pX ̸= 0 for all possible y
Continuous cases:
fXY (x, y) = fX (x)fY (y) for all possible x and y
fY |X (y|x) = fY (y) where fX ̸= 0 for all possible y

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Maximum Likelihood Estimation


In statistics, maximum likelihood estimation (MLE) is a method of
estimating the parameters of a distribution by maximizing a likelihood
function, so that under the assumed statistical model the observed data is
most probable
Take a continuous distribution as an example, the probability density
function of a distribution f (y; θ) can be parameterized by parameters
θ = [θ1 , θ2 , · · · , θk ]T
Given all the observed data samples from the distribution
y = (y1 , · · · , yn ), the joint density of the samples is
Ln (θ) = Ln (θ; y) = f (y; θ)
The goal of maximum likelihood estimation is to find the values of the
model parameters that maximize the likelihood function over the
parameter space
Ln (θ̂; y) = sup Ln (θ; y)
θ∈Θ

In practice, it is often convenient to work with the natural logarithm of the


likelihood function, called the log-likelihood function
ℓ(θ; y) = ln Ln (θ; y)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Multivariate Gaussian (normal) distribution

Multivariate Gaussian distribution


A random vector X = [X1 , · · · , Xn ]T has a multivariate Gaussian
distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Sn ++ (positive
definite n × n matrix), it is denoted by X ∼ N (µ, Σ) and its PDF is given
by
 
1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π n/2 |Σ|1/2 ) 2
The argument of the exponential function − 12 (x − µ)T Σ−1 (x − µ) is a
quadratic form in vector variable x. For any vector x ̸= µ,
1
− (x − µ)T Σ−1 (x − µ) < 0
2

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Interpretation of the covariance matrix


Diagonal convariance matrix can be viewed as a collection of n
independent Gaussian random variables with mean µi and σi2
n  
Y 1 1
p(x; µ, Σ) = √ exp − 2 (xi − µi )2
i=1
2πσi 2σi
Shape of isocontours:
Diagonal covariance matrix: axis-aligned ellipsoids in Rn centered at µ with
axis length proportional to σ1 , σ2 , · · · , σn
Non-diagonal covariance matrix: rotated ellipsoids in Rn centered at µ with
axis length proportional to Σ’s eigenvalues.

   
25 0 10 5
µ = [3, 2]T , Σ = µ = [3, 2]T , Σ =
0 9 5 5
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Basic optimization
Pattern recognition systems usually involves optimizing some cost (or loss,
energy) functions
If a function J(θ) has global minimum or maximum, the minimal or
maximal point θ̂ must be reached when
∇θ J(θ̂) = 0
Then θ̂ must be a the global minimum or maximum

For local minima and minima, we cannot simply find optimal θ̂ by setting
the gradient to 0

Prof. LI Hongsheng ENGG 5202: Pattern Recognition


Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions

Gradient descent

To recover local minimum, we could utilize the gradient descent algorithm


with initial parameter θ(0)

Gradient descent algorithm


For i = 1, 2, 3, · · ·
θ(i+1) = θ(i) − γ∇θ J(θ(i) )
Terminate iterations if i is large enough or ∥∇θ J(θ(i) )∥ is small
enough

γ is called the step size (or learning rate), −∇J(θ(i) ) is the negative
gradient direction.

1 variable 2 variables
Prof. LI Hongsheng ENGG 5202: Pattern Recognition

You might also like