0% found this document useful (0 votes)
9 views60 pages

Lecture W12ab

The document covers Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) as techniques for dimensionality reduction. PCA aims to project high-dimensional data onto a lower-dimensional surface to minimize projection error, while LDA focuses on maximizing class separation in supervised learning contexts. It also discusses various feature scaling methods and the mathematical formulations involved in both PCA and LDA.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views60 pages

Lecture W12ab

The document covers Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) as techniques for dimensionality reduction. PCA aims to project high-dimensional data onto a lower-dimensional surface to minimize projection error, while LDA focuses on maximizing class separation in supervised learning contexts. It also discusses various feature scaling methods and the mathematical formulations involved in both PCA and LDA.

Uploaded by

Hadia Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CS-871: Machine Learning

Fall 2024 - Week 12


Principal Component Analysis – Linear Discriminant Analysis

Dr. M. Daud Abdullah Asif


Assistant Professor
Faculty of Computing, SEECS
Email: [email protected]
Dimensionality Reduction - Agenda
• Motivation I: Data Compression
• Motivation II: Visualization
• PCA Problem Formulation
• PCA Algorithm
• Reconstruction
• Number of PCs
• Applications of PCA
If we approximate the original data set by projecting all original examples onto this green
line we need only one number to represent the location of each of
the training examples after they've been projected onto that green line.
How can we visualize this data having a large number of features?
Can we simplify the features so that instead of 50 values for a
country, we can represent the information by 2D?
Can we simplify the features so that instead of 50 values for a
country, we can represent the information by 2D?
Reduce from 2D to 1D ----- Project on a straight line
The length of blue line segments, is the
projection error. Before applying PCA, it's
standard practice to first perform mean
PCA finds a lower dimensional surface, (e.g., line), normalization at feature scaling so that the
onto which to project the data so that the sum of features x1 and x2 should have zero mean,
squares of these blue line segments is minimized. and should have comparable ranges of values.
Larger projection errors for an irrelevant / inaccurate choice
Note the direction of the project lines and the error lines
Note the vertical axis
Summary
• PCA tries to find a lower dimensional surface onto which to
project the data, to minimize the squared projection error
• The goal is to minimize the square distance between each point
and the location of where it gets projected
• How to find the lower dimensional surface onto which to project
the data?
Different Types of Feature scaling:
1. Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ). Divide by max-min
(𝑥𝑖 −𝜇𝑖 )
𝑥𝑖 =
max 𝑥𝑖 − min⁡(𝑥𝑖 )

E.g.

2. Standardization: Subtract 3. Min-max Normalization


mean and divide by standard (𝑥𝑖 −min⁡(𝑥𝑖 ))
𝑥𝑖 =
deviation 𝑥𝑖 − 𝜇𝑖 max 𝑥𝑖 − min⁡(𝑥𝑖 )
𝑥 =
𝑖
𝑠𝑖 0 ≤ 𝑥𝑖 ≤ 1
Find U’s and Z’s
Reconstruction
• So, given an unlabeled data set, we know how to apply PCA
and take high dimensional features x and map that to this lower-
dimensional representation z
• We also know how to take these low-representation z and map
it back up to an approximation of your original high-dimensional
data
Linear Discriminant Analysis
• Supervised dimensionality reduction technique.

• Pre-processing step in many pattern recognition problems.

• Can be used for feature extraction.

• Linear transformation that maximize the separation between


multiple classes and minimize the within class variability.

1
LDA
• Let us start with a data set which we can write as a matrix:
x1,1 x1,2 x1,N
X = x2,1 x2,2 x2,N
x3,1

xn,1 xn,N
• Each column is one data point, each row is a variable, but take
care sometimes the transpose is used
The mean adjusted data matrix
• We form the mean adjusted data matrix by subtracting the mean
of each variable
x1,1 - m1 x1,2 - m1 x1,N - m1
U = x2,1 - m2 x2,2 - m2 x2,N - m2
x3,1 - m3

xn,1 - mn xn,N - mn

• mi is the mean of the data items in row i

Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 3


Covariance Matrix

• The covariance matrix can be formed from the product:

1
𝑆 = 𝑈 𝑈𝑇
𝑁
Geometric Idea

x2
u
f1
f2

x2 • PCA: (f1,f2)

LDA: u

x1 x1
Method (Additional Notes)
• Let the between-class scatter matrix Sb be defined as
g
Sb  
i 1
N i ( xi  x )( xi  x )T
• and the within-class scatter matrix Sw be defined as
g g Ni
Sw  
i 1
( N i  1) S i  
i 1 j 1
( xi , j  xi )( xi , j  xi )T

• where xi,j is an n-dimensional data point j from class pi, Ni is the


number of training examples from class pi, and g is the total number of
classes or groups
Method (Additional Notes cont.)
• It has been shown that Plda is in fact the solution of the following
eigensystem problem:
Sb P  S w P  0

• Multiplying both sides by the inverse of Sw

S w1S b P  S w1S w P  0
S w1S b P  P  0
( S w1S b ) P  P
Standard LDA (Additional Notes)
• If Sw is a non-singular matrix then the Fisher’s criterion is
maximised when the projection matrix Plda is composed of the
eigenvectors of
1
S S
w b

• with at most (g-1) nonzero corresponding eigenvalues.


• (since there are only g points to estimate Sb)
Questions?

You might also like