0% found this document useful (0 votes)
8 views

Data Pre-Processing-IV (Feature Extraction-PCA)_7c5a4c5da931f4f69a14c94e7e8b9062

The document discusses feature extraction, specifically focusing on Principal Component Analysis (PCA), which is a technique for dimensionality reduction that transforms a higher-dimensional feature space into a lower-dimensional one while preserving as much information as possible. It outlines the mathematical foundations of PCA, including the construction of the covariance matrix, computation of eigenvalues, and the process of identifying principal components that capture maximum variance. Additionally, it provides a numerical example to illustrate the PCA process and its effectiveness in reducing dimensionality while retaining significant data variance.

Uploaded by

kqxzyffw4m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Pre-Processing-IV (Feature Extraction-PCA)_7c5a4c5da931f4f69a14c94e7e8b9062

The document discusses feature extraction, specifically focusing on Principal Component Analysis (PCA), which is a technique for dimensionality reduction that transforms a higher-dimensional feature space into a lower-dimensional one while preserving as much information as possible. It outlines the mathematical foundations of PCA, including the construction of the covariance matrix, computation of eigenvalues, and the process of identifying principal components that capture maximum variance. Additionally, it provides a numerical example to illustrate the PCA process and its effectiveness in reducing dimensionality while retaining significant data variance.

Uploaded by

kqxzyffw4m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Pre-Processing-IV

(Feature Extraction: PCA)

D r. JASMEET S INGH
ASSISTANT P ROFESSOR, C SED
T IET, PATIALA
Feature Extraction
Feature extraction, creates new features from a combination of original features.

For a given Feature set Fi (F1, F2, F3,……..Fn), feature extraction finds a mapping
function that maps it to new feature set Fi ’ (F1’, F2’, F3’,…….Fm’) such that Fi’=f(Fi)
and m <n.

 For instance F1’= k1 F1 + k2F2

 Some commonly used methods are:


Principal Component Analysis (PCA)
Singular Valued Decomposition (SVD)
Linear Discriminant Analysis (LDA)
Principal Component Analysis
Principal Component Analysis (PCA): It is a technique of
dimensionality reduction which performs the said task by reducing the
higher-dimensional feature-space to a lower-dimensional feature-space. It
also helps to make visualization of large dataset simple.
Principal Component Analysis
Some of the major facts about PCA are:
 Principal components are new features that are constructed as a linear
combinations or mixtures of the initial feature set.
 These combinations is performed in such a manner that all the newly
constructed principal components are uncorrelated.
 Together with reduction task, PCA also preserving as much information
as possible of original data set.
Principal Component Analysis
Some of the major facts about PCA are:
 Principal components are usually denoted by PCi, where i can be 0, 1, 2,
3,. ….,n (depending on the number of feature in original dataset).
 The major proportion of information about original feature set can be
alone explained by first principal component i.e. PC1.
 The remaining information can be obtained from other principal
components in a decreasing proportion as per increase in value of i.
Principal Component Analysis
PCA- Geometrical Interpretation
 Geometrically , it can be said that principal components are lines pointing
the directions that captures maximum amount of information about the data.
 Principal components also aims to minimize the error between the true
location of the data points (in original feature space) and the projected
location of the data points (in projected feature space).
 The larger the variance carried by a line, the larger the dispersion of the data
points along it, and the larger the dispersion along a line, the more the
information it has.

Simply, principal components are new axes to get better data


visibility with clear difference in observations.
PCA- Geometrical Interpretation
 Suppose we have the following standardized data (as shown in figure 1).
Fig 1
 If suppose we have to choose 1 feature out of X1 and X2, we will choose
feature X1. (The one which explains the maximum variation in the data).
 This is exactly what PCA does. It finds the features which have maximum
spread and drop the others with the aim to minimize information loss.
 Let’s take a slightly complex example where we can not simply drop one
feature (as shown in figure 2). Fig 2
 Here, both the features X1 and X2 have equal spread. So we can’t tell
which feature is more important.
 But if we try to find a direction (or axis) which explains the variation in
data we can find a line which fits the data very well. So if we rotate our
axis slightly by theta, we get f1 and f2 (perpendicular to f1). We can then
drop f2 and say f1 is the most important feature. This is what PCA does.
Mathematics behind PCA
Let X be a feature matrix centered around mean of size n  k, where
n denote number of examples and k denote number of features.
Let F1 be the direction along which there is maximum variance of
data and 𝑢 be a unit vector in the direction of F1 (for simplicity, two
dimensional data is shown in figure).
Let xi be any data point from X.
.
Projection of xi on 𝑢 = 𝑥 cos = | |
= 𝑥 .𝑢

We have to find 𝑢, such that Variance (projected 𝑢⃗ xi) is maximum


for all xi
Mathematics behind PCA (Contd…..)
Total variance of all data points xi

= ∑ (𝑥 . 𝑢 − 𝑥̅ 𝑢)

Since the feature vector is centered around mean, therefore, 𝑥̅ =0 (sum of deviations from mean is
zero)

Therefore, total variance = V = = ∑ (𝑥 . 𝑢)

This approach is called variance maximization approach


Mathematics behind PCA (Contd…..)
Another way to think of PCA is that it fits the best line that passes through our data with an
aim to minimize the projection error ‘d’ for each point. This approach is called distance
minimization approach.
Notice that both the
optimization problems,
𝑑 = |𝑥 | − (𝑥 . 𝑢) though look different, are
same. Since the |𝑥 | term is
independent of u so in order
So our optimization problem to minimize the function we
becomes have to maximize (𝑢 𝑥𝑖 )
1 which is same as our first
𝑚𝑖𝑛 |𝑥 | − (𝑥 . 𝑢) optimization problem.
𝑛
Mathematics behind PCA (Contd…..)
So, using both objectives, our optimization problem is,

𝑚𝑎𝑥 ∑ (𝑥 . 𝑢)

Writing out all the summations grows tedious, so let’s do our algebra in matrix form. If we stack our n data
vectors into an n × k matrix, X, then the projections are given by Xu, which is an n ×1 matrix.

Total variance = |𝑋𝑢| = (𝑋𝑢) 𝑋𝑢 = 𝑢 𝑋 𝑋𝑢 = 𝑢 𝑢 = 𝑢 𝑆𝑢

Where S is the covariance matrix, given by where X is centered around mean


Mathematics behind PCA (Contd…..)
 The given optimization problem is solved using Lagrange Optimization (which is used
for constrained optimization)
 The method can be summarized as follows: in order to find the maximum or minimum
of a function f(x) subjected to the equality constraint g(x)=0, form the Lagrangian
function
L(x, λ)= f(x) – λ g(x)
and find the stationary points of L considered as a function of x and the Lagrange
multiplier λ
Mathematics behind PCA (Contd…..)
𝐿 𝜆, 𝑢 = 𝑢 𝑆𝑢 − 𝜆 𝑢 𝑢 − 1
,
=− 𝑢 𝑢−1
,
= 2𝑆𝑢 − 2𝜆𝑢 = 2 𝑆𝑢 − 𝜆𝑢
, ,
For L to be maximum or minimum substitute = 0 and =0

Therefore, 𝑢 𝑢 = 1 𝑎𝑛𝑑 𝑆𝑢 = 𝜆𝑢
Thus, u is eigen vectors corresponding to the eigen values of covariance matrix S.
Since, S is of k X k , hence there will k eigen vectors.
𝐿𝑒𝑡 𝜆 > 𝜆 > ⋯ … … … … … 𝜆 𝑏𝑒 𝑘 𝑒𝑖𝑔𝑒𝑛 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑆
Mathematics behind PCA (Contd…..)
,
=0
,
= 2 𝑆 − 𝜆𝐼
,
=2 𝑆−𝜆 𝐼

Eigen values of 2 𝑆 − 𝜆 𝐼 will be 𝜆 − 𝜆 where 1 ≤ 𝑖 ≤ 𝑘


(because If λ1, λ2…….λn are the eigen values of A, then λ1-k, λ2-k…….λn-k are eigen values of A-kI)
Hence all eigen values of 2 𝑆 − 𝜆 𝐼 will be negative or zero.
There L is maximum when λ = 𝜆

Thus, the principal components that captures maximum variance of input data points are the
eigen vectors of the covariance matrix of the input feature matrix corresponding to largest eigen
values.
Principal Component Analysis- Step-wise Working
Step 1: Construction of covariance matrix named as S.
The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to
each other, or in other words, to see if there is any relationship between them.

where
Alternatively, Covariance matrix, S, can be computed as
Principal Component Analysis- Step-wise Working
Step 2: Computation of eigenvalues for covariance matrix , using following equation:
𝐝𝐞𝐭(𝐒 − 𝝀𝑰)=0
The eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance
carried in each Principal Component.

Step 3: Sort the eigenvectors in decreasing order of eigenvalues and choose k eigenvectors with the
largest eigenvalues

Step 4: Compute eigenvectors corresponding to every eigenvalue obtained in step 2

The eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most
variance (most information) and that we call Principal Components
Step 5: Transform the data along the principal component axis.
PCA –Numerical Example
Check (mathematically) whether the following two-dimensional datapoints can
be transformed to one dimension using Principal Component Analysis.
If yes, determine the magnitude, percentage variance captured along the new
principal components and the new principal component.
Data points (x, y): {(2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8)}
PCA –Numerical Example (Solution)
Step 1: Compute covariance matrix, S:

2 1
3 5
Feature matrix=X= 4 3
5 6
6 7
7 8
Mean Vector=𝜇 = 4.5 5

−2.5 −4
−1.5 0
Feature Vector centered around mean=𝑋 = −0.5 −2
0.5 1
1.5 2
2.5 3
PCA –Numerical Example (Solution)
Covariance Matrix=S=
−2.5 −4
−1.5 0
−2.5 −1.5 −0.5 0.5 1.5 2.5 −0.5 −2
=
−4 0 −2 1 2 3 0.5 1
1.5 2
2.5 3
1 17.5 22 3.5 4.4
= =
5 22 34 4.4 6.8
Alternately covariance between each pair of variables can be computed using
following equation (as shown in next slide):
PCA –Numerical Example (Solution)
Step 1 (second method)
x y 𝒙−𝒙 𝒚−𝒚 (𝒙 − 𝒙)𝟐 (𝒚 − 𝒚)𝟐 (𝒙 − 𝒙)(𝒚 − 𝒚)
2 1 -2.5 -4 6.25 16 10
3 5 -1.5 0 2.25 0 0
4 3 -0.5 -2 0.25 4 1
5 6 0.5 1 0.25 1 0.5
6 7 1.5 2 2.25 4 3
7 8 2.5 3 6.25 9 7.5
𝑣𝑎𝑟 𝑥
1 𝑣𝑎𝑟 𝑦 𝑐𝑜𝑣 𝑥, 𝑦
= (𝒙 − 𝒙)𝟐 1
𝑛−1 1
= (𝒚 − 𝒚)𝟐 = (𝒙 − 𝒙)(𝒚 − 𝒚)
=3.5 𝑛−1 𝑛−1
𝒙=4.5 𝑦=5 =6.8 =4.4
3.5 4.4
S=
4.4 6.8
PCA –Numerical Example (Solution)
Step 2: Find eigen values of covariance matrix
Characteristic Equation=|S-λI|=0
3.5 − 𝜆 4.4
=0
4.4 6.8 − 𝜆
3.5 − 𝜆 6.8 − 𝜆 − 19.36 = 0
23.8 − 6.8𝜆 − 3.5𝜆 + 𝜆 − 19.36 = 0
𝜆 − 10.3𝜆 + 4.44 = 0
10.3 ∓ 106.09 − 17.76 10.3 ∓ 9.43
𝜆= = = 9.865,0.435
2 2
Step 3: Magnitude of variance captured along first principal components= 9.865
.
Percentage of variance captured along first principal components= × 100% = 95.78%
. .
Yes, it can be transformed to one dimension because maximum variance is captured in first dimension.
PCA –Numerical Example (Solution)
Step 4: First Principal Component i.e. eigen vector for 𝜆1= 9.86
(S- 𝜆𝐈)X=O
𝟑. 𝟓 𝟒. 𝟒 𝟏 𝟎
− 𝟗. 𝟖𝟔 𝑿=𝑶
𝟒. 𝟒 𝟔. 𝟖 𝟎 𝟏
−𝟔. 𝟑𝟔 𝟒. 𝟒
𝑿=𝑶
𝟒. 𝟒 −𝟑. 𝟎𝟔
𝟏 −𝟎. 𝟔𝟗
𝑿=𝑶
𝟒. 𝟒 −𝟑. 𝟎𝟔
𝟎. 𝟒𝟕
𝑿= (𝒍𝒆𝒏𝒈𝒕𝒉 𝒏𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒆𝒅)
𝟎. 𝟔𝟖

You might also like