0% found this document useful (0 votes)
6 views

Module 3

Module 3 of ML TECHMAX

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 3

Module 3 of ML TECHMAX

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Dimension Reduction

Feature Selection, Dimensionality


Reduction

2/54
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

3
Data Reduction

• Data reduction: Obtain a reduced representation of the data set


that is much smaller in volume but yet produces the same (or
almost the same) analytical results

• Why data reduction? — A database/data warehouse may store


terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.

– Dimensionality reduction, e.g., remove unimportant


attributes

4
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)

5
Dimensionality Reduction
• Significant improvements can be achieved by first mapping
the data into a lower-dimensional space.

• Dimensionality can be reduced by:


− Combining features using a linear or non-linear transformations.
− Selecting a subset of features (i.e., feature selection).

7
Feature Selection
In the presence of millions of features/attributes/inputs/variables,
select the most relevant ones.
Advantages: build better, faster, and easier to understand learning
machines.
m features

X m’

10/54
Principal Component Analysis
• PCA is the most commonly used dimension
reduction technique.
– PCA used to reduce dimensions of data without much
loss of information.

– Used in machine learning and in signal processing and


image compression

– https://youtu.be/BfTMmoDFXyE
– https://www.youtube.com/watch?v=BfTMmoDFXyE
https://setosa.io/ev/principal-component-analysis/
• What happens when a data set has too many
variables ? Here are few possible situations which
you might come across:
• You find that most of the variables are
correlated.
• You lose patience and decide to run a model on
whole data. This returns poor accuracy and you
feel terrible.
• You become indecisive about what to do
• You start thinking of some strategic method to
find few important variables
PCA is “an orthogonal linear transformation
that transfers the data to a new coordinate
system such that the greatest variance by any
projection of the data comes to lie on the first
coordinate (first principal component), the
second greatest variance lies on the second
coordinate (second principal component), and
so on.”
• Principal Component Analysis (PCA) is a multivariate technique that
allows us to summarize the systematic patterns of variations in the
data.

• From a data analysis standpoint, PCA is used for studying one table
of observations and variables with the main idea of transforming
the observed variables into a set of new variables, the principal
components, which are uncorrelated and explain the variation in
the data.

• For this reason, PCA allows to reduce a “complex” data set to a


lower dimension in order to reveal the structures or the dominant
types of variations in both the observations and the variables.
• In simple words, principal component analysis is a method
of extracting important variables (in form of components)
from a large set of variables available in a data set.
• It extracts low dimensional set of features from a high
dimensional data set with a motive to capture as much
information as possible.
• With fewer variables, visualization also becomes much
more meaningful. PCA is more useful when dealing with 3
or higher dimensional data.
• It is always performed on a symmetric correlation or
covariance matrix. This means the matrix should be
numeric and have standardized data.
PCA
Background for PCA
• https://www.youtube.com/watch?v=HMOI_lk
zW08

• Suppose attributes are A1 and A2, and we have


n training examples. x’s denote values of A1
and y’s denote values of A2 over the training
examples.

• Variance of an attribute:
• Covariance of two attributes:

• If covariance is positive, both dimensions increase


together. If negative, as one increases, the other
decreases. Zero: independent of each other.
• Covariance matrix
– Suppose we have n attributes, A1, ..., An.

– Covariance matrix:
Covariance matrix
PCA - Steps
− Suppose x1, x2, ..., xM are N x 1 vectors

(i.e., center at zero)

26
PCA – Steps (cont’d)
an orthogonal basis

wher
e

27
PCA – Linear Transformation
If ui has unit length:

• The linear transformation RN → RK that performs the


dimensionality reduction is:

28
PCA
1. Given original data set S = {x1, ..., xk},
produce new set by subtracting the mean
of attribute Ai from each xi.

Mean: 1.81 1.91 Mean: 0 0


2. Calculate the covariance matrix:
x y

x
y

2. Calculate the (unit) eigenvectors and eigenvalues


of the covariance matrix:
https://www.youtube.com/watch?v=IdsV0RaC9jM
&t=8s
Eigenvector with largest
eigenvalue traces
linear pattern in data
4. Order eigenvectors by eigenvalue, highest to lowest.

In general, you get n components.


Construct new feature vector.
Feature vector = (v1, v2, ...vp)
5. Derive the new data set.

TransformedData = RowFeatureVector × RowDataAdjust

This gives original data in terms of chosen components


(eigenvectors)—that is, along these axes.
Dimensionality Reduction:
Nonlinear (Kernel) Principal Components Analysis

Original dataset X Map X to a HIGHER- (If necessary,) map


dimensional space, the resulting
and carry out LINEAR principal
PCA in that space components back to
the origianl space
40/59
PRINCIPAL COMPONENT ANALYSIS: PCA

From a set of N correlated descriptors,


we can derive a set of N uncorrelated
descriptors (the principal components).
Each principal component (PC) is a
suitable linear combination of all the
original descriptors. PCA reduces the
.information dimensionality that is often
needed from the vast arrays of data in a
way so that there is minimal loss of
information

( from Nature Reviews Drug Discovery 1, 882-894


(2002) : INTEGRATION OF VIRTUAL AND HIGH
THROUGHPUT SCREENING Jürgen Bajorath ; and
Materials Today; MATERIALS INFORMATICS , K. Rajan ,
October 2005
Functionality 1 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8
I
……)
Functionality 2 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8

…….
……)
X1 = f (
II PC 1= A1 X1 + A2 X2 + A3 X3 + A4 X4
x2) III
…….
X2 = g(
PC 2 = B1 X1 + B2 X2 + B3 X3 +B4 X4
x3)
…….
…….

X3= h(x4)
PC 3 = C1 X1 + C2 X2 + C3 X3 + C4 X4
…….
Principal component analysis (PCA) involves a mathematical procedure
that transforms a number of (possibly) correlated variables into a
(smaller) number of uncorrelated variables called principal components.

The first principal component accounts for as much of the variability in


the data as possible, and each succeeding component accounts for as
much of the remaining variability as possible.

You might also like