0% found this document useful (0 votes)

6 views

Module 3

Module 3 of ML TECHMAX

Uploaded by

neha1831sewani

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Module 3

Module 3 of ML TECHMAX

Uploaded by

neha1831sewani

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Dimension Reduction

Feature Selection, Dimensionality

Reduction

2/54
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

3
Data Reduction

• Data reduction: Obtain a reduced representation of the data set

that is much smaller in volume but yet produces the same (or
almost the same) analytical results

• Why data reduction? — A database/data warehouse may store

terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.

– Dimensionality reduction, e.g., remove unimportant

attributes

4
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)

5
Dimensionality Reduction
• Significant improvements can be achieved by first mapping
the data into a lower-dimensional space.

• Dimensionality can be reduced by:

− Combining features using a linear or non-linear transformations.
− Selecting a subset of features (i.e., feature selection).

7
Feature Selection
In the presence of millions of features/attributes/inputs/variables,
select the most relevant ones.
Advantages: build better, faster, and easier to understand learning
machines.
m features

X m’

10/54
Principal Component Analysis
• PCA is the most commonly used dimension
reduction technique.
– PCA used to reduce dimensions of data without much
loss of information.

– Used in machine learning and in signal processing and

image compression

– https://youtu.be/BfTMmoDFXyE
– https://www.youtube.com/watch?v=BfTMmoDFXyE
https://setosa.io/ev/principal-component-analysis/
• What happens when a data set has too many
variables ? Here are few possible situations which
you might come across:
• You find that most of the variables are
correlated.
• You lose patience and decide to run a model on
whole data. This returns poor accuracy and you
feel terrible.
• You become indecisive about what to do
• You start thinking of some strategic method to
find few important variables
PCA is “an orthogonal linear transformation
that transfers the data to a new coordinate
system such that the greatest variance by any
projection of the data comes to lie on the first
coordinate (first principal component), the
second greatest variance lies on the second
coordinate (second principal component), and
so on.”
• Principal Component Analysis (PCA) is a multivariate technique that
allows us to summarize the systematic patterns of variations in the
data.

• From a data analysis standpoint, PCA is used for studying one table
of observations and variables with the main idea of transforming
the observed variables into a set of new variables, the principal
components, which are uncorrelated and explain the variation in
the data.

• For this reason, PCA allows to reduce a “complex” data set to a

lower dimension in order to reveal the structures or the dominant
types of variations in both the observations and the variables.
• In simple words, principal component analysis is a method
of extracting important variables (in form of components)
from a large set of variables available in a data set.
• It extracts low dimensional set of features from a high
dimensional data set with a motive to capture as much
information as possible.
• With fewer variables, visualization also becomes much
more meaningful. PCA is more useful when dealing with 3
or higher dimensional data.
• It is always performed on a symmetric correlation or
covariance matrix. This means the matrix should be
numeric and have standardized data.
PCA
Background for PCA
• https://www.youtube.com/watch?v=HMOI_lk
zW08

• Suppose attributes are A1 and A2, and we have

n training examples. x’s denote values of A1
and y’s denote values of A2 over the training
examples.

• Variance of an attribute:
• Covariance of two attributes:

• If covariance is positive, both dimensions increase

together. If negative, as one increases, the other
decreases. Zero: independent of each other.
• Covariance matrix
– Suppose we have n attributes, A1, ..., An.

– Covariance matrix:
Covariance matrix
PCA - Steps
− Suppose x1, x2, ..., xM are N x 1 vectors

(i.e., center at zero)

26
PCA – Steps (cont’d)
an orthogonal basis

wher
e

27
PCA – Linear Transformation
If ui has unit length:

• The linear transformation RN → RK that performs the

dimensionality reduction is:

28
PCA
1. Given original data set S = {x1, ..., xk},
produce new set by subtracting the mean
of attribute Ai from each xi.

Mean: 1.81 1.91 Mean: 0 0

2. Calculate the covariance matrix:
x y

x
y

2. Calculate the (unit) eigenvectors and eigenvalues

of the covariance matrix:
https://www.youtube.com/watch?v=IdsV0RaC9jM
&t=8s
Eigenvector with largest
eigenvalue traces
linear pattern in data
4. Order eigenvectors by eigenvalue, highest to lowest.

In general, you get n components.

Construct new feature vector.
Feature vector = (v1, v2, ...vp)
5. Derive the new data set.

TransformedData = RowFeatureVector × RowDataAdjust

This gives original data in terms of chosen components

(eigenvectors)—that is, along these axes.
Dimensionality Reduction:
Nonlinear (Kernel) Principal Components Analysis

Original dataset X Map X to a HIGHER- (If necessary,) map

dimensional space, the resulting
and carry out LINEAR principal
PCA in that space components back to
the origianl space
40/59
PRINCIPAL COMPONENT ANALYSIS: PCA

From a set of N correlated descriptors,

we can derive a set of N uncorrelated
descriptors (the principal components).
Each principal component (PC) is a
suitable linear combination of all the
original descriptors. PCA reduces the
.information dimensionality that is often
needed from the vast arrays of data in a
way so that there is minimal loss of
information

( from Nature Reviews Drug Discovery 1, 882-894

(2002) : INTEGRATION OF VIRTUAL AND HIGH
THROUGHPUT SCREENING Jürgen Bajorath ; and
Materials Today; MATERIALS INFORMATICS , K. Rajan ,
October 2005
Functionality 1 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8
I
……)
Functionality 2 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8

…….
……)
X1 = f (
II PC 1= A1 X1 + A2 X2 + A3 X3 + A4 X4
x2) III
…….
X2 = g(
PC 2 = B1 X1 + B2 X2 + B3 X3 +B4 X4
x3)
…….
…….

X3= h(x4)
PC 3 = C1 X1 + C2 X2 + C3 X3 + C4 X4
…….
Principal component analysis (PCA) involves a mathematical procedure
that transforms a number of (possibly) correlated variables into a
(smaller) number of uncorrelated variables called principal components.

The first principal component accounts for as much of the variability in

the data as possible, and each succeeding component accounts for as
much of the remaining variability as possible.

JISMOP6 Math Spring 23
No ratings yet
JISMOP6 Math Spring 23
27 pages
Siebel Integration With SVN
No ratings yet
Siebel Integration With SVN
9 pages
Linux Material
100% (2)
Linux Material
328 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
W4.2 DataPreProcessing-PCA (1)
No ratings yet
W4.2 DataPreProcessing-PCA (1)
22 pages
Dimensionality Reduction2023
No ratings yet
Dimensionality Reduction2023
20 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
data reduction
No ratings yet
data reduction
9 pages
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
No ratings yet
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
36 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
program-3
No ratings yet
program-3
7 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
2 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
Need of Principal Component Analysis
No ratings yet
Need of Principal Component Analysis
8 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
Dimensionality Reduction Using Principal Component Analysis
No ratings yet
Dimensionality Reduction Using Principal Component Analysis
32 pages
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
No ratings yet
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
15 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
10 ASAP Advanced Statistics Dimension Reduction
No ratings yet
10 ASAP Advanced Statistics Dimension Reduction
8 pages
PCA_dev
No ratings yet
PCA_dev
16 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Principal+Component+Analysis
No ratings yet
Principal+Component+Analysis
6 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Principal Component Analysis (PCA) in Machine Learning
No ratings yet
Principal Component Analysis (PCA) in Machine Learning
20 pages
Principal Component Analysis PCA in Machine Learning
No ratings yet
Principal Component Analysis PCA in Machine Learning
20 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
10 pages
Unit 3
No ratings yet
Unit 3
31 pages
ML Module 6
No ratings yet
ML Module 6
6 pages
DR Pca
No ratings yet
DR Pca
22 pages
Love Report
No ratings yet
Love Report
7 pages
Module3 Notes
No ratings yet
Module3 Notes
13 pages
7.3 Pca
No ratings yet
7.3 Pca
17 pages
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
10 pages
Ai ( PCA)
No ratings yet
Ai ( PCA)
3 pages
Pca Lda Lobo
No ratings yet
Pca Lda Lobo
20 pages
Dimensionality Reduction (Principal Component Analysis)
No ratings yet
Dimensionality Reduction (Principal Component Analysis)
12 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
6 pages
STAT502
No ratings yet
STAT502
13 pages
ML Mod32019
No ratings yet
ML Mod32019
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Ai & ML Week-9
No ratings yet
Ai & ML Week-9
30 pages
Face Recognition PAC
No ratings yet
Face Recognition PAC
24 pages
Linear Algebra
No ratings yet
Linear Algebra
5 pages
PCA
100% (1)
PCA
33 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
DimensionalitY Reduction
No ratings yet
DimensionalitY Reduction
29 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
9 pages
DS Ca2 PPT 3010 3017
No ratings yet
DS Ca2 PPT 3010 3017
10 pages
3
No ratings yet
3
12 pages
Unit V Foml
No ratings yet
Unit V Foml
18 pages
PCA- PRINCIPAL COMPONENT ANALYSIS 1233
No ratings yet
PCA- PRINCIPAL COMPONENT ANALYSIS 1233
30 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
No ratings yet
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
22 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
F30926498_S230952590_DSR109233240
No ratings yet
F30926498_S230952590_DSR109233240
7 pages
FA1 Merged Merged
No ratings yet
FA1 Merged Merged
13 pages
Workflow
No ratings yet
Workflow
36 pages
Fiji National University: CIN506: Computer Principles ASSIGNMENT 1: Microsoft Word
100% (1)
Fiji National University: CIN506: Computer Principles ASSIGNMENT 1: Microsoft Word
3 pages
Artificial Intelligent Approach To Predict The Student Behaviour and Performance
No ratings yet
Artificial Intelligent Approach To Predict The Student Behaviour and Performance
11 pages
The Algorithmic Leader Walsh en 40779
No ratings yet
The Algorithmic Leader Walsh en 40779
6 pages
Onboarding Checklist
No ratings yet
Onboarding Checklist
78 pages
Babbar Program Sheet
No ratings yet
Babbar Program Sheet
27 pages
Unit4 DevOps v2021
100% (1)
Unit4 DevOps v2021
69 pages
Psu650 Manual en Bms
No ratings yet
Psu650 Manual en Bms
8 pages
5G - Trends and Technologies: Team Members
No ratings yet
5G - Trends and Technologies: Team Members
11 pages
Biography of Steven Paul Jobs
No ratings yet
Biography of Steven Paul Jobs
4 pages
Haramaya University: Department of Information Science
No ratings yet
Haramaya University: Department of Information Science
19 pages
Adhar Card
No ratings yet
Adhar Card
1 page
Online Admission System
100% (1)
Online Admission System
39 pages
Communiques DP DP 301 Submission of Annual System Audit Report
No ratings yet
Communiques DP DP 301 Submission of Annual System Audit Report
20 pages
OAT - Unit 2 (A) - Spreadsheet Software PDF
No ratings yet
OAT - Unit 2 (A) - Spreadsheet Software PDF
17 pages
Week-2-Jan-2025 assignment solution
No ratings yet
Week-2-Jan-2025 assignment solution
3 pages
1670 Assignment 1
No ratings yet
1670 Assignment 1
22 pages
PDF
No ratings yet
PDF
42 pages
Chapter 5 - Issuer Trust Model
No ratings yet
Chapter 5 - Issuer Trust Model
26 pages
Assignment -002_copy
No ratings yet
Assignment -002_copy
9 pages
t300 Truly Wireless in Ear Earbuds Manual
No ratings yet
t300 Truly Wireless in Ear Earbuds Manual
9 pages
MT 309 Es
No ratings yet
MT 309 Es
14 pages
Chainlink: A Decentralized Oracle Network Steve Ellis, Ari Juels, and Sergey Nazarov
No ratings yet
Chainlink: A Decentralized Oracle Network Steve Ellis, Ari Juels, and Sergey Nazarov
38 pages
Mini Project Finial Report
No ratings yet
Mini Project Finial Report
22 pages
Python, Django and MySQL Project Report On Employee Management System - Databases - Software Testing PDF
No ratings yet
Python, Django and MySQL Project Report On Employee Management System - Databases - Software Testing PDF
1 page

Module 3

Uploaded by

Module 3

Uploaded by

Dimension Reduction

Feature Selection, Dimensionality

• Data reduction: Obtain a reduced representation of the data set

• Why data reduction? — A database/data warehouse may store

– Dimensionality reduction, e.g., remove unimportant

• Dimensionality can be reduced by:

– Used in machine learning and in signal processing and

• For this reason, PCA allows to reduce a “complex” data set to a

• Suppose attributes are A1 and A2, and we have

• If covariance is positive, both dimensions increase

(i.e., center at zero)

• The linear transformation RN → RK that performs the

Mean: 1.81 1.91 Mean: 0 0

2. Calculate the (unit) eigenvectors and eigenvalues

In general, you get n components.

TransformedData = RowFeatureVector × RowDataAdjust

This gives original data in terms of chosen components

Original dataset X Map X to a HIGHER- (If necessary,) map

From a set of N correlated descriptors,

( from Nature Reviews Drug Discovery 1, 882-894

The first principal component accounts for as much of the variability in

You might also like