Module 3
Module 3
2/54
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
3
Data Reduction
4
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
5
Dimensionality Reduction
• Significant improvements can be achieved by first mapping
the data into a lower-dimensional space.
7
Feature Selection
In the presence of millions of features/attributes/inputs/variables,
select the most relevant ones.
Advantages: build better, faster, and easier to understand learning
machines.
m features
X m’
10/54
Principal Component Analysis
• PCA is the most commonly used dimension
reduction technique.
– PCA used to reduce dimensions of data without much
loss of information.
– https://youtu.be/BfTMmoDFXyE
– https://www.youtube.com/watch?v=BfTMmoDFXyE
https://setosa.io/ev/principal-component-analysis/
• What happens when a data set has too many
variables ? Here are few possible situations which
you might come across:
• You find that most of the variables are
correlated.
• You lose patience and decide to run a model on
whole data. This returns poor accuracy and you
feel terrible.
• You become indecisive about what to do
• You start thinking of some strategic method to
find few important variables
PCA is “an orthogonal linear transformation
that transfers the data to a new coordinate
system such that the greatest variance by any
projection of the data comes to lie on the first
coordinate (first principal component), the
second greatest variance lies on the second
coordinate (second principal component), and
so on.”
• Principal Component Analysis (PCA) is a multivariate technique that
allows us to summarize the systematic patterns of variations in the
data.
• From a data analysis standpoint, PCA is used for studying one table
of observations and variables with the main idea of transforming
the observed variables into a set of new variables, the principal
components, which are uncorrelated and explain the variation in
the data.
• Variance of an attribute:
• Covariance of two attributes:
– Covariance matrix:
Covariance matrix
PCA - Steps
− Suppose x1, x2, ..., xM are N x 1 vectors
26
PCA – Steps (cont’d)
an orthogonal basis
wher
e
27
PCA – Linear Transformation
If ui has unit length:
28
PCA
1. Given original data set S = {x1, ..., xk},
produce new set by subtracting the mean
of attribute Ai from each xi.
x
y
…….
……)
X1 = f (
II PC 1= A1 X1 + A2 X2 + A3 X3 + A4 X4
x2) III
…….
X2 = g(
PC 2 = B1 X1 + B2 X2 + B3 X3 +B4 X4
x3)
…….
…….
X3= h(x4)
PC 3 = C1 X1 + C2 X2 + C3 X3 + C4 X4
…….
Principal component analysis (PCA) involves a mathematical procedure
that transforms a number of (possibly) correlated variables into a
(smaller) number of uncorrelated variables called principal components.