0% found this document useful (0 votes)
1 views

Week 3

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Week 3

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

sainathgunda99@gmail.

com
DLZNK464L9 Data Preprocessing

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Objectives
Upon completion, you will be able to:

● Explain data pre-processing tasks.


● Illustrate methods to handle missing values and noisy data.
● Explain the importance of outlier removal and redundant data removal from datasets.
● List the methods for dimensionality reduction and numerosity reduction .
[email protected]
● Define data discretization and its methods.
DLZNK464L9

● Explain data transformation and the importance of normalization.


● Demonstrate typical data pre-processing tasks in Python.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Quality and Data Format
[email protected]
DLZNK464L9 Overview

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Concepts of Data Pre-processing:
○ Data Quality
○ Data Formats
○ Major Tasks in Data Pre-processing
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Quality: Why Preprocess the Data ?
Measures for data quality: A multidimensional view

● Accuracy: proper or incorrect, accurate or not.

● Completeness: not recorded, un-available, missing values, important variables not included

● Consistency: dangling and some features are modified but some features not
[email protected]
DLZNK464L9

● Interpretability: how easily the data can be understood, codes as variable names, or coded values,

nominal values – semantic ambiguity in the data

● Timeliness: is timely updated?

● Believability: how much data is trustable are as perceived by the end user

● Evaluate all of the above to assess data’s fitness for the task
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Formats: Tidy Data
1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.


Var 1 Var 2 … … Var n
Obs 1
[email protected]
DLZNK464L9 2.3 34 Yes 123.45 0.3
Obs 2 3.6 23 No 567.34 0.7
Obs n 5.6 56 No 112.7 0.56

● Provides a standard way of structuring a dataset.


● Make it easier to extract needed variables for analysis.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Formats: Wide Format vs Long Format
● “wide” format: consider variable “Math/English”

Name Math English


Anna 86 90
John 43 75
Cath 80 82
[email protected]
DLZNK464L9

● “long” format: considered variable “Subject”


Name Subject Grade
Anna Math 86
Anna English 90
John Math 43
John English 75
Cath Math 80
Cath English 82
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Pre-processing: Major Tasks
● Data cleaning
○ Handling missing values, noisy data, resolve inconsistencies and identify or remove the outliers
● Data integration
○ Integration of multiple databases, data cubes, or files
● Data reduction
[email protected]
DLZNK464L9

○ Dimensionality reduction (PCA)


○ Numerosity reduction
● Data transformation
○ Normalization
○ Data discretization
○ Concept of hierarchy generation
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Pre-processing: Major Tasks

Tasks Methods
Binning, Histogram analysis
Missing values
Regression
Noisy data
Clustering, Classification
Outliers Correlation/covariance
[email protected]
Redundancy
DLZNK464L9
PCA, Feature selection
Box plots
Dimensionality reduction
Sampling
Numerosity reduction
Data compression
Data discretization Data Normalization
Scale differences Concept hierarchy

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed:

● Data quality: format, accuracy, completeness, consistency, timeliness, believability, interpretability.


● Tidy data provides a standard way of structuring a dataset.
● Major pre-processing tasks - Data cleaning, data integration, data reduction, and data
[email protected]

transformation.
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
[email protected]
DLZNK464L9
Missing Values and Noisy Data

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Different Tasks and Methods
○ Missing values
○ How to handle missing data?
○ Simple Linear Regression
○ Multiple Linear Regression
[email protected]
DLZNK464L9

○ Noisy data

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Missing Values
● Empty cells or cells filled with “NA”-like tokens.
● Semantics of missing data
○ An empty data cell could mean:
■ Value exists
[email protected]
DLZNK464L9
● Value is available but not recorded due to human error, for example
○ Negative findings are left empty (e.g., negative for asymmetric binary variables)
● Value is not available (e.g., I don’t know my grandpa’s birthday)
■ Value does not exist:
● Absence of a value (I don’t have a middle name)
● Not applicable (I don’t have a tail)
○ Different semantics should be encoded as different values,
NA-not applicable, Missing Sharing –applicable
or but
publishing the contents Rightsnot available,
inReserved.
part or Unauthorized use or etc.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All
full distribution prohibited.
is liable for legal action.
How to Handle Missing Data ?
● Ignore the tuples with missing value
○ when the class label is missing (when doing classification)
○ not effective when the percentage of missing information varies greatly per attribute - resulting
in a large number of tuples not being included in analyses.
● Fill in the missing value manually: major feasibility issue
[email protected]
DLZNK464L9

● Replace empty cells with ‘NA’, “Missing”, etc. More see https://support.datacite.org/docs/schema-
values-unknown-information-v42

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
How to Handle Missing Data ?
● Fill in automatically with (imputation)
○ A global constant: e.g., NA. Not ideal but often done
○ The attribute mean/median/mode
○ The mean/median/mode for all data objects in the same class (smarter)
○ The most probable value: regression or inference-based such as Bayesian inference or decision
[email protected]
DLZNK464L9

tree: best, but is this problem-free?

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Simple Linear Regression
● A statistical method that summarizes and studies the relationships between two continuous
(quantitative) variables
○ Independent (predictor) variable: x = height
○ Dependent (response) variable: y = weight
● Goal: find the best straight line that fits the data
○ y = bx +a
[email protected]
DLZNK464L9
● Method: find a and b that minimize the objective function

● How good is the fit?: coefficient of determination (R Squared,=1 is the best)

Adjusted R Squared
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Simple Linear Regression
y y = bx + a
250
Residual
200
r = 100 - 150 = -50
Weight (lbs)

150
[email protected]
DLZNK464L9
r
100 (55, 100)

50

x
10 20 30 40 50 60

Height (inches)

‘r’ here shows a residual, the difference between the true value and the predicted value.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Multiple Linear Regression
● Multiple linear regression (more than 1 independent variables, X and beta are vectors).
● Tips on choosing the best model.
○ http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-choose-the-best-regression-
model
● Use for:
○ missing values: use predicted values to replace missing values.
[email protected]
DLZNK464L9

○ data smoothing: use predicted values to replace original data.


○ data reduction: save only the function, parameters, and outliers (not the original data for the
predicated dimensions).
○ outlier detection: identify (visualize) data that are far away from the predicted values.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Noise
● Noise has two main sources:
○ Implicit inaccuracies caused by measuring devices
○ Random errors caused by human errors or other issues
● Noise can occur in attribute names and attribute values, including class labels

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: How to Handle Noisy Data ?
● Binning/Histogram analysis
○ First, sort data and partition it into (e.g., equal-frequency) bins.
○ Then smooth by bin means, smooth by bin median, or smooth by bin borders.
● Regression
○ Smooth by fitting data into the regression functions
[email protected]
DLZNK464L9

● Clustering
○ Smooth data by cluster centres
○ detect and remove outliers/errors

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: How to Handle Noisy Data ?
● Truncation
○ Truncate the least significant digits in a real number
● Human inspection and Combined computer
○ Detect suspicious values and check by humans
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Smooth by Binning
● Divide sorted data into bins.
● Partitioning rules:
○ Equal-width: equal bin range
○ Equal-frequency (or equal-depth): equal # of
data points in the bins
[email protected]
DLZNK464L9

● For data smoothing/discretization, replace data


with bin mean, median, etc/bin label.
● In effect, it also reduced the number of different
data values (cardinality of the variable)

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Equal-width binning
● Equal-width (interval) partitioning
○ Divides the range into N bins of equal intervals.
○ if A is lowest and B is highest values of the attribute,
The width of intervals will be:
W = (B –A)/N.
[email protected]
DLZNK464L9

○ In practice: Freedman-Diaconis rule works well (more rules)


■ W=2×IQR×n−1/3 . N = (B−A)/W

○ The most straightforward, but outliers may dominate the presentation.


○ Data can’t be handled well if it is skewed

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Equal-Depth Binning
● Equal-depth (count, frequency) partitioning
○ Divides the entire range into N bins of equal number of data points.
○ Good data scaling with varied bin width

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Example Equal-Depth Binning for Data Smoothing
● Sorted data for the price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
○ Partition into equal-frequency (equi-depth) bins:
■ Bin 1: 4, 8, 9, 15
■ Bin 2: 21, 21, 24, 25
■ Bin 3: 26, 28, 29, 34
[email protected]
DLZNK464L9
○ Smoothing by bin boundaries:
■ Bin 1: 4, 4, 4, 15
■ Bin 2: 21, 21, 25, 25
■ Bin 3: 26, 26, 26, 34
○ Smoothing by bin means:
■ Bin 1: 9, 9, 9, 9
■ Bin 2: 23, 23, 23, 23
■ Bin 3: 29, 29, 29, 29
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Clustering
● Partition continuous, discrete, or mixed datasets into clusters
based on similarity [distance].
○ There are many choices of distance functions, clustering
definitions, and clustering algorithms
● Can be used to smooth noisy data, detect outliers, numerosity
[email protected]
DLZNK464L9

reduction, and data discretization.


○ Data smoothing/discretization: take cluster means,
median, etc.
○ Data reduction: store cluster representation only
○ Outlier detection: visualize data points far away

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Noisy Data: Clustering
● Can be very useful if the data is clustered, but it
cannot be effective if the data is "splattered."
● Can have hierarchical clustering and be stored in
multi-dimensional index tree structures.
● A non-parametric method: no assumption. Let the
[email protected]
DLZNK464L9

data tell the story.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed,

● Empty cells or cells filled with “NA”-like tokens are referred to as missing data.
● Noisy Data can be implicit errors introduced by measurement tools, such as different types of
sensors, or random errors.
● There are different ways to handle missing data and noisy data, including various imputation
[email protected]
DLZNK464L9

methods and data smoothing methods.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
Outliers and Data Redundancy
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Tasks and Methods
○ Outliers
○ Data Redundancy

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Outliers: Outlier Detection
● Exploratory data analysis:
○ Data summary plots – boxplots
○ Histogram analysis
● Regression
○ Data that doesn’t fit the known distribution model are outliers.
[email protected]
DLZNK464L9

● Clustering
○ Outliers form small and distant clusters or not be included in any cluster.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Data Integration
● Data integration:
○ Data from multiple sources is combined into a coherent storage.
● Database schema integration
○ Challenging; examining metadata carefully originates from various
sources.
[email protected]
DLZNK464L9

● Data redundancy, e.g., entity identification problem:


○ Identify real-world entities from a variety of data sources
● Detecting and resolving data value conflicts and scale differences.
○ Attribute values from different sources differ for the same real-world
item.
○ Possible reasons: different representations (e.g., date, GPA), different
scales, e.g., metric vs.Proprietary
BritishThis file units
is meant for personal use by [email protected] only.
content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Handling Redundancy in Data Integration
● Redundant attributes may be detected by correlation analysis or covariance analysis.

● Redundant attributes should be removed

● Attributes that are correlated but not redundant should often be kept.

● Careful integration of data from various sources may aid in the reduction/avoidance of redundancies
[email protected]
DLZNK464L9
and inconsistencies, as well as the improvement of mining speed and quality.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Correlation Analysis (Nominal Data)
Play chess Not play chess Sum (row)
[c1] [c2]
Like science fiction[r1] 250(90) 200(360) 450 [R=r1]
Not like science fiction 50(210) 1000(840) 1050 [R=r2]
[r2]
Sum(col.)
[email protected]
DLZNK464L9
300 [C=c1] 1200 [C=c2] 1500 [n]

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Chi-Square Calculation
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

● H0: A and B are not correlated. alpha = 0.001


[email protected]
DLZNK464L9
● Χ2 (chi-square) value calculation

● Using the Χ2 table (next slide), we find the critical value=10.828 for the alpha and d.f.=1
● Χ2 > 10.828, reject H0, so A and B are correlated.
● Most tests will give you a p-value; if p-value < alpha, reject H0.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Critical Value Table
Critical values of the Chi-square distribution with d
degrees of freedom

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Correlation Analysis (Numeric Data)
● Correlation coefficient (also called Pearson’s product moment coefficient) [-1, 1]

○ Where n is the number of tuples, and are the respective means of A and B,
[email protected]

○ σA and σB are the respective standard deviation of A and B


DLZNK464L9

○ Σ(aibi) is the sum of the AB cross-product.


● If rA,B > 0, A and B are positively linearly correlated (A’s values increase as B’s). The higher the value of
rA,B, the stronger the correlation.
● rA,B = 0: not linearly correlated; may still be associated in other ways.
● rAB < 0: negatively linearly correlated

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Visually Evaluating Correlation

Scatter plots showing Pearson


[email protected]
DLZNK464L9

coefficient from –1 to 1.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Covariance (Numeric Data)
● Covariance is similar to the correlation.

Contrast: Correlation coefficient:


[email protected]
DLZNK464L9

○ where n is the number of tuples, are the respective mean or expected values (E) of A
and B, σA and σB are the respective standard deviation of A and B.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Covariance (Numeric Data)
● Positive covariance: Cov A and B> 0, indicating A and B both tend to be larger than their expected
values.

● Negative covariance: CovA and B < 0, indicating two variables change in different directions: one is
larger and the other one is smaller than their expected values.
[email protected]
● Independence: CovA and B= 0, but the reverse is not true:
DLZNK464L9

○ Some random variable pairings may have a covariance of zero but they are not independent. A
covariance of 0 implies independence only under certain additional conditions (for example,
the data have multivariate normal distributions).

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Redundancy: Co-Variance: An Example

● Suppose two stocks A and B have the following values in one week: (2,5), (3, 8), (5, 10), (4,
11), (6, 14).
[email protected]
DLZNK464L9

● Question: Are the prices of A and B rise or fall together?


● E(A) = (2 + 3 + 5 + 4 + 6)/5 = 20/5 = 4
● E(B) = (5 + 8 + 10 + 11 + 14)/5 = 48/5 = 9.6
● Cov(A,B) = (2x5+3x8+5x10+4x11+6x14)/5 - 4 x 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed,

● Outliers can be detected.


● Data redundancy occurs mostly because of data integration, and redundant attributes may be
detected by correlation or covariance analysis.
● Redundant attributes should be removed.
[email protected]
DLZNK464L9

● Correlated attributes are often useful in mining tasks.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
[email protected] Dimensionality Reduction and
Numerosity Reduction
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Tasks and Methods
○ Dimensionality reduction
○ Curse of Dimensionality and data sparseness
○ PCA – Principal Component Analysis
○ Numerosity reduction and random sampling methods
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Reduction Strategies
● Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results.

● Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a long time on the complete data set.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Reduction Strategies
● Data reduction strategies
○ Dimensionality reduction, e.g., removing or merging attributes
■ Principal Components Analysis (PCA).
■ Feature subset selection, feature creation
○ Numerosity reduction (reduce data volume, use smaller forms of data representation)
■ Regression
[email protected]
DLZNK464L9

■ Histograms/binning, clustering, sampling


■ Data cube aggregation
○ Data compression

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Dimensionality Reduction: Curse of Dimensionality
● Curse of dimensionality
○ When dimensionality of features in the dataset increases, data becomes increasingly sparse in
feature space.
○ Density and distance between points, which are important for grouping and outlier analysis,
become less relevant.
○ The number of possible subspace combinations will expand exponentially.
● Dimensionality reduction
[email protected]
DLZNK464L9

○ Avoid the curse of dimensionality by reducing features


○ Dimensionality reduction help in eliminate irrelevant features and reduce noise.
○ Reduces time and space required in data mining.
○ Ease to visualize
● Dimensionality reduction techniques
○ Principal Component Analysis
○ Supervised techniques
○ Nonlinear techniques (e.g., feature selection)
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Curse of Dimensionality: sparseness

A single feature does not


result in a perfect separation
of our training data
[email protected]
DLZNK464L9

Adding a third feature


results in a linearly
separable classification
problem in our training data

Adding a second feature still does


not result in a linearly separable
This file is meant for personal use by [email protected] only.
classification Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Sparseness: More Training Data Needed

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Sparseness -> Everything is Equal-Distanced

[email protected]
DLZNK464L9

● With increased dimensionality, the hypersphere occupies only a very small


portion of the search space; all training examples are essentially located in
the corners.
● When dim -> infinity, all training examples are at the same distance from all
other examples.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (PCA): Numeric Data
● Finds the projection that captures the most variety in the data.
● The original data can be reflected into a much smaller space, which reduces dimensionality while
keeping variability. We find the eigenvectors (“characteristic” vectors) of the covariance matrix, and
these eigenvectors define the new space.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (PCA): Numeric Data

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Principal Component Analysis (PCA)
• First three PCs capture 75% of original
variance based on loadings.
• Component values are weighted sum of
the original dimensions.
• Comp1 = 0.361*Sepal.Length +
0.867*Petal.Length + 0.358*Petal.Width
[email protected]
DLZNK464L9

• Subsequent analysis will use reduced


presentation/dimensions

53 This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Numerosity Reduction
● Reduce the size of data volume by choosing alternative smaller forms of data representation.
● Parametric methods (Example: regression)
○ Consider the data fits some model, estimate model parameters, store only the parameters, and
discard the data (except possible outliers).
● Non-parametric methods
[email protected]
DLZNK464L9

○ Do not assume parameterized probability distributions.


○ Major families: histograms/binning, clustering, sampling, …

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Sampling
● Obtaining a small sample “s” to represent the whole data set “N”.

● Also used in sampling training and test examples.

● Allow mining algorithms to run at a complexity that is possibly sub-linear to data size.

● Key principle: choose a representative subset of the data.


[email protected]
DLZNK464L9

○ In skewed datasets, simple random sampling may perform poorly.

○ Develop adaptive sampling methods, e.g., stratified sampling.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed,

● Data reduction obtains a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results.
● Data reduction can be done by:
○ Dimensionality reduction - It is the process of removing unimportant attributes.
[email protected]
DLZNK464L9

○ Numerosity reduction - It reduces data volume; uses smaller forms of data representation.
○ Data compression
● Sampling is about obtaining a small sample s to represent the whole data set N.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Tasks and Methods:
Data Transformation
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
In this session, we will discuss:
● Tasks and Methods
○ Data transformation: Normalization
○ Data discretization methods
○ Concept Hierarchy generation
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Transformation
● Data are transformed or consolidated into forms appropriate for mining.
● Methods
○ Smoothing: Remove noise from data
○ Attribute / feature construction
■ New attributes constructed from the given ones
[email protected]
DLZNK464L9

○ Aggregation: Data cube construction, summarization


○ Normalization: Scaled to fall within a smaller, specified range for more meaningful comparison
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
○ Discretization: Concept hierarchy climbing
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization
● Min-max normalization: to [new_minA, new_maxA]

○ Ex. Let income range from $12,000 to $98,000 normalized to [0.0,


1.0]. Then $73,600 is mapped to
[email protected]
DLZNK464L9

● Z-score normalization (μ: mean, σ: standard deviation):

○ Ex. Let μ = 54,000, σ = 16,000. Then

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Normalization
● Normalization by decimal scaling

○ Ex. (50, 20) -> (0.5, 0.2) with j=2

Where j is the smallest integer such


[email protected] that Max(|ν’|) < 1 or =1
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Discretization
Discretization: Divide the range of a continuous attribute into intervals.

● Actual data values are replaced with interval labels..


● Reduce attribute cardinality
● Handles outliers and skewed data
● Supervised vs. unsupervised
[email protected]
DLZNK464L9

● Prepare data for further analysis, e.g., classification.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Discretization Methods
Typical methods:

All the methods mentioned below can be applied recursively.

● Histogram and Binning analysis

○ Top-down split
[email protected]
DLZNK464L9

○ Unsupervised

● Clustering analysis (unsupervised, top-down split, or bottom-up merge)

● Classification analysis, e.g., decision-tree (supervised, top-down split)

● Correlation (e.g., χ2) analysis, e.g., ChiMerge (supervised, bottom-up merge)

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Discretization by Correlation Analysis
Correlation analysis (e.g., Chi-merge: χ2-based discretization)

● Exploit the correlation between intervals and class labels.

● "Interval – Class” contingency tables

● If two adjacent intervals have low χ2 values (less correlated to the class labels), merge them to form
[email protected]
DLZNK464L9

a larger interval (keeping them separate does not offer more information on how to classify objects).

● Merge performed recursively until a predefined stopping condition is met.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
ChiMerge Discretization
Sample F K

● Statistical approach to Data 1 1 1


Discretization. 2 3 2
3 7 1
● Discretizing the data based on class 4 8 1
labels, using the Chi-square
[email protected]
DLZNK464L9
5 9 1
approach.
6 11 2
● F:attribute 7 23 1
8 37 1
● K:class label 9 39 2
10 45 1
11 46 2
12 59 1
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
ChiMerge Discretization Example Sample F K Intervals
● Sort and arrange the
1 1 1 {0,2}
attributes you want to 2 3 2 {2,5}
group (Example: 3 7 1 {5,7.5}
attribute F). 4 8 1
● Begin by having each {7.5,8.5}
unique value in the
[email protected]
DLZNK464L9 5 9 1 {8.5,10}
attribute in its own 6 11 2 {10,17}
interval. 7 23 2 {17,30}
8 37 1 {30,38}
9 39 2 {38,42}
10 45 1 {42,45.5}
11 46 1 {45.5,52}
12 59 1 {52,60}
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
ChiMerge Discretization Example Sample F K

● Begin calculating the Chi- 1 1 1


square test on every pair 2 3 2
of adjacent intervals 3 7 1

● Interval/class contingency 4 8 1
tables:
[email protected]
DLZNK464L9
5 9 1
Sample K=1 K=2 6 11 2
2 0 1 1 7 23 2
3 1 0 1
8 37 1
total 1 1 2
9 39 2
Sample K=1 K=2
10 45 1
3 1 0 1
11 46 1
4 1 0 1
total 2 0 2 12 59
This file is meant for personal use by [email protected] only. 1
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Sampl K=1 K=2 E11 = (1/2)*1 = .05
e E12 = (1/2)*1 = .05
E21 = (1/2)*1 = .05
2 0 1 1
E22 = (1/2)*1 = .05
3 1 0 1
total 1 1 2
[email protected]
DLZNK464L9 X2 = (0-.5)2/.5 + (0-.5)2/.5 + (0-.5)2/.5 + (0-.5)2/.5 = 2
Sampl K=1 K=2
E11 = (1/2)*2 = 1
e
E12 = (0/2)*2 = 0
3 1 0 1 E21 = (1/2)*2 = 1
4 1 0 1 E22 = (0/2)*2 = 0

total 2 0 2
X2 = (1-1)2/1+(0-0)2/0+ (1-1)2/1+(0-0)2/0 = 0
Sig Level 0.1 with df=1 from Chi square distribution X2 critical value = 2.7024. Not
correlated, can be merged. ProprietarySharing
This file is meant for personal use by [email protected] only.
content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example

Sample F K Intervals Chi2 ● Calculate all the Chi-square


1 1 1 {0,2} values for all intervals.
2
● Merge the intervals with the
2 3 2 {2,5}
2 smallest Chi values.
3 7 1 {5,7.5}
0
4 8 1 {7.5,8}
[email protected]
DLZNK464L9 0
5 9 1 {8.5,10}
2
6 11 2 {10,17}
0
7 23 2 {17,30}
2
8 37 1 {30,38}
2
9 39 2 {38,42}
2
10 45 1 {42,45.5}
0
11 46 1 {45.5,52}
0
12 59 1 {52,60}
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Intervals Chi2
Samp F K
1 1 1 {0,2} 2
2 3 2 {2,5}
Repeat.
3 7 1 4
Keep merging intervals with small X2
4 8 1 {5,10} until all X2 > 2.7024
[email protected]
DLZNK464L9 5 9 1 5
6 11 2
7 23 2 {10,30}
3
8 37 1 {30,38}
2
9 39 2 {38,42}
10 45 1 4
11 46 1 {42,60}
12 59 1
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Chi-Merge Discretization Example
Sample F K Intervals Chi2
1 1 1
2 3 2 {0,10}
3 7 1
4 8 1 ● End: There are no
more intervals with
5 9 1 2.72
[email protected]
DLZNK464L9 X2 < 2.7024.

6 11 2 ● These intervals are


7 23 2 {10,30} correlated with class
8 37 1 labels.
9 39 2
3.93

10 45 1
11 46 1 {42,60}
12 59 1 This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Concept Hierarchy Generation
● Concept hierarchy organises concepts (attribute values) hierarchically and is typically associated with
each dimension in a data warehouse.
● In data warehouses, concept hierarchies enable drilling and rolling to see data at various
granularities.
● Concept hierarchy generation
[email protected]
DLZNK464L9

○ Specified by domain experts, taxonomies/thesaurus/ ontologies


○ Generated from data sets (for some simple, specific cases)
■ Discretization for numerical or ordinal data
■ Frequency counts for categorical data (limited cases)
○ Concept hierarchy learning
■ Natural language processing and ML approaches.
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Concept Hierarchy Generation for Nominal Data
● Specification of a partial/total ordering of attributes explicitly at the schema level by users or
experts.
○ street < city < state < country
● Specification of a hierarchy for a set of values by explicit data grouping.
○ {Urbana, Champaign, Canada} < Illinois
[email protected]
DLZNK464L9

● Specification of only a partial set of attributes.


○ E.g. only street < city, not others
● Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct
values.
○ E.g. for a set of attributes: {street, city, state, country}

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Automatic Concept Hierarchy Generation
● Some hierarchies can be built automatically based on a study of the number of distinct values for
each attribute in the data collection.
○ The attribute with the most distinct values is at the bottom of the hierarchy.
○ Exceptions,
Example: weekday, month, quarter, year
[email protected]
DLZNK464L9

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
In this session, we discussed:

● Normalization – The data is scaled to fall within a smaller, specified range for more meaningful
comparison.
● Discretization divides the range of a continuous attribute into the interval.
● Chi-Merge Discretization example
[email protected]
DLZNK464L9

● Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated
with each dimension in a data warehouse.
● Concept hierarchy generation for nominal data

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning Outcomes
You should now be able to:

● Apply data pre-processing tasks and methods to prepare data for a data mining task.
● Summarize the importance of outlier removal and redundant data removal from data sets.
● Explain the methods for dimensionality reduction and numerosity reduction.
● Implement data transformation strategies, such as normalization, discretization, and concept
[email protected]
DLZNK464L9
hierarchy generation.
● Perform typical data pre-processing tasks in Python.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9 Thank you !

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like