0% found this document useful (0 votes)
54 views

Final - Unit 3 Data Preprocessing - Phases

Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. It addresses issues like missing values, inconsistent data, and noise. Common techniques for handling missing data include replacing it with mean, median, or mode values. Noisy data can be smoothed using binning, where values are grouped into ranges. The goal of preprocessing is to improve data quality and accuracy for analysis.

Uploaded by

Jaydeep Dodiya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Final - Unit 3 Data Preprocessing - Phases

Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. It addresses issues like missing values, inconsistent data, and noise. Common techniques for handling missing data include replacing it with mean, median, or mode values. Noisy data can be smoothed using binning, where values are grouped into ranges. The goal of preprocessing is to improve data quality and accuracy for analysis.

Uploaded by

Jaydeep Dodiya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Preprocessing

Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Summary
Data quality
Data quality is a major concern in Data Mining
and Knowledge Discovery tasks.
Why: At most all Data Mining algorithms
induce knowledge strictly from data.
The quality of knowledge extracted highly
depends on the quality of data.
Why Data Preprocessing?
Data in the real world is dirty
◦ incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ”

◦ noisy: containing errors or outliers


 e.g., Salary=“-10”

◦ inconsistent: containing discrepancies in codes or names


 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Why Data Preprocessing?
Why Data Preprocessing?
age income student buys_computer Discover only those
<=30 high yes yes rules which contain
<=30 high no yes support (frequency)
>40 medium yes no greater >= 2
Data Mining
>40 medium no no
>40 low yes yes
31…40 no yes
31…40 medium yes yes

Training data • If ‘age <= 30’ and income = ‘high’ then


buys_computer = ‘yes’
• If ‘age > 40’ and income = ‘medium’ then
buys_computer = ‘no’
Due to the missing value in training
dataset, the accuracy of prediction age income student buys_computer
decreases and becomes “66.7%” <=30 high no ?
>40 medium yes ?
Testing data or actual data 31…40 medium yes ?
Why Is Data Dirty?
 Incomplete data may come from
◦ “Not applicable” data value when collected
◦ Different considerations between the time when the data was
collected and when it is analyzed.
◦ Human/hardware/software problems

 Noisy data (incorrect values) may come from


◦ Faulty data collection instruments
◦ Human or computer error at data entry
◦ Errors in data transmission

 Inconsistent data may come from


◦ Different data sources
◦ Functional dependency violation (e.g., modify some linked data)

 Duplicate records also need data cleaning


Why Is Data Preprocessing Important?
 No quality data, no quality mining results!
◦ Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.

◦ Data warehouse needs consistent integration of quality data

 Data extraction, cleaning, and transformation comprise the majority of


the work of building a data warehouse
Multi-Dimensional Measure of Data Quality
Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not

 Completeness: not recorded, unavailable, …

 Consistency: some modified but some not, dangling, …

 Timeliness: timely update?

 Believability: how trustable the data are correct?

 Interpretability: how easily the data can be understood?


Exercise
Let in the Employee dataset (Emp_id, Emp_name, Salary, Age, Date_of_birth)
the values of the features are given,
("EMP_01","Mr.X", -100, 40, "01/01/1990"). In the dataset Salary attribute has,
a) noisy data
b) incomplete data
c) inconsistent data
d) both noisy and inconsistent data

Let in the Employee dataset (Emp_id, Emp_name, Salary, Age, Date_of_birth)


the values of the features are given,
("EMP_01","Mr.X",-100,40,"01/01/1990"). In the dataset Age attribute has,
e) noisy data
f) incomplete data
g) inconsistent data
h) both noisy and inconsistent data
Major Tasks in Data Preprocessing
 Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

 Data integration
◦ Integration of multiple databases, data cubes, or files

 Data transformation
◦ Normalization and aggregation

 Data reduction
◦ Obtains reduced representation in volume but produces the same
or similar analytical results

 Data discretization
◦ Data reduction, especially for numerical data
◦ data discretization is a method of converting attribute values of
continuous data into a finite set of intervals with minimum data
loss.
Forms of Data
Preprocessing

12
Data Pre-processing
Why pre-process the data?
Data cleaning
Data integration and transformation
Data reduction
Summary
Data Cleaning
 Importance
◦ “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
◦ “Data cleaning is the number one problem in data warehousing”—
DCI survey

 Data cleaning tasks


1. Fill in missing values
2. Identify outliers and smooth out noisy data
3. Correct inconsistent data
4. Resolve redundancy caused by data integration
Missing Data
 Data is not always available

◦ Many tuples have no recorded value for several attributes,


such as Customer Income in sales data

 Missing data may be due to

◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the time of
entry
◦ not register history or changes of the data
 Missing data may need to be inferred.
Example
Example
How to Handle Missing Data?
1. Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification) not effective when the percentage of missing values per attribute varies
considerably.
2. Fill in the missing value manually:
time-consuming + infeasible in large data sets?
3. Fill in it automatically with
◦ a global constant : e.g., “unknown”, a new class?
(if so, the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common as “unknown”- it Is simple but
foolproof.
◦ the attribute mean or median
◦ the attribute mean for all samples belonging to the same class: smarter
( ex: if classifying customers acc. To credit-risk, we may replace the missing value
with the mean income value for customers in the same credit risk category as that
of the given tuple.
4. The most probable value: inference-based such as Bayesian
formula or decision tree
Example
Handling missing values
Example - K-Nearest Neighbor (k-NN) approach

• k-NN imputes the missing attribute values on the basis of


nearest K neighbor. Neighbors are determined on the basis
of distance measure.
• Once K neighbors are determined, missing value are
imputed by taking mean/median or MOD of known attribute
values of missing attribute.
• Pseudo-code/analysis after studying distance measure.

Missing value record

Other dataset records


Noisy Data
Noise: random error or variance in a measured variable

Incorrect attribute values may due to


◦ faulty data collection instruments
◦ data entry problems
◦ data transmission problems
◦ technology limitation
◦ inconsistency in naming convention

Other data problems which requires data cleaning


◦ duplicate records
◦ incomplete data
◦ inconsistent data
How to Handle Noisy Data?
Binning
◦ Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is the values around it.
◦ first sort data and partition it into (equal-frequency) bins
◦ smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Regression
◦ smooth by fitting the data into regression functions
Clustering
◦ detect and remove outliers
Semi-automated method: combined computer and human inspection
◦ detect suspicious values and check manually
Cluster Analysis
Detect and remove outliers, Where similar values are organized into
groups or “clusters”
Noise smoothing - Binning

Equal Frequency Binning: bins have an equal frequency.

Equal Width Binning: bins have equal width with a range


of each bin are defined as,

[min + w], [min + 2w] …. [min + nw]

where w = (max – min) / (no of bins)


Noise smoothing - Binning

Bin boundaries
Find the minimum and maximum values among data.

Put the minimum on the left side and the maximum on


the right side.

How do smooth middle values?

Middle values in bin boundaries move to their closest


neighbour value with less distance.
Example I
Suppose a group of 12 sales price records has been
sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the
following methods:
(a) equal-frequency (equal-depth) partitioning
(b) equal-width partitioning
Example I solution
Example II
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30,   26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency/equi-depth
bins (depth=4)
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Example II
Smoothing by bin means
For Bin 1:
(8+ 9 + 15 +16 / 4)  = 12
(4 indicating the  total values like 8, 9 , 15, 16)
Bin 1 = 12, 12, 12, 12
 
For Bin 2:
(21 +  21 + 24 + 26 / 4) =  23
Bin 2 = 23, 23, 23, 23
 
For Bin 3:
(27 + 30 + 30 +  34 / 4) = 30
Bin 3 =  30, 30, 30, 30
Example II
Smooth by bin boundaries - Price data in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30,   26, 27,
30, 34
First of all, sort the data
After sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30,
34
Example II
Smoothing the data by equal frequency bins
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smooth data after bin Boundary
Before bin Boundary:  Bin 1: 8, 9, 15, 16
Here, 8 is the minimum value and 16 is the maximum value.
9 is near to 8, so 9 will be treated as 8.
15 is more near to 16 and farther away from 8. So, 15 will be treated as
16.
After  bin Boundary:  Bin 1: 8, 8, 16, 16
Before bin Boundary:  Bin 2: 21, 21, 24, 26,
After  bin Boundary:  Bin 2: 21, 21, 26, 26,
Before bin Boundary:  Bin 3: 27, 30, 30, 34
After  bin Boundary:  Bin 3: 27, 27, 27, 34
Example III
Data (in increasing order) for the attribute age:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25,
30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth these
data, using a bin depth of 3. Illustrate your steps.
How to Handle Inconsistent Data?
 Manual correction using external references

 Semi-automatic using various tools


◦ To detect violation of known functional dependencies and data
constraints
◦ To correct redundant data
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Summary
Data Integration
 Data integration:
◦ Combines data from multiple sources into a coherent store

Issues to be considered
 Schema integration: e.g., “cust-id” & “cust-no”
◦ Integrate metadata from different sources
◦ Entity identification problem:
 Identify real-world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
◦ Detecting and resolving data value conflicts
 For the same real-world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units
Data Transformation
Smoothing: remove noise from data using smoothing techniques
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
Attribute/feature construction:
◦ New attributes constructed from the given ones
Data Transformation: Normalization
 Min-max normalization: For Linear Transformation; to [new_minA, new_maxA]
v −min A
v '= ( new max A −new min A )+ new min A
max A −min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is
mapped to 73,600  12 ,000
( 1.0  0 ) + 0 = 0.716
98,000  12 ,000
 Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ (mean) = 54,000, v −μ A 73 ,600−54 ,000
v '= =1.225
σ (std. dev)= 16,000. Then σA 16 ,000

 Normalization by decimal scaling v


v' = j
10
Where j is the smallest integer such that, Max(|ν’|) < 1
Exercise
Which of the following(s) are the data transformation techniques?
a) Min-max normalization
b) Missing data identification
c) Outlier detection
d) Both b and c

Which of the following(s) are not the data transformation techniques?


e) Min-max normalization
f) Missing data identification
g) Outlier detection
h) Both b and c
 
Exercise
Normalize the following dataset on to the range
[0, 1].

23, 23, 27, 27, 39, 41, 47, 49, 50


Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
References
• D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-
78, 1999
• A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
• H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model,
and algorithms. VLDB'01
• M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
• H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), Dec. 1997
• H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective.
Kluwer Academic, 1998
• J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
• D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
• T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
• R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge
and Data Engineering, 7:623-640, 1995

You might also like