DEC_Unit II Data Pre-processing
DEC_Unit II Data Pre-processing
UNIT II
1
Syllabus
• Data Preprocessing: An Overview, Methods: Data Cleaning, Data
Integration, Data Reduction, Data Transformation, Data Discretization.
• Data Cleaning: Handling Missing Values, Noisy Data, Data Cleaning as
a Process, Data Integration: Entity Identification Problem,
Redundancy and Correlation Analysis, Tuple Duplication, Data Value
Conflict Detection and Resolution, Data Reduction: Attribute Subset
• Selection, Histograms, Sampling, Data discretization – binning,
histogram analysis, decision tree and correlation analysis, concept
hierarchy for nominal data.
2
Data Pre-processing: An Overview
• Data preprocessing is the process of transforming raw data into an
understandable format.
• It is also an important step in data mining as we cannot work with raw data.
• The quality of the data should be checked before applying machine
learning or data mining algorithms.
• The goal of data preprocessing is to improve the quality of the data and to
make it more suitable for the specific data mining task.
• Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model and it is the first and crucial step
while creating a machine learning model.
3
Why is Data Preprocessing Important?
4
Why is Data Preprocessing Important?
5
Why is Data Preprocessing Important?
6
Why is Data Preprocessing Important?
7
Major Tasks in Data Preprocessing
There are 4 major tasks in data
preprocessing – Data cleaning,
Data integration, Data reduction,
and Data transformation.
8
Major Tasks in Data Preprocessing
9
Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data.
• Data cleaning is the process of removing incorrect data, incomplete
data, and inaccurate data from the datasets, and it also replaces the
missing values.
• Topics under Data Cleaning are: Handling missing values, Noisy data,
Data cleaning as a process.
10
Data Cleaning
• Missing or incomplete records: Missing data sometimes appears as
empty cells, values (e.g., NULL or N/A), or a particular character, such
as a question mark
11
Data Cleaning
• Missing data may be due to:
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data.
• It is important to note that, a missing value may not always imply an
error. (for example, Null-allow attri. )
12
Data Cleaning
• How can you go about filling in the missing values for this attribute? Let’s
look at the following methods:
• 1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification).
• This method is not very effective, unless the tuple contains several
attributes with missing values.
• It is especially poor when the percentage of missing values per attribute
varies considerably.
• By ignoring the tuple, we do not make use of the remaining attributes
values in the tuple.
• 2. Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many
missing values.
13
Data Cleaning
• 3. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant, such as a label like
“Unknown” or −∞.
• If missing values are replaced by, say, “Unknown,” then the mining
program may mistakenly think that they form an interesting concept,
since they all have a value in common—that of “Unknown.”
• 4. Use a measure of central tendency for the attribute (such as the
mean or median) to fill in the missing value: measures of central
tendency, which indicate the “middle” value of a data distribution.
• For normal (symmetric) data distributions, the mean can be used,
while skewed data distribution should employ the median.
14
Data Cleaning
• 5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple: For example, if classifying customers
according to credit risk, we may replace the missing value with the average
income value for customers in the same credit risk category as that of the
given tuple.
• If the data distribution for a given class is skewed, the median value is a
better choice.
• 6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
• For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.
15
Data Cleaning: Noisy Data
Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins
16
Data Cleaning
• In this example, the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values).
• In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1
is 9.
• Therefore, each original value in this bin is replaced by the value 9.
• Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median.
• In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries.
• Each bin value is then replaced by the closest boundary value.
17
Data Cleaning
• Regression: Data smoothing can also be done by conforming data
values to a function, a technique known as regression.
• Linear regression involves finding the “best” line to fit two attributes
(or variables), so that one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
18
Data Cleaning
• Outlier analysis: Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may
be considered outliers
19
Data Cleaning as a Process
• Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have
looked at techniques for handling missing data and for smoothing data. “But data
cleaning is a big job. What about data cleaning as a process? How exactly does one
proceed in tackling this task? Are there any tools out there to help?”
• The first step in data cleaning as a process is discrepancy (inconsistency) detection.
• Discrepancies can be caused by several factors, including poorly designed data entry
forms that have many optional fields, human error in data entry, deliberate errors (e.g.,
respondents not wanting to disclose information about themselves), and data decay (e.g.,
outdated addresses).
• Discrepancies may also arise from inconsistent data representations and the inconsistent
use of codes.
• Errors in instrumentation devices that record data, and system errors, are another source
of discrepancies.
• Errors can also occur when the data are (inadequately) used for purposes other than
originally intended and there may also be inconsistencies due to data integration (e.g.,
where a given attribute can have different names in different databases).
20
Data Cleaning as a Process
• “So, how to proceed with discrepancy detection?” As a starting point, use
any knowledge you may already have regarding properties of the data.
Such knowledge or “data about data” is referred to as metadata.
• For example, what are the data type and domain of each attribute? What
are the acceptable values for each attribute? The basic statistical data
descriptions those are useful here to grasp data trends and identify
anomalies.
• For example, find the mean, median, and mode values.
• Are the data symmetric or skewed? What is the range of values? Do all
values fall within the expected range? What is the standard deviation of
each attribute?
• Values that are more than two standard deviations away from the mean for
a given attribute may be flagged as potential outliers.
• Are there any known dependencies between attributes?
21
Data Cleaning as a Process
• Field overloading is another source of errors that typically results when
developers squeeze new attribute definitions into unused (bit) portions of
already defined attributes (e.g., using an unused bit of an attribute whose
value range uses only, say, 31 out of 32 bits).
• The data should also be examined regarding unique rules, consecutive rules,
and null rules.
• A unique rule says that each value of the given attribute must be different
from all other values for that attribute.
• A consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also be
unique (e.g., as in check numbers).
• A null rule specifies the use of blanks, question marks, special characters,
or other strings that may indicate the null condition (e.g., where a value for
a given attribute is not available), and how such values should be handled.
22
Data Cleaning as a Process
• There are a number of different commercial tools that can aid in the step of
discrepancy detection.
• Data scrubbing tools use simple domain knowledge (e.g., knowledge of
postal addresses, and spell-checking) to detect errors and make corrections
in the data.
• These tools rely on parsing and fuzzy matching techniques when cleaning
data from multiple sources.
• Data auditing tools find discrepancies by analyzing the data to discover
rules and relationships, and detecting data that violate such conditions.
• They are variants of data mining tools. For example, they may employ
statistical analysis to find correlations, or clustering to identify outliers.
23
Data Cleaning as a Process
• ETL (extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI).
• These tools typically support only a restricted set of transforms so
that, often, we may also choose to write custom scripts for this step
of the data cleaning process.
• The two-step process of discrepancy detection and data
transformation (to correct discrepancies) iterates.
• This process, however, is error-prone and time consuming and some
transformations may introduce more discrepancies while some
nested discrepancies may only be detected after others have been
fixed.
24
Data Cleaning as a Process
• New approaches to data cleaning emphasize increased interactivity.
• Potter’s Wheel, for example, is a publicly available data cleaning tool that integrates discrepancy
detection and transformation.
• Users gradually build a series of transformations by composing and debugging individual
transformations, one step at a time, on a spreadsheet-like interface.
• The transformations can be specified graphically or by providing examples.
• Results are shown immediately on the records that are visible on the screen.
• The user can choose to undo the transformations, so that transformations that introduced
additional errors can be “erased.”
• The tool performs discrepancy checking automatically in the background on the latest
transformed view of the data.
• Users can gradually develop and refine transformations as discrepancies are found, leading to
more effective and efficient data cleaning.
• Another approach to increased interactivity in data cleaning is the development of declarative
languages for the specification of data transformation operators.
• Such work focuses on defining powerful extensions to SQL and algorithms that enable users to
express data cleaning specifications efficiently.
25
Exercise
• Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215 Partition them into three
bins solve it by each of the following methods:
(a) equi-depth partitioning
(b) Smoothing by bin boundaries
26
Exercise
• Solution:
(a) equi-depth partitioning -Bin-1: 5, 10, 11, 13, Bin-2: 15, 35, 50, 55, Bin-03: 72,
92, 204, 215
(b) Smoothing by bin boundaries-Smoothing by bin boundaries: Bin-1:
5,13,13,13 , Bin-2: 15,15,55,55, Bin-3:72,72,215,215
27
Data Integration
• The merging of data from multiple data stores.
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• This can help improve the accuracy and speed of the subsequent
mining process.
• The semantic heterogeneity and structure of data pose great
challenges in data integration. How can we match schema and
objects from different sources?
28
Data Integration
Data integration techniques:
• Schema matching
• Instance conflict resolution
• Source selection
• Result merging
• Quality composition
29
Data Integration
• Combines data from multiple sources into a coherent store
• Careful integration can help reduce & avoid redundancies and inconsistencies
• This helps to improve accuracy & speed of subsequent data mining
• Heterogeneity & structure of data pose great challenges
• Issues that need to be addressed:
1. How to match schema & objects from different sources? (Entity identification problem)
2. Are any attributes correlated?
3. Tuple duplication
4. Detection & resolution of data value conflicts
30
Data Integration
31
31
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Careful integration can help reduce & avoid redundancies and inconsistencies
• This helps to improve accuracy & speed of subsequent data mining
• Heterogeneity & structure of data pose great challenges
• Issues that need to be addressed:
1. How to match schema & objects from different sources? (Entity identification
problem)
2. Are any attributes correlated?
3. Tuple duplication
4. Detection & resolution of data value conflicts
32
32
Issues during Data Integration(Contd..)
1. Schema integration:
Integrate metadata from different sources
• e.g. customer_id in one database and cust_number in another
Entity identification problem: Identify real world entities from multiple data sources,
• e.g., Bill Clinton = William Clinton
2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from another
attribute or set of attribute.
• Inconsistencies in attribute can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis.
3. Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units
Data Integration(Contd..)
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id ≡ B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Redundancy:
• Inconsistencies in attribute or dimensions naming can cause redundancy
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different Possible reasons: different representations, different scales, e.g.,
34
metric vs. British units 34
Data Integration
35
35
Handling Redundancy in Data Integration
36
36
Detection of Data Redundancy-Correlation
38
Correlation Analysis (Numeric data )
• Evaluate correlation between 2 attributes, A & B, by computing correlation
coefficient(Pearson’s product moment coefficient)
Where n is the number of tuples, ai and bi are the respective values of A and
B in tuple i,
are the respective mean values of A and B,
are the respective standard deviations of A and B
is the sum of the AB cross-product (i.e., for each tuple, the value for A
is multiplied by the value for B in that tuple)
39
Correlation- Pearson Correlation Coefficient
• Note that
• If resulting value is greater than 0, then A and B are positively correlated,
meaning that the values of A increase as the values of B increase
• Higher the value, the stronger the correlation
• Higher value may indicate that A (or B) may be removed as a redundancy
• If the resulting value is equal to 0, then A and B are independent and there is no
• correlation between them
• If the resulting value is less than 0, then A and B are negatively correlated, where
the values of one attribute increase as the values of the other attribute decrease
41
Covariance Analysis(Numeric Data)
• Correlation and covariance are two similar measures for assessing how much two
attributes change together
• Consider two numeric attributes A and B, and a set of n observations
• Mean values of A and B, respectively, are also known as the expected values on
A and B, that is
42
Covariance Analysis(Numeric Data)
43
Correlation
44
Covariance Analysis(Numeric Data)
• Table shows stock prices of two companies at five time points. If the stocks are
affected by same industry trends, determine whether their prices rise or fall
together?
45
Covariance Analysis(Numeric Data)
46
Correlation: example
An example of stock prices observed at five time points for
AllElectronics and HighTech, a high-tech company. If the
• Suppose two stocks A and B have the following values in
stocks are affected by the same industry trends, will their one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
prices rise or fall together?
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
47
Data Value Conflict Detection and Resolution
• For the same real-world entity, attribute values from different sources
may differ
• Eg. Prices of rooms in different cities may involve different currencies
• Attributes may also differ on the abstraction level, where an attribute
in one system is recorded at, say, a lower abstraction level than the
“same” attribute in another.
• Eg. total sales in one database may refer to one branch of All_Electronics,
while an attribute of the same name in another database may refer to the
total sales for All_Electronics stores in a given region.
• To resolve, data values have to be converted into consistent form
48
Data Transformation
49
49
Data Transformation: Normalization
• Min-max normalization
• Z-score normalization
50
Data Transformation: Min Max-Normalization
• Min-max normalization: to [new_minA, new_maxA]
• Performs a linear transformation on the original data.
• Suppose that mina and maxa are the minimum and maximum values of an attribute,
a.
• Min-max normalization maps a value, vi , of a to the range [new_mina, new_maxa] by
computing
• Eg. Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. We would like to map income to the range [0.0, 1.0]. By min-max normalization,
a value of $73,600 for income is transformed to 0.716
51
Data Transformation: Z-score Normalization
• Eg. Suppose that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively.
52
Data Transformation: Decimal scaling Normalization
53
Data Warehouse and Data Mining
• Introduction to KDD
• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
54
Exercise
• Define Normalization.
• What is the value range of min-max. Use min-max normalization to
normalize the following group of data: 8,10,15,20.
• Solution:
Marks Marks after Min-
Max normalization
8 0
10 0.16
15 0.58
20 1
Data Reduction Strategies
56
Data Reduction Strategies
57
Data Cube Aggregation
58
Data Cube Aggregation
60
Data Reduction 1- Dimensionality Reduction
61
Dimensionality reduction :Attribute Subset Selection
• Reduces the data set size by removing irrelevant or redundant attributes (or
dimensions)
• Goal of attribute subset selection is to find a minimum set of attributes
• Improves speed of mining as dataset size is reduced
• Mining on a reduced data set also makes the discovered pattern easier to
understand
• Duplicate information contained in one or more attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' telephone number is often irrelevant to the task of predicting students'
CGPA
62
Attribute Subset Selection
63
Heuristic (Greedy) methods for attribute subset selection
64
Heuristic (Greedy) methods for attribute subset selection(cont)
65
Example of Decision Tree Induction
Data Reduction 2: Numerosity Reduction
• Parametric methods
• Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
• E.g.: Log-linear models: obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling
67
Data Reduction 2: Numerosity Reduction
68
Regression and Log-Linear Models
69
Histograms
• Histograms (or frequency histograms) are at least a century old and are widely used.
• Plotting histograms is a graphical method for summarizing the distribution of a given attribute, X.
• Height of the bar indicates the frequency (i.e., count) of that X value
• Range of values for X is partitioned into disjoint consecutive subranges.
• Subranges, referred to as buckets or bins, are disjoint subsets of the data distribution for X.
• Range of a bucket is known as the width
• Typically, the buckets are of equal width.
• Eg. a price attribute with a value range of $1 to $200 can be partitioned into subranges 1 to 20, 21
to 40, 41 to 60, and so on.
• For each subrange, a bar is drawn with a height that represents the total count of items observed
within the subrange
70
Histograms
71
Histogram
• Divide data into buckets and store
average (sum) for each bucket
• Partitioning rules:
• Equal-width: equal bucket range
• Equal-frequency (or equal-depth)
72
Histogram Analysis- Explanation & Example
73
Histogram Analysis- Explanation & Example
74
Histogram Analysis- Explanation & Example
75
Histogram of an image : Application
As you can see from the graph, that most of the bars that have high frequency lies in the first half
portion which is the darker portion. That means that the image we have got is darker
76
Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless
• But only limited manipulation is possible without expansion
• Audio/video, image compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
• Time sequence is not audio
• Typically short and vary slowly with time
77
Data Compression
o s sy
l
Original Data
Approximated
78
Clustering
• Partition data set into clusters, and store cluster representation only
• Quality of clusters measured by their diameter (max distance
between any two objects in the cluster) or centroid distance (avg.
distance of each cluster object from its centroid)
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering (possibly stored in multi-dimensional
index tree structures (B+-tree, R-tree, quad-tree, etc))
• There are many choices of clustering definitions and clustering
algorithms (further details later)
79
Sampling
• Sampling: obtaining a small sample s to represent the whole data set N
• Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of
the data
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor performance in the presence of skew
• Develop adaptive sampling methods, e.g., stratified sampling:
• In stratified sampling, researchers divide subjects into subgroups called strata
based on characteristics that they share (e.g., race, gender, educational
attainment). Once divided, each subgroup is randomly sampled using another
probability sampling method.
• Note: Sampling may not reduce database I/Os (page at a time)
80
Sampling
• Allow a mining algorithm to run in complexity that is potentially sub-linear to the
size of the data
• Cost of sampling: proportional to the size of the sample, increases linearly with
the number of dimensions
• Choose a representative subset of the data
• Simple random sampling may have very poor performance in the presence of skew
• Develop adaptive sampling methods
• Stratified sampling:
• Approximate the percentage of each class (or subpopulation of interest) in the overall
database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a time).
• Sampling: natural choice for progressive refinement of a reduced data set.
81
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data
82
Types of Sampling
SW OR dom
SR ran ut
p l e
(sim le witho
samp ement)
c
repla
SRSW
R
Raw Data
83
Sampling
84
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Discretization is one form of data transformation technique. It transforms numeric values to interval
labels of conceptual labels. Ex. age can be transformed to (0-10,11-20….) or to conceptual labels like
youth, adult, senior
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification
85
Data Discretization
y1 y2 y3 y4 y5 y6
86
Discretization and Concept Hierarchies
• Discretization
• reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values.
• Concept Hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
87
Discretization and Concept Hierarchies : Numerical data
88
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)
89
Concept of Hierarchy in Nominal Data
95
Reference
96