Unit 2 Data Preprocessing for Students.pptx
Unit 2 Data Preprocessing for Students.pptx
2
3
KDD Process: Summary
4
Data Mining: Confluence of Multiple Disciplines
Database
Technolo Statistics
gy
Pattern
Other
Recogniti
Algorith Discipline
on
m s
5
Why Not Traditional Data Analysis?
6
Data Mining: On What Kinds of Data?
● Cluster analysis
● Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
● Maximizing intra-class similarity & minimizing interclass similarity
● Outlier analysis
● Outlier: Data object that does not comply with the general behavior
of the data
● Noise or exception? Useful in fraud detection, rare events analysis
● Trend and evolution analysis
● Trend and deviation: e.g., regression analysis
● Sequential pattern mining: e.g., digital camera 🡪 large SD memory
● Periodicity analysis
● Similarity-based analysis
● Other pattern-directed or statistical analyses
9
Major Issues in Data Mining
● Mining methodology
● Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
● Performance: efficiency, effectiveness, and scalability
● Pattern evaluation: the interestingness problem
● Incorporation of background knowledge
● Handling noise and incomplete data
● Parallel, distributed and incremental mining methods
● Integration of the discovered knowledge with existing one: knowledge fusion
● User interaction
● Data mining query languages and ad-hoc mining
● Expression and visualization of data mining results
● Interactive mining of knowledge at multiple levels of abstraction
● Applications and social impacts
● Domain-specific data mining & invisible data mining
● Protection of data security, integrity, and privacy
10
Architecture: Typical Data Mining
System
Pattern Evaluation
Knowledg
Data Mining Engine e-Base
Database or Data
Warehouse Server
Interval
Variables that have constant, equal distances between
values, but the zero point is arbitrary.
The values of a nominal attribute are just zip codes, employee ID mode, entropy,
Nominal different names, i.e., nominal attributes numbers, eye color contingency
provide only enough information to
correlation, χ2 test
distinguish one object from another. (=, ≠)
For interval attributes, the differences between calendar dates, mean, standard
Interval values are meaningful, i.e., a unit of temperature in Celsius or deviation, Pearson's
measurement exists.
Fahrenheit correlation, t and F
(+, - )
tests
For ratio variables, both differences and temperature in Kelvin, geometric mean,
Ratio ratios are meaningful. (*, /) monetary quantities, counts, harmonic mean,
age, mass, length, electrical
percent variation
current
Levels of Measurement
● Higher level variables can always be expressed at a lower level,
but the reverse is not true.
● For example, Body Mass Index (BMI) is typically measured at an
interval-level such as 23.4.
● BMI can be collapsed into lower-level Ordinal categories
such as:
• >30: Obese
• 25-29.9: Overweight
• <25: Underweight
or Nominal categories such as:
• Overweight
• Not overweight
Tip : measure data at the highest level of measurement possible.
Discrete Data
● Record
● Data Matrix
● Document Data
● Transaction Data
● Graph
● World Wide Web
● Molecular Structures
● Ordered
● Spatial Data
● Temporal Data
● Sequential Data
● Genetic Sequence Data
Record Data
● Data that consists of a collection of records, each of which consists
of a fixed set of attributes
Data Matrix
● If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a
multi-dimensional space, where each dimension represents a
distinct attribute
TID Items
1 Bread, Coke, Milk
2 Beans, Bread
3 Beans, Coke, Jam, Milk
4 Beans, Bread, Jam, Milk
5 Coke, Jam, Milk
Graph Data
● Mean
● Median
● Mode
● Midrange
Mean
MEAN
● Range
● Five number summary (based on Quartiles)
● Inter quartile range
● Standard deviation
Range
IQR = Q3 – Q1
Quartiles and IQR
● 78, 80, 80, 81, 82, 83, 85, 85, 86, 87 (n=10)
Median= Q2 = M = 11
Q1 = Median of the lower half, i.e. 1,5,7,9 = (5+7)/2= 6
Q3 = Median of the upper half, i.e. 15,22,24,47=(22+24)/2=23
Therefore, IQR = Q3 – Q1 = 23 – 6 = 17
Exercise
median (Q2)
the quartiles Q1 and Q3
the smallest and largest individual observations
Minimum = 78
Q1 = 80
Q2 = 82.5
Q3 = 85
Maximum = 87
Graphic Displays
● Boxplot
● Histograms
● Quantile plots
● Quantile-quantile plots (QQ plots)
● Scatter plots (XY plots)
Boxplot
Mild and Extreme Outliers
10.2, 14.1, 14.4, 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7,
14.7, 14.9, 15.1, 15.9, 16.4 (n=15)
Median = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 0.5
Example contd…
Inner fence :
Q1 – 1.5*IQR Q3+1.5*IQR
14.4 – 1.5*0.5 14.9 + 1.5 * 0.5
13.65 15.65
Outer Fence
Q1 – 3 * IQR Q3 + 3 * IQR
12.9 16.4
median = 29.025
Q1 = 27.08
Q3 = 33.28
The interquartile range is 6.2
Inner fence :
Outer fence :
This boxplot is clearly not symmetrical.
However, the pattern of its skewness is not straightforward.
The box, corresponding to the middle 50% of the data, appears
to be right-skew, because the line marking the median is
towards the left of the box (so that the right section of the box
is longer than the left).
However, the longer whisker is on the left, indicating a longer
tail towards smaller values, which in turn suggests that the data
are left-skew.
The following example relates to birth weights of infants
exhibiting severe idiopathic respiratory distress syndrome
(SIRDS), and the question ‘Is it possible to relate the
chances of eventual survival to birth weight?’
ETL Process
107
Data Preprocessing
110
Data Quality - The Reality
111
Data Extraction
112
Major Tasks in Data Preprocessing
Major Tasks in Data Preprocessing
● Data cleaning
● Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
● Data integration
● Integration of multiple databases, data cubes, or files
● Data transformation
● Normalization and aggregation
● Data reduction
● Obtains reduced representation in volume but produces the
same or similar analytical results
● Data discretization
● Part of data reduction but with particular importance, especially
for numerical data, concept hierarchy generation
Data Cleaning
● Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
● incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
● e.g., Occupation=“ ” (missing data)
● noisy: containing noise, errors, or outliers
● e.g., Salary=“−10” (an error)
● inconsistent: containing discrepancies in codes or names, e.g.,
● Age=“42”, Birthday=“03/07/2010”
● Was rating “1, 2, 3”, now rating “A, B, C”
● discrepancy between duplicate records
● Intentional (e.g., disguised missing data)
● Jan. 1 as everyone’s birthday?
Incomplete (missing) Data
● Data is not always available
● E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
● Missing data may be due to
● equipment malfunction
● inconsistent with other recorded data and thus deleted
● data not entered due to misunderstanding
● certain data may not be considered important at the time of entry
● Information is not collected (e.g., people decline to give their age and weight)
● Attributes may not be applicable to all cases (e.g., annual income is not applicable
to children)
● not register history or changes of the data
● Missing data may need to be inferred
How to Handle Missing Data?
● Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
● Fill in the missing value manually: tedious + infeasible?
● Fill in it automatically with
● a global constant : e.g., “unknown”, a new class?!
● the attribute mean
● the attribute mean for all samples belonging to the same class: smarter
● the most probable value: inference-based such as Bayesian formula or
decision tree
117
Noisy Data
● Binning
● first sort data and partition into (equal-frequency) bins
● then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
● Regression
● smooth by fitting the data into regression functions
● Clustering
● detect and remove outliers
● Combined computer and human inspection
● detect suspicious values and check by human (e.g., deal with possible
outliers)
119
Binning
y (salary)
Value of ‘age’ can be used to predict
value of ‘salary’
Y1 y=x+1
X1 x (age)
Clustering
salary
Data is
smoothed by
cluster removing
outliers
outlier
age
Duplicate Data
● Data set may include data objects that are duplicates, or
almost duplicates of one another
● Major issue when merging data from heterogenous sources
● Examples:
● Same person with multiple email addresses
● Data cleaning
● Process of dealing with duplicate data issues
Data Cleaning as a Process
● Data discrepancy detection
● Use metadata (e.g., domain, range, dependency, distribution)
● Check field overloading
● Check uniqueness rule, consecutive rule and null rule
● Use commercial tools
● Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect
errors and make corrections [1) Integrate.io · 2) Tibco Clarity · 3) DemandTools · 4) RingLead · 5) Melissa
Clean Suite · 6) WinPure]
● Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers) [Oracle Audit Vault and Database Firewall · 2. IBM Guardium
Data Protection · 3. Imperva SecureSphere Database]
● Data migration and integration
● Data migration tools: allow transformations to be specified
● ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a
graphical user interface
● Integration of the two processes
● Iterative and interactive (e.g., Potter’s Wheels)
128
Major Tasks in Data
Preprocessing
● Data cleaning
● Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
● Data integration
● Integration of multiple databases, data cubes, or files
● Data transformation
● Normalization and aggregation
● Data reduction
● Obtains reduced representation in volume but produces the
same or similar analytical results
● Data discretization
● Part of data reduction but with particular importance, especially
for numerical data
Data Integration
130
Data Integration Across Sources
131
Schema Integration
● Developing a unified representation of semantically similar
information, structured and stored differently in the
individual databases.
132
Schema Integration
134
Data Integrity Problems
oij is the observed frequency (i.e., actual count) of the joint event
eij is the expected frequency of (Ai, Bj)
Χ2 (chi-square) test
● The larger the Χ2 value, the more likely the variables are related
● The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
● The chi-square statistic tests the hypothesis that A and B are independent, that
is, there is no correlation between them.
● The test is based on a significance level (SL)
● degrees of freedom = (r-1)(c-1).
● If the hypothesis can be rejected, then we say that A and B are
statistically correlated.
142
Sum(col.) 1500
where n is the number of tuples, and are the respective means of A and B, σA and σB are
the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
● rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation.
● rA,B = 0: independent;
● rAB < 0: negatively correlated
Correlation Analysis (Numeric Data)
Pearson’s Coefficient
● The Pearson product-moment correlation coefficient (or Pearson
correlation coefficient, for short) is a measure of the strength of a
linear association between two variables and is denoted by r.
● Pearson product-moment correlation attempts to draw a line of best fit
through the data of two variables, and the Pearson correlation
coefficient, r, indicates how far away all these data points are to this
line of best fit (i.e., how well the data points fit this new model/line of
best fit).
● The Pearson correlation coefficient, r, can take a range of
values from +1 to -1.
● r = 0 indicates that there is no association between the two
variables.
● r > 0 indicates a positive association; that is, as the value of
one variable increases, so does the value of the other
variable.
● r < 0 indicates a negative association; that is, as the value
of one variable increases, the value of the other variable
decreases.
● The first step is to draw a scatter plot of the variables to
check for linearity.
● The correlation coefficient should not be calculated if the
relationship is not linear.
● For correlation only purposes, it does not really matter on
which axis the variables are plotted. However, conventionally,
the independent (or explanatory) variable is plotted on the
x-axis (horizontally) and the dependent (or response) variable
is plotted on the y-axis (vertically).
● The nearer the scatter of points is to a straight line, the
higher the strength of association between the variables.
Also, it does not matter what measurement units are used.
Formula
Example
= 0.5298
Note that the strength of the association of the variables
depends on what you measure and sample sizes.
Covariance
Positive covariance: If CovA,B > 0, Indicates that two variables tend to move in the same
direction
Negative covariance: If CovA,B < 0 Reveals that two variables tend to move in inverse
directions.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence
Calculate the mean (average) prices for each asset.
For each security, find the difference between each
value and mean price
The positive covariance indicates that the price of the ABC Corp. stock and
the S&P 500 tend to move in the same direction.
162
Covariance vs. Correlation
Where:
•ρ(X,Y) – the correlation between the variables X and Y
•Cov(X,Y) – the covariance between the variables X and Y
•σX – the standard deviation of the X-variable
•σY – the standard deviation of the Y-variable
Sampling
● Sampling is the main technique employed for data selection.
● It is often used for both the preliminary investigation of the data and the final data analysis.
● Statisticians sample because obtaining the entire set of data of interest is too
expensive or time consuming.
● Sampling is used in data mining because processing the entire set of data of interest
is too expensive or time consuming.
Sample Size
● Stratified sampling
● Split the data into several partitions/strata ; then draw random samples from each partition
❑ Cluster Samplings : Split the data into ‘m’ disjoint cluster, then apply simple
random sampling.
Data Transformation
● min-max normalization
● z-score normalization