0% found this document useful (0 votes)
9 views

CH 2

Uploaded by

afifrafsan111
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

CH 2

Uploaded by

afifrafsan111
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

1

Why Data Preprocessing?


• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age="42" Birthday="03/07/1997"
• e.g., Was rating "1,2,3", now rating "A, B, C”
• e.g., discrepancy(disagreement records

2
Why is Data Dirty?
• Incomplete data may come from
• "Not aplity data, no quality mining resuDifferentand whedecisions must be
based on quality data was collectedHuman/ duplicate or missing data may
cause incor Noisy deading statistics. ne from- Faulty darehouse needs
consistent integration of-Human traction, cleaning, and transfor - Errors in •
Inconsses the majority of the work ofDifferenwarehouse-Functional
dependency violation (e.g., modify some linked data) • Duplicate records also
need data cleaning

3
Why Is Data Preprocessing
Important?
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
• Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse

4
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization
• Where raw value of a numeric attribute (e.g. age) are replaced by interval labels (e.g. 0-10,
11-20, etc.)

5
Forms of Data Preprocessing

6
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): x 
1
x
n

i   x
n
Note: n is sample size and N is population size. i 1
N
n
• Weighted arithmetic mean: w x i i

• Trimmed mean: chopping extreme values x  i 1


n

w i
• Median: i 1

• Middle value if odd number of values, or average of the middle two values otherwise
• Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median L1  ( ) width
freq median
• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula:
mean  mode 3 (mean  median)
7
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively
skewed data

symmetric

positively skewed negatively


skewed

8
Dispersion
• Actually meaning is scattered
• Dispersion is the measure of the variation of items or observations of
a data set.
• A: 12 12 12 12 12 where mean = 12
• B: 8 10 13 15 14 where mean = 12
• C: 2 10 13 15 20 where mean = 12

9
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1

• Five number summary: min, Q1, median, Q3, max


• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2 1 n
1 n
2
s  
n  1 i 1
2
( xi  x )  [ xi  ( xi ) ]
n  1 i 1 n i 1
 2

N

i 1
( xi  
2
) 
N
x
i 1
i
2
 2

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

10
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually

11
Visualization of Data Dispersion: 3-D Boxplots

12
Data Cleaning
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

13
Why Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data

14
How to Handle Missing Data?
• Ignore the tuple
• the tuple contains several attributes with missing values.
• Fill in the missing value manually
• time-consuming and may not be feasible given a large data
• Use a global constant to fill in the missing value:
• Replace all missing values by same constant-such as unknown or
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same class
• for example, if densifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple
• Use the most probable value to fill in the missing value
• with the help of Decision trees, regression, and Bayesian inference (Chap-6)

15
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
16
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by equal frequency bin, smooth by
bin boundaries, etc.
• Regression
• Used to predict future values based on the past values by fitting a set of points to
a curve.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)
17
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –
A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning


• Divides the range into N intervals, each containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
18
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
19
Regression Analysis
• In general sense the estimation of some unknown values from a set
known values.
• is mathematical a measure of the average relationship between two
or more variables in terms of the original value of data.
• Two types of variable is used here
• Dependent and independent variables
• Types depending on regression curve
• Linear and non-linear regression

20
Simple Linear Regression

Y1

Y1’
y=x+1

X1 x

21
Cluster Analysis

22
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Sources may include databases, data cubes, or fat files

23
Issues Related to Data Integration
• Schema(overall design of database) integration and object matching can be
tricky
• Entity identification problem:
• Identify real world entities from multiple data source, e.g.,
Customer_id=Customer_number
• Detecting and resolving data value conflicts
• For the same real world entity, attributes values from different sources are different
• Possible reason: different representation, different scales, e.g., metric vs. British
unit, mm vs. inch
• Data Redundancy
• Annual revenue can be calculated from some other attributes

24
Correlation and Correlation Analysis
• Correlation
• Is a analysis of the co-variation between to or more variables
• Two variables is said to be co-related if the change in one variable results in a
corresponding change in another variable
• Positive correlation
• Weight vs. Height
• Negative Correlation
• Sale of woolen cloth vs. temperature

25
Handling Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different names
in different databases
• Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
26
Operations in Data Transformation
• Smoothing
• Which works to remove noises from the data
• Such techniques include binning, regression, and clustering
• Aggregation
• Summarization, data cube constriction
• Generalization
• Data are replaced by higher-level concepts through the use of concept hierarchies
• For example categorical attributes, like street can be generalized to higher-level concepts, like city or country
• Normalization: scaled to fall within a small specified range
• Min-max Normalization
• Z-score normalization
• Normalization by decimal scaling
• Attributes/ feature construction
• New attributes constricted from the given one( age can be constructed from date of birth)

27
Min-max normalization
• It performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, vi , of A to v 0 i in
the range [new_minA,new_maxA] by computing
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].


Then $73,000 is mapped to
73,600  12,000
(1.0  0)  0 0.716
98,000  12,000

28
• In z-score normalization (or zero-mean normalization), the values for
an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A. A value, vi , of A is normalized to v 0 i by
computing v  A
v' 
 A

• n. Suppose that the mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively. With z-score
normalization, a value of $73,600 for income is transformed to
73,600  54,000
1.225
16,000

29
Normalization by decimal scaling
• normalizes by moving the decimal point of values of attribute A.
• The number of decimal points moved depends on the maximum
absolute value of A.
• A value, vi , of A is normalized to v by computing
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10

Suppose that the recorded values of A range from −986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we
therefore divide each value by 1000 (i.e., j = 3) so that −986 normalizes to
−0.986 and 917 normalizes to 0.917.
30
Data Reduction Strategies
• Why data reduction?
• A database/data warehouse may store terabytes of data.
• Complex data analysis may take a very long time to run on the complete data set.
• Data reduction:
• Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical
results
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression
31
Data Cube Aggregation

32
Dimensionality Reduction
• The process of reducing the number of random variables or attributes
under consideration
• in dimensionality reduction, data encoding or transformations are applied
so as to obtain a reduced or compressed representation of the original
data.
• if the original data can be reconstructed from the compressed data without
any loss of information, the data reduction is called lossless
• If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy
• two popular an effective methods of lossy dimensionality reduction:
• Wavelet transform and principle components analysis(3.4.2 and 3.4.3)
33
Numerosity Reduction
• This techniques can indeed replace the original data volume by
choosing alternative, 'smaller forms of data representation.
• These techniques may be parametric or nonparametric.• For
parametric methods, Log-linear models, which estimate discrete
multidimensional probability distributions, are an example. (3.4.5)
• Nonparametric methods for storing reduced representations of the
data include histograms, clustering, and sampling. (3.4.6-9)

34
Data Discretization and Concept
Hierarchy Generation
• Data discretization techniques can be used to reduce the number of values
for a given continuous attribute by dividing the range of the attribute into
intervals.
• Interval labels can then be used to replace actual data values.
• Replacing numerous values of a continuous attribute by a small number of
interval labels thereby reduces and simplifies the original data.
• This leads to a concise, easy-to-use, knowledge-level representation of
mining results.
• Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts (such as numerical values for the attribute
age) with higher-level concepts (such as youth, middle-aged, or senior).
35
Attribute Subset Selection
• Feature selection (i.e., attribute subset selection):
• reduces the data set size by removing irrelevant or redundant attributes (or
dimensions)
• find a minimum set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original
distribution obtained using all attributes
• it reduces the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to

36

You might also like