0% found this document useful (0 votes)

9 views

CH 2

Uploaded by

afifrafsan111

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

CH 2

Uploaded by

afifrafsan111

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

1

Why Data Preprocessing?

• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age="42" Birthday="03/07/1997"
• e.g., Was rating "1,2,3", now rating "A, B, C”
• e.g., discrepancy(disagreement records

2
Why is Data Dirty?
• Incomplete data may come from
• "Not aplity data, no quality mining resuDifferentand whedecisions must be
based on quality data was collectedHuman/ duplicate or missing data may
cause incor Noisy deading statistics. ne from- Faulty darehouse needs
consistent integration of-Human traction, cleaning, and transfor - Errors in •
Inconsses the majority of the work ofDifferenwarehouse-Functional
dependency violation (e.g., modify some linked data) • Duplicate records also
need data cleaning

3
Why Is Data Preprocessing
Important?
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
• Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse

4
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization
• Where raw value of a numeric attribute (e.g. age) are replaced by interval labels (e.g. 0-10,
11-20, etc.)

5
Forms of Data Preprocessing

6
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population): x 
1
x
n

i   x
n
Note: n is sample size and N is population size. i 1
N
n
• Weighted arithmetic mean: w x i i

• Trimmed mean: chopping extreme values x  i 1

w i
• Median: i 1

• Middle value if odd number of values, or average of the middle two values otherwise
• Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median L1  ( ) width
freq median
• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula:
mean  mode 3 (mean  median)
7
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively
skewed data

symmetric

positively skewed negatively

skewed

8
Dispersion
• Actually meaning is scattered
• Dispersion is the measure of the variation of items or observations of
a data set.
• A: 12 12 12 12 12 where mean = 12
• B: 8 10 13 15 14 where mean = 12
• C: 2 10 13 15 20 where mean = 12

9
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1

• Five number summary: min, Q1, median, Q3, max

• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2 1 n
1 n
2
s  
n  1 i 1
2
( xi  x )  [ xi  ( xi ) ]
n  1 i 1 n i 1
 2

N

i 1
( xi  
2
) 
N
x
i 1
i
2
 2

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

10
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually

11
Visualization of Data Dispersion: 3-D Boxplots

12
Data Cleaning
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

13
Why Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data

14
How to Handle Missing Data?
• Ignore the tuple
• the tuple contains several attributes with missing values.
• Fill in the missing value manually
• time-consuming and may not be feasible given a large data
• Use a global constant to fill in the missing value:
• Replace all missing values by same constant-such as unknown or
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same class
• for example, if densifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple
• Use the most probable value to fill in the missing value
• with the help of Decision trees, regression, and Bayesian inference (Chap-6)

15
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
16
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by equal frequency bin, smooth by
bin boundaries, etc.
• Regression
• Used to predict future values based on the past values by fitting a set of points to
a curve.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)
17
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –
A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning

• Divides the range into N intervals, each containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
18
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
19
Regression Analysis
• In general sense the estimation of some unknown values from a set
known values.
• is mathematical a measure of the average relationship between two
or more variables in terms of the original value of data.
• Two types of variable is used here
• Dependent and independent variables
• Types depending on regression curve
• Linear and non-linear regression

20
Simple Linear Regression

Y1’
y=x+1

X1 x

21
Cluster Analysis

22
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Sources may include databases, data cubes, or fat files

23
Issues Related to Data Integration
• Schema(overall design of database) integration and object matching can be
tricky
• Entity identification problem:
• Identify real world entities from multiple data source, e.g.,
Customer_id=Customer_number
• Detecting and resolving data value conflicts
• For the same real world entity, attributes values from different sources are different
• Possible reason: different representation, different scales, e.g., metric vs. British
unit, mm vs. inch
• Data Redundancy
• Annual revenue can be calculated from some other attributes

24
Correlation and Correlation Analysis
• Correlation
• Is a analysis of the co-variation between to or more variables
• Two variables is said to be co-related if the change in one variable results in a
corresponding change in another variable
• Positive correlation
• Weight vs. Height
• Negative Correlation
• Sale of woolen cloth vs. temperature

25
Handling Redundancy in Data
Integration
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different names
in different databases
• Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
26
Operations in Data Transformation
• Smoothing
• Which works to remove noises from the data
• Such techniques include binning, regression, and clustering
• Aggregation
• Summarization, data cube constriction
• Generalization
• Data are replaced by higher-level concepts through the use of concept hierarchies
• For example categorical attributes, like street can be generalized to higher-level concepts, like city or country
• Normalization: scaled to fall within a small specified range
• Min-max Normalization
• Z-score normalization
• Normalization by decimal scaling
• Attributes/ feature construction
• New attributes constricted from the given one( age can be constructed from date of birth)

27
Min-max normalization
• It performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an
attribute, A. Min-max normalization maps a value, vi , of A to v 0 i in
the range [new_minA,new_maxA] by computing
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

Then $73,000 is mapped to
73,600  12,000
(1.0  0)  0 0.716
98,000  12,000

28
• In z-score normalization (or zero-mean normalization), the values for
an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A. A value, vi , of A is normalized to v 0 i by
computing v  A
v' 
 A

• n. Suppose that the mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively. With z-score
normalization, a value of $73,600 for income is transformed to
73,600  54,000
1.225
16,000

29
Normalization by decimal scaling
• normalizes by moving the decimal point of values of attribute A.
• The number of decimal points moved depends on the maximum
absolute value of A.
• A value, vi , of A is normalized to v by computing
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10

Suppose that the recorded values of A range from −986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we
therefore divide each value by 1000 (i.e., j = 3) so that −986 normalizes to
−0.986 and 917 normalizes to 0.917.
30
Data Reduction Strategies
• Why data reduction?
• A database/data warehouse may store terabytes of data.
• Complex data analysis may take a very long time to run on the complete data set.
• Data reduction:
• Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical
results
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression
31
Data Cube Aggregation

32
Dimensionality Reduction
• The process of reducing the number of random variables or attributes
under consideration
• in dimensionality reduction, data encoding or transformations are applied
so as to obtain a reduced or compressed representation of the original
data.
• if the original data can be reconstructed from the compressed data without
any loss of information, the data reduction is called lossless
• If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy
• two popular an effective methods of lossy dimensionality reduction:
• Wavelet transform and principle components analysis(3.4.2 and 3.4.3)
33
Numerosity Reduction
• This techniques can indeed replace the original data volume by
choosing alternative, 'smaller forms of data representation.
• These techniques may be parametric or nonparametric.• For
parametric methods, Log-linear models, which estimate discrete
multidimensional probability distributions, are an example. (3.4.5)
• Nonparametric methods for storing reduced representations of the
data include histograms, clustering, and sampling. (3.4.6-9)

34
Data Discretization and Concept
Hierarchy Generation
• Data discretization techniques can be used to reduce the number of values
for a given continuous attribute by dividing the range of the attribute into
intervals.
• Interval labels can then be used to replace actual data values.
• Replacing numerous values of a continuous attribute by a small number of
interval labels thereby reduces and simplifies the original data.
• This leads to a concise, easy-to-use, knowledge-level representation of
mining results.
• Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts (such as numerical values for the attribute
age) with higher-level concepts (such as youth, middle-aged, or senior).
35
Attribute Subset Selection
• Feature selection (i.e., attribute subset selection):
• reduces the data set size by removing irrelevant or redundant attributes (or
dimensions)
• find a minimum set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original
distribution obtained using all attributes
• it reduces the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to

C. Market Share of Five Competing Internet Providers
No ratings yet
C. Market Share of Five Competing Internet Providers
6 pages
Interval Estimation
100% (1)
Interval Estimation
42 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Why Data Preprocessing?
No ratings yet
Why Data Preprocessing?
3 pages
02 Data
No ratings yet
02 Data
62 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
KKA CETR RAS 601 UNIT4 1st Lec
No ratings yet
KKA CETR RAS 601 UNIT4 1st Lec
22 pages
Advanced Statistical Approaches To Quality: INSE 6220 - Week 4
No ratings yet
Advanced Statistical Approaches To Quality: INSE 6220 - Week 4
44 pages
Week2-1
No ratings yet
Week2-1
24 pages
Lecture 3 - Numerical Statistics
No ratings yet
Lecture 3 - Numerical Statistics
7 pages
03 Numerical Description FULL
No ratings yet
03 Numerical Description FULL
51 pages
4
No ratings yet
4
26 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
PETE110 - Spring2019 - 2020 - Data Interpretation
No ratings yet
PETE110 - Spring2019 - 2020 - Data Interpretation
32 pages
03 Numerical Description
No ratings yet
03 Numerical Description
52 pages
slide11-dimred-BK-v3-0104
No ratings yet
slide11-dimred-BK-v3-0104
91 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Lecture 23
No ratings yet
Lecture 23
29 pages
Descriptive Statistics and Graphs: Statistics For Psychology
No ratings yet
Descriptive Statistics and Graphs: Statistics For Psychology
14 pages
Business Statistics Key Formulas
50% (4)
Business Statistics Key Formulas
5 pages
02data Part2
No ratings yet
02data Part2
34 pages
Lecture 3 - Numerical Summary - Part 1
No ratings yet
Lecture 3 - Numerical Summary - Part 1
4 pages
Measures of Central Location (Central Tendency) : STAT 2263 Descriptive Statistics (Summary Statistics)
No ratings yet
Measures of Central Location (Central Tendency) : STAT 2263 Descriptive Statistics (Summary Statistics)
8 pages
02data InClass 20150827
No ratings yet
02data InClass 20150827
18 pages
POINT INTERVAL Estimates
No ratings yet
POINT INTERVAL Estimates
48 pages
3_Preprocessing
No ratings yet
3_Preprocessing
82 pages
Section 2
No ratings yet
Section 2
22 pages
Lecture 3
No ratings yet
Lecture 3
39 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Topics Summary and Review Exercises
No ratings yet
Topics Summary and Review Exercises
19 pages
Clustering
No ratings yet
Clustering
64 pages
CS822-DataMining-Week2 (2)
No ratings yet
CS822-DataMining-Week2 (2)
28 pages
Psp-Unit-6 Estimation Theory PDF
No ratings yet
Psp-Unit-6 Estimation Theory PDF
38 pages
2 Ukuran Numerik Dan Deskriptif
No ratings yet
2 Ukuran Numerik Dan Deskriptif
31 pages
Basic Statistics: Confidential, Proprietary Information of Tyco International LTD
No ratings yet
Basic Statistics: Confidential, Proprietary Information of Tyco International LTD
74 pages
Ch 3_250408_170537
No ratings yet
Ch 3_250408_170537
33 pages
Statistics
No ratings yet
Statistics
51 pages
IDS3
No ratings yet
IDS3
38 pages
Mkurt: Teknik Sipil ITS Program Pasca Sarjana
No ratings yet
Mkurt: Teknik Sipil ITS Program Pasca Sarjana
63 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Chapter 1 Statistics Review Sept20
No ratings yet
Chapter 1 Statistics Review Sept20
11 pages
Outline and Equation Sheet For M E 345: Every Additive Term in An Equation Must Have The Same Dimensions
No ratings yet
Outline and Equation Sheet For M E 345: Every Additive Term in An Equation Must Have The Same Dimensions
7 pages
EGR 601 Formulas (v2)
No ratings yet
EGR 601 Formulas (v2)
11 pages
Last Lecture - Visual Presentation of Data (Graphs) - Shapes of Distributions - Examples and Review Today - Measures of Central Tendency
No ratings yet
Last Lecture - Visual Presentation of Data (Graphs) - Shapes of Distributions - Examples and Review Today - Measures of Central Tendency
27 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
Chapter 2 dataPreProcessing HAN
No ratings yet
Chapter 2 dataPreProcessing HAN
76 pages
ECM1001 Formula Sheet
No ratings yet
ECM1001 Formula Sheet
15 pages
Inferential Statistics - GRY 324
No ratings yet
Inferential Statistics - GRY 324
88 pages
Chapter 8-Statistical Inference - Updated 27 December 2022 GC
No ratings yet
Chapter 8-Statistical Inference - Updated 27 December 2022 GC
18 pages
CH 02
No ratings yet
CH 02
41 pages
webMATH236_Lecture5(1)
No ratings yet
webMATH236_Lecture5(1)
87 pages
Ec310 Day 3 Lecture Notes
No ratings yet
Ec310 Day 3 Lecture Notes
9 pages
Week 10
No ratings yet
Week 10
50 pages
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
No ratings yet
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
33 pages
Definitions of Descriptive Statistics of A Single Variable Generated by The Descriptive Statistics Tool in Excel's Data Analysis
No ratings yet
Definitions of Descriptive Statistics of A Single Variable Generated by The Descriptive Statistics Tool in Excel's Data Analysis
48 pages
Aula 03 PE Eng 2S ISUTC
No ratings yet
Aula 03 PE Eng 2S ISUTC
18 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Sibd Questions Soved Theory
No ratings yet
Sibd Questions Soved Theory
14 pages
Stats Lab 2
No ratings yet
Stats Lab 2
15 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
64 pages
Exercise 2E: 1 A CF 4 8 10 17 37 61 71 B Q
No ratings yet
Exercise 2E: 1 A CF 4 8 10 17 37 61 71 B Q
3 pages
MMW MidTerm RevMat
No ratings yet
MMW MidTerm RevMat
8 pages
How To Calculate Range in Statistics
No ratings yet
How To Calculate Range in Statistics
5 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
126 pages
Exercise Using SPSS To Explore Measures of Central Tendency and Dispersion
No ratings yet
Exercise Using SPSS To Explore Measures of Central Tendency and Dispersion
8 pages
Business Research Neo
No ratings yet
Business Research Neo
35 pages
Unit 3_Unit 4 Problems and Solutions
No ratings yet
Unit 3_Unit 4 Problems and Solutions
30 pages
Statistical Tables and Formulae PDF
No ratings yet
Statistical Tables and Formulae PDF
93 pages
Five Number Summary
No ratings yet
Five Number Summary
8 pages
1670688007625643
No ratings yet
1670688007625643
85 pages
Excel Box and Whisker Diagrams (Box Plots) - Peltier Tech Blog
No ratings yet
Excel Box and Whisker Diagrams (Box Plots) - Peltier Tech Blog
32 pages
Measure of Dispersion
No ratings yet
Measure of Dispersion
66 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
No. Absen: 5 Nama: Malvin Tanoto NIM: 705200106: 1. Frekuensi Jenis Kelamin Dan Angkatan
No ratings yet
No. Absen: 5 Nama: Malvin Tanoto NIM: 705200106: 1. Frekuensi Jenis Kelamin Dan Angkatan
12 pages
Mdm4U Final Exam Review: This Review Is A Supplement Only. It Is To Be Used As A Guide Along With Other Review
No ratings yet
Mdm4U Final Exam Review: This Review Is A Supplement Only. It Is To Be Used As A Guide Along With Other Review
6 pages
Percepción de Los Padres Sobre Las Barreras para El Manejo Integral de Niños Con Labio y Paladar Hendido en Bogotá, Colombia
No ratings yet
Percepción de Los Padres Sobre Las Barreras para El Manejo Integral de Niños Con Labio y Paladar Hendido en Bogotá, Colombia
14 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
Data Analysis Presentation
No ratings yet
Data Analysis Presentation
16 pages
2017 Paper 2 Maths PDF
No ratings yet
2017 Paper 2 Maths PDF
20 pages
DeepLMS A Deep Learning Predictive Model For Supporting Online Learning in The Covid19 Erascientific Reports
No ratings yet
DeepLMS A Deep Learning Predictive Model For Supporting Online Learning in The Covid19 Erascientific Reports
17 pages
STATS Shortcut Formula
No ratings yet
STATS Shortcut Formula
3 pages
SAS Statistics Quizzes
No ratings yet
SAS Statistics Quizzes
38 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
52 pages
Spring 2024 - STA301 - 1
No ratings yet
Spring 2024 - STA301 - 1
4 pages