0% found this document useful (0 votes)
12 views

DM Ch3 Data Preprocessing

Data mining

Uploaded by

sh1637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

DM Ch3 Data Preprocessing

Data mining

Uploaded by

sh1637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

llygomelascam r a o 4 a NillluviluDo

INF 489
I". • • • 4 4 I t
Of*, r a l * c a ,

0, i t 4 4?) g o
4t 4

In structor:
t r a l k " A l % 7 A %
Dr. Mohamed H. Farrag DATA MINING
Concepts a n d Techniques

M . 2 1 1 Jiawei Hon I Micheline Komber I Jion Pei

Instructor: Dr. Mohamed H. Farrag 1 C o u r s e : Data Mining Ch3: Data Pmprocessing


TwdbocoA
Main textbook,
oO81a Nfikkcy Ceflcopb eflicl TschIgnosw (3rd ed.)
Jiawei Han, Micheline Kamber, and „Ilan Pei p- 4 - 4 tile.
V. I
University of Illinois at Urbana-Champaign & •—•;„.,;"f
a
DATA MININ
Simon Fraser University

O k aTOcollos© r a g a M a g e [ V E 2nd Edition

Tan, Steinbach,
Karpatne, Kumar

Modified for Introduction to Data Mining by Dr. Mohamed H. Farrag

Instructor: Dr. Mohamed H. Farrag 2 C o u r s e : Data Mining Ch3: Data Preprocessing


Chapb7 3
tr r2 . . 4 # ...4o i o
410,6 - i t $ 6 i t . , ' •
v P O P rd. a . t r i t t Ai l

01 I
lit_t'Pilib,4 Aim
q rI

I I 4 1
4 ! fig d i
OData Preprocesdng I P I I APIIt l i A v \. i.t odrv _ 0 4
t V
• ,,,,i,... A ollikow d i a f f 4

' d e r & 0 1 *ZA,1

DATA MINING
Concepts a n d Techniques

MI • • Jrcrwer Hon I Micheline Komber I Jion Pei

Instructor: Dr. Mohamed H. Farrag 3 C o u r s e : Data Mining Ch3: Data Preprocessing


Data Preprocessing

Instructor: Dr. Mohamed H. Farrag 4 C o u r s e : Data Mining Ch3: Data Preprocessing


Data Preprocessing
• Aggregation

• Sampling

• Dimensionality Reduction

• Feature subset selection

• Feature creation

• Discretization and Binarization

• Attribute Transformation

Instructor: Dr. Mohamed H. Farrag 5 C o u r s e : Data Mining Ch3: Data Preprocessing


Aggregation
• Combining two or more attributes (or objects) into a
single attribute (or object)
• Purpose
- Data reduction
• Reduce the number of attributes or objects
- Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
- More "stable" data
• Aggregated data tends to have less variability

Instructor: Dr. Mohamed H. Farrag 6 C o u r s e : Data Mining Ch3: Data Preprocessing


Example: Precipitation in Australia
Variation of Precipitation in Australia

180

160

g 140

0 120
-CI 1 0 0

_J 8 0

in'g:17) 60

z 40

20

0 5 10 1 5 la 1 2 1 4 1 0 1 0 2 0

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation

Instructor: Dr. Mohamed H. Farrag 7 C o u r s e : Data Mining Ch3: Data Preprocessing


Sampling
• Sampling i s t h e main technique employed f o r data
reduction.
- It is often used for both the preliminary investigation of the
data and the final data analysis.

• Statisticians often sample because obtaining the entire


set of data of interest is too expensive or time consuming.

• Sampling i s typically used i n data mining because


processing t h e entire set o f data o f interest i s t o o
expensive or time consuming.

Instructor: Dr. Mohamed H. Farrag 8 C o u r s e : Data Mining Ch3: Data Preprocessing


Sampling ...
• T h e key principle for effective sampling is the following:

- Using a sample will work almost as well as using the


entire d a t a s e t , i f t h e s a m p l e i s representative

- A sample is representative if it has approximately the


same properties (of interest) as the original set of data

Instructor: Dr. Mohamed H. Farrag 9 C o u r s e : Data Mining Ch3: Data Preprocessing


Types of Sampling
• Simple random sampling
- There is an equal probability of selecting any particular
item
• Sampling without replacement
- Once a n object i s selected, i t i s removed f r o m t h e
population
• Sampling with replacement
- A selected object is not removed from the population
• Stratified sampling:
- Partition the data set, and draw samples from each
partition (proportionally, i . e . , approximately t h e s a m e
percentage of the data)

Instructor: Dr. Mohamed H. Farrag 1 0 C o u r s e : Data Mining Ch3: Data Preprocessing


Cis'
Sample Size


• .• • •• ••
•, ...VI, • • • • •!' 4 • • • • •
• . 4 • • : * : • •
• • ! .••tts'1••?••••• • • • • • %.• • : '

; • • ••Te•••?4,3••• •
I t , . . 4 11 - • • •••• • • • i • • t t r I . N : . : * •• • %tot' •s " ••1•••• •••••%. •
rg • • •
4'••1
• ••• ,• -••• -• .• •
• givi••• ••••••: tws •

••••• .•!%„n : ▪ A m • •••:•••• 544 .•••
• • •• *0°
•4••

••• • •••

• ••• * e •O:•-1-1. • . . • • -tt.:1!..4(•••• ••••• • *


%

••*,,• • • • •
• 5-1 •-•.'

00
• %
,
t.:••b• •'•. • • • :;e•
:'e' ;•••:i .•-% '<i'...
; ;At••!,%: SE.1.••••
.„:". *0'; % . . t " ; A t ,
. ft.,. z .•
N.1 0•;„•• •• % • • • •• * • • •

• t• Pi * h . : I r y % . . . . * t e l . . 0 . • •• •
• t k**
: tV, . .f•. 1• 1 • s•. •. •b• 2•••
T. 2 • 1 4 1 : 7 . : f : : • vvii141*
• •-s, •
•• ;",/ • , h:,-: . . 0 . - •si•, .:• •• .• •-•tt:•••0
.4. ...: •• • •• • •• ' • ••• • • • •••••
-• • • • • ••
.••:.•• . .•4 at': 41•'%.1
•?....;.••• ::.t•l'?,
74 -..I3,;:-1.4.1.-..4
i•A••z-•ss.r.•
•kZ% •••4 •• 441-';';:l .•%•••• •
•,.,• le: :At-es:*
• • • • • •.:
•••:-.4.'
w•••,•••.c.,. • •••• i v 4• •

• • •• •
•••• • • . •

8000 points 2000 Points 500 Points

Instructor: Dr. Mohamed H. Farrag 1 1 C o u r s e : Data Mining Ch3: Data Prappacessing


Sampling: With or without Replacement

Instructor: Dr. Mohamed H. Farrag 1 2 C o u r s e : Data Mining Ch3: Data Pmprocessing


,
oC
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

Instructor: Dr. Mohamed H. Farrag 1 3 C o u r s e : Data Mining Ch3: Data Preprocessing


Ai'
Curse of Dimensionality
• When d i m e n s i o n a l i t y
increases, data becomes
increasingly sparse in the
space that it occupies

• Definitions of density and


distance between points,
which a r e c r i t i c a l f o r
clustering a n d o u t l i e r
detection, b e c o m e l e s s 5 1 0 15 20 25 30 35 40 45 50
Number of dimensions
meaningful
• Randomly generate 500 points
'Compute difference between max and
min distance between a n y p a i r o f
points
Instructor: Dr. Mohamed H. Farrag 14 C o u r s e : Data Mining Ch3: Data Preprocessing
CC
Dimensionality Reduction
• Purpose:
—Avoid curse of dimensionality
—Reduce amount of time and memory required by data
mining algorithms
—Allow data to be more easily visualized
—May help to eliminate irrelevant features or reduce noise

• Techniques
—Principal Components Analysis (PCA)
—Singular Value Decomposition
—Others: supervised and non-linear techniques

Instructor: Dr. Mohamed H. Farrag 1 5 C o u r s e : Data Mining Ch3: Data Preprocessing


Feature Subset Selection
• Another way to reduce dimensionality of data I
• Redundant features
- Duplicate much or all of the information contained in
one or more other attributes
- Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
- Contain no information that is useful for the data
mining task at hand
- Example: students' ID is often irrelevant to the task of
predicting students' GPA
• Many techniques developed, especially for classification

Instructor: Dr. Mohamed H. Farrag 1 6 C o u r s e : Data Mining Ch3: Data Preprocessing


Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• Three general methodologies:
- Feature extraction (creation of a new set of features from the
original data)
• b u t you can apply in specific domains only (domain —specific)
- Feature construction (one or more new features constructed
out of the original features can be more useful than the original)
• Example: dividing mass by volume to get density
- Mapping data to new space (totally different view of the data
can reveal important and interesting features)
• Example: Fourier and wavelet analysis

Instructor: Dr. Mohamed H. Farrag 1 7 C o u r s e : Data Mining Ch3: Data Preprocessing


it)
Discretization
• Discretization i s t h e p r o c e s s o f converting a
continuous attribute into an categorical attribute
- A potentially infinite number of values are mapped into a
small number of categories

—Discretization is commonly used in classification

Instructor: Dr. Mohamed H. Farrag 1 8 C o u r s e : Data Mining Ch3: Data Pmprocessing


ob
-
Binarization
• Binarization m a p s a continuous o r categorical
attribute into one or more binary variables

• Typically used for association analysis

• Often convert a continuous attribute to a categorical


attribute and then convert a categorical attribute to a set
of binary attributes

Instructor: Dr. Mohamed H. Farrag 1 9 C o u r s e : Data Mining Ch3: Data Preprocessing


Attribute Transformation
• An attribute transform is a function that maps the entire
set o f values o f a given attribute t o a n e w s e t o f
replacement values such that each old value can b e
identified with one of the new values

Instructor: Dr. Mohamed H. Farrag 2 0 C o u r s e : Data Mining Ch3: Data Pmprocessing


I, t
Similarity and Dissimilarity Measures
• Similarity measure
- Numerical measure of how alike two data objects are.
- Is higher when objects are more alike.
- Often falls in the range [0,1]
• Dissimilarity measure
- Numerical measure of how different two data objects
are
- Lower when objects are more alike
- Minimum dissimilarity is often 0
- Upper limit varies
- Proximity refers to a similarity or dissimilarity

Dr. Mohamed H. Farrag 2 1 C o u r s e : Data Mining Ch2: Getdng to Know Your Data
Cri)_
Similarity/Dissimilarity for Simple Attributes

The following table shows t h e similarity a n d dissimilarity


between two objects, x and .1., with respect to a single, simple
attribute.

Attribute Dissimilarity Similarity


Type
d= I 0 if x = y 1ifx=y
Yominal s ,
11 if x 0 y 10 i f x 0 y
' d = Ix - yli(n — 1)
Ordinal (values mapped to integers 0 to n—1. s=1 - d
where 71 is the number of values)
Interval or Ratio d = lx - yl — .45--d1
8—
- —
— S — e
1-1-d' — 1
s = 1 _ cl—min_ct
max_cl—min_d

Instructor: Dr. Mohamed H. Farrag 22 C o u r s e : Data Mining Ch2: Getting to Know Your Data C o t , _
Data Mining: Exploring Data

Instructor: Dr. Mohamed H. Farrag 2 3 C o u r s e : Data Mining Ch3: Data Preprocessing


What is data exploration?
A preliminary exploration o f t h e d a t a t o better
understand its characteristics.

• K e y motivations of data exploration include


—Helping t o select t h e right tool f o r preprocessing o r
analysis

—Making use of humans' abilities to recognize patterns

• People can recognize patterns n o t captured b y data


analysis tools

Instructor: Dr. Mohamed H. Farrag 2 4 C o u r s e : Data Mining Ch3: Data Preprocessing


Techniques Used In Data Exploration
I

• I n our discussion of data exploration, we focus on

—Summary statistics
—Visualization
—Online Analytical Processing (OLAP)

Instructor: Dr. Mohamed H. Farrag 2 5 C o u r s e : Data Mining Ch3: Data Preprocessing


Summary Statistics
• Summary statistics a r e numbers that summarize
properties of the data
—Summarized properties include frequency, location and
spread
• Examples: location - mean
spread - standard deviation

—Most summary statistics can be calculated in a single


pass through the data

Instructor: Dr. Mohamed H. Farrag 2 6 C o u r s e : Data Mining Ch3: Data Preprocessing


Frequency and Mode
• T h e frequency of an attribute value is the percentage of time the value
occurs in the data set

—For example, given t h e attribute gender' and a representative


population of people, the gender female occurs about 50% of the
time.
• T h e mode of an attribute is the most frequent attribute value

• R a n g e is the difference between the max and min


• T h e variance or standard deviation s1 is the most common measure of
the spread of a set of points

Instructor: Dr. Mohamed H. Farrag 2 7 C o u r s e : Data Mining Ch3: Data Preprocessing


Visualization
Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data and
the relationships among data items or attributes can be
analyzed or reported.

• Visualization of data is one of the most powerful and


appealing techniques for data exploration.
—Humans have a well developed ability to analyze large
amounts of information that is presented visually
—Can detect general patterns and trends
—Can detect outliers and unusual patterns

Instructor: Dr. Mohamed H. Farrag 2 8 C o u r s e : Data Mining Ch3: Data Preprocessing


ob
Visualization Techniques: Histograms
• Histogram
—Usually shows the distribution of values of a single variable
—Divide the values into bins and show a bar plot of the number of
objects in each bin.
—The height of each bar indicates the number of objects
—Shape of histogram depends on the number of bins
• E x a m p l e : Petal Width (10 and 20 bins, respectively)

15

Nut Width

Instructor: Dr. Mohamed H. Farrag 2 9 C o u r s e : Data Mining Ch3: Data Preprocessing


Histogram from Weka

Program Applications Tools Visuekation Windows

Explorer
Preprocess Classify I Cluster I Associate I Select attributes I Visualize

Open U... O p e n 0... G e n e r a t . . . I Edit... Save...

rliter
Choose ( N o n e

rCurrent relation Selected attribute


Relation: weather Name: outlook Type: Nominal
Instances: 14 Attributes: 5 teltssing: 0 (0%) Distinct: 3 Unique: 0 (0%)
Attributes Label Count

Al I None I Invert I Pattern overcast 4


rainy
futirlY
No. Name

2 r temperature 'Class: play (Nom) V i s u a l i z e All I


3 hurridity
4 windi
5
PlaY
4

Remove

Instructor: Dr. Mohamed H. Farrag 3 0 C o u r s e : Data Mining Ch3: Data Pmprocessing


Two-Dimensional Histograms

4
3
petal width petal length

Instructor: Dr. Mohamed H. Farrag 3 1 C o u r s e : Data Mining Ch3: Data Pmprocessing


0-1
Visualization Techniques: Box Plots

-4— outlier
• B o x Plots
—Invented by J.
4— 90th percentile
Tukey
—Another way of
displaying the
distribution of 4— 75th percentile
data
4— 50th percentile
—Following figure
4— 25th percentile
shows the basic
part of a box plot

4— 10th percentile
E

[ Instructor: Dr. Mohamed H. Farrag 32 C o u r s e : Data Mining Ch3: Data Preprocessing


Example of Box Plots
• Box plots can be used to compare attributes

0
sepal length s e p a l width p e t a l length p e t a l width

Instructor: Dr. Mohamed H. Farrag 3 3 C o u r s e : Data Mining Ch3: Data Preprocessing


Visualization Techniques: Scatter Plots
• Scatter plots I
—Attributes values determine the position
—Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots
—Often additional attributes can be displayed by using
the size, shape, and color of the markers that represent
the objects
—It is useful to have arrays of scatter plots can compactly
summarize the relationships of several pairs of
attributes
• S e e example on the next slide

Instructor: Dr. Mohamed H. Farrag 3 4 C o u r s e : Data Mining Ch3: Data Preprocessing


Scatter Plot Array of Iris Attributes
Setosa
4Ve r s i c o l o u r
O Vi r g i n i c a

00
0

0 at9
4 4 / Ii9 ' t 0 +

xx,,,§4k0<ack
4

7 8 2 4 2 4 6 0
sepal length sepal width p e t a l length p e t a l width

Instructor: Dr. Mohamed H. Farrag 3 5 C o u r s e : Data Mining Ch3: Data Pmprocessing


Parallel Coordinates Plots for Iris Data

a Setosa
Versicolor
Virginica - - Setosa
Versicolor
Virginica

i (
sepal? length sepal width petal length petal width s e p 3 width sepal length petal length petal width

Instructor: Dr. Mohamed H. Farrag 3 6 C o u r s e : Data Mining Ch3: Data Preprocessing


Star Plots for Iris Data

A I A L Dsa

2 3 4 5

51 5
<I>
2 5 3
' W
54 55
sicolour

102 1 0 3
<le 1 0 4 1 0 5
lmica

Instructor: Dr. Mohamed H. Farrag 3 7 C o u r s e : Data Mining Ch3: Data Pmprocessing


fil
OLAP
• O n -Line Analytical Processing (OLAP) was proposed by
E. F. Codd, the father of the relational database.
• Relational databases put data into tables, while OLAP
uses a multidimensional array representation.
—Such representations o f data previously existed i n
statistics and other fields
• There a r e a number o f d a t a analysis a n d d a t a
exploration operations that are easier with such a data
representation.

Instructor: Dr. Mohamed H. Farrag 3 8 C o u r s e : Data Mining Ch3: Data Preprocessing


Example
Petal
Width

Virginica
Versicolour
Setosa

high

medium
•ec.'
low c-)9

Petal 0
Width

Instructor: Dr. Mohamed H. Farrag 3 9 C o u r s e : Data Mining Ch3: Data Preprocessing


()LAP Operations:
Data Cube:
• A data cube is a multidimensional representation of data, together with all possible
aggregates.
• B y all possible aggregates, we mean the aggregates that result by selecting a
proper subset of the dimensions and summing over all remaining dimensions.

Date

Product ID

Instructor: Dr. Mohamed H. Farrag 4 0 C o u r s e : Data Mining Ch3: Data Preprocessing


()LAP Operations:
Slicing and Dicing
• S l i c i n g is selecting a group of cells from the entire multidimensional array by
specifying a specific value for one or more dimensions.
. D i c i n g involves selecting a subset of cells by specifying a range of attribute
values.

Instructor: Dr. Mohamed H. Farrag 4 1 C o u r s e : Data Mining Ch3: Data Preprocessing


()LAP Operations:
Roll-up and Drill-down
• Attribute values often have a hierarchical structure.
—Each date is associated with a year, month, and week.
—A location is associated with a continent, country, state
(province, etc.), and city.
—Products can be divided into various categories, such as
clothing, electronics, and furniture.
• Note that these categories often nest and form a tree or
lattice
—A year contains months which contains day
—A country contains a state which contains a city

Instructor: Dr. Mohamed H. Farrag 4 2 C o u r s e : Data Mining Ch3: Data Preprocessing


Summary
• D a t a attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
• M a n y types of data sets, e.g., numerical, text, graph, Web, image.
• G a i n insight into the data by:
—Basic statistical data description: central tendency, dispersion, graphical
displays
—Data visualization: map data onto graphical primitives
—Measure data similarity
• A b o v e steps are the beginning of data preprocessing.
• M a n y methods have been developed but still an active area of research.

Instructor: Dr. Mohamed H. Farrag 4 3 C o u r s e : Data Mining Ch3: Data Preprocessing


References
• W . Cleveland, Visualizing Data, Hobart Press, 1993
• T . Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• U . Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Miningand
Knowledge Discovery, Morgan Kaufmann, 2001
• L . Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
• H . V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
• D . A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
• D . Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• S . Santini and R. Jain," Similarity measures", IEEE Trans. on Pattern Analysis and Machine
Intelligence, 21(9), 1999
• E . R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001
• C . Yu, et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009

[ Instructor: Dr. Mohamed H. Farrag 4 4 C o u r s e : Data Mining Ch3: Data Preprocessing


References
• D . P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM,
42:73-78, 1999
• A . Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
• T . Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• J . Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
• H . Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Salta. Declarative data cleaning: Language,
model, and algorithms. VLDB'01
• M . Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
• H . V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee
on Data Engineering, 20(4), Dec. 1997
• H . Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
• J . E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
• D . Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• V . Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB'2001
• T . Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
• R . Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995

[
-•
Instructor: Dr. Mohamed H. Farrag 4 5 C o u r s e : Data Mining Ch3: Data Pleprocessing -

You might also like