0% found this document useful (0 votes)
9 views

Unit 2 Data Preprocessing for Students.pptx

The document discusses the necessity and process of data mining, emphasizing the explosive growth of data and the need for automated analysis to extract useful knowledge. It outlines the KDD process, various data types, functionalities of data mining, and the challenges faced in traditional data analysis. Additionally, it covers different types of data attributes and their measurement levels, providing insights into data summarization and analysis techniques.

Uploaded by

nirajdhanore04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit 2 Data Preprocessing for Students.pptx

The document discusses the necessity and process of data mining, emphasizing the explosive growth of data and the need for automated analysis to extract useful knowledge. It outlines the KDD process, various data types, functionalities of data mining, and the challenges faced in traditional data analysis. Additionally, it covers different types of data attributes and their measurement levels, providing insights into data summarization and analysis techniques.

Uploaded by

nirajdhanore04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Why Data Mining?

● The Explosive Growth of Data: from terabytes to petabytes


● Data collection and data availability
● Automated data collection tools, database systems, Web,
computerized society
● Major sources of abundant data
● Business: Web, e-commerce, transactions, stocks, …
● Science: Remote sensing, bioinformatics, scientific simulation, …
● Society and everyone: news, digital cameras, YouTube
● We are drowning in data, but starving for knowledge!
● “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data
What Is Data Mining?

● Data mining (knowledge discovery from data)


● Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data.
● Alternative names
● Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
● Watch out: Is everything “data mining”?
● Simple search and query processing
● (Deductive) expert systems

2
3
KDD Process: Summary

● Learning the application domain


● relevant prior knowledge and goals of application
● Creating a target data set: data selection
● Data cleaning and preprocessing: (may take 60% of effort!)
● Data reduction and transformation
● Find useful features, dimensionality/variable reduction, invariant representation
● Choosing functions of data mining
● summarization, classification, regression, association, clustering
● Choosing the mining algorithm(s)
● Data mining: search for patterns of interest
● Pattern evaluation and knowledge presentation
● visualization, transformation, removing redundant patterns, etc.
● Use of discovered knowledge

4
Data Mining: Confluence of Multiple Disciplines

Database
Technolo Statistics
gy

Machine Data Visualizat


Learning Mining ion

Pattern
Other
Recogniti
Algorith Discipline
on
m s

5
Why Not Traditional Data Analysis?

● Tremendous amount of data


● Algorithms must be highly scalable to handle such as tera-bytes of
data
● High-dimensionality of data
● Micro-array may have tens of thousands of dimensions
● High complexity of data
● Data streams and sensor data
● Time-series data, temporal data, sequence data
● Structure data, graphs, social networks and multi-linked data
● Heterogeneous databases and legacy databases
● Spatial, spatiotemporal, multimedia, text and Web data

6
Data Mining: On What Kinds of Data?

● Database-oriented data sets and applications


● Relational database, data warehouse, transactional database
● Advanced data sets and advanced applications
● Data streams and sensor data
● Time-series data, temporal data, sequence data (incl. bio-sequences)
● Structure data, graphs, social networks and multi-linked data
● Object-relational databases
● Heterogeneous databases and legacy databases
● Spatial data and spatiotemporal data
● Multimedia database
● Text databases
● The World-Wide Web
7
Data Mining Functionalities

● Multidimensional concept description: Characterization and


discrimination
● Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
● Frequent patterns, association, correlation vs. causality
● Tea 🡪 Sugar [0.5%, 75%] (Correlation or causality?)
● Classification and prediction
● Construct models (functions) that describe and distinguish classes
or concepts for future prediction
● E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
● Predict some unknown or missing numerical values
8
Data Mining Functionalities

● Cluster analysis
● Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
● Maximizing intra-class similarity & minimizing interclass similarity
● Outlier analysis
● Outlier: Data object that does not comply with the general behavior
of the data
● Noise or exception? Useful in fraud detection, rare events analysis
● Trend and evolution analysis
● Trend and deviation: e.g., regression analysis
● Sequential pattern mining: e.g., digital camera 🡪 large SD memory
● Periodicity analysis
● Similarity-based analysis
● Other pattern-directed or statistical analyses
9
Major Issues in Data Mining

● Mining methodology
● Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
● Performance: efficiency, effectiveness, and scalability
● Pattern evaluation: the interestingness problem
● Incorporation of background knowledge
● Handling noise and incomplete data
● Parallel, distributed and incremental mining methods
● Integration of the discovered knowledge with existing one: knowledge fusion
● User interaction
● Data mining query languages and ad-hoc mining
● Expression and visualization of data mining results
● Interactive mining of knowledge at multiple levels of abstraction
● Applications and social impacts
● Domain-specific data mining & invisible data mining
● Protection of data security, integrity, and privacy
10
Architecture: Typical Data Mining
System

Graphical User Interface

Pattern Evaluation
Knowledg
Data Mining Engine e-Base

Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web
11
Getting to know your data
What is Data?
Attributes
● Collection of data objects and
their attributes
● An attribute is a property or
characteristic of an object
● Examples: eye color of a
person, temperature, etc.
● Attribute is also known as
variable, field, characteristic, or
feature Objects
● A collection of attributes
describe an object
● Object is also known as record,
point, case, sample, entity, or
instance
Types of Variables

Qualitative / Categorical (data that are counted)


• Nominal
• Ordinal
Quantitative or Numerical (data that are measured)
• Interval
The methods used to display, summarize, and analyze data
• Ratiothe variables are categorical or
depend on whether
quantitative.
Types of Variables:
Categorical
Nominal
Variables that are “named”, i.e. classified into one or more
qualitative categories that describe the characteristic of interest

• no ordering of the different categories


• no measure of distance between values
• categories can be listed in any order without affecting
the relationship between them
Nominal variables are the simplest type of variable
Nominal
In medicine, nominal variables are often used to
describe the patient. Examples of nominal variables
might include:

▪ Gender (male, female)

▪ Eye color (blue, brown, green, hazel)

▪ Surgical outcome (dead, alive)

▪ Blood type (A, B, AB, O)


Note: When only two possible categories exist, the variable is
sometimes called dichotomous, binary, or binomial.
● Is color a nominal attribute?
● it is possible to represent such symbols or “names” with
numbers
● mathematical operations on values of nominal attributes
are not meaningful.
● Symmetric vs asymmetric binary variable
Ordinal
Variables that have an inherent order to the relationship
among the different categories

• an implied ordering of the categories (levels)


• quantitative distance between levels is unknown
• distances between the levels may not be the same
• meaning of different levels may not be the same
for different individuals
Ordinal
In medicine, ordinal variables often describe the patient’s
characteristics, attitude, behavior, or status. Examples of
ordinal variables might include:

▪ Stage of cancer (stage I, II, III, IV)

▪ Education level (elementary, secondary, college)


▪ Pain level (mild, moderate, severe)
▪ Satisfaction level (very dissatisfied, dissatisfied, neutral,
satisfied, very satisfied)

▪ Agreement level (strongly disagree, disagree, neutral, agree,


Types of Variables:
Quantitative/Numerical

Interval
Variables that have constant, equal distances between
values, but the zero point is arbitrary.

Examples of interval variables:

▪ Intelligence (IQ test score of 100, 110, 120, etc.)


▪ Pain level (1-10 scale)
▪ Body length in infant
Ratio
Variables have equal intervals between values, the zero
point is meaningful, and the numerical relationships
between numbers is meaningful.
Examples of ratio variables:

▪ Weight (50 kilos, 100 kilos, 150 kilos, etc.)


▪ Pulse rate
▪ Respiratory rate
Properties of Attribute Values
● The type of an attribute depends on which of
the following properties it possesses:
● Distinctness: = ≠
● Order: < >
● Addition: + -
● Multiplication: */
● Nominal attribute: distinctness
● Ordinal attribute: distinctness & order
● Interval attribute: distinctness, order & addition
● Ratio attribute: all 4 properties
Attribute Type Description Examples Operations

The values of a nominal attribute are just zip codes, employee ID mode, entropy,
Nominal different names, i.e., nominal attributes numbers, eye color contingency
provide only enough information to
correlation, χ2 test
distinguish one object from another. (=, ≠)

The values of an ordinal attribute hardness of minerals, median, percentiles,


Ordinal provide enough information to order {good, better, best}, rank correlation, run
objects (< >). grades, street numbers tests, sign tests

For interval attributes, the differences between calendar dates, mean, standard
Interval values are meaningful, i.e., a unit of temperature in Celsius or deviation, Pearson's
measurement exists.
Fahrenheit correlation, t and F
(+, - )
tests

For ratio variables, both differences and temperature in Kelvin, geometric mean,
Ratio ratios are meaningful. (*, /) monetary quantities, counts, harmonic mean,
age, mass, length, electrical
percent variation
current
Levels of Measurement
● Higher level variables can always be expressed at a lower level,
but the reverse is not true.
● For example, Body Mass Index (BMI) is typically measured at an
interval-level such as 23.4.
● BMI can be collapsed into lower-level Ordinal categories
such as:
• >30: Obese
• 25-29.9: Overweight
• <25: Underweight
or Nominal categories such as:
• Overweight
• Not overweight
Tip : measure data at the highest level of measurement possible.
Discrete Data

Quantitative or Numerical variables that are measured


in each individual in a data set, but can only be whole
numbers.
Examples are counts of objects or occurrences:

▪ Number of children in household


▪ Number of relapses
▪ Number of admissions to a hospital
Continuous Data

Quantitative or Numerical variables that are measured in


each individual in a data set.
Continuous variables can theoretically take on an infinite
number of values - the accuracy of the measurement is
limited only by the measuring instrument.

Note: Continuous data often include decimals or fractions of


numbers.
Continuous Data
Examples of continuous variables:
Height, weight, heart rate, blood pressure, serum
cholesterol, age, temperature

A person’s height may be measured and recorded as 60


cm, but in theory the true height could be an infinite
number of values:
height may be 60.123456789…………..cm
or 59.892345678…………..cm
Types of data sets

● Record
● Data Matrix
● Document Data
● Transaction Data

● Graph
● World Wide Web
● Molecular Structures

● Ordered
● Spatial Data
● Temporal Data
● Sequential Data
● Genetic Sequence Data
Record Data
● Data that consists of a collection of records, each of which consists
of a fixed set of attributes
Data Matrix
● If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a
multi-dimensional space, where each dimension represents a
distinct attribute

● Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute
Text Data
● Each document becomes a `term' vector,
● each term is a component (attribute) of the vector,
● the value of each component is the number of times the corresponding
term occurs in the document.
Transaction Data
● A special type of record data, where
● each record (transaction) involves a set of items.
● For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.

TID Items
1 Bread, Coke, Milk
2 Beans, Bread
3 Beans, Coke, Jam, Milk
4 Beans, Bread, Jam, Milk
5 Coke, Jam, Milk
Graph Data

● Examples: Facebook graph and HTML Links


Ordered Data

● Genomic sequence data


Getting to know your data
? What are the various attribute types
? What kind of values does a attribute have
? What does the data look like
? How are the values distributed
-- central tendency, dispersion of data
? Are there ways we can visualize the data
to get a better sense of it all?
-- visualization techniques
Descriptive Data
Summarization

● Measures of central tendency


● Measures of dispersion
● Graphical representations
Measures of central tendency

● Mean
● Median
● Mode
● Midrange
Mean
MEAN

● The mean is sensitive to extreme scores when population


samples are small.
● Means are better used with larger sample sizes.
Trimmed Mean

● Extreme values can affect the mean.


● To offset the effect of extremes, chop off extreme low
and/or high values and then calculate the mean.
Weighted Mean
Median
MEDIAN
● The median is the point at which half the scores are
above and half the scores are below.
● Medians are less sensitive to extreme scores and are
probably a better indicator generally of where the middle
of the class is achieving, especially for smaller sample
sizes.
● The larger the population sample (number of scores) the
closer mean and median become.
● In fact, in a perfect bell curve, the mean and median are
identical.
Approximate Median
Mode
Midrange

● Average of smallest and largest values in a dataset


Exercise
Measures of dispersion

● Range
● Five number summary (based on Quartiles)
● Inter quartile range
● Standard deviation
Range

● Difference between largest and smallest values.


● Eg. 10, 4, 78, 39, 96, 56, 81
largest value = 96
smallest value = 4
Range = 96 – 4 = 92
Standard Deviation

● shows how much variation there is from the average


(mean).
● A low SD indicates that the data points tend to be close to
the mean
● A high SD indicates that the data are spread out over a
large range of values.
Variance

● Average of the squared differences


from the mean
1) Calculate mean
2) For each number, subtract mean
and square the result
3) Work out average of those squared
differences
Standard Deviation

● measure of how spread out the numbers are


● The standard deviation σ (sigma) is the square root of
the variance of X
Quantiles

● The word “quantile” comes from the word quantity.


● In simple terms, a quantile is where a sample is divided
into equal-sized, adjacent, subgroups
● Also called a “fractile“
● It can also refer to dividing a probability distribution into
areas of equal probability.
Quantiles
● The median is a quantile; the median is
placed in a probability distribution so
that exactly half of the data is lower than
the median and half of the data is above
the median. The median cuts a
distribution into two equal areas and so
it is sometimes called 2-quantile.
Quantiles

● To calculate quantiles, use the formula


ith observation = q (n + 1)
where q is the quantile, the proportion below the ith value
that you are looking for, n is the number of items in a data
set
● Find the number in the following set of data where 20 percent of values fall
below it, and 80 percent fall above:
1 3 5 6 9 11 12 13 19 21 22 32 35 36 45 44 55 68 79 80 81 88 90
91 92 100 112 113 114 120 121 132 145 146 149 150 155 180 189
190

Step 1: Order the data from smallest to largest.


Step 2: Count how many observations you have in your data set. (here n=40)
Step 3: Convert any percentage to a decimal for “q”. We are looking for the
number where 20 percent of the values fall below it, so convert that to 0.2
Step 4: Insert your values into the formula:
ith observation = q (n + 1)
ith observation = .2 (40 + 1) = 8.2
Answer: The ith observation is at 8.2, so we round down to 8 (remembering
that this formula is an estimate). The 8th number in the set is 13, which is the
number where 20 percent of the values fall below it.
Quantiles

● The kth q-quantile for a given data distribution is the


value x such that at most k/ q of the data values are less
than x and at most (q – k)/q of the data values are more
than x, where k is an integer such that 0 < k < q.
There are q - 1 q-quantiles.
Quartiles

● Quartiles are also quantiles;


they divide the distribution
into four equal parts
● Each part represents
one-fourth of the data
distribution
First Quartile(Q1) = ((n + 1)/4)th Term
Second Quartile(Q2) = ((n + 1)/2)th Term
Third Quartile(Q3) = (3(n + 1)/4)th Term
20, 19, 21, 22, 23, 24, 25, 27, 26

● Arranging the values in ascending order: 19, 20, 21,


22, 23, 24, 25, 26, 27 (n=9)
● Median(Q2) = 5th Term = 23
● Lower Quartile (Q1) = Mean of 2nd and 3rd term = (20 +
21)/2 = 20.5
● Upper Quartile(Q3) = Mean of 7th and 8th term = (25 +
26)/2 = 25.5
Percentiles

● Percentiles are quantiles that divide a distribution into


100 equal parts.
Quantiles, Quartiles, and Percentiles

● The quartiles give an indication of a distribution’s center,


spread, and shape.
● The first quartile, denoted by Q 1 , is the 25th percentile. It
cuts off the lowest 25% of the data.
● The third quartile, denoted by Q 3 , is the 75th percentile,
it cuts off the lowest 75% (or highest 25%) of the data.
● The second quartile is the 50th percentile. As the median, it
gives the center of the data distribution.
Inter Quartile Range(IQR)

● The distance between the first and third quartiles is a


simple measure of spread that gives the range covered
by the middle half of the data. This distance is called the
interquartile range (IQR)

IQR = Q3 – Q1
Quartiles and IQR

● 78, 80, 80, 81, 82, 83, 85, 85, 86, 87 (n=10)

Median= Q2 = M = (82+83)/2 = 82.5


Q1 = Median of the lower half, i.e. 78 80 80 81 82 = 80
Q3 = Median of the upper half, i.e. 83 85 85 86 87 = 85
Therefore, IQR = Q3 – Q1 = 85 – 80 = 5
Quartiles and IQR
1, 5, 7, 9, 11, 15, 22, 25, 47 (n=9)

Median= Q2 = M = 11
Q1 = Median of the lower half, i.e. 1,5,7,9 = (5+7)/2= 6
Q3 = Median of the upper half, i.e. 15,22,24,47=(22+24)/2=23
Therefore, IQR = Q3 – Q1 = 23 – 6 = 17
Exercise

Find Q1, Q2, Q3, and IQR

17, 21, 22, 22, 26, 30, 38, 59, 67, 85


Five number summary

median (Q2)
the quartiles Q1 and Q3
the smallest and largest individual observations

written in the order of Minimum, Q1, Median, Q3, Maximum.

● gives a fuller summary of the shape of a distribution


Exercise

Give 5 point summary of


78, 80, 80, 81, 82, 83, 85, 85, 86, 87 (n=10)

Minimum = 78
Q1 = 80
Q2 = 82.5
Q3 = 85
Maximum = 87
Graphic Displays

● Boxplot
● Histograms
● Quantile plots
● Quantile-quantile plots (QQ plots)
● Scatter plots (XY plots)
Boxplot
Mild and Extreme Outliers

● Boxplots are also used to plot potential outliers


● Inner fences : below Q1 – 1.5 * IQR
above Q3 + 1.5 * IQR
● Outer fences : below Q1 – 3 * IQR
above Q3 + 3 * IQR
A point beyond inner fence is considered as a Mild Outlier.
A point beyond outer fence is considered as Extreme Outlier.
Box-and-whiskers Plot

In a boxplot, the whiskers are extended to the low (min)


and high (max) observations only if these values are less
than 1.5 IQR beyond the quartiles. Otherwise, the whiskers
terminate at the most extreme observations occurring
within 1.5 IQR of the quartiles. The remaining cases are
plotted individually.
Example

10.2, 14.1, 14.4, 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7,
14.7, 14.9, 15.1, 15.9, 16.4 (n=15)

Median = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 0.5
Example contd…

Inner fence :
Q1 – 1.5*IQR Q3+1.5*IQR
14.4 – 1.5*0.5 14.9 + 1.5 * 0.5
13.65 15.65

Mild outliers : 10.2, 15.9, 16.4


Example contd..

Outer Fence
Q1 – 3 * IQR Q3 + 3 * IQR

12.9 16.4

Extreme Outliers : 10.2


Try It !!

20.77 22.56 22.71 22.69 26.39 27.08 27.32 27.33


27.57 27.81 28.69 29.36 30.25 31.89 32.88 33.23
33.28 33.40 33.52 33.83 33.95 34.82

median = 29.025
Q1 = 27.08
Q3 = 33.28
The interquartile range is 6.2
Inner fence :
Outer fence :
This boxplot is clearly not symmetrical.
However, the pattern of its skewness is not straightforward.
The box, corresponding to the middle 50% of the data, appears
to be right-skew, because the line marking the median is
towards the left of the box (so that the right section of the box
is longer than the left).
However, the longer whisker is on the left, indicating a longer
tail towards smaller values, which in turn suggests that the data
are left-skew.
The following example relates to birth weights of infants
exhibiting severe idiopathic respiratory distress syndrome
(SIRDS), and the question ‘Is it possible to relate the
chances of eventual survival to birth weight?’

1.050* 2.500* 1.890* 1.760 2.830


1.175* 1.030* 1.940* 1.930 1.410
1.230* 1.100* 2.200* 2.015 1.715
1.310* 1.185* 2.270* 2.090 1.720
1.500* 1.225* 2.440* 2.600 2.040
1.600* 1.262* 2.560* 2.700 2.200
1.720* 1.295* 2.730* 2.950 2.400
1.750* 1.300* 1.130 3.160 2.550
1.770* 1.550* 1.575 3.400 2.570
2.275* 1.820* 1.680 3.640 3.005
*child died
The results in this case would show that the mean birth
weight of the infants who survived is considerably higher
than the mean birth weight of the infants who died, and
that the standard deviation of the birth weights of the
infants who survived is also higher.
For the birth weights (in kg) of the infants who survived,
the lower quartile, median and upper quartile are,
respectively, 1.72, 2.20 and 2.83. For the infants who died,
the corresponding quartiles are 1.23, 1.60 and 2.20.
● Comparison of location: the median birth weight of infants who
survived is greater than that of those who died.
● Comparison of dispersion: The interquartile ranges are reasonably
similar (as shown by the lengths of the boxes), though the overall
range of the data set is greater for the surviving infants (as shown
by the distances between the ends of the two whiskers for each
boxplot).
● Comparison of skewness: Though both batches of data appear to
be right-skew, and the batch for the infants who died is slightly
more skewed than that for those who survived, the skewness is
not particularly marked in either case.
● Comparison of potential outliers: Neither data set shows any
suspiciously far out values which might require a closer look.
● General conclusions: Overall, the two batches of data look as if
they were generally distributed in a similar way, but with one batch
located to the right (larger location) of the other. You can see
immediately that the median birth weight of infants who died is
less than the lower quartile of the birth weights of infants who
survived (that is, over three-quarters of the survivors were heavier
than the median birth weight of those who died). So it looks as if
we can safely say that survival is related to birth weight.
Guidelines for comparing boxplots

● Compare the respective medians, to compare location.


● Compare the interquartile ranges (that is, the box lengths), to
compare dispersion.
● Look at the overall spread as shown by the adjacent values.
(This is another aspect of dispersion.)
● Look for signs of skewness. If the data do not appear to be
symmetric, does each batch show the same kind of
asymmetry?
● Look for potential outliers.
In a study of memory recall times, a series of stimulus
words was shown to a subject on a computer screen. For
each word, the subject was instructed to recall either a
pleasant or an unpleasant memory associated with that
word. Successful recall of a memory was indicated by the
subject pressing a bar on the computer keyboard. Table
below shows the recall times (in seconds) for twenty
pleasant and twenty unpleasant memories.
Scatter Plots
A scatter plot reveals relationships or association between
two variables. Such relationships manifest themselves by
any non-random structure in the plot.
A scatter plot is a plot of the values of Y versus the
corresponding values of X:
● Vertical axis: variable Y--usually the response variable
● Horizontal axis: variable X--usually some variable we
suspect may ber related to the response
Scatter plots can provide answers to the following
questions:
● Are variables X and Y related?
● Are variables X and Y linearly related?
● Are variables X and Y non-linearly related?
● Does the variation in Y change depending on X?
● Are there outliers?
No relation Strong +ve relation Strong -ve relation

Exact linear relation


Histograms
● A histogram gives distribution of numerical data.
● It is an estimate of the probability distribution of a continuous variable (quantitative
variable)
● It differs from a bar graph, in the sense that a bar graph relates two variables,
but a histogram relates only one.
● To construct a histogram, the first step is to "bin" (or "bucket") the range of
values—that is, divide the entire range of values into a series of intervals—and
then count how many values fall into each interval.
Choosing the correct bin
width
● There is no right or wrong answer as to how wide a bin
should be, but there are rules of thumb. You need to make
sure that the bins are not too small or too large.
Simple Discretization Methods: Binning
● Equal-width (distance) partitioning:
● It divides the range into N intervals/bins of equal size: uniform grid
● if A and B are the lowest and highest values of the attribute, the width of intervals will
be: W = (B -A)/N, where N is number of bins
● The most straightforward
● But outliers may dominate presentation
● Skewed data is not handled well.
● Equal-depth (frequency) partitioning:
● It divides the range into N intervals, each containing approximately same number of
samples
● Good data scaling
● Managing categorical attributes can be tricky.
Binning
● Attribute values (for one attribute e.g., age):
● 0, 4, 12, 16, 16, 18, 24, 26, 28
● Equi-width binning – for N = 3 (this gives bin width of 10)
● Bin 1: 0, 4 [-,10) bin
● Bin 2: 12, 16, 16, 18 [10,20) bin
● Bin 3: 24, 26, 28 [20,+) bin
● – denote negative infinity, + positive infinity
● Equi-frequency binning – for bin density of e.g., 3:
● Bin 1: 0, 4, 12 [-, 14) bin
● Bin 2: 16, 16, 18 [14, 21) bin
● Bin 3: 24, 26, 28 [21,+] bin
Quantile plot (q-plot)

● A quantile plot is a simple and effective way to


have a first look at a univariate data distribution.
1) It displays all of the data for the given attribute
(allowing the user to assess both the overall behavior and
unusual occurrences).
2) It plots quantile information
● Let xi , for i = 1 to N, be the data sorted in increasing order so that x 1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X.
● Each observation, xi , is paired with a percentage, fi , which indicates that approximately
fi 100% of the data are below the value, xi .We say “approximately” because there may
not be a value with exactly a fraction, fi , of the data below xi .
0.25 percentile corresponds to quartile Q1
0.50 percentile is the median
0.75 percentile is Q3.
Let fi = (i - 0.5) / N
● These numbers increase in equal steps of 1/N, ranging from 1/2N (which is slightly above
0) to 1- 1/2N (which is slightly below 1).
● On a quantile plot, xi is graphed against fi .
● This allows us to compare different distributions based on their quantiles.
Example

3.7, 2.7, 3.3, 1.3, 2.2, 3.1

Sort the values : 1.3, 2.2, 2.7, 3.1, 3.3, 3.7


Pair each obs xi with fi
xi 1.3 2.2 2.7 3.1 3.3 3.7
fi 0.08 0.25 0.42 0.58 0.75 0.92

Plot xi against fi.


All point of quantiles lie on or close to straight line at an angle of 45 degree
from x – axis. It indicates that two samples have similar distributions
Right-skewed data
Left-skewed data
Quantile-Quantile plot (q-q plot)
Quantiles for Data Set 1 Versus Quantiles of Data Set 2
The q-q plot is formed by:
● Vertical axis: Estimated quantiles from data set 1
● Horizontal axis: Estimated quantiles from data set 2
● Both axes are in units of their respective data sets. That is, the actual quantile level
is not plotted.
● For a given point on the q-q plot, we know that the quantile level is the same for
both points, but not what that quantile level actually is.
● If the data sets have the same size, the q-q plot is essentially a plot of sorted data
set 1 against sorted data set 2. If the data sets are not of equal size, the quantiles
are usually picked to correspond to the sorted values from the smaller data set and
then the quantiles for the larger data set are interpolated.
Use of Q-Q plots

The q-q plot is used to answer the following questions:


● Do two data sets come from populations with a common
distribution?
● Do two data sets have common location and scale?
● Do two data sets have similar distributional shapes?
● Do two data sets have similar tail behavior?
106

ETL Process
107
Data Preprocessing

● Why preprocess the data?


● Data cleaning
● Data integration and transformation
● Data reduction
● Discretization and concept hierarchy generation
Why Is Data Preprocessing Important?

● No quality data, no quality mining results!


● Quality decisions must be based on quality data
● e.g., duplicate or missing data may cause incorrect or even misleading statistics.
● Data warehouse needs consistent integration of quality data
● Data extraction, cleaning, and transformation comprises the majority of
the work of building a data warehouse
Data Quality - The Reality
● Tempting to think creating a data warehouse is simply
extracting operational data and entering into a data
warehouse
● Nothing could be farther from the truth
● Warehouse data comes from disparate questionable
sources

110
Data Quality - The Reality

● Legacy systems no longer documented


● Outside sources with questionable quality procedures
● Production systems with no built in integrity checks and no
integration
● Operational systems are usually designed to solve a specific business
problem and are rarely developed to a corporate plan
● “And get it done quickly, we do not have time to worry about corporate standards...”

111
Data Extraction

● Which files and tables are to be accessed in the source


database?
● Which fields are to be extracted from them? This is often done
internally by SQL Select statement.
● What are those to be called in the resulting database?
● What is the target machine and database format of the output?
● On what schedule should the extraction process be repeated?

112
Major Tasks in Data Preprocessing
Major Tasks in Data Preprocessing

● Data cleaning
● Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
● Data integration
● Integration of multiple databases, data cubes, or files
● Data transformation
● Normalization and aggregation
● Data reduction
● Obtains reduced representation in volume but produces the
same or similar analytical results
● Data discretization
● Part of data reduction but with particular importance, especially
for numerical data, concept hierarchy generation
Data Cleaning
● Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error
● incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
● e.g., Occupation=“ ” (missing data)
● noisy: containing noise, errors, or outliers
● e.g., Salary=“−10” (an error)
● inconsistent: containing discrepancies in codes or names, e.g.,
● Age=“42”, Birthday=“03/07/2010”
● Was rating “1, 2, 3”, now rating “A, B, C”
● discrepancy between duplicate records
● Intentional (e.g., disguised missing data)
● Jan. 1 as everyone’s birthday?
Incomplete (missing) Data
● Data is not always available
● E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
● Missing data may be due to
● equipment malfunction
● inconsistent with other recorded data and thus deleted
● data not entered due to misunderstanding
● certain data may not be considered important at the time of entry
● Information is not collected (e.g., people decline to give their age and weight)
● Attributes may not be applicable to all cases (e.g., annual income is not applicable
to children)
● not register history or changes of the data
● Missing data may need to be inferred
How to Handle Missing Data?

● Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
● Fill in the missing value manually: tedious + infeasible?
● Fill in it automatically with
● a global constant : e.g., “unknown”, a new class?!
● the attribute mean
● the attribute mean for all samples belonging to the same class: smarter
● the most probable value: inference-based such as Bayesian formula or
decision tree
117
Noisy Data

● Noise: random error or variance in a measured variable


● Incorrect attribute values may be due to
● faulty data collection instruments
● data entry problems
● data transmission problems
● technology limitation
● inconsistency in naming convention
● Other data problems which require data cleaning
● duplicate records
● incomplete data
● inconsistent data
118
How to Handle Noisy Data?
how can we “smooth” out the data to remove the noise?

● Binning
● first sort data and partition into (equal-frequency) bins
● then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
● Regression
● smooth by fitting the data into regression functions
● Clustering
● detect and remove outliers
● Combined computer and human inspection
● detect suspicious values and check by human (e.g., deal with possible
outliers)
119
Binning

● Smoothing by bin medians: Each value in the bin is replaced


by the bin median.
● Smoothing by bin means: Each value in the bin is replaced by
the mean value of the bin.
● Smoothing by boundaries: The min and max values of a bin
are identified as the bin boundaries.
Each bin value is replaced by the closest
boundary value.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Kamber Ex. 3.3, pg. 121
Regression

● Data can be smoothed by fitting the data to a regression


function.
● Linear Regression : involves finding the “best” line to fit
two attributes so that one can be used to predict the
other.
Regression

Example of linear regression

y (salary)
Value of ‘age’ can be used to predict
value of ‘salary’

Y1 y=x+1

X1 x (age)
Clustering

● Similar values are organized into groups/clusters.


● Values which lie outside clusters are considered as
outliers.
Cluster Analysis

salary
Data is
smoothed by
cluster removing
outliers

outlier

age
Duplicate Data
● Data set may include data objects that are duplicates, or
almost duplicates of one another
● Major issue when merging data from heterogenous sources

● Examples:
● Same person with multiple email addresses

● Data cleaning
● Process of dealing with duplicate data issues
Data Cleaning as a Process
● Data discrepancy detection
● Use metadata (e.g., domain, range, dependency, distribution)
● Check field overloading
● Check uniqueness rule, consecutive rule and null rule
● Use commercial tools
● Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect
errors and make corrections [1) Integrate.io · 2) Tibco Clarity · 3) DemandTools · 4) RingLead · 5) Melissa
Clean Suite · 6) WinPure]
● Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers) [Oracle Audit Vault and Database Firewall · 2. IBM Guardium
Data Protection · 3. Imperva SecureSphere Database]
● Data migration and integration
● Data migration tools: allow transformations to be specified
● ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a
graphical user interface
● Integration of the two processes
● Iterative and interactive (e.g., Potter’s Wheels)

128
Major Tasks in Data
Preprocessing

● Data cleaning
● Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
● Data integration
● Integration of multiple databases, data cubes, or files
● Data transformation
● Normalization and aggregation
● Data reduction
● Obtains reduced representation in volume but produces the
same or similar analytical results
● Data discretization
● Part of data reduction but with particular importance, especially
for numerical data
Data Integration

● The process of merging data from multiple sources.


● Integrity and reliability of data should be maintained
● Two main approaches : schema integration and instance
integration

130
Data Integration Across Sources

Savings Loans Trust Credit card

Same data Different data Data found here Different keys


different name Same name nowhere else same data

131
Schema Integration
● Developing a unified representation of semantically similar
information, structured and stored differently in the
individual databases.

132
Schema Integration

● Schema integration: e.g., A.cust-id ≡ B.cust-#


● Integrate metadata from different sources
Instance Integration
● Entity identification problem:
Is A. Smith, Smith A, Andrew Smith, Andrew S same?

● Detecting and resolving data value conflicts


● For the same real world entity, attribute values from different sources are
different
● Possible reasons: different representations, different scales, e.g., metric
vs. British units

134
Data Integrity Problems

● Same person, different spellings


● Agarwal, Agrawal, Aggarwal etc...
● Multiple ways to denote company name
● Persistent Systems, PSPL, Persistent Pvt. LTD.
● Use of different names
● mumbai, bombay
● Different account numbers generated by different applications for the same
customer
● Required fields left blank
● Invalid product codes collected at point of sale
● manual entry leads to mistakes
● “in case of a problem use 9999999” 135
Handling Redundancy in Data Integration

● Redundant data occur often when integration of multiple databases


● Object identification: The same attribute or object may have different names
in different databases
● Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
● Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
Handling Redundancy in Data
Integration
● Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

● For nominal data, we can use the χ2 (chi-square) test.


● For numeric attributes, we can use the correlation coefficient and
covariance, both of which assess how one attribute’s values vary from those
of another.
CHI-SQUARE TEST

● The X2 test is used to determine whether an association (or relationship)


between 2 categorical variables in a sample is likely to reflect a real
association between these 2 variables in the population.
● In the case of 2 variables being compared, the test can also be interpreted as
determining if there is a difference between the two variables.
● The sample data is used to calculate a single number (or test statistic), the size
of which reflects the probability (p-value) that the observed association
between the 2 variables has occurred by chance, ie due to sampling error.
Χ2 (chi-square) test

For two variables A and B,


suppose A has c distinct values a1,a2,……ac
B has r distinct values b1,b2,……br

The data tuples described by A and B are shown as a


contingency table
Null hypothesis: A and B are independent
Χ2 (chi-square) test

oij is the observed frequency (i.e., actual count) of the joint event
eij is the expected frequency of (Ai, Bj)
Χ2 (chi-square) test

● The larger the Χ2 value, the more likely the variables are related
● The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
● The chi-square statistic tests the hypothesis that A and B are independent, that
is, there is no correlation between them.
● The test is based on a significance level (SL)
● degrees of freedom = (r-1)(c-1).
● If the hypothesis can be rejected, then we say that A and B are
statistically correlated.
142

Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250 200

Not like science fiction 50 1000

Sum(col.) 1500

● Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)

It shows that like_science_fiction and play_chess are correlated in


the group
Correlation Analysis (Nominal Data)

● Correlation does not imply causality


● # of hospitals and # of car-theft in a city are correlated
● Both are causally linked to the third variable: population
144

Correlation Analysis (Numeric Data)

● Correlation coefficient (also called Pearson’s product moment coefficient)

where n is the number of tuples, and are the respective means of A and B, σA and σB are
the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
● rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation.
● rA,B = 0: independent;
● rAB < 0: negatively correlated
Correlation Analysis (Numeric Data)
Pearson’s Coefficient
● The Pearson product-moment correlation coefficient (or Pearson
correlation coefficient, for short) is a measure of the strength of a
linear association between two variables and is denoted by r.
● Pearson product-moment correlation attempts to draw a line of best fit
through the data of two variables, and the Pearson correlation
coefficient, r, indicates how far away all these data points are to this
line of best fit (i.e., how well the data points fit this new model/line of
best fit).
● The Pearson correlation coefficient, r, can take a range of
values from +1 to -1.
● r = 0 indicates that there is no association between the two
variables.
● r > 0 indicates a positive association; that is, as the value of
one variable increases, so does the value of the other
variable.
● r < 0 indicates a negative association; that is, as the value
of one variable increases, the value of the other variable
decreases.
● The first step is to draw a scatter plot of the variables to
check for linearity.
● The correlation coefficient should not be calculated if the
relationship is not linear.
● For correlation only purposes, it does not really matter on
which axis the variables are plotted. However, conventionally,
the independent (or explanatory) variable is plotted on the
x-axis (horizontally) and the dependent (or response) variable
is plotted on the y-axis (vertically).
● The nearer the scatter of points is to a straight line, the
higher the strength of association between the variables.
Also, it does not matter what measurement units are used.
Formula
Example

SUBJECT AGE X GLUCOSE LEVEL Y


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Σ 247 486
Example contd..
GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Σ
✔ Σx = 247
✔ Σy = 486
✔ Σxy = 20,485
✔ Σx2 = 11,409
✔ Σy2 = 40,022
r = 6(20,485) – (247 × 486)

[√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]

= 0.5298
Note that the strength of the association of the variables
depends on what you measure and sample sizes.
Covariance

In mathematics and statistics, covariance is a measure of


the relationship between two random variables.
The metric evaluates how much – to what extent – the
variables change together.
In other words, it is essentially a measure of the variance
between two variables.
However, the metric does not assess the dependency
between variables.
Co-Variance (Numeric Data)

Positive covariance: If CovA,B > 0, Indicates that two variables tend to move in the same
direction
Negative covariance: If CovA,B < 0 Reveals that two variables tend to move in inverse
directions.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence
Calculate the mean (average) prices for each asset.
For each security, find the difference between each
value and mean price

Multiply the results obtained in the previous step.


Answer =

The positive covariance indicates that the price of the ABC Corp. stock and
the S&P 500 tend to move in the same direction.
162
Covariance vs. Correlation

● Covariance and correlation both primarily assess the relationship between


variables.
● Covariance measures the total variation of two random variables from their
expected values. Using covariance, we can only gauge the direction of the
relationship (whether the variables tend to move in tandem or show an inverse
relationship). However, it does not indicate the strength of the relationship, nor
the dependency between the variables.
● On the other hand, correlation measures the strength of the relationship
between variables. Correlation is the scaled measure of covariance. It is
dimensionless. In other words, the correlation coefficient is always a pure value
and not measured in any units.
The relationship between the two concepts can be
expressed using the formula below:

Where:
•ρ(X,Y) – the correlation between the variables X and Y
•Cov(X,Y) – the covariance between the variables X and Y
•σX – the standard deviation of the X-variable
•σY – the standard deviation of the Y-variable
Sampling
● Sampling is the main technique employed for data selection.
● It is often used for both the preliminary investigation of the data and the final data analysis.

● Statisticians sample because obtaining the entire set of data of interest is too
expensive or time consuming.

● Sampling is used in data mining because processing the entire set of data of interest
is too expensive or time consuming.
Sample Size

8000 points 2000 Points 500 Points


Sampling …

● The key principle for effective sampling is the following:


● using a sample will work almost as well as using the entire
data sets, if the sample is representative
● A sample is representative if it has approximately the same
property (of interest) as the original set of data
Types of Sampling
● Simple Random Sampling
● There is an equal probability of selecting any particular item

● Simple Random Sampling without replacement (SRSWOR)


● As each item is selected, it is removed from the population

● Simple Random Sampling with replacement (SRSWR)


● Objects are not removed from the population as they are selected for the sample.
● In sampling with replacement, the same object can be picked up more than once

● Stratified sampling
● Split the data into several partitions/strata ; then draw random samples from each partition

❑ Cluster Samplings : Split the data into ‘m’ disjoint cluster, then apply simple
random sampling.
Data Transformation

● Smoothing: remove noise from data


● Aggregation: summarization, data cube construction
● Generalization: concept hierarchy climbing
● Normalization: scaled to fall within a small, specified
range
● min-max normalization
● z-score normalization
● normalization by decimal scaling
Data Transformation:
Normalization

● min-max normalization

● z-score normalization

● normalization by decimal scaling


Where j is the smallest integer such that Max(| |)<1

You might also like