IDS3
IDS3
Data Preprocessing
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• We have added and modified slides to suit the requirements of
the course.
2
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency
n
• Weighted arithmetic mean: wx i i
x i 1n
w
i 1
i
Median
interval
• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula:
1 n 1 n 2 1 n
2
s
n 1 i 1
( xi x ) 2
[
n 1 i 1
xi (
n i 1
xi ]
) 2
N i 1 N i 1
symmetric
negatively
positively skewed
skewed
For univariate data Y1, Y2, ..., YN, the formula for skewness is:
_
where Y is the mean, s is the standard deviation, and N is the
number of data points.
The above formula for skewness is referred to as the Fisher-
Pearson coefficient of skewness
13
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis
14
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis
15
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis
16
https://www.researchgate.net/figure/Statistical-moments-such-as-a-skewness-b-kurtosis-c-variance-and-d-mean_fig4_353016479
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
More of Skewness & Kurtosis
https://brownmath.com/stat/shape.htm#Kurtosis
17
( xi ) 2 x
n n
1 1 1 2 2
2
[ xi ( xi ) 2 ]
2 2
s ( xi x ) 2 N N
i
n 1 i 1 n 1 i 1 n i 1 i 1 i 1
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the
box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually
Data Mining
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
Solution:
Min: 13
Q1: 20
Median: 25
Q3: 35
Max: 70
Any possible outliers here?
Data Mining
21
Data Mining
02/26/2025
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histogram Analysis
Data Mining
02/26/2025
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Data Mining
Data Mining: Concepts and
Techniques 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.
Data Mining
26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter plot
Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Positively and Negatively Correlated Data
Data Mining
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Uncorrelated Data
Data Mining
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sampling of Data
30
– Using a sample will work almost as well as using the entire data set, if the
sample is representative
W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re
SRSW
R
Raw Data 34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sample Size
37
38