0% found this document useful (0 votes)
2 views

IDS3

The document provides an introduction to data science with a focus on data preprocessing, covering key statistical concepts such as central tendency, variance, standard deviation, skewness, and kurtosis. It discusses various methods for measuring and visualizing data dispersion, including boxplots, histograms, and scatter plots, as well as the importance of sampling techniques in data analysis. The material is tailored for a course at BITS Pilani and acknowledges contributions from various authors.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

IDS3

The document provides an introduction to data science with a focus on data preprocessing, covering key statistical concepts such as central tendency, variance, standard deviation, skewness, and kurtosis. It discusses various methods for measuring and visualizing data dispersion, including boxplots, histograms, and scatter plots, as well as the importance of sampling techniques in data analysis. The material is tailored for a course at BITS Pilani and acknowledges contributions from various authors.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to Data Science

Data Preprocessing
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• We have added and modified slides to suit the requirements of
the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Statistical Descriptions

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency

Mean (algebraic measure) (sample vs. 1 n



x   xi   x
population): n i 1 N
Note: n is sample size and N is population size.

n
• Weighted arithmetic mean: wx i i
x  i 1n
w
i 1
i

• Trimmed mean: chopping extreme values

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Measuring the Central Tendency
• Median:
• Middle value if odd number of values, or
average of the middle two values otherwise

• Estimated by interpolation (for grouped


data):
n / 2  ( freq ) l
median L1  ( ) width
freq median

Median
interval

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Measuring the Central Tendency

• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula:

mean  mode 3 (mean  median)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Probability Distribution

Probability distributions help us model and quantify uncertainty


and variability in data.
Probability distributions also help us to analyze data and draw
conclusions by describing the likelihood of different outcomes or
events.
A frequently used probability density function (pdf) is Normal or
Gaussian function.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Measures Data Distribution:
Variance and Standard Deviation

• Variance and standard deviation (sample: s, population: σ)


• Variance: (algebraic, scalable computation)
• Q: Can you compute it incrementally and efficiently?

1 n 1 n 2 1 n
2
s  
n  1 i 1
( xi  x ) 2
 [ 
n  1 i 1
xi  ( 
n i 1
xi ]
) 2

Note: The subtle difference of


formulae for sample vs. population
• n : the size of the sample
• N : the size of the population
n n
1 1
  i
2
 
2
( xi   ) 
2
x   2

N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Properties of Normal Distribution Curve

← — ————Represent data dispersion, spread — ————→

Represent central tendency

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Properties of Normal Distribution Curve

Cumulative distribution function

Probability density function

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively skewed data

symmetric

negatively
positively skewed
skewed

February 26, 2025 12


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

Skewness is a measure of symmetry (more precisely,


the lack of symmetry).

For univariate data Y1, Y2, ..., YN, the formula for skewness is:

_
where Y is the mean, s is the standard deviation, and N is the
number of data points.
The above formula for skewness is referred to as the Fisher-
Pearson coefficient of skewness
13
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

The skewness for a normal distribution is zero, and


any symmetric data should have a skewness near
zero.
• Negative values for the skewness indicate data
that are skewed left (long tail to the left) and
• Positive values for the skewness indicate data
that are skewed right (long tail to the right)

14
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

Kurtosis is a measure of whether the data are heavy-


tailed or light-tailed relative to a normal distribution.
• The data sets with high kurtosis tend to have heavy
tails, or outliers.
• Data sets with low kurtosis tend to have light tails, or
lack of outliers.

15
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

16
https://www.researchgate.net/figure/Statistical-moments-such-as-a-skewness-b-kurtosis-c-variance-and-d-mean_fig4_353016479
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
More of Skewness & Kurtosis

https://brownmath.com/stat/shape.htm#Kurtosis

17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
• Outlier: usually, a value higher/lower than 1.5 x IQR (on both sides of box from Q1 to Q3)
• Variance and standard deviation (sample: s, population: σ)
• Variance:n (algebraic, scalable computation) 1 n
1 n

 ( xi   ) 2  x
n n
1 1 1 2  2
 2
 [ xi  ( xi ) 2 ]
2 2
s  ( xi  x ) 2  N N
i
n  1 i 1 n  1 i 1 n i 1 i 1 i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum

• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the
box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually

Data Mining
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

Following is an ordered list of observations of a variable. Compute 5 point


summary.
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70

Solution:
Min: 13
Q1: 20
Median: 25
Q3: 35
Max: 70
Any possible outliers here?

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Graphic Displays of
Statistical Descriptions

21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary


• Histogram: x-axis are values, y-axis repres. frequencies

• Quantile plot: each value xi is paired with fi indicating that


approximately 100 fi % of data are  xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant


distribution against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as
points in the plane

Data Mining
02/26/2025
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histogram Analysis

• Histogram: Graph display of tabulated frequencies, shown as bars


• It shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the value,
not the height as in bar charts, a crucial distinction when the categories are not
of uniform width
• The categories are usually specified as non-overlapping intervals of some
variable. The categories (bars) must be adjacent
Data Mining
02/26/2025
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histograms Often Tell More than Boxplots

 The two histograms shown in


the left may have the same
boxplot representation
 The same values for: min,
Q1, median, Q3, max
 But they have rather different
data distributions

Data Mining
02/26/2025
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi

Data Mining
Data Mining: Concepts and
Techniques 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

Data Mining
26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter plot

• Provides a first look at bivariate data to see clusters of


points, outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Positively and Negatively Correlated Data

• The left half fragment is positively correlated

• The right half is negative correlated

Data Mining
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Uncorrelated Data

Data Mining
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sampling of Data

30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sampling

• Sampling is the main technique employed for data


reduction.
– It is often used for both the preliminary investigation of the data and the final data
analysis.

• Statisticians often sample because obtaining the entire set


of data of interest is too expensive or time consuming.

• Sampling is typically used in data mining because processing


the entire set of data of interest is too expensive or time
consuming.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sampling …

• The key principle for effective sampling is the following:

– Using a sample will work almost as well as using the entire data set, if the
sample is representative

– A sample is representative if it has approximately the same properties (of


interest) as the original set of data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
– Sampling without replacement
• As each item is selected, it is removed from the population
– Sampling with replacement
• Objects are not removed from the population as they are selected
for the sample.
• In sampling with replacement, the same object can be picked up
more than once
• Stratified sampling
– Split the data into several partitions; then draw random samples from each
partition

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sampling: With or without Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re

SRSW
R

Raw Data 34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sample Size

8000 points 2000 Points 500 Points

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sample Size
What sample size is necessary to get at least one object from each
of 10 equal-sized groups.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Text Books

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

T2 Introducing Data Science by Cielen, Meysman and Ali


T3 Storytelling with Data, A data visualization guide for business
professionals, by Cole Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei
Han and Micheline Kamber Morgan Kaufmann Publishers

37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You

38

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like