0% found this document useful (0 votes)

2 views

IDS3

The document provides an introduction to data science with a focus on data preprocessing, covering key statistical concepts such as central tendency, variance, standard deviation, skewness, and kurtosis. It discusses various methods for measuring and visualizing data dispersion, including boxplots, histograms, and scatter plots, as well as the importance of sampling techniques in data analysis. The material is tailored for a course at BITS Pilani and acknowledges contributions from various authors.

Uploaded by

AtindranathGhosh

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

IDS3

Uploaded by

AtindranathGhosh

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Introduction to Data Science

Data Preprocessing
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• We have added and modified slides to suit the requirements of
the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Statistical Descriptions

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency

Mean (algebraic measure) (sample vs. 1 n

•
x   xi   x
population): n i 1 N
Note: n is sample size and N is population size.

n
• Weighted arithmetic mean: wx i i
x  i 1n
w
i 1
i

• Trimmed mean: chopping extreme values

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Measuring the Central Tendency
• Median:
• Middle value if odd number of values, or
average of the middle two values otherwise

• Estimated by interpolation (for grouped

data):
n / 2  ( freq ) l
median L1  ( ) width
freq median

Median
interval

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Measuring the Central Tendency

• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula:

mean  mode 3 (mean  median)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Probability Distribution

Probability distributions help us model and quantify uncertainty

and variability in data.
Probability distributions also help us to analyze data and draw
conclusions by describing the likelihood of different outcomes or
events.
A frequently used probability density function (pdf) is Normal or
Gaussian function.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Measures Data Distribution:
Variance and Standard Deviation

• Variance and standard deviation (sample: s, population: σ)

• Variance: (algebraic, scalable computation)
• Q: Can you compute it incrementally and efficiently?

1 n 1 n 2 1 n
2
s  
n  1 i 1
( xi  x ) 2
 [ 
n  1 i 1
xi  ( 
n i 1
xi ]
) 2

Note: The subtle difference of

formulae for sample vs. population
• n : the size of the sample
• N : the size of the population
n n
1 1
  i
2
 
2
( xi   ) 
2
x   2

N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Properties of Normal Distribution Curve

← — ————Represent data dispersion, spread — ————→

Represent central tendency

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Properties of Normal Distribution Curve

Cumulative distribution function

Probability density function

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively skewed data

symmetric

negatively
positively skewed
skewed

February 26, 2025 12

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

Skewness is a measure of symmetry (more precisely,

the lack of symmetry).

For univariate data Y1, Y2, ..., YN, the formula for skewness is:

_
where Y is the mean, s is the standard deviation, and N is the
number of data points.
The above formula for skewness is referred to as the Fisher-
Pearson coefficient of skewness
13
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

The skewness for a normal distribution is zero, and

any symmetric data should have a skewness near
zero.
• Negative values for the skewness indicate data
that are skewed left (long tail to the left) and
• Positive values for the skewness indicate data
that are skewed right (long tail to the right)

14
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

Kurtosis is a measure of whether the data are heavy-

tailed or light-tailed relative to a normal distribution.
• The data sets with high kurtosis tend to have heavy
tails, or outliers.
• Data sets with low kurtosis tend to have light tails, or
lack of outliers.

15
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Skewness and Kurtosis

16
https://www.researchgate.net/figure/Statistical-moments-such-as-a-skewness-b-kurtosis-c-variance-and-d-mean_fig4_353016479
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
More of Skewness & Kurtosis

https://brownmath.com/stat/shape.htm#Kurtosis

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
• Outlier: usually, a value higher/lower than 1.5 x IQR (on both sides of box from Q1 to Q3)
• Variance and standard deviation (sample: s, population: σ)
• Variance:n (algebraic, scalable computation) 1 n
1 n

 ( xi   ) 2  x
n n
1 1 1 2  2
 2
 [ xi  ( xi ) 2 ]
2 2
s  ( xi  x ) 2  N N
i
n  1 i 1 n  1 i 1 n i 1 i 1 i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum

• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the
box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually

Data Mining
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

Following is an ordered list of observations of a variable. Compute 5 point

summary.
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70

Solution:
Min: 13
Q1: 20
Median: 25
Q3: 35
Max: 70
Any possible outliers here?

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Graphic Displays of
Statistical Descriptions

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis repres. frequencies

• Quantile plot: each value xi is paired with fi indicating that

approximately 100 fi % of data are  xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant

distribution against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as
points in the plane

Data Mining
02/26/2025
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histogram Analysis

• Histogram: Graph display of tabulated frequencies, shown as bars

• It shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the value,
not the height as in bar charts, a crucial distinction when the categories are not
of uniform width
• The categories are usually specified as non-overlapping intervals of some
variable. The categories (bars) must be adjacent
Data Mining
02/26/2025
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histograms Often Tell More than Boxplots

 The two histograms shown in

the left may have the same
boxplot representation
 The same values for: min,
Q1, median, Q3, max
 But they have rather different
data distributions

Data Mining
02/26/2025
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi

Data Mining
Data Mining: Concepts and
Techniques 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

Data Mining
26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter plot

• Provides a first look at bivariate data to see clusters of

points, outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Positively and Negatively Correlated Data

• The left half fragment is positively correlated

• The right half is negative correlated

Data Mining
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Uncorrelated Data

Data Mining
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sampling of Data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Sampling

• Sampling is the main technique employed for data

reduction.
– It is often used for both the preliminary investigation of the data and the final data
analysis.

• Statisticians often sample because obtaining the entire set

of data of interest is too expensive or time consuming.

• Sampling is typically used in data mining because processing

the entire set of data of interest is too expensive or time
consuming.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Sampling …

• The key principle for effective sampling is the following:

– Using a sample will work almost as well as using the entire data set, if the
sample is representative

– A sample is representative if it has approximately the same properties (of

interest) as the original set of data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
– Sampling without replacement
• As each item is selected, it is removed from the population
– Sampling with replacement
• Objects are not removed from the population as they are selected
for the sample.
• In sampling with replacement, the same object can be picked up
more than once
• Stratified sampling
– Split the data into several partitions; then draw random samples from each
partition

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Sampling: With or without Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re

SRSW
R

Raw Data 34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sample Size

8000 points 2000 Points 500 Points

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Sample Size
What sample size is necessary to get at least one object from each
of 10 equal-sized groups.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Text Books

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

T2 Introducing Data Science by Cielen, Meysman and Ali

T3 Storytelling with Data, A data visualization guide for business
professionals, by Cole Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei
Han and Micheline Kamber Morgan Kaufmann Publishers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

GW Associate Certifications Q&A
No ratings yet
GW Associate Certifications Q&A
30 pages
The Game Definition Game
No ratings yet
The Game Definition Game
22 pages
QM Mid Exam Slides Sachin Gupta
No ratings yet
QM Mid Exam Slides Sachin Gupta
404 pages
Quantitative Methods
100% (1)
Quantitative Methods
53 pages
RL3.1 Data Descriptions 1
No ratings yet
RL3.1 Data Descriptions 1
18 pages
IDS8 Midsem Review
No ratings yet
IDS8 Midsem Review
24 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
IDS Webinar2 Annotated
No ratings yet
IDS Webinar2 Annotated
24 pages
Module 1 - 3 - Statistics
No ratings yet
Module 1 - 3 - Statistics
44 pages
Descriptive Analysis
No ratings yet
Descriptive Analysis
64 pages
B.SC (Design & Computing) Statistical Inference and Its Applications
No ratings yet
B.SC (Design & Computing) Statistical Inference and Its Applications
74 pages
SECTION 1 - DSE - Session 1 - 29th August 2020
No ratings yet
SECTION 1 - DSE - Session 1 - 29th August 2020
79 pages
Student Notes 1.3 New
No ratings yet
Student Notes 1.3 New
6 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
R22-UNIT2-CH2
No ratings yet
R22-UNIT2-CH2
28 pages
CH - 4
No ratings yet
CH - 4
71 pages
Lecture 1
No ratings yet
Lecture 1
28 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
PEZC213 Engg Measurements L4
No ratings yet
PEZC213 Engg Measurements L4
50 pages
Lecture 1ASADA Descriptive Stats
No ratings yet
Lecture 1ASADA Descriptive Stats
38 pages
BT 3041: Analysis and Interpretation of Biological Data
No ratings yet
BT 3041: Analysis and Interpretation of Biological Data
57 pages
Data Distribution
No ratings yet
Data Distribution
26 pages
MÔ TẢ BIẾN SỐ
No ratings yet
MÔ TẢ BIẾN SỐ
48 pages
20210129--Lecture01
No ratings yet
20210129--Lecture01
76 pages
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
No ratings yet
Chapter 4: Summarizing & Exploring Data (Descriptive Statistics) Graphics! Graphics! Graphics! (And Some Numbers)
85 pages
J. K. Sharma - Fundamentals of Business Statistics-Pearson Education (2014)
100% (4)
J. K. Sharma - Fundamentals of Business Statistics-Pearson Education (2014)
505 pages
IDS Webinar1 Annotated
No ratings yet
IDS Webinar1 Annotated
10 pages
Mining Data Dispersion Characteristics
No ratings yet
Mining Data Dispersion Characteristics
7 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
L2 SCA
No ratings yet
L2 SCA
88 pages
Lecture03 Descriptive Statistics
No ratings yet
Lecture03 Descriptive Statistics
22 pages
Statistics For Css
No ratings yet
Statistics For Css
73 pages
Statistics
No ratings yet
Statistics
81 pages
02data Part2
No ratings yet
02data Part2
34 pages
Describing Data Numerically
No ratings yet
Describing Data Numerically
9 pages
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
12 pages
Or Lecture 202209
No ratings yet
Or Lecture 202209
21 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
12 pages
Lecture-6: Introduction To Data Science
No ratings yet
Lecture-6: Introduction To Data Science
25 pages
Descriptive Measures With Samples-1
No ratings yet
Descriptive Measures With Samples-1
33 pages
CHP 2
No ratings yet
CHP 2
52 pages
Stats and Maths For Data Analyst
No ratings yet
Stats and Maths For Data Analyst
23 pages
ZG536 - L4 - Descriptive Analytics - 030224
No ratings yet
ZG536 - L4 - Descriptive Analytics - 030224
28 pages
Chap1 Student
No ratings yet
Chap1 Student
14 pages
05 Descriptive Stat
No ratings yet
05 Descriptive Stat
52 pages
Adv Stat Knewness Kurtosis and Box Plot
No ratings yet
Adv Stat Knewness Kurtosis and Box Plot
28 pages
Stat 253 Part 2
No ratings yet
Stat 253 Part 2
58 pages
Chapitre 1
No ratings yet
Chapitre 1
13 pages
Lecture 1
No ratings yet
Lecture 1
64 pages
Class Notes v1
No ratings yet
Class Notes v1
4 pages
QTM Lecture 3
No ratings yet
QTM Lecture 3
36 pages
ANALYST Sources
No ratings yet
ANALYST Sources
23 pages
Staticus: Math 103 Lecture 9 Class Notes
No ratings yet
Staticus: Math 103 Lecture 9 Class Notes
4 pages
Dan Shuster's Exploring Data AP Statistics
No ratings yet
Dan Shuster's Exploring Data AP Statistics
3 pages
Data Visualizations: Histograms
No ratings yet
Data Visualizations: Histograms
27 pages
Quantitative Methods MM ZG515 / QM ZG515: BITS Pilani
No ratings yet
Quantitative Methods MM ZG515 / QM ZG515: BITS Pilani
30 pages
(LBOLYTC) Notes
No ratings yet
(LBOLYTC) Notes
12 pages
Types of Statistics
No ratings yet
Types of Statistics
7 pages
2.data Description
No ratings yet
2.data Description
57 pages
Desc. Stat
No ratings yet
Desc. Stat
41 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
IDS4
No ratings yet
IDS4
50 pages
IDS6
No ratings yet
IDS6
64 pages
Contact Session 5_with annotation
No ratings yet
Contact Session 5_with annotation
27 pages
Assignment_1 (2)
No ratings yet
Assignment_1 (2)
2 pages
Process Control - 2019
No ratings yet
Process Control - 2019
9 pages
7 - Repurchase Agreements-1
No ratings yet
7 - Repurchase Agreements-1
18 pages
Environmental Pollution Control: BITS Pilani
No ratings yet
Environmental Pollution Control: BITS Pilani
19 pages
Lab Module NMJ20103 2023 - 2024
No ratings yet
Lab Module NMJ20103 2023 - 2024
5 pages
Plastic Memory Report
100% (2)
Plastic Memory Report
31 pages
CNS_Question_bank
No ratings yet
CNS_Question_bank
2 pages
Microsoft - BCom - CA - Business Statistics 232
No ratings yet
Microsoft - BCom - CA - Business Statistics 232
13 pages
Upa 15160
No ratings yet
Upa 15160
5 pages
History of Computers
No ratings yet
History of Computers
94 pages
Assignment#01 - CEE 543 - Prof. MA
No ratings yet
Assignment#01 - CEE 543 - Prof. MA
5 pages
Reflection of Light
No ratings yet
Reflection of Light
17 pages
CEM2005W 2024 - Problem Sheet 1
No ratings yet
CEM2005W 2024 - Problem Sheet 1
4 pages
Unit 11.6 Electronics lesson notes
No ratings yet
Unit 11.6 Electronics lesson notes
22 pages
Name
No ratings yet
Name
2 pages
EEM
No ratings yet
EEM
17 pages
48-Hour Take-Home Exercises Session 2
No ratings yet
48-Hour Take-Home Exercises Session 2
10 pages
Airstream and Slipstream
No ratings yet
Airstream and Slipstream
16 pages
Full Records 21-02-2023 11-41-36
No ratings yet
Full Records 21-02-2023 11-41-36
32 pages
My Test
No ratings yet
My Test
3 pages
DIN 13-1 (1999) - General Purpose ISO Metric Screw Threads
100% (1)
DIN 13-1 (1999) - General Purpose ISO Metric Screw Threads
4 pages
Dequest 2040, 2050 and 2060 Product Series
No ratings yet
Dequest 2040, 2050 and 2060 Product Series
9 pages
Field Repair/Service Blanketrol Iii, Model 233 Operation and Technical Manual
No ratings yet
Field Repair/Service Blanketrol Iii, Model 233 Operation and Technical Manual
9 pages
Python and PyCharm Setup Tutorial
No ratings yet
Python and PyCharm Setup Tutorial
4 pages
Python For Audio Signal Processing
No ratings yet
Python For Audio Signal Processing
8 pages
Period 3 Mindmap
No ratings yet
Period 3 Mindmap
1 page
Transistorized Ignitiontutorial
No ratings yet
Transistorized Ignitiontutorial
17 pages
Travel Mode Choices in Small Cities of China A Case Study of Changting
No ratings yet
Travel Mode Choices in Small Cities of China A Case Study of Changting
14 pages
KD Sudoku EZ20 8 v58
No ratings yet
KD Sudoku EZ20 8 v58
10 pages
Physics - Blueprints For Drinking Straw Tower
No ratings yet
Physics - Blueprints For Drinking Straw Tower
1 page
Plastering
No ratings yet
Plastering
3 pages

IDS3

Uploaded by

IDS3

Uploaded by

Introduction to Data Science

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Mean (algebraic measure) (sample vs. 1 n

• Trimmed mean: chopping extreme values

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Estimated by interpolation (for grouped

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

mean  mode 3 (mean  median)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Probability distributions help us model and quantify uncertainty

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Variance and standard deviation (sample: s, population: σ)

Note: The subtle difference of

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

← — ————Represent data dispersion, spread — ————→

Represent central tendency

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Cumulative distribution function

Probability density function

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

February 26, 2025 12

Skewness is a measure of symmetry (more precisely,

The skewness for a normal distribution is zero, and

Kurtosis is a measure of whether the data are heavy-

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

Following is an ordered list of observations of a variable. Compute 5 point

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Boxplot: graphic display of five-number summary

• Quantile plot: each value xi is paired with fi indicating that

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant

• Histogram: Graph display of tabulated frequencies, shown as bars

 The two histograms shown in

• Provides a first look at bivariate data to see clusters of

• The left half fragment is positively correlated

• The right half is negative correlated

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Sampling is the main technique employed for data

• Statisticians often sample because obtaining the entire set

• Sampling is typically used in data mining because processing

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• The key principle for effective sampling is the following:

– A sample is representative if it has approximately the same properties (of

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

8000 points 2000 Points 500 Points

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

T2 Introducing Data Science by Cielen, Meysman and Ali

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like