Exploratory Data Analysis: M. Srinath

This document provides an overview of exploratory data analysis (EDA) techniques. EDA involves visually examining datasets without hypotheses to identify patterns and relationships. Common EDA techniques include histograms, box plots, scatter plots, stem-and-leaf plots, and cross tabulations. These techniques allow researchers to summarize variable characteristics, identify outliers, and assess distribution shapes and relationships between variables in an exploratory manner.

Uploaded by

roma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views19 pages

Exploratory Data Analysis: M. Srinath

Uploaded by

roma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Exploratory Data Analysis

M. Srinath
Exploratory Data Analysis
Introduction
• Exploratory data analysis was promoted by John Tukey in 1977 to
encourage statisticians visually to examine their data sets, to formulate
hypotheses that could be tested on data-sets

• Exploratory data analysis (EDA) is an approach for analysing data to

summarize the main characteristics of variables in easy-to-understand
form, often with visual graphs, without using a statistical model or
having formulated a hypothesis

• EDA techniques are generally graphical. They include scatter plots,

Stem and leaf plots, box plots, histograms, quantile plots, residual
plots, and mean plots

• Exploratory data analysis is generally cross-classified in two ways.

First, each method is either non-graphical or graphical. And second,
each method is either univariate or multivariate (usually just bivariate)

•
2
Exploratory Data Analysis
• EDA offers several techniques to comprehend data
• But EDA is more than a library of data analysis techniques
• EDA is an approach to data analysis
• EDA involves inspecting data without any assumptions
– Mostly using information graphics

3
Exploratory Data Analysis
Univariate non-graphical EDA
 Categorical data
 Only useful univariate non-graphical techniques for categorical variables
is some form of tabulation of the frequencies, usually along with
calculation of the fraction (or percent) of data that falls in each
category
 Quantitative data
 Univariate non-graphical EDA focuses, generally, on measures of central
tendency(mean, median & mode), quartiles, spread(variance, sd & IQR),
skewness and kurtosis
 These descriptives quantitatively describe the main features of data

4
Univariate non-graphical EDA

A typical output of Descriptive

Statistics
Variable : Di (Development index)

N Valid 1201
Missing 0
Mean 0.260333 • When data has outliers median
Median 0.261697 is more robust
Mode 0.214959 • When data distribution is skewed
Std. Deviation 0.086778
median is more meaningful
Skewness 0.086567
Std. Error of Skewness 0.070593 • IQR = .0.143608
Kurtosis -0.88541 • IQR is also a robust measure of
Std. Error of Kurtosis 0.14107 spread
Percentiles
25 0.186396
50 0.261697
75 0.330004

5
Univariate graphical EDA -Histogram

• Graphical display of frequency

distribution
– Counts of data falling in various ranges
(bins)
– Histogram for numeric data
• Bin size selection is important
– Too small – may show spurious
patterns
– Too large – may hide important
patterns
• Several Variations possible
– Plot relative frequencies instead of
raw frequencies
– Make the height of the histogram
equal to the ‘relative frequency/width’
• Area under the histogram is 1
• When observations come from
continuous scale histograms can be
approximated by continuous curves
6
Stem and Leaf Plot
• This plot organizes data for
easy visual inspection Data
– Min and max values 29, 44, 12, 53, 21, 34, 39, 25,
– Data distribution
48, 23, 17, 24, 27, 32, 34, 15,
• Unlike descriptive statistics,
this plot shows all the data 42, 21, 28, 37
– No information loss
– Individual values can be
inspected
• Structure of the plot Stem and Leaf Plot
– Stem – the digits in the largest
place (e.g. tens place) 1|275
– Leaves – the digits in the
smallest place (e.g. ones place) 2|91534718
– Leaves are listed to the left of
stem separated by ‘|’ 3|49247
• Possible to place leaves from
another data set to the right of 4|482
the stem for comparing two data 5|3
distributions

7
Stem and leaf plot
Di Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 0. &
10.00 0 . 999&
32.00 1 . 0000001111
66.00 1 . 2222222222223333333333
59.00 1 . 4444444445555555555
104.00 1 . 66666666666666666777777777777777777
81.00 1 . 888888888888899999999999999
76.00 2 . 00000000000000011111111111
82.00 2 . 2222222222222333333333333333
82.00 2 . 444444444444444445555555555
96.00 2 . 66666666666667777777777777777777
91.00 2 . 888888888888888899999999999999
79.00 3 . 00000000000000111111111111
90.00 3 . 222222222222223333333333333333
82.00 3 . 4444444444444555555555555555
67.00 3 . 6666666666667777777777
38.00 3 . 888888889999
33.00 4 . 00000011111
18.00 4 . 222233
9.00 4 . 445
4.00 4 . 6&
1.00 4. &

Stem width: .1000000

Each leaf: 3 case(s)
& denotes fractional leaves.
8
Box Plot
• A five value summary plot of
data
– Minimum, maximum
– Median
– 1st and 3rd quartiles
• Often used in conjunction with a
histogram in EDA
• Structure of the plot
– Box represents the IQR (the
middle 50% values)
– The horizontal line in the box
shows the median
– Vertical lines extend above and
below the box
– Ends of vertical lines called
whiskers indicate the max and
min values
• If max and min fall within
1.5*IQR
– Shows outliers above/below the
whiskers

9
Quantile-Normal plot
• Used to see how well a
particular sample follows a
particular theoritical
distribution
• Many statistical tests have
the assumption that the
outcome for any set of
values of the explanatory
variables is approximately
normally distributed, and
that is why QN plots are
useful: if the assumption is
grossly violated, the p-value
and confidence intervals of
those tests are wrong

10
Scatter Plot
• Scatter plots are two
dimensional graphs with
– explanatory attribute
plotted on the x-axis
– Response attribute plotted
on the y-axis
• Useful for understanding the
relationship between two
attributes
• Features of the relationship
– strength
– shape (linear or curve)
– Direction
– Outliers

11
Scatter Plot Matrix
• When multiple
attributes need to be
visualized all at once
– Scatter plots are drawn
for every pair of
attributes and arranged
into a 2D matrix.
• Useful for spotting
relationships among
attributes
– Similar to a scatter plot
– Attributes are shown on
the diagonal

12
Cross tabulation
• For categorical data (and quantitative data with only a few
different values) an extension of tabulation called cross-
tabulation is very useful.
• For two variables, cross-tabulation is performed by making a
two-way table with column headings that match the levels of one
variable and row headings that match the levels of the other
variable, then filling in the counts of all subjects that share a
pair of levels.
• The two variables might be both explanatory, both outcome, or
one of each. Depending on the goals, row percentages (which add
to 100% for each row), column percentages (which add to 100%
for each column) and/or cell percentages (which add to 100%
over all cells) are also useful.
• Cross-tabulation can be extended to three (and sometimes
more) variables by making separate two-way tables for two
variables at each level of a third variable. Cross-tabulation is
the basic bivariate non-graphical EDA technique.

13
Cross tabulation
MainOccupation * Castehierarchy Crosstabulation
Castehierarchy Total
MainOccupation SC/ST Backward OBC Upper caste
Labour Count 148 33 35 12 228
% within MainOccupation 64.9 14.5 15.4 5.3 100.0
% within Castehierarchy 42.2 26.2 15.0 2.4 19.0
% of Total 12.3 2.7 2.9 1.0 19.0
Business Count 20 6 26 26 78
% within MainOccupation 25.6 7.7 33.3 33.3 100.0
% within Castehierarchy 5.7 4.8 11.1 5.3 6.5
% of Total 1.7 0.5 2.2 2.2 6.5
Service Count 21 5 4 37 67
% within MainOccupation 31.3 7.5 6.0 55.2 100.0
% within Castehierarchy 6.0 4.0 1.7 7.6 5.6
% of Total 1.7 0.4 0.3 3.1 5.6
Farming Count 162 82 169 415 828
% within MainOccupation 19.6 9.9 20.4 50.1 100.0
% within Castehierarchy 46.2 65.1 72.2 84.7 68.9
% of Total 13.5 6.8 14.1 34.6 68.9
Count 351 126 234 490 1201
% within MainOccupation 29.2 10.5 19.5 40.8 100.0
% within Castehierarchy 100.0 100.0 100.0 100.0 100.0
% of Total 29.2 10.5 19.5 40.8 100.0

14
Univariate statistics by category

• For one categorical variable

(usually explanatory) and one Univariate statistics of Di by
quantitative variable (usually category
outcome), it is common to Statecode Mean SD Median Min Max Skewness Kurtosis
produce some of the standard Andhra Pradesh 0.1901 0.0592 0.1787 0.0947 0.3947 0.6399 0.1279
univariate non-graphical Assam 0.2080 0.0569 0.1970 0.0878 0.3664 0.2354 -0.5341
statistics for the quantitative Haryana 0.2706 0.0684 0.2853 0.1135 0.3997 -0.4030 -0.7584
variables separately for each HP 0.3319 0.0617 0.3353 0.1559 0.4862 -0.1965 0.0755
level of the categorical Karnataka 0.1782 0.0586 0.1716 0.0781 0.4674 1.5890 4.5150
variable, and then compare Maharashtra 0.2537 0.0778 0.2434 0.0975 0.4318 0.1728 -0.6878
the statistics across levels of Punjab 0.3342 0.0676 0.3346 0.1623 0.4694 -0.1837 -0.5428
the categorical variable Uttrakhand 0.3144 0.0552 0.3216 0.1864 0.4416 -0.3060 -0.6048
Total 0.2603 0.0868 0.2617 0.0781 0.4862 0.0866 -0.8854

15
Univariate graph by category
Bar plot

16
Univariate graph by category
Box plot

17
EDA summary
• All the techniques presented so far are the
tools useful for EDA
• But without an understanding built from the
EDA, effective use of tools is not possible
• EDA helps to answer a lot of questions
– What is a typical value?
– What is the uncertainty of a typical value?
– What is a good distributional fit for the data?
– What are the relationships between two
attributes?
– etc

18
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran

The greatest value of a picture is when it forces us to notice what we

never expected to see.
— John W. Tukey

The best thing about being a statistician is that you get to

play in everyone’s backyard. - John W. Tukey

Cinematic Journeys - Film and Movement
100% (2)
Cinematic Journeys - Film and Movement
217 pages
Data Science- Module 2 (Updated )
No ratings yet
Data Science- Module 2 (Updated )
94 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
EDA
No ratings yet
EDA
9 pages
Unit 3
No ratings yet
Unit 3
47 pages
Unit 3
No ratings yet
Unit 3
77 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Exploratory Data Analysis types
No ratings yet
Exploratory Data Analysis types
14 pages
Unit 3
No ratings yet
Unit 3
222 pages
chap11_2012 (1)
No ratings yet
chap11_2012 (1)
20 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
5 pages
5.1_exploratory_analysis_en
No ratings yet
5.1_exploratory_analysis_en
79 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Confirmatory Data Analysis (CFA)
No ratings yet
Confirmatory Data Analysis (CFA)
8 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
EDA
No ratings yet
EDA
21 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Topic2 EDA 3
No ratings yet
Topic2 EDA 3
33 pages
Data Analysis
No ratings yet
Data Analysis
17 pages
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
No ratings yet
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
47 pages
05_AIHC_Exp02
No ratings yet
05_AIHC_Exp02
11 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
53 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
03a EDA
No ratings yet
03a EDA
47 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
Class 2 Exploratory Data Analysis
100% (1)
Class 2 Exploratory Data Analysis
18 pages
chap11_2012
No ratings yet
chap11_2012
39 pages
M1.2 DS
No ratings yet
M1.2 DS
29 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
Exploratory Data Analysis Reference
100% (2)
Exploratory Data Analysis Reference
49 pages
IDS 3,4,5. EXCLUSIVE
No ratings yet
IDS 3,4,5. EXCLUSIVE
43 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
No ratings yet
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
49 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
Module 3
No ratings yet
Module 3
11 pages
2 Eda
No ratings yet
2 Eda
20 pages
Materi 1 B VDE
No ratings yet
Materi 1 B VDE
18 pages
Exploratory Spatial Data Analysis
No ratings yet
Exploratory Spatial Data Analysis
54 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
7 pages
Unit-1
No ratings yet
Unit-1
52 pages
827b551be7606030c4c1ca693fb54a0ed875
No ratings yet
827b551be7606030c4c1ca693fb54a0ed875
12 pages
6.1EDA Inferential.docx
No ratings yet
6.1EDA Inferential.docx
3 pages
DataAnalytics(Unit 2)
No ratings yet
DataAnalytics(Unit 2)
131 pages
Unit 4 Exploratory Data Analysis and the Data Science Process (1)
No ratings yet
Unit 4 Exploratory Data Analysis and the Data Science Process (1)
9 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Module 1 - 2 - EDA
No ratings yet
Module 1 - 2 - EDA
12 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Applied Linear Algebra: Core Principles
From Everand
Applied Linear Algebra: Core Principles
Kartikeya Dutta
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
This Study Resource Was: Knowledge Test
No ratings yet
This Study Resource Was: Knowledge Test
9 pages
Faculty of Engineering Technology: Continuous Assessment Test #1 - Academic Year 2020/2021 General Instructions
No ratings yet
Faculty of Engineering Technology: Continuous Assessment Test #1 - Academic Year 2020/2021 General Instructions
2 pages
PL Questions
No ratings yet
PL Questions
6 pages
Model Question MHZ3551 CAT I 2021
No ratings yet
Model Question MHZ3551 CAT I 2021
5 pages
Trigonometry Pass Paper
100% (1)
Trigonometry Pass Paper
4 pages
2016 AL Combine Maths Part 1 @apepanthiya - LK PDF
No ratings yet
2016 AL Combine Maths Part 1 @apepanthiya - LK PDF
3 pages
2016 AL ICT Marking Scheme English at Apepanthiya - LK
No ratings yet
2016 AL ICT Marking Scheme English at Apepanthiya - LK
25 pages
2014 AL ICT Marking Scheme English at Apepanthiya - LK
No ratings yet
2014 AL ICT Marking Scheme English at Apepanthiya - LK
11 pages
Column and Thin Layer Chromatography: Malunggay (Moringa Oleifera)
No ratings yet
Column and Thin Layer Chromatography: Malunggay (Moringa Oleifera)
3 pages
SCHR Odinger Equation, One-Dimensional Problems
No ratings yet
SCHR Odinger Equation, One-Dimensional Problems
1 page
Statistics Practice Midterm 2
No ratings yet
Statistics Practice Midterm 2
3 pages
Managing Retention
No ratings yet
Managing Retention
19 pages
Essentials of QUANTUM MECHANICS
No ratings yet
Essentials of QUANTUM MECHANICS
4 pages
Question Answers: Q1: Define Research? What Are The Characteristic of Research? A: Meaning
No ratings yet
Question Answers: Q1: Define Research? What Are The Characteristic of Research? A: Meaning
42 pages
Chapter 3.the Case Study Method
No ratings yet
Chapter 3.the Case Study Method
5 pages
Get (Original PDF) Business Statistics A First Course, Second 2nd Canadian Edition Free All Chapters
100% (5)
Get (Original PDF) Business Statistics A First Course, Second 2nd Canadian Edition Free All Chapters
43 pages
Chapter-1-p.-3-5 (1)
No ratings yet
Chapter-1-p.-3-5 (1)
3 pages
BSP Text Summary
100% (1)
BSP Text Summary
55 pages
Quality Sample Size
0% (1)
Quality Sample Size
6 pages
Research: Definition, Characteristics, Purposes, Types & Approaches
No ratings yet
Research: Definition, Characteristics, Purposes, Types & Approaches
4 pages
Bba Businessresearchmethods 4
No ratings yet
Bba Businessresearchmethods 4
2 pages
Full Download Fundamentals of Nursing and Midwifery Research: A Practical Guide for Evidence-based Practice 2nd Edition Mckenna PDF DOCX
100% (2)
Full Download Fundamentals of Nursing and Midwifery Research: A Practical Guide for Evidence-based Practice 2nd Edition Mckenna PDF DOCX
62 pages
Linear Regression in Excel
No ratings yet
Linear Regression in Excel
7 pages
Research Methodology: Rahul Kumar Saurabh Mishra
No ratings yet
Research Methodology: Rahul Kumar Saurabh Mishra
9 pages
Upaya Peningkatan Pelayanan Melalui Room Attendant
No ratings yet
Upaya Peningkatan Pelayanan Melalui Room Attendant
8 pages
Calibration Curve of MB
No ratings yet
Calibration Curve of MB
140 pages
Checklist - IBDP Physics HL FE2016 - Kognity-7
No ratings yet
Checklist - IBDP Physics HL FE2016 - Kognity-7
2 pages
Sample Mean Distribution
No ratings yet
Sample Mean Distribution
10 pages
Practical Research 2: Quarter 1 - Module 1: Characteristics, Strengths, Weaknesses, and Kinds of Quantitative Research
No ratings yet
Practical Research 2: Quarter 1 - Module 1: Characteristics, Strengths, Weaknesses, and Kinds of Quantitative Research
12 pages
Ethics of Research
No ratings yet
Ethics of Research
10 pages
MCQ
100% (1)
MCQ
2 pages
Advanced Quantum Mechanics: (Second Quantization)
No ratings yet
Advanced Quantum Mechanics: (Second Quantization)
13 pages
Architectural Research Methods of
No ratings yet
Architectural Research Methods of
2 pages
Creswell 2009
No ratings yet
Creswell 2009
97 pages
Abstract
No ratings yet
Abstract
2 pages
Applications of Uv-Visible Spectros
86% (29)
Applications of Uv-Visible Spectros
2 pages