Exploratory Data Analysis: M. Srinath
Exploratory Data Analysis: M. Srinath
M. Srinath
Exploratory Data Analysis
Introduction
• Exploratory data analysis was promoted by John Tukey in 1977 to
encourage statisticians visually to examine their data sets, to formulate
hypotheses that could be tested on data-sets
•
2
Exploratory Data Analysis
• EDA offers several techniques to comprehend data
• But EDA is more than a library of data analysis techniques
• EDA is an approach to data analysis
• EDA involves inspecting data without any assumptions
– Mostly using information graphics
3
Exploratory Data Analysis
Univariate non-graphical EDA
Categorical data
Only useful univariate non-graphical techniques for categorical variables
is some form of tabulation of the frequencies, usually along with
calculation of the fraction (or percent) of data that falls in each
category
Quantitative data
Univariate non-graphical EDA focuses, generally, on measures of central
tendency(mean, median & mode), quartiles, spread(variance, sd & IQR),
skewness and kurtosis
These descriptives quantitatively describe the main features of data
4
Univariate non-graphical EDA
N Valid 1201
Missing 0
Mean 0.260333 • When data has outliers median
Median 0.261697 is more robust
Mode 0.214959 • When data distribution is skewed
Std. Deviation 0.086778
median is more meaningful
Skewness 0.086567
Std. Error of Skewness 0.070593 • IQR = .0.143608
Kurtosis -0.88541 • IQR is also a robust measure of
Std. Error of Kurtosis 0.14107 spread
Percentiles
25 0.186396
50 0.261697
75 0.330004
5
Univariate graphical EDA -Histogram
7
Stem and leaf plot
Di Stem-and-Leaf Plot
1.00 0. &
10.00 0 . 999&
32.00 1 . 0000001111
66.00 1 . 2222222222223333333333
59.00 1 . 4444444445555555555
104.00 1 . 66666666666666666777777777777777777
81.00 1 . 888888888888899999999999999
76.00 2 . 00000000000000011111111111
82.00 2 . 2222222222222333333333333333
82.00 2 . 444444444444444445555555555
96.00 2 . 66666666666667777777777777777777
91.00 2 . 888888888888888899999999999999
79.00 3 . 00000000000000111111111111
90.00 3 . 222222222222223333333333333333
82.00 3 . 4444444444444555555555555555
67.00 3 . 6666666666667777777777
38.00 3 . 888888889999
33.00 4 . 00000011111
18.00 4 . 222233
9.00 4 . 445
4.00 4 . 6&
1.00 4. &
9
Quantile-Normal plot
• Used to see how well a
particular sample follows a
particular theoritical
distribution
• Many statistical tests have
the assumption that the
outcome for any set of
values of the explanatory
variables is approximately
normally distributed, and
that is why QN plots are
useful: if the assumption is
grossly violated, the p-value
and confidence intervals of
those tests are wrong
10
Scatter Plot
• Scatter plots are two
dimensional graphs with
– explanatory attribute
plotted on the x-axis
– Response attribute plotted
on the y-axis
• Useful for understanding the
relationship between two
attributes
• Features of the relationship
– strength
– shape (linear or curve)
– Direction
– Outliers
11
Scatter Plot Matrix
• When multiple
attributes need to be
visualized all at once
– Scatter plots are drawn
for every pair of
attributes and arranged
into a 2D matrix.
• Useful for spotting
relationships among
attributes
– Similar to a scatter plot
– Attributes are shown on
the diagonal
12
Cross tabulation
• For categorical data (and quantitative data with only a few
different values) an extension of tabulation called cross-
tabulation is very useful.
• For two variables, cross-tabulation is performed by making a
two-way table with column headings that match the levels of one
variable and row headings that match the levels of the other
variable, then filling in the counts of all subjects that share a
pair of levels.
• The two variables might be both explanatory, both outcome, or
one of each. Depending on the goals, row percentages (which add
to 100% for each row), column percentages (which add to 100%
for each column) and/or cell percentages (which add to 100%
over all cells) are also useful.
• Cross-tabulation can be extended to three (and sometimes
more) variables by making separate two-way tables for two
variables at each level of a third variable. Cross-tabulation is
the basic bivariate non-graphical EDA technique.
13
Cross tabulation
MainOccupation * Castehierarchy Crosstabulation
Castehierarchy Total
MainOccupation SC/ST Backward OBC Upper caste
Labour Count 148 33 35 12 228
% within MainOccupation 64.9 14.5 15.4 5.3 100.0
% within Castehierarchy 42.2 26.2 15.0 2.4 19.0
% of Total 12.3 2.7 2.9 1.0 19.0
Business Count 20 6 26 26 78
% within MainOccupation 25.6 7.7 33.3 33.3 100.0
% within Castehierarchy 5.7 4.8 11.1 5.3 6.5
% of Total 1.7 0.5 2.2 2.2 6.5
Service Count 21 5 4 37 67
% within MainOccupation 31.3 7.5 6.0 55.2 100.0
% within Castehierarchy 6.0 4.0 1.7 7.6 5.6
% of Total 1.7 0.4 0.3 3.1 5.6
Farming Count 162 82 169 415 828
% within MainOccupation 19.6 9.9 20.4 50.1 100.0
% within Castehierarchy 46.2 65.1 72.2 84.7 68.9
% of Total 13.5 6.8 14.1 34.6 68.9
Count 351 126 234 490 1201
% within MainOccupation 29.2 10.5 19.5 40.8 100.0
% within Castehierarchy 100.0 100.0 100.0 100.0 100.0
% of Total 29.2 10.5 19.5 40.8 100.0
14
Univariate statistics by category
15
Univariate graph by category
Bar plot
16
Univariate graph by category
Box plot
17
EDA summary
• All the techniques presented so far are the
tools useful for EDA
• But without an understanding built from the
EDA, effective use of tools is not possible
• EDA helps to answer a lot of questions
– What is a typical value?
– What is the uncertainty of a typical value?
– What is a good distributional fit for the data?
– What are the relationships between two
attributes?
– etc
18
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran
19