0% found this document useful (0 votes)
95 views19 pages

Exploratory Data Analysis: M. Srinath

This document provides an overview of exploratory data analysis (EDA) techniques. EDA involves visually examining datasets without hypotheses to identify patterns and relationships. Common EDA techniques include histograms, box plots, scatter plots, stem-and-leaf plots, and cross tabulations. These techniques allow researchers to summarize variable characteristics, identify outliers, and assess distribution shapes and relationships between variables in an exploratory manner.

Uploaded by

roma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views19 pages

Exploratory Data Analysis: M. Srinath

This document provides an overview of exploratory data analysis (EDA) techniques. EDA involves visually examining datasets without hypotheses to identify patterns and relationships. Common EDA techniques include histograms, box plots, scatter plots, stem-and-leaf plots, and cross tabulations. These techniques allow researchers to summarize variable characteristics, identify outliers, and assess distribution shapes and relationships between variables in an exploratory manner.

Uploaded by

roma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Exploratory Data Analysis

M. Srinath
Exploratory Data Analysis
Introduction
• Exploratory data analysis was promoted by John Tukey in 1977 to
encourage statisticians visually to examine their data sets, to formulate
hypotheses that could be tested on data-sets

• Exploratory data analysis (EDA) is an approach for analysing data to


summarize the main characteristics of variables in easy-to-understand
form, often with visual graphs, without using a statistical model or
having formulated a hypothesis

• EDA techniques are generally graphical. They include scatter plots,


Stem and leaf plots, box plots, histograms, quantile plots, residual
plots, and mean plots

• Exploratory data analysis is generally cross-classified in two ways.


First, each method is either non-graphical or graphical. And second,
each method is either univariate or multivariate (usually just bivariate)


2
Exploratory Data Analysis
• EDA offers several techniques to comprehend data
• But EDA is more than a library of data analysis techniques
• EDA is an approach to data analysis
• EDA involves inspecting data without any assumptions
– Mostly using information graphics

3
Exploratory Data Analysis
Univariate non-graphical EDA
 Categorical data
 Only useful univariate non-graphical techniques for categorical variables
is some form of tabulation of the frequencies, usually along with
calculation of the fraction (or percent) of data that falls in each
category
 Quantitative data
 Univariate non-graphical EDA focuses, generally, on measures of central
tendency(mean, median & mode), quartiles, spread(variance, sd & IQR),
skewness and kurtosis
 These descriptives quantitatively describe the main features of data

4
Univariate non-graphical EDA

A typical output of Descriptive


Statistics
Variable : Di (Development index)

N Valid 1201
Missing 0
Mean 0.260333 • When data has outliers median
Median 0.261697 is more robust
Mode 0.214959 • When data distribution is skewed
Std. Deviation 0.086778
median is more meaningful
Skewness 0.086567
Std. Error of Skewness 0.070593 • IQR = .0.143608
Kurtosis -0.88541 • IQR is also a robust measure of
Std. Error of Kurtosis 0.14107 spread
Percentiles
25 0.186396
50 0.261697
75 0.330004

5
Univariate graphical EDA -Histogram

• Graphical display of frequency


distribution
– Counts of data falling in various ranges
(bins)
– Histogram for numeric data
• Bin size selection is important
– Too small – may show spurious
patterns
– Too large – may hide important
patterns
• Several Variations possible
– Plot relative frequencies instead of
raw frequencies
– Make the height of the histogram
equal to the ‘relative frequency/width’
• Area under the histogram is 1
• When observations come from
continuous scale histograms can be
approximated by continuous curves
6
Stem and Leaf Plot
• This plot organizes data for
easy visual inspection Data
– Min and max values 29, 44, 12, 53, 21, 34, 39, 25,
– Data distribution
48, 23, 17, 24, 27, 32, 34, 15,
• Unlike descriptive statistics,
this plot shows all the data 42, 21, 28, 37
– No information loss
– Individual values can be
inspected
• Structure of the plot Stem and Leaf Plot
– Stem – the digits in the largest
place (e.g. tens place) 1|275
– Leaves – the digits in the
smallest place (e.g. ones place) 2|91534718
– Leaves are listed to the left of
stem separated by ‘|’ 3|49247
• Possible to place leaves from
another data set to the right of 4|482
the stem for comparing two data 5|3
distributions

7
Stem and leaf plot
Di Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 0. &
10.00 0 . 999&
32.00 1 . 0000001111
66.00 1 . 2222222222223333333333
59.00 1 . 4444444445555555555
104.00 1 . 66666666666666666777777777777777777
81.00 1 . 888888888888899999999999999
76.00 2 . 00000000000000011111111111
82.00 2 . 2222222222222333333333333333
82.00 2 . 444444444444444445555555555
96.00 2 . 66666666666667777777777777777777
91.00 2 . 888888888888888899999999999999
79.00 3 . 00000000000000111111111111
90.00 3 . 222222222222223333333333333333
82.00 3 . 4444444444444555555555555555
67.00 3 . 6666666666667777777777
38.00 3 . 888888889999
33.00 4 . 00000011111
18.00 4 . 222233
9.00 4 . 445
4.00 4 . 6&
1.00 4. &

Stem width: .1000000


Each leaf: 3 case(s)
& denotes fractional leaves.
8
Box Plot
• A five value summary plot of
data
– Minimum, maximum
– Median
– 1st and 3rd quartiles
• Often used in conjunction with a
histogram in EDA
• Structure of the plot
– Box represents the IQR (the
middle 50% values)
– The horizontal line in the box
shows the median
– Vertical lines extend above and
below the box
– Ends of vertical lines called
whiskers indicate the max and
min values
• If max and min fall within
1.5*IQR
– Shows outliers above/below the
whiskers

9
Quantile-Normal plot
• Used to see how well a
particular sample follows a
particular theoritical
distribution
• Many statistical tests have
the assumption that the
outcome for any set of
values of the explanatory
variables is approximately
normally distributed, and
that is why QN plots are
useful: if the assumption is
grossly violated, the p-value
and confidence intervals of
those tests are wrong

10
Scatter Plot
• Scatter plots are two
dimensional graphs with
– explanatory attribute
plotted on the x-axis
– Response attribute plotted
on the y-axis
• Useful for understanding the
relationship between two
attributes
• Features of the relationship
– strength
– shape (linear or curve)
– Direction
– Outliers

11
Scatter Plot Matrix
• When multiple
attributes need to be
visualized all at once
– Scatter plots are drawn
for every pair of
attributes and arranged
into a 2D matrix.
• Useful for spotting
relationships among
attributes
– Similar to a scatter plot
– Attributes are shown on
the diagonal

12
Cross tabulation
• For categorical data (and quantitative data with only a few
different values) an extension of tabulation called cross-
tabulation is very useful.
• For two variables, cross-tabulation is performed by making a
two-way table with column headings that match the levels of one
variable and row headings that match the levels of the other
variable, then filling in the counts of all subjects that share a
pair of levels.
• The two variables might be both explanatory, both outcome, or
one of each. Depending on the goals, row percentages (which add
to 100% for each row), column percentages (which add to 100%
for each column) and/or cell percentages (which add to 100%
over all cells) are also useful.
• Cross-tabulation can be extended to three (and sometimes
more) variables by making separate two-way tables for two
variables at each level of a third variable. Cross-tabulation is
the basic bivariate non-graphical EDA technique.

13
Cross tabulation
MainOccupation * Castehierarchy Crosstabulation
Castehierarchy Total
MainOccupation SC/ST Backward OBC Upper caste
Labour Count 148 33 35 12 228
% within MainOccupation 64.9 14.5 15.4 5.3 100.0
% within Castehierarchy 42.2 26.2 15.0 2.4 19.0
% of Total 12.3 2.7 2.9 1.0 19.0
Business Count 20 6 26 26 78
% within MainOccupation 25.6 7.7 33.3 33.3 100.0
% within Castehierarchy 5.7 4.8 11.1 5.3 6.5
% of Total 1.7 0.5 2.2 2.2 6.5
Service Count 21 5 4 37 67
% within MainOccupation 31.3 7.5 6.0 55.2 100.0
% within Castehierarchy 6.0 4.0 1.7 7.6 5.6
% of Total 1.7 0.4 0.3 3.1 5.6
Farming Count 162 82 169 415 828
% within MainOccupation 19.6 9.9 20.4 50.1 100.0
% within Castehierarchy 46.2 65.1 72.2 84.7 68.9
% of Total 13.5 6.8 14.1 34.6 68.9
Count 351 126 234 490 1201
% within MainOccupation 29.2 10.5 19.5 40.8 100.0
% within Castehierarchy 100.0 100.0 100.0 100.0 100.0
% of Total 29.2 10.5 19.5 40.8 100.0

14
Univariate statistics by category

• For one categorical variable


(usually explanatory) and one Univariate statistics of Di by
quantitative variable (usually category
outcome), it is common to Statecode Mean SD Median Min Max Skewness Kurtosis
produce some of the standard Andhra Pradesh 0.1901 0.0592 0.1787 0.0947 0.3947 0.6399 0.1279
univariate non-graphical Assam 0.2080 0.0569 0.1970 0.0878 0.3664 0.2354 -0.5341
statistics for the quantitative Haryana 0.2706 0.0684 0.2853 0.1135 0.3997 -0.4030 -0.7584
variables separately for each HP 0.3319 0.0617 0.3353 0.1559 0.4862 -0.1965 0.0755
level of the categorical Karnataka 0.1782 0.0586 0.1716 0.0781 0.4674 1.5890 4.5150
variable, and then compare Maharashtra 0.2537 0.0778 0.2434 0.0975 0.4318 0.1728 -0.6878
the statistics across levels of Punjab 0.3342 0.0676 0.3346 0.1623 0.4694 -0.1837 -0.5428
the categorical variable Uttrakhand 0.3144 0.0552 0.3216 0.1864 0.4416 -0.3060 -0.6048
Total 0.2603 0.0868 0.2617 0.0781 0.4862 0.0866 -0.8854

15
Univariate graph by category
Bar plot

16
Univariate graph by category
Box plot

17
EDA summary
• All the techniques presented so far are the
tools useful for EDA
• But without an understanding built from the
EDA, effective use of tools is not possible
• EDA helps to answer a lot of questions
– What is a typical value?
– What is the uncertainty of a typical value?
– What is a good distributional fit for the data?
– What are the relationships between two
attributes?
– etc

18
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran

The greatest value of a picture is when it forces us to notice what we


never expected to see.
— John W. Tukey

The best thing about being a statistician is that you get to


play in everyone’s backyard. - John W. Tukey

19

You might also like