Exploratory Data Analysis Reference
Exploratory Data Analysis Reference
Data Visualization
Credits: ChrisVolinsky - Columbia University
1
Outline
• EDA
• Visualization
– One variable
– Two variables
– More than two variables
– Other types of data
– Dimension reduction
2
EDA and Visualization
• Exploratory Data Analysis (EDA) and Visualization are very
important steps in any analysis task.
3
Data Visualization – cake bakery
4
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data
– means, medians, quantiles, histograms, boxplots
• You should always look at every variable - you will learn something!
• data-driven (model-free)
• Think interactive and visual
– Humans are the best pattern recognizers
– You can use more than 2 dimensions!
• x,y,z, space, color, time….
5
Summary Statistics
• not visual
• sample statistics of data X
– mean: = i Xi / n
– mode: most common value in X
– median: X=sort(X), median = Xn/2 (half below, half above)
– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)
• range: max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n
– skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (what kind of data is
right skewed?)
7
Issues with Histograms
8
But be careful
with axes and
scales!
9
Smoothed Histograms - Density Estimates
• Kernel estimates smooth out the contribution of each
datapoint over a local neighborhood of that point.
n
x xi
fˆ (x) 1
nh K( h )
i1
h is the kernel width
10
Bandwidth
choice is an art
Usually want to
try several
11
Boxplots
12
Time Series
If your data has a temporal component, be sure to exploit it
steady growth
trend
13
Time-Series Example 3
• Data from
cities/states/zip cods
– easy to get lat/long
15
Spatial data: choropleth Maps
interesting?
interesting?
17
2D Scatterplots
interesting?
interesting?
18
Scatter Plot: No apparent relationship
19
Scatter Plot: Linear relationship
20
Scatter Plot: Quadratic relationship
21
Scatter plot: Homoscedastic
22
Scatter plot: Heteroscedastic
23
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
24
Two variables - continuous
25
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22", pch=16,cex=3)
26
Jittering
27
Displaying Two Variables
• If one variable is
categorical, use small
multiples
library(‘lattice’)
histogram(~DiastolicBP | TimesPregnant==0)
28
Two Variables - one categorical
29
Barcharts and Spineplots
orange=M blue=F
spineplots show
proportions well, but can
be hard to interpret
30
More than two
variables
Pairwise scatterplots
Can be somewhat
ineffective for
categorical data
31
32
Multivariate: More than two variables
• Get creative!
• Conditioning on variables
– trellis or lattice plots
– Cleveland models on human perception, all based on
conditioning
– Infinite possibilities
• Earthquake data:
– locations of 1000 seismic events of MB > 4.0. The events
occurred in a cube near Fiji since 1964
– Data collected on the severity of the earthquake
33
34
35
How many
dimensions are
represented here?
Petal, a non-reproductive
part of the flower
Sepal, a non-reproductive
part of the flower
37
Parallel Coordinates
Sepal
Length
5.1
Sepal Sepal
Length Width
3.5
5.1
3.5
5.1 0.2
1.4
3.5
5.1
1.4
0.2
41
Multivariate: Parallel coordinates
Alpha blending
can be effective
43
Networks and Graphs
44
Network Visualization
• Graphviz (open source software) is a nice layout tool for big and small
graphs
45
What’s missing?
• pie charts
– very popular
– good for showing simple relations of proportions
– Human perception not good at comparing arcs
– barplots, histograms usually better (but less pretty)
• 3D
– nice to be able to show three dimensions
– hard to do well
– often done poorly
– 3d best shown through “spinning” in 2D
• uses various types of projecting into 2D
• http://www.stat.tamu.edu/~west/bradley/
46
Worst graphic in the world?
47
Dimension Reduction
– Variable selection
• e.g. stepwise
– Principle Components
• find linear projection onto p-space with maximal variance
– Multi-dimensional scaling
• takes a matrix of (dis)similarities and embeds the points in p-
dimensional space to retain those similarities
48
Visualization done right
• http://www.youtube.com/watch?v=jbkSRLYSojo
49