0% found this document useful (0 votes)
60 views

Ellison - 2001 - Exploratory Data Analysis and Graphical Display

This document discusses exploratory data analysis and graphic displays for analyzing ecological experiments. It begins by explaining that graphics are useful both for exploring patterns in data before formal statistical analysis and for clearly communicating results. The document then provides guidelines for effective exploratory data analysis graphics and presentation graphics, emphasizing showing underlying patterns while maintaining data integrity and using simple, uncluttered designs. Specific graph types are described and examples provided.

Uploaded by

Thyago Naves
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Ellison - 2001 - Exploratory Data Analysis and Graphical Display

This document discusses exploratory data analysis and graphic displays for analyzing ecological experiments. It begins by explaining that graphics are useful both for exploring patterns in data before formal statistical analysis and for clearly communicating results. The document then provides guidelines for effective exploratory data analysis graphics and presentation graphics, emphasizing showing underlying patterns while maintaining data integrity and using simple, uncluttered designs. Specific graph types are described and examples provided.

Uploaded by

Thyago Naves
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

3

Exploratory Data Analysis and


Graphic Display

AARON M. ELLISON

Ellison, A. M. 2001. Exploratory data analysis and graphic display. In:


Scheiner, S. M.; Gurevitch, J. (eds.), Design and analysis of ecological
experiments. Oxford. pp. 37-62.

3.1 Introduction

You have designed your experiment, collected the data, and are now confronted
with a tangled mass of information that must be analyzed, presented, and pub-
lished. Turning this heap of raw spaghetti into an elegant fettucine alfredo will
be immensely easier if you can visualize the message buried in your data. Data
graphics, the visual "display [of] measured quantities by means of the combined
use of points, lines, a coordinate system, numbers, symbols, words, shading, and
color" (Tufte 1983, p. 9) provide the means for this visualization.
Graphics serve two general functions in the context of data analysis. First,
graphics are a tool used to explore patterns in data before the formal statistical
analysis (Exploratory Data Analysis, or EDA, Tukey 1977). Second, graphics
communicate large amounts of information clearly, concisely, and rapidly, and
illuminate complex relationships within data sets.
Graphic EDA yields rough sketches to help guide you to appropriate, often
counterintuitive, formal statistical analyses. In contrast to EDA, presentation
graphics are final illustrations suitable for publication. Presentation graphics of
high quality can leave a lasting impression on readers or audiences, whereas
vague, sloppy, or overdone graphics easily can obscure valuable information and
engender confusion. Ecological researchers should view EDA and sound presen-
tation graphic techniques as essential components of data analysis, presentation,
and publication.
This chapter provides an introduction to graphic EDA, and some guidelines
for clear presentation graphics. More detailed discussions of these and related

37
38 Design and Analysis of Ecological Experiments

topics can be found in texts by Tukey (1977), Tufte (1983, 1990), and Cleveland
(1985). These techniques are illustrated for univariate, bivariate, multivariate, and
classified quantitative (ANOVA) data sets that exemplify the types of data sets
encountered commonly in ecological research. Sample data sets are described
briefly in section 3.3; formal analyses of three of the illustrated data sets can be
found in chapters 14 (univariate data set) and 10 (predator-prey data set), and Pot-
vin (1993, chapter 3, ANOVA data set). You may find some of the graphics types
presented unfamiliar or puzzling, but consider them seriously as alternatives to
the more common bar charts, histograms, and pie charts. The majority of these
graphs can be produced by readily available Windows-based software (Kardia
1998). I used S-Plus (MathSoft, Inc.) and SYSTAT (SPSS, Inc.) to construct the
figures in this chapter.
Guiding Principles. The question or hypothesis guiding the experimental de-
sign also should guide the decision as to which graphics are appropriate for ex-
ploring or illustrating the data set. Sketching a mock graph, without data points,
before beginning the experiment usually will clarify experimental design and al-
ternative outcomes. This procedure also clarifies a priori hypotheses that will
prevent inappropriately considering a posteriori hypotheses (suggested by EDA)
as a priori. Often, the simplest graph, without frills, is the best. However, graphs
do not have to be simple-minded, conveying only a single type of information,
and they need not be assimilated in a single glance. Tufte (1983) and Cleveland
(1985) provide numerous examples of graphs that require detailed inspection be-
fore they reveal their messages. Besides the aesthetic and cognitive interest they
provoke, complex graphs that are information-rich can save publication costs and
time in presentations.
Regardless of the complexity of your illustrations, you should adhere to the
following four guidelines in EDA and production graphics:

1. Underlying patterns of interest should be illuminated, while not compromising


the integrity of the data.
2. The data structure should be maintained, so that readers can reconstruct the data
from the figure.
3. Figures should have a high data-to-ink ratio and no "chartjunk"—"graphical par-
aphernalia routinely added to every display" (Tufte 1983, p. 107), including ex-
cessive shading, grid lines, ticks, special effects, and unnecessary three-dimen-
sionality.
4. Figures should not distort, exaggerate, or censor the data.

With the increasing availability of hardware and software capable of digitizing


information directly from published sources, adherence to these guidelines has
become increasingly important. Gurevitch (chapter 18; Gurevitch et al. 1992), for
example, relied extensively on information gleaned by digitizing data from many
different published figures to explore common ecological effects across many
experiments via meta-analysis. Readers will be better able to compare published
data sets that are represented clearly and accurately.
Exploratory Data Analysis and Graphic Display 39

3.2 Graphic Approaches

3.2.1 Exploratory Data Analysis (EDA)


Tukey (1977) established many of the principles of EDA, and his book is an
indispensable guide to EDA techniques. You should view EDA as a first pass
through your data set prior to formal statistical analysis. EDA is particularly ap-
propriate when there is a large amount of variability in the data (low signal-to-
noise ratio) and when treatment effects are not immediately apparent. You can then
proceed to explore, through formal analysis, the patterns illuminated by graphic
EDA.
Since EDA is designed to illuminate underlying patterns in noisy data, it is
imperative that the underlying data structure not be obscured or hidden com-
pletely in the process. Also, because EDA is the predecessor to formal analysis,
it should not be time-consuming. Personal computer-based packages permit rapid,
interactive graphic construction with little of the effort necessary in formal analy-
sis. Finally, EDA should lead you to appropriate formal analyses and models. A
common use of EDA is to determine whether the raw data satisfy the assumptions
of the statistical tests suggested by the experimental design (see sections 3.3.1
and 3.3.4). Violation of assumptions revealed by EDA may lead you to use differ-
ent statistical models from those you had intended to employ a priori. For exam-
ple, Antonovics and Fowler (1985) found unanticipated effects of planting posi-
tion in their studies of plant competitive interactions in hexagonal planting arrays.
These results led to a new appreciation for neighborhood interactions in plant
assemblages (e.g., Czaran and Bartha 1992).

3.2.2 Production Graphics


Graphics are an essential medium of communication in scientific literature and at
seminars and meetings. In a small amount of space or time, it is imperative to
deliver the message and fix it clearly and memorably in the audience's mind.
Numerous authors have investigated and analyzed how individuals perceive dif-
ferent types of graphs, and what makes a "good" and "bad" graph from a cogni-
tive perspective (reviewed concisely by Wilkinson 1990 and in depth by Cleve-
land 1985). It is not my intention to review this material; rather, through example,
I hope to change the way we as ecologists display our data to maximize the
amount of information communicated while minimizing distraction.
Cleveland (1985) presented a hierarchy of graphic elements used to construct
data graphics that satisfy the guidelines suggested in section 3.1 (figure 3.1).
Although there is no simple way to distinguish good graphics from bad graphics,
we can derive general principles from Cleveland's ranking. First, color, shading,
and other chartjunk effects do not as a rule enhance the information content of
graphs. They may look snazzy in a seminar, but they lack substance and use a
lot of ink. Second, three-dimensional graphs that are mere extensions of two-
dimensional graphs (e.g., ribbon charts, three-dimensional histograms, or pie
40 Design and Analysis of Ecological Experiments

BETTER

t Position along a common scale

2 Position along identical scales

3. Length

4. Angle/Slope

5. Area

6. Volume

7. Shading color, saturation, density

WORSE
Figure 3.1 Ordering of graphic features according to their relative accuracy in representing
quantitative variation (after Cleveland 1985).

charts) not only do not increase the information content available, but often ob-
scure the message (a dramatic, if unfortunate, set of examples can be found in
Benditt 1992). These graphics, common in business presentations and increas-
ingly rife at scientific meetings, violate all of the suggested guidelines. Finally,
more dimensions often are used than are necessary, for example, "areas" and
lines where a point would do. Sirnken and Hastie (1987) discuss exceptions to
Cleveland's graphic hierarchy. In general, when designing graphics, adhere to the
Shaker maxim: form follows function.
High-quality graphical elements can be assembled into effective graphic dis-
plays of data (Cleveland 1985). First, emphasize the data. Lines drawn through
data points should not hide the points themselves. Second, data points should
Exploratory Data Analysis and Graphic Display 41

never lie on axes themselves, as the axes can obscure data points. If, for example,
there are many points that would fall along a zero line, then extend that axis
beyond zero (figure 3.6). Third, reference lines, if needed (which they rarely are),
should be deemphasized relative to the data. This can be accomplished with dif-
ferent line types (variable thicknesses; dotted, dashed, or solid) or shading. Fourth,
overlapping data symbols or data sets should be clearly distinguishable. You can
increase data visibility and minimize overlap by varying symbol size or position,
separating data sets to be compared into multiple plots, or changing from arithme-
tic <to logarithmic scales. Exemplars include the jitter plot, which avoids overlap
of identical values (figure 3.3B) and spreading of responses to categories across
an axis (figure 3.12D). Fifth, the plot must be easily readable following reduction
for publication or when projected as a slide to a seminar audience. Finally, Cleve-
land recommends using a full rectangular plot frame, not the more common bot-
tom axis/left axis only combination seen in many articles. This, together with tick
marks outside the plot frame (1) emphasize the data and (2) help the reader
accurately place individual data points. Tufte (1983) disagrees, as the extra axes
are an excessive use of ink and convey no information. Examples in this chapter
illustrate most of these possibilities. In the final analysis, many of these rules
reflect not only insight into cognitive perception, but also aesthetic judgments by
you, the author.
From this discussion, we could ask, Isn't all this too much trouble? Should
we dispense with graphs altogether in favor of tables? Because of their concise-
ness, graphics are almost always preferable in oral presentations. Graphs illustrate
more clearly relationships among variables and are a quick means of displaying
multivariate information. However, where exact values are important (as in final
publications), tables are more precise. Although the need for precise tables has
been obviated by the increasing availability of digitizing software and on-line
data archives, data presented graphically must be unbiased and uncensored. A
discussion of what data should be provided, in either graphs or tables, follows in
section 3.4.

3.3 Examples

3.3.1 Univariate Data: Frequency (Density) Distributions


Distributions of height, biomass, or other size metrics are often the primary
descriptor of populations or communities. As an example of size distributions, I
use a data set containing the number of leaf nodes of 75 Ailanthus altissima
plants. The experimental design and formal analysis of these data are given in
chapter 14.
With univariate data, two questions are paramount: (1) How are the data dis-
tributed (including summary statistics such as the mean, variance, and median)?
and (2) Are the data normally distributed or can they be transformed to make
them amenable to parametric analyses? Investigators often explore these ques-
tions via histograms or normality plots.
42 Design and Analysis of Ecological Experiments

A histogram is an example of a density plot; that is, each bar illustrates the
frequency, or density, of the values occurring in the data set between the lower
bound and the upper bound of each bar. Histograms are commonly confused with
bar charts (see section 3.3.4). The latter are used to illustrate some summary mea-
sure (often the mean, sum, or percentage) of all the values within a given treat-
ment category. Histograms of the Ailanthus data are shown in figure 3.2.
A histogram is not the best method for answering the two questions posed
previously, for three reasons. First, the raw data are hidden. In this example, there
are 75 plants, which have been divided into 12 biomass groups, or bins (figure
3.2A). It is impossible to know, for example, if the third bar (range 12-14 nodes)
contains 10 observations of 12 nodes, 10 observations of 14 nodes, or any other
of the possible combinations of 12-14 nodes in 10 observations. Second, the di-
vision into 12 bins is arbitrary; it was the default of the graphics program. We
could just as easily use 24 or 6 bins, both of which change the apparent shape of
the distribution (figures 3.2B,C) without conveying additional information. Third,
summary statistics cannot be computed from the data illustrated in the histogram.
Thus, a histogram does not enable us to answer key questions about univariate
data. In addition, histograms fall low on Cleveland's hierarchy of graphic primi-
tives. Bars in a histogram use vertical lines, horizontal lines, and shading in con-
cert to present information embodied in the single point indicated by the top of
the bar.
Tukey (1977) introduced the stem-and-leaf diagram as the simplest alternative
to the histogram (figure 3.3A). The main advantage of the stem-and-leaf diagram
is that the raw data are presented in toto. Summary statistics can be derived easily
from or incorporated into the figure. Nevertheless, stem-and-leaf diagrams suffer
visually from one of the same drawbacks as histograms: the number of bins is
arbitrary. Two other alternatives to histograms are jitter plots (figure 3.3B) and
dit plots (figure 3.3C). These two figures preserve the underlying data structure
(all values are presented), do not use arbitrary bins, and can be constructed
quickly without additional preparation (e.g., sorting) of the data set. Both plots
permit rapid assessment of density patterns and are simple to understand.
Stem-and-leaf plots and the density diagrams presented in figure 3.3 can be
used as simple alternatives to histograms. However, these plots do not clearly
convey some of the information that ecologists may want to communicate, and it
is difficult to compare the information in two or more of these plots. I suggest
the box-and-whisker plot (Tukey 1977), often called simply a box plot, as a pre-
sentation alternative to the univariate histogram (figures 3.2 and 3.4A). An advan-
tage of the box plot is that it provides more summary statistical information than
a histogram—it includes medians, quartiles, ranges, and outliers (extreme vari-
ates)—in much less space and with much less ink. Box plot construction is not
dependent on arbitrary bins, so these plots do not exaggerate or distort the data
distribution. By notching the box plot (figure 3.12E), you can easily add confi-
dence intervals so that plots of several distributions can be compared easily.
Wilkinson (1990; Haber and Wilkinson 1982) developed the fuzzy gram (figure
3.4B), another alternative to the histogram. Fuzzygrams are histograms with prob-
ability distributions superimposed on each bar. Consequently, fuzzygrams present
Exploratory Data Analysis and Graphic Display 43

Figure 3.2 Histograms of the number of nodes per plant of 75 surviving Ailanthus altissima
individuals grown in a 5 X 20 plant rectangular array. Each bar represents the frequency
or count (right axis) of observations within the bounds indicated by the ticks on the
*-axis, and the proportion of the total sample (left axis) represented by each bar. The three
plots illustrate the variation in histogram presentation obtained by changing the bin width:
(A) default (bin width = 4); (B) bin width = 2; (C) bin width = 8. At the top of the figure,
a box plot (see figure 3.4 for construction details) illustrates summary statistics and is a
better indication of the true data distribution.
44 Design and Analysis of Ecological Experiments

Figure 3.3 Alternative density plots that convey more information than a histogram. (A)
A stem-and-leaf plot. In this plot, each line is a stem, and each datum on a stem is a leaf.
The label for the stem is the first digit (starting part) of the number, followed by the value
of the leaf. On the first line, the starting part is 0 and the only leaf is 8, indicating a value
of 08 nodes. On the second line, the starting part is 1, and there are four leaves, indicating
four data points: 10, 11, 11, and 11 nodes. The location of the sample median (M) and
upper and lower quartiles (H) are also marked on this plot. (B) A jittered density plot.
Each point is placed along the horizontal scale at the exact location of its value. To keep
points of equal value from overlapping, they are located at random heights above the
x-axis. (C) A dit plot. Each point indicates an individual observation, stacked along the
y-axis at its location along the x-axis. In essence, a dit plot is a stem-and-leaf plot with
symbols substituted for leaves.
Exploratory Data Analysis and Graphic Display 45

Figure 3.4 Information-rich production alternatives to histograms. (A) A box-and-whisker


plot. The vertical line in the center of the box plot indicates the sample median. The left
and right vertical sides of the box indicate, respectively, the location of the 25th and 75th
percentile of the data (lower and upper quartiles, or hinges). The absolute value of the
distance between the hinges (obtained by subtracting the value of the lower quartile from
the value of the upper quartile) is the hspread. The whiskers of the box extend to the last
point occurring between each hinge and its inner fence, a distance 1.5 hspreads from the1
hinge. Two kinds of outliers can be distinguished on a box plot. Points occurring between
1.5 hspreads and 3 hspreads (the outer fence) are indicated by an asterisk (see figure
3.12E). Points occurring beyond the outer fence are indicated by an open circle. The
various summary statistics are clearly seen in relation to the raw data, which are overlain
on this box plot as a symmetric dit plot. The distance encompassed by the whiskers in-
cludes =90% of the data (Norusis 1990). (B) A fuzzygram (Wilkinson 1990). This plot is
a standard histogram (counts and proportions of each bin indicated by the height of the
vertical line), with a probability distribution superimposed on each bar. The shading of the
bars is based on a gray-scale distribution according to the probability that the fth observa-
tion will occur in that region: Pt = P(pt > 7t,), where p, = n,/n is the sample estimate of
71, (the expected proportion of a sample of n values from a continuous distribution to fall
in the fth bin of the histogram). The more likely that />, > TC,, the lighter the bar. Conse-
quently, for large sample sizes, the bars will appear in sharp focus, whereas for small
counts, the bars will be fuzzy. See Haber and Wilkinson (1982) for a discussion of the
cognitive perception of fuzzygrams.
46 Design and Analysis of Ecological Experiments

not only the data, but also some estimation of how realistically the data represent
the actual population distribution. Such a presentation is particularly useful in
concert with results derived from sensitivity analyses (Ellison and Bedford 1991)
or resampling methods (Efron 1982; chapters 7 and 14). Haber and Wilkinson
(1982) discuss, from a cognitive perspective, the merits of fuzzygrams and other
density plots relative to traditional histograms. Histograms (figure 3.2), stem-and-
leaf plots (figure 3.3A), dit plots (figure 3.3C), and fuzzygrams (figure 3.4B) can
indicate possible bimodality in the data. Bimodal data, observed commonly in
plant ecology, are obscured by box plots and jittered density diagrams.
Probability plots are common features of most statistical packages, and they
provide a visual estimate of whether the data fit a given distribution. The most
common probability plot is the normal probability plot (figure 3.5A). Here, the
observed values are plotted against their expected values if the data are from a
normal distribution; if the data are derived from an approximately normal distri-
bution, the points will fall along a relatively straight diagonal line. There are also
numerical statistical tests for normality (e.g., Sokal and Rohlf 1995; Zar 1996).
If, for biological reasons, the investigator believes the data come from a popula-
tion with a known distribution different from a normal one, it is similarly possible
to construct probability plots for other distribution functions (figure 3.5B).

3.3.2 Bivariate Data: Examining Relationships


Between Variables
Ecological experiments often explore relationships between two or more continu-
ous variables. Two general questions related to bivariate data can be addressed
with graphical EDA: (1) What is the general relationship between the two vari-
ables? and (2) Are there any outliers—points that disproportionately affect the
apparent relationship between the two variables? The answers to these questions
lead, in formal analyses, to investigations of the strength and significance of the
relationship (chapters 6, 9, and 10). Scatterplots and generalized smoothing rou-
tines are illustrated here for exploring and presenting bivariate data. Extensions
of these techniques to multivariate data are presented in section 3.3.3.
Bivariate data sets can be grouped into two types: (1) those for which we have
a priori knowledge about which variable ought to be considered independent,
leading us to consider formal regression models (chapters 8 and 10), and (2) those
for which such a priori knowledge is lacking, leading us to examine correlation
coefficients and subsequent a posteriori analyses. The functional response of No-
tonecta glauca, a predatory aquatic hemipteran, presented experimentally with
varying numbers of the isopod Asellus aquaticus is used to illustrate the first type
of data set; these data are described in detail in chapter 10. For the latter type of
data, I use a data set consisting of the height (diameter at breast height, dbh) and
distance to nearest neighbor of 41 trees in a 625-m2 plot within an approximately
75-year-old mixed hardwood stand in South Hadley, Massachusetts (A. M. El-
lison, unpubl. data, 1993). Data sets of this type are commonly used to construct
forestry yield tables (e.g., Tritton and Hornbeck 1982) and have been used to infer
Exploratory Data Analysis and Graphic Display 47

Figure 3.5 Probability plots


of the Ailanthus data. (A)
A normal probability plot.
(B) A probability plot with
the predicted values com-
ing from a Weibull distribu-
tion: /<j)= 1 - exp[(-yA)'],
where s is a spread parame-
ter and t is a shape parame-
ter. In this probability plot,
the slope of the line is an
estimate of lit, and the in-
tercept is an estimate of
ln(,s). See Gnanadesikan
(1977) for a general discus-
sion of probability plots.
48 Design and Analysis of Ecological Experiments

competitive interactions among trees (e.g., Weller 1987) and forest successional
dynamics (e.g., Horn et al. 1989).
For both exploration and presentation, scatterplots are the most straightforward
way of displaying bivariate data (figure 3.6A). However, because scatterplots are
merely a display, they do not necessarily reveal pattern. Figure 3.6A illustrates
clearly this idea. Three functional response curves (Holling 1966; chapter 10)
could be fit to these data, but it is not clear from the scatterplot itself which curve
would best fit the data. EDA is particularly useful for dealing with these data,
which show high variability and no obvious best relatiTonship between the two
variables.
Recent computer-intensive innovations in smoothing techniques (reviewed by
Efron and Tibshirani 1991) have expanded the palette of smoothers developed by
Tukey (1977). Basically, to construct a smoothed curve through the data, a best-
fit line is constructed through a subset of the data, local to each point along the
jt-axis. This process is repeated for each point, and a smooth line is constructed
by connecting the intersections of each local regression line. The result of this
process, using LOWESS (robust LOcally WEighted regrESSion: Cleveland 1979;
Efron and Tibshirani 1991), is shown for the predator-prey data in figure 3.6B.
In this case, 50% of the data were used to construct each segment of the smoothed
curve. That is, to construct the first segment, the response data from 0 < N0 < 50
were used; to construct the second segment, the response data from 1 < N0 < 51
were used, and so forth. The apparent type III functional response observed in
the smoothed curve is supported by the formal analysis of these data (chapter 10).
The lack of underlying assumptions about the distribution and variance of the
data and the ability to elucidate patterns very noisy data are two advantages of
smoothing over traditional regression techniques. One disadvantage of smoothing
is that relative weighting of data used for each segment must be specified in
advance, usually with little or no rational basis for the decision. Moreover, statis-
tical comparison of different smoothed curves is virtually impossible. Most statis-
tical software packages compute a variety of smoothers (see reviews by Ellison
1992; Kardia 1998).
Smoothers are used appropriately only when there is clear a priori knowledge
of an independent variable and a corresponding dependent variable or variables.
When this is not the case, other exploratory techniques are more appropriate for
examining relationships between variables. In addition, smoothing does not pro-
vide information about potential outliers in the data set. To examine correlations
between variables and to search a posteriori for outliers, influence plots and con-
vex hulls are useful exploratory tools.
A scatterplot of the relationship between tree height and stem diameter (A. M.
Ellison, unpubl. data, 1993) is illustrated in figure 3.7A. The raw data are shown,
and there appears to be an apparent outlier (a 30-m-tall tree with a dbh > 70 cm).
In an influence plot of these data (figure 3.7B), the size of each point becomes
directly proportional to the magnitude of the change its removal would have on
the Pearson correlation coefficient (r) between the two variables. By overlaying
a bivariate 50% confidence ellipse, it becomes obvious that outlying points have
greater influence on r than do points within the ellipse.
Exploratory Data Analysis and Graphic Display 49

Figure 3.6 Scatterplots of the


functional response of Noto-
necta to varying levels of
Asellus. (A) A simple scatter-
plot showing the raw data.
(B) A scatterplot with a
lowess smooth fitted to the
data. Note the apparent type
III functional response re-
vealed by the smoother (see
chapter 10).
Exploratory Data Analysis and Graphic Display 5I

In an influence plot of the logarithmically transformed data (figure 3.7C), the


apparent outliers have all but disappeared (the large outlier in figure 3.7B now
has an influence on r of only .01), and the data are better distributed for formal
analysis. Figure 3.7D supports this notion. The outer ellipse is a 95% confidence
ellipse centered on the sample (dbh and height) means, with the ellipses' major
and minor axes equal in length to the unbiased sample standard deviations of
height and dbh, respectively. The orientation of the ellipse is determined by the
sample covariance. All of the points, expect the apparent outlier, fall within this
confidence ellipse. For comparison, the inner ellipse is a 95% confidence ellipse
with axes computed from the standard errors of the means of each variable and
centered on the sample centroid—a graphic illustration of the real difference
between the standard deviation and the standard error (see section 3.4).
Convex hulls and subsequent peeled convex hulls (Barnett 1976) are useful
exploratory tools when the distribution underlying the data is not normal or not
known. Convex hulls illustrate order in bivariate or multivariate data, and they
are used to distinguish distinct groups, outliers, and general shapes of multivariate
distributions (for a detailed discussion, see Barnett 1976). Peeled convex hulls
are essentially bivariate smoothers. Figure 3.8 illustrates a convex hull and a
subsequent peel around the same data set illustrated in figure 3.7. The initial hull
(figure 3.8A) describes the boundaries of the data—it encompasses the full range
of variation in the data set. The peeled hull, referred to as "peeled to depth 2"
(figure 3.8B), includes all but the most extreme values of the data set (compare
the points outside the peeled hull of figure 3.8B to the points with strong influ-
ence on r in figure 3.7B). This process can be repeated ad infinitum, but normally
does not proceed beyond depth 3. This is analogous to Tukey's (1977) running
median (3R) smoother, extended in two dimensions. Like smoothers, convex hulls
are constructed most easily with pencil and paper, or fast, interactive computer
software (S-Plus). Convex hulls are useful for highlighting patterns within noisy
data; they make no assumptions about the underlying distribution of the data.
Bivariate plots suitable for EDA are also suitable for final presentation. In
preparing these plots for publication, however, several conventions often ob-
served in the literature should be dropped in favor of clarity of presentation. First,
it is common in scatterplots to always start each axis at the origin (0, 0). In fact,
closely adhering to the actual range of the data when scaling axes is far more

Figure 3.7 Scatterplots of tree diameter versus tree height for 41 trees in a mixed hardwood
stand. (A) Raw data. (B) An influence plot, where the size of each point is directly propor-
tional to the magnitude of its influence on r. Shading of the points indicates the direction
of the influence (open circles have a positive influence on r, solid circles a negative influ-
ence). In this case, the putative outlier is shown as a large solid point (influence x 100 =
11). Removal of this point alone, therefore, would increase the value of r from 0.72 to
0.83. A 50% bivariate confidence ellipse is overlain on the figure. (C) An influence plot
of the data following log transformation. (D) Two different 95% confidence ellipses, the
outer constructed based on the variables' standard deviations, and the inner constructed
based on the standard errors of the means of the variables.
52 Design and Analysis of Ecological Experiments

Figure 3.8 A convex hull (A and B, solid line) and a depth-2 peel (B, dotted line) around
the tree size data. The hull is constructed by determining which points are farthest from
the centroid of the data and by joining those points to form a polygon that envelopes the
other points. To peel the hull, all the points that lie on the initial convex hull are deleted,
and a new convex hull is constructed for the remaining points.
Exploratory Data Analysis and Graphic Display 53

useful and informative than always including 0, especially if the extreme value
of either variable is « 0 or » 0. Restricting the values on the axes to just
beyond the extreme values of the data improves clarity and highlights pattern.
Axis breaks do not always help, and changing the relative scaling after an axis
break usually hinders accurate perception of the data and can stymie future digi-
tizers.

3.3.3 Extensions of Bivariate Techniques to


Multivariate Data Sets
For data sets that include a number of continuous variables, it may not be clear
which, if any, pair(s) of variables should be subjected to bivariate correlation or
regression analysis, or whether you should resort to multivariate techniques (chap-
ter 7). Three-dimensional plots (e.g., figure 3.11 A) are often used to examine and
illustrate higher dimensional data. Although these graphs are aesthetically pleas-
ing and easy to produce with current graphic software, accurate interpretation and
digitizing depend on the perspective and orientation of the plot.
The scatterplot matrix, whose origins are shrouded in mystery, provides an
alternative exploratory and presentation tool for higher dimensional data. A sym-
metrical scatterplot matrix of the tree data is shown in figure 3.9. This is simply
a plot of all possible bivariate combinations of the variables in the data set. Plots
above the diagonal have x- and y-axes transposed relative to those below the di-
agonal, which frees the investigator from preconceived notions of dependent and
independent variables. We can, of course, apply the bivariate exploratory tech-
niques described previously to each of the scatterplots within the matrix. The
possible addition of density plots of each variable along the diagonal gives the
investigator a simultaneous feel for the distribution of individual variables (El-
lison and Bedford 1991). The final construction provides an information-rich, but
rapidly comprehensible, picture of the overall data set. Advanced, interactive data
exploration and visualization techniques have been extended to «-dimensional
data by Cook et al. (1995) and Buja et al. (1996).

3.3.4 Classified Quantitative Data: Alternatives to


Bars and Pies
Classified quantitative data are common in many experimental situations. This
type of data set consists of responses of a given parameter to discrete treatments.
Such experiments may be analyzed by ANOVA (chapters 4 and 5), and the results
expressed in terms of the significance of treatment effects or interaction effects.
Data from these types of experiments often are not explored before the formal
analysis, although the univariate techniques described in section 3.3.1 are appro-
priate for examining the data structure of individual treatment groups. The excep-
tion to this generalization is common tests of the critical assumptions of ANOVA:
homoscedasticity (variances among treatment groups are equal) and normal distri-
bution of residuals within treatment groups. In particular, failure to test for homo-
scedasticity is one of the most common statistical errors (Fowler 1990). Hetero-
54 Design and Analysis of Ecological Experiments

Figure 3.9 A scatterplot matrix of the tree size data. This plot illustrates bivariate relation-
ships between all possible combinations of variables in a multivariate dataset. The variable
name in the boxes along the diagonal corresponds to ;t-axis variables of plots below the
diagonal and v-axis variables above the diagonal.

scedastic (unequal variances) data can complicate or compromise results obtained


from ANOVA (Sokal and Rohlf 1995).
To illustrate EDA and graphical presentation of classified quantitative data, I
used two data sets from Potvin (1993, tables 4.2 and 4.3; see also [http://
www.oup-usa.org/sc/0195131878/]). In one data set, the effects of genotype (the
classifying variable) on fresh mass of Plantago major, were examined. The sec-
ond set comprised data on the interaction effects of bench position and genotype
on stem dry mass of Helianthus annum grown in a Latin square design. In each
of these data sets, there is only one response variable: plant mass. More complex
data sets include responses of several variables to multiple levels of a given treat-
ment. As an example of this latter type of data set, I use data from Ellison et al.
(1993). We measured a number of growth and morphological characteristics of
Nepsera aquatica (an herbaceous species of disturbed areas in tropical wet for-
ests) in response to varying light levels (2%, 20%, and 40% of full sunlight).
Exploratory Data Analysis and Graphic Display 55

Spread (some measure of variance) versus level (mean, median) plots (Norusis
1990) are a rapid, graphic way to examine the within- and between-treatment
group variances, as well as to provide clues as to appropriate data transformations
to bring heteroscedastic data into line. Norusis (1990), modifying the technique
of Box et al. (1978), suggests plotting the natural log of the interquartile distance
(i.e., the hspread; fig. 3.4A) versus the natural log of the median for each treat-
ment group. An appropriate transformation of the data to remove dependency of
the spread on the level is then given as 1 minus the slope of the linear regression
line fit to the spread versus level plot. Figure 3.10A illustrates a spread versus
level plot for Potvin's Plantago data. Note that the raw data are not homoscedas-
tic; the variance increases with the mean. Following Norusis (1990) and Box et
al. (1978), the slope of the regression line for this plot is 1.71, suggesting that
the data be transformed by raising each observation to the -0.71 power. After
such a transformation, the spread versus level plot (figure 3.1 OB) illustrates that
the strict dependency of spread on level no longer exists, and the data are some-
what more suitable for ANOVA (the variances are no longer correlated with the
mean, although they are still not equalized). Plant size data are often subject to
logarithmic transformations to equalize variances within treatment groups. A log
transformation of these data is almost as good as the negative exponential trans-
formation in equalizing these variances (table 3.1). Box and Cox (1964) and Zar
(1996) provide detailed methods on determining the "best" transformation to be
used on heteroscedastic data. Such transformations may not make biological
sense, but keep in mind that the role of transformation is to bring your data in
line with the assumptions and requirements of the statistical model(s) you are
testing.
Graphic EDA can also be used to examine interaction effects in data. An
example is illustrated in figure 3.11 for Potvin's Helianthus data. In this experi-
ment, Potvin illustrates how position on a greenhouse bench interacts with geno-
type to determine plant mass. The top figure illustrates the relatively small size
of genotype A and the relatively large size of genotype E. Although a scatterplot
matrix might have made this pattern clearer, there is no real reason to plot row x
column, or row x genotype, or column x genotype when the point is to illustrate
the row x column interaction effect on genotype. The lower figure, a contour plot
of the top figure, illustrates the clear "hot spot" in the upper left corner of the
bench. Because interaction effects often involve visualizing data in more than two
dimensions, you can use many of the techniques normally applied to multivariate
data in the exploration of interactions.
Classified quantitative data are presented poorly in the ecological literature.
These problems are illustrated with the data of Ellison et al. (1993) on resource
allocation and morphological responses to light by Nepsera (figure 3.12). The
most common ways of presenting classified quantitative data are bar charts, sepa-
rated or stacked (figures 3.12A,B), and pie charts (figure 3.12C). Separated bar
charts (figure 3.12A), where a single bar represents the results of a single treat-
ment, suffer from the same problems as histograms. The bars themselves use a
lot of ink—horizontal lines, vertical lines, shading of bars of arbitrary width—to
convey information about only a single point at the top of the bar (compare figure
56 Design and Analysis of Ecological Experiments

Figure 3.10 Spread versus level plots of the Plantago data. Values plotted on (A) are
ln(interquartile distance) on the y-axis versus ln(median plant mass) on the x-axis of seven
replicate individuals of each of five genotypes. Genotype number is indicated on the plot.
(B) Spread versus level plot of data following a negative exponential transformation (Nor-
usis 1990). See text and table 3.1 for further explanation.
Exploratory Data Analysis and Graphic Display 57

Table 3. 1 Variance (/) of n = 1 observations per genotype of Plantago fresh mass"

Variance

Negative
Log exponential
Genotype Mean Untransformed transformation transformation 0>~°'71)

1 0.198 0.006 0.179 1.245


2 0.309 0.034 0.440 1.798
3 0.109 0.008 0.151 0.710
4 0.298 0.029 0.354 1.302
5 0.412 0.039 0.196 0.392

"Variances are shown (1) before transformation, (2) after transformation by natural logarithms, and (3) after trans-
formation by the negative exponential suggested by the spread versus level plot (figure 3.10).

3.12A with 3.12D). Stacked bar charts (figure 3.12B), where treatment groups are
divided into subsets and groups are compared against one another, are virtually
unintelligible and never should be used. In this example, the percent allocation to
leaves, roots, and stems sums to roughly 100% (allowing for error and missing
values). Figure 3.12A (bars side by side) at least clearly illustrates the relative
allocation to each part. It is not so simple, on the other hand, to determine the
relative allocation in figure 3.12B.
Because we use 0 as our reference point, the first guess would be that the
allocation to roots in 2% light is approximately 70% and that to stems is 100%,
when clearly this cannot be true. However, it is difficult to determine visually the
beginning point of any of the stacked segments beyond the lowest one. Although
measures of variance can be placed clearly on side-by-side bar charts, error bars
cannot be placed on stacked bar charts (see section 3.4). Shading, hatching, and
other chartjunk used in bar charts also can interfere with accurate perception of
the data and decrease the data-to-ink ratio. Pies share all of the problems of
stacked bar charts, and none of the advantages of side-by-side bar charts. I can
think of no cases in which a pie chart should be used.
There are several alternatives to bar charts and pie charts. Plots in which the
mean value of the response variable is plotted as a single point, along with some
measure of error, clearly illustrate the same data as in a bar chart with greater
clarity and less ink (figure 3.12D). Sets of box plots better illustrate the underly-
ing data structure and convey more information with less ink and confusion (fig-
ure 3.12E). These box plots have been "notched" (McGill et al. 1978) to show
95% confidence intervals. Polar category plots (with or without error bars; the
latter are shown in figure 3.12F) are the minimalist alternative to bar charts and
are a visually comparable substitute for pie charts. These polar category plots
illustrate the response of eight measured variables to the three light environments
and clearly convey overall differences between treatment groups.
58 Design and Analysis of Ecological Experiments

Figure 3.11 Two ways of visualizing the effect of bench position and genotype on stem
dry weight of Helianthus. The top figure is a three-dimensional scatterplot, with genotype
letter (A-F) as the plotting symbol. The addition of sticks connecting each point to its
position on the x-y plane permits more accurate perception of the true height along the
z-axis of each point. The lower figure is a contour plot, with intensity of shading indicating
the biomass at a particular row x column location on the bench. These contours were
determined by a negative exponential smoothing routine, where the influence of neighbor-
ing values decreases exponentially with distance. Shading density increases with biomass.
Exploratory Data Analysis and Graphic Display 59

3.4 A Word About Error Bars

Any reported parameter must include a measure of the reliability of that para-
meter, as well as the sample size. For example, sample means, whether reported
graphically or in tables, must be accompanied by the sample size and some esti-
mator of the variance. Error bars on graphs must be correctly identified. Three
kinds of error bars are seen commonly in the ecological literature: standard devia-
tions, standard errors, and n% confidence intervals. Strictly speaking, the first is
the sample standard deviation. The second, more properly referred to as the stan-
dard error of the mean, is an estimate of the accuracy of the estimate of the
mean. We compute it as the standard deviation of a distribution of means of
samples of identical sizes from the underlying population (see Zar 1996, section
6.3 for a complete description). Thus, calling error bars simply standard deviation
bars confounds the two. Measures of error are used to calculate n% confidence
intervals. We can easily compute confidence intervals of normally distributed
data from the standard error of the mean (Sokal and Rohlf 1995). For other distri-
butions, approximations of confidence intervals can be computed using boot-
straps, jackknifes, or other resampling techniques (Efron 1982; chapter 13). All
of these measures require information about sample size, which must be reported
to ensure accurate interpretation of results.
In general, error bars are useful only when they convey information about
confidence intervals. Typically, in the ecological literature, means are plotted
along with error bars illustrating 1 standard error of the mean. For suitably large
n, or for samples from a normal distribution, 1 standard error bar approximates a
68% confidence interval. This conveys little information of interest, since we are
accustomed to thinking in terms of 50%, 90%, 95%, or 99% confidence intervals.
Further, most ecological samples are small, or the underlying data distributions
are unknown. In those cases, error bars representing 1 standard error of the mean
convey no useful information at all. In keeping with the guidelines for graphical
display presented at the beginning of the chapter, I suggest that sample standard
deviations or 95% confidence intervals be the error bars of choice. Two-tiered
error bars (Cleveland 1985) that display both quantities are an excellent compro-
mise. Meta-analysis (chapter 18) requires sample standard deviations, and if re-
ported together with sample size, they permit rapid calculation of confidence
intervals, standard errors, or most other measures of variation. In the end, the
choice of error bar lies with you. It is most important that they be identified
accurately.
If you transformed the data before analysis, your calculated standard deviation
will be symmetrical only with respect to the transformed mean. If you present
the results back-transformed (as is common practice), the error bars may be asym-
metric.

3.5 Conclusion

Ecologists traditionally have used a limited palette of graphic elements and tech-
niques for exploring and presenting data. We must refocus our vision to grasp
60 Design and Analysis of Ecological Experiments

Figure 3.12 Six alternatives for


presenting classified quantitative
data. Data are from an experi-
ment examining the effect of
three different light levels (2%,
20%, and 40% of full sun) on
growth, resource allocation, and
morphology of Nepsera aquatica.
Each treatment consisted of 20 in-
dividually potted plants, harvested
after 6 months of growth (Ellison
et al. 1993). (A) A side-by-side
bar chart illustrating percent allo-
cation to leaves, roots, and stems
by plants in each light treatment.
Height of the bar indicates mean
percent allocation, and error bars
indicate 1 standard deviation of
the mean. (B) A stacked bar chart
illustrating the same data. (C) Pie
charts illustrating the relative re-
source allocation in the three light
environments (dark shading: 2%
light; intermediate shading: 20%
light; no shading: 40% light).
Note that it is not possible to
place error bars on stacked bar
charts or pie charts. (D) Simple
category plot of the data illus-
trated in figure 3.12A. Each point
represents the mean percent allo-
cation to leaves (circles), roots
(squares), and stems (triangles);
error bars are 1 standard devia-
tion. (E) Notched box plots of the
data. Box plot construction as in
figure 3.4A. Plots are "notched"
to illustrate 95% confidence inter-
vals. Where the box reaches full
width on either side of the me-
dian indicates the limits of the
Exploratory Data Analysis and Graphic Display 61

confidence interval. (F)


Polar projections of cate-
gory plots (also known as
star plots) of the response
of eight measured param-
eters to the three light
treatments. The radius of
the circle is equivalent to
the y-axis of a rectangular
plot; the distance from
the center of the circle to
each vertex of the poly-
gon is the mean response
of each variable to the
treatment. Variables are
arranged equidistantly
around the perimeter of
the circle (equivalent to
the x-axis of a rectangular
plot). One obtains a pic-
ture of the overall re-
sponse of the plant to
each light treatment by
constructing a polygon
whose vertices are equal
to the value of the re-
sponse variable. Different
shapes in the different
light treatments indicate
overall treatment effects.
For this type of plot to be
effective, all data must be
similarly scaled; for this
plot, root-to-shoot ratio (g
g~!) was multiplied by
102, and specific leaf
weight (g cm"2) was
multiplied by 104. Leaf
area (cm2), is a measure
of total leaf area per
plant.
62 Design and Analysis of Ecological Experiments

new or unfamiliar graphic elements and techniques that will permit clear commu-
nication of our data. We can now use available computer hardware and software
with expanded EDA and presentation capabilities to display our results accu-
rately, concisely, and in aesthetically pleasing ways (Ellison 1992; Kardia 1998).
We can improve our comprehension and appreciation of data by using many of
the graphic techniques presented in this chapter, just as we can increase our ap-
preciation of the diversity of pasta entrees with a trip to a fine Italian restaurant.

Acknowledgments I am grateful to the late Deborah Rabinowitz for introducing me to


EDA and data-rich graphic techniques. Philip Dixon, Steve Juliano, and Catherine Potvin
generously shared data from their respective chapters, The data on tree size was collected
by the 1992 population ecology class at Mount Holyoke College. The work on Nepsera
was supported by NSF Grant BSR-8605106 to Julie Denslow. Technical support personnel
at Systat, Inc (now SPSS, Inc.). and Statistical Sciences, Inc. (now MathSoft, Inc.) helped
immensely with final graphics production. Philip Dixon, Elizabeth Farnsworth, Jessica
Gurevitch, Catherine Potvin, Sam Scheiner, and one anonymous reviewer provided con-
structive reviews of early drafts of this chapter that resulted in a much-improved final
version. Hardware for graphics production was provided by the BioCIS grant from IBM
Corporation. Additional support was provided by NSF Grant BSR-9107195 and the In-
ternet.

You might also like