A Practical Guide To Data Analysis Using R An Examplebased Approach John H Maindonald instant download
A Practical Guide To Data Analysis Using R An Examplebased Approach John H Maindonald instant download
https://ebookbell.com/product/a-practical-guide-to-data-analysis-
using-r-an-examplebased-approach-john-h-maindonald-57087316
https://ebookbell.com/product/using-r-in-hr-analytics-a-practical-
guide-to-analysing-people-data-martin-edwards-230412222
https://ebookbell.com/product/building-datadriven-applications-with-
danfojs-a-practical-guide-to-data-analysis-and-machine-learning-using-
javascript-rising-odegua-34812722
https://ebookbell.com/product/data-analytics-using-splunk-9x-a-
practical-guide-to-implementing-splunks-features-for-performing-data-
analysis-at-scale-1st-edition-dr-nadine-shillingford-50788984
https://ebookbell.com/product/mastering-machine-learning-with-python-
in-six-steps-a-practical-implementation-guide-to-predictive-data-
analytics-using-python-manohar-swamynathan-42933766
Mastering Machine Learning With Python In Six Steps A Practical
Implementation Guide To Predictive Data Analytics Using Python 2nd
Manohar Swamynathan
https://ebookbell.com/product/mastering-machine-learning-with-python-
in-six-steps-a-practical-implementation-guide-to-predictive-data-
analytics-using-python-2nd-manohar-swamynathan-10519694
https://ebookbell.com/product/mastering-machine-learning-with-python-
in-six-steps-a-practical-implementation-guide-to-predictive-data-
analytics-using-python-manohar-swamynathan-10519704
https://ebookbell.com/product/data-analytics-for-marketing-a-
practical-guide-to-analyzing-marketing-data-using-python-1st-edition-
guilherme-diazbrrio-57082760
https://ebookbell.com/product/statistical-methods-for-practice-and-
research-a-guide-to-data-analysis-using-spss-second-edition-ajai-s-
gaur-1877140
A Practical Guide To Analytics For Governments Using Big Data For Good
Lowman
https://ebookbell.com/product/a-practical-guide-to-analytics-for-
governments-using-big-data-for-good-lowman-6755394
A P R AC T I C A L G U I D E TO DATA A NA LY S I S U S I N G R
Using diverse real-world examples, this text examines what models used for data analysis
mean in a specific research context. What assumptions underlie analyses, and how can you
check them?
Building on the successful Data Analysis and Graphics Using R, third edition (Cam-
bridge, 2010), it expands upon topics including cluster analysis, exponential time series,
matching, seasonality, and resampling approaches. An extended look at p-values leads to an
exploration of replicability issues and of contexts where numerous p-values exist, including
gene expression.
Developing practical intuition, this book assists scientists in the analysis of their own
data, and familiarizes students in statistical theory with practical data analysis. The worked
examples and accompanying commentary teach readers to recognize when a method works
and, more importantly, when it doesn’t. Each chapter contains copious exercises. Selected
solutions, notes, slides, and R code are available online, with extensive references pointing
to detailed guides to R.
J O H N H . M A I N D O NA L D
Statistics Research Associates, Wellington, New Zealand
W. J O H N B R AU N
University of British Columbia, Okanagan
JEFFREY L. ANDREWS
University of British Columbia, Okanagan
Cambridge University Press is part of Cambridge University Press & Assessment, a department
of the University of Cambridge
We share the University’s mission to contribute to society through the pursuit of
education, learning and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781009282277
DOI: 10.1017/9781009282284
© John H. Maindonald, W. John Braun, and Jeffrey L. Andrews 2024
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
First published 2024
Printed in the United Kingdom by CPI Group Ltd, Croydon CR0 4YY
A catalogue record for this publication is available from the British Library
A Cataloging-in-Publication data record for this book is available from the Library of Congress
ISBN 978-1-009-28227-7 Hardback
Cambridge University Press & Assessment has no responsibility for the persistence
or accuracy of URLs for external or third-party internet websites referred to in this
publication and does not guarantee that any content on such websites is, or will
remain, accurate or appropriate.
For my family (Irene, Charlie, and Mia) and my parents (Dave and Marleen)
Epilogue 467
Appendix A The R System: a Brief Overview 469
A.1 Getting Started with R 469
A.2 R Data Structures 473
A.3 Functions and Operators 483
A.4 Calculations with Matrices, Arrays, Lists, and Data Frames 487
A.5 Brief Notes on R Graphics Packages and Functions 490
A.6 Plotting Characters, Symbols, Line Types, and Colors 493
References 495
References to R Packages 508
Index of R Functions 514
Index of Terms 519
1.1 (A) Dotplot and (B) boxplot displays of cuckoo egg lengths 4
1.2 (A) Boxplot with annotation, compared with (B) histogram with over-
laid density plot 12
1.3 Total lengths of possums, by sex and geographical location 13
1.4 Mortality from measles, London: (A) 1629–1939; (B) 1841–1881 14
1.5 Brain vs. body weight: (A) untransformed; (B) log transformed scales 15
1.6 Distance traveled up a 20 ◦ ramp, vs. starting point 16
1.7 Quarterly labor force numbers, by Canadian region, 1995–1996: (A)
same log scale; (B) sliced log scale 18
1.8 Alternative logarithmic scale labeling choices, labor force numbers 19
1.9 Outcomes for two different surgery types – Simpson’s paradox example 20
1.10 Boxplot showing weights (inverse sampling fractions), in the dataset
DAAG::nassCDS 23
1.11 Individual plot-level yields of kiwifruit, by season and by block 25
1.12 Different y vs. x relationships, and Pearson vs. Spearman correlation 29
1.13 Normal density plot, with associated statistical measures 34
1.14 Plots for five samples of 50 from a normal distribution 35
1.15 Quantile–quantile plots – data vs. simulated normal values 36
1.16 Simulations of the sampling distribution of the mean 38
1.17 Normal densities with t8 and t3 overlaid 40
1.18 A fitted line, as against a fitted lowess curve 44
1.19 Quantile–quantile plots – regression residuals vs. normal samples 46
1.20 Boxplots for 200 simulated p-values – one-sided one-sample t-test 52
1.21 Post-study probability (PPV) vs. pre-study odds, given power 55
1.22 Sampling distribution of difference in AIC statistics 60
1.23 Alternative Cauchy priors, and posteriors, for the sleep data 62
1.24 Change in Bayes Factor with sample size, for different p-values 64
1.25 Permutation distribution density curves 67
2.1 Female vs. male admission rates – Simpson’s paradox example 89
2.2 Second vs. first member of paired data – two examples 92
2.3 Quantile–quantile and worm plots for binomial and betabinomial fits 97
2.4 Worm plots for Poisson and negative binomial type I fits 98
2.5 Chemical vs. magnetic measure – line vs. loess smooth 105
2.6 Weight vs. volume, for eight softback books, with regression line 107
2.7 Diagnostic plots for Figure 2.6 108
2.8 Pointwise bounds for line, and for new predicted values 110
2.9 Confidence bounds – pairwise differences vs. difference of means 111
2.10 Regression lines – y on x and x on y 112
2.11 Graphs that illustrate the use of power transformations 113
2.12 Heart weight vs. body weight, for 30 Cape fur seals 115
2.13 Graphical summary of three-fold cross-validation – house sale data 117
2.14 Plots that relate to bootstrap distributions of prediction errors 120
2.15 LSD and HSD comparisons of means for three treatments 122
2.16 Test for linear trend vs. anova test – p-value comparison 125
2.17 False-color image of two channel microarray gene expression values 126
2.18 Rice shoot dry mass data – plots that show interactions 129
2.19 Diagnostic plots – MCMCregress() Bayesian analysis 135
3.1 Weight vs. volume, for seven hardback and eight softback books 145
3.2 Diagnostic plots – lm(weight ∼ 0+volume+area) 147
3.3 Scatterplot matrices for Northern Ireland hill race data 149
3.4 Variation in distance per unit time with distance 151
3.5 Diagnostic plots – lm(mph ∼ log(dist)+log(gradient) 152
3.6 Diagnostic plots – lm(logtime ∼ logdist + logclimb) 153
3.7 Scatterplot matrices – log transformed oddbooks data 154
3.8 Scatterplot matrix for the DAAG::litters data 157
3.9 Termplots for regression with oddbooks data 164
3.10 Confidence intervals, compared with prediction intervals 167
3.11 Scatterplot matrix with power transformations – hurricane deaths data 169
3.12 Diagnostic plots – model for hurricane death data 170
3.13 Scatterplot matrix for hills2000 data, logarithmic scales 172
3.14 Residuals vs. fitted – least squares compared with resistant fit 173
3.15 (A) A 2D plot that shows leverages; (b) a 3D dynamic graphic plot 175
3.16 Standardized changes in regression coefficients 175
3.17 Increase in penalty term difference for unit increase in the number of
parameters p, for AIC, BIC, and AICc 177
3.18 Diagnostic plot, compared with simulated diagnostic plots 182
3.19 p-Values vs. number of variables available for selection 186
3.20 Scatterplot matrix for Coxite data 187
3.21 Observed porosities, and fitted values with 95 percent confidence bounds 188
3.22 Change in regression line as error in x changes 192
3.23 Apparent differences between groups, resulting from errors in x 194
3.24 Does preoperative baclofen reduce pain – Simpson’s paradox example? 196
3.25 Added variable plots (a termplot variant) 198
3.26 Residuals vs. fitted values, for each of the three regressions 199
4.1 Weights of extracted sugar – wild-type plant vs. other types 209
4.2 Apple taste scores – panelist and product effects 215
4.3 Plots relate to alternative models fitted to the leaftemp data 219
4.4 Diagnostic plots for the parallel line model – leaftemp data 219
4.5 Number of grains per head vs. barley seeding rate 221
4.6 Line vs. quadratic curve, and residual plots, for barley seeding rate
data 223
4.7 Resistance vs. apparent juice content for kiwifruit slabs 226
4.8 Thin plate spline basis curves, and contributions to fitted curve 228
9.19 (A) Mean–variance relationship for cancer gene expression data; (B)
use MDS to locate samples in 2D space 437
9.20 Different accuracy measures, in the development of a discriminant rule 440
9.21 How effective is linear discriminant in distinguishing known groups? 442
9.22 Overlaid density plots – treatment groups and experimental controls 447
9.23 Are observations for which re74 is available detectably different? 448
9.24 Random forest propensity scores – treated vs. controls? 451
9.25 Propensity scores for treatment and control groups after matching 454
9.26 (A) “Love plot”; (B) treatment/control differences for matched items 454
9.27 Love plots for different numbers (5,6) of cutpoints 456
9.28 Term plots for checking GAM model with straight line terms 459
9.29 Means of overimputations (solid points), with confidence bounds 461
A.1 Worldwide annual totals of CO 2 emissions – 1900, 1920, . . . , 2020 471
A.2 Fonts, symbols, and line types 493
This text is designed as an aid, for learning and for reference, in the navigation
of a world in which unprecedented new data sources, and tools for data analysis,
are pervasive. It aims to teach, using real-world examples, a style of analysis and
critique that, given meaningful data, can generate defensible analysis results. Its
focus is on ideas and concepts, with extensive use of graphical presentation. It may
be used to give students who have taken courses in statistical theory exposure to
practical data analysis. It is designed, also, as a resource for scientists who wish
to do statistical analyses on their own data, preferably with reference as necessary
to professional statistical advice. It emphasizes the role of statistical design and
analysis as part of the wider scientific process.
As far as possible, our account of statistical methodology comes from the coal-
face, where the quirks of real data must be faced and addressed. Experience in
consulting with researchers in many different areas of application, in supervising
research students, and in lectures to researchers, have been strong influences in
the text’s style and content. We comment extensively on analysis results, noting
inferences that seem well founded, and noting limitations on inferences that can be
drawn. We emphasize the use of graphs for gaining insight into data – in advance
of any formal analysis, for understanding the analysis, and for presenting analysis
results. The project has been a tremendous learning experience for all three of us.
As is usual, the more we learn, the more we appreciate how much more we have to
learn.
The text is suitable for a style of learning where readers work through the text
with a computer at their side, running the R code as and when this seems helpful.
It complements more mathematically oriented accounts of statistical methodology.
The appendix provides a brief account of R, primarily as a starting point for learn-
ing. We encourage readers with limited R experience to avail themselves of the
wealth of instructional material on the web as well as the hardcopy resources listed
in Section 1.11.
While no prior knowledge of specific statistical methods or theory is assumed,
readers will need to bring with them, or quickly acquire, a modest level of statis-
tical sophistication. Prior experience with real data, prior exposure to statistical
methodology, and some prior familiarity with regression methods, will all be helpful.
... Statistics is a science ... and it is no more a branch of mathematics than are physics,
chemistry and economics; for if its methods fail the test of experience – not the test of
logic – they are discarded.
[Tukey (1953), quoted by Brillinger (2002)]
The methods that we cover have wide application. The datasets, many of which
have featured in published papers, are drawn from many different fields. They reflect
a journey in learning and understanding, alike for the authors and for those with
whom they have worked, that has ranged widely over many different research areas.
We hope that our text will stimulate the cross-fertilization that occurs when ideas
and applications that have proved effective in one area find use elsewhere, perhaps
even leading to new lines of investigation.
To summarize: The strengths of this book include the directness of its encounter
with research data, its advice on practical data analysis issues, careful critiques
of analysis results, the use of modern data analysis tools and approaches, the use
of simulation and other computer-intensive methods where these provide insight
or give results that are not otherwise available, attention to graphical and other
presentation issues, the use of examples drawn from across the range of statistical
applications, the links that it makes into the debate over reproducibility in science,
and the inclusion of code that reproduces analyses.
A substantial part of the first edition of Data Analysis and Graphics Using R
(Maindonald and Braun, 2003) was derived, initially, from the lecture notes of
courses for researchers that the first author presented, at the University of New-
castle (Australia) over 1996–1997 and at Australian National University from 1998,
through until formal retirement and beyond. It was a privilege to have contacts,
arising from consulting work and lectures, across the University. Those contacts
were extended as a result of short courses on R-based analysis that were offered,
1 For an overview of the theory of statistical inference, see, for example, Cox (2006).
encounter R. It is finding its way into the upper levels of secondary schools. While
this is to be encouraged, students do need to understand that such courses are at
the start of an adventure in statistical understanding. There is no good substitute
for professional training in modern tools for data analysis, and experience in using
those tools with a wide range of datasets. No one should be embarrassed that they
have difficulty with analyses that involve ideas that professional statisticians may
take seven or eight years of training and experience to master.
The questions that data analysis is designed to answer can often be stated simply.
This may encourage the layperson, or even scientists doing their own analyses, to
believe that the answers are similarly simple. Commonly, they are not. Be prepared
for unexpected subtleties. Comments made by Stephen Senn are apt:
I’ve been studying statistics for over 40 years and still don’t understand it. The ease with
which non-statisticians master it is staggering.
The R System
Work on R started in the early 1990s, as a project of Ross Ihaka and Robert Gentle-
man, when both were at the University of Auckland (New Zealand). The R system
implements a dialect of the S language, developed at AT&T by John Chambers
and colleagues. Section 1.4 in Chambers (2008) describes the history. Versions of
R are available, at no charge, for Microsoft Windows, for Linux and other Unix
systems, and for Macintosh systems. It is available through the Comprehensive R
Archive Network (CRAN). Go to http://cran.r-project.org/, and find the nearest
mirror site. A huge range of packages, contributed by specialists in many different
areas, supplement base R. The development model has proved effective in marshal-
ing high levels of computing expertise for continuing improvement, for identifying
and fixing bugs, and for responding quickly to the evolving needs and interests of
the statistical community. The R Task Views web page2 lists packages that handle
some of the more common R applications. It has become an increasing challenge to
keep pace with the new and/or improved abilities that R packages, new and old,
continue to develop. Those who rely heavily on R for their day-to-day work will do
well to keep attuned to major changes and developments.
The R system has brought into a common framework a huge range of abili-
ties that extend beyond the data analysis and associated data manipulation and
graphics abilities that are the focus of this text. Examples include drawing and
coloring maps, reading and handling shapefiles, map projections, plotting data col-
lected by balloon-borne weather instruments, creating color palettes, manipulating
bitmap images, solving sudoku puzzles, creating magic squares, solving ordinary
differential equations, and processing various types of genomic data. Help files and
2 https://cran.r-project.org/web/views/.
vignettes that are included with packages are a large reservoir of information on
the methodologies that they implement.
There are several graphical user interfaces (GUIs) that can be highly helpful in
accessing a restricted range of R abilities – examples are BlueSky, Rcmdr, R-Instat,
jamovi, and rattle. Access to the fill range of abilities that R and R packages make
available will require use of the command line.
RStudio is a widely used R interactive development environment (IDE) for tasks
that include viewing history, debugging, managing the workspace, package man-
agement, and data input and output. It has features that greatly assist project
management and package development.
Among systems that have the potential to challenge R’s dominance for data
analysis, Julia (julialang.org/) seems particularly interesting. Relative to R, it
has high computational efficiency. It has the potential to develop or adapt a range
of packages that together match what R packages offer.
Acknowledgements
The prefaces to the three editions of Data Analysis and Graphics Using R give names
of those who provided helpful comment. For this new text, James Cone has provided
useful comments. Trish Scott has helped with copyediting. Discussions on the R-
help and R-devel email lists have contributed greatly to insight and understanding.
The failings that remain are, naturally, our responsibility.
This text has drawn on data from many different sources. Following the references
is a list of data sources (individuals and/or organizations) that we wish to thank and
acknowledge. Thanks are due also to the many researchers whose discussions with
us have helped stimulate thinking and understanding, and who in many instances
have given us access to their data. We apologize to anyone that we may have
inadvertently failed to acknowledge.
Too often, data that have become the basis for a published paper are not made
available in any form of public record. The data may not find their way into any
permanent record, and cease to be available for checking the analysis, for work
that builds on what can be learned when data from multiple sources are brought
together, to try a new form of analysis, or for use in teaching. In areas where data
are as a matter of course kept available for future researchers to use, this has been
a major contributor to advances in scientific understanding. Those benefits can and
should extend more widely. Thanks are due to Beverley Lawrence for her efforts
as copy-editor, and to Cambridge University Press staff who assisted us through
the copy-editing and publication process – Roger Astley, Natalie Tomlinson, Anna
Scriven, and Clare Dennison.
Conventions
Starred headings identify more technical discussions that can be skipped at a first
reading. Item numbers for more technical and/or challenging exercises are likewise
starred.
Comments, prefaced by # or for extra emphasis by ##, will often be included in
code chunks. Where code is included in comments, it will be surrounded by back
quotes, as in `species ~ length` in the final line of code that now follows:
## Code for a stripped down version of Figure 1.1A
library(latticeExtra) # The 'lattice' package will be loaded & attached also
cuckoos <- DAAG::cuckoos
## Panel A: Dotplot without species means added
dotplot(species ∼ length, data=cuckoos) ## `species ∼ length` is a 'formula'
Chapter Summary
We begin by illustrating the interplay between questions driven by scientific curios-
ity and the use of data in seeking the answers to such questions. Graphs provide a
useful window through which meaning can be extracted from data. Numeric sum-
mary statistics and probability distributions provide a form of quantitative scaf-
folding for models of random as well as nonrandom variation. Simple regression
models foreshadow the issues that arise in the more complex models considered
later in the book. Frequentist and Bayesian approaches to statistical inference are
touched upon, the latter primarily using the Bayes Factor as a summary statistic
which moves beyond the limited perspective that p-values offer. Resampling meth-
ods, where the one available dataset is used to provide an empirical substitute for
a theoretical distribution, are also introduced. Remaining topics are of a more gen-
eral nature. Section 1.9 will discuss the use of RStudio and other such tools for
organizing and managing work. Section 1.10 will include a discussion on the impor-
tant perspective that replication studies provide, for experimental studies, on the
interplay between statistical analysis and scientific practice. The checks provided
by independent replication at another time and place are an indispensable comple-
ment to statistical analysis. Chapter 2 will extend the discussion of this chapter to
consider a wider class of models, methods, and model diagnostics.
Suppose, for example, that names on an electoral roll are numbered from 1 to
9384. The following uses the function sample() to obtain a random sample of 12
individuals:
## For the sequence below, precede with set.seed(3676)
sample(1:9384, 12, replace=FALSE) # NB: `replace=FALSE` is the default
[1] 2263 9264 4490 8441 1868 3073 5430 19 1305 2908 5947 915
The numbers are the numerical labels for the 12 individuals who are included in the
sample. The task is then to find them! The option replace=FALSE gives a without
replacement sample, that is, it ensures that no one is included more than once.
A more realistic example might be the selection of 1200 individuals, perhaps for
purposes of conducting an opinion poll, from names numbered 1 to 19,384, on an
electoral roll. Suitable code is:
chosen1200 <- sample(1:19384, 1200, replace=FALSE)
The following randomly assigns 10 plants (labeled from 1 to 10, inclusive) to one
of two equal-sized groups, control and treatment:
## For the sequence below, precede with set.seed(366)
split(sample(seq(1:10)), rep(c("Control","Treatment"), 5))
$Control
[1] 5 7 1 10 4
$Treatment
[1] 8 6 3 2 9
Cluster Sampling
Cluster sampling is one of many probability-based variants on simple random sam-
pling. See Barnett (2002). The function sample() can be used as before, but now
the numbers from which a selection is made correspond to clusters. For example,
households or localities may be selected, with multiple individuals from each. Stan-
dard inferential methods then require adaptation to account for the fact that it is
the clusters that are independent, not the individuals within the clusters. Donner
and Klar (2000) describe methods that are designed for use in health research.
20 21 22 23 24 25
A: Dotplot B: Boxplot
wren
tree pipit
robin
pied wagtail
meadow pipit
hedge sparrow
20 21 22 23 24 25
Figure 1.1 Dotplot (Panel A) and boxplot (Panel B) displays of cuckoo egg
lengths. In Panel A, points that overlap have a more intense color. Means are
shown as +. The boxes in Panel B take in the central 50 percent of the data, from
25 percent of the way through the data to 75 percent of the way through. The
dot marks the median. Data are from Latter (1902).
repeatedly. The distribution that results can be an empirical substitute for the use
of a theoretical distribution as a basis for inference.
We can randomly sample from the set {1, 2, . . . , 10}, allowing repeats, thus:
sample(1:10, replace=TRUE)
[1] 1 3 7 5 5 10 3 3 2 9
Table 1.1 Mean lengths of cuckoo eggs, compared with mean lengths of eggs laid by
the host bird species. The table combines information from the two DAAG data
frames cuckoos and cuckoohosts.
Meadow Hedge Tree Yellow
Host species pipit sparrow Robin Wagtails pipit Wren hammer
Length (cuckoo) 22.3 (45) 23.1 (14) 22.5 (16) 22.6 (26) 23.1 (15) 21.1 (15) 22.6 (9)
Length (host) 19.7 (74) 20.0 (26) 20.2 (57) 19.9 (16) 20 (27) 17.7 (-) 21.6 (32)
(Numbers in parentheses are numbers of eggs)
display of the raw data. Panel B is the more summary boxplot form of display (to
be discussed further in Section 1.1.5) that is designed to give a rough indication of
how variation between groups compares with variation within groups. 1
Table 1.1 adds information that suggests a relationship between the size of the
host bird’s eggs and the size of the cuckoo eggs that were laid in that nest. Observe
that apart from several outlying egg lengths in the meadow pipit nests, the length
variability within each host species’ nest is fairly uniform.
In the paper (Latter, 1902) that supplied the cuckoo egg data of Figure 1.1 and
Table 1.1, the interest was in whether cuckoos do in fact match the eggs that they
lay to the host eggs, and if so, in assessing which features match and to what extent.
Uniquely among the birds listed, the architecture of wren nests makes it impossi-
ble for the host birds to see the cuckoo’s eggs, and the cuckoo’s eggs do not match
the wren’s eggs in color. For the other species the color does mostly match. Latter
concluded that the claim in Newton and Gadow (1896) is correct, that the eggs
that cuckoos lay tend to match the eggs of the host bird in ways that will make it
difficult for hosts to distinguish their own eggs from the cuckoo eggs.
Issues with the data in Table 1.1 and Figure 1.1 are as follows.
• The cuckoo eggs and the host eggs are from different nests, collected over the
course of several investigations. Data on the host eggs are from various sources.
• The host egg lengths for the wren are indicative lengths, from Gordon (1894).
There is thus a risk of biases, different for the different sources of data, that limit
the inferences that can be drawn. How large, then, relative to statistical variation,
is the difference between wrens and other species? Would it require an implausibly
large bias to explain the difference? A more formal comparison between lengths for
the different species based on an appropriate statistical model will be a useful aid
to informed judgment.
Stripped down code for Figure 1.1 is:
library(latticeExtra) # Lattice package will be loaded and attached also
cuckoos <- DAAG::cuckoos
## Panel A: Dotplot without species means added
dotplot(species ∼ length, data=cuckoos) ## `species ∼ length` is a 'formula'
## Panel B: Box and whisker plot
bwplot(species ∼ length, data=cuckoos)
## The following shows Panel A, including species means & other tweaks
av <- with(cuckoos, aggregate(length, list(species=species), FUN=mean))
1 Subsection A.5.1 has the code that combines the two panels, for display as one graph.
1.1.2, special care is required to ensure that hidden biases induced by the method
of data collection do not lead to incorrect conclusions. Biases are likely when data
are obtained from “convenience” samples that have the appearance of surveys but
which are really poorly designed observational studies. Online voluntary surveys
are of this type. Similar biases can arise in experimental studies if care is not taken.
For example, an agricultural experimenter may pick one plant from each of several
parts of a plot. If the choice is not made according to an appropriate randomization
mechanism, a preference bias can easily be introduced.
Nonresponse, so that responses are missing for some respondents, is endemic in
most types of sample survey data. Or responses may be incomplete, with answers
not provided to some questions. Dietary studies based on the self-reports of partic-
ipants are prone to measurement error biases. With experimental data on crop or
fruit yields, results may be missing for some plots because of natural disturbances
caused by animals or harsh weather. One ignores the issue at a certain risk, but
treating the problem is nontrivial, and the analyst is advised to determine as well
as possible the nature of the missingness. It can be tempting simply to replace
a missing height value for a male adult in a dataset by the average of the other
male heights. Such a single imputation strategy will readily create unwanted bi-
ases. Males that are of smaller than average weight and chest measurement are
likely to be of smaller than average height. Multiple imputation is a generic name
for methodologies that, by matching incomplete observations as closely as possible
to other observations on the variables for which values are available, aim to fill in
the gaps.
Causal Inference
With data from carefully designed experiments, it is often possible to infer causal
relationships. Perhaps the most serious danger is that the results will be generalized
beyond the limits imposed by the experimental conditions.
Observational data, or data from experiments where there have been failures
in design or execution, is another matter. Correlations do not directly indicate
causation. A and B may be correlated because A drives B, or because B drives A,
or because A and B change together, in concert with a third variable. For inferring
causation, other sources of evidence and understanding must come into play.
and sounds perhaps) and field trips? Answers to other questions included in the
survey shed some limited light.
In the socsupport dataset, an important variable is the Beck Depression Inven-
tory or BDI, which is based on a 21-question multiple-choice self-report. It is the
outcome of a rigorous process of development and testing. Since its first publication
in 1961, it has been extensively used, critiqued, and modified. Its results have been
well validated, at least for populations on which it has been tested. It has become
a standard psychological measure of depression (see, e.g., Streiner et al., 2014).
For therapies that are designed to prolong life, what is the relevant measure? Is
it survival time from diagnosis? Or is a measure that takes account of quality of
life over that time more appropriate? Two such measures are “Disability Adjusted
Life Years” (DALYs) and “Quality Adjusted Life Years” (QALYs). Quality of life
may differ greatly between the therapies that are compared.
the manner of use of results. (If, for example, predictions are made that will be
applied a year into the future, check how predictions made a year ahead panned
out for historical data.)
• For experimental data, have the work replicated independently by another re-
search group, from generation of data through to analysis.
In areas where the nature of the work requires cooperation between scientists
with a wide range of skills, and where data are shared, researchers provide checks
on each other. For important aspects of the work, the most effective critiques are
likely to come from fellow researchers rather than from referees who are inevitably
more remote from the details of what has been done. Failures of scientific processes
are a greater risk where scientists work as individuals or in small groups with limited
outside checks.
There are commonalities with the issues of legal and medical decision making that
receive extensive attention in Kahneman et al. (2021, p. 372), on the benefits of
“averaging,” that is, using the perspectives of multiple judges as a basis for decision
making when sentencing; the authors comment:
The advantage of averaging is further enhanced when judges have diverse skills and com-
plementary judgment patterns.
Graphical Comparisons
Figure 1.1 was a graphical comparison between the lengths of cuckoo eggs that had
been laid in the nests of different host species. The boxes that give boxplots their
name focus attention on quartiles of the data, that is, the three points on the axis
that split the data into four equal parts. The lower end of the box marks the first
quartile, the dot marks the median, and the upper end of the box marks the third
quartile. Points that lie out beyond the “whiskers” are plotted individually, and are
candidates to be considered outliers. The widths of the boxes will of course vary
randomly, leading in some cases to the flagging of points that should not be treated
as extreme. The narrow box may largely account for the five values that are flagged
for meadow pipit.
Figure 1.1 strongly suggested that eggs planted in wrens’ nests were substantially
smaller than eggs planted in other birds’ nests. The upper quartile (75 percent
point) for eggs in wrens’ nests lies below all the lower quartiles for other eggs.
The model postulates that the length of a cuckoo egg found in a given nest de-
pends in some way on the host species. There are likely to be additional factors
that have not been observed but which also influence the egg length. The variation
due to these unobserved factors is aggregated into one term which is referred to
as statistical error or random variation. Where none of these observed factors pre-
dominates and their effects add, a normal distribution will often be effective as a
model for the random variation.
The species means are estimated from the data and are called fitted values. The
differences between the data values and those means are called residuals. For ex-
ample, suppose ℓi is the length of the ith egg in the nest of a wren, and ℓ̄ is the
average of all eggs in the wrens’ nests. Then the ith residual for this group is
ei = ℓi − ℓ̄.
The scale() function provides a convenient way to calculate such residuals; its
usage below centers the data by subtracting the average from each data point.
Thus, the residuals for the wren length model are:
with(cuckoos, scale(length[species=="wren"], scale=FALSE))[,1]
[1] -1.32 0.98 0.38 -0.22 0.88 -0.12 1.18 -0.12 -0.82 -0.22 0.88
[12] -1.12 -0.32 0.08 -0.12
Is the variability different for different species? The boxes in Figure 1.1, with
endpoints set for each species to contain the central 50 percent of the data, hint
that variation may be greater for the pied wagtail than for other species. (The box
widths equal the inter-quartile range, or IQR. See further, Subsection 1.3.4.)
Density
0.06
10
0.04
0.02 5
(outliers excepted)
Smallest value
upper quartile
Largest value
lower quartile
(no outliers)
median
Outlier?
75 80 85 90 95
Figure 1.2 Panel A shows a boxplot, with annotation that explains boxplot fea-
tures. Panel B shows a density plot, with a histogram overlaid. Histogram fre-
quencies are shown on the right axis of Panel B. In both panels, the individual
data points appear as a “rug” along the lower side of the bounding box. Where
necessary, they have been moved slightly apart to avoid overlap.
One data point lies outside the boxplot “whiskers” to the left, and is flagged
as a possible outlier. An outlier is a point that is determined to be far from the
main body of the data. Under the default criterion, about 1 percent of normally
distributed data would be judged as outlying.
A histogram is a crude form of density estimate. A smooth density estimate is,
often, a better alternative. The height of the density curve at any point is an esti-
mate of the proportion of sample values per unit interval, locally at that point. Both
histograms and density curves involve an element of subjective choice. Histograms
require the choice of breakpoints, while density estimates require the choice of a
bandwidth parameter that controls the amount of smoothing. In both cases, the
software has default choices that should be used with care.
Code for a slightly simplified version of Figure 1.2B is:
fossum <- subset(DAAG::possum, sex=="f")
densityplot(∼totlngth, plot.points=TRUE, pch="|", data=fossum) +
layer_(panel.histogram(x, type="density", breaks=c(75,80,85,90,95,100)))
75 80 85 90 95
f m
other
Vic
75 80 85 90 95
Figure 1.3 Total lengths of possums, by sex and (within panels) by geographical
location (Victorian or other).
Univariate summaries can be broken down by one or more factors between and/or
within panels. Figure 1.3 overlays dotplots on boxplots of the distributions of Aus-
tralian possum lengths, broken down by sex and (within panels) by geographical
region (Victoria or other).
## Create boxplot graph object --- Simplified code
gph <- bwplot(Pop∼totlngth | sex, data=possum)
## plot graph, with dotplot distribution of points below boxplots
gph + latticeExtra::layer(panel.dotplot(x, unclass(y)−0.4))
The normal distribution is not necessarily the appropriate reference. Points may
be identified as outliers because the distribution is skew (usually, with a tail to the
right). Any needed action will depend on the context, requiring the user to exercise
good judgement. Subsection 1.2.8 will comment in more detail.
A (1629−1939)
5000000
Population
1000000
1000
Deaths
100
10
B (1841−1881)
Pop (1000s)
4000
3000
2000
1000
Deaths
Figure 1.4 The two panels provide different insights into data on mortality from
measles, in London over 1629–1939. Panel A uses a logarithmic scale to show
the numbers of deaths from measles in London for the period from 1629 through
1939 (black curve). The black dots show, for the period 1800 to 1939 the London
population in thousands. Panel B shows, on the linear scale (black curve), the
subset of the measles data for the period 1840 through 1882 together with the
London population (in thousands, black dots).
Panel A uses a logarithmic vertical scale while Panel B uses a linear scale and takes
advantage of the fact that annual deaths from measles were of the order of one in
500 of the population. Thus, deaths in thousands and population in half millions
can be shown on the same scale.
Panel A shows broad trends over time, but is of no use for identifying changes
on the time-scale of a year or two. In Panel B, the lines that show such changes
are, mostly, at an angle that is in the approximate range of 20◦ to 70◦ from the
horizontal. A sawtooth pattern is evident, indicating that years in which there were
many deaths were often followed by years in which there were fewer deaths. To
obtain this level of detail for the whole period from 1629 until 1939, multiple panels
would be necessary.
Brain (unit=100g)
Brain (unit=100g)
50 10
40
1
30
20 0.1
10 0.01
0
1 1 .1 1 10 100 1000
0 200 400 600 800 0.00 0.0 0
Figure 1.5 Brain weight versus body weight, for 28 animals that vary greatly in
size. Panel A has untransformed scales, while Panel B has logarithmic scales, on
both axes.
For details of the data, and commentary, see Guy (1882), Stocks (1942), and
Senn (2003) where interest was in the comparison with smallpox mortality. The
population estimates (londonpop) are from Mitchell (1988).
marks are separated by an amount that, when translated back from log(weight)
to weight, differ by a factor of 100. The argument aspect="iso" has ensured that
these correspond to the same physical distance on both axes of the graph. Code is:
## Untransformed vs log transformed scales
Animals <- MASS::Animals
asp <- with(Animals, sapply(list(log(brain/100), log(body/100)),
function(x)diff(range(x)))) |> (\(d)d[1]/d[2])()
xlab <- "Body weight (unit=100kg)"; ylab <- "Brain (unit=100g)"
gphA <- xyplot(I(brain/100) ∼ I(body/100), data=Animals, aspect=asp,
xlab=xlab, ylab=ylab)
gphB <- xyplot(log(brain/100) ∼ log(body/100), data=MASS::Animals, # Panel B
aspect='iso', xlab=xlab, ylab=ylab)
labx <- 10∧ c((−3):3); laby <- 10∧ c((−2):2)
gphB <- update(gphB, scales=list(x=list(at=log(labx), labels=labx, rot=20),
y=list(at=log(laby), labels=laby)))
For these data, the physics suggests the likely form of response. Where no such
help is available, careful examination of the graph, followed by systematic examina-
tion of plausible forms of response, may suggest a suitable form of response curve.
With a logarithmic scale, as in Figure 1.7A, similar changes on the scale corre-
spond to similar proportional changes. The regions have been taken in order of the
number of workers in December 1996 (or, in fact, at any other time). This ensures
that the order of the labels in the key matches the positioning of the points for the
different regions. Code that has been used to create and update the graphics object
basicGphA, then updating it to obtain the labeling on the x- and y-axes is:
## Panel A: Basic plot; all series in a single panel; use log y-scale
formRegions <- Ontario+Quebec+BC+Alberta+Prairies+Atlantic ∼ Date
basicGphA <-
xyplot(formRegions, outer=FALSE, data=DAAG::jobs, type="l", xlab="",
ylab="Number of workers", scales=list(y=list(log="e")),
auto.key=list(space="right", lines=TRUE, points=FALSE))
## `outer=FALSE`: plot all columns in one panel
## Create improved x- and y-axis tick labels; will update to use
datelabpos <- seq(from=95, by=0.5, length=5)
datelabs <- format(seq(from=as.Date("1Jan1995", format="%d%b%Y"),
by="6 month", length=5), "%b%y")
## Now create $y$-labels that have numbers, with log values underneath
ylabposA <- exp(pretty(log(unlist(DAAG::jobs[,−7])), 5))
gphA <- update(basicGphA, scales=list(x=list(at=datelabpos, labels=datelabs),
y=list(at=ylabposA, labels=ylabelsA)))
Because the labor forces in the various regions do not have similar sizes, it is
impossible to discern any differences among the regions from this plot. Plotting
on the logarithmic scale was not enough on its own. Figure 1.7B, where the six
different panels use different slices of the same logarithmic scale, is an informative
alternative. Simplified code is:
## Panel B: Separate panels (`outer=TRUE`); sliced log scale
basicGphB <-
Number of workers
4915
(8.5)
Ontario
2981 Quebec
(8) BC
1808 Alberta
(7.5) Prairies
1097 Atlantic
(7)
(6.86)
(7.24) 973
(6.88) 934
1366
(6.84)
(7.22)
Ontario Quebec 1845 BC
5432 3294 (7.52)
(8.6) (8.1)
1808
5324 3229 (7.5)
(8.58) (8.08)
1772
5219 3165 (7.48)
(8.56) (8.06)
1737
5115
(7.46)
(8.54)
Jan95 Jul95 Jan96 Jul96 Jan97 Jan95 Jul95 Jan96 Jul96 Jan97
Figure 1.7 Data are labor force numbers (thousands) for various regions of
Canada, at quarterly intervals over 1995–1996. Panel A uses the same logarith-
mic y-scale for all regions. Panel B shows the same data, but now with separate
(“sliced”) logarithmic y-scales on which the same percentage increase, for exam-
ple, by 1 percent, corresponds to the same distance on the scale, for all plots.
Distances between ticks are 0.02 on the loge scale, that is, a change of close to 2
percent.
From the beginning of 1995 and the end of 1996, the increase of 70 in Alberta
from 1366 to 1436 is by a factor of 1436/1366 ≃ 1.051). For BC, an increase by 88
Figure 1.8 Labeling of the values for Alberta (1366, 1436) and BC (1752, 1840),
with alternative logarithmic scale choices.
from 1752 to 1840 is by a factor of 1.050. The proper comparison is not between
the absolute increases, but between very nearly identical multipliers of 1.051 and
1.050.
Even better than using a logarithmic y-scale, particularly if ready comprehen-
sion is important, would be to standardize the labor force numbers by dividing,
for example, by the respective number of persons aged 15 years and over at that
time. Scales would then be directly comparable. (The plot method for time se-
ries could then suitably be used to plot the data as a multivariate time series. See
?plot.ts.)
Figure 1.9 Outcomes are for two different types of surgery for kidney stones.
The overall (apparent) success rates (78 percent for open surgery as against 83
percent for ultrasound) favor ultrasound. The success rate for each size of stone
separately favors, in each case, open surgery.
If we consider small stones and large stones separately, it appears that surgery
is more successful than ultrasound. The blue vertical bar Figure 1.9 is in each case
to the right of the corresponding red vertical bar. The overall counts, which favor
ultrasound, are thus misleading. For open surgery, the larger number of operations
for large stones (263 large, 87 small) weights the overall success rate towards the low
overall success rate for large stones. For ultrasound surgery (red bars), the weighting
(80 large, 280 small) is towards the high success rate for small stones. This is an
example of the phenomenon called the Simpson or Yule–Simpson paradox. (See also
Subsection 2.1.2.)
Note that without additional information, the results are not interpretable from
a medical standpoint. Different surgeons will have preferred different surgery types,
and the prior condition of patients will have affected the choice of surgery type.
The consequences of unsuccessful surgery may have been less serious for ultrasound
than for open surgery.
The table stones, shown to the right of Figure 1.9, has three margins – Success,
Method, and Size. The table margin12 that results from adding over Size retains
the first two of these. Code used is:
Mosaic plots are an alternative type of display that can be obtained using either
mosaicplot() from base graphics or vcd::mosaic(). Figure 1.9 makes the point
of interest for the kidney stone surgery data more simply and directly.
Outliers
Outliers are points that appear, or are judged, isolated from the main body of the
data. Such points, whether errors or genuine values, can indicate departure from
model assumptions, and may distort any model that is fitted.
Boxplots, and the normal quantile–quantile plot that will be discussed in Sub-
section 1.4.3, are useful for highlighting outliers in one dimension. Scatterplots may
highlight outliers in two dimensions. Some outliers will, however, be apparent only
in three or more dimensions.
Changes in Variability
Boxplots and histograms readily convey an impression of the extent of variability or
scatter in the data. Side-by-side boxplots, such as in Figure 1.1B, or dotplots such as
in Figure 1.1A, allow rough comparisons of the variability across different samples
or treatment groups. They provide a visual check on the assumption, common in
many statistical models, that variability is constant across treatment groups.
It is easy to over-interpret such plots. Statistical theory offers useful and necessary
warnings about the potential for such over-interpretation. (The variability in a
sample, typically measured by the variance, is itself highly variable under repeated
sampling. Measures of variability will be discussed in Subsection 1.3.3.)
When variability increases as data values increase, the logarithmic transformation
will often help. Constant relative variability on the original scale becomes constant
absolute variability on a logarithmic scale.
Clustering
Clusters in scatterplots may suggest features of the data that may or may not
have been expected. Upon proceeding to a formal analysis, any clustering must be
taken into account. Do the clusters correspond to different values of some relevant
variable? Outliers are a special form of clustering.
Nonlinearity
Where it seems clear that one or more relationships are nonlinear, a transformation
may make it possible to model the relevant effects as linear. Where none of the
1.3.1 Counts
The data frame DAAG::nswpsid1 is from a study (Lalonde, 1986) that compared two
groups of individuals with a history of unemployment problems – one an “untreated”
control group and the other a “treatment” group whose members were exposed to a
labor training program. The data include measures that can be used for checks on
whether the two groups were, aside from exposure (or not) to the training program,
otherwise plausibly similar. The following compares the relative numbers between
who had completed high school (nodeg = 0) and those who had not (nodeg = 1).
## Table of counts example: data frame nswpsid1 (DAAG)
## Specify `useNA="ifany"` to ensure that any NAs are tabulated
tab <- with(DAAG::nswpsid1, table(trt, nodeg, useNA="ifany"))
dimnames(tab) <- list(trt=c("none", "training"), educ = c("completed", "dropout"))
tab
educ
trt completed dropout
none 1730 760
training 80 217
Figure 1.10 Boxplot showing weights (inverse sampling fractions), in the dataset
DAAG::nassCDS. A log(weight+1) scale): has been used.
The training group has a much higher proportion of dropouts. Similar compar-
isons are required for other factors, variables, and combinations of two factors or
variables. The data will be investigated further in Section 9.7.1.
alive dead
Sample 25037 1180
Total number 12067937 65595
This might suggest that the fitting of an airbag substantially reduces the risk of
mortality. Consider, however:
SAtab <- xtabs(weight ∼ seatbelt + airbag + dead, data=nassCDS)
## SAtab <- addmargins(SAtab, margin=3, FUN=list(Total=sum)) ## Gdet Totals
SAtabf <- ftable(addmargins(SAtab, margin=3, FUN=DeadPer1000), col.vars=3)
print(SAtabf, digits=2, method="compact", big.mark=",")
The Total column gives the weights that are, effectively, applied to the values in the
DeadPer1000 column when the raw numbers are added over the seatbelt margin. In
the earlier table (Atab), the results for airbag=none were mildly skewed (4119:1366)
to those for belted. Results with airbags were strongly skewed (5763:886) to those
for seatbelt=none. Hence, adding over the seatbelt margin gave a spuriously large
advantage to the presence of an airbag.
The reader may wish to try an analysis that accounts, additionally, for estimated
force of impact (dvcat):
FSAtab <- xtabs(weight ∼ dvcat + seatbelt + airbag + dead, data=nassCDS)
FSAtabf <- ftable(addmargins(FSAtab, margin=4, FUN=DeadPer1000), col.vars=3:4)
print(FSAtabf, digits=1)
Dec2Feb
Aug2Dec
none
yield
Figure 1.11 Individual yields and plot-level mean yields of kiwifruit (in kg) for
each of four treatments (season) and blocks (exposure).
there was a driver airbag. Farmer found a ratio of driver fatalities to passenger
fatalities that was 11 percent lower in the cars with driver airbags. Factors that
have a large effect on the absolute risk can be expected to have a much smaller
effect on the relative risk.
In addition to the functions discussed, note the function gmodels::CrossTable(),
which offers a choice of SPSS-like and SAS-like output formats.
ebookbell.com