100% found this document useful (1 vote)
17 views

A Practical Guide To Data Analysis Using R An Examplebased Approach John H Maindonald instant download

The document is a practical guide to data analysis using R, focusing on real-world examples to illustrate statistical models and their assumptions. It covers various topics including regression models, time series analysis, and multilevel models, providing exercises and online resources for further learning. The authors, John H. Maindonald, W. John Braun, and Jeffrey L. Andrews, bring extensive academic and practical experience to the text.

Uploaded by

jaretzkirwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
17 views

A Practical Guide To Data Analysis Using R An Examplebased Approach John H Maindonald instant download

The document is a practical guide to data analysis using R, focusing on real-world examples to illustrate statistical models and their assumptions. It covers various topics including regression models, time series analysis, and multilevel models, providing exercises and online resources for further learning. The authors, John H. Maindonald, W. John Braun, and Jeffrey L. Andrews, bring extensive academic and practical experience to the text.

Uploaded by

jaretzkirwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

A Practical Guide To Data Analysis Using R An

Examplebased Approach John H Maindonald download

https://ebookbell.com/product/a-practical-guide-to-data-analysis-
using-r-an-examplebased-approach-john-h-maindonald-57087316

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Using R In Hr Analytics A Practical Guide To Analysing People Data


Martin Edwards

https://ebookbell.com/product/using-r-in-hr-analytics-a-practical-
guide-to-analysing-people-data-martin-edwards-230412222

Building Datadriven Applications With Danfojs A Practical Guide To


Data Analysis And Machine Learning Using Javascript Rising Odegua

https://ebookbell.com/product/building-datadriven-applications-with-
danfojs-a-practical-guide-to-data-analysis-and-machine-learning-using-
javascript-rising-odegua-34812722

Data Analytics Using Splunk 9x A Practical Guide To Implementing


Splunks Features For Performing Data Analysis At Scale 1st Edition Dr
Nadine Shillingford

https://ebookbell.com/product/data-analytics-using-splunk-9x-a-
practical-guide-to-implementing-splunks-features-for-performing-data-
analysis-at-scale-1st-edition-dr-nadine-shillingford-50788984

Mastering Machine Learning With Python In Six Steps A Practical


Implementation Guide To Predictive Data Analytics Using Python Manohar
Swamynathan

https://ebookbell.com/product/mastering-machine-learning-with-python-
in-six-steps-a-practical-implementation-guide-to-predictive-data-
analytics-using-python-manohar-swamynathan-42933766
Mastering Machine Learning With Python In Six Steps A Practical
Implementation Guide To Predictive Data Analytics Using Python 2nd
Manohar Swamynathan

https://ebookbell.com/product/mastering-machine-learning-with-python-
in-six-steps-a-practical-implementation-guide-to-predictive-data-
analytics-using-python-2nd-manohar-swamynathan-10519694

Mastering Machine Learning With Python In Six Steps A Practical


Implementation Guide To Predictive Data Analytics Using Python Manohar
Swamynathan

https://ebookbell.com/product/mastering-machine-learning-with-python-
in-six-steps-a-practical-implementation-guide-to-predictive-data-
analytics-using-python-manohar-swamynathan-10519704

Data Analytics For Marketing A Practical Guide To Analyzing Marketing


Data Using Python 1st Edition Guilherme Diazbrrio

https://ebookbell.com/product/data-analytics-for-marketing-a-
practical-guide-to-analyzing-marketing-data-using-python-1st-edition-
guilherme-diazbrrio-57082760

Statistical Methods For Practice And Research A Guide To Data Analysis


Using Spss Second Edition Ajai S Gaur

https://ebookbell.com/product/statistical-methods-for-practice-and-
research-a-guide-to-data-analysis-using-spss-second-edition-ajai-s-
gaur-1877140

A Practical Guide To Analytics For Governments Using Big Data For Good
Lowman

https://ebookbell.com/product/a-practical-guide-to-analytics-for-
governments-using-big-data-for-good-lowman-6755394
A P R AC T I C A L G U I D E TO DATA A NA LY S I S U S I N G R

Using diverse real-world examples, this text examines what models used for data analysis
mean in a specific research context. What assumptions underlie analyses, and how can you
check them?
Building on the successful Data Analysis and Graphics Using R, third edition (Cam-
bridge, 2010), it expands upon topics including cluster analysis, exponential time series,
matching, seasonality, and resampling approaches. An extended look at p-values leads to an
exploration of replicability issues and of contexts where numerous p-values exist, including
gene expression.
Developing practical intuition, this book assists scientists in the analysis of their own
data, and familiarizes students in statistical theory with practical data analysis. The worked
examples and accompanying commentary teach readers to recognize when a method works
and, more importantly, when it doesn’t. Each chapter contains copious exercises. Selected
solutions, notes, slides, and R code are available online, with extensive references pointing
to detailed guides to R.

j o h n h . m a i n d o na l d is Contract Associate at Statistics Research Associates and


was previously Visiting Fellow at the Australian National University. He has had wide
experience both as a university lecturer and as a quantitative problem solver, working with
researchers in diverse areas. He is the author of Statistical Computation (1984), and the
senior author of Data Analysis and Graphics Using R (third edition, 2010).
w. j o h n b r au n is Professor at the University of British Columbia, where he is Director
of the UBCO campus of the Banff International Research Station for Mathematical Innova-
tion and Discovery. In 2020, he received the Statistical Society of Canada Award for Impact
of Applied and Collaborative Work.
j e f f r e y l . a n d r e w s is Associate Professor at the University of British Columbia.
He currently serves as Principal Co-director of the Master of Data Science program and
President-elect of The Classification Society (TCS). He is the 2013 Distinguished Disserta-
tion Award winner from TCS and a recipient of the 2017 Chikio Hayashi Award for Young
Researchers from the International Federation of Classification Societies.

Published online by Cambridge University Press


Published online by Cambridge University Press
A P R AC T I C A L G U I D E TO DATA
A NA LY S I S U S I N G R
An Example-Based Approach

J O H N H . M A I N D O NA L D
Statistics Research Associates, Wellington, New Zealand

W. J O H N B R AU N
University of British Columbia, Okanagan

JEFFREY L. ANDREWS
University of British Columbia, Okanagan

Published online by Cambridge University Press


Shaftesbury Road, Cambridge CB2 8EA, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467

Cambridge University Press is part of Cambridge University Press & Assessment, a department
of the University of Cambridge
We share the University’s mission to contribute to society through the pursuit of
education, learning and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781009282277
DOI: 10.1017/9781009282284
© John H. Maindonald, W. John Braun, and Jeffrey L. Andrews 2024
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
First published 2024
Printed in the United Kingdom by CPI Group Ltd, Croydon CR0 4YY
A catalogue record for this publication is available from the British Library
A Cataloging-in-Publication data record for this book is available from the Library of Congress
ISBN 978-1-009-28227-7 Hardback
Cambridge University Press & Assessment has no responsibility for the persistence
or accuracy of URLs for external or third-party internet websites referred to in this
publication and does not guarantee that any content on such websites is, or will
remain, accurate or appropriate.

Published online by Cambridge University Press


For my grandchildren Luke, Amelia, and Ted

For my children, Matthew, Phillip, and Reese

For my family (Irene, Charlie, and Mia) and my parents (Dave and Marleen)

Published online by Cambridge University Press


Published online by Cambridge University Press
Contents

List of Figures page xi


Preface xvii
1 Learning from Data, and Tools for the Task 1
1.1 Questions, and Data That May Point to Answers 2
1.2 Graphical Tools for Data Exploration 12
1.3 Data Summary 22
1.4 Distributions: Quantifying Uncertainty 30
1.5 Simple Forms of Regression Model 42
1.6 Data-Based Judgments – Frequentist, in a Bayesian World 48
1.7 Information Statistics and Bayesian Methods with Bayes
Factors 58
1.8 Resampling Methods for SEs, Tests, and Confidence Intervals 66
1.9 Organizing and Managing Work, and Tools That Can Assist 70
1.10 The Changing Environment for Data Analysis 72
1.11 Further, or Supplementary, Reading 79
1.12 Exercises 80
2 Generalizing from Models 88
2.1 Model Assumptions 88
2.2 t-Statistics, Binomial Proportions, and Correlations 91
2.3 Extra-Binomial and Extra-Poisson Variation 95
2.4 Contingency Tables 100
2.5 Issues for Regression with a Single Explanatory Variable 104
2.6 Empirical Assessment of Predictive Accuracy 116
2.7 One- and Two-Way Comparisons 121
2.8 Data with a Nested Variation Structure 130
2.9 Bayesian Estimation – Further Commentary and Approaches 131
2.10 Recap 136
2.11 Further Reading 137
2.12 Exercises 137
3 Multiple Linear Regression 144
3.1 Basic Ideas: the allbacks Book Weight Data 144
3.2 The Interpretation of Model Coefficients 148

Published online by Cambridge University Press


viii Contents

3.3 Choosing the Model, and Checking It Out 161


3.4 Robust Regression, Outliers, and Influence 171
3.5 Assessment and Comparison of Regression Models 176
3.6 Problems with Many Explanatory Variables 183
3.7 Errors in x 191
3.8 Multiple Regression Models – Additional Points 195
3.9 Recap 201
3.10 Further Reading 202
3.11 Exercises 203
4 Exploiting the Linear Model Framework 208
4.1 Levels of a Factor – Using Indicator Variables 209
4.2 Block Designs and Balanced Incomplete Block Designs 213
4.3 Fitting Multiple Lines 216
4.4 Methods for Fitting Smooth Curves 219
4.5 ∗ Quantile Regression 238
4.6 Further Reading and Remarks 240
4.7 Exercises 240
5 Generalized Linear Models, and Survival Analysis 245
5.1 Generalized Linear Models 245
5.2 Logistic Multiple Regression 250
5.3 Logistic Models for Categorical Data – an Example 260
5.4 Models for Counts – Poisson, Quasipoisson, and Negative
Binomial 261
5.5 Fitting Smooths 274
5.6 Additional Notes on Generalized Linear Models 276
5.7 Models with an Ordered Categorical or Categorical Response 278
5.8 Survival Analysis 281
5.9 Transformations for Proportions and Counts 288
5.10 Further Reading 289
5.11 Exercises 290
6 Time Series Models 292
6.1 Time Series – Some Basic Ideas 293
6.2 Regression Modeling with ARIMA Errors 304
6.3 ∗ Nonlinear Time Series 313
6.4 Further Reading 314
6.5 Exercises 315
7 Multilevel Models, and Repeated Measures 318
7.1 Corn Yield Data – Analysis Using aov() 320
7.2 Analysis Using lme4::lmer() 325
7.3 Survey Data, with Clustering 329
7.4 A Multilevel Experimental Design 335
7.5 Within- and Between-Subject Effects 344
7.6 A Mixed Model with a Betabinomial Error 349

Published online by Cambridge University Press


Contents ix

7.7 Observation-Level Random Effects – the moths Dataset 356


7.8 Repeated Measures in Time 357
7.9 Further Notes on Multilevel Models 367
7.10 Recap 371
7.11 Further Reading 371
7.12 Exercises 371
8 Tree-Based Classification and Regression 373
8.1 Tree-Based Methods – Uses and Basic Notions 374
8.2 Splitting Criteria, with Illustrative Examples 378
8.3 The Practicalities of Tree Construction – Two Examples 384
8.4 From One Tree to a Forest – a More Global Optimality 390
8.5 Additional Notes – One Tree, or Many Trees? 393
8.6 Further Reading and Extensions 395
8.7 Exercises 396
9 Multivariate Data Exploration and Discrimination 400
9.1 Multivariate Exploratory Data Analysis 401
9.2 Principal Component Scores in Regression 408
9.3 Cluster Analysis 412
9.4 Discriminant Analysis 422
9.5* High-Dimensional Data – RNA-Seq Gene Expression 429
9.6 High-Dimensional Data from Expression Arrays 433
9.7 Balance and Matching – Causal Inference from Observational
Data 443
9.8 Multiple Imputation 457
9.9 Further Reading 462
9.10 Exercises 463

Epilogue 467
Appendix A The R System: a Brief Overview 469
A.1 Getting Started with R 469
A.2 R Data Structures 473
A.3 Functions and Operators 483
A.4 Calculations with Matrices, Arrays, Lists, and Data Frames 487
A.5 Brief Notes on R Graphics Packages and Functions 490
A.6 Plotting Characters, Symbols, Line Types, and Colors 493
References 495
References to R Packages 508
Index of R Functions 514
Index of Terms 519

Published online by Cambridge University Press


Published online by Cambridge University Press
Figures

1.1 (A) Dotplot and (B) boxplot displays of cuckoo egg lengths 4
1.2 (A) Boxplot with annotation, compared with (B) histogram with over-
laid density plot 12
1.3 Total lengths of possums, by sex and geographical location 13
1.4 Mortality from measles, London: (A) 1629–1939; (B) 1841–1881 14
1.5 Brain vs. body weight: (A) untransformed; (B) log transformed scales 15
1.6 Distance traveled up a 20 ◦ ramp, vs. starting point 16
1.7 Quarterly labor force numbers, by Canadian region, 1995–1996: (A)
same log scale; (B) sliced log scale 18
1.8 Alternative logarithmic scale labeling choices, labor force numbers 19
1.9 Outcomes for two different surgery types – Simpson’s paradox example 20
1.10 Boxplot showing weights (inverse sampling fractions), in the dataset
DAAG::nassCDS 23
1.11 Individual plot-level yields of kiwifruit, by season and by block 25
1.12 Different y vs. x relationships, and Pearson vs. Spearman correlation 29
1.13 Normal density plot, with associated statistical measures 34
1.14 Plots for five samples of 50 from a normal distribution 35
1.15 Quantile–quantile plots – data vs. simulated normal values 36
1.16 Simulations of the sampling distribution of the mean 38
1.17 Normal densities with t8 and t3 overlaid 40
1.18 A fitted line, as against a fitted lowess curve 44
1.19 Quantile–quantile plots – regression residuals vs. normal samples 46
1.20 Boxplots for 200 simulated p-values – one-sided one-sample t-test 52
1.21 Post-study probability (PPV) vs. pre-study odds, given power 55
1.22 Sampling distribution of difference in AIC statistics 60
1.23 Alternative Cauchy priors, and posteriors, for the sleep data 62
1.24 Change in Bayes Factor with sample size, for different p-values 64
1.25 Permutation distribution density curves 67
2.1 Female vs. male admission rates – Simpson’s paradox example 89
2.2 Second vs. first member of paired data – two examples 92
2.3 Quantile–quantile and worm plots for binomial and betabinomial fits 97
2.4 Worm plots for Poisson and negative binomial type I fits 98
2.5 Chemical vs. magnetic measure – line vs. loess smooth 105
2.6 Weight vs. volume, for eight softback books, with regression line 107
2.7 Diagnostic plots for Figure 2.6 108

Published online by Cambridge University Press


xii List of Figures

2.8 Pointwise bounds for line, and for new predicted values 110
2.9 Confidence bounds – pairwise differences vs. difference of means 111
2.10 Regression lines – y on x and x on y 112
2.11 Graphs that illustrate the use of power transformations 113
2.12 Heart weight vs. body weight, for 30 Cape fur seals 115
2.13 Graphical summary of three-fold cross-validation – house sale data 117
2.14 Plots that relate to bootstrap distributions of prediction errors 120
2.15 LSD and HSD comparisons of means for three treatments 122
2.16 Test for linear trend vs. anova test – p-value comparison 125
2.17 False-color image of two channel microarray gene expression values 126
2.18 Rice shoot dry mass data – plots that show interactions 129
2.19 Diagnostic plots – MCMCregress() Bayesian analysis 135
3.1 Weight vs. volume, for seven hardback and eight softback books 145
3.2 Diagnostic plots – lm(weight ∼ 0+volume+area) 147
3.3 Scatterplot matrices for Northern Ireland hill race data 149
3.4 Variation in distance per unit time with distance 151
3.5 Diagnostic plots – lm(mph ∼ log(dist)+log(gradient) 152
3.6 Diagnostic plots – lm(logtime ∼ logdist + logclimb) 153
3.7 Scatterplot matrices – log transformed oddbooks data 154
3.8 Scatterplot matrix for the DAAG::litters data 157
3.9 Termplots for regression with oddbooks data 164
3.10 Confidence intervals, compared with prediction intervals 167
3.11 Scatterplot matrix with power transformations – hurricane deaths data 169
3.12 Diagnostic plots – model for hurricane death data 170
3.13 Scatterplot matrix for hills2000 data, logarithmic scales 172
3.14 Residuals vs. fitted – least squares compared with resistant fit 173
3.15 (A) A 2D plot that shows leverages; (b) a 3D dynamic graphic plot 175
3.16 Standardized changes in regression coefficients 175
3.17 Increase in penalty term difference for unit increase in the number of
parameters p, for AIC, BIC, and AICc 177
3.18 Diagnostic plot, compared with simulated diagnostic plots 182
3.19 p-Values vs. number of variables available for selection 186
3.20 Scatterplot matrix for Coxite data 187
3.21 Observed porosities, and fitted values with 95 percent confidence bounds 188
3.22 Change in regression line as error in x changes 192
3.23 Apparent differences between groups, resulting from errors in x 194
3.24 Does preoperative baclofen reduce pain – Simpson’s paradox example? 196
3.25 Added variable plots (a termplot variant) 198
3.26 Residuals vs. fitted values, for each of the three regressions 199
4.1 Weights of extracted sugar – wild-type plant vs. other types 209
4.2 Apple taste scores – panelist and product effects 215
4.3 Plots relate to alternative models fitted to the leaftemp data 219
4.4 Diagnostic plots for the parallel line model – leaftemp data 219
4.5 Number of grains per head vs. barley seeding rate 221
4.6 Line vs. quadratic curve, and residual plots, for barley seeding rate
data 223
4.7 Resistance vs. apparent juice content for kiwifruit slabs 226
4.8 Thin plate spline basis curves, and contributions to fitted curve 228

Published online by Cambridge University Press


List of Figures xiii

4.9 Use of gam.check() with model for fruitohms data 229


4.10 Plots that relate to a monotonic decreasing spline fit 231
4.11 Gas consumption vs. external temperature – before and after insulation 232
4.12 Minimum and maximum temperature effects on dewpoint 234
4.13 Residuals vs. maximum temperature, for three minimum temperature
ranges 235
4.14 Hurricane deaths – plots for fitted terms, and for residuals 236
4.15 Hurricane deaths – logarithmic vs. untransformed base damage mea-
sure 236
4.16 Plots show quantile curves (A) the 50 percent curve with two SE
bounds; (B) 10 percent and 90 percent curves, unweighted and weighted
by population 239
5.1 Plot illustrating the logit link function 246
5.2 Proportion moving vs. alveolar concentration – anesthetic data 249
5.3 Empirical log(odds) vs. concentration – anesthetic data 250
5.4 Location of sites for DAAG::frogs data 251
5.5 Scatterplot matrices that relate to frogs data 252
5.6 Scatterplot matrix, with suggested transformations – frogs data 254
5.7 Color density scale shows predicted probability of finding a frog 256
5.8 Explanatory variable contributions to fit, linear predictor scale 256
5.9 Contributions of model terms to fit, relative to means from other terms 258
5.10 Number of simple aberrant crypt foci, plotted against time 262
5.11 Dotplot summaries of numbers of two moth species, by habitat type 264
5.12 Dispersion estimates vs. mean, for moths data 267
5.13 Diagnostic plots – model for numbers of species A moths 269
5.14 Diagnostic plots – hurricane death model with quasipoisson error 271
5.15 Fitted values for NBI model, and quantile–quantile plot of residuals 273
5.16 Proportion of lefthanders, as smooth function of year of birth 275
5.17 Leverage vs. fitted proportion, for three common link functions 278
5.18 Graphical representation of survival data collection process 282
5.19 Survival curves – female vs. male AIDS contaminated blood infections 284
5.20 Survival curve for males who contracted AIDS from sexual contact 285
5.21 Time-dependent coefficients – Cox proportional hazards model 287
5.22 Time-dependent coefficients – Cox proportional hazards, cricketers 289
6.1 Trace plot of annual Lake Huron depth measurements 294
6.2 (A) First four lag plots of Lake Huron depth data; (B) autocorrelations
for AR(1) and AR(2) fits vs. data; (C) partial autocorrelations 295
6.3 Autocorrelations and partial autocorrelations for an MA process 298
6.4 Predictions with pointwise CIs – ARIMA(1,1,2) vs. ETS 301
6.5 Two simulation runs each for alternative MA3 processes 302
6.6 Original and seasonally adjusted series, and plot of seasonal component 303
6.7 mdbrtRain and mdbAVt, and SOI and IOD yearly values 304
6.8 Termplots for model gam(mdbrtRain ∼ s(CO2)+s(SOI)+s(IOD) 305
6.9 Termplots for model gam(mdbAVt ∼ s(CO2)+s(SOI) 307
6.10 (A) Rainfall; (B) temperature vs. year, with fitted values 307
6.11 Scatterplot matrix for air quality data 310
6.12 Predicted values of IA400/Lab ratio – ARIMA vs. ETS model 312
7.1 Stripplots – corn yields for four parcels on each of eight sites 319

Published online by Cambridge University Press


xiv List of Figures

7.2 Profile likelihoods – model fitted to Antiguan corn data 328


7.3 Boxplots for average class scores (like) – public vs. private schools 329
7.4 Plots of parameter estimates for fit to DAAG::science data 333
7.5 Field layout for the kiwifruit shading trial 336
7.6 Variation at the different levels, for the kiwifruit shading data 340
7.7 Plots of residuals, of plot effects, and of simulated plot effects 343
7.8 Effects of car window tinting on visual performance, plots of data 345
7.9 Cold-storage fruitfly mortality, fitted curves and 95 percent bounds 351
7.10 Fruitfly mortality model – intra-class correlations for different links 352
7.11 Diagnostics for model fitted to insect cool-storage time–mortality data:
(A) quantile–quantile plot of quantile residuals; (B) boxplots compar-
ing treatment groups; (C) data-based quartiles vs. model-based; (D)
quartiles as a function of number of insects 354
7.12 LT99 95 percent CIs – complementary log–log link and logit link 355
7.13 Oxygen intake vs. power output, for five athletes in the Daedalus
project 360
7.14 Distance between two positions on the skull vs. age, for 27 children 363
7.15 Slopes of profiles, vs. means of distance and log(distance) 364
8.1 Boxplots for six selected variables, from 500 rows in the SPAM database 375
8.2 Tree diagram, from use of rpart() with email spam data 376
8.3 Illustrative tree from rpart() output 378
8.4 Mileage (mpg) vs. Weight, for 60 cars, with loess curve 381
8.5 Tree-based model for Mileage given Weight, for 60 cars 381
8.6 CV error eventual increase vs. error decrease, with later splits 384
8.7 CV error vs. cp, for female heart attack data 386
8.8 Tree from use of one standard error rule for email spam data 388
8.9 Error rates – random forest OOB vs. test set and rpart() test set 394
9.1 Brushtail possum morphometric measurements: (A) scatterplot matrix;
(B) cloud plot 402
9.2 Second vs. first principal component, columns 6–14 of possum data 404
9.3 This repeats Figure 9.2, now for bootstrap data 406
9.4 Two-dimensional, obtained from nine-dimensional Euclidean, by two
different scaling methods 407
9.5 Pairs plot of first three principal components 410
9.6 Plot of BDI against scores on first principal component 411
9.7 Four “blobs” of bivariate normal data, with different layouts of means 412
9.8 Single linkage hierarchical clustering plot, and plot that checks results 413
9.9 Cluster dendrograms for Panel B of Figure 9.7 414
9.10 Dendrograms shown are from moving clusters closer together 415
9.11 Four-group k-means makes implicit equal-sized assumption, example 417
9.12 Different two-component mixtures of univariate Gaussians 418
9.13 BIC values (BIC as used elsewhere), plotted against number of groups 420
9.14 Density contours of the fitted mixture model 421
9.15 Leaf length vs. leaf width – untransformed vs. logarithmic scales 423
9.16 Scatterplot matrix for the first three LDA canonical variates 428
9.17 (A) Mean–variance relationship for mRNA gene expression data; (B)
use MDS to locate samples in 2D space 431
9.18 (A) LDA analysis for Golub data; (B) repeat for random normal data 436

Published online by Cambridge University Press


List of Figures xv

9.19 (A) Mean–variance relationship for cancer gene expression data; (B)
use MDS to locate samples in 2D space 437
9.20 Different accuracy measures, in the development of a discriminant rule 440
9.21 How effective is linear discriminant in distinguishing known groups? 442
9.22 Overlaid density plots – treatment groups and experimental controls 447
9.23 Are observations for which re74 is available detectably different? 448
9.24 Random forest propensity scores – treated vs. controls? 451
9.25 Propensity scores for treatment and control groups after matching 454
9.26 (A) “Love plot”; (B) treatment/control differences for matched items 454
9.27 Love plots for different numbers (5,6) of cutpoints 456
9.28 Term plots for checking GAM model with straight line terms 459
9.29 Means of overimputations (solid points), with confidence bounds 461
A.1 Worldwide annual totals of CO 2 emissions – 1900, 1920, . . . , 2020 471
A.2 Fonts, symbols, and line types 493

Published online by Cambridge University Press


Published online by Cambridge University Press
Preface

This text is designed as an aid, for learning and for reference, in the navigation
of a world in which unprecedented new data sources, and tools for data analysis,
are pervasive. It aims to teach, using real-world examples, a style of analysis and
critique that, given meaningful data, can generate defensible analysis results. Its
focus is on ideas and concepts, with extensive use of graphical presentation. It may
be used to give students who have taken courses in statistical theory exposure to
practical data analysis. It is designed, also, as a resource for scientists who wish
to do statistical analyses on their own data, preferably with reference as necessary
to professional statistical advice. It emphasizes the role of statistical design and
analysis as part of the wider scientific process.
As far as possible, our account of statistical methodology comes from the coal-
face, where the quirks of real data must be faced and addressed. Experience in
consulting with researchers in many different areas of application, in supervising
research students, and in lectures to researchers, have been strong influences in
the text’s style and content. We comment extensively on analysis results, noting
inferences that seem well founded, and noting limitations on inferences that can be
drawn. We emphasize the use of graphs for gaining insight into data – in advance
of any formal analysis, for understanding the analysis, and for presenting analysis
results. The project has been a tremendous learning experience for all three of us.
As is usual, the more we learn, the more we appreciate how much more we have to
learn.
The text is suitable for a style of learning where readers work through the text
with a computer at their side, running the R code as and when this seems helpful.
It complements more mathematically oriented accounts of statistical methodology.
The appendix provides a brief account of R, primarily as a starting point for learn-
ing. We encourage readers with limited R experience to avail themselves of the
wealth of instructional material on the web as well as the hardcopy resources listed
in Section 1.11.
While no prior knowledge of specific statistical methods or theory is assumed,
readers will need to bring with them, or quickly acquire, a modest level of statis-
tical sophistication. Prior experience with real data, prior exposure to statistical
methodology, and some prior familiarity with regression methods, will all be helpful.

https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press


xviii Preface

Important technical terms will include random sample, independence, dependence,


standard deviation, and normal distribution, with limited attention to formal defini-
tion. Our primary concern is with the role and meaning of this language in practical
data analysis. While there will be references to theoretical results, it is not our pur-
pose to provide a systematic account of statistical theory.1 We make only limited
use of mathematical symbolism.
Statistical analysis relies heavily on mathematical models. An understanding of
the mathematics underlying a model is important only to the extent that it helps
in understanding, and where possible in checking, what the model means in the
context from which the data came. Is it reasonable to assume that observations are
independent? What are the influences, perhaps the time sequence in which the data
were collected, that might place this assumption in question? This is just one of the
issues, but a very important one, that data analysts need to consider. Comments
made by John W. Tukey emphasize the importance, in statistical training and
practice, of wrestling with what the models used mean in the context of data that
has been presented for analysis:

... Statistics is a science ... and it is no more a branch of mathematics than are physics,
chemistry and economics; for if its methods fail the test of experience – not the test of
logic – they are discarded.
[Tukey (1953), quoted by Brillinger (2002)]

The methods that we cover have wide application. The datasets, many of which
have featured in published papers, are drawn from many different fields. They reflect
a journey in learning and understanding, alike for the authors and for those with
whom they have worked, that has ranged widely over many different research areas.
We hope that our text will stimulate the cross-fertilization that occurs when ideas
and applications that have proved effective in one area find use elsewhere, perhaps
even leading to new lines of investigation.
To summarize: The strengths of this book include the directness of its encounter
with research data, its advice on practical data analysis issues, careful critiques
of analysis results, the use of modern data analysis tools and approaches, the use
of simulation and other computer-intensive methods where these provide insight
or give results that are not otherwise available, attention to graphical and other
presentation issues, the use of examples drawn from across the range of statistical
applications, the links that it makes into the debate over reproducibility in science,
and the inclusion of code that reproduces analyses.
A substantial part of the first edition of Data Analysis and Graphics Using R
(Maindonald and Braun, 2003) was derived, initially, from the lecture notes of
courses for researchers that the first author presented, at the University of New-
castle (Australia) over 1996–1997 and at Australian National University from 1998,
through until formal retirement and beyond. It was a privilege to have contacts,
arising from consulting work and lectures, across the University. Those contacts
were extended as a result of short courses on R-based analysis that were offered,
1 For an overview of the theory of statistical inference, see, for example, Cox (2006).

https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press


Preface xix

across a wide variety of Australian government and academic institutions, between


2003 and 2014.

Influences on the Modern Practice of Statistics


Statistics is a young discipline. Only in the 1920s and 1930s did the modern frame-
work of statistical theory, including ideas of hypothesis testing and estimation, begin
to take shape. As documented in Gigerenzer et al. (1989, The Empire of Chance),
differences in historical development have led to some differences in practice be-
tween research areas.
Statistical methods have found wide use, but they have also been widely misused.
There has been a widespread reliance on “black box” approaches, used without due
consideration of the reasonableness of assumptions made, or attention to diagnostic
checks, or attention to the processes that generated the data. In experimental work,
the use of p-values and other statistics has too often become a substitute for the
checks that independent replication provides on the total experimental process.
There has been a renewed attention, both in the wider scientific community and
in the statistical community, to the interplay between scientific methodology and
statistical design and analysis. Critical reexamination of current scientific processes,
and of the role of statistical analysis within those processes, can help ensure that
the demands of scientific rationality do in due course win out over accidents of
historical development and all-too-human failures to maintain critical standards.

New Data Analysis Tools


The methodology has developed in a synergy with the relevant supporting mathe-
matical theory and, more recently, with computing. This has led to major advances
on the methodologies of the precomputer era. “Data Science,” or perhaps “Statis-
tical Science,” is a good name for the mix of tools and skills required for effective
data analysis. Data analysts now have at their disposal vastly new powerful tools
than were available even 20 years ago, for exploratory analysis of regression data,
for choosing between alternative models, for diagnostic checks, for handling nonlin-
earity, for assessing the predictive power of models, and for graphical presentation.
New computing tools make it straightforward to move data between different sys-
tems, to keep a record of calculations, to retrace or adapt earlier calculations, and to
edit output and graphics into a form that can be incorporated into published doc-
uments. Machine learning and related methodologies emphasize new types of data,
new data analysis demands, new data analysis tools, and datasets that may be of
unprecedented size. Textual data and image data offer interesting new challenges.
The traditional concerns of professional data analysts remain as important as
ever. Irrespective of the size of dataset, questions of data quality, of relevance to the
issues that are under investigation, and of the way that the data have been sampled,
remain as important as ever. Implicit or explicit claims that results generalize to a
relevant wider target population must be justified.
Students in first or second year university courses, in such areas as geography
or biology or politics or psychology or business studies, are increasingly likely to

https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press


xx Preface

encounter R. It is finding its way into the upper levels of secondary schools. While
this is to be encouraged, students do need to understand that such courses are at
the start of an adventure in statistical understanding. There is no good substitute
for professional training in modern tools for data analysis, and experience in using
those tools with a wide range of datasets. No one should be embarrassed that they
have difficulty with analyses that involve ideas that professional statisticians may
take seven or eight years of training and experience to master.
The questions that data analysis is designed to answer can often be stated simply.
This may encourage the layperson, or even scientists doing their own analyses, to
believe that the answers are similarly simple. Commonly, they are not. Be prepared
for unexpected subtleties. Comments made by Stephen Senn are apt:
I’ve been studying statistics for over 40 years and still don’t understand it. The ease with
which non-statisticians master it is staggering.

No amount of statistical or computing technology can be a substitute for good


design of data collection, for understanding the context in which data are to be
interpreted, or for skill in using available analysis tools. The best any analysis can
do is to highlight the information in the data.

The R System
Work on R started in the early 1990s, as a project of Ross Ihaka and Robert Gentle-
man, when both were at the University of Auckland (New Zealand). The R system
implements a dialect of the S language, developed at AT&T by John Chambers
and colleagues. Section 1.4 in Chambers (2008) describes the history. Versions of
R are available, at no charge, for Microsoft Windows, for Linux and other Unix
systems, and for Macintosh systems. It is available through the Comprehensive R
Archive Network (CRAN). Go to http://cran.r-project.org/, and find the nearest
mirror site. A huge range of packages, contributed by specialists in many different
areas, supplement base R. The development model has proved effective in marshal-
ing high levels of computing expertise for continuing improvement, for identifying
and fixing bugs, and for responding quickly to the evolving needs and interests of
the statistical community. The R Task Views web page2 lists packages that handle
some of the more common R applications. It has become an increasing challenge to
keep pace with the new and/or improved abilities that R packages, new and old,
continue to develop. Those who rely heavily on R for their day-to-day work will do
well to keep attuned to major changes and developments.
The R system has brought into a common framework a huge range of abili-
ties that extend beyond the data analysis and associated data manipulation and
graphics abilities that are the focus of this text. Examples include drawing and
coloring maps, reading and handling shapefiles, map projections, plotting data col-
lected by balloon-borne weather instruments, creating color palettes, manipulating
bitmap images, solving sudoku puzzles, creating magic squares, solving ordinary
differential equations, and processing various types of genomic data. Help files and
2 https://cran.r-project.org/web/views/.

https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press


Preface xxi

vignettes that are included with packages are a large reservoir of information on
the methodologies that they implement.
There are several graphical user interfaces (GUIs) that can be highly helpful in
accessing a restricted range of R abilities – examples are BlueSky, Rcmdr, R-Instat,
jamovi, and rattle. Access to the fill range of abilities that R and R packages make
available will require use of the command line.
RStudio is a widely used R interactive development environment (IDE) for tasks
that include viewing history, debugging, managing the workspace, package man-
agement, and data input and output. It has features that greatly assist project
management and package development.
Among systems that have the potential to challenge R’s dominance for data
analysis, Julia (julialang.org/) seems particularly interesting. Relative to R, it
has high computational efficiency. It has the potential to develop or adapt a range
of packages that together match what R packages offer.

Changes and Additions from Data Analysis and Graphics Using R


Chapters 1–5 of Data Analysis and Graphics Using R, third edition (Maindonald
and Braun, 2010) have been amalgamated and condensed somewhat into Chapters
1–3 of the present book. Here, the focus has moved, from including extensive R
tutorial content in the text, to pointing users to the extensive R help resources now
available both on the web and in printed form. Supplementary content available
online includes R Markdown scripts, one for each chapter, that can be processed
to reproduce all computer output, including tables and graphs. This content is
available at https://jhmaindonald.github.io/PGRcode.
Concerns about reproducibility (or, in the terminology we prefer, “replicability”),
especially in wet laboratory biology and in psychology, have attracted extensive
attention in the pages of Nature, Science, The Economist, psychology journals, and
elsewhere. The uses and limitations of p-values have been an important part of
the discussion. Chapter 1 now has a much extended discussion of their use and
role, leading on to the wider discussion of replicability issues. Information statistics
(AIC, AICc, and BIC) get more detailed attention.
The treatment of p-values extends to noting the new possibilities that arise when
there are, potentially, hundreds, or thousands, or more, p-values. The false discovery
rate estimates that are then available are more informative, and relate more directly
to the questions that are commonly of experimental interest, than p-values. The
new Section 9.5 takes up these ideas as they apply to the analysis of RNA-Seq gene
expression data.
Other topics that get new or increased attention include: the modeling of extra-
binomial or extra-Poisson variation; exponential time series, including their use in
forecasting; seasonality; spline smooths with time series error terms; fitting mono-
tonic increasing or decreasing response curves; and quantile regression automatic
choice of smoothing parameter.
Changes in the lme4 package for fitting mixed-effects models, and the implemen-
tation of the Kenward–Roger approach that is now available in the afex package,

https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press


xxii Preface

have required substantial rewrites. In Chapter 7, there is a new section on “A Mixed


Model with a Betabinomial Error.” The treatment of Principal Component Analysis
and of multi-dimensional scaling is now followed by a new section on hierarchical
and other forms of clustering.
The treatment of causal inference from observational data has been greatly ex-
tended to discuss the role of matching. There is some limited attention to the use
of multiple imputation to fill in missing values in data where some observations are
incomplete.

Source Files That Combine Text and R Code


Drafts of this text were created from Sweave source files that combine marked up
code and text into one document, in a form that could then be processed using Yihui
Xie’s knitr package to give the LATEX files and associated R output and figures from
which this text was generated. Rerunning and checking of code is a built-in part
of the process, making the revising and updating of text and code easier and less
error prone. The R Markdown plain text format, designed to be easier for novices
to learn and master, can can be processed using knitr abilities in a very similar
way. R Markdown is widely used for creating online content, for papers and books,
and for the vignettes that many R packages use to supplement help pages. See
https://rmarkdown.rstudio.com/.

Acknowledgements
The prefaces to the three editions of Data Analysis and Graphics Using R give names
of those who provided helpful comment. For this new text, James Cone has provided
useful comments. Trish Scott has helped with copyediting. Discussions on the R-
help and R-devel email lists have contributed greatly to insight and understanding.
The failings that remain are, naturally, our responsibility.
This text has drawn on data from many different sources. Following the references
is a list of data sources (individuals and/or organizations) that we wish to thank and
acknowledge. Thanks are due also to the many researchers whose discussions with
us have helped stimulate thinking and understanding, and who in many instances
have given us access to their data. We apologize to anyone that we may have
inadvertently failed to acknowledge.
Too often, data that have become the basis for a published paper are not made
available in any form of public record. The data may not find their way into any
permanent record, and cease to be available for checking the analysis, for work
that builds on what can be learned when data from multiple sources are brought
together, to try a new form of analysis, or for use in teaching. In areas where data
are as a matter of course kept available for future researchers to use, this has been
a major contributor to advances in scientific understanding. Those benefits can and
should extend more widely. Thanks are due to Beverley Lawrence for her efforts
as copy-editor, and to Cambridge University Press staff who assisted us through
the copy-editing and publication process – Roger Astley, Natalie Tomlinson, Anna
Scriven, and Clare Dennison.

https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press


Preface xxiii

Notes for Readers


For many readers, a largely “learn as one goes” approach to mastering what they
need to know of R will work well. For this, they can look for the mix of sources of
tutorial content that works best for them – online tutorial content such as is noted
in Section 1.11, books and other printed material, results from web searches, and
such guidance as is provided in Appendix A. We encourage readers who are new to
R to skim over the content of Appendix A before or as they work through the first
chapter.
A complete set of R code, together with other supplementary material, is available
from https://jhmaindonald.github.io/PGRcode.

Graphs and Graphics Packages


In Chapter 1, simplified code is given for figures that do not involve relatively
complicated code. In later chapters, code is given only for those figures that are
specifically targeted at the methodology under discussion.
The main graphics packages that will be used are the base graphics package,
lattice and latticeExtra, and ggplot2 . The plot() and related functions in base
graphics directly generate a plot. With lattice and ggplot2 functions, an alternative
to directly creating a plot is to save the output as a graphics object that can be
further updated and/or modified before use to create a plot.

Accessing Data and Functions from Packages


A number of packages are automatically loaded, with their functions and datasets
then available, at the start of a new R session. For functions and datasets in pack-
ages that are not already available, there is a choice between using library() or
an equivalent to make all datasets and functions from the package available, or
using code such as lattice::xyplot() (execute the lattice function from the lat-
tice package) or DAAG::cuckoos (the cuckoos dataset from the DAAG package)
whenever such a function or dataset is required.

Conventions
Starred headings identify more technical discussions that can be skipped at a first
reading. Item numbers for more technical and/or challenging exercises are likewise
starred.
Comments, prefaced by # or for extra emphasis by ##, will often be included in
code chunks. Where code is included in comments, it will be surrounded by back
quotes, as in `species ~ length` in the final line of code that now follows:
## Code for a stripped down version of Figure 1.1A
library(latticeExtra) # The 'lattice' package will be loaded & attached also
cuckoos <- DAAG::cuckoos
## Panel A: Dotplot without species means added
dotplot(species ∼ length, data=cuckoos) ## `species ∼ length` is a 'formula'

https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press


https://doi.org/10.1017/9781009282284.001 Published online by Cambridge University Press
1
Learning from Data, and Tools for the Task

Chapter Summary
We begin by illustrating the interplay between questions driven by scientific curios-
ity and the use of data in seeking the answers to such questions. Graphs provide a
useful window through which meaning can be extracted from data. Numeric sum-
mary statistics and probability distributions provide a form of quantitative scaf-
folding for models of random as well as nonrandom variation. Simple regression
models foreshadow the issues that arise in the more complex models considered
later in the book. Frequentist and Bayesian approaches to statistical inference are
touched upon, the latter primarily using the Bayes Factor as a summary statistic
which moves beyond the limited perspective that p-values offer. Resampling meth-
ods, where the one available dataset is used to provide an empirical substitute for
a theoretical distribution, are also introduced. Remaining topics are of a more gen-
eral nature. Section 1.9 will discuss the use of RStudio and other such tools for
organizing and managing work. Section 1.10 will include a discussion on the impor-
tant perspective that replication studies provide, for experimental studies, on the
interplay between statistical analysis and scientific practice. The checks provided
by independent replication at another time and place are an indispensable comple-
ment to statistical analysis. Chapter 2 will extend the discussion of this chapter to
consider a wider class of models, methods, and model diagnostics.

A Note on Terminology – Variables, Factors and More!


Much of data analysis is concerned with the statistical modeling of relationships or
associations that can be gleaned from data, with a mathematical formula used to
specify the model. There is an example at the beginning of Section 1.1.6.
The word variable will be used when data values are numeric. These include
counts, as for example in count from the DAAG::ACF1 data frame which has num-
bers of aberrant lesions in the lining of a rat’s colon. The term factor will be used
when values are on a categorical scale. Thus, in the data frame DAAG::kiwishade,
yield is a variable with values such as 101.11, and block is a factor with levels
east, north, and west. A factor may also represent values on an ordinal scale.
Thus the factor tint in the data frame DAAG::tinting has ordered levels no, lo,
and hi.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


2 Learning from Data, and Tools for the Task

Continuous measurements can be further classified as having either an interval


scale or a ratio scale. Variables defined on an interval scale can take positive or
negative values, and differences in the data values are meaningful. Variables defined
on a ratio scale are usually positive only, so quotients are more meaningful.

1.1 Questions, and Data That May Point to Answers


Accounts of observed phenomena become part of established science once we know
the circumstances under which they will recur. This process is relatively straight-
forward when applied to the study of regular events, such as a solar eclipse or the
ocean tide levels in the Bay of Fundy in eastern Canada. Mathematical models
that are based on sound physical principles can provide very accurate predictions
for such events. Not everything is so readily predictable. How effective is a partic-
ular vaccine in preventing COVID-19-associated hospitalizations? How fast will a
wildfire spread through a region with known topography and vegetation under given
wind, temperature, and moisture conditions? Data from a suitable experiment or
series of experiments may be able to go at least part of the way towards providing
an answer. Thus, results from prescribed burns in designated forest stands where
all relevant variables have been measured can provide a starting point for assessing
the rate of spread of surface fires.
Or it may be necessary to rely on whatever data are already available. How
effective are airbags in reducing the risk of death in car accidents? Data on car
accidents in the United States over the period 1997–2002 are available. While careful
and critical analyses of these data can help answer the question, caveats apply
when the interest is in effectiveness at a later time and in another country. There
have been important advances in the subsequent two decades in airbag design,
manufacture, and systems that control deployment.
In Canada, there is a tendency for car passengers to use seatbelts at a higher
rate than in the United States, so that efficacy assessments based on the American
data have to be tempered when applied to the Canadian experience. The decision
on which of the available datasets is best designed to provide an answer, and the
choice of model, have called for careful and critical assessment. The help pages
?DAAG::nassCDS and ?gamclass::FARS provide further commentary. There is a
strong interplay between the questions that can reasonably be asked, and the data
that are available or can be collected. Keep in mind, also, that different questions,
asked of the same data, may demand different analyses.

1.1.1 A Sample Is a Window into the Wider Population


The population comprises all the data that might have been. The sample is the data
that we have. Subjects for a sample to be surveyed should be selected randomly. In
a clinical trial, it is important to allocate subjects randomly to different treatment
groups.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.1 Questions, and Data That May Point to Answers 3

Suppose, for example, that names on an electoral roll are numbered from 1 to
9384. The following uses the function sample() to obtain a random sample of 12
individuals:
## For the sequence below, precede with set.seed(3676)
sample(1:9384, 12, replace=FALSE) # NB: `replace=FALSE` is the default

[1] 2263 9264 4490 8441 1868 3073 5430 19 1305 2908 5947 915

The numbers are the numerical labels for the 12 individuals who are included in the
sample. The task is then to find them! The option replace=FALSE gives a without
replacement sample, that is, it ensures that no one is included more than once.
A more realistic example might be the selection of 1200 individuals, perhaps for
purposes of conducting an opinion poll, from names numbered 1 to 19,384, on an
electoral roll. Suitable code is:
chosen1200 <- sample(1:19384, 1200, replace=FALSE)

The following randomly assigns 10 plants (labeled from 1 to 10, inclusive) to one
of two equal-sized groups, control and treatment:
## For the sequence below, precede with set.seed(366)
split(sample(seq(1:10)), rep(c("Control","Treatment"), 5))

$Control
[1] 5 7 1 10 4

$Treatment
[1] 8 6 3 2 9

# sample(1:10) gives a random re-arrangement (permutation) of 1, 2, ..., 10


This assigns plants 3, 5, 10, 2, and 7 to the control group. This mechanism avoids
any unwitting preference for placing healthier-looking plants in the treatment group.
The simple independent random sampling scheme can be modified or extended in
ways that take account of structure in the data, with random sampling remaining
a part of the data-selection process.

Cluster Sampling
Cluster sampling is one of many probability-based variants on simple random sam-
pling. See Barnett (2002). The function sample() can be used as before, but now
the numbers from which a selection is made correspond to clusters. For example,
households or localities may be selected, with multiple individuals from each. Stan-
dard inferential methods then require adaptation to account for the fact that it is
the clusters that are independent, not the individuals within the clusters. Donner
and Klar (2000) describe methods that are designed for use in health research.

∗A Note on With-Replacement Samples


For data that can be treated as a random sample from the population, one way to
get an idea of the extent to which it may be affected by random variation is to take
with-replacement random samples from the one available sample, and to do this

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


4 Learning from Data, and Tools for the Task

20 21 22 23 24 25

A: Dotplot B: Boxplot
wren
tree pipit
robin
pied wagtail
meadow pipit
hedge sparrow

20 21 22 23 24 25

Length of egg (mm)

Figure 1.1 Dotplot (Panel A) and boxplot (Panel B) displays of cuckoo egg
lengths. In Panel A, points that overlap have a more intense color. Means are
shown as +. The boxes in Panel B take in the central 50 percent of the data, from
25 percent of the way through the data to 75 percent of the way through. The
dot marks the median. Data are from Latter (1902).

repeatedly. The distribution that results can be an empirical substitute for the use
of a theoretical distribution as a basis for inference.
We can randomly sample from the set {1, 2, . . . , 10}, allowing repeats, thus:
sample(1:10, replace=TRUE)

[1] 1 3 7 5 5 10 3 3 2 9

## sample(1:10, replace=FALSE) returns a random permutation of 1,2,...10

With-replacement sampling is the basis of bootstrap sampling. The effect is that


of repeating each value an infinite number of times, and then taking a without-
replacement sample. Subsections 1.8.3 and 1.8.4 will demonstrate the methodology.

1.1.2 Formulating the Scientific Question


Questions should be structured with a view both to the intended use of results, and
to the limits of what the available data allow. Predictions of numbers in hospital
from COVID-19 two weeks into the future do not demand the same level of scientific
understanding or detailed data as needed to judge who among those infected are
most likely to require hospitalization.

Example: A Question About Cuckoo Eggs


Cuculus canorus is one of several species of cuckoos that lay eggs in the nests
of other birds. The eggs are then unwittingly adopted and hatched by the hosts.
Latter (1902) collected the data in DAAG::cuckoos as shown in Figure 1.1 in order
to investigate claims in Newton and Gadow (1896, p. 123) that the cuckoo eggs tend
to match the eggs of the host bird in size, shape, and color. Panel A is a dotplot

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.1 Questions, and Data That May Point to Answers 5

Table 1.1 Mean lengths of cuckoo eggs, compared with mean lengths of eggs laid by
the host bird species. The table combines information from the two DAAG data
frames cuckoos and cuckoohosts.
Meadow Hedge Tree Yellow
Host species pipit sparrow Robin Wagtails pipit Wren hammer
Length (cuckoo) 22.3 (45) 23.1 (14) 22.5 (16) 22.6 (26) 23.1 (15) 21.1 (15) 22.6 (9)
Length (host) 19.7 (74) 20.0 (26) 20.2 (57) 19.9 (16) 20 (27) 17.7 (-) 21.6 (32)
(Numbers in parentheses are numbers of eggs)

display of the raw data. Panel B is the more summary boxplot form of display (to
be discussed further in Section 1.1.5) that is designed to give a rough indication of
how variation between groups compares with variation within groups. 1
Table 1.1 adds information that suggests a relationship between the size of the
host bird’s eggs and the size of the cuckoo eggs that were laid in that nest. Observe
that apart from several outlying egg lengths in the meadow pipit nests, the length
variability within each host species’ nest is fairly uniform.
In the paper (Latter, 1902) that supplied the cuckoo egg data of Figure 1.1 and
Table 1.1, the interest was in whether cuckoos do in fact match the eggs that they
lay to the host eggs, and if so, in assessing which features match and to what extent.
Uniquely among the birds listed, the architecture of wren nests makes it impossi-
ble for the host birds to see the cuckoo’s eggs, and the cuckoo’s eggs do not match
the wren’s eggs in color. For the other species the color does mostly match. Latter
concluded that the claim in Newton and Gadow (1896) is correct, that the eggs
that cuckoos lay tend to match the eggs of the host bird in ways that will make it
difficult for hosts to distinguish their own eggs from the cuckoo eggs.
Issues with the data in Table 1.1 and Figure 1.1 are as follows.

• The cuckoo eggs and the host eggs are from different nests, collected over the
course of several investigations. Data on the host eggs are from various sources.
• The host egg lengths for the wren are indicative lengths, from Gordon (1894).
There is thus a risk of biases, different for the different sources of data, that limit
the inferences that can be drawn. How large, then, relative to statistical variation,
is the difference between wrens and other species? Would it require an implausibly
large bias to explain the difference? A more formal comparison between lengths for
the different species based on an appropriate statistical model will be a useful aid
to informed judgment.
Stripped down code for Figure 1.1 is:
library(latticeExtra) # Lattice package will be loaded and attached also
cuckoos <- DAAG::cuckoos
## Panel A: Dotplot without species means added
dotplot(species ∼ length, data=cuckoos) ## `species ∼ length` is a 'formula'
## Panel B: Box and whisker plot
bwplot(species ∼ length, data=cuckoos)
## The following shows Panel A, including species means & other tweaks
av <- with(cuckoos, aggregate(length, list(species=species), FUN=mean))
1 Subsection A.5.1 has the code that combines the two panels, for display as one graph.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


6 Learning from Data, and Tools for the Task

dotplot(species ∼ length, data=cuckoos, alpha=0.4, xlab="Length of egg (mm)") +


as.layer(dotplot(species ∼ x, pch=3, cex=1.4, col="black", data=av))
# Use `+` to indicate that more (another 'layer') is to be added.
# With `alpha=0.4`, 40% is the point color with 60% background color
# `pch=3`: Plot character 3 is '+'; `cex=1.4`: Default char size X 1.4

1.1.3 Planning for a Statistical Analysis


First steps in any coordinated scientific endeavor must include clear identification
of the question of interest, followed by careful planning. Consultation with subject-
matter specialists, as well as with specialists in statistical aspects of study design,
will help avoid obvious mistakes in any of the steps: designing the study, collecting
and/or collating data, carrying out analyses, and interpreting results.
If new data are to be acquired, one must decide if a designed experiment is
feasible. In human or animal experimentation, such as in clinical trials to test a
new drug therapy, ethics are an immediate concern. Data from experiments appear
throughout this text – examples are the data on the tinting of car windows that is
used for Figure 7.8 in Section 7.5, and the kiwifruit shading data that is discussed
in Subsection 1.3.2. Such data can, if the experiment has been well designed with
a view to answering the questions of interest, give reliable results. Always, the
question must be asked: “How widely do the results generalize?”. For example, we
might be interested in knowing to what extent the results for the kiwifruit shading
conditions can be generalized to other locations with different soil types and weather
conditions.

Understand the Data


Most standard elementary statistical methods assume that sample values were all
chosen independently and with equal probability from the relevant population. If
the data were from an observational study, such as in the cuckoo eggs example of
Subsection 1.1.2, special care is required to consider what biases may have been
induced by the method of data collection, and to ensure that they do not not lead
to incorrect conclusions.
Temporal and spatial dependence are common forms of departure from indepen-
dence, often leading to more complicated analyses. Data points originating from
points that are close together in time and/or space are often more similar. Tests
and graphical checks for dependence are necessarily designed to detect specific forms
of dependence. Their effectiveness relies on recognizing forms of dependence that
can be expected in the specific context.
If the data were acquired earlier and for a different purpose, details of the cir-
cumstances that surrounded the data collection are especially important. Were they
from a designed experiment? If so, how was the randomization carried out? What
factors were controlled? Was there a hierarchical structure to the data, such as
would occur in a survey of students, randomly selected from classes, which are
themselves randomly selected from schools, and so on? If the data were collected
as part of an observational study, such as in the cuckoos example of Subsection

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.1 Questions, and Data That May Point to Answers 7

1.1.2, special care is required to ensure that hidden biases induced by the method
of data collection do not lead to incorrect conclusions. Biases are likely when data
are obtained from “convenience” samples that have the appearance of surveys but
which are really poorly designed observational studies. Online voluntary surveys
are of this type. Similar biases can arise in experimental studies if care is not taken.
For example, an agricultural experimenter may pick one plant from each of several
parts of a plot. If the choice is not made according to an appropriate randomization
mechanism, a preference bias can easily be introduced.
Nonresponse, so that responses are missing for some respondents, is endemic in
most types of sample survey data. Or responses may be incomplete, with answers
not provided to some questions. Dietary studies based on the self-reports of partic-
ipants are prone to measurement error biases. With experimental data on crop or
fruit yields, results may be missing for some plots because of natural disturbances
caused by animals or harsh weather. One ignores the issue at a certain risk, but
treating the problem is nontrivial, and the analyst is advised to determine as well
as possible the nature of the missingness. It can be tempting simply to replace
a missing height value for a male adult in a dataset by the average of the other
male heights. Such a single imputation strategy will readily create unwanted bi-
ases. Males that are of smaller than average weight and chest measurement are
likely to be of smaller than average height. Multiple imputation is a generic name
for methodologies that, by matching incomplete observations as closely as possible
to other observations on the variables for which values are available, aim to fill in
the gaps.

Causal Inference
With data from carefully designed experiments, it is often possible to infer causal
relationships. Perhaps the most serious danger is that the results will be generalized
beyond the limits imposed by the experimental conditions.
Observational data, or data from experiments where there have been failures
in design or execution, is another matter. Correlations do not directly indicate
causation. A and B may be correlated because A drives B, or because B drives A,
or because A and B change together, in concert with a third variable. For inferring
causation, other sources of evidence and understanding must come into play.

What Was Measured? Is It the Relevant Measure?


The DAAG::science and DAAG::socsupport data frames are both from surveys.
The former concerns student attitudes towards science in Australian private and
public school systems. The latter concerns social and emotional support resources
as they might relate to psychological depression in a sample of individuals.
In either case it is necessary to ask: “What was measured?” This question is
itself amenable to experimental investigation. For the dataset science, what did
students understand by “science”? Was science, for them, a way to gain and test
knowledge of the world? Or was it a body of knowledge? Or, more likely, was it
a label for their experience of science laboratory classes (interesting sights, smells

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


8 Learning from Data, and Tools for the Task

and sounds perhaps) and field trips? Answers to other questions included in the
survey shed some limited light.
In the socsupport dataset, an important variable is the Beck Depression Inven-
tory or BDI, which is based on a 21-question multiple-choice self-report. It is the
outcome of a rigorous process of development and testing. Since its first publication
in 1961, it has been extensively used, critiqued, and modified. Its results have been
well validated, at least for populations on which it has been tested. It has become
a standard psychological measure of depression (see, e.g., Streiner et al., 2014).
For therapies that are designed to prolong life, what is the relevant measure? Is
it survival time from diagnosis? Or is a measure that takes account of quality of
life over that time more appropriate? Two such measures are “Disability Adjusted
Life Years” (DALYs) and “Quality Adjusted Life Years” (QALYs). Quality of life
may differ greatly between the therapies that are compared.

Use Relevant Prior Information in the Planning Stages


Information from the analysis of earlier data may be invaluable both for the design
of data collection for the new study and for planning data analysis. When prior data
are not available, a pilot study involving several experimental runs can sometimes
provide such information.
Graphical and other checks are needed to identify obvious mistakes and/or quirks
in the data. Graphs that draw attention to inadequacies may be suggestive of
remedies. For example, they may indicate a need to numerically transform the
data, such as by taking a logarithm or square root, in order to more accurately
meet the assumptions underlying a more formal analysis. At the same time, one
should keep in mind the risk that use of the data to influence the analysis may bias
results.

Subject Area Knowledge and Judgments


Data analysis results must be interpreted against a background of subject area
knowledge and judgment. Some use of qualitative judgment is inevitable, relating
to such matters as the weight that can be placed on claimed subject area knowledge,
the measurements that are taken, the details of study design, the analysis choices,
and the interpretation of analysis results. These, while they should be as informed as
possible, involve elements of qualitative judgment. A well-designed study will often
lead to results that challenge the insights and understandings that underpinned the
planning.

The Importance of Clear Communication


When there are effective lines of communication, the complementary skills of a data
analyst and a subject matter expert can result in effective and insightful analyses.
When unclear about the question of interest, or about some feature of the data,
analysts should be careful not to appear to know more than is really the case. The
subject-matter specialist may be so immersed in the details of their problem that,
without clear signals to the contrary, they may assume similar knowledge on the
part of the analyst.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.1 Questions, and Data That May Point to Answers 9

Data-Based Selection of Comparisons


In carefully designed studies where subjects have been assigned to different groups,
with each group receiving a different treatment, comparisons of outcomes between
the various groups, and of subgroups within those groups (e.g., female/male, old/
young) will be of interest. Among what may be many possible comparisons, the
comparisons that will be considered should be specified in advance. Prior data, if
available, can provide guidance. Any investigation of other comparisons may be
undertaken as an exploratory investigation, a preliminary to the next study.
Data-based selection of one or two comparisons from a much larger number is
not appropriate, since large biases may be introduced. Alternatively, there must be
allowance for such selection in the assessment of model accuracy. The issues here
are nontrivial, and we defer further discussion until later.

Models Must Be Fit for Their Intended Use


Statistical models must, along with the data upon which they rely, be applied ac-
cording to their intended use. Architects and engineers have in the past relied heav-
ily on scale models for giving a sense of important features of a planned building.
For checking routes through the building, for the plumbing as well as for humans,
such models can be very useful. They will not give much insight on how buildings
in earthquake-prone regions are likely to respond to a major earthquake – a lively
concern in Wellington, New Zealand, where the first author now lives. For that
purpose, engineers use mathematical equations that are designed to reflect the rel-
evant physical processes. The credibility of predictions will strongly depend on the
accuracy with which the models can be shown to represent those processes.

1.1.4 Results That Withstand Thorough and Informed Challenge


Statistical models aim to give real-world descriptions that are adequate for the
purposes for which the model will be used. What checks will give confidence that
a model will do the task asked of it? As argued in Tukey (1997), there must be
exposure to diverse challenges that can build (or destroy!) confidence in model-
based inferences. We should trust those results that have withstood thorough and
informed challenge.
A large part of our task in this text is to suggest effective forms of challenge.
Specific types of challenge may include the following.

• For experiments, carefully check and critique the design.


• Look into what is known of the processes that generated the data, and consider
critically how this may affect its use and the reliance placed on it. Are there
possible or likely biases?
• Look for inadequacies in laboratory procedure.
• Use all relevant graphical or other summary checks to critique the model that
underpins the analysis.
• Where possible, check the performance of the model on test data that reflects

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


10 Learning from Data, and Tools for the Task

the manner of use of results. (If, for example, predictions are made that will be
applied a year into the future, check how predictions made a year ahead panned
out for historical data.)
• For experimental data, have the work replicated independently by another re-
search group, from generation of data through to analysis.
In areas where the nature of the work requires cooperation between scientists
with a wide range of skills, and where data are shared, researchers provide checks
on each other. For important aspects of the work, the most effective critiques are
likely to come from fellow researchers rather than from referees who are inevitably
more remote from the details of what has been done. Failures of scientific processes
are a greater risk where scientists work as individuals or in small groups with limited
outside checks.
There are commonalities with the issues of legal and medical decision making that
receive extensive attention in Kahneman et al. (2021, p. 372), on the benefits of
“averaging,” that is, using the perspectives of multiple judges as a basis for decision
making when sentencing; the authors comment:
The advantage of averaging is further enhanced when judges have diverse skills and com-
plementary judgment patterns.

Also needed is a high level of shared understanding.


For observational data, the challenges that are appropriate will depend strongly
on the nature of the claims made as a result of any analysis. Dangers of over-
interpretation and/or misinterpretation of results gleaned from observational data
will be exemplified later in the text.

1.1.5 Using Graphs to Make Sense of Data


Ideas of Exploratory Data Analysis (EDA), as formalized by John W. Tukey, have
been a strong influence in the development of many of the forms of graphical display
that are now in wide use. See Hoaglin (2003). A key concern is that the data should
as far as possible speak for itself, prior to or as part of a formal analysis.
A use of graphics that is broadly in an EDA tradition continues to develop and
evolve. The best modern statistical software makes a strong connection between
data analysis and graphics, combining the computer’s ability to crunch numbers
and present graphs with that of a trained human eye to detect pattern. Statistical
theory has an important role in suggesting forms of display that may be helpful
and interpretable.

Graphical Comparisons
Figure 1.1 was a graphical comparison between the lengths of cuckoo eggs that had
been laid in the nests of different host species. The boxes that give boxplots their
name focus attention on quartiles of the data, that is, the three points on the axis
that split the data into four equal parts. The lower end of the box marks the first
quartile, the dot marks the median, and the upper end of the box marks the third

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.1 Questions, and Data That May Point to Answers 11

quartile. Points that lie out beyond the “whiskers” are plotted individually, and are
candidates to be considered outliers. The widths of the boxes will of course vary
randomly, leading in some cases to the flagging of points that should not be treated
as extreme. The narrow box may largely account for the five values that are flagged
for meadow pipit.
Figure 1.1 strongly suggested that eggs planted in wrens’ nests were substantially
smaller than eggs planted in other birds’ nests. The upper quartile (75 percent
point) for eggs in wrens’ nests lies below all the lower quartiles for other eggs.

1.1.6 Formal Model-Based Comparison


For comparing lengths between species in the cuckoo eggs data, we use the model:
Egg length = Mean for species + Random variation.
The means in the dataset cuckoos are:
av <- with(cuckoos, aggregate(length, list(species=species), FUN=mean))
setNames(round(av[["x"]],2), abbreviate(av[["species"]],10))

hedgsparrw meadowpipt piedwagtal robin tree pipit wren


23.11 22.29 22.89 22.56 23.08 21.12

The model postulates that the length of a cuckoo egg found in a given nest de-
pends in some way on the host species. There are likely to be additional factors
that have not been observed but which also influence the egg length. The variation
due to these unobserved factors is aggregated into one term which is referred to
as statistical error or random variation. Where none of these observed factors pre-
dominates and their effects add, a normal distribution will often be effective as a
model for the random variation.
The species means are estimated from the data and are called fitted values. The
differences between the data values and those means are called residuals. For ex-
ample, suppose ℓi is the length of the ith egg in the nest of a wren, and ℓ̄ is the
average of all eggs in the wrens’ nests. Then the ith residual for this group is
ei = ℓi − ℓ̄.
The scale() function provides a convenient way to calculate such residuals; its
usage below centers the data by subtracting the average from each data point.
Thus, the residuals for the wren length model are:
with(cuckoos, scale(length[species=="wren"], scale=FALSE))[,1]

[1] -1.32 0.98 0.38 -0.22 0.88 -0.12 1.18 -0.12 -0.82 -0.22 0.88
[12] -1.12 -0.32 0.08 -0.12

Is the variability different for different species? The boxes in Figure 1.1, with
endpoints set for each species to contain the central 50 percent of the data, hint
that variation may be greater for the pied wagtail than for other species. (The box
widths equal the inter-quartile range, or IQR. See further, Subsection 1.3.4.)

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


12 Learning from Data, and Tools for the Task

B: Density curve, with histogram overlaid


0.10
20
0.08
15

Density
0.06
10
0.04

0.02 5

| | | | | || | | || | | || | | ||| || |||||||| ||| || || || | | | | 0

A: Boxplot, with annotation added

(outliers excepted)
Smallest value

upper quartile

Largest value
lower quartile

(no outliers)
median
Outlier?

75 80 85 90 95

Total length of female possums (cm)

Figure 1.2 Panel A shows a boxplot, with annotation that explains boxplot fea-
tures. Panel B shows a density plot, with a histogram overlaid. Histogram fre-
quencies are shown on the right axis of Panel B. In both panels, the individual
data points appear as a “rug” along the lower side of the bounding box. Where
necessary, they have been moved slightly apart to avoid overlap.

1.2 Graphical Tools for Data Exploration


In this section, we illustrate basic approaches to the graphical exploration of data.
Three R static graphics systems enjoy wide use. These are: base (or “traditional”)
graphics using plot() and associated commands, lattice which offers more stylized
types of graphs, and ggplot2 whose rich array of features comes at the cost of extra
graphics language complexity.
Later chapters will make extensive use both of base graphics and of lattice graph-
ics, resorting to ggplot2 on those occasions when features are needed that are not
readily available in the other packages. Some lattice graphs will be printed in a style
(use a theme) akin to the default ggplot2 style. Section A.5 has further details.

1.2.1 Displays of a Single Variable


A basic form of display for a single numeric variable is the dotplot, which plots the
individual data points along a number line or single axis. The boxplot provides a
coarser summary of univariate data. The histogram and density curve offer more
fine-grained alternatives.
Figure 1.2A shows a boxplot of total lengths of females in the possum dataset,
with annotation added that explains the interpretation of boxplot features. Fig-
ure 1.2B shows a density curve, with a histogram overlaid, for the same data. Both
panels contain rug plots which are essentially dotplots consisting of vertical bars
added along the lower edge.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.2 Graphical Tools for Data Exploration 13

One data point lies outside the boxplot “whiskers” to the left, and is flagged
as a possible outlier. An outlier is a point that is determined to be far from the
main body of the data. Under the default criterion, about 1 percent of normally
distributed data would be judged as outlying.
A histogram is a crude form of density estimate. A smooth density estimate is,
often, a better alternative. The height of the density curve at any point is an esti-
mate of the proportion of sample values per unit interval, locally at that point. Both
histograms and density curves involve an element of subjective choice. Histograms
require the choice of breakpoints, while density estimates require the choice of a
bandwidth parameter that controls the amount of smoothing. In both cases, the
software has default choices that should be used with care.
Code for a slightly simplified version of Figure 1.2B is:
fossum <- subset(DAAG::possum, sex=="f")
densityplot(∼totlngth, plot.points=TRUE, pch="|", data=fossum) +
layer_(panel.histogram(x, type="density", breaks=c(75,80,85,90,95,100)))

Comparing Univariate Displays Across Factor Levels

75 80 85 90 95
f m

other

Vic

75 80 85 90 95

Total length (cm)

Figure 1.3 Total lengths of possums, by sex and (within panels) by geographical
location (Victorian or other).

Univariate summaries can be broken down by one or more factors between and/or
within panels. Figure 1.3 overlays dotplots on boxplots of the distributions of Aus-
tralian possum lengths, broken down by sex and (within panels) by geographical
region (Victoria or other).
## Create boxplot graph object --- Simplified code
gph <- bwplot(Pop∼totlngth | sex, data=possum)
## plot graph, with dotplot distribution of points below boxplots
gph + latticeExtra::layer(panel.dotplot(x, unclass(y)−0.4))

The normal distribution is not necessarily the appropriate reference. Points may
be identified as outliers because the distribution is skew (usually, with a tail to the
right). Any needed action will depend on the context, requiring the user to exercise
good judgement. Subsection 1.2.8 will comment in more detail.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


14 Learning from Data, and Tools for the Task

1.2.2 Patterns in Univariate Time Series


Figure 1.4 shows time plots of historical deaths from measles in London. (Here,
“measles” includes both what is nowadays called measles and the closely related
rubella or German measles.)

A (1629−1939)
5000000
Population

1000000

1000
Deaths

100

10

1650 1700 1750 1800 1850 1900 1950

B (1841−1881)
Pop (1000s)

4000
3000
2000
1000
Deaths

1840 1850 1860 1870 1880

Figure 1.4 The two panels provide different insights into data on mortality from
measles, in London over 1629–1939. Panel A uses a logarithmic scale to show
the numbers of deaths from measles in London for the period from 1629 through
1939 (black curve). The black dots show, for the period 1800 to 1939 the London
population in thousands. Panel B shows, on the linear scale (black curve), the
subset of the measles data for the period 1840 through 1882 together with the
London population (in thousands, black dots).

Panel A uses a logarithmic vertical scale while Panel B uses a linear scale and takes
advantage of the fact that annual deaths from measles were of the order of one in
500 of the population. Thus, deaths in thousands and population in half millions
can be shown on the same scale.
Panel A shows broad trends over time, but is of no use for identifying changes
on the time-scale of a year or two. In Panel B, the lines that show such changes
are, mostly, at an angle that is in the approximate range of 20◦ to 70◦ from the
horizontal. A sawtooth pattern is evident, indicating that years in which there were
many deaths were often followed by years in which there were fewer deaths. To
obtain this level of detail for the whole period from 1629 until 1939, multiple panels
would be necessary.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.2 Graphical Tools for Data Exploration 15

A: Linear scales B: Logarithmic scales


100

Brain (unit=100g)
Brain (unit=100g)
50 10
40
1
30
20 0.1

10 0.01
0
1 1 .1 1 10 100 1000
0 200 400 600 800 0.00 0.0 0

Body weight (unit=100kg) Body weight (unit=100kg)

Figure 1.5 Brain weight versus body weight, for 28 animals that vary greatly in
size. Panel A has untransformed scales, while Panel B has logarithmic scales, on
both axes.

Simplified code is:


measles <- DAAG::measles
## Panel A
plot(log10(measles), xlab="", ylim=log10 (c(1,5000∗540)),
ylab=" Deaths; Population", yaxt="n")
ytiks1 <- c(1, 10, 100, 1000); ytiks2 <- c(1000000, 5000000)
## London population in thousands
londonpop <-ts(c(1088,1258,1504,1778,2073,2491,2921,3336,3881,
4266,4563,4541,4498,4408), start=1801, end=1931, deltat=10)
points(log10(londonpop∗600), pch=16, cex=.5)
abline(h=log10(ytiks1), lty = 2, col = "gray", lwd = 2)
abline(h=log10(ytiks2∗0.5), lty = 2, col = "gray", lwd = 2)
axis(2, at=log10(ytiks1), labels=paste(ytiks1), lwd=0, lwd.ticks=1)
axis(2, at=log10(ytiks2∗0.5), labels=paste(ytiks2), tcl=0.3,
hadj=0, lwd=0, lwd.ticks=1)
## Panel B
plot(window(measles, start=1840, end=1882), ylim=c (0, 4600), yaxt="n")
points(londonpop, pch=16, cex=0.5)
axis(2, at=(0:4)∗ 1000, labels=paste(0:4), las=2)

For details of the data, and commentary, see Guy (1882), Stocks (1942), and
Senn (2003) where interest was in the comparison with smallpox mortality. The
population estimates (londonpop) are from Mitchell (1988).

1.2.3 Visualizing Relationships Between Pairs of Variables


Patterns and relationships linking multiple variables are a primary focus of data
analysis. The following example is concerned with the relationship between two
variables and illustrates an important question that often arises: What is the ap-
propriate scale?
Figures 1.5A and B plot brain weight (g) against body weight (kg), for 28 animals.
Panel A indicates that the distributions of data values are highly positively skew,
on both axes, but is otherwise unhelpful. Panel B’s logarithmic scales spread points
out more evenly, and the graph tells a clearer story. Note that, on both axes, tick

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


16 Learning from Data, and Tools for the Task

Distance traveled (cm)


30
Starting Distance traveled
25
point
20 3 31.38 30.38 33.63
6 26.63 25.75 27.13
15 9 18.75 22.50 21.63
12 13.88 11.75 14.88
0 2 4 6 8 10 12
Distance up ramp (cm)

Figure 1.6 Distance traveled (distance.traveled) by model car, as a function


of starting point (starting.point), up a 20◦ ramp.

marks are separated by an amount that, when translated back from log(weight)
to weight, differ by a factor of 100. The argument aspect="iso" has ensured that
these correspond to the same physical distance on both axes of the graph. Code is:
## Untransformed vs log transformed scales
Animals <- MASS::Animals
asp <- with(Animals, sapply(list(log(brain/100), log(body/100)),
function(x)diff(range(x)))) |> (\(d)d[1]/d[2])()
xlab <- "Body weight (unit=100kg)"; ylab <- "Brain (unit=100g)"
gphA <- xyplot(I(brain/100) ∼ I(body/100), data=Animals, aspect=asp,
xlab=xlab, ylab=ylab)
gphB <- xyplot(log(brain/100) ∼ log(body/100), data=MASS::Animals, # Panel B
aspect='iso', xlab=xlab, ylab=ylab)
labx <- 10∧ c((−3):3); laby <- 10∧ c((−2):2)
gphB <- update(gphB, scales=list(x=list(at=log(labx), labels=labx, rot=20),
y=list(at=log(laby), labels=laby)))

A logarithmic scale is appropriate for quantities that change multiplicatively.


Thus, if cells in a growing organism divide and produce new cells at a constant
rate, then the total number of cells changes multiplicatively, resulting in what is
termed exponential growth. Large organisms may similarly increase in a given time
interval by the same approximate fraction as smaller organisms. Growth rate on a
natural logarithmic scale (loge ) equals the relative growth rate.
Anyone who works with real data – biologists, economists, physical scientists –
will do well to make themselves comfortable with the use and interpretation of
logarithmic scales. See Subsection 2.5.6 for a brief discussion of other commonly
used transformations.

1.2.4 Response Lines (and/or Curves)


The data shown on the right-hand side of Figure 1.6, and plotted in the figure, were
generated by releasing a model car three times at each of four different distances
(starting.point) up a 20◦ ramp. The experimenter recorded distances traveled
from the bottom of the ramp across a concrete floor. Response curve analysis, using
regression, is appropriate. It would be a mistake to treat the four starting points
as factor levels in a one-way analysis. Data are available in DAAG::modelcars.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.2 Graphical Tools for Data Exploration 17

For these data, the physics suggests the likely form of response. Where no such
help is available, careful examination of the graph, followed by systematic examina-
tion of plausible forms of response, may suggest a suitable form of response curve.

1.2.5 ∗ Multiple Variables and Times


Overlaying plots of several time series (sequences of measurements taken at regular
intervals) might seem appropriate for making direct comparisons. However, this
approach will only work if the scales are comparable for the different series.
Figures 1.7A and B show alternative views of labor force numbers (thousands),
for various regions of Canada, at quarterly intervals over the 24-month period from
January 1995 to December 1996. Over this time, Canada was emerging from a deep
economic recession. The ranges of values, for each of the six regions, are:
## Apply function range to columns of data frame jobs (DAAG)
sapply(DAAG::jobs, range) ## NB: `BC` = British Columbia

BC Alberta Prairies Ontario Quebec Atlantic Date


[1 ,] 1737 1366 973 5212 3167 941 95.00
[2 ,] 1840 1436 999 5360 3257 968 96.92

With a logarithmic scale, as in Figure 1.7A, similar changes on the scale corre-
spond to similar proportional changes. The regions have been taken in order of the
number of workers in December 1996 (or, in fact, at any other time). This ensures
that the order of the labels in the key matches the positioning of the points for the
different regions. Code that has been used to create and update the graphics object
basicGphA, then updating it to obtain the labeling on the x- and y-axes is:
## Panel A: Basic plot; all series in a single panel; use log y-scale
formRegions <- Ontario+Quebec+BC+Alberta+Prairies+Atlantic ∼ Date
basicGphA <-
xyplot(formRegions, outer=FALSE, data=DAAG::jobs, type="l", xlab="",
ylab="Number of workers", scales=list(y=list(log="e")),
auto.key=list(space="right", lines=TRUE, points=FALSE))
## `outer=FALSE`: plot all columns in one panel
## Create improved x- and y-axis tick labels; will update to use
datelabpos <- seq(from=95, by=0.5, length=5)
datelabs <- format(seq(from=as.Date("1Jan1995", format="%d%b%Y"),
by="6 month", length=5), "%b%y")
## Now create $y$-labels that have numbers, with log values underneath
ylabposA <- exp(pretty(log(unlist(DAAG::jobs[,−7])), 5))
gphA <- update(basicGphA, scales=list(x=list(at=datelabpos, labels=datelabs),
y=list(at=ylabposA, labels=ylabelsA)))

Because the labor forces in the various regions do not have similar sizes, it is
impossible to discern any differences among the regions from this plot. Plotting
on the logarithmic scale was not enough on its own. Figure 1.7B, where the six
different panels use different slices of the same logarithmic scale, is an informative
alternative. Simplified code is:
## Panel B: Separate panels (`outer=TRUE`); sliced log scale
basicGphB <-

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


18 Learning from Data, and Tools for the Task

A: Same vertical log scale

Number of workers
4915
(8.5)
Ontario
2981 Quebec
(8) BC
1808 Alberta
(7.5) Prairies
1097 Atlantic
(7)

Jan95 Jul95 Jan96 Jul96 Jan97

B: Sliced vertical log scale


Jan95 Jul95 Jan96 Jul96 Jan97

Alberta Prairies Atlantic


1012
(6.92) 973
1422
(6.88)
(7.26) 992
(6.9) 953
1394
Number of workers

(6.86)
(7.24) 973
(6.88) 934
1366
(6.84)
(7.22)
Ontario Quebec 1845 BC
5432 3294 (7.52)
(8.6) (8.1)
1808
5324 3229 (7.5)
(8.58) (8.08)
1772
5219 3165 (7.48)
(8.56) (8.06)
1737
5115
(7.46)
(8.54)
Jan95 Jul95 Jan96 Jul96 Jan97 Jan95 Jul95 Jan96 Jul96 Jan97

Figure 1.7 Data are labor force numbers (thousands) for various regions of
Canada, at quarterly intervals over 1995–1996. Panel A uses the same logarith-
mic y-scale for all regions. Panel B shows the same data, but now with separate
(“sliced”) logarithmic y-scales on which the same percentage increase, for exam-
ple, by 1 percent, corresponds to the same distance on the scale, for all plots.
Distances between ticks are 0.02 on the loge scale, that is, a change of close to 2
percent.

xyplot(formRegions, data=DAAG::jobs, outer=TRUE, type="l", layout=c(3,2),


xlab="", ylab="Number of workers",
scales=list(y=list(relation="sliced", log=TRUE)))

Use of outer=TRUE, causes separate columns (regions) to be plotted on separate


panels. As before, equal distances on the scale correspond to equal relative changes.
It is now clear that Alberta and BC experienced the fastest job growth and that
there was little or no job growth in Quebec and the Atlantic region.
The following are the changes in numbers employed, in each of Alberta and BC,
from January 1995 to December 1996. The changes are shown in actual numbers,
and on scales of log2 , loge and log10 . Figure 1.8 shows this graphically.
Increase
Rel. change log2 loge log10
Alberta (1366 to 1466; increase=70) 1.051 0.072 0.050 0.022
BC (1752 to 1840; increase=88) 1.050 0.070 0.049 0.021

From the beginning of 1995 and the end of 1996, the increase of 70 in Alberta
from 1366 to 1436 is by a factor of 1436/1366 ≃ 1.051). For BC, an increase by 88

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.2 Graphical Tools for Data Exploration 19

log=10 103.135 103.157 103.244 103.265


log="e" e7.220 e7.270 e7.469 e7.518
log=2 210.42 210.49 210.78 210.85
1366 1436 1752 1840

1250 1500 1800

Figure 1.8 Labeling of the values for Alberta (1366, 1436) and BC (1752, 1840),
with alternative logarithmic scale choices.

from 1752 to 1840 is by a factor of 1.050. The proper comparison is not between
the absolute increases, but between very nearly identical multipliers of 1.051 and
1.050.
Even better than using a logarithmic y-scale, particularly if ready comprehen-
sion is important, would be to standardize the labor force numbers by dividing,
for example, by the respective number of persons aged 15 years and over at that
time. Scales would then be directly comparable. (The plot method for time se-
ries could then suitably be used to plot the data as a multivariate time series. See
?plot.ts.)

1.2.6 ∗ Labeling Technicalities


For lattice functions, the arguments log=2 or log="e" or log=10 are available.
The latter two scales are referred to as natural and common log scales, respectively.
These use the relevant logarithmic axis labeling, as in Figure 1.8, for axis labels.
In base graphics, with one of the arguments log="x" or log="y" or log="xy", the
default is to label the specified axis or axes in the original units.
An alternative, both for traditional and lattice graphics, is to enter the log-
transformed values, using whatever base is preferred (2 or "e" or 10), into the
graphics formula. Unless other tick labels are manually entered, ticks will be auto-
matically transformed to the correct scale.
Note again the reason for placing y-axis tick marks a distance 0.02 apart on the
loge linear scale used in Figure 1.7. On the loge scale a change of 0.02 is very nearly
a 2 percent change.

1.2.7 Graphical Displays for Categorical Data


Figure 1.9 illustrates the possible hazards of adding values in a multiway table over
one of its margins. Data are from a study (Charig, 1986) that compared the use of
open surgery for kidney stones with a method that made a small incision and used
ultrasound to destroy the stone. Stones were classified by diameter: either at least
2 cm or less than 2 cm. For each subject, the outcome was assessed as successful
(“yes”) or unsuccessful (“no”).

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


20 Learning from Data, and Tools for the Task

All open All ultrasound


78
^ 82.6^
Success
300 Open yes no Total %Yes
Number of operations

250 Ultrasound Method Size


<2cm
open <2cm 81 6 87 93.1
200 >=2cm
>=2cm 192 71 262 73.0
150 ultrasound <2cm 234 36 280 86.7
>=2cm 55 25 80 68.8
100
50 Add over Size
open 273 77 350 78.0
0
ultrasound 289 61 350 82.6
50 60 70 80 90

Success rate (%)

Figure 1.9 Outcomes are for two different types of surgery for kidney stones.
The overall (apparent) success rates (78 percent for open surgery as against 83
percent for ultrasound) favor ultrasound. The success rate for each size of stone
separately favors, in each case, open surgery.

If we consider small stones and large stones separately, it appears that surgery
is more successful than ultrasound. The blue vertical bar Figure 1.9 is in each case
to the right of the corresponding red vertical bar. The overall counts, which favor
ultrasound, are thus misleading. For open surgery, the larger number of operations
for large stones (263 large, 87 small) weights the overall success rate towards the low
overall success rate for large stones. For ultrasound surgery (red bars), the weighting
(80 large, 280 small) is towards the high success rate for small stones. This is an
example of the phenomenon called the Simpson or Yule–Simpson paradox. (See also
Subsection 2.1.2.)
Note that without additional information, the results are not interpretable from
a medical standpoint. Different surgeons will have preferred different surgery types,
and the prior condition of patients will have affected the choice of surgery type.
The consequences of unsuccessful surgery may have been less serious for ultrasound
than for open surgery.
The table stones, shown to the right of Figure 1.9, has three margins – Success,
Method, and Size. The table margin12 that results from adding over Size retains
the first two of these. Code used is:

stones <- array(c(81,6,234,36,192,71,55,25), dim=c(2,2,2),


dimnames=list(Success=c("yes","no"),
Method=c("open","ultrasound"), Size=c("<2cm", " ≥2cm")))
margin12 <- margin.table(stones, margin=1:2)

Mosaic plots are an alternative type of display that can be obtained using either
mosaicplot() from base graphics or vcd::mosaic(). Figure 1.9 makes the point
of interest for the kidney stone surgery data more simply and directly.

1.2.8 What to Look for in Plots


We now note points to keep in mind when examining data.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.2 Graphical Tools for Data Exploration 21

Outliers
Outliers are points that appear, or are judged, isolated from the main body of the
data. Such points, whether errors or genuine values, can indicate departure from
model assumptions, and may distort any model that is fitted.
Boxplots, and the normal quantile–quantile plot that will be discussed in Sub-
section 1.4.3, are useful for highlighting outliers in one dimension. Scatterplots may
highlight outliers in two dimensions. Some outliers will, however, be apparent only
in three or more dimensions.

Asymmetry of the Distribution


Positive skewness (a tail to the right) is a common form of departure from nor-
mality. The largest values are widely dispersed, and values near the minimum are
likely to be bunched up together. Provided that all values are greater than zero,
a logarithmic transformation typically makes such a distribution more symmetric.
Negative skewness (a tail to the left) is less common. Severe skewness is typically
a more serious problem for the validity of analysis results than other types of
nonnormality.
If values of a variable that takes positive values range by a factor of more than
10:1 then, depending on the application area context, positive skewness is to be
expected. A logarithmic transformation should be considered.

Changes in Variability
Boxplots and histograms readily convey an impression of the extent of variability or
scatter in the data. Side-by-side boxplots, such as in Figure 1.1B, or dotplots such as
in Figure 1.1A, allow rough comparisons of the variability across different samples
or treatment groups. They provide a visual check on the assumption, common in
many statistical models, that variability is constant across treatment groups.
It is easy to over-interpret such plots. Statistical theory offers useful and necessary
warnings about the potential for such over-interpretation. (The variability in a
sample, typically measured by the variance, is itself highly variable under repeated
sampling. Measures of variability will be discussed in Subsection 1.3.3.)
When variability increases as data values increase, the logarithmic transformation
will often help. Constant relative variability on the original scale becomes constant
absolute variability on a logarithmic scale.

Clustering
Clusters in scatterplots may suggest features of the data that may or may not
have been expected. Upon proceeding to a formal analysis, any clustering must be
taken into account. Do the clusters correspond to different values of some relevant
variable? Outliers are a special form of clustering.

Nonlinearity
Where it seems clear that one or more relationships are nonlinear, a transformation
may make it possible to model the relevant effects as linear. Where none of the

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


22 Learning from Data, and Tools for the Task

common standard transformations meets requirements, methodology is available


that will fit quite general nonlinear curves. See Subsection 4.4.2.
If there is a theory that suggests the form of model, then this is a good starting
point. Available theory may, however, incorporate various approximations, and the
data may tell a story that does not altogether match the available theory. The data,
unless they are flawed, have the final say!

Time Trends in the Data


It is common to find time trends that are associated with order of data collection.
It can be enlightening to plot residuals, or other quantities, against time. Patterns
of increase or decrease are common and are readily recognized, but one should also
be alert to the possibility of seasonality or periodic behavior.

1.3 Data Summary


Data summaries may: (1) be of interest in themselves; (2) give insight into aspects
of data structure that may affect further analysis; (3) be used as data for further
analysis. In case (3), it is necessary to ensure that important information, relevant
to the analysis, is not lost. Before adding counts across the margins of multiway
tables, or otherwise pooling data across different groups, it is important to check
the potential for distortions that are artifacts of the way that the data have been
summarized. Examples will be given.
If there is no loss of information, use of summary data can allow a helpful sim-
plicity of analysis and interpretation, Do not, however, proceed without careful
consideration!

1.3.1 Counts
The data frame DAAG::nswpsid1 is from a study (Lalonde, 1986) that compared two
groups of individuals with a history of unemployment problems – one an “untreated”
control group and the other a “treatment” group whose members were exposed to a
labor training program. The data include measures that can be used for checks on
whether the two groups were, aside from exposure (or not) to the training program,
otherwise plausibly similar. The following compares the relative numbers between
who had completed high school (nodeg = 0) and those who had not (nodeg = 1).
## Table of counts example: data frame nswpsid1 (DAAG)
## Specify `useNA="ifany"` to ensure that any NAs are tabulated
tab <- with(DAAG::nswpsid1, table(trt, nodeg, useNA="ifany"))
dimnames(tab) <- list(trt=c("none", "training"), educ = c("completed", "dropout"))
tab

educ
trt completed dropout
none 1730 760
training 80 217

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.3 Data Summary 23

0 1 10 100 1000 10000 100000

Inverse sampling weights

Figure 1.10 Boxplot showing weights (inverse sampling fractions), in the dataset
DAAG::nassCDS. A log(weight+1) scale): has been used.

The training group has a much higher proportion of dropouts. Similar compar-
isons are required for other factors, variables, and combinations of two factors or
variables. The data will be investigated further in Section 9.7.1.

Tabulation That Accounts for Frequencies or Weights – the xtabs() Function


Each year the National Highway Traffic Safety Administration in the United States
uses a stratified random sampling method to collect data from all police-reported
collisions in which there is an injury to people or property and where at least one
vehicle is towed. Sampling fractions differ according to class of accident. The subset
in DAAG::nassCDS is restricted to front-seat occupants.2
Factors whose effect warrant investigation include: airbag (was an airbag fit-
ted?), seatbelt (was a seatbelt used?), and dvcat (a force of impact measure).
The column weight (national inflation factor) holds the inverses of the sampling
fraction estimates. The less accurate estimates that come where the sampling frac-
tion is small have to be given an accordingly greater weight in the calculation of
overall estimates, in order to fairly represent the population. Very large weights,
for some classes of accident, will exaggerate the effect, both of any mistakes in data
collection, and of deviations from the prescribed (and relatively complex) sampling
scheme. The following contrasts numbers in the sample with estimated total num-
bers of collisions, obtained by applying the sampling weights:
sampNum <- table(nassCDS$dead)
popNum <- as.vector(xtabs(weight ∼ dead, data=nassCDS))
rbind(Sample=sampNum, "Total number"=round(popNum,1))

alive dead
Sample 25037 1180
Total number 12067937 65595

Use of xtabs() to classify the estimated population numbers (in thousands) by


airbag use, and adding the marginal death rates per 1000 to the table, gives:
nassCDS <- DAAG::nassCDS
2 It holds a subset of the columns from a corrected version of the data analyzed in Meyer and
Finney (2005). See also Farmer (2005) and Meyer (2006). More complete data are available
from one of the web pages noted on the help page for nassCDS.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


24 Learning from Data, and Tools for the Task

Atab <- xtabs(weight ∼ airbag + dead, data=nassCDS)/1000


## Define a function that calculates Deaths per 1000
DeadPer1000 <- function(x)1000∗x[2]/sum(x)
Atabm <- ftable(addmargins(Atab, margin=2, FUN=DeadPer1000))
print(Atabm, digits=2, method="compact", big.mark=",")

airbag | dead alive dead DeadPer1000


none 5 ,445.2 39.7 7.2
airbag 6 ,622.7 25.9 3.9

This might suggest that the fitting of an airbag substantially reduces the risk of
mortality. Consider, however:
SAtab <- xtabs(weight ∼ seatbelt + airbag + dead, data=nassCDS)
## SAtab <- addmargins(SAtab, margin=3, FUN=list(Total=sum)) ## Gdet Totals
SAtabf <- ftable(addmargins(SAtab, margin=3, FUN=DeadPer1000), col.vars=3)
print(SAtabf, digits=2, method="compact", big.mark=",")

seatbelt airbag | dead alive dead DeadPer1000


none none 1 ,342 ,021.9 24 ,066.7 17.6
airbag 871 ,875.4 13 ,759.9 15.5
belted none 4 ,103 ,224.0 15 ,609.4 3.8
airbag 5 ,750 ,815.6 12 ,159.2 2.1

The Total column gives the weights that are, effectively, applied to the values in the
DeadPer1000 column when the raw numbers are added over the seatbelt margin. In
the earlier table (Atab), the results for airbag=none were mildly skewed (4119:1366)
to those for belted. Results with airbags were strongly skewed (5763:886) to those
for seatbelt=none. Hence, adding over the seatbelt margin gave a spuriously large
advantage to the presence of an airbag.
The reader may wish to try an analysis that accounts, additionally, for estimated
force of impact (dvcat):
FSAtab <- xtabs(weight ∼ dvcat + seatbelt + airbag + dead, data=nassCDS)
FSAtabf <- ftable(addmargins(FSAtab, margin=4, FUN=DeadPer1000), col.vars=3:4)
print(FSAtabf, digits=1)

There is no consistent pattern in the difference between "none" and "airbag".


Further terms, including the age of vehicle and the age of driver, demand consid-
eration. The estimated effect of airbag, or of any factor other than seatbelt, varies
depending on what further terms are included in the model. Seatbelts have such
a large effect that their contribution stands out irrespective of what other terms
appear in the model. These data, tabulated as above, have too many uncertainties
and potential sources of bias to give reliable answers.
A better starting point for investigation are the data from the Fatality Analysis
Recording System (FARS). The gamclass::FARS dataset has data for the years
1998 to 2010. This has, in principle at least, a complete set of records for the more
limited class of accidents where there was at least one fatality.
Farmer (2005) used the FARS data for an analysis, limited to cars without pas-
senger airbags, that used front-seat passenger mortality as a standard against which
to compare driver mortality. In the absence of any effect from airbags, the ratio of
driver mortality to passenger mortality should be the same, irrespective of whether

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


1.3 Data Summary 25

Individual vine yields Plot means (4 vines)


85 90 95 100 105 110

west north east


Feb2May

Dec2Feb

Aug2Dec

none

85 90 95 100 105 110 85 90 95 100 105 110

yield

Figure 1.11 Individual yields and plot-level mean yields of kiwifruit (in kg) for
each of four treatments (season) and blocks (exposure).

there was a driver airbag. Farmer found a ratio of driver fatalities to passenger
fatalities that was 11 percent lower in the cars with driver airbags. Factors that
have a large effect on the absolute risk can be expected to have a much smaller
effect on the relative risk.
In addition to the functions discussed, note the function gmodels::CrossTable(),
which offers a choice of SPSS-like and SAS-like output formats.

1.3.2 Summaries of Information from Data Frames


The data frame DAAG::kiwishade has yield measurements from 48 kiwifruit vines.
Plots, made up of four vines each, were the experimental units. Figure 1.11 plots
both the aggregated means and the individual vine results.
The 12 plots were divided into three blocks of four plots each. One block of four
was north-facing, a second block west-facing, and a third block east-facing. (Because
the trial was conducted in the Southern hemisphere, there is no south-facing block.)
Shading treatments were applied to whole plots, that is, to groups of four vines,
with each treatment occurring once per block. The shading treatments were applied
either from August to December, December to February, February to May, or not
at all. For more details of the experiment, look ahead to Figure 7.5.
As treatments were applied to whole plots, a focus on the individual vines ex-
aggerates the extent of information that is available, in each block, for comparing
treatments. To gain an accurate impression of the strength of the evidence, focus on
the means, represented by +. The code is given as a footnote.3 The code includes
3 ## Individual vine yields, with means by block and treatment overlaid
kiwishade <- DAAG::kiwishade
kiwishade$block <- factor(kiwishade$block, levels=c("west","north","east"))
keyset <- list(space="top", columns=2,
text=list(c("Individual vine yields", "Plot means (4 vines)")),
points=list(pch=c(1,3), cex=c(1,1.35), col=c("gray40","black")))
panelfun <- function(x,y,...){panel.dotplot(x,y, pch=1, ...)
av <- sapply(split(x,y),mean); ypos <- unique(y)
lpoints(ypos~av, pch=3, col="black")}
dotplot(shade~yield | block, data=kiwishade, col="gray40", aspect=0.65,
panel=panelfun, key=keyset, layout=c(3,1))
## Note that parameter settings were given both in the calls
## to the panel functions and in the list supplied to key.

https://doi.org/10.1017/9781009282284.002 Published online by Cambridge University Press


Random documents with unrelated
content Scribd suggests to you:
générosité m’a permis de compléter une somme de mille francs, je
te l’adresse en un mandat du receveur-général de Tours sur le
Trésor.»

—La belle avance! dit Constance en regardant Césarine.

«En retranchant quelques superfluités dans ma vie, je pourrai


rendre en trois ans à madame de Listomère les quatre cents francs
qu’elle m’a prêtés, ainsi ne t’en inquiète pas, mon cher César. Je
t’envoie tout ce que je possède dans le monde, en souhaitant que
cette somme puisse aider à une heureuse conclusion de tes
embarras commerciaux, qui sans doute ne seront que momentanés.
Je connais ta délicatesse, et veux aller au devant de tes objections.
Ne songe ni à me donner aucun intérêt de cette somme, ni à me la
rendre dans un jour de prospérité qui ne tardera pas à se lever pour
toi, si Dieu daigne entendre les prières que je lui adresserai
journellement. D’après ta dernière reçue il y a deux ans, je te croyais
riche, et pensais pouvoir disposer de mes économies en faveur des
pauvres; mais maintenant, tout ce que j’ai t’appartient. Quand tu
auras surmonté ce grain passager de ta navigation, garde encore
cette somme pour ma nièce Césarine, afin que, lors de son
établissement, elle puisse l’employer à quelque bagatelle qui lui
rappelle un vieil oncle dont les mains se lèveront toujours au ciel
pour demander à Dieu de répandre ses bénédictions sur elle et sur
tous ceux qui lui seront chers. Enfin, mon cher César, songe que je
suis un pauvre prêtre qui va à la grâce de Dieu comme les alouettes
des champs, marchant dans mon sentier, sans bruit, tâchant d’obéir
aux commandements de notre divin Sauveur, et à qui
conséquemment il faut peu de chose. Ainsi, n’aie pas le moindre
scrupule dans la circonstance difficile où tu te trouves, et pense à
moi comme à quelqu’un qui t’aime tendrement. Notre excellent abbé
Chapeloud, auquel je n’ai point dit ta situation, et qui sait que je
t’écris, m’a chargé de te transmettre les plus aimables choses pour
toutes les personnes de ta famille et te souhaite la continuation de
tes prospérités. Adieu, cher et bien-aimé frère, je fais des vœux pour
que, dans les conjonctures où tu te trouves, Dieu te fasse la grâce
de te conserver en bonne santé, toi, ta femme et ta fille; je vous
souhaite à tous patience et courage en vos adversités.
»François Birotteau,

»Prêtre, vicaire de l’église cathédrale et


paroissiale de Saint-Gatien de Tours.»

—Mille francs! dit madame Birotteau furieuse.


—Serre-les, dit gravement César, il n’a que cela. D’ailleurs, ils
sont à notre fille, et doivent nous faire vivre sans rien demander à
nos créanciers.
—Ils croiront que tu leur as soustrait des sommes importantes.
—Je leur montrerai la lettre.
—Ils diront que c’est une frime.
—Mon Dieu, mon Dieu, cria Birotteau terrifié. J’ai pensé cela de
pauvres gens qui sans doute étaient dans la situation où je me
trouve.
Trop inquiètes de l’état où se trouvait César, la mère et la fille
travaillèrent à l’aiguille auprès de lui, dans un profond silence. A
deux heures du matin, Popinot ouvrit doucement la porte du salon et
fit signe à madame César de descendre. En la voyant, son oncle ôta
ses besicles.
—Mon enfant, il y a de l’espoir, lui dit-il, tout n’est pas perdu;
mais ton mari ne résisterait pas aux alternatives des négociations à
faire et qu’Anselme et moi nous allons tenter. Ne quitte pas ton
magasin demain, et prends toutes les adresses des billets; nous
avons jusqu’à quatre heures. Voici mon idée. Ni monsieur Ragon ni
moi ne sommes à craindre. Supposez maintenant que vos cent mille
francs déposés chez Roguin aient été remis aux acquéreurs, vous ne
les auriez pas plus que vous ne les avez aujourd’hui. Vous êtes en
présence de cent quarante mille francs souscrits à Claparon, que
vous deviez toujours payer en tout état de cause; ainsi ce n’est pas
la banqueroute de Roguin qui vous ruine. Je vois, pour faire face à
vos obligations, quarante mille francs à emprunter tôt ou tard sur
vos fabriques et soixante mille francs d’effets Popinot. On peut donc
lutter, car après vous pourrez emprunter sur les terrains de la
Madeleine. Si votre principal créancier consent à vous aider, je ne
regarderai pas à ma fortune, je vendrai mes rentes, je serai sans
pain. Popinot sera entre la vie et la mort; quant à vous, vous serez à
la merci du plus petit événement commercial. Mais l’huile rendra
sans doute de grands bénéfices. Popinot et moi nous venons de
nous consulter, nous vous soutiendrons dans cette lutte. Ah! je
mangerai bien gaiement mon pain sec si le succès poind à l’horizon.
Mais tout dépend de Gigonnet et des associés Claparon. Popinot et
moi, nous irons chez Gigonnet de sept à huit heures, et nous
saurons à quoi nous en tenir sur leurs intentions.
Constance se jeta tout éperdue dans les bras de son oncle, sans
autre voix que des larmes et des sanglots. Ni Popinot ni Pillerault ne
pouvaient savoir que Bidault dit Gigonnet, et Claparon étaient du
Tillet sous une double forme, que du Tillet voulait lire dans les
Petites-Affiches ce terrible article:

«Jugement du tribunal de commerce qui déclare le sieur


César Birotteau, marchand parfumeur, demeurant à Paris, rue
Saint-Honoré, no 397, en état de faillite, en fixe
provisoirement l’ouverture au 16 janvier 1819. Juge-
commissaire, monsieur Gobenheim-Keller. Agent, monsieur
Molineux.»

Anselme et Pillerault étudièrent jusqu’au jour les affaires de


César. A huit heures du matin, ces deux héroïques amis, l’un vieux
soldat, l’autre sous-lieutenant d’hier, qui ne devaient jamais
connaître que par procuration les terribles angoisses de ceux qui
avaient monté l’escalier de Bidault dit Gigonnet, s’acheminèrent,
sans se dire un mot, vers la rue Grenétat. Ils souffraient. A plusieurs
reprises, Pillerault passa sa main sur son front.
La rue Grenétat est une rue où toutes les maisons, envahies par
une multitude de commerces, offrent un aspect repoussant; les
constructions y ont un caractère horrible, l’ignoble malpropreté des
fabriques y domine. Le vieux Gigonnet habitait le troisième étage
d’une maison dont toutes les fenêtres étaient à bascule et à petits
carreaux sales. Son escalier descendait jusque sur la rue. Sa portière
était logée à l’entresol, dans une cage qui ne tirait son jour que de
l’escalier et d’une échappée sur la rue. Excepté Gigonnet, tous les
locataires exerçaient un état. Il venait, il sortait continuellement des
ouvriers. Les marches étaient donc revêtues d’une couche de boue
dure ou molle, au gré de l’atmosphère, et où séjournaient des
immondices. Sur ce fétide escalier, chaque palier offrait aux yeux les
noms du fabricant écrits en or sur une tôle peinte en rouge et
vernie, avec des échantillons de ses chefs-d’œuvre. La plupart du
temps, les portes ouvertes laissaient voir la bizarre union du ménage
et de la fabrique, il s’en échappait des cris et des grognements
inouïs, des chants, des sifflements qui rappelaient l’heure de quatre
heures chez les animaux du Jardin des Plantes. Au premier se
faisaient, dans un taudis infect, les plus belles bretelles de l’article
Paris. Au second se confectionnaient, au milieu des plus sales
ordures, les plus élégants cartonnages qui parent au jour de l’an les
montres de Suisse. Gigonnet mourut riche de dix-huit cent mille
francs dans le troisième de cette maison, sans qu’aucune
considération eût pu l’en faire sortir, malgré l’offre de madame
Saillard, sa nièce, de lui donner un appartement dans un hôtel de la
place Royale.
—Du courage, dit Pillerault en tirant le pied de biche pendu par
un cordon à la porte grise et propre de Gigonnet.
Gigonnet vint ouvrir lui-même, et les deux parrains du parfumeur,
en lice dans le champ des faillites, traversèrent une première
chambre correcte et froide, sans rideaux aux croisées. Tous trois
s’assirent dans la seconde où se tenait l’escompteur devant un foyer
plein de cendres au milieu desquelles le bois se défendait contre le
feu. Popinot eut l’âme glacée par les cartons verts de l’usurier, par la
rigidité monastique de ce cabinet aéré comme une cave; il regarda
d’un air hébété le petit papier bleuâtre semé de fleurs tricolores collé
sur les murs depuis vingt-cinq ans, et reporta ses yeux attristés sur
la cheminée ornée d’une pendule en forme de lyre, et des vases
oblongs en bleu de Sèvres richement montés en cuivre doré. Cette
épave, ramassée par Gigonnet dans le naufrage de Versailles où la
populace brisa tout, venait du boudoir de la reine; elle était
accompagnée de deux chandeliers du plus misérable modèle en fer
battu.
—Je sais que vous ne pouvez pas venir pour vous, dit Gigonnet,
mais pour le grand Birotteau. Eh? bien, qu’y a-t-il, mes amis?
—Je sais qu’on ne vous apprend rien, ainsi nous serons brefs, dit
Pillerault: vous avez des effets ordre Claparon?
—Oui.
—Voulez-vous échanger les cinquante premiers mille contre des
effets de monsieur Popinot que voici, moyennant escompte, bien
entendu.
Gigonnet ôta sa terrible casquette verte qui semblait née avec lui,
montra son crâne couleur beurre frais dénué de cheveux, fit sa
grimace voltairienne et dit:—Vous voulez me payer en huile pour les
cheveux, quéque j’en ferais?
—Quand vous plaisantez, il n’y a qu’à tirer ses grègues, dit
Pillerault.
—Vous parlez comme un sage que vous êtes, lui dit Gigonnet
avec un sourire flatteur.
—Eh! bien, si j’endossais les effets de monsieur Popinot? dit
Pillerault en faisant un dernier effort.
—Vous êtes de l’or en barre, monsieur Pillerault, mais je n’ai pas
besoin d’or, il me faut seulement mon argent.
Pillerault et Popinot saluèrent et sortirent. Au bas de l’escalier, les
jambes de Popinot flageolaient encore sous lui.
—Est-ce un homme? dit-il à Pillerault.
—On le prétend, fit le vieillard. Souviens-toi toujours de cette
courte séance, Anselme! Tu viens de voir la Banque sans la
mascarade de ses formes agréables. Les événements imprévus sont
la vis du pressoir, nous sommes le raisin, et les banquiers sont les
tonneaux. L’affaire des terrains est sans doute bonne, Gigonnet veut
étrangler César pour se revêtir de sa peau: tout est dit, il n’y a plus
de remède. Voilà la Banque, n’y recours jamais.
Après cette affreuse matinée où, pour la première fois, madame
Birotteau prit les adresses de ceux qui venaient chercher leur argent
et renvoya le garçon de la Banque sans le payer, à onze heures,
cette courageuse femme, heureuse d’avoir sauvé ces douleurs à son
mari, vit revenir Anselme et Pillerault qu’elle attendait en proie à de
croissantes anxiétés: elle lut sa sentence sur leurs visages. Le dépôt
était inévitable.
—Il va mourir de douleur, dit la pauvre femme.
—Je le lui souhaite, dit gravement Pillerault; mais il est si
religieux que, dans les circonstances actuelles, son directeur, l’abbé
Loraux, peut seul le sauver.
Pillerault, Popinot et Constance attendirent qu’un commis fût allé
chercher l’abbé Loraux avant de présenter le bilan que Célestin
préparait à la signature de César. Les commis étaient au désespoir,
ils aimaient leur patron. A quatre heures, le bon prêtre arriva,
Constance le mit au fait du malheur qui fondait sur eux, et l’abbé
monta comme un soldat monte à la brèche.
—Je sais pourquoi vous venez, s’écria Birotteau.
—Mon fils, dit le prêtre, vos sentiments de résignation à la
volonté divine me sont depuis longtemps connus; mais il s’agit de les
appliquer: ayez toujours les yeux sur la croix, ne cessez de la
regarder en pensant aux humiliations dont le Sauveur des hommes
fut abreuvé, combien sa passion fut cruelle, vous pourrez supporter
ainsi les mortifications que Dieu vous envoie...
—Mon frère l’abbé m’avait déjà préparé, dit César en lui montrant
la lettre qu’il avait relue et qu’il tendit à son confesseur.
—Vous avez un bon frère, dit monsieur Loraux, une épouse
vertueuse et douce, une tendre fille, deux vrais amis, votre oncle et
le cher Anselme, deux créanciers indulgents, les Ragon, ces bons
cœurs verseront incessamment du baume sur vos blessures et vous
aideront à porter votre croix. Promettez-moi d’avoir la fermeté d’un
martyr, d’envisager le coup sans défaillir.
L’abbé toussa pour prévenir Pillerault qui était dans le salon.
—Ma résignation est sans bornes, dit César avec calme. Le
déshonneur est venu, je songe à la réparation.
La voix du pauvre parfumeur et son air surprirent Césarine et le
prêtre. Cependant rien n’était plus naturel. Tous les hommes
supportent mieux un malheur connu, défini, que les cruelles
alternatives d’un sort qui, d’un instant à l’autre, apporte ou la joie
excessive ou l’extrême douleur.
—J’ai rêvé pendant vingt-deux ans, je me réveille aujourd’hui
mon gourdin à la main, dit César redevenu paysan tourangeau.
En entendant ces mots, Pillerault serra son neveu dans ses bras.
César aperçut sa femme, Anselme et Célestin. Les papiers que tenait
le premier commis étaient bien significatifs. César contempla
tranquillement ce groupe où tous les regards étaient tristes mais
amis.
—Un moment! dit-il en détachant sa croix qu’il tendit à l’abbé
Loraux. Vous me la rendrez quand je pourrai la porter sans honte.
Célestin, ajouta-t-il en s’adressant à son commis, écrivez ma
démission d’adjoint. Monsieur l’abbé vous dictera la lettre, vous la
daterez du quatorze, et la ferez porter chez monsieur de La
Billardière par Raguet.
Célestin et l’abbé Loraux descendirent. Pendant environ un quart
d’heure, un profond silence régna dans le cabinet de César. Sa
fermeté surprenait sa famille. Célestin et l’abbé revinrent, César
signa sa démission. Quand l’oncle Pillerault lui présenta le bilan, le
pauvre homme ne put réprimer un horrible mouvement nerveux.
—Mon Dieu, ayez pitié de moi, dit-il en signant la terrible pièce et
la tendant à Célestin.
—Monsieur, dit alors Anselme Popinot, sur le front nuageux
duquel il passa un lumineux éclair. Madame, faites-moi l’honneur de
m’accorder la main de mademoiselle Césarine.
A cette phrase, tous les assistants eurent des larmes aux yeux,
excepté César qui se leva, prit la main d’Anselme, et, d’une voix
creuse, lui dit:—Mon enfant, tu n’épouseras jamais la fille d’un failli.
Anselme regarda fixement Birotteau, et lui dit:—Monsieur, vous
engagez-vous, en présence de toute votre famille, à consentir à
notre mariage, si mademoiselle m’agrée pour mari, le jour où vous
serez relevé de votre faillite?
Il y eut un moment de silence pendant lequel chacun fut ému par
les sensations qui se peignirent sur le visage affaissé du parfumeur.
—Oui, dit-il enfin.
Anselme fit un indicible geste pour prendre la main de Césarine,
qui la lui tendit, et il la baisa.
—Vous consentez aussi? demanda-t-il à Césarine.
—Oui, dit-elle.
—Je suis donc enfin de la famille, j’ai le droit de m’occuper de ses
affaires, dit-il avec une expression bizarre.
Anselme sortit précipitamment pour ne pas montrer une joie qui
contrastait trop avec la douleur de son patron. Anselme n’était pas
précisément heureux de la faillite, mais l’amour est si absolu, si
égoïste! Césarine elle-même sentait en son cœur une émotion qui
contrariait son amère tristesse.
—Puisque nous y sommes, dit Pillerault à l’oreille de Césarine,
frappons tous les coups.
Madame Birotteau laissa échapper un signe de douleur et non
d’assentiment.
—Mon neveu, dit Pillerault en s’adressant à César, que comptes-
tu faire?
—Continuer le commerce.
—Ce n’est pas mon avis, dit Pillerault. Liquide et distribue ton
actif à tes créanciers, ne reparais plus sur la place de Paris. Je me
suis souvent supposé dans une position analogue à la tienne... (Ah!
il faut tout prévoir dans le commerce! le négociant qui ne pense pas
à la faillite est comme un général qui compterait n’être jamais battu,
il n’est négociant qu’à demi.) Moi, je n’aurais jamais continué.
Comment! toujours rougir devant des hommes à qui j’aurais fait tort,
recevoir leurs regards défiants et leurs tacites reproches? Je conçois
la guillotine!... un instant, et tout est fini. Mais avoir une tête qui
renaît et se la sentir couper tous les jours, est un supplice auquel je
me serais soustrait. Beaucoup de gens reprennent les affaires
comme si rien ne leur était arrivé! tant mieux! ils sont plus forts que
Claude-Joseph Pillerault. Si vous faites au comptant, et vous y êtes
obligé, on dit que vous avez su vous ménager des ressources; si
vous êtes sans le sou, vous ne pouvez jamais vous relever. Bonsoir!
Abandonne donc ton actif, laisse vendre ton fonds et fais autre
chose.
—Mais quoi? dit César.
—Eh! dit Pillerault, cherche une place. N’as-tu pas des
protections? le duc et la duchesse de Lenoncourt, madame de
Mortsauf, monsieur de Vandenesse; écris-leur, vois-les, ils te
caseront dans la Maison du Roi avec quelque millier d’écus; ta
femme en gagnera bien autant, ta fille peut-être aussi. La position
n’est pas désespérée. A vous trois, vous réunirez près de dix mille
francs par an. En dix ans, tu peux payer cent mille francs, car tu ne
prendras rien sur ce que vous gagnerez: tes deux femmes auront
quinze cents francs chez moi pour leurs dépenses, et, quant à toi,
nous verrons!
Constance et non César médita ces sages paroles. Pillerault se
dirigea vers la Bourse, qui se tenait alors sous une construction
provisoire en planches et en pans de bois, formant une salle ronde
où l’on entrait par la rue Feydeau. La faillite du parfumeur en vue et
jalousé, déjà connue, excitait une rumeur générale dans le haut
commerce, alors constitutionnel. Les commerçants libéraux voyaient
dans la fête de Birotteau une audacieuse entreprise sur leurs
sentiments. Les gens de l’opposition voulaient avoir le monopole de
l’amour du pays. Permis aux royalistes d’aimer le roi, mais aimer la
patrie était le privilége de la gauche: le peuple lui appartenait. Le
pouvoir avait eu tort de se réjouir, par ses organes, d’un événement
dont les libéraux voulaient l’exploitation exclusive. La chute d’un
protégé du château, d’un ministériel, d’un royaliste incorrigible qui,
le 13 vendémiaire, insultait la liberté en se battant contre la
glorieuse révolution française, cette chute excitait les cancans et les
applaudissements de la Bourse. Pillerault voulait connaître, étudier
l’opinion. Il trouva, dans un des groupes les plus animés, du Tillet,
Gobenheim-Keller, Nucingen, le vieux Guillaume et son gendre
Joseph Lebas, Claparon, Gigonnet, Mongenod, Camusot, Gobseck,
Adolphe Keller, Palma, Chiffreville, Matifat, Grindot et Lourdois.
—Eh! bien, quelle prudence ne faut-il pas, dit Gobenheim à du
Tillet, il n’a tenu qu’à un fil que mes beaux-pères n’accordassent un
crédit à Birotteau!
—Moi, j’y suis de dix mille francs qu’il m’a demandés il y a quinze
jours, je les lui ai donnés sur sa simple signature, dit du Tillet. Mais il
m’a jadis obligé, je les perdrai sans regret.
—Il a fait comme tous les autres, votre neveu, dit Lourdois à
Pillerault, il a donné des fêtes! Qu’un fripon essaie de jeter de la
poudre aux yeux pour stimuler la confiance, je le conçois; mais un
homme qui passait pour la crème des honnêtes gens recourir aux
roueries de ce vieux charlatanisme auquel nous nous prenons
toujours!
—Comme des bêtes, dit Gobseck.
—N’ayez confiance qu’à ceux qui vivent dans des bouges, comme
Claparon, dit Gigonnet.
—Hé pien, dit le gros baron Nucingen à du Tillet, fous afez fouli
meu chouer eine tire han m’enfoyant Piroddôt. Che ne sais pas
birquoi, dit-il en se tournant vers Gobenheim, le manufacturier, el n’a
pas enfoyé brentre chez moi zinguande mille francs, che les lui
aurais remisse.
—Oh! non, dit Joseph Lebas, monsieur le baron. Vous deviez bien
savoir que la Banque avait refusé son papier, vous l’avez fait rejeter
dans le comité d’escompte. L’affaire de ce pauvre homme, pour qui
je professe encore une haute estime, offre des circonstances
singulières...
La main de Pillerault serrait celle de Joseph Lebas.
—Il est impossible, en effet, dit Mongenod, d’expliquer ce qui
arrive, à moins de croire qu’il y ait, cachés derrière Gigonnet, des
banquiers qui veulent tuer l’affaire de la Madeleine.
—Il lui arrive ce qui arrivera toujours à ceux qui sortent de leur
spécialité, dit Claparon en interrompant Mongenod. S’il avait monté
lui-même son Huile Céphalique au lieu de venir nous renchérir les
terrains dans Paris en se jetant dessus, il aurait perdu ses cent mille
francs chez Roguin, mais il n’aurait pas failli. Il va travailler sous le
nom de Popinot.
—Attention à Popinot, dit Gigonnet.
Roguin, selon cette masse de négociants, était l’infortuné Roguin,
le parfumeur était ce pauvre Birotteau. L’un semblait excusé par une
grande passion, l’autre semblait plus coupable à cause de ses
prétentions. En quittant la Bourse, Gigonnet passa la rue Perrin-
Gasselin avant de revenir rue Grenétat, et vint chez madame Madou,
la marchande de fruits secs.
—Ma grosse mère, lui dit-il avec sa cruelle bonhomie, eh! bien,
comment va notre petit commerce?
—A la douce, dit respectueusement madame Madou en
présentant son unique fauteuil à l’usurier avec une affectueuse
servilité qu’elle n’avait eue que pour le cher défunt.
La mère Madou, qui jetait à terre un charretier récalcitrant ou
trop badin, qui n’eût pas craint d’aller à l’assaut des Tuileries au dix
octobre, qui goguenardait ses meilleures pratiques, capable enfin de
porter sans trembler la parole au roi au nom des dames de la Halle,
Angélique Madou recevait Gigonnet avec un profond respect. Sans
force en sa présence, elle frissonnait sous son regard âpre. Les gens
du peuple trembleront encore longtemps devant le bourreau,
Gigonnet était le bourreau de ce commerce. A la Halle, nul pouvoir
n’est plus respecté que celui de l’homme qui fait le cours de l’argent.
Les autres institutions humaines ne sont rien auprès. La justice elle-
même se traduit aux yeux de la Halle par le commissaire,
personnage avec lequel elle se familiarise. Mais l’usure assise
derrière ses cartons verts, l’usure implorée la crainte dans le cœur,
dessèche la plaisanterie, altère le gosier, abat la fierté du regard et
rend le peuple respectueux.
—Est-ce que vous avez quelque chose à me demander? dit-elle.
—Un rien, une misère, tenez-vous prête à rembourser les effets
Birotteau, le bonhomme a fait faillite, tout devient exigible, je vous
enverrai le compte demain matin.
Les yeux de madame Madou se concentrèrent d’abord comme
ceux d’une chatte, puis vomirent des flammes.
—Ah! le gueux! ah! le scélérat! il est venu lui-même ici me dire
qu’il était adjoint, me monter des couleurs! Matigot, ça va comme
ça, le commerce! Il n’y a plus de foi chez les maires, le
gouvernement nous trompe. Attendez, je vais aller me faire payer,
moi...
—Hé, dans ces affaires-là, chacun s’en tire comme il peut, chère
enfant! dit Gigonnet en levant sa jambe par ce petit mouvement sec
semblable à celui d’un chat qui veut passer un endroit mouillé, et
auquel il devait son nom. Il y a de gros bonnets qui pensent à retirer
leur épingle du jeu.
—Bon! bon! je vais retirer ma noisette. Marie-Jeanne! mes
socques et mon cachemire de poil de lapin: et vite, ou je te
réchauffe la joue par une giroflée à cinq feuilles.
—Ça va s’échauffer dans le haut de la rue, se dit Gigonnet en se
frottant les mains. Du Tillet sera content, il y aura du scandale dans
le quartier. Je ne sais pas ce que lui a fait ce pauvre diable de
parfumeur, moi j’en ai pitié comme d’un chien qui se casse la patte.
Ce n’est pas un homme, il n’est pas de force.
Madame Madou déboucha, comme une insurrection du faubourg
Saint-Antoine, sur les sept heures du soir à la porte du pauvre
Birotteau qu’elle ouvrit avec une excessive violence, car la marche
avait encore animé ses esprits.
—Tas de vermine, il me faut mon argent, je veux mon argent!
Vous me donnerez mon argent, ou je vais emporter des sachets, des
brimborions de satin, des éventails, enfin de la marchandise pour
mes deux mille francs! A-t-on jamais vu des maires voler les
administrés! Si vous ne me payez pas, je l’envoie aux galères, je vais
chez le procureur du roi, le tremblement de la justice ira son train!
Enfin, je ne sors pas d’ici sans ma monnaie.
Elle fit mine de lever les glaces d’une armoire où étaient des
objets précieux.
—La Madou prend, dit à voix basse Célestin à son voisin.
La marchande entendit le mot, car dans les paroxismes de
passion les organes s’oblitèrent ou se perfectionnent selon les
constitutions, elle appliqua sur l’oreille de Célestin la plus vigoureuse
tape qui se fût donnée dans un magasin de parfumerie.
—Apprends à respecter les femmes, mon ange, dit-elle, et à ne
pas chiffonner le nom de ceux que tu voles.
—Madame, dit madame Birotteau sortant de l’arrière-boutique où
se trouvait par hasard son mari que l’oncle Pillerault voulait
emmener, et qui, pour obéir à la loi, poussait l’humilité jusqu’à
vouloir se laisser mettre en prison; madame, au nom du ciel,
n’ameutez pas les passants.
—Eh! qu’ils entrent, dit la femme, je leux y dirai la chose, histoire
de rire! Oui, ma marchandise et mes écus ramassés à la sueur de
mon front servent à donner vos bals. Enfin, vous allez vêtue comme
une reine de France avec la laine que vous prenez à des pauvres
igneaux comme moi! Jésus! ça me brûlerait les épaules, à moi, du
bien volé; je n’ai que du poil de lapin sur ma carcasse, mais il est à
moi! Brigands de voleurs, mon argent ou...
Elle sauta sur une jolie boîte en marqueterie où étaient de
précieux objets de toilette.
—Laissez cela, madame, dit César en se montrant, rien ici n’est à
moi, tout appartient à mes créanciers. Je n’ai plus que ma personne,
et si vous voulez vous en emparer, me mettre en prison, je vous
donne ma parole d’honneur (une larme sortit de ses yeux) que
j’attendrai votre huissier et ses recors...
Le ton et le geste en harmonie avec l’action firent tomber la
colère de madame Madou.
—Mes fonds ont été emportés par un notaire, et je suis innocent
des désastres que je cause, reprit César; mais vous serez payée
avec le temps, dussé-je mourir à la peine et travailler comme un
manœuvre, à la Halle, en prenant l’état de porteur.
—Allons, vous êtes un brave homme, dit la femme de la Halle.
Pardon de mes paroles, madame; mais faut donc que je me jette à
l’eau, car Gigonnet va me poursuivre, et je n’ai que des valeurs à dix
mois pour rembourser vos damnés billets.
—Venez me trouver demain matin, dit Pillerault en se montrant,
je vous arrangerai votre affaire à cinq pour cent, chez un de mes
amis.
—Quien! c’est le brave père Pillerault. Eh! mais, il est votre oncle,
dit-elle à Constance. Allons, vous êtes d’honnêtes gens, je ne perdrai
rien, est-ce pas? A demain, vieux, dit-elle à l’ancien quincaillier.
César voulut absolument demeurer au milieu de ses ruines, en
disant qu’il s’expliquerait ainsi avec tous ses créanciers. Malgré les
supplications de sa nièce, l’oncle Pillerault approuva César, et le fit
remonter chez lui. Le rusé vieillard courut chez monsieur Haudry, lui
expliqua la position de Birotteau, obtint une ordonnance pour une
potion somnifère, l’alla commander et revint passer la soirée chez
son neveu. De concert avec Césarine, il contraignit César à boire
comme eux. Le narcotique endormit le parfumeur qui se réveilla,
quatorze heures après, dans la chambre de son oncle Pillerault, rue
des Bourdonnais, emprisonné par le vieillard qui couchait, lui, sur un
lit de sangle dans son salon. Quand Constance entendit rouler le
fiacre dans lequel son oncle Pillerault emmenait César, son courage
l’abandonna. Souvent nos forces sont stimulées par la nécessité de
soutenir un être plus faible que nous. La pauvre femme pleura de se
trouver seule chez elle avec sa fille, comme elle aurait pleuré César
mort.
—Maman, dit Césarine en s’asseyant sur les genoux de sa mère,
et la caressant avec ces grâces chattes que les femmes ne déploient
bien qu’entre elles, tu m’as dit que si je prenais bravement mon
parti, tu trouverais de la force contre l’adversité. Ne pleure donc pas,
ma chère mère. Je suis prête à entrer dans quelque magasin, et je
ne penserai plus à ce que nous étions. Je serai comme toi dans ta
jeunesse, une première demoiselle, et tu n’entendras jamais une
plainte ni un regret. J’ai une espérance. N’as-tu pas entendu
monsieur Popinot?
—Le cher enfant, il ne sera pas mon gendre...
—Oh! maman...
—Il sera véritablement mon fils.
—Le malheur, dit Césarine en embrassant sa mère, a cela de bon
qu’il nous apprend à connaître nos vrais amis.
Césarine finit par adoucir le chagrin de la pauvre femme en
jouant auprès d’elle le rôle d’une mère. Le lendemain matin,
Constance alla chez le duc de Lenoncourt, un des premiers
gentilshommes de la chambre du roi, et y laissa une lettre par
laquelle elle lui demandait une audience à une certaine heure de la
journée. Dans l’intervalle, elle vint chez monsieur de La Billardière,
lui exposa la situation où la fuite du notaire mettait César, le pria de
l’appuyer auprès du duc, et de parler pour elle, ayant peur de mal
s’expliquer. Elle voulait une place pour Birotteau. Birotteau serait le
caissier le plus probe, s’il y avait à distinguer dans la probité.
—Le roi vient de nommer le comte de Fontaine à une direction
générale dans le ministère de sa maison, il n’y a pas de temps à
perdre.
A deux heures, La Billardière et madame César montaient le
grand escalier de l’hôtel de Lenoncourt, rue Saint-Dominique, et
furent introduits chez celui de ses gentilshommes que le roi
préférait, si tant est que le roi Louis XVIII ait eu des préférences. Le
gracieux accueil de ce grand seigneur, qui appartenait au petit
nombre des vrais gentilshommes que le siècle précédent a légués à
celui-ci, donna de l’espoir à madame César. La femme du parfumeur
se montra grande et simple dans la douleur. La douleur ennoblit les
personnes les plus vulgaires, car elle a sa grandeur, et pour en
recevoir du lustre, il suffit d’être vrai. Constance était une femme
essentiellement vraie. Il s’agissait de parler au roi promptement. Au
milieu de la conférence, on annonça monsieur de Vandenesse, et le
duc s’écria:—Voilà votre sauveur! Madame Birotteau n’était pas
inconnue à ce jeune homme, venu chez elle une ou deux fois pour y
demander de ces bagatelles souvent aussi importantes que de
grandes choses. Le duc expliqua les intentions de La Billardière. En
apprenant le malheur qui accablait le filleul de la marquise d’Uxelles,
Vandenesse alla sur-le-champ avec La Billardière chez le comte de
Fontaine, en priant madame Birotteau de l’attendre. Monsieur le
comte de Fontaine était, comme La Billardière, un de ces braves
gentilshommes de province, héros presque inconnus qui firent la
Vendée. Birotteau ne lui était pas étranger, il l’avait vu jadis à la
Reine des Roses. Les gens qui avaient répandu leur sang pour la
cause royale jouissaient à cette époque de priviléges que le Roi
tenait secrets pour ne pas effaroucher les Libéraux. Monsieur de
Fontaine, un des favoris de Louis XVIII, passait pour être dans toute
sa confidence. Non-seulement le comte promit positivement une
place, mais il vint chez le duc de Lenoncourt, alors de service, pour
le prier de lui obtenir un moment d’audience dans la soirée, et de
demander pour La Billardière une audience de Monsieur, qui aimait
particulièrement cet ancien diplomate vendéen. Le soir même,
monsieur le comte de Fontaine alla des Tuileries chez madame
Birotteau lui annoncer que son mari serait, après son concordat,
officiellement nommé à une place de deux mille cinq cents francs à
la Caisse d’Amortissement, tous les services de la maison du roi se
trouvant alors chargés de nobles surnuméraires avec lesquels on
avait pris des engagements. Ce succès n’était qu’une partie de la
tâche de madame Birotteau. La pauvre femme alla rue Saint-Denis,
au Chat qui pelote, trouver Joseph Lebas. Pendant cette course, elle
rencontra dans un brillant équipage madame Roguin, qui sans doute
faisait des emplettes. Ses yeux et ceux de la belle notaresse se
croisèrent. La honte que la femme heureuse ne put réprimer en
voyant la femme ruinée donna du courage à Constance.
—Jamais je ne roulerai carrosse avec le bien d’autrui, se dit-elle.
Bien reçue de Joseph Lebas, elle le pria de procurer à sa fille une
place dans une maison de commerce respectable. Lebas ne promit
rien; mais huit jours après, Césarine eut la table, le logement et
mille écus dans la plus riche maison de nouveautés de Paris, qui
fondait un nouvel établissement dans le quartier des Italiens. La
caisse et la surveillance du magasin étaient confiées à la fille du
parfumeur, qui, placée au-dessus de la première demoiselle,
remplaçait le maître et la maîtresse de la maison. Quant à madame
César, elle alla le jour même chez Popinot lui demander de tenir chez
lui la caisse, les écritures et le ménage. Popinot comprit que sa
maison était la seule où la femme du parfumeur pourrait trouver les
respects qui lui étaient dus et une position sans infériorité. Le noble
enfant lui donna trois mille francs par an, la nourriture, son logement
qu’il fit arranger, et prit pour lui la mansarde d’un commis. Ainsi la
belle parfumeuse, après avoir joui pendant un mois des
somptuosités de son appartement, dut habiter l’effroyable chambre,
ayant vue sur la cour obscure et humide, où Gaudissart, Anselme et
Finot avaient inauguré l’Huile Céphalique.
Quand Molineux, nommé Agent par le tribunal de commerce, vint
prendre possession de l’actif de César Birotteau, Constance, aidée
par Célestin, vérifia l’inventaire avec lui. Puis la mère et la fille
sortirent, à pied, dans une mise simple, et allèrent chez leur oncle
Pillerault sans retourner la tête, après avoir demeuré dans cette
maison le tiers de leur vie. Elles cheminèrent en silence vers la rue
des Bourdonnais, où elles dînèrent avec César pour la première fois
depuis leur séparation. Ce fut un triste dîner. Chacun avait eu le
temps de faire ses réflexions, de mesurer l’étendue de ses
obligations et de sonder son courage. Tous trois étaient comme des
matelots prêts à lutter avec le mauvais temps, sans se dissimuler le
danger. Birotteau reprit courage en apprenant avec quelle sollicitude
de grands personnages lui avaient arrangé un sort; mais il pleura
quand il sut ce qu’allait devenir sa fille. Puis, il tendit la main à sa
femme en voyant le courage avec lequel elle recommençait la vie.
L’oncle Pillerault eut pour la dernière fois de sa vie les yeux mouillés
à l’aspect du touchant tableau de ces trois êtres unis, confondus
dans un embrassement au milieu duquel Birotteau, le plus faible des
trois, le plus abattu, leva la main en disant: Espérons!
—Pour économiser, dit l’oncle, tu logeras avec moi, garde ma
chambre et partage mon pain. Il y a longtemps que je m’ennuie
d’être seul, tu remplaceras ce pauvre enfant que j’ai perdu. D’ici, tu
n’auras qu’un pas pour aller, rue de l’Oratoire, à ta Caisse.
—Dieu de bonté, s’écria Birotteau, au fort de l’orage une étoile
me guide.
En se résignant, le malheureux consomme son malheur. La chute
de Birotteau se trouvait dès lors accomplie, il y donnait son
consentement, il redevenait fort.
Après avoir déposé son bilan, un commerçant ne devrait plus
s’occuper que de trouver une oasis en France ou à l’étranger pour y
vivre sans se mêler de rien, comme un enfant qu’il est: la loi le
déclare mineur et incapable de tout acte légal, civil et civique. Mais il
n’en est rien. Avant de reparaître, il attend un sauf-conduit que
jamais ni juge-commissaire ni créancier n’ont refusé, car s’il était
rencontré sans cet exeat, il serait mis en prison, tandis que, muni de
cette sauvegarde, il se promène en parlementaire dans le camp
ennemi, non par curiosité, mais pour déjouer les mauvaises
intentions de la loi relativement aux faillis. L’effet de toute loi qui
touche à la fortune privée est de développer prodigieusement les
fourberies de l’esprit. La pensée des faillis, comme de tous ceux dont
les intérêts sont contre-carrés par une loi quelconque, est de
l’annuler à leur égard. La situation de mort civil, où le failli reste
comme une chrysalide, dure trois mois environ, temps exigé par les
formalités avant d’arriver au congrès où se signe entre les créanciers
et le débiteur un traité de paix, transaction appelée Concordat. Ce
mot indique assez que la concorde règne après la tempête soulevée
entre des intérêts violemment contrariés.
Sur le vu du bilan, le tribunal de commerce nomme aussitôt un
juge-commissaire qui veille aux intérêts de la masse des créanciers
inconnus et doit aussi protéger le failli contre les entreprises
vexatoires de ses créanciers irrités: double rôle qui serait magnifique
à jouer, si les juges-commissaires en avaient le temps. Ce juge-
commissaire investit un agent du droit de mettre la main sur les
fonds, les valeurs, les marchandises, en vérifiant l’actif porté dans le
bilan; enfin le greffe indique une convocation de tous les créanciers,
laquelle se fait au son de trompe des annonces dans les journaux.
Les créanciers faux ou vrais sont tenus d’accourir et de se réunir afin
de nommer des syndics provisoires qui remplacent l’agent, se
chaussent avec les souliers du failli, deviennent par une fiction de la
loi le failli lui-même, et peuvent tout liquider, tout vendre, transiger
sur tout, enfin fondre la cloche au profit des créanciers, si le failli ne
s’y oppose pas. La plupart des faillites parisiennes s’arrêtent aux
syndics provisoires, et voici pourquoi.
La nomination d’un ou plusieurs syndics définitifs est un des
actes les plus passionnés auxquels puissent se livrer des créanciers
altérés de vengeance, joués, bafoués, turlupinés, attrapés,
dindonnés, volés et trompés. Quoiqu’en général les créanciers soient
trompés, volés, dindonnés, attrapés, turlupinés, bafoués et joués, il
n’existe pas à Paris de passion commerciale qui vive quatre-vingt-dix
jours. En négoce, les effets de commerce savent seuls se dresser,
altérés de paiement, à trois mois. A quatre-vingt-dix jours tous les
créanciers exténués de fatigue par les marches et contre-marches
qu’exige une faillite dorment auprès de leurs excellentes petites
femmes. Ceci peut aider les étrangers à comprendre combien en
France le provisoire est définitif: sur mille syndics provisoires, il n’en
est pas cinq qui deviennent définitifs. La raison de cette abjuration
des haines soulevées par la faillite va se concevoir. Mais il devient
nécessaire d’expliquer aux gens qui n’ont pas le bonheur d’être
négociants le drame d’une faillite, afin de faire comprendre comment
il constitue à Paris une des plus monstrueuses plaisanteries légales,
et comment la faillite de César allait être une énorme exception.
Ce beau drame commercial a trois actes distincts: l’acte de
l’Agent, l’acte des Syndics, l’acte du Concordat. Comme toutes les
pièces de théâtre il offre un double spectacle: il a sa mise en scène
pour le public et ses moyens cachés, il y a la représentation vue du
parterre et la représentation vue des coulisses. Dans les coulisses
sont le failli et son Agréé, l’avoué des commerçants, les Syndics et
l’Agent, enfin le Juge-Commissaire. Personne hors Paris ne sait, et
personne à Paris n’ignore qu’un juge au tribunal de commerce est le
plus étrange magistrat qu’une Société se soit permis de créer. Ce
juge peut craindre à tout moment sa justice pour lui-même. Paris a
vu le président de son tribunal être forcé de déposer son bilan. Au
lieu d’être un vieux négociant retiré des affaires et pour qui cette
magistrature serait la récompense d’une vie pure, ce juge est un
commerçant surchargé d’énormes entreprises, à la tête d’une
immense maison. La condition sine quâ non de l’élection de ce juge,
tenu de juger les avalanches de procès commerciaux qui roulent
incessamment dans la capitale, est d’avoir beaucoup de peine à
conduire ses propres affaires. Ce tribunal de commerce, au lieu
d’avoir été institué comme une utile transition d’où le négociant
s’élèverait sans ridicule aux régions de la noblesse, se compose de
négociants en exercice, qui peuvent souffrir de leurs sentences en
rencontrant leurs parties mécontentes, comme Birotteau rencontrait
du Tillet.
Le Juge-Commissaire est donc nécessairement un personnage
devant lequel il se dit beaucoup de paroles, qui les écoute en
pensant à ses affaires et s’en remet de la chose publique aux syndics
et à l’agréé, sauf quelques cas étranges et bizarres, où les vols se
présentent avec des circonstances curieuses, et lui font dire que les
créanciers ou le débiteur sont des gens habiles. Ce personnage,
placé dans le drame, comme un buste royal dans une salle
d’audience, se voit le matin, entre cinq et sept heures, à son
chantier, s’il est marchand de bois; dans sa boutique, si, comme
jadis Birotteau, il est parfumeur, ou le soir après dîner, entre la poire
et le fromage, d’ailleurs toujours horriblement pressé. Ainsi ce
personnage est généralement muet. Rendons justice à la loi: la
législation, faite à la hâte, qui régit la matière a lié les mains au
Juge-Commissaire, et dans plusieurs circonstances il consacre des
fraudes sans les pouvoir empêcher comme vous l’allez voir.
L’Agent, au lieu d’être l’homme des créanciers, peut devenir
l’homme du débiteur. Chacun espère pouvoir grossir sa part en se
faisant avantager par le failli, auquel on suppose toujours des trésors
cachés. L’Agent peut s’utiliser des deux côtés, soit en n’incendiant
pas les affaires du failli, soit en attrapant quelque chose pour les
gens influents: il ménage donc la chèvre et le chou. Souvent un
Agent habile a fait rapporter le jugement, en rachetant les créances
et en relevant le négociant, qui rebondit alors comme une balle
élastique. L’Agent se tourne vers le râtelier le mieux garni, soit qu’il
faille couvrir les plus forts créanciers et découvrir le débiteur, soit
qu’il faille immoler les créanciers à l’avenir du négociant. Ainsi, l’acte
de l’Agent est l’acte décisif. Cet homme, ainsi que l’Agréé, joue la
grande utilité dans cette pièce où, l’un comme l’autre, ils n’acceptent
leur rôle que sûrs de leurs honoraires. Sur une moyenne de mille
faillites, l’Agent est neuf cent cinquante fois l’homme du failli. A
l’époque où cette histoire eut lieu, presque toujours les Agréés
venaient trouver le Juge-Commissaire et lui présentaient un Agent à
nommer, le leur, un homme à qui les affaires du négociant étaient
connues et qui saurait concilier les intérêts de la masse et ceux de
l’homme honorable tombé dans le malheur. Depuis quelques années,
les juges habiles se font indiquer l’Agent que l’on désire, afin de ne
pas le prendre, et tâchent d’en nommer un quasi-vertueux.
Pendant cet acte se présentent les créanciers, faux ou vrais, pour
désigner les syndics provisoires qui sont, comme il est dit, définitifs.
Dans cette assemblée électorale, ont droit de voter ceux auxquels il
est dû cinquante sous comme les créanciers de cinquante mille
francs: les voix se comptent et ne se pèsent pas. Cette assemblée,
où se trouvent les faux électeurs introduits par le failli, les seuls qui
ne manquent jamais à l’élection, proposent pour candidats les
créanciers parmi lesquels le Juge-Commissaire, président sans
pouvoir, est tenu de choisir les syndics. Ainsi, le Juge-Commissaire
prend presque toujours de la main du failli les Syndics qu’il lui
convient d’avoir: autre abus qui rend cette catastrophe un des plus
burlesques drames que la justice puisse protéger. L’homme
honorable tombé dans le malheur, maître du terrain, légalise alors le
vol qu’il a médité. Généralement le petit commerce de Paris est pur
de tout blâme. Quand un boutiquier arrive au dépôt de son bilan, le
pauvre honnête homme a vendu le châle de sa femme, a engagé
son argenterie, a fait flèche de tout bois et a succombé les mains
vides, ruiné, sans argent même pour l’Agréé, qui se soucie fort peu
de lui.
La loi veut que le concordat qui remet au négociant une partie de
sa dette et lui rend ses affaires soit voté par une certaine majorité
de sommes et de personnes. Ce grand œuvre exige une habile
diplomatie dirigée au milieu des intérêts contraires qui se croisent et
se heurtent, par le failli, par ses syndics et son agréé. La manœuvre
habituelle, vulgaire, consiste à offrir, à la portion de créanciers qui
fait la majorité voulue par la loi, des primes à payer par le débiteur
en outre des dividendes consentis au concordat. A cette immense
fraude il n’est aucun remède. Les trente tribunaux de commerce qui
se sont succédé les uns aux autres le connaissent pour l’avoir
É
pratiqué. Éclairés par un long usage, ils ont fini dernièrement par se
décider à annuler les effets entachés de fraude, et comme les faillis
ont intérêt à se plaindre de cette extorsion, les juges espèrent
moraliser ainsi la faillite, mais ils arriveront à la rendre encore plus
immorale: les créanciers inventeront quelques actes encore plus
coquins, que les juges flétriront comme juges, et dont ils profiteront
comme négociants.
Une autre manœuvre extrêmement en usage, à laquelle on doit
l’expression de créancier sérieux et légitime, consiste à créer des
créanciers, comme du Tillet avait créé une maison de banque, et
d’introduire une certaine quantité de Claparons, sous la peau
desquels se cache le failli qui, dès lors, diminue d’autant le dividende
des créanciers véritables, et se crée ainsi des ressources pour
l’avenir, tout en se ménageant la quantité de voix et de sommes
nécessaires pour obtenir son concordat. Les créanciers gais et
illégitimes sont comme de faux électeurs introduits dans le Collége
Électoral. Que peut faire le créancier sérieux et légitime contre les
créanciers gais et illégitimes? s’en débarrasser en les attaquant!
Bien. Pour chasser l’intrus, le créancier sérieux et légitime doit
abandonner ses affaires, charger un Agréé de sa cause, lequel
Agréé, n’y gagnant presque rien, préfère diriger des faillites et mène
peu rondement ce procillon. Pour débusquer le créancier gai, besoin
est d’entrer dans le dédale des opérations, de remonter à des
époques éloignées, fouiller les livres, obtenir par autorité de justice
l’apport de ceux du faux créancier, découvrir l’invraisemblance de la
fiction, la démontrer aux juges du tribunal, plaider, aller, venir,
chauffer beaucoup de cœurs froids; puis, faire ce métier de don
Quichotte à l’endroit de chaque créancier illégitime et gai, lequel, s’il
vient à être convaincu de gaieté, se retire en saluant les juges et dit:
—Excusez-moi, vous vous trompez, je suis très-sérieux. Le tout sans
préjudice des droits du Failli, qui peut mener le don Quichotte en
Cour royale. Durant ce temps, les affaires du don Quichotte vont
mal, il est susceptible de déposer son bilan.
Morale: Le débiteur nomme ses Syndics, vérifie ses créances et
arrange son Concordat lui-même.
D’après ces données, qui ne devine les intrigues, tours de
Sganarelle, inventions de Frontin, mensonges de Mascarille et sacs
vides de Scapin que développent ces deux systèmes? Il n’existe pas
de faillite où il ne s’en engendre assez pour fournir la matière des
quatorze volumes de Clarisse Harlove à l’auteur qui voudrait les
décrire. Un seul exemple suffira. L’illustre Gobseck, le maître des
Palma, des Gigonnet, des Werbrust, des Keller et des Nucingen,
s’étant trouvé dans une faillite où il se proposait de rudement mener
un négociant qui l’avait su rouer, reçut en effets à échoir après le
concordat, la somme qui, jointe à celle des dividendes, formait
l’intégralité de sa créance. Gobseck détermina l’acceptation d’un
concordat qui consacrait soixante-quinze pour cent de remise au
failli. Voilà les créanciers joués au profit de Gobseck. Mais le
négociant avait signé les effets illicites de sa raison sociale en faillite;
il put appliquer à ces effets la déduction de soixante-quinze pour
cent. Gobseck, le grand Gobseck, reçut à peine cinquante pour cent.
Il saluait toujours son débiteur avec un respect ironique.
Toutes les opérations engagées par un failli dix jours avant sa
faillite pouvant être incriminées, quelques hommes prudents ont soin
d’entamer certaines affaires avec un certain nombre de créanciers
dont l’intérêt est, comme celui du failli, d’arriver à un prompt
concordat. Des créanciers très-fins vont trouver des créanciers très-
niais ou très-occupés, leur peignent la faillite en laid et leur achètent
leurs créances la moitié de ce qu’elles vaudront à la liquidation, et
retrouvent alors leur argent par le dividende de leurs créances, et la
moitié, le tiers ou le quart gagné sur les créances achetées.
La faillite est la fermeture plus ou moins hermétique d’une
maison où le pillage a laissé quelques sacs d’argent. Heureux le
négociant qui se glisse par la fenêtre, par le toit, par les caves, par
un trou, qui prend un sac et grossit sa part! Dans cette déroute, où
se crie le sauve-qui-peut de la Bérésina, tout est illégal et légal, faux
et vrai, honnête et déshonnête. Un homme est admiré s’il se couvre.
Se couvrir est s’emparer de quelques valeurs au détriment des
autres créanciers. La France a retenti des débats d’une immense
faillite éclose dans une ville où siégeait une Cour Royale, et où les
magistrats en comptes courants avec les faillis s’étaient donné des
manteaux en caoutchouc si pesants que le manteau de la justice en
fut troué. Force fut, pour cause de suspicion légitime, de déférer le
jugement de la faillite dans une autre Cour. Il n’y avait ni juge-
commissaire, ni agent, ni cour souveraine possible dans l’endroit où
la banqueroute éclata.
Cet effroyable gâchis commercial est si bien apprécié à Paris,
qu’à moins d’être intéressé dans la faillite pour une somme capitale,
tout négociant, quelque peu affairé qu’il soit, accepte la faillite
comme un sinistre sans assureurs, passe la perte au compte des
«profits et pertes,» et ne commet pas la sottise de dépenser son
temps; il continue à brasser ses affaires. Quant au petit
commerçant, harcelé par ses fins de mois, occupé de suivre le char
de sa fortune, un procès effrayant de durée et coûteux à entamer
l’épouvante; il renonce à voir clair, imite le gros négociant, et baisse
la tête en réalisant sa perte.
Les gros négociants ne déposent plus leur bilan, ils liquident à
l’amiable: les créanciers donnent quittance en prenant ce qu’on leur
offre. On évite alors le déshonneur, les délais judiciaires, les
honoraires d’agréés, les dépréciations de marchandises. Chacun croit
que la faillite donnerait moins que la liquidation. Il y a plus de
liquidations que de faillites à Paris.
L’acte des Syndics est destiné à prouver que tout Syndic est
incorruptible, qu’il n’y a jamais entre eux et le failli la moindre
collusion. Le parterre, qui a été plus ou moins syndic, sait que tout
Syndic est un créancier couvert. Il écoute, il croit ce qu’il veut, et
arrive à la journée du concordat, après trois mois employés à vérifier
les créances passives et les créances actives. Les Syndics Provisoires
font alors à l’assemblée un petit rapport dont voici la formule
générale:
«Messieurs, il nous était dû à tous en bloc un million; nous avons
dépecé notre homme comme une frégate sombrée: les clous, les
fers, les bois, les cuivres ont donné trois cent mille francs. Nous
avons donc trente pour cent de nos créances. Heureux d’avoir trouvé
cette somme quand notre débiteur pouvait ne nous laisser que cent
mille francs, nous le déclarons un Aristide, nous lui votons des
primes d’encouragement, des couronnes, et proposons de lui laisser
son actif, en lui accordant dix ou douze ans pour nous payer
cinquante pour cent qu’il daigne nous promettre. Voici le concordat,
passez au bureau, signez-le!
A ce discours, les heureux négociants se félicitent et
s’embrassent. Après l’homologation de ce concordat, le failli
redevient négociant comme devant; on lui rend son actif, il
recommence ses affaires, sans être privé du droit de faire faillite des
dividendes promis, arrière-petite-faillite qui se voit souvent, comme
un enfant mis au jour par une mère neuf mois après le mariage de
sa fille.
Si le Concordat ne prend pas, les créanciers nomment alors des
Syndics définitifs, prennent des mesures exorbitantes en s’associant
pour exploiter les biens, le commerce de leur débiteur, saisissant
tout ce qu’il aura, la succession de son père, de sa mère, de sa
tante, etc. Cette rigoureuse mesure s’exécute au moyen d’un contrat
d’union.
Il y a donc deux faillites: la faillite du négociant qui veut ressaisir
les affaires, et la faillite du négociant qui, tombé dans l’eau, se
contente d’aller au fond de la rivière. Pillerault connaissait bien cette
différence. Il était, selon lui, comme selon Ragon, aussi difficile de
sortir pur de la première que de sortir riche de la seconde. Après
avoir conseillé l’abandon général, il alla s’adresser au plus honnête
Agréé de la place pour le faire exécuter en liquidant la faillite et
remettant les valeurs à la disposition des créanciers. La loi veut que
les créanciers donnent, pendant la durée de ce drame, des aliments
au failli et à sa famille. Pillerault fit savoir au Juge-Commissaire qu’il
pourvoirait aux besoins de sa nièce et de son neveu.
Tout avait été combiné par du Tillet pour rendre la faillite une
agonie constante à son ancien patron. Voici comment. Le temps est
si précieux à Paris, que généralement dans les faillites, de deux
Syndics, un seul s’occupe des affaires. L’autre est pour la forme: il
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like