100% found this document useful (1 vote)
48 views

Scientific Inference Learning From Data 1st Edition Simon Vaughan pdf download

The document provides an overview of the book 'Scientific Inference: Learning from Data' by Simon Vaughan, aimed at physical sciences students to enhance their data analysis skills. It covers essential statistical concepts, methods, and practical applications, including graphical explanations, worked examples, and real data case studies. The book is designed for both undergraduate courses and self-study, assuming familiarity with calculus and linear algebra but no prior knowledge of statistics.

Uploaded by

poryasuyco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
48 views

Scientific Inference Learning From Data 1st Edition Simon Vaughan pdf download

The document provides an overview of the book 'Scientific Inference: Learning from Data' by Simon Vaughan, aimed at physical sciences students to enhance their data analysis skills. It covers essential statistical concepts, methods, and practical applications, including graphical explanations, worked examples, and real data case studies. The book is designed for both undergraduate courses and self-study, assuming familiarity with calculus and linear algebra but no prior knowledge of statistics.

Uploaded by

poryasuyco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Scientific Inference Learning From Data 1st

Edition Simon Vaughan pdf download

https://ebookgate.com/product/scientific-inference-learning-from-
data-1st-edition-simon-vaughan/

Get Instant Ebook Downloads – Browse at https://ebookgate.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Learning From Data A short course 1st Edition Yaser S.


Abu-Mostafa

https://ebookgate.com/product/learning-from-data-a-short-course-1st-
edition-yaser-s-abu-mostafa/

ebookgate.com

Learning Rails 1st Edition Simon St. Laurent

https://ebookgate.com/product/learning-rails-1st-edition-simon-st-
laurent/

ebookgate.com

Modern Statistics With R From Wrangling and Exploring Data


to Inference and Predictive Modelling second edition Måns
Thulin
https://ebookgate.com/product/modern-statistics-with-r-from-wrangling-
and-exploring-data-to-inference-and-predictive-modelling-second-
edition-mans-thulin-2/
ebookgate.com

Modern Statistics with R From Wrangling and Exploring Data


to Inference and Predictive Modelling Second Edition Måns
Thulin
https://ebookgate.com/product/modern-statistics-with-r-from-wrangling-
and-exploring-data-to-inference-and-predictive-modelling-second-
edition-mans-thulin/
ebookgate.com
Pattern theory from representation to inference Ulf
Grenander

https://ebookgate.com/product/pattern-theory-from-representation-to-
inference-ulf-grenander/

ebookgate.com

Advanced Analytics with Spark Patterns for Learning from


Data at Scale 1st Edition Sandy Ryza

https://ebookgate.com/product/advanced-analytics-with-spark-patterns-
for-learning-from-data-at-scale-1st-edition-sandy-ryza/

ebookgate.com

Understanding Psychology as a Science An Introduction to


Scientific and Statistical Inference 2008th Edition Zoltan
Dienes
https://ebookgate.com/product/understanding-psychology-as-a-science-
an-introduction-to-scientific-and-statistical-inference-2008th-
edition-zoltan-dienes/
ebookgate.com

Speaking Pictures 1st Edition Virginia Mason Vaughan

https://ebookgate.com/product/speaking-pictures-1st-edition-virginia-
mason-vaughan/

ebookgate.com

Learning Qlikview Data Visualization 1st Edition Karl


Pover

https://ebookgate.com/product/learning-qlikview-data-
visualization-1st-edition-karl-pover/

ebookgate.com
SCIENTIFIC INFERENCE

Providing the knowledge and practical experience to begin analysing scientific


data, this book is ideal for physical sciences students wishing to improve their data
handling skills.
The book focuses on explaining and developing the practice and understanding
of basic statistical analysis, concentrating on a few core ideas, such as the visual
display of information, modelling using the likelihood function, and simulating
random data.
Key concepts are developed through a combination of graphical explanations,
worked examples, example computer code and case studies using real data. Stu-
dents will develop an understanding of the ideas behind statistical methods and
gain experience in applying them in practice. Further resources are available at
www.cambridge.org/9781107607590, including data files for the case studies so
students can practice analysing data, and exercises to test students’ understanding.

simon vaughan is a Reader in the Department of Physics and Astronomy,


University of Leicester, where he has developed and runs a highly regarded course
for final year physics students on the subject of statistics and data analysis.
SCIENTIFIC INFERENCE
Learning from data

SIMON VAUGHAN
University of Leicester
University Printing House, Cambridge CB2 8BS, United Kingdom

Cambridge University Press is a part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781107607590
© S. Vaughan 2013
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2013
Printing in the United Kingdom by TJ International Ltd. Padstow Cornwall
A catalogue record for this publication is available from the British Library
Library of Congress Cataloguing in Publication data
Vaughan, Simon, 1976– author.
Scientific inference : learning from data / Simon Vaughan.
pages cm
Includes bibliographical references and index.
ISBN 978-1-107-02482-3 (hardback) – ISBN 978-1-107-60759-0 (paperback)
1. Mathematical statistics – Textbooks. I. Title.
QA276.V34 2013
519.5 – dc23 2013021427
ISBN 978-1-107-02482-3 Hardback
ISBN 978-1-107-60759-0 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
For my family
Contents

For the student page x


For the instructor xii
1 Science and statistical data analysis 1
1.1 Scientific method 1
1.2 Inference 3
1.3 Scientific inference 6
1.4 Data analysis in a nutshell 7
1.5 Random samples 8
1.6 Know your data 10
1.7 Language 11
1.8 Statistical computing using R 12
1.9 How to use this book 12
2 Statistical summaries of data 14
2.1 Plotting data 14
2.2 Plotting univariate data 16
2.3 Centre of data: sample mean, median and mode 18
2.4 Dispersion in data: variance and standard deviation 21
2.5 Min, max, quantiles and the five-number summary 24
2.6 Error bars, standard errors and precision 25
2.7 Plots of bivariate data 28
2.8 The sample correlation coefficient 36
2.9 Plotting multivariate data 38
2.10 Good practice in statistical graphics 43
2.11 Chapter summary 44
3 Simple statistical inferences 46
3.1 Inference about the mean of a sample 46

vii
viii Contents

3.2 Difference in means from two samples 49


3.3 Straight line fits 51
3.4 Linear regression in practice 56
3.5 Residuals: what lies beneath 58
3.6 Case study: regression of Reynolds’ data 59
3.7 Chapter summary 63
4 Probability theory 64
4.1 Experiments, outcomes and events 64
4.2 Probability 69
4.3 The rules of the probability calculus 72
4.4 Random variables 82
4.5 The visual perception of randomness 89
4.6 The meaning of ‘probability’ and ‘random’ 89
4.7 Chapter summary 92
5 Random variables 94
5.1 Properties of random variables 94
5.2 Discrete random variables 100
5.3 Continuous random variables 110
5.4 Change of variables 116
5.5 Approximate variance relations (or the propagation
of errors) 120
5.6 Chapter summary 122
6 Estimation and maximum likelihood 124
6.1 Models 124
6.2 Case study: Rutherford & Geiger data 125
6.3 Maximum likelihood estimation 129
6.4 Weighted least squares 133
6.5 Case study: pion scattering data 139
6.6 Chapter summary 140
7 Significance tests and confidence intervals 142
7.1 A thought experiment 142
7.2 Significance testing and test statistics 143
7.3 Pearson’s χ 2 test 146
7.4 Fixed-level tests and decisions 153
7.5 Interpreting test results 156
7.6 Confidence intervals on MLEs 159
7.7 Chapter summary 166
Contents ix

8 Monte Carlo methods 169


8.1 Generating pseudo-random numbers 169
8.2 Estimating sampling distributions by Monte Carlo 175
8.3 Computing confidence by bootstrap 181
8.4 The power of Monte Carlo 183
8.5 Further reading 184
8.6 Chapter summary 184
Appendix A Getting started with statistical computation 185
A.1 What is R? 185
A.2 A first R session 185
A.3 Entering data 187
A.4 Quitting R 188
A.5 More mathematics 188
A.6 Writing your own R scripts 189
A.7 Producing graphics in R 190
A.8 Saving graphics in R 192
A.9 Good practice with R 193
Appendix B Data case studies 195
B.1 Michelson’s speed of light data 195
B.2 Rutherford–Geiger radioactive decay 196
B.3 A study of fluid flow 198
B.4 The HR diagram 199
B.5 A particle physics experiment 202
B.6 Atmospheric conditions in New York City 205
Appendix C Combinations and permutations 207
C.1 Permutations 207
C.2 Combinations 208
C.3 Probability of combinations 209
Appendix D More on confidence intervals 210
Appendix E Glossary 214
Appendix F Notation 219
References 221
Index 223
For the student

Science is not about certainty, it is about dealing rigorously with uncertainty. The
tools for this are statistical. Statistics and data analysis are therefore an essential
part of the scientific method and modern scientific practice, yet most students of
physical science get little explicit training in statistical practice beyond basic error
handling. The aim of this book is to provide the student with both the knowledge and
the practical experience to begin analysing new scientific data, to allow progress
to more advanced methods and to gain a more statistically literate approach to
interpreting the constant flow of data provided by modern life.
More specifically, if you work through the book you should be able to accomplish
the following.
r Explain aspects of the scientific method, types of logical reasoning and data
analysis, and be able to critically analyse statistical and scientific arguments.
r Calculate and interpret common quantitative and graphical statistical summaries.
r Use and interpret the results of common statistical tests for difference and asso-
ciation, and straight line fitting.
r Use the calculus of probability to manipulate basic probability functions.
r Apply and interpret model fitting, using e.g. least squares, maximum likelihood.
r Evaluate and interpret confidence intervals and significance tests.

Students have asked me whether this is a book about statistics or data analysis or
statistical computing. My answer is that they are so closely connected it is difficult
to untangle them, and so this book covers areas of all three.
The skills and arguments discussed in the book are highly transferable: statistical
presentations of data are used throughout science, business, medicine, politics and
the news media. An awareness of the basic methods involved will better enable you
to use and critically analyse such presentations – this is sometimes called statistical
literacy.

x
For the student xi

In order to understand the book, you need to be familiar with the mathematical
methods usually taught in the first year of a physics, engineering or chemistry
degree (differential and integral calculus, basic matrix algebra), but this book is
designed so that the probability and statistics content is entirely self-contained.
For the instructor

This book was written because I could not find a suitable textbook to use as the
basis of an undergraduate course on scientific inference, statistics and data analysis.
Although there are good books on different aspects of introductory statistics, those
intended for physicists seem to target a post-graduate audience and cover either
too much material or too much detail for an undergraduate-level first course. By
contrast, the ‘Intro to stats’ books aimed at a broader audience (e.g. biologists,
social scientists, medics) tend to cover topics that are not so directly applicable
for physical scientists. And the books aimed at mathematics students are usually
written in a style that is inaccessible to most physics students, or in a recipe-book
style (aimed at science students) that provides ready-made solutions to common
problems but develops little understanding along the way.
This book is different. It focuses on explaining and developing the practice and
understanding of basic statistical analysis, concentrating on a few core ideas that
underpin statistical and data analysis, such as the visual display of information,
modelling using the likelihood function, and simulating random data. Key con-
cepts are developed using several approaches: verbal exposition in the main text,
graphical explanations, case studies drawn from some of history’s great physics
experiments, and example computer code to perform the necessary calculations.1
The result is that, after following all these approaches, the student should both
understand the ideas behind statistical methods and have experience in applying
them in practice.
The book is intended for use as a textbook for an introductory course on data
analysis and statistics (with a bias towards students in physics) or as self-study
companion for professionals and graduate students. The book assumes familiarity
with calculus and linear algebra, but no previous exposure to probability or statistics

1 These are based on R, a freely available software package for data analysis and statistics and used in many
statistics textbooks.

xii
For the instructor xiii

is assumed. It is suitable for a wide range of undergraduate and postgraduate science


students.
The book has been designed with several special features to improve its value
and effectiveness with students:
r several complete data analysis case studies using real data from some of history’s
great experiments
r ‘example boxes’ – approximately 20 boxes throughout the text that give specific,
worked examples for concepts as they are discussed
r ‘computer practice boxes’ – approximately 90 boxes throughout the text that give
working R code to perform the calculations discussed in the text or produce the
plots shown
r graphical explanations of important concepts
r appendices that provide technical details supplementary to the main text
r a well-populated glossary of terms and list of notational conventions.

The emphasis on a few core ideas and their practical applications means that
some subjects usually covered in introductory statistics texts are given little or
no treatment here. Rigorous mathematical proofs are not covered – the interested
reader can easily consult any good reference work on probability theory or math-
ematical statistics to check these. In addition, we do not cover some topics of
‘classical’ statistics that are dealt with in other introductory works. These topics
include
r more advanced distribution functions (beta, gamma, multinomial, . . . )
r ANOVA and the generalised linear model
r characteristic functions and the theory of moments
r decision and information theories
r non-parametric tests
r experimental design
r time series analysis
r multivariate analysis (principal components, clustering, . . . )
r survival analysis
r spatial data analysis.
Upon completion of this book the student should be in a much better position to
understand any of these topics from any number of more advanced or comprehen-
sive texts.
Perhaps the ‘elephant in the room’ question is: what about Bayesian methods?
Unfortunately, owing to practical limitations there was not room to include full
chapters developing Bayesian methods. I hope I have designed the book in such a
way that it is not wholly frequentist or Bayesian. The emphasis on model fitting
xiv For the instructor

using the likelihood function (Chapter 6) could be seen as the first step towards a
Bayesian analysis (i.e. implicitly using flat priors and working towards the posterior
mode). Fortunately, there are many good books on Bayesian data analysis that can
then be used to develop Bayesian ideas explicitly. I would recommend Gelman et al.
(2003) generally and Sivia and Skilling (2006) or Gregory (2005) for physicists in
particular. Albert (2007) also gives a nice ‘learn as you compute’ introduction to
Bayesian methods using R.
1
Science and statistical data analysis

It is remarkable that a science which began with the consideration of


games of chance should have become the most important object of human
knowledge.
Pierre-Simon Laplace (1812)
Théorie Analytique des Probabilités

Why should a scientist bother with statistics? Because science is about dealing
rigorously with uncertainty, and the tools to accomplish this are statistical. Statistics
and data analysis are an indispensable part of modern science.
In scientific work we look for relationships between phenomena, and try to
uncover the underlying patterns or laws. But science is not just an ‘armchair’ activ-
ity where we can make progress by pure thought. Our ideas about the workings
of the world must somehow be connected to what actually goes on in the world.
Scientists perform experiments and make observations to look for new connec-
tions, test ideas, estimate quantities or identify qualities of phenomena. However,
experimental data are never perfect. Statistical data analysis is the set of tools that
helps scientists handle the limitations and uncertainties that always come with data.
The purpose of statistical data analysis is insight not just numbers. (That’s why
the book is called Scientific Inference and not something more like Statistics for
Physics.)

1.1 Scientific method


Broadly speaking, science is the investigation of the physical world and its phenom-
ena by experimentation. There are different schools of thought about the philosophy
of science and the scientific method, but there are some elements that almost every-
one agrees are components of the scientific method.

1
2 Science and statistical data analysis

Figure 1.1 A cartoon of a simplified model of the scientific method.

Hypothesis A hypothesis or model is an explanation of a phenomenon in terms


of others (usually written in terms of relations or equations), or the suggestion
of a connection between phenomena.
Prediction A useful hypothesis will allow predictions to be made about the
outcome of experiments or observations.
Observation The collection of experimental data in order to investigate a
phenomenon.
Inference A comparison between predictions and observations that allows us
to learn about the hypothesis or model.
What distinguishes science from other disciplines is the insistence that ideas be
tested against what actually happens in Nature. In particular, hypotheses must
make predictions that can be tested against observations. Observations that match
closely the predictions of a hypothesis are considered as evidence in support of
the hypothesis, but observations that differ significantly from the predictions count
as evidence against the hypothesis. If a hypothesis makes no predictions about
possible observations, how can we learn about it through observation?
Figure 1.1 gives a summary of a simplified scientific method. Models and
hypotheses1 can be used to make predictions about what we can observe.

1 The terms ‘hypothesis’, ‘model’ and ‘theory’ have slightly different meanings but are often used interchange-
ably in casual discussions. A theory is usually a reasonably comprehensive, abstract framework (of definitions,
assumptions and relations or equations) for describing generally a set of phenomena, that has been tested and
found at least some degree of acceptance. Examples of scientific theories are classical mechanics, thermody-
namics, germ theory, kinetic theory of gases, plate tectonics etc. A model is usually more specific. It might be
the application of a theory to a particular situation, e.g. a classical mechanics model of the orbit of Jupiter. Some
1.2 Inference 3

Hypotheses may come from some more general theory, or may be more ad hoc,
based on intuition or guesswork about the way some phenomenon might work.
Experiments or observations of the phenomenon can be made, and the results com-
pared with the predictions of the hypothesis. This comparison allows one to test
the model and/or estimate any unknown parameters. Any mismatch between data
and model predictions, or other unpredicted findings in the data, may suggest ways
to revise or change the model. This process of learning about hypotheses from data
is scientific inference. One may enter the cycle at any point: by proposing a model,
making predictions from an existing model, collecting data on some phenomenon
or using data to test a model or estimate some of its parameters. In many areas of
modern science, the different aspects have become so specialised that few, if any,
researchers practice all of these activities (from theory to experiment and back),
but all scientists need an appreciation of the other steps in order to understand the
‘big picture’. This book focuses on the induction/inference part of the chain.

1.2 Inference
The process of drawing conclusions based on what is already known is called
inference. There are two types of reasoning process used in inference: deductive
and non-deductive.

1.2.1 Deductive reasoning (from general to specific)


The first kind of reasoning is deductive reasoning. This starts with premises and
follows the rules of logic to arrive at conclusions. The conclusions are therefore
true as long as the premises are true. Philosophers say the premises entail the
conclusion. Mathematics is based on deductive reasoning: we start from axioms,
follow the rules of logic and arrive at theorems. (Theorems should be distinguished
from theories – the former are the product of deductive reasoning; the latter are
not.) For example, the two propositions ‘A is true implies B is true’ and ‘A is true’
together imply ‘B is true’. This type of argument is a simple deduction known as
a syllogism, which comprises a major premise and a minor premise; together they
imply a conclusion:
Major premise : A ⇒ B (read: A is true implies B is true)
Minor premise : A (read: A is true)
Conclusion : B (read: B is true).
Deductive reasoning leads to conclusions, or theorems, that are inescapable given
the axioms. One can then use the axioms and theorems together to deduce more
authors go on to distinguish hypotheses as models, and their parameters, which may be speculative, as they are
used in statistical inference. For now we have no need to distinguish between models and hypotheses.
4 Science and statistical data analysis

theorems, and so on. A theorem2 is something like ‘A ⇒ B’, which simply says
that the truth value of A is transferred to B, but it does not, in and of itself, assert
that A or B are true. If we happen to know that A is indeed true, the theorem tells
us that B must also be true. The box gives a simple proof that there is no largest
prime number, a purely deductive argument that leads to an ineluctable conclusion.

Box 1.1
Deduction example – proof of no largest prime number
r Suppose there is a largest prime number; call this pN , the Nth prime.
r Make a list of each and every prime number: p1 = 2, p2 = 3, p3 = 5, until pN .
r Now form a new number q from the product of the N primes in the list, and add one:

N
q =1+ pi = 1 + (p1 × p2 × p3 × · · · × pN ) (1.1)
i=1

which is either prime or it is not.


r This new number q is larger than every prime in the list, but it is not divisible by
any prime in the list – it always leaves a remainder of one.
r This means q is prime since it has no prime factors (the fundamental theorem of
arithmetic says that any integer larger than 1 has a unique prime factorisation).
r But this is a contradiction. We have found a prime number q that is larger than
every number in our list, in contradiction with our definition of pN . Therefore our
original assumption – that there is a largest prime, pN – must be false.

Deduction involves reasoning from the general to the specific. If a general


principle is true, we can conclude that any particular cases satisfying the general
principle are true. For example:

Major premise : All monkeys like bananas


Minor premise : Zippy is a monkey
Conclusion : Zippy likes bananas.

The conclusion is unavoidable given the premises. (This type of argument is given
the technical name modus ponens by philosophers of logic.) If some theory is true
we can predict that its consequences must also be true. This applies to probabilistic
as well as deterministic theories. Later on we consider flipping coins, rolling dice,
and other random events. Although we cannot precisely predict the outcome of

2 It is worth noting here that the logical implication used above, e.g. B ⇒ A, does not mean that A can be derived
from B, but only that if B is true then A must also be true, or that the propositions ‘B is true’ and ‘B and A are
both true’ must have the same truth value (both true, or both false).
1.2 Inference 5

individual events (they are random!), we can derive frequencies for the various
outcomes in repeated events.

1.2.2 Inductive reasoning (from specific to general)


Inductive reasoning is a type of non-deductive reasoning. Induction is often said to
describe arguments from special cases to general ones, or from effects to causes.
For example, if we observe that the Sun has risen every day for many days, we can
inductively reason that it will continue to do so. We cannot directly deduce that the
Sun will rise tomorrow (there is no logical contradiction implied if it does not).
The basic point about the limited power of our inferences about the real world
(i.e. our inductive reasoning) was made most forcefully by the Scottish philosopher
David Hume (1711–1776), and is now known as the problem of induction. The
philosopher and mathematician Bertrand Russell furnished us with a memorable
example in his book The Problems of Philosophy (Russell, 1997, ch. 4):
imagine a chicken that gets fed by the farmer every day and so, quite understandably,
imagines that this will always be the case . . . until the farmer wrings its neck! The chicken
never expected that to happen; how could it? – given it had no experience of such an event
and the uniformity of its previous experience had been so great as to lead it to assume the
pattern it had always observed (chicken gets fed every day) was universally true. But the
chicken was wrong.3

You can see that inductive reasoning does not have the same power as deductive
reasoning: a conclusion arrived at by deductive reasoning is necessarily true if the
premises are true, whereas a conclusion arrived at by inductive reasoning is not
necessarily true, it is based on incomplete information. We cannot deduce (prove)
that the Sun will rise tomorrow, but nevertheless we do have confidence that it
will. We might say that deductive reasoning concerns statements that are either
true or false, whereas inductive reasoning concerns statements whose truth value
is unknown, about which we are better to speak in terms of ‘degree of belief’ or
‘confidence’. Let’s see an example:
Major premise : All monkeys we have studied like grapes
Minor premise : Zippy is a monkey
Conclusion : Zippy likes grapes.
The conclusion is not unavoidable, other conclusions are allowed. There is no
logical contradiction in concluding
Conclusion : Zippy does not like grapes.

3 By permission of Oxford University Press.


6 Science and statistical data analysis

But the premises do give us some information. It seems plausible, even probable,
that Zippy likes grapes.

1.2.3 Abductive reasoning (inference to the best explanation)


There is another kind of non-deductive inference, called abduction, or inference to
the best explanation. For our purposes, it does not matter whether abduction is a
particular type of induction, or another kind of non-deductive inference alongside
induction. Let’s go straight to an example:

Premise : Nelly likes bananas


Premise : The banana left near to Nelly has been eaten
Conclusion : Nelly ate the banana.

Again the conclusion is not unavoidable, other conclusions are valid. Perhaps
someone else ate the banana. But the original conclusion seems to be in some sense
the simplest of those allowed. This kind of reasoning, from observed data to an
explanation, is used all the time in science.
Induction and abduction are closely related. When we make an inductive infer-
ence from the limited observed data (‘the monkeys in our sample like grapes’) to
unobserved data (‘Zippy likes grapes’) it is as if we implicitly passed through a
theory (‘all monkeys like grapes’) and then deduced the conclusion from this.

1.3 Scientific inference


Scientific work employs all the above forms of reasoning. We use deductive rea-
soning to go from general theories to specific predictions about the data we could
observe, and non-deductive reasoning to go from our limited data to general con-
clusions about unobserved cases or theories.
Imagine A is the theory of classical mechanics and B is the predicted path of a
rocket deduced from the theory and the details of the launch. Now, we make some
observations and find the rocket did indeed follow the predicted path B (as well
as we can determine). Can we conclude that A is true? We may infer A, but not
deductively. Other conclusions are possible. In fact, the observational confirmation
of one prediction (or even a thousand) does not prove the theory in the same sense
as a deductive proof. A different theory may make indistinguishable predictions in
all of the cases considered to date, but differ in its predictions for other (e.g. future)
observations.
Experimental and observational science is all about inductive reasoning, going
from a finite number of observations or results to a general conclusion about
1.4 Data analysis in a nutshell 7

unobserved cases (induction), or a theory that explains them (abduction). In recent


years, there has been a lot of interest in showing that inductive reasoning can be
formalised in a manner similar to deductive reasoning, so long as one allows for
the uncertainty in the data and therefore in the conclusions (Jeffreys, 1961; Jaynes,
2003).
You might still have reservations about the need for statistical reasoning. After
all, the great experimental physicist Ernest Rutherford is supposed to have said

If your experiment needs statistics, you ought to have done a better experiment!4

Rutherford probably didn’t say this, or didn’t mean for it to be taken at face value.
Nevertheless, statistician Bradley Efron, about a hundred years later, contrasted this
simplistic view with the challenges of modern science (Efron, 2005):

Rutherford lived in a rich man’s world of scientific experimentation, where nature gen-
erously provided boatloads of data, enough for the law of large numbers to squelch any
noise. Nature has gotten more tight-fisted with modern physicists. They are asking harder
questions, ones where the data is thin on the ground, and where efficient inference becomes
a necessity. In short, they have started playing in our ball park.

But it is not just scientists who use (or should use) statistical data analysis. Any
time you have to draw conclusions from data you will make use of these skills.
This is true for particle physics as well as journalism, and whether the data form
part of your research or come from a medical test you were given you need to be
able to understand and interpret them properly, making inferences using methods
built on the same basic principles.

1.4 Data analysis in a nutshell


The analysis of data5 can be broken into different modes that are employed either
individually or in combination; the outcome of one mode of analysis may inform
the application of other modes.

Data reduction This is the process of converting raw data into something more
useful or meaningful to the experimenter: for example, converting the voltage
changes in a particle detector (e.g. a proportional counter) into the records of
the times and energies of individual particle detections. In turn, these may be
further reduced into an energy spectrum for a specific type of particle.

4 The earliest reference to this phrase I can find is Bailey (1967, ch. 2, p. 23).
5 ‘Data’ is the plural of ‘datum’ and means ‘items of information’, although it has now become acceptable to use
‘data’ as a singular mass noun rather like ‘information’.
8 Science and statistical data analysis

Exploratory data analysis (EDA) is an approach to data analysis that uses


quantitative and graphical methods in an attempt to reveal new and inter-
esting patterns in the data. One does not test a particular hypothesis, but
instead ‘plays around with the data’, searching for patterns suggestive of new
hypotheses.
Inferential data analysis Sometimes known as ‘confirmational data analysis’.
We can divide this into two main tasks: model checking and parameter esti-
mation. The former is the process of choosing which of a set of models
provides the most convincing explanation of the data; the latter is the process
of estimating values of a model’s unknown parameters.

Exploratory data analysis is all about summarising the data in ways that might
provide clues about their nature, and inferential data analysis is about making
reasonable and justified inferences based on the data and some set of hypotheses.

1.5 Random samples


Our data about the real world are almost always incomplete, affected by random
errors, or both. Let’s say we wanted to find the answer to some important question:
does the UK population prefer red or green sweets? We could survey the entire
population and in principle get a complete answer, but this would normally be
impractical. So we settle for a subset of the population, and assume this is rep-
resentative of the population at large. Our results from the subset of people we
actually survey is a sample and this is drawn from some population (of all the
responses from the entire population). The sample is just one of the many possible
samples that could be obtained from the same population.
But what we’re interested in is the population, so we need to use what we know
about the sample to infer something about the population. A small sample is easy
to collect, but smaller samples are also more susceptible to random fluctuations
(think of surveying just one person and extrapolating his/her answer to the entire
population); a larger sample is less prone to such fluctuations but is also harder to
collect. We also need to be sure to sample randomly and in an unbiased fashion – if
we only sample younger people, or people in certain counties, these may not reflect
the wider population. We need ways to quantify the properties of the sample, and
also to quantify what we can learn about the population. This is statistics.
You may be left thinking: what’s this got to do with experiments in the physical
sciences? We often don’t have a simple population from which we pull a random
sample. Each time we perform some measurement (or series of measurements) we
are collecting a sample of possible data. We can think of our sample as being drawn
from a population, a hypothetical population of all the possible data that could be
1.5 Random samples 9

Figure 1.2 Illustration of the distinct concepts of accuracy and precision as applied
to the positions of ‘shot’ on a target.

produced from our measurement(s). The differences between samples are due to
randomness in the experiment or measurement processes.

1.5.1 Errors and uncertainty


The type of randomness described above is usually called random error (or mea-
surement error) by physicists (the term error is used differently by statisticians6 ).
Here, error does not mean a mistake as in the usual sense. To most scientists the
‘measurement error’ is an estimate of the repeatability of a measurement. If we take
some data and use them to infer the speed of sound through air, what is the error
on our measurement? If we repeat the entire experiment – under almost identical
conditions – chances are the next measurements will be slightly different, by some
unpredictable amount. As will further repeats. The ‘random error’ is a quantitative
indication of how close repeated results will be. Data with small errors are said to
have high precision – if we repeat the measurement the next value is likely to be
very close to the previous value(s).
In addition to random errors, there is another type of error called systematic
error. A systematic error is a bias in a measurement that leads to the values being
systematically either too low or too high, and may arise from the selection of
the sample under study or the calibration of the instrument used. Data with small
systematic error are said to be accurate; if only we could reduce the random error
we could get a result extremely close to the ‘true’ value. Figure 1.2 illustrates
the difference between precision and accuracy. The experimenter usually works
to reduce the impact of both random and systematic errors (by ‘beating down the

6 To a statistician, ‘error’ is a technical term for the discrepancy between what is observed and what is expected.
10 Science and statistical data analysis

errors’) in the design and execution of the experiment, but the reality is that such
errors can never be completely eliminated.
It is important to distinguish between accuracy and precision. These two con-
cepts are illustrated in Figure 1.2. Precise data are narrowly spread, whereas accu-
rate data have values that fall (on average) around the true value. Precision is an
indicator of variation within the data and accuracy is a measure of variation between
the data and some ‘true’ value. These apply to direct measurements of simple
quantities and also to more complicated estimates of derived quantities (Chapters 6
and 7).

1.6 Know your data


There are several types of data you may be confronted with. The main types are as
follows.
Categorical data take on values that are not numerical but can be placed in
distinct categories. For example, records of gender (male, female) and particle
type (electron, pion, muon, proton etc.) are categorical data.
Ordinal data have values that can be ranked (put in order) or have a rating
scale attached, but the differences between the ranks cannot be compared. An
example is the Likert-type scale that you see on many surveys: 1, strongly
disagree; 2, disagree; 3, neutral; 4, agree; 5, strongly agree. These have a
definite order, but the difference between options 1 and 2 might not be the
same as between options 3 and 4.
Discrete data have numerical values that are distinct and separate (e.g. 1, 2,
3, . . . ). Examples from physics might be the number of planets around stars,
or the number of particles detected in a certain time interval.
Continuous data may take on any value within a finite or infinite interval. You
can count, order and measure continuous data: for example, the energy of
an accelerated particle, temperature of a star, ocean depth, magnetic field
strength etc.
Furthermore, data may have many dimensions.

Univariate data concern only one variable (e.g. the temperature of each star in
a sample).
Bivariate data concern two variables (e.g. the temperatures and luminosity of
stars in a sample). Each data point contains two values, like the coordinates
of a point on a plane.
Multivariate data concern several variables (e.g. temperature, luminosity, dis-
tance etc. of stars). Each data point is a point in an N-dimensional space, or
an N-dimensional vector.
1.7 Language 11

As mentioned previously, there are two main roles that variables play.
Explanatory variables (sometimes known as independent variables) are
manipulated or chosen by the experimenter/observer in order to examine
change in other variables.
Response variables (sometimes known as dependent variables) are observed in
order to examine how they change as a function of the explanatory variables.
For example, if we recorded the voltage across a circuit element as we drive it with
different AC frequencies, the frequency would be the explanatory variable, and
the response variable would be the voltage. Usually the error in the explanatory
variable is far smaller than, and can be neglected by comparison with, the error on
the response variables.

1.7 Language
The technical language used by statisticians can be quite different from that com-
monly used by scientists, and this language barrier is one of the reasons that science
students (and professional researchers!) have such a hard time with statistics books
and papers. Even within disciplines there are disagreements over the meaning and
uses of particular terms.
For example, physicists often say they measure or even determine the value of
some physical quantity. A statistician might call this estimation. Physicists tend
to use words like error and uncertainty interchangeably and rather imprecisely.
In these cases, where conventional statistical language or notation offers a more
precise definition, we shall use it. This is a deliberate choice. By using terminology
and notation more like that of a formal statistics course, and less like that of an
undergraduate laboratory manual, we hope to give the readers more scope for using
and developing their knowledge and skills. It should be easier to understand more
advanced texts on aspects of data analysis or statistics, and understand analyses
from other fields (e.g. biology, medicine).
This means that we do not explicitly make use of the definitions set out in the
Guide to the Expression of Uncertainty in Measurement (GUM, 2008). The doc-
ument (now with revisions and several supplements) is intended to establish an
industrial standard for the expression of uncertainty. Its recommendations included
categorising uncertainty into ‘type A’ (estimated based on statistical treatment of
a sample of data) and ‘type B’ (evaluated by other means), using ‘standard uncer-
tainty’ for the standard deviation of an estimator, ‘coverage factor’ for a multiplier
on the ‘combined standard uncertainty’. And so on. These recommendations may
be valuable within some fields such as metrology, but they are not standard in most
physics laboratories (research or teaching) as of 2013, and are unlikely to be taken
12 Science and statistical data analysis

up by the broader community of researchers using and researching statistics and


data analysis.

1.8 Statistical computing using R


You will need to be able to use a computer to do statistical data analysis on all
but the smallest datasets. It is still possible to understand the ideas and methods of
statistical data analysis in purely theoretical terms, without learning how to perform
the analysis using a computer. The purpose of this book is to help you not only
understand and interpret simple statistical analyses, but also perform analyses on
data, and that means using a computer.
Throughout this book we give examples of statistical computing using the R
environment (see Appendix A). R is an environment for statistical computation
and data analysis. It is really a programming language with an integrated suite of
software for manipulating data, producing plots and performing calculations, and
has a very wide range of powerful statistical tools ‘built in’. Using R it is relatively
simple to perform statistical calculations accurately – this means you can spend
less time worrying about the computational details, and more time thinking about
the data and the statistical concepts. Appendix A provides a gentle introduction
and a walkthrough of R.
Throughout the text are shaded boxes (R.boxes) containing the R code to carry
out or demonstrate the procedures discussed in the accompanying text. Lines of
R are written with typewriter font; these are meant to be typed at the R
command line. As you progress through the book, working through the examples
of R code, you will acquire the skills necessary to complete the data analysis
case studies (and hopefully more besides). Of course, R is just one of the options
you have for carrying out statistical computing. If your preferences lie elsewhere
you should still be able to gain from the book by skipping past the R.boxes, or
translating their contents into your favourite computing language.

1.9 How to use this book


This book is intended to provide a reasonably self-contained introduction to design-
ing, performing and presenting statistical analyses of experimental data. Several
devices are used to encourage you, the reader, to engage with the material rather
than just read it. When a new term is used for the first time it usually appears in
italics and is then defined, and to aid your memory there is a glossary of statistical
terms towards the back of the book, along with a crib sheet for the mathematical
notation. Dotted throughout the notes are two types of text box: white boxes contain
examples or applications of ideas discussed in the text; shaded boxes (‘R.boxes’)
1.9 How to use this book 13

contain examples using the R computing environment for you to work through
yourself. We rely heavily on examples to illustrate the main ideas, and these are
based on real data. The datasets are discussed in Appendix B.
In outline, the rest of the book is organised as follows.
r Chapter 2 discusses numerical and graphical summaries of data, and the basics
of exploratory data analysis.
r Chapter 3 introduces some of the basic recipes of statistical analyses, such as
looking for difference of the mean, or estimating the gradient of a straight line
relationship.
r Chapter 4 introduces the concept of probability, starting with discrete, random
events. We then discuss the rules of the probability calculus and develop the
theory of random variables.
r Chapter 5 extends the discussion of probability to discuss some of the most
frequently encountered distributions (and also mentions, in passing, the central
limit theorem).
r Chapter 6 discusses the fitting of simple models to data and the estimation of
model parameters.
r Chapter 7 considers the uncertainty on the parameter estimates, and model testing
(i.e. comparing predictions of hypotheses to data).
r Chapter 8 discusses Monte Carlo methods, computer simulations of random
experiments that can be used to solve difficult statistical problems.
r Appendix A describes how to get started in the computer environment R used in
the examples throughout the text.
r Appendix B introduces the data case studies used throughout the text.
r Appendix C provides a refresher on combinations and permutations.
r Appendix D discusses the construction of confidence intervals (extending the
discussion from Chapter 7).
r A glossary can be found on p. 217.
r A list of the notation can be found on p. 224.
2
Statistical summaries of data

The greatest value of a picture is when it forces us to notice what we


never expected to see.
John Tukey (1977),
statistician and pioneer of exploratory data analysis

How should you summarise a dataset? This is what descriptive statistics and
statistical graphics are for. A statistic is just a number computed from a data
sample. Descriptive statistics provide a means for summarising the properties of
a sample of data (many numbers or values) so that the most important results
can be communicated effectively (using few numbers). Numerical and graphical
methods, including descriptive statistics, are used in exploratory data analysis
(EDA) to simplify the uninteresting and reveal the exceptional or unexpected in
data.

2.1 Plotting data


One of the basic principles of good data analysis is: always plot the data. The
brain–eye system is incredibly good at recognising patterns, identifying outliers
and seeing the structure in data. Visualisation is an important part of data analysis,
and when confronted with a new dataset the first step in the analysis should be to
plot the data. There is a wide array of different types of statistical plot useful in data
analysis, and it is important to use a plot type appropriate to the data type. Graphics
are usually produced for screen or paper and so are inherently two dimensional,
even if the data are not.
The variables can often be classified as explanatory or response. We are usually
interested in understanding the behaviour of the response variable as a function of
the explanatory variable, where the explanatory variable is usually controlled by

14
2.1 Plotting data 15

the experimenter. Different plots are suitable depending on the number and type of
the response variable.

r Data with one variable (univariate)


– If the data are continuous, we can make a histogram showing how the data are
distributed. A smooth density curve is an alternative to a histogram.
– If the data are discrete or categorical, we could produce a bar chart,
similar to a histogram but with gaps between the bars to indicate their
discreteness.
– If the data are a time series (a series of points taken at distinct times), we can
make a time series plot by marking them as points on the x–y plane with y the
data and x the time corresponding to each data point.
– If the data are fractions of a whole, we may use compositional plots such as the
pie chart; however, these are rarely used in scientific and statistical graphics
(it is usually more efficient to present the proportions in a table or a bar
chart).
r Data with two variables (bivariate)
– If both variables are continuous, we may use a scatter plot where the data are
plotted as points on the x–y plane.
– There are many ways of augmenting a standard scatter plot, such as joining the
points with lines (if the order is important or if it improves clarity), overlaying
a smoothed curve or theoretical prediction curve and including error bars to
indicate the precisions of the measurements.
– If the explanatory variable is discrete (or binned continuous), we may choose
from a dotchart, boxplot, stripchart or others.
r Data with many variables (multivariate)
– A matrix of several scatter plots, each showing a different pair of variables,
may be used to illustrate the dependence of each variable upon each of the
others.
– A coplot shows several scatter plots of the same two variables, where the data
in each panel of the plot differ by the values of a third variable.
– With three continuous variables we can make a projection of the three-
dimensional equivalent of the scatter plot.
– Another variation on the three-dimensional scatter plot is the bubble plot, which
uses differently sized symbols to represent a third variable.
– If we have one response variable and two explanatory variables, we can make
an image using either greyscale, colours or contours to indicate the values of
the response variable over the explanatory dimensions, or we can construct a
projection of the surface, e.g. z = f (x, y).
16 Statistical summaries of data

30
25
20
Frequency
15
10
5
0

600 700 800 900 1000 1100


Speed − 299 000 (km s–1)

Figure 2.1 Histogram of the 100 Michelson speed-of-light data points.

2.2 Plotting univariate data


Michelson’s data – see Appendix B, section B.1 – records 100 experimental values
from his speed-of-light experiment. For compactness the tabulated data have had
the leading three digits removed (i.e. 299 000 km s−1 subtracted). How should we
plot these data? One option is an index plot, which plots points on the x–y plane at
coordinates (1, y1 ), (2, y2 ) and so on, one point for each data value yi . The order
of the points is simply the order they occur in the table, which may (or may not)
be the order they were obtained. Such a plot would make it much easier to see the
‘centre’ and ‘spread’ of the sample, compared with a table of raw numbers. But
there are more revealing ways to view the data.

2.2.1 Histogram
One way to simplify univariate data is to produce a histogram. A histogram is
a diagram that uses rectangles to represent frequency, where the areas of each
rectangle are proportional to the frequencies. To produce a histogram one must
first choose the locations of the bins into which the data are to be divided, then one
simply counts the number of data points that fall within each bin. See Figure 2.1
(and R.box 2.1).
A histogram contains less information than the original data – we know how
many data points fell within a particular bin (e.g. the 700–800 bin in Figure 2.1),
but we have lost the information about which points and their exact values. What
we have lost in information we hope to gain in clarity; looking at the histogram it
is clear how the data are distributed, where the ‘central’ value is and how the data
points are spread around it.
2.2 Plotting univariate data 17

R.Box 2.1
Histograms
The R command to produce and plot a histogram is hist(). The following shows
how to produce a basic histogram from Michelson’s data (see Appendix B,
section B.1):
hist(morley$Speed)

We can specify (roughly) how many histogram bins to use by using the breaks
argument, and we can also alter the colour of the histogram and the labels as follows:

hist(morley$Speed, breaks=25, col="darkgray",


main="", xlab="speed - 299,000 (km/s)")

This hist() command is quite flexible. See the help pages for more information
(type ?hist).

2.2.2 Bar chart


The bar chart is a relative of the histogram. Frequencies are indicated by the
lengths of bars, which should be of equal width. Bar charts are used for discrete
or categorical data, and a histogram is used for continuous data; neighbouring
histogram bins touch each other, bar chart bars do not. For example, measurements
of the speed of light are (in principle) continuous since the measured value can
take any real number over some range, and so a histogram may be used. But if we
were to plot data from a poll of support for different political parties, we should
use a bar chart, since the data are categorical (different parties).
Figure 2.2 shows a bar chart for the data recorded by Rutherford and Geiger (see
Appendix B, section B.2). The data record the number of intervals during which
there were zero scintillations, one scintillation, two scintillations, up to 14 (there
were no intervals with 15 or more scintillations). The data are discrete – the number
of scintillations per interval, shown along the horizontal axis, must be an integer –
and so a bar chart is appropriate.

R.Box 2.2
A simple bar chart
There are two simple ways to produce bar charts using R. Let’s illustrate this using the
Rutherford and Geiger data (see Appendix B, section B.2):

plot(rate, freq, type="h")


plot(rate, freq, type="h", bty="n",
18 Statistical summaries of data

500
400
Frequency
300
200
100
0

0 2 4 6 8 10 12 14
Rate (counts/interval)

Figure 2.2 Bar chart showing the Rutherford and Geiger (1910) data of the fre-
quency of alpha particle decays. The data comprise recordings of scintillations in
7.5 s intervals, over 2608 intervals, and this plot shows the frequency distribution
of scintillations per interval.

xlab="Rate (counts/interval)",
ylab="Frequency", lwd=5)

The first line produces a very basic plot using the type="h" argument. The second
line produces an improved plot with user-defined axis labels, thicker lines/bars and no
box enclosing the data area. An alternative is to use the specialised command
barplot().

barplot(freq, names.arg=rate, space=0.5,


xlab="Rate (cts/interval)",
ylab="Frequency")

Here the argument space=0.5 determines the sizes of the gaps between the bars, and
names.arg defines the labels for the x-axis. If the data were categorical, we could
produce a bar chart by setting the names.arg argument to the list of categories.

2.3 Centre of data: sample mean, median and mode


Probably the first conclusion we might draw from looking at Michelson’s data is
that the measured values lie close to 299 800 km s−1 . What we have just done
is make a numerical summary of the data – if we needed to communicate the
most important aspects of this dataset to a colleague in the smallest amount of
information, a sensible place to start would be with a summary like this, which
gives some idea of the ‘centre’ of the data. But instead of making a quick informal
2.3 Centre of data: sample mean, median and mode 19

299 700 299 800 299 900 300 000


Speed (km s–1)

Figure 2.3 Illustration of the mean as the balance point of a set of weights. The
data are the first 20 of the Michelson data points.

guess of the centre we could instead calculate and quote the mean of the sample,
defined by

1
n
x= xi (2.1)
n i=1

where xi (i = 1, 2, . . . , n) are the individual data points in the sample and n is the
size of the sample. If x are our data, then x̄ is the conventional symbol for the
sample mean. The sample mean is just the sum of all the data points, divided by
the number of data points. Strictly, this is the arithmetic mean. The mean of the
first 20 Michelson data values is 909 km s−1 :
1
x̄ = (850 + 740 + 900 + 1070 + 930 + 850 + . . . + 960) = 909.
20
One way to view the mean is as the balancing point of the data stretched out
along a line. If we have n equal weights and place them along a line at locations
corresponding to each data point, the mean is the one location along the line where
the weights balance, as illustrated in Figure 2.3.
The mean is not the only way to characterise the centre of a sample. The sample
median is the middle point of the data. If the size of the sample, n, is odd, the
median is the middle value, i.e. the (n + 1)/2th largest value. If n is even, the
median is the mean of the middle two values (the n/2th and n/2 + 1th ordered
values). The median has the sometimes desirable property that it is not so easily
swayed by a few extreme points. A single outlying point in a dataset could have a
dramatic effect on the sample mean, but for moderately large n one outlier will have
little effect on the median. The median of the first 20 light speed measurements is
940 km s−1 , which is not so different from the mean – take a look at Figure 2.1 and
notice that the histogram is quite symmetrical about the mean.
The last measure of the centre we shall discuss is the sample mode, which is
simply the value that occurs most frequently. If the variable is continuous, with no
repeating values, the peak of a histogram is taken to be the mode. Often there is
more than one mode; in the case of the 100 speed of light values, there are two
values that occur most frequently (810 and 880 km s−1 occur 10 times each). Once
20 Statistical summaries of data

mode

median
0.10

mean
p(x)
0.00

0 2 4 6 8 10
x

Figure 2.4 Illustration of the locations of the mean, median and mode for an
asymmetric distribution, p(x), where x is some random variable.

we bin the Michelson data into a histogram it becomes clear that the distribution
has a single mode around 800–850 km s−1 (see Figure 2.1).
Now we have three measures of centrality, but the one that is used the most is
the mean, often just called the average. If we have some theoretical distribution of
data spread over some range, we may calculate the mean, median and mode using
methods discussed in Chapter 5.
Figure 2.4 illustrates how the three different measures differ for some theoretical
distribution. The mean is like the centre of gravity of the distribution (if we imagine
it to be a distribution of mass density along a line); the median is simply the 50%
point, i.e. the point that divides the curve into halves with equal areas (equal
mass) on each side; the mode is the peak of the distribution (the densest point).
If the distribution is symmetrical about some point, the mean and median will be
the same, and if it is symmetrical about a single peak then the mode will also
be the same, but in general the three measures differ.

R.Box 2.3
Mean, median and mode in R
We can use R to calculate means and medians quite easily using the appropriately
named mean() and median() commands. The variable morley$Speed contains
the 100 speed values of Michelson. To calculate the mean and median, and add on the
offset (299 000 km s−1 ), type
mean(morley$Speed) + 299000
median(morley$Speed) + 299000

The modal value is not quite as easy to calculate as the mean or median since there is
no built-in function for this. One simple way to find the mode is to view a histogram
of the data and select the value corresponding to the peak.
2.4 Dispersion in data: variance and standard deviation 21

Box 2.1
Different averages
Imagine a room containing 100 working adults randomly selected from the
population. Then Bill Gates walks into the room. What happens to the mean wealth of
the people in the room? What about the median or mode? These different measures of
‘centre’ react very differently to an extreme outlier (such as Bill Gates). What will
happen to the average height (mean, median and mode) of the people in the room if
the world’s tallest man walks in?
What is the average number of legs for an adult human? The mode and the median
are surely two, but the mean number of legs is slightly less than two!

2.4 Dispersion in data: variance and standard deviation


The sample mean is a very simple and useful single-number summary of a sample,
and it gives us an idea of the typical location of the data. If we required slightly more
information about the sample a good place to start would be with some measure of
the spread of the data around this central location: the dispersion around the mean.
We could start by calculating the mean of the deviations between each data value
and the sample mean. But this is useless as it always equals zero. Take another look
at the definition for the sample mean (equation 2.1) and notice how the sample
mean is the one value that ensures the (data – mean) deviations sum to zero (recall
the balance of Figure 2.3):
1 1 1
n n n
n
(xi − x̄) = xi − x̄ = x̄ − x̄ = x̄ − x̄ = 0. (2.2)
n i=1 n i=1 n i=1 n
The negative deviations exactly cancel the positive deviations.
If instead we square the deviations, then all the elements of the sum are positive
(or zero), so the average of the squared deviation seems like a more useful measure
of the spread in a sample. The sample variance is defined as
1 
n
sx2 = (xi − x̄)2 . (2.3)
n − 1 i=1
This is almost the mean of the squared deviations. But notice that we have divided
by n − 1 rather than n: the story behind this is sketched out in the box. Table 2.1
illustrates explicitly the steps involved in calculating the variance using the first 20
values from the Michelson dataset: first we compute the sample mean, then subtract
this from the data, and compute the sum of the squared data − mean deviations.
Of course, in real data analysis this calculation is always performed by computer.
22 Statistical summaries of data

Table 2.1 Illustration of the computation of variance using the first n = 20 data
values from Michelson’s speed of light data. Here xi are the data values, and the
sample mean is their sum divided by n: x̄ = 18 180/20 = 909 km s−1 . The
xi − x̄ are the deviations, which always sum to zero. The squared deviations are
positive (or zero) valued and sum to a non-negative number. The sum of squared
deviations divided by n − 1 gives the sample variance:
s 2 = 209 180/19 = 11 009.47 km2 s−2 .

i 1 2 3 4 5 ··· 20 sum

Data xi (km s−1 ) 850 740 900 1 070 930 ··· 960 18 180
xi − x̄ (km s−1 ) −59 −169 −9 161 21 ··· 51 0
(xi − x̄)2 (km2 s−2 ) 3481 28 561 81 25 921 441 ··· 2601 209 180

The sample variance is always non-negative (i.e. either zero or positive), and
will not have the same units as the data. If the xi are in units of kg, the sample mean
will have the same units (kg) but the sample variance will be in units of kg2 .√The
standard deviation is the positive square root of the sample variance, i.e. s = s 2 ,
and has the same units as the data xi . Standard deviation is a measure of the typical
deviation of the data points from the sample mean. Sometimes this is called the
RMS: the root mean square (of the data after subtracting the mean).

Box 2.2
Why 1/(n − 1) in the sample variance?
The sample variance is normalised by a factor 1/(n − 1), where a factor 1/n might
seem more natural if we want the mean of the squared deviations. As discussed above,
the sum of the deviations (x − x̄) is always zero. If we have the sample mean the last
deviation can be found once we know the other n − 1 deviations, and so when we
average the square deviation we divide by the number of independent elements, i.e.
n − 1. This known as Bessel’s correction.
Using 1/(n − 1) makes the resulting estimate unbiased. Bias is the difference
between an average statistic and the true value that it is supposed to estimate, and an
unbiased statistic gives the right result when given a sufficient amount of data (i.e. in
the limit of large n). For more details of the bias in the variance, see section 5.2.2 of
Barlow (1989), or any good book on mathematical statistics.

The variance, or standard deviation, gives us a measure of the spread of the data
in the sample. If we had two samples, one with s 2 = 1.0 and one with s 2 = 1.7,
we would know the that the typical deviation
√ (from the mean) is 30% times larger
in the second sample (recall that s = s ).2
2.4 Dispersion in data: variance and standard deviation 23

R.Box 2.4
Variance and standard deviation
R has functions to calculate variances and standard deviations. For example, in order
to calculate the mean, variance and standard deviation of the numbers 1, 2, . . . , 50:

x <- 1:50
mean(x)
var(x)
sd(x)

Likewise to calculate the variance of the entire Michelson sample

Speed <- morley$Speed


var(Speed)

The first line defines a new array in order to save us having to use the prefix
morley$. . . every time we wish to access these data.

R.Box 2.5
Calculating with subarrays
If we want to calculate the variance for each of Michelson’s five ‘experiments’ (each
one is a block of 20 consecutive values) individually, we could use

mask <- morley$Expt == 2


mask
Speed[mask]
var(Speed[mask])

Note the use of the double equals sign (==) in testing for equality. The first line forms
an array mask, the same size as the Speed array, with values that are TRUE where the
condition is met (i.e. Expt == 2), and FALSE elsewhere. The third line forms a
subarray from Speed by taking only those elements that occur where mask is TRUE).
The third line shows how to compute the variance of this subset of the original data.
We can repeat this process using a loop as follows:
for (i in 1:5) {
print(var(Speed[morley$Expt==i]))
}

This looks quite complicated, so let’s unpack it. The first part for (i in 1:5)
{. . .} defines a loop. The second part (inside the curly brackets) defines what is to
happens each time around the loop. The loop runs once for each of i = 1, 2, 3, 4, 5,
24 Statistical summaries of data

and each time round it prints the variance of the sample of data with the corresponding
experiment number i. The following may help illustrate the way loops are written
in R:

for (i in 1:10) { print(i) }

2.5 Min, max, quantiles and the five-number summary


A simple two-point indicator of the spread of a data sample is the pair (minimum,
maximum). Other measures of a sample commonly used in descriptive statistics
are quantiles. The α quantile is simply the data point below which a fraction α of
the data occur. The 0.25 quantile is then simply the value for which 25% of the
data points are lower. The 0.5 quantile is the median. Some quantiles have special
names, for example the 0.25, 0.5 and 0.75 quantiles are called the first, second
and third quartiles, respectively. The median is the second quartile. The difference
between the 0.75 and 0.25 quantiles is called the interquartile range (IQR). (Note
that the first and third quartiles can be obtained by splitting the data about the
median, and then finding the medians of the lower and upper halves.)
John Tukey (see Tukey, 1977) suggested a simple and compact five-number
summary of a univariate dataset, now known as the Tukey five-number summary.
This comprises the minimum, first quartile, median (second quartile), third quartile
and maximum values of a sample. From these five numbers, one can get a reasonable
impression of the way the data are distributed: the centre of the sample (median),
the way the central 50% of the data are spread around the median (IQR) and the
most extreme (lowest, highest) values in the sample.

R.Box 2.6
Tukey’s five-number summary
There are two functions in R to calculate variations on Tukey’s five-number summary.
The first is

fivenum(0:100)
fivenum(Speed)

Here the reported values for the first, second (median) and third quartiles are given as
the closest actual data values. There is a variation on this:

summary(0:100)
summary(Speed)

The two methods differ slightly in how the quartiles are calculated. Note that the
summary() command calculates the mean for free.
2.6 Error bars, standard errors and precision 25

2.6 Error bars, standard errors and precision


From the above, we now have some numerical and graphical ways to summarise
data, and in particular its centre and spread. However, we still have not made any
attempt to quantify how precise these summaries might be. There are 100 values
in the Michelson datasets, divided into five experiments, each of 20 measure-
ments. For each of the experiments, we can calculate a mean and variance for the
20 measurements. From these, we may calculate the standard error on the sample
mean. Here it is:

sx2
SEx̄ = (2.4)
n
which is just the square root of the sample variance, sx2 (equation 2.3), divided by
the size of the sample, n. We shall not be concerned with where this formula comes
from until later chapters. For now, we consider it a useful, simple, approximate
formula for the uncertainty on the sample mean, x̄.
What is the meaning of the standard error? Imagine repeating an experiment n
times and, to get the ‘best’ result, taking the sample mean of the measurements,
x̄1 . We could repeat the whole set of n experiments and calculate another sample
mean, x̄2 , and so on. If we do this many times, we have a sample of mean values,
x̄j , each of which is an independent estimate of the population (‘true’) mean, μ.
The standard error is an estimate of the standard deviation of the sample means
from the expected (population) mean value. In other words, we expect the sample
means to be about one standard error from the population mean. Thus the standard
error gives us an idea of the precision of the sample mean. You can see that as
n increases, the standard error decreases; one would expect the precision of the
mean to improve as more data are acquired. In statistics the word precision (see
section 1.6) is sometimes used for the reciprocal of the variance of the data. The
precision of the mean x̄ is 1/SE2x̄ .
Let’s look at the sample means and standard errors for the Michelson data divided
into five ‘experiments’. Figure 2.5 shows the sample means and their standard
errors. The standard errors are illustrated by error bars, which run from x̄ − SEx̄
to x̄ + SEx̄ . This figure summarises each of the five experiments in terms of two
numbers each, the mean and a measure of its precision, and the five experiments
can easily be compared with each other and the modern, accepted value.

R.Box 2.7
Standard errors in R
There is no single command to compute the standard error in R, but one may make
use of the var() function to make the calculation simple. For example, to compute
the mean, variance and standard error of the Michelson data
26 Statistical summaries of data

950
Speed − 299 000 (km s–1)
900
850
800
750

1 2 3 4 5
Experiment

Figure 2.5 The sample means for each of the five ‘experiments’ of Michelson,
each comprising 20 measurements. The standard errors for each mean are indicated
by the error bars. Notice the sidebars at the end of each error bar. These help define
the ends of each error bar, but may clutter the graphic when there are a lot of data to
present. The dotted line shows the modern value for the speed of light in air. From
this graphic, one can start to make inferences about Michelson’s measurements.

x <- morley$Speed
mean(x)
var(x)
sqrt( var(x) / length(x) )

where the length(x) function returns the number of data points.

R.Box 2.8
Standard errors by group, part 1
It is possible to calculate a statistic (e.g. mean or variance) for each of the five
experiments in an efficient manner by first re-organising the data into a matrix. Once
this is done we can make use of some powerful matrix tools in R. In the following
example, the speed data are converted to a matrix with 20 rows (and therefore five
columns, one for each ‘experiment’) called speed.
speed <- matrix(morley$Speed, nrow=20)
speed
[,1] [,2] [,3] [,4] [,5]
[1,] 850 960 880 890 890
[2,] 740 940 880 810 840
[3,] 900 960 880 810 780
2.6 Error bars, standard errors and precision 27

[4,] 1070 940 860 820 810


[5,] 930 880 720 800 760
[6,] 850 800 720 770 810
... ... ... ... ... ...

It is important to check that the matrix is arranged in the right way. Here we see all the
data from first experiment in the first column – compare with the output of

morley$Speed[morley$Expt == 1]

R.Box 2.9
Standard errors by group, part 2
With the Michelson data arranged in a matrix, we can use the apply() command to
apply any function, e.g. mean() or var(), to every row or column of the matrix. For
example, to calculate the mean and variance of the data in each column, and then store
the results in new data objects, we can use
speed.mean <- apply(speed, 2, mean)
speed.var <- apply(speed, 2, var)
speed.var

The command apply(speed, 2, var) takes the matrix called speed and applies
the function var() to each of its columns to calculate the variance. You could also
use mean, sd, sum, or any other valid R command. The second argument (i.e. 2)
specifies columns should be analysed. If instead we used 1, we would get the variance
over each row. This approach, applying the same function over rows or columns of an
array, is usually faster (on large datasets) and more elegant than using loops.

R.Box 2.10
Standard errors by group, part 3
Finally, the standard errors for the five ‘experiments’ are just the square roots of these
variances divided by the number of data points in each experiment. We find the
number of data points in each column using the command apply() to apply the
length() function (we know the answer is 20).

speed.n <- apply(speed, 2, length)


se <- sqrt(speed.var / speed.n)
se
data.frame(speed.mean, speed.var, speed.n, se)

Remember that R is case sensitive, so se is not the same object as SE. The last line
uses the four new vectors (of the means, variance, lengths and standard errors) as
28 Statistical summaries of data

columns of a new object, a data frame (similar to a matrix but the columns may be
formed from different types of data).

R.Box 2.11
Plotting error bars
There are several ways to add error bars to a graphic in R. One way is using the
segments() command to draw a series of line segments between x− error and
x+ error. If we have sample means with standard errors (as in the previous box), we
may plot them as follows:
Expt <- 1:length(speed.mean)
plot(Expt, speed.mean, ylim=c(780,950), pch=16,
bty="l", xlab="Experiment",
ylab="Speed - 299,000 (km/s)")
segments(Expt, speed.mean-se, Expt, speed.mean+se)

where the second line plots the data and the third line adds the error bars. The
segments command takes as its input segments(x0,y0,x1,y1) and draws lines
between coordinates (x0,y0) and (x1,y1). A variation on this is to use the arrows
command to give each error bar a sidebar (as in Figure 2.5):

arrows(Expt, speed.mean-se, Expt, speed.mean+se,


code=3, angle=90, length=0.1)

Where the first four arguments give the coordinates of the endpoints (as for the
segments() command), and the last three define two-sided arrows (code=3 means
draw an arrow head at both ends of the arrow), with flat arrow heads (angle=90) and
the extent of the arrow heads (length=0.1).

It is common in physical science to expect error bars accompanying data when-


ever appropriate; they immediately allow the viewer to gauge the precision of the
estimate or measurement. What use is an estimate without any measure of how
reliable it is?

2.7 Plots of bivariate data


2.7.1 Scatter plot
So far we have considered only data that are records of the values of a single vari-
able, such as Michelson’s speed of light measurements. However, a great deal of
data analysis concerns data with more than one variable, often one or more response
variable, observed or measured at different values of one or more explanatory
variables.
2.7 Plots of bivariate data 29

R.Box 2.12
Scatter plots in R
The R command plot() will produce a basic scatter plot from two (equal length)
arrays of numbers. The Hipparcos data shown in Figure 2.6 are described in
Appendix B (section B.4). Using the reduced data file hip clean.txt we can
produce a simple plot

hip <- read.table("hip_clean.txt", header=TRUE)

This creates a data array called hip that contains the contents of the file: 14 columns
and 5740 rows of data. A simple scatter plot may be produced using

plot(hip$BV, hip$V)

However, with a little more effort we can do much better than this.

The simplest way to visualise data with two continuous variables is a scatter plot,
where each data point (pair of numbers) is treated as a coordinate and is marked
with a symbol on the x–y plane. Scatter diagrams are used to reveal relationships
between pairs of variables, and are among the most widely used diagrams in all
of science. They can be enormously powerful; indeed, some of the most important
diagrams and relations in science were discovered by examination of scatter plots.
Figure 2.6 shows one such example from astronomy. This is a Hertzsprung–
Russell diagram (sometimes known as a colour–magnitude diagram) and shows the
luminosity against colour index for a sample of nearby stars. Each point represents
a star, the horizontal position of the points represents the B − V colour index
(a simple measure of the colour of the star, which depends on its temperature),
and the vertical position represents the absolute magnitude (an upside-down and
logarithmic measure of the luminosity). When these two variables are used to
construct a scatter diagram for a sample of stars, it is clear there is a great deal of
structure in the data, patterns that would not be at all obvious by examination of a
table of numbers, or of graphical examination of either variable separately.

R.Box 2.13
Basic scatter plot design
The following command shows how to produce a better scatter plot:

plot(hip$BV, hip$V.abs, pch=1, cex=0.5, bty="n",


ylim=c(16, -3), xlim=c(-0.3, 2.0),
ylab="V.abs (mag)", xlab="B-V (mag)")
30 Statistical summaries of data

0
V (mag)
5
10
15

0.0 0.5 1.0 1.5 2.0


B − V (mag)

Figure 2.6 Example of a scatter plot showing data on 5740 stars using data from
the Hipparcos astronomy satellite. Plotted is the V -band (green) absolute (distance
corrected) magnitude against the B − V colour index (difference between B and
V -band magnitudes, a blue–green colour). Each point represents a star: brighter
(smaller magnitude) stars are at the top, bluer stars are on the left. The plot clearly
reveals structure in the data: most stars fall in the band from top left to bottom
right, with a small island in the top right. This type of diagram is of fundamental
importance in stellar astrophysics. For comparison we also show the histograms
of each of the two variables (V and B − V ) separately. The structure in the data
is only apparent when looking at the two variables together using a scatter plot.

Here we have plotted Vabs , the absolute magnitude stored in the V.abs column (not
the apparent magnitude in the V column), against B − V . The pch=1 argument
selects a plot symbol (1 is a hollow circle); cex=0.5 makes the symbols smaller than
default. A small, hollow symbol was chosen here to reduce the clutter from the large
number of points to be plotted.
The option ylim=c(16, -3) sets the range of the vertical axis to run from 16 at
the bottom to −3 at the top. The xlim argument is used to control the horizontal axis
span. The arguments xlab and ylab are for setting the axis labels, and finally
bty="n" defines what type of box to enclose the plot in ("n" means no box).
For more information on the arguments that can be changed within the plot()
command, try ?plot and ?par.

How does one decide which observable to plot on the horizontal axis, and
which on the vertical axis? In an experiment one usually studies the response of
some variable(s) to changes in experimenter-controlled explanatory variables, in
which case the explanatory variable is plotted along the horizontal axis and the
response variable plotted along the vertical axis. However, it is often the case that
neither variable is obviously an explanatory variable. For example, if we recorded

.
.
Another Random Scribd Document
with Unrelated Content
En de kloeke zeeman ging, zonder acht te slaan op de ontzaglijke ijsvelden
en het weinig geruststellende voorkomen der monsters, stoutmoedig
voorwaarts.

Zijne aandacht werd getrokken door een groot mannetje, dat gerust lag te
slapen; hij sloop er zachtjes naar toe en trof het zoo onverhoeds aangevallen
dier juist tusschen de oogen.

Hevig gekwetst, nam de walrus de vlucht. Maar James liet zijne prooi niet
los: hij haalde het dier in weinige oogenblikken in en stak het met de lans
onder het schouderblad. Op dit oogenblik kwam ook Ford er bij en sneed
met een behendigen lansstoot den hals van het monster af, dat duchtig
tegenspartelde.

„Hoera!” riep James uit. „Maar laat ons haast maken, daar deze dieren
eindelijk zullen beginnen te begrijpen, wat we met hen voorhebben.”

Door dezen goeden uitslag stoutmoediger geworden, begaf de stuurman zich


te midden van den troep en deed een aanval op een ouden walrus van eene
buitengewone lichaamsgrootte.

Maar het dier weerde den aanval met zijne slagtanden af en wierp zich,
terwijl het een woedend gehuil liet hooren, op zijne beurt op zijn aanvaller.
Toen gebeurde er iets zonderlings. Eensklaps omringde de geheele troep, als
op een gegeven sein, onze reizigers. James, die juist ontsnapt was aan een
monster, dat op hem aanviel, kreeg het met een ander te kwaad, en al
spoedig zagen de beide reizigers zich omringd door eene menigte ronde en
afschuwelijke koppen, die zich met hunne lange witte slagtanden
verdedigden.

James liet zich door het gevaar niet verbijsteren. Terwijl hij zich met zijne
lange lans verdedigde, week hij langzaam achterwaarts naar het hooge
gedeelte van het ijsveld, waar hij zich zonder moeite zou kunnen beveiligen.
Duchtig in het nauw gebracht, deelde hij wanhopige slagen onder de
woedende dieren uit. Hoe stevig zijne lans ook wezen mocht, toch kon hij
daarmee de walrussen niet in bedwang houden. Juist op het oogenblik,
waarop de moedige stuurman gevaar liep om onder de slagtanden van een
dier monsters te vallen, gevoelde hij, dat hij waggelde. Hij zwaaide een
oogenblik met de handen, op deze wijze zijn evenwicht trachtende te
bewaren; toen stortte hij in een gapende kloof neer.

Nu James van het slagveld was verdwenen, werd de toestand van den
kapitein nog hachelijker. Tevergeefs zwaaide de moedige zeeman met zijne
lans. Zijn arm werd moede, en de wonden, die hij toebracht, dienden slechts
om de woede zijner aanvallers te doen toenemen.

Verscheidene malen deed de kapitein eene poging om door de dichte


gelederen zijner aanvallers heen te dringen; maar terstond was hij
genoodzaakt, voor hunne verschrikkelijke slagtanden terug te wijken.

Nadat onze zeeman een mislukten lansstoot had gedaan, verdween hij
onmiddellijk te midden van de zwarte ondieren, die hem omgaven.

Maar op dit oogenblik deden zich kort na elkander vier revolverschoten


hooren; de walrus, door wiens slagtanden de kapitein gevallen was, richtte
zich een oogenblik met inspanning van al zijne krachten overeind en viel
toen dood neer: hij had de kogels in zijn oor gekregen. Gromski maakte zich
dit korte oponthoud ten nutte: nadat hij zijn laatsten kogel afgeschoten had,
die tusschen de ronde oogen van het naastbijzijnde dier doorgedrongen was,
greep hij den kapitein met zijne krachtige armen aan en haalde hem van
onder het onbeweeglijke lichaam van het monster weg.

„Neem de vlucht!” riep hij uit, terwijl hij hem naar den top van het ijsveld
wees, dat zich als eene piramide verhief.

Al spoedig bevonden de beide reizigers zich op den top dezer piramide,


waar geenerlei gevaar hen meer bedreigde.

Deze voorzorg bleek onnoodig geweest te zijn; want de walrussen, die door
de revolverschoten verschrikt waren, gaven den strijd op en keerden in aller
ijl naar hun element, de zee, terug.

„Maar ginds, in die kloof, is James achtergebleven,” riep Ford uit, terwijl hij
zich van de piramide liet afglijden. „We moeten hem redden!”
De ongerustheid van den kapitein over den stuurman scheen
gerechtvaardigd te zijn. De beide metgezellen hadden hem in de kloof
tusschen de ijsbergen zien vallen, werwaarts ook de walrussen de vlucht
genomen hadden.

Deze dieren, die op het ijs vrij langzaam in hunne bewegingen zijn,
bewegen zich vlug in het water en strijden daarin zelfs met een gewenschten
uitslag tegen den ijsbeer. Als zij ter plaatse kwamen, waar de stuurman lag,
dan zou deze niet aan den dood kunnen ontkomen.

Ford bleef aan den rand der kloof staan en keek naar het troebele water.
Maar hij zag tusschen de walrussen James nergens.

„Verloren!”

Deze wanhopige uitroep werd echter terstond gevolgd door een hoera! dat
van den anderen kant der kloof weerklonk. Nadat Ford in deze richting had
gekeken, zag hij tot zijne groote blijdschap den stuurman, die, hoewel hij tot
op zijn hemd toe nat was, vroolijk lachte. Toen James in het water gevallen
was, had hij zijne tegenwoordigheid van geest niet verloren. Daar hij een
goed zwemmer was, bereikte hij binnen weinige oogenblikken een ijsblok,
waarop hij zich op zijn gemak neerzette. Toen hij de revolverschoten hoorde
en de walrussen de vlucht zag nemen, begreep hij al spoedig, dat de kapitein
ook buiten gevaar was.

De ingenieur liep nu naar de kloof en stak den ouden zeeman zijn stok toe,
met behulp waarvan deze het ijsveld kon bereiken.

„Duizend duivels! Ik dacht niet, dat die ondieren zoo kwaad zouden zijn,”
zeide hij, terwijl hij zich naast Ford neerzette. „Ge zijt, hoop ik, goed en wel
aan hunne slagtanden ontkomen, kapitein?”

Helaas! de dappere zeeman kon op de ongeruste vraag van den stuurman


geen bevestigend antwoord geven; want hij droeg op zijn been en in zijne
zijde de sporen der slagtanden van den walrus, dien de ingenieur zoo juist
bijtijds gedood had.
„Wie zou dat gedacht hebben? De walrussen zien er zoo onschuldig uit!”
merkte James aan, nadat hij de wonden van den kapitein onderzocht had.

„Je hebt daarop te veel gebouwd,” antwoordde Gromski. „Misschien hadden


zij al eens meer met vijanden te doen gehad; want zij hebben terstond onze
bedoelingen geraden.”

De wonden van den kapitein waren echter niet zeer ernstig. James verbond
ze inderhaast, en men keerde naar de plaats terug, waar de luchtballon zich
bevond.

Eerst den volgenden dag gingen Gromski en James, nadat zij met behulp
van stukken rots een fornuis gemaakt en daarop de beide ketels geplaatst
hadden, naar het ijsveld om aan de gedoode zeekoeien en walrussen de huid
af te stroopen.

Deze lastige bezigheid kostte hun veel tijd. Op den 28sten Februari begon de
stuurman de olie te smelten, terwijl hij als brandstof de slechtste stukken
van het vet en het mos gebruikte, dat hij in de rotskloven verzameld had.

De oude zeeman had de belofte vervuld, die hij aan Gromski gedaan had;
want acht dagen na de gevaarlijke jacht had hij drie zeekoehuiden, met olie
gevuld, ter zijner beschikking, alsmede een grooten hoop mos.

Gedurende dezen tijd hield de ingenieur zich met het maken van de noodige
toebereidselen bezig. Hij zaagde de bamboesstokken, waarvan het schuitje
vervaardigd was, halverwege af, waardoor het veel lichter werd, en bracht
eene zekere hoeveelheid zoet water bijeen.

Op den 5den Maart was alles, wat er noodig was om waterdamp te


verkrijgen, gereed. De ingenieur wachtte slechts op het oogenblik, waarop
de barometer zou dalen, om alsdan een groot vuur in het fornuis aan te
leggen.
ZESTIENDE HOOFDSTUK.
Door den storm meegevoerd.
De lucht bleef onveranderlijk helder, als wilde zij den spot met onze
reizigers drijven. Al spoedig, omstreeks den 15den Maart, kondigde de eerste
vorst aan, dat de winter in aantocht was. Toen de ingenieur den thermometer
raadpleegde, die tot 2 graden beneden het nulpunt gedaald was, werd hij
door angst aangegrepen; want bij eene lage temperatuur zou het vullen van
den ballon zeer moeilijk gaan.

De kapitein teekende verscheidene malen per dag den stand van den
barometer aan, ten einde Gromski den eersten herfststorm te kunnen
aankondigen, die den luchtballon uit de ijswoestijn moest wegvoeren.

Eindelijk, na tien dagen gewacht te hebben, begon er verandering in de


weersgesteldheid te komen. De barometer daalde in den nacht van den 19den
in zes uren niet minder dan 15 millimeters. Deze plotselinge toeneming van
de vochtigheid deed tot groote veranderingen in den dampkring besluiten.
De kapitein deelde dit aan Gromski mede, die de wacht bij den luchtballon
hield.

De ingenieur bracht met behulp van James den stoomketel in orde en


drenkte het mos met olie om, dank zij dezen maatregel, het vuur te kunnen
ontsteken, zoodra dit noodig zou wezen.

Toen in den morgen van den 24sten de barometer 745 millimeters aanwees,
begon men, zonder langer te wachten, den ballon te vullen.

„Als het maar geen cycloon wordt!” zei de stuurman gejaagd, terwijl hij
naar de lucht keek, waaraan donkergrijze wolken voortdreven. „Met een
cycloon zouden wij niet ver komen.”

Het mos, met olie gedrenkt, was, zooals men al spoedig zag, eene goede
brandstof; het brandde spoedig en gaf eene flinke en hooge vlam. De
ingenieur hoopte den ballon in zes uren te vullen. Hij spaarde de olie niet,
want het was er om te doen, zoo spoedig mogelijk de noodige hoeveelheid
waterdamp te verkrijgen.

Tegen den middag begon de lucht zich met geelachtige wolken als met een
half doorzichtigen sluier te bedekken. Dit was, naar het zeggen van den
stuurman, de voorbode van een orkaan.

De ingenieur wreef zich van blijdschap in de handen, toen hij de wolken


zag, die zich aan den horizon opeenhoopten.

„Laat alle winden maar losbarsten!” zeide hij, terwijl hij nog wat olie in het
vuur goot.

Ford hield zijne oogen onafgebroken op den barometer gevestigd, die nog
voortdurend daalde. Er viel niet aan te twijfelen, of er was een storm in
aantocht.

„Haast u wat!” zei de kapitein, terwijl hij den voorraad gerookte ganzen en
versche eieren in het schuitje neerlegde. „Binnen drie uren zal de storm
opsteken; ik vrees, dat het een sneeuwjacht zal worden, die voor ons
noodlottig zou wezen.”
Gromski maakte van dit oogenblik gebruik. Blz. 225.

De kloeke zeeman had juist geraden. Omstreeks 4 uur in den namiddag


deden de eerste windvlagen zich gevoelen, welke den ballon, die nu voor
drie vierden gevuld was, heen en weer bewogen. Gromski richtte het hoofd
op en zag de lucht loodkleurige tinten aannemen. Op dezen somberen
achtergrond dreven langzaam eenige witte wolkjes voort.

„Ik heb dikwijls dergelijke wolkjes gezien vóór stormen, die wel
verscheidene dagen achtereen aanhielden,” zeide James, die op eene rots
geklommen was.

Ford haastte zich, daar hij het gevaar wel vooruitzag, dat den ballon zou
bedreigen, indien men eens genoodzaakt was, gedurende den storm boven
de kleine vallei op te stijgen.

„Er bestaat geen gevaar,” zei de ingenieur. „Onze ballon zal zóó snel
opstijgen, dat wij zelfs de rotsen rondom het schuitje niet zullen opmerken.”

Gedurende dezen tijd nam de orkaan met ieder oogenblik in hevigheid toe.
Zijn schel gefluit werd machtiger dan het geluid van den stoom, die uit den
stoomketel ontsnapte. De luchtballon nam langzamerhand zijn vroegeren
vorm van eene reusachtige sigaar weder aan. Om zes uur, toen het overige
gedeelte van het mos naar het schuitje overgebracht was, verzocht de
ingenieur aan zijne metgezellen, hunne plaatsen in te nemen.

„Houdt u goed!” riep hij hun toe, terwijl hij nog wat ballast uit het schuitje
wierp.

Nauwelijks had hij deze woorden gesproken, of de luchtballon begon zich te


bewegen en steeg in een oogwenk boven de rotsen, die hem omgaven. De
schok, daardoor teweeggebracht, wierp den stuurman, die er niet op
verdacht was geweest, op het vochtige mos neer. De kapitein, die den rand
gegrepen had, zag de vallei eensklaps verdwijnen. Eene halve minuut
daarna zweefde de luchtballon reeds over de ijsvelden, die de kusten
omgaven.

„We gaan naar het Noordoosten,” riep Ford uit. „Kijkt maar naar het ijs!”

Inderdaad vloog de ballon over het uitgestrekte ijsveld heen, dat zij eenige
weken geleden met zooveel moeite overgeloopen hadden. Al spoedig
verdween de lange keten van bergen als in een mist.
De kapitein bemerkte te midden daarvan de noodlottige plaats, die bijna zijn
graf geworden was. Maar de ijsvelden verdwenen al spoedig uit het gezicht.
De orkaan voerde den ballon daar overheen met eene snelheid, die Gromski
op 180 kilometers begrootte.

„Drommels!” riep de kapitein uit. „Het is toch heerlijk, dat we nu in een


kwartier den afstand afleggen, waartoe we vroeger een maand noodig gehad
hebben!”

De herinneringen aan de bezwaren der reis, aan de bovenmenschelijke


inspanning en aan de verschrikkelijke oogenblikken, te midden der
ijsbergen doorgebracht, schenen den reizigers nu belachelijk, vergeleken bij
de verbazende snelheid van den luchtballon, die nu in eene minuut
denzelfden weg aflegde, waarvoor zij vroeger een geheelen dag noodig
hadden gehad.

Het oppervlak van den oceaan bedekte zich al spoedig met dreigende
wolken. Een halfuur na de opstijging kwam de luchtballon geheel in de
zwarte wolken, en dat wel op het oogenblik, waarop Ford de zee hoopte te
zien, waarvoor hij weinig tijds geleden had moeten zwichten, zonder de
pool te kunnen bereiken, waarvan hij nauwelijks 15 kilometers verwijderd
was.

Nu de ballon eenmaal in deze verdunde en van vochtigheid verzadigde lucht


gekomen was, steeg hij langzaam; gelukkig was de temperatuur te midden
der wolken hoog genoeg, anders zouden onze reizigers met een plotselingen
val bedreigd zijn. De ingenieur stak het waterstofgas onder den stoomketel
aan om het opstijgingsvermogen te doen toenemen. Hij slaagde daarin
volkomen. Om 7 uur wees de barometer eene hoogte van 2400 meters aan.
De wolken eindigden nauwelijks 500 meters hooger.

„Wie weet?” zeide James met een zucht. „Misschien hebben wij wel boven
de pool zelf gezweefd.”

„Dat betwijfel ik,” antwoordde Gromski. „De wind voert ons naar het
Noordoosten en niet naar het Zuiden.”
„Moet dat beteekenen, dat wij niet regelrecht naar Amerika terugkeeren?”

„Dat doet er niet toe, als wij maar eenig vasteland bereiken. Ik voor mij zou
er niets tegen hebben, al kwamen we ook in Afrika, zelfs onder de
Hottentotten neer.”

„Ik ook niet,” mompelde de zeeman. „Maar ik heb er een voorgevoel van,
dat wij eindelijk nog eens in den Oceaan zullen neerkomen.”

Onze reizigers waren zich al te zeer van hun hachelijken toestand bewust,
dan dat zij zich hieromtrent illusiën zouden maken. Zij wisten maar al te
goed, dat hunne reis slechts eene wanhopige poging was. Nochtans verloor
geen hunner zijne koelbloedigheid en zijn moed.

Gromski rekende niet veel op den ballon, die nu van waterstofgas verstoken
was; maar hij wenschte toch zoolang mogelijk in de lucht te blijven.

Gelukkig verloor de ballon, dank zij zijne betrekkelijk kleine oppervlakte,


slechts langzaam zijne warmte. Te middernacht was de damp, dien hij
bevatte, nog niet in een vloeibaren toestand overgegaan.

Eerst ’s morgens om vier uur begonnen de eerste droppels uit het inwendige
van den ballon te vallen.

Toen Gromski dit bemerkte, stak hij het fornuis aan en bracht in den ballon
eene aanzienlijke hoeveelheid warmen damp.

Van dat oogenblik af moest men het vuur onder den stoomketel steeds
aanhouden; de stuurman ving zorgvuldig in de huid van een zeekoe het
water op, waarin de damp veranderde. Anders zou de voorraad water, die als
ballast gebruikt werd, binnen eenige uren uitgeput geweest zijn.

Onze reizigers zagen met angst, dat de inhoud van het ballonnetje, dat het
waterstofgas inhield, snel verminderde. Vijftien uren na hun vertrek had de
ingenieur reeds de helft van deze kostbare brandstof verbruikt.
„Ik zou wel eens willen weten, met welke snelheid wij voortgaan,” zei de
kapitein. „Als die zwarte wolken er niet waren, zou ik nooit hebben kunnen
gelooven, dat een storm ons meevoerde.”

Inderdaad was er geene verandering in den omtrek van den ballon op te


merken. De atmosfeer scheen volkomen kalm te zijn. Bliksemstralen en
donderslagen gingen met dezen storm zelfs niet gepaard. Door de dichte
massa wolken, die onder de voeten van onze reizigers voortdreven, kon men
niets onderscheiden. De ingenieur dacht, dat het in de lagere luchtlagen
moest regenen en sneeuwen; dientengevolge moest men er zooveel mogelijk
boven blijven.

Maar de ingenieur kon zijn ballon niet lang op de gewenschte hoogte


houden; want de temperatuur daalde na verloop van eenige uren
aanmerkelijk. De damp perste zich al meer en meer samen, zoodat de
stoomketel zonder ophouden in werking moest blijven. Op den 21sten Maart,
des morgens om 10 uur, was het waterstofgas geheel verbruikt. Terstond
begon de barometer snel te dalen. Het gevaar werd al dreigender en
dreigender.

Een halfuur daarna trok de stuurman Gromski bij de mouw en wees hem
verscheidene witte vlokken, die op zijne kleederen neervielen.

„Sneeuw,” mompelde hij.

Zoo was dan gebeurd, waarvoor de ingenieur zoozeer had gevreesd: de


luchtballon bevond zich in de lagere luchtlagen, waar een sneeuwstorm
woedde.

„James, doe het overige van het mos in het fornuis,” zeide hij, „anders
zullen we vallen.”

„Dat zullen we toch niet kunnen verhinderen,” antwoordde de zeeman,


terwijl hij aan het gegeven bevel voldeed.

De kleine hoeveelheid mos verdween al spoedig; de stuurman haalde, na de


laatste hoeveelheid op het verflauwende vuur geworpen te hebben, de
schouders op.

„Nu moet het omhulsel van den rand van het schuitje er aan gelooven,”
zeide hij.

En daar hij geen antwoord kreeg, nam hij een mes en begon de zijde af te
snijden, die den rand omgaf. De ingenieur verhinderde hem hierin niet.

„Laat het maar verbranden! Wat doet dat er toe?”

Maar het harde bamboes wilde niet branden. James stak dus zijn mes weder
in den zak en begon, met de handen op den rug, een matrozenliedje te
fluiten, welks vroolijke melodie een zonderling contrast met het
onheilspellende gebulder van den Oceaan opleverde. De kapitein volgde
met angst de sneeuwvlokken, die eene gedurig dikker wordende laag op den
luchtballon vormden.

„Welnu, kameraden, we moeten maar afscheid van elkaar nemen!” zeide hij
eensklaps met eene stem, die van ontroering trilde. „Al spoedig zullen de
golven ons voor immer den mond sluiten, en ik wil de aarde niet verlaten
zonder u vergiffenis gevraagd te hebben, Mijnheer. Het is mijne
hardnekkigheid, die ons allen in het verderf heeft gestort. Ik weet het …
want … als ik …”

Maar Gromski viel hem in de rede.

„Ge moest wel … het was uw plicht, zoo te handelen,” zeide Gromski met
tranen in de oogen, terwijl hij den kloeken zeeman de hand drukte. „Het is
een gering offer op het altaar der wetenschap. Alles, wat ik wensch, is, dat
het niet vruchteloos moge blijven.”

„Dat zal het geval niet zijn!” antwoordde Ford, terwijl hij de blikken doos,
waarin zijn dagboek opgesloten was, te voorschijn haalde. „Men zal dit
vroeger of later wel vinden! Weest daar maar gerust op!”

„Ja,” zeide James, terwijl hij zich den neus snoot. „Alleen is het jammer, dat
wij dit zelf niet kunnen vertellen … Och hemel!”
In dezen laatsten uitroep lag alles opgesloten, wat de oude zeeman niet
onder woorden kon brengen: verbittering, spijtigheid, teleurstelling,
wanhoop en eindelijk berusting.

Dit treurige tooneel duurde echter niet lang. Onze reizigers beheerschten
hunne aandoeningen en wachtten met gelatenheid den dreigenden dood af.

Het dof gebulder van den Oceaan, die door den storm werd voortgezweept,
werd al duidelijker en duidelijker. Gromski berekende, met den barometer in
de hand, de hoogte, waarop zij zich bevonden.

„Achthonderd meters, zevenhonderd.…”

„De Oceaan!” riep James, terwijl hij zich over den rand van het schuitje
heenboog.

En tegelijkertijd kwam de luchtballon uit de dichte wolk, die hem tot


dusverre omgeven had. Tusschen de sneeuwvlokken, die door de lucht
dwarrelden, tusschen het gebulder van den orkaan door, kon men zonder
moeite het donkere oppervlak van den Oceaan zien en zijne millioenen
golven, die elkander onafgebroken opvolgden. Op den donkeren Oceaan
teekenden zich de talrijke witte silhouetten der ijsbergen af, die door de
golven werden meegevoerd. Dit tooneel werd met ieder oogenblik
duidelijker te onderscheiden; want de luchtballon, die reeds met eene
sneeuwlaag van eenige centimeters bedekt was, daalde met eene ontzaglijke
snelheid naar beneden.

„Nu is het einde daar!” mompelde James.

Het schuitje raakte het schuim der golven reeds aan. De ballon gleed over
het onstuimige oppervlak van den Oceaan. Onze reizigers gevoelden een
hevigen schok en terstond daarna eene ijzige koude door hunne leden. Het
was gedaan. De ballon worstelde met de golven. Door den wind
voortgedreven, verhief hij zich somtijds, als een gekwetste vogel, maar viel
terstond weder neer. De ingenieur en zijne metgezellen, die door het schuim
der golven overstroomd en verblind waren, grepen als bij instinct de touwen
om uit het schuitje te komen, dat geheel op zijde gevallen was.
Na bovenmenschelijke pogingen gelukte het hun, het inwendige van den
ballon te bereiken. Deze laatste kromp bij iedere aanraking met het water
zichtbaar ineen. De waterdamp veranderde snel in een vloeibaren toestand.

„Hooger, hooger!” riep de ingenieur, terwijl hij in de groote holte van het
omhulsel kroop.

Hier konden de golven hen niet bereiken. Na al zijne koelbloedigheid


herkregen te hebben, bemerkte hij, dat de ballon niet onbeweeglijk was,
maar dat hij met snelheid naar eene ontzaglijke massa heendreef, waarvan
de omtrekken door den sluier van sneeuw slechts onduidelijk te zien waren.
De kapitein bemerkte deze insgelijks.

„Een ijsberg!” riep hij uit.

Nauwelijks waren deze woorden over de lippen van Ford gekomen, of de


gehavende ballon kwam op den ijsberg neer. Onze reizigers hoorden het dof
gekraak van het brekende bamboes. Het schuitje was met een scherpen kant
van het ijsveld in aanraking gekomen en bleef daaraan voor een oogenblik
vasthaken.

Gromski maakte van dit oogenblik gebruik. Na zijne schuilplaats verlaten te


hebben, bereikte hij na eenige vruchtelooze pogingen het hobbelige
oppervlak van den ijsberg en bevond zich al spoedig op den kant, waaraan
de ballon was blijven vasthaken.

De kapitein en James volgden hem werktuiglijk.

Intusschen verhief zich de ballon, zoodra zij den voet op den ijsberg gezet
hadden, eensklaps en verdween in het luchtruim.

Gedurende eenige oogenblikken zag men hem zich nog als een zwart stipje
tegen de vallende sneeuw afteekenen; daarop verdween hij te midden der
donkere wolken.

De ingenieur, die aan den kant van eene breede kloof zat, staarde hem tot op
het laatste oogenblik na. Toen verborg hij zijn gelaat in de handen en
biggelden er een paar tranen langs zijne wangen.

Het gewrocht van zijn genie, de oorzaak van zijn roem, was voor immer
verloren gegaan.
ZEVENTIENDE HOOFDSTUK.
Op een ijsberg in den Oceaan.
De ijsberg, waarop onze reizigers zoo onvoorziens waren geworpen, was
geene veilige schuilplaats. Deze ijsberg was een stuk van een ijsveld, dat
door talrijke en diepe spleten doorboord was, die bij iederen nieuwen aanval
der golven in omvang toenamen. Deze geheele massa dreef op het
oppervlak van den Oceaan rond en stond ieder oogenblik bloot aan het
gevaar, het evenwicht te verliezen en om te kantelen. De ingenieur en zijne
metgezellen hielden zich met de uiterste moeite vast aan eene plek, die
telkens met het schuim der golven bedekt werd. De eerste de beste golf, die
kwam aanrollen, kon hen in zee doen storten. Daar de kapitein dit wel
voorzag, keek hij naar eene veiligere plaats rond. Eenige voeten hooger
bevond zich eene holte, die door verscheidene bergjes omgeven en in het
midden van den ijsberg gelegen was. Het hobbelige oppervlak van den berg
bood een vrij stevigen steun voor de voeten aan.

Zonder verder na te denken, begon Ford naar de gekozen plaats te klimmen.


Na verloop van eenige oogenblikken zat hij in deze soort van schuilplaats,
die de hoogste golven niet meer konden bereiken.

Op aansporing van den kapitein besloten de ingenieur en James, insgelijks


hun fortuin te beproeven. Maar het waagstuk, waarin Ford gelukkig
geslaagd was, zou bijna noodlottig voor zijne beide metgezellen afgeloopen
zijn. Een ontzaglijke golf sloeg over den stuurman heen en zou hem zonder
twijfel in zee hebben doen vallen, indien Gromski er niet in geslaagd was,
hem vast te houden.

Gebruik makende van de tusschenruimte tusschen twee golven, bereikten de


reizigers echter de kloof, waar zij volkomen beveiligd waren tegen de
golven, die op de zijden van den ijsberg braken.

Een hooge ijsmuur, die zich aan den windkant verhief, beschermde onze
helden volkomen tegen de jachtsneeuw. De stuurman ontdekte aan den voet
van dezen muur eene kleine grot en kroop daar onmiddellijk in.
De luchtschipbreukelingen, die van koude verstijfd en tot op het hemd
doornat waren, sloegen van uit hunne schuilplaats wanhopige blikken op
den Oceaan. Waarom hadden zij den dood afgeweerd, die zijne kaken reeds
voor hen opsperde? Was het om kort daarna een anderen dood te sterven,
honderdmaal verschrikkelijker, den dood tengevolge van honger en koude?

De stuurman, die de zaken gewoonlijk nog al luchtig placht op te nemen,


verwenschte luide zijne lafhartigheid, die hem had aangedreven, eene
twijfelachtige schuilplaats op den ijsberg te zoeken.

„We zouden anders reeds lang in het water omgekomen zijn,” zeide hij; „en
nu moeten we opnieuw het einde afwachten.”

De oude zeeman was blijkbaar ter prooi aan dezelfde marteling, welke een
man ondergaat, die de voltrekking van zijn doodvonnis afwacht.

„Houd je maar kalm, James!” zei de ingenieur met een bitteren grimlach.
„Het oogenblik is nabij, waarop deze berg zal omkeeren: een enkele groote
golf zal daartoe voldoende zijn.”

„Dat zal zoo gauw nog niet gebeuren,” merkte Ford aan. „Ge vergeet, dat de
ijsberg, waarop we ons bevinden, meer dan tien meters boven het water
uitsteekt; het gedeelte, dat zich onder water bevindt, heeft omstreeks eene
achtmaal grootere afmeting. De Oceaan of de golven of de orkaan zullen
niet in staat zijn om zulk eene massa om te keeren. De ijsbergen vertoonen
zich dikwijls op 54 graden breedte; in het noordelijk halfrond bereiken zij
dikwijls den 46sten graad. De koude winden, die van de polen naar de
evennachtslijn gaan, drijven ze dikwijls zelfs tot in de gematigde luchtstreek
voort. Het is dus mogelijk, dat we eene lange reis zullen doen, die zoolang
zal duren, totdat onze ijsberg onder den invloed der zonnestralen geheel
gesmolten is.”

„Maar is het wel zeker, dat we voortgaan?”

„De wind stuwt ons zonder eenigen twijfel voort of misschien wel eenige
stroom; anders zou deze ijsberg niet van de kust van het zuidelijk vasteland,
waar zij zich van het ijsveld gescheiden heeft, weggedreven zijn.”
„We hebben dus den tijd om driemaal van honger en koude te sterven,”
zeide Gromski.

James ondersteunde de opmerking van den ingenieur met een krachtigen


matrozenvloek.

„Zou de weg, dien we met dezen storm afgelegd hebben, aanzienlijk zijn?”
vroeg hij na eenige oogenblikken van stilzwijgen.

Gromski haalde de schouders op.

„Wat weet ik daarvan?” antwoordde hij. „Misschien een mijl, misschien wel
drie duizend kilometers. Binnen eenige uren zullen we dit met zekerheid te
weten komen door den duur van den nacht. Als deze niet invalt, zal dit ons
een bewijs zijn, dat we den poolcirkel nog niet eens overschreden hebben.”

„Ja, dat is waar, en ik zou graag zoo spoedig mogelijk willen weten, waar
we ons bevinden,” zei de kapitein peinzend. „Hoe jammer, dat de
chronometer in het schuitje achtergebleven is!”

„Wat kan dat alles ons schelen?” mompelde de stuurman.

„Dat kan ons meer schelen dan je wel denkt; want als onze ballon vernietigd
is in eene streek, die door de walvischvaarders bezocht wordt.…”

De kapitein durfde dezen volzin niet voltooien. De onderstelling, die hij


wilde te berde brengen, scheen hem al te gewaagd toe.

Na dit gesprek heerschte er een diep stilzwijgen op den ijsberg. Onze


schipbreukelingen, die tegen elkaar aan gedrukt zaten, bevende van koude,
hoorden het gebulder van den orkaan en het onheilspellend geklots der
golven, die tegen de zijden van den ijsberg aankwamen.

Intusschen bemerkten zij na verloop van eenigen tijd tot hunne blijdschap,
dat het ophield met sneeuwen. De donkere wolken lieten eenig licht door,
waar doorheen men de onbestemde lijn van den horizon kon zien, waaraan
zich hier en daar witte stippen vertoonden. Dit waren ijsbergen. De kapitein,
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

You might also like