M348 Applied Statistical Modelling - Linear Models
M348 Applied Statistical Modelling - Linear Models
Book 1
Linear models
This publication forms part of the Open University module M348 Applied statistical modelling. Details of this
and other Open University modules can be obtained from Student Recruitment, The Open University, PO Box
197, Milton Keynes MK7 6BJ, United Kingdom (tel. +44 (0)300 303 5303; email [email protected]).
Alternatively, you may visit the Open University website at www.open.ac.uk where you can learn more about
the wide range of modules and packs offered at all levels by The Open University.
Introduction to Unit 1 8
1 Computing preliminaries 10
1.1 Getting started with Jupyter 10
1.2 Introducing R 12
8 Looking ahead . . . 64
Summary 66
Learning outcomes 68
References 69
Acknowledgements 70
Solutions to activities 71
Introduction 81
3 Diagnostics 109
3.1 Checking the model assumptions 110
3.2 Leverage 114
3.3 Cook’s distance 123
3.4 Using R to perform diagnostic checks 129
References 182
Acknowledgements 183
Introduction 203
Summary 274
Acknowledgements 278
Introduction 295
Summary 373
References 377
Acknowledgements 378
Introduction 397
7 Replication 465
7.1 Perils of multiple testing 466
7.2 Overcoming the problem of multiple testing 468
Summary 469
References 473
Acknowledgements 475
Index 493
Unit 1
Introduction to statistical modelling
and R
Setting the scene
Pose questions
Design study
Collect data
Explore data
Make assumptions
Formulate model
Improve model
Fit model
Check model
Choose model
Report results
4
Setting the scene
5
Unit 1 Introduction to statistical modelling and R
6
Setting the scene
Separate strands
7
Unit 1 Introduction to statistical modelling and R
Introduction to Unit 1
In Unit 1, we will focus on linear regression with a single explanatory
variable and also introduce the software used throughout the module for
the practical computing work. You will get hands-on experience of using
the software, with the aim that you should feel confident using it for linear
regression with a single explanatory variable by the end of the unit.
The module uses the statistical programming language R (yes, its name is
just this one letter!) for statistical modelling. R is open source and is
widely used by practising statisticians and researchers working in many
different fields. The primary difference between statistical modelling using
R and statistical modelling using a statistical package such as Minitab or
SPSS is the fact that R is not menu-driven: in statistical packages such as
Minitab and SPSS, models can be specified by selecting options from
menus on the toolbar and then completing dialogue boxes, whereas in R,
models are specified by typing commands. Although this may sound like
harder work, the advantage with using typed commands instead of menus
is that there is far more flexibility as to what can be done and how it is
done.
If you are new to using computer code, you may feel a little nervous about
using R. Do not worry, you will not be expected to write a lot of code; in
the majority of cases the code will be provided for you, and any code that
you do need to produce yourself will be easily copied and adapted from
existing code.
To help you work with the R code, the module will use R via the Jupyter
Notebook application. Jupyter Notebook is often simply referred to as
‘Jupyter’ and we will do so throughout the module.
The easiest way to get a feel for using R via Jupyter is to try it! You will
do that next in Section 1, starting with the basics of using Jupyter and
then an introduction to R. Section 2 will focus on exploratory data analysis
and you will then try some exploratory data analysis using R in Section 3.
Sections 4, 5 and 6 cover various aspects of linear regression for modelling
the relationship between two variables. Some of the content of these
sections will be a review for you, while some of the content is likely to be
new to you. You will then learn how to use R for linear regression in
Section 7.
Finally, the unit rounds off with a (very short!) section looking ahead to
what’s to come in the rest of the module.
The structure of Unit 1 in terms of how the unit’s sections fit together are
represented diagrammatically in what we’ve called a ‘route map’ (shown
next). This is accompanied by a note of any sections or subsections that
need a computer or other resources. The aim of the route map is to help
you to navigate your way through the unit. Each unit will have its own
route map, which will be used in the same way.
8
Introduction to Unit 1
Section 1
Computing preliminaries
Section 3
Section 2
Using R for
Exploratory data
exploratory data
analysis
analysis
Section 4
Simple linear
regression
Section 5 Section 7
Checking the simple Using R for
linear regression simple linear
model assumptions regression
Section 6
Prediction in simple
linear regression Section 8
Looking ahead. . .
Note that for Sections 1, 3 and 7 you will need to use your computer
as well as the written unit in order to complete activities using the
module software. In Section 1, you will also need to access other
resources on the module website (such as screencasts or written
instructions) as part of setting up and familiarising yourself with the
module software.
9
Unit 1 Introduction to statistical modelling and R
1 Computing preliminaries
This section introduces Jupyter and R, which together form the software
you will be using throughout the module for the practical computing work.
Jupyter is used in this module as a way to work with the statistical
programming language R.
We will focus on Jupyter first, in Subsection 1.1, before adding R into the
mix in Subsection 1.2. For both subsections, you will need to refer to the
module website alongside this printed text. Before starting Subsection 1.1,
make sure that you have installed both Jupyter and R on your computer.
10
1 Computing preliminaries
Whenever you need to work through a notebook for the module, we have
signalled a ‘Notebook activity’ in the printed text, such as that given
below for Notebook activity 1.1. The heading for each of these notebook
activities indicates which notebook you will need, and is followed by a brief
explanation of what is covered in the notebook.
So now work through Notebook activity 1.1 (for which you opened the
associated notebook in Activity 1) to get a feel for the main features of
Jupyter.
All of the notebooks associated with Unit 1 can be found in the folder
called ‘Unit 1’ on the Jupyter dashboard. (Similarly, all the notebooks
associated with Unit 2 can be found in the folder called ‘Unit 2’ on the
Jupyter dashboard, and so on.) Work through Notebook activity 1.2 to
explore Jupyter a little further.
You will use Jupyter notebooks throughout this module, and at times you
will need to manipulate your own notebooks. In order to do this, you need
to know how to write text in notebooks. As has already been mentioned,
Jupyter uses Markdown to do this.
Activity 2 introduces you to Markdown, before you practise using it in
Notebook activity 1.3.
Watch Screencast 1.2 on the module website, which shows you how to use
Markdown. There are also some written instructions provided on the
module website to use instead of, or as well as, the screencast.
11
Unit 1 Introduction to statistical modelling and R
You should by now have a fairly good idea of how you can create text
documents in Jupyter. However, we haven’t yet considered the
all-important question of how to use the programming language R! This
will be considered in the next subsection.
1.2 Introducing R
Although you met R very briefly while using Jupyter in Subsection 1.1,
this section provides a proper introduction to R. We’ll get started with R
in Notebook activity 1.4.
12
1 Computing preliminaries
not necessarily the same type, and each vector becomes a separate column
in an array of data. The sequence of values within each vector in a data
frame must be the same, so that the first values in each vector correspond
to the values for the first observation, the second values in each vector
correspond to the second observation, and so on.
Most of the datasets used in M348 are stored in data frames, which have
been created for you. When a dataset is introduced, the name of the data
frame is also given. You will learn how to load and view the data frames
for M348 in the next notebook activity. We will then focus on the
individual vectors within a data frame.
Although the datasets that you’ll use in M348 are all stored in data frames
which have already been created for you, there will be times when you will
need to create a data frame for use in calculations. For example, you may
want to create a data frame which contains only a subset of the vectors
from a data frame. You will see how to do this in the next – and final –
notebook activity of this section.
This section has introduced some of the basics of R, but we haven’t yet
used R to do very much. I’m sure that you will be glad to hear that you
will be using R to do more exciting things as we move through the module,
and you will discover that R is, in fact, a very powerful tool for analysing
data and building statistical models.
We will return to using R later in this unit, but in the meantime you may
choose to watch Screencast 1.3 (on the module website) for a ‘sneak peek’
at some of the things that R can do.
Now that you have been introduced to both Jupyter and R, we can move
on to considering statistical modelling in R. A crucial first step in any
statistical modelling is to get a feel for the data through some exploratory
data analysis; this is the subject of the next section.
13
Unit 1 Introduction to statistical modelling and R
14
2 Exploratory data analysis
The data used in this module will all be secondary data. Although there
are advantages to using primary data in terms of data quality, primary
data can be expensive and time-consuming to obtain. As such, secondary
data are commonly used by many researchers.
15
Unit 1 Introduction to statistical modelling and R
16
2 Exploratory data analysis
As was seen in Example 3, the fact that a study is called ‘an experiment’
does not necessarily mean that the resulting data are always experimental
data. The key to whether the data are observational or experimental lies in
whether or not the researcher has control over the value a variable takes.
The final distinction for data that we’ll consider here is between natural
science data and social science data.
Although both natural science and social science have ‘science’ in common,
the focus of each is different: natural science focuses on the physical world,
while social science focuses on the study of society and how humans
behave and influence the world around us. (Natural science disciplines
include biology, chemistry, Earth science and physics. Social science
disciplines include economics, education, human geography, law, politics,
psychology and sociology.) This difference in focus means that
natural science data are generated from well-defined laws of nature,
whereas social science data are generated from situations involving people.
Natural science data are objective and you would expect to get very
similar results when repeating an experiment. Social science data tend to
be subjective, often relying on opinion, which means that the resulting
data can vary a lot from sample to sample. What’s more, it is not usually
possible for the researcher to control a social science data variable in the
same way that it often is in natural science. This means that social science
data are invariably observational rather than experimental.
There are many data sources which provide freely available data. We met
one of these, the OECD, earlier in Example 1. Here are examples of some
others.
• National agencies, such as the Office for National Statistics (ONS) in the
UK, collect and analyse data on many aspects of national life including
the economy, population and society at national, regional and local
levels.
17
Unit 1 Introduction to statistical modelling and R
As already mentioned, the type of data in a dataset can affect its quality
and reliability. We will discuss data quality and reliability next.
18
2 Exploratory data analysis
However, primary data may not always be ideal for addressing the problem
at hand. For example, it may be impractical or too expensive to collect
data on a variable of interest. Indeed, it may not even be possible to
observe the variable that we are really interested in, and instead we need to
use a less ideal alternative. This problem is illustrated in the next example.
19
Unit 1 Introduction to statistical modelling and R
Data on gender identity are important for issues of equality, diversity and
inclusion, as well as to support policy development and service provision.
At the time of writing, where data on gender is available, the majority of
existing datasets use just a binary female/male classification for gender.
You’ll notice this prevalence of the binary classification for gender reflected
in the datasets that you’ll meet in this module. However, there are many
terms that someone may use to identify their gender. Alternatively, they
may use more than one term or they may not use any specific term.
Given this, what problems might arise with a dataset that only allows
gender to take the values male and female?
20
2 Exploratory data analysis
Figure 3 Treezilla map showing the manna ash trees along Walton
Drive on the OU campus (as at 10 February 2022), with labels added to
indicate the west and east sides of Walton Drive
A tree’s diameter and height may be affected by its location – for
example, some trees may be in sunnier locations than others. So, the
location of the tree may be useful information for any statistical
21
Unit 1 Introduction to statistical modelling and R
22
2 Exploratory data analysis
For the manna ash trees dataset, while measuring tree diameter (1.3 m
above ground) is fairly straightforward, foresters use special equipment to
measure tree height. (For example, a laser rangefinder measures distances
and a clinometer measures the angle between the person and the top of a
tree. Trigonometry can then be used to calculate the height of the tree.) A laser rangefinder
There are also smartphone apps which can measure tree heights, but, at
the time of writing, these can be very inaccurate.
(a) Some of the individual trees in the Treezilla data have inaccurate or
missing information regarding which species the tree is. Why might
that be?
(b) Why might some of the individual tree height measurements be
inaccurate or missing?
A clinometer
One of the data quality problems mentioned in Activity 8 was the problem
of missing data. There is no hard-and-fast rule as to how to deal with
missing data values. If there are only a few values of a variable which seem
to be missing randomly, then it can be sensible to simply drop these
observations. When there are a substantial number of values missing for a
variable, then it might be wiser to drop that particular variable. On the
other hand, there could be an underlying reason why data values are
missing – for example, a particular group of people may refuse to answer a
particular question. In this case, the missing values should not be ignored
since the fact that they are missing is important. We will consider the
problem of missing data in more detail in Unit 5.
Another potential problem with large-scale databases such as Treezilla is
the potential for there to be bias in the data. You will see an example of
such bias in the next activity.
23
Unit 1 Introduction to statistical modelling and R
Figure 4(a) shows the Treezilla map of trees for an area of London, taken
from the Treezilla website on 10 February 2022, and Figure 4(b) shows the
Treezilla map of a slightly larger area around Loch Tummel (a rural area
in Scotland), taken from the same website on the same day.
(a)
(b)
Figure 4 Map showing the locations of trees with data (denoted by coloured
dots) in the Treezilla database in (a) an area of London, and (b) an area
around Loch Tummel in Scotland.
24
2 Exploratory data analysis
In the map showing the area of London, there are data for a lot of trees, as
indicated by the many dots on the map. In contrast, there are no dots on
the map showing the area around Loch Tummel, which indicates that no
data have been recorded for trees in that area. As a result, from the
Treezilla maps, it looks like there are far more trees in London than the
area around Loch Tummel. However, as can be seen from Figure 5, there
are many trees around Loch Tummel.
Why might the data collection method used in the Treezilla citizen science
project lead to more tree data being collected in London than in the Loch
Tummel area?
25
Unit 1 Introduction to statistical modelling and R
26
2 Exploratory data analysis
27
Unit 1 Introduction to statistical modelling and R
28
3 Using R for exploratory data analysis
The first thing that we need to do when presented with a new dataset is to
consider the quality of the data. We will do this in the next activity.
As you saw in Activity 11, it is possible that some of the variables in the
FIFA 19 dataset may not be precisely measured, and/or may also exhibit
bias. Equally, it is possible that there are no problems with the data.
Without further details regarding how the data were obtained for these
variables, all we can do is use the data as given in our exploratory data
analysis, and flag up any potential problems in any conclusions/discussions
that we present.
In Subsection 2.3 you were reminded that visual summaries of a dataset
are an important part of exploratory data analysis. So, in the next activity
you will consider which graphics would be suitable to start exploring the
FIFA 19 dataset.
29
Unit 1 Introduction to statistical modelling and R
30
4 Simple linear regression
Once we have a good feel for the data through an exploratory data
analysis, we are ready to start modelling the data; simple linear regression
is the topic of the next section.
31
Unit 1 Introduction to statistical modelling and R
We will start this section by reviewing the basic idea of simple linear
regression in Subsection 4.1. We will then consider estimating the
parameters in the simple linear regression model in Subsection 4.2, and
testing for a relationship between the two variables in Subsection 4.3.
32
4 Simple linear regression
Regression
function h(x)
y
Random part
represents
the scatter
about h(x)
x
Figure 6 A scatterplot of data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) with a
regression function h(x)
33
Unit 1 Introduction to statistical modelling and R
µ − 3σ µ − 2σ µ − σ µ µ+σ µ + 2σ µ + 3σ
y
34
4 Simple linear regression
35
Unit 1 Introduction to statistical modelling and R
x1 x2
x
Figure 8 Illustration of simple linear regression for data
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )
36
4 Simple linear regression
The notation introduced in Box 3 will be used throughout the module and
you will also see that it is used for specifying models in R.
It is often helpful to use informative variable names rather than Y and x
when specifying a model, as illustrated in the final activity of this
subsection.
37
Unit 1 Introduction to statistical modelling and R
finds the values of α and β which maximise the likelihood). Note that you
do not need to know the technical details of obtaining these estimates for
either of these methods in this module; R will be used to obtain the
estimated values of α and β, and indeed to obtain the estimated values of
all parameters in all of the regression models that you will meet in this
module.
Terminology and notation used to describe the fitted simple linear
regression model is given in Box 4.
This is also known as the equation of the least squares line or the
equation of the fitted line.
The line itself is usually referred to simply as the fitted line.
38
4 Simple linear regression
10
8
Height (m)
4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 9 A scatterplot of height against diameter, with the fitted line
added
(a) Write down the fitted simple linear regression model for these data.
(b) According to the fitted simple linear regression model, if the diameter
increased by 0.1 m, how would the height change?
In the solution of Activity 14, the fitted simple linear regression model
y = 5.05 + 12.27x
was written in an alternative form by replacing Y and x by their more
informative names height and diameter, respectively, so that the fitted
model becomes:
height = 5.05 + 12.27 diameter.
Writing the fitted model using more informative variable names can help
make sense of the model more easily: in simple linear regression with only
one explanatory variable it isn’t difficult to remember what x is, but things
can get rather complicated when not using informative variable names for
some of the models you will meet in this module.
39
Unit 1 Introduction to statistical modelling and R
The fitted line for a simple linear regression model, together with two
fitted values and residuals (one positive and one negative), are illustrated
in Figure 10.
50
rj = yj − yj yj
positive rj
40
Fitted line with
equation
30
y = α + βx yj
y
yi
20
ri = yi − yi
yi negative ri
10
0 5 10 15 20
xi xj
x
Figure 10 Illustration of the fitted line for a simple linear regression model,
together with two fitted values and residuals
In the next activity you will calculate a fitted value and a residual.
40
4 Simple linear regression
The residuals are the differences between the observed data points and the
fitted regression line, and as such, ri is essentially an estimate of the
random term Wi . Now, σ 2 = V (Wi ) and so we can use the (sample)
variance of the residuals r1 , r2 , . . . , rn as an estimate of σ 2 ; this is
summarised in Box 6.
Box 6 Estimate of σ 2
In simple linear regression, the (sample) mean of the residuals is 0. So
an estimate of the variance σ 2 = V (Wi ), i = 1, 2, . . . , n, denoted
b2 , is given by
by σ
Pn 2
2 r
b = i=1 i ,
σ (4)
n−2
where ri is the residual for the ith data point.
More informally, this is
sum of squared residuals
b2 =
σ .
n−2
41
Unit 1 Introduction to statistical modelling and R
Of course there’s a
relationship between these two
variables – just look at how well
the line fits!
In previous study, you will have met hypothesis testing when using a
normal distribution to test your hypotheses, and when using another
distribution – the t-distribution. The hypothesis test for testing whether
the slope parameter β in a simple linear regression model is zero is based
on the t-distribution. A reminder of the t-distribution is given in Box 7.
42
4 Simple linear regression
t(7) N (0, 1)
t(1) t(3)
−5 −4 −3 −2 −1 0 1 2 3 4 5
43
Unit 1 Introduction to statistical modelling and R
You may not have seen the test statistic t expressed in the form given in
Box 8 before. There is, however, a good reason for writing the test statistic
in this form, as you will discover when we use R for regression and as you
learn about other regression models in the module.
44
4 Simple linear regression
p.d.f. of null
distribution
t(n − 2)
p-value is the
area of these
two shaded tails
−t 0 t
Observed value
of the test statistic
45
Unit 1 Introduction to statistical modelling and R
So, following on from Box 9, the question is: how small does the p-value
need to be to conclude that β ̸= 0? Well, unfortunately there is no
hard-and-fast answer to this question!
In previous study, you may well have been given some rough guidelines on
how to interpret p-values. For example, a value of p < 0.05 is often taken
as enough evidence against H0 to conclude that β ̸= 0, while a value of
p < 0.01 is often taken as strong evidence against H0 . However, as
Example 6 shows, what is considered as being small enough to suggest that
β ̸= 0 is dependent on the context of the data and the research question of
interest.
46
5 Checking the simple linear regression model assumptions
In the next activity you will interpret a p-value obtained for a simple linear
regression model.
47
Unit 1 Introduction to statistical modelling and R
48
5 Checking the simple linear regression model assumptions
Residuals
0 0
(a) (b)
Residuals
Residuals
0 0
(c) (d)
In the next activity you will examine residual plots resulting from simple
linear models fitted to some more datasets.
49
Unit 1 Introduction to statistical modelling and R
Residuals
Residuals
0 0
(a) (b)
Residuals
Residuals
0 0
(c) (d)
Residuals
Residuals
0 0
(e) (f)
Residuals
Residuals
0 0
(g) (h)
50
5 Checking the simple linear regression model assumptions
Residual plots sometimes also include a trend line, which indicates how the
mean of the residuals varies across the values of the fitted values. This can
be useful for checking the assumption that the mean of the Wi ’s is zero
(and hence that the linearity assumption is reasonable): if the trend stays
roughly around the zero residual line, then the assumption is reasonable.
(Note that R adds a trend line to residual plots, but we have not included
this line in the figures in the units.)
2
Residuals
−2
51
Unit 1 Introduction to statistical modelling and R
You may not have agreed with the solution provided for Activity 20. There
is no definitive ‘correct’ conclusion in cases like this, where a plot is not
presenting a clear-cut picture. If the residual plot had a very clear pattern
and looked like the one in Figure 13(c), for example, then this would
change our conclusions. As with much of statistics, interpretation of the
plots and results can be very subjective and often the best we can do is to
present our conclusions together with the reasons for coming to those
conclusions.
There isn’t any particularly obvious ordering of the data that would be
useful for checking the independence of the observations in the manna ash
trees dataset. So Figure 21 shows the residuals observed from fitting a
simple linear regression model to the manna ash trees data in Activity 14
plotted in the order given by the identification numbers (treeID). From
this plot, does the independence assumption seem reasonable?
52
5 Checking the simple linear regression model assumptions
2
Residuals
−2
53
Unit 1 Introduction to statistical modelling and R
54
5 Checking the simple linear regression model assumptions
Standardised residuals 2
−1
−2 −1 0 1 2
Theoretical quantiles
Figure 17 A normal probability plot where the points lie approximately
along a straight line, indicating that it is reasonable to assume that the
residuals are normally distributed
55
Unit 1 Introduction to statistical modelling and R
1
Standardised residuals
−1
−2
−2 −1 0 1 2
Theoretical quantiles
Figure 18 Normal probability plot of the residuals of the simple linear
regression model for the manna ash trees
56
6 Prediction in simple linear regression
In the next activity you will use the result given in Box 12 to consider
predicting the heights of manna ash trees.
Consider once again the manna ash trees dataset. From Activity 14
(Subsection 4.2.1), the fitted simple linear regression model for these data
is
height = 5.05 + 12.27 diameter.
(a) According to the model, what is the predicted height for a manna ash
tree with diameter 0.20 m?
(b) Explain why the fitted model may not be appropriate to predict the
value of height for young trees with very small diameters. (Hint: you
might find it useful to consider Figure 9 in Subsection 4.2.1 and what
values the data take in the sample.)
57
Unit 1 Introduction to statistical modelling and R
Activity 23 raises an important issue. The fitted model is valid for the
values of x used when fitting the model, but may not be valid beyond the
range of these values. This is illustrated in Example 7.
8.2
Weight (kg)
8.0
7.8
7.6
7.4
6 7 8 9 10
Age (months)
Figure 19 Scatterplot of weight and age, together with the fitted
simple linear regression line for these data
Despite how well the simple linear regression model fits the data in
Figure 19, this model is a poor fit when considering weight and age
for values of age from 1 to 12 months. A scatterplot of these data
(with values of age between 1 and 12 months) is shown in Figure 20,
together with the fitted line from Figure 19 (based on values of age
between 6 and 10 months). Clearly, the simple linear regression line
based on data for values of age between 6 and 10 months does not fit
the relationship between weight and age across all values of age up
to 12 months.
58
6 Prediction in simple linear regression
8
Weight (kg)
2 4 6 8 10 12
Age (months)
Figure 20 Scatterplot of weight and age for ages up to 12 months,
together with the fitted simple linear regression line from Figure 19
59
Unit 1 Introduction to statistical modelling and R
So, in linear regression, interpolation uses the fitted line within the range
of x values used for fitting the line, while extrapolation uses the same
fitted line outside the range of x values used for fitting the line. As we saw
in Example 7, caution is needed when using extrapolation: extrapolating
just outside the range of x values may well be fine, but extrapolating
further outside the range may produce very misleading results.
60
6 Prediction in simple linear regression
15
y0 = 7.50, point prediction of Y0
99% prediction
interval at x0 = 0.20
10
Height (m)
5 90% prediction
interval at x0 = 0.20
95% prediction
interval at
x0 = 0.20 x0 = 0.20
0
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 21 Scatterplot of height and diameter, together with the fitted
simple linear regression line and the 90%, 95% and 99% prediction
intervals marked
61
Unit 1 Introduction to statistical modelling and R
For the manna ash trees dataset, the values of diameter range from 0.15 m
up to 0.35 m. The 95% prediction interval for the height of a tree with
diameter 0.15 m (the low end of the range of diameter) is (3.4, 10.3) and
the 95% prediction interval for the height of a tree with diameter 0.35 m
(the high end of the range of diameter) is (5.9, 12.8).
If instead we consider a diameter in the middle of the range of values for
diameter, then the 95% prediction interval for the height of a tree with
diameter 0.25 m is (4.8, 11.4).
By looking at the widths of these three prediction intervals, how do the
widths of the prediction intervals vary across the range of values of
diameter?
62
7 Using R for simple linear regression
You will notice, as we move through the module, that the output produced
by R is often given to several decimal places. This degree of accuracy is
usually not required when writing down the fitted model, since the model
is just that – a model, giving a simplified representation of the underlying
process which generated the data. So, it is common practice to round the
parameter estimates given by R when quoting a fitted model. For example,
if R produces the estimates α b = 17.12938 and βb = 0.32173, then it would
be reasonable to quote the fitted model as any of the following:
y = 17.129 + 0.322x,
y = 17.13 + 0.322x,
y = 17.13 + 0.32x,
and so on.
There is no hard-and-fast rule on how much rounding is reasonable, and it
usually depends on what seems sensible in the context of the data, the
quality of the data, the sample size, and so on. Statisticians are not
particularly fussy about degrees of rounding, since any rounding errors are
usually insignificant in comparison to other assumptions and
approximations involved in the modelling process.
Hopefully, you should now feel comfortable with using both Jupyter and R
for exploratory data analysis and simple linear regression. Bear in mind
that code can be copied and pasted between notebooks, so do not worry if
you can’t remember the required code – simply copy, paste and adapt code!
63
Unit 1 Introduction to statistical modelling and R
8 Looking ahead . . .
To round off this unit, we’ll take a very brief look at some of the main
ways in which this module will go beyond simple linear regression.
64
8 Looking ahead . . .
8
Height (m)
4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 22 Fitted simple linear regression lines for the manna ash trees
dataset using trees on the west side of Walton Drive only (blue triangles and
blue dashed fitted line) and using trees on the east side of Walton Drive only
(red circles and red solid fitted line)
65
Unit 1 Introduction to statistical modelling and R
Summary
In this unit, you have been introduced to the software to be used for
statistical modelling throughout this module – namely Jupyter notebooks
using the statistical programming language R. We started off by exploring
Jupyter, learning the basics of how to create, use and edit notebooks.
There then followed an introduction to R, where you learnt how to run R
code that you were given, as well as writing (and running) some of your
own code.
The unit then moved onto exploratory data analysis. We considered some
different types of data and discussed how the quality and reliability of data
can vary, both between and within datasets. The module assumes that you
have knowledge of various visual and numerical data summaries from prior
study, and the unit provided a list of these. You then had the opportunity
to use R to do some exploratory data analysis.
After this, we moved onto the topic of simple linear regression to model a
linear relationship between a response variable Y and an (assumed known
and fixed) explanatory variable x. We discussed the basic idea of simple
linear regression, which expresses the response in terms of an underlying
systematic straight-line relationship with the explanatory variable,
together with a random element to represent how the observed data values
vary around this straight line. Estimating the model parameters and
testing for a relationship between the response and the explanatory
variable were then considered, before moving onto the use of residual plots
and normal probability plots for checking the model assumptions (of
linearity, independence, zero mean, constant variance and normality of the
residuals), rounding off with a look at prediction. The unit finished with
using R for simple linear regression.
The Unit 1 route map, repeated from the introduction, provides a nice
reminder of what has been studied and how the different sections link
together.
66
Summary
Section 1
Computing preliminaries
Section 3
Section 2
Using R for
Exploratory data
exploratory data
analysis
analysis
Section 4
Simple linear
regression
Section 5 Section 7
Checking the simple Using R for
linear regression simple linear
model assumptions regression
Section 6
Prediction in simple
linear regression Section 8
Looking ahead. . .
67
Unit 1 Introduction to statistical modelling and R
Learning outcomes
After you have worked through this unit, you should be able to:
• open and work through an M348 Jupyter notebook
• create a new Jupyter notebook
• add and edit text in a Jupyter notebook using Markdown
• run, write and adapt R code given in a notebook
• appreciate what R objects, functions and vectors are
• load M348 data frames into a notebook and create new data frames
• appreciate the differences between primary and secondary data,
observational and experimental data, and natural science and social
science data
• appreciate that the quality and reliability of data varies between, and
within, datasets
• use R to produce a variety of visual and numerical data summaries, and
be able to interpret these summaries
• appreciate that simple linear regression models a straight line
relationship between a response variable (the variable we would like to
model) and an explanatory variable (the variable which can be thought
of as ‘explaining’ the response)
• use R to fit a simple linear regression model and interpret the resulting
output produced by R
• use the fitted model output to test for a relationship between the
response variable and the explanatory variable
• use R to produce residual plots and normal probability plots for fitted
models
• appreciate that if the points in a residual plot show a pattern, then the
model assumptions of linearity, zero mean and constant variance might
not be justified
• appreciate that if the points in a normal probability plot do not fall
close to a straight line, then the model assumption of normality might
not be justified
• appreciate that if the residuals show a pattern when ordered, then the
independence assumption might not be justified
• use R to calculate point predictions and prediction intervals for the
response when given new values of the explanatory variable
• appreciate that predictions calculated from the fitted model are only
valid for new values of the explanatory variable within the range of
values used to fit the model, and there is no guarantee that the same
relationship between the response variable and the explanatory variable
will hold outside this range.
68
References
References
Coughlan, S. (2020) ‘Most children sleep with mobile phone beside bed’,
BBC News, 30 January. Available at: https://www.bbc.co.uk/news/
education-51296197 (Accessed: 8 February 2022).
Gadiya, K. (2019) FIFA 19 complete player dataset. Available at:
https://www.kaggle.com/karangadiya/fifa19 (Accessed: 13 March 2019).
OECD (no date) OECD skills surveys. Available at:
https://www.oecd.org/site/piaac (Accessed: 8 February 2022).
OECD (2013) OECD skills outlook 2013: first results from the Survey of
Adult Skills. Paris: OECD Publishing. doi:10.1787/9789264204256-en.
ONS (2020) Census 2021 paper questionnaires. Available at:
https://www.ons.gov.uk/census/censustransformationprogramme/
questiondevelopment/census2021paperquestionnaires
(Accessed: 8 February 2022).
Treezilla (2012) Treezilla. Available at: https://treezilla.org
(Accessed: 19 July 2019).
69
Unit 1 Introduction to statistical modelling and R
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Setting the scene, child measuring height: © Sergey Novikov /
www.123rf.com
Subsection 1.1, Jupiter: © tristan3d / www.123rf.com
Subsection 2.1, work collaboration: © rawpixel / www.123rf.com
Subsection 2.1, part of the LHC: © 2021 CERN
Subsection 2.1: young person using her phone in bed © Ian Iankovskii /
www.123rf.com
Subsection 2.1, UN building: © Calapre Pocholo www.123rf.com
Subsection 2.1, satellite image: © Alexander Koltyrin / www.123rf.com
Subsection 2.2.2, laser rangefinder: Taken from:
https://www.ebay.ca/itm/124860962209?oid=321358572410
Subsection 2.2.2, clinometer: Taken from:
https://treezilla.org/assets/downloads/tree-survey-guide.pdf
Figure 4(a): https://www.treezilla.org
Figure 4(b): https://www.treezilla.org
Figure 5: the B8019 crossing Allt Charmaig, by Loch Tummel in Scotland:
© Peter Wood / https://www.geograph.org.uk/photo/6152160. This file
is licensed under the Creative Commons
Attribution-Noncommercial-ShareAlike Licence
http://creativecommons.org/licenses/by-sa/3.0/
Section 3, FIFA World Cup 2018: © Russian Presidential Press and
Information Office / https://en.wikipedia.org/wiki/2018 FIFA
World Cup Final. This file is licensed under the Creative Commons
Attribution Licence http://creativecommons.org/licenses/by/4.0/
Section 3, FIFA Women’s World Cup 2019: © Romain Biard /
www.shutterstock.com
Section 3, footballer showing skill move: © sportgraphic / www.123rf.com
Subsection 4.2.1, tartan fabric: © emqan / www.123rf.com
Section 6, crystal ball: © alexkich / www.123rf.com
Subsection 6.1, baby being weighed: © liudmilachernetska /
www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
70
Solutions to activities
Solutions to activities
Solution to Activity 1
From following Screencast 1.1 and/or the written instructions on the
module website, you should have:
• launched Jupyter
• opened the ‘Jupyter dashboard’
• navigated to the ‘Unit 1’ folder
• opened your first Jupyter notebook.
Solution to Activity 2
From following Screencast 1.2 and/or the written instructions on the
module website, you should now feel ready to try using Markdown to:
• create new cells
• add and format text
• add lists and tables.
Solution to Activity 3
(a) Since the physicists are running the experiment to generate the data
and then analysing the data, these are primary data for this team of
physicists.
(b) Since the data were collected by the CERN physicists, and not the
university researcher, these are secondary data for the researcher.
Solution to Activity 4
These data are observational, because the participants’ answers were
simply observed and recorded, and the OECD had no control over the
answers each participant would give.
Solution to Activity 5
Eurostat collect and analyse international data on aspects of the economy,
population and society, and is therefore likely to provide social science
data.
NASA’s Earthdata Search provides satellite data about the Earth, and is
therefore likely to provide natural science data.
The IMF collects data with a focus on finance and the economy, and is
therefore likely to provide social science data.
71
Unit 1 Introduction to statistical modelling and R
Solution to Activity 6
Two possible problems that the researcher may have are:
• The question on the census may not be exactly the question that the
researcher is interested in. For example, on the UK’s 2021 census
(ONS, 2020), a question asking respondents whether they had done any
paid work in the previous seven days told respondents to include ‘casual
or temporary work, even if only for one hour’. This may not be the
definition of work that the researcher wants to use.
• The released data may not be aggregated in the way that the researcher
would prefer. For example, the data might be aggregated into larger
geographical areas than the researcher would like, or the boundaries of
the geographical areas may not correspond with the boundaries that the
researcher would like to use.
You may have thought of other potential problems.
Solution to Activity 7
There is the obvious problem that a dataset which only allows gender to
take the values male and female cannot be used to learn about gender
identity beyond this binary classification.
Another problem is that using the female/male classification can lead to
unreliable data. A person may select one of these options despite feeling
that neither option describes their identity. Alternatively, the person may
choose to not answer that question and so the gender information for that
person is simply missing.
Solution to Activity 8
(a) The Treezilla data are collected by a large number of different people,
including members of the public who are not tree specialists.
Therefore, the species may be incorrectly identified by the person
entering the data, or information regarding tree species could be
missing altogether if the species is unknown by the person entering
the data.
(b) Foresters use laser rangefinders and clinometers to measure tree
height. However, members of the general public may not have either
of these available to use. This makes measuring tree height difficult.
Although there are smartphone apps which can measure tree heights,
because these apps can give inaccurate measurements this means that
the height data for individual trees may be inaccurate. If the person
collecting the data does not have access to either special equipment or
a smartphone app for measuring tree height, then the height
measurement may be missing altogether.
72
Solutions to activities
Solution to Activity 9
The Treezilla citizen science project database relies on many different
people (‘citizens’) to collect the data. As such, the amount of tree data
collected in an area will depend, to a certain extent, on the number of
people collecting the data. So, since a rural, sparsely populated area such
as that around Loch Tummel will have fewer people who might potentially
collect the data for this project than in a densely populated area such as
London, it is perhaps no surprise that the data for more trees are collected
in London than around Loch Tummel.
Solution to Activity 10
Some of the potential problems which may arise are as follows.
• Observations may be duplicated if they are recorded in more than one of
the sources. For example, a local branch of a shop may hold a
customer’s details, which may also be held on a different database of
customers purchasing from the shop via the internet.
• The same variable may be recorded in different sources using a different
unit of measure. For example, height may be measured in metres in one
data source, but centimetres in another.
• Categorical variables may have different naming conventions in different
datasets. For example, one data source may record the gender ‘male’ as
‘male’, and another as ‘m’. When analysing the data, these may appear
as different values when in fact they are the same.
You may well have thought of different potential problems.
Solution to Activity 11
(a) It is not clear from the data source how the scores for strength,
marking and skillMoves were calculated, and it is possible that they
were not precisely measured. For example, the scores may simply be
subjective assessments and therefore the opinion of the assessor(s).
On the other hand, despite being rounded to the nearest inch and
pound, respectively, the variables height and weight will have been
measured using measuring equipment, and are therefore more likely to
be fairly accurate and precise.
The variable preferredFoot is not something that can be measured,
but is also likely to be accurate for most players, since it is usually a
fairly clear-cut decision as to which foot a footballer prefers to use.
There may, however, be players who play with both feet equally, and
it isn’t obvious how this might be recorded from the description of the
dataset.
73
Unit 1 Introduction to statistical modelling and R
(b) If the scores for strength, marking and skillMoves are subjective
assessments, then there may be bias in the data from individual
assessors. Given the size of the database, it is also likely that there
were many different assessors, and it is possible that there would be
differences between the scores given by different assessors. For
example, some may be more generous than others.
Solution to Activity 12
Bar charts are suitable for displaying categorical or discrete variables.
There are two categorical variables in the FIFA 19 dataset –
preferredFoot (with two levels left and right) and skillMoves (with five
levels labelled 1, 2, 3, 4 and 5) – and so both of these can be displayed in
bar charts.
Both of the variables strength and marking are discrete scores between 0
and 100. Since they are discrete, in theory they could also be represented
by bar charts. However, because there are so many possible values for each
variable (potentially all the integers between 0 and 100), it would be
difficult to see the overall shape of the distribution of the data in a bar
chart. For data such as these, it is usually more sensible to present the
data in histograms by grouping the possible data values into bins.
Although the variables height and weight look discrete because their
values have been rounded to integers, they are in fact continuous, because
they have been measured on continuous scales. As such, both of these
variables are suitable for displaying in histograms.
Solution to Activity 13
Using height for Y and diameter for x, the model can be expressed as
height ∼ diameter.
Solution to Activity 14
(a) Since the estimates of the parameters α and β are 5.05 and 12.27,
respectively, the fitted simple linear regression model is
y = 5.05 + 12.27x,
that is,
height = 5.05 + 12.27 diameter.
74
Solutions to activities
Solution to Activity 15
(a) Using Equation (2) and the fitted simple linear regression model given
in the question, the fitted value for the first observation is
yb1 = 5.05 + 12.27x1
= 5.05 + (12.27 × 0.23)
= 7.8721 = 7.87 (to 2 d.p.).
(b) From Equation (3), the residual for the first observation is
r1 = y1 − yb1
= 9 − 7.8721 = 1.13 (to 2 d.p.).
The values of height were recorded to the nearest metre, and so the
value of y1 was rounded and could actually be anywhere between
8.50 m and 9.49 m (taking two decimal places to match those for r1
and yb1 ). Hence the calculated value of r1 is only approximate.
Solution to Activity 16
If β = 0, then the model becomes
Yi = α + W i , Wi ∼ N (0, σ 2 ).
Since this is not a function of xi , this means that according to the model
the value of Yi is unaffected by xi ’s value. So, if β = 0, then there is no
relationship between Y and x.
On the other hand, if β ̸= 0, then according to the model, the value of xi
does affect the value of Yi . So, if β ̸= 0, then there is a relationship
between Y and x.
Solution to Activity 17
(a) From the equation of the fitted line, βb = 12.27.
(b) The observed value t is calculated as
βb 12.27
t= = ≃ 2.46.
standard error of βb 4.98
(c) Since the fitted model was based on data for n = 42 trees, the null
distribution for this test is t(n − 2) = t(42 − 2) = t(40).
Solution to Activity 18
Given the context of the data, the p-value of 0.018 is small enough to
suggest to some analysts that β is not zero, so that there is a linear
relationship between tree height and diameter for these manna ash trees.
Note that there may not be universal agreement about this, and some
analysts may not agree that the p-value is small enough to draw this
conclusion. However, by quoting the p-value, it allows anyone reading the
75
Unit 1 Introduction to statistical modelling and R
analysis to see for themselves the strength of evidence against the null
hypothesis that β = 0, so that they can draw their own conclusion.
Solution to Activity 19
Residual plots (c) and (h) are unpatterned, suggesting that the linearity,
zero mean and constant variance assumptions seem reasonable.
Residual plot (e) seems to have a potential outlier, but is otherwise
unpatterned. So this also suggests that the linearity, zero mean and
constant variance assumptions seem reasonable.
Neither of the residual plots (a) and (d) are randomly scattered about the
horizontal line: instead each follows a curve, suggesting that the zero mean
and linearity assumptions are not reasonable. (Residual plot (d) also has
two potential outliers which stand out from the rest of the pattern.)
Finally, the vertical spreads in the residual plots (b), (f) and (g) all change
with the fitted values, suggesting that the constant variance assumption is
not reasonable. Further, residual plot (f) also suggests that the linearity
assumption is also not reasonable since the points follow a curve rather
than being scattered about the zero residual line.
Solution to Activity 20
The points in the residual plot are randomly scattered about the zero
residual line, suggesting that the assumption that the Wi ’s have zero
mean, and hence linearity, is reasonable. There is, however, perhaps a hint
of decreasing spread as fitted values increase, which could suggest that the
assumption that the Wi ’s have constant variance may be in question.
However, the sample size of 42 trees is not large, and so it is difficult to
draw firm conclusions from the residual plot. Also, if the two large positive
and two large negative residual values of ybi for values of x around 7.3
to 7.5 were slightly smaller, would the residual plot still have a hint of
decreasing spread?
On balance, it looks like the linearity, zero mean and constant variance
assumptions could be considered to be reasonable.
Solution to Activity 21
There is perhaps a hint of curvature in the residuals going from left to
right, which might mean that the independence assumption could be
questionable. Perhaps the identification numbers represent the order that
the trees are situated along the road and the heights of trees are affected
by the heights of their neighbours? From the given data, we do not know
the answer to this and we would need further information regarding how
the data were collected if we were to investigate the independence
assumption further.
Any curvature in the plot is, however, only slight and, as mentioned in the
solution to Activity 20, the plot is based on only 42 observations. So, on
76
Solutions to activities
balance overall, for us (the module team) Figure 21 wouldn’t rule out the
independence assumption.
Solution to Activity 22
Most of the points in the normal probability plot lie roughly on a straight
line, so the assumption of normality seems plausible.
Solution to Activity 23
(a) Using Equation (6), the predicted height, in metres, for a manna ash
tree with diameter x0 = 0.20 m is
5.05 + (12.27 × 0.20) = 7.504 ≃ 7.50.
(b) The diameters of trees range from about 0.15 m up to 0.35 m, and
young trees with very small diameters will be outside of this range.
So, it may not be appropriate to use the fitted model to predict the
height for such trees. In particular, if the tree had a diameter of 0 m,
then the fitted model would predict its height to be 5.05 m, which
clearly does not make sense!
Solution to Activity 24
The width of the 95% prediction interval at the low end of the range
(when the diameter is 0.15 m) is 10.3 − 3.4 = 6.9, at the middle of the
range (when the diameter is 0.25 m) is 11.4 − 4.8 = 6.6, and at the high
end of the range (when the diameter is 0.35 m) is 12.8 − 5.9 = 6.9. So, the
widths are the same at the low and high ends of the range, but slightly
narrower in the middle of the range.
77
Unit 2
Multiple linear regression
Introduction
Introduction
Section 4 of Unit 1 reviewed regression in its simplest form – namely,
simple linear regression. This is a technique for modelling a linear
relationship between two variables, where one is a response variable and
the other is an explanatory variable which helps to ‘explain’ the variation
in the response.
The response variable is in fact usually affected by more than one single
explanatory variable. For example, you saw in Notebook activity 1.20 (in
Unit 1) that the strength score of football players is affected by their
weight. However, you can also think of the players’ heights or their skills
as other explanatory variables that could ‘explain’ the variation in their
strength, and it is possible that the model would be improved if these
other explanatory variables were considered too. Regression with more
than one explanatory variable is called multiple linear regression or,
more simply, multiple regression (so called because there are ‘multiple’
explanatory variables). Multiple regression is a very common tool in
statistical data analysis.
This unit explores the basic properties and uses of multiple linear
regression. The unit starts in Section 1 with a formal definition of the
multiple linear regression model as an extension of the simple linear
regression model. Using the model for prediction is discussed in Section 2.
Section 3 considers how to assess how good your fitted model is, while
Section 4 discusses the use of transformations of variables in multiple
regression to address problems with the model. Working with more than
one explanatory variable raises the question of which of the explanatory
variables should be included in the model; methods for choosing the
explanatory variables are discussed in Section 5.
The following route map shows how the sections connect to each other.
81
Unit 2 Multiple linear regression
Section 1
Section 2
The multiple
Prediction in
linear regression
multiple regression
model
Section 4
Section 3
Transformations in
Diagnostics
multiple regression
Section 5
Choosing
explanatory
variables
82
1 The multiple linear regression model
83
Unit 2 Multiple linear regression
15
10
5
Residuals
−5
−10
−15
150 160 170 180 190
Weight (lb)
Figure 1 Residuals from strength ∼ weight, plotted against weight
84
1 The multiple linear regression model
(b) The same residuals for this fitted model (with weight as the
explanatory variable) are now plotted against height in Figure 2.
Based on this plot, has the fitted model explained all of the variation
in strength?
15
10
5
Residuals
−5
−10
−15
70 75
Height (in)
Figure 2 Residuals from strength ∼ weight, plotted against height
85
Unit 2 Multiple linear regression
80
75
Strength score
70
65
60
66 68 70 72 74 76 78
Height (in)
Figure 3 Scatterplot of strength and height
(b) The fitted equation for the model is
strength = −23.146 + 1.318 height.
Interpret the value of the regression coefficient of height.
(c) The p-value associated with the regression coefficient of height is less
than 0.001. What does this tell you about the significance of height
as an explanatory variable?
(d) A plot of the residuals and height is given in Figure 4. Based on this
plot, comment on how well the model fits.
(e) A plot of the residuals and weight is given in Figure 5. Based on this
plot, after fitting the model with height as the explanatory variable,
does there seem to be any remaining variation in strength which
looks to be associated with weight?
86
1 The multiple linear regression model
10
5
Residuals
−5
−10
66 68 70 72 74 76 78
Height (in)
Figure 4 Residuals from strength ∼ height, plotted against height
10
5
Residuals
−5
−10
87
Unit 2 Multiple linear regression
88
1 The multiple linear regression model
89
Unit 2 Multiple linear regression
90
1 The multiple linear regression model
91
Unit 2 Multiple linear regression
The ability to mark other players is one of the crucial skills in football. It
is sometimes thought that the footballers’ scores of marking ability could
explain the differences in their strength scores. In this activity, you will
investigate a multiple regression model for footballers’ strength scores as a
response variable with their scores of marking ability as an extra potential
In football, it usually takes
only one player to mark explanatory variable in addition to their weight and height. So, we will
another. consider the model
strength ∼ weight + height + marking.
The resulting fitted equation is given as
strength = −28.305 + 0.273 weight + 0.681 height + 0.085 marking.
(a) Interpret the values of the regression coefficients.
(b) Compare the estimated value and interpretation of the regression
coefficient of weight obtained from the model here (when using three
explanatory variables) to its corresponding estimated value and
interpretation of the regression coefficient obtained from the model
strength ∼ weight + height
considered in Example 1.
92
1 The multiple linear regression model
93
Unit 2 Multiple linear regression
1.0 F (2, 2)
F (20, 20)
0.8
F (10, 10)
0.6
F (5, 10)
0.4 F (10, 2)
0.2
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
The test statistic for the F -test is often called the F -statistic, and the
p-value associated with the F -statistic is obtained by calculating
probabilities from F (ν1 , ν2 ), where
ν1 = the number of explanatory variables in the model = q,
ν2 = n − (the number of parameters in the model) = n − (q + 1).
94
1 The multiple linear regression model
Notice that the number of parameters in the model is q + 1, since there are
q regression coefficients (β1 , β2 , . . . , βq ), and one intercept parameter α.
We will discuss using the F -test in a multiple regression model with two
explanatory variables next in Example 3.
95
Unit 2 Multiple linear regression
96
1 The multiple linear regression model
βbj
t-value = ,
standard error of βbj
and the t-value follows a t(n − (q + 1)) distribution.
In the next activity, we will look at these individual tests for the fitted
multiple regression model of footballers’ strength scores considered in
Example 3.
The fitted equation of footballers’ strength scores with the two explanatory
variables weight and height was given in Example 1 as
strength = −10.953 + 0.252 weight + 0.558 height.
In Example 3, we tested the hypotheses
H0 : β1 = β2 = 0,
H1 : at least one of the two coefficients differs from zero,
where β1 and β2 are the regression coefficients of weight and height,
respectively. The resulting p-value was very small, and so we concluded
that at least one of the regression coefficients is different from zero.
We now wish to investigate which of the regression coefficients differs from
zero. Is it both β1 and β2 , or just one of them?
To investigate, we will test the two sets of hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 ),
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 ).
(a) Given that the standard error of βb1 is 0.0535 (to three significant
figures), calculate the test statistic for testing the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 ).
(b) Given that the standard error of βb2 is 0.262 (to three significant
figures), calculate the test statistic for testing the hypotheses
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 ).
97
Unit 2 Multiple linear regression
Notice that in the multiple regression context, the conclusion of the pair of
tests you performed in Activity 5 can be expressed as follows. There is
evidence to suggest that each of the two explanatory variables, weight and
height, influences the footballers’ strength scores, given the presence of
the other explanatory variable in the model. This should not be confused
with the two individual tests for the regression coefficients in the simple
linear regression models discussed in Activities 1 and 2. A test for a
regression coefficient in simple linear regression tests the individual
influence of each explanatory variable on the response variable. But the
test considered in Activity 5 tests the significance of each explanatory
variable when both of them are included in the model.
The regression coefficient of weight turns out to be highly significant in
both the simple linear regression model and the multiple regression model.
But the significance of the regression coefficient of height is different
between the two models. In the simple linear regression model, there is
strong evidence that the coefficient of height is not zero (the p-value was
less than 0.001). However, the corresponding evidence of height in the
multiple regression model, when weight is also in the model, is not so
strong (the p-value was 0.036).
98
1 The multiple linear regression model
q regression coefficients:
β1 , β 2 , . . . , β q
Figure 7 Visual summary of the testing procedures for multiple regression models
It is worth mentioning that, although the F -test may not seem as useful as
the individual t-tests at this stage, the F -test will become particularly
useful as we move through the module.
99
Unit 2 Multiple linear regression
In the next activity, you will test the significance of the regression
coefficients of a multiple regression model that contains three explanatory
variables.
100
1 The multiple linear regression model
level for 380 areas and tried to relate their voting behaviour to their SHET
fundamental socio-economic features.
ORK
The Brexit dataset (brexit)
The data considered in this dataset are a subset of the data collected
in the study, comprising data from 237 of the areas. The dataset
contains data on the following variables for each area:
• leave: percentage who voted ‘Leave’ in the referendum
• age: proportion who are in the age group 30 to 44 years old (based
on the 2001 census)
GIB
• income: mean hourly pay in £ (based on the 2005 Annual Survey
of Hours and Earnings) Key: Majority leave
Majority remain
• migrant: proportion who are EU migrants (based on the 2001
census) Map of the UK showing the
• satisfaction: mean life satisfaction score (from the Annual results from the referendum
Population Survey 2015) (BBC, 2016)
• noQual: proportion with no educational qualification (based on the
2001 census).
In order to try to understand voting behaviour in the referendum, we
can take leave to be the response variable, and use the other
variables as potential explanatory variables.
The data for the first five areas from the Brexit dataset are shown in
Table 1.
Table 1 First five areas from brexit
In Activity 7, we will use the Brexit dataset to fit two simple linear
regression models and one multiple regression model. You will see that the
significance of one explanatory variable can markedly change, depending
on whether or not another variable is added to the model.
101
Unit 2 Multiple linear regression
(i) Write down the fitted equation of the estimated model and
interpret the estimated regression coefficients.
(ii) Do the data suggest that both age and income together influence
the percentage of ‘Leave’ voters?
(c) Do you think there is any contradiction among the results you
obtained in parts (a)(ii) and (b)(ii)? Discuss why or why not.
To round off this section, we will now use R for multiple regression.
102
2 Prediction in multiple regression
103
Unit 2 Multiple linear regression
104
2 Prediction in multiple regression
You will be using the roller coasters dataset in the next activity to predict
a roller coaster’s speed given its height and length.
105
Unit 2 Multiple linear regression
The model
speed ∼ height + length
was fitted to data from the roller coasters dataset. The results from fitting
the model are summarised in Table 4.
Table 4 Coefficients for speed ∼ height + length
(a) Write down the estimated regression equation of the fitted model.
(b) Interpret the values of the regression coefficients in the fitted equation
from part (a).
(c) According to the fitted regression equation obtained in part (a), what
is the predicted value of speed for a new roller coaster that is 200 ft
high and 4000 ft long?
In the next activity, you will use a multiple regression model with three
explanatory variables to predict the response variable.
Use this fitted equation to predict the strength score of a newly registered
footballer who weighs 170 lb, has a height of 72 in and has obtained a
marking score of 65.
106
2 Prediction in multiple regression
• Even if we did know the true regression equation, we still wouldn’t know
the value of Y0 with certainty, since we don’t know the value of the
random term (W0 ) for Y0 .
One way that these uncertainties can be taken into account is to calculate
a prediction interval for the response variable.
Prediction intervals for multiple regression are used and interpreted in
exactly the same way as they were in simple linear regression, as
summarised in Box 6.
107
Unit 2 Multiple linear regression
108
3 Diagnostics
3 Diagnostics
Once we have fitted a multiple regression model, we need to assess how
good the fitted model is. The techniques used for doing this are often
referred to as diagnostics, or diagnostic checks, because they help us
diagnose what might be wrong with a model or with particular data points.
The multiple regression model makes several assumptions that need to
hold so that the conclusions we obtain from the analysis of the model are
accurate and valid. It is therefore always important to check that the
assumptions seem reasonable for a particular dataset. Checking the
assumptions for the multiple regression model is the focus of
Subsection 3.1.
In addition to checking the model assumptions, there are other diagnostic
tools that can be used to check the adequacy of the model. We will
introduce two of these in this section: leverage in Subsection 3.2 and
Cook’s distance in Subsection 3.3.
Finally, in Subsection 3.4, we’ll use all of these diagnostic tools, by using R
to produce diagnostic plots that incorporate them.
109
Unit 2 Multiple linear regression
110
3 Diagnostics
As for simple linear regression, we can check the linearity, zero mean and
constant variance assumptions for the multiple regression model using a
residual plot of the residuals and the fitted values. The residual plots used
in multiple regression are the same as those used for simple linear
regression, as described in Subsection 5.1 of Unit 1, and are also
interpreted in the same way, as summarised in Box 8.
111
Unit 2 Multiple linear regression
3
10
2
Standardised residuals
5 1
Residuals
0 0
−5 −1
−2
−10
−3
60 65 70 75 80 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 8 The residual plot (a) and normal probability plot (b) for strength ∼ weight + height + marking
(a) Does the residual plot in Figure 8(a) suggest that the assumptions of
linearity, zero mean and constant variance are satisfied?
(b) Does the normal probability plot in Figure 8(b) suggest that the
normality assumption is satisfied?
(c) Based on your answers to parts (a) and (b), comment on the
appropriateness of the multiple regression model.
112
3 Diagnostics
30
4
20
Standardised residuals
2
10
Residuals
0 0
−10
−2
−20
−4
−30
40 60 80 100 −3 −2 −1 0 1 2 3
(a) Fitted values (b) Theoretical quantiles
Figure 9 The residual plot (a) and normal probability plot (b) for speed ∼ height + length
(a) Use the residual plot in Figure 9(a) to discuss whether the
assumptions of linearity, zero mean and constant variance seem to be
satisfied.
(b) Use the normal probability plot in Figure 9(b) to discuss whether the
normality assumption seems to be satisfied.
(c) Considering your answers to parts (a) and (b), does this multiple
regression model seem appropriate for the roller coasters dataset?
113
Unit 2 Multiple linear regression
3.2 Leverage
In this section, we will introduce another diagnostic tool for multiple
regression, known as leverage. Leverage identifies particular data points
which have the potential to have a major impact on the regression model.
• Data points with high leverage have the potential to substantially alter
the fitted regression model if that particular data point were changed or
removed from the dataset.
• On the other hand, if a data point with low leverage were changed or
removed from the dataset, then it would make little difference to the
fitted regression model.
The word ‘potential’ is important here: leverage does not tell us whether a
data point has had a major impact on the regression model, it just tells us
whether a data point could have a major impact on the regression model.
It is easiest to get an understanding of what we mean by high and low
leverage by looking at a specific example.
In Activity 2, we fitted the simple linear regression model
strength ∼ height
using data from the FIFA 19 dataset. The scatterplot of the data together
with the fitted line for this model is given in Figure 10. In Example 6 we
illustrate high and low leverage by considering how the fitted line would
change if a couple of data points change.
80
75
Strength score
70
65
60
70 75
Height (in)
Figure 10 Scatterplot of strength and height, together with the fitted line
for strength ∼ height
114
3 Diagnostics
Original
80 data point
75
Fitted line
Strength score
with original
70
data point
65
Fitted line
60 with changed
data point
Changed
55 data point
70 75
Height (in)
Figure 11 Scatterplot of strength and height, together with two
fitted lines for strength ∼ height
Notice that, by changing the response for just that single far-left
observation, the fitted line has altered and has ‘tilted’ towards the
new value.
Let’s have a look at what happens to the fitted regression line if
instead of changing the response of the far-left observation, we change
the response for one of the more central observations.
115
Unit 2 Multiple linear regression
One of the footballers has a value of 72 for height and has a strength
score of 77 (which is only 1 larger than the strength score of the
far-left observation). Let’s now try changing the value for this
footballer’s strength score from 77 to 55 (so that the change that
we’re making to the response is comparable to the change that we
made for the far-left observation).
Figure 12 shows a scatterplot of the data points, this time with the
changed response for one of the data points for which height is 72.
(The new data point is the lowest data point in the plot. The original
data point is also included on the plot so that you can see how the
data point has changed.) The fitted line using the data points in the
scatterplot (which uses the different response for one of the
observations for which height is 72) is shown on the plot as a solid
line, whereas the fitted line using the original data (that is, the fitted
line shown in Figure 10) is shown on the plot as a dashed line.
80 Original
data point
75
Fitted line
Strength score
65
60
Changed
data point
55
70 75
Height (in)
Figure 12 Scatterplot of strength and height, together with two
fitted lines for strength ∼ height
116
3 Diagnostics
on the fitted line for the far-left point, but very little effect on the
fitted line for the central point. As such, the data point on the far left
has high leverage, whereas the central data point has low leverage.
117
Unit 2 Multiple linear regression
118
3 Diagnostics
An easy way to identify which data points have high leverage in a dataset
is by using a residuals versus leverage plot, which plots the
standardised residuals – that is, residuals which have been scaled so that
their standard deviation is 1 – for each of the observations against their
leverage values. You will see a residuals versus leverage plot in Example 7.
3
15
2
Standardised residuals
−1
62
−2
92
−3
0.02 0.04 0.06 0.08 0.10 0.12
Leverage
Figure 13 The residual versus leverage plot for strength ∼ height
119
Unit 2 Multiple linear regression
From Figure 13, we can see that in a residuals versus leverage plot the
standardised residuals are plotted on the vertical axis and the leverage
values are plotted on the horizontal axis. The zero residual line is
included so that it is easy to see at a glance which residuals are
relatively small and which are relatively large.
In the residuals versus leverage plot given in Figure 13, there is one
data point (numbered 62) whose leverage stands out as being quite a
bit higher than the leverage of the other data points. This data point
therefore has the most potential to affect the regression model.
The observation with the next highest leverage, which also stands
apart from the rest of the data points, is the data point numbered 15.
So, this data point also has the potential to affect the regression
model.
Even if a data point has very high leverage, meaning that it has the
potential to affect the regression model, it doesn’t necessarily mean that
the data point actually has affected the regression model.
If the observed response for a data point with high leverage is not
consistent with the fitted model (so the standardised residual is large),
then it is likely that the refitted model will be different to the original
fitted model, since the refitted model will no longer be ‘pulled’ towards the
inconsistent data point. (We saw this ‘pulling’ effect in Example 6 for the
far-left data point in Figure 11.) So, if a data point has both high leverage
and a large residual, then it is likely that that particular data point has
affected the regression model.
On the other hand, if the observed response for a data point with high
leverage is consistent with the fitted model (so the standardised residual is
small), then it is unlikely that the refitted model will be very different,
since the data point won’t really have ‘pulled’ the original fitted model
away from where the refitted line is. So, if a data point has high leverage
but a small residual, then it is unlikely that that particular data point has
affected the regression model, although it could do if the leverage was high
enough.
Since the size of the residuals plays an important role in determining
whether a data point actually has affected the regression model, it is also
possible for a data point without high leverage to have affected the model
if the residual is large enough.
Data points which have affected the regression model are known as
influential points, as summarised in Box 12.
120
3 Diagnostics
In the next two activities, we will use residuals versus leverage plots to
identify data points which are likely to be influential points.
121
Unit 2 Multiple linear regression
43
4
2
Standardised residuals
77
0 44
4 3
1
2
−2
−4
122
3 Diagnostics
Points with both high leverage and a large standardised residual are
likely to be highly influential to the model, in the sense that they can
alter the results of the regression model if they are changed. These
influential points appear towards the lower-right or upper-right
corners of the plot.
Box 14 Cook’s (squared) distance using standardised We’re not ‘cooking the
analysis’ ! These ideas are
residuals
named after their originator,
For the ith data point, i = 1, 2, . . . , n, Cook’s (squared) distance, Di2 , Professor R.D. Cook.
is given by
1 ′ 2 hi
Di2 = r ,
q + 1 i 1 − hi
where q is the number of explanatory variables in the model (so that
q + 1 is the number of parameters in the model), ri′ is the standardised
residual for the ith data point and hi is its leverage.
123
Unit 2 Multiple linear regression
The question now is: how large does the Cook’s distance for a specific data
point need to be in order for it to be considered an influential point?
Unfortunately, there is no definitive answer to this question as there is no
standard rule of thumb. A threshold of 0.5 is sometimes considered, but a
data point with a Cook’s distance of less than 0.5 could also be considered
to be influential if its Cook’s distance is large in comparison to the other
Cook’s distance values.
Box 15 introduces a new diagnostic plot called the Cook’s distance plot.
This plot gives a visual representation of the Cook’s distance values for the
data points in the dataset, so that it is easier to identify the influential
points. The Cook’s distance plot is summarised in Box 15.
124
3 Diagnostics
0.25 15
62
0.20 92
Cook’s distance
0.15
0.10
0.05
0.00
0 20 40 60 80 100
Observation number
Figure 15 The Cook’s distance plot of strength ∼ height
The most influential data points in a dataset will have the tallest bars
in a Cook’s distance plot. Therefore, for this dataset, the three most
influential data points are those numbered 15, 62 and 92. These are,
in fact, three of the data points identified as possible influential points
from the residuals versus leverage plot in Activity 16.
Notice that, the other two data points with similar leverage to data
point 92, but with small residuals, do not appear to have high Cook’s
distances.
We mentioned that a value of 0.5 is sometimes used as a threshold
value for deciding if the Cook’s distance is large enough to conclude
that a data point is an influential point. It is clear from the Cook’s
distance plot in Figure 15 that none of the points for this dataset has
a Cook’s distance of 0.5 or above. However, because the highest
Cook’s distance values in the Cook’s distance plot are so high in
comparison to the Cook’s distance values for the other data points in
the sample, these data points can be considered as being influential,
even though they do not exceed the threshold of 0.5.
In multiple regression, a Cook’s distance plot is the same as that used for
simple linear regression and should be interpreted in the same way. We
will use the Cook’s distance plot to identify the influential points in a
multiple regression model next in Activity 18.
125
Unit 2 Multiple linear regression
0.08
1
Cook’s distance
0.06
43
0.04
0.02
0.00
0 50 100 150 200
Observation number
Figure 16 A Cook’s distance plot for speed ∼ height + length
(a) Based on the Cook’s distance plot, what are the highest influential
points in the data? Are they the same points that were identified in
Activity 17?
(b) How do you interpret the fact that the data point with the highest
leverage in Figure 14 does not appear as one of the three values with
the highest Cook’s distance in Figure 16?
126
3 Diagnostics
(a) The residual versus leverage plot of this multiple regression model is
given in Figure 17. Explain why the data points numbered 96, 124
and 228 might have higher Cook’s distances than other points. Your
explanation should be in terms of the leverage and/or residuals of
these points.
124
2 96
228
Standardised residuals
−1
−2
−3
127
Unit 2 Multiple linear regression
(b) The Cook’s distance plot for this multiple regression model is given in
Figure 18. Based on this plot, which data point seems to be the most
influential point in the data?
124
0.06
0.05
Cook’s distance
0.04
96
228
0.03
0.02
0.01
0.00
0 50 100 150 200
Observation number
Figure 18 A Cook’s distance plot for the multiple regression model of the
Brexit dataset
128
4 Transformations in multiple regression
Having identified influential points, we are then left with the problem of
what to do about them. Usually, statisticians would have to either present
analyses with and without influential points or to just omit the influential
points from the analysis and explain why they are thought to be invalid for
inclusion. The fact that they are influential is not by itself a valid reason
for excluding points.
Finally, it is worth noting that the residual versus leverage plot and the
Cook’s distance plot are not the only available plots of their kind that may
be obtained from a statistical package. For example, another useful plot
for detecting influential points that you may meet elsewhere is the Cook’s
distance versus leverage plot, which plots the Cook’s distance on the
vertical axis and the leverage on the horizontal axis. We won’t, however,
be discussing this plot further in this unit.
We will finish this section by carrying out diagnostic checks in R.
4 Transformations in multiple
regression
There are times when we would like to use a multiple regression model but
our data are not suitable for such a model – either because the model
assumptions for multiple regression are unrealistic for our data, or the
relationship between the response and the explanatory variables is not
linear. In cases such as these, we can sometimes transform one or more of
the variables so that a model using the transformed data is suitable for
modelling using multiple regression.
We will start this section by considering how transformations can be useful
in multiple regression in Subsection 4.1, before discussing how to find
suitable transformations in Subsection 4.2. We’ll then finish the section by
using transformations in R in Subsection 4.3.
129
Unit 2 Multiple linear regression
To strengthen To enhance
model linearity between
assumptions explanatory variable
xj and response
y ∼ x 1 + x 2 + · · · + xj + · · · + x q
So, suppose that we’ve decided that we want to transform the response
and/or one (or more) of the explanatory variables. The next question is:
which transformation(s) should we try? We will address this question in
the next subsection.
130
4 Transformations in multiple regression
Go down Go up
the ladder the ladder
of powers of powers
131
Unit 2 Multiple linear regression
132
4 Transformations in multiple regression
Now, the variable make is in fact a categorical variable, which means that
the regression model needs to treat this explanatory variable in a different
way to the two (continuous) explanatory variables mileage and liter.
You won’t be learning about how to model with categorical explanatory
variables until Units 3 and 4, but you don’t actually need to know about
modelling with categorical explanatory variables for this example.
Therefore, for now, just think of make as ‘one of the three explanatory
variables’.
So, we’re interested in fitting the model
price ∼ mileage + liter + make.
However, the residual plot for this fitted model, shown in Figure 21,
indicates that there is a problem with using this model. The points in the
residual plot show an increase in spread with increasing fitted values,
indicating that the constant variance assumption may not be valid, and
also there is a hint of curvature in the plot, suggesting that the zero mean
and linearity assumptions may also be questionable. Therefore, as
suggested in Box 16, since the problem seems to be that one of the model
assumptions may not be valid, we should consider transforming the
response variable, price.
10000
5000
Residuals
−5000
−10000
133
Unit 2 Multiple linear regression
200
Frequency 150
100
50
0
10000 20000 30000 40000 50000
Price (US$)
Going down the ladder of powers suggests that the first transformation we
√ √
should try is price; a histogram of the transformed values price is
shown in Figure 23.
100
Frequency
50
0
100 120 140 160 180 200 220
√
price
√
Figure 23 Histogram of price
134
4 Transformations in multiple regression
√
The histogram of price still looks rather skew, and so it’s likely that the
assumptions may also not be valid for the model
p
price ∼ mileage + liter + make.
Indeed this does turn out to be the case, since the points in the residual
plot of this fitted model, given in Figure 24, indicate that the model
assumptions still look questionable.
30
20
10
Residuals
−10
−20
−30
So, let’s go further down the ladder of powers and try the next
transformation: log(price).
A histogram of the transformed values log(price) is shown in Figure 25.
This time, the histogram for the transformed response looks much more
symmetric, so this transformation looks more promising!
135
Unit 2 Multiple linear regression
150
Frequency
100
50
0
9.0 9.5 10.0 10.5 11.0
log(price)
0.3
0.2
0.1
Residuals
0.0
−0.1
−0.2
−0.3
136
4 Transformations in multiple regression
This time the points in the residual plot do seem to be randomly scattered
about the zero residual line, indicating that the assumptions of constant
variance, zero mean and linearity now seem to be reasonable. So, by using
the transformed variable log(price) as our response instead of the original
response price, we now have data for which we can use multiple regression!
Popularity of films
A research project was conducted to determine whether the
conventional features of a film (such as its budget and the number of
screens on which it is initially launched) are more important in
predicting the popularity of the film than the social media features
(such as the number of ‘likes’ and ‘dislikes’ the film trailer receives on
the online video sharing and social media platform YouTube). In
particular, the researchers were interested in predicting the gross How will online streaming
affect cinemas in the future?
income of a film using a set of explanatory variables.
The films dataset (films)
There are 169 films in this dataset, which is a subset of the data
collected by the researchers for 231 films produced between 2014
and 2015.
The response variable is:
• income: the gross income (in millions of US dollars) for each film.
The explanatory variables of interest are:
• budget: the budget (in millions of US dollars) for the film
• rating: the rating of the film on a scale from 1 to 10, collected
from the films website IMDb
• screens: the number of screens (in thousands) on which the film
was initially launched in the USA
• views: the number of views (in millions) for the film trailer on
YouTube
• likes: the number of likes (in thousands) for the film trailer on
YouTube
• dislikes: the number of dislikes (in thousands) for the film trailer
on YouTube
• comments: the number of comments (in thousands) on the film
trailer on YouTube
• followers: the aggregate number of followers (in millions) on
Twitter for the top three actors in the film.
137
Unit 2 Multiple linear regression
The data for the first five films from this dataset, ordered by budget,
are shown in Table 8.
Table 8 First five films (ordered by income) from films
comments followers
0.230 2.815
0.001 10.280
0.864 4.520
0.250 1.198
0.464 0.006
138
4 Transformations in multiple regression
200
100
Residuals
−100
−200
100
80
60
Frequency
40
20
0
0 50 100 150 200 250 300 350
Income (millions of US $)
139
Unit 2 Multiple linear regression
√
(c) The histograms of income and log (income) are given in Figure 29.
Based on these two histograms, which transformation would you
recommend?
35
50
30
25 40
Frequency
Frequency
20 30
15
20
10
10
5
0 0
0 10 5 15 20 −6 −4 −2 0 2 4 6
√
(a) income (b) log(income)
√
Figure 29 Histograms of (a) income, and (b) log (income)
140
4 Transformations in multiple regression
5
Residuals
−5
5 10 15
Fitted values
√
Figure 30 Residual plot for income ∼ budget + screens
15 15
income
income
10 10
√
5 5
0 0
0 100 50
150 200 0 1 2 3 4
(a) budget (b) screens
√ √
Figure 31 Scatterplots of (a) income and budget, and (b) income and screens
(a) From the plots in Figure 31, explain why it is reasonable to decide to
transform the explanatory variable screens, but not the explanatory
variable budget.
(b) The pattern of points in the scatterplot shown in Figure√31(b) can be
interpreted as suggesting that the relationship between income and
screens may be quadratic or cubic. Therefore, scatterplots of the
141
Unit 2 Multiple linear regression
√
transformed response income and each of the two transformed
variables, screens2 and screens3 , are given in Figure 32.
Given these scatterplots, explain why it is reasonable to transform
screens to screens3 .
15 15
income
income
10 10
√
√
5 5
0 0
0 5 10 15 0 20 40 60 80
(a) screens2 (b) screens3
√ √
Figure 32 Scatterplots of (a) income and screens2 , and (b) income and screens3
(a) For the fitted model using the transformed response and explanatory
variable,
√
income ∼ budget + screens3 ,
the residual plot is shown in Figure 33.
Do you think this regression model with the transformed variables has
resolved the problem with the non-constant variance (apparent in
Figure 27) and the problem with the assumption of linearity (raised
by the hint of a downward trend in Figure 30)?
142
4 Transformations in multiple regression
5
Residuals
−5
5 10 15
Fitted values
√
Figure 33 Residual plot for income ∼ budget + screens3
(b) The suggested model with transformations has the following fitted
equation:
√
income = 3.01 + 0.025 budget + 0.107 screens3 .
Use the fitted equation to predict the income of a new film that has a
production budget of 50 million US dollars and is to be initially
launched in the USA on 2500 screens.
143
Unit 2 Multiple linear regression
Source: Sullivan III, 2020 and 2021, both accessed 9 June 2022
144
5 Choosing explanatory variables
So, as we saw in Example 9, the fit of a model can’t get any worse by
adding extra explanatory variables; the fit can either improve or, at worst,
stay the same. However, fitting the data superbly is not the be-all and
end-all. After all, the data themselves provide a perfect model for the data!
Cutting down the number of explanatory variables is desirable for two
reasons.
• The first reason concerns the purpose of modelling. By modelling, we
are trying to understand the situation in simplified terms by identifying
the main sources of influence on the response variable.
• The second reason concerns prediction. Just because we can fit the data The surface of the Earth is
itself a perfect model (map) of
we have, with all its particular idiosyncracies, does not mean that the
the Earth!
resulting model will necessarily be a good fit for further data that arise
(under the same situation). We return to this issue in Unit 5.
So, it is preferable to use just enough explanatory variables to capture the
main features of the data. Doing this will also often lead to improved
prediction as well as a model that is easier to interpret. Underlying these
ideas is the principle of parsimony, which is given in Box 18.
145
Unit 2 Multiple linear regression
In this section, we will introduce some tools and techniques for selecting
the explanatory variables that enhance the fit of a multiple regression
model, while keeping it as simple as possible according to the principle of
parsimony.
When choosing the explanatory variables, problems can occur if two of the
explanatory variables are highly correlated. As a result, it is important
that any such pairs of explanatory variables are identified. To do this,
Subsection 5.1 discusses two simple tools that can be used to help explore
the relationships, not only between each explanatory variable and the
response, but also between the explanatory variables themselves.
In deciding which explanatory variables to include in a model, we need to
be able to compare how well models with different combinations of the
explanatory variables fit the data. Being able to measure a model’s fit is
therefore crucial. We will introduce some such measures of fit in
Subsection 5.2 and then, in Subsection 5.3, we’ll see how one of these
measures is used in a widely used procedure for selecting explanatory
variables, called stepwise regression.
Finally, in Subsection 5.4, we will use all of these tools and techniques for
choosing explanatory variables in R.
146
5 Choosing explanatory variables
For the scatterplot matrices that we will use in the context of multiple
regression in this module:
• each row, apart from the last row, will include the scatterplots of one
explanatory variable against the other explanatory variables
• the last row will include scatterplots of the response variable versus each
explanatory variable.
You will see a scatterplot matrix in action next in Example 10.
147
Unit 2 Multiple linear regression
followers
comments
dislikes
likes
views
screens
rating
budget
income
148
5 Choosing explanatory variables
149
Unit 2 Multiple linear regression
followers
comments
dislikes
likes
views
screens
rating
budget
√
income
Figure 35 A scatterplot matrix of the data in the films dataset with the
response transformed
√
The individual scatterplots of income and each individual
explanatory variable still fail to indicate any strong relationships, with
the possible exceptions of budget, rating and screens, as before.
This time, however, the relationships with budget and rating now
seem more linear, so let’s stick with this transformation of income –
at least for now!
For the remainder of the scatterplot matrix, we need to scan the plots to
look for any pairs of explanatory variables which seem to be related. We
will do this in Activity 24.
150
5 Choosing explanatory variables
In the correlation matrix for the films dataset, the correlations are given in
essentially the same matrix format as the scatterplot matrix, except that
the response variable income has been left out of the correlation matrix.
Notice, also, that this correlation matrix is in fact only showing half of the
matrix! This is because the ‘missing’ values are simply mirror images of
the values already there (since the correlation of two variables x1 and x2 is
the same as the correlation of x2 and x1 ). So, like the scatterplot matrix,
the missing elements above and to the right of the main diagonal do not
provide any more information regarding the correlations between the
variables.
So, once we have a correlation matrix, what are we looking for? Well, as
you will find in Activity 25, we need to look for pairs of explanatory
variables which have high correlations. In this context, we’ll use the
following rule of thumb:
• a correlation with an absolute value ≥ 0.7 is considered to be high.
151
Unit 2 Multiple linear regression
Using the correlation matrix for the films dataset, which pairs of
explanatory variables are highly correlated?
152
5 Choosing explanatory variables
5 2
Standardised residuals
1
Residuals
0 0
−1
−5 −2
−3
0 5 10 15 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
2 0.06
Standardised residuals
Cook’s distance
0 0.04
−1
0.02
−2
−3
0.00
0.00.2 0.1
0.3 0 50 100 150
(c) Leverage (d) Observation number
√
Figure 36 Diagnostic plots for income with all eight explanatory variables
153
Unit 2 Multiple linear regression
So, let’s summarise what our exploration of the films dataset has so far
suggested regarding the choice of explanatory variables for modelling
income:
• The bottom row of the scatterplot matrix suggested that income might
depend on budget, rating and screens.
• The scatterplot matrix and the correlation matrix flagged up that the
inclusion in the model of all four explanatory variables comments,
dislikes, likes and views might be unnecessary (and also possibly
both explanatory variables screens and budget might be unnecessary).
• The fitted multiple regression model suggested that all of the
explanatory variables should be included in the model, except perhaps
for followers.
But can we do something more formal to select explanatory variables?
Indeed we can – and in very many ways! However, in this module we will
only consider one of these methods, known as stepwise regression.
Before introducing stepwise regression (in Subsection 5.3), we will first
introduce some ways to measure how well a model fits.
154
5 Choosing explanatory variables
60 65 70 75 80
Strength score
Figure 37 Boxplot of strength
80
75
Strength score
70
65
60
155
Unit 2 Multiple linear regression
Overall scatter
80 of strength score
Scatter of
strength score
75
about the
fitted line
Strength score
70
65
60
So, as we’ve seen in Example 12, a model can help to explain the variation
in the response variable. The measure of model fit being considered in this
subsection is a measure of the percentage of variation in the response
variable that can be accounted for – or explained – by the model.
In order to make it easier to understand the ideas being presented here, we
will look at a subset of just 10 of the observations from the FIFA 19
dataset (taking every 10th observation in the dataset only). A scatterplot
of strength and weight for this small dataset is given in Figure 40.
156
5 Choosing explanatory variables
80
75
Strength score
70
65
60
150 160 170 180
Weight (lb)
Figure 40 Scatterplot of strength and weight for a subset of
10 observations from the FIFA 19 dataset
where y is the sample mean of the responses. The numerator of the sample
variance is often referred to as the total sum of squares, which we’ll
abbreviate to TSS, so
n
X
TSS = (yi − y)2 .
i=1
157
Unit 2 Multiple linear regression
80
75
Strength score
70
y
Distance
65 yi − y
yi
60
150 160 170 180
Weight (lb)
Figure 41 Scatterplot of strength and weight: the dotted vertical lines
show the distances used to calculate TSS
Now, it can be shown (although we won’t do so here) that the total sum of
squares can be partitioned into two separate sums of squares:
+
• a sum of squares associated with the variation that can be explained by
our model – the explained sum of squares, which we’ll abbreviate to
ESS
• a sum of squares associated with the variation that cannot be explained
+ + +
by our model – the residual sum of squares, which we’ll abbreviate to
RSS.
Another sum of squares? Let’s look at each of these in turn.
We’ll start by considering the ESS – the sum of squares associated with
the variation that can be explained by our model. As with any regression
model, for given value(s) of any explanatory variable(s), the corresponding
response Yi is modelled to be the fitted value ybi . So, a measure of how
these fitted values yb1 , yb2 , . . . , ybn vary about the overall mean y will give us
a measure of the variation explained by our model, so that
n
X
ESS = yi − y)2 .
(b
i=1
158
5 Choosing explanatory variables
The distances on which the ESS is based (that is, yb1 − y, yb2 − y, . . . , ybn − y)
are illustrated for the reduced version of the FIFA 19 dataset in Figure 42.
80
75
Strength score
70
Distance y
yi − y
65 yi
yi
60
150 160 170 180
Weight (lb)
Figure 42 Scatterplot of strength and weight: the fitted values are shown
as red dots and the dotted vertical lines show the distances used to calculate
ESS
We’ll now consider the RSS – the sum of squares associated with the
variation which can’t be explained by the model. Now, Yi modelled to be
the fitted value ybi using our model. So, a measure of how the actual
response values y1 , y2 , . . . , yn vary around their fitted values yb1 , yb2 , . . . , ybn
will give us a measure of the variation which still remains after fitting our
model, that is, the residual variation:
n
X
RSS = (yi − ybi )2 .
i=1
159
Unit 2 Multiple linear regression
80
75
Strength score 70
yi
65 Distance
yi − yi
60 yi
So, putting this altogether, we get the result that the TSS can be
partitioned into the ESS and the RSS:
TSS = ESS + RSS.
160
5 Choosing explanatory variables
• The RSS is the residual sum of squares and gives a measure of how
the observed responses y1 , y2 , . . . , yn vary about the fitted values
yb1 , yb2 , . . . , ybn :
n
X
RSS = (yi − ybi )2 .
i=1
Recall that the measure of model fit being considered in this subsection is
a measure of the percentage of variation accounted for by the model. The
values of ESS and TSS provide us with information regarding what
proportion of the total variation is explained by the fitted model, so one
way to measure the percentage of variation accounted for by the model is
by comparing the values of ESS and TSS. One such measure – known as
the R2 statistic (pronounced ‘R squared statistic’), or simply R2 – does
precisely that. The R2 statistic is defined in Box 22.
161
Unit 2 Multiple linear regression
For a model which fits the data well, would the value of R2 be large or
small?
When looking at the sums of squares for the fitted model using the
reduced FIFA 19 dataset in Activity 27, we concluded that the sums of
squares suggested that the model seems to explain the variation in the
response fairly well since the ESS was almost four times the value of the
RSS. Let’s see what the value of R2 is for this model.
162
5 Choosing explanatory variables
You will see both R2 and Ra2 in action next for comparing models in
Example 13.
Model 2
Fitted in Example 1 with the following estimated equation:
strength = −10.953 + 0.252 weight + 0.558 height.
163
Unit 2 Multiple linear regression
Model 3
Fitted in Activity 4 with the following estimated equation:
strength = − 28.305 + 0.273 weight + 0.681 height
+ 0.085 marking.
The R2 and Ra2 statistics for these models (in percentages) are given
as follows.
• Model 1: R2 = 36.54%, Ra2 = 35.89%.
• Model 2: R2 = 39.36%, Ra2 = 38.11%.
• Model 3: R2 = 48.83%, Ra2 = 47.23%.
As expected, the values of R2 increase from Model 1 to Model 3 as a
result of increasing the number of explanatory variables. Notice also
that, as expected, the value of Ra2 in each model is less than its
corresponding R2 value. The values of Ra2 are also increasing from
Model 1 to Model 3, but since they were adjusted for the number of
explanatory variables in each model, these increases reflect actual
increases in the model fit.
As discussed before, we will be using Ra2 as a measure for the
percentage of variance accounted for.
Comparing Ra2 for the three models, it is clear that adding extra
explanatory variables has improved the fit: the percentage of variance
of strength that is accounted for by the model increased from 35.89%
to 38.11% when height was added to the model, and increased
from 38.11% to 47.23% when marking was also added to the model.
You have just seen in Example 13 that the values of R2 and Ra2 can both
increase by adding more explanatory variables to the regression model. But
this is not always the case. If the additional variable is not actually adding
to the model fit, the value of R2 will still increase, but the value of Ra2 may
decrease to reflect the fact that the model fits better without adding that
extra explanatory variable. You will see this happening in Activity 31.
164
5 Choosing explanatory variables
The R2 and Ra2 statistics for these models are given as follows:
Model 1: R2 = 39.38%, Ra2 = 38.86%.
Model 2: R2 = 74.66%, Ra2 = 74.33%.
Model 3: R2 = 74.67%, Ra2 = 74.23%.
(a) Comment on what the increase or decrease of the R2 and Ra2 values
between the three models mean.
(b) Based on the percentage of variance accounted for, which of these
three models fits the data best? Explain your choice.
We will consider one further measure of model fit in this section; this is
introduced next in Subsection 5.2.2.
165
Unit 2 Multiple linear regression
In the next activity, you will explore how the AIC can be used in practice
to select the best regression model from the three models given in
Activity 31.
Although the Ra2 and AIC might suggest the same model, be aware that
this is not always the case.
166
5 Choosing explanatory variables
167
Unit 2 Multiple linear regression
The AIC values obtained from each of these Step 1 models are listed
in ascending order in the following table.
Change AIC
Adding budget 376.32
Adding screens 391.77
Adding rating 474.36
Adding followers 487.00
Adding likes 489.69
Adding comments 491.15
Adding dislikes 491.15
Adding views 494.40
None 495.19
The last line, denoted by ‘None’, gives the AIC of the current model,
that is, the model with the intercept only.
Note that all of the eight models have AIC values lower than that of
the null model (495.19). This means that adding any individual
explanatory variable to the intercept will give a better model than
just having the intercept.
The model with the smallest AIC value (with AIC = 376.32) is
obtained by adding budget to the intercept. The best model in Step 1
based on the AIC values is thus the model containing budget: this
represents the outcome of Step 1.
Step 2
In Step 2, we start with the model including the intercept and budget
(with AIC = 376.32), and then we fit seven extra models, each with
one of the remaining explanatory variables added to the intercept and
budget. So, we have the fitted models:
√
income = 2.181 + 0.039 budget + 1.187 screens,
√
income = −0.674 + 0.054 budget + 0.737 rating,
and so on.
168
5 Choosing explanatory variables
The AIC values obtained from each of these Step 2 models are listed
in ascending order in the following table.
Change AIC
Adding screens 336.54
Adding rating 369.50
Adding likes 371.87
Adding dislikes 375.50
Adding followers 375.76
None 376.32
Adding views 377.49
Adding comments 377.76
There are now five explanatory variables (those in the first five lines of
Step 2) each giving a smaller AIC when added to the model. The
sixth line, denoted by ‘None’, gives the AIC of the model obtained in
Step 1 which includes the intercept and budget only.
The model with smallest AIC (with AIC = 336.54) is that obtained
when screens is added to the model with both the intercept and
budget. This is therefore the best model in Step 2 and will be the
starting point for Step 3.
Step 3
Similarly at Step 3, we fit six models, each with one of the remaining
explanatory variables. That is, we have the fitted models:
√
income = − 3.627 + 0.033 budget + 1.268 screens
+ 0.923 rating,
√
income = 2.008 + 0.039 budget + 1.134 screens
+ 0.028 likes,
and so on.
The AIC values obtained from each of these Step 3 models are listed
in ascending order in the following table.
Change AIC
Adding rating 320.59
Adding likes 336.45
None 336.54
Adding followers 336.87
Adding comments 338.53
Adding dislikes 338.53
Adding views 338.54
169
Unit 2 Multiple linear regression
Step 4
The process continues in the same way in Step 4. That is, we have the
fitted models:
√
income = − 3.672 + 0.032 budget + 1.253 screens
+ 0.914 rating + 0.077 followers,
√
income = − 3.632 + 0.033 budget + 1.226 screens
+ 0.903 rating + 0.021 likes,
and so on.
The AIC values obtained from each of these Step 4 models are listed
in ascending order in the following table.
Change AIC
None 320.59
Adding followers 321.14
Adding likes 321.23
Adding dislikes 321.39
Adding views 322.59
Adding comments 322.59
Here, the AIC of the starting model denoted by ‘None’ in the first line
of Step 4, is actually the smallest (with AIC = 320.59). This means
that adding any of the explanatory variables to the model already
containing budget, screens and rating simply increases the AIC. As
such, we cannot further improve the quality of the model.
Conclusion
Our final conclusion is that the forward stepwise regression procedure
suggests that the best model is the one including an intercept,
budget, screens and rating, with the fitted equation
√
income = − 3.627 + 0.033 budget + 1.268 screens
+ 0.923 rating.
Given our earlier exploration, in Example 10 (Subsection 5.1), of
which variables might have been important, this choice seems pretty
sensible. Certainly, the three selected variables – budget, screens
and rating – appeared important from looking at the scatterplot
matrix, and their coefficients also appeared highly significant from
Table 10 in Activity 26 (Subsection 5.1).
Note that the correlation between screens and budget is 0.58, but
this level of correlation should not preclude the two variables from
appearing together.
170
5 Choosing explanatory variables
Working your way through Example 14, you might be able to now guess
how the backward stepwise regression procedure would work. Let’s see
whether you are correct!
171
Unit 2 Multiple linear regression
Change AIC
Dropping followers 316.24
None 316.34
Dropping comments 319.46
Dropping dislikes 320.21
Dropping views 320.35
Dropping likes 320.44
Dropping rating 334.74
Dropping screens 352.32
Dropping budget 362.37
Change AIC
None 316.24
Dropping comments 319.41
Dropping dislikes 319.45
Dropping views 319.53
Dropping likes 320.71
Dropping rating 334.14
Dropping screens 353.01
Dropping budget 364.82
172
5 Choosing explanatory variables
model should be
√
income ∼ budget + rating + screens + views
+ likes + dislikes + comments.
The model suggested by the backward stepwise regression procedure
is different from the corresponding model suggested by the forward
stepwise regression procedure in Example 14.
So which model is better? Well, dropping followers from the full
model is consistent with its coefficient being non-significant in
Table 10 of Activity 26 (Subsection 5.1). However, from the
correlation matrix for this dataset, it turned out (in Activity 25) that
the block of the four explanatory variables – comments, dislikes,
likes and views – are highly correlated and therefore should not
appear in a model together.
It would therefore be sensible to conclude that the three-variable
model suggested by forward stepwise regression may be better for the
data than the seven-variable model suggested here. Moreover,
selecting the former model also fulfils the principle of parsimony given
in Box 18 at the start of Section 5.
In Examples 14 and 15 you have seen that forward and backward stepwise
regression do not necessarily suggest the same final model. This leads to
the recommendation given in Box 25.
Is it usually the case that the two procedures suggest different models?
Actually, no – achieving the same result with both methods often happens
and is reassuring when it does.
In Activity 33, stepwise regression will be used to select the explanatory
variables for modelling the data from the Brexit dataset that you first used
in Activity 7.
173
Unit 2 Multiple linear regression
noQual
satisfaction
migrant
income
age
leave
174
5 Choosing explanatory variables
Step 2
Change AIC
None 612.18
Dropping income 613.92
Dropping satisfaction 615.04
Dropping age 617.15
Dropping noQual 814.16
175
Unit 2 Multiple linear regression
Step 2
Change AIC
Adding income 617.18
Adding age 617.30
None 622.16
Adding satisfaction 622.30
Adding migrant 624.15
Step 3
Change AIC
Adding age 615.04
Adding satisfaction 617.15
None 617.18
Adding migrant 619.17
Step 4
Change AIC
Adding satisfaction 612.18
None 615.04
Adding migrant 616.90
Step 5
Change AIC
None 612.18
Adding migrant 612.49
176
5 Choosing explanatory variables
177
Unit 2 Multiple linear regression
Source: Baker and Beall, 1982, cited in The Open University, 2009, p. 11
178
Summary
Summary
In this unit you have been learning about multiple linear regression – an
extension of simple linear regression to situations where there are more
than one explanatory variable. These extra explanatory variables may
include derived variables, such as those obtained using transformations.
You have seen that although much is the same as with simple linear
regression, the interpretation of the coefficients is not. The coefficients in
multiple regression are partial regression coefficients, reflecting the effect of
the corresponding explanatory variable on the response in the presence of
the other explanatory variables with their values held fixed. Furthermore,
as well as using a t-test to test whether individual coefficients are zero
(assuming the other coefficients take their estimates), we can test whether
all coefficients are zero using an F -test.
In this unit you have also been introduced to two new types of diagnostics:
leverage and Cook’s distance. Leverage measures the potential for an
observation to affect a regression model. The leverage of a data point is
determined by the values of its explanatory variables: an observation has
high leverage – and hence potentially affects the regression model – if the
values of its explanatory variables are far away from the ‘centre’ of the
pattern of the explanatory variables. Cook’s distance measures the
influence of an observation on the regression model. It can be thought of
as a combination of an observation’s residual and its leverage. An
observation is potentially highly influential if it has very high leverage, a
very large residual, or both high leverage and a large residual.
You have also learnt how transformations can be useful for multiple
regression when the model assumptions do not seem reasonable for the
data. Transforming the response can sometimes help to fulfil or strengthen
the model assumptions, whereas transforming the explanatory variable can
sometimes help to enhance linearity between the response and the
explanatory variables.
The unit concluded by considering the selection of explanatory variables to
include in a final multiple regression model. This is because it is preferable
to only include those variables necessary for capturing the main features of
the data. You have seen that how good a model is can be measured using
the adjusted R2 statistic and the Akaike information criterion (AIC). Both
these measures adjust for the number of explanatory variables in the
model. The variables to include in a final model can be selected using
scatterplot matrices, correlation matrices and a fit of the full model, along
with stepwise regression (both forward stepwise regression and backward
stepwise regression).
The Unit 2 route map, repeated from the introduction, provides a visual
reminder of what has been studied in this unit and how the different
sections link together.
179
Unit 2 Multiple linear regression
Section 1
Section 2
The multiple
Prediction in
linear regression
multiple regression
model
Section 4
Section 3
Transformations in
Diagnostics
multiple regression
Section 5
Choosing
explanatory
variables
180
Learning outcomes
Learning outcomes
After you have worked through this unit, you should be able to:
• explain how simple linear regression of a response variable with a single
explanatory variable can be extended to regression with more than one
explanatory variable
• interpret the coefficients in multiple regression
• fit multiple regression models in R and be able to interpret the resulting
output
• interpret leverage
• interpret Cook’s distance
• explain informally the relationship between an observation’s Cook’s
distance and its residual and leverage
• use plots to identify points that have the potential to dominate an
analysis and to identify any points that are influential
• obtain and interpret diagnostic plots in R
• use transformations and extra variables which are derived from other
variables to possibly improve the regression model
• use R for transformations in multiple regression
• use a scatterplot matrix, a matrix of correlations between explanatory
variables, a fit of the full model and regressions between the response
and each explanatory variable individually to select the explanatory
variables that are likely to be in a good final model
• use R to produce a scatterplot matrix and a correlation matrix
• compare different regression models for the same data
• use stepwise regression, from both the null model and the full model, to
arrive at a final regression model
• perform stepwise regression in R.
181
Unit 2 Multiple linear regression
References
Ahmed, M., Jahangir, M., Afzal, H., Majeed, A. and Siddiqi, I. (2015)
‘Using crowd-source based features from social media and conventional
features to predict the movies popularity’, The 8th IEEE International
Conference on Social Computing and Networking, China, December 2015.
Available at: https://www.researchgate.net/publication/298352830
Using Crowd-source based features from social media and Conventional
features to predict movies popularity (Accessed: 21 March 2022).
Baker, P.T. and Beall, C.M. (1982) ‘The biology and health of Andean
migrants: a case study in south coastal Peru’, Mountain Research and
Development, 2(1), pp. 81–95.
BBC (2016) EU Referendum Results. Available at: https://www.bbc.co.uk
/news/politics/eu referendum/results (Accessed: 28 February 2022).
Becker, S.O., Fetzer, T. and Novy, D. (2017) ‘Who voted for Brexit? A
comprehensive district-level analysis’, Economic Policy, 32(92),
pp. 601–650. Available at: https://doi.org/10.1093/epolic/eix012.
Kuiper, S. (2008) ‘Introduction to multiple regression: how much is your
car worth?’, Journal of Statistics Education, 16(3).
doi:10.1080/10691898.2008.11889579
Stewart, L. (2021) Roller Coaster Data. Available at:
https://www.kaggle.com/lyallstewart/roller-coaster-data (Accessed:
18 March 2022).
The Open University (2009) M346 Unit 5 Multiple linear regression.
Milton Keynes: The Open University.
Sullivan III, M. (2020) ‘Sullystats/Statistics6e’. Available at:
https://github.com/sullystats/Statistics6e/blob/master/docs/Data/
SullivanStatsSurveyI.csv (Accessed 9 June 2022).
Sullivan III, M. (2021) Statistics: informed decisions using data,
6th Edition. Illinois, USA: Pearson. Accompanying dataset for Chapter 5
available at: https://sullystats.github.io/Statistics6e/Data/Chapter5
(Accessed: 6 June 2022).
182
Acknowledgements
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, Adebayo Akinfenwa: © Cosmin Iftode / www.123rf.com
Subsection 1.2, marking footballer: © Stef22 | Dreamstime.com
Subsection 1.3.2, marking Lionel Messi: © Matthew Trommer /
www.123rf.com
Subsection 1.3.2, map of referendum results: Taken from:
https://www.bbc.co.uk/news/politics/eu referendum/results
Subsection 2.1, Formula Rossa: © Chrisstanley | Dreamstime.com
Subsection 3.1, excited baby: © angelsimon / www.123rf.com
Subsection 3.3, boiling pan: © Akhararat Wathanasing / www.123rf.com
Subsection 4.2.2, cinema audience: © Vadymvdrobot | Dreamstime.com
Subsection 4.3, Facebook icon: © bigtunaonline / www.123rf.com
Section 5, the Earth: © abidal / www.123rf.com
Subsection 5.3.1, tea and biscuits: © olegdudko / www.123rf.com
Subsection 5.4, Machu Picchu: paltitaviajeratrip / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
183
Unit 2 Multiple linear regression
Solutions to activities
Solution to Activity 1
(a) The plot of residuals against weight does not indicate any serious
grounds for concern about the model. There is no clear sign of a
pattern, and no sign of unequal variance can be detected. Therefore,
the fitted model seems adequate.
(b) The plot in Figure 2 shows a clear positive relationship between the
residuals from the fitted regression model (with weight as the
explanatory variable) and footballers’ height, and it seems that
footballers with lower height tend to have negative residuals.
To understand what this means, suppose there are two players with
the same weight but with different heights. The model
strength ∼ weight
would imply that these two players would have the same strength
score. However, the residual plot in Figure 2 indicates that the
shorter player is more likely to have a negative residual; in other
words, the strength score of the shorter player is likely to be less than
that of the taller player.
Therefore, the model (with weight as the explanatory variable) does
not adequately explain all of the variation in strength, since there
seems to be further variation in players’ strength scores due to their
heights.
Solution to Activity 2
(a) The scatterplot in Figure 3 suggests an increasing linear relationship
between strength and height. There are a few points that are far
from the main scatter of the data but these should not cause concern
at this stage. Data points are evenly scattered around the main trend
with slightly more variation at the higher values of height. A simple
linear regression model therefore seems appropriate for the data.
(b) The regression coefficient value of 1.318 indicates that for each
additional inch in their height, a footballer’s strength score is
associated with an increase of 1.318 on average.
(c) Since the p-value for the regression coefficient of height is very small,
there is strong evidence that the regression coefficient is non-zero, and
therefore height is significant in explaining strength.
(d) The plot of the residuals and height in Figure 4 shows that there is a
tendency for the residuals to be lower for low values of height.
However, it seems to be mainly footballers with height 68 in which is
giving this impression, and overall this tendency doesn’t seem to be
very marked. So, although the fitted model might not be perfect for
this dataset, overall the fit is not too bad.
184
Solutions to activities
(e) The plot in Figure 5 shows a clear positive relationship between the
residuals from the fitted regression model (with height as the
explanatory variable) and footballers’ weight, with footballers with
lower weight tending to have negative residuals.
This means that, for two players with the same height but with
different weights, the lighter player is more likely to have a negative
residual; in other words, the strength score of the lighter player is
likely to be less than that of the heavier player.
Therefore, the model (with height as the explanatory variable) does
not adequately explain all of the variation in strength, since there
seems to be further variability in players’ strength scores due to their
weights.
Solution to Activity 3
The estimated value (0.322) of the regression coefficient of weight in the
simple linear regression model in Activity 1 is different from its
corresponding value (0.252) in the multiple regression model in Example 1.
Similarly, the two corresponding values of the coefficient of height are
different – it is estimated as 1.318 in the simple linear regression model in
Activity 2, and 0.558 in the multiple regression model in Example 1. These
coefficient values are summarised in the following table.
Note from the table that both explanatory variables have lower values for
their coefficients in the multiple regression model than the corresponding
coefficients in the simple linear regression models. The reason for these
differences are discussed in Subsection 1.2.
Solution to Activity 4
(a) The regression coefficient of weight is 0.273. This means that a
footballer’s strength score is expected to increase by 0.273 if their
weight increases by one lb, and both their height and their score of
marking ability remain fixed.
The regression coefficient of height means that a footballer’s strength
score is expected to increase by 0.681 if their height increases by
one inch, and both their weight and score of marking ability remain
fixed.
Similarly, the regression coefficient of marking means that a
footballer’s strength score is expected to increase by 0.085 if their
score of marking skill increases by one unit, and both their weight and
height remain fixed.
185
Unit 2 Multiple linear regression
(b) The estimated values of the regression coefficients for weight in the
model here and the model considered in Example 1 are not the same
(0.273 here, and 0.252 in Example 1). This difference is to be
expected, in the same way that a regression coefficient in a model
with two explanatory variables is expected to be different to its
corresponding coefficient in a simple linear regression model.
The two regression coefficients also have different interpretations. The
value of the regression coefficient of weight in the model with three
explanatory variables represents the partial effect of weight on
strength, given both height and marking. However, the value of the
regression coefficient of weight in the model with two explanatory
variables in Example 1 represents the partial effect of weight on
strength, given height only.
Solution to Activity 5
(a) The test statistic is calculated as
βb1 0.252
t-value = = ≃ 4.71.
standard error of βb1 0.0535
(c) The p-values associated with the two tests are both obtained from
t(n − (q + 1)) = t(100 − (2 + 1)) = t(97).
Since the p-value associated with the regression coefficient β1 of
weight is so small (< 0.001), there is strong evidence to suggest that
β1 differs from zero.
The p-value associated with the regression coefficient β2 of height is
not so small (0.036), but we (the module team) would still judge this
to be small enough to conclude that there is evidence to suggest that
β2 also differs from zero.
So, overall, there is evidence to suggest that both β1 and β2 differ
from zero.
Solution to Activity 6
(a) Let β1 , β2 and β3 be the regression coefficients of weight, height and
marking, respectively. The test procedure being used here tests the
hypotheses
H0 : β1 = β2 = β3 = 0,
H1 : at least one of the three coefficients differs from zero.
186
Solutions to activities
(b) Since the p-value associated with this test is less than 0.001, there is
strong evidence that at least one of the three regression coefficients is
different from zero. Hence, there is strong evidence that the regression
model contributes information to explain the variability in the
footballers’ strength scores.
(c) This time, the test procedure being used tests three sets of hypotheses:
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 and β3 = βb3 ),
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 and β3 = βb3 ),
H0 : β3 = 0, H1 : β3 ̸= 0 (assuming β1 = βb1 and β2 = βb2 ).
For each of these tests, the p-value is based on the same
t-distribution, namely t(n − (q + 1)) = t(96).
(d) For the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 and β3 = βb3 ),
the p-value is very small (< 0.001). Therefore, there is strong evidence
to suggest that the regression coefficient of weight (β1 ) is not zero
when height and marking are in the model.
For the hypotheses
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 and β3 = βb3 ),
the p-value is small (0.006). So, there is evidence to suggest that the
regression coefficient of height (β2 ) is also not zero when weight and
marking are in the model.
For the hypotheses
H0 : β3 = 0, H1 : β3 ̸= 0 (assuming β1 = βb1 and β2 = βb2 ),
the p-value is again very small (< 0.001). So, there is strong evidence
to suggest that the regression coefficient of marking (β3 ) is also not
zero when weight and height are in the model.
From these three tests, we can conclude that, in a model that contains
these three explanatory variables, there is strong evidence to suggest
that each of the explanatory variables, individually, explains the
variability in the footballers’ strength scores, given that the other
variables are in the model. This means that they are significant
together in explaining the variability in the footballers’ strength
scores.
(e) In part (d), you assessed the individual influence on the response
variable for each of weight and height when they are both included
187
Unit 2 Multiple linear regression
Solution to Activity 7
(a) (i) The regression coefficient of age in the first model is −81.426,
indicating that a one percentage point (1% = 0.01) increase in
the percentage of people in the 30 to 44 age group in an area is
associated, on average, with a decrease of
0.01 × 81.426 = 0.81426%
in the percentage of ‘Leave’ voters in this area.
The regression coefficient of income (−2.085) in the second
model means that a one pound (£1) increase in the mean hourly
income of people in an area is associated with an average decrease
of 2.085% in the percentage of ‘Leave’ voters in this area.
(ii) The p-value associated with the test of the regression coefficient
of age is 0.004. This means that there is evidence to suggest that
the regression coefficient is not zero. Hence, there is evidence
that a voter’s age explains their tendency to vote ‘Leave’.
The p-value associated with the regression coefficient of income
is less than 0.001, showing evidence that a voter’s income also
explains their tendency to vote ‘Leave’.
(b) (i) The fitted regression equation can be written as
leave = 78.005 + 31.768 age − 2.180 income.
The regression coefficient of age is 31.768. This means that a
one percentage point (1% = 0.01) increase in the percentage of
people in the 30 to 44 age group in an area, with the mean hourly
income in this area being fixed, is associated with an increase of
0.01 × 31.768 = 0.31768%
in its ‘Leave’ voters.
Similarly, the regression coefficient of income indicates that a
one pound (£1) increase in the mean hourly income of people in
an area, with its population age being fixed, is accompanied by a
decrease of 2.18% in the percentage of ‘Leave’ voters in this area.
188
Solutions to activities
Solution to Activity 8
(a) Using the information given in Table 4, the fitted regression equation
is
speed = 23.58 + 0.218 height + 0.00182 length.
189
Unit 2 Multiple linear regression
Solution to Activity 9
Denoting the predicted value of strength for this newly registered
footballer by yb0 , we can calculate yb0 by substituting the values of its
explanatory variables (weight = 170, height = 72 and marking = 65)
into the fitted model, so that
yb0 = −28.305 + (0.273 × 170) + (0.681 × 72) + (0.085 × 65)
= 72.662 ≃ 73.
So, a point prediction of the strength score of the newly registered
footballer is 73. (The predicted strength score has been rounded to the
nearest integer to match the accuracy of the variable strength in the
original data.)
Solution to Activity 10
The larger the confidence level, the wider the corresponding prediction
interval will be. Therefore, (57.3, 91.6) must be the 99% prediction interval
since it is the widest, (61.4, 87.5) must be the 95% prediction interval
because it is the next widest, and (63.5, 85.4) must be the 90% prediction
interval because it is the narrowest.
Solution to Activity 11
The 95% prediction interval tells us that it is predicted that the strength
score of the newly registered footballer is somewhere between 64 and 81.
Solution to Activity 12
(a) There are two potential outliers in the residual plot shown in
Figure 8(a): the two points with lowest residuals. However, overall,
the residual plot shows no particular pattern. The points are
randomly scattered around zero, and so the assumptions of linearity
and zero mean seem to be reasonably satisfied. Also, the vertical
scatter of the points seems to be fairly constant, and so the
assumption of constant variance is also plausible.
(b) The two potential outliers identified in part (a) are also evident in the
normal probability plot given in Figure 8(b): the two points in the
bottom left-hand corner. However, in the plot as a whole, nearly all
the points lie on, or are very close to, a straight line in the plot. The
normality assumption is therefore also very plausible.
(c) Since all of the assumptions are reasonably satisfied, the multiple
regression model seems to provide an adequate model for this dataset.
190
Solutions to activities
Solution to Activity 13
(a) The residual plot in Figure 9(a) shows a very clear curved pattern. As
such, the assumptions of linearity and zero mean do not seem to be
satisfied. What’s more, the vertical scatter does not look to be
constant, since the scatter for the middle fitted values seems higher
than for the small and large values. This means that the assumption
of constant variance is also in doubt.
(b) In the normal probability plot in Figure 9(b), most of the points lie
roughly on a straight line. There are, however, a few points which
deviate from the line in the top and bottom corners of the plot. So,
overall, we (the module team) wouldn’t rule out the normality
assumption based on the normal probability plot, but it also looks like
the assumption may be questionable. (You may, of course, disagree!)
(c) Since the assumptions of linearity, zero mean and constant variance
are in doubt, and the normality assumption may be questionable, this
model does not seem adequate for the roller coasters dataset.
Solution to Activity 14
The only real difference between the two original data points and the two
changed data points is the value of the explanatory variable height. The
far-left data point has a low value of height (66) in comparison to the rest
of the values of height in the dataset, whereas the value of height is one
of the more central values (72) for the other data point.
This suggests that the value of a data point’s explanatory variable might
determine whether a data point has high or low leverage.
Solution to Activity 15
Data points with both high leverage and a large residual are likely to be
influential points, and these would appear towards the lower-right or
upper-right corners of the plot.
If a data point has very high leverage, this is also likely to be an influential
point, in which case this would appear at the far right of the plot.
Data points with very large residuals are also likely to be influential points,
and these would appear near the top or the bottom of the plot.
Solution to Activity 16
When looking for influential points, we want to be looking at those points
with high leverage and/or large residuals.
Example 7 identified two of the data points as having high leverage – the
data points numbered 62 and 15. So, both of these data points have the
potential to be influential points.
The standardised residual for the data point numbered 62 isn’t quite large
enough to be classified as being ‘large’ using our general rule of thumb
191
Unit 2 Multiple linear regression
(its value is just above −2), but, combined with its high leverage, it’s
possible that this data point is an influential point.
In contrast, the standardised residual for the data point numbered 15 can
be classified as being ‘large’ using our general rule of thumb (its value is
roughly 2.5). Therefore, given the high leverage for this data point, it
looks likely that this data point is an influential point.
There are two other data points with ‘large’ residuals: the data point
numbered 92 (whose standardised residual is roughly −2.5) and the data
point to its left in the plot (whose standardised residual is close to −3).
Since these residuals are large, then it’s possible that these data points are
also influential points.
Notice that there are two other data points with similar leverage to the
data point numbered 92. However, since their standardised residuals are
small (they are both fairly close to zero), it is unlikely that these data
points will be influential points.
Solution to Activity 17
(a) Figure 14 clearly shows one data point (numbered 77) with high
leverage to the far right of the data.
There is then a group of five data points with relatively high leverage
in comparison to the bulk of the data points – these are the data
points numbered 1, 2, 3, 4 and 44.
(b) Although the data point at the far right numbered 77 has the highest
leverage, its standardised residual is small (its value is less than 1),
and therefore this data point is unlikely to be an influential point.
The data point with the next highest leverage is the point
numbered 2. The standardised residual for this data point seems to be
roughly −2, and, as such, could be just about considered as ‘large’
using our rule of thumb. Since this data point also has high leverage,
it is possible that this data point is an influential point.
The standardised residuals for the points numbered 1, 3 and 4 are all
above −2, and so wouldn’t be considered as being large by our rule of
thumb. However, combined with their high leverage values, it is
possible that these are influential points.
The standardised residual for the point numbered 44 is so small (its
value is fairly close to zero) that it’s unlikely that this data point is an
influential point, even though its leverage is fairly high.
There remains one data point which could also be influential: the
data point numbered 43 at the upper-left corner of the plot, which has
a very large standardised residual value greater than 4! Although the
leverage for this data point is very small, the residual is so large that
this data point could also possibly be an influential point.
192
Solutions to activities
Solution to Activity 18
(a) From the Cook’s distance plot in Figure 16, the three most influential
points are those with the tallest bars – namely the data points
numbered 1, 2 and 43. They all have Cook’s distances of about 0.05
or more, which is considerably higher than the Cook’s distances of all
other data points in the sample.
Three out of the five points identified in Activity 17 as being
potentially influential points have turned out to be influential points
in terms of having highest Cook’s distances in the sample. Although
point 43 has a very small leverage, its very extreme standardised
residual has led to the third highest Cook’s distance.
(b) The data point numbered 77 with the highest leverage in Figure 14
has turned out not to be one of the three most influential points.
Although it has the maximum leverage in the sample, it does not have
one of the highest Cook’s distances since it has a standardised
residual near zero. The combination of the very high leverage with a
very small standardised residual has not led to a high Cook’s distance.
Solution to Activity 19
(a) The data points numbered 96, 124 and 228 have high leverage values
together with large standardised residuals. Although there are three
other points with even higher leverage, they have smaller standardised
residuals than the three identified points. There are also many other
points with higher standardised residuals than the identified points,
but those tend to have smaller leverage, hence their Cook’s distance
values are not expected to be high.
(b) Based on the Cook’s distance plot given in Figure 18, data point
number 124 is the most potentially influential point in the data since
it has the highest value of the Cook’s distance. The Cook’s distance
values of the other two points identified in part (a) are close to the
Cook’s distance values of many other data points in the sample, and
so these data points don’t seem to be influential.
(c) There are not any considerable differences in the estimated regression
coefficients (compared with their standard errors) and their p-values
after removing data point number 124. This data point, therefore,
does not seem to have a considerable influence on the regression
model. Although it has the highest Cook’s distance, it does not
automatically mean that it will be influential to the model. After all,
its Cook’s distance is only about 0.06 and is not that far from the
Cook’s distance values of some of the other points.
193
Unit 2 Multiple linear regression
Solution to Activity 20
(a) The most noticeable feature of the residual plot is that the spread of
points about the zero residual line seems to increase as the fitted
values increase. This suggests that the assumption that the variance
of the random terms is constant may not hold. Problems with the
model assumptions can sometimes be solved by transforming the
response variable, and so this is why it is reasonable to try
transforming the response.
(b) The histogram of income in Figure 28 is very right-skew. This
suggests that transformations of income down the ladder of powers
might help.
The first transformation down the ladder of powers from x1 is the
square root transformation and the one down from that is the log
transformation, which√explains why it is reasonable to consider the
two transformations income and log (income).
(c) The histogram√ of income in Figure 28 is very right-skew, but the
histogram of income in Figure 29 seems much more symmetric. On
the other hand, the histogram of log (income) is very left-skew,
indicating that the data have been transformed too far down the
ladder of powers. Therefore, based on these two histograms, we (the
module team) would recommend applying the square root
transformation to income.
Solution to Activity 21
(a) To enhance linearity between the response and any of the explanatory
variables, we can try transforming the associated explanatory
variables. The scatterplot in Figure
√ 31(a) shows a roughly linear
increasing relationship between income and budget. On the other
hand, although the scatterplot in √Figure 31(b) seems to show an
increasing relationship between income and screens, this
relationship is clearly not linear. So, a transformation of screens
may improve the linearity assumption, but there √ is no need to
transform budget since the relationship between income and
budget already seems to be roughly linear.
√
(b) From the scatterplot of √ income and screens given in Figure 31(b),
the relationship between income and√screens is certainly
non-linear. The relationship between income and the transformed
explanatory variable screens2 , shown in Figure 32(a), seems to be
more linear, although√still a little non-linear. However, the
relationship between income and the transformed explanatory
variable screens3 , shown in Figure 32(b), does seem to be roughly
linear, and so transforming screens to screens3 seems to be a
sensible way forwards.
194
Solutions to activities
Solution to Activity 22
(a) In the residual plot given in Figure 33 of the fitted model with the
transformed variables, the points seem to be fairly randomly scattered
about the zero residual line. As such, it looks like both the assumption
of constant variance of the random terms and the linearity assumption
are plausible for the regression model with the transformed variables.
(b) For the new film, since budget is given in millions of US dollars, the
value of budget for this new film is 50.
Since the variable screens is given in thousands and the new film is
to be initially launched on 2500 screens, the value of screens for this
film is 2.5. Therefore, the value of screens3 is 2.53 = 15.625.
√
We can now use the fitted equation to predict income for this film:
√
income = 3.01 + (0.025 × 50) + (0.107 × 15.625) ≃ 5.932.
The predicted income of the new film is therefore 5.9322 ≃ 35.
So, for the new film with a production budget of 50 million US dollars
and initially being launched on 2500 screens, the predicted income is
35 million US dollars.
Solution to Activity 23
There do not seem to be many strong single relationships between income
and the explanatory variables; only the relationships with budget, rating
and screens seem fairly clear. However, these relationships do not seem to
be linear.
Solution to Activity 24
The explanatory variables comments, dislikes, likes and views seem to
be very closely related to each other, since the scatterplots between this set
of explanatory variables show clear relationships between each pair of
explanatory variables.
It also looks like screens and budget may be related, since the scatterplot
for these two explanatory variables also shows a clear relationship.
Solution to Activity 25
Using the rule of thumb that a correlation is high if the absolute value of
the correlation is 0.7 or more, the only correlations which we would
identify as being high are those between the four explanatory variables
comments, dislikes, likes and views.
195
Unit 2 Multiple linear regression
Solution to Activity 26
(a) (i) From Table 10, it is easy to cast your eye down the list of
p-values given for testing the regression coefficients individually.
On this basis, the strongest dependence is on budget, rating
and screens, since all of these explanatory variables have
p-values very close to zero.
All of the other explanatory variables, except followers, also
have fairly small p-values, which suggests that their associated
regression coefficients are also non-zero.
(ii) While it may be sensible to include all of the explanatory
variables (except followers) in the model, it is not guaranteed
that this is a sensible thing to do. This is because the values of
the regression coefficients depend on the values of the others
being fixed, and so, when some explanatory variables are absent
from the model, the remaining regression coefficients might be
quite different.
(b) The residual plots do not cast any doubts on the model assumptions.
The points in the plots of residuals against fitted values look fairly
randomly scattered, and so the assumptions of linearity, zero mean
and constant variance seem reasonable.
The normality assumption also seems reasonable, since, although
there are two outliers in the plot, overall the points in the normal
probability plot generally follow a straight line.
In the residuals versus leverage plot, there are no points with high
leverage and large residuals, and so this plot does not raise any
concerns either.
In the Cook’s distance plot, none of the points stand out as having a
Cook’s distance value much higher than the rest. This suggests that
none of the points are much more influential on the model than the
others.
Solution to Activity 27
(a) We know that
TSS = ESS + RSS,
so
ESS = TSS − RSS
= 515.60 − 111.44
= 404.16.
(b) The value of ESS is almost four times larger than the value of RSS,
and so a fairly large proportion of the response variation can be
explained by the model. Therefore, the sums of squares do suggest
that the model seems to explain the variation in the response fairly
well.
196
Solutions to activities
Solution to Activity 28
If a model fits the data well, then we would expect the model to explain
the variation in the response well. In this case, we’d expect the explained
variation to be large in comparison to the residual (unexplained) variation,
so that ESS is large in comparison RSS.
But, since TSS = ESS + RSS, if ESS is large in comparison to RSS, then
the value of TSS will be not much larger than ESS, and so
ESS
R2 =
TSS
will be large.
Solution to Activity 29
From Equation (1), the R2 statistic is calculated as
ESS
R2 =
TSS
404.16
= ≃ 0.784.
515.60
Or, as a percentage, the R2 statistic is 78.4%.
Solution to Activity 30
In this module, we are taking Ra2 to be the percentage of variance
accounted for. So, using Equation (2), Ra2 is calculated as
n−1
Ra2 = 1 − (1 − R2 ),
n − (q + 1)
10 − 1
=1− (1 − 0.784) ≃ 0.757.
10 − (1 + 1)
As a percentage, the value of Ra2 is 75.7%.
(Notice that, as expected, the value of Ra2 is less than the value of R2 .)
Solution to Activity 31
(a) The values of R2 increase from Model 1 to Model 3. In contrast, even
though the value of Ra2 increases from 38.86% for Model 1 to 74.33%
for Model 2, which has noQual added to the model, Ra2 reduces to
74.23% for Model 3, which has migrant added to the model.
Since Ra2 adjusts for the number of explanatory variables, the decrease
in the value of Ra2 for Model 3, means that adding migrant as an
explanatory variable does not enhance the fit of the model. So, the
increase in the value of R2 seen for Model 3 must have been due
simply to the increase in the number of explanatory variables from
Model 2 to Model 3, rather than a real increase in fit.
(Notice that, as expected, the value of Ra2 in each model is less than
its corresponding R2 value.)
197
Unit 2 Multiple linear regression
(b) The value of Ra2 should be used as a measure for the percentage of
variance accounted for, especially in this case, where Ra2 shows that
adding an extra explanatory variable in Model 3 does not in fact
enhance the model fit.
Comparing Ra2 for the three models, it is clear that adding noQual in
Model 2 markedly increases the percentage of variance of leave that
is accounted for by the model – from 38.86% to 74.33%. But adding
migrant in Model 3 slightly reduced the variance accounted for,
to 74.23%. Therefore, based on the values of Ra2 , of these three
models, Model 2 seems to fit the data best.
Solution to Activity 32
(a) The best model in the given set is the one with the lowest AIC.
Comparing the AIC of the three models, it is clear that adding
noQual in Model 2 markedly decreased the AIC of the model
from 819.7 to 615.0; but adding migrant in Model 3 slightly increased
the AIC to 616.9. Therefore, based on the AIC values for these three
models, Model 2 seems to be the best.
(b) Model 2 is the same model that we chose in part (b) of Activity 31,
based on the adjusted R2 . This means that both Ra2 and AIC
suggested the same model for these data.
Solution to Activity 33
(a) From the last row of the scatterplots, there seems to be a strong
relationship between leave and noQual. It is clear that leave tends
to increase as noQual increases.
Although there do not seem to be any other quite so strong
relationships between leave and each of the other four explanatory
variables, the plots suggest that leave does tend to decrease as each
of income, migrant or satisfaction increases.
The relationships between leave and each of income and migrant
both seem to be fairly linear, and the relationship between leave and
noQual is clearly linear.
The scatter of the values of leave seems constant for different values
of any of the explanatory variables.
From the correlation matrix, the highest correlation between the
explanatory variables is between income and noQual (−0.79). Each of
these two explanatory variables is also correlated with migrant.
Because these correlations are high, we would not expect all three of
these explanatory variables to appear together in a regression model.
So, summarising this discussion, a good regression model might
include noQual and perhaps satisfaction as explanatory variables.
The remaining explanatory variables are not expected to be in the
model because income and migrant are highly correlated with
noQual.
198
Solutions to activities
199
Unit 2 Multiple linear regression
The best model in Step 3 is the one that contains age together with
the intercept, noQual and income.
Similarly, Step 4 suggests adding satisfaction to the model.
But finally, Step 5 indicates that adding migrant to the model
increases the AIC from 612.18 to 612.49, which means that adding
migrant does not further improve the quality of the model.
The forward stepwise regression procedure therefore suggests that the
best model is the one including an intercept, noQual, income, age and
satisfaction. This is exactly the set of explanatory variables
suggested by the backward stepwise regression procedure in part (c).
200
Unit 3
Regression with a categorical
explanatory variable
Introduction
Introduction
So far in this module, the focus has been on statistical models for a
continuous response variable Y with either one numerical explanatory
variable (using simple linear regression in Unit 1) or more than one
numerical explanatory variable (using multiple regression in Unit 2). In
the real world, however, there are many situations where we require a
statistical model for a continuous response variable, but one of the possible
explanatory variables is categorical rather than numerical. In this unit we
will consider the situation where there is just one explanatory variable that
happens to be categorical.
203
Unit 3 Regression with a categorical explanatory variable
Explanatory variables
Covariates Factors
Section 1 Section 2
Regression with a factor: Developing the
the basic idea model further
Section 4
Section 3
Using R to fit
Using the proposed
a regression
model
with a factor
Section 5 Section 6
Analysis of variance Using R to produce
(ANOVA) ANOVA tables
Section 7 Section 8
Analysing the effects Using R to produce
of the factor levels extended ANOVA
further tables
204
1 Regression with a factor: the basic idea
Wages
The Office for National Statistics (ONS) is the UK’s largest
independent producer of official statistics. They are responsible for
collecting, analysing and publishing statistics about the UK’s
economy, society and population.
The ONS run a number of surveys, one of which is the Labour Force
Survey (LFS). This data source uses international definitions of I don’t know about you, but I
employment, unemployment and economic inactivity, together with find it easier to spend a wage
information on a wide range of related topics such as occupation, than earn it!
training, hours of work and personal characteristics of household
members aged 16 years and over at private addresses in the UK.
The LFS was first conducted biennially in 1973, and over the years has
increased to annually and then quarterly. Government departments
use the results of the survey to identify how and where they should be
using public resources, to check how different groups in the community
are affected by existing policies and to inform future policy changes.
205
Unit 3 Regression with a categorical explanatory variable
206
1 Regression with a factor: the basic idea
We’ll start with a quick look at the wages dataset in Activities 1 and 2,
before we try to model any of the data.
As mentioned in Activity 1, both edLev and occ from the wages dataset
are categorical variables that represent the different categories by using
numerical codes (1, 2, . . . , 17 for edLev and 1, 2, . . . , 7 for occ). It is quite
common for factors to use numerical codes like these to represent the
different possible values, and this is often done for convenience to avoid
long-winded labels; for example, the numerical code ‘3’ is used to represent
‘Intermediate non-manual’ for the factor occ.
Looking at the data for the first five observations from the wages dataset
given in Table 1, can you spot a potential error for one of the values of one
of the variables?
207
Unit 3 Regression with a categorical explanatory variable
6
hourlyWageSqrt
0
1 2 3 4 5 6 7
occ
Figure 2 Scatterplot of the response hourlyWageSqrt and the occupation
codes of the factor occ from the wages dataset
Since there are a lot of data points in the scatterplot given in Figure 2, but
only seven possible values for occ, it is difficult to get a clear picture of any
relationship between the two variables. For example, there are 956 data
points plotted for which occ takes the value 5! So, instead, let’s look at a
208
1 Regression with a factor: the basic idea
5
occ
0 1 2 3 4 5 6 7
hourlyWageSqrt
Figure 3 Comparative boxplot of the response hourlyWageSqrt over the
different level codes of the factor occ from the wages dataset
209
Unit 3 Regression with a categorical explanatory variable
α + βxi
yi
Wi = yi − (α + βxi ) (xi , yi )
xi
210
1 Regression with a factor: the basic idea
If two observations i and j have the same value of the covariate, so that
xi = xj , then
α + βxi = α + βxj .
In this case, the models for Yi and Yj only differ in the values of their
random terms Wi and Wj , as illustrated in Figure 5.
yj xj , y j
Wj = yj − (α + βxj )
= yj − (α + βxi )
α + βxi
Wi = yi − (α + βxi )
yi xi , y i
x i = xj
Figure 5 The simple linear regression model from Figure 4 for Yi and Yj
when xi = xj
211
Unit 3 Regression with a categorical explanatory variable
factor we can base the model on the means of the responses for the
different levels of the factor (as mentioned at the end of Subsection 1.2).
To help visualise what such a model would look like, Figure 6 shows a
scatterplot of hourlyWageSqrt and the level codes of occ, together with
the (sample) means of the responses associated with each of the seven
levels of occ; the model is based on these means rather than on a fitted
straight line.
6
hourlyWageSqrt
0
1 2 3 4 5 6 7
occ
Figure 6 Scatterplot of hourlyWageSqrt and the level codes of occ with the
(sample) means of the responses associated with each of the seven levels of
occ indicated by the large red circles
212
1 Regression with a factor: the basic idea
Consider once again the data given in Table 1. Write Model (2) in a form
specifically for Y2 .
In Model (2), the value of the variance σ 2 relates to the vertical spread of
the data about the mean, just as σ 2 in Model (1) relates to the vertical
spread of the data about the regression line in simple linear regression.
Notice that, for all responses Y1 , Y2 , . . . , Yn , the model has the same
variance σ 2 . This means that, for each level of the factor, the variation of
the associated responses about their mean is roughly the same across the
different levels of the factor. This is illustrated in Figure 7, which follows.
213
Unit 3 Regression with a categorical explanatory variable
σ 2 relates to σ 2 relates to
the scatter about the scatter about
the line the means
Response
Response
(a) Covariate (b) Factor
Figure 7 How σ 2 relates to the vertical spread of observations about (a) a regression line when the
explanatory variable is a covariate, and (b) mean responses (indicated by large red circles) when the
explanatory variable is a factor
Model (2) gives us a possible model for each response Yi based on the
mean response for all observations taking the same factor level as the ith
observation. There is, however, a problem with using this to model the
relationship between the response and the factor. This is because
Model (2) essentially models the responses associated with each level of the
factor separately, whereas, in order to model the relationship between the
response and the factor, we really need a model that links the responses
across the different levels of the factor. To do this, we need to adapt the
model slightly. We consider this in the next subsection.
214
1 Regression with a factor: the basic idea
α3 (positive as
level 3 of factor
α1 = 0,
increases response
so µ + α1 = µ
in comparison to
level 1)
µ + α3
Response
µ + α2
α2 (negative as level 2
Baseline of factor decreases
mean response in comparison
to level 1)
In the next example we show how this works in practice for the first
individual in the wages dataset, before you rewrite the models for the next
four individuals in that dataset in Activity 7.
215
Unit 3 Regression with a categorical explanatory variable
For the data given in Table 1 (Subsection 1.1), taken from the wages
dataset, use Model (3) to write model forms for Y2 , Y3 , Y4 and Y5 .
216
2 Developing the model further
217
Unit 3 Regression with a categorical explanatory variable
• If the ith tree is on the west side of the road (so that side takes
level 2), then the regression model for Yi is
Yi = µ + α2 + Wi , Wi ∼ N (0, σ 2 ).
So, the observations for which side takes level 2 have the extra
parameter α2 in their model.
218
2 Developing the model further
The value of zi2 therefore indicates whether or not the ith observation
takes level 2 of side:
• if zi2 = 1, then we know that the ith observation does take level 2
of side
• if zi2 = 0, then we know that the ith observation does not take
level 2 of side.
Data for five observations from the manna ash trees dataset (for the
trees numbered 1, 2, 28, 29 and 39), including the indicator variable
Z2 , are given in Table 2.
Table 2 Five observations from the manna ash trees dataset, including the
indicator variable Z2
219
Unit 3 Regression with a categorical explanatory variable
220
2 Developing the model further
Therefore:
• If zi3 = 1, then we know that the ith observation takes level 3 of
occ.
• If zi3 = 0, then we know that the ith observation does not take
level 3 of occ.
Similarly, we could define four more indicator variables to identify
which observations take each of the remaining four levels of occ; that
is, levels 4, 5, 6 and 7. (You will do this soon in Activity 8.)
Data for the first five observations from the wages dataset for the
variables hourlyWageSqrt and occ, together with the indicator
variables Z2 and Z3 , are given in Table 3.
Table 3 First five observations from wages, giving hourlyWageSqrt and
occ together with Z2 and Z3
221
Unit 3 Regression with a categorical explanatory variable
Complete Table 4 to show the first five observations from the wages
dataset for the variables hourlyWageSqrt and occ, together with the
indicator variables Z2 , Z3 , . . . , Z7 .
Table 4 Incomplete table showing the first five observations from wages dataset,
giving hourlyWageSqrt and occ together with Z2 , Z3 , . . . , Z7
Model (4) for the response Y and factor A with K levels can be rewritten
in terms of a set of indicator variables. This allows the model to be
represented, for i = 1, 2, . . . , n, by a single model equation of the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , Wi ∼ N (0, σ 2 ), (5)
where
1 if the ith observation takes level 2 of A,
zi2 =
Not an indicator variable. But 0 otherwise,
this pH indicator paper
1 if the ith observation takes level 3 of A,
changes colour depending on zi3 =
0 otherwise,
the acidity, so does this make
it a ‘variable indicator’ ? ..
.
1 if the ith observation takes level K of A,
ziK =
0 otherwise.
Notice that we don’t have an indicator variable associated with level 1 of
factor A, since we are assuming that the baseline mean µ represents the
mean response when A takes level 1.
222
2 Developing the model further
(a) Explain why Model (5) reduces to the form given in Model (4) when
the ith observation takes level k of factor A, for k = 2, 3, . . . , K.
(b) Confirm that Model (5) reduces to the form
Yi = µ + Wi
when the ith observation takes level 1 of factor A. (Note that this is
also the form given in Model (4) when the ith observation takes
level 1, since we have set α1 to be 0.)
As you saw in Activity 9, Model (5) reduces to the form given in Model (4)
when the ith observation takes level k of factor A. The indicator variables
therefore allow us to use the same model equation for all of the
observations, by basically switching effect terms (α2 , α3 , . . . , αK ) on and off
depending on which level of factor A is taken by each observation; this is
illustrated in Figure 9.
Observation i takes Y i = µ + α 2 + α 3 + · · · + α K + Wi
level 1 of A:
Effect terms for levels 2
to K switched off
Observation i takes Yi = µ + α2 + α3 + · · · + αK + Wi
level 2 of A:
Effect terms for levels 3
to K switched off
.. .. ..
. . .
Observation i takes Yi = µ + α2 + α3 + · · · + αK−1 +αK + Wi
level K of A:
Effect terms for levels 2
to K − 1 switched off
223
Unit 3 Regression with a categorical explanatory variable
(a) Use the indicator variable Z2 to specify a model for height as a single
equation (that is, in the form of Model (5)).
(b) The first tree in the manna ash trees dataset is located on the west
side of the road, whereas the 28th tree is located on the east side of
the road. Write the models for Y1 and Y28 in the form of Model (4).
224
2 Developing the model further
225
Unit 3 Regression with a categorical explanatory variable
µ + αK
Model based on
response means µ
for levels of A µ + α2
1 2 ... K
A
K model equations:
k = 1: Yi = µ + Wi
Separate model k = 2: Yi = µ + α2 + Wi
equation for each k = 3: Yi = µ + α3 + Wi
level of A .. ..
. .
k = K: Yi = µ + αK + Wi
Define K − 1
One model equation:
indicator variables
Yi = µ + α2 zi2 + α3 zi3 + . . . + αK ziK + Wi
zi2 , zi3 , . . . , ziK
We have written the regression model for response Y with a factor A that
has K levels as the single model equation
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , i = 1, 2, . . . , n,
where zi2 , zi3 , . . . , ziK are all numerical variables (each taking the value 0
or 1), and µ, α2 , α3 , . . . , αK are model parameters to be estimated. Which
regression model has a model equation with the same form?
226
2 Developing the model further
Define K − 1 indicator
variables zi2 , zi3 , . . . , ziK
Model equation:
Yi = µ + α2 zi2 + α3 zi3 + . . . + αK ziK + Wi
227
Unit 3 Regression with a categorical explanatory variable
228
3 Using the proposed model
Parameter Estimate
µ (baseline mean) 4.489
α2 (occ level 2) −0.304
α3 (occ level 3) −0.515
α4 (occ level 4) −1.011
α5 (occ level 5) −1.022
α6 (occ level 6) −1.383
α7 (occ level 7) −1.435
From Table 1 (Subsection 1.1), the value of occ for the first
observation is 5, and so the fitted value for Y1 is
yb1 = µ
b+α
b5 = 4.489 + (−1.022)
= 3.467 = 3.47 (to 2 d.p).
Indeed, the fitted values for all observations which take level 5 of occ
will also be 3.47.
As another example, the value of occ for the second observation is 3,
and so the fitted value for Y2 is
yb2 = µ
b+α
b3 = 4.489 + (−0.515)
= 3.974 = 3.97 (to 2 d.p.).
The fitted values for all observations which take level 3 of occ will
also be 3.97.
229
Unit 3 Regression with a categorical explanatory variable
From Table 1 (Subsection 1.1), the value of occ from the wages dataset for
the third, fourth and fifth observations are, respectively, 2, 6 and 1. Using
the parameter estimates produced by fitting the model
hourlyWageSqrt ∼ occ
given in Table 5 (in Example 5), calculate the fitted values yb3 , yb4 and yb5 .
Considering the context of the data, interpret the parameter estimates for
µ, α2 , α3 , . . . , α7 given in Table 5 (in Example 5).
In the next activity, you will look at the fitted model for modelling height
from the manna ash trees dataset using the factor side as the explanatory
variable.
Parameter Estimate
µ (baseline mean) 6.636
α2 (side level 2) 1.880
230
3 Using the proposed model
(a) Considering the context of the data, interpret the parameter estimates
for µ and α2 given in Table 6.
(b) The first tree in the manna ash trees dataset is located on the west
side of the road, whereas the 28th tree is located on the east side of
the road. Calculate the fitted values for the first and 28th trees.
(c) What will the fitted values be for all of the trees on the west side of
the road? What will the fitted values be for all of the trees on the east
side of the road?
In addition to calculating fitted values for the observed data, we can use
the fitted model to predict the response Y0 for a new observation that we
know which of the K levels of the factor it takes. If we know the level of
the factor for the new observation, then we know the values of the
indicator variables for the new observation; that is, we know
z01 , z02 , . . . , z0K . For example, if the new observation takes level k of the
factor, then z0k = 1, whereas the values of the other indicator variables will
all be zero. The predicted value yb0 of Y0 is then
yb0 = µ α2 × 0) + (b
b + (b α3 × 0) + · · · + (b
αk × 1) + · · · + (b
αK × 0)
=µb+α bk .
(If the new observation takes a completely different level of the factor, say
level ‘K + 1’, then this model does not help us predict to the response Y0 .)
As for simple linear regression and multiple regression, prediction intervals
can also be calculated easily in R for the new response Y0 , but we won’t go
into the details of how these are calculated here.
In this subsection, we’ve looked at the fitted model for regression with a
factor. We’ve used the context of the data to interpret the estimated
model parameters and used the fitted model to calculate both fitted values
and predicted values of the response. However, as with simple linear
regression, we need to test whether there is actually a relationship between
the response and the explanatory variable to model! We will do this next.
231
Unit 3 Regression with a categorical explanatory variable
232
3 Using the proposed model
We will look at testing the relationships between the response and the
factor in the next two activities. Activity 18 will use the fitted model
presented in Activity 15 (Subsection 3.1) for the manna ash trees dataset,
and Activity 19 will use the fitted model presented in Example 5 (also in
Subsection 3.1) for the wages dataset.
The method for testing a relationship between the response and a factor
explanatory variable is summarised in Box 3.
233
Unit 3 Regression with a categorical explanatory variable
Recall that Unit 2 also carried out individual tests (based on the
t-distribution) to test whether each regression coefficient is zero, to help
decide which covariates should be kept in the model. The same tests can
be done for each of the effect terms when the explanatory variable is a
factor – indeed, R automatically calculates the test statistics (t-statistics)
and their associated p-values for each of these tests when fitting the model.
However, as mentioned already in this unit, when the explanatory variable
is a factor, we cannot remove individual level effect terms from the model:
either all of the effect terms are in the model or none of the effect terms
are in the model. So, even if individual p-values suggest that there is no
evidence that a particular effect term should be in the model, the
associated effect term cannot be removed from the model (unless there is
no evidence from the F -statistic’s p-value of a relationship between the
response and the factor, in which case none of the effect terms would be in
the model).
Despite this, the p-values associated with testing whether each individual
effect term is zero are still useful, as they provide information regarding
the extent to which each level of the factor affects the response in
comparison to how the first level of the factor affects the response.
In Activities 20 and 21, we will consider the test statistics and p-values for
the individual level effect parameters for the fitted models for the wages
and manna ash trees datasets.
234
3 Using the proposed model
Explain why the p-values suggest that all of the level effect terms for the
factor occ are non-zero. Interpret what this means in the context of the
data.
Notice that the p-value associated with the effect term for level 2 of the
factor side is the same as the p-value associated with the F -statistic given
in Activity 18 when testing for a relationship between the response height
and the factor side. Explain why this is so.
235
Unit 3 Regression with a categorical explanatory variable
236
4 Using R to fit a regression with a factor
4 4
Standardised residuals
2 2
Residuals
0 0
−2 −2
−4 −4
Figure 12 Diagnostic plots for hourlyWageSqrt ∼ occ: (a) the residual plot, (b) the normal probability plot
We are ready to put these regression models into practice using R. We will
do this in the next section.
237
Unit 3 Regression with a categorical explanatory variable
238
5 Analysis of variance (ANOVA)
◦ 5 (Skilled manual)
◦ 6 (Semi-skilled manual)
◦ 7 (Unskilled manual).
In Activity 19, we tested for a relationship between hourlyWageSqrt
and occ and concluded that there was indeed evidence for a
relationship between the two. Additionally, in Activity 20, we tested
the individual effect terms and concluded that there was strong
evidence that they were all non-zero.
The individual effect terms compare the effects on the response of
each occupation group with the effect on the response of being in the
‘professional’ occupation group. However, there might be other issues
which we’d like to investigate for these data. For example, we might
be interested in issues such as:
• comparing the effect on the response of manual occupations (that
is, occ levels 5, 6 and 7) with the effect of the other occupations
(that is, occ levels 1 to 4)
• comparing the effect on the response of the more senior occupations
(that is, occ levels 1 and 2) with the effect of the other occupations
(that is, occ levels 3 to 7)
• comparing the effect on the response within manual occupations, by
comparing the response for individuals classed as having a skilled
manual occupation (that is, occ level 5) with the response for
individuals classed as having an unskilled manual occupation (that
is, occ level 7).
239
Unit 3 Regression with a categorical explanatory variable
salinity waterMass
37.54 1
37.01 1
36.71 1
37.03 1
37.32 1
Before going any further, let’s have a quick look at the lagoon dataset.
240
5 Analysis of variance (ANOVA)
We are interested in taking the variable salinity from the lagoon dataset
as our response variable and the variable waterMass as a factor.
A comparative boxplot of the response over the three levels of waterMass
is shown in Figure 13.
Water mass code
37 38 39 40
Salinity (parts per thousand)
Figure 13 A comparative boxplot of salinity over the levels of waterMass
As the model
salinity ∼ waterMass
is likely to be a useful model for salinity for data from the lagoon
dataset, we will use these data to help us think about the model in a
different way.
Now, as we saw in Subsection 5.2.1 in Unit 2, one of the aims of modelling
the relationship between an explanatory variable and the response is to
explain some of the variation in the response. We illustrated what we
meant by this when an explanatory variable is a covariate, in Example 12
of Unit 2. In Example 7, we will illustrate how a regression model can help
to ‘explain the variation in the response’ when the explanatory variable is
a factor, using data from the lagoon dataset.
241
Unit 3 Regression with a categorical explanatory variable
37 38 39 40
Salinity (parts per thousand)
Figure 14 A boxplot of all values of salinity
242
5 Analysis of variance (ANOVA)
Water mass: 1 2 3
40
Salinity (parts per thousand)
39
38
37
1 2 3 4 5 6 7 8 9
Observation number
Figure 15 Scatterplot of salinity against the associated observation
number for the data given in Table 10
243
Unit 3 Regression with a categorical explanatory variable
So, now that we have a (very small!) dataset to think about the model
salinity ∼ waterMass,
and Figure 15 to help us visualise things, we are ready to look at the ideas
behind what ANOVA is and how it can be used.
Recall from Subsection 5.2.1 in Unit 2 that when trying to assess the
extent to which the variation in the response can be explained by a
regression model with a covariate, we broke the variation of the response –
Excited to learn about
the total variation – into two types of variation:
ANOVA?
• the explained variation – the variation that can be explained by our
model
• the residual variation – the variation that still remains and can’t be
explained by our model.
We then used the sums of squares – TSS, ESS and RSS – as measures of
the three types of variation, as follows.
• The TSS is the total sum of squares and gives a measure of how the
observed responses y1 , y2 , . . . , yn vary about the overall response mean y:
n
X
TSS = (yi − y)2 .
i=1
• The ESS is the explained sum of squares and gives a measure of how the
fitted values yb1 , yb2 , . . . , ybn vary about the overall response mean y:
n
X
ESS = yi − y)2 .
(b
i=1
• The RSS is the residual sum of squares and gives a measure of how the
observed responses y1 , y2 , . . . , yn vary about the fitted values
yb1 , yb2 , . . . , ybn :
n
X
RSS = (yi − ybi )2 .
i=1
244
5 Analysis of variance (ANOVA)
Let’s start by looking at the total variation. The TSS gives a measure of
how the observed responses vary about the overall response mean y, and is
based on the distances
y1 − y, y2 − y, . . . , yn − y.
These distances are illustrated when we have a factor for the reduced
version of the lagoon dataset in Figure 16.
Water mass: 1 2 3
40
Salinity (parts per thousand)
Distance
y1 − y
39
y
38 Distance
y2 − y
37
1 2 3 4 5 6 7 8 9
Observation number
Figure 16 Scatterplot of salinity against observation number: the dotted
vertical lines show the distances used to calculate the TSS
Now let’s consider the variation that can be explained by our model. The
ESS gives a measure of the variation explained by our model by
considering how the fitted values for the model vary about the overall
response mean y. The ESS is therefore based on the distances
yb1 − y, yb2 − y, . . . , ybn − y.
These distances are illustrated when we have a factor for the reduced
version of the lagoon dataset in Figure 17, which follows. Remember that,
when the explanatory variable is a factor, observations which have the
same level of the factor, have the same model and hence the same fitted
values.
245
Unit 3 Regression with a categorical explanatory variable
Water mass: 1 2 3
40
y
Distance
38 y2 − y
37
y1 y2
1 2 3 4 5 6 7 8 9
Observation number
Figure 17 Scatterplot of salinity against observation number: the fitted
values are shown as red dots and the dotted vertical lines show the distances
used to calculate the ESS
Finally, we’ll consider the variation which can’t be explained by the model.
The RSS gives a measure of the residual variation by considering how the
observed values of the response vary about the fitted values, and is based
on the distances
y1 − yb1 , y2 − yb2 , . . . , yn − ybn .
These distances are illustrated when we have a factor for the reduced
version of the lagoon dataset in Figure 18. (Once again, remember that
when we have a factor, observations which take the same level of the factor
have the same fitted values.)
246
5 Analysis of variance (ANOVA)
Water mass: 1 2 3
40
Salinity (parts per thousand)
39
38 Distance
y1 − y1
Distance
y3 − y3
37
y1 y3
1 2 3 4 5 6 7 8 9
Observation number
Figure 18 Scatterplot of salinity against observation number: the fitted
values are shown as red dots and the dotted vertical lines show the distances
used to calculate the RSS
By considering Figures 16, 17 and 18, which of the explained and residual
sum of squares seems to contribute more to the total sum of squares?
Following on from Activity 25, it seems that the ESS is larger than the
RSS for the model
salinity ∼ waterMass.
So, does that mean that the variation in the response explained by the
model must be larger than the residual variation, indicating that the
model may be useful for the data? Well, not necessarily. The ESS and RSS
are only part of the story, and in order to compare the explained variation
with the residual variation, we need to scale both the ESS and the RSS so
that we have variance estimates of the two sources of variation which can
then be compared. (This scaling is so that we have unbiased variance
estimates of each type of variation, in the same way that the total variation
in the n responses is estimated by dividing the TSS by n − 1.) In the next
subsection, you will see how these two variance estimates can be compared.
247
Unit 3 Regression with a categorical explanatory variable
248
5 Analysis of variance (ANOVA)
For the response salinity and factor waterMass (which has three levels),
the model
salinity ∼ waterMass
was fitted using the 30 observations in the lagoon dataset.
(a) For these data, the ESS is calculated to be 38.80, whereas the RSS is
calculated to be 7.93 (both values given to two decimal places). Show
that the F -value associated with the ANOVA test for this model and
these data is approximately 66.
(b) The p-value associated with this F -value is reported to be less
than 0.001. What do you conclude?
249
Unit 3 Regression with a categorical explanatory variable
You may well be wondering why we would bother with ANOVA when the
F -value and its associated p-value are the same values as the F -statistic
and its associated p-value! Well, you’ll be glad to know that there is a
good reason to bother with ANOVA, so studying this section has not been
a complete waste of time!
It turns out that the ESS can be partitioned further into more sums of
squares, each of which relates to the response variation associated with
subsets of A’s factor levels. To illustrate, think about the model
hourlyWageSqrt ∼ occ
for the wages dataset considered in Sections 1 to 3. For this model, one of
the sums of squares in the partitioning of ESS could relate to the variation
in hourlyWageSqrt associated with whether an occupation can be
described as manual or non-manual. ANOVA techniques can then be used
to test whether an occupation being manual or being non-manual helps to
explain the variation in hourlyWageSqrt. This test will generate an
F -value and associated p-value, but these values will not be the same as
the F -statistic and its associated p-value generated by the regression
model. This is because we’re no longer testing whether factor occ explains
the variation in hourlyWageSqrt, but instead looking at the factor more
closely and testing whether the fact that the occupation is manual or
non-manual explains the variation in hourlyWageSqrt.
We will consider partitioning the ESS and the subsequent use of ANOVA
More ANOVA excitement to techniques in Section 7, but first, in the next subsection, we will introduce
come! a widely used way for presenting ANOVA results.
250
5 Analysis of variance (ANOVA)
Let’s consider the different columns of Table 11, the general form of an
ANOVA table, in turn.
• The first column, labelled ‘Source of variation’, identifies the type of
variation which each row in the ANOVA table refers to. The first row is
the explained variation, which is labelled ‘Factor A’ here because A is
the only explanatory variable for this model. The next row then refers
to the residual variation, while the last row refers to the total variation.
• The second column, labelled ‘Degrees of freedom’, gives the
denominators for the variance estimates for the associated sources of
variation. (The name ‘degrees of freedom’ refers to the associated null
distribution of the F -value.)
• The third column, labelled ‘Sum of squares’, gives the sum of squares for
each of the associated sources of variation.
• The fourth column, labelled ‘Mean square’, gives the variance estimates
for the associated sources of variation (and, as you have no doubt
gathered, each of these estimates is known as a ‘mean square’). Note
that ANOVA tables only include mean square values for the factor and
the residual.
• Finally, the columns labelled ‘F -value’ and ‘p-value’ give, respectively,
the F -value and p-value associated with the ANOVA test. A really mean square!
251
Unit 3 Regression with a categorical explanatory variable
Total
Total
252
5 Analysis of variance (ANOVA)
length treatment
75 none
67 none
70 none
75 none
65 none
253
Unit 3 Regression with a categorical explanatory variable
The model
length ∼ treatment
was fitted using data from the pea growth dataset.
The ESS and RSS were calculated to be 1077.3 and 245.5, respectively.
Complete the ANOVA table given in Table 15 for this model and these
data. What do you conclude?
Table 15 Incomplete ANOVA table for length ∼ treatment for the pea growth
dataset
Total
254
7 Analysing the effects of the factor levels further
255
Unit 3 Regression with a categorical explanatory variable
fructose sucrose
Let
θsugar = µsugar − µno sugar .
The parameter θsugar is known as a contrast because it allows us to
compare – that is, contrast – the mean responses of two groups of levels of
our original factor in order to address a research question; if θsugar is large
in magnitude, then this would suggest that the means for the two groups
are different, implying that sugar does affect pea growth. As Section 7
progresses, you will see how we can learn about this contrast as part of an
ANOVA analysis.
256
7 Analysing the effects of the factor levels further
3. Then we have:
θcontrast = µ1 − µ2 .
That is, the contrast θcontrast compares the response means for the
two groups of data.
So far in this section we have just been considering the pea growth dataset.
This dataset is not unique in raising research questions that can be
answered using contrasts. In Activity 31 you will define a contrast to
address a question about the wages dataset.
257
Unit 3 Regression with a categorical explanatory variable
So, once we have defined a contrast to compare the response means of the
two groups of levels, we then need to decide whether or not the contrast is
large enough to conclude that the response means of the two groups are
different. We therefore want to test the hypotheses:
H0 : θcontrast = 0 (that is, the response means are the same),
H1 : θcontrast ̸= 0 (that is, the response means are different).
We will do this in the next subsection.
258
7 Analysing the effects of the factor levels further
Model: Y ∼ A
Test whether
effect terms are non-zero
Test whether
=
A explains
Test whether
variation in Y
response means differ
across levels of A
F -statistic = F -value
p-value = p-value
In the next activity you will consider what a p-value from such an analysis
tells us about the variation of the response means across the factor levels.
259
Unit 3 Regression with a categorical explanatory variable
260
7 Analysing the effects of the factor levels further
We can then carry out an ANOVA analysis of a model using the new
factor sugar, that is, using the model
length ∼ sugar,
to test whether or not the response means for the two levels of sugar
differ – in other words, whether or not µsugar differs from µno sugar –
and therefore whether or not sugar affects pea growth.
According to the Yes Peas
website (British Growers
Association, 2022), the world
Since the contrast θsugar was defined by partitioning the levels of record for eating peas is held
treatment, the level of our new factor sugar for each observation is by Janet Harris of Sussex who,
directly determined by the level of treatment for that observation. We in 1984, ate 7175 peas one by
can therefore specify the value of sugar for each observation simply from one in 60 minutes using
the values of treatment. This is what we will do next in Activity 33. chopsticks!
Table 16 gives the values of the response length and the factor treatment
for some of the observations recorded for the pea growth dataset. For these
observations, specify the values of the new factor sugar.
Table 16 Some of the observations from the pea growth dataset
So far, in our quest to test the contrast θsugar using data from the pea
growth dataset, we have defined the new factor sugar. We now will carry
out an ANOVA analysis for the model
length ∼ sugar.
The resulting explained sum of squares for this fitted model – which we’ll
denote here by ESSsugar – will then give a measure of the variation in the
data which is explained by whether or not a treatment contains sugar.
Because the contrast θsugar was defined by partitioning the levels of the
factor treatment, the explained sum of squares associated with θsugar
(that is, ESSsugar ) is in fact part of the explained sum of squares for our
original model length ∼ treatment (that is, ESS). So, rather like TSS
partitions into ESS and RSS, ESS can be partitioned into the explained
sums of squares associated with a set of contrasts defined by partitioning
261
Unit 3 Regression with a categorical explanatory variable
75 ESSsugar
ESS distances
distances
70
Length
65 y
60
55
0 10 20 30 40 50 0 10 20 30 40 50
(a) Observation number (b) Observation number
Figure 21 Scatterplots of length against observation number: the fitted values are shown as red dots and the
dotted vertical lines show the distances used to calculate (a) ESS and (b) ESSsugar
We can now use ESSsugar to assess whether the variation in pea growth can
be explained by whether or not a treatment contains sugar. If we conclude
that the factor sugar does explain length, then we can conclude that the
two means – µsugar and µno sugar – are different, and so sugar does affect
growth of pea sections.
Now, we could compare ESSsugar with the residual sum of squares for the
model
length ∼ sugar.
However, our new factor sugar was defined from the original factor
262
7 Analysing the effects of the factor levels further
For the pea growth dataset (which includes data for 50 pea sections), we
have
ESSsugar = 832.3,
and the residual sum of squares for the model length ∼ treatment is
RSS = 245.5.
263
Unit 3 Regression with a categorical explanatory variable
264
7 Analysing the effects of the factor levels further
The ANOVA table can be extended to include the results for testing a
contrast, in addition to the usual ANOVA table information, so that all of
the results are summarised in one place.
The general form of the extended ANOVA table is summarised in Box 8.
Notice that the extended ANOVA table shown in Box 8 is the same as the
ANOVA table for the model Y ∼ A (given in Box 5, Subsection 5.4), but
with an additional row relating to the contrast θcontrast . The total row still
relates to the model Y ∼ A, so it is still true that TSS = ESS + RSS.
The notation ‘A: θcontrast ’ has been used in the first column of the extended
ANOVA table in Table 17, to emphasise that the contrast θcontrast was
defined from factor A and the ANOVA analysis associated with θcontrast is
assessed as part of the ANOVA analysis for the model Y ∼ A.
Example 11 shows what an extended ANOVA table looks like in practice.
265
Unit 3 Regression with a categorical explanatory variable
Total 49 1322.8
Notice that the extended ANOVA table is exactly the same as the
ANOVA table obtained in Activity 30, except that there is an extra
row presenting the results for the contrast θsugar considered in the
analysis.
Notice also that ESSsugar is less than ESS. This is because, as
mentioned earlier, ESS can be partitioned into ESSsugar and other
explained sums of squares associated with a set of contrasts.
Next, in Activity 36 you will use an extended ANOVA table that has
already been created to draw conclusions about a research question.
Consider once again the wages dataset of 3331 individuals and the model
hourlyWageSqrt ∼ occ.
266
7 Analysing the effects of the factor levels further
where
µmanual = mean response across levels 5, 6 and 7 of occ,
µnon-manual = mean response across levels 1, 2, 3 and 4 of occ.
The extended ANOVA table for this model and the contrast θmanual is
given in Table 19.
Table 19 The extended ANOVA table for hourlyWageSqrt ∼ occ and the
contrast θmanual
267
Unit 3 Regression with a categorical explanatory variable
fructose sucrose
Groups of treatment
containing sugar: no sucrose sucrose
(b)
268
7 Analysing the effects of the factor levels further
Define a contrast, such as θmix , which can be used to investigate the final
question concerning the pea growth data:
• How does the effect of a treatment containing a mix of both glucose and
fructose (that is, treatment gluc + fruct) compare with the effects of
treatments containing glucose or fructose alone?
Following on from Activity 37, note that, as with θsucrose , we are not
considering all of the levels of treatment when specifying the contrast
θmix ; this time, we are partitioning the levels in the group ‘no sucrose’
defined with our contrast θsucrose . This is illustrated in Figure 23.
Groups of treatment
containing sugar: no sucrose sucrose
(a)
Groups of treatment
containing sugar,
but no sucrose: no mix mix
(b)
Figure 23 Partitioning the levels of treatment when defining the contrasts
(a) θsucrose and (b) θmix
So, we have now defined two contrasts, θsucrose and θmix , and to help us
investigate the two questions of interest we can test the two sets of
hypotheses:
H0 : θsucrose = 0, H1 : θsucrose ̸= 0,
and
H0 : θmix = 0, H1 : θmix ̸= 0.
269
Unit 3 Regression with a categorical explanatory variable
Following the methods described for testing θsucrose , briefly outline how we
can test the contrast θmix .
All three contrasts, θsugar , θsucrose and θmix , can be added to an extended
ANOVA table as shown in Table 20.
Table 20 The extended ANOVA table for length ∼ treatment and the
contrasts θsugar , θsucrose and θmix
Total 49 1322.8
Notice that the extended ANOVA table is once again the same as the
ANOVA table for the model length ∼ treatment obtained in Activity 30
270
7 Analysing the effects of the factor levels further
(Subsection 5.4), but with three additional rows relating to the three
contrasts.
Also notice that, because ESS partitions into the explained sums of
squares for a set of contrasts,
ESSsugar + ESSsucrose + ESSmix = 832.3 + 235.2 + 9.3
= 1076.8,
which is less than the ESS value of 1077.3.
In the next activity you will make use of this extended ANOVA table to
draw conclusions about the pea growth dataset.
Given Table 20, what do you conclude about the effects of the treatments
on pea growth?
When analysing the pea growth dataset, we used the three contrasts θsugar ,
θsucrose and θmix . We could, however, have used different contrasts – the
choice of contrasts depends on the research questions of interest. For
example, we might have wanted to define a contrast to compare the effects
of using a treatment with a mixture of sugars (that is, gluc + fruct) with
the treatments which only contain single sugars (that is, glucose, fructose
and sucrose), or we might have wanted to define a contrast to compare the
effects of treatments containing glucose (that is, glucose and gluc + fruct)
with the other sugar treatments.
Whichever set of contrasts are used for an analysis, there are rules which
need to be followed when specifying multiple contrasts, to ensure that the
associated tests are valid, as summarised in Box 9.
271
Unit 3 Regression with a categorical explanatory variable
272
8 Using R to produce extended ANOVA tables
(b) Explain why we couldn’t use either of the contrasts θmanual or θunskilled
to define a third contrast to compare the effects on the square root of
the hourly wage of the more senior occupations (that is, occ levels 1
and 2) with the effects of the other occupations (that is, occ levels 3
to 7).
(c) What is the maximum number of contrasts that could be defined for
this model and these data?
We will use R to obtain the extended ANOVA table for the wages dataset
and test the two contrasts θmanual and θunskilled , considered in Activity 40,
in the (very short!) final section of this unit next.
273
Unit 3 Regression with a categorical explanatory variable
Summary
This unit has focused on regression when there is a single factor (a
categorical explanatory variable). A regression model with a factor needs
to be treated differently to a regression model with a covariate: the focus
of regression with a covariate is a fitted line, whereas the focus of
regression with a factor is based on the response means for the different
levels of the factor.
In order to model the relationship between the response and the factor
across the factor’s levels, we use a baseline mean which is common to all
levels, together with level effects which, for each level, model how the
mean response for that level differs from the baseline mean. In this unit,
we have used R’s default convention of defining the baseline mean to be
the mean response for level 1 of the factor.
Indicator variables can be used to identify which level of the factor each
observation takes. We can then use the indicator variables to express the
different model equations associated with the different levels of the factor,
by a single model equation. By doing this, our model has the form of a
multiple regression model, where the indicator variables are the covariates.
As such, we can use multiple regression techniques to fit our model and
check the model assumptions.
However, the model with a factor is not interpreted or used in the same
way as a multiple regression model is. When we have a factor, each
regression coefficient is an effect term representing how the associated level
of the factor affects the response in comparison to how level 1 of the factor
affects the response. Also, even though we have multiple indicator
variables (for the levels of the factor), we still only have one explanatory
variable, and so we either need to include all of the indicator variables as
covariates in the model, or none of the indicator variables.
We can then test for a relationship between the response and the factor by
calculating the F -statistic for testing whether all of the effect terms are
zero; if the associated p-value is small, then this implies that there is a
relationship between the response and the factor.
In the second half of the unit, we introduced the idea of ANOVA (analysis
of variance). The focus of ANOVA is to assess whether the factor helps to
explain the variation in the response by comparing the variation which can
be explained by the model, with the variation which is left unexplained by
the model, through the F -value. If the p-value associated with the F -value
is small, then this implies that the factor does help to explain the variation
in the response. The ANOVA table is widely used to summarise the results
from an ANOVA analysis.
We can use ANOVA techniques to analyse the effects of the factor levels
further. To do this, we define contrasts, which compare the mean responses
for two (non-overlapping) groups of levels of the factor; a large value of a
contrast would imply that there is a difference in the mean responses for
274
Summary
the two groups, implying that the difference between the levels in the two
groups affects the response. The contrast is tested using another F -value;
if the associated p-value is small, then this implies that the difference
between the levels in the two groups does affect the response. The results
from testing contrasts can be combined with the ANOVA table for the
factor in an extended ANOVA table.
In this unit, R was used for regression with a factor, for obtaining ANOVA
tables, and for specifying contrasts and obtaining the associated extended
ANOVA table.
As a reminder of what has been studied in Unit 3 and how the different
sections link together, the route map for the unit is repeated below.
Section 1 Section 2
Regression with a factor: Developing the
the basic idea model further
Section 4
Section 3
Using R to fit
Using the proposed
a regression
model
with a factor
Section 5 Section 6
Analysis of variance Using R to produce
(ANOVA) ANOVA tables
Section 7 Section 8
Analysing the effects Using R to produce
of the factor levels extended ANOVA
further tables
275
Unit 3 Regression with a categorical explanatory variable
Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate the difference between a covariate and a factor
• appreciate that regression with a factor is based on the response means
for the levels of a factor
• explain how the relationship between the response and the factor can be
modelled through a baseline mean (common to all factor levels) and
individual effect terms, representing how the response mean for each
level differs from the baseline mean
• interpret the baseline mean and individual effect terms for a model fitted
to a specific dataset
• use indicator variables to express the model as a multiple regression
model
• explain the differences in interpretation and use of a multiple regression
model (with q covariates) and a regression model with a single factor
using indicator variables
• use parameter estimates to calculate fitted values and point predictions
• appreciate that the F -statistic (used for testing whether all of the effect
terms are zero) can be used to test for a relationship between the
response and the factor
• explain why we either need to include all of the effect terms in the
model, or none of them
• check the model assumptions using residual plots and normal probability
plots
• fit a regression model with a factor in R and be able to interpret the
output
• appreciate the idea behind ANOVA (analysis of variance) as a method of
assessing the extent to which the variation in the response can be
explained by the factor
• explain the ideas behind the sums of squares TSS, ESS and RSS when
the explanatory variable is a factor
• interpret the results from the ANOVA test based on the F -value test
statistic and its associated p-value
• complete and interpret an ANOVA table
• use R to produce an ANOVA table
• appreciate the ideas behind contrasts for investigating the effects on the
response of groups of levels of the factor
• define a contrast to address a particular research question
• interpret the results from testing a contrast based on the test statistic
Fcontrast and its associated p-value
276
Learning outcomes
277
Unit 3 Regression with a categorical explanatory variable
References
British Growers Association (2022) Pea facts. Available at:
https://peas.org/pea-facts/ (Accessed: 13 February 2022).
Oppenheim, M. (2020) ‘Surge in women applying for manual jobs after
wording in adverts made less “masculine”’, The Independent, 22 June.
Available at: https://www.independent.co.uk/news/uk/home-news/
women-job-advert-london-thames-water-process-technician-a9579251.html
(Accessed: 13 February 2022).
Sokal, R.R. and Rohlf, F.J. (1981) Biometry: the principles and practice of
statistics in biological research. 2nd edn. San Francisco: W.H. Freeman.
Taylor, K. (1999) Male earnings dispersion over the period 1973 to 1995 in
four industries. Unpublished PhD thesis. The Open University.
Till, R. (1974) Statistical methods for the Earth scientist. London:
Macmillan.
Woodland Trust (no date) Ash. Available at:
https://www.woodlandtrust.org.uk/trees-woods-and-wildlife/british-
trees/a-z-of-british-trees/ash/ (Accessed: 28 January 2022).
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, spending a wage: © Hanna Kuprevich / www.123rf.com
Subsection 2.1, variable dummy: © Prapass Wannapinij / www.123rf.com
Subsection 2.1, manager: © Wavebreak Media Ltd / www.123rf.com
Subsection 2.1, pH indicator paper: © Bjoern Wylezich / www.123rf.com
Subsection 2.1, group of individuals: © rawpixel / www.123rf.com
Subsection 2.2, fireworks: © Maksim Pasko / www.123rf.com
Subsection 3.1, olive oil: © rrraven / www.123rf.com
Subsection 3.2, pay day: © EdZbarzhyvetsky / www.create.vista.com
Subsection 5.1, Bimini Lagoon: © andydidyk / iStock / Getty Images Plus
Subsection 5.2, excited person: © deagreez / www.123rf.com
Subsection 5.3, jumping for joy: © nasrul0412 / www.123rf.com
Subsection 5.4, dressed as a pea: © Mark Bowden / www.123rf.com
Subsection 7.1, manual worker: © Visoot Uthairam | Dreamstime.com
Subsection 7.2, peas: © izikmd / www.123rf.com
Subsection 7.3, jeweller: © Olga Yastremska / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
278
Solutions to activities
Solutions to activities
Solution to Activity 1
Both workHrs and educAge are numerical variables, and so are potential
covariates.
Although edLev and occ have numerical values in the dataset, these are in
fact just numerical codes that represent the different values of categorical
variables. As such, both edLev and occ are potential factors.
The remaining two variables, gender and computer, are clearly categorical
and not numerical, and so these two are also potential factors.
Solution to Activity 2
There seems to be a potential error for the fifth individual in the wages
dataset where the value of educAge – the age, in years, at which the
individual ceased education – is just 2. This seems extremely unlikely,
especially given the individual’s education is high (since edLev is 4) and
they are in a professional occupation (since occ is 1).
Solution to Activity 3
From the description of the wages dataset, occ takes seven possible values
and so the factor occ has seven levels.
Solution to Activity 4
Both the boxes and the median lines in the comparative boxplot seem to
follow the general trend – that hourlyWageSqrt seems to increase as the
level codes of occ decrease. Therefore, there seems to be a negative
relationship between hourlyWageSqrt and the level codes of occ.
Solution to Activity 5
The value of occ for the ith individual is 1 and the (population) mean
response for individuals for which occ takes the value 1 is µ1 .
From Figure 6, we can see that the observed responses for individuals for
which occ is 1 vary about the sample mean (the large red circle). So,
following the simple linear regression model given in Model (1) and
illustrated in Figures 4 and 5, we can capture this variation by using a
random term.
A possible model for Yi is therefore given by
Yi = µ 1 + W i , Wi ∼ N (0, σ 2 ).
279
Unit 3 Regression with a categorical explanatory variable
Solution to Activity 6
For the second observation in Table 1, the value of occ is 3. Therefore,
Model (2) can be written as
Y2 = µ3 + W2 , W2 ∼ N (0, σ 2 ),
where µ3 is the (population) mean of the responses for observations for
which occ is 3.
Solution to Activity 7
The value of occ for the second observation in Table 1 is 3, and so
Model (3) for Y2 has the form
Y2 = µ + α3 + W2 , W2 ∼ N (0, σ 2 ),
where
µ = mean response for observations for which occ is 1,
α3 = effect on the response of occ being 3 in comparison to
when occ is 1.
Similarly, since the values of occ are 2 and 6 for Y3 and Y4 , respectively,
we have the model forms
Y3 = µ + α2 + W3 , W3 ∼ N (0, σ 2 ),
Y4 = µ + α6 + W4 , W4 ∼ N (0, σ 2 ),
where µ is as defined above and
α2 = effect on the response of occ being 2 in comparison to
when occ is 1,
α6 = effect on the response of occ being 6 in comparison to
when occ is 1.
The model form for Y5 , however, looks slightly different: this is because
the value of occ for the fifth observation is 1. This means that the model
for Y5 is given simply as
Y5 = µ + W5 , W5 ∼ N (0, σ 2 ),
since the baseline mean has been defined to be the mean of level 1 and
α1 = 0.
Solution to Activity 8
The indicator variable Z4 indicates whether or not the ith observation
takes level 4 of occ or not. As a result, Z4 will only be 1 for those
observations which take level 4 of occ, whereas the rest will be 0. But
none of these five observations take level 4 of occ, and so all of the values
of Z4 in Table 4 will be 0. The other values of the indicator variables are
found similarly.
280
Solutions to activities
Notice that the values of all of the indicator variables are 0 for the fifth
observation, which takes level 1 of occ. This is because we don’t need an
indicator variable for level 1 of occ, since the effect of level 1 is assumed to
be part of the baseline mean µ.
Solution to Activity 9
(a) If the ith observation takes level k of factor A, then the indicator
variable zik takes the value 1, whereas the other indicator variables all
take the value 0. Model (5) then becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αk × 1) + · · · + (αK × 0) + Wi
= µ + αk + Wi ,
which is the form given in Model (4).
(b) If the ith observation takes level 1 of factor A, then all of the indicator
variables zi2 , zi3 , . . . , ziK will be zero. In this case, Model (5) becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αK × 0) + Wi
= µ + Wi ,
as required.
Solution to Activity 10
(a) Using the indicator variable Z2 , for each i = 1, 2, . . . , 42, we can write
the model
height ∼ side
in the form
Yi = µ + α2 zi2 + Wi , Wi ∼ N (0, σ 2 ).
(b) The first tree in the manna ash trees dataset is located on the west
side of the road. This means that side takes level 2 for this
observation, so that z12 = 1. Therefore, using the answer to part (a),
the model for Y1 can be written as
Y1 = µ + (α2 × 1) + W1 , W1 ∼ N (0, σ 2 ),
that is,
Y1 = µ + α2 + W1 , W1 ∼ N (0, σ 2 ).
281
Unit 3 Regression with a categorical explanatory variable
The 28th tree in the manna ash trees dataset is located on the east
side of the road. This means that side takes level 1 for this
observation, so that z28,2 = 0. Therefore, using the answer to part (a),
the model for Y28 can be written as
Y28 = µ + (α2 × 0) + W28 , W28 ∼ N (0, σ 2 ),
that is,
Y28 = µ + W28 , W28 ∼ N (0, σ 2 ).
Solution to Activity 11
(a) Using the indicator variables Z2 , Z3 , . . . , Z7 , for each i = 1, 2, . . . , 3331,
the model can be written as a single equation with the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + α7 zi7 + Wi , Wi ∼ N (0, σ 2 ).
(b) For the first observation, occ takes level 5. Therefore, the value of z15
is 1, whereas z12 , z13 , z14 , z16 , z17 are all zero. Therefore, the model
from part (a) for Y1 becomes
Y1 = µ + (α2 × 0) + (α3 × 0) + (α4 × 0) + (α5 × 1)
+ (α6 × 0) + (α7 × 0) + W1
= µ + α5 + W1 ,
which is indeed the same as the model for Y1 obtained in Example 2.
Solution to Activity 12
A multiple regression model equation has the same form. We can think of
µ as the intercept parameter, α2 , α3 , . . . , αK as the regression coefficients
and zi2 , zi3 , . . . , ziK as covariates.
Solution to Activity 13
The value of occ for the third observation is 2, and so the fitted value for
Y3 is
yb3 = µ
b+α
b2 = 4.489 + (−0.304)
= 4.185 = 4.19 (to 2 d.p.).
The value of occ for the fourth observation is 6, and so the fitted value for
Y4 is
yb4 = µ
b+α
b6 = 4.489 + (−1.383)
= 3.106 = 3.11 (to 2 d.p.).
The value of occ for the fifth observation is 1, and as α1 is set to zero this
means the fitted value for Y5 is simply
yb5 = µ
b = 4.489 = 4.49 (to 2 d.p.).
282
Solutions to activities
Solution to Activity 14
The parameter µ is our baseline mean for level 1 of occ – this is the mean
of the square root of the hourly wage for individuals in a professional
occupation.
Each of the α parameters (effect terms) is the mean of the square root of
the hourly wage for individuals with the corresponding level of occ
compared with the mean of the square root of the hourly wage for those in
professional occupations.
Since all of the α parameters are negative, this means that the mean of the
square root of the hourly wages for individuals in occupations other than a
professional occupation are lower than for individuals in a professional
occupation.
What’s more, the estimates of the α parameters are decreasing as the level
codes of occ increase. This means that the mean of the square root of the
hourly wage decreases as the level codes of occ increase. But, since the
coding for occ is such that the occupation skill level decreases as the
associated codes increase, this in turn means that the mean of the square
root of the hourly wage decreases as the occupation skill level also
decreases.
Solution to Activity 15
(a) The parameter µ is our baseline mean for level 1 of side. Since ‘east’
has been taken to be level 1 of side, this means that µ represents the
mean height of trees on the east side of the road.
The α2 parameter is the effect term for the second level of side.
Therefore, α2 provides a measure of how the mean height of trees on
the west side of the road compares with the mean height of trees on
the east side of the road.
Since the estimated value of α2 is positive, this means that the mean
height of trees on the west side of the road is higher than the mean
height of trees on the east side of the road.
(b) The first tree is located on the west side of the road. This means that
this tree takes level 2 of side, and so z12 = 1. Therefore, the fitted
value, in metres, for Y1 is
yb1 = µ
b+α
b2 = 6.636 + 1.880 = 8.516 = 8.52 (to 2 d.p.).
The 28th tree is located on the east side of the road, so this tree takes
level 1 of side and z28,2 = 0. Therefore, the fitted value, in metres,
for Y28 is
yb28 = µ
b = 6.636 = 6.64 (to 2 d.p.).
(c) Since the first tree is located on the west side of the road, the fitted
values for all of the trees on the west side of the road will be the same
as yb1 , that is, 8.52 (to 2.d.p.).
283
Unit 3 Regression with a categorical explanatory variable
Similarly, the 28th tree is located on the east side of the road, and so
the fitted values for all of the trees on the east side of the road will be
the same as yb28 , that is, 6.64 (to 2 d.p.).
Solution to Activity 16
The value of occ is 7, and so the predicted value for Y0 is
yb0 = µ
b+α
b7 = 4.489 + (−1.435) = 3.054.
This gives us the predicted value of the response, which is the square root
of the hourly wage. We were asked to predict the hourly wage for this
individual (rather than the square root), and so we need to square the
value of yb0 to obtain the value required. So the predicted hourly wage in £
for this individual is
3.0542 ≃ 9.33 (to 2 d.p.).
Solution to Activity 17
When the regression model with a factor is expressed in terms of indicator
variables, we can think of the model as a multiple regression model where
the effect parameters α2 , α3 , . . . , αK are regression coefficients. The
hypotheses are then equivalent to the hypotheses (in Subsection 1.3.1 of
Unit 2) which were formulated to test if all of the regression coefficients in
a multiple regression model are zero.
The p-value associated with this test was calculated using the
F -distribution with q and n − (q + 1) degrees of freedom, where q is the
number of regression coefficients in the model. Therefore, using the same
distribution here, the p-value associated with our test will be based on the
F -distribution with K − 1 and n − ((K − 1) + 1) = n − K degrees of
freedom (since there are K − 1 regression coefficients for the K − 1
indicator variables).
Solution to Activity 18
The p-value is 0.00107, which is small. There is therefore evidence to reject
H0 and we conclude that there is a relationship between height and side.
Solution to Activity 19
The p-value is less than 0.001, and so is very small. Therefore, there is
evidence to reject H0 and we can conclude that there is a relationship
between hourlyWageSqrt and occ.
284
Solutions to activities
Solution to Activity 20
All of the p-values in the table are very small. There is therefore (strong)
evidence to suggest that all of the level effect terms are non-zero.
This means that the square root of the hourly wage is significantly
different for individuals in each of the occupation groups in comparison to
the first occupation group – that is, in comparison to individuals classed as
being in a professional occupation. What’s more, since each of the level
effect terms is negative, the square root of the hourly wage is significantly
lower for individuals in each of the occupation groups in comparison to
individuals in a professional occupation.
Solution to Activity 21
Since the factor side has two levels, the associated model using indicator
variables for the levels of side has just one indicator variable (for level 2
of side). Therefore, a test of whether all of the effect terms are zero is
testing the same thing as a test of whether the individual effect term for
level 2 of side is zero. Since we are testing the same thing using the same
data, we should get the same result – that is, the same p-value.
(Note however, that the values of the associated test statistics are not the
same values. This is because the two tests are calculating different test
statistics and calculating the associated p-values using different
distributions – the t-distribution for the t-value and the F -distribution for
the F -statistic.)
Solution to Activity 22
(a) From Unit 2 (Box 7, Subsection 3.1), the model assumptions
underlying multiple regression are that:
• the relationship between each of the explanatory variables
x1 , x2 , . . . , xq and Y is linear
• the random terms Wi , i = 1, 2, . . . , n, are independent
• the random terms Wi , i = 1, 2, . . . , n, all have the same variance σ 2
across the values of x1 , x2 , . . . , xq
• the random terms Wi , i = 1, 2, . . . , n, are normally distributed with
zero mean and constant variance, N (0, σ 2 ).
When each explanatory variable is an indicator variable, the linearity
assumption is automatically satisfied. This is because each indicator
variable only takes two possible values (0 or 1) and so a straight line
will always go through the fitted values associated with these two
values. Also, because the fitted value for each level of the factor is the
same as the sample mean, the assumption of zero mean is also
automatically satisfied.
(b) As for multiple regression (and indeed simple linear regression), a plot
of the residuals against the fitted values can be used to check that it
is reasonable to assume that all the random terms Wi , i = 1, 2, . . . , n,
285
Unit 3 Regression with a categorical explanatory variable
Solution to Activity 23
In Figure 12(a), the residuals seem to be fairly randomly scattered either
side of the zero residual line, and the scatter seems to be roughly constant
across the fitted values. So the assumption that the residuals have
constant variance seems reasonable.
In Figure 12(b), the points in the centre of the plot follow the line very
closely, but then they deviate from the line at either end. The assumption
of normality may therefore be questionable.
Solution to Activity 24
When the explanatory variable is a factor, the model is based on the mean
responses for each level of the factor and is most useful when there are
differences in the mean response for at least one level of the factor.
From Figure 13, it’s clear that the mean of salinity is different for all
three levels of waterMass (since the boxes in the comparative boxplot do
not overlap with each other). As such, modelling salinity according to
which level is taken by waterMass is likely to be a useful model.
Solution to Activity 25
Comparing Figures 16, 17 and 18, it seems that the ESS is larger than the
RSS, and so the explained sum of squares seems to contribute more to the
total sum of squares than the residual sum of squares does.
286
Solutions to activities
Solution to Activity 26
(a) There are 30 observations in this dataset, and so n = 30. Also, the
factor waterMass has three levels, and so K = 3. The F -value is
therefore calculated as
ESS/(K − 1) 38.80/2
F = = ≃ 66,
RSS/(n − K) 7.93/27
as required.
(b) The reported p-value is very small, and so we conclude that the factor
waterMass does help to explain the variation in the response
salinity.
Solution to Activity 27
(a) There are 3331 observations in this dataset, and so n = 3331. Also,
the factor occ has seven levels, and so K = 7. The F -value is
therefore calculated as
ESS/(K − 1) 648.4/6
F = = ≃ 118.3,
RSS/(n − K) 3035.7/3324
as required.
(b) The reported p-value is very small, and so we conclude that the factor
occ does help to explain the variation in the response
hourlyWageSqrt.
Solution to Activity 28
Using information from Activity 26 and the fact that
TSS = ESS + RSS,
the completed ANOVA table for the model salinity ∼ waterMass is
given below.
Total 29 46.73
287
Unit 3 Regression with a categorical explanatory variable
Solution to Activity 29
Using information from Activity 27 and the fact that
TSS = ESS + RSS,
the completed ANOVA table for hourlyWageSqrt ∼ occ is given below.
Solution to Activity 30
Data were recorded for ten pea sections for each of the five treatments, and
so n = 50 and K = 5. Therefore, the degrees of freedom column should
have K − 1 = 4 in the treatment row, n − K = 45 in the residual row and
n − 1 = 49 in the total row.
To fill in entries for the sum of squares column, the values of ESS and RSS
were given in the question, and TSS can be calculated by using the
formula TSS = ESS + RSS.
For the entries in the mean square column, the mean square associated
with treatment is calculated as ESS/(K − 1), while the mean square
associated with the residuals is calculated as RSS/(n − K).
The F -value is calculated by dividing the mean square for treatment by
the mean square for the residual.
The completed ANOVA table for length ∼ treatment is given below.
Total 49 1322.8
The p-value is very small, and so we’d reject H0 and conclude that the
factor treatment does help to explain the variation in length.
288
Solutions to activities
Solution to Activity 31
The researchers need to compare the square root of the hourly wage for
those occupations which are manual with those occupations which are not
manual.
Let µmanual denote the mean response across all levels of occ which
represent manual occupations, and let µnon-manual denote the mean
response across those levels of occ which represent non-manual
occupations. Then to investigate whether the hourly wage is affected by
whether or not an occupation is manual, we need to compare µmanual
with µnon-manual .
There are three levels of occ which represent manual occupations:
5 (skilled manual), 6 (semi-skilled manual) and 7 (unskilled manual), and
the other levels represent non-manual occupations. This means that
µmanual = mean response across levels 5, 6 and 7 of occ,
µnon-manual = mean response across levels 1, 2, 3 and 4 of occ.
So, to help the researcher to answer the question of whether the hourly
wage is affected by whether or not their occupation is manual, define the
contrast
θmanual = µmanual − µnon-manual .
If θmanual is large in magnitude, then this suggests that the means for the
two groups are not the same – in which case, whether or not an occupation
is manual does affect the square root of the hourly wage.
Solution to Activity 32
If the p-value in an ANOVA analysis is small, then the p-value in a
regression analysis of the same model will also be small (because it’s the
same value).
But a small p-value in a regression analysis indicates that the effect terms
in the model are non-zero, and so that the response means are different
across the factor levels.
Therefore, if the p-value is small in an ANOVA analysis of the model, then
this tells us that the response means are different across the factor levels.
Solution to Activity 33
The factor sugar will take the value ‘yes’ for any observation which used a
treatment containing sugar, and will take the value ‘no’ for any
observation which used a treatment not containing sugar. This means that
sugar will take the value ‘yes’ for any observation for which treatment
takes one of the four levels glucose, fructose, gluc + fruct and sucrose, and
sugar will take the value ‘no’ for any observation for which treatment
takes the level ‘none’. The values of sugar for the observations in the
question are given in the completed version of Table 16 below.
289
Unit 3 Regression with a categorical explanatory variable
Solution to Activity 34
The factor treatment has five levels, so K = 5, and there are
50 observations in the dataset. Therefore, Fsugar is calculated as
ESSsugar /1
Fsugar =
RSS/n − K
832.3/1
= ≃ 152.56.
245.5/45
Solution to Activity 35
The p-value corresponding to Fsugar is the p-value associated with the
contrast θsugar . As it is very small, it suggests that µsugar differs from
µno sugar , and so sugar does indeed affect the growth of pea sections.
Solution to Activity 36
The p-value associated with the contrast θmanual is very small. We therefore
conclude that the response mean for manual occupations is different to the
response mean for non-manual occupations, and so whether or not an
occupation is manual does indeed affect the square root of the hourly wage.
Solution to Activity 37
We want to compare the effects of the treatment gluc + fruct (the only
treatment which is a mix of glucose and fructose), with the effects of the
two treatments glucose and fructose (the treatments containing glucose or
fructose alone). So let
θmix = µmix − µno mix ,
where
µmix = mean response when treatment takes level gluc + fruct,
µno mix = mean response when treatment takes one of the levels
glucose or fructose.
290
Solutions to activities
Solution to Activity 38
Define a new factor – mix, say – from the levels of treatment which relate
to θmix :
• mix: a factor identifying whether or not the sugar treatments containing
glucose and/or fructose is a mix of glucose and fructose, taking the
possible values yes and no.
Carry out an ANOVA analysis for the model
length ∼ mix
using only data for those observations taking levels glucose, gluc + fruct or
fructose of treatment.
The resulting explained sum of squares for this fitted model on the reduced
dataset, ESSmix , then provides us with a measure of the variation in the
responses for the sugar treatments containing glucose and/or fructose
which is explained by whether or not the sugar treatment is a mix. The
test statistic for this test, Fmix , compares ESSmix with the overall residual
sum of squares and is calculated as
ESSmix /1
Fmix = ,
RSS/(n − K)
where K is the number of the levels of treatment.
Solution to Activity 39
The p-values associated with treatment, and the contrasts θsugar and
θsucrose , are all very small (p < 0.001). This means that there is strong
evidence to suggest that treatment helps to explain the variation in
length, as does whether or not the treatment contains sugar, and whether
or not the sugar treatment contains sucrose.
However, the p-value associated with the contrast θmix is large (p = 0.198),
and so there is no evidence to suggest that pea growth is affected by
whether a treatment is a mixture of glucose and fructose, or just either
glucose or fructose alone.
Solution to Activity 40
(a) There are three levels of occ which represent manual occupations: 5
(skilled manual), 6 (semi-skilled manual) and 7 (unskilled manual),
whereas the other four occupations (occ levels 1 to 4) are all
non-manual. So, the contrast θmanual partitions the seven levels of occ
into two groups: a ‘manual’ group for levels 5, 6 and 7 of occ, and a
‘non-manual’ group for levels 1, 2, 3 or 4 of occ.
For the contrast θunskilled , we are only interested in manual
occupations – that is, levels 5, 6 and 7 of occ. Of these, only level 7
(unskilled manual) is classed as being unskilled, whereas levels 5
and 6 are both classed as skilled. So, the contrast θunskilled partitions
291
Unit 3 Regression with a categorical explanatory variable
292
Unit 4
Multiple regression with both
covariates and factors
Introduction
Introduction
In Unit 2, we considered regression models for a continuous response where
the explanatory variables were all covariates (that is, numerical
explanatory variables), whereas in Unit 3 we introduced regression models
for a continuous response where the single explanatory variable was a
factor (that is, a categorical explanatory variable). In real-world
applications, however, it is often the case that the potential explanatory
variables can include a mixture of both covariates and factors.
For example, when analysing the wages dataset in Unit 3, we took the
variable hourlyWageSqrt as the response and occ as a single factor
explanatory variable. However, the dataset also contains data on several
additional potential explanatory variables – both potential covariates
(namely, workHrs and educAge) and potential factors (namely, gender,
edLev and computer).
In this unit we will introduce regression models which can accommodate
any number of covariates in addition to any number of factors as
explanatory variables.
295
Unit 4 Multiple regression with both covariates and factors
The route map shows how the sections of the unit connect to each other.
Section 1
Regression with one
covariate and one factor
Section 2 Section 3
Modelling using Modelling using
parallel slopes non-parallel slopes
Section 4 Section 5
Regression with Regression with
two factors that two factors that
do not interact interact
Section 6
Regression with any number
of covariates and factors
Note that you will need to switch between the written unit and your
computer for Subsections 2.4, 3.4, 4.5, 5.3 and 6.4.
296
1 Regression with one covariate and one factor
10
8
Height (m)
4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 1 Repeat of Figure 9 from Unit 1 showing a scatterplot of diameter
and height together with the fitted line
297
Unit 4 Multiple regression with both covariates and factors
In Section 8 of Unit 1, two other regression lines were also fitted to these
data – one line was fitted using only data for the trees on the west side of
Walton Drive, while the other regression line was fitted using only data for
the trees on the east side of Walton Drive. The resulting equations of those
fitted lines are
height = 4.50 + 17.26 diameter, for trees on the west side,
height = −0.81 + 27.59 diameter, for trees on the east side.
Figure 2 shows these two fitted lines on a scatterplot of height and
diameter (this is in fact a repeat of Figure 22 in Unit 1). It is clear from
both the equations of the fitted lines and Figure 2 that the fitted line for
trees on the west side of the road is not the same as the fitted line for trees
on the east side of the road. As such, the different levels of the factor side
are affecting the relationship between height and diameter and we ought
to accommodate this in our regression model.
8
Height (m)
4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 2 Repeat of Figure 22 from Unit 1 showing the fitted regression lines
for the manna ash trees dataset using trees on the west side of Walton Drive
only (blue triangles and dashed blue fitted line) and using trees on the east
side of Walton Drive only (red circles and solid red fitted line)
298
1 Regression with one covariate and one factor
299
Unit 4 Multiple regression with both covariates and factors
In Model (1), the term α1 is set to be zero. Explain why this is so.
300
1 Regression with one covariate and one factor
Factor A
Response: Y Covariate x
with K levels
K − 1 indicator
variables
zi2 , zi3 , . . . , ziK
Multiple regression
with covariates
zi2 , zi3 , . . . , ziK
Multiple regression
with covariates
zi2 , zi3 , . . . , ziK and xi
Model: Y ∼ A + x
301
Unit 4 Multiple regression with both covariates and factors
302
2 Modelling using parallel slopes
and the ith tree is the tree given in the ith row of the dataset (as in
Example 3 of Unit 3 (Subsection 2.1)).
(a) Write down the fitted model for Yi when the ith tree is on the east
side of Walton Drive.
(b) Write down the fitted model for Yi when the ith tree is on the west
side of Walton Drive.
(c) What do you notice about the two fitted regression lines for the
different levels of the factor side?
(d) Compared with being on the east side of Walton Drive, what effect
does being on the west side of the road have on height after
controlling for diameter?
(e) What effect does an increase in diameter by 0.1 m have on height
after controlling for side?
It turns out that Model (3) will always produce fitted regression lines for
the individual levels of the factor A which are parallel, because there is
only one slope parameter β for all of the levels. For this reason, this model
is known as the parallel slopes model.
A scatterplot of the response and the covariate with the two fitted lines
obtained for the manna ash trees dataset in Activity 3 is given in Figure 4,
next, together with a visual representation of how the fitted values µ
b, α
b2
and β relate to the fitted lines.
b
303
Unit 4 Multiple regression with both covariates and factors
The data points and regression lines in Figure 4 are identified according to
the two levels of side.
Height (m)
6
Both lines
have slope
4 β = 19.80
2 α2 = 2.62
µ = 1.29
0
0.0 0.1 0.2 0.3
Diameter (m)
Figure 4 Scatterplot of height and diameter, together with the fitted
regression lines for the two levels of the factor side
304
2 Modelling using parallel slopes
We’ll finish this subsection with an example and an activity using a parallel
slopes model for the FIFA 19 dataset first introduced in Unit 1 (Section 3).
305
Unit 4 Multiple regression with both covariates and factors
Skills moves: 1 2 3 4 5
80
75
Strength score
70
65
60
Notice that in Figure 5, the data points associated with the different
levels of skillMoves don’t seem to be totally randomly scattered
about the fitted line. For example, the data points associated with
level 1 of skillMoves are generally below the fitted line, while those
for level 2 of skillMoves are generally above the fitted line. This
suggests that the different levels of the factor skillMoves may also
affect the response strength. Therefore, the model
strength ∼ skillMoves + weight
was fitted to the data, producing the following fitted regression lines
for the five different levels of skillMoves:
strength = 6.94 + 0.36 weight, for level 1,
strength = 13.75 + 0.36 weight, for level 2,
strength = 12.53 + 0.36 weight, for level 3,
strength = 9.00 + 0.36 weight, for level 4,
strength = 9.14 + 0.36 weight, for level 5.
These fitted lines are shown on a scatterplot of strength and weight
in Figure 6.
306
2 Modelling using parallel slopes
Skills moves: 1 2 3 4 5
80
75
Strength score
70
65
60
So, we have a model for fitting parallel regression lines associated with the
levels of a factor, but, as usual for regression models, we now need to check
whether or not all of the explanatory variables need to be in the model.
We will consider how to do this in the next subsection.
307
Unit 4 Multiple regression with both covariates and factors
In the next activity, we will use the test described in Activity 5 for the
FIFA 19 dataset.
So, the F -test tests whether all of the regression coefficients are zero.
However, this doesn’t tell us whether there is evidence that any of the
individual parameters are zero.
Now, the model that we’re considering here, Y ∼ A + x, is a multiple
regression model. As such, separate t-tests can be used to test whether
308
2 Modelling using parallel slopes
each individual parameter is zero after controlling for the other explanatory
variables; R carries out these t-tests automatically when fitting the model.
While an individual t-test is fine for assessing β (the regression coefficient
for the covariate x), individual t-tests for assessing the level effect
parameters associated with the factor A aren’t always particularly useful.
This is because, as mentioned in Unit 3 (Subsection 3.2), either all or none
of the level effect parameters associated with the factor A need to be
included in the model, and so the level effects really ought to be assessed
‘as a whole’ (which is indeed what we did by using the F -test to test
whether all of the level effect parameters are zero for the model Y ∼ A).
So, for the model Y ∼ A + x, rather than relying on the individual t-tests
for deciding whether A should be included in the model, we will take a
different approach and instead consider how well the model with both A
and x fits the data in comparison to how well the model with only x fits
the data. If the fit of the model is significantly improved by including A
(in addition to x), then this would suggest that A should be included in
the model.
This leads us to the question: ‘How should we compare the fits of the two
models Y ∼ x and Y ∼ A + x?’ To answer this question, you first need to
know what nested models are. We explain these next in Box 2, and
demonstrate them for one particular case in Example 2.
In the next activity, you will practise identifying pairs of nested models.
309
Unit 4 Multiple regression with both covariates and factors
You may be wondering what all this talk of nested models has to do with
comparing the fits of the two models Y ∼ x and Y ∼ A + x? Well, it turns
out that, when we have two nested models (which indeed we do have with
models Y ∼ x and Y ∼ A + x), we can compare the fits of the two models
by considering the difference between the values of the residual sum of
squares (RSS) for the two models. (Basically, when we have two nested
models, the maths works nicely!)
Now, recall from Subsection 5.2 in Unit 3 that a model’s RSS provides a
measure of how much variation in the response is left unexplained by the
fitted model. Also recall from Section 5 in Unit 2 that model fit gets better
as we increase the number of explanatory variables, which means that the
model Y ∼ A + x will fit the data better than the model Y ∼ x. Of course,
the better the fit of a model, the better the model explains the data, which
in turn means that the unexplained variation decreases. It therefore follows
that the RSS for the (better fitting) model Y ∼ A + x will be less than the
RSS for the model Y ∼ x. Thus it is the size of the difference in RSS
between these two models that matters, as you will explore in Activity 8.
We know from Section 5 in Unit 2 that model fit isn’t the ‘be all and end
all’; the principle of parsimony is also important. This means we only want
to include in the model those explanatory variables which significantly
increase the fit of the model. Therefore we only want to include A in the
model (in addition to x) if there is a significant gain in fit from doing so.
310
2 Modelling using parallel slopes
As you found in Activity 8, we can use the difference between the values of
the RSS for the two models Y ∼ x and Y ∼ A + x to assess the gain in fit
by including A in the model.
• If the difference is large enough, then this suggests that the gain in fit
when A is included (in addition to x) is significant enough to suggest
that A should be in the model as well as x.
• If the difference is small, then this suggests that there isn’t much gain in
fit when A is included (in addition to x), and so, for parsimony, it would
be better to use the simpler model Y ∼ x.
You will probably not be surprised to learn that a test can be used to
decide whether the difference in the values of the RSS is ‘large enough’ to
suggest that A should be in the model in addition to x. The test is in fact
another ANOVA test, because it compares different sources of response
variation. As such, the test is based on the F -distribution and is referred
to as an F -test. (Yes, yet another ‘F -test’ !)
As usual, R can easily calculate the test statistic for this test, together
with its associated p-value. For this module, all you need to know about
the details of the test are summarised in Box 3. You will then use this Another sort of ‘F test’ (and
F -test in Activities 9 and 10. ‘E’ test, ‘M’ test, ‘T’ test, . . . )?
311
Unit 4 Multiple regression with both covariates and factors
The model
height ∼ side + diameter
was fitted to data from the manna ash trees dataset and produced the
output shown in Table 1.
Table 1 Output produced when fitting the parallel slopes model for the manna
ash trees dataset
A second model
height ∼ diameter
was fitted to the data. An ANOVA test to compare the RSS values for the
two fitted models for height was carried out: the test statistic was
calculated to be F = 35.1 and the associated p-value was less than 0.001.
Are diameter and side both required in the model?
The model
strength ∼ skillMoves + weight
was fitted to data from the FIFA 19 dataset and produced the output
shown in Table 2.
Table 2 Output produced when fitting the parallel slopes model for the FIFA 19
dataset
312
2 Modelling using parallel slopes
A second model
strength ∼ weight
was also fitted to the data. An ANOVA test to compare the RSS values for
the two fitted models for strength was carried out: the test statistic was
calculated to be F = 7.4 and the associated p-value was less than 0.001.
Given these results, are weight and skillMoves both required in the
model?
Model: Y ∼ A + x
Is A required Is x required
in addition to x? in addition to A?
Following the usual methods for regression, once we have decided which
explanatory variables should be included in the model, we need to check
whether the assumptions underlying the model are reasonable for the fitted
model and data. Checking the assumptions of the parallel slopes model is
the focus of the next subsection.
313
Unit 4 Multiple regression with both covariates and factors
The last three assumptions given in Box 4 regarding the random terms
W1 , W2 , . . . , Wn of the parallel slopes model can be checked using methods
from multiple regression given in Subsection 3.1 of Unit 2 (and also used in
Subsection 3.3 of Unit 3). The first assumption that the regression lines for
the different levels of the factor all have the same slope can be checked
informally by visual inspection of a scatterplot of the response variable and
the covariate, together with the fitted regression lines. (A formal method
to test whether or not the slopes are the same will be introduced later, in
Subsection 3.2.)
In the next example, we will check whether or not the assumptions of the
parallel slopes model are reasonable for the fitted model for the manna ash
trees dataset.
314
2 Modelling using parallel slopes
3
14
13
1
Residuals
−1
−2
16
−3
5 6 7 8 9 10
(a) Fitted values
2
Standardised residuals
−1
−2
−2 −1 0 1 2
(b) Theoretical quantiles
Figure 8 Checking the parallel slopes model assumptions for the manna
ash trees dataset: (a) residual plot; (b) normal probability plot
315
Unit 4 Multiple regression with both covariates and factors
In the residual plot, there are three residuals (numbered 13, 14 and 16
on the plot) which seem rather large in comparison to the other
residuals, suggesting that perhaps the assumption that the variance of
the Wi ’s is constant may be questionable. However, the rest of the
points in the residual plot seem to be scattered fairly randomly about
zero, suggesting that, overall, the assumption that the Wi ’s have zero
mean and constant variance does seem reasonable. Additionally, the
residuals in the normal probability plot lie roughly along a straight
line, and so the assumption of normality of the residuals also seems
plausible.
A scatterplot of height and diameter, together with the two fitted
regression lines for the levels of side, was given in Figure 4. The two
regression lines seem to fit the associated data points for the two
levels of side fairly well, and so the assumption that the two
regression lines are parallel also seems to be reasonable.
We will finish this subsection with an activity checking the assumptions for
the parallel slopes model for the FIFA 19 dataset.
3
10
2
Standardised residuals
5
1
Residuals
0 0
−1
−5
−2
−10
−3
65 70 75 80 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 9 Checking the parallel slopes model assumptions for the FIFA 19 dataset: (a) residual plot;
(b) normal probability plot
316
2 Modelling using parallel slopes
(a) Do the plots given in Figure 9 suggest any problems with the
assumptions of the random terms in the parallel slopes model for
these data?
(b) By considering Figure 6, is it reasonable to assume that the regression
lines for the five levels of the factor skillMoves are parallel?
317
Unit 4 Multiple regression with both covariates and factors
318
3 Modelling using non-parallel slopes
6
hourlyWageSqrt
0
0 20 40 60 80 100
workHrs
Figure 10 Scatterplot of hourlyWageSqrt and workHrs from the wages
dataset, together with the fitted regression lines for the two levels of
gender after fitting a parallel slopes model
6
hourlyWageSqrt
0
0 20 40 60 80 100
workHrs
Figure 11(a) Scatterplot of hourlyWageSqrt and workHrs together
with a fitted regression line, using only data for which gender takes the
value male
319
Unit 4 Multiple regression with both covariates and factors
hourlyWageSqrt
4
0
0 20 40 60 80 100
workHrs
6
hourlyWageSqrt
0
0 20 40 60 80 100
workHrs
Figure 11(c) Showing Figures 11(a) and (b) on the same plot
320
3 Modelling using non-parallel slopes
It is clear from Figure 11(c) that the fitted lines for the two levels of
gender are not parallel. What’s more, there seems to be a negative
relationship between hourlyWageSqrt and workHrs for males, but a
positive relationship between hourlyWageSqrt and workHrs for
females. As such, the assumption that the fitted lines are parallel
(required for the parallel slopes model) doesn’t appear to hold.
What needs to change in Model (4) in order for the slopes to not be
parallel?
321
Unit 4 Multiple regression with both covariates and factors
Instead of using the single regression coefficient β for each xi , we now set
the slope for level 1 of A to be the regression coefficient β and then adjust
the slopes for levels k = 2, 3, . . . , K by adding an amount γk to β, where
γk = added effect on the regression slope when A takes level k,
in comparison to the regression slope when A takes level 1.
This allows the slopes to have different values and hence for the lines not
to be parallel.
The adjustment γk can be either positive (meaning that the slope for level
k is an increase on the slope for level 1) or negative (meaning that the
slope for level k is a reduction on the slope for level 1). This means that
the regression coefficient for x will not be the same for the different levels
of A, hence this will give different slopes for the regression lines for the
different levels of factor A.
So, Model (4) is adapted to become
Yi = µ + αk + ( β + γk ) xi + Wi , Wi ∼ N (0, σ 2 ), (5)
| {z } | {z }
↑ ↑
intercept slope
where α1 and γ1 are both set to be zero. (This is because the intercept and
slope of the line for level k are compared to the intercept and slope of the
line for level 1.)
Notice that, for level 1, we have exactly the same model as we have for the
parallel slopes model given in Model (4).
Because Model (5) can accommodate slopes which are not parallel for the
different levels of A, this model is known as the non-parallel slopes
model.
In the next activity, we will use the non-parallel slopes model for the wages
dataset considered in Example 4.
A non-parallel slopes model (as given by Model (5)) was fitted to data in
the wages dataset, taking hourlyWageSqrt as the response (Y ), gender as
a factor (A) and workHrs as a covariate (x). Level 1 of gender was set to
be male, while level 2 was set to be female.
The following parameter estimates for the fitted model were obtained:
Some occupations have less
b = 4.575, βb = −0.0173, α
µ b2 = −1.816, γ
b2 = 0.0335.
predictable weekly working
hours than others (a) Write down the fitted line for individuals who are male and the fitted
line for individuals who are female.
(b) How does gender affect the slopes of the fitted lines?
(c) The value of workHrs for two of the individuals in the dataset is 35.
One of these individuals is male and the other is female. According to
322
3 Modelling using non-parallel slopes
6 Line has
slope
β = −0.017
hourlyWageSqrt
0
0 20 40 60 80 100
workHrs
Figure 12 Scatterplot of hourlyWageSqrt and workHrs, together with the
fitted regression lines for the two levels of gender
Now, in Model (3) we used the indicator random variables zi2, , zi3 , . . . , ziK
to express the parallel slopes model by a single equation. Similarly, for
i = 1, 2, . . . , n, we can use the same indicator variables to express the
model for non-parallel slopes given in Model (5) by an equation of the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK
+ (β + γ2 zi2 + γ3 zi3 + · · · + γK ziK )xi + Wi , (6)
where, for k = 2, 3, . . . , K,
1 if the ith observation takes level k of A,
zik =
0 otherwise.
323
Unit 4 Multiple regression with both covariates and factors
We can extend the model notation used for the parallel slopes model to
include an interaction between x and A for a non-parallel slopes model.
Following the notation used in R, we will denote this interaction by A:x,
and then the non-parallel slopes model (as given in Model (5)) is denoted
324
3 Modelling using non-parallel slopes
by
Y ∼ A + x + A:x.
This is often written in abbreviated form as
Y ∼ A ∗ x.
The ‘∗’ between A and x simply tells us that both A and x are explanatory
variables in the model, and there is also an interaction between A and x.
The non-parallel slopes model is summarised in Box 5.
So, when the ith observation takes level 1 of factor A, the model
becomes
Yi = µ + βxi + Wi , i = 1, 2, . . . , n,
and when the ith observation takes level k = 2, 3, . . . , K of factor A,
the model becomes
Yi = µ + αk + (β + γk )xi + Wi , i = 1, 2, . . . , n.
325
Unit 4 Multiple regression with both covariates and factors
Factor A Interaction
Response Y Covariate x
with K levels A:x
K − 1 indicator
variables
zi2 , zi3 , . . . , ziK
Covariates
Covariates
(zi2 × xi), (zi × xi),
zi2 , zi3 , . . . , ziK
. . . , (ziK × xi)
Model: Y ∼ A + x + A:x
In the next activity, we will fit a non-parallel slopes model using data from
the FIFA 19 dataset.
326
3 Modelling using non-parallel slopes
(a) Given the fitted regression lines for the two levels of preferredFoot,
what are the values of the estimated parameters µ b, β,
b αb2 and γ
b2 ?
(b) How does preferredFoot affect the slopes of the fitted lines?
(c) The value of height for two of the football players in the dataset
is 75. One of these individuals is left-footed and the other is
right-footed. According to the fitted non-parallel slopes model, what
is the fitted value of strength for each of these individuals?
327
Unit 4 Multiple regression with both covariates and factors
As you have found in Activity 16, the size of difference in the RSS values
indicates if the interaction term A:x should be in the model. We can use
another ANOVA test (similar to that used in Subsection 2.2) to test
whether the difference in the RSS values is ‘large enough’ to suggest that
there is an interaction between A and x (meaning that the interaction term
should be included in the model). Once again, this is a form of F -test,
with a test statistic F and p-value calculated using an F -distribution.
The test is summarised in Box 6.
328
3 Modelling using non-parallel slopes
329
Unit 4 Multiple regression with both covariates and factors
Model: Y ∼ A + x + A:x
Is the interaction
A:x required?
330
3 Modelling using non-parallel slopes
4 4
Standardised residuals
2 2
Residuals
0 0
−2 −2
−4 −4
3.0 3.5 4.0 4.5 −2 0 2
(a) Fitted values (b) Theoretical quantiles
Figure 15 Checking the non-parallel slopes model assumptions for the wages dataset: (a) residual plot;
(b) normal probability plot
331
Unit 4 Multiple regression with both covariates and factors
10 2
5 Standardised residuals 1
Residuals
0 0
−5 −1
−10 −2
65 70 75 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 16 Checking the non-parallel slopes model assumptions for data from the FIFA 19 dataset: (a) residual
plot; (b) normal probability plot
332
4 Regression with two factors that do not interact
333
Unit 4 Multiple regression with both covariates and factors
Employment rates
As mentioned in Example 1 of Unit 1 (Subsection 2.1), the
Organisation for Economic Co-operation and Development (OECD) is
an international organisation which works with governments, policy
makers and citizens to find solutions to social, economic and
environmental challenges. The OECD website states that it has
‘helped to place issues relating to well-being and inequalities of
income and opportunities high on the international agenda’ and that
its goal ‘is to shape policies that foster prosperity, equality,
opportunity and well-being for all’ (OECD, 2020).
Some members of a diverse The OECD collects a vast amount of data concerning many different
and inclusive workforce issues. The dataset we introduce here contains data on the 2019
employment rates in 37 countries for people educated to a Bachelors
degree or equivalent level, broken down by gender and age grouping.
The employment rates dataset (employmentRate)
Data are available for each of the 37 countries for the following
variables:
• country: the name of the country
• rate: the 2019 employment rate (as a percentage) for people
educated to a Bachelors degree or equivalent level
• gender: the gender the individual identifies with, taking the values
male and female
• age: the age groupings (in years), taking the values 25 to 34,
35 to 44, 45 to 54, and 55 to 64.
The first three and last three observations from this dataset are given
in Table 3. So, for example, the employment rate for women aged 25
to 34 in Australia was 82.5% in 2019.
Table 3 The first three and last three observations in employmentRate
334
4 Regression with two factors that do not interact
where
• the ‘baseline mean’ is the mean response when A takes level 1 (this was
denoted µ in Unit 3)
• the ‘effect of kth level of A’ is the effect on the mean response of the
kth level of A in comparison to the effect on the mean response of
level 1 of A (this was denoted αk in Unit 3)
• each ‘random term’ is a normal random variable with zero mean and
constant variance σ 2 (this was denoted Wi in Unit 3).
335
Unit 4 Multiple regression with both covariates and factors
Notice that, since each level effect term is the added effect on the response
in comparison to the effect of level 1, when the ith observation takes
level 1 of A, the level effect term is simply zero. Model (8) then reduces to
baseline random
Yi = + . (9)
mean term
We can extend and adapt this general model form for Y ∼ A to the
situation where we instead have two factors A and B. This time, each
observation will take one level of A and one level of B. So, if the ith
observation takes level k of A and level l of B, for k = 1, 2, . . . , K,
l = 1, 2, . . . , L, then we will write our model for the response Yi in the
following general form:
effect of effect of
baseline random
Yi = + kth level + lth level + , (10)
mean term
of A of B
Write down what the general form given in Model (10) reduces to for each
of the following situations.
(a) Factors A and B both take level 1.
(b) Factor A takes level 1 and factor B takes level l, where l = 2, 3, . . . , L.
(c) Factor B takes level 1 and factor A takes level k, where
k = 2, 3, . . . , K.
336
4 Regression with two factors that do not interact
In Activity 23, you will see the general form given in Model (10) in action
for the model
rate ∼ gender + age
using data from the employment rates dataset.
The model
rate ∼ gender + age
was fitted to data from the employment rates dataset, taking level 1 of
gender to be male and level 2 to be female, and taking levels 1, 2, 3 and 4
of age to be, respectively, 25 to 34, 35 to 44, 45 to 54 and 55 to 64.
The output from fitting the model is given in Table 4.
Table 4 Output produced from fitting the model rate ∼ gender + age
337
Unit 4 Multiple regression with both covariates and factors
We have already seen in this unit and Unit 3, that when there is a factor,
indicator random variables can be used to express the model as a single
equation (as is done, for example, in Model (2), Subsection 1.2). It will
therefore probably come as no surprise to you to learn that when there are
two factors, we can simply use a set of indicator variables for each of the
factors so that the model with two factors is equivalent to a multiple
regression model with (K − 1) + (L − 1) indicator variables as covariates.
The model with two factors (and no interaction) is summarised in Box 7.
where
• the ‘baseline mean’ is the mean response when A takes level 1 and
B takes level 1
• the ‘effect of kth level of A’ is the effect on the mean response of
the kth level of A in comparison to the effect on the mean response
of level 1 of A (after controlling for B)
• the ‘effect of lth level of B’ is the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of
level 1 of B (after controlling for A)
• each ‘random term’ is a normal random variable with zero mean
and constant variance σ 2 .
By using indicator variables for each factor, the model is equivalent to
a multiple regression model with (K − 1) + (L − 1) indicator variables
as covariates.
The translation of the model terms for regression with two factors into
terms fitted in a multiple regression model is illustrated in Figure 17.
338
4 Regression with two factors that do not interact
Factor A Factor B
Response: Y
with K levels with L levels
Define K − 1 Define L − 1
indicator indicator
variables variables
Model: Y ∼ A + B
Figure 17 Summary of regression with two factors
339
Unit 4 Multiple regression with both covariates and factors
Age group:
25 to 34 35 to 44 45 to 54 55 to 64
100 Fitted mean response
when gender and age
both take level 1
80
Effect of level 4
(55 to 64) of age
Effect of level 2
70 (female) of gender
Male Female
(a) Gender
90
µ
80
Effect of level 2
(35 to 44) of age
Effect of level 2
(female) of gender
70
25 to 34 35 to 44 45 to 54 55 to 64
(b) Age
Figure 18 shows visual representations of the fitted model for the response
rate with the two factors gender and age as explanatory variables. The
fitted model assumed that there was no interaction between the two
factors.
Now consider the visualisation of the fitted model for the parallel slopes
model for the manna ash trees dataset shown in Figure 4 (Subsection 2.1).
What do the fitted lines in Figure 4 have in common with the lines joining
the fitted mean responses in Figure 18?
The lines joining the fitted mean responses in both of the plots in
Figure 18 are parallel for the different levels of the second factor. The fact
that the lines are parallel means that the effect of each factor on the
response is the same across the different levels of the other factor, as
summarised in Box 8. For example, in Figure 18(a), the differences
between the employment rates for the different age groups are the same for
each gender, and likewise, in Figure 18(b), the differences between the male
and female employment rates is the same for each of the four age groups.
341
Unit 4 Multiple regression with both covariates and factors
whether both factors A and B are required in the model. For this we need
two nested models, the identification of which you will consider in the next
activity.
The model
Y ∼A+B
has been fitted to some data and we would like to decide whether both A
and B should be included in the model.
(a) For which two nested models could we compare RSS values in order
to test whether A should be included in the model in addition to B?
(b) For which two nested models could we compare RSS values in order
to test whether B should be included in the model in addition to A?
The method for testing whether both factors A and B should be in the
model is summarised in Box 9.
342
4 Regression with two factors that do not interact
In the final activity of this subsection, we will use these tests to see
whether both gender and age are needed in the model for rate using data
from the employment rates dataset.
Is A required Is B required
in addition to B? in addition to A?
344
4 Regression with two factors that do not interact
Age group:
25 to 34 35 to 44 45 to 54 55 to 64
100
Employment rate (%)
90
80
70
Male Female
(a) Gender
90
80
70
25 to 34 35 to 44 45 to 54 55 to 64
(b) Age
Figure 20 Means plots for rate from the employment rates dataset,
with: (a) gender on the horizontal axis and separate lines for levels of
age; (b) age on the horizontal axis and separate lines for levels of gender
345
Unit 4 Multiple regression with both covariates and factors
346
4 Regression with two factors that do not interact
95 95
Weight gain (g)
85 85
80 80
347
Unit 4 Multiple regression with both covariates and factors
348
5 Regression with two factors that interact
349
Unit 4 Multiple regression with both covariates and factors
100 Factor B
Level 1
Level 2
80
Interaction effect
of level 2 of A
Response
and level 2 of B
60
Effect of
level 2 of B Effect of
level 2 of B
40
Effect of
level 2 of A
20 Baseline
mean
Level 1 Level 2
Factor A
Figure 22 Visualisation of the model Y ∼ A ∗ B
In the next example and activity, we’ll see the general form given in
Model (11) in action for modelling data from the rats and protein dataset.
350
5 Regression with two factors that interact
On the other hand, the 21st rat in the dataset was fed a low cereal
protein diet and so takes level 2 for both source and amount.
Therefore, this time, there is an added interaction term in the model.
The model form for this rat is given next.
351
Unit 4 Multiple regression with both covariates and factors
baseline effect of effect of
Y21 = + +
mean cereal source low amount
interaction effect of
random
+ cereal source + .
term
and low amount
(a) The first rat in the dataset was fed a low beef protein diet. What will
be the fitted value yb1 for this rat?
(b) The 11th rat in the dataset was fed a high beef protein diet. What
will be the fitted value yb11 for this rat?
(c) The 21st rat in the dataset was fed a low cereal protein diet. What
will be the fitted value yb21 for this rat?
(d) The 31st rat in the dataset was fed a high cereal protein diet. What
will be the fitted value yb31 for this rat?
Recall from Section 3 (and in particular, Model (7)) that when we have
one factor and one covariate for the explanatory variables, and there is an
interaction between the two explanatory variables, we can express our
model as a multiple regression model with the covariates
zi2 , zi3 , . . . , ziK , xi , (zi2 × xi ), (zi3 × xi ), . . . , (ziK × xi ),
where zi2 , zi3 , . . . , ziK are indicator variables associated with the factor
and xi is the covariate.
352
5 Regression with two factors that interact
Now, in this section where we instead have two factors, a set of indicator
variables is used for each of the factors. So, following the ideas from
Section 3, for factor A with K levels and factor B with L levels, the model
Y ∼ A + B + A:B
can be expressed as a multiple regression model with the following
covariates:
• K − 1 indicator variables for factor A
• L − 1 indicator variables for factor B
• (K − 1) × (L − 1) indicator variables associated with forming the
product of the kth indicator variable for A with the lth indicator
variable for B, for k = 2, 3, . . . , K, l = 2, 3, . . . , L.
The model with two factors and an interaction is summarised in Box 11.
where
• the ‘baseline mean’ is the mean response when A takes level 1 and
B takes level 1
• the ‘effect of kth level of A’ is the effect on the mean response of
the kth level of A in comparison to the effect on the mean response
of level 1 of A, when B takes level 1
• the ‘effect of lth level of B’ is the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of
level 1 of B, when A takes level 1
• the ‘interaction effect of kth level of A and lth level of B’ is the
added effect on the mean response of the interaction between the
kth level of A and the lth level of B
• each ‘random term’ is a normal random variable with zero mean
and constant variance σ 2 .
353
Unit 4 Multiple regression with both covariates and factors
Define K − 1 Define L − 1
indicator variables indicator variables
(K − 1) × (L − 1) indicator variables
that is, the products of the
(K − 1) indicator variables for A
and the
(L − 1) indicator variables for B
Model: Y ∼ A + B + A:B
354
5 Regression with two factors that interact
355
Unit 4 Multiple regression with both covariates and factors
Model : Y ∼ A + B + A:B
Is the interaction
A:B required?
The model including an interaction for the rats and protein dataset
discussed in Activity 30 illustrates an important point: even though a
means plot may suggest that there is an interaction between two factors, it
may turn out that the interaction isn’t actually judged to be significant.
When this happens, the more parsimonious model without an interaction
is preferable.
In the next activity, we will revisit the wages dataset and fit a model for
the response hourlyWageSqrt using two of the factors in the dataset as
explanatory variables.
356
5 Regression with two factors that interact
:
computer: yes no gender: male female
4.0 4.0
hourlyWageSqrt
hourlyWageSqrt
3.8 3.8
3.6 3.6
3.4 3.4
357
Unit 4 Multiple regression with both covariates and factors
(a) Explain why the means plots given in Figure 25 suggest that a model
for hourlyWageSqrt using gender and computer as the explanatory
variables may need to include their interaction.
(b) An ANOVA test was carried out to compare the RSS values of the
two models
hourlyWageSqrt ∼ gender + computer
and
hourlyWageSqrt ∼ gender + computer + gender:computer.
The resulting p-value was less than 0.001.
Should the interaction gender:computer be included in the model?
(c) A particular individual in the dataset identifies as male and has a
computer at home. According to the model that includes the
interaction, what is the fitted hourly wage for this individual?
(d) A different male individual doesn’t have a computer at home.
According to the model that includes the interaction, what is the
fitted hourly wage for this individual?
(e) According to the model that includes the interaction, what affect does
having a computer at home have on a male’s hourly wage?
358
5 Regression with two factors that interact
359
Unit 4 Multiple regression with both covariates and factors
Table 8 The first three and last three observations from ouStudents
360
6 Regression with any number of covariates and factors
where
• the ‘baseline mean’ is the mean response when A takes level 1 and B
takes level 1
• the ‘effect of kth level of A’ is the effect on the mean response of the kth
level of A in comparison to the effect on the mean response of level 1 of
A (after controlling for B)
• the ‘effect of lth level of B’ is the effect on the mean response of the lth
level of B in comparison to the effect on the mean response of level 1 of
B (after controlling for A)
• each ‘random term’ is a normal random variable with zero mean and
constant variance σ 2 .
This model form can be extended in a natural way to accommodate any
number of factors as explanatory variables, as summarised in Box 13 next.
361
Unit 4 Multiple regression with both covariates and factors
where
• the ‘baseline mean’ is the mean response when A, B, . . . , Z all take
level 1
• the ‘effect of kth level of A’ is the effect on the mean response of
the kth level of A in comparison to the effect on the mean response
of level 1 of A (after controlling for all the other factors)
• the ‘effect of lth level of B’ is the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of
level 1 of B (after controlling for all the other factors)
..
.
• the ‘effect of rth level of Z’ is the effect on the mean response of the
rth level of Z in comparison to the effect on the mean response of
level 1 of Z (after controlling for all the other factors)
• each ‘random term’ is a normal random variable with zero mean
and constant variance σ 2 .
So, Yi is modelled as the baseline mean when all of the factors take
level 1, with added individual effects for the other levels of each factor.
362
6 Regression with any number of covariates and factors
In the next activity, we will revisit the wages dataset, this time fitting a
model for the response hourlyWageSqrt using four of the factors in the
dataset as explanatory variables.
Use the fitted model (rounding all estimates to two decimal places) to
calculate the following.
(a) The fitted value of hourlyWageSqrt for a male who has a computer,
and takes level 3 of edLev and level 2 of occ.
(b) The fitted value of hourlyWageSqrt for a female who hasn’t got a
computer, and takes level 17 of edLev and level 7 of occ.
363
Unit 4 Multiple regression with both covariates and factors
For which two nested models could we compare RSS values in order to test
whether the factor occ needs to be in the model for hourlyWageSqrt in
addition to gender, computer and edLev?
Consider once again the wages dataset with the response hourlyWageSqrt
and the four factors gender, computer, edLev and occ. The following
stepwise regression procedures were used to choose which of these factors
ought to be in our model.
(a) A forward stepwise regression procedure starting from the null
regression model was performed. The results from the procedure are
given below. Which explanatory variables are selected by this
procedure?
364
6 Regression with any number of covariates and factors
365
Unit 4 Multiple regression with both covariates and factors
where Wi ∼ N (0, σ 2 ).
The interpretation of the model parameters continues to follow the
same principles as we’ve seen before, so that:
• a baseline mean is the mean response when all of the factors take
level 1 and all of the covariates are zero
• for each factor, the effect term in the model is the effect on the
mean response of the associated level of the factor in comparison to
the effect on the mean response of level 1 of the factor, after
controlling for the other factors and the covariates
• for each covariate, its regression coefficient represents the effect on
the mean response, after controlling for the other covariates and the
factors.
The model can be expressed as a multiple regression model (with
indicator variables as covariates), and so the usual multiple regression
model assumptions hold.
366
6 Regression with any number of covariates and factors
367
Unit 4 Multiple regression with both covariates and factors
368
6 Regression with any number of covariates and factors
(a) Explain why the output for the fitted model suggests that both
covariates and their interaction should be included in the model.
(b) The first student listed in the dataset has observed values of 89.2 for
bestPrevModScore and 32 for age. According to the model, what is
the fitted value (to the nearest whole mark) for examScore, yb1 , for
this student?
369
Unit 4 Multiple regression with both covariates and factors
370
6 Regression with any number of covariates and factors
So, as you have seen, it is possible to build regression models with any
number of covariates and factors, including interactions between them.
A collective name often used for such models is general linear models.
371
Unit 4 Multiple regression with both covariates and factors
As the name suggests, a wide range of models are general linear models.
One thing that general linear models have in common is that the random
term is assumed to have a normal distribution for all of them.
We will round off this unit by using general linear models in R. It is worth
noting that R automatically uses the hierarchical principle when choosing
a model using stepwise regression.
We will revisit modelling the exam scores from the OU students dataset by
considering an alternative model later in the module (in Unit 7).
372
Summary
Summary
This unit brought together the regression ideas presented in Units 2 and 3
to develop regression with any number of covariates and factors as the
explanatory variables.
We began the unit by considering regression when there is just a single
covariate, x, and a single factor, A, as the explanatory variables. In this
case, separate regression lines are fitted for the different levels of the
factor. These lines can either be parallel or non-parallel.
If the fitted lines are parallel, then the lines for the different levels of the
factor only differ in terms of their intercepts: this is the parallel slopes
model, which we denoted
Y ∼ A + x.
On the other hand, if the fitted lines are non-parallel, then the lines for the
different levels of the factor differ in terms of both their intercepts and
their slopes. In this case, the covariate and the factor are said to interact,
so that the effect on the response of one of the explanatory variables
depends on the value of the other explanatory variable. For example, the
relationship between the response and x may be positive for level 1 of A,
but may be negative for level 2 of A. The non-parallel slopes model
accommodates the interaction between the covariate and factor by
including an extra interaction term, A:x, in the model so that the model
becomes
Y ∼ A + x + A:x or, equivalently, Y ∼ A ∗ x.
We then moved on to consider regression when there are two factors,
A and B, as the explanatory variables. We started with the case where
A and B do not interact to affect the response:
Y ∼ A + B.
In this case, the model is an extension of the model presented in Unit 3,
where:
• instead of having a baseline mean as the mean response when the
(single) factor (A) takes level 1, the baseline mean is now the mean
response when both factors (A and B) take level 1
• in addition to a separate added effect of each level of A, there is also a
separate added effect for each level of B.
373
Unit 4 Multiple regression with both covariates and factors
All of these models were brought together and extended in a natural way
towards the end of the unit where we considered regression with any
number of covariates, x1 , x2 , . . . , xq , and any number of factors,
A, B, . . . , Z. The simplest model here is
Y ∼ A + B + · · · + Z + x1 + x2 + · · · + xq ,
where there are no interactions between any of the explanatory variables.
Interactions can also be added into the model, and can be:
• between covariates, between factors or between a mixture of covariates
and factors
• between two explanatory variables (two-way interactions), between three
explanatory variables (three-way interactions), and so on.
When adding interactions into the model, the hierarchical principle needs
to be respected, which says that if an interaction is included in a model,
then the model must also include:
• the individual explanatory variables for each variable in the interaction
• any lower-order interactions involving any of the variables in the
interaction.
Stepwise regression can be useful to help with deciding which explanatory
variables and which interactions should be included in the model.
For the models in this unit:
• A set of indicator variables can be used to represent each factor, so that
all of the models can be expressed as multiple regression models (with
the usual model assumptions).
• Individual t-tests (to test whether the (partial) regression coefficient is
zero) can be used to test whether each covariate should be included in
the model (in addition to the other explanatory variables in the model).
374
Summary
• ANOVA tests comparing the fit of the model with the factor to the fit of
the model without the factor can be used to test whether each factor
should be included in the model (in addition to the other explanatory
variables in the model).
• ANOVA tests comparing the fit of the model with the interaction to the
fit of the model without the interaction can be used to test whether an
individual interaction should be included in the model (in addition to
the other explanatory variables in the model).
R was used to fit the models introduced in this unit, to test whether
individual covariates, factors or interaction terms should be included in the
model, and to carry out stepwise regression in order to choose which
covariates, factors and interactions should be included in the model.
As a reminder of what has been studied in Unit 4 and how the sections in
the unit link together, the route map for the unit is repeated below.
Section 1
Regression with one
covariate and one factor
Section 2 Section 3
Modelling using Modelling using
parallel slopes non-parallel slopes
Section 4 Section 5
Regression with Regression with
two factors that two factors that
do not interact interact
Section 6
Regression with any number
of covariates and factors
375
Unit 4 Multiple regression with both covariates and factors
Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate that regression with one covariate and one factor as the
explanatory variables produces separate fitted lines for different levels of
the factor; the fitted lines are parallel if there is no interaction, and
non-parallel if there is an interaction
• appreciate that for regression with multiple factors, the fitted model is
based on the mean responses for the different level combinations of the
factors
• use means plots to informally decide whether there may be an
interaction between two factors
• appreciate that the ideas of multiple regression, regression with one
covariate and one factor, and regression with multiple factors, can be
combined to produce a regression model with any number of covariates
and factors
• understand that sets of indicator variables can be used to represent any
factors in the model so that a regression model with any number of
covariates and factors can be expressed as a multiple regression model
• interpret the baseline mean, individual effect terms and (partial)
regression coefficients in a data context
• understand that an interaction between explanatory variables means
that the explanatory variables in the interaction work together to affect
the response
• appreciate that interactions can be between covariates, between factors
or between a mixture of covariates and factors
• appreciate that interactions can be between two explanatory variables
(two-way interactions), between three explanatory variables (three-way
interactions), and so on
• understand and be able to use the hierarchical principle
• use parameter estimates to calculate fitted values for the response for a
model with any number of covariates, factors and interactions
• use individual t-tests to test whether each covariate should be in the
model in addition to the other explanatory variables
• use ANOVA tests comparing the RSS values of two nested models to
test whether individual factors should be in the model in addition to the
other explanatory variables
• use ANOVA tests comparing the RSS values of two nested models to
also test whether interactions involving one or more factors should be in
the model in addition to the other explanatory variables
• appreciate that stepwise regression can be used to choose which
covariates, factors and interactions should be included in a model
376
References
• use R to fit regression models with any number of covariates, factors and
interactions
• use the summary output from R to test whether individual covariates
and individual interactions between covariates should be included in the
model in addition to the rest of the explanatory variables
• use R to fit nested models and carry out ANOVA tests to test whether
individual factors, and individual interactions involving factors, should
be included in the model in addition to the rest of the explanatory
variables
• use R to produce means plots
• use R to carry out stepwise regression to choose which explanatory
variables to include in the model when there are multiple covariates,
multiple factors and multiple interactions.
References
OECD (2020) OECD 60th anniversary. Available at:
https://www.oecd.org/60-years (Accessed: 20 March 2022).
OECD (2021) Trends in employment, unemployment and inactivity rates,
by educational attainment and age group. Available at:
https://stats.oecd.org (Accessed: 4 October 2020).
Snedecor, G.W. and Cochran, W.G. (1967) Statistical methods. 6th edn.
Ames, IA: Iowa State University Press.
The Open University (2020) ‘Internal bespoke anonymised extract M348’,
OU administration data (Accessed: 25 October 2020).
377
Unit 4 Multiple regression with both covariates and factors
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, a manna ash tree flowering: © zanozaru / www.123rf.com
Subsection 1.2, ‘hooray!’: © Roman Samborskyi / www.123rf.com
Section 2, scenic parallel slopes: © bsvet / www.123rf.com
Subsection 2.2, dog showing skill move: © alexeitm / www.123rf.com
Section 3, scenic non-parallel slopes: © Kotenko / www.123rf.com
Subsection 3.1, musicians performing: © razihusin / www.123rf.com
Subsection 3.2, Andreas Brehme: © Getty Images
Subsection 4.1, diverse and inclusive workforce: © langstrup /
www.123rf.com
Subsection 4.1, older worker: © Wavebreakmedia Ltd | Dreamstime.com
Subsection 4.2, railway lines: © konradbak / www.123rf.com
Subsection 6.1, researchers: © dragoscondrea / www.123rf.com
Subsection 6.2, older learner: © Fizkes | Dreamstime.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
378
Solutions to activities
Solutions to activities
Solution to Activity 1
The term α1 is set to be zero because µ is the baseline mean of Y for
level 1 of A, and so the effect of level 1 of A is already accommodated in µ.
Besides, the α1 term compares the effect on Y of level 1 of A with the
effect on Y of level 1 of A, which must be zero!
Solution to Activity 2
Since side is a factor with two levels, K − 1 = 2 − 1 = 1 and we can define
one indicator variable zi2 , where
1 if the ith tree takes level 2 of side
zi2 = (that is, if the ith tree is on the west side of Walton Drive),
0 otherwise.
Then, since zi2 is numerical, we can use zi2 and diameter as the two
covariates in a multiple regression model.
Solution to Activity 3
(a) When the ith tree is on the east side of Walton Drive, it takes level 1
of side and so zi2 = 0. The fitted model is then
height = 1.29 + 19.80 diameter.
(b) When the ith tree is on the west side of Walton Drive, it takes level 2
of side and so zi2 = 1. Therefore, this time the fitted model is
height = 1.29 + 2.62 + 19.80 diameter = 3.91 + 19.80 diameter.
(c) These two regression lines both have the same slope (19.80) but
different intercepts (1.29 for east and 3.91 for west). The lines are
therefore parallel to each other. This can be clearly seen in Figure S1,
which follows.
379
Unit 4 Multiple regression with both covariates and factors
Height (m) 7
4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure S1 Scatterplot of height and diameter, together with the two fitted
(parallel) regression lines for height ∼ side + diameter
(d) Since αb2 = 2.62, after controlling for diameter, the effect on height
of being on the west side of Walton Drive, in comparison to being on
the east side, is that height would be expected to increase by 2.62 m.
(e) Since βb = 19.80, after controlling for side, the effect on height of an
increase in diameter of 0.1 m is that height would be expected to
increase by 0.1 m × 19.80 = 1.98 m.
Solution to Activity 4
(a) The parameter µ is the baseline intercept when skillMoves takes
level 1 and weight is zero, and so µ b is the intercept for the fitted
regression line for level 1, that is, µ
b = 6.94.
The parameters α2 , α3 , α4 and α5 are the added effects on strength
of skillMoves levels 2, 3, 4 and 5, respectively, after controlling for
weight, in comparison to the effect of level 1. (The effect of level 1 on
the relationship between strength and weight is accounted for in the
baseline intercept µ.) Therefore,
α
b2 = 13.75 − 6.94 = 6.81,
α
b3 = 12.53 − 6.94 = 5.59,
α
b4 = 9.00 − 6.94 = 2.06,
α
b5 = 9.14 − 6.94 = 2.20.
380
Solutions to activities
(b) The values of αb2 and αb3 are both positive and relatively large, which
suggests that, after controlling for weight, strength is generally
higher for players whose values of skillMoves are 2 or 3, in
comparison to players whose values of skillMoves are 1.
Although the values of α
b4 and αb5 are also both positive, their values
are smaller than α
b2 and α
b3 , which suggests that, after controlling for
weight, strength is generally a little higher for players whose values
of skillMoves are 4 or 5, in comparison to players whose values of
skillMoves are 1, but not as high as for players whose values of
skillMoves are 2 or 3.
The values of αb4 and αb5 are also very close to each other, suggesting
that, after controlling for weight, strength is very similar for players
whose values of skillMoves are 4 or 5. This can be seen in Figure 6,
where the regression lines associated with these two levels of
skillMoves are very close to each other.
Solution to Activity 5
For this model, we have K − 1 regression coefficients associated with A
(that is, α2 , α3 , . . . , αK ), and one regression coefficient associated with x
(that is, β). We therefore have K regression coefficients altogether. So,
from multiple regression, the p-value associated with this test is based on
an F -distribution with K and n − (K + 1) degrees of freedom.
Solution to Activity 6
The p-value associated with the test is very small, so we should reject H0
and conclude that there is evidence to suggest that at least one of the
regression coefficients is non-zero.
Solution to Activity 7
The explanatory variables in MC are x2 and x5 . Those in MA are x2 , x5
and also x4 . So MC is nested within MA .
MC is also nested within MB , since, in addition to MC ’s explanatory
variables (x2 and x5 ), MB ’s explanatory variables also include x1 and x4 .
Finally, MA is also nested within MB , since MB ’s explanatory variables
include x1 in addition to MA ’s explanatory variables (x2 , x4 and x5 ).
Solution to Activity 8
We know that the RSS for the model Y ∼ A + x will be less than the RSS
for the model Y ∼ x. However, if the difference between the RSS values for
the two models is large, this suggests that the RSS for the model
Y ∼ A + x is quite a bit smaller than the RSS for the model Y ∼ x, which
in turn suggests that the fit of the model Y ∼ A + x is quite a bit better
than the fit of the model Y ∼ x.
381
Unit 4 Multiple regression with both covariates and factors
Solution to Activity 9
From Table 1, the p-value associated with diameter is very small
(< 0.001), which suggests that diameter is significant after controlling for
side and so should be included in the model.
The p-value from the ANOVA test for whether side should be included in
the model in addition to diameter is also very small (< 0.001), which
suggests that side should also be included in the model in addition to
diameter.
Solution to Activity 10
From Table 2, the p-value associated with weight is very small (< 0.001),
which suggests that weight is significant after controlling for skillMoves
and so should be included in the model.
The p-value from the ANOVA test testing whether skillMoves should be
included in the model in addition to weight is also very small (< 0.001),
which suggests that skillMoves should also be included in the model in
addition to weight.
Solution to Activity 11
(a) With the exception of one large (negative) residual, the points in the
residual plot seem to be scattered randomly about zero, suggesting
that the assumption that the Wi ’s have zero mean and constant
variance seems reasonable. Also, the residuals in the normal
probability plot lie roughly along a straight line, and so the
assumption of normality of residuals seems plausible as well.
So, neither of the plots suggest any problems with the assumptions of
the random terms in the parallel slopes model for these data.
Although, note that neither of these plots give us any information
about the reasonableness of the independence assumption.
(b) The five regression lines for the levels of skillMoves seem to fit the
associated data points fairly well in Figure 6, and so the assumption
that the five lines are parallel seems to be reasonable.
Solution to Activity 12
In order for the slopes to not be parallel, we need different slopes for the
different factor levels. This means that we need the regression coefficient,
β, to differ across the K levels of A.
382
Solutions to activities
Solution to Activity 13
(a) When individual i is male, gender takes level 1. The fitted model
then has the form
hourlyWageSqrt = µ
b + βb workHrs,
and so the fitted line for individuals who are male is
hourlyWageSqrt = 4.575 − 0.0173 workHrs.
(b) Since βb = −0.0173, the slope for the fitted line when gender takes
level 1 is negative; that is, when gender is male, the fitted line is a
downwards slope.
However, since βb + γ b2 is positive (0.0162), the slope for the fitted line
when gender takes level 2 is positive; that is, when gender is female,
the fitted line is an upwards slope.
(c) The fitted model is given by:
hourlyWageSqrt = 4.575 − 0.0173 workHrs, for level 1 (male),
hourlyWageSqrt = 2.759 + 0.0162 workHrs, for level 2 (female).
For the male individual, the fitted value of hourlyWageSqrt when the
value of workHrs is 35 is, therefore,
hourlyWageSqrt = 4.575 − (0.0173 × 35) = 3.9695 ≃ 3.97,
while for the female individual, the fitted value of hourlyWageSqrt
when the value of workHrs is 35 is
hourlyWageSqrt = 2.759 + (0.0162 × 35) = 3.3260 ≃ 3.33.
The fitted values of their respective hourly wages are the squares of
the fitted values for hourlyWageSqrt. Therefore, the fitted value of
the hourly wage in £, to the nearest 10p, for the male individual is
3.96952 ≃ 15.80
and for the female individual is
3.32602 ≃ 11.10.
383
Unit 4 Multiple regression with both covariates and factors
Solution to Activity 14
If the ith observation takes level 1 of factor A, then the indicator variables
zi2 , zi3 , . . . , ziK all take the value 0. Model (6) then becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αK × 0)
+ (β + (γ2 × 0) + (γ3 × 0) + (γK × 0))xi + Wi
= µ + β x i + Wi .
If the ith observation takes level k of factor A, for k = 2, 3, . . . , K, then the
indicator variable zik takes the value 1, while the other indicator variables
all take the value 0. Model (6) then becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αk × 1) + · · · + (αK × 0)
+ (β + (γ2 × 0) + (γ3 × 0) + · · · + (γk × 1) + · · · + (γK × 0))xi + Wi
= µ + αk + (β + γk ) xi + Wi .
Solution to Activity 15
(a) When observation i takes level 1 of preferredFoot – that is, when
the player is left-footed – the fitted model for strength has the form
strength = µ
b + βb height.
Therefore, since the fitted line for level 1 of preferredFoot is
strength = −86.2 + 2.22 height,
b = −86.2 and βb = 2.22.
this means µ
When observation i takes level 2 of preferredFoot – that is, when
the player is right-footed – the fitted model for strength has the form
strength = µ
b+α
b2 + (βb + γ
b2 ) height.
The fitted line for level 2 of preferredFoot is
strength = −6.5 + 1.08 height,
and so
µ
b+α b2 = −6.5
−86.2 + α
b2 = −6.5
b2 = −6.5 + 86.2 = 79.7
α
and
βb + γ
b2 = 1.08
2.22 + γb2 = 1.08
b2 = 1.08 − 2.22 = −1.14.
γ
(Note that when using parameter estimates given to full computer
accuracy, it turns out that γb2 is −1.13 to two decimal places.)
(b) Since β = 2.22, the slope for the regression equation when
b
preferredFoot takes level 1 is positive; that is, when preferredFoot
is left, there is an upwards slope.
384
Solutions to activities
The value of γb2 represents the added effect that level 2 (right) of
preferredFoot has on the slope in comparison to the regression slope
when preferredFoot is level 1 (left). Since βb + γ b2 is positive, the
slope for the regression equation is still positive, but since γ
b2 is
negative, the slope is less steep when preferredFoot is right than
when preferredFoot is left.
(c) We have the fitted model
strength = −86.2 + 2.22 height, for level 1 (left),
strength = −6.5 + 1.08 height, for level 2 (right).
For the left-footed player, the fitted value of strength when the value
of height is 75 is therefore
strength = −86.2 + (2.22 × 75) ≃ 80,
while for the right-footed player, the fitted value of strength when
the value of height is 75 is
strength = −6.5 + (1.08 × 75) ≃ 75.
Solution to Activity 16
If there is an interaction between A and x, we would expect the model
Y ∼ A + x + A:x
to fit the data much better than the model
Y ∼ A + x.
In this case, we would expect a fairly large reduction in the amount of
unexplained variation when the interaction term is added into the model,
and so we might expect the difference in the RSS values for the two models
to be large.
Solution to Activity 17
The p-value for testing whether the interaction term gender:workHrs
should be included in the model is very small (< 0.001), which means that
there is strong evidence to suggest that the interaction term should be
included in the model. Therefore, it was indeed wise to use a non-parallel
slopes model in Activity 13, since there would have been a significant
decrease in fit if we’d simply used the parallel slopes model.
385
Unit 4 Multiple regression with both covariates and factors
Solution to Activity 18
A p-value of 0.039 is quite small, which suggests that the interaction term
preferredFoot:height should probably be included in the model,
meaning that a non-parallel slopes model is preferable to a parallel slopes
model for these data.
Solution to Activity 19
The model
height ∼ side + diameter
is a parallel slopes model, whereas the model
height ∼ side + diameter + side:diameter
is a non-parallel slopes model. So, deciding whether a non-parallel slopes
model or a parallel slopes model is better, is equivalent to deciding
whether or not the interaction term side:diameter should be included in
the model.
The p-value for testing whether the interaction term side:diameter should
be included in the model was 0.256. This value is quite large and so there
is not enough evidence to suggest that the interaction term side:diameter
should be included in the model. The more parsimonious parallel slopes
model is therefore better for modelling height for these data.
Solution to Activity 20
There seems to be a decreasing trend in the residual plot for the high
fitted values. There are, however, not very many of these higher values,
and for the vast majority of the fitted values the residuals appear to be
fairly randomly scattered about zero. So, overall (to the module team) the
assumption that the random terms have zero mean and constant variance
seems to be reasonable.
The majority of the points in the normal probability plot lie along the
straight line, which is consistent with the assumption that the random
terms are normally distributed. At the ends of the plot, however, the
points systematically deviate slightly from the line, so the assumption of
normality may therefore be questionable. Despite this, since the deviation
is only slight and the majority of points lie close to the line, the normality
assumption can’t be ruled out.
386
Solutions to activities
Solution to Activity 21
Although there are a few large negative residuals in the residual plot, and
also a hint of curvature in the residual plot, the points generally seem to
be scattered fairly randomly about zero, so the assumption that the Wi ’s
have zero mean and constant variance seems reasonable.
In the normal probability plot, the residuals lie roughly along a straight
line but systematically depart from the line for the higher residual values.
The assumption of normality may, therefore, be questionable, but (to the
module team) the normal probability plot suggests that the assumption of
normality is still plausible.
So, both plots raise some potential issues, but neither of the plots suggest
any major problems with the assumptions of the non-parallel slopes model
for these data.
Solution to Activity 22
(a) Since each level effect term for A is the added effect on the response in
comparison to the effect of level 1 of A, when k = 1, the ‘effect of kth
level of A’ term in Model (10) will simply be zero. Similarly, when
l = 1, the ‘effect of lth level of B’ term in Model (10) will also be zero.
Therefore, when both factors A and B take level 1, Model (10)
reduces to
baseline random
Yi = + .
mean term
(b) From part (a), when k = 1, the ‘effect of kth level of A’ term in
Model (10) is zero, and so Model (10) reduces to
effect of
baseline random
Yi = + lth level + .
mean term
of B
(c) Also from part (a), when l = 1, the ‘effect of lth level of B’ term in
Model (10) is zero, and so Model (10) reduces to
effect of
baseline random
Yi = + kth level + .
mean term
of A
Solution to Activity 23
(a) (i) For males aged 55 to 64, gender takes level 1 and so we do not
have a gender effect in the model, but age does not take level 1
and so we do need to include the level effect for age being
55 to 64. So, the fitted employment rate for males aged 55 to 64
in the country is calculated as
87.777 − 11.340 ≃ 76.4.
387
Unit 4 Multiple regression with both covariates and factors
(ii) For females aged 25 to 34, gender takes level 2 and so we need to
include the level effect for gender taking the value female in the
model, but age takes level 1, and so we do not have an age
effect. So, the fitted employment rate for females aged 25 to 34
in the country is calculated as
87.777 − 8.068 ≃ 79.7.
(iii) For males aged 35 to 44, gender takes level 1 and so we do not
have a gender effect in the model, but age does not take level 1
and so we do need to include the level effect for age being
35 to 44. So, the fitted employment rate for males aged 35 to 44
in the country is calculated as
87.777 + 5.332 ≃ 93.1.
(iv) For females aged 45 to 54, neither gender nor age take level 1,
and so we need to include both of the level effect terms for
gender taking the value female and age being 45 to 54. So, the
fitted employment rate for females aged 45 to 54 in the country is
calculated as
87.777 − 8.068 + 5.657 ≃ 85.4.
(b) Since the effect term when gender takes the value female is
approximately −8.1, after controlling for age, the employment rate is
expected to be lower by 8.1% for females in comparison to the
employment rate for males.
(c) The effect term when age is 35 to 44 is approximately 5.3 and the
effect term when age is 45 to 54 is approximately 5.7. These values
are both positive, and so, after controlling for gender, in comparison
to the employment rate for the age group 25 to 34, the employment
rate is expected to be higher on average by 5.3% for the age group
35 to 44 and higher on average by 5.7% for the age group 45 to 54.
Notice that since the values of these level effect terms are very similar,
the employment rates for these two age groups, after controlling for
gender, are also very similar.
On the other hand, since the effect term when age is 55 to 64 is
approximately −11.3, and this value is negative, after controlling for
gender, the employment rate is expected to be lower by 11.3% for
those aged 55 to 64 in comparison to those aged 25 to 34.
388
Solutions to activities
Solution to Activity 24
In both figures, the lines for the different levels of the factor are parallel to
each other.
Solution to Activity 25
(a) In order to decide whether A should be included in the model in
addition to B, we could compare the values of the RSS for the two
nested models
Y ∼B and Y ∼ A + B.
Then, if the difference is large enough, this would suggest that adding
A into the model (in addition to B) significantly reduces the
unexplained response variation (and therefore significantly increases
the model fit).
(b) Similarly, in order to decide whether B should be included in the
model in addition to A, we could compare the values of the RSS for
the two nested models
Y ∼A and Y ∼ A + B.
Then, if the difference is large enough, this would suggest that adding
B into the model (in addition to A) significantly reduces the
unexplained response variation (and therefore significantly increases
the model fit).
Solution to Activity 26
The p-value from the first test is very small, and so there is evidence to
suggest that adding age into the model in addition to gender significantly
increases the model fit.
The p-value from the second test is also very small, and so there is also
evidence to suggest that adding gender into the model in addition to age
significantly increases the model fit.
Therefore, we should include both gender and age in the model.
Solution to Activity 27
If the model is suitable for these data, then we might expect the lines in
the means plots to be roughly parallel to each other. However, this is not
the case for either of the means plots in Figure 21 and so this model may
not be suitable for these data.
389
Unit 4 Multiple regression with both covariates and factors
Solution to Activity 28
(a) The first rat takes level 1 of source and level 2 of amount. Therefore,
there isn’t an added interaction term in the model, nor the individual
effect term for source. The fitted value is therefore
yb1 = 100.0 − 20.8 = 79.2.
(b) The 11th rat takes level 1 of both source and amount. Therefore,
there isn’t an added interaction term in the model, nor either of the
individual effect terms for source and amount. The fitted value is
therefore simply the fitted baseline mean
yb11 = 100.0.
(c) The 21st rat takes level 2 of both source and amount. Therefore,
there is an added interaction term in the model, and also individual
effect terms for both source and amount. The fitted value is therefore
yb21 = 100.0 − 14.1 − 20.8 + 18.8 = 83.9.
(d) The 31st rat takes level 2 of source and level 1 of amount. Therefore,
there isn’t an added interaction term in the model, nor the individual
effect term for amount. The fitted value is therefore
yb31 = 100.0 − 14.1 = 85.9.
Solution to Activity 29
In order to decide whether the interaction A:B should be included in the
model in addition to factors A and B, we could compare the values of the
RSS for the two nested models
Y ∼A+B and Y ∼ A + B + A:B.
If the difference is large enough, this would suggest that adding A:B into
the model (in addition to A and B) significantly reduces the unexplained
response variation (and therefore significantly increases the model fit).
Solution to Activity 30
The p-value for testing whether the interaction should be included in the
model in addition to source and amount is 0.054. As such, there is some
evidence to suggest that the fit of the model including an interaction is
significantly better than the model without an interaction, although the
evidence isn’t very strong.
In situations such as this, the context of the research problem and the
opinion of the researcher are important to be able to judge how ‘small’ the
p-value needs to be for us to conclude that the interaction term should be
included in the model. If the researcher would only consider a p-value to
be ‘small’ if it is less than 0.01, say, then they would conclude that the
interaction should not be in the model. On the other hand, if they
390
Solutions to activities
Solution to Activity 31
(a) The lines in both of the means plots are not parallel, which suggests
that the effect on the response of each factor depends on the other
factor. In particular, whether or not the individual has a computer at
home seems to have little effect on the hourly wage for females, but
makes quite a difference for males.
(b) The p-value associated with the interaction is very small (< 0.001).
There is therefore strong evidence to suggest that the interaction
gender:computer should be included in the model.
(c) For a male individual who has a computer, both gender and
computer take level 1, and so the fitted value for the hourlyWageSqrt
is simply the baseline mean
yb ≃ 4.073.
We want the fitted value of the hourly wage (in £), rather than its
square root, which is 4.0732 ≃ 16.59.
(d) For a male individual who doesn’t have a computer, gender takes
level 1 and computer takes level 2. So, since gender takes level 1,
there is neither an interaction in the model nor an effect for gender.
The fitted value of hourlyWageSqrt for this individual is then
yb ≃ 4.073 − 0.384 = 3.689.
So, the fitted value of the hourly wage (in £) is 3.6892 ≃ 13.61.
(e) From parts (c) and (d), the fitted value of hourly wage for a male
with a computer is £16.59, while that for a male without a computer
is £13.61. Therefore, according to the fitted model, having a
computer at home increases a male’s hourly wage by
£16.59 − £13.61 = £2.98, on average.
Solution to Activity 32
(a) This individual takes level 1 of both gender and computer, so that
the individual effects of these factors are accounted for in the baseline
mean for this individual. However, the other two factors, edLev and
occ, do not take level 1, and so their individual level effects need to
be added into the model. The fitted value is then calculated as
effect of effect of
baseline
yb = + level 3 + level 2
mean
of edLev of occ
391
Unit 4 Multiple regression with both covariates and factors
(b) This individual doesn’t take level 1 for any of the four factors, and so
the individual effects of the factor levels need to be added into the
model. The fitted value is therefore calculated as
effect of effect of
baseline
yb = + level 2 + level 2
mean
of gender of computer
effect of effect of
+ level 17 + level 7
of edLev of occ
Solution to Activity 33
To test whether the factor occ needs to be in the model in addition to
gender, computer and edLev, we can compare the two nested models
hourlyWageSqrt ∼ gender + computer + edLev
and
hourlyWageSqrt ∼ gender + computer + edLev + occ.
Solution to Activity 34
(a) The outcome at Step 1 is to add occ to the model (because this factor
has the smallest AIC value). So, Step 2 starts with the model with
the single factor occ. Repeating this method at each step, we have
that the outcome at Step 2 is to add edLev to the model, then add
gender at Step 3, and finally add computer at Step 4. There are then
no further explanatory variables to try adding into the model, and so
all four factors – gender, computer, edLev and occ – were selected by
this forward stepwise regression procedure.
(b) The model with no change has the smallest AIC value in Step 1 of the
backward stepwise regression procedure, and so we can’t improve the
model by removing any of the four factors. Therefore, the backward
stepwise regression procedure selected all four factors, and so forward
and backward stepwise regression both suggested the same factors for
the model.
Solution to Activity 35
Following the ideas presented in Subsection 6.1, we can define a set of
indicator variables for each factor. Then, since each indicator variable is a
covariate, and x1 , x2 , . . . , xq are also covariates, we have a multiple
regression model.
392
Solutions to activities
Solution to Activity 36
(a) The first student takes level 2 (m) of gender and also level 2 (maths)
of qualLink. Therefore, we need to include the effects of both of
these factors in the model. The fitted exam score for the first student
is therefore calculated as
yb1 = −24.320 + 0.783 − 2.709 + (1.087 × 89.2) + (−0.113 × 32)
≃ 67.
Solution to Activity 37
(a) The p-values for each of the covariates bestPrevModScore and age
are both very small (p < 0.001 for bestPrevModScore and p = 0.001
for age), as is the p-value for their interaction bestPrevModScore:age
(p = 0.003). This suggests that both covariates bestPrevModScore
and age should be included in the model, together with their
interaction.
(b) The value of xi1 is 89.2 and the value of xi2 is 32, and so the fitted
value for this student is
yb1 = 13.193 + (0.644 × 89.2) + (−1.151 × 32) + (0.012 × 89.2 × 32)
= 68.
393
Unit 4 Multiple regression with both covariates and factors
Solution to Activity 38
(a) There are six possible two-way interactions – gender:computer,
gender:edLev, gender:occ, computer:edLev, computer:occ and
edLev:occ.
(b) There are four possible three-way interactions –
gender:computer:edLev, gender:computer:occ, gender:edLev:occ
and computer:edLev:occ.
(c) There are six two-way interactions, four three-way interactions and a
single four-way interaction (between all four of the factors). So, in
total there are 6 + 4 + 1 = 11 possible interactions for this model.
Solution to Activity 39
(a) If the interaction edLev:occ is in the model, then the individual
effects for each of the factors edLev and occ also need to be in the
model. Since the interaction is a two-way interaction, there aren’t any
lower-order interactions to consider.
(b) If the interaction gender:computer:occ is in the model, then the
individual effects of each of the factors gender, computer and occ
need to be in the model, together with the two-way interactions
gender:computer, gender:occ, and computer:occ.
(c) If the interaction between all four of the factors is in the model, then
the individual effects of each of the four factors needs to be in the
model, together with all six of the two-way interactions between each
pair of the four factors, and all four of the three-way interactions
between each subset of three of the four factors.
394
Unit 5
Linear modelling in practice
Introduction
Introduction
So far in this module, we have seen how multiple regression including any
number of covariates and factors, also known as general linear models, can
be used to model the relationships between a response variable and various
explanatory variables. These explanatory variables can be either covariates
or factors. General linear modelling techniques represent a powerful set of
tools in the statistician’s toolbox, enabling a wide variety of data to be
analysed. In Units 2, 3 and 4, the focus has been on the fitting and
checking of such models. In this unit, although there will still be some
fitting and checking, the focus will be on other aspects of the statistical
modelling process, both before a model is formulated and after a good
model has been identified.
The statistical modelling process was discussed back at the start of Unit 1
and represented by a diagram (Figure 1 in Unit 1). For convenience, this
diagram is repeated shortly (Figure 1 in this unit) with an indication of
the steps that we will be focusing on in this unit: ‘Pose questions’, ‘Design
study’, ‘Collect data’, ‘Report results’ and, in part, ‘Formulate model’.
For the majority of the unit, we’ll use the statistical modelling process in
the context of predicting Olympic success. We’ll start in Section 1 by
translating a general description of the question of interest, expressed
using non-technical language, into a well-defined statistical modelling task.
Statistical modelling obviously requires some data to work with. So, in
Section 2, we’ll deal with sourcing suitable data and then preparing the
data ready for model fitting. This includes putting the data into a form
suitable for whatever statistics software we are using. Preparing the data
for modelling can be a non-trivial process which, if not handled
appropriately, can compromise, or even invalidate, the entire analysis.
397
Unit 5 Linear modelling in practice
Pose questions
Design study
Collect data
Explore data
Make assumptions
Formulate model
Improve model
Fit model
Check model
Choose model
Report results
Figure 1 The statistical modelling process: the steps of the process that will
be focused on in this unit are circled
398
1 Specifying the problem
Section 1 Section 2
Specifying our Sourcing and preparing
problem the data
Section 3
Building a
statistical model
Section 4
Section 5
Further modelling
Missing data
issues
Section 6
Section 7
Documenting the
Replication
analysis
Note that Subsections 2.2, 3.2, 3.3, 4.3 and 6.2 contain a number of
notebook activities, so you will need to switch between the written
unit and your computer to complete these.
Additionally, in Subsections 1.3, 2.1, 6.1, 6.3 and 6.4, you will need to
access other resources on the module website (such as videos and
articles) to complete a number of other activities.
399
Unit 5 Linear modelling in practice
Subsection 1.4 deals with the ‘Design study’ step of the modelling process.
In this case, as we will be using secondary data, this corresponds to
identifying exactly what we are trying to achieve with our modelling.
For each sport at the Olympics, a number of events are held. For example,
at recent summer Olympic Games, athletics has included events such
as 100 m, 200 m, high jump, long jump, 4 × 100 m relay and 4 × 400 m
relay, both for men and for women. Each event awards gold, silver and
bronze medals for first, second and third place, respectively.
Throughout this unit, we will use the term ‘athlete’ to include any
competitor at an Olympics, not just those taking part in athletics events.
For many athletes, winning an Olympic gold medal is seen as the pinnacle
of sporting achievement. But in addition to being extremely important to
individual athletes, success at the Olympics is often seen as a matter of
national pride. It is therefore not surprising that, before an Olympic
Games, there is speculation about which nations will do well and which will
not. This is something that statistics can help to try to shed some light on!
400
1 Specifying the problem
With the use of appropriate data, statistical models can be used to model
success at previous Olympic Games, which can then provide a basis for
predictions of what will happen in future Olympic Games. Such statistical
modelling is the focus of this unit. In particular, we will be developing a
statistical model for predicting which countries are likely to do well and
which countries are likely to not do so well, at the Paris 2024 Summer
Olympics. (At the time of writing this unit, this Olympics is yet to be
held.)
As you will have noticed while working on Activity 1, there is more than
one way to measure a nation’s success at the Olympics. Even if you just
regard ‘winning a medal’ as the key criterion of success (and hence all the
athletes from fourth place downwards are regarded as having not
succeeded), there is still the issue of how much more weight, if any, to
place on a gold medal as opposed to a silver medal, and a silver medal as
opposed to a bronze medal.
For the purposes of this unit, we will select a single (simple) measure of a
A gold medal from the
nation’s Olympic success, given in Box 1.
Tokyo 2020 Summer Olympics
Notice that in the measure of Olympic success given in Box 1, it does not
matter which medal (gold, silver or bronze) an athlete wins, just that they
win a medal. Then, using this measure, any model we consider will take
the total number of medals won (or a transformation of the total number
of medals won) as the response variable.
401
Unit 5 Linear modelling in practice
Watch ‘Video for Activity 2’ provided on the module website and then
answer the questions below. The short video describes some of the
explanatory variables a team of modellers used when trying to predict the
number of medals a nation would win at Rio 2016. (Note that this video
was made in 2016, before those Olympic Games had taken place.)
(a) What explanatory variables are suggested in the video?
(b) Which of the suggested explanatory variables are likely to affect a
nation’s Olympic success directly? And which are likely to affect a
nation’s Olympic success more indirectly?
(c) Suggest at least one other potential explanatory variable.
So, as you have seen in Activity 2, knowledge of the context in which the
modelling sits plays a part in deciding on potential explanatory variables.
Another important factor in the selection of potential explanatory
variables is pragmatism. For any variable we wish to use, we need to be
able to source data for it. For example, a good explanatory variable to
predict a nation’s success at the Olympics might be one that directly
measures the importance a nation gives to sporting success. However, data
for this variable are difficult, if not impossible, to obtain! In contrast, data
on a nation’s success at the previous Olympics (as measured by the
number of medals won, for example) are easy to obtain and could offer an
alternative explanatory variable as a useful approximation to the one for
which we cannot obtain data.
For predictive models, a particular concern is whether, for any prediction
we wish to make, we will be able to obtain the corresponding values of the
explanatory variables. As you will see in Activity 3, this can depend on
when we want to make the predictions, as well as which predictions we
wish to make.
402
1 Specifying the problem
We will tackle this prediction task by building a model for the number of
medals won by nations at summer Olympics, considering all of the summer
Olympics held between 1996 and 2016, basing the predictions for each of
these summer Olympics on what is known at the end of the associated
previous summer Olympics. (Data for the summer Olympics in 2020 will
be used later on in the unit to assess how well our model performs.)
In Subsection 1.3, a number of potential explanatory variables to help us
with the task have already been suggested. For the rest of the unit, we will
just focus on the ones listed in Box 3, which follows.
403
Unit 5 Linear modelling in practice
404
2 Sourcing and preparing the data
fantastic
model!
Data Results
Also recall that secondary data are data that have been collected for other
purposes. These data might be data that are in the public domain. Or the
data may be confidential to the organisation that the statistician works in.
Alternatively, they might be data from elsewhere that the statistician has
obtained permission to use.
As you saw in Subsection 2.2.1 of Unit 1, if the data used are secondary
data, it can affect the usefulness and quality of the data for the particular
problem at hand. For example, the data collected may not adequately
address the researcher’s problem, the data may be out of date, it might be
difficult to decipher definitions, and so on.
405
Unit 5 Linear modelling in practice
One source of information about the number of medals won at the summer
Olympics games is Wikipedia, which we will consider now.
Read the copy of the Wikipedia page for Rio 2016 given on the module
website (‘2016 Summer Olympics medal table’, 2021) and then answer the
following questions.
(a) Do you trust that the medals table given on this Wikipedia page
correctly gives the number of medals that have been awarded to each
nation?
(b) Is it true that every event awarded a gold, silver and bronze medal?
(c) At Rio 2016, did every athlete who won a medal represent a nation?
(d) Has the winner of each medal remained the same over time?
As Activity 4 shows, even if the source of the data is trusted, and the data
are correct, it might still be easy to make false assumptions about the
data. (For example, it would be wrong to assume that every event awarded
a gold, silver and bronze medal, or that every athlete represented a
nation.) Making a wrong assumption can lead to errors in building and
interpreting the model. Activity 5 highlights another problem that arose at
Rio 2016 and can affect our model to predict success at future Olympics.
In the solution to Activity 4, you saw that Kuwaiti athletes were not able
to compete for their nation at Rio 2016. Unfortunately, these were not the
only athletes that faced problems: the situation for athletes from Russia
was also not straightforward.
Read the BBC News article ‘Rio 2016 Olympics: Russians “have cleanest
team” as 271 athletes cleared to compete’ (BBC News, 2016) provided on
the module website, then consider the following question.
When using data from Rio 2016 to build a model to predict success at
Russia’s team after winning
future Olympics, why is it important to know about the problems faced by
silver in women’s gymnastics
at Rio 2016 the Russian team?
406
2 Sourcing and preparing the data
As you have seen in Activities 4 and 5, the question ‘how many medals did
each nation win at Rio 2016?’ does not have as straightforward an answer
as it first might seem. Rio 2016 is not unique in that respect. Similar
issues surround other summer Olympics.
Some of the other problems that have arisen since 1980 are listed below.
• Moscow 1980: more than 60 nations boycotted the Olympics.
• Los Angeles 1984: 14 nations boycotted the Olympics.
• Barcelona 1992: following the break-up of the Soviet Union, Russia and
other former Soviet republics competed as a ‘Unified Team’.
• Sydney 2000: seven medals were reallocated. This included one, initially
won by Lance Armstrong, which was reallocated over a decade later
in 2013.
• Beijing 2008 and London 2012: in 2016, a wave of retesting of athletes
for potential doping violations from both these Olympic Games led to a
number of medals getting reallocated.
• Tokyo 2020: Russia was not allowed to compete as a nation. However
athletes from that country were able to compete representing the
‘Russian Olympic Committee (ROC)’ instead.
When issues with a variable are identified, such as we have seen with the
number of medals each nation has won, a decision has to be taken about
what to do.
Sometimes it will be decided that a variable is sufficiently unreliable that it
is better not to use it. In other cases, we can still continue to use the data
after resolving any ambiguities.
For example, in terms of medal allocations, in this unit we will use the
medal allocations as they stood on 9 August 2021 (just after the end of
Tokyo 2020), as this was around the time this unit was written. This has
the advantage of making the medal allocation used unambiguous. However,
this means that for more recent Olympic Games, in particular Tokyo 2020,
there has been less time afterwards for medal reallocations to occur.
407
Unit 5 Linear modelling in practice
So, now that we have decided how we’re going to measure population, the
next task is to source some data! One source for population data is the
World Bank. You will consider this in the next activity.
408
2 Sourcing and preparing the data
Sources of data such as the World Bank tend to be very reliable. However,
since maintaining data and making them available costs money, there is no
guarantee that such data will be kept up-to-date or in formats that are
easy to use.
When it comes to data about the wealth of a nation, the World Bank is
again a useful source. It provides several different measures of national
wealth. The World Bank’s description of these measures (The World Bank
Group, 2020b) is provided on the module website.
In this unit we will use gross domestic product (GDP) per capita as a
measure of a nation’s wealth. (An explanation of GDP per person is given
in Example 4 of Unit 1.) However, even having made this decision, there is
still ambiguity, because there are different units which are used to measure
GDP. For example, it might be measured in terms of each nation’s own
currency or relative to a single standard currency such as US dollars.
It is not clear which are the most appropriate units to use when it comes
to predicting the impact of a nation’s wealth on Olympic success. So, here
we will take the arbitrary decision to focus on GDP per capita relative to
the US dollar in 2010. Similarly to population size, a country’s GDP per
capita changes over time. Therefore, the best time to measure GDP –
relative to each summer Olympics – needs to be decided. This could be
based on considerations such as when increased government-level
investment in sport might be thought to translate to better performance at
the Olympics. However, in this unit, we simply choose to use the same
year as the population estimate, which is four years beforehand.
409
Unit 5 Linear modelling in practice
410
2 Sourcing and preparing the data
Now that you have seen in Activity 7 what data in CSV format can look
like, the time has come to read such data into a statistical package.
We will start in Notebook activity 5.1 by reading into R the CSV file
containing data from the medals table for Tokyo 2020 (part of which was
shown in Figure 2). We’ll then save the data in a new file that is in R’s
data format. Notebook activity 5.2 explains how the data from the medals
table for Tokyo 2020 can be read into R from the saved R data file created
in Notebook activity 5.1.
Although the resulting data frame and the data for each variable in
Notebook activity 5.1 are ‘ready for use’, this is not always the case. For
example, we may find that the variable names given by the original data
source are not ones that would be helpful for us to use. Alternatively,
there may be problems with using the data frame because of the way the
data were defined in the data source, or because of the way that R read
the data. In such cases, the analyst may want to adapt the data frame to
make it ‘ready for use’ for their particular purpose. In Notebook
activity 5.3, the final notebook activity in Subsection 2.2.1, we will create
a data frame which is not ‘ready for use’, using data (from a CSV file) on
national population estimates for various years from the World Bank. This
data frame will then be adapted until it suits our needs.
411
Unit 5 Linear modelling in practice
412
2 Sourcing and preparing the data
Activity 8 discussed how we could combine the data from the two medals
tables for Rio 2016 and Tokyo 2020 into a single table. We will put this
into practice in Notebook activity 5.4 by combining a data frame that
contains the Tokyo 2020 medals table with a data frame that contains the
Rio 2016 medals table.
Suppose that the data on population sizes and GDP are added to the data
in the medals tables for each country. In what way should the structure of
the combined data differ from the medals tables data?
413
Unit 5 Linear modelling in practice
After working through Notebook activities 5.4 and 5.5, we have ended up
with two datasets: one containing the information about the medals tables
and one for the World Bank data. These two datasets can be merged using
the same approach taken in Notebook activity 5.5. However, it turns out
that matching rows does not work well. In Activity 10, you will explore
why there can be problems.
414
2 Sourcing and preparing the data
(Remember ‘garbage in, garbage out’ means that it is important that any
sorting out of the data is done correctly.) So this has been done for you
(because we’re kind like that!).
The dataset has been split into two parts – the ‘Olympics dataset’ and the
‘Olympics 2020’ dataset – so that the data from the 2020 Olympics can be
held back to use later when models are compared. We will explain in
Subsection 4.3 why this is a good idea.
A description of the Olympics dataset is given next. In Section 3, we will
move on to analysing the data.
416
3 Building a statistical model
Table 3 The first three and last three observations from olympic
417
Unit 5 Linear modelling in practice
Consider the models you have already learnt about in this module, and in
previous modules.
s
Tool
(a) What sort of response variable are the models suitable for?
(b) What sort of explanatory variable(s) are the models suitable for?
A statistical toolbox
So, as Activity 12 will have reminded you, you have learnt how to fit
models that contain a wide variety of explanatory variables, such as when
an explanatory variable is categorical or when it is numerical. You can also
fit models with lots of explanatory variables as well as just one.
However, when it comes to the response variable, the situation is more
limited. The variable needs to be continuous, or at least not too discrete.
Furthermore, in all of the models we have met so far, there is an
assumption that the random part of the model is based on the normal
distribution. So, the starting point is to identify the response variable and
check that this assumption is not unreasonable.
418
3 Building a statistical model
800
600
Frequency
400
200
0
0 20 40 60 80 100 120
Number of medals
419
Unit 5 Linear modelling in practice
420
3 Building a statistical model
You saw in Unit 2 that a regression model involving more than one
explanatory variable can be better than a simple linear regression model.
Furthermore, while the significance or otherwise of an explanatory variable
in a simple linear regression model can indicate whether that explanatory
variable should be in a multiple regression model, it doesn’t guarantee it.
Explanatory variables that are not significant in a simple linear regression
model may be so in a multiple regression model, while explanatory
variables that are significant in a simple linear regression model may not
be when fitted with other explanatory variables.
We end this subsection with Notebook activity 5.8, where we will try using
a multiple regression model for medals using all three of the covariates
considered in separate simple linear regression models in Notebook
activities 5.6 and 5.7 – namely, lagMedals, and population and gdp.
421
Unit 5 Linear modelling in practice
The variable nextHost – whether or not the country will be the next host
of the Olympic games – can be treated in the same way as host. You will
consider the remaining variable, year, in Activity 15.
In olympic, year can only take one of six values: 1996, 2000, 2004, 2008,
2012 or 2016. So, in a linear model do you think it is better to treat year
as a covariate or as a factor? Or does it not matter? Justify your opinion.
422
3 Building a statistical model
In all of this discussion about how to treat year in any prospective model,
you may be wondering why the year in which an Olympic Games is held
might make a difference. One reason is that the events at the Olympic
Games have not remained the same, so the total number of medals that
can be won has also varied.
423
Unit 5 Linear modelling in practice
424
4 Further modelling issues
425
Unit 5 Linear modelling in practice
There can also be occasions when a variable, or variables, are always in the
model, regardless of their statistical significance. Often these are variables
that are already known to be associated with the response variable.
Including such variables can simplify the modelling process by reducing the
number of models that are considered.
The different status of variables can impact on the way a model is
described. When a key explanatory variable, say X1 , is added to a model
containing a set of other variables, say X2 , X3 , . . . , Xq , reference might be
made to the effect of X1 adjusted for X2 , X3 , . . . , Xq .
‘But that’s the same interpretation as we already use for multiple
regression!’ I hear you cry. Well yes it is, but the main difference in this
subsection from what we’ve done before is the fact that we are only
interested in the effect of one (or more) key explanatory variable(s) – and
the other explanatory variables are simply there to ensure that we can
assess the effects of the key variable(s) from a good model baseline. If the
model representing the relationship between the response variable and
X2 , X3 , . . . , Xq is a good one, then we should be able to assess the effect of
the key variable well. This is described in Box 6.
426
4 Further modelling issues
So, what might this mean for our researcher from Example 3 who is most
interested in the effect of being the host nation on the number of Olympic
medals won? Well, we have already seen in Notebook activity 5.6 that
there is a (very) strong relationship between the number of medals won by
a nation at an Olympics (the response medals) and the number of medals
won by the nation at the previous Olympics (the explanatory variable
lagMedals). It therefore looks likely that any good model for medals
should have lagMedals as an explanatory variable. So, in order to assess
the effect of host on medals, the researcher should also include (at least)
lagMedals in the model.
We will interpret the effect of host on the response medals using ‘adjusted
for’ next in Activity 16.
(a) Table 4 shows the resulting regression coefficients from fitting the
following model to the Olympics dataset
medals ∼ lagMedals + host.
427
Unit 5 Linear modelling in practice
4.2 Is it an outlier?
In this module you have already come across the terms ‘outlier’ or
‘potential outlier’ applied to observations in a dataset. However, before
considering whether an observation is an outlier (or potential outlier), in
the next activity it is worth revisiting what this label implies about an
observation.
Take a little bit of time to think about what you understand by the term
‘outlier’. What makes an observation an outlier?
0 20 40 60 80 100 120
Medals won at Tokyo 2020
Figure 4 Boxplot of the medals data
428
4 Further modelling issues
(b) Figure 5 shows a boxplot of the same data after a log transformation
has been applied. Do any of the data points appear to be outliers in
this boxplot? If so, which ones and in what way are they outlying?
0 1 2 3 4 5
log(Medals won at Tokyo 2020)
Figure 5 Boxplot of the data after a log transformation
429
Unit 5 Linear modelling in practice
99 1.0
Ammonia absorbed (%)
0.5
98
Residuals
0.0
97
−0.5
96 −1.0
(c) Figure 8 is a residual plot from a regression including air flow and two
other explanatory variables. Now do any of the data points appear to
be outliers? If so, which ones and in what way are they outlying?
0.5
Residuals
0.0
−0.5
96 97 98 99
Fitted values
Figure 8 Residual plot from a regression of percentage of ammonia absorbed
including three explanatory variables
430
4 Further modelling issues
Box 7 Outliers
An outlier is an observation that does not follow the same pattern as
the majority of the data. As such, whether or not an observation is
regarded as an outlier depends on the values the observation has for all
of the variables and the model, or models, that are being considered.
In your study of statistics before, and in the module so far, you have
already met some methods for spotting outliers. For example:
• looking at boxplots of the data
• looking at scatterplots
• looking at a plot of residuals against fitted values from a regression
model.
However, with all of these methods, it is worth noting that they are not
foolproof. Just because a particular plot does not show an outlier, does not
mean that one is not there. For example, in regression, if an outlying
observation is sufficiently influential, the fitted regression line will go close
to that observation even if it does not reflect the pattern in the rest of the
data!
So, given all this uncertainty, what should be done about it? Box 8 gives
some suggestions.
431
Unit 5 Linear modelling in practice
As Box 8 suggests, one strategy for dealing with outliers is to compare the
results of fitting the model both with and without the outlier(s) included.
Activities 20 to 22 focus on three situations where this is done. For the
purposes of making teaching points, we will make the arbitrary decision to
use a significance value of 0.05 throughout these three activities and we
will be focusing on three separate small subsets of data from the Olympics
dataset.
432
4 Further modelling issues
20
Residuals
−20
5 10 15 20 25
Fitted values
Figure 9 Residual plot for medals ∼ gdp + population using a subset of the
data
After fitting the model to the data in the second subset, this time the
p-value associated with the coefficient for gdp was 0.054, and that
associated with the coefficient for population was 0.001. The p-value
for the F -statistic for the model using these new data was 0.003.
Once again using a significance level of 0.05, do either of the two
covariates gdp and population appear to be significantly related to
the number of medals won for this second subset of data?
433
Unit 5 Linear modelling in practice
40
20
Residuals
−20
−40
10 20 30 40 50
Fitted values
Figure 10 Residual plot for medals ∼ gdp + population using a different
subset of data to Activity 20
(c) The statistician decided that there was one outlier in the data. The
regression was refitted after having dropped this outlier from the
dataset.
The revised p-value associated with the coefficient for gdp was 0.037,
the revised p-value associated with the coefficient for population was
less than 0.001, and the revised p-value for the F -statistic was less
than 0.001.
Have the general conclusions changed? If so, in what way? Hence,
does it matter whether the outlier is included or not?
434
4 Further modelling issues
(a) After fitting the model to the data in the third subset of data, the
p-value associated with the coefficient for gdp was 0.947, and that
associated with the coefficient for population was 0.001. The p-value
for the F -statistic for the model using this third subset of data was
0.005.
Once again using a significance level of 0.05, do either of the two
covariates gdp and population appear to be significantly related to
the number of medals won for this third subset of data?
(b) The corresponding residuals against fitted values plot is given in
Figure 11. Based on this plot, how many outliers do you think there
are? Which points are they on this plot?
30
20
10
Residuals
−10
−20
−30
10 20 30
Fitted values
(c) The statistician decided that there were two outliers in this third
subset of data. The regression was refitted after having dropped these
outliers from the dataset.
The revised p-value associated with the coefficient for gdp was 0.958,
the revised p-value associated with the coefficient for population was
0.203, and the revised p-value for the F -statistic was 0.411.
Have the general conclusions changed? If so, in what way? Hence,
does it matter whether the outliers are included or not?
435
Unit 5 Linear modelling in practice
436
4 Further modelling issues
Use this model to predict the number of medals that will be won by a
nation who is hosting the summer Olympics and who won 17 medals
at the previous summer Olympics.
(c) Using this model, the 95% prediction interval for the number of
medals won by a nation described in part (b) is estimated to
be (22.5, 36.0). That nation turns out to be Brazil at Rio 2016.
At Rio 2016, the actual number of medals that Brazil won was
19 medals. Did the actual observed response lie inside the 95%
prediction interval or not?
Activity 23 shows that there are two interlinked aspects to prediction. One
is how close the prediction is to the actual result. The other is whether any
prediction interval we give reflects the true uncertainty about the predicted
value.
Box 9 introduces two ways of measuring how close the predictions are to
the actual results: the mean squared error and the mean absolute
percentage error.
Note that, for both measures, the smaller the value, the closer the
predicted values tend to be to the actual values. Also, in the unusual
situation that all the predictions match the corresponding actual
values, both the MSE and the MAPE take their minimum value of 0.
437
Unit 5 Linear modelling in practice
If there are too few or too many actual observations in the prediction
intervals, this suggests the prediction intervals do not have the width they
should have.
• If there are too few observations in their respective intervals, then it
means that the intervals are not capturing all of the uncertainty
surrounding the predicted values. In this case, there is overconfidence
about the range of values the predicted value might take.
• If there are too many observations in their respective intervals, then it
means that the intervals are too cautious about the range of values the
predicted value might take.
438
4 Further modelling issues
Box 10 Over-fitting
When fitting a model, it is possible to have a model that fits the data
too well. When this happens, the model reflects peculiarities that are
specific to the data used. As these peculiarities are not then reflected
in other data that the model might be applied to, the applicability of
the model is compromised. Such a model is said to over-fit the data.
You will explore fitting and over-fitting a little more in the next activity.
7 7
6 6
5 5
y
4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(a) x (b) x
7 7
6 6
5 5
y
4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(c) x (d) x
Figure 12 A dataset to which regression models with various numbers of
parameters have been fitted: (a) a simple linear regression model, depending
on two terms (parameters), with (b) to (d) showing regression models with an
increasing number of terms fitted to the data
(a) Which regression model appears to provide the closest fit to the data
so that the curve tends to be closest to the observed points?
439
Unit 5 Linear modelling in practice
(b) The MSE values for each of these models are given in Table 6. Are
these values to be expected? Why or why not? (Hint: think about
how the MSE values are calculated.)
Table 6 MSEs for the models shown in Figure 12
(c) Which regression model do you think best represents the relationship
between x and y over the range displayed?
All of the datasets you have been fitting models to so far in this module
have been training datasets. As such, the training dataset has observed
values for the response variable as well as the explanatory variables. It
turns out that in the test dataset we also need observed values for the
response variable, along with observed values for the explanatory variables.
This is because, although we don’t need the values of the response variable
when making predictions, we do need them to compare the predictions
from the model with the actual observed values. In the next activity you
will use a test dataset to assess models.
440
4 Further modelling issues
7 7
6 6
5 5
y
y
4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(a) x (b) x
7 7
6 6
5 5
y
y
4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(c) x (d) x
441
Unit 5 Linear modelling in practice
442
5 Missing data
5 Missing data
So far in this unit, we have been focusing on the problem of developing a
model for predicting the number of medals a nation will win at Paris 2024.
The data we have been using for this in Notebook activities 5.6 to 5.11 are
complete, which means that for each observation in the dataset, all values
of the variables are known. However, often we do not know every value for
every observation in a dataset. Indeed, we had this problem for some of
the values of the variables considered in Notebook activity 5.5. When such
values are not available, or not known, this is referred to as missing data.
443
Unit 5 Linear modelling in practice
The presence of missing data matters since it can put at risk the
representativeness of the data to the wider population of interest. This is
illustrated in Example 6.
Example 6 illustrated how missing data can cause problems with how
representative the data are. But what about data for the summer
Olympics? What missing data are there that can impact on
representativeness? You’ll consider examples of this in Activity 26.
Here are two examples of where there are missing data connected with
data about the Olympics that we have been analysing. In each case, briefly
discuss what impact on representativeness such missing data might have.
(a) No GDP per capita values being available for the Democratic People’s
Republic of Korea.
(b) The number of medals won at the previous Olympics not being known
for the Czech Republic in 1996. (In 1992, Czech athletes competed as
part of a Czechoslovakian team.)
444
5 Missing data
In Subsection 5.1 you’ll learn about strategies for handling missing data in
analyses. Then, in Subsection 5.2, we’ll focus on why the data are missing.
As you will discover in Subsection 5.3, considering why data are missing
allows us to then decide whether our chosen strategy for dealing with
missing data is appropriate.
445
Unit 5 Linear modelling in practice
In the next activity, you will consider the placebo effect dataset in more
detail.
The values of the variables for the first six room attendants are given in
Table 8 above. What do you notice while looking at this table?
446
5 Missing data
There are many missing values across the whole dataset of 75 room
attendants. The number of missing values for each variable is given in
Table 9. From this, we can see that, although there are a couple of
variables for which missing data are not a problem, most of the variables
in this dataset have at least one missing value.
Table 9 Number of missing values for each variable in placeboEffect
In the rest of this subsection, we will introduce three strategies for dealing
with such data: complete case analysis, available case analysis and
imputation.
447
Unit 5 Linear modelling in practice
Recall that the placebo effect dataset contains data gathered about a total
of 75 room attendants and that – excluding attID, which just gives
identification numbers – there are a total of nine variables.
(a) Suppose that a value was recorded for every variable for every room
attendant. How many individual recorded data values would there be
in the total dataset? (Ignore the values given for attID as these are
just identification numbers.)
(b) Table 9 gave the number of missing values for each variable in the
placebo effect dataset.
In total, how many values were not recorded? Hence, what percentage
of the total number of values that could have been recorded in the
dataset actually were recorded? (In calculating this percentage, ignore
the values given for attID as these are just identification numbers.)
(c) The number of room attendants with 0, 1, 2, 3, 4 or 5 missing values
is given in Table 10. (Each room attendant had no more than
5 missing values.)
Table 10 Number of room attendants in placeboEffect with 0, 1, 2, 3, 4 or 5
missing values
448
5 Missing data
449
Unit 5 Linear modelling in practice
Table 8 gives the data for the first six room attendants in the placebo
effect dataset. Due to the way this dataset is organised, all of these first
six room attendants were in the ‘Not informed’ group (that is, they all had
the value 0 for informed).
The data for the first six room attendants in the ‘Informed’ group (that is,
for attID 35 to 40) are given in Table 11.
Table 11 The first six room attendants in the ‘Informed’ group
For each of the following models, give the identification numbers of the
room attendants whose data will be included in an available cases analysis.
(a) percent ∼ informed
(b) percent2 ∼ informed
(c) percent2 ∼ informed + percent
(d) percent2 ∼ informed + percent + age + wt + bmi + ratio
+ syst + diast
450
5 Missing data
5.1.3 Imputation
For both a complete case analysis and an available case analysis, missing
data can be handled by dropping cases which contain missing values.
Imputation takes a different approach – it aims to replace missing data
with suitable values, so that the dataset can then be analysed in the same
way as if there had been no missing values.
In Example 9 we show how this works with respect to the first six rows of
the placebo effect dataset that was given in Table 8.
451
Unit 5 Linear modelling in practice
452
5 Missing data
• For some observations, it does not make sense for there to be a value.
For example, for a variable defined as ‘age when first married’ it only
makes sense to have a value for those people who have been married.
For those who have never been married, this value is necessarily missing.
Knowing why the data are missing is important as it can help us to decide
whether we are dealing with the missing data appropriately.
Now, for any given dataset, the precise reasons why data are missing will
be specific to the context in which the data sit. However, using a
categorisation first proposed in Rubin (1976), it is helpful to classify the
missing data as one of these three types:
• missing completely at random (MCAR)
• missing at random (MAR)
• not missing at random (NMAR).
The idea underpinning these different types of missing data is the extent to
which ‘everything we do know’ tells us something about the values that are
missing.
Before we get to the definitions of these types a little later on in the
subsection, it is useful to think about what is ‘everything we do know’.
Well first, but easily overlooked, we know that for every missing value that
we have, we do not have a value for it! You may be thinking that just
knowing that a value is missing could not possibly tell us anything about
what that missing value should be. You would be right in some
circumstances, but not in all. To see why this is so, it is helpful to consider
a concrete example.
Suppose that in a study about the effectiveness of teaching material, we
are interested in how successfully students have studied a module. Further,
suppose that we take the performance in the module’s exam as our
measure of how successful students have been. Unfortunately, unless a
module only has relatively few students, it is unusual for exam scores to be
recorded for all students. The reasons why an exam score may be missing
for a student are many and varied. As you will see in Example 10, some of
the reasons will not tell us anything about what a score might have been. We wouldn’t expect any of
However, as you will see in Example 11, other reasons can tell us these students to have missing
something about what a score might have been. exam scores
453
Unit 5 Linear modelling in practice
So far, we have just been considering whether the mere fact a value is
missing tells us anything about what value it would have been. However,
unless we only collect data relating to just one variable, ‘everything we do
know’ includes values for other variables too. As Example 12 illustrates,
sometimes the values of these variables tell us something about what the
missing values would have been, and sometimes they don’t.
454
5 Missing data
We now return to the three missing data types from Rubin (1976).
A missing value is missing completely at random (MCAR) if nothing
that we know tells us anything about what the value would have been. So,
for example, a missing exam score would be MCAR if we knew that the
student had gone down with flu the day before the exam and the only
455
Unit 5 Linear modelling in practice
other data we’d collected about the student was their age and which region
they were based in.
Both missing at random (MAR) and not missing at random
(NMAR) assume that the values of other variables tell us something
about what the missing value would be. So, a missing exam score would be
MAR or NMAR if we had also collected data about the first TMA. The
distinction between the two then lies in whether the reason that the exam
score is missing tells us anything extra.
For example, if we had the score for the first TMA for the student who had
gone down with flu, the missing exam score would be MAR. This is
because knowing that the student had gone down with flu doesn’t provide
us with any further information on what the student’s exam score would
have been.
However, the missing exam score for the student who gave up on the
module is likely to be NMAR. This is because the reason for the missing
exam score gives us additional information that can inform us about what
the value of the exam score would have been, over and above the score for
the first TMA.
Unfortunately, the assumptions for each of the missing data types cannot
be checked since such checking would rely on knowing the actual values of
the missing data – which, of course, are not known because they’re
missing! So we can only use the context that the data arose from to decide
which of these missing data types is the most appropriate. You will
practise doing this for a hypothetical drug trial in the next activity.
456
5 Missing data
Which type of missing data we decide we have has an impact on how the
analysis can be done while avoiding a biased result. We will consider this
very briefly next.
457
Unit 5 Linear modelling in practice
Box 13 Reproducibility
In research, reproducibility means that sufficient detail is given so
that others are able to recreate the results of the original researchers.
For data analysis, this means that sufficient detail needs to be given
to enable others working from the same original data to follow the
same analysis and obtain the same results.
458
6 Documenting the analysis
This audit trail, and the analysis more generally, can be documented in
different ways: either as personal notes or in a form intended to be read by
others. Although personal notes do not need to be as tidy as those for use
by others, they still need to be good enough to be informative. For
instance, the handwritten notes in Figure 15 may make sense to the
statistician who wrote them, but they are fairly useless for anyone else
(and probably just as useless for the statistician to look back on too!).
Analysis
460
6 Documenting the analysis
The strategies used in Notebook activities 5.12 and 5.13 can be easily
extended to other aspects of the analysis beyond preparing the data, and
they can be used to document the entire analysis.
Keeping all of the code, right from reading in the data through to fitting
and checking the last model, also helps enormously when data are updated,
for example, if more data are gathered. Then, it is simply a case of reading
in the updated dataset and re-running all of the rest of the code.
461
Unit 5 Linear modelling in practice
462
6 Documenting the analysis
In the article ‘Explaining the gender wage gap in STEM: does field sex
composition matter?’ (Michelmore and Sassler, 2016), the authors looked
for evidence of a pay gap between men and women working in the USA.
This article is available via the module website.
Read the ‘Data and measurement’ section of the article (pp. 197–200) and
then answer the following questions. (When reading the article, remember
that ‘dependent variable’ is simply another name for the response variable,
and ‘independent variable’ is another name for an explanatory variable.)
(a) From what source did the authors obtain the data?
(b) Were all the people in the original data source included in the
analysis? If not, who were excluded?
(c) What variable did the authors use as the response variable? Did they
transform this variable? If so, in what way?
(d) Which explanatory variable were the authors most interested in?
(e) From the information given, do you think this analysis is
reproducible?
463
Unit 5 Linear modelling in practice
The article you read while working on Activity 35 was not the only place
that the authors’ work was described. In the following activity, you will
read another account. This time, the description is aimed at a general,
non-statistical audience.
As you have seen in Activities 35 and 36, the audience a report is written
for makes a difference to what gets described in an article, as well as the
terminology used.
464
7 Replication
7 Replication
In Subsections 6.1 and 6.3 you learnt about the importance of
documenting an analysis so that, given access to the data, it is possible to
repeat what was done exactly. In science there is also the related notion of
replication, as described in Box 14
Notice from Box 14 that replication is not about being able to use the
same data to reproduce the results. Instead, it is about being able to
generate fresh data that produce similar general results. At the heart of
this is the notion that when results are truly telling us something about
the real world, they are not one-offs: other researchers should be able to
repeat – that is, replicate – what was done.
In the next example we describe a couple of studies that aim to replicate a
study that has already been considered in this unit.
465
Unit 5 Linear modelling in practice
466
7 Replication
• For a single location, the significance level of the test is set to be 0.01.
This means that, for each of the 10 million tests, the probability of
rejecting H0 when it is in fact true (a ‘false positive’), is 0.01.
• The size of the study is such that, for a single location, the power of the
test is 0.9. Recall from your previous study of statistics that this means
that, for each of the 10 million tests, the probability of rejecting H0
when it is in fact false is 0.9.
Based on this scenario, answer the following questions.
(a) For how many of the hypothesis tests would we expect to reject H0
when it is in fact false (and hence correctly decide that there is an
association between the location and Type 1 diabetes)?
(b) For how many of the hypothesis tests would we expect to reject H0
when it is in fact true (and hence incorrectly decide that there is an
association between the location and Type 1 diabetes)?
(c) Using your answers to parts (a) and (b), calculate the proportion of
hypothesis tests which reject H0 that are actually associated with
Type 1 diabetes.
As you have seen in Activity 37, when carrying out multiple tests with lots
of hypotheses, it is quite easy to get in a situation where the null
hypotheses that really should be rejected are swamped by all the null
hypotheses that were spuriously rejected. This issue about multiple testing
is well-known in genetics research and means that, in practice, geneticists
would not set up a study exactly like this.
You may think that multiple testing only applies in studies where it is
obvious that very many hypotheses are being tested, such as in gene
association studies. Unfortunately, its effect is more widespread. It is easy
to get into the situation of carrying out lots of tests – enough to be a
problem. In particular, model building of the sort described in Section 3
also falls within the realm of doing lots of hypothesis testing, as you will
see in Activity 38.
467
Unit 5 Linear modelling in practice
In Activities 37 and 38, we have seen how multiple testing can arise within
a single study or data analysis. The fact that studies with statistically
significant results tend to gain far more attention than other studies – for
example, being published in peer-reviewed journals, or even just to get
written up – means that the problem with false positive results plays out
across the totality of studies being done. This has led some to argue that
many of the statistically significant results that are published are likely to
be spurious results, particularly for studies that are small (Ioannidis, 2005).
468
Summary
Summary
In this unit, we have been applying the techniques learnt in Units 1 to 4 to
a real data analysis task: modelling national Olympic success and using
that model to predict success at future Olympics. In doing so, you saw
that fitting a linear model or models is not necessarily the most difficult or
time-consuming aspect of the task!
Data need to be read into whatever statistical package is being used. A
format such as comma-separated values (CSV) can be read by many
different statistical packages. However, the resulting data frame may not
be ‘ready to use’ for the statistician’s particular problem and may need to
be adapted. Merging data from different sources also needs to be done
with care. It is important to ensure that information about observations
from different sources gets correctly linked together, and in some cases it
may not be possible to make that linkage with confidence.
The issue of missing data is another problem that often occurs in data
analysis. There are three types of missing data: missing completely at
random (MCAR), missing at random (MAR) and not missing at random
(NMAR). Strategies for dealing with missing data include complete case
analysis (keeping only those observations with no missing values), available
case analysis (dropping only those observations with missing values for
variables in the model) and imputation (replacing each missing value with
469
Unit 5 Linear modelling in practice
an estimate of the missing value). These strategies work best when missing
data are MCAR. When missing data are NMAR, all of these strategies will
lead to bias.
However an analysis is done, it is important to document it in sufficient
detail to be able to repeat what was done exactly: the analysis needs to be
reproducible. A statistician may need to return to the analysis long after
they have had time to forget the exact details. Documenting the analysis
is also important to allow others to reproduce the results.
Finally, the volume of significance testing conducted across the world has
led to what is known as the replication crisis. Many hypotheses, including
some which have become high-profile, are likely to be associated with
spuriously significant p-values. This has led to a greater recognition of the
need to replicate results, so that studies are repeated with the aim of
seeing whether a significant result is again obtained.
As a reminder of what has been studied in Unit 5 and how the sections in
the unit link together, the route map is repeated below.
Section 1 Section 2
Specifying our Sourcing and preparing
problem the data
Section 3
Building a
statistical model
Section 4
Section 5
Further modelling
Missing data
issues
Section 6
Section 7
Documenting the
Replication
analysis
470
Learning outcomes
Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate that sourcing and preparing data ready for modelling can be
a difficult and time-consuming task
• understand the importance of the data preparation stage: ‘garbage in,
garbage out’
• appreciate that it is not always obvious how the response and
explanatory variables should be specified, and that decisions sometimes
need to be made based on pragmatism
• appreciate that some explanatory variables can be treated as either a
factor or a covariate, and the pros and cons of each need to be
considered when choosing between the two
• appreciate that the explanatory variables may not always be of equal
importance to the analyst
• understand that when the primary focus of a study is to assess the
impact on the response of one key explanatory variable (of several), a
good strategy is to find the best model without including that key
explanatory variable, and then fit this best model with the key
explanatory variable also added in
• understand that whether or not an observation is considered to be an
outlier, depends not only on the other observations, but also on the
assumed model for the data
• assess the impact of an outlier (or outliers) by fitting the model both
with and without the outlier(s) included, so that the results can be
compared
• use the mean squared error (MSE), the mean absolute percentage error
(MAPE) and prediction intervals to help assess predictions
• appreciate that a more complicated model is not necessarily better than
a simpler one
• understand the uses of training datasets and test datasets
• identify three types of missing data: missing completely at random
(MCAR), missing at random (MAR) and not missing at random
(NMAR)
• use and appreciate the pros and cons of three strategies for dealing with
missing data: complete case analysis, available case analysis and
imputation
• understand what reproducibility means and its importance
• appreciate that how an analysis is documented should depend on the
audience
• understand the notion of replication
471
Unit 5 Linear modelling in practice
• understand what the replication crisis is and its link to multiple testing
and the interpretation of p-values
• appreciate how replication can help to prevent false significant results
drowning out the true significant results
• read a CSV file into an R data frame
• create an R data file
• read an R data file into R
• change variable names in R
• change variable types in R
• combine data frames in R by adding rows of data (that is, by adding
observations) or by adding columns of data (that is, by adding variables
for existing observations)
• use R to calculate the MSE, the MAPE and the percentage of
observations contained in prediction intervals
• use R to assess predictions using a test dataset
• document an analysis in Jupyter.
472
References
References
‘2016 Summer Olympics medal table’ (2021) Wikipedia. Available at:
https://en.wikipedia.org/w/index.php?title=2016 Summer Olympics
medal table&oldid=1039342470 (Accessed: 24 August 2021).
‘2020 Summer Olympics medal table’ (2021) Wikipedia. Available at:
https://en.wikipedia.org/wiki/2020 Summer Olympics medal table
(Accessed: 9 August 2021).
BBC News (2016) ‘Rio 2016 Olympics: Russians “have cleanest team” as
271 athletes cleared to compete’, 5 August. Available at:
https://www.bbc.co.uk/sport/olympics/36970627 (Accessed: 21 March
2022).
Bredtmann, J., Crede, C.J. and Otten, S. (2016) ‘Olympic medals: Does
the past predict the future?’, Significance, 13(3), pp. 22–25.
doi:10.1111/j.1740-9713.2016.00915.x
Brownlee, K.A. (1965) Statistical theory and methodology in science and
engineering, 2nd edn. New York: John Wiley and Sons, pp. 454–55.
‘Cost of the Olympic Games’ (2022) Wikipedia. Available at:
https://en.wikipedia.org/wiki/Cost of the Olympic Games (Accessed:
18 March 2022).
Crum, A.J. and Langer, E.J. (2007) ‘Mind-set matters: exercise and the
placebo effect’, Psychological Science, 18(2), pp. 165–171. Data obtained
from: https://dasl.datadescription.com/datafile/hotel-maids
(Accessed: 5 May 2020).
Crum, A.J., Corbin, W.R., Brownell, K.D. and Salovey, P. (2011) ‘Mind
over milkshakes: mindsets, not just nutrients, determine ghrelin response’,
Health Psychology, 30(4), pp. 424–429.
Curtice, J. et al. (1994) ‘The Opinion Polls and the 1992 General Election
(Full Report)’. Available at: https://amsr.contentdm.oclc.org/digital/
collection/p21050coll1/id/669 (Accessed: 11 July 2022).
Draganich, C. and Erdal, K. (2014) ‘Placebo sleep affects cognitive
functioning’, Journal of Experimental Psychology: Learning, Memory, and
Cognition, 40(3), pp. 857–864.
Dunbar, J. (2016) ‘Predicting the Rio Olympic medal table’, BBC News,
3 August. Available at: https://www.bbc.co.uk/news/magazine-36955132
(Accessed: 2 April 2022).
Gardner, M.J., Snee, M.P., Hall, A.J., Powell, C.A., Downes, S. and
Terrell, J.D. (1990a) ‘Results of case-control study of leukaemia and
lymphoma among young people near Sellafield nuclear plant in West
Cumbria’, British Medical Journal, 300(6722), pp. 423–429.
473
Unit 5 Linear modelling in practice
Gardner, M.J., Hall, A.J., Snee, M.P., Downes, S. Powell, C.A. and Terrell,
J.D. (1990b) ‘Methods and basic data of case-control study of leukaemia
and lymphoma among young people near Sellafield nuclear plant in West
Cumbria’, British Medical Journal, 300(6722), pp. 429–434.
‘Independent Olympic Athletes at the 2016 Summer Olympics’ (2021)
Wikipedia. Available at: https://en.wikipedia.org/w/index.php?title=
Independent Olympic Athletes at the 2016 Summer Olympics&oldid
=999908107 (Accessed: 20 August 2021).
Ioannidis, J.P.A. (2005) ‘Why most published research findings are false’,
PLoS Medicine, 2(8), e124.
International Olympic Committee (2021) Factsheet: The Games of the
Olympiad. Available at: https://stillmed.olympics.com/media/
Documents/Olympic-Games/Factsheets/The-Games-of-the-Olympiad.pdf
(Accessed: 30 June 2022).
Michelmore, K. and Sassler, S. (2016) ‘Explaining the gender wage gap in
STEM: does field sex composition matter?’, RSF: The Russell Sage
Foundation Journal of the Social Sciences, 2(4), pp. 194–215.
Royal Statistical Society (2022) RSS – Journal Series C. Available at:
https://rss.org.uk/news-publication/publications/journals/series-c
(Accessed: 22 March 2022).
Rubin, D.B. (1976) ‘Inference and missing data’, Biometrika, 63(3),
pp. 581–592.
The World Bank Group (2020a) World Development Indicators – Themes –
People. Available at: http://datatopics.worldbank.org/world-development-
indicators/themes/people.html#population (Accessed: 13 August 2020).
The World Bank Group (2020b) World Development Indicators – Themes
– Economy. Available at: http://datatopics.worldbank.org/world-
development-indicators/themes/economy.html (Accessed: August 2020).
Wakeford, R. and Tawn, E.J. (1994) ‘Childhood leukaemia and Sellafield:
the legal cases’, Journal of Radiological Protection, 14, pp. 293–316.
‘Wikipedia:About’ (2021) Wikipedia. Available at: https://en.wikipedia
.org/w/index.php?title=Wikipedia:About&oldid=1037955346
(Accessed: 20 August 2021).
474
Acknowledgements
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.2, a gold medal from Tokyo 2020: Taken from:
https://triathlonmagazine.ca/racing/tokyo-2020-olympic-medals-revealed/
Subsection 2.1.2, Russia’s women’s gymnastics team at Rio 2016:
© Agência Brasil Fotografias / https://commons.wikimedia.org/wiki/
File:Russia takes silver in womens artistic gymnastics.jpg This file is
licensed under the Creative Commons Attribution Licence
http://creativecommons.org/licenses/by/3.0/
Subsection 2.1.3, Mary Hanna: © Franz Verhaus /
https://www.flickr.com/photos/franz-josef/7776871064/ This file is
licensed under the Creative Commons Attribution Licence
http://creativecommons.org/licenses/by/3.0/
Subsection 2.1.3, Hend Zaza: © Tim Clayton - Corbis / Contributor /
Getty Images
Subsection 2.2.3, somewhere in the Bahamas: © Lawrence Malvin /
www.123rf.com
Subsection 2.2.4, Olympic rings: https://olympics.com/ioc/olympic-rings
Subsection 2.2.4, person falling asleep in front of a computer © fizkes /
www.123rf.com
Subsection 4.1, COVID-19 vaccination: © milkos / www.123rf.com
Section 5, a ‘shy’ voter? © Prostockstudio — Dreamstime.com
Subsection 5.1, room attendants at work: © Olga Yastremska /
www.123rf.com
Subsection 5.2, students taking an exam © Wavebreak Media Ltd /
www.123rf.com
Figure 14: © Hjem / https://www.flickr.com/photos/hjem/1876641223/
This file is licensed under the Creative Commons
Attribution-Noncommercial-ShareAlike Licence
http://creativecommons.org/licenses/by-nc-sa/2.0/
Subsection 6.3, Sellafield nuclear site: © Simon Ledingham /
https://en.wikipedia.org/wiki/Sellafield#/media/File:Aerial view Sellafield,
Cumbria - geograph.org.uk - 50827.jpg This file is licensed under the
Creative Commons Attribution-Noncommercial-ShareAlike Licence
http://creativecommons.org/licenses/by-sa/2.0/
Section 7, milkshake: © Brent Hofacker / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
475
Unit 5 Linear modelling in practice
Solutions to activities
Solution to Activity 1
There is no single ‘correct’ answer here. These are some of the possible
ways to measure a nation’s success, but you may have thought of others.
• Count the number of medals (gold, silver or bronze) that a nation wins
overall. The more medals, the more successful the nation is.
• Assign a points score to each type of medal: 3 points for a gold, say,
2 points for a silver and 1 point for a bronze. Then calculate the total
points scored for each nation; the higher the score, the more successful
the nation is.
• For each sport, calculate the percentage of medals each nation wins, and
average these percentages across all of the sports. The higher the
average, the more successful the nation is.
• Calculate the average position of each nation’s athletes across all events.
This time, the lower the average, the more successful the nation is.
Solution to Activity 2
(a) The following explanatory variables are mentioned.
• The nation’s wealth. The richer a nation is, the more success they
are likely to have.
• The population size. The more populous a nation is, the more
success they are likely to have (although the video points out that
this is not necessarily the case!).
• Previous success at the Olympics. Nations that were successful
before are more likely to be successful again.
• Whether the nation is hosting the Olympics. The host nation does
better than might otherwise be expected.
• Whether a nation is going to host the next Olympics. The nation
that is going to host next is also likely to do better than might
otherwise be expected.
(b) Of the explanatory variables listed in part (a), arguably only
population size and hosting the Olympics are likely to affect a nation’s
Olympic success directly. A larger population means that there is a
greater pool of people from which to pick athletes, and so, all other
things being equal, the better someone has to be in order to be
amongst the very best in that nation. Also, it is suggested that being
a host nation – and therefore competing on home ground – inspires
those athletes to do better than they would if they were competing
elsewhere, because they don’t want to fail in front of home crowds.
The other explanatory variables listed in part (a) affect a nation’s
Olympic success more indirectly. Richer nations have more money to
invest in sport, and it is this extra money which drives the Olympic
476
Solutions to activities
Solution to Activity 3
(a) That close to the Paris 2024 start, the teams are likely to have been
selected and all the previous Olympic Games will have happened. So,
yes, the values of both explanatory variables are likely to be known.
(b) Previous success at the Olympics will be known one year before
Paris 2024 starts. However, it is unlikely the selection of teams will be
confirmed that far in advance of Paris 2024, because the time period is
long enough for the form of individual athletes to change significantly.
(c) In this situation, only previous success at the Olympics will be a useful
potential explanatory variable. Just after Tokyo 2020 ended, the size
of teams for Paris 2024 was not known. So, if on 9 August 2021 we
did try to use a model that included the size of the team as an
explanatory variable to predict success at Paris 2024, we would first
need to predict what the sizes of all the teams for Paris 2024 will be!
Solution to Activity 4
(a) Wikipedia is a website based on the principle that what is contributed
is more important than who contributes it (‘Wikipedia:About’, 2021).
So there are few restrictions on who can add or amend content, or on
what changes they make. As such, there is no guarantee that what is
there is correct. However, this community aspect is also a strength as
it makes it easy for others to go in and correct material – particularly
for entries which are not contentious. So, we (the module team) think
the information on this Wikipedia page can be trusted to be correct.
(b) No, it is not. The article states that for boxing, judo, taekwondo and
wrestling, each event awarded two bronze medals. In addition, for a
few events, a tied result for a medal occurred. The website states that
a tie for a gold meant the tied athletes were each awarded the gold
model but no silver medal was awarded. Similarly, a tie for silver
meant the tied athletes were each awarded the silver medal but no
bronze medal was awarded.
477
Unit 5 Linear modelling in practice
(c) No, they did not. At these Olympics there was an ‘Independent
Olympic Athletes’ team who won a gold medal and a bronze medal.
This team was composed of athletes from Kuwait. They were unable
to represent Kuwait because of the suspension of the Kuwait Olympic
Committee from the International Olympic Committee. (‘Independent
Olympic Athletes at the 2016 Summer Olympics’, 2021)
There was also a Refugee Olympic Team. However, this team,
consisting of 10 athlete refugees from a variety of nations, did not win
any medals and hence does not appear in the medals table.
(d) No, it has not. The winners of four medals – one silver and three
bronzes – are listed as having been officially changed.
Solution to Activity 5
At Rio 2016, Russia’s team was depleted. At the time the article was
written, only 70% of Russia’s original team were judged eligible to
compete. Russian athletes in athletics and weightlifting faced a complete
ban, while athletes in seven other sports faced a partial ban. It is likely
that some of these banned athletes would have won medals if they had
been allowed to participate. Thus, Russia’s medal-winning potential was
diminished.
It is also worth noting that, by the same token, it means that the number
of medals won by other countries will be increased as a result of Russia’s
diminished participation. (The medals were still won by somebody!)
However, as this increase probably was spread across many other
countries, the impact on other countries is likely to have been much
smaller than that on Russia.
Solution to Activity 6
(a) The population size data from the World Bank appears to be very
reliable. They have collated data from national and international
bodies. They claim that the primary source they have used is the
most reliable dataset.
(b) They state that the estimates are based on census data. That is,
attempts by countries to count all of their population. However, even
in high-income countries they acknowledge that censuses do not
include everybody. Nor do censuses occur every year. So, for years
when a census has not occurred, the population of a country has to be
estimated. Besides, even if censuses were carried out every year,
populations are changing all the time as people die, are born, or
migrate.
478
Solutions to activities
Solution to Activity 7
(a) No, they do not. As you can see in Figure 2, it is possible to include
data that are composed of text, such as the name of a nation.
(b) The first line, which is often known as the ‘header’, gives names for
each of the variables in the file. This is not always done in such files,
but can be useful when it is. It provides an indication of what each
variable is for someone looking at the file, and can save the task of
giving variables sensible names once they have been read into a
statistical package.
(c) Each line represents the data for a single nation. More generally, in
such files each line gives the data for one (and only one) observation –
even if that means the line is long!
(d) The commas separate the information about each variable. Note that
this can be a problem if a variable can take values that contain a
comma.
Solution to Activity 8
(a) The best way is by adding the rows. This is because the data from
Rio 2016 effectively adds more observations about how many medals a
country actually gets at an Olympics. However, when doing this, it is
helpful to add an extra column that indicates which Olympic Games
the data come from.
(b) The medals tables only list those countries that have won at least one
medal. Those countries that took part, but did not win anything, are
not included. If a good model is to be built, then data from these
countries are also required. Just because a country did not win a
medal at Rio 2016 and Tokyo 2020 does not mean it will never win a
medal!
For example, at Tokyo 2020 three countries won their first ever
Olympic medals: Burkina Faso, San Marino and Turkmenistan.
Solution to Activity 9
Adding information about the population size and the GDP corresponds to
adding more variables. So more columns need to be added.
Solution to Activity 10
(a) It is likely that none of these would be automatically regarded as the
same. A person looking at these is likely to say they are the same
because they all contain ‘Japan’. However, the inclusion of ‘*’,
‘(JPN)’ or just an extra space, or spaces, could be enough for a match
not to be found by a computer. In these cases, though, it is possible
to pre-process the country names so that such differences do not cause
a match to fail.
479
Unit 5 Linear modelling in practice
Solution to Activity 11
The answer to this question partly depends on how common your name is,
and on your own preferences.
For instance, if you have a relatively common name, there will likely be
situations where others have the same name as you.
You might prefer to use a different name with friends, say, to that used
with relatives and/or on official forms.
Solution to Activity 12
(a) In this module you have so far learnt about:
• multiple regression (in Unit 2)
• regression with one factor and ANOVA (in Unit 3)
• regression with any number of covariates and factors (in Unit 4).
In addition, you should have met simple linear regression in your
previous study of statistics.
For all of these models, the response variable is assumed to be
continuous. (Note this does not mean that the response variable is
always continuous in a strict sense – just that it is not ‘too discrete’.)
Furthermore, for all of these models it is assumed that the deviations
of observations from their fitted or predicted values can be assumed to
have a normal distribution, with zero mean, and that the variance of
the deviations is the same for all observations.
480
Solutions to activities
Solution to Activity 13
(a) No, it does not look symmetric. There are lots of instances where
countries won a relatively low number of medals, and very few
countries that won a large number of medals. However, this by itself
is not incompatible with the random variation being normally
distributed. It could be that the expected number of medals produced
by the model is similarly highly skewed.
(b) The data are discrete. The number of medals won by a nation is
necessarily an integer. However, it is reasonable to treat the data as
continuous since there are a hundred or more different values that the
number of medals could take.
(c) The lowest number of medals that a nation could win is zero. It
appears that a data value corresponds to this lower bound a lot of the
time. This could be a problem because the assumption that the
random variation is normally distributed means that, in theory at
least, there is no lower bound to the number of medals that could be
won. With much of the data at this bound, the model is liable to
imply that values less than the lower bound often happen.
Solution to Activity 14
The variable host takes two (coded) values:
• 0 if the nation is not the current host
• 1 if the nation is the current host.
So, it does not matter whether host is treated as a factor or as a covariate.
If host is declared to be a factor, all that would happen is that an
indicator variable would be created which would have exactly the same
(coded) values as host currently does.
Solution to Activity 15
Since year can take more than two different values, it matters whether
year is treated as a factor or as a covariate.
By treating year as a factor, all that is implied in the model is that the
number of medals a nation wins (partly) depends on which Olympics we
are talking about. It does not impose any ordering on this effect of year.
(See, for example, Unit 3, Subsection 1.3.)
481
Unit 5 Linear modelling in practice
Solution to Activity 16
(a) Hosting the Olympics is associated with an increase of 12.6 medals
after adjusting for the number of medals the country won at the
previous Olympics.
(b) The model included the explanatory variables host along with
lagMedals, population and gdp. However, it is not possible to tell
exactly how they were included in the model. For example, whether it
was just population that had been included in the model, or just
log(population), or both population and log(population), or
indeed another transformation of population.
Solution to Activity 17
You have probably come across this term in the context of an observation
being noticeably different to the other observations. Often this is because
the value associated with this observation is particularly big or particularly
small in relation to the others.
When more than one variable is observed, it could be that the values for
each of the variables are not individually noticeably different to the rest of
the data. However, the combination of the values might be enough to
make an observation look different to the rest.
482
Solutions to activities
Solution to Activity 18
(a) There appear to be up to nine outliers in these data: these are the
observations indicated by individual points. All of these observations
correspond to nations that have high numbers of medals won. One of
these points, the one corresponding to over 100 medals (and checking
back in the data, this turns out to correspond to the USA), seems to
be particularly outlying.
(b) In this boxplot, none of the data points appears to be outlying!
Solution to Activity 19
(a) There appear to be two outliers, both corresponding to low
percentages of ammonia removed. Since the dataset only contains
21 observations, this corresponds to almost 10% of the observations.
(b) In the scatterplot, there appear to be two points that might be
regarded as outliers. The point corresponding to an air flow of just
over 60 and just over 97.0% of ammonia absorbed looks a bit lower
than other readings with similar air flow. More obviously, the point
corresponding to an air flow of 70 appears to have an unusually high
percentage of ammonia absorbed. The residual plot makes it clearer
that one point seems unusually high: the point with a fitted value of
between 97.0 and 97.5. This is the point corresponding to an air flow
of 70.
(c) From this residual plot it could be argued that there is just one
outlier, the point with the highest residual. The residual with the
lowest value is not very much less than some of the other residuals.
Solution to Activity 20
(a) Using a significance level of 0.05, the individual p-values suggest that
the number of medals depends on a country’s population but not on
its GDP per capita. However, if we also use a significance level of 0.05
for the p-value associated with the F -statistic, then that seems to
indicate that the model overall isn’t significant.
(b) There appears to be at least one outlier: the point with the highest
residual. This is because its residual is much bigger than the residuals
for the other points.
(c) The general conclusions have not changed. The revised individual
p-values suggest that the number of medals depends on a country’s
population but not its GDP per capita, and the revised p-value
associated with the F -statistic still indicates that the model overall
isn’t significant. It therefore doesn’t matter whether or not the outlier
is included in the analysis.
483
Unit 5 Linear modelling in practice
Solution to Activity 21
(a) This time, there again appears to be a significant effect of
population. However, the p-value for gdp is just above the p < 0.05
threshold that we’re using here. This suggests that there is evidence
that a model should include population, but not gdp. There is also
strong evidence from the p-value associated with the F -statistic that
the model is significant.
It is worth noting here that, although we’ve specifically chosen to use
p < 0.05 as our threshold for deciding on significance (to help make a
teaching point about outliers!), this activity – with an individual
p-value just above 0.05 – illustrates just how arbitrary picking a
significance level can be!
(b) Again there appears to be one clear outlier. This time it corresponds
to the point with a residual of about 40.
(You may have thought that the point relating to the large fitted
value greater than 50 is an outlier because this point lies far away
from the other points in the plot. However, this point is not an
outlier, because the value of the residual for this point is very small,
which means that this particular point fitted the model well.)
(c) This time, whether the general conclusions have changed is debatable.
In particular, the p-value for gdp is now below the cut-off of 0.05
(slightly). So, without the outlier, it looks like the number of medals
won may depend on both GDP per capita and population size.
However, we still have the same conclusion that the overall model is
significant. So whether or not the outlier is included (and, of course,
how we might interpret the results) may well affect our conclusions.
Again it is worth noting how the arbitrariness of picking a significance
level can affect the decisions that we may make!
Solution to Activity 22
(a) For this third subset of data, the individual p-values suggest there is a
significant effect for population but not for gdp. The p-value
associated with the F -statistic also suggests that overall the model is
significant.
(b) This time there appears to be two outliers, both having fitted values
over 25.
(c) This time, it matters a lot whether the outliers are included or not!
This is because without them in the dataset, neither gdp nor
population appear to be needed in the model. What’s more, while
the overall model was significant when the outliers were included
(with p = 0.005), it was no longer so when the outliers were removed
(where p increased up to 0.411!).
484
Solutions to activities
Solution to Activity 23
(a) A predicted value can be calculated by substituting the corresponding
values of the explanatory variables into the regression equation. See
Section 6 in Unit 1 and Section 2 in Unit 2.
(b) For this model, the regression equation is as follows:
medals = 0.258 + 0.961 lagMedals + 12.639 host.
Now, when lagMedals = 17 and host = 1 are substituted into this
equation, we get
predicted medals = 0.258 + 0.961 × 17 + 12.639 = 29.234 ≃ 29.2.
In other words, using this model such a nation is predicted to win
29 medals.
(c) The actual number of medals that Brazil won at Rio 2016 is below the
lower limit of the 95% prediction interval. So no, the observed value
of the response did not lie in the 95% prediction interval.
Solution to Activity 24
(a) The regression model with the most terms provides the closest fit to
the data. The curve appears to go through more than half of the
points – more than for any of the other curves. The curve is also close
to the other points.
(b) The MSE associated with each of the models goes down as the
number of terms increases. This is to be expected. Having extra
terms in the regression model gives the curve more flexibility to get
closer to the data points.
(c) Based on Figure 12, there isn’t a definitive answer to this. However,
we (the module team) would say that either the regression with 2
terms or the regression with 4 terms best represents the relationship.
In both cases, the change in y for a small change in x remains similar
across the range of values depicted.
When 8 and 16 terms are used for the regression model, the
relationship between x and y between observed data points varies
much more. Most noticeably, with 16 terms, the regression model
implies some sharp changes in the value of y associated with very
small changes in x, for example when x < 2.5. Such dramatic changes
in the relationship between x and y are possible, but would generally
be accepted as unlikely without further evidence for it.
485
Unit 5 Linear modelling in practice
Solution to Activity 25
In all four cases, the regression model follows the data reasonably closely.
In the case of the regression models with 8 and 16 terms, the data do not
give compelling evidence for the complexity both of these models provide.
Out of the remaining two regression models, there is some suggestion that
the relationship between x and y is non-linear, which is reflected in the
regression model with four terms but not the one with two terms. So, for
the module team, the best model (just) appears to be one with four terms.
Solution to Activity 26
(a) The Democratic People’s Republic of Korea, with its strongly
authoritarian regime and closed society (at least for the years 1996 to
2016), is not likely to follow the pattern set by most other countries.
So, by not having the data from this nation in the dataset, the range
of variation is likely to be less than it should be.
(b) If anything, not including the data for the Czech Republic might
enhance the representativeness of the data (at least, if it is intended
that the model be applied when nations are stable). This is because,
although the dissolution of Czechoslovakia into the Czech Republic
and Slovakia was a relatively peaceful transition, it means that any
short-term effects of the transition will not impact on the modelling.
Solution to Activity 27
From the table it is clear that not all of the measurements were recorded
for all of the room attendants. For example, the variable percent2 was
only recorded for two of these six room attendants.
Solution to Activity 28
(a) If a value was recorded for every variable for every room attendant,
then the total number of values in the dataset would be 75 × 9 = 675.
(b) Overall, 0 + 1 + · · · + 9 + 9 = 48 values are missing in the dataset.
Therefore, the number of values from the possible 675 which were
recorded is 627, which corresponds to 92.9% of the total number of
values that could have been recorded.
(c) Only those room attendants who did not have any missing values
would be included in the complete case analysis, which means that
47 room attendants would be included in the analysis. So, since there
were 75 room attendants altogether, 62.7% of the individual data
values would be used in a complete case analysis.
(d) A complete case analysis would use only 62% of the values that would
ideally have been there, whereas 92.9% of the values were actually
collected. So, a complete case analysis fails to make use of about 30%
of the values.
486
Solutions to activities
Solution to Activity 29
(a) Data from the room attendants with the following identification
numbers would be included: 35, 36, 37, 39 and 40. The data from
room attendant 38 would be dropped because the value for percent is
missing.
(b) Data from all six of these room attendants would be included since all
values for informed and percent2 were recorded.
(c) Data from the same group of room attendants as in part (a) would be
included: room attendants 35, 36, 37, 39 and 40. Again the lack of a
percent value for room attendant 38 means that data from her would
be dropped.
(d) Data from room attendants 35, 37, 39 and 40 would be included when
this model is fitted. This is because having a missing value for any of
the variables would be enough for the data from a room attendant to
be dropped.
Solution to Activity 30
(a) For percent2 the sample mean based on the first six room attendants
is
1 214.4
(32.8 + 34.8 + 34.8 + 42.4 + 34.8 + 34.8) = ≃ 35.7.
6 6
(b) The sample mean based on all the room attendants will be 34.8.
This is because this mean can be thought of being a weighted average
of the mean for the values that were not missing and the mean of the
imputed values. By assuming that the missing values are in fact equal
to the sample mean for all the observed values (34.8 in this case), the
mean of the imputed values will be the same as the sample mean for
all the observed values (because all of the imputed values have the
same value 34.8). So, since the weighted average of 34.8 and 34.8 is
equal to 34.8, the sample mean based on all values must be 34.8.
(c) The standard deviation based on the imputed version of the dataset
will be less than 6.13. The imputation is making available some values
that are identical to the sample mean, so less variable than the
observed values. Thus, including the imputed values will reduce the
variability and hence the standard deviation. In fact, it turns out to
be 5.28.
487
Unit 5 Linear modelling in practice
Solution to Activity 31
(a) MCAR. There is no reason that the participant’s non-attendance, and
hence lack of a blood pressure reading, is linked to what the blood
pressure reading actually is, or measurements taken at previous clinic
visits. Unless forgetfulness is a side effect of the drug in the trial!
(b) MAR. The non-attendance is linked to something that is already
known (that is, which drug the participant received as part of the
trial). (If the illness were linked to blood pressure, then the missing
data type would be NMAR.)
(c) NMAR. The reason why the participant did not attend the clinic is
linked to their blood pressure.
(d) MCAR. There is no reason why the participant getting a new job
should be linked to their blood pressure or the drugs they were given.
(e) NMAR. In this case, the failure of the participant to attend the
follow-up clinic is clearly linked to their blood pressure.
Solution to Activity 32
(a) The issue that we (the module team) chose was volume 64, issue 5
(November 2015).
(b) There were six articles in our chosen issue. (You will probably have
chosen a different issue and hence get different results.)
(c) In our chosen issue, the articles, along with the dates on which they
were first received and revised, are as follows.
488
Solutions to activities
(d) The length of time (in months) between the two dates for the six
articles is as follows:
14 13 16 15 10 15.
The authors were therefore likely to be returning to their analysis
about a year after they first analysed the data for submission to the
journal. That is lots of time for them to forget the details if they have
not been recorded somewhere!
You may well have found similar results for other issues of the journal,
at least for issues published when this unit was written.
Solution to Activity 33
One issue was whether all the cases were living in West Cumbria at the
time of diagnosis. Although this might seem an easy thing to define, here
the difficulty arose in deciding the place of residence of somebody who was
living away from home while studying at university.
The other issue was whether the individuals in the study needed to have
been born in West Cumbria or whether it was enough that they had lived
there.
Solution to Activity 34
(a) The data came from the Scientists and Engineers Statistical Data
System (SESTAT). In particular, they used six waves of SESTAT
from 1995 to 2008. SESTAT combines data from three surveys: the
National Survey of College Graduates Science and Engineering Panel,
the National Survey of Recent College Graduates, and the Survey of
Doctoral Recipients.
(b) No, not all of them were included. They had to have received their
bachelor’s degree between 1970 and 2004, have majored in STEM and
be working in STEM. They also needed to be working at least
35 hours per week.
The analysis was run separately for what the authors regarded as the
four main STEM categories, and so only data for a single category
was considered at once.
(c) The authors used the hourly wage of the individuals as the response.
They transformed this variable by calculating the log of it. The
authors also noted that they had to calculate the hourly wage from
the annual salary by dividing by the weeks worked per year and hours
worked per week.
(d) The authors state that the gender of the respondent was the
explanatory variable they were most interested in. (In fact, this is
made clear further on in the paper as the gender of the respondent
was the only variable included in Model 1.)
(e) Yes, it does appear that the analysis will be reproducible.
489
Unit 5 Linear modelling in practice
Solution to Activity 35
(a) The response variable is the total number of medals won. This is the
same response variable that we have been using.
(b) The authors considered a total of eight explanatory variables: Lag
Medals, ln GDP, ln Pop, Host, Next Host, Planned, Muslim and Year.
Of these, Host, Next Host, Planned and Muslim seem to have been
treated as categorical (because the article says that these are
‘indicator variables’), whereas Lag Medals, ln GDP, ln Pop and Year
appear to have been treated as continuous variables.
Note that each of the categorical variables Host, Next Host, Planned
and Muslim can each only take one of two values. So, as you found in
Activity 14 (Subsection 3.3.1) for host in olympic, it does not matter
whether they are treated as being continuous or categorical.
(c) The difference between the two models lies with the explanatory
variables included. In the ‘naive’ model, just Lag Medals and Year are
included. In the ‘sophisticated’ model, all eight explanatory variables
are included.
Solution to Activity 36
(a) The descriptions of the variables are similar in the two articles. In
both cases, all of the variables are listed along with explanations of
why it is reasonable to include them. Notice that in both cases, the
authors make it clear that the explanatory variables are not intended
to represent direct causes but an indirect measure of other effects.
(b) The article written for the BBC News website says very little about
the fitted models. All the article really describes is which variables
were included in the model. It does not say anything about the type
of model. In the other article, there is a box giving the equation of
the models fitted, along with how they were fitted (‘using ordinary
least squares’).
(c) In the BBC News website article, no statistical technical terms were
used. The article therefore does not require the reader to have any
statistical training. Instead, the article sticks with language which a
reader proficient in English should be able to understand.
Solution to Activity 37
(a) H0 is false for the 3000 locations associated with Type 1 diabetes, and
the probability of rejecting H0 when it is indeed false is the power of
the test, 0.9. So, the number of hypothesis tests for which we would
expect to reject H0 when it is false is
3000 × 0.9 = 2700.
490
Solutions to activities
(b) H0 is true for the 9 997 000 locations which are not associated with
Type 1 diabetes, and the probability of rejecting H0 when it is true is
the significance level, 0.01. So, the number of hypothesis tests for
which we would expect to reject H0 when in fact it is true is
9 997 000 × 0.01 = 99 970.
(c) From part (a), we expect to reject H0 in 2700 hypothesis tests when
H0 is false, and from part (b), we expect to reject H0 in 99 970
hypothesis tests when H0 is true. So, in total, we expect to reject H0
2700 + 99 970 = 102 670 times.
Therefore, the proportion of these that are actually associated with
Type 1 diabetes is
2700
≃ 0.0263.
102 670
In other words, fewer than 3% of the hypothesis tests expected to
have significant results in this study (that is, the ‘positives’) are likely
to correspond to locations that are actually associated with Type 1
diabetes!
Solution to Activity 38
(a) As you saw in Activity 26 (Subsection 5.1) of Unit 2, in order to help
decide whether an explanatory variable should be in a model, we can
test whether the associated regression coefficient, βj say, is zero. We
therefore test the hypotheses
H0 : βj = 0, H1 : βj ̸= 0.
The test is completed by comparing a test statistic of the form
βbj /(standard error of βbj ) against a t-distribution.
(b) Such a test is likely to be done many times. When there are q
variables in a model, the p-values from q such tests are presented.
Furthermore, it is unlikely that just one such model would be fitted to
the data. So, with multiple such models likely to be fitted to the data,
the situation where many tests are considered quickly arises.
Solution to Activity 39
(a) The probability of rejecting H0 when it is indeed false is still 0.9.
So, the expected number of locations associated with Type 1 diabetes
where the null hypothesis would be rejected a second time is
2700 × 0.9 = 2430.
(b) The probability of rejecting H0 when it is true is still 0.01.
So, the expected number of locations not associated with Type 1
diabetes where the null hypothesis would be rejected a second time is
99 970 × 0.01 = 999.7.
491
Unit 5 Linear modelling in practice
492
Index
Index
: (model notation) 324, 350, 368, 370 dataset
∗ (model notation) 325, 350, 369 Brexit 101
car prices 132
adjusted for 427 employment rates 334
adjusted R2 statistic 162 Facebook 144
AIC 165 FIFA 19 28
Akaike information criterion 165 films 137
analysis of variance 239 lagoon 240
analysis of variance table 251 manna ash trees 22
ANOVA 239 Olympics 416
extended table 265 Olympics 2020 442
table 251 OU students 359
test 248 pea growth 253
assumption (model) 47, 79, 110, 236, 314, 331 Peru 177
athlete (definition) 400 placebo effect 446
available case analysis 449 rats and protein 346
roller coasters 105
backward stepwise regression 171
test 440
baseline mean 215
training 440
brexit 101
wages 206
carPrices 132 diagnostic checks 109
comma-delimited file 410 diagnostics 109
complete case analysis 447 distribution
confidence level 60 t 43
constant variance assumption 47 normal 34
contrast 256, 257 standard normal 34
maximum number 272 dummy variable 218
multiple 271
testing 264 effect
Cook’s distance 123 parameter 215
formula 123 term 215
plot 124 employmentRate 334
correlation matrix 151 error term 33
covariate 203 ESS 158, 161, 244
CSV file 410 experimental data 16
explained sum of squares 158, 161, 244
data explained variation 157, 244
experimental 16 explanatory variable 5
missing 23 extended ANOVA table 265
natural science 17 extrapolation 59
observational 16
primary 14 F -distribution 94
secondary 14 F -ratio 248
social science 17 F -statistic 94, 233
data frame 12 F -test 311
data sources 17 for an interaction 328, 355
493
Index
494
Index
495
Index
Soundex 415
standard error 43
standard normal distribution 34
standardised residual 54
statistical modelling process 3
stepwise regression 154
backward 171
forward 167
strategy 173
sum of squares 244, 251
partition 160
partitioning 250, 261
t-distribution 43
t-test in multiple regression 96, 99
t-value 44
test dataset 440
theoretical quantile 54
total sum of squares 157, 161, 244
total variation 157, 244
training dataset 440
transformation 130
Treezilla 21
TSS 157, 161, 244
two factor model 338, 353
two-sided test 45
wages 206
496