0% found this document useful (0 votes)
124 views

M348 Applied Statistical Modelling - Linear Models

Applied Statistical Modelling - Linear Models

Uploaded by

M T
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

M348 Applied Statistical Modelling - Linear Models

Applied Statistical Modelling - Linear Models

Uploaded by

M T
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 504

M348

Applied statistical modelling

Book 1
Linear models
This publication forms part of the Open University module M348 Applied statistical modelling. Details of this
and other Open University modules can be obtained from Student Recruitment, The Open University, PO Box
197, Milton Keynes MK7 6BJ, United Kingdom (tel. +44 (0)300 303 5303; email [email protected]).
Alternatively, you may visit the Open University website at www.open.ac.uk where you can learn more about
the wide range of modules and packs offered at all levels by The Open University.

The Open University, Walton Hall, Milton Keynes, MK7 6AA.


First published 2022.
Copyright © 2022 The Open University
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, transmitted or
utilised in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without
written permission from the publisher or a licence from The Copyright Licensing Agency Ltd. Details of such
licences (for reprographic reproduction) may be obtained from The Copyright Licensing Agency Ltd, 5th Floor,
Shackleton House, 4 Battle Bridge Lane, London, SE1 2HX (website www.cla.co.uk).
Open University materials may also be made available in electronic formats for use by students of the
University. All rights, including copyright and related rights and database rights, in electronic materials and
their contents are owned by or licensed to The Open University, or otherwise used by The Open University as
permitted by applicable law.
In using electronic materials and their contents you agree that your use will be solely for the purposes of
following an Open University course of study or otherwise as licensed by The Open University or its assigns.
Except as permitted above you undertake not to copy, store in any medium (including electronic storage or use
in a website), distribute, transmit or retransmit, broadcast, modify or show in public such electronic materials in
whole or in part without the prior written consent of The Open University or in accordance with the Copyright,
Designs and Patents Act 1988.
Edited, designed and typeset by The Open University, using LATEX.
Printed in the United Kingdom by Bell & Bain Ltd, Glasgow.

ISBN 978 1 4730 3552 2


1.1
Contents
Unit 1 Introduction to statistical modelling and R 1

Setting the scene 3


The statistical modelling process 3
Working with a computer 5
Terminology for modelling 5
Overview of the units 6

Introduction to Unit 1 8

1 Computing preliminaries 10
1.1 Getting started with Jupyter 10
1.2 Introducing R 12

2 Exploratory data analysis 14


2.1 Types of data 14
2.2 Data quality and reliability 19
2.2.1 Primary and secondary data 19
2.2.2 Experimental and observational data 20
2.2.3 Combining data sources 26
2.3 Summaries of data 27

3 Using R for exploratory data analysis 28

4 Simple linear regression 31


4.1 Review of the basic idea 32
4.2 Estimating the model parameters 37
4.2.1 Estimating α and β 37
4.2.2 Estimating the variance σ 2 40
4.3 Testing whether a relationship exists 42

5 Checking the simple linear regression model assumptions 47


5.1 Residual plots 48
5.1.1 Checking for linearity, zero mean and constant variance 48
5.1.2 Checking independence 52
5.2 Normal probability plots 53
5.2.1 Checking normality 54

6 Prediction in simple linear regression 57


6.1 Predicting an individual response 57
6.2 Prediction intervals 60

7 Using R for simple linear regression 62

8 Looking ahead . . . 64

Summary 66
Learning outcomes 68

References 69

Acknowledgements 70

Solutions to activities 71

Unit 2 Multiple linear regression 79

Introduction 81

1 The multiple linear regression model 82


1.1 Introducing the model 83
1.2 Interpreting the coefficients 90
1.3 Testing the regression coefficients 93
1.3.1 Testing all regression coefficients simultaneously 93
1.3.2 Testing regression coefficients individually 95
1.3.3 Using both types of testing 98
1.4 Using R to fit multiple regression models 103

2 Prediction in multiple regression 103


2.1 Point prediction of the response variable 104
2.2 Prediction intervals in multiple regression 107
2.3 Using R to obtain predictions using multiple regression 109

3 Diagnostics 109
3.1 Checking the model assumptions 110
3.2 Leverage 114
3.3 Cook’s distance 123
3.4 Using R to perform diagnostic checks 129

4 Transformations in multiple regression 129


4.1 The use of transformations in multiple regression 130
4.2 Finding suitable transformations 131
4.2.1 An example of applying transformations 132
4.2.2 Another example of applying transformations 137
4.3 Using R to perform transformations 144

5 Choosing explanatory variables 145


5.1 Scatterplot matrix and correlation matrix 146
5.2 Measuring how well a model fits 154
5.2.1 Percentage of variance accounted for 154
5.2.2 The Akaike information criterion (AIC) 165
5.3 Stepwise regression for choosing explanatory variables 166
5.3.1 Forward stepwise regression 167
5.3.2 Backward stepwise regression 171
5.4 Using R to choose explanatory variables 177
Summary 179

Learning outcomes 181

References 182

Acknowledgements 183

Solutions to activities 184

Unit 3 Regression with a categorical explanatory variable 201

Introduction 203

1 Regression with a factor: the basic idea 205


1.1 Introducing some data with covariates and factors 205
1.2 Visualising the relationship between the response and a factor 208
1.3 Adapting ideas from simple linear regression 210
1.4 Adapting the model based on means 214

2 Developing the model further 217


2.1 Introducing indicator variables into the model 217
2.2 Examining the model with indicator variables 226

3 Using the proposed model 228


3.1 The fitted model 228
3.2 Testing whether there is a relationship 232
3.3 Checking the model assumptions 236

4 Using R to fit a regression with a factor 237

5 Analysis of variance (ANOVA) 238


5.1 A different way of thinking about the model 240
5.2 ANOVA: the basic idea 243
5.3 The ANOVA test 248
5.4 The ANOVA table 251

6 Using R to produce ANOVA tables 254

7 Analysing the effects of the factor levels further 255


7.1 Introducing contrasts 256
7.2 Testing a contrast 258
7.3 Specifying further contrasts 267

8 Using R to produce extended ANOVA tables 273

Summary 274

Learning outcomes 276


References 278

Acknowledgements 278

Solutions to activities 279

Unit 4 Multiple regression with both covariates and factors 293

Introduction 295

1 Regression with one covariate and one factor 296


1.1 The basic idea 297
1.2 Building a model 299

2 Modelling using parallel slopes 302


2.1 The parallel slopes model 302
2.2 Testing the model parameters 308
2.3 Checking the assumptions of the parallel slopes model 313
2.4 Using R to fit parallel slopes models 317

3 Modelling using non-parallel slopes 318


3.1 The non-parallel slopes model 321
3.2 Testing for an interaction 327
3.3 Checking the assumptions of the non-parallel slopes model 331
3.4 Using R to fit non-parallel slopes models 333

4 Regression with two factors that do not interact 333


4.1 The model for two factors that do not interact 334
4.2 Visualising the model 339
4.3 Testing whether both factors are required 341
4.4 Informally checking the assumption of no interaction 344
4.5 Using R for regression with two factors that do not interact 348

5 Regression with two factors that interact 348


5.1 Including an interaction 349
5.2 Testing for an interaction 354
5.3 Using R for regression with two factors with an interaction 358

6 Regression with any number of covariates and factors 361


6.1 Regression with more than two factors 361
6.2 Regression with multiple covariates and factors 365
6.3 Including interactions 368
6.4 Using R for modelling using general linear models 372

Summary 373

Learning outcomes 376

References 377
Acknowledgements 378

Solutions to activities 379

Unit 5 Linear modelling in practice 395

Introduction 397

1 Specifying the problem 399


1.1 The background context 400
1.2 Identifying a response variable 401
1.3 Which explanatory variables might be useful? 402
1.4 Summary of the modelling task 403

2 Sourcing and preparing the data 404


2.1 Sourcing data for analysis 404
2.1.1 Primary and secondary data revisited 405
2.1.2 Sourcing data about Olympic medals 406
2.1.3 Sourcing data about population and wealth 407
2.1.4 Sourcing the remaining variables 409
2.2 Preparing data for analysis 410
2.2.1 Reading data into R from a file 410
2.2.2 Using R to combine data 412
2.2.3 Using R to merge data 413
2.2.4 Finishing the preparation of the data 415

3 Building a statistical model 417


3.1 Which models to consider? 417
3.2 Using R for an initial data analysis 420
3.3 Expanding the model 421
3.3.1 Deciding whether to treat an explanatory variable as
a covariate or a factor 422
3.3.2 Using R to explore adding more variables to the model 423

4 Further modelling issues 424


4.1 When explanatory variables vary in importance 424
4.2 Is it an outlier? 428
4.3 Are the predictions any good? 436
4.3.1 Measures for the quality of predictions 436
4.3.2 Using R to calculate MSE and MAPE 438
4.3.3 Using a test dataset to assess predictions 438
4.3.4 Assessing predictions through the use of a test dataset
in R 442
4.4 Model exploration or model confirmation? 443

5 Missing data 443


5.1 Dealing with missing data 445
5.1.1 Complete case analysis 447
5.1.2 Available case analysis 449
5.1.3 Imputation 451
5.2 Why missing data arise 452
5.3 Impact of the missing data type 457

6 Documenting the analysis 458


6.1 Documenting for yourself 459
6.2 Creating documentation in Jupyter notebooks 461
6.3 Documenting for others 461
6.4 Describing a model 463

7 Replication 465
7.1 Perils of multiple testing 466
7.2 Overcoming the problem of multiple testing 468

Summary 469

Learning outcomes 471

References 473

Acknowledgements 475

Solutions to activities 476

Index 493
Unit 1
Introduction to statistical modelling
and R
Setting the scene

Setting the scene


This module is all about statistical modelling in practice. A statistical
model is a simplified representation of the underlying process that
generated a sample of data. As such, the statistical model might be the
underlying probability distribution from which the data were sampled, or
the statistical model might include other components, such as regression
relationships between variables. This module focuses on regression-type
statistical models. Throughout the module, the term ‘model’ will always
refer to a statistical model.

The statistical modelling process


Other statistical models
The identification of an appropriate statistical model for some data fits
into a wider statistical modelling process, illustrated in Figure 1. There is
a cycle of improvement at the heart of the process and we will describe the
steps shown in Figure 1 next.

Pose questions

Design study

Collect data

Explore data

Make assumptions

Formulate model
Improve model

Fit model

Check model

Choose model

Report results

Figure 1 The statistical modelling process


3
Unit 1 Introduction to statistical modelling and R

The steps of the process


The statistical modelling process begins with a practical problem. This
problem needs to be formulated in terms of statistical questions. A study
can then be designed and appropriate data collected to answer these
questions. These steps of the process (‘Pose questions’, ‘Design study’ and
‘Collect data’ in Figure 1) typically involve specialists, such as medics,
economists, biologists, and so on, with knowledge of the particular
application area, and ideally an analyst.
The next step of the modelling process is to explore the data. Exploring
the data allows the analyst to ‘get a feel for the data’ and to then make
assumptions about the data, such as ‘there seems to be a linear
relationship between variables X and Y ’. The assumptions made regarding
the data then help the analyst to formulate the general form of a potential
statistical model. For example, if we are assuming that there is a linear
relationship between variables X and Y , then our formulated model should
represent this linear relationship.
The next step in the process is to fit the proposed model to the data so
that the model is fully specified with numerical values for all of its
parameters. Once the model is fitted, the model then needs to be checked
in terms of how well the model fits the data, and also in terms of whether
or not any assumptions required for the model seem reasonable.
After checking the model, the analyst needs to consider whether or not the
model can be improved. If they think that perhaps it can be, they can
make different model assumptions, such as ‘the relationship between
variables X and Y may be non-linear, rather than linear’. A new model is
then formulated to be consistent with these model assumptions, fitted to
the data and checked. This process is repeated until a final model is
chosen that the analyst is satisfied with.
The final step in the modelling process is to report the results, usually in
the form of a written report.

Other considerations for the model


Knowledge of the application area of the data may drive the formulation of
the statistical model. For example, a statistical model of child height may
make use of the fact that a child’s height is related to their age. In other
situations, little may be known about the relationships between variables,
and the statistical model may be entirely driven by the data.
The purpose of the model may also vary. It could be used as an
explanatory tool: for example, a statistical model could be used to
understand which of a set of variables affect the value of another variable
and also how they affect the variable. Alternatively, or additionally, a
statistical model could be used to predict the value of a variable on the
basis of values of other variables. Or a statistical model could be used to
test out whether or not various theories are reasonable. Whatever the
One way to increase height purpose, there is no ‘correct’ model, and a statistical model shouldn’t be
measurement! expected to fit perfectly.

4
Setting the scene

Working with a computer


The focus of this module is on the practical use of statistical models. Even
though it is possible to fit very simple models ‘by hand’, most of the
models considered in this module are far too computationally difficult for
that and the use of a computer is essential. As such, computer work
is an integral part of the module (including assessment) – not only in
terms of you being able to use the statistical models presented, but also to
help your understanding of the material in the units. It is therefore very
important that you keep up with the practical computing as you go along,
and you are strongly advised to complete the computing work for each unit
before moving on to the next unit.

Terminology for modelling


This module assumes that in previous study you have met one of the
simplest statistical models: the linear regression model for modelling a
linear relationship between two variables. Linear regression models are
used in many different subject and application areas. The terminology
associated with the models can vary, not only between different subject
areas, but also, confusingly, within the same subject area!
Throughout Units 1 to 8 of this module, we will refer to the variable of
interest that we are modelling as the response variable, or simply the
response, while the variable that we are using to model the response will
be referred to as the explanatory variable. For example, if we wanted a
model of child height using the relationship between child height and age,
then ‘child height’ would be the response variable and ‘age’ would be the
explanatory variable.
Terminology that you may come across elsewhere for the response and
explanatory variables is summarised in Table 1. Although the terms paired
together on rows in Table 1 are often used together, this is not always the
case: the terms are sometimes used interchangeably.
Table 1 Different terminology used for the response and explanatory variables

Response variable Explanatory variable


Dependent variable Independent variable
Explained variable Explanatory variable
Predictand Predictor
Regressand Regressor
Response Stimulus
Endogenous variable Exogenous variable
Outcome Covariate
Controlled variable Control variable

5
Unit 1 Introduction to statistical modelling and R

Overview of the units


In Unit 1, we will consider the main ideas of regression when the
relationship between the response and explanatory variables is assumed to
be linear. From Unit 2 onwards, regression models are extended beyond
modelling linear relationships between two variables to deal with a greater
range of situations.
When using linear regression in your studies so far, it is very likely that you
have only used numerical explanatory variables. This means each variable
is a measurable quantity, such as height, age, and so on. It is, however,
also possible to have categorical explanatory variables. These are variables
that can only take values which are names or labels, such as eye colour.
(Categorical variables are also known in computer science as ‘enumerated
type’, ‘enumeration’ or ‘enum’.) Units 2 to 5 extend the regression model
reviewed in Unit 1, which uses a single numerical explanatory variable, to
regression models with any number of explanatory variables, which can be
a mixture of both numerical and categorical variables.
In Units 1 to 5, the response variable is assumed to be modelled (at least
approximately) by a normal distribution. In Units 6 to 8, the regression
models developed in Units 1 to 5 are extended further to model response
variables which cannot be assumed to be modelled by a normal
distribution.
The module then divides into separate strands, each strand considering
statistical modelling issues from the perspective of a different discipline;
you will study one of these strands. The final unit of the module, which is
for everyone regardless of which strand was studied, then brings everything
together to help prepare you for the EMA.
Figure 2 summarises diagrammatically what is covered in M348 and how
the units fit together.

6
Setting the scene

Regression with one


explanatory variable
(Unit 1)

Regression with multiple


explanatory variables
(Unit 2)

Regression with one


categorical explanatory variable
(Unit 3)

Regression with any number Putting


of numerical and categorical Units 1 to 4
explanatory variables into practice
(Unit 4) (Unit 5)

Regression with non-normal


response variables
(Units 6 to 8)

Separate strands

Bringing it all together


(Unit 13)

Figure 2 Summary of M348 and how the units fit together

7
Unit 1 Introduction to statistical modelling and R

Introduction to Unit 1
In Unit 1, we will focus on linear regression with a single explanatory
variable and also introduce the software used throughout the module for
the practical computing work. You will get hands-on experience of using
the software, with the aim that you should feel confident using it for linear
regression with a single explanatory variable by the end of the unit.
The module uses the statistical programming language R (yes, its name is
just this one letter!) for statistical modelling. R is open source and is
widely used by practising statisticians and researchers working in many
different fields. The primary difference between statistical modelling using
R and statistical modelling using a statistical package such as Minitab or
SPSS is the fact that R is not menu-driven: in statistical packages such as
Minitab and SPSS, models can be specified by selecting options from
menus on the toolbar and then completing dialogue boxes, whereas in R,
models are specified by typing commands. Although this may sound like
harder work, the advantage with using typed commands instead of menus
is that there is far more flexibility as to what can be done and how it is
done.
If you are new to using computer code, you may feel a little nervous about
using R. Do not worry, you will not be expected to write a lot of code; in
the majority of cases the code will be provided for you, and any code that
you do need to produce yourself will be easily copied and adapted from
existing code.
To help you work with the R code, the module will use R via the Jupyter
Notebook application. Jupyter Notebook is often simply referred to as
‘Jupyter’ and we will do so throughout the module.
The easiest way to get a feel for using R via Jupyter is to try it! You will
do that next in Section 1, starting with the basics of using Jupyter and
then an introduction to R. Section 2 will focus on exploratory data analysis
and you will then try some exploratory data analysis using R in Section 3.
Sections 4, 5 and 6 cover various aspects of linear regression for modelling
the relationship between two variables. Some of the content of these
sections will be a review for you, while some of the content is likely to be
new to you. You will then learn how to use R for linear regression in
Section 7.
Finally, the unit rounds off with a (very short!) section looking ahead to
what’s to come in the rest of the module.
The structure of Unit 1 in terms of how the unit’s sections fit together are
represented diagrammatically in what we’ve called a ‘route map’ (shown
next). This is accompanied by a note of any sections or subsections that
need a computer or other resources. The aim of the route map is to help
you to navigate your way through the unit. Each unit will have its own
route map, which will be used in the same way.

8
Introduction to Unit 1

The Unit 1 route map

Section 1
Computing preliminaries

Section 3
Section 2
Using R for
Exploratory data
exploratory data
analysis
analysis

Section 4
Simple linear
regression

Section 5 Section 7
Checking the simple Using R for
linear regression simple linear
model assumptions regression

Section 6
Prediction in simple
linear regression Section 8
Looking ahead. . .

Note that for Sections 1, 3 and 7 you will need to use your computer
as well as the written unit in order to complete activities using the
module software. In Section 1, you will also need to access other
resources on the module website (such as screencasts or written
instructions) as part of setting up and familiarising yourself with the
module software.

9
Unit 1 Introduction to statistical modelling and R

1 Computing preliminaries
This section introduces Jupyter and R, which together form the software
you will be using throughout the module for the practical computing work.
Jupyter is used in this module as a way to work with the statistical
programming language R.
We will focus on Jupyter first, in Subsection 1.1, before adding R into the
mix in Subsection 1.2. For both subsections, you will need to refer to the
module website alongside this printed text. Before starting Subsection 1.1,
make sure that you have installed both Jupyter and R on your computer.

Installing Jupyter and R


Follow the installation instructions provided on the module website.

1.1 Getting started with Jupyter


Jupyter is a web-based interactive computational environment which
combines explanatory text (using something called Markdown) with live
computer code (written in R in this module) which can be run within the
Jupyter web document. As such, a complete data analysis – including
descriptions of the problem of interest and the data analysed, the results
for the analysis including the code used to generate the results, and the
conclusions of the analysis – can be kept together in a single document.
Jupyter therefore makes it easier to share and reproduce the analysis. It is
also easy to add to an analysis or tweak code and re-run the analysis,
OK, so this Jupiter has a which can be very helpful for working collaboratively with other analysts.
different spelling to Jupyter, Each document within Jupyter is called a ‘notebook’ (and has the suffix
but it makes a nice picture! ‘.ipynb’). All the notebooks that you need for M348 are provided on the
module website. Activity 1 guides you through finding and opening your
first M348 Jupyter notebook.

Activity 1 Launching Jupyter and opening a notebook


If you haven’t already done so, you should install Jupyter and R by
following the instructions on the module website.
Watch Screencast 1.1 on the module website, which shows you how to get
started with Jupyter and how to open an M348 Jupyter notebook. There
are also some written instructions provided on the module website to use
instead of, or as well as, the screencast.

You should now be ready to ‘work through a notebook’. To do this, you


will need to read through the text in the notebook (in logical order from
top to bottom, as you would when reading a book or document), and
follow any instructions given in the notebook as you come to them.

10
1 Computing preliminaries

Whenever you need to work through a notebook for the module, we have
signalled a ‘Notebook activity’ in the printed text, such as that given
below for Notebook activity 1.1. The heading for each of these notebook
activities indicates which notebook you will need, and is followed by a brief
explanation of what is covered in the notebook.
So now work through Notebook activity 1.1 (for which you opened the
associated notebook in Activity 1) to get a feel for the main features of
Jupyter.

Notebook activity 1.1 Introducing Jupyter


This notebook introduces you to the main features of Jupyter that
you will use in this module.

All of the notebooks associated with Unit 1 can be found in the folder
called ‘Unit 1’ on the Jupyter dashboard. (Similarly, all the notebooks
associated with Unit 2 can be found in the folder called ‘Unit 2’ on the
Jupyter dashboard, and so on.) Work through Notebook activity 1.2 to
explore Jupyter a little further.

Notebook activity 1.2 Exploring Jupyter further


This notebook explores Jupyter’s menu bar and toolbar.

You will use Jupyter notebooks throughout this module, and at times you
will need to manipulate your own notebooks. In order to do this, you need
to know how to write text in notebooks. As has already been mentioned,
Jupyter uses Markdown to do this.
Activity 2 introduces you to Markdown, before you practise using it in
Notebook activity 1.3.

Activity 2 Introducing Markdown

Watch Screencast 1.2 on the module website, which shows you how to use
Markdown. There are also some written instructions provided on the
module website to use instead of, or as well as, the screencast.

Notebook activity 1.3 Using Markdown


Before you work through this notebook, make sure you have worked
through Activity 2.
This notebook gives you the opportunity to try using Markdown.

11
Unit 1 Introduction to statistical modelling and R

You should by now have a fairly good idea of how you can create text
documents in Jupyter. However, we haven’t yet considered the
all-important question of how to use the programming language R! This
will be considered in the next subsection.

1.2 Introducing R
Although you met R very briefly while using Jupyter in Subsection 1.1,
this section provides a proper introduction to R. We’ll get started with R
in Notebook activity 1.4.

Notebook activity 1.4 Getting started with R


This notebook considers using R to do simple calculations, including
using multiple lines of R code and the use of comments with code.

Although Notebook activity 1.4 focused on using R to do calculations, R is


so much more than a calculator! One reason for this is that R operates on
what are known as objects. The use of objects is one of the things which
is very different from a menu-driven statistical package such as Minitab or
SPSS, in which the user refers to columns in a data table instead of objects.
R objects are introduced in Notebook activity 1.5. R functions, which
can execute several instructions using a single command, are introduced in
Notebook activity 1.6. You will meet two particular functions: one which
allows you to create and store a one-dimensional array of data (a vector),
and the other which can calculate the mean of these data. Then
Notebook activity 1.7 considers vectors in a little more detail.

Notebook activity 1.5 R objects


This notebook introduces R objects.

Notebook activity 1.6 R functions


This notebook introduces R functions.

Notebook activity 1.7 R vectors


This notebook introduces some different types of R vectors.

Many datasets (indeed, all of the datasets in this module) consist of


observations on more than one variable, which are not ideally represented
by a one-dimensional array. When storing data observed on several
variables in R, vectors are amalgamated together to form what is known as
a data frame. The vectors in a data frame are all the same length, but

12
1 Computing preliminaries

not necessarily the same type, and each vector becomes a separate column
in an array of data. The sequence of values within each vector in a data
frame must be the same, so that the first values in each vector correspond
to the values for the first observation, the second values in each vector
correspond to the second observation, and so on.
Most of the datasets used in M348 are stored in data frames, which have
been created for you. When a dataset is introduced, the name of the data
frame is also given. You will learn how to load and view the data frames
for M348 in the next notebook activity. We will then focus on the
individual vectors within a data frame.

Notebook activity 1.8 M348 data frames


This notebook explains how to load and view the data frames created
for M348.

Notebook activity 1.9 Data frame vectors


This notebook considers individual vectors stored within a data frame.

Although the datasets that you’ll use in M348 are all stored in data frames
which have already been created for you, there will be times when you will
need to create a data frame for use in calculations. For example, you may
want to create a data frame which contains only a subset of the vectors
from a data frame. You will see how to do this in the next – and final –
notebook activity of this section.

Notebook activity 1.10 Creating data frames


This notebook shows you how to create a data frame.

This section has introduced some of the basics of R, but we haven’t yet
used R to do very much. I’m sure that you will be glad to hear that you
will be using R to do more exciting things as we move through the module,
and you will discover that R is, in fact, a very powerful tool for analysing
data and building statistical models.
We will return to using R later in this unit, but in the meantime you may
choose to watch Screencast 1.3 (on the module website) for a ‘sneak peek’
at some of the things that R can do.
Now that you have been introduced to both Jupyter and R, we can move
on to considering statistical modelling in R. A crucial first step in any
statistical modelling is to get a feel for the data through some exploratory
data analysis; this is the subject of the next section.

13
Unit 1 Introduction to statistical modelling and R

2 Exploratory data analysis


Exploratory data analysis – often referred to as ‘getting a feel for the data’
– involves learning about a dataset and preparing the data ready for
modelling.
There are different types of data. These can affect both the quality and
reliability of the data, as well as which models are considered in the
modelling process. We will consider some different types of data in
Subsection 2.1, before discussing data quality and reliability in
Subsection 2.2. To round off the section, Subsection 2.3 considers some of
the visual and numerical data summaries that are useful for exploratory
data analysis. We will then be ready to try some exploratory data analysis
using R in Section 3.
We will return to considering the process of preparing the data ready for
modelling in more detail in Unit 5.

2.1 Types of data


In this subsection, we will consider some different types of data. We’ll
start with primary and secondary data.
Primary data are data collected for the first time directly from the data
source. Common sources of primary data include observations, surveys,
experiments, questionnaires and personal interviews. Primary data are
collected for a specific purpose, often a research project with a particular
goal in mind. As such, there is a degree of control over the choice of data
to be collected so as to best address the problem at hand.
Secondary data are data which already exist and were collected in the
past by someone else but made available for a third party to use. These
data usually start out as primary data and then become secondary data
when used by a third party.
Some primary and secondary data are illustrated in Example 1 and
Activity 3.

Example 1 Survey of Adult Skills


The Organisation for Economic Co-operation and Development –
OECD, for short – is an international organisation that works with
governments, policy makers and citizens to find solutions to social,
economic and environmental challenges. The OECD collects and
analyses a large amount of data on many different issues.
The Survey of Adult Skills is an example of primary data for the
OECD. The survey is conducted (every 10 years) in over 40 countries
as part of the OECD’s Programme for the International Assessment of

14
2 Exploratory data analysis

Adult Competencies (PIAAC). The survey measures the proficiency of


working-age adults (16- to 65-year-olds) in literacy, numeracy and
problem-solving in technology-rich environments, and gathers
information and data on how adults use their skills at home, at work
and in the wider community. The survey is also interested in other
employment skills, such as collaboration. In each country, between
about 4000 and 27 000 individuals take part in the survey, answering
questions via a computer or using pencil and paper. The survey is Collaborating at work
designed to be valid internationally to obtain comparable results
across different cultures and national languages. (OECD, no date)
OECD’s analyses of the survey data are published in various
publications: for example, the first results from the survey were
published in the OECD Skills Outlook 2013 (OECD, 2013). Since the
data are collected and analysed by the OECD, they are primary data
for the OECD.
The OECD’s publications on the Survey of Adult Skills aim to help
countries better understand how education and training systems can
develop skills. Additionally, the published data themselves are also
available for use by others outside the OECD. In this case, the data
become secondary data when being used by a third party.

Activity 3 Primary or secondary data?

In December 2020, the European Organization for Nuclear Research in


Geneva, known as CERN, announced a new open data policy which
publicly releases data for the scientific experiments at the Large Hadron
Collider (LHC).
(a) A team of CERN physicists run an experiment at the LHC and
analyse their results. For this team of physicists, are these data
primary or secondary?
(b) A university researcher in particle physics hears about the LHC Part of the LHC at CERN,
experiment and would like to have access to the resulting data for Geneva
their own research. For this researcher, are these data primary or
secondary?

The data used in this module will all be secondary data. Although there
are advantages to using primary data in terms of data quality, primary
data can be expensive and time-consuming to obtain. As such, secondary
data are commonly used by many researchers.

15
Unit 1 Introduction to statistical modelling and R

Next we’ll consider observational and experimental data. As the name


suggests, observational data are data that have simply been observed
and recorded. For observational data, the researcher has no control over
any of the variables and there is no manipulation of the data environment.
In contrast, experimental data are generated through a controlled
experiment, where the researcher can control the explanatory variable.
Both types of data are illustrated in Examples 2 and 3, and then
considered in Activity 4.

Example 2 Phone use and sleep


Suppose that the BBC News headline ‘Most children sleep with
mobile phone beside bed’ (Coughlan, 2020) prompts a researcher to
investigate the relationship between the amount of phone use before
bed and the amount of sleep a person gets. Further, suppose that the
researcher takes ‘amount of sleep’ as the response variable and decides
that this will be measured by the time spent asleep, in hours.
(Participants were left to wake naturally.) Assume that the
explanatory variable is taken to be ‘phone use’ and the researcher
A little night-time reading? decides that this will be measured by the number of minutes spent
using the phone in the hour before bed.
One way for the researcher to proceed would be to observe and record
the number of minutes spent using the phone in the hour before bed
and the time spent asleep for each participant in the study. The
resulting data would then be observational data, since the researcher
is simply observing the data and is not controlling the participants’
phone use.
An alternative way for the researcher to proceed would be to control
the value of the explanatory variable ‘phone use’ for each participant.
For example, the researcher could set the amount of phone use in the
hour before bed to be zero for one group of the participants,
15 minutes for another group of participants, 30 minutes for another,
and so on. This time, the resulting data would be experimental data,
since the researcher controls what the value of the explanatory
variable is for each participant.

Example 3 Data from the LHC experiment


Activity 3 considered data generated from an experiment at the LHC
at CERN. Despite the fact that this is an experiment, the resulting
data could be either experimental or observational! If the team of
physicists is controlling a variable, then the resulting data will be

16
2 Exploratory data analysis

experimental, but if the team are simply observing the variables


(which often happens when studying a new phenomenon), then the
resulting data will be observational.

As was seen in Example 3, the fact that a study is called ‘an experiment’
does not necessarily mean that the resulting data are always experimental
data. The key to whether the data are observational or experimental lies in
whether or not the researcher has control over the value a variable takes.

Activity 4 Observational or experimental data?

Example 1 considered The Survey of Adult Skills conducted by the OECD.


This survey involved participants being interviewed and answering
questions (on a computer or using pencil and paper). The participants’
answers were then observed and recorded. Are the resulting data
observational or experimental? Explain your answer.

The final distinction for data that we’ll consider here is between natural
science data and social science data.
Although both natural science and social science have ‘science’ in common,
the focus of each is different: natural science focuses on the physical world,
while social science focuses on the study of society and how humans
behave and influence the world around us. (Natural science disciplines
include biology, chemistry, Earth science and physics. Social science
disciplines include economics, education, human geography, law, politics,
psychology and sociology.) This difference in focus means that
natural science data are generated from well-defined laws of nature,
whereas social science data are generated from situations involving people.
Natural science data are objective and you would expect to get very
similar results when repeating an experiment. Social science data tend to
be subjective, often relying on opinion, which means that the resulting
data can vary a lot from sample to sample. What’s more, it is not usually
possible for the researcher to control a social science data variable in the
same way that it often is in natural science. This means that social science
data are invariably observational rather than experimental.
There are many data sources which provide freely available data. We met
one of these, the OECD, earlier in Example 1. Here are examples of some
others.
• National agencies, such as the Office for National Statistics (ONS) in the
UK, collect and analyse data on many aspects of national life including
the economy, population and society at national, regional and local
levels.

17
Unit 1 Introduction to statistical modelling and R

• International agencies, such as Eurostat and the United Nations (UN),


collect and analyse data on many international issues, such as the
economy, population, society and the environment.
• Satellites collect Earth data for weather, climate, and environmental
monitoring applications including sea surface temperatures, sea ice
extent, forest fires, volcanic eruptions, and so on. NASA’s Earthdata
Search is an example of a source of such data.
Rows of flags representing the • International health data are collected by organisations such as the
member nations of the UN World Health Organization (WHO) and The World Bank.
outside the United Nations • International data with a focus on finance and the economy are collected
Office at Geneva by organisations such as The World Bank and the International
Monetary Fund (IMF).
• Citizen science projects allow the general public to contribute to data
collection for scientific research by, for example, searching for exoplanets
using human eyes, sharing observations on biodiversity across the globe,
and playing computer games to map retinol neurons, to name but a few.
A number of such projects can be found on the Zooniverse web portal –
go and check it out if you fancy being involved!
Some of the data provided by these sources are primary to the data
provider (as seen, for instance, in Example 1), whereas others are
secondary to the provider, using data collected from different sources, such
as different government agencies.
The majority of the data from the providers listed above are observational
data. Experimental data (where the researcher controls the value of the
explanatory variable) are more likely to be published in specialist journal
articles (although observational data can also be found in such articles).
We’ll finish this subsection with an activity identifying which data sources
are likely to provide social science data, and which are likely to provide
natural science data.

Activity 5 Sources of social science data and natural


science data
Consider the following international data sources (mentioned above):
• Eurostat
• NASA’s Earthdata Search
• the IMF.
Which of these data sources are likely to provide social science data and
which are likely to provide natural science data?
Satellite image of sea ice

As already mentioned, the type of data in a dataset can affect its quality
and reliability. We will discuss data quality and reliability next.

18
2 Exploratory data analysis

2.2 Data quality and reliability


The quality and reliability of data varies between, and indeed within,
datasets. We’ll consider this for primary and secondary data in
Subsection 2.2.1. Then, in Subsection 2.2.2, we’ll touch on experimental
data before focusing on observational data. Finally, in Subsection 2.2.3,
we’ll consider what might happen when data sources are combined.

2.2.1 Primary and secondary data


Primary data are usually of higher quality and more reliable than
secondary data. This is because primary data are collected for a specific
purpose, with a degree of control over which data to collect for the
particular problem at hand. Problems that may occur with secondary data
are explored in the next activity.

Activity 6 Potential problems with secondary data

Many countries around the world conduct censuses of their populations.


These data, in an aggregated form that ensures that confidentiality of
individuals is maintained, can often then be obtained by researchers for
use as secondary data. Can you think of any possible problems that the
researcher may have when using such data?

However, primary data may not always be ideal for addressing the problem
at hand. For example, it may be impractical or too expensive to collect
data on a variable of interest. Indeed, it may not even be possible to
observe the variable that we are really interested in, and instead we need to
use a less ideal alternative. This problem is illustrated in the next example.

Example 4 Measuring the standard of living


A country’s standard of living is often of interest to researchers, but
isn’t something that can be easily measured. So, researchers often use
a country’s gross domestic product (GDP) per capita, which measures
the value of goods and services produced for each resident, to
represent a country’s standard of living. However, using GDP per
capita to represent standard of living is less than ideal because the
value of goods and services does not capture various aspects of a
society’s quality of life, such as health and access to education. As
such, GDP per capita could be a misleading measure of standard of
living since, for example, an increasing quantity of goods produced
could decrease a society’s quality of life through increased pollution
and environmental damage, and increased income inequality.

19
Unit 1 Introduction to statistical modelling and R

The problem illustrated in Example 4 is not uncommon for social science


data, since, unlike natural science data, it can be difficult to observe a
‘measure’ of some of the variables of interest. For example, it is difficult to
give a quantitative measure of things like ‘happiness’ or ‘teamwork skills’;
any measure of these is likely to be imprecise and just a rough
approximation. As a consequence, natural science data – where more
precise measurements are (usually) possible – is usually of higher quality
and more reliable than social science data.
Regardless of whether the data are primary or secondary, or from social or
natural science, how the variables in a dataset are defined can affect the
usefulness of the data. This is illustrated in the next activity.

Activity 7 Classifying gender identity

Data on gender identity are important for issues of equality, diversity and
inclusion, as well as to support policy development and service provision.
At the time of writing, where data on gender is available, the majority of
existing datasets use just a binary female/male classification for gender.
You’ll notice this prevalence of the binary classification for gender reflected
in the datasets that you’ll meet in this module. However, there are many
terms that someone may use to identify their gender. Alternatively, they
may use more than one term or they may not use any specific term.
Given this, what problems might arise with a dataset that only allows
gender to take the values male and female?

As a follow-up to Activity 7, it may interest you to know that, in recent


years, several countries have taken steps to include gender identity options
beyond male and female in official surveys. For example, the UK census
for England and Wales asked about gender beyond the binary female/male
classification for the first time in 2021. So, moving into the future, more
data on gender identity should become available.

2.2.2 Experimental and observational data


Whether data are experimental or observational also impacts on how
reliable the data are likely to be.
For experimental data (that is, the data collected as part of a planned
experiment) there is a degree of control of the data collection methods and
measurements. This means that experimental data tend to be more
reliable than observational data.
We will consider the issue of reliability for observational data by exploring
some data collected as part of the Treezilla project, described next.

20
2 Exploratory data analysis

Treezilla data on manna ash trees


Treezilla is a citizen science project whose aim is to create a database
and map of Great Britain’s trees through collaboration between
organisations, local governments and the general public. Anyone can
add tree data to the database. The location of each tree is entered
into the database by the user zooming into the location using a map
on the Treezilla website. The user can then add information about the
tree, such as the name of the species of tree, the diameter of the tree
(in metres, to two decimal places) at 1.3 m above the ground, and its
height (in metres, rounded to the nearest metre). Information for each
tree added to the map and database does not need to be complete –
for example, the species of the tree may be unknown by the user. In
this case, other users can add further information regarding the tree.
As part of the Treezilla database, there are data on 42 trees of the
species manna ash (Fraxinus ornus) located along Walton Drive on
The Open University (OU) campus in Milton Keynes. Figure 3 shows
a screenshot of the map of these trees taken from the Treezilla
The manna ash trees along
website. The orange and red dots on the map denote the locations of Walton Drive on the OU
the manna ash trees in the database. The orange is used to represent campus in autumn
‘good’ data and the red for ‘needs updating’.

Figure 3 Treezilla map showing the manna ash trees along Walton
Drive on the OU campus (as at 10 February 2022), with labels added to
indicate the west and east sides of Walton Drive
A tree’s diameter and height may be affected by its location – for
example, some trees may be in sunnier locations than others. So, the
location of the tree may be useful information for any statistical

21
Unit 1 Introduction to statistical modelling and R

modelling of the tree diameter and height data. Looking at the


locations of the trees given in Figure 3, the trees are either located on
the west side of Walton Drive (at the left-hand side of the road on the
map in Figure 3), or on the east side of Walton Drive (at the
right-hand side of the road on the map). Therefore, for each tree we
have also recorded which side of Walton Drive (west or east) the tree
is on. Each of the manna ash trees along Walton Drive in the
Treezilla database has an identification number.
The manna ash trees dataset (mannaAsh)
The database contains data on many variables, but we will focus on
the ones listed below (including the location details that we have
added):
• treeID: the identification number for the tree
• diameter: the diameter of the tree (in metres, to two decimal
places) at 1.3 m above the ground
• height: the height of the tree (in metres), rounded to the nearest
metre
• side: the side of Walton Drive that the tree is located on, taking
possible values west and east.
Note that ‘mannaAsh’ given in brackets in the heading above is the
name of the corresponding data frame. (As mentioned in
Subsection 1.2, when we introduce a dataset we will give the
corresponding name of the data frame. We will do this in a similar
way for other datasets.)
The data for the first five manna ash trees (taken from the Treezilla
database) are given in Table 2. (Note that the order in which the
observations are given in the data frame is not the same as the
ordering given by treeID.)
Table 2 First five observations from mannaAsh

treeID diameter height side


271 0.23 9 west
270 0.21 8 west
269 0.20 7 west
268 0.21 8 west
272 0.21 9 west

Source: Treezilla, 2012, accessed 19 July 2019

22
2 Exploratory data analysis

In Activity 8, you will consider the potential issue of inaccurate or missing


information for observational data.

Activity 8 Quality of the Treezilla data

For the manna ash trees dataset, while measuring tree diameter (1.3 m
above ground) is fairly straightforward, foresters use special equipment to
measure tree height. (For example, a laser rangefinder measures distances
and a clinometer measures the angle between the person and the top of a
tree. Trigonometry can then be used to calculate the height of the tree.) A laser rangefinder
There are also smartphone apps which can measure tree heights, but, at
the time of writing, these can be very inaccurate.
(a) Some of the individual trees in the Treezilla data have inaccurate or
missing information regarding which species the tree is. Why might
that be?
(b) Why might some of the individual tree height measurements be
inaccurate or missing?

A clinometer
One of the data quality problems mentioned in Activity 8 was the problem
of missing data. There is no hard-and-fast rule as to how to deal with
missing data values. If there are only a few values of a variable which seem
to be missing randomly, then it can be sensible to simply drop these
observations. When there are a substantial number of values missing for a
variable, then it might be wiser to drop that particular variable. On the
other hand, there could be an underlying reason why data values are
missing – for example, a particular group of people may refuse to answer a
particular question. In this case, the missing values should not be ignored
since the fact that they are missing is important. We will consider the
problem of missing data in more detail in Unit 5.
Another potential problem with large-scale databases such as Treezilla is
the potential for there to be bias in the data. You will see an example of
such bias in the next activity.

23
Unit 1 Introduction to statistical modelling and R

Activity 9 Bias in the Treezilla data

Figure 4(a) shows the Treezilla map of trees for an area of London, taken
from the Treezilla website on 10 February 2022, and Figure 4(b) shows the
Treezilla map of a slightly larger area around Loch Tummel (a rural area
in Scotland), taken from the same website on the same day.

(a)

(b)

Figure 4 Map showing the locations of trees with data (denoted by coloured
dots) in the Treezilla database in (a) an area of London, and (b) an area
around Loch Tummel in Scotland.

24
2 Exploratory data analysis

In the map showing the area of London, there are data for a lot of trees, as
indicated by the many dots on the map. In contrast, there are no dots on
the map showing the area around Loch Tummel, which indicates that no
data have been recorded for trees in that area. As a result, from the
Treezilla maps, it looks like there are far more trees in London than the
area around Loch Tummel. However, as can be seen from Figure 5, there
are many trees around Loch Tummel.

Figure 5 The B8019 crossing Allt Charmaig, by Loch Tummel in Scotland

Why might the data collection method used in the Treezilla citizen science
project lead to more tree data being collected in London than in the Loch
Tummel area?

With the advances in computing and technology, there is an increasing


number of (often very large) observational datasets. The data for many of
these are generated via automated data collection processes, or through
collaborative data collection methods, such as that used for Treezilla.
These datasets can contain a wealth of information and can be extremely
useful. However, because of the potential problems with data accuracy and
possible reporting bias, when analysing such datasets it is always
important to consider how reliable each particular dataset is, and to bear
this in mind when analysing the data.

25
Unit 1 Introduction to statistical modelling and R

2.2.3 Combining data sources


So far in Subsection 2.2 we have considered the quality and reliability of
just a single data source. The increase in the number of data sources
available means that many datasets are created by combining data from
different sources. For example, a multinational company will have offices in
different countries, each of which may collect and report data in different
ways depending on the country’s requirements. The local company data
for each country can be combined to analyse data for the company as a
whole. Creating datasets by combining different data sources can, however,
create problems, as considered in the next activity.

Activity 10 Considering combined data sources

If a dataset is created by combining data from different sources, what


potential problems may arise?

Any data problems created by combining data sources should be identified


and then corrected, if possible, before analysing the data. For example,
duplicate observations should be deleted, a variable should be measured
using the same unit throughout the dataset, and the same naming
conventions should be used. We will return to the creation of a dataset by
combining data sources in Unit 5.
Regardless of the source of a dataset, there may be statistical outliers
which are considered as ‘atypical’ observations. Such observations could
simply be data errors or anomalous data values, but could instead be an
indication that we have chosen the ‘wrong’ model. When there are
outliers, it is often sensible to analyse the data both with the outliers, and
excluding the outliers. If the results are very similar for both analyses,
then the outliers are unlikely to be of concern. However, if the results are
different for the two analyses, then this would suggest that the outliers are
influential, and in that case it is wise to report the results obtained both
with and without the outliers. Outliers will be considered further in Unit 5.
For small datasets, it may not be too difficult to spot any data problems,
errors or potential outliers. For larger datasets, summaries of a dataset can
help to reveal potential data problems; some of the commonly used
summaries are considered next.

26
2 Exploratory data analysis

2.3 Summaries of data


This module assumes that you are familiar with various graphics and
numerical measures which can be used to summarise data. This subsection
will identify exactly which visual and numerical summaries we are
assuming that you are familiar with. If you find that you are unfamiliar
with any of these, or you would like a reminder about any (or indeed all!)
of them, see the module website for a review of them.
Visual summaries of a dataset, through graphics, are an important part of
exploratory data analysis. Graphics can help to identify patterns in the
data and any relationships between variables. They can also provide
information regarding what potential distribution may best represent the
data, as well as help to identify any potential outliers or data errors. As a
starting point for this module, we assume that you are familiar with using
and interpreting the following:
• bar charts, for displaying categorical and discrete variables
• histograms (both frequency histograms and unit-area
histograms), for displaying continuous variables or discrete variables
with a large number of possible values
• boxplots (also known as box-and-whisker plots), for displaying
continuous variables
• comparative boxplots, for comparing boxplots of more than one
continuous variable
• scatterplots, for displaying two linked variables.
In addition to using visual summaries of a dataset, there are various
numerical summaries which can be calculated. For this module, we assume
that you are familiar with the following:
• the sample mean, often denoted by x
• the sample median, often denoted by m
• the sample variance, often denoted by s2
• the sample standard deviation, often denoted by s
• the interquartile range, often denoted by IQR
• the lower quartile or first quartile, often denoted by qL or Q1
• the upper quartile or third quartile, often denoted by qU or Q3
• the correlation coefficient between two variables, often denoted by r.
It is worth noting that we will be using R throughout this module for
calculating any numerical summaries, and so what matters is that you are
familiar with their use rather than the formulas for calculating them.
We will be using these visual and numerical summaries in R next, and so,
if any of them are unfamiliar to you, or you would like a reminder, refresh
your knowledge on these before moving on to the next section.

27
Unit 1 Introduction to statistical modelling and R

3 Using R for exploratory data


analysis
In this section, we will use R to put the exploratory data analysis
techniques reviewed in Section 2 into practice. We can customise the
default plots in R to produce very nice graphics – indeed, R was used to
produce some of the graphics in this module’s printed materials. However,
for exploratory data analysis, R’s default plots are usually all that we
require to get a feel for the data. We will therefore simply focus on
producing R’s default plots, which are quick and easy to produce.
We will explore a dataset regarding footballer performance, which is
described next, and then explore the manna ash trees dataset introduced
in Section 2.

Attributes of football players


Fédération Internationale de Football Association, more commonly
known as FIFA, describes itself as the international governing body of
association football (also known as soccer). One of FIFA’s
responsibilities is the organisation and promotion of football’s major
international tournaments, including the FIFA World Cup and the
France won the FIFA World FIFA Women’s World Cup.
Cup in 2018 FIFA maintains a huge database containing various attributes of
football players. The FIFA 19 database includes data for 2019 on
more than 18 000 footballers. For each player, the FIFA 19 database
includes data regarding each individual footballer’s price, wage,
country, club, height, weight and assessed scores for many different
skills.
The FIFA 19 dataset (fifa19)
In this module, we will be using the data from a subset of 100 football
players taken from this database. The database contains data on
The USA won the FIFA many variables, but we will focus on just a few here, as listed below:
Women’s World Cup in 2019
• strength: a score of strength, expressed as an integer between 0
and 100
• height: the player’s height, measured in inches (in) to the nearest
inch (1 in = 2.54 cm)
• weight: the player’s weight, measured in pounds (lb) to the nearest
pound (1 lb ≃ 0.45 kg)
• marking: a score of marking ability, expressed as an integer
between 0 and 100. (In football, marking is a defence strategy
which aims to prevent a member of the opposing team from taking
control of the ball.)

28
3 Using R for exploratory data analysis

• preferredFoot: the foot preferred by each player, taking possible


values left and right
• skillMoves: an assessment of each player’s football ‘skill’ moves,
taking possible values 1, 2, 3, 4 and 5 (with 5 being the highest
level).
From the data source, it is unclear as to how exactly the scores for
strength, marking and skillMoves were obtained, or who calculated
these scores.
The data for the first five observations from the FIFA 19 dataset
(from the subset of 100 football players taken from the database) are
shown in Table 3.
Table 3 First five observations from fifa19

strength height weight marking preferredFoot skillMoves


67 69 159 45 right 3
66 72 163 79 right 4
72 72 161 81 left 3
59 68 150 66 right 4
70 69 157 38 left 4

Source: Gadiya, 2019, accessed 13 March 2019

The first thing that we need to do when presented with a new dataset is to
consider the quality of the data. We will do this in the next activity.

Activity 11 Quality of the FIFA 19 dataset

(a) By considering the description of the FIFA 19 dataset, which of the


variables, if any, may not be precisely measured?
(b) Are there any potential sources of bias in the dataset?

As you saw in Activity 11, it is possible that some of the variables in the
FIFA 19 dataset may not be precisely measured, and/or may also exhibit
bias. Equally, it is possible that there are no problems with the data.
Without further details regarding how the data were obtained for these
variables, all we can do is use the data as given in our exploratory data
analysis, and flag up any potential problems in any conclusions/discussions
that we present.
In Subsection 2.3 you were reminded that visual summaries of a dataset
are an important part of exploratory data analysis. So, in the next activity
you will consider which graphics would be suitable to start exploring the
FIFA 19 dataset.

29
Unit 1 Introduction to statistical modelling and R

Activity 12 Selecting suitable graphics

In the FIFA 19 dataset, which of the variables strength, weight, height,


marking, preferredFoot and skillMoves are suitable for displaying in
bar charts, and which are suitable for displaying in histograms?

A footballer showing a ‘skill


In Subsection 2.3, we described the visual and numerical summaries of
move’ data that you should be familiar with. Bar charts are useful for displaying
categorical and discrete variables, whereas histograms are useful for
displaying continuous or discrete variables with a large number of possible
values. Boxplots are useful for getting a feel for the symmetry or otherwise
of a distribution, whereas comparative boxplots are particularly useful for
comparing the distributions of several variables.
When the relationships between variables are of particular interest, in
statistical modelling, scatterplots play a major role. When there is a
categorical variable in addition to the variables that we wish to display in
a scatterplot, it is possible that the relationship between the two variables
differs for the different values of the categorical variable. A good way to
investigate this is to use different colours for the points in a scatterplot
according to the value of the categorical variable for that observation.
Notebook activities 1.11 to 1.14 show you how to produce these visual
summaries in R for the FIFA 19 dataset. In Notebook activity 1.15, you
will move on to producing numerical summaries for the FIFA 19 dataset.
Then, in Notebook activity 1.16, you will use R to perform an exploratory
data analysis of the manna ash trees dataset described in Subsection 2.2.2.

Notebook activity 1.11 Bar charts in R


This notebook explains how to create bar charts in R for the
categorical variables in fifa19.

Notebook activity 1.12 Histograms in R


This notebook explains how to create histograms in R for the
remaining variables in fifa19.

Notebook activity 1.13 Boxplots in R


This notebook explains how to create boxplots and comparative
boxplots in R.

30
4 Simple linear regression

Notebook activity 1.14 Scatterplots in R


This notebook explains how to produce scatterplots for two variables
in R, as well as how to use colour in a scatterplot to distinguish data
points associated with different values of a categorical variable.

Notebook activity 1.15 Numerical summaries in R


This notebook explains how to obtain numerical summaries of data
in R.

Notebook activity 1.16 Exploratory data analysis of the


manna ash trees dataset
This notebook carries out an exploratory data analysis of mannaAsh.

Once we have a good feel for the data through an exploratory data
analysis, we are ready to start modelling the data; simple linear regression
is the topic of the next section.

4 Simple linear regression


As already mentioned when setting the scene at the beginning of this unit,
linear regression is widely used in many disciplines, and the associated
terminology can vary between, and within, disciplines. In this section, we
will specify the regression notation and terminology which will be used
throughout Units 1 to 8. Please be aware that you may have used different
notation and terminology in previous study, and you may meet different
notation and terminology beyond Unit 8 and M348.
Remember that in Units 1 to 8 we’ll use the convention that the variable
to be modelled is called the response variable, or simply the response, and
a variable that we are using to model the response is called an explanatory
variable. We can think of the explanatory variable as the variable which
explains the other variable, and the response variable as the variable which
responds to the explanatory variable. A summary of the commonly used
names for the response and explanatory variables was given in Table 1 (in
‘Setting the scene’).
In this section, we will focus on regression where there are just two
variables – the response variable and a single numerical explanatory
variable – and there is a linear relationship between the two variables.
Here (and throughout Units 1 to 8), we will refer to this type of regression
as simple linear regression. The ‘simple’ in this title refers to the fact
that there is just one numerical explanatory variable.

31
Unit 1 Introduction to statistical modelling and R

We will start this section by reviewing the basic idea of simple linear
regression in Subsection 4.1. We will then consider estimating the
parameters in the simple linear regression model in Subsection 4.2, and
testing for a relationship between the two variables in Subsection 4.3.

4.1 Review of the basic idea


For experimental data, we have control over the value of the explanatory
variable so that its value is known and fixed in advance of collecting the
data. For observational data, we do not have control over the value of the
explanatory variable and its value is observed, along with that of the
response, when collecting the data. Despite this difference, in statistics the
response variable is regarded as a random variable, while the explanatory
variable is regarded as fixed and non-random (whether its data values were
known in advance of collecting the data or not). So, even if we are
observing the values of the response and explanatory variable at the same
time, the focus is on the variability exhibited by the response assuming
that the explanatory variable is non-random.
The explanatory variable is usually denoted by x (lower case is used to
indicate that the explanatory variable is assumed to be fixed and
non-random), while the response variable is usually denoted by Y (upper
case is used to indicate that the response is random). Our observed data
are then the n data pairs, (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Regression aims to model the relationship between the response Y and the
explanatory variable x. A regression model has two parts – a systematic
(deterministic) part and a random part.
• The systematic part defines the line or curve representing the
relationship between Y and x. We will refer to this as the regression
function, h(x).
• The random part represents how the data points (x1 , y1 ), (x2 , y2 ), . . . ,
(xn , yn ) are scattered about the regression function.
A regression function and scatter about that function are illustrated in
Figure 6.

32
4 Simple linear regression

Regression
function h(x)

y
Random part
represents
the scatter
about h(x)

x
Figure 6 A scatterplot of data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) with a
regression function h(x)

In simple linear regression, the relationship between Y and x is linear, so


that the regression function, h(x), has the form
h(x) = α + βx,
which, on a scatterplot of the data, corresponds to a straight line with
(unknown) intercept parameter α and (unknown) slope parameter β. Note
that we have used the Greek letters α and β in this straight line equation
(rather than a and b, for example). This follows the convention, which will
be used throughout the module, that Greek letters will be used to denote
unknown parameters.
The simple linear regression model for the ith response Yi , with
associated observed explanatory variable xi , can then be written as
Yi = α + βxi + Wi , i = 1, 2, . . . , n,
| {z } |{z}
systematic random
where Wi is known as a random term (or error term).
The random terms W1 , W2 , . . . , Wn are independent random variables. (If
W1 , W2 , . . . , Wn are independent of one another, then each Wi doesn’t
affect any of the other random terms.) We would ideally like the random
terms to be zero (which would mean that there is no scatter about the
regression function). However, this is usually not possible. So, instead we
would like their values to be zero on average, so that the expected value of
each random term, E(Wi ), is zero. In simple linear regression, in addition
to being independent and having zero mean, the Wi ’s are also assumed to
follow normal distributions, all with the same variance.
A reminder of the normal distribution is given in Box 1.

33
Unit 1 Introduction to statistical modelling and R

Box 1 The normal distribution


A random variable Y has a normal (or Gaussian) distribution
with mean µ and standard deviation σ (and hence variance σ 2 ) if the
probability density function (p.d.f.) of Y is given by
(  )
1 y−µ 2

1
f (y) = √ exp − , where −∞ < y < ∞.
σ 2π 2 σ

This is written Y ∼ N (µ, σ 2 ). Here, −∞ < µ < ∞ and σ > 0.


(Remember that the p.d.f. defines a curve representing the variation
of a random variable. You do not need to know the formula for this
p.d.f. – it is given here for completeness.)
The p.d.f. of N (µ, σ 2 ) is symmetric about the mean µ and is always
‘bell-shaped’. The value of σ determines the spread of the p.d.f., so
values of Y less than µ − 3σ or greater than µ + 3σ are unlikely.
The p.d.f. of N (µ, σ 2 ) is shown in Figure 7. Note that the normal
p.d.f. is often referred to as the normal curve.

µ − 3σ µ − 2σ µ − σ µ µ+σ µ + 2σ µ + 3σ
y

Figure 7 The p.d.f. of a normal distribution N (µ, σ 2 )

The normal distribution with mean 0 and standard deviation 1,


N (0, 1), is known as the standard normal distribution.

The simple linear regression model is summarised in Box 2. Then, in


Example 5, we will apply this model to the manna ash data.

34
4 Simple linear regression

Box 2 The simple linear regression model


If Y is a response variable and x is an explanatory variable, then the
simple linear regression model for the ith response Yi , with
associated (fixed) explanatory variable xi , can be written as
Yi = α + βxi + Wi , i = 1, 2, . . . , n. (1)
The parameters α and β are, respectively, the intercept and slope of
the straight line relating the response to the explanatory variable.
These are the regression coefficients of the regression model.
The random terms Wi are independent normally distributed random
variables with zero mean and constant variance, σ 2 ; that is,
Wi ∼ N (0, σ 2 ), i = 1, 2, . . . , n.

Example 5 A simple linear regression model for the


manna ash trees data
Consider once again the manna ash trees dataset (introduced in
Subsection 2.2.2). In that dataset, there are two numerical variables
measured on manna ash trees along Walton Drive on The Open
University campus:
• height: the height of the tree, rounded to the nearest metre
• diameter: the diameter of the tree (in metres, to two decimal
places) at 1.3 m above the ground.
As remarked in Activity 8 (also in Subsection 2.2.2), measuring the
height of a tree can be difficult without specialist equipment, whereas
measuring a tree’s diameter (1.3 m above the ground) is easier. It
would therefore be useful to be able to predict the height of a tree
given its diameter. This can be done using a simple linear regression
model with diameter as the explanatory variable x, and height as
the response variable Y . (And we will do this in Section 6.) The
simple linear regression model in this case then has the form
Yi = α + βxi + Wi , i = 1, 2, . . . , n,
where the terms Wi are independent with Wi ∼ N (0, σ 2 ),
i = 1, 2, . . . , n.

35
Unit 1 Introduction to statistical modelling and R

As already mentioned, the values of the explanatory variable x1 , x2 , . . . , xn


are assumed known and non-random in simple linear regression. As a
result, since E(Wi ) = 0,
E(Yi ) = E(α + βxi + Wi ) = α + βxi , i = 1, 2, . . . , n.

Also, since V (Wi ), the variance of Wi , is equal to σ 2 , the Wi are


independent of one another, and the variance of anything non-random
(such as α + βxi ) is 0,
V (Yi ) = V (α + βxi + Wi ) = σ 2 , i = 1, 2, . . . , n.
So, each Yi has the same variance σ 2 , but each has a mean (α + βxi ) that
depends on the value of the explanatory variable xi . Furthermore, since
the random terms W1 , W2 , . . . , Wn are independent normally distributed
random variables, results from statistical theory then mean that the
responses Y1 , Y2 , . . . , Yn are also independent normally distributed random
variables.
The simple linear regression model is illustrated diagrammatically in
Figure 8.

V (Yi ) = σ 2 means that the


scatter about the line
is roughly constant
E(Y1 ) = α + βx1

Data point y1 Data point


(x1 , y1 ) (x2 , y2 )
y
y2
h(x) = α + βx
E(Y2 ) = α + βx2

x1 x2
x
Figure 8 Illustration of simple linear regression for data
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

The simple linear regression model, Model (1) in Box 2, is fairly


straightforward with a single explanatory variable and just two regression
coefficients (α and β). You will, however, meet models in this module with
many explanatory variables and associated parameters. These models can
look very complicated when written in a form similar to that given in
Box 2, and so we will introduce a simpler way to write the model, as
described in Box 3.

36
4 Simple linear regression

Box 3 Simpler model notation


For response variable Y and explanatory variable x, we will use the
notation
y∼x
to denote the model
Yi = α + βxi + Wi , i = 1, 2, . . . , n.

The notation identifies that y is the response variable, x is the


explanatory variable and we are using a linear regression model for y
(with an intercept parameter and a slope parameter).

The notation introduced in Box 3 will be used throughout the module and
you will also see that it is used for specifying models in R.
It is often helpful to use informative variable names rather than Y and x
when specifying a model, as illustrated in the final activity of this
subsection.

Activity 13 Informative variable names

Example 5 specified the following simple linear regression model:


Yi = α + βxi + Wi , i = 1, 2, . . . , n,
where Y denotes the response variable height and x denotes the
explanatory variable diameter. Using informative variable names, express
this model using the notation introduced in Box 3.

4.2 Estimating the model parameters


From Box 2, there are three unknown parameters in the simple linear
regression model: the intercept α, the slope β and the (common)
variance σ 2 of each of the random terms. We’ll consider estimating α
and β next in Subsection 4.2.1, and then in Subsection 4.2.2 we’ll consider
estimating σ 2 .

4.2.1 Estimating α and β


The observed data, (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), are used to estimate the
values of α and β that produce the line which is ‘closest’ to the data
points; that is, the data are used to find the line which ‘fits’ the data best.
In simple linear regression, the estimates of α and β can be obtained by
the ‘method of least squares’, also known as ‘ordinary least squares’ (which Tartan fabric is certainly not
minimises the sum of squared vertical distances between the data points designed using a method of
and the regression line) or using ‘maximum likelihood estimation’ (which least squares!

37
Unit 1 Introduction to statistical modelling and R

finds the values of α and β which maximise the likelihood). Note that you
do not need to know the technical details of obtaining these estimates for
either of these methods in this module; R will be used to obtain the
estimated values of α and β, and indeed to obtain the estimated values of
all parameters in all of the regression models that you will meet in this
module.
Terminology and notation used to describe the fitted simple linear
regression model is given in Box 4.

Box 4 The fitted simple linear regression model


Denoting the estimated values of α and β by α
b and β,
b respectively,
the fitted simple linear regression model is
y=α
b + βx.
b

This is also known as the equation of the least squares line or the
equation of the fitted line.
The line itself is usually referred to simply as the fitted line.

The fitted simple linear regression model provides the following


information regarding the relationship between Y and x:
• αb is the intercept of the line relating Y and x, so it’s the value that we
might expect Y to take when x = 0
• βb is the slope of the line relating Y and x, so it’s the value that we
might expect Y to change by if x were to increase by one unit.
In Activity 14 you will use this information to interpret a simple linear
regression model that has been fitted to the manna ash trees data.

Activity 14 The fitted simple linear regression model for


the manna ash trees data
From Example 5 (Subsection 4.1), the simple linear regression model for
the response variable height (Y ) and explanatory variable diameter (x)
has the form
Yi = α + βxi + Wi , i = 1, 2, . . . , n,
where the terms Wi are independent with Wi ∼ N (0, σ 2 ), i = 1, 2, . . . , n.
The estimates of the parameters α and β are 5.05 and 12.27, respectively.
The resulting fitted line has been added to the scatterplot of height
against diameter given in Figure 9. (Note there is some overplotting of
points on Figure 9, as some trees in the dataset have the same values for
height and diameter.)

38
4 Simple linear regression

10

8
Height (m)

4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 9 A scatterplot of height against diameter, with the fitted line
added
(a) Write down the fitted simple linear regression model for these data.
(b) According to the fitted simple linear regression model, if the diameter
increased by 0.1 m, how would the height change?

In the solution of Activity 14, the fitted simple linear regression model
y = 5.05 + 12.27x
was written in an alternative form by replacing Y and x by their more
informative names height and diameter, respectively, so that the fitted
model becomes:
height = 5.05 + 12.27 diameter.
Writing the fitted model using more informative variable names can help
make sense of the model more easily: in simple linear regression with only
one explanatory variable it isn’t difficult to remember what x is, but things
can get rather complicated when not using informative variable names for
some of the models you will meet in this module.

39
Unit 1 Introduction to statistical modelling and R

4.2.2 Estimating the variance σ 2


The remaining unknown parameter in the simple linear regression model is
σ 2 = V (Wi ), i = 1, 2, . . . , n. In order to estimate σ 2 , Box 5 gives a
reminder of what fitted values and residuals are.

Box 5 Fitted values and residuals


For each xi , i = 1, 2, . . . , n, the fitted simple linear regression model
can be used to calculate the fitted value of Yi , denoted ybi . For the
ith data point, the value of x is xi , so that
ybi = α
b + βx
b i. (2)

The residuals for the fitted model are then calculated as


ri = yi − ybi , i = 1, 2, . . . , n. (3)

The fitted line for a simple linear regression model, together with two
fitted values and residuals (one positive and one negative), are illustrated
in Figure 10.

50
rj = yj − yj yj
positive rj
40
Fitted line with
equation
30
y = α + βx yj
y
yi
20

ri = yi − yi
yi negative ri
10

0 5 10 15 20
xi xj
x
Figure 10 Illustration of the fitted line for a simple linear regression model,
together with two fitted values and residuals

In the next activity you will calculate a fitted value and a residual.

40
4 Simple linear regression

Activity 15 Fitted values and residuals for the manna ash


trees model
Following on from Activity 14, the fitted simple linear regression model for
the manna ash trees is
height = 5.05 + 12.27 diameter.
(a) The first manna ash tree in the dataset has a diameter of 0.23 m.
What is the fitted value for this observation? (Give your answer to
two decimal places.)
(b) Given that the height for the first manna ash tree is recorded as 9 m,
calculate the residual for this observation, giving your answer to
two decimal places. Explain why this calculated residual is only
approximate. (Hint: consider how the values of height were recorded
in the manna ash trees dataset.)

The residuals are the differences between the observed data points and the
fitted regression line, and as such, ri is essentially an estimate of the
random term Wi . Now, σ 2 = V (Wi ) and so we can use the (sample)
variance of the residuals r1 , r2 , . . . , rn as an estimate of σ 2 ; this is
summarised in Box 6.

Box 6 Estimate of σ 2
In simple linear regression, the (sample) mean of the residuals is 0. So
an estimate of the variance σ 2 = V (Wi ), i = 1, 2, . . . , n, denoted
b2 , is given by
by σ
Pn 2
2 r
b = i=1 i ,
σ (4)
n−2
where ri is the residual for the ith data point.
More informally, this is
sum of squared residuals
b2 =
σ .
n−2

In previous study, instead of the divisor n − 2 in Equation (4), you may


have seen n or n − 1. We use the divisor n − 2 here in order to make the
b2
estimate of the variance unbiased – that is, on average, the value of σ
2
would equal σ . The value is n − 2 because there are n observations in the
sample and two other parameters, α b and β,
b being estimated.
Once a simple linear regression model has been fitted, we need to check
whether there is a linear relationship between Y and x; after all, there is
no point in using a simple linear regression model if there isn’t a linear
relationship between the two variables to model! The next subsection
considers how we can test whether a linear relationship exists.

41
Unit 1 Introduction to statistical modelling and R

4.3 Testing whether a relationship exists


Consider once again the simple linear regression model for i = 1, 2, . . . , n:
Yi = α + βxi + Wi , Wi ∼ N (0, σ 2 ).
We would like to test whether a relationship exists between Y and x.

Of course there’s a
relationship between these two
variables – just look at how well
the line fits!

To do this, in Activity 16 we will test the hypotheses


H0 : β = 0, H1 : β ̸= 0.

Activity 16 Why these hypotheses?


Explain why testing whether a relationship exists between Y and x is
equivalent to testing the hypotheses
H0 : β = 0, H1 : β ̸= 0.

In previous study, you will have met hypothesis testing when using a
normal distribution to test your hypotheses, and when using another
distribution – the t-distribution. The hypothesis test for testing whether
the slope parameter β in a simple linear regression model is zero is based
on the t-distribution. A reminder of the t-distribution is given in Box 7.

42
4 Simple linear regression

Box 7 The t-distribution


The t-distribution with ν degrees of freedom, denoted t(ν), is like the
standard normal distribution, N (0, 1), but with heavier (that is,
thicker) tails.
Like the standard normal, the p.d.f. of t(ν) is symmetric about 0.
The degrees of freedom (taking the possible values ν = 1, 2, . . .)
determine the rest of the shape of the p.d.f. curve: the lower the
degrees of freedom, the heavier the tails. As the degrees of freedom
increase, t(ν) becomes closer to a standard normal distribution. To
illustrate, the p.d.f.s of t(1), t(3), t(7) and the standard normal are
shown in Figure 11.

t(7) N (0, 1)

t(1) t(3)

−5 −4 −3 −2 −1 0 1 2 3 4 5

Figure 11 The p.d.f.s of t(1), t(3), t(7) and N (0, 1)

So, we wish to test the hypotheses


H0 : β = 0, H1 : β ̸= 0.
To do this, we use the fact (presented here without proof) that in simple
linear regression,
βb − β
∼ t(n − 2), (5)
standard error of βb
where
• βb is the estimator of β
• the standard error of βb is the standard deviation of β’s
b sampling
distribution
• t(n − 2) denotes the t-distribution with n − 2 degrees of freedom.

43
Unit 1 Introduction to statistical modelling and R

Now, when H0 is true, β = 0, so Result (5) becomes


βb
∼ t(n − 2).
standard error of βb
This result forms the basis of the test for whether a relationship exists in
simple linear regression, as summarised in Box 8. You will practise using
this in Activity 17.

Box 8 Testing whether a relationship exists


Testing whether a relationship exists between response Y and
explanatory variable x in simple linear regression involves testing the
hypotheses
H0 : β = 0, H1 : β ̸= 0,
using the test statistic t, where
βb
t= .
standard error of βb
The null distribution is t(n − 2). (That is, the distribution of the test
statistic when the null hypothesis H0 is true is t(n − 2).)
The observed value of this test statistic is often called the t-value.

You may not have seen the test statistic t expressed in the form given in
Box 8 before. There is, however, a good reason for writing the test statistic
in this form, as you will discover when we use R for regression and as you
learn about other regression models in the module.

Activity 17 Test statistic for the manna ash trees model

From Activity 14 (Subsection 4.2.1), the fitted simple linear regression


model for the response variable height and explanatory variable diameter
turned out to be
height = 5.05 + 12.27 diameter.
This model was fitted using the data for 42 trees.
(a) What is the estimated value βb of the slope parameter in this fitted
simple linear regression model?
(b) Given that the standard error of βb is calculated to be 4.98, calculate
the observed value of the test statistic t for testing the hypotheses
H0 : β = 0, H1 : β ̸= 0.

(c) What is the null distribution for this test?

44
4 Simple linear regression

The hypothesis test summarised in Box 8 is usually completed by


considering the p-value for the test. The p-value is often simply denoted
by p.
Recall that the p-value of a test is the probability that, under the null
distribution, the test statistic is at least as extreme as the value observed.
In the test described in Box 8, the alternative hypothesis H1 is two-sided,
meaning that both large positive and large negative values of the test
statistic would cast doubt on the null hypothesis H0 . So, if the observed
value of the test statistic T is t > 0, then all values of T ≥ t would be
considered ‘at least as extreme as’ t, but equally, since the p.d.f. of
t(n − 2) is symmetric about 0, all values of T ≤ −t would also be
considered ‘at least as extreme as’ t in the other direction. The p-value for
this two-sided test is then
p = P (T ≥ t) + P (T ≤ −t).
This two-sided p-value is illustrated in Figure 12. By symmetry, the
p-value for this two-sided test if t < 0 would be exactly the same.

p.d.f. of null
distribution
t(n − 2)

p-value is the
area of these
two shaded tails

−t 0 t
Observed value
of the test statistic

Figure 12 Illustration of the p-value for testing whether there is a


relationship between Y and x

The p-value is therefore obtained by calculating probabilities from the


t(n − 2) distribution. In this module, the p-value for this test is calculated
automatically by R when the model is fitted (you will see this for yourself
soon in Section 7), which means that you won’t need to calculate any
p-values yourself. You do, however, need to be able to interpret what the
p-value tells you about your model; this is described in Box 9.

45
Unit 1 Introduction to statistical modelling and R

Box 9 Interpreting the p-value


In simple linear regression, the p-value calculated for the test of the
hypotheses
H0 : β = 0, H1 : β ̸= 0
is interpreted as follows.
• If the p-value is small, then the data which have been observed are
very unlikely to have occurred when the null hypothesis is true.
This suggests that there is evidence against H0 and we should
conclude that β ̸= 0 (that is, there is a relationship between Y
and x).
• On the other hand, if the p-value is not small, then the data which
have been observed could well have occurred when the null
hypothesis is true. This suggests that there isn’t enough evidence
against H0 and so we should conclude that β = 0 (that is, there is
not a relationship between Y and x).

So, following on from Box 9, the question is: how small does the p-value
need to be to conclude that β ̸= 0? Well, unfortunately there is no
hard-and-fast answer to this question!
In previous study, you may well have been given some rough guidelines on
how to interpret p-values. For example, a value of p < 0.05 is often taken
as enough evidence against H0 to conclude that β ̸= 0, while a value of
p < 0.01 is often taken as strong evidence against H0 . However, as
Example 6 shows, what is considered as being small enough to suggest that
β ̸= 0 is dependent on the context of the data and the research question of
interest.

Example 6 How small is small?


In a medical context where an incorrect model can have serious
consequences, the researcher usually needs to have strong evidence
against H0 , and may only conclude that β ̸= 0 if p is very small (for
example, p < 0.005).
Other contexts will conclude that β ̸= 0 for larger values of p. For
example, in econometrics, when modelling financial variables,
particularly asset prices, low p-values (for example, p < 0.05 or
p < 0.01) are generally taken as evidence against H0 . On the other
hand, when looking at long-term relationships between two
macroeconomic variables, p-values as large as 0.1 can be regarded as
strong evidence against H0 .

46
5 Checking the simple linear regression model assumptions

In the next activity you will interpret a p-value obtained for a simple linear
regression model.

Activity 18 Testing whether there is a relationship between


tree height and diameter
From Activity 14 (Subsection 4.2.1), the fitted simple linear regression
model for the response variable height and explanatory variable diameter
is
height = 5.05 + 12.27 diameter.
In Activity 17, the test statistic for testing the hypotheses
H0 : β = 0, H1 : β ̸= 0
was calculated to be 2.46.
Given that the associated p-value is 0.018, what do you conclude about the
relationship between tree height and diameter for these manna ash trees?

Activity 18 highlights the important message that statistical analyses do


not lead to definite ‘correct’ conclusions. All that we can do is present the
evidence produced by the statistical analysis – for example, by reporting
the p-value – and use this evidence to draw conclusions which we believe to
be sensible for the context of the data and the research question being
addressed.

5 Checking the simple linear


regression model assumptions
The simple linear regression model makes several important assumptions,
as summarised in Box 10.

Box 10 Simple linear regression model assumptions


For response Y and explanatory variable x, there are four main
assumptions of the simple linear regression model
Yi = α + βxi + Wi , i = 1, 2, . . . , n.
These are:
• Linearity: the relationship between x and Y is linear.
• Independence: the random terms Wi , i = 1, 2, . . . , n, are
independent.
• Constant variance: the random terms Wi , i = 1, 2, . . . , n, all have
the same variance σ 2 across the values of x.

47
Unit 1 Introduction to statistical modelling and R

• Normality: the random terms Wi , i = 1, 2, . . . , n, are normally


distributed with zero mean and constant variance, N (0, σ 2 ).
Together, these four assumptions are equivalent to the following two
assumptions:
• Y1 , Y2 , . . . , Yn are independent random variables.
• Each Yi is normally distributed with mean α + βxi and variance σ 2 .

After fitting a simple linear regression model, it is necessary to check that


the model assumptions seem to be reasonable. This is important because if
the model assumptions do not hold, then using the model could lead to
misleading or incorrect conclusions. Indeed, in some application areas (for
example, in econometrics) the model assumptions often do not hold. This
has lead to the development of models that do not require all these
assumptions to be made.
Since the residuals r1 , r2 , . . . , rn can be thought of as estimates of the
random terms W1 , W2 , . . . , Wn , the residuals play a key role in checking the
assumptions on the random terms and hence on the assumptions placed on
Y1 , Y2 , . . . , Yn . (Recall from Equation (3) in Subsection 4.2.2, ri = yi − ybi ,
i = 1, 2, . . . , n.) In this section, we’ll consider two types of plots useful for
using the residuals to check the model assumptions: residual plots in
Subsection 5.1, and normal probability plots in Subsection 5.2.

5.1 Residual plots


A residual plot is a scatterplot which plots the residuals against various
things. Residual plots can be useful for checking the assumptions of
linearity and that the variance remains constant across all values of x. For
some datasets, they may also be helpful for checking the assumption of
independence.

5.1.1 Checking for linearity, zero mean and constant


variance
liLnienaeraitryi
ty
eennccee
To check the assumptions about linearity, zero mean and constant variance,
Inddeeppeenndd
in it is common to plot the residuals against the explanatory variable or
zZereoromm eaenan against the fitted values. In simple linear regression, the fitted values are a
cence
e evavraiarnia
sSaamm linear function of the explanatory variable. This means that a residual plot
ty
Noorrmmaalitlyi
with the residuals against the explanatory variable will look the same as
n one which plots the residuals against the fitted values (apart from a change
in the horizontal axis). When using such residual plots in this module, we
will plot the residuals against the fitted values because this sort of plot is
easier to extend to situations where there are several explanatory variables.

48
5 Checking the simple linear regression model assumptions

If the assumptions of linearity, zero mean and constant variance are


reasonable, then the points in this type of residual plot should be scattered
about the zero residual line in a random, unpatterned way. An example of
such a plot is given in Figure 13(a).
If there is a systematic pattern in the points in a residual plot, then one or
more of the model assumptions may be questionable. For example, the
residual plot in Figure 13(b) exhibits a systematic pattern which suggests
that the assumption that the Wi ’s have zero mean is not reasonable
(because the ‘band’ of points moving from left to right does not remain
centred about the line ri = 0 throughout the plot). This particular kind of
systematic pattern usually suggests that the regression function isn’t in
fact linear, because the linear regression function isn’t accommodating the
systematic relationship between Y and x.
As another example, the residual plot in Figure 13(c) exhibits a systematic
pattern which suggests that the assumption of constant variance of the
Wi ’s is not reasonable (because the ‘band’ of points moving from left to
right does not have constant spread).
A residual plot can also help to spot any potential outliers in the data: an
outlier will produce a residual whose magnitude is much larger than the
other residuals. An example of such a plot is shown in Figure 13(d).
Residuals

Residuals

0 0

(a) (b)
Residuals

Residuals

0 0

(c) (d)

Figure 13 Examples of residual plots: (a) residuals are unpatterned,


(b) residuals are not scattered about zero, (c) variance of the residuals is not
constant, (d) there is a single potential outlier

In the next activity you will examine residual plots resulting from simple
linear models fitted to some more datasets.

49
Unit 1 Introduction to statistical modelling and R

Activity 19 Practise checking residual plots


Figure 14 shows residual plots obtained after the simple linear regression
model was fitted to eight different datasets. For each plot, decide whether
the linearity, zero mean and constant variance assumptions seem
reasonable for that dataset. For those residual plots suggesting that the
model assumptions are not reasonable, explain which of the assumptions
seem to be in question.

Residuals

Residuals
0 0

(a) (b)
Residuals

Residuals
0 0

(c) (d)
Residuals

Residuals

0 0

(e) (f)
Residuals

Residuals

0 0

(g) (h)

Figure 14 Residual plots for Activity 19

50
5 Checking the simple linear regression model assumptions

Residual plots sometimes also include a trend line, which indicates how the
mean of the residuals varies across the values of the fitted values. This can
be useful for checking the assumption that the mean of the Wi ’s is zero
(and hence that the linearity assumption is reasonable): if the trend stays
roughly around the zero residual line, then the assumption is reasonable.
(Note that R adds a trend line to residual plots, but we have not included
this line in the figures in the units.)

Activity 20 Residual plot for the manna ash trees model

In Activity 14 (Subsection 4.2.1), the fitted simple linear regression model


for modelling the response height with explanatory variable diameter for
the manna ash trees dataset (consisting of 42 trees) was
height = 5.05 + 12.27 diameter.
The residual plot for this model is given in Figure 15. (As points with the
same values for height and diameter will have the the same fitted values
and residuals, there is some overplotting of points on this plot.)
Does this residual plot suggest that the linearity, zero mean and constant
variance assumptions seem reasonable?

2
Residuals

−2

7.0 7.5 8.0 8.5 9.0


Fitted values
Figure 15 Residual plot of the simple linear regression model for the manna
ash trees dataset

51
Unit 1 Introduction to statistical modelling and R

You may not have agreed with the solution provided for Activity 20. There
is no definitive ‘correct’ conclusion in cases like this, where a plot is not
presenting a clear-cut picture. If the residual plot had a very clear pattern
and looked like the one in Figure 13(c), for example, then this would
change our conclusions. As with much of statistics, interpretation of the
plots and results can be very subjective and often the best we can do is to
present our conclusions together with the reasons for coming to those
conclusions.

5.1.2 Checking independence


ty
liLnienaeraitryi A different sort of residual plot can be used to check the independence
eennccee
Inddeeppeenndd
assumption. Independence of the Wi ’s (and hence the Yi ’s) involves the
in
zZereoromm eaenan data collection process and the design of the experiment. For example, if
cence
e evavraiarnia
multiple measurements are made on the same individual, or made on the
sSaamm same variable over time, then the resulting measurements may not be
ty
Noorrmmaalitlyi
n independent.
Checking for independence requires some notion of ordering the
observations. Where data are collected over time, then ordering according
to time is an obvious way forwards. However, in situations where time is
not involved, it is often not obvious what a relevant ordering should be.
Plotting the residuals in time order, such as the order that the observations
were collected (if known), may help to identify any trends in the residuals,
and hence any potential problems with the assumption of independence: if
the assumption of independence is reasonable, then we would expect the
ordered residuals to be randomly scattered about the zero residual line. It
is, however, important to bear in mind that if it is not obvious what the
ordering of the residuals should be to check for independence, then such
plots may not be particularly informative, in which case independence is
generally assumed unless there are reasons to believe otherwise.
In the next activity you will consider whether the independence
assumption is reasonable for the manna ash trees data.

Activity 21 Checking the independence assumption

There isn’t any particularly obvious ordering of the data that would be
useful for checking the independence of the observations in the manna ash
trees dataset. So Figure 21 shows the residuals observed from fitting a
simple linear regression model to the manna ash trees data in Activity 14
plotted in the order given by the identification numbers (treeID). From
this plot, does the independence assumption seem reasonable?

52
5 Checking the simple linear regression model assumptions

2
Residuals

−2

240 250 260 270 280


Identification number
Figure 16 The residuals from the model described in Activity 14, plotted in
the order given by the identification number for each tree

5.2 Normal probability plots


In the previous subsection, we used residual plots for checking the liLnienaeraitryi
ty
eennccee
Inddeeppeenndd
assumptions of linearity, independence, zero mean and constant variance
in
eaenan
across the values of x. There remains one final assumption to check – this
zZereoromm
cence
is the assumption that the Wi ’s are normally distributed.
sSaamm e evavraiarnia
Following on from the methods used to check the other assumptions, since
ty
the residual ri can be thought of as an estimate of the random term Wi , Noorrmmaalitlyi
n
we will use the residuals to check the normality assumption. In particular,
if it is reasonable to assume that the residuals r1 , r2 , . . . , rn follow a normal
distribution, then we will conclude that it is reasonable to assume that the
random terms W1 , W2 , . . . , Wn also follow a normal distribution.
To check the normality of the residuals, we can use a normal probability
plot. Normal probability plots are often also referred to as QQ plots.
Note that, although this section will only talk in terms of using normal
probability plots for checking the normality assumption of the residuals,
these plots can in fact be used to check the normality of any observed
variable.

53
Unit 1 Introduction to statistical modelling and R

5.2.1 Checking normality


The idea behind a normal probability plot is to plot each residual ri
against the value that we’d expect this residual to take assuming that the
residuals follow a normal distribution, N (0, σ 2 ). If the plotted points lie
roughly along a straight line, then we are observing what we would expect
to observe, and so it is reasonable to assume that the residuals are
normally distributed.
For normal probably plots, instead of plotting the calculated residuals
(often referred to as the ‘raw residuals’) it is common to plot
standardised residuals: these are scaled so that their standard deviation
is 1 (and hence their variance is also 1). In this case, if the normality
assumption is correct, then the standardised residuals should follow the
standard normal distribution N (0, 1). This distribution is then used to
calculate the expected values of the standardised residuals. The values of
the standardised residuals that we’d expect, assuming that they follow the
standard normal distribution, are often referred to as the theoretical
quantiles or the normal scores. Once again, if the points in the plot lie
roughly along a straight line, then we are observing what we expect to
observe, and so it is reasonable to assume that the standardised residuals
follow a standard normal distribution and we can conclude that the
normality assumption is reasonable.
In this module, R is used to produce normal probability plots which plot
the standardised residuals against the theoretical quantiles. You do not
need to know the details of how the plots are produced, but it is important
that you know how to use normal probability plots to check that the
normality assumption is reasonable.
Normal probability plots for checking the assumption of normality are
summarised in Box 11.

Box 11 Checking the normality assumption using normal


probability plots
In a normal probability plot (QQ plot) for checking the normality of
the residuals, the observed standardised residuals are plotted against
the theoretical quantiles (normal scores) calculated assuming that the
standardised residuals follow a standard normal distribution.
If the assumption of normality seems reasonable, then the points in
the normal probability plot should lie approximately along a straight
line. An example is shown in Figure 17.

54
5 Checking the simple linear regression model assumptions

Standardised residuals 2

−1

−2 −1 0 1 2
Theoretical quantiles
Figure 17 A normal probability plot where the points lie approximately
along a straight line, indicating that it is reasonable to assume that the
residuals are normally distributed

It is important to note that the points in a normal probability plot only


need to approximately lie along a straight line to suggest normality, and
slight deviations from the line, such as those shown in the top right-hand
and bottom left-hand corners of Figure 17, aren’t a concern. Indeed, the
standardised residuals in the normal probability plot in Figure 17 were in
fact simulated from a standard normal distribution, and so we know that
they are normally distributed.
In this module, the theoretical quantiles are plotted on the horizontal axis
and the standardised residuals are plotted on the vertical axis (to match
how R draws normal probability plots). You may have seen the axes the
other way around in previous study. Either way is fine – the key thing to
look out for in normal probability plots is whether or not the points lie
approximately on a straight line, and this can be done whichever way
round the axes are.
You are now in a position, in Activity 22, to be able to check whether the
assumption of normality of the Wi ’s is reasonable for the simple linear
regression model of the manna ash trees dataset.

55
Unit 1 Introduction to statistical modelling and R

Activity 22 Is the normality assumption reasonable for the


manna ash trees model?
In Activity 14 (Subsection 4.2.1), a simple linear regression model was
fitted to the manna ash trees data using height as the response variable
and diameter as the explanatory variable. You considered the resulting
residual plot in Activity 20 (Subsection 5.1.1). Now consider the normal
probability plot, given in Figure 18.
Does this normal probability plot suggest that the assumption that the
Wi ’s are normally distributed is reasonable?

1
Standardised residuals

−1

−2

−2 −1 0 1 2
Theoretical quantiles
Figure 18 Normal probability plot of the residuals of the simple linear
regression model for the manna ash trees

If the points in a normal probability plot do not lie approximately on a


liLnienaeraitryi
ty straight line, then this suggests that the assumption that the residuals are
eennccee
Inddeeppeenndd
normally distributed may be questionable and the fitted simple linear
in
eaenan
regression model may not be valid. However, all is not lost! In this case, it
zZereoromm
cence
might be possible to transform the response so that the normality
sSaamm e evavraiarnia assumption is reasonable for a model using the transformed response
ty
Noorrmmaalitlyi
n instead. (Using transformations in regression will be discussed in detail in
Unit 2.) Alternatively, we may need to consider using a different model
instead (such as one of the models explored later in the module).

56
6 Prediction in simple linear regression

6 Prediction in simple linear


regression
Very often the main aim of fitting a regression model to data is to produce
an equation that can be used to predict new responses from given values of
the explanatory variable. In Subsection 6.1, you will see how we can use a
fitted simple linear regression model to generate a single value that
represents our best prediction for the response variable based on the value An alternative way of
of the explanatory variable; that is, to produce a point prediction for the predicting
response given the value of the explanatory variable. However, by itself, a
point prediction is not that useful. Some indication of the uncertainty
associated with the prediction is important, and in Subsection 6.2 you will
see how this can be done via a prediction interval.

6.1 Predicting an individual response


If all that is required is a point estimate of the response, things are very
easy: the given value of the explanatory variable can simply be substituted
into the fitted regression equation as shown in Box 12.

Box 12 Point prediction in simple linear regression


If x0 is the value of the explanatory variable for an individual data
point where the response value of Y0 is not known, then the point
prediction of Y0 is
yb0 = α
b + βx
b 0. (6)

In the next activity you will use the result given in Box 12 to consider
predicting the heights of manna ash trees.

Activity 23 Prediction for the manna ash trees

Consider once again the manna ash trees dataset. From Activity 14
(Subsection 4.2.1), the fitted simple linear regression model for these data
is
height = 5.05 + 12.27 diameter.
(a) According to the model, what is the predicted height for a manna ash
tree with diameter 0.20 m?
(b) Explain why the fitted model may not be appropriate to predict the
value of height for young trees with very small diameters. (Hint: you
might find it useful to consider Figure 9 in Subsection 4.2.1 and what
values the data take in the sample.)

57
Unit 1 Introduction to statistical modelling and R

Activity 23 raises an important issue. The fitted model is valid for the
values of x used when fitting the model, but may not be valid beyond the
range of these values. This is illustrated in Example 7.

Example 7 Extending the range of a fitted model


Suppose that we would like to model the relationship between the
response variable weight, the median weight of baby girls (in kg), and
the explanatory variable age, the babies’ ages (in months). A simple
linear regression model was fitted to data for values of age between 6
and 10 months: these data, together with the fitted simple linear
regression line, are shown in Figure 19. From the plot, the fitted line
seems to fit the relationship between weight and age very well.

This one seems to be happy to


be weighed!
8.4

8.2
Weight (kg)

8.0

7.8

7.6

7.4

6 7 8 9 10
Age (months)
Figure 19 Scatterplot of weight and age, together with the fitted
simple linear regression line for these data

Despite how well the simple linear regression model fits the data in
Figure 19, this model is a poor fit when considering weight and age
for values of age from 1 to 12 months. A scatterplot of these data
(with values of age between 1 and 12 months) is shown in Figure 20,
together with the fitted line from Figure 19 (based on values of age
between 6 and 10 months). Clearly, the simple linear regression line
based on data for values of age between 6 and 10 months does not fit
the relationship between weight and age across all values of age up
to 12 months.

58
6 Prediction in simple linear regression

8
Weight (kg)

2 4 6 8 10 12
Age (months)
Figure 20 Scatterplot of weight and age for ages up to 12 months,
together with the fitted simple linear regression line from Figure 19

As you have seen in Activity 23 and Example 7, the validity of a prediction


using a simple linear regression model depends in part on the value of the
explanatory variable used for the prediction. This is summarised in Box 13.

Box 13 Range of model validity


The fitted model is valid for prediction for values of x within the
range of x values used to fit the model, but there is no guarantee that
the same relationship between Y and x will hold outside of the range
of data used to calculate the fitted model.

For an unknown individual response value of Y0 with an associated


explanatory variable value of x0 , the point prediction yb0 = α
b + βx
b 0 is said
to be
• interpolated (or found by interpolation) if the value of x0 lies within
the range of values of x used to fit the model
• extrapolated (or found by extrapolation) if the value of x0 lies outside
the range of values of x used to fit the model.

59
Unit 1 Introduction to statistical modelling and R

So, in linear regression, interpolation uses the fitted line within the range
of x values used for fitting the line, while extrapolation uses the same
fitted line outside the range of x values used for fitting the line. As we saw
in Example 7, caution is needed when using extrapolation: extrapolating
just outside the range of x values may well be fine, but extrapolating
further outside the range may produce very misleading results.

6.2 Prediction intervals


Calculating yb0 , the point prediction of Y0 , on its own does not take
account of the following two issues.
• The true regression line is not known: we only have an estimate for it
(that is, we have the estimate y = α
b + βx).
b
• Even if the true line were somehow known (that is, we knew the true
values of α and β), then we still wouldn’t know the true value of Y0 with
certainty, since we do not have a value for the random term W0 that
defines how far the actual value of Y0 will be from the regression line.
(Remember that Y0 = α + βx0 + W0 .)
One way of taking these uncertainties into account is through prediction
intervals, as described in Box 14.

Box 14 Prediction intervals in simple linear regression


If x0 is the value of the explanatory variable for an individual
response value of Y0 that is not known, then a prediction interval is
an interval estimate of Y0 .
A prediction interval provides a range of plausible values for the value
of Y0 .
Prediction intervals are calculated according to different confidence
levels, which indicate how ‘confident’ we are that Y0 will lie in the
prediction interval.
The commonly used confidence levels are 90%, 95% and 99%. The
higher the confidence level, the more confident we are that Y0 will lie
in the interval, but also the wider – and hence less informative – the
prediction interval is.

Prediction intervals can be produced easily in R, and so we won’t go into


the formulas used to calculate them in this module. Prediction intervals
are illustrated in Example 8 and Activity 24.

60
6 Prediction in simple linear regression

Example 8 Prediction intervals for the manna ash trees


Once again consider the manna ash trees dataset and the fitted simple
linear regression model
height = 5.05 + 12.27 diameter.
Using this fitted model, in Activity 23 (Subsection 6.1) we calculated
that the predicted height for a manna ash tree of diameter 0.20 m is
7.50 m. This gives a single value point prediction yb0 .
The prediction interval for the height of a manna ash tree with
diameter 0.20 m is calculated to be (4.7, 10.3) at the 90% level,
(4.2, 10.8) at the 95% level and (3.0, 12.0) at the 99% level. These
prediction intervals are illustrated in Figure 21. Notice how the
prediction intervals are centred around the point prediction
yb0 = 7.50 m, and how, with the increase in confidence level, these
prediction intervals increase in width, making them less informative.

15
y0 = 7.50, point prediction of Y0

99% prediction
interval at x0 = 0.20
10
Height (m)

5 90% prediction
interval at x0 = 0.20
95% prediction
interval at
x0 = 0.20 x0 = 0.20
0
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 21 Scatterplot of height and diameter, together with the fitted
simple linear regression line and the 90%, 95% and 99% prediction
intervals marked

61
Unit 1 Introduction to statistical modelling and R

Activity 24 Prediction intervals for different tree diameters

For the manna ash trees dataset, the values of diameter range from 0.15 m
up to 0.35 m. The 95% prediction interval for the height of a tree with
diameter 0.15 m (the low end of the range of diameter) is (3.4, 10.3) and
the 95% prediction interval for the height of a tree with diameter 0.35 m
(the high end of the range of diameter) is (5.9, 12.8).
If instead we consider a diameter in the middle of the range of values for
diameter, then the 95% prediction interval for the height of a tree with
diameter 0.25 m is (4.8, 11.4).
By looking at the widths of these three prediction intervals, how do the
widths of the prediction intervals vary across the range of values of
diameter?

Following on from Activity 24, the widths of prediction intervals always


vary slightly across the range of x used for fitting the model, with the
prediction intervals being slightly wider for values of x0 towards the ends
of the range of x (when x0 is far from the sample mean of the x values, x)
than they are for values of x0 in the middle of the range (when x0 is close
to x). This means that predictions are more informative for Y0 when x0 is
close to the centre of its observed range (because the prediction intervals
are narrower) than when x0 is at the edges.

7 Using R for simple linear


regression
In this section, we explain how to use R for simple linear regression.
Starting with Notebook activity 1.17, we focus on the simple linear
regression model for the manna ash trees dataset (discussed in Section 4).
Then, in Notebook activity 1.18, we explain how to use R to obtain
residual plots and normal probability plots (considered in Section 5) for
checking the assumptions of simple linear regression. This is followed by
Notebook activity 1.19, where we explain how R can be used to obtain
predictions (considered in Section 6).
Finally, in Notebook activity 1.20, you will fit a simple linear regression
model, check the model assumptions and obtain predictions for the
FIFA 19 dataset, which you first met in Section 3.

Notebook activity 1.17 Fitting a simple linear regression


model in R
This notebook explains how to use R for simple linear regression,
focusing on the simple linear regression model for mannaAsh.

62
7 Using R for simple linear regression

Notebook activity 1.18 Using R to check the model


assumptions
In this notebook, we will use R to produce residual plots and normal
probability plots for simple linear regression, focusing in particular on
the fitted model from Notebook activity 1.17.

Notebook activity 1.19 Prediction in R


This notebook explains how to use R to obtain point predictions for
the response and interval predictions, using the fitted model from
Notebook activity 1.17.

Notebook activity 1.20 Modelling the FIFA 19 dataset


using simple linear regression
In this notebook, you will fit a simple linear regression model using
data from fifa19, check the assumptions for this model, and obtain
predictions for some new players.

You will notice, as we move through the module, that the output produced
by R is often given to several decimal places. This degree of accuracy is
usually not required when writing down the fitted model, since the model
is just that – a model, giving a simplified representation of the underlying
process which generated the data. So, it is common practice to round the
parameter estimates given by R when quoting a fitted model. For example,
if R produces the estimates α b = 17.12938 and βb = 0.32173, then it would
be reasonable to quote the fitted model as any of the following:
y = 17.129 + 0.322x,
y = 17.13 + 0.322x,
y = 17.13 + 0.32x,
and so on.
There is no hard-and-fast rule on how much rounding is reasonable, and it
usually depends on what seems sensible in the context of the data, the
quality of the data, the sample size, and so on. Statisticians are not
particularly fussy about degrees of rounding, since any rounding errors are
usually insignificant in comparison to other assumptions and
approximations involved in the modelling process.
Hopefully, you should now feel comfortable with using both Jupyter and R
for exploratory data analysis and simple linear regression. Bear in mind
that code can be copied and pasted between notebooks, so do not worry if
you can’t remember the required code – simply copy, paste and adapt code!

63
Unit 1 Introduction to statistical modelling and R

8 Looking ahead . . .
To round off this unit, we’ll take a very brief look at some of the main
ways in which this module will go beyond simple linear regression.

Including multiple explanatory variables


In simple linear regression, there is a single explanatory variable. However,
many regression applications have many possible explanatory variables.
For example, in the FIFA 19 dataset, there are six variables, and the
database from which these data came has data for many more variables! It
is possible that several explanatory variables work together to influence the
response variable. In this case, it would be useful to be able to include all
of these explanatory variables in a regression model. Such models will be
developed next in Unit 2.

Including categorical explanatory variables


As has already been mentioned, not all explanatory variables are
continuous – some are factors (that is, categorical). The different factor
levels can affect the relationship between the response variable and the
explanatory variable(s) in different ways.
For example, for the manna ash trees dataset with the response variable
height and explanatory variable diameter (see, for example, Activity 14),
the fitted line using trees from only the west side of Walton Drive is
different to the fitted line using trees from only the east side of Walton
Drive, as can be seen in Figure 22. As such, the different levels of the
factor side affect the relationship between height and diameter.
In this module we will develop regression models which can include both
continuous explanatory variables and factors, and which can accommodate
different relationships between the response and explanatory variables for
different levels of the factors within a single model.

64
8 Looking ahead . . .

Side: East West


10

8
Height (m)

4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 22 Fitted simple linear regression lines for the manna ash trees
dataset using trees on the west side of Walton Drive only (blue triangles and
blue dashed fitted line) and using trees on the east side of Walton Drive only
(red circles and red solid fitted line)

Modelling non-normal response variables


Simple linear regression assumes that the random terms are normally
distributed, which in turn means that each response variable Yi is also
normally distributed. However, not all potential response variables are
normally distributed.
For example, if a medical researcher is conducting a trial to test out a new
treatment to cure a disease, then they may have several explanatory
variables, such as the patient’s age, their gender, and so on, which affect
whether the treatment is successful or not. In this case, they may be
interested in modelling a response variable Yi which can only take one of
two values for patient i, either Yi = ‘successful’ or Yi = ‘not successful’,
which clearly isn’t normally distributed!
There are many situations in which the response variable doesn’t have a
normal distribution. Later in the module we will develop regression models
suitable for modelling such non-normal response variables.
The statistical modelling techniques that you will meet in this module
provide a powerful toolkit for tackling many of the statistical modelling
problems encountered by researchers and analysts. We hope that you
enjoy learning about them!

65
Unit 1 Introduction to statistical modelling and R

Summary
In this unit, you have been introduced to the software to be used for
statistical modelling throughout this module – namely Jupyter notebooks
using the statistical programming language R. We started off by exploring
Jupyter, learning the basics of how to create, use and edit notebooks.
There then followed an introduction to R, where you learnt how to run R
code that you were given, as well as writing (and running) some of your
own code.
The unit then moved onto exploratory data analysis. We considered some
different types of data and discussed how the quality and reliability of data
can vary, both between and within datasets. The module assumes that you
have knowledge of various visual and numerical data summaries from prior
study, and the unit provided a list of these. You then had the opportunity
to use R to do some exploratory data analysis.
After this, we moved onto the topic of simple linear regression to model a
linear relationship between a response variable Y and an (assumed known
and fixed) explanatory variable x. We discussed the basic idea of simple
linear regression, which expresses the response in terms of an underlying
systematic straight-line relationship with the explanatory variable,
together with a random element to represent how the observed data values
vary around this straight line. Estimating the model parameters and
testing for a relationship between the response and the explanatory
variable were then considered, before moving onto the use of residual plots
and normal probability plots for checking the model assumptions (of
linearity, independence, zero mean, constant variance and normality of the
residuals), rounding off with a look at prediction. The unit finished with
using R for simple linear regression.
The Unit 1 route map, repeated from the introduction, provides a nice
reminder of what has been studied and how the different sections link
together.

66
Summary

The Unit 1 route map

Section 1
Computing preliminaries

Section 3
Section 2
Using R for
Exploratory data
exploratory data
analysis
analysis

Section 4
Simple linear
regression

Section 5 Section 7
Checking the simple Using R for
linear regression simple linear
model assumptions regression

Section 6
Prediction in simple
linear regression Section 8
Looking ahead. . .

67
Unit 1 Introduction to statistical modelling and R

Learning outcomes
After you have worked through this unit, you should be able to:
• open and work through an M348 Jupyter notebook
• create a new Jupyter notebook
• add and edit text in a Jupyter notebook using Markdown
• run, write and adapt R code given in a notebook
• appreciate what R objects, functions and vectors are
• load M348 data frames into a notebook and create new data frames
• appreciate the differences between primary and secondary data,
observational and experimental data, and natural science and social
science data
• appreciate that the quality and reliability of data varies between, and
within, datasets
• use R to produce a variety of visual and numerical data summaries, and
be able to interpret these summaries
• appreciate that simple linear regression models a straight line
relationship between a response variable (the variable we would like to
model) and an explanatory variable (the variable which can be thought
of as ‘explaining’ the response)
• use R to fit a simple linear regression model and interpret the resulting
output produced by R
• use the fitted model output to test for a relationship between the
response variable and the explanatory variable
• use R to produce residual plots and normal probability plots for fitted
models
• appreciate that if the points in a residual plot show a pattern, then the
model assumptions of linearity, zero mean and constant variance might
not be justified
• appreciate that if the points in a normal probability plot do not fall
close to a straight line, then the model assumption of normality might
not be justified
• appreciate that if the residuals show a pattern when ordered, then the
independence assumption might not be justified
• use R to calculate point predictions and prediction intervals for the
response when given new values of the explanatory variable
• appreciate that predictions calculated from the fitted model are only
valid for new values of the explanatory variable within the range of
values used to fit the model, and there is no guarantee that the same
relationship between the response variable and the explanatory variable
will hold outside this range.

68
References

References
Coughlan, S. (2020) ‘Most children sleep with mobile phone beside bed’,
BBC News, 30 January. Available at: https://www.bbc.co.uk/news/
education-51296197 (Accessed: 8 February 2022).
Gadiya, K. (2019) FIFA 19 complete player dataset. Available at:
https://www.kaggle.com/karangadiya/fifa19 (Accessed: 13 March 2019).
OECD (no date) OECD skills surveys. Available at:
https://www.oecd.org/site/piaac (Accessed: 8 February 2022).
OECD (2013) OECD skills outlook 2013: first results from the Survey of
Adult Skills. Paris: OECD Publishing. doi:10.1787/9789264204256-en.
ONS (2020) Census 2021 paper questionnaires. Available at:
https://www.ons.gov.uk/census/censustransformationprogramme/
questiondevelopment/census2021paperquestionnaires
(Accessed: 8 February 2022).
Treezilla (2012) Treezilla. Available at: https://treezilla.org
(Accessed: 19 July 2019).

69
Unit 1 Introduction to statistical modelling and R

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Setting the scene, child measuring height: © Sergey Novikov /
www.123rf.com
Subsection 1.1, Jupiter: © tristan3d / www.123rf.com
Subsection 2.1, work collaboration: © rawpixel / www.123rf.com
Subsection 2.1, part of the LHC: © 2021 CERN
Subsection 2.1: young person using her phone in bed © Ian Iankovskii /
www.123rf.com
Subsection 2.1, UN building: © Calapre Pocholo www.123rf.com
Subsection 2.1, satellite image: © Alexander Koltyrin / www.123rf.com
Subsection 2.2.2, laser rangefinder: Taken from:
https://www.ebay.ca/itm/124860962209?oid=321358572410
Subsection 2.2.2, clinometer: Taken from:
https://treezilla.org/assets/downloads/tree-survey-guide.pdf
Figure 4(a): https://www.treezilla.org
Figure 4(b): https://www.treezilla.org
Figure 5: the B8019 crossing Allt Charmaig, by Loch Tummel in Scotland:
© Peter Wood / https://www.geograph.org.uk/photo/6152160. This file
is licensed under the Creative Commons
Attribution-Noncommercial-ShareAlike Licence
http://creativecommons.org/licenses/by-sa/3.0/
Section 3, FIFA World Cup 2018: © Russian Presidential Press and
Information Office / https://en.wikipedia.org/wiki/2018 FIFA
World Cup Final. This file is licensed under the Creative Commons
Attribution Licence http://creativecommons.org/licenses/by/4.0/
Section 3, FIFA Women’s World Cup 2019: © Romain Biard /
www.shutterstock.com
Section 3, footballer showing skill move: © sportgraphic / www.123rf.com
Subsection 4.2.1, tartan fabric: © emqan / www.123rf.com
Section 6, crystal ball: © alexkich / www.123rf.com
Subsection 6.1, baby being weighed: © liudmilachernetska /
www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

70
Solutions to activities

Solutions to activities
Solution to Activity 1
From following Screencast 1.1 and/or the written instructions on the
module website, you should have:
• launched Jupyter
• opened the ‘Jupyter dashboard’
• navigated to the ‘Unit 1’ folder
• opened your first Jupyter notebook.

Solution to Activity 2
From following Screencast 1.2 and/or the written instructions on the
module website, you should now feel ready to try using Markdown to:
• create new cells
• add and format text
• add lists and tables.

Solution to Activity 3
(a) Since the physicists are running the experiment to generate the data
and then analysing the data, these are primary data for this team of
physicists.
(b) Since the data were collected by the CERN physicists, and not the
university researcher, these are secondary data for the researcher.

Solution to Activity 4
These data are observational, because the participants’ answers were
simply observed and recorded, and the OECD had no control over the
answers each participant would give.

Solution to Activity 5
Eurostat collect and analyse international data on aspects of the economy,
population and society, and is therefore likely to provide social science
data.
NASA’s Earthdata Search provides satellite data about the Earth, and is
therefore likely to provide natural science data.
The IMF collects data with a focus on finance and the economy, and is
therefore likely to provide social science data.

71
Unit 1 Introduction to statistical modelling and R

Solution to Activity 6
Two possible problems that the researcher may have are:
• The question on the census may not be exactly the question that the
researcher is interested in. For example, on the UK’s 2021 census
(ONS, 2020), a question asking respondents whether they had done any
paid work in the previous seven days told respondents to include ‘casual
or temporary work, even if only for one hour’. This may not be the
definition of work that the researcher wants to use.
• The released data may not be aggregated in the way that the researcher
would prefer. For example, the data might be aggregated into larger
geographical areas than the researcher would like, or the boundaries of
the geographical areas may not correspond with the boundaries that the
researcher would like to use.
You may have thought of other potential problems.

Solution to Activity 7
There is the obvious problem that a dataset which only allows gender to
take the values male and female cannot be used to learn about gender
identity beyond this binary classification.
Another problem is that using the female/male classification can lead to
unreliable data. A person may select one of these options despite feeling
that neither option describes their identity. Alternatively, the person may
choose to not answer that question and so the gender information for that
person is simply missing.

Solution to Activity 8
(a) The Treezilla data are collected by a large number of different people,
including members of the public who are not tree specialists.
Therefore, the species may be incorrectly identified by the person
entering the data, or information regarding tree species could be
missing altogether if the species is unknown by the person entering
the data.
(b) Foresters use laser rangefinders and clinometers to measure tree
height. However, members of the general public may not have either
of these available to use. This makes measuring tree height difficult.
Although there are smartphone apps which can measure tree heights,
because these apps can give inaccurate measurements this means that
the height data for individual trees may be inaccurate. If the person
collecting the data does not have access to either special equipment or
a smartphone app for measuring tree height, then the height
measurement may be missing altogether.

72
Solutions to activities

Solution to Activity 9
The Treezilla citizen science project database relies on many different
people (‘citizens’) to collect the data. As such, the amount of tree data
collected in an area will depend, to a certain extent, on the number of
people collecting the data. So, since a rural, sparsely populated area such
as that around Loch Tummel will have fewer people who might potentially
collect the data for this project than in a densely populated area such as
London, it is perhaps no surprise that the data for more trees are collected
in London than around Loch Tummel.

Solution to Activity 10
Some of the potential problems which may arise are as follows.
• Observations may be duplicated if they are recorded in more than one of
the sources. For example, a local branch of a shop may hold a
customer’s details, which may also be held on a different database of
customers purchasing from the shop via the internet.
• The same variable may be recorded in different sources using a different
unit of measure. For example, height may be measured in metres in one
data source, but centimetres in another.
• Categorical variables may have different naming conventions in different
datasets. For example, one data source may record the gender ‘male’ as
‘male’, and another as ‘m’. When analysing the data, these may appear
as different values when in fact they are the same.
You may well have thought of different potential problems.

Solution to Activity 11
(a) It is not clear from the data source how the scores for strength,
marking and skillMoves were calculated, and it is possible that they
were not precisely measured. For example, the scores may simply be
subjective assessments and therefore the opinion of the assessor(s).
On the other hand, despite being rounded to the nearest inch and
pound, respectively, the variables height and weight will have been
measured using measuring equipment, and are therefore more likely to
be fairly accurate and precise.
The variable preferredFoot is not something that can be measured,
but is also likely to be accurate for most players, since it is usually a
fairly clear-cut decision as to which foot a footballer prefers to use.
There may, however, be players who play with both feet equally, and
it isn’t obvious how this might be recorded from the description of the
dataset.

73
Unit 1 Introduction to statistical modelling and R

(b) If the scores for strength, marking and skillMoves are subjective
assessments, then there may be bias in the data from individual
assessors. Given the size of the database, it is also likely that there
were many different assessors, and it is possible that there would be
differences between the scores given by different assessors. For
example, some may be more generous than others.

Solution to Activity 12
Bar charts are suitable for displaying categorical or discrete variables.
There are two categorical variables in the FIFA 19 dataset –
preferredFoot (with two levels left and right) and skillMoves (with five
levels labelled 1, 2, 3, 4 and 5) – and so both of these can be displayed in
bar charts.
Both of the variables strength and marking are discrete scores between 0
and 100. Since they are discrete, in theory they could also be represented
by bar charts. However, because there are so many possible values for each
variable (potentially all the integers between 0 and 100), it would be
difficult to see the overall shape of the distribution of the data in a bar
chart. For data such as these, it is usually more sensible to present the
data in histograms by grouping the possible data values into bins.
Although the variables height and weight look discrete because their
values have been rounded to integers, they are in fact continuous, because
they have been measured on continuous scales. As such, both of these
variables are suitable for displaying in histograms.

Solution to Activity 13
Using height for Y and diameter for x, the model can be expressed as
height ∼ diameter.

Solution to Activity 14
(a) Since the estimates of the parameters α and β are 5.05 and 12.27,
respectively, the fitted simple linear regression model is
y = 5.05 + 12.27x,
that is,
height = 5.05 + 12.27 diameter.

(b) If the diameter increased by 1 m, then, according to the model, the


value of the height would increase by 12.27 m. Therefore, if diameter
increased by 0.1 m, then, according to the model, the value of height
would increase by 12.27 × 0.1 m = 1.227 m.

74
Solutions to activities

Solution to Activity 15
(a) Using Equation (2) and the fitted simple linear regression model given
in the question, the fitted value for the first observation is
yb1 = 5.05 + 12.27x1
= 5.05 + (12.27 × 0.23)
= 7.8721 = 7.87 (to 2 d.p.).

(b) From Equation (3), the residual for the first observation is
r1 = y1 − yb1
= 9 − 7.8721 = 1.13 (to 2 d.p.).
The values of height were recorded to the nearest metre, and so the
value of y1 was rounded and could actually be anywhere between
8.50 m and 9.49 m (taking two decimal places to match those for r1
and yb1 ). Hence the calculated value of r1 is only approximate.

Solution to Activity 16
If β = 0, then the model becomes
Yi = α + W i , Wi ∼ N (0, σ 2 ).
Since this is not a function of xi , this means that according to the model
the value of Yi is unaffected by xi ’s value. So, if β = 0, then there is no
relationship between Y and x.
On the other hand, if β ̸= 0, then according to the model, the value of xi
does affect the value of Yi . So, if β ̸= 0, then there is a relationship
between Y and x.

Solution to Activity 17
(a) From the equation of the fitted line, βb = 12.27.
(b) The observed value t is calculated as
βb 12.27
t= = ≃ 2.46.
standard error of βb 4.98

(c) Since the fitted model was based on data for n = 42 trees, the null
distribution for this test is t(n − 2) = t(42 − 2) = t(40).

Solution to Activity 18
Given the context of the data, the p-value of 0.018 is small enough to
suggest to some analysts that β is not zero, so that there is a linear
relationship between tree height and diameter for these manna ash trees.
Note that there may not be universal agreement about this, and some
analysts may not agree that the p-value is small enough to draw this
conclusion. However, by quoting the p-value, it allows anyone reading the

75
Unit 1 Introduction to statistical modelling and R

analysis to see for themselves the strength of evidence against the null
hypothesis that β = 0, so that they can draw their own conclusion.

Solution to Activity 19
Residual plots (c) and (h) are unpatterned, suggesting that the linearity,
zero mean and constant variance assumptions seem reasonable.
Residual plot (e) seems to have a potential outlier, but is otherwise
unpatterned. So this also suggests that the linearity, zero mean and
constant variance assumptions seem reasonable.
Neither of the residual plots (a) and (d) are randomly scattered about the
horizontal line: instead each follows a curve, suggesting that the zero mean
and linearity assumptions are not reasonable. (Residual plot (d) also has
two potential outliers which stand out from the rest of the pattern.)
Finally, the vertical spreads in the residual plots (b), (f) and (g) all change
with the fitted values, suggesting that the constant variance assumption is
not reasonable. Further, residual plot (f) also suggests that the linearity
assumption is also not reasonable since the points follow a curve rather
than being scattered about the zero residual line.

Solution to Activity 20
The points in the residual plot are randomly scattered about the zero
residual line, suggesting that the assumption that the Wi ’s have zero
mean, and hence linearity, is reasonable. There is, however, perhaps a hint
of decreasing spread as fitted values increase, which could suggest that the
assumption that the Wi ’s have constant variance may be in question.
However, the sample size of 42 trees is not large, and so it is difficult to
draw firm conclusions from the residual plot. Also, if the two large positive
and two large negative residual values of ybi for values of x around 7.3
to 7.5 were slightly smaller, would the residual plot still have a hint of
decreasing spread?
On balance, it looks like the linearity, zero mean and constant variance
assumptions could be considered to be reasonable.

Solution to Activity 21
There is perhaps a hint of curvature in the residuals going from left to
right, which might mean that the independence assumption could be
questionable. Perhaps the identification numbers represent the order that
the trees are situated along the road and the heights of trees are affected
by the heights of their neighbours? From the given data, we do not know
the answer to this and we would need further information regarding how
the data were collected if we were to investigate the independence
assumption further.
Any curvature in the plot is, however, only slight and, as mentioned in the
solution to Activity 20, the plot is based on only 42 observations. So, on

76
Solutions to activities

balance overall, for us (the module team) Figure 21 wouldn’t rule out the
independence assumption.

Solution to Activity 22
Most of the points in the normal probability plot lie roughly on a straight
line, so the assumption of normality seems plausible.

Solution to Activity 23
(a) Using Equation (6), the predicted height, in metres, for a manna ash
tree with diameter x0 = 0.20 m is
5.05 + (12.27 × 0.20) = 7.504 ≃ 7.50.

(b) The diameters of trees range from about 0.15 m up to 0.35 m, and
young trees with very small diameters will be outside of this range.
So, it may not be appropriate to use the fitted model to predict the
height for such trees. In particular, if the tree had a diameter of 0 m,
then the fitted model would predict its height to be 5.05 m, which
clearly does not make sense!

Solution to Activity 24
The width of the 95% prediction interval at the low end of the range
(when the diameter is 0.15 m) is 10.3 − 3.4 = 6.9, at the middle of the
range (when the diameter is 0.25 m) is 11.4 − 4.8 = 6.6, and at the high
end of the range (when the diameter is 0.35 m) is 12.8 − 5.9 = 6.9. So, the
widths are the same at the low and high ends of the range, but slightly
narrower in the middle of the range.

77
Unit 2
Multiple linear regression
Introduction

Introduction
Section 4 of Unit 1 reviewed regression in its simplest form – namely,
simple linear regression. This is a technique for modelling a linear
relationship between two variables, where one is a response variable and
the other is an explanatory variable which helps to ‘explain’ the variation
in the response.
The response variable is in fact usually affected by more than one single
explanatory variable. For example, you saw in Notebook activity 1.20 (in
Unit 1) that the strength score of football players is affected by their
weight. However, you can also think of the players’ heights or their skills
as other explanatory variables that could ‘explain’ the variation in their
strength, and it is possible that the model would be improved if these
other explanatory variables were considered too. Regression with more
than one explanatory variable is called multiple linear regression or,
more simply, multiple regression (so called because there are ‘multiple’
explanatory variables). Multiple regression is a very common tool in
statistical data analysis.

How Unit 2 relates to the module so far


Moving on from . . . What’s next?

Regression with Regression with


one explanatory variable multiple explanatory
(Unit 1) variables

This unit explores the basic properties and uses of multiple linear
regression. The unit starts in Section 1 with a formal definition of the
multiple linear regression model as an extension of the simple linear
regression model. Using the model for prediction is discussed in Section 2.
Section 3 considers how to assess how good your fitted model is, while
Section 4 discusses the use of transformations of variables in multiple
regression to address problems with the model. Working with more than
one explanatory variable raises the question of which of the explanatory
variables should be included in the model; methods for choosing the
explanatory variables are discussed in Section 5.
The following route map shows how the sections connect to each other.

81
Unit 2 Multiple linear regression

The Unit 2 route map

Section 1
Section 2
The multiple
Prediction in
linear regression
multiple regression
model

Section 4
Section 3
Transformations in
Diagnostics
multiple regression

Section 5
Choosing
explanatory
variables

Note that each section ends with a number of notebook activities.


This means you will need to switch between the written unit and your
computer for Subsections 1.4, 2.3, 3.4, 4.3 and 5.4.

1 The multiple linear regression


model
This section introduces multiple linear regression. We’ll start in
Subsection 1.1 by building the multiple linear regression model. Then, in
Subsection 1.2, we will discuss the interpretation of the coefficients in a
multiple regression model, since it is more nuanced than for coefficients in
a simple linear regression model. Subsection 1.3 then turns to the issue of
testing regression coefficients arising from multiple regression models.
Finally, in Subsection 1.4, you will learn how to implement multiple
regression in R.

82
1 The multiple linear regression model

1.1 Introducing the model


Recall from Unit 1 that the simple linear regression model with one
response variable Y and one explanatory variable x can be written as
Yi = α + βxi + Wi , i = 1, 2, . . . , n,
for the collection of data points (x1 , Y1 ), (x2 , Y2 ), . . . , (xn , Yn ). The Wi are
independent normal random variables with zero mean and constant
variance. The parameters α and β are the intercept and the regression
coefficient, respectively.
The next two activities consider two separate simple linear regression
models: each model has the same response variable, but uses a different
explanatory variable. Consideration of these two simple linear regression
models will pave the way to building a multiple linear regression model
which uses both of the explanatory variables in a single regression model.
In both activities, we will be using the FIFA 19 dataset described in
Section 3 of Unit 1. This dataset lists 100 footballers, where the response
variable strength gives the footballers’ strength scores (out of 100) as
assigned by FIFA: the higher the score, the stronger the footballer is
judged to be. Some other variables are also included in the dataset, such
as weight, for the footballers’ weights in pounds (lb), and height, for the
footballers’ heights in inches (in).
In Notebook activity 1.20 (Unit 1), we assumed that footballers’ weights
can explain the differences in their strength scores and we fitted the
Adebayo Akinfenwa is the
simple linear regression model
world’s strongest footballer in
strength ∼ weight. FIFA 19. He is also the
heaviest!
The model seemed to fit the data fairly well, with a plot of the resulting
residuals and fitted values showing randomly scattered points where the
residuals had no specific pattern with the fitted values.
Now, as was mentioned in Subsection 5.1.1 of Unit 1, in simple linear
regression, a plot of the residuals and the fitted values will have the same
pattern as a plot of the residuals and the explanatory variable. So, if the
model is a good fit, then a plot of the residuals against the values of the
explanatory variable weight should also show the same random pattern.
What’s more, if the simple linear regression model of strength with the
explanatory variable weight explained all of the variation in strength,
then we would expect the residuals to show no pattern if they were plotted
against any other explanatory variables: in particular, we would expect
the residuals for the model strength ∼ weight to show no pattern if they
were plotted against height as well. We will investigate whether or not
this is the case next in Activity 1.

83
Unit 2 Multiple linear regression

Activity 1 Modelling footballers’ strength scores with


weight as an explanatory variable
From Notebook activity 1.20, the model
strength ∼ weight
has the fitted equation
strength = 17.129 + 0.322 weight.
(a) A plot of the residuals for this fitted model against weight is given in
Figure 1. Based on this plot, comment on the fit of the regression
model.

15

10

5
Residuals

−5

−10

−15
150 160 170 180 190
Weight (lb)
Figure 1 Residuals from strength ∼ weight, plotted against weight

84
1 The multiple linear regression model

(b) The same residuals for this fitted model (with weight as the
explanatory variable) are now plotted against height in Figure 2.
Based on this plot, has the fitted model explained all of the variation
in strength?

15

10

5
Residuals

−5

−10

−15
70 75
Height (in)
Figure 2 Residuals from strength ∼ weight, plotted against height

In Activity 1, you considered a plot of the residuals from the model


strength ∼ weight
against height. In the next activity, we will investigate using height as
the explanatory variable in a simple linear regression model, instead of
weight.

85
Unit 2 Multiple linear regression

Activity 2 Modelling footballers’ strength scores with


height as an explanatory variable
In this activity, we are interested in fitting the model
strength ∼ height
using data from the FIFA 19 dataset.
(a) A scatterplot of strength and height is given in Figure 3. In view of
this scatterplot, comment on the appropriateness of a simple linear
regression model for these data.

80

75
Strength score

70

65

60

66 68 70 72 74 76 78
Height (in)
Figure 3 Scatterplot of strength and height
(b) The fitted equation for the model is
strength = −23.146 + 1.318 height.
Interpret the value of the regression coefficient of height.
(c) The p-value associated with the regression coefficient of height is less
than 0.001. What does this tell you about the significance of height
as an explanatory variable?
(d) A plot of the residuals and height is given in Figure 4. Based on this
plot, comment on how well the model fits.
(e) A plot of the residuals and weight is given in Figure 5. Based on this
plot, after fitting the model with height as the explanatory variable,
does there seem to be any remaining variation in strength which
looks to be associated with weight?

86
1 The multiple linear regression model

10

5
Residuals

−5

−10

66 68 70 72 74 76 78
Height (in)
Figure 4 Residuals from strength ∼ height, plotted against height

10

5
Residuals

−5

−10

150 160 170 180 190


Weight (lb)
Figure 5 Residuals from strength ∼ height, plotted against weight

87
Unit 2 Multiple linear regression

The conclusions from Activities 1 and 2 suggest that it might be better to


simultaneously include both explanatory variables in one model than either
explanatory variable individually. Also, since both weight and height
individually ‘explain’ strength, it could be that they will work well
together to explain strength. We consider this in Example 1.

Example 1 Modelling footballers’ strength scores with


two explanatory variables
Consider once again the FIFA 19 dataset.
In Activities 1 and 2, we saw that weight seems to explain strength
fairly well, and height also seems to explain strength fairly well. So,
let’s add both weight and height as explanatory variables in a model
for strength, so that we have the model
strength ∼ weight + height.
The resulting fitted model has the regression equation
strength = −10.953 + 0.252 weight + 0.558 height.

This fitted equation indicates now that a footballer’s strength score


depends on both their weight and their height. Since the coefficients
of weight and height are both positive, it means that an increase in
either of the explanatory variables is expected to be associated with
an increase in the response variable, strength. Interpreting these
coefficients will be discussed in detail in Subsection 1.2.
It is also possible here, as in the simple linear regression model, to test
the significance of each coefficient. We will discuss hypothesis testing
for multiple regression in more detail in Subsection 1.3. The p-value
associated with the coefficient of weight is less than 0.001, and the
p-value associated with the coefficient of height is 0.036. This
suggests there is evidence that the coefficients for weight and height
are both non-zero. So, in this multiple regression model, it seems that
both explanatory variables are significant in explaining the variation
in footballers’ strength.

In Example 1, you have just seen an example of a multiple regression


model, with two explanatory variables. Later in this subsection, you will
learn about the multiple regression model in its general form with any
number of explanatory variables. But before doing so, it is helpful to think
a little bit more about the fitted model from Example 1; we will do this in
the following activity.

88
1 The multiple linear regression model

Activity 3 Comparing coefficients of the simple and


multiple regression models
Compare the estimated values of the regression coefficients of weight and
height in the simple linear regression models in Activities 1 and 2,
respectively, to the values of their corresponding coefficients in the multiple
regression model in Example 1. What do you notice?

We will now introduce multiple regression in a more formal way. In a


multiple regression model there is still only one response variable, which is
still denoted Yi for observation i, as in the simple linear regression model.
However, the response variable Yi is now assumed to be dependent on q
explanatory variables, xi1 , xi2 , . . . , xiq , where xij denotes the ith
observation of the jth explanatory variable. This dependence is linear and
the model for Yi can be written as
Yi = α + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi ,
where Wi ∼ N (0, σ 2 ) and the Wi ’s are independent of one another.
The multiple regression model forms the basis for many widely used
statistical methods. Notice that it differs from the simple linear regression
model of Unit 1 in only one respect: the response variable depends on
more than one explanatory variable. The model is summarised in Box 1.

Box 1 The multiple linear regression model


If Y is the response variable and x1 , x2 , . . . , xq are q explanatory
variables, then the multiple linear regression model, or more
simply, the multiple regression model, for a collection of n data
points can be written as
Yi = α + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi , for i = 1, 2, . . . , n,
where the Wi ’s are independent normal random variables with zero
mean and constant variance σ 2 , and xi1 , xi2 , . . . , xiq are values of the q
explanatory variables.
If α
b, βb1 , βb2 , . . . , βbq are estimates of α, β1 , β2 , . . . , βq , respectively, then
the fitted multiple regression model is
b + βb1 x1 + βb2 x2 + · · · + βbq xq .
y=α
This is also referred to as the fitted equation, or the estimated
equation, of the multiple regression model.

There are some similarities between the interpretation of the coefficients of


both the simple linear regression model and the multiple regression model.
For both models, a positive value of a coefficient indicates that the
response variable increases with an increase of the corresponding

89
Unit 2 Multiple linear regression

explanatory variable, whereas a negative value of a coefficient indicates


that the response variable decreases with an increase of the explanatory
variable. So for both models, the coefficients represent the effects of the
corresponding explanatory variables on the response variable. This effect
is, however, interpreted differently in the two models. We will consider
how to interpret the coefficients in a multiple regression model next.

1.2 Interpreting the coefficients


As we noticed in Activity 3, the estimated coefficient of each explanatory
variable in a multiple regression model is not necessarily equal to its
corresponding coefficient of the same explanatory variable in a simple
linear regression model. This is because the coefficients have a rather
different interpretation in the two regression models.
For simplicity, for now consider the multiple linear regression model with
just two explanatory variables:
Yi = α + β1 xi1 + β2 xi2 + Wi , for i = 1, 2, . . . , n.
Here, the parameter α is still called the intercept parameter because it still
gives the value of the response variable when the two explanatory variables
are zero. However, the coefficients β1 and β2 are partial regression
coefficients. This is because β1 measures the effect of an increase of one
unit in x1 treating the value of the other variable x2 as fixed. Contrast this
with the simple linear regression model
Yi = α + βxi1 + Wi , for i = 1, 2, . . . , n,
in which β represents an increase of one unit in x1 assuming that x2 is not
in the model.
Similarly, β2 in the multiple linear regression model measures the effect of
an increase of one unit in x2 treating the value of the other variable x1 as
fixed. On the other hand, in the simple linear regression model
Yi = α + βxi2 + Wi , for i = 1, 2, . . . , n,
the coefficient β represents an increase of one unit in x2 assuming that x1
is not in the model.
The next example explains this idea in more detail.

Example 2 Interpreting coefficients for two explanatory


variables
In Example 1, the model
strength ∼ weight + height
was fitted to data from the FIFA 19 dataset.

90
1 The multiple linear regression model

The resulting fitted equation was given as


strength = −10.953 + 0.252 weight + 0.558 height.
The value of the coefficient of weight (0.252) represents the partial
effect of weight on strength, given height. This means that a
footballer’s strength score is expected to increase by 0.252 if their
weight increases by one lb, and their height remains fixed. Notice that
the different estimated value (0.322) of the coefficient of weight in the
simple linear regression model in Activity 1 represents the individual
effect of weight on strength.
You can also use the same reasoning with the two different values of
the coefficient of height. In this case, the value of the coefficient of
height in the multiple regression model (0.558) represents the partial
effect of height on strength, given weight. This means that a
footballer’s strength score is expected to increase by 0.558 if their
height increases by one inch, and their weight remains fixed. Again,
this is different from the estimated value (1.318) of the coefficient of
height in the simple linear regression model in Activity 2, which
represents the individual effect of height on strength.

The partial regression coefficients can be interpreted in the same way in


multiple regression with more than two explanatory variables.
Although the name ‘partial regression coefficients’ is a useful reminder of
the partial nature of the coefficients’ interpretation, they are usually
simply referred to as regression coefficients. Box 2 provides a summary
of interpreting regression coefficients in multiple regression.

Box 2 Interpreting regression coefficients in multiple


regression
In a multiple regression model for response Y and explanatory
variables x1 , x2 , . . . , xq , each Yi , i = 1, 2, . . . , n, is modelled as
Yi = α + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi , Wi ∼ N (0, σ 2 ).
The coefficients α, β1 , β2 , . . . , βq can then be interpreted as follows:
• α is called the intercept parameter as it gives the value of the
response variable, Y , when the explanatory variables x1 , x2 , . . . , xq
are all zeros.
• β1 is the partial regression coefficient that measures the effect on Y
of an increase of one unit in x1 , treating the values of the other
variables x2 , x3 , . . . , xq as fixed.

91
Unit 2 Multiple linear regression

• β2 is the partial regression coefficient that measures the effect on Y


of an increase of one unit in x2 , treating the values of the other
variables x1 , x3 , x4 , . . . , xq as fixed.
..
.
• βq is the partial regression coefficient that measures the effect on Y
of an increase of one unit in xq , treating the values of the other
variables x1 , x2 , . . . , xq−1 as fixed.

In the next activity, we will interpret the regression coefficients of a


multiple regression model with three explanatory variables.

Activity 4 Interpreting coefficients in a model with three


explanatory variables

The ability to mark other players is one of the crucial skills in football. It
is sometimes thought that the footballers’ scores of marking ability could
explain the differences in their strength scores. In this activity, you will
investigate a multiple regression model for footballers’ strength scores as a
response variable with their scores of marking ability as an extra potential
In football, it usually takes
only one player to mark explanatory variable in addition to their weight and height. So, we will
another. consider the model
strength ∼ weight + height + marking.
The resulting fitted equation is given as
strength = −28.305 + 0.273 weight + 0.681 height + 0.085 marking.
(a) Interpret the values of the regression coefficients.
(b) Compare the estimated value and interpretation of the regression
coefficient of weight obtained from the model here (when using three
explanatory variables) to its corresponding estimated value and
interpretation of the regression coefficient obtained from the model
strength ∼ weight + height
considered in Example 1.

So, now that we know how to interpret the regression coefficients in a


multiple regression model, next we will consider how to test the regression
coefficients.

92
1 The multiple linear regression model

1.3 Testing the regression coefficients


For the multiple regression model
Yi = α + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi , i = 1, 2, . . . , n,
we are usually interested in testing whether or not the regression
coefficients β1 , β2 , . . . , βq , are zero. To do this, there are two testing
procedures that we can use.
• We can test whether all of the regression coefficients are zero by testing
the hypotheses
H0 : β1 = β2 = · · · = βq = 0,
H1 : at least one of the q coefficients differs from zero.
Testing these hypotheses requires a test commonly known as the F -test.
• We can test whether each individual regression coefficient is zero by
testing q sets of hypotheses:
H0 : β1 = 0, H1 : β1 ̸= 0
(assuming β2 = βb2 , β3 = βb3 , . . . , βq = βbq ),
H0 : β2 = 0, H1 : β2 ̸= 0
(assuming β1 = βb1 , β3 = βb3 , β4 = βb4 , . . . , βq = βbq ), Phew!
..
.
H0 : βq = 0, H1 : βq ̸= 0
(assuming β1 = βb1 , β2 = βb2 , . . . , βq−1 = βbq−1 ).
Notice that, because of the partial nature of the regression coefficients,
we cannot consider individual regression parameters in isolation (hence
the assumptions above in brackets).
Although these q sets of hypotheses look rather complicated (because of
the assumptions in brackets), they are in fact straightforward to test by Testing the q sets of
simply adapting the t-test methods that we’ve already met for simple hypotheses is not as
linear regression (in Subsection 4.3 of Unit 1). complicated as it looks!

1.3.1 Testing all regression coefficients


simultaneously
We will start by introducing the F -test to test the hypotheses
H0 : β1 = β2 = · · · = βq = 0,
H1 : at least one of the q coefficients differs from zero.
In order to test these hypotheses, we need to use another probability
distribution, known as the F -distribution. If you are not familiar with the
F -distribution, you can see some of its properties summarised in Box 3.
However, in this module, we will only be using the associated p-value (as
calculated by R) for the test, to assess the evidence against the null
hypothesis.

93
Unit 2 Multiple linear regression

Box 3 The F -distribution


The F -distribution, sometimes written F (ν1 , ν2 ), has the following
properties:
• It is a probability distribution for a continuous random variable
that takes only positive values.
• The distribution is right-skew, not symmetric like the normal and
t-distributions.
• It has two positive degrees of freedom, ν1 and ν2 .
• ν1 and ν2 are typically integers.
• The order of ν1 and ν2 is important; that is, for ν1 ̸= ν2 ,
F (ν1 , ν2 ) is not the same distribution as F (ν2 , ν1 ).
• The shape of the F -distribution density depends on ν1 and ν2 .
Figure 6 illustrates the F -distribution density for some different
combinations of ν1 and ν2 .

1.0 F (2, 2)
F (20, 20)

0.8
F (10, 10)

0.6
F (5, 10)

0.4 F (10, 2)

0.2

0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 6 F -distribution densities for some different combinations of ν1


and ν2

The test statistic for the F -test is often called the F -statistic, and the
p-value associated with the F -statistic is obtained by calculating
probabilities from F (ν1 , ν2 ), where
ν1 = the number of explanatory variables in the model = q,
ν2 = n − (the number of parameters in the model) = n − (q + 1).

94
1 The multiple linear regression model

Notice that the number of parameters in the model is q + 1, since there are
q regression coefficients (β1 , β2 , . . . , βq ), and one intercept parameter α.
We will discuss using the F -test in a multiple regression model with two
explanatory variables next in Example 3.

Example 3 Testing regression coefficients with two


explanatory variables
The fitted equation of footballers’ strength scores with the two
explanatory variables weight and height was given in Example 1 as
strength = −10.953 + 0.252 weight + 0.558 height.
We wish to test the hypotheses
H0 : β1 = β2 = 0,
H1 : at least one of the two coefficients differs from zero,
where β1 and β2 are the regression coefficients of weight and height,
respectively.
The value of the F -statistic for this test is 31.49, and the associated
p-value is calculated from an F (ν1 , ν2 ) distribution where
ν1 = q = 2,
ν2 = n − (q + 1) = 100 − 3 = 97.
The resulting p-value is calculated to be less than 0.001, indicating
that there is strong evidence to reject H0 and conclude that at least
one of the two regression coefficients is different from zero.

1.3.2 Testing regression coefficients individually


The F -test is basically a test of the usefulness of the regression model,
because it tests whether the model contributes information for explaining
the variation in the response variable, Y . If there is evidence from the data
against the null hypothesis, then at least one of the explanatory variables
significantly explains the variation in the response variable. However, the
F -test doesn’t tell us which of the explanatory variables significantly
explains the variation in the response variable. Do all of them explain the
variation, or just a subset of them?
To investigate this, we now consider our second test procedure.

95
Unit 2 Multiple linear regression

We will test the following q sets of hypotheses:


H0 : β1 = 0, H1 : β1 ̸= 0
(assuming β2 = βb2 , β3 = βb3 , . . . , βq = βbq ),
H0 : β2 = 0, H1 : β2 ̸= 0
(assuming β1 = βb1 , β3 = βb3 , β4 = βb4 , . . . , βq = βbq ),
..
.
H0 : βq = 0, H1 : βq ̸= 0
(assuming β1 = βb1 , β2 = βb2 , . . . , βq−1 = βbq−1 ).
To do this, we’ll adapt methods from simple linear regression.
Recall (from Box 8 in Subsection 4.3 of Unit 1) that in simple linear
regression we test the hypotheses
H0 : β = 0, H1 : β ̸= 0,
using the t-value, where
βb
t-value = .
standard error of βb
The t-value then follows a t(n − 2) distribution, where n is the number of
observations. (There are n − 2 degrees of freedom because there are two
parameters, α and β, in the simple linear regression model.)
We can use the same idea in multiple regression to test each individual
partial regression coefficient. Therefore, to test the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0
(assuming β2 = βb2 , β3 = βb3 , . . . , βq = βbq ),
we use the test statistic
βb1
t-value = .
standard error of βb1
As with simple linear regression, the t-value follows a t-distribution. There
are n − (q + 1) degrees of freedom for this t-distribution (rather than n − 2,
which we had for simple linear regression). This is because there are q + 1
parameters in the multiple regression model – namely, α and β1 , β2 , . . . , βq .
Similarly, to test the hypotheses
H0 : β2 = 0, H1 : β2 ̸= 0
(assuming β1 = βb1 , β3 = βb3 , β4 = βb4 , . . . , βq = βbq ),
we use the test statistic
βb2
t-value = ,
standard error of βb2
and the t-value also follows a t(n − (q + 1)) distribution.

96
1 The multiple linear regression model

The rest of the regression coefficients (β3 , β4 , . . . , βq ) can be tested


similarly, so that the test statistic for testing the significance of the jth
regression coefficient βj is

βbj
t-value = ,
standard error of βbj
and the t-value follows a t(n − (q + 1)) distribution.
In the next activity, we will look at these individual tests for the fitted
multiple regression model of footballers’ strength scores considered in
Example 3.

Activity 5 Testing individual partial regression coefficients

The fitted equation of footballers’ strength scores with the two explanatory
variables weight and height was given in Example 1 as
strength = −10.953 + 0.252 weight + 0.558 height.
In Example 3, we tested the hypotheses
H0 : β1 = β2 = 0,
H1 : at least one of the two coefficients differs from zero,
where β1 and β2 are the regression coefficients of weight and height,
respectively. The resulting p-value was very small, and so we concluded
that at least one of the regression coefficients is different from zero.
We now wish to investigate which of the regression coefficients differs from
zero. Is it both β1 and β2 , or just one of them?
To investigate, we will test the two sets of hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 ),
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 ).

(a) Given that the standard error of βb1 is 0.0535 (to three significant
figures), calculate the test statistic for testing the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 ).

(b) Given that the standard error of βb2 is 0.262 (to three significant
figures), calculate the test statistic for testing the hypotheses
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 ).

(c) The p-value associated with the regression coefficient β1 of weight is


less than 0.001, and the p-value associated with the regression
coefficient β2 of height is 0.036. Which distribution was used when
calculating these p-values? What do you conclude about the
regression coefficients β1 and β2 ?

97
Unit 2 Multiple linear regression

Notice that in the multiple regression context, the conclusion of the pair of
tests you performed in Activity 5 can be expressed as follows. There is
evidence to suggest that each of the two explanatory variables, weight and
height, influences the footballers’ strength scores, given the presence of
the other explanatory variable in the model. This should not be confused
with the two individual tests for the regression coefficients in the simple
linear regression models discussed in Activities 1 and 2. A test for a
regression coefficient in simple linear regression tests the individual
influence of each explanatory variable on the response variable. But the
test considered in Activity 5 tests the significance of each explanatory
variable when both of them are included in the model.
The regression coefficient of weight turns out to be highly significant in
both the simple linear regression model and the multiple regression model.
But the significance of the regression coefficient of height is different
between the two models. In the simple linear regression model, there is
strong evidence that the coefficient of height is not zero (the p-value was
less than 0.001). However, the corresponding evidence of height in the
multiple regression model, when weight is also in the model, is not so
strong (the p-value was 0.036).

1.3.3 Using both types of testing


The procedures for testing regression coefficients are given formally in
Box 4. These procedures are summarised visually in Figure 7.

Box 4 Testing in multiple regression models


In a multiple regression model for response Y and explanatory
variables x1 , x2 , . . . , xq , each Yi , i = 1, 2, . . . , n, is modelled as
Yi = α + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi , Wi ∼ N (0, σ 2 ).
There are two procedures for testing the regression coefficients in this
model.
Test whether all regression coefficients are zero
• Test the hypotheses
H0 : β1 = β2 = · · · = βq = 0,
H1 : at least one of the q coefficients differs from zero.

• The p-value associated with this test is based on the


F (q, n − (q + 1)) distribution.
• If the p-value is small, then we conclude that the regression model
contributes information for explaining the variation in the response,
Y.
• This is called the F -test.

98
1 The multiple linear regression model

Test whether each individual regression coefficient is zero


• Test q sets of hypotheses:
H0 : β1 = 0, H1 : β1 ̸= 0
(assuming β2 = βb2 , β3 = βb3 , . . . , βq = βbq ),
H0 : β2 = 0, H1 : β2 ̸= 0
(assuming β1 = βb1 , β3 = βb3 , β4 = βb4 , . . . , βq = βbq ),
..
.
H0 : βq = 0, H1 : βq ̸= 0
(assuming β1 = βb1 , β2 = βb2 , . . . , βq−1 = βbq−1 ).

• The p-value associated with each of the q tests is based on the


t(n − (q + 1)) distribution.
• For each small p-value, we conclude that the corresponding
explanatory variable helps to explain the variation in the response,
Y , when the other explanatory variables are in the model.
• Each individual test is a t-test.

q regression coefficients:
β1 , β 2 , . . . , β q

Test hypotheses: Test q sets of hypotheses:


H0 : β1 = β2 = · · · = βq = 0, H0 : β1 = 0, H1 : β1 = 0 (assuming β1 = β1 , β2 = β2 , . . . , βq = βq ),
H1 : at least one regression H0 : β2 = 0, H1 : β2 = 0 (assuming β1 = β1 , β3 = β3 , β4 = β4 , . . . , βq = βq ),
coefficient is non-zero
..
.
H0 : βq = 0, H1 : βq = 0 (assuming β1 = β1 , β2 = β2 , . . . , βq−1 = βq−1 )

F -test: q individual t-tests:


p-value based on p-values each based on
F (q, n − (q + 1)) distribution t(n − (q + 1)) distribution

Figure 7 Visual summary of the testing procedures for multiple regression models

It is worth mentioning that, although the F -test may not seem as useful as
the individual t-tests at this stage, the F -test will become particularly
useful as we move through the module.

99
Unit 2 Multiple linear regression

In the next activity, you will test the significance of the regression
coefficients of a multiple regression model that contains three explanatory
variables.

Activity 6 Testing when there are three explanatory


variables
Activity 4 considered the model
strength ∼ weight + height + marking,
and the resulting fitted equation was given as
It might take six players to strength = −28.305 + 0.273 weight + 0.681 height + 0.085 marking.
mark Lionel Messi, the player
with the highest skill score in (a) The p-value for testing that all coefficients are zero is less than 0.001.
the FIFA 19 dataset. Which distribution is this p-value obtained from?
(b) Is there evidence that the regression model contributes information to
explain the variability in the footballers’ strength scores?
(c) The p-value associated with the regression coefficient of height is
0.006 and the p-values associated with weight and marking are both
less than 0.001. Which distribution are these p-values calculated from?
(d) Comment on the individual significance of the regression coefficients
of weight, height and marking, and give a clear conclusion about
each regression coefficient.
(e) Compare the significance of the regression coefficients of weight and
height here with the significance of the regression coefficients of the
same variables in the model with two explanatory variables in
Activity 5.

We have seen so far that adding more explanatory variables to a regression


model did not change the significance of the individual explanatory
variables already in the model. In Activity 7, to follow, we will model some
data concerning a UK referendum, which will demonstrate that this is not
always the case. The data are described next.

Voter behaviour in the Brexit referendum


A referendum was held on 23 June 2016 in which the UK electorate
was asked to vote on whether the UK should remain in the European
Union (EU) or leave the EU: 51.9% of the participating UK electorate
voted for the UK to leave (BBC, 2016). The process of leaving the EU
has been called ‘Brexit’.
Many observers and researchers have been interested in studying and
analysing the referendum results to understand who voted to leave the
EU. A study collected data on the vote shares at the local authority

100
1 The multiple linear regression model

level for 380 areas and tried to relate their voting behaviour to their SHET
fundamental socio-economic features.
ORK
The Brexit dataset (brexit)
The data considered in this dataset are a subset of the data collected
in the study, comprising data from 237 of the areas. The dataset
contains data on the following variables for each area:
• leave: percentage who voted ‘Leave’ in the referendum
• age: proportion who are in the age group 30 to 44 years old (based
on the 2001 census)
GIB
• income: mean hourly pay in £ (based on the 2005 Annual Survey
of Hours and Earnings) Key: Majority leave
Majority remain
• migrant: proportion who are EU migrants (based on the 2001
census) Map of the UK showing the
• satisfaction: mean life satisfaction score (from the Annual results from the referendum
Population Survey 2015) (BBC, 2016)
• noQual: proportion with no educational qualification (based on the
2001 census).
In order to try to understand voting behaviour in the referendum, we
can take leave to be the response variable, and use the other
variables as potential explanatory variables.
The data for the first five areas from the Brexit dataset are shown in
Table 1.
Table 1 First five areas from brexit

leave age income migrant satisfaction noQual


65.48 0.2164 10.02 0.0067 7.41 0.4368
66.19 0.2133 11.45 0.0054 7.57 0.4451
61.73 0.2317 12.15 0.0057 7.69 0.3924
56.18 0.2238 11.03 0.0096 7.64 0.3984
57.42 0.2215 10.50 0.0048 7.33 0.4196

Source: Becker, Fetzer and Novy, 2017, including ‘Supplementary data’,


accessed 14 March 2022

In Activity 7, we will use the Brexit dataset to fit two simple linear
regression models and one multiple regression model. You will see that the
significance of one explanatory variable can markedly change, depending
on whether or not another variable is added to the model.

101
Unit 2 Multiple linear regression

Activity 7 Testing the coefficients for models of the Brexit


dataset
In this activity, regression models using data from the Brexit dataset are
fitted. We will be modelling the response leave (the percentage of ‘Leave’
voters in the UK 2016 (Brexit) referendum), with age and income as two
possible explanatory variables. The emphasis of this activity is on the
change of the regression coefficient testing results for different regression
models.
(a) Two simple linear regression models,
leave ∼ age and leave ∼ income,
were fitted to the data. The resulting two fitted equations are
leave = 74.732 − 81.426 age
and
leave = 83.772 − 2.085 income.
(i) Interpret the regression coefficients for the two models.
(ii) The p-value associated with the regression coefficient of age in
the first model is 0.004 and the p-value associated with income in
the second model is less than 0.001. Is age required in the first
model? Is income required in the second?
(b) The multiple regression model
leave ∼ age + income
was also fitted. The estimated regression coefficients, their standard
errors, t-values and associated p-values are given in Table 2.
Table 2 Coefficients for leave ∼ age + income

Parameter Estimate Standard t-value p-value


error
Intercept 78.005 4.943 15.780 < 0.001
age 31.768 24.295 1.308 0.192
income −2.180 0.185 −11.776 < 0.001

(i) Write down the fitted equation of the estimated model and
interpret the estimated regression coefficients.
(ii) Do the data suggest that both age and income together influence
the percentage of ‘Leave’ voters?
(c) Do you think there is any contradiction among the results you
obtained in parts (a)(ii) and (b)(ii)? Discuss why or why not.

To round off this section, we will now use R for multiple regression.

102
2 Prediction in multiple regression

1.4 Using R to fit multiple regression


models
In this subsection, we will use the Brexit dataset (which was introduced in
Subsection 1.3) to show you how to implement multiple regression in R.
Starting with Notebook activity 2.1, you will take leave as the response
variable, and use the explanatory variables satisfaction and noQual; in
Notebook activity 2.2, you will use the explanatory variables age, migrant
and noQual.

Notebook activity 2.1 Multiple regression in R


This notebook explains how to use R for multiple regression, with a
focus on fitting a multiple regression model using data from brexit.

Notebook activity 2.2 More multiple regression in R


This notebook once again considers data from brexit, but this time
fits a multiple regression model with different explanatory variables.

2 Prediction in multiple regression


As was mentioned in Unit 1, the main aim of fitting a regression model to
data is often to produce an equation that can be used to predict new
responses for given value(s) of the explanatory variables.
In simple linear regression, we obtained a single value prediction – a point
prediction – of a new response Y0 , by simply substituting the value of the
explanatory variable for the new response into the fitted equation. It is
straightforward to extend this method to obtain a point prediction in
multiple regression as well; this is discussed in Subsection 2.1.
We also saw in Unit 1 that in simple linear regression we can calculate
prediction intervals for a new response given the value of its explanatory
variable. The resulting prediction interval then provides a range of
plausible values for the new response Y0 . Prediction intervals can also be
calculated in multiple regression; these are the topic of Subsection 2.2.
To round off this section, in Subsection 2.3 we’ll use R for prediction in
multiple regression.

103
Unit 2 Multiple linear regression

2.1 Point prediction of the response variable


We saw in Unit 1 that it is very straightforward in simple linear regression
to obtain a point prediction of a (new) response given the value of its
explanatory variable: we simply need to substitute the value of the
explanatory variable for the new response into the fitted equation. It is
equally simple to obtain a point prediction for a new response in multiple
regression! The only difference is that we now substitute the given values
of multiple explanatory variables for the new response into the fitted
multiple regression equation. This is summarised in Box 5.

Box 5 Point prediction in multiple regression


Consider a multiple regression model for response Y with the q
explanatory variables x1 , x2 , . . . , xq . Let x01 , x02 , . . . , x0q be the values
of the explanatory variables for a new response Y0 whose value is not
known. Then a point prediction of Y0 is
b + βb1 x01 + βb2 x02 + · · · + βbq x0q .
yb0 = α

The next example illustrates point prediction in a multiple regression


model with two explanatory variables.

Example 4 Predicting a footballer’s strength score


In this example, we will once again consider the model
strength ∼ weight + height
using data from the FIFA 19 dataset. (This model was discussed in
Examples 1 to 3.)
From Example 1, the fitted regression equation is
strength = −10.953 + 0.252 weight + 0.558 height.

Suppose that a newly registered footballer has a weight of 162 lb and


a height of 68 in. Denoting the point prediction of the strength score
for this new footballer by yb0 , this footballer is predicted to have a
strength score of
yb0 = −10.953 + (0.252 × 162) + (0.558 × 68)
= 67.815 ≃ 68.
Here, we have rounded the predicted strength score to the nearest
integer to match the level of accuracy of strength in the FIFA 19
dataset.

104
2 Prediction in multiple regression

The following dataset gives another context in which a multiple regression


model is useful for predicting the response variable.

Attributes of roller coasters


The Roller Coaster DataBase website contains data on roller coasters
around the world. Newly opened roller coasters are added to the
database on a regular basis.
The roller coasters dataset (rcoaster)
This dataset contains data on a subset from the database. The
dataset contains 236 observations on three variables: Formula Rossa in Ferrari
World, Abu Dhabi, is the
• speed: maximum speed of the roller coaster, measured in miles per world’s speediest roller coaster,
hour (mph) and given to one decimal place according to the Roller Coaster
• height: height of the roller coaster, measured in feet (ft) and given DataBase website.
to two decimal places
• length: length of the roller coaster, also measured in feet and given
to two decimal places.
(Note that 1 mile ≃ 1.61 km and 1 ft ≃ 0.3084 m.)
The first five observations from the roller coasters dataset are given
in Table 3.
Table 3 First five observations from rcoaster

speed height length

95.0 325.00 6602.00


95.0 318.25 8133.10
93.0 310.00 6595.00
90.0 305.00 5100.00
85.0 245.00 5312.00

Source: Stewart, 2021, accessed 18 March 2022

You will be using the roller coasters dataset in the next activity to predict
a roller coaster’s speed given its height and length.

105
Unit 2 Multiple linear regression

Activity 8 Modelling the speed of roller coasters

The model
speed ∼ height + length
was fitted to data from the roller coasters dataset. The results from fitting
the model are summarised in Table 4.
Table 4 Coefficients for speed ∼ height + length

Parameter Estimate Standard t-value p-value


error
Intercept 23.58 0.975 24.193 < 0.001
height 0.218 0.009 23.242 < 0.001
length 0.00182 0.00035 5.159 < 0.001

(a) Write down the estimated regression equation of the fitted model.
(b) Interpret the values of the regression coefficients in the fitted equation
from part (a).
(c) According to the fitted regression equation obtained in part (a), what
is the predicted value of speed for a new roller coaster that is 200 ft
high and 4000 ft long?

In the next activity, you will use a multiple regression model with three
explanatory variables to predict the response variable.

Activity 9 Predicting a footballer’s strength score in a


model with three explanatory variables
Activity 4 considered the model
strength ∼ weight + height + marking,
and the resulting fitted equation was given as:
strength = −28.305 + 0.273 weight + 0.681 height + 0.085 marking.

Use this fitted equation to predict the strength score of a newly registered
footballer who weighs 170 lb, has a height of 72 in and has obtained a
marking score of 65.

Next we will consider prediction intervals.

106
2 Prediction in multiple regression

2.2 Prediction intervals in multiple


regression
As you have just seen in Subsection 2.1, in multiple regression it is simple
to obtain a point prediction of the response variable Y0 for a new
observation when we know the values of the explanatory variables for this
observation. However, it is generally not terribly helpful to have a point
prediction without any indication of the error or variation associated with
it.
In particular, as a reminder from Subsection 6.2 in Unit 1, a point
prediction does not take account of two issues:
• The true regression equation is not known and we only have an estimate
of it through the fitted equation
b + βb1 x1 + βb2 x2 + · · · + βbq xq .
y=α

• Even if we did know the true regression equation, we still wouldn’t know
the value of Y0 with certainty, since we don’t know the value of the
random term (W0 ) for Y0 .
One way that these uncertainties can be taken into account is to calculate
a prediction interval for the response variable.
Prediction intervals for multiple regression are used and interpreted in
exactly the same way as they were in simple linear regression, as
summarised in Box 6.

Box 6 Prediction intervals in multiple regression


In multiple regression:
• The prediction interval of the response variable Y0 for a new
observation is an interval estimate of Y0 , giving a range of plausible
values for the value of Y0 that will actually be observed, rather than
just a single point prediction.
• Prediction intervals are calculated according to different confidence
levels – commonly 90%, 95% and 99% – indicating how ‘confident’
we are that Y0 will lie in the interval.
• The higher the confidence level, the more confident we are that Y0
will lie in the interval, but also the wider – and therefore less
informative – the prediction interval is.

As with simple linear regression in Unit 1, prediction intervals are easily


produced in R and so we won’t go into the details of how prediction
intervals in multiple regression are calculated.
In Example 5 and Activities 10 and 11 that follow, we’ll take a look at
some prediction intervals associated with the point predictions that we
calculated in Subsection 2.1.

107
Unit 2 Multiple linear regression

Example 5 A prediction interval for a newly registered


footballer’s strength score
In this example, we will once again consider the model
strength ∼ weight + height
using data from the FIFA 19 dataset.
In Example 4, we found that the point prediction of the strength score
for a newly registered footballer with a weight of 162 lb and a height
of 68 in is 67.815, which we rounded to 68 (to match the rounding of
strength in the original data). So, if we needed to give a single value
prediction of the footballer’s strength score, we would predict their
score to be 68.
A prediction interval can give us some idea of the uncertainty
associated with this point prediction. The 95% prediction interval for
this footballer’s strength score turns out to be (58, 77). Therefore, it
is predicted that the strength score of the newly registered footballer
is somewhere between 58 and 77.

Activity 10 Prediction intervals for the roller coaster’s


speed
In Activity 8, the model
speed ∼ height + length
was fitted to data from the roller coasters dataset.
In that activity, we found that the predicted value of speed for a new
roller coaster that is 200 ft high and 4000 ft long is 74.5 mph.
Three prediction intervals were obtained for the speed of the new roller
coaster:
(63.5, 85.4), (61.4, 87.5) and (57.3, 91.6).
One of these is the 99% prediction interval, one is the 95% prediction
interval and one is the 90% prediction interval. Which is which?

Activity 11 Prediction interval for another newly registered


footballer
Consider once again the model
strength ∼ weight + height + marking
for modelling data from the FIFA 19 dataset.

108
3 Diagnostics

In Activity 9, we found that, for a newly registered footballer who weighs


170 lb, has a height of 72 in and has obtained a marking score of 65, the
point prediction of their strength score is 73.
The 95% prediction interval for the newly registered footballer’s strength
score is (64, 81). What does this tell us about the strength score of the
newly registered footballer?

We are now ready to use R for prediction.

2.3 Using R to obtain predictions using


multiple regression
In this section, you will learn how to use R to obtain both point
predictions and prediction intervals in multiple regression. In Notebook
activities 2.1 and 2.2, we fitted multiple regression models for data from
the Brexit dataset by using two explanatory variables and then three
explanatory variables. In Notebook activity 2.3 we will use R for
prediction, again considering data from the Brexit dataset (using a
different set of three explanatory variables).

Notebook activity 2.3 Using R for prediction in multiple


regression
This notebook explains how to use R to obtain predictions in multiple
regression, using data from brexit.

3 Diagnostics
Once we have fitted a multiple regression model, we need to assess how
good the fitted model is. The techniques used for doing this are often
referred to as diagnostics, or diagnostic checks, because they help us
diagnose what might be wrong with a model or with particular data points.
The multiple regression model makes several assumptions that need to
hold so that the conclusions we obtain from the analysis of the model are
accurate and valid. It is therefore always important to check that the
assumptions seem reasonable for a particular dataset. Checking the
assumptions for the multiple regression model is the focus of
Subsection 3.1.
In addition to checking the model assumptions, there are other diagnostic
tools that can be used to check the adequacy of the model. We will
introduce two of these in this section: leverage in Subsection 3.2 and
Cook’s distance in Subsection 3.3.
Finally, in Subsection 3.4, we’ll use all of these diagnostic tools, by using R
to produce diagnostic plots that incorporate them.

109
Unit 2 Multiple linear regression

3.1 Checking the model assumptions


The assumptions of the multiple regression model are summarised in
Box 7. (The assumptions should look familiar, since they are basically the
same as those for simple linear regression.)

Box 7 Multiple regression model assumptions


For response Y and explanatory variables x1 , x2 , . . . , xq , there are four
main assumptions of the multiple regression model
Yi = α + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi , i = 1, 2, . . . , n.
These are:
• Linearity: the relationship between each of the explanatory
variables x1 , x2 , . . . , xq and Y is linear.
• Independence: the random terms Wi , i = 1, 2, . . . , n, are
independent.
• Constant variance: the random terms Wi , i = 1, 2, . . . , n, all have
the same variance σ 2 across the values of x1 , x2 , . . . , xq .
• Normality: the random terms Wi , i = 1, 2, . . . , n, are normally
distributed with zero mean and constant variance, N (0, σ 2 ).

110
3 Diagnostics

As for simple linear regression, we can check the linearity, zero mean and
constant variance assumptions for the multiple regression model using a
residual plot of the residuals and the fitted values. The residual plots used
in multiple regression are the same as those used for simple linear
regression, as described in Subsection 5.1 of Unit 1, and are also
interpreted in the same way, as summarised in Box 8.

Box 8 Interpreting residual plots in multiple regression


Residual plots are interpreted in multiple regression as follows.
• If the assumptions are reasonable, then the points in the plot
should be randomly scattered about the zero residual line.
• Any systematic pattern in the plot indicates that the assumptions
may not hold:
◦ If the vertical spread of the points varies across the fitted values,
then the constant variance assumption may not be reasonable.
◦ If the points are not scattered randomly about the zero residual
line, then the zero mean and linearity assumptions may not be
reasonable.

The normality of the Wi ’s is also checked in the same way as it is in simple


linear regression: namely, using a normal probability plot of the residuals.
Normal probability plots for multiple regression are exactly the same as
those used in simple linear regression as described in Subsection 5.2 of
Unit 1 and are interpreted in the same way, as summarised in Box 9.

Box 9 Interpreting normal probability plots in multiple


regression
Normal probability plots are interpreted in multiple regression as
follows: if the normality assumption is reasonable, then the points in
a normal probability plot of the residuals should lie roughly on a
straight line.

Checking the independence assumption in multiple regression can be


tricky, just as it was for simple linear regression. If there is some sort of
ordering for the observations, then any pattern in a plot of the residuals in
this order may flag up that there is a problem with the independence
assumption. Otherwise, as for simple linear regression, if there is no
obvious ‘ordering’ of the observations, then the independence assumption
is generally assumed to hold unless there is reason to believe that the
assumption is unrealistic.
We’ll use residual plots and normal probability plots to check the
assumptions for two multiple regression models in the next two activities.

111
Unit 2 Multiple linear regression

Activity 12 Checking the assumptions for a multiple


regression model of footballers’ strength scores
Activity 4 considered the model
strength ∼ weight + height + marking,
using data from the FIFA 19 dataset.
Figure 8 gives the residual plot and the normal probability plot for the
fitted model.

3
10
2

Standardised residuals
5 1
Residuals

0 0

−5 −1

−2
−10
−3
60 65 70 75 80 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 8 The residual plot (a) and normal probability plot (b) for strength ∼ weight + height + marking

(a) Does the residual plot in Figure 8(a) suggest that the assumptions of
linearity, zero mean and constant variance are satisfied?
(b) Does the normal probability plot in Figure 8(b) suggest that the
normality assumption is satisfied?
(c) Based on your answers to parts (a) and (b), comment on the
appropriateness of the multiple regression model.

Activity 13 Checking the assumptions for a multiple


regression model for roller coasters’ speed
Activity 8 considered the model
speed ∼ height + length,
using data from the roller coasters dataset.
Figure 9 gives the residual plot and the normal probability plot for the
fitted model.

112
3 Diagnostics

30
4
20

Standardised residuals
2
10
Residuals

0 0

−10
−2
−20
−4
−30
40 60 80 100 −3 −2 −1 0 1 2 3
(a) Fitted values (b) Theoretical quantiles
Figure 9 The residual plot (a) and normal probability plot (b) for speed ∼ height + length

(a) Use the residual plot in Figure 9(a) to discuss whether the
assumptions of linearity, zero mean and constant variance seem to be
satisfied.
(b) Use the normal probability plot in Figure 9(b) to discuss whether the
normality assumption seems to be satisfied.
(c) Considering your answers to parts (a) and (b), does this multiple
regression model seem appropriate for the roller coasters dataset?

It is worth mentioning here, that the residual plot discussed in this


subsection is by no means the only graphical representation of the
residuals and fitted values which can be used. Instead of plotting the
residuals, some statistical software packages plot the standardised
residuals, where the residuals are scaled so that that their standard
deviation is 1 (and hence their variance is also 1).
Another alternative to plotting residuals which is used in some statistical
packages, is to plot the square roots of the absolute values of the
standardised residuals; in this case, the resulting plot is called the
scale-location plot. You will meet the scale-location plot soon when we use
R for diagnostics in Subsection 3.4. We will also be using standardised
residuals in another new type of plot (the residuals versus leverage plot) in Yes, get excited – there’s a new
the next subsection. plot on the way!

113
Unit 2 Multiple linear regression

3.2 Leverage
In this section, we will introduce another diagnostic tool for multiple
regression, known as leverage. Leverage identifies particular data points
which have the potential to have a major impact on the regression model.
• Data points with high leverage have the potential to substantially alter
the fitted regression model if that particular data point were changed or
removed from the dataset.
• On the other hand, if a data point with low leverage were changed or
removed from the dataset, then it would make little difference to the
fitted regression model.
The word ‘potential’ is important here: leverage does not tell us whether a
data point has had a major impact on the regression model, it just tells us
whether a data point could have a major impact on the regression model.
It is easiest to get an understanding of what we mean by high and low
leverage by looking at a specific example.
In Activity 2, we fitted the simple linear regression model
strength ∼ height
using data from the FIFA 19 dataset. The scatterplot of the data together
with the fitted line for this model is given in Figure 10. In Example 6 we
illustrate high and low leverage by considering how the fitted line would
change if a couple of data points change.

80

75
Strength score

70

65

60

70 75
Height (in)
Figure 10 Scatterplot of strength and height, together with the fitted line
for strength ∼ height

114
3 Diagnostics

Example 6 Illustrating high and low leverage


In Figure 10, the first data point at the far left has a value of 66 for
height and a strength score of 76. Suppose that instead of a strength
score of 76, this football player had a strength score of 55. How might
the change in the response for this observation affect the fitted line?
Well, in Figure 11, we can see the effect on the fitted line that this
change produces. Figure 11 shows a scatterplot of the data points
with the changed response for the first data point at the far left. (The
original data point is also included so that you can see how the data
point has changed.) On the scatterplot are two fitted lines: the fitted
line using the changed response for the far-left observation is shown as
a solid line, whereas the fitted line using the original data (that is, the
fitted line shown in Figure 10) is shown as a dashed line.

Original
80 data point

75
Fitted line
Strength score

with original
70
data point

65

Fitted line
60 with changed
data point
Changed
55 data point

70 75
Height (in)
Figure 11 Scatterplot of strength and height, together with two
fitted lines for strength ∼ height

Notice that, by changing the response for just that single far-left
observation, the fitted line has altered and has ‘tilted’ towards the
new value.
Let’s have a look at what happens to the fitted regression line if
instead of changing the response of the far-left observation, we change
the response for one of the more central observations.

115
Unit 2 Multiple linear regression

One of the footballers has a value of 72 for height and has a strength
score of 77 (which is only 1 larger than the strength score of the
far-left observation). Let’s now try changing the value for this
footballer’s strength score from 77 to 55 (so that the change that
we’re making to the response is comparable to the change that we
made for the far-left observation).
Figure 12 shows a scatterplot of the data points, this time with the
changed response for one of the data points for which height is 72.
(The new data point is the lowest data point in the plot. The original
data point is also included on the plot so that you can see how the
data point has changed.) The fitted line using the data points in the
scatterplot (which uses the different response for one of the
observations for which height is 72) is shown on the plot as a solid
line, whereas the fitted line using the original data (that is, the fitted
line shown in Figure 10) is shown on the plot as a dashed line.

80 Original
data point

75
Fitted line
Strength score

with original Fitted line


70 data point with changed
data point

65

60
Changed
data point
55
70 75
Height (in)
Figure 12 Scatterplot of strength and height, together with two
fitted lines for strength ∼ height

This time, although we changed the response of the more central


observation by even more than we did for the far-left observation, the
fitted line using the changed data point is very close to the fitted line
for the original data!
So, although we changed both of the response values of the two
observations in a very similar way, the change had quite a large effect

116
3 Diagnostics

on the fitted line for the far-left point, but very little effect on the
fitted line for the central point. As such, the data point on the far left
has high leverage, whereas the central data point has low leverage.

In Example 6, we identified the data point to the far left in Figure 10 as


having high leverage because changes to the value of its response had quite
a large effect on the fitted line, and we identified one of the central data
points in Figure 10 as having low leverage because changes to the value of
its response had very little effect on the fitted line. So, what is it about
these two points which means that one of them has high leverage and one
has low leverage? We will investigate this in the next activity.

Activity 14 What makes a data point have high leverage?


The original data points considered in Example 6 both have very similar
values of the response (a strength score of 76 for the far-left data point,
and a strength score of 77 for the central data point), and the changed
data points both have the same value of the response (both have a
strength score of 55). So, the responses of the two original data points and
the two changed data points are very similar. But in what way do the two
original data points and the two changed data points differ?
Hence, can you suggest what might determine whether a data point has
high or low leverage?

In Example 6, we saw that:


• The data point at the far left in Figure 10 has the lowest value of the
explanatory variable height and is not close to the ‘centre’ of the
distribution of values of height in the dataset; we concluded that this
data point has high leverage.
• The data point whose value of height is 72 is at the ‘centre’ of the
distribution of values of height in the dataset; we concluded that this
data point has low leverage.
This reflects that it is the value of an observation’s explanatory variable
that determines whether an observation has high or low leverage. In
particular, the further the value of the explanatory variable is from the
‘centre’ of the distribution of values of the explanatory variable, the higher
the leverage is.
The formula for calculating the leverage for each data point in simple
linear regression is given in Box 10. The term (xi − x)2 in the formula
ensures that the leverage is high when xi is a long way from x (the ‘centre’
of the distribution of values of x) and the leverage is low when xi is close
to x. (Note that you won’t need to use this formula, it is just important
that you understand the concept of leverage.)

117
Unit 2 Multiple linear regression

Box 10 Formula for leverage in a simple linear regression


model
In a simple linear regression model for the response Y with the
explanatory variable x, the leverage hi of the ith data point is given
by the equation
1 (xi − x)2
hi = + ,
n (n − 1)s2x
where n is the number of observations in the dataset, and x and s2x
are, respectively, the mean and variance of the observed x values.
Note that the leverage hi only takes values between 0 and 1.

Although we have focused on leverage in simple linear regression, the


general idea extends directly to multiple regression. A data point with
high leverage in multiple regression has the potential to substantially alter
the fitted multiple regression model if that particular data point were
changed or removed from the dataset. However, because there are several
explanatory variables, the variables together determine whether or not a
data point has high leverage. This makes things a bit more complicated
than simple linear regression (and the formula won’t be given here), but
the general idea remains that a data point has high leverage if the values
of its explanatory variables are a long way from the centre of the
distribution of values of the explanatory variables in the dataset.
So, now we know what leverage is, you may be wondering why we might
care whether a data point has high leverage or not? Well, since we use the
regression model to decide whether there is a linear relationship between
the response variable and the explanatory variables, and to predict the
responses for new values of the explanatory variables, it is important that
the model is reliable. However, if a high-leverage point had a response that
did not seem to fit the pattern of the rest of the data, then the worry
would be that the fitted equation might be very inaccurate. Thus, any
conclusions about which explanatory variables were related to the response
variable could be misleading and the predictions made using the model
would be wrong. Leverage is summarised in Box 11.

Box 11 Summary of leverage


Data points with high leverage have the potential to substantially
alter the fitted regression model if that particular data point were
changed or removed from the dataset.
Leverage depends on the values of the explanatory variable(s) only, as
described next.

118
3 Diagnostics

• Data points with values of the explanatory variables a long way


from the centre of the distribution of values of the explanatory
variables in the dataset have high leverage.
• Data points with values of the explanatory variables close to the
centre of the distribution of values of the explanatory variables in Ooh, what a treat!
the dataset have low leverage. A shiny new plot!
Leverage only takes values between 0 and 1.

An easy way to identify which data points have high leverage in a dataset
is by using a residuals versus leverage plot, which plots the
standardised residuals – that is, residuals which have been scaled so that
their standard deviation is 1 – for each of the observations against their
leverage values. You will see a residuals versus leverage plot in Example 7.

Example 7 A residuals versus leverage plot for a simple


linear regression model
In Activity 2, we fitted the model
strength ∼ height
using data from the FIFA 19 dataset.
The residuals versus leverage plot for this model is given in Figure 13.

3
15

2
Standardised residuals

−1
62
−2
92

−3
0.02 0.04 0.06 0.08 0.10 0.12
Leverage
Figure 13 The residual versus leverage plot for strength ∼ height

119
Unit 2 Multiple linear regression

From Figure 13, we can see that in a residuals versus leverage plot the
standardised residuals are plotted on the vertical axis and the leverage
values are plotted on the horizontal axis. The zero residual line is
included so that it is easy to see at a glance which residuals are
relatively small and which are relatively large.
In the residuals versus leverage plot given in Figure 13, there is one
data point (numbered 62) whose leverage stands out as being quite a
bit higher than the leverage of the other data points. This data point
therefore has the most potential to affect the regression model.
The observation with the next highest leverage, which also stands
apart from the rest of the data points, is the data point numbered 15.
So, this data point also has the potential to affect the regression
model.

Even if a data point has very high leverage, meaning that it has the
potential to affect the regression model, it doesn’t necessarily mean that
the data point actually has affected the regression model.
If the observed response for a data point with high leverage is not
consistent with the fitted model (so the standardised residual is large),
then it is likely that the refitted model will be different to the original
fitted model, since the refitted model will no longer be ‘pulled’ towards the
inconsistent data point. (We saw this ‘pulling’ effect in Example 6 for the
far-left data point in Figure 11.) So, if a data point has both high leverage
and a large residual, then it is likely that that particular data point has
affected the regression model.
On the other hand, if the observed response for a data point with high
leverage is consistent with the fitted model (so the standardised residual is
small), then it is unlikely that the refitted model will be very different,
since the data point won’t really have ‘pulled’ the original fitted model
away from where the refitted line is. So, if a data point has high leverage
but a small residual, then it is unlikely that that particular data point has
affected the regression model, although it could do if the leverage was high
enough.
Since the size of the residuals plays an important role in determining
whether a data point actually has affected the regression model, it is also
possible for a data point without high leverage to have affected the model
if the residual is large enough.
Data points which have affected the regression model are known as
influential points, as summarised in Box 12.

120
3 Diagnostics

Box 12 Influential points


A data point is an influential point if it affects the fitted regression
model – in other words, if the data point exerts a big influence on the
model.
A data point is likely to be an influential point if the data point has:
• high leverage and a large residual
• very high leverage
• a very large residual.

One important question remains: how do we decide that a standardised


residual is ‘large’ ? Well, because we are using standardised residuals
(rather than (raw) residuals which can have any scale), the value of each
standardised residual can be compared on a fixed scale. There’s no
hard-and-fast rule for deciding if a standardised residual is large, but as a
general rule of thumb:
• standardised residuals > 2 or < −2 are considered to be large
• standardised residuals > 3 or < −3 are considered to be very large.
Using the residuals versus leverage plot, we can get a feel for both the size
of a data point’s standardised residual and its leverage. Therefore, as you
will see in Activity 15, the residuals versus leverage plot can be useful for
seeing at a glance which data points are likely to have actually affected the
regression, as well as seeing which particular data points have the potential
to affect the regression model.

Activity 15 Where are we likely to find influential points?

Whereabouts in a residuals versus leverage plot are we likely to find


influential data points?

In the next two activities, we will use residuals versus leverage plots to
identify data points which are likely to be influential points.

Activity 16 Any influential points?

In Example 7, we considered the residuals versus leverage plot for the


fitted model
strength ∼ height.
The residuals versus leverage plot was given in that example in Figure 13.
Do there seem to be any influential data points for this model?

121
Unit 2 Multiple linear regression

Activity 17 Identifying influential data points for a multiple


regression model
Recall from Activity 8 that we fitted the model
speed ∼ height + length
using data from the roller coasters dataset. Figure 14 shows the residuals
versus leverage plot for this model.

43
4

2
Standardised residuals

77

0 44

4 3
1
2
−2

−4

0.02 0.04 0.06 0.08


Leverage
Figure 14 A residuals versus leverage plot for speed ∼ height + length
(a) Which points can be identified as having high leverage?
(b) Are any of the data points influential points for this model?

The residuals versus leverage plot is summarised in Box 13.

Box 13 The residuals versus leverage plot


The residuals versus leverage plot is a graphical representation
that plots the standardised residuals of each data point on the vertical
axis against its leverage value on the horizontal axis.
The plot is a useful tool for identifying the likely influential points.
These are points with very high leverage values, those with very large
standardised residuals or, more importantly, points with high leverage
and large standardised residuals.

122
3 Diagnostics

Points with both high leverage and a large standardised residual are
likely to be highly influential to the model, in the sense that they can
alter the results of the regression model if they are changed. These
influential points appear towards the lower-right or upper-right
corners of the plot.

As Activities 16 and 17 demonstrated, it is not always easy to decide


whether a data point is influential from the residuals versus leverage plots.
It would therefore be useful to have some sort of formal measure of how
influential a data point is. We will introduce one such measure next in
Subsection 3.3.

3.3 Cook’s distance


In this subsection, we will introduce what is known as Cook’s distance,
which uses the values of the standardised residuals and leverage to provide
a measure of how influential a data point is. The formula of Cook’s
(squared) distance for the ith data point is given in Box 14. (As with the
formula for leverage, you won’t need to use this formula in this module;
understanding the concept of Cook’s distance, and how to use it, is what’s
important.)

Box 14 Cook’s (squared) distance using standardised We’re not ‘cooking the
analysis’ ! These ideas are
residuals
named after their originator,
For the ith data point, i = 1, 2, . . . , n, Cook’s (squared) distance, Di2 , Professor R.D. Cook.
is given by
1 ′ 2 hi
Di2 = r ,
q + 1 i 1 − hi
where q is the number of explanatory variables in the model (so that
q + 1 is the number of parameters in the model), ri′ is the standardised
residual for the ith data point and hi is its leverage.

The formula given in Box 14 makes intuitive sense in terms of what we


want Cook’s distance to measure. We know that a data point with high
leverage can have a big impact on the regression coefficients, but that in
practice it will do so only if it has a large standardised residual. In this
case, both the leverage hi and the absolute value of the standardised
residual ri′ are large. For such a data point, the values of ri′ 2 and
hi /(1 − hi ) will both be large too, which means it will then have a large
value of Di2 . On the other hand, a data point with large leverage but a
small standardised residual will have a smaller Cook’s distance (since ri′ 2
will be small), as will a data point with a large standardised residual but a
small leverage (since hi /(1 − hi ) will be small).

123
Unit 2 Multiple linear regression

The question now is: how large does the Cook’s distance for a specific data
point need to be in order for it to be considered an influential point?
Unfortunately, there is no definitive answer to this question as there is no
standard rule of thumb. A threshold of 0.5 is sometimes considered, but a
data point with a Cook’s distance of less than 0.5 could also be considered
to be influential if its Cook’s distance is large in comparison to the other
Cook’s distance values.
Box 15 introduces a new diagnostic plot called the Cook’s distance plot.
This plot gives a visual representation of the Cook’s distance values for the
data points in the dataset, so that it is easier to identify the influential
points. The Cook’s distance plot is summarised in Box 15.

Box 15 The Cook’s distance plot


The Cook’s distance plot is a graphical representation that plots
the Cook’s distance values on the vertical axis against an index of the
data points on the horizontal axis. The Cook’s distance values are
plotted as vertical bars that start from zero.
The plot is a useful tool for identifying the highly influential data
points in a regression model: these are the points with the highest
bars in the Cook’s distance plot.
The numerical values of the Cook’s distance values in the plot help to
determine how influential the extreme data points are. In addition,
the plot shows how influential those points are relative to the rest of
the data points.

You will see a Cook’s distance plot in action in Example 8.

Example 8 Exploring Cook’s distance in a simple linear


regression model
In Example 7, we considered the residuals versus leverage plot for the
model
strength ∼ height
fitted using data from the FIFA 19 dataset, and in Activity 16 you
used this plot to identify likely influential points. In this example,
we’ll use the Cook’s distance plot to identify influential points. The
Cook’s distance plot for this model is given in Figure 15.

124
3 Diagnostics

0.25 15

62
0.20 92
Cook’s distance

0.15

0.10

0.05

0.00
0 20 40 60 80 100
Observation number
Figure 15 The Cook’s distance plot of strength ∼ height

The most influential data points in a dataset will have the tallest bars
in a Cook’s distance plot. Therefore, for this dataset, the three most
influential data points are those numbered 15, 62 and 92. These are,
in fact, three of the data points identified as possible influential points
from the residuals versus leverage plot in Activity 16.
Notice that, the other two data points with similar leverage to data
point 92, but with small residuals, do not appear to have high Cook’s
distances.
We mentioned that a value of 0.5 is sometimes used as a threshold
value for deciding if the Cook’s distance is large enough to conclude
that a data point is an influential point. It is clear from the Cook’s
distance plot in Figure 15 that none of the points for this dataset has
a Cook’s distance of 0.5 or above. However, because the highest
Cook’s distance values in the Cook’s distance plot are so high in
comparison to the Cook’s distance values for the other data points in
the sample, these data points can be considered as being influential,
even though they do not exceed the threshold of 0.5.

In multiple regression, a Cook’s distance plot is the same as that used for
simple linear regression and should be interpreted in the same way. We
will use the Cook’s distance plot to identify the influential points in a
multiple regression model next in Activity 18.

125
Unit 2 Multiple linear regression

Activity 18 Exploring Cook’s distance in a multiple


regression model
In Activity 17, we considered the residuals versus leverage plot for the
model
speed ∼ height + length
using data from the roller coasters dataset.
In that activity, we identified the data points numbered 1, 2, 3, 4 and 43 as
possibly being influential points. In this activity, we’ll use the Cook’s
distance plot for the model to identify influential points; the Cook’s
distance plot is given in Figure 16.

0.08

1
Cook’s distance

0.06
43

0.04

0.02

0.00
0 50 100 150 200
Observation number
Figure 16 A Cook’s distance plot for speed ∼ height + length
(a) Based on the Cook’s distance plot, what are the highest influential
points in the data? Are they the same points that were identified in
Activity 17?
(b) How do you interpret the fact that the data point with the highest
leverage in Figure 14 does not appear as one of the three values with
the highest Cook’s distance in Figure 16?

In the next activity, you will explore a reanalysis of another multiple


regression model that you have already met.

126
3 Diagnostics

Activity 19 Reanalysing the Brexit dataset

Recall the Brexit dataset that you analysed earlier in Activity 7. In


part (b) of Activity 7, the multiple regression model
leave ∼ age + income
was fitted. The estimated regression coefficients, their standard errors,
t-values and associated p-values were obtained and are repeated here, as
Table 5, for convenience.
Table 5 Coefficients for leave ∼ age + income

Parameter Estimate Standard t-value p-value


error
Intercept 78.005 4.943 15.780 < 0.001
age 31.768 24.295 1.308 0.192
income −2.180 0.185 −11.776 < 0.001

(a) The residual versus leverage plot of this multiple regression model is
given in Figure 17. Explain why the data points numbered 96, 124
and 228 might have higher Cook’s distances than other points. Your
explanation should be in terms of the leverage and/or residuals of
these points.

124
2 96
228
Standardised residuals

−1

−2

−3

0.01 0.02 0.03 0.04 0.05


Leverage
Figure 17 A residuals versus leverage plot for leave ∼ age + income

127
Unit 2 Multiple linear regression

(b) The Cook’s distance plot for this multiple regression model is given in
Figure 18. Based on this plot, which data point seems to be the most
influential point in the data?

124
0.06

0.05
Cook’s distance

0.04
96
228
0.03

0.02

0.01

0.00
0 50 100 150 200
Observation number
Figure 18 A Cook’s distance plot for the multiple regression model of the
Brexit dataset

(c) The multiple regression model


leave ∼ age + income
was refitted after removing data point number 124 from the dataset.
The estimated regression coefficients, their standard errors, t-values
and associated p-values were obtained and are given in Table 6.
Table 6 Coefficients for leave ∼ age + income after removing data point
number 124

Parameter Estimate Standard t-value p-value


error
Intercept 77.998 4.913 15.877 < 0.001
age 36.030 24.239 1.486 0.139
income −2.256 0.188 −12.007 < 0.001

Given the change in the estimated regression coefficients after


removing data point number 124, do you think that this point has a
considerable influence on the regression coefficients? What do you
conclude?

128
4 Transformations in multiple regression

Having identified influential points, we are then left with the problem of
what to do about them. Usually, statisticians would have to either present
analyses with and without influential points or to just omit the influential
points from the analysis and explain why they are thought to be invalid for
inclusion. The fact that they are influential is not by itself a valid reason
for excluding points.
Finally, it is worth noting that the residual versus leverage plot and the
Cook’s distance plot are not the only available plots of their kind that may
be obtained from a statistical package. For example, another useful plot
for detecting influential points that you may meet elsewhere is the Cook’s
distance versus leverage plot, which plots the Cook’s distance on the
vertical axis and the leverage on the horizontal axis. We won’t, however,
be discussing this plot further in this unit.
We will finish this section by carrying out diagnostic checks in R.

3.4 Using R to perform diagnostic checks


In this subsection, we will perform diagnostic checks for multiple regression
in R. Following on from Notebook activity 2.1, we will once again use the
Brexit dataset, taking leave as the response variable, and satisfaction
and noQual as the explanatory variables.

Notebook activity 2.4 Diagnostics for multiple regression


in R
This notebook explains how to use R to perform diagnostic checks for
multiple regression.

4 Transformations in multiple
regression
There are times when we would like to use a multiple regression model but
our data are not suitable for such a model – either because the model
assumptions for multiple regression are unrealistic for our data, or the
relationship between the response and the explanatory variables is not
linear. In cases such as these, we can sometimes transform one or more of
the variables so that a model using the transformed data is suitable for
modelling using multiple regression.
We will start this section by considering how transformations can be useful
in multiple regression in Subsection 4.1, before discussing how to find
suitable transformations in Subsection 4.2. We’ll then finish the section by
using transformations in R in Subsection 4.3.

129
Unit 2 Multiple linear regression

4.1 The use of transformations in multiple


regression
In simple linear regression, it is sometimes helpful to use transformations
of variables. Appropriate transformations can be used to reformulate the
relationship between the explanatory variable and the response variable to
obtain or enhance linearity; in this case, the explanatory variable is
transformed. In some situations where the simple linear regression
assumptions do not seem to hold, the response variable can be transformed
so that the assumptions seem to be satisfied for the transformed data. In
some other cases, it may be useful to transform both the response and the
explanatory variable.
Transformations of variables can be used in the same manner in multiple
regression, as stated in Box 16. This is illustrated in Figure 19.

Box 16 Transformations of variables in multiple


regression
In multiple regression, transformations are sometimes useful to:
• attain or enhance linearity between the response variable and one
or more of the explanatory variables by transforming the
corresponding explanatory variable(s)
• fulfil or strengthen the assumptions of the multiple regression model
by transforming the response variable.

To strengthen To enhance
model linearity between
assumptions explanatory variable
xj and response

Transform Transform explanatory


response variable xj

y ∼ x 1 + x 2 + · · · + xj + · · · + x q

Figure 19 Summary of the use of transformations for the multiple regression


model y ∼ x1 + x2 + · · · + xj + · · · + xq

So, suppose that we’ve decided that we want to transform the response
and/or one (or more) of the explanatory variables. The next question is:
which transformation(s) should we try? We will address this question in
the next subsection.

130
4 Transformations in multiple regression

4.2 Finding suitable transformations


Unfortunately, there are no hard-and-fast rules for finding suitable
transformations for the response or any of the explanatory variables, and
the process usually involves some trial and error. We can, however,
sometimes make use of something called the ladder of powers to help
suggest possible transformations to try. The ladder of powers is
summarised in Box 17.

Box 17 The ladder of powers


The ladder of powers is a list of possible transformations:
. . . , x−2 , x−1 , x−1/2 , log x, x1/2 , x1 , x2 , x3 , . . . .
The ladder of powers lists the transformations in ascending powers
of x, where log x takes the place of x0 .
The ladder of powers can be used to suggest transformations to try to
make a variable more symmetric:
• for right-skew data, start from x1 and go down the ladder of powers
• for left-skew data, start from x1 and go up the ladder of powers.
The transformations become ‘stronger’ the further they are from x1 .
The use of the ladder of powers for transformations is illustrated
visually in Figure 20.

Right-skew data Left-skew data

Go down Go up
the ladder the ladder
of powers of powers

Figure 20 Illustration of how the ladder of powers can be used for


transformations

We will use the ladder of powers for transformations in multiple regression


in Subsections 4.2.1 and 4.2.2.

131
Unit 2 Multiple linear regression

4.2.1 An example of applying transformations


In this subsection, we will start by considering transformations in multiple
regression using data concerning car prices; the dataset is described next.

The value of used General Motors cars


General Motors is an American multinational corporation that
encompasses different makes of cars. Data are available for several
hundreds of used General Motors cars.
The condition of a car is thought to greatly affect its price, and so it
might be possible to develop a multiple regression model to predict a
car’s price based on a variety of characteristics such as its mileage,
make, model, engine size and interior style.
Driving one of the early cars The car prices dataset (carPrices)
made by the General Motors The dataset considered in this unit focuses on the following variables:
Company.
• price: suggested retail price of each car in US dollars (US$)
• mileage: number of miles that the car has been driven
• make: manufacturer of the car such as Saturn, Pontiac and
Chevrolet
• liter: size of engine, measured in litres (note ‘liter’ is the
American spelling)
All of the cars in this dataset were less than one year old when priced
in 2005.
The data for the first five cars in the dataset are shown in Table 7.
Table 7 First five cars from carPrices

price mileage make liter

49248.16 6685 Cadillac 5.7


48365.98 2616 Cadillac 4.6
48310.33 788 Cadillac 4.6
47065.21 5239 Chevrolet 6.0
46747.67 15343 Cadillac 5.7

Source: Kuiper, 2008, including ‘Supplementary material’, accessed 7 June


2022

We will work through the process of finding an appropriate transformation


in a multiple regression model based on the car prices dataset.
Suppose that we are interested in using data from the car prices dataset to
model the response price, with the three explanatory variables mileage,
liter and make.

132
4 Transformations in multiple regression

Now, the variable make is in fact a categorical variable, which means that
the regression model needs to treat this explanatory variable in a different
way to the two (continuous) explanatory variables mileage and liter.
You won’t be learning about how to model with categorical explanatory
variables until Units 3 and 4, but you don’t actually need to know about
modelling with categorical explanatory variables for this example.
Therefore, for now, just think of make as ‘one of the three explanatory
variables’.
So, we’re interested in fitting the model
price ∼ mileage + liter + make.
However, the residual plot for this fitted model, shown in Figure 21,
indicates that there is a problem with using this model. The points in the
residual plot show an increase in spread with increasing fitted values,
indicating that the constant variance assumption may not be valid, and
also there is a hint of curvature in the plot, suggesting that the zero mean
and linearity assumptions may also be questionable. Therefore, as
suggested in Box 16, since the problem seems to be that one of the model
assumptions may not be valid, we should consider transforming the
response variable, price.

10000

5000
Residuals

−5000

−10000

10000 20000 30000 40000


Fitted values
Figure 21 Residual plot for price ∼ mileage + liter + make

So, how might we go about finding a suitable transformation for price?


Well, since price is the response, a good place to start is by looking at a
histogram of price – if this is very skew, then that could be causing the
problems with the model assumptions.

133
Unit 2 Multiple linear regression

A frequency histogram of price is shown in Figure 22. This is clearly


right-skew! So, let’s try transforming price by going down the ladder of
powers.

200

Frequency 150

100

50

0
10000 20000 30000 40000 50000
Price (US$)

Figure 22 Histogram of price

Going down the ladder of powers suggests that the first transformation we
√ √
should try is price; a histogram of the transformed values price is
shown in Figure 23.

100
Frequency

50

0
100 120 140 160 180 200 220

price

Figure 23 Histogram of price
134
4 Transformations in multiple regression


The histogram of price still looks rather skew, and so it’s likely that the
assumptions may also not be valid for the model
p
price ∼ mileage + liter + make.
Indeed this does turn out to be the case, since the points in the residual
plot of this fitted model, given in Figure 24, indicate that the model
assumptions still look questionable.

30

20

10
Residuals

−10

−20

−30

100 120 140 160 180 200 220


Fitted values

Figure 24 Residual plot for price ∼ mileage + liter + make

So, let’s go further down the ladder of powers and try the next
transformation: log(price).
A histogram of the transformed values log(price) is shown in Figure 25.
This time, the histogram for the transformed response looks much more
symmetric, so this transformation looks more promising!

135
Unit 2 Multiple linear regression

150

Frequency
100

50

0
9.0 9.5 10.0 10.5 11.0
log(price)

Figure 25 Histogram of log(price)

Fitting the model


log(price) ∼ mileage + liter + make
results in the residual plot shown in Figure 26.

0.3

0.2

0.1
Residuals

0.0

−0.1

−0.2

−0.3

9.5 10.0 10.5 11.0


Fitted values

Figure 26 Residual plot for log(price) ∼ mileage + liter + make

136
4 Transformations in multiple regression

This time the points in the residual plot do seem to be randomly scattered
about the zero residual line, indicating that the assumptions of constant
variance, zero mean and linearity now seem to be reasonable. So, by using
the transformed variable log(price) as our response instead of the original
response price, we now have data for which we can use multiple regression!

4.2.2 Another example of applying transformations


We’ll consider transformations in multiple regression further in a series of
activities using data concerning films; the dataset is described next.

Popularity of films
A research project was conducted to determine whether the
conventional features of a film (such as its budget and the number of
screens on which it is initially launched) are more important in
predicting the popularity of the film than the social media features
(such as the number of ‘likes’ and ‘dislikes’ the film trailer receives on
the online video sharing and social media platform YouTube). In
particular, the researchers were interested in predicting the gross How will online streaming
affect cinemas in the future?
income of a film using a set of explanatory variables.
The films dataset (films)
There are 169 films in this dataset, which is a subset of the data
collected by the researchers for 231 films produced between 2014
and 2015.
The response variable is:
• income: the gross income (in millions of US dollars) for each film.
The explanatory variables of interest are:
• budget: the budget (in millions of US dollars) for the film
• rating: the rating of the film on a scale from 1 to 10, collected
from the films website IMDb
• screens: the number of screens (in thousands) on which the film
was initially launched in the USA
• views: the number of views (in millions) for the film trailer on
YouTube
• likes: the number of likes (in thousands) for the film trailer on
YouTube
• dislikes: the number of dislikes (in thousands) for the film trailer
on YouTube
• comments: the number of comments (in thousands) on the film
trailer on YouTube
• followers: the aggregate number of followers (in millions) on
Twitter for the top three actors in the film.

137
Unit 2 Multiple linear regression

The data for the first five films from this dataset, ordered by budget,
are shown in Table 8.
Table 8 First five films (ordered by income) from films

income budget rating screens views likes dislikes


245.0 210 5.8 4.233 0.171 0.791 0.362
234.0 200 8.1 3.996 0.002 0.009 0.000
203.0 200 6.8 4.324 0.387 4.996 0.233
93.2 190 6.6 3.972 1.000 4.212 0.066
241.0 180 7.0 3.948 0.557 3.528 0.135

comments followers
0.230 2.815
0.001 10.280
0.864 4.520
0.250 1.198
0.464 0.006

Source: Ahmed et al., 2015, including ‘Supplementary resource’, accessed


21 March 2022

Suppose that we wish to use the films dataset in a multiple regression


model to predict a film’s income (income) using two explanatory variables,
budget and screens. Unfortunately, it turns out that some of the
assumptions associated with the multiple regression model are not
reasonable for these data. In Activities 20 to 22 we will investigate these
problems and use appropriate variable transformations for solving them.
We will then use our fitted model for the transformed variables to predict
a film’s income.

Activity 20 Transforming the response variable

A multiple regression model is to be fitted for the response income from


the films dataset, using budget and screens as the explanatory variables.
(a) Consider the model
income ∼ budget + screens.
The residual plot for this fitted model given in Figure 27.
Explain why the residual plot in Figure 27 suggests that it is
reasonable to try transforming the response.
(b) The histogram of income is shown in Figure 28.
Explain why it is reasonable to consider
√ the two following
transformations of the response: income and log (income).

138
4 Transformations in multiple regression

200

100
Residuals

−100

−200

0 50 100 150 200


Fitted values

Figure 27 Residual plot for income ∼ budget + screens

100

80

60
Frequency

40

20

0
0 50 100 150 200 250 300 350
Income (millions of US $)

Figure 28 Histogram of income

139
Unit 2 Multiple linear regression

(c) The histograms of income and log (income) are given in Figure 29.
Based on these two histograms, which transformation would you
recommend?

35
50
30

25 40
Frequency

Frequency
20 30
15
20
10
10
5

0 0
0 10 5 15 20 −6 −4 −2 0 2 4 6

(a) income (b) log(income)

Figure 29 Histograms of (a) income, and (b) log (income)

In Activity 20, we concluded that we should try transforming the response


income from the films dataset using the square root transformation in
order to address the fact that the assumption of constant error variance
may not be valid. In the next activity, we will investigate whether this
transformation has solved the problem or whether we additionally need to
transform either of the explanatory variables budget and screens.

Activity 21 Any further transformations?


Following on from Activity 20, the model

income ∼ budget + screens,
was fitted to the data from the films dataset. The residual plot for this
fitted model is given in Figure 30.
A reasonable interpretation of this residual plot is as follows:
Transforming the response has led to the assumption of constant
variance of the random terms being plausible. However, the plot
raises concerns about the linearity assumption, as the plotted points
in the residual plot show a hint of a downward trend (rather than
being equally scattered about the zero residual line).
As a result of this interpretation
√ of the residual plot, scatterplots of the
transformed response ( income) and each of the explanatory variables,
budget and screens, were produced. The resulting plots are given in
Figure 31.

140
4 Transformations in multiple regression

5
Residuals

−5

5 10 15
Fitted values

Figure 30 Residual plot for income ∼ budget + screens

15 15
income

income

10 10

5 5

0 0
0 100 50
150 200 0 1 2 3 4
(a) budget (b) screens
√ √
Figure 31 Scatterplots of (a) income and budget, and (b) income and screens

(a) From the plots in Figure 31, explain why it is reasonable to decide to
transform the explanatory variable screens, but not the explanatory
variable budget.
(b) The pattern of points in the scatterplot shown in Figure√31(b) can be
interpreted as suggesting that the relationship between income and
screens may be quadratic or cubic. Therefore, scatterplots of the

141
Unit 2 Multiple linear regression

transformed response income and each of the two transformed
variables, screens2 and screens3 , are given in Figure 32.
Given these scatterplots, explain why it is reasonable to transform
screens to screens3 .

15 15

income
income

10 10


5 5

0 0
0 5 10 15 0 20 40 60 80
(a) screens2 (b) screens3
√ √
Figure 32 Scatterplots of (a) income and screens2 , and (b) income and screens3

Activities 20 and 21 suggested that an appropriate model for modelling



income from the films dataset should transform the response to income,
transform the explanatory variable screens to screens3 , and keep the
explanatory variable budget untransformed. In Activity 22, you will see
how the suggested transformations affect the residual plot of the regression
model and how to use the fitted regression model for prediction.

Activity 22 Modelling the films’ income

(a) For the fitted model using the transformed response and explanatory
variable,

income ∼ budget + screens3 ,
the residual plot is shown in Figure 33.
Do you think this regression model with the transformed variables has
resolved the problem with the non-constant variance (apparent in
Figure 27) and the problem with the assumption of linearity (raised
by the hint of a downward trend in Figure 30)?

142
4 Transformations in multiple regression

5
Residuals

−5

5 10 15
Fitted values

Figure 33 Residual plot for income ∼ budget + screens3

(b) The suggested model with transformations has the following fitted
equation:

income = 3.01 + 0.025 budget + 0.107 screens3 .

Use the fitted equation to predict the income of a new film that has a
production budget of 50 million US dollars and is to be initially
launched in the USA on 2500 screens.

To finish this section, we will look at transformations for multiple


regression in R.

143
Unit 2 Multiple linear regression

4.3 Using R to perform transformations


In this subsection, we will use data from a new dataset concerning the
well-known social media platform Facebook. The dataset is described next.

Patterns of social media usage


In a study on patterns of social media usage between adults, data on
a number of Facebook users were collected. For each person, data on
some explanatory variables were also recorded.
The Facebook dataset (facebook)
We consider a subset of these data, which contains data on 135 people
The Facebook icon is familiar for three variables.
to many . . . The response variable is:
• friends: the number of Facebook ‘friends’ each person has.
The two potential explanatory variables are:
• age: the person’s age in years
• visits: the number of visits the person makes in a day to the
Facebook website.
Data for the first five people in this dataset is given in Table 9.
Table 9 First five people from facebook

friends age visits


164 31 3
184 42 5
588 20 13
485 20 15
205 20 20

Source: Sullivan III, 2020 and 2021, both accessed 9 June 2022

In Notebook activity 2.5, we will use R for transformations in multiple


regression, focusing in particular on data from the Facebook dataset.

Notebook activity 2.5 Transformations in R


This notebook explains how to use R for transformations in multiple
regression, using the data given in facebook.

144
5 Choosing explanatory variables

5 Choosing explanatory variables


The more explanatory variables you add to a model, the better the model
will fit the data. To see this, in Example 9 we will revisit two of the
models for the FIFA 19 dataset used at the start of this unit.

Example 9 Model fit after adding an explanatory variable


In Activity 1, we fitted the model
strength ∼ weight.
Then, in Example 1, we added a second explanatory variable and
fitted the model
strength ∼ weight + height.
We can think of the first (simpler) model as a particular case of the
second model, in which the regression coefficient for height has the
value zero.
Of course, when fitting the model strength ∼ weight + height, we
do not restrict the value of the regression coefficient for height to be
zero. Instead, we choose the estimated value which is calculated to
give the model that provides the best fit to the data. By doing so, our
fitted model must fit at least as well as the model in which the
regression coefficient is set to be zero.

So, as we saw in Example 9, the fit of a model can’t get any worse by
adding extra explanatory variables; the fit can either improve or, at worst,
stay the same. However, fitting the data superbly is not the be-all and
end-all. After all, the data themselves provide a perfect model for the data!
Cutting down the number of explanatory variables is desirable for two
reasons.
• The first reason concerns the purpose of modelling. By modelling, we
are trying to understand the situation in simplified terms by identifying
the main sources of influence on the response variable.
• The second reason concerns prediction. Just because we can fit the data The surface of the Earth is
itself a perfect model (map) of
we have, with all its particular idiosyncracies, does not mean that the
the Earth!
resulting model will necessarily be a good fit for further data that arise
(under the same situation). We return to this issue in Unit 5.
So, it is preferable to use just enough explanatory variables to capture the
main features of the data. Doing this will also often lead to improved
prediction as well as a model that is easier to interpret. Underlying these
ideas is the principle of parsimony, which is given in Box 18.

145
Unit 2 Multiple linear regression

Box 18 The principle of parsimony


The principle of parsimony states that if two models explain the
data roughly equally well, you should prefer the simpler model.

In this section, we will introduce some tools and techniques for selecting
the explanatory variables that enhance the fit of a multiple regression
model, while keeping it as simple as possible according to the principle of
parsimony.
When choosing the explanatory variables, problems can occur if two of the
explanatory variables are highly correlated. As a result, it is important
that any such pairs of explanatory variables are identified. To do this,
Subsection 5.1 discusses two simple tools that can be used to help explore
the relationships, not only between each explanatory variable and the
response, but also between the explanatory variables themselves.
In deciding which explanatory variables to include in a model, we need to
be able to compare how well models with different combinations of the
explanatory variables fit the data. Being able to measure a model’s fit is
therefore crucial. We will introduce some such measures of fit in
Subsection 5.2 and then, in Subsection 5.3, we’ll see how one of these
measures is used in a widely used procedure for selecting explanatory
variables, called stepwise regression.
Finally, in Subsection 5.4, we will use all of these tools and techniques for
choosing explanatory variables in R.

5.1 Scatterplot matrix and correlation


matrix
In this subsection, we will discuss two simple tools – the scatterplot matrix
and the correlation matrix – which can be helpful for exploring the
relationships between the explanatory variables and the relationships
between each of these and the response variable.
We are obviously interested in exploring the relationships between each of
the explanatory variables and the response, since we want our regression
model to capture these relationships. However, why would we want to
explore the relationships between the explanatory variables?
The reason stems from the fact that, in multiple regression, each partial
regression coefficient is associated with a variable’s contribution after
allowing for the contributions of the other explanatory variables. As a
result, if an explanatory variable is highly correlated with another
explanatory variable (with an absolute correlation value of 0.7 or more,
say), it will have little or no additional contribution to the regression
model over and above that of the other. In this case, the associated
p-values can give an inaccurate picture of the significance of the

146
5 Choosing explanatory variables

explanatory variables. For example, if a group of explanatory variables


have mutually high correlations, it is possible that none of them has a
significant p-value, even though their joint contribution may be very
substantial. This problem is called multicollinearity.
Luckily, we can easily solve multicollinearity problems by only including in
our model one of the explanatory variables of a highly correlated pair. For
example, if x1 and x2 are two highly correlated explanatory variables, then
to solve any multicollinearity problems we would include either x1 or x2 in
our model, but not both x1 and x2 together. But, of course, to be able to
solve any potential multicollinearity problems, we need to be able to
identify which pairs of variables are highly correlated!
To help us to do this, we’ll start by introducing the scatterplot matrix, as
described in Box 19.

Box 19 The scatterplot matrix


The scatterplot matrix is a graphical representation that can be
helpful for exploring the relationships between a set of variables.
It consists of a matrix of subplots, each of which is a (small)
scatterplot of a pair of variables.

For the scatterplot matrices that we will use in the context of multiple
regression in this module:
• each row, apart from the last row, will include the scatterplots of one
explanatory variable against the other explanatory variables
• the last row will include scatterplots of the response variable versus each
explanatory variable.
You will see a scatterplot matrix in action next in Example 10.

Example 10 Scatterplot matrix of the films dataset


In this example, we will revisit the films dataset, first introduced in
Subsection 4.2.2. The dataset contains data on 169 films produced
between 2014 and 2015. As a reminder, the response variable is
income and there are eight possible explanatory variables: budget,
rating, screens, views, likes, dislikes, comments and followers.
In Subsection 4.2.2, we focused on modelling income with just two
explanatory variables – namely, budget and screens. Here, we will
consider all eight of the possible explanatory variables.
The scatterplot matrix for the eight explanatory variables and the
response is given in Figure 34, next.

147
Unit 2 Multiple linear regression

followers

comments

dislikes

likes

views

screens

rating

budget

income

Figure 34 A scatterplot matrix of the data in the films dataset

The following points explain what the scatterplot matrix shows.


• For each scatterplot, the variable at the end of each row is on the
vertical axis and the variable at the top of each column is on the
horizontal axis.
For example, the top-left plot is a scatterplot of comments on the
vertical axis against followers on the horizontal axis; the plot
directly below that is a scatterplot of dislikes on the vertical axis
against followers on the horizontal axis; and so on.
• For each of the scatterplots on the bottom row, the response is
on the vertical axis and is plotted against the corresponding
explanatory variable on the horizontal axis.
For example, the first plot on the bottom row is a scatterplot of
income on the vertical axis against followers on the horizontal
axis; the second plot on the bottom row is a scatterplot of income
on the vertical axis against comments on the horizontal axis; and so
on.
(If you find yourself forgetting which variable is on the vertical axis

148
5 Choosing explanatory variables

and which is on the horizontal axis, just remember that the


response is always plotted on the vertical axis, and the response is
the variable at the end of the last row – that is, the row variable
goes on the vertical axis.)
• The scatterplot matrix in Figure 34 only actually shows half of the
matrix of possible scatterplots for the variables. This is because the
‘missing’ scatterplots would simply be the same plots which are
shown, only with the axes reversed.
For example, the top-left plot is a scatterplot of comments on the
vertical axis and followers on the horizontal axis. The plot which
would be to the right of followers and above comments would
simply be a scatterplot with followers on the vertical axis and
comments on the horizontal axis.
Including these ‘reverse’ scatterplots doesn’t provide us with any
more information concerning the relationships between the
variables, and so they haven’t been included in Figure 34. This
helps to keep the scatterplot matrix as simple as it can be.

A scatterplot matrix can feel like an overload of information! Luckily, you


are not meant to study every single plot in detail.
So, what should you pay attention to? Well, a good place to start is to
look at the bottom row of plots, in which the response is plotted on the
vertical axis against each of the explanatory variables. (We will return to
the rest of the scatterplot matrix soon.)

Activity 23 Any relationships between each explanatory


variable and the response?
Using Figure 34, which of the explanatory variables seem to be related to
the response income?

As we saw in Activity 23, the relationships between the explanatory


variables and the response in Figure 34 that are fairly clear do not seem to
be linear. This suggests that we should try some transformations for this
dataset.
Since there are several relationships which do not appear to be linear, we’ll
try transforming the response income as a starting point. Back in
Activity 20, we found that transforming income using the square root
transformation produced a distribution which was more symmetric, so let’s
try that transformation here and have a look at the scatterplot matrix
with the transformed response.

149
Unit 2 Multiple linear regression

Example 11 Scatterplot matrix of the films dataset with


income transformed
The scatterplot matrix with the square root transformation on income
is shown in Figure 35.

followers

comments

dislikes

likes

views

screens

rating

budget


income

Figure 35 A scatterplot matrix of the data in the films dataset with the
response transformed

The individual scatterplots of income and each individual
explanatory variable still fail to indicate any strong relationships, with
the possible exceptions of budget, rating and screens, as before.
This time, however, the relationships with budget and rating now
seem more linear, so let’s stick with this transformation of income –
at least for now!

For the remainder of the scatterplot matrix, we need to scan the plots to
look for any pairs of explanatory variables which seem to be related. We
will do this in Activity 24.

150
5 Choosing explanatory variables

Activity 24 Any relationships between the explanatory


variables?
From the scatterplot matrix given in Figure 35, do any of the explanatory
variables seem to be closely related to each other?

Another way of investigating the relationships between explanatory


variables is to work out their correlation matrix, as described next in
Box 20.

Box 20 The correlation matrix


The correlation matrix is a tool that can be helpful for exploring
the correlation relationships between a set of explanatory variables.
Each row of the correlation matrix gives the correlations of one
explanatory variable with the other explanatory variables.
Correlations between each variable and itself are on the main
(leading) diagonal of the matrix (and are all equal to 1).

The correlation matrix for the films dataset is as follows.


followers comments dislikes likes views screens rating budget
followers 1.00
comments 0.13 1.00
dislikes 0.09 0.84 1.00
likes 0.17 0.81 0.76 1.00
views 0.18 0.78 0.78 0.86 1.00
screens 0.19 0.23 0.29 0.23 0.18 1.00
rating 0.09 0.03 −0.19 0.08 0.04 0.08 1.00
budget 0.22 0.21 0.14 0.10 0.11 0.58 0.29 1.00

In the correlation matrix for the films dataset, the correlations are given in
essentially the same matrix format as the scatterplot matrix, except that
the response variable income has been left out of the correlation matrix.
Notice, also, that this correlation matrix is in fact only showing half of the
matrix! This is because the ‘missing’ values are simply mirror images of
the values already there (since the correlation of two variables x1 and x2 is
the same as the correlation of x2 and x1 ). So, like the scatterplot matrix,
the missing elements above and to the right of the main diagonal do not
provide any more information regarding the correlations between the
variables.
So, once we have a correlation matrix, what are we looking for? Well, as
you will find in Activity 25, we need to look for pairs of explanatory
variables which have high correlations. In this context, we’ll use the
following rule of thumb:
• a correlation with an absolute value ≥ 0.7 is considered to be high.

151
Unit 2 Multiple linear regression

Activity 25 Correlation matrix of the films dataset

Using the correlation matrix for the films dataset, which pairs of
explanatory variables are highly correlated?

How does Activity 25 help in selecting explanatory variables to include in


the model? Well, because the correlations between comments, dislikes,
likes and views are of the order of 0.7 or 0.8, we should consider the
possibility of omitting one or more of these variables from the model to
avoid possible problems with multicollinearity.
So far, we have looked at the scatterplot matrix to see which explanatory
variables seem to be related to the response, and we’ve used both the
scatterplot matrix and the correlation matrix to investigate which
explanatory variables might be highly correlated (and therefore might
cause collinearity problems). In Activity 26, we will consider one more way
to help us in our selection of explanatory variables.

Activity 26 Modelling a film’s income using all eight


explanatory variables

The model of the transformed response ( income) with all eight
explanatory variables,

income ∼ budget + rating + screens + views
+ likes + dislikes + comments + followers,
was fitted to data from the films dataset.
(a) The estimated regression coefficients, their standard errors, t-values
and associated p-values are given in Table 10.

Table 10 Coefficients for income ∼ budget + rating + screens +
views + likes + dislikes + comments + followers

Parameter Estimate Standard t-value p-value


error
Intercept −4.518 1.509 −2.995 0.003
budget 0.035 0.005 7.251 < 0.001
rating 1.066 0.235 4.530 < 0.001
screens 1.088 0.171 6.350 < 0.001
views −0.286 0.119 −2.407 0.017
likes 0.095 0.039 2.424 0.016
dislikes 1.585 0.667 2.377 0.019
comments −0.548 0.247 −2.217 0.028
followers 0.087 0.065 1.345 0.180

152
5 Choosing explanatory variables

(i) On the basis of Table 10, which explanatory variables appear to


be most important?
(ii) Why is it not appropriate simply to settle on the regression
model that includes only those variables with the smallest
p-values in Table 10?
(b) The diagnostic plots for the fitted model are shown in Figure 36. Do
these plots raise any concerns?

5 2

Standardised residuals
1
Residuals

0 0

−1

−5 −2

−3
0 5 10 15 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles

2 0.06
Standardised residuals

Cook’s distance

0 0.04

−1
0.02
−2

−3
0.00
0.00.2 0.1
0.3 0 50 100 150
(c) Leverage (d) Observation number

Figure 36 Diagnostic plots for income with all eight explanatory variables

153
Unit 2 Multiple linear regression

So, let’s summarise what our exploration of the films dataset has so far
suggested regarding the choice of explanatory variables for modelling
income:
• The bottom row of the scatterplot matrix suggested that income might
depend on budget, rating and screens.
• The scatterplot matrix and the correlation matrix flagged up that the
inclusion in the model of all four explanatory variables comments,
dislikes, likes and views might be unnecessary (and also possibly
both explanatory variables screens and budget might be unnecessary).
• The fitted multiple regression model suggested that all of the
explanatory variables should be included in the model, except perhaps
for followers.
But can we do something more formal to select explanatory variables?
Indeed we can – and in very many ways! However, in this module we will
only consider one of these methods, known as stepwise regression.
Before introducing stepwise regression (in Subsection 5.3), we will first
introduce some ways to measure how well a model fits.

5.2 Measuring how well a model fits


We will now discuss two measures that can be used to assess how well a
regression model fits the data: the percentage of variance accounted for
and the Akaike information criterion (AIC). It is important to note that
each of these measures is often used to compare regression models, rather
than as an absolute measure of fit.

5.2.1 Percentage of variance accounted for


One of the purposes of statistical modelling is to build a model which can
help to explain the variation in the response variable. The next example
illustrates what we mean by this.

Example 12 Explaining the variation


Consider the FIFA 19 dataset once again and the model
strength ∼ weight.
From the boxplot of strength shown in Figure 37, we can get a feel
for the variation in the response: it looks like the footballers’ strength
scores roughly vary between just under 60 to just over 80.

154
5 Choosing explanatory variables

60 65 70 75 80
Strength score
Figure 37 Boxplot of strength

Now, the scatterplot of strength and weight given in Figure 38


shows that lighter footballers tend to have lower strength scores,
whereas heavier players tend to have higher strength scores. As a
result, the variation in strength scores for footballers of a particular
weight is actually much smaller than the variation in strength scores
across all of the footballers.

80

75
Strength score

70

65

60

150 160 170 180 190


Weight (lb)
Figure 38 Scatterplot of strength and weight

This difference in variation of strength scores is illustrated in


Figure 39, in which the fitted line for the model strength ∼ weight
has been added to a scatterplot of strength and weight: the scatter
of strength values about the fitted line at any particular value of
weight is clearly smaller than the scatter of strength values overall.
As such, some of the variation in strength has been explained or
accounted for by the model
strength ∼ weight.

155
Unit 2 Multiple linear regression

Overall scatter
80 of strength score

Scatter of
strength score
75
about the
fitted line

Strength score
70

65

60

150 160 170 180 190


Weight (lb)
Figure 39 Scatterplot of strength and weight, together with the fitted
line for strength ∼ weight

So, as we’ve seen in Example 12, a model can help to explain the variation
in the response variable. The measure of model fit being considered in this
subsection is a measure of the percentage of variation in the response
variable that can be accounted for – or explained – by the model.
In order to make it easier to understand the ideas being presented here, we
will look at a subset of just 10 of the observations from the FIFA 19
dataset (taking every 10th observation in the dataset only). A scatterplot
of strength and weight for this small dataset is given in Figure 40.

156
5 Choosing explanatory variables

80

75
Strength score

70

65

60
150 160 170 180
Weight (lb)
Figure 40 Scatterplot of strength and weight for a subset of
10 observations from the FIFA 19 dataset

Now, we would like a measure of the percentage of variation in the


response that can be accounted for by the model. To do this, we’ll break
the variation in the response – the total variation – into two types of
variation:
• the variation that can be explained by our model – the explained
variation
• the variation that still remains and can’t be explained by our model –
the residual variation.
Let’s start by looking at the total variation in the response variable.
Denoting our observed responses as y1 , y2 , . . . , yn , an obvious measure of
the total variation in these responses is the sample variance, which can be
calculated using the (usual) formula:
n
1 X
sample variance = (yi − y)2 ,
n−1
i=1

where y is the sample mean of the responses. The numerator of the sample
variance is often referred to as the total sum of squares, which we’ll
abbreviate to TSS, so
n
X
TSS = (yi − y)2 .
i=1

157
Unit 2 Multiple linear regression

The distances on which the TSS is based (that is, y1 − y, y2 − y, . . . , yn − y)


are illustrated for the reduced version of the FIFA 19 dataset in Figure 41.

80

75
Strength score

70
y
Distance
65 yi − y

yi
60
150 160 170 180
Weight (lb)
Figure 41 Scatterplot of strength and weight: the dotted vertical lines
show the distances used to calculate TSS

Now, it can be shown (although we won’t do so here) that the total sum of
squares can be partitioned into two separate sums of squares:
+
• a sum of squares associated with the variation that can be explained by
our model – the explained sum of squares, which we’ll abbreviate to
ESS
• a sum of squares associated with the variation that cannot be explained
+ + +
by our model – the residual sum of squares, which we’ll abbreviate to
RSS.
Another sum of squares? Let’s look at each of these in turn.
We’ll start by considering the ESS – the sum of squares associated with
the variation that can be explained by our model. As with any regression
model, for given value(s) of any explanatory variable(s), the corresponding
response Yi is modelled to be the fitted value ybi . So, a measure of how
these fitted values yb1 , yb2 , . . . , ybn vary about the overall mean y will give us
a measure of the variation explained by our model, so that
n
X
ESS = yi − y)2 .
(b
i=1

158
5 Choosing explanatory variables

The distances on which the ESS is based (that is, yb1 − y, yb2 − y, . . . , ybn − y)
are illustrated for the reduced version of the FIFA 19 dataset in Figure 42.

80

75
Strength score

70

Distance y
yi − y
65 yi

yi

60
150 160 170 180
Weight (lb)
Figure 42 Scatterplot of strength and weight: the fitted values are shown
as red dots and the dotted vertical lines show the distances used to calculate
ESS

We’ll now consider the RSS – the sum of squares associated with the
variation which can’t be explained by the model. Now, Yi modelled to be
the fitted value ybi using our model. So, a measure of how the actual
response values y1 , y2 , . . . , yn vary around their fitted values yb1 , yb2 , . . . , ybn
will give us a measure of the variation which still remains after fitting our
model, that is, the residual variation:
n
X
RSS = (yi − ybi )2 .
i=1

The distances on which the RSS is based (that is,


y1 − yb1 , y2 − yb2 , . . . , yn − ybn ) are illustrated for the reduced version of the
FIFA 19 dataset in Figure 43.

159
Unit 2 Multiple linear regression

80

75

Strength score 70
yi

65 Distance
yi − yi

60 yi

150 160 170 180


Weight (lb)
Figure 43 Scatterplot of strength and weight: the fitted values are shown
as red dots and the dotted vertical lines show the distances used to calculate
the RSS

So, putting this altogether, we get the result that the TSS can be
partitioned into the ESS and the RSS:
TSS = ESS + RSS.

Activity 27 The sums of squares for the reduced FIFA 19


dataset
The model
strength ∼ weight
was fitted using the reduced version of the FIFA 19 dataset.
The total sum of squares (TSS) for this fitted model was calculated to be
515.60 and the residual sum of squares (RSS) was calculated to be 111.44.
(a) What is the explained sum of squares (ESS) for this model?
(b) Based on the values of these sums of squares, does it look like this
model seems to explain the variation in the response fairly well?

The sums of squares are summarised in Box 21.

160
5 Choosing explanatory variables

Box 21 TSS, ESS and RSS


In regression, TSS, ESS and RSS all represent sums of squares.
• The TSS is the total sum of squares and gives a measure of how
the observed responses y1 , y2 , . . . , yn vary about the overall response
mean y:
Xn
TSS = (yi − y)2 .
i=1

• The ESS is the explained sum of squares and gives a measure of


how the fitted values yb1 , yb2 , . . . , ybn vary about the overall response
mean y:
Xn
ESS = yi − y)2 .
(b
i=1

• The RSS is the residual sum of squares and gives a measure of how
the observed responses y1 , y2 , . . . , yn vary about the fitted values
yb1 , yb2 , . . . , ybn :
n
X
RSS = (yi − ybi )2 .
i=1

The three sums of squares are related:


TSS = ESS + RSS.

Recall that the measure of model fit being considered in this subsection is
a measure of the percentage of variation accounted for by the model. The
values of ESS and TSS provide us with information regarding what
proportion of the total variation is explained by the fitted model, so one
way to measure the percentage of variation accounted for by the model is
by comparing the values of ESS and TSS. One such measure – known as
the R2 statistic (pronounced ‘R squared statistic’), or simply R2 – does
precisely that. The R2 statistic is defined in Box 22.

Box 22 The R2 statistic


The R2 statistic, or just R2 for short, is calculated as
ESS
R2 = . (1)
TSS
Since TSS = ESS + RSS, the value of R2 lies between 0 and 1.
R2 is often given as a percentage (between 0 and 100%), rather than
as a proportion.
R2 is also sometimes called the multiple correlation coefficient.

161
Unit 2 Multiple linear regression

Activity 28 Large or small R2 ?

For a model which fits the data well, would the value of R2 be large or
small?

When looking at the sums of squares for the fitted model using the
reduced FIFA 19 dataset in Activity 27, we concluded that the sums of
squares suggested that the model seems to explain the variation in the
response fairly well since the ESS was almost four times the value of the
RSS. Let’s see what the value of R2 is for this model.

Activity 29 R2 for the reduced FIFA 19 dataset


The sums of squares were calculated for the fitted model
strength ∼ weight
(using the reduced version of the FIFA 19 dataset), resulting in the values
TSS = 515.60, ESS = 404.16 and RSS = 111.44.
What is the value of R2 in this case?

So, can we use R2 as our measure of ‘the percentage of variation accounted


for’ to compare the fits of different models? Well, many statisticians do,
but there is a problem with doing so. Since the fit of a model can’t get any
worse by adding extra explanatory variables to the model, the value of the
ESS, and hence the R2 statistic, can’t get smaller as explanatory variables
are added to the model. As a result, the value of R2 must always be at
least as large for a model with more explanatory variables than for one
with fewer, and so we’d always prefer the model with the most explanatory
variables. That, however, goes against the principle of parsimony (that is,
the idea that we’d like to find the model which fits well, but is as simple as
possible).
Luckily, we can adjust the R2 statistic so that it does take into account the
principle of parsimony. This new value is (very imaginatively!) called the
adjusted R2 statistic. In this module, it is the adjusted R2 statistic that
will be used as a measure of ‘the percentage of variance accounted for’. The
adjusted R2 statistic is usually denoted by Ra2 and is defined in Box 23.

Box 23 The adjusted R2 statistic (Ra2 )


The adjusted R2 statistic, usually denoted by Ra2 , is a modified
version of R2 that adjusts for the number of explanatory variables in
the model, which makes it useful for comparing the fit of multiple
regression models that contain different numbers of explanatory

162
5 Choosing explanatory variables

variables. Ra2 increases with the addition of a new explanatory


variable to the model only if the new explanatory variable enhances
the model fit more than would be expected by chance.
The formula for Ra2 is given by
n−1
Ra2 = 1 − (1 − R2 ), (2)
n − (q + 1)
where, as usual, n is the number of observations and q is the number
of explanatory variables in the model. As with R2 , the value of Ra2 is
often reported as a percentage.
For a model fitted to some data, Ra2 is always less than R2 .
In this module, we’ll use Ra2 as a measure of ‘the percentage of
variance accounted for’ by our model.

Activity 30 Ra2 for the reduced FIFA 19 dataset


For the model
strength ∼ weight,
fitted using the reduced FIFA 19 dataset of 10 observations, the R2
statistic was calculated in Activity 29 to be 0.784.
For this model and these data, what is the value of the percentage of
variance accounted for?

You will see both R2 and Ra2 in action next for comparing models in
Example 13.

Example 13 Percentage of variance accounted for:


comparing models for the FIFA 19 dataset
Recall the series of activities and examples where different regression
models have been fitted to model footballers’ strength scores using the
FIFA 19 dataset. Specifically, these models are summarised as follows.
Model 1
Fitted in Unit 1 with the following estimated equation:
strength = 17.129 + 0.322 weight.

Model 2
Fitted in Example 1 with the following estimated equation:
strength = −10.953 + 0.252 weight + 0.558 height.

163
Unit 2 Multiple linear regression

Model 3
Fitted in Activity 4 with the following estimated equation:
strength = − 28.305 + 0.273 weight + 0.681 height
+ 0.085 marking.

The R2 and Ra2 statistics for these models (in percentages) are given
as follows.
• Model 1: R2 = 36.54%, Ra2 = 35.89%.
• Model 2: R2 = 39.36%, Ra2 = 38.11%.
• Model 3: R2 = 48.83%, Ra2 = 47.23%.
As expected, the values of R2 increase from Model 1 to Model 3 as a
result of increasing the number of explanatory variables. Notice also
that, as expected, the value of Ra2 in each model is less than its
corresponding R2 value. The values of Ra2 are also increasing from
Model 1 to Model 3, but since they were adjusted for the number of
explanatory variables in each model, these increases reflect actual
increases in the model fit.
As discussed before, we will be using Ra2 as a measure for the
percentage of variance accounted for.
Comparing Ra2 for the three models, it is clear that adding extra
explanatory variables has improved the fit: the percentage of variance
of strength that is accounted for by the model increased from 35.89%
to 38.11% when height was added to the model, and increased
from 38.11% to 47.23% when marking was also added to the model.

You have just seen in Example 13 that the values of R2 and Ra2 can both
increase by adding more explanatory variables to the regression model. But
this is not always the case. If the additional variable is not actually adding
to the model fit, the value of R2 will still increase, but the value of Ra2 may
decrease to reflect the fact that the model fits better without adding that
extra explanatory variable. You will see this happening in Activity 31.

Activity 31 Percentage of variance accounted for:


comparing models for the Brexit dataset
The Brexit dataset was used to fit three different regression models:
Model 1: leave ∼ age + income
Model 2: leave ∼ age + income + noQual
Model 3: leave ∼ age + income + noQual + migrant.

164
5 Choosing explanatory variables

The R2 and Ra2 statistics for these models are given as follows:
Model 1: R2 = 39.38%, Ra2 = 38.86%.
Model 2: R2 = 74.66%, Ra2 = 74.33%.
Model 3: R2 = 74.67%, Ra2 = 74.23%.
(a) Comment on what the increase or decrease of the R2 and Ra2 values
between the three models mean.
(b) Based on the percentage of variance accounted for, which of these
three models fits the data best? Explain your choice.

We will consider one further measure of model fit in this section; this is
introduced next in Subsection 5.2.2.

5.2.2 The Akaike information criterion (AIC)


A statistical model aims to represent the underlying process that
generated the data. This is never going to be exact, and by representing
the underlying process by a model, we inevitably lose information about
the underlying process. Some models lose a lot of information about the
underlying process, whereas others lose less. Models which lose less
information are more likely to produce reliable predictions for any new
data generated by the underlying process, and, as such, can be considered
to be better models.
The Akaike information criterion, commonly written in its abbreviated
form as the AIC, assesses the quality of a model by giving a measure of
the amount of information lost by a given model relative to the amount of
information lost by other models. Obviously, we would prefer models with
lower values of lost information, and hence lower values of AIC. The AIC
also considers the principle of parsimony, so that models with fewer
explanatory variables have lower values of AIC. As a result, the model
with the lowest AIC should be selected as the preferred model.
The formula for calculating the AIC will not be given in this module; you
will not be asked to calculate the AIC yourself, and most statistical
software packages can easily calculate AIC for you. It is just important
that you know how to use the AIC to select a model.
The AIC is summarised in Box 24.

Box 24 The Akaike information criterion (AIC)


The Akaike information criterion (AIC) is a measure of the amount of
information lost by a given model relative to the amount of
information lost by other models. It also considers the number of
explanatory variables in the model, preferring simpler models.
The best model from a set of alternatives, is the model with the lowest
AIC.

165
Unit 2 Multiple linear regression

In the next activity, you will explore how the AIC can be used in practice
to select the best regression model from the three models given in
Activity 31.

Activity 32 AIC for comparing models for the Brexit


dataset
In Activity 31, the Brexit dataset was used to fit three different regression
models:
Model 1: leave ∼ age + income
Model 2: leave ∼ age + income + noQual
Model 3: leave ∼ age + income + noQual + migrant.
The AIC for each of these models turned out to be:
Model 1: AIC = 819.7
Model 2: AIC = 615.0
Model 3: AIC = 616.9.
(a) Based on the AIC of each model, which model should have the best
fit? Explain your choice.
(b) In part (a) of this activity, did you choose the same model as in
part (b) of Activity 31?

Although the Ra2 and AIC might suggest the same model, be aware that
this is not always the case.

5.3 Stepwise regression for choosing


explanatory variables
In this subsection you will learn how to use stepwise regression as a tool
for selecting explanatory variables in multiple regression.
Stepwise regression is a step-by-step iterative procedure in which a
regression model is built at each step with a different selection of
explanatory variables. The procedure continues by adding or removing
potential explanatory variables in turn and calculating a selection criterion
– for example, the AIC – to assess how the model fits at each step. Based
on comparing the values of the selection criterion, the best model is finally
determined.
The stepwise regression method itself can be performed in three different
ways: forward (by adding one extra explanatory variable to the model at
each step), backward (by removing one explanatory variable from the
model at each step), or both. We will start by considering the forward
stepwise regression procedure in the next subsection.

166
5 Choosing explanatory variables

5.3.1 Forward stepwise regression


The main idea of the forward stepwise regression procedure is to start
with fitting what is known as a null regression model; the null model is
a model with an intercept only and none of the explanatory variables.
Then we move forward in steps by adding one extra explanatory variable
and fitting the new model at each step. The variable which is added
should ‘improve’ the model in some sense. The procedure stops when no
more explanatory variables can be added to improve the model further.
There are, in fact, many criteria that can be used to measure the
improvement of a model as a result of adding (or removing) an
explanatory variable. In this module, we will use the AIC as our criterion.
So, in terms of forward stepwise regression, at each step we select the
model with the explanatory variable that gives the lowest value of the AIC
when added to the model.
The following example illustrates how to do this with the films dataset.

Example 14 Modelling a film’s income using forward


stepwise regression
In this example we’ll apply the forward stepwise regression procedure
to
√ data from the films dataset. Following Activity 26, we will take
income as our response, and then use forward stepwise regression to
select the explanatory variables. This example is quite long!
Step 0 You might want to get yourself
a nice cup of tea and a biscuit
We
√ start the procedure by fitting a null model for the response or two . . .
income with only an intercept term. That is, we obtain the fitted
model

income = 6.812.
This null model gives an AIC of 495.19.
Step 1
There are eight possible explanatory variables in this dataset, so in
Step 1 of the procedure, we fit eight extra models, each with only one
explanatory variable added to the intercept in turn.

167
Unit 2 Multiple linear regression

We then have the eight fitted models:



income = 3.872 + 0.058 budget,

income = 2.321 + 2.012 screens,

income = −3.425 + 1.593 rating,

income = 6.050 + 0.336 followers,

income = 5.987 + 0.082 likes,

income = 6.142 + 0.471 comments,

income = 6.136 + 1.243 dislikes,

income = 6.275 + 0.160 views.

The AIC values obtained from each of these Step 1 models are listed
in ascending order in the following table.

Change AIC
Adding budget 376.32
Adding screens 391.77
Adding rating 474.36
Adding followers 487.00
Adding likes 489.69
Adding comments 491.15
Adding dislikes 491.15
Adding views 494.40
None 495.19

The last line, denoted by ‘None’, gives the AIC of the current model,
that is, the model with the intercept only.
Note that all of the eight models have AIC values lower than that of
the null model (495.19). This means that adding any individual
explanatory variable to the intercept will give a better model than
just having the intercept.
The model with the smallest AIC value (with AIC = 376.32) is
obtained by adding budget to the intercept. The best model in Step 1
based on the AIC values is thus the model containing budget: this
represents the outcome of Step 1.
Step 2
In Step 2, we start with the model including the intercept and budget
(with AIC = 376.32), and then we fit seven extra models, each with
one of the remaining explanatory variables added to the intercept and
budget. So, we have the fitted models:

income = 2.181 + 0.039 budget + 1.187 screens,

income = −0.674 + 0.054 budget + 0.737 rating,
and so on.

168
5 Choosing explanatory variables

The AIC values obtained from each of these Step 2 models are listed
in ascending order in the following table.

Change AIC
Adding screens 336.54
Adding rating 369.50
Adding likes 371.87
Adding dislikes 375.50
Adding followers 375.76
None 376.32
Adding views 377.49
Adding comments 377.76

There are now five explanatory variables (those in the first five lines of
Step 2) each giving a smaller AIC when added to the model. The
sixth line, denoted by ‘None’, gives the AIC of the model obtained in
Step 1 which includes the intercept and budget only.
The model with smallest AIC (with AIC = 336.54) is that obtained
when screens is added to the model with both the intercept and
budget. This is therefore the best model in Step 2 and will be the
starting point for Step 3.
Step 3
Similarly at Step 3, we fit six models, each with one of the remaining
explanatory variables. That is, we have the fitted models:

income = − 3.627 + 0.033 budget + 1.268 screens
+ 0.923 rating,

income = 2.008 + 0.039 budget + 1.134 screens
+ 0.028 likes,
and so on.
The AIC values obtained from each of these Step 3 models are listed
in ascending order in the following table.

Change AIC
Adding rating 320.59
Adding likes 336.45
None 336.54
Adding followers 336.87
Adding comments 338.53
Adding dislikes 338.53
Adding views 338.54

You should be familiar by now with what is going on here!


The best model in Step 3 is the one that contains rating together
with the intercept, budget and screens.

169
Unit 2 Multiple linear regression

Step 4
The process continues in the same way in Step 4. That is, we have the
fitted models:

income = − 3.672 + 0.032 budget + 1.253 screens
+ 0.914 rating + 0.077 followers,

income = − 3.632 + 0.033 budget + 1.226 screens
+ 0.903 rating + 0.021 likes,
and so on.
The AIC values obtained from each of these Step 4 models are listed
in ascending order in the following table.

Change AIC
None 320.59
Adding followers 321.14
Adding likes 321.23
Adding dislikes 321.39
Adding views 322.59
Adding comments 322.59

Here, the AIC of the starting model denoted by ‘None’ in the first line
of Step 4, is actually the smallest (with AIC = 320.59). This means
that adding any of the explanatory variables to the model already
containing budget, screens and rating simply increases the AIC. As
such, we cannot further improve the quality of the model.
Conclusion
Our final conclusion is that the forward stepwise regression procedure
suggests that the best model is the one including an intercept,
budget, screens and rating, with the fitted equation

income = − 3.627 + 0.033 budget + 1.268 screens
+ 0.923 rating.
Given our earlier exploration, in Example 10 (Subsection 5.1), of
which variables might have been important, this choice seems pretty
sensible. Certainly, the three selected variables – budget, screens
and rating – appeared important from looking at the scatterplot
matrix, and their coefficients also appeared highly significant from
Table 10 in Activity 26 (Subsection 5.1).
Note that the correlation between screens and budget is 0.58, but
this level of correlation should not preclude the two variables from
appearing together.

170
5 Choosing explanatory variables

Working your way through Example 14, you might be able to now guess
how the backward stepwise regression procedure would work. Let’s see
whether you are correct!

5.3.2 Backward stepwise regression


As you might have guessed, the backward stepwise regression
procedure works by successively dropping variables rather than adding
them.
The starting point, Step 0, is the full regression model; that is, the
model containing all of the explanatory variables. In each successive step,
the set of reduced models obtained by dropping, in turn, each of the
explanatory variables from the model obtained in the previous step are
considered.
• If dropping an explanatory variable leads to a better model than the
model from the previous step, then the best model is selected and we
move on to the next step.
• If the model from the previous step is better than all of the models
obtained by dropping an explanatory variable, then the backwards
stepwise procedure stops.
As with forward stepwise regression, in this module we will use AIC as our
criterion of how good a model is.
The next example illustrates backward stepwise regression.

Example 15 Modelling a film’s income using backward


stepwise regression
In Example 14, we applied the forward stepwise regression procedure
to data from the films dataset. We will now apply the backward
stepwise regression procedure to the same data.
Step 0
We start the procedure by fitting the full model with all eight
explanatory variables, that is:

income ∼ budget + rating + screens + views
+ likes + dislikes + comments + followers.
Fitting the full model, we obtain an AIC of 316.34.
Step 1
In Step 1 we fit eight extra models, each with only one explanatory
variable dropped in turn from the full model, so that each model
contains seven explanatory variables.
The AIC values obtained from these models in Step 1 of the backward
stepwise regression are given next.

171
Unit 2 Multiple linear regression

Change AIC
Dropping followers 316.24
None 316.34
Dropping comments 319.46
Dropping dislikes 320.21
Dropping views 320.35
Dropping likes 320.44
Dropping rating 334.74
Dropping screens 352.32
Dropping budget 362.37

At Step 1, the smallest AIC corresponds to dropping followers from


the model. Notice that this is the only explanatory variable which
produces a smaller AIC value than the full model (which is in the
second line, denoted ‘None’).
This suggests that the full model can be improved by simply dropping
followers. The improved model includes the seven explanatory
variables budget, rating, screens, views, likes, dislikes and
comments.
Step 2
We repeat the same process starting with the improved seven-variable
model resulting from Step 1. But now we drop each of the remaining
seven explanatory variables in turn from the model and monitor how
this affects the AIC. If the AIC becomes smaller, it means that the
dropped explanatory variable does not contribute much to the model.
The AIC values in Step 2 are as follows.

Change AIC
None 316.24
Dropping comments 319.41
Dropping dislikes 319.45
Dropping views 319.53
Dropping likes 320.71
Dropping rating 334.14
Dropping screens 353.01
Dropping budget 364.82

It turns out from Step 2 that dropping any of the explanatory


variables from the model will give a value of AIC that is greater than
that of the model with all the seven variables (with AIC = 316.24).
This means that the model cannot be improved further by dropping
any more explanatory variables, and so the backward stepwise
regression procedure should stop here.
Conclusion
The backward stepwise regression procedure suggests that the best

172
5 Choosing explanatory variables

model should be

income ∼ budget + rating + screens + views
+ likes + dislikes + comments.
The model suggested by the backward stepwise regression procedure
is different from the corresponding model suggested by the forward
stepwise regression procedure in Example 14.
So which model is better? Well, dropping followers from the full
model is consistent with its coefficient being non-significant in
Table 10 of Activity 26 (Subsection 5.1). However, from the
correlation matrix for this dataset, it turned out (in Activity 25) that
the block of the four explanatory variables – comments, dislikes,
likes and views – are highly correlated and therefore should not
appear in a model together.
It would therefore be sensible to conclude that the three-variable
model suggested by forward stepwise regression may be better for the
data than the seven-variable model suggested here. Moreover,
selecting the former model also fulfils the principle of parsimony given
in Box 18 at the start of Section 5.

In Examples 14 and 15 you have seen that forward and backward stepwise
regression do not necessarily suggest the same final model. This leads to
the recommendation given in Box 25.

Box 25 A recommended stepwise regression strategy


It is recommended that, in performing stepwise regression, you use
both the ‘forward’ and ‘backward’ procedures, because there can be
useful information in the results when they differ.

Is it usually the case that the two procedures suggest different models?
Actually, no – achieving the same result with both methods often happens
and is reassuring when it does.
In Activity 33, stepwise regression will be used to select the explanatory
variables for modelling the data from the Brexit dataset that you first used
in Activity 7.

173
Unit 2 Multiple linear regression

Activity 33 Explanatory variable selection for the Brexit


dataset
Recall that the Brexit dataset introduced earlier in Subsection 1.3.3
contains the response variable leave, which represents the percentage of
those who voted ‘Leave’ in the 2016 UK referendum for each local
authority area, along with the five explanatory variables: age, income,
migrant, satisfaction and noQual. The aim is to determine whether,
and how, the percentage of the ‘Leave’ voters is related to the other five
variables.
(a) The scatterplot matrix, including all of the variables, is given in
Figure 44. The correlation matrix showing the correlations between
age, income, migrant, satisfaction and noQual is also given below.

noQual

satisfaction

migrant

income

age

leave

Figure 44 A scatterplot matrix of the Brexit dataset

The correlation matrix showing the correlations between age, income,


migrant, satisfaction and noQual is as follows.

174
5 Choosing explanatory variables

age income migrant satisfaction noQual


age 1.00
income 0.40 1.00
migrant 0.12 0.50 1.00
satisfaction −0.11 0.37 0.12 1.00
noQual −0.31 −0.79 −0.68 −0.48 1.00

On the basis of the scatterplot matrix in Figure 44 and the correlation


matrix, carry out a preliminary exploration of the relationships
between leave and the explanatory variables, and of the relationships
between the five explanatory variables. Regarding leave as the
response, which of the explanatory variables would you expect to see
in a good multiple regression model?
(b) A multiple regression model was fitted for the response variable,
leave, with all of the five variables as explanatory variables. The
results from fitting this model, the full model, are given in Table 11.
Table 11 Results from fitting the response variable leave

Parameter Estimate Standard t-value p-value


error
Intercept −43.454 18.739 −2.319 0.021
age 48.464 16.917 2.865 0.005
income 0.362 0.185 1.956 0.052
migrant 89.392 69.541 1.285 0.200
satisfaction 4.863 1.932 2.517 0.013
noQual 131.058 8.846 14.816 < 0.001

Based on these results, which variables seem to be important in this


model?
(c) A backward stepwise regression procedure starting from the full
regression model was performed. The procedure then stopped after
two steps, giving the results listed below.
Step 1
Change AIC
Dropping migrant 612.18
None 612.49
Dropping income 614.38
Dropping satisfaction 616.90
Dropping age 618.77
Dropping noQual 768.81

Step 2
Change AIC
None 612.18
Dropping income 613.92
Dropping satisfaction 615.04
Dropping age 617.15
Dropping noQual 814.16

175
Unit 2 Multiple linear regression

Which explanatory variables are selected by this procedure? Does this


make sense in the light of your initial data exploration in parts (a)
and (b)?
(d) A forward stepwise regression procedure starting from the null
regression model was also performed. The results are as follows.
Step 1
Change AIC
Adding noQual 622.16
Adding income 819.45
Adding migrant 837.43
Adding satisfaction 900.88
Adding age 928.02
None 934.34

Step 2
Change AIC
Adding income 617.18
Adding age 617.30
None 622.16
Adding satisfaction 622.30
Adding migrant 624.15

Step 3
Change AIC
Adding age 615.04
Adding satisfaction 617.15
None 617.18
Adding migrant 619.17

Step 4
Change AIC
Adding satisfaction 612.18
None 615.04
Adding migrant 616.90

Step 5
Change AIC
None 612.18
Adding migrant 612.49

Based on these results of the forward stepwise regression, does the


procedure suggest the same model as the backward stepwise
regression procedure performed in part (c)?

176
5 Choosing explanatory variables

Stepwise regression, as developed in Examples 14 and 15, and in


Activity 33, is particularly useful when the focus is on getting an
understanding of which variables mainly affect the response. When the
focus is on prediction, however, this kind of variable selection approach has
its opponents. Two alternative approaches are worthy of a (very!) brief
mention.
The first is an argument that says that, for prediction purposes, it may be
better to retain all the explanatory variables but to change the estimates
of the βj ’s. Instead of setting some of the βbj ’s to zero, as we have just been
doing by selecting a subset of variables, ‘shrinkage’ estimators that modify
all the βbj ’s without reducing any of them to zero have been proposed.
The second alternative for prediction, argues against selecting a single best
model at all. Instead, there may well be several plausible models which
each fit the data reasonably well (for example, they might all have fairly
similar values of Ra2 ). In the absence of compelling subject-matter
knowledge that allows us to choose between them, it may be better to
average our predictions over all of the reasonable models, perhaps
assigning a different weight to each model according to how good it is.
We will finish this section by using R for choosing explanatory variables.

5.4 Using R to choose explanatory variables


In the notebook activities in this subsection, we will use data from a new
dataset; the dataset is described next.

Blood pressure of Indigenous Peruvians


A case study was designed to investigate the long-term impact of
migration on blood pressure, including migration from living at high
altitude to living at low altitude.
To perform this study, data were collected from 38 Indigenous
Machu Picchu, built 2400 m
Peruvians. All of these individuals were males and over 21. They were
above sea level between the
all born at high altitude in the Peruvian Andes, and their parents were Peruvian Andes and the
born at high altitude too. However, all these people had since moved Amazon Basin by the Inca
into the mainstream of Peruvian society at a much lower altitude. civilisation
The Peru dataset (peru)
For each individual in the dataset, the response variable is:
• sbp: the man’s systolic blood pressure.
There are eight potential explanatory variables:
• age: the man’s age in years
• years: the number of years since he migrated
• weight: the man’s weight, in kilograms (kg)
• height: the man’s height, in millimetres (mm)

177
Unit 2 Multiple linear regression

• chin: a measure of his chin skinfold, in mm


• forearm: a measure of his forearm skinfold, in mm
• calf: a measure of his calf skinfold, in mm
• pulse: his pulse rate in beats per minute.
(Note that years was the explanatory variable which motivated the
study.)
The data for the first five men in this dataset are given in Table 12.
Table 12 First five men from peru

age years weight height chin forearm calf pulse sbp

22 6 56.5 1569 3.3 5.0 8.0 64 120


24 5 56.0 1561 3.3 1.3 4.3 68 125
24 1 61.0 1619 3.7 3.0 4.3 52 148
25 1 65.0 1566 9.0 12.7 20.7 72 140
27 19 62.0 1639 3.0 3.3 5.7 72 106

Source: Baker and Beall, 1982, cited in The Open University, 2009, p. 11

We will start in Notebook activity 2.6 by using R to produce a scatterplot


matrix and a correlation matrix for data in the Peru dataset. Then, in
Notebook activity 2.7, we will use stepwise regression for choosing the
explanatory variables for a multiple regression model, focusing in
particular on the Peru dataset.

Notebook activity 2.6 Producing a scatterplot matrix


and a correlation matrix in R
This notebook explains how to produce a scatterplot matrix and a
correlation matrix in R, focusing on peru.

Notebook activity 2.7 Choosing explanatory variables for


multiple regression in R
In this notebook we will use stepwise regression procedures in R to
search for the best model for sbp in peru.

178
Summary

Summary
In this unit you have been learning about multiple linear regression – an
extension of simple linear regression to situations where there are more
than one explanatory variable. These extra explanatory variables may
include derived variables, such as those obtained using transformations.
You have seen that although much is the same as with simple linear
regression, the interpretation of the coefficients is not. The coefficients in
multiple regression are partial regression coefficients, reflecting the effect of
the corresponding explanatory variable on the response in the presence of
the other explanatory variables with their values held fixed. Furthermore,
as well as using a t-test to test whether individual coefficients are zero
(assuming the other coefficients take their estimates), we can test whether
all coefficients are zero using an F -test.
In this unit you have also been introduced to two new types of diagnostics:
leverage and Cook’s distance. Leverage measures the potential for an
observation to affect a regression model. The leverage of a data point is
determined by the values of its explanatory variables: an observation has
high leverage – and hence potentially affects the regression model – if the
values of its explanatory variables are far away from the ‘centre’ of the
pattern of the explanatory variables. Cook’s distance measures the
influence of an observation on the regression model. It can be thought of
as a combination of an observation’s residual and its leverage. An
observation is potentially highly influential if it has very high leverage, a
very large residual, or both high leverage and a large residual.
You have also learnt how transformations can be useful for multiple
regression when the model assumptions do not seem reasonable for the
data. Transforming the response can sometimes help to fulfil or strengthen
the model assumptions, whereas transforming the explanatory variable can
sometimes help to enhance linearity between the response and the
explanatory variables.
The unit concluded by considering the selection of explanatory variables to
include in a final multiple regression model. This is because it is preferable
to only include those variables necessary for capturing the main features of
the data. You have seen that how good a model is can be measured using
the adjusted R2 statistic and the Akaike information criterion (AIC). Both
these measures adjust for the number of explanatory variables in the
model. The variables to include in a final model can be selected using
scatterplot matrices, correlation matrices and a fit of the full model, along
with stepwise regression (both forward stepwise regression and backward
stepwise regression).
The Unit 2 route map, repeated from the introduction, provides a visual
reminder of what has been studied in this unit and how the different
sections link together.

179
Unit 2 Multiple linear regression

The Unit 2 route map

Section 1
Section 2
The multiple
Prediction in
linear regression
multiple regression
model

Section 4
Section 3
Transformations in
Diagnostics
multiple regression

Section 5
Choosing
explanatory
variables

180
Learning outcomes

Learning outcomes
After you have worked through this unit, you should be able to:
• explain how simple linear regression of a response variable with a single
explanatory variable can be extended to regression with more than one
explanatory variable
• interpret the coefficients in multiple regression
• fit multiple regression models in R and be able to interpret the resulting
output
• interpret leverage
• interpret Cook’s distance
• explain informally the relationship between an observation’s Cook’s
distance and its residual and leverage
• use plots to identify points that have the potential to dominate an
analysis and to identify any points that are influential
• obtain and interpret diagnostic plots in R
• use transformations and extra variables which are derived from other
variables to possibly improve the regression model
• use R for transformations in multiple regression
• use a scatterplot matrix, a matrix of correlations between explanatory
variables, a fit of the full model and regressions between the response
and each explanatory variable individually to select the explanatory
variables that are likely to be in a good final model
• use R to produce a scatterplot matrix and a correlation matrix
• compare different regression models for the same data
• use stepwise regression, from both the null model and the full model, to
arrive at a final regression model
• perform stepwise regression in R.

181
Unit 2 Multiple linear regression

References
Ahmed, M., Jahangir, M., Afzal, H., Majeed, A. and Siddiqi, I. (2015)
‘Using crowd-source based features from social media and conventional
features to predict the movies popularity’, The 8th IEEE International
Conference on Social Computing and Networking, China, December 2015.
Available at: https://www.researchgate.net/publication/298352830
Using Crowd-source based features from social media and Conventional
features to predict movies popularity (Accessed: 21 March 2022).
Baker, P.T. and Beall, C.M. (1982) ‘The biology and health of Andean
migrants: a case study in south coastal Peru’, Mountain Research and
Development, 2(1), pp. 81–95.
BBC (2016) EU Referendum Results. Available at: https://www.bbc.co.uk
/news/politics/eu referendum/results (Accessed: 28 February 2022).
Becker, S.O., Fetzer, T. and Novy, D. (2017) ‘Who voted for Brexit? A
comprehensive district-level analysis’, Economic Policy, 32(92),
pp. 601–650. Available at: https://doi.org/10.1093/epolic/eix012.
Kuiper, S. (2008) ‘Introduction to multiple regression: how much is your
car worth?’, Journal of Statistics Education, 16(3).
doi:10.1080/10691898.2008.11889579
Stewart, L. (2021) Roller Coaster Data. Available at:
https://www.kaggle.com/lyallstewart/roller-coaster-data (Accessed:
18 March 2022).
The Open University (2009) M346 Unit 5 Multiple linear regression.
Milton Keynes: The Open University.
Sullivan III, M. (2020) ‘Sullystats/Statistics6e’. Available at:
https://github.com/sullystats/Statistics6e/blob/master/docs/Data/
SullivanStatsSurveyI.csv (Accessed 9 June 2022).
Sullivan III, M. (2021) Statistics: informed decisions using data,
6th Edition. Illinois, USA: Pearson. Accompanying dataset for Chapter 5
available at: https://sullystats.github.io/Statistics6e/Data/Chapter5
(Accessed: 6 June 2022).

182
Acknowledgements

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, Adebayo Akinfenwa: © Cosmin Iftode / www.123rf.com
Subsection 1.2, marking footballer: © Stef22 | Dreamstime.com
Subsection 1.3.2, marking Lionel Messi: © Matthew Trommer /
www.123rf.com
Subsection 1.3.2, map of referendum results: Taken from:
https://www.bbc.co.uk/news/politics/eu referendum/results
Subsection 2.1, Formula Rossa: © Chrisstanley | Dreamstime.com
Subsection 3.1, excited baby: © angelsimon / www.123rf.com
Subsection 3.3, boiling pan: © Akhararat Wathanasing / www.123rf.com
Subsection 4.2.2, cinema audience: © Vadymvdrobot | Dreamstime.com
Subsection 4.3, Facebook icon: © bigtunaonline / www.123rf.com
Section 5, the Earth: © abidal / www.123rf.com
Subsection 5.3.1, tea and biscuits: © olegdudko / www.123rf.com
Subsection 5.4, Machu Picchu: paltitaviajeratrip / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

183
Unit 2 Multiple linear regression

Solutions to activities
Solution to Activity 1
(a) The plot of residuals against weight does not indicate any serious
grounds for concern about the model. There is no clear sign of a
pattern, and no sign of unequal variance can be detected. Therefore,
the fitted model seems adequate.
(b) The plot in Figure 2 shows a clear positive relationship between the
residuals from the fitted regression model (with weight as the
explanatory variable) and footballers’ height, and it seems that
footballers with lower height tend to have negative residuals.
To understand what this means, suppose there are two players with
the same weight but with different heights. The model
strength ∼ weight
would imply that these two players would have the same strength
score. However, the residual plot in Figure 2 indicates that the
shorter player is more likely to have a negative residual; in other
words, the strength score of the shorter player is likely to be less than
that of the taller player.
Therefore, the model (with weight as the explanatory variable) does
not adequately explain all of the variation in strength, since there
seems to be further variation in players’ strength scores due to their
heights.

Solution to Activity 2
(a) The scatterplot in Figure 3 suggests an increasing linear relationship
between strength and height. There are a few points that are far
from the main scatter of the data but these should not cause concern
at this stage. Data points are evenly scattered around the main trend
with slightly more variation at the higher values of height. A simple
linear regression model therefore seems appropriate for the data.
(b) The regression coefficient value of 1.318 indicates that for each
additional inch in their height, a footballer’s strength score is
associated with an increase of 1.318 on average.
(c) Since the p-value for the regression coefficient of height is very small,
there is strong evidence that the regression coefficient is non-zero, and
therefore height is significant in explaining strength.
(d) The plot of the residuals and height in Figure 4 shows that there is a
tendency for the residuals to be lower for low values of height.
However, it seems to be mainly footballers with height 68 in which is
giving this impression, and overall this tendency doesn’t seem to be
very marked. So, although the fitted model might not be perfect for
this dataset, overall the fit is not too bad.

184
Solutions to activities

(e) The plot in Figure 5 shows a clear positive relationship between the
residuals from the fitted regression model (with height as the
explanatory variable) and footballers’ weight, with footballers with
lower weight tending to have negative residuals.
This means that, for two players with the same height but with
different weights, the lighter player is more likely to have a negative
residual; in other words, the strength score of the lighter player is
likely to be less than that of the heavier player.
Therefore, the model (with height as the explanatory variable) does
not adequately explain all of the variation in strength, since there
seems to be further variability in players’ strength scores due to their
weights.

Solution to Activity 3
The estimated value (0.322) of the regression coefficient of weight in the
simple linear regression model in Activity 1 is different from its
corresponding value (0.252) in the multiple regression model in Example 1.
Similarly, the two corresponding values of the coefficient of height are
different – it is estimated as 1.318 in the simple linear regression model in
Activity 2, and 0.558 in the multiple regression model in Example 1. These
coefficient values are summarised in the following table.

Coefficient Simple Multiple


weight 0.322 0.252
height 1.318 0.558

Note from the table that both explanatory variables have lower values for
their coefficients in the multiple regression model than the corresponding
coefficients in the simple linear regression models. The reason for these
differences are discussed in Subsection 1.2.

Solution to Activity 4
(a) The regression coefficient of weight is 0.273. This means that a
footballer’s strength score is expected to increase by 0.273 if their
weight increases by one lb, and both their height and their score of
marking ability remain fixed.
The regression coefficient of height means that a footballer’s strength
score is expected to increase by 0.681 if their height increases by
one inch, and both their weight and score of marking ability remain
fixed.
Similarly, the regression coefficient of marking means that a
footballer’s strength score is expected to increase by 0.085 if their
score of marking skill increases by one unit, and both their weight and
height remain fixed.

185
Unit 2 Multiple linear regression

(b) The estimated values of the regression coefficients for weight in the
model here and the model considered in Example 1 are not the same
(0.273 here, and 0.252 in Example 1). This difference is to be
expected, in the same way that a regression coefficient in a model
with two explanatory variables is expected to be different to its
corresponding coefficient in a simple linear regression model.
The two regression coefficients also have different interpretations. The
value of the regression coefficient of weight in the model with three
explanatory variables represents the partial effect of weight on
strength, given both height and marking. However, the value of the
regression coefficient of weight in the model with two explanatory
variables in Example 1 represents the partial effect of weight on
strength, given height only.

Solution to Activity 5
(a) The test statistic is calculated as
βb1 0.252
t-value = = ≃ 4.71.
standard error of βb1 0.0535

(b) The test statistic is calculated as


βb2 0.558
t-value = = ≃ 2.13.
standard error of βb2 0.262

(c) The p-values associated with the two tests are both obtained from
t(n − (q + 1)) = t(100 − (2 + 1)) = t(97).
Since the p-value associated with the regression coefficient β1 of
weight is so small (< 0.001), there is strong evidence to suggest that
β1 differs from zero.
The p-value associated with the regression coefficient β2 of height is
not so small (0.036), but we (the module team) would still judge this
to be small enough to conclude that there is evidence to suggest that
β2 also differs from zero.
So, overall, there is evidence to suggest that both β1 and β2 differ
from zero.

Solution to Activity 6
(a) Let β1 , β2 and β3 be the regression coefficients of weight, height and
marking, respectively. The test procedure being used here tests the
hypotheses
H0 : β1 = β2 = β3 = 0,
H1 : at least one of the three coefficients differs from zero.

186
Solutions to activities

The p-value for this test is calculated from an F (ν1 , ν2 ) distribution


where
ν1 = q = 3
ν2 = n − (q + 1) = 100 − 4 = 96.

(b) Since the p-value associated with this test is less than 0.001, there is
strong evidence that at least one of the three regression coefficients is
different from zero. Hence, there is strong evidence that the regression
model contributes information to explain the variability in the
footballers’ strength scores.
(c) This time, the test procedure being used tests three sets of hypotheses:
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 and β3 = βb3 ),
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 and β3 = βb3 ),
H0 : β3 = 0, H1 : β3 ̸= 0 (assuming β1 = βb1 and β2 = βb2 ).
For each of these tests, the p-value is based on the same
t-distribution, namely t(n − (q + 1)) = t(96).
(d) For the hypotheses
H0 : β1 = 0, H1 : β1 ̸= 0 (assuming β2 = βb2 and β3 = βb3 ),
the p-value is very small (< 0.001). Therefore, there is strong evidence
to suggest that the regression coefficient of weight (β1 ) is not zero
when height and marking are in the model.
For the hypotheses
H0 : β2 = 0, H1 : β2 ̸= 0 (assuming β1 = βb1 and β3 = βb3 ),
the p-value is small (0.006). So, there is evidence to suggest that the
regression coefficient of height (β2 ) is also not zero when weight and
marking are in the model.
For the hypotheses
H0 : β3 = 0, H1 : β3 ̸= 0 (assuming β1 = βb1 and β2 = βb2 ),
the p-value is again very small (< 0.001). So, there is strong evidence
to suggest that the regression coefficient of marking (β3 ) is also not
zero when weight and height are in the model.
From these three tests, we can conclude that, in a model that contains
these three explanatory variables, there is strong evidence to suggest
that each of the explanatory variables, individually, explains the
variability in the footballers’ strength scores, given that the other
variables are in the model. This means that they are significant
together in explaining the variability in the footballers’ strength
scores.
(e) In part (d), you assessed the individual influence on the response
variable for each of weight and height when they are both included

187
Unit 2 Multiple linear regression

in a model together with marking. But in Activity 5, we tested their


significance in a model that contained only weight and height.
Because of the partial nature of the regression coefficients in the
multiple regression context, tests in the two models could yield
different results. In particular, there is strong evidence (p = 0.006)
that the coefficient of height is not zero in the model
strength ∼ weight + height + marking,
but not so strong evidence (p = 0.036) that the coefficient of height
is not zero in the model
strength ∼ weight + height.

Solution to Activity 7
(a) (i) The regression coefficient of age in the first model is −81.426,
indicating that a one percentage point (1% = 0.01) increase in
the percentage of people in the 30 to 44 age group in an area is
associated, on average, with a decrease of
0.01 × 81.426 = 0.81426%
in the percentage of ‘Leave’ voters in this area.
The regression coefficient of income (−2.085) in the second
model means that a one pound (£1) increase in the mean hourly
income of people in an area is associated with an average decrease
of 2.085% in the percentage of ‘Leave’ voters in this area.
(ii) The p-value associated with the test of the regression coefficient
of age is 0.004. This means that there is evidence to suggest that
the regression coefficient is not zero. Hence, there is evidence
that a voter’s age explains their tendency to vote ‘Leave’.
The p-value associated with the regression coefficient of income
is less than 0.001, showing evidence that a voter’s income also
explains their tendency to vote ‘Leave’.
(b) (i) The fitted regression equation can be written as
leave = 78.005 + 31.768 age − 2.180 income.
The regression coefficient of age is 31.768. This means that a
one percentage point (1% = 0.01) increase in the percentage of
people in the 30 to 44 age group in an area, with the mean hourly
income in this area being fixed, is associated with an increase of
0.01 × 31.768 = 0.31768%
in its ‘Leave’ voters.
Similarly, the regression coefficient of income indicates that a
one pound (£1) increase in the mean hourly income of people in
an area, with its population age being fixed, is accompanied by a
decrease of 2.18% in the percentage of ‘Leave’ voters in this area.

188
Solutions to activities

(ii) The p-value associated with testing the coefficient of age is


0.192. Since the p-value is relatively large, there is no evidence to
suggest that the regression coefficient of age is not zero, when
income is also in the model.
The p-value associated with testing the coefficient of income is
0.001, and so, since the p-value is small, there is evidence that
the regression coefficient of income is not zero, when age is also
in the model.
Therefore, since there is no evidence that the coefficient of age is
not zero, overall there is no evidence from the data that both age
and income together explain the percentage of ‘Leave’ voters.
(c) There is in fact no contradiction between the hypotheses tests
performed in parts (a)(ii) and (b)(ii). In part (a)(ii) we concluded
that there is strong evidence that each of age and income has an
individual impact on leave if each is the only explanatory variable in
the model. Part (b)(ii) then tests whether only one (or both) of the
two explanatory variables is significant given the other. The result in
part (b)(ii) shows that when both age and income are together as
explanatory variables in a model, there is no evidence from the data
that age has any impact on leave.
(This suggests that, although each of age and income individually has
a strong impact on leave, they should not appear together in a
multiple regression model. Choosing which explanatory variables to
include in a multiple regression model will be covered in detail in
Section 5.)

Solution to Activity 8
(a) Using the information given in Table 4, the fitted regression equation
is
speed = 23.58 + 0.218 height + 0.00182 length.

(b) The regression coefficient of height is 0.218. This means that a 1 ft


increase in a roller coaster’s height, with its length being fixed, is
expected to be associated with an increase of 0.218 mph in its speed.
Similarly, the regression coefficient of length indicates that a 1 ft
increase in a roller coaster’s length, with its height being fixed, is
expected to be accompanied by an increase of 0.00182 mph in its
speed.
(c) Denoting the predicted value of speed for this new roller coaster by
yb0 , we can calculate yb0 by substituting the values of its explanatory
variables (height = 200 and length = 4000) into the fitted model in
part (a). Then
yb0 = 23.58 + (0.218 × 200) + (0.00182 × 4000)
= 74.46 ≃ 74.5.

189
Unit 2 Multiple linear regression

So, for this roller coaster, the maximum speed is predicted to


be 74.5 mph. (This predicted speed has been rounded to match the
accuracy of the variable speed in the original data.)

Solution to Activity 9
Denoting the predicted value of strength for this newly registered
footballer by yb0 , we can calculate yb0 by substituting the values of its
explanatory variables (weight = 170, height = 72 and marking = 65)
into the fitted model, so that
yb0 = −28.305 + (0.273 × 170) + (0.681 × 72) + (0.085 × 65)
= 72.662 ≃ 73.
So, a point prediction of the strength score of the newly registered
footballer is 73. (The predicted strength score has been rounded to the
nearest integer to match the accuracy of the variable strength in the
original data.)

Solution to Activity 10
The larger the confidence level, the wider the corresponding prediction
interval will be. Therefore, (57.3, 91.6) must be the 99% prediction interval
since it is the widest, (61.4, 87.5) must be the 95% prediction interval
because it is the next widest, and (63.5, 85.4) must be the 90% prediction
interval because it is the narrowest.

Solution to Activity 11
The 95% prediction interval tells us that it is predicted that the strength
score of the newly registered footballer is somewhere between 64 and 81.

Solution to Activity 12
(a) There are two potential outliers in the residual plot shown in
Figure 8(a): the two points with lowest residuals. However, overall,
the residual plot shows no particular pattern. The points are
randomly scattered around zero, and so the assumptions of linearity
and zero mean seem to be reasonably satisfied. Also, the vertical
scatter of the points seems to be fairly constant, and so the
assumption of constant variance is also plausible.
(b) The two potential outliers identified in part (a) are also evident in the
normal probability plot given in Figure 8(b): the two points in the
bottom left-hand corner. However, in the plot as a whole, nearly all
the points lie on, or are very close to, a straight line in the plot. The
normality assumption is therefore also very plausible.
(c) Since all of the assumptions are reasonably satisfied, the multiple
regression model seems to provide an adequate model for this dataset.

190
Solutions to activities

Solution to Activity 13
(a) The residual plot in Figure 9(a) shows a very clear curved pattern. As
such, the assumptions of linearity and zero mean do not seem to be
satisfied. What’s more, the vertical scatter does not look to be
constant, since the scatter for the middle fitted values seems higher
than for the small and large values. This means that the assumption
of constant variance is also in doubt.
(b) In the normal probability plot in Figure 9(b), most of the points lie
roughly on a straight line. There are, however, a few points which
deviate from the line in the top and bottom corners of the plot. So,
overall, we (the module team) wouldn’t rule out the normality
assumption based on the normal probability plot, but it also looks like
the assumption may be questionable. (You may, of course, disagree!)
(c) Since the assumptions of linearity, zero mean and constant variance
are in doubt, and the normality assumption may be questionable, this
model does not seem adequate for the roller coasters dataset.

Solution to Activity 14
The only real difference between the two original data points and the two
changed data points is the value of the explanatory variable height. The
far-left data point has a low value of height (66) in comparison to the rest
of the values of height in the dataset, whereas the value of height is one
of the more central values (72) for the other data point.
This suggests that the value of a data point’s explanatory variable might
determine whether a data point has high or low leverage.

Solution to Activity 15
Data points with both high leverage and a large residual are likely to be
influential points, and these would appear towards the lower-right or
upper-right corners of the plot.
If a data point has very high leverage, this is also likely to be an influential
point, in which case this would appear at the far right of the plot.
Data points with very large residuals are also likely to be influential points,
and these would appear near the top or the bottom of the plot.

Solution to Activity 16
When looking for influential points, we want to be looking at those points
with high leverage and/or large residuals.
Example 7 identified two of the data points as having high leverage – the
data points numbered 62 and 15. So, both of these data points have the
potential to be influential points.
The standardised residual for the data point numbered 62 isn’t quite large
enough to be classified as being ‘large’ using our general rule of thumb

191
Unit 2 Multiple linear regression

(its value is just above −2), but, combined with its high leverage, it’s
possible that this data point is an influential point.
In contrast, the standardised residual for the data point numbered 15 can
be classified as being ‘large’ using our general rule of thumb (its value is
roughly 2.5). Therefore, given the high leverage for this data point, it
looks likely that this data point is an influential point.
There are two other data points with ‘large’ residuals: the data point
numbered 92 (whose standardised residual is roughly −2.5) and the data
point to its left in the plot (whose standardised residual is close to −3).
Since these residuals are large, then it’s possible that these data points are
also influential points.
Notice that there are two other data points with similar leverage to the
data point numbered 92. However, since their standardised residuals are
small (they are both fairly close to zero), it is unlikely that these data
points will be influential points.

Solution to Activity 17
(a) Figure 14 clearly shows one data point (numbered 77) with high
leverage to the far right of the data.
There is then a group of five data points with relatively high leverage
in comparison to the bulk of the data points – these are the data
points numbered 1, 2, 3, 4 and 44.
(b) Although the data point at the far right numbered 77 has the highest
leverage, its standardised residual is small (its value is less than 1),
and therefore this data point is unlikely to be an influential point.
The data point with the next highest leverage is the point
numbered 2. The standardised residual for this data point seems to be
roughly −2, and, as such, could be just about considered as ‘large’
using our rule of thumb. Since this data point also has high leverage,
it is possible that this data point is an influential point.
The standardised residuals for the points numbered 1, 3 and 4 are all
above −2, and so wouldn’t be considered as being large by our rule of
thumb. However, combined with their high leverage values, it is
possible that these are influential points.
The standardised residual for the point numbered 44 is so small (its
value is fairly close to zero) that it’s unlikely that this data point is an
influential point, even though its leverage is fairly high.
There remains one data point which could also be influential: the
data point numbered 43 at the upper-left corner of the plot, which has
a very large standardised residual value greater than 4! Although the
leverage for this data point is very small, the residual is so large that
this data point could also possibly be an influential point.

192
Solutions to activities

Solution to Activity 18
(a) From the Cook’s distance plot in Figure 16, the three most influential
points are those with the tallest bars – namely the data points
numbered 1, 2 and 43. They all have Cook’s distances of about 0.05
or more, which is considerably higher than the Cook’s distances of all
other data points in the sample.
Three out of the five points identified in Activity 17 as being
potentially influential points have turned out to be influential points
in terms of having highest Cook’s distances in the sample. Although
point 43 has a very small leverage, its very extreme standardised
residual has led to the third highest Cook’s distance.
(b) The data point numbered 77 with the highest leverage in Figure 14
has turned out not to be one of the three most influential points.
Although it has the maximum leverage in the sample, it does not have
one of the highest Cook’s distances since it has a standardised
residual near zero. The combination of the very high leverage with a
very small standardised residual has not led to a high Cook’s distance.

Solution to Activity 19
(a) The data points numbered 96, 124 and 228 have high leverage values
together with large standardised residuals. Although there are three
other points with even higher leverage, they have smaller standardised
residuals than the three identified points. There are also many other
points with higher standardised residuals than the identified points,
but those tend to have smaller leverage, hence their Cook’s distance
values are not expected to be high.
(b) Based on the Cook’s distance plot given in Figure 18, data point
number 124 is the most potentially influential point in the data since
it has the highest value of the Cook’s distance. The Cook’s distance
values of the other two points identified in part (a) are close to the
Cook’s distance values of many other data points in the sample, and
so these data points don’t seem to be influential.
(c) There are not any considerable differences in the estimated regression
coefficients (compared with their standard errors) and their p-values
after removing data point number 124. This data point, therefore,
does not seem to have a considerable influence on the regression
model. Although it has the highest Cook’s distance, it does not
automatically mean that it will be influential to the model. After all,
its Cook’s distance is only about 0.06 and is not that far from the
Cook’s distance values of some of the other points.

193
Unit 2 Multiple linear regression

Solution to Activity 20
(a) The most noticeable feature of the residual plot is that the spread of
points about the zero residual line seems to increase as the fitted
values increase. This suggests that the assumption that the variance
of the random terms is constant may not hold. Problems with the
model assumptions can sometimes be solved by transforming the
response variable, and so this is why it is reasonable to try
transforming the response.
(b) The histogram of income in Figure 28 is very right-skew. This
suggests that transformations of income down the ladder of powers
might help.
The first transformation down the ladder of powers from x1 is the
square root transformation and the one down from that is the log
transformation, which√explains why it is reasonable to consider the
two transformations income and log (income).
(c) The histogram√ of income in Figure 28 is very right-skew, but the
histogram of income in Figure 29 seems much more symmetric. On
the other hand, the histogram of log (income) is very left-skew,
indicating that the data have been transformed too far down the
ladder of powers. Therefore, based on these two histograms, we (the
module team) would recommend applying the square root
transformation to income.

Solution to Activity 21
(a) To enhance linearity between the response and any of the explanatory
variables, we can try transforming the associated explanatory
variables. The scatterplot in Figure
√ 31(a) shows a roughly linear
increasing relationship between income and budget. On the other
hand, although the scatterplot in √Figure 31(b) seems to show an
increasing relationship between income and screens, this
relationship is clearly not linear. So, a transformation of screens
may improve the linearity assumption, but there √ is no need to
transform budget since the relationship between income and
budget already seems to be roughly linear.

(b) From the scatterplot of √ income and screens given in Figure 31(b),
the relationship between income and√screens is certainly
non-linear. The relationship between income and the transformed
explanatory variable screens2 , shown in Figure 32(a), seems to be
more linear, although√still a little non-linear. However, the
relationship between income and the transformed explanatory
variable screens3 , shown in Figure 32(b), does seem to be roughly
linear, and so transforming screens to screens3 seems to be a
sensible way forwards.

194
Solutions to activities

Solution to Activity 22
(a) In the residual plot given in Figure 33 of the fitted model with the
transformed variables, the points seem to be fairly randomly scattered
about the zero residual line. As such, it looks like both the assumption
of constant variance of the random terms and the linearity assumption
are plausible for the regression model with the transformed variables.
(b) For the new film, since budget is given in millions of US dollars, the
value of budget for this new film is 50.
Since the variable screens is given in thousands and the new film is
to be initially launched on 2500 screens, the value of screens for this
film is 2.5. Therefore, the value of screens3 is 2.53 = 15.625.

We can now use the fitted equation to predict income for this film:

income = 3.01 + (0.025 × 50) + (0.107 × 15.625) ≃ 5.932.
The predicted income of the new film is therefore 5.9322 ≃ 35.
So, for the new film with a production budget of 50 million US dollars
and initially being launched on 2500 screens, the predicted income is
35 million US dollars.

Solution to Activity 23
There do not seem to be many strong single relationships between income
and the explanatory variables; only the relationships with budget, rating
and screens seem fairly clear. However, these relationships do not seem to
be linear.

Solution to Activity 24
The explanatory variables comments, dislikes, likes and views seem to
be very closely related to each other, since the scatterplots between this set
of explanatory variables show clear relationships between each pair of
explanatory variables.
It also looks like screens and budget may be related, since the scatterplot
for these two explanatory variables also shows a clear relationship.

Solution to Activity 25
Using the rule of thumb that a correlation is high if the absolute value of
the correlation is 0.7 or more, the only correlations which we would
identify as being high are those between the four explanatory variables
comments, dislikes, likes and views.

195
Unit 2 Multiple linear regression

Solution to Activity 26
(a) (i) From Table 10, it is easy to cast your eye down the list of
p-values given for testing the regression coefficients individually.
On this basis, the strongest dependence is on budget, rating
and screens, since all of these explanatory variables have
p-values very close to zero.
All of the other explanatory variables, except followers, also
have fairly small p-values, which suggests that their associated
regression coefficients are also non-zero.
(ii) While it may be sensible to include all of the explanatory
variables (except followers) in the model, it is not guaranteed
that this is a sensible thing to do. This is because the values of
the regression coefficients depend on the values of the others
being fixed, and so, when some explanatory variables are absent
from the model, the remaining regression coefficients might be
quite different.
(b) The residual plots do not cast any doubts on the model assumptions.
The points in the plots of residuals against fitted values look fairly
randomly scattered, and so the assumptions of linearity, zero mean
and constant variance seem reasonable.
The normality assumption also seems reasonable, since, although
there are two outliers in the plot, overall the points in the normal
probability plot generally follow a straight line.
In the residuals versus leverage plot, there are no points with high
leverage and large residuals, and so this plot does not raise any
concerns either.
In the Cook’s distance plot, none of the points stand out as having a
Cook’s distance value much higher than the rest. This suggests that
none of the points are much more influential on the model than the
others.

Solution to Activity 27
(a) We know that
TSS = ESS + RSS,
so
ESS = TSS − RSS
= 515.60 − 111.44
= 404.16.

(b) The value of ESS is almost four times larger than the value of RSS,
and so a fairly large proportion of the response variation can be
explained by the model. Therefore, the sums of squares do suggest
that the model seems to explain the variation in the response fairly
well.

196
Solutions to activities

Solution to Activity 28
If a model fits the data well, then we would expect the model to explain
the variation in the response well. In this case, we’d expect the explained
variation to be large in comparison to the residual (unexplained) variation,
so that ESS is large in comparison RSS.
But, since TSS = ESS + RSS, if ESS is large in comparison to RSS, then
the value of TSS will be not much larger than ESS, and so
ESS
R2 =
TSS
will be large.

Solution to Activity 29
From Equation (1), the R2 statistic is calculated as
ESS
R2 =
TSS
404.16
= ≃ 0.784.
515.60
Or, as a percentage, the R2 statistic is 78.4%.

Solution to Activity 30
In this module, we are taking Ra2 to be the percentage of variance
accounted for. So, using Equation (2), Ra2 is calculated as
n−1
Ra2 = 1 − (1 − R2 ),
n − (q + 1)
10 − 1
=1− (1 − 0.784) ≃ 0.757.
10 − (1 + 1)
As a percentage, the value of Ra2 is 75.7%.
(Notice that, as expected, the value of Ra2 is less than the value of R2 .)

Solution to Activity 31
(a) The values of R2 increase from Model 1 to Model 3. In contrast, even
though the value of Ra2 increases from 38.86% for Model 1 to 74.33%
for Model 2, which has noQual added to the model, Ra2 reduces to
74.23% for Model 3, which has migrant added to the model.
Since Ra2 adjusts for the number of explanatory variables, the decrease
in the value of Ra2 for Model 3, means that adding migrant as an
explanatory variable does not enhance the fit of the model. So, the
increase in the value of R2 seen for Model 3 must have been due
simply to the increase in the number of explanatory variables from
Model 2 to Model 3, rather than a real increase in fit.
(Notice that, as expected, the value of Ra2 in each model is less than
its corresponding R2 value.)

197
Unit 2 Multiple linear regression

(b) The value of Ra2 should be used as a measure for the percentage of
variance accounted for, especially in this case, where Ra2 shows that
adding an extra explanatory variable in Model 3 does not in fact
enhance the model fit.
Comparing Ra2 for the three models, it is clear that adding noQual in
Model 2 markedly increases the percentage of variance of leave that
is accounted for by the model – from 38.86% to 74.33%. But adding
migrant in Model 3 slightly reduced the variance accounted for,
to 74.23%. Therefore, based on the values of Ra2 , of these three
models, Model 2 seems to fit the data best.

Solution to Activity 32
(a) The best model in the given set is the one with the lowest AIC.
Comparing the AIC of the three models, it is clear that adding
noQual in Model 2 markedly decreased the AIC of the model
from 819.7 to 615.0; but adding migrant in Model 3 slightly increased
the AIC to 616.9. Therefore, based on the AIC values for these three
models, Model 2 seems to be the best.
(b) Model 2 is the same model that we chose in part (b) of Activity 31,
based on the adjusted R2 . This means that both Ra2 and AIC
suggested the same model for these data.

Solution to Activity 33
(a) From the last row of the scatterplots, there seems to be a strong
relationship between leave and noQual. It is clear that leave tends
to increase as noQual increases.
Although there do not seem to be any other quite so strong
relationships between leave and each of the other four explanatory
variables, the plots suggest that leave does tend to decrease as each
of income, migrant or satisfaction increases.
The relationships between leave and each of income and migrant
both seem to be fairly linear, and the relationship between leave and
noQual is clearly linear.
The scatter of the values of leave seems constant for different values
of any of the explanatory variables.
From the correlation matrix, the highest correlation between the
explanatory variables is between income and noQual (−0.79). Each of
these two explanatory variables is also correlated with migrant.
Because these correlations are high, we would not expect all three of
these explanatory variables to appear together in a regression model.
So, summarising this discussion, a good regression model might
include noQual and perhaps satisfaction as explanatory variables.
The remaining explanatory variables are not expected to be in the
model because income and migrant are highly correlated with
noQual.

198
Solutions to activities

The variable age is not expected to be in the model because it does


not have a strong linear relationship with the response variable,
leave.
(b) Based on the p-values of the parameter estimates, age, satisfaction
and noQual appear to be important because they have small p-values.
(c) In Step 1, five extra models were fitted, each with only one
explanatory variable dropped in turn from the full model. The
smallest AIC, corresponding to dropping migrant from the model,
suggests that the full model can be improved by dropping migrant.
The same process starting with the improved four-variable model
resulting from Step 1 is detailed in Step 2. It turns out from Step 2
that dropping any of the four explanatory variables from the model
will give a value of AIC that is greater than that of the model with all
of the four variables (with AIC = 612.18). This means that the model
cannot be improved further by dropping any more explanatory
variables.
So, after two steps, a model containing the variables age, income,
satisfaction and noQual is selected.
It is not surprising that noQual is retained in the model, since in the
scatterplot matrix in part (a) it has a strong positive relationship
with leave. The variable noQual is also highly significant in the full
regression model in part (b).
It is also not surprising to see satisfaction in the model. It is
borderline significant in the full regression model in part (b) and
shows a vaguely negative relationship with leave in the scatterplot
matrix.
Although income is highly correlated to noQual, it seems to be
significant for leave as it is also borderline significant in the full
regression model in part (b). So it is not unreasonable that this is
retained in the model.
We would not have initially expected age to be retained in the model.
Although, it is significant in the full regression model in part (b), it
does not show any clear relationship with leave in the scatterplot
matrix.
(d) The forward stepwise regression procedure started with fitting a null
model – the model that only includes an intercept. This gives an AIC
of 934.34.
Then, in Step 1, five extra models were fitted, each with only one
explanatory variable added to the intercept in turn. The model with
the smallest AIC value (622.16) is obtained by adding noQual to the
intercept. So, the best model in Step 1 is the model containing
noQual.
In Step 2, four extra models were fitted and the model with smallest
AIC (617.18) is that obtained when income is added to both the
intercept and noQual.

199
Unit 2 Multiple linear regression

The best model in Step 3 is the one that contains age together with
the intercept, noQual and income.
Similarly, Step 4 suggests adding satisfaction to the model.
But finally, Step 5 indicates that adding migrant to the model
increases the AIC from 612.18 to 612.49, which means that adding
migrant does not further improve the quality of the model.
The forward stepwise regression procedure therefore suggests that the
best model is the one including an intercept, noQual, income, age and
satisfaction. This is exactly the set of explanatory variables
suggested by the backward stepwise regression procedure in part (c).

200
Unit 3
Regression with a categorical
explanatory variable
Introduction

Introduction
So far in this module, the focus has been on statistical models for a
continuous response variable Y with either one numerical explanatory
variable (using simple linear regression in Unit 1) or more than one
numerical explanatory variable (using multiple regression in Unit 2). In
the real world, however, there are many situations where we require a
statistical model for a continuous response variable, but one of the possible
explanatory variables is categorical rather than numerical. In this unit we
will consider the situation where there is just one explanatory variable that
happens to be categorical.

How Unit 3 relates to the module so far


Moving on from . . . What’s next?

Regression with multiple Regression with


numerical explanatory one categorical
variables (Unit 2) explanatory variable

We have already considered datasets with both numerical and categorical


potential explanatory variables. For example, the manna ash trees dataset
(first introduced in Unit 1) includes data for two numerical variables:
• height: the height of the tree (in metres), rounded to the nearest metre
• diameter: the diameter of the tree (in metres, to two decimal places) at
1.3 m above the ground
and one categorical variable:
• side: the side of the Walton Drive that the tree is located on, taking the
possible values west and east.
In Section 5 of Unit 1, we used simple linear regression to model the
response height taking diameter as a (continuous) explanatory variable.
But is it also possible that the categorical variable side affects the
response height? We can investigate this question by using a regression
model for height taking side as a (categorical) explanatory variable.
A categorical explanatory variable needs to be treated differently to a
numerical explanatory variable. As such, it is useful to be able to
distinguish between the two types of explanatory variables. As with much
of regression, the terminology used for the two types of explanatory
variable varies, but here (and throughout the module up to Unit 8) we will
refer to a numerical explanatory variable as a covariate and a categorical
explanatory variable as a factor. This is shown diagrammatically in
Figure 1, which follows.

203
Unit 3 Regression with a categorical explanatory variable

Explanatory variables

Numerical variables Categorical variables

Covariates Factors

Figure 1 The two types of explanatory variable

We will start in Section 1 by introducing the basic idea behind regression


with a factor, before developing the model in Section 2. In Section 3, we
will focus on how the proposed model is used, before using R for regression
with a factor in Section 4. In the second half of the unit, we will think
about the model in a different way to enable us to learn more about how
the different possible values of the factor affect the response. To do this,
we need to use a technique called analysis of variance, or ANOVA for
short. ANOVA is introduced and then developed in Sections 5 and 7,
respectively, and used in R in Sections 6 and 8. The structure of the unit
is illustrated in the following route map.

The Unit 3 route map

Section 1 Section 2
Regression with a factor: Developing the
the basic idea model further

Section 4
Section 3
Using R to fit
Using the proposed
a regression
model
with a factor

Section 5 Section 6
Analysis of variance Using R to produce
(ANOVA) ANOVA tables

Section 7 Section 8
Analysing the effects Using R to produce
of the factor levels extended ANOVA
further tables

Note that Sections 4, 6 and 8 contain a number of notebook activities,


which means you will need to switch between the written unit and
your computer to complete these sections.

204
1 Regression with a factor: the basic idea

1 Regression with a factor: the


basic idea
In this section, we will focus on the basic idea of regression when there is a
single explanatory variable that happens to be a factor.
In Subsection 1.1 we will introduce a dataset that can give rise to such a
regression model. In Subsection 1.2 we will explore how such data can be
plotted so that the relationship between the response and the factor can be
visualised. We then turn to developing a model for such data. First, in
Subsection 1.3, you will see how ideas from simple linear regression can be
adapted to generate a model suitable for the situation when the
explanatory variable is a factor. Then, in Subsection 1.4, we will consider a
rewritten version of this model that makes it easier to explore the
relationship between the response and the factor.

1.1 Introducing some data with covariates


and factors
As has already been mentioned in the introduction, this unit is about
regression with a factor. For such modelling we need some data. We’ll
focus on a particular dataset concerning UK wages in 1994, which is
described next.

Wages
The Office for National Statistics (ONS) is the UK’s largest
independent producer of official statistics. They are responsible for
collecting, analysing and publishing statistics about the UK’s
economy, society and population.
The ONS run a number of surveys, one of which is the Labour Force
Survey (LFS). This data source uses international definitions of I don’t know about you, but I
employment, unemployment and economic inactivity, together with find it easier to spend a wage
information on a wide range of related topics such as occupation, than earn it!
training, hours of work and personal characteristics of household
members aged 16 years and over at private addresses in the UK.
The LFS was first conducted biennially in 1973, and over the years has
increased to annually and then quarterly. Government departments
use the results of the survey to identify how and where they should be
using public resources, to check how different groups in the community
are affected by existing policies and to inform future policy changes.

205
Unit 3 Regression with a categorical explanatory variable

The wages dataset (wages)


The wages dataset contains a subset of the data available from the
1994 UK LFS. There are 3331 individuals in the dataset and the data
includes the following variables:
• hourlyWageSqrt: the square root of the individual’s hourly wage
(in £)
• workHrs: the average number of hours the individual works each
week
• educAge: the age, in years, at which the individual ceased education
• gender: the gender the individual identifies with, taking the values
male and female
• edLev: the education level attained by the individual, taking
17 possible values with codes 1 (Higher degree) to 17 (No
qualifications)
• occ: the occupation of the individual, taking the value codes
1 (Professional), 2 (Employer/Manager), 3 (Intermediate
non-manual), 4 (Junior non-manual), 5 (Skilled manual),
6 (Semi-skilled manual) and 7 (Unskilled manual)
• computer: whether the individual has access to a computer at
home, taking the values yes and no.
Data for the variable hourlyWageSqrt is not given directly in the
LFS. Instead, the hourly wage was estimated for each individual using
their annual earnings from their main job. This variable was then
transformed by taking its square root, and only individuals with
hourly wages less than £50 were included in the wages dataset; this
was to remove a large amount of skew in the data, so that the dataset
could be analysed in this unit (and Unit 4 to follow).
The data for the first five observations from the wages dataset are
shown in Table 1.
Table 1 First five observations from wages

hourlyWageSqrt workHrs educAge gender edLev occ computer


3.58 39 22 male 7 5 no
2.45 60 15 male 8 3 no
2.80 32 15 female 17 2 no
2.46 38 15 female 17 6 no
5.68 50 2 male 4 1 yes

Source: Taylor, 1999

206
1 Regression with a factor: the basic idea

We’ll start with a quick look at the wages dataset in Activities 1 and 2,
before we try to model any of the data.

Activity 1 Identifying covariates and factors

Taking hourlyWageSqrt from the wages dataset as the response variable,


which of the other variables are potential covariates and which are
potential factors?

As mentioned in Activity 1, both edLev and occ from the wages dataset
are categorical variables that represent the different categories by using
numerical codes (1, 2, . . . , 17 for edLev and 1, 2, . . . , 7 for occ). It is quite
common for factors to use numerical codes like these to represent the
different possible values, and this is often done for convenience to avoid
long-winded labels; for example, the numerical code ‘3’ is used to represent
‘Intermediate non-manual’ for the factor occ.

Activity 2 Identifying a possible data entry error

Looking at the data for the first five observations from the wages dataset
given in Table 1, can you spot a potential error for one of the values of one
of the variables?

In Activity 2 we identified a potential error for one of the recorded values


of educAge. In large datasets like this, such errors can, and do, occur.
Potential errors can be difficult to spot when there are lots of data; they
may only appear as potential outliers in plots, or they may go undetected
altogether if the incorrect value isn’t unusual. It is therefore always
important to consider how reliable a dataset is (as discussed in
Subsection 2.2 of Unit 1). Datasets like the wages dataset, which are
produced by organisations such as the ONS, do tend to be generally
reliable, and so we would hope that there aren’t many errors in this
particular dataset. These toddlers have some
We will use the wages dataset to illustrate the basic idea of regression when important work to do!
the explanatory variable is a categorical variable. To do this, we’ll take
hourlyWageSqrt as our response variable and the factor occ as our single
explanatory variable. (We will consider regression models with the other
possible explanatory variables as we move through this unit and Unit 4.)
Before we take a closer look at the data for hourlyWageSqrt and occ,
there’s one more bit of terminology that will be needed for regression with
a single factor explanatory variable: the possible values that a factor can
take, known as its levels. You’ll consider levels of occ in the following
activity.

207
Unit 3 Regression with a categorical explanatory variable

Activity 3 How many levels?


When taking the categorical variable occ as a factor in a regression model,
how many levels does occ have?

Now we have a response variable, hourlyWageSqrt, and a factor, occ, we


are ready to start exploring the relationship between them.

1.2 Visualising the relationship between the


response and a factor
When we have a single covariate explanatory variable in simple linear
regression, a good starting point is to produce a scatterplot of the response
and the covariate to give an idea of the relationship between the two
variables. So, let’s start by doing the same thing here with the response
and factor: a scatterplot of the response hourlyWageSqrt and the
occupation codes of the factor occ is shown in Figure 2.

6
hourlyWageSqrt

0
1 2 3 4 5 6 7
occ
Figure 2 Scatterplot of the response hourlyWageSqrt and the occupation
codes of the factor occ from the wages dataset

Since there are a lot of data points in the scatterplot given in Figure 2, but
only seven possible values for occ, it is difficult to get a clear picture of any
relationship between the two variables. For example, there are 956 data
points plotted for which occ takes the value 5! So, instead, let’s look at a

208
1 Regression with a factor: the basic idea

comparative boxplot of hourlyWageSqrt over the different level codes of


occ to see whether that gives us a clearer picture; this is shown in Figure 3.

5
occ

0 1 2 3 4 5 6 7
hourlyWageSqrt
Figure 3 Comparative boxplot of the response hourlyWageSqrt over the
different level codes of the factor occ from the wages dataset

Activity 4 Describing the relationship between


hourlyWageSqrt and occ
By considering the comparative boxplot given in Figure 3, describe how
the response hourlyWageSqrt and the level codes of the factor occ seem to
be related.

Following on from Activity 4, how might the relationship described there


between the response hourlyWageSqrt and the factor occ be modelled?
Looking back at Figures 2 and 3, we can see that for these data the
relationship between hourlyWageSqrt and the levels of occ only really
becomes clear when we represent the data in a comparative boxplot.
In particular, by allowing us to focus on how the medians of
hourlyWageSqrt change with the level of occ, we can see that the medians
of hourlyWageSqrt for each occupation group seem to increase as the level
of occ decreases. So, can we use the medians of the response
hourlyWageSqrt for each of the levels of the factor occ to help us model
the relationship between hourlyWageSqrt and occ?
Well, it turns out that, rather than using the medians of hourlyWageSqrt,
we can use the means of hourlyWageSqrt for each of the levels of occ to

209
Unit 3 Regression with a categorical explanatory variable

help us model the relationship between hourlyWageSqrt and occ. We will


develop these ideas next, in Subsection 1.3.

1.3 Adapting ideas from simple linear


regression
In order to build a regression model when the explanatory variable is a
factor, we’ll start by adapting ideas from simple linear regression.
In simple linear regression, each response Yi , with covariate value xi , is
modelled as
Yi = α + βxi + Wi , Wi ∼ N (0, σ 2 ). (1)
| {z } |{z}
value of random
line at xi term
This is illustrated in Figure 4. The red line (y = α + βx) gives the value
of Y for each value of x. So for the ith data point, the value of Y is
α + βxi and the distance of the ith data point vertically above or below
that value is given by Wi .

α + βxi

yi
Wi = yi − (α + βxi ) (xi , yi )

xi

Figure 4 An example of Yi ’s model in simple linear regression

210
1 Regression with a factor: the basic idea

If two observations i and j have the same value of the covariate, so that
xi = xj , then
α + βxi = α + βxj .
In this case, the models for Yi and Yj only differ in the values of their
random terms Wi and Wj , as illustrated in Figure 5.

yj xj , y j

Wj = yj − (α + βxj )
= yj − (α + βxi )

α + βxi
Wi = yi − (α + βxi )
yi xi , y i

x i = xj

Figure 5 The simple linear regression model from Figure 4 for Yi and Yj
when xi = xj

When the explanatory variable is a factor, we invariably have a similar


situation where several of the observations have the same value of the
explanatory variable. For example, in Figure 2 there are 266 observations
for which occ takes the value 1. So, in theory, we could use simple linear
regression to model hourlyWageSqrt by taking the level code for the factor
occ as the value of xi in a linear regression model.
However, the level codes are just that: they are codes with no real
meaning. Although there is a logical ordering to the level codes and what
they represent for the factor occ, this is certainly not the case for many
factors. For example, there is no such logical ordering to assign numerical
codes for the variable make in the car prices dataset (Subsection 4.2.1 in
Unit 2) which gives the manufacturer of the car, such as Saturn, Pontiac
and Chevrolet. As a result, level codes for factors are often assigned
arbitrarily, and assigning different codes would invariably lead to a
different model. We therefore need to take a slightly different approach
when the explanatory variable is a factor.
Instead of basing the model on a line (yi = α + βxi ), when we have a

211
Unit 3 Regression with a categorical explanatory variable

factor we can base the model on the means of the responses for the
different levels of the factor (as mentioned at the end of Subsection 1.2).
To help visualise what such a model would look like, Figure 6 shows a
scatterplot of hourlyWageSqrt and the level codes of occ, together with
the (sample) means of the responses associated with each of the seven
levels of occ; the model is based on these means rather than on a fitted
straight line.

6
hourlyWageSqrt

0
1 2 3 4 5 6 7
occ
Figure 6 Scatterplot of hourlyWageSqrt and the level codes of occ with the
(sample) means of the responses associated with each of the seven levels of
occ indicated by the large red circles

Activity 5 A model based on the means

Let Yi denote the response hourlyWageSqrt of the ith individual in the


wages dataset. Also, let µ1 denote the (population) mean response for
individuals for which occ takes the value 1, µ2 denote the (population)
mean response for individuals for which occ takes the value 2, and so on.
Suppose that the value of occ for the ith individual is 1.
By considering the simple linear regression model given in Model (1) and
illustrated in Figures 4 and 5, together with the scatterplot of
hourlyWageSqrt and the level codes of occ given in Figure 6, suggest a
possible model for Yi based on µ1 .

212
1 Regression with a factor: the basic idea

In Activity 5 you suggested a possible model for Yi based on µ1 . This is


fine for individuals whose value of occ is 1, but what about the individuals
whose value of occ is something other than 1? In Example 1 and
Activity 6, we will consider just such individuals. It turns out that, when
the ith individual takes level k of a factor explanatory variable, a possible
model for the associated response, Yi , has the form
Yi = µk + Wi , Wi ∼ N (0, σ 2 ), (2)
|{z} |{z}
mean random
response term
for level k
where µk is the mean response when the factor takes level k.

Example 1 A model for Y1


For the first observation of the wages dataset (given in Table 1,
Subsection 1.1), the value of occ is 5. Therefore, for Y1 , Model (2)
can be written as
Y1 = µ5 + W1 , W1 ∼ N (0, σ 2 ),
where µ5 is the (population) mean of the responses for observations
for which occ takes the value 5.

Activity 6 A model for Y2

Consider once again the data given in Table 1. Write Model (2) in a form
specifically for Y2 .

In Model (2), the value of the variance σ 2 relates to the vertical spread of
the data about the mean, just as σ 2 in Model (1) relates to the vertical
spread of the data about the regression line in simple linear regression.
Notice that, for all responses Y1 , Y2 , . . . , Yn , the model has the same
variance σ 2 . This means that, for each level of the factor, the variation of
the associated responses about their mean is roughly the same across the
different levels of the factor. This is illustrated in Figure 7, which follows.

213
Unit 3 Regression with a categorical explanatory variable

σ 2 relates to σ 2 relates to
the scatter about the scatter about
the line the means
Response

Response
(a) Covariate (b) Factor

Figure 7 How σ 2 relates to the vertical spread of observations about (a) a regression line when the
explanatory variable is a covariate, and (b) mean responses (indicated by large red circles) when the
explanatory variable is a factor

Model (2) gives us a possible model for each response Yi based on the
mean response for all observations taking the same factor level as the ith
observation. There is, however, a problem with using this to model the
relationship between the response and the factor. This is because
Model (2) essentially models the responses associated with each level of the
factor separately, whereas, in order to model the relationship between the
response and the factor, we really need a model that links the responses
across the different levels of the factor. To do this, we need to adapt the
model slightly. We consider this in the next subsection.

1.4 Adapting the model based on means


Suppose that we rewrite each µk (that is, the mean response for those
observations which take level k of the factor) in the form
µk = µ + αk ,
where
µ = a ‘baseline’ mean response common to all levels of the factor,
αk = a measure of the effect that level k of the factor has on the response.
Then, Model (2) can be rewritten as:
Yi = µ + αk + Wi , Wi ∼ N (0, σ 2 ). (3)
|{z} |{z} |{z}
baseline effect of random
mean level k term

214
1 Regression with a factor: the basic idea

The parameter µ in Model (3) will be referred to as the baseline mean


(throughout Units 3 to 8) and αk will be referred to as the effect term or
effect parameter for level k. The baseline mean is the same for each of
the responses Y1 , Y2 , . . . , Yn ; the effect terms provide information about
how the responses differ across the levels of the factor, by comparing the
mean response for each level with the baseline mean.
There are various ways in which the baseline mean µ can be defined, but in
this module we’ll use the same convention as used by default in R:
µ = mean of the responses for the first level of the factor.
This means that µ = µ1 and α1 = 0. Then,
αk = the effect that level k of the factor has on the response
in comparison to the effect that level 1 has on the response.
These parameters are illustrated for a factor with three levels in Figure 8.

α3 (positive as
level 3 of factor
α1 = 0,
increases response
so µ + α1 = µ
in comparison to
level 1)
µ + α3
Response

µ + α2

α2 (negative as level 2
Baseline of factor decreases
mean response in comparison
to level 1)

Level 1 Level 2 Level 3


Factor
Figure 8 Illustration of the baseline mean µ and the effect terms α2 and α3
for a factor with three levels

In the next example we show how this works in practice for the first
individual in the wages dataset, before you rewrite the models for the next
four individuals in that dataset in Activity 7.

215
Unit 3 Regression with a categorical explanatory variable

Example 2 Adapted model form for Y1


In Example 1, we proposed a model for Y1 from the wages dataset so
that
Y1 = µ5 + W1 , W1 ∼ N (0, σ 2 ),
where µ5 is the (population) mean of the responses for observations
for which occ takes the value 5.
We can adapt this model by writing it in the form of Model (3), so
that
Y1 = µ + α5 + W1 , W1 ∼ N (0, σ 2 ),
where
µ = mean response for observations for which occ is 1
(that is, the first level of occ),
α5 = effect on the response of occ being 5 in comparison to
when occ is 1.

Activity 7 Adapted model forms for Y2 , Y3 , Y4 and Y5

For the data given in Table 1 (Subsection 1.1), taken from the wages
dataset, use Model (3) to write model forms for Y2 , Y3 , Y4 and Y5 .

In Unit 1 (Box 3, Subsection 4.1), we used notation of the form


y∼x
to denote a simple linear regression model for the response Y with
covariate (numerical explanatory variable) x. We will use the same
notation when the explanatory variable is a factor. For example, a
regression model for the response hourlyWageSqrt from the wages dataset,
with the factor explanatory variable occ, will be denoted by
hourlyWageSqrt ∼ occ.
Remember, however, that despite using the same notation regardless of
whether the explanatory variable is a covariate or a factor, the associated
regression model when the explanatory variable is a covariate is not the
same as the regression model when the explanatory variable is a factor.
The basic idea of the regression model when we have a single factor
explanatory variable is summarised in Box 1. We will develop this model
further in the next section.

216
2 Developing the model further

Box 1 The basic model idea behind regression with a


single factor
Suppose that we have the (continuous) responses Y1 , Y2 , . . . , Yn and a
single factor explanatory variable. Then, if the ith observation takes
level k of the factor, we set the regression model for Yi to have the
form
Yi = µ + αk + Wi , Wi ∼ N (0, σ 2 ),
where
µ = mean of the responses for the first level of the factor,
αk = the effect that level k of the factor has on the response
in comparison to the effect that level 1 has on the response.
Using these definitions of µ and αk means that α1 is set to be zero.

2 Developing the model further


In this section, we will develop the model summarised in Box 1 further. In
Subsection 2.1, we will introduce what are known as indicator variables or
dummy variables into the model: these allow us to represent the model by
a single equation. We will then look more closely at this model in
Subsection 2.2, in preparation for using the model in Section 3.

2.1 Introducing indicator variables into the


model
We wish to model the response Y using regression with a factor
explanatory variable. Let’s call the factor A and denote the number of
levels of A by K.
Using the model specified in Box 1, if the ith observation takes level k of
factor A, then the regression model for each Yi , i = 1, 2, . . . , n, has the form
Yi = µ + αk + Wi , Wi ∼ N (0, σ 2 ), (4)
where
µ = mean response for level 1 of factor A,
α1 = 0,
and, for k = 2, 3, . . . , K,
αk = the effect on the response of level k of factor A
in comparison to the effect of level 1 of factor A.
Although this model links the responses across the K different levels of the
factor A through the (common) baseline mean µ, the model still has

217
Unit 3 Regression with a categorical explanatory variable

separate model equations associated with each of the K levels of A. This


means that the observations are essentially split up into K groups, so that
the observations which all take the same level of the factor are in the same
group and use the same model equation (which is different to the model
equations used by the other groups of observations). In this section, we
will express the K separate model equations in the form of a single
equation, so that all observations can use the same model equation.
To do this, we will make use of what are known as indicator variable or
Not a dummy variable. But is dummy variables. An indicator (or dummy) variable is a binary variable
this poseable dummy a which can take either the value 1, if a particular condition is true, or the
‘variable dummy’ ? value 0, if the condition is not true. Indicator variables are often defined as
additional variables in a dataset, as a way of identifying which observations
satisfy certain conditions. This is illustrated in the next example.

Example 3 Indicator variable to identify levels of a factor


Suppose that we would like to use regression with data from the
manna ash trees dataset using the response:
• height: the height of the tree (in metres), rounded to the nearest
metre
and the factor explanatory variable:
• side: the side of Walton Drive that the tree is located on, taking
possible values west and east.
We will take east to be level 1 of side and west to be level 2. Also, in
what follows, we will simply number our observations (trees) in the
order they are listed in mannaAsh, from i = 1, 2 . . . , 42, instead of
using treeID.
Let’s consider the model
height ∼ side.
Now, each observation takes one of the two levels of side which
determines the model equation to be used for that observation.
• If the ith tree is on the east side of the road (so that side takes
level 1), then the regression model for Yi is
Yi = µ + W i , Wi ∼ N (0, σ 2 ).

• If the ith tree is on the west side of the road (so that side takes
level 2), then the regression model for Yi is
Yi = µ + α2 + Wi , Wi ∼ N (0, σ 2 ).

So, the observations for which side takes level 2 have the extra
parameter α2 in their model.

218
2 Developing the model further

To help us identify which observations take level 2 (and therefore


which observations need the extra parameter α2 in their model), we
can define an indicator variable, Z2 , say, so that, for each observation
i = 1, 2, . . . , 42, the observed value of Z2 is:

1 if the ith observation takes level 2 of side





 (that is, if the ith tree is on the west side of the road),
zi2 =

 0 if the ith observation does not take level 2 of side
(that is, if the ith tree is on the east side of the road).

The value of zi2 therefore indicates whether or not the ith observation
takes level 2 of side:
• if zi2 = 1, then we know that the ith observation does take level 2
of side
• if zi2 = 0, then we know that the ith observation does not take
level 2 of side.
Data for five observations from the manna ash trees dataset (for the
trees numbered 1, 2, 28, 29 and 39), including the indicator variable
Z2 , are given in Table 2.
Table 2 Five observations from the manna ash trees dataset, including the
indicator variable Z2

Observation (i) treeID height side Z2


1 271 9 west 1
2 270 8 west 1
28 274 4 east 0
29 275 4 east 0
39 236 10 west 1

Notice that, out of these observations, the trees numbered 1, 2 and 39


take level 2 of side (that is, they are on the west side), and so the
value of Z2 is 1 for these observations; the value of Z2 is 0 for the
other two trees. That is,
z12 = z22 = z39,2 = 1 and z28,2 = z29,2 = 0.
(Where the observation numbers are double digits, we have used a
comma to distinguish them from the the level number.)

219
Unit 3 Regression with a categorical explanatory variable

In Example 3, we saw how an indicator variable could be used to identify


which of the two levels of side is taken by each observation. But what
happens when the factor has more than two levels? Can we still use
indicator variables to identify factor levels then? Well, yes we can, but in
this case we need to use more than one indicator variable. This is
illustrated in Example 4 and Activity 8.

Example 4 Indicator variables for two of the levels of occ


For the wages dataset, consider once again the model
hourlyWageSqrt ∼ occ,
where hourlyWageSqrt is a (continuous) response and occ is a factor
with seven levels.
This manager would take
level 2 of occ Each observation takes one of the seven levels of occ, which
determines the model equation to be used for that observation. For
example, if the ith observation takes level 2 of occ, then the
regression model for Yi is
Yi = µ + α2 + Wi , Wi ∼ N (0, σ 2 ).

Now, this time we cannot use a single indicator variable to identify


which level of occ is taken by each observation, because an indicator
variable is binary, whereas occ has seven levels. We can, however, use
an indicator variable to identify which observations take level 2 of
occ, and use a second indicator variable to identify which
observations take level 3 of occ, and so on.
So, let’s start with an indicator variable to help us identify which
observations take level 2 of occ. To do this, define the indicator
variable, Z2 , say, so that, for each observation i = 1, 2, . . . , 3331, the
observed value of Z2 is:

1 if the ith observation takes level 2 of occ,
zi2 =
0 if the ith observation does not take level 2 of occ.
Then, the value of zi2 indicates whether or not the ith observation
takes level 2 of occ:
• If zi2 = 1, then we know that the ith observation takes level 2 of
occ.
• If zi2 = 0, then we know that the ith observation does not take
level 2 of occ.
We can then define the indicator variable Z3 to identify which
observations take level 3 of occ, so that, for each observation
i = 1, 2, . . . , 3331, the observed value of Z3 is:

1 if the ith observation takes level 3 of occ,
zi3 =
0 if the ith observation does not take level 3 of occ.

220
2 Developing the model further

Therefore:
• If zi3 = 1, then we know that the ith observation takes level 3 of
occ.
• If zi3 = 0, then we know that the ith observation does not take
level 3 of occ.
Similarly, we could define four more indicator variables to identify
which observations take each of the remaining four levels of occ; that
is, levels 4, 5, 6 and 7. (You will do this soon in Activity 8.)
Data for the first five observations from the wages dataset for the
variables hourlyWageSqrt and occ, together with the indicator
variables Z2 and Z3 , are given in Table 3.
Table 3 First five observations from wages, giving hourlyWageSqrt and
occ together with Z2 and Z3

Observation (i) hourlyWageSqrt occ Z2 Z3


1 3.58 5 0 0
2 2.45 3 0 1
3 2.80 2 1 0
4 2.46 6 0 0
5 5.68 1 0 0

Notice that, out of these first five observations:


• Only the third observation takes level 2 of occ, so the value of Z2
is 1 for this observation but 0 for the rest; that is,
z32 = 1 and z12 = z22 = z42 = z52 = 0.

• Only the second observation takes level 3 of occ, so the value of Z3


is 1 for this observation but 0 for the rest; that is,
z23 = 1 and z13 = z33 = z43 = z53 = 0.

Activity 8 Indicator variables for the remaining levels of occ

Once again we will consider the model


hourlyWageSqrt ∼ occ,
using data from the wages dataset.
In Example 4, we defined the indicator variables Z2 and Z3 . Suppose that
we now define four further indicator variables Z4 , Z5 , Z6 and Z7 as follows.

221
Unit 3 Regression with a categorical explanatory variable

For each observation i = 1, 2, . . . , 3331, the observed values of Z4 , Z5 , Z6


and Z7 are:

1 if the ith observation takes level 4 of occ,
zi4 =
0 otherwise,

1 if the ith observation takes level 5 of occ,
zi5 =
0 otherwise,

1 if the ith observation takes level 6 of occ,
zi6 =
0 otherwise,

1 if the ith observation takes level 7 of occ,
zi7 =
0 otherwise.

Complete Table 4 to show the first five observations from the wages
dataset for the variables hourlyWageSqrt and occ, together with the
indicator variables Z2 , Z3 , . . . , Z7 .
Table 4 Incomplete table showing the first five observations from wages dataset,
giving hourlyWageSqrt and occ together with Z2 , Z3 , . . . , Z7

Observation (i) hourlyWageSqrt occ Z2 Z3 Z4 Z5 Z6 Z7


1 3.58 5 0 0
2 2.45 3 0 1
3 2.80 2 1 0
4 2.46 6 0 0
5 5.68 1 0 0

Model (4) for the response Y and factor A with K levels can be rewritten
in terms of a set of indicator variables. This allows the model to be
represented, for i = 1, 2, . . . , n, by a single model equation of the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , Wi ∼ N (0, σ 2 ), (5)
where

1 if the ith observation takes level 2 of A,
zi2 =
Not an indicator variable. But 0 otherwise,
this pH indicator paper 
1 if the ith observation takes level 3 of A,
changes colour depending on zi3 =
0 otherwise,
the acidity, so does this make
it a ‘variable indicator’ ? ..
.

1 if the ith observation takes level K of A,
ziK =
0 otherwise.
Notice that we don’t have an indicator variable associated with level 1 of
factor A, since we are assuming that the baseline mean µ represents the
mean response when A takes level 1.

222
2 Developing the model further

Activity 9 Equivalence of the two model forms

(a) Explain why Model (5) reduces to the form given in Model (4) when
the ith observation takes level k of factor A, for k = 2, 3, . . . , K.
(b) Confirm that Model (5) reduces to the form
Yi = µ + Wi
when the ith observation takes level 1 of factor A. (Note that this is
also the form given in Model (4) when the ith observation takes
level 1, since we have set α1 to be 0.)

As you saw in Activity 9, Model (5) reduces to the form given in Model (4)
when the ith observation takes level k of factor A. The indicator variables
therefore allow us to use the same model equation for all of the
observations, by basically switching effect terms (α2 , α3 , . . . , αK ) on and off
depending on which level of factor A is taken by each observation; this is
illustrated in Figure 9.

Observation i takes Y i = µ + α 2 + α 3 + · · · + α K + Wi
level 1 of A:
Effect terms for levels 2
to K switched off

Observation i takes Yi = µ + α2 + α3 + · · · + αK + Wi
level 2 of A:
Effect terms for levels 3
to K switched off

Observation i takes Yi = µ + α2 +α3 + α4 + · · · + αK + Wi


level 3 of A:
Effect terms for level 2 and
levels 4 to K switched off

.. .. ..
. . .
Observation i takes Yi = µ + α2 + α3 + · · · + αK−1 +αK + Wi
level K of A:
Effect terms for levels 2
to K − 1 switched off

Figure 9 Demonstrating the role of indicator variables

Activities 10 and 11 will give you some practice at using indicator


variables for regression models with a factor.

223
Unit 3 Regression with a categorical explanatory variable

Activity 10 Using indicator variables for the model


height ∼ side
Consider once again the model
height ∼ side
for modelling data from the manna ash trees dataset.
In Example 3, we defined the indicator variable Z2 so that, for each
observation i = 1, 2, . . . , 42, the observed value of Z2 is:
1 if the ith observation takes level 2 of side



 (that is, if the ith tree is on the west side of the road),
zi2 =

 0 if the ith observation does not take level 2 of side
(that is, if the ith tree is on the east side of the road).

(a) Use the indicator variable Z2 to specify a model for height as a single
equation (that is, in the form of Model (5)).
(b) The first tree in the manna ash trees dataset is located on the west
side of the road, whereas the 28th tree is located on the east side of
the road. Write the models for Y1 and Y28 in the form of Model (4).

Activity 11 Using indicator variables for the model


hourlyWageSqrt ∼ occ
Once again we will consider the model
hourlyWageSqrt ∼ occ,
using data from the wages dataset.
In Example 4 and Activity 8, we defined the indicator variables
Between them, these Z2 , Z3 , . . . , Z7 , so that, for each observation i = 1, 2, . . . , 3331,
individuals would take several 
different levels of occ. 1 if the ith observation takes level 2 of occ,
zi2 =
0 otherwise,

1 if the ith observation takes level 3 of occ,
zi3 =
0 otherwise,
..
. 
1 if the ith observation takes level 7 of occ,
zi7 =
0 otherwise.
(a) Use the indicator variables Z2 , Z3 , . . . , Z7 to write the model
hourlyWageSqrt ∼ occ
as a single equation of the form given in Model (5).
(b) Confirm that your model from part (a) reduces to the model form
for Y1 obtained in Example 2, namely,
Y1 = µ + α5 + W1 , W1 ∼ N (0, σ 2 ).

224
2 Developing the model further

The model using indicator variables is summarised in Box 2.

Box 2 Introducing indicator variables into the model


Suppose that we have the (continuous) responses Y1 , Y2 , . . . , Yn and a
single factor explanatory variable A with K levels.
We can use K − 1 indicator variables to express the model in a single
equation of the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , i = 1, 2, . . . , n,
where Wi ∼ N (0, σ 2 ) and, for k = 2, 3, . . . , K,
µ = mean of the responses for the first level of the factor,
αk = the effect that level k of the factor has on the response
in comparison to the effect that level 1 has on the response,
and

1 if the ith observation takes level 2 of A,
zi2 =
0 otherwise,

1 if the ith observation takes level 3 of A,
zi3 =
0 otherwise,
..
.

1 if the ith observation takes level K of A,
ziK =
0 otherwise.

We’ll consider this model further in the next subsection.

225
Unit 3 Regression with a categorical explanatory variable

2.2 Examining the model with indicator


variables
Before we go any further, Figure 10 gives a quick recap of the ‘story’ so far.

Response: Y Explanatory variable:


factor A with K levels

µ + αK
Model based on
response means µ
for levels of A µ + α2

1 2 ... K
A

K model equations:
k = 1: Yi = µ + Wi
Separate model k = 2: Yi = µ + α2 + Wi
equation for each k = 3: Yi = µ + α3 + Wi
level of A .. ..
. .
k = K: Yi = µ + αK + Wi

Define K − 1
One model equation:
indicator variables
Yi = µ + α2 zi2 + α3 zi3 + . . . + αK ziK + Wi
zi2 , zi3 , . . . , ziK

Figure 10 Regression with a single factor: the story so far

We used indicator variables to express the model for regression with a


factor explanatory variable with multiple levels as a single model equation
(summarised in Box 2); in this way, all of the responses can use the same
model. You will begin to explore this in Activity 12.

Activity 12 Does the model form look familiar?

We have written the regression model for response Y with a factor A that
has K levels as the single model equation
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , i = 1, 2, . . . , n,
where zi2 , zi3 , . . . , ziK are all numerical variables (each taking the value 0
or 1), and µ, α2 , α3 , . . . , αK are model parameters to be estimated. Which
regression model has a model equation with the same form?

226
2 Developing the model further

Following on from Activity 12, this similarity in model equation form is


helpful because it means we can use regression techniques we’re already
familiar with. Indicator variables can be used to express a regression model
with a single factor with K levels as a multiple regression model with
K − 1 covariates, and so we can simply use the techniques for multiple
regression from Unit 2. This modelling approach is illustrated in Figure 11.

Response: Y Explanatory variable:


factor A with K levels

Model Y ∼ A Hooray! We can use regression


techniques we already know!

Define K − 1 indicator
variables zi2 , zi3 , . . . , ziK

Model equation:
Yi = µ + α2 zi2 + α3 zi3 + . . . + αK ziK + Wi

Use multiple regression


techniques

Figure 11 The modelling approach for regression with a single factor

Although we can use multiple regression techniques to estimate the


parameters in the model which uses indicator variables (as summarised in
Box 2) in the same way as we would do when using multiple regression
with several covariates (as used in Unit 2), there are some important
differences in the interpretation of the two models.
When using multiple regression in Unit 2:
• there were two or more possible covariates
• the focus was often on choosing a subset of the possible covariates to
include in our final model.
In Model (5) (repeated in Box 2):
• although there are K − 1 indicator variables in the model, there is
actually just the single factor
• the indicator variables represent the different levels of just one factor,
and so selecting a subset of the indicator variables isn’t possible because
we would be removing some of the levels of the factor from the model!
There are also differences between the interpretation of the parameters in
the two models, which we’ll describe next.

227
Unit 3 Regression with a categorical explanatory variable

In multiple regression with covariate explanatory variables:


• the intercept parameter is the expected value of the response when all of
the covariates are zero
• the regression coefficients measure the expected change in the response if
the associated covariate increases by one unit while the other covariates
are all kept fixed.
In Model (5):
• the equivalent intercept parameter is the expected value of the response
when the factor takes the first level
• the regression coefficients measure the effect on the response that each
level of the factor has in comparison to the effect on the response of the
first level of the factor.
Bearing these differences in mind, we are now ready to look at how we
would use the proposed model. We will do this in the next section.

3 Using the proposed model


In this section, we will use the model developed in Section 2. We’ll start
by looking at the fitted model in Subsection 3.1, and then use the model to
test for a relationship between the response and the factor in
Subsection 3.2. We round the section off in Subsection 3.3 by discussing
the model assumptions and how we can check them.

3.1 The fitted model


In this section, we will look at the fitted model for regression for a
response Y with a factor explanatory variable A with K levels using a
model of the form given in Model (5), that is, for i = 1, 2, . . . , n,
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , Wi ∼ N (0, σ 2 ),
where, for k = 2, 3, . . . , K,

1 if the ith observation takes level k of A,
zik =
0 otherwise.
There are K + 1 unknown parameters altogether in the model: the
baseline mean µ, the effect terms α2 , α3 , . . . , αK (remember that α1 is set
to be zero, and so is not unknown), and the variance σ 2 .
As discussed in Subsection 2.2, we can use multiple regression techniques
when the model is expressed in terms of indicator variables. As such,
estimates of the parameters are calculated in R in exactly the same way as
they are for multiple regression, and you do not need to know the details of
how these estimates are obtained for this module. As usual, though, you
do need to know how to interpret and use the parameter estimates
produced by R.

228
3 Using the proposed model

So, denoting the estimate of µ by µ b and the estimates of α2 , α3 , . . . , αK by


α
b2 , α
b3 , . . . , α
bK , when, for k = 2, 3, . . . , K, the ith observation takes level k
of the factor A, we have that ybi , the fitted value of Yi , is given by
ybi = µ α2 × 0) + (b
b + (b α3 × 0) + · · · + (b
αk × 1) + · · · + (b
αK × 0)
=µb+α bk .
Notice that all observations which take the same level of A will have the
same fitted value.
Next, in Example 5 and Activity 13, we will take a look at the fitted model
for the response hourlyWageSqrt and the factor occ using data from the
wages dataset.

Example 5 Fitted model for hourlyWageSqrt


The model
hourlyWageSqrt ∼ occ
was fitted using data from the wages dataset. The resulting estimates
of the parameters (to three decimal places) are given in Table 5.
Table 5 Parameter estimates for hourlyWageSqrt ∼ occ

Parameter Estimate
µ (baseline mean) 4.489
α2 (occ level 2) −0.304
α3 (occ level 3) −0.515
α4 (occ level 4) −1.011
α5 (occ level 5) −1.022
α6 (occ level 6) −1.383
α7 (occ level 7) −1.435

From Table 1 (Subsection 1.1), the value of occ for the first
observation is 5, and so the fitted value for Y1 is
yb1 = µ
b+α
b5 = 4.489 + (−1.022)
= 3.467 = 3.47 (to 2 d.p).
Indeed, the fitted values for all observations which take level 5 of occ
will also be 3.47.
As another example, the value of occ for the second observation is 3,
and so the fitted value for Y2 is
yb2 = µ
b+α
b3 = 4.489 + (−0.515)
= 3.974 = 3.97 (to 2 d.p.).
The fitted values for all observations which take level 3 of occ will
also be 3.97.

229
Unit 3 Regression with a categorical explanatory variable

Activity 13 More on the fitted model for hourlyWageSqrt

From Table 1 (Subsection 1.1), the value of occ from the wages dataset for
the third, fourth and fifth observations are, respectively, 2, 6 and 1. Using
the parameter estimates produced by fitting the model
hourlyWageSqrt ∼ occ
given in Table 5 (in Example 5), calculate the fitted values yb3 , yb4 and yb5 .

In Example 5 and Activity 13, we used the parameter estimates given in


Table 5 to calculate the fitted values for Y1 , Y2 , . . . , Y5 from the wages
dataset. In the next activity, we will interpret what the parameter
estimates mean when considering the context of the data.

Activity 14 Interpreting the parameter estimates

Considering the context of the data, interpret the parameter estimates for
µ, α2 , α3 , . . . , α7 given in Table 5 (in Example 5).

In the next activity, you will look at the fitted model for modelling height
from the manna ash trees dataset using the factor side as the explanatory
variable.

Activity 15 Fitted model for regression with a factor for


the manna ash trees dataset
In Activity 10 (Subsection 2.1), we wrote the model
height ∼ side
in the form, for each i = 1, 2, . . . , n,

Olive oil is chemically similar Yi = µ + α2 zi2 + Wi , Wi ∼ N (0, σ 2 ),


to the oil produced by ash
where
trees (in the olive family, 
Oleacae), according to the  1 if the ith observation takes level 2 of side
Woodland Trust (n.d.) zi2 = (that is, if the ith tree is on the west side of the road),
0 otherwise.

As a reminder, we took ‘east’ to be the first level of the factor side.


The estimates of the parameters after fitting the model are given in
Table 6.
Table 6 Parameter estimates for height ∼ side

Parameter Estimate
µ (baseline mean) 6.636
α2 (side level 2) 1.880

230
3 Using the proposed model

(a) Considering the context of the data, interpret the parameter estimates
for µ and α2 given in Table 6.
(b) The first tree in the manna ash trees dataset is located on the west
side of the road, whereas the 28th tree is located on the east side of
the road. Calculate the fitted values for the first and 28th trees.
(c) What will the fitted values be for all of the trees on the west side of
the road? What will the fitted values be for all of the trees on the east
side of the road?

In addition to calculating fitted values for the observed data, we can use
the fitted model to predict the response Y0 for a new observation that we
know which of the K levels of the factor it takes. If we know the level of
the factor for the new observation, then we know the values of the
indicator variables for the new observation; that is, we know
z01 , z02 , . . . , z0K . For example, if the new observation takes level k of the
factor, then z0k = 1, whereas the values of the other indicator variables will
all be zero. The predicted value yb0 of Y0 is then
yb0 = µ α2 × 0) + (b
b + (b α3 × 0) + · · · + (b
αk × 1) + · · · + (b
αK × 0)
=µb+α bk .
(If the new observation takes a completely different level of the factor, say
level ‘K + 1’, then this model does not help us predict to the response Y0 .)
As for simple linear regression and multiple regression, prediction intervals
can also be calculated easily in R for the new response Y0 , but we won’t go
into the details of how these are calculated here.

Activity 16 Predicting the hourly wage for an unskilled


manual worker
Table 5 in Example 5 gives parameter estimates for the model
hourlyWageSqrt ∼ occ.
What would be the predicted hourly wage for a new individual who is an
unskilled manual worker (and so takes level 7 of occ)?

In this subsection, we’ve looked at the fitted model for regression with a
factor. We’ve used the context of the data to interpret the estimated
model parameters and used the fitted model to calculate both fitted values
and predicted values of the response. However, as with simple linear
regression, we need to test whether there is actually a relationship between
the response and the explanatory variable to model! We will do this next.

231
Unit 3 Regression with a categorical explanatory variable

3.2 Testing whether there is a relationship


In this subsection, we’ll discuss how we can use our regression model to
test for a relationship between the response and the factor.
Let’s start by thinking about how we test for a relationship between the
response and the explanatory variable in simple linear regression when the
explanatory variable is a covariate. In this case, we have the regression
model equation
Yi = α + βxi + Wi , Wi ∼ N (0, σ 2 ),
and in order to test for a relationship between the response Y and the
covariate x, we test the hypotheses
 
which means that the value of x does not affect Y ,
H0 : β = 0 ,
and so there isn’t a relationship between Y and x
 
which means that the value of x does affect Y ,
H1 : β ̸= 0 .
and so there is a relationship between Y and x

Now, when the explanatory variable is a factor A with K levels (instead of


a covariate), we have seen that we can use K − 1 indicator variables to
write the regression model equation as
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , i = 1, 2, . . . , n,
where, for k = 2, 3, . . . , K,

1 if the ith observation takes level k of A,
zik =
0 otherwise.
So, following the ideas from simple linear regression, there isn’t a
relationship between the response and the factor if the different levels of
the factor do not affect the response, in which case all of the effect terms
α2 , α3 , . . . , αK are zero. Therefore, to test for a relationship between the
response Y and the factor A, we can use the null hypothesis
H0 : α2 = α3 = · · · = αK = 0.

On the other hand, if there is a relationship between the response Y and


the factor A, then at least one of the levels of the factor affects the
response, which means that at least one of the effect terms is not zero.
This therefore suggests the alternative hypothesis
H1 : at least one of the effect terms α2 , α3 , . . . , αK is non-zero.

Activity 17 Identifying the distribution for the test

We wish to test the hypotheses


H0 : α2 = α3 = · · · = αK = 0,
H1 : at least one of the effect terms α2 , α3 , . . . , αK is non-zero.

232
3 Using the proposed model

By thinking back to testing similar hypotheses for multiple regression in


Unit 2, what distribution do you think our test will be based on?

We will look at testing the relationships between the response and the
factor in the next two activities. Activity 18 will use the fitted model
presented in Activity 15 (Subsection 3.1) for the manna ash trees dataset,
and Activity 19 will use the fitted model presented in Example 5 (also in
Subsection 3.1) for the wages dataset.
The method for testing a relationship between the response and a factor
explanatory variable is summarised in Box 3.

Box 3 Testing for a relationship between the response


and the factor
For the response Y and factor explanatory variable A with K levels,
we can test for a relationship between Y and A using the hypotheses
H0 : α2 = α3 = · · · = αK = 0,
H1 : at least one of the effect terms α2 , α3 , . . . , αK is non-zero.
The test statistic for this test is called the F -statistic and its
associated p-value is based on the F -distribution with K − 1 and
n − K degrees of freedom.
If the p-value is small, then there is evidence to reject H0 and we
conclude that there is a relationship between the response and the
factor.
If the p-value is not small, then there is not enough evidence to reject
H0 and we conclude that there is not a relationship between the
response and the factor.

Activity 18 Testing for a relationship between height and


side
When the model
height ∼ side
is fitted to the manna ash trees data, the value of the F -statistic for
testing for a relationship between height and side is 12.43 with
associated p-value 0.00107. Is there evidence to suggest that there is a
relationship between height and side?

233
Unit 3 Regression with a categorical explanatory variable

Activity 19 Testing for a relationship between


hourlyWageSqrt and occ
When the model
hourlyWageSqrt ∼ occ
is fitted to the wages dataset, the value of the F -statistic for testing for a
relationship between hourlyWageSqrt and occ is 118.3 with associated
p-value having a value less than 0.001. Is there evidence to suggest that
there is a relationship between hourlyWageSqrt and occ?

Recall that Unit 2 also carried out individual tests (based on the
t-distribution) to test whether each regression coefficient is zero, to help
decide which covariates should be kept in the model. The same tests can
be done for each of the effect terms when the explanatory variable is a
factor – indeed, R automatically calculates the test statistics (t-statistics)
and their associated p-values for each of these tests when fitting the model.
However, as mentioned already in this unit, when the explanatory variable
is a factor, we cannot remove individual level effect terms from the model:
either all of the effect terms are in the model or none of the effect terms
are in the model. So, even if individual p-values suggest that there is no
evidence that a particular effect term should be in the model, the
associated effect term cannot be removed from the model (unless there is
no evidence from the F -statistic’s p-value of a relationship between the
response and the factor, in which case none of the effect terms would be in
the model).
Despite this, the p-values associated with testing whether each individual
effect term is zero are still useful, as they provide information regarding
the extent to which each level of the factor affects the response in
comparison to how the first level of the factor affects the response.
In Activities 20 and 21, we will consider the test statistics and p-values for
the individual level effect parameters for the fitted models for the wages
and manna ash trees datasets.

Activity 20 Testing the individual effect terms for the


wages dataset
The model
hourlyWageSqrt ∼ occ
was fitted using data from the wages dataset. The resulting estimates of
the parameters (to two decimal places), together with the associated
Everyone likes pay day!
individual test statistics and p-values, are given in Table 7.

234
3 Using the proposed model

Table 7 Parameter estimates for hourlyWageSqrt ∼ occ

Parameter Estimate t-value p-value


µ (baseline mean) 4.49 76.61 < 0.001
α2 (occ level 2) −0.30 −4.47 < 0.001
α3 (occ level 3) −0.52 −6.95 < 0.001
α4 (occ level 4) −1.01 −13.01 < 0.001
α5 (occ level 5) −1.02 −15.43 < 0.001
α6 (occ level 6) −1.38 −18.35 < 0.001
α7 (occ level 7) −1.43 −14.24 < 0.001

Explain why the p-values suggest that all of the level effect terms for the
factor occ are non-zero. Interpret what this means in the context of the
data.

Activity 21 Testing the individual effect terms for the


manna ash trees dataset
The model
height ∼ side
was fitted using data from the manna ash trees dataset. The resulting
estimates of the parameters, together with the associated individual test
statistics and p-values, are given in Table 8.
Table 8 Parameter estimates for height ∼ side

Parameter Estimate t-value p-value


µ (baseline mean) 6.636 14.49 < 0.001
α2 (side level 2) 1.880 3.53 0.00107

Notice that the p-value associated with the effect term for level 2 of the
factor side is the same as the p-value associated with the F -statistic given
in Activity 18 when testing for a relationship between the response height
and the factor side. Explain why this is so.

Just as we make certain model assumptions for simple linear regression


and multiple regression, so we make model assumptions for regression
when we have a factor. We will consider what these assumptions are and
how to check that they are reasonable for the data in the next subsection.

235
Unit 3 Regression with a categorical explanatory variable

3.3 Checking the model assumptions


In this subsection, we will look at checking the model assumptions for
regression with a factor. You’ll be glad to know that you actually already
know how to do this!

Activity 22 Model assumptions and how to test them

In Section 2, we wrote the regression model with a factor explanatory


variable A with K levels in the form, for i = 1, 2, . . . , n:
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , Wi ∼ N (0, σ 2 ),
where, for k = 2, 3, . . . , K,

1 if the ith observation takes level k of A,
zik =
0 otherwise.
(a) Given that the model is in the form of a multiple regression model
where the indicator variables zi2 , zi3 , . . . , ziK are the explanatory
variables, what are the model assumptions underlying the model?
(b) Explain how you might check each of the assumptions that you
identified in part (a).

So, as seen in Activity 22, the assumptions underlying the regression


model with a factor are the same as those underlying multiple regression,
and we can check the assumptions in the same way.
We will check the assumptions for a fitted regression model in Activity 23.

Activity 23 Checking the model assumptions for


hourlyWageSqrt ∼ occ
Figure 12 shows the residual plot and normal probability plot for the fitted
model
hourlyWageSqrt ∼ occ
discussed in Example 5.
Do the underlying model assumptions seem reasonable?

236
4 Using R to fit a regression with a factor

4 4

Standardised residuals
2 2
Residuals

0 0

−2 −2

−4 −4

3.0 3.5 4.0 4.5 −2 0 2


(a) Fitted values (b) Theoretical quantiles

Figure 12 Diagnostic plots for hourlyWageSqrt ∼ occ: (a) the residual plot, (b) the normal probability plot

We are ready to put these regression models into practice using R. We will
do this in the next section.

4 Using R to fit a regression with a


factor
In this section of the unit, we will fit and interpret regression models with a
single factor using R. We will start in Notebook activity 3.1 by explaining
how to use R to fit and interpret a model for the response hourlyWageSqrt
from the wages dataset, with the factor edLev as the explanatory variable.
Then, in Notebook 3.2, you will practise by modelling hourlyWageSqrt
but this time using a different factor as the explanatory variable.
Finally, in Notebook activity 3.3, we will revisit the FIFA 19 dataset. In
Units 1 and 2, we used data from the FIFA 19 dataset to fit regression
models using strength as the response variable and the numerical
variables (weight, height and marking) as covariates. The FIFA 19
dataset also contains data for two categorical variables: preferredFoot
and skillMoves. In Notebook activity 3.3, you will use skillMoves as a
factor in a regression model for strength.

237
Unit 3 Regression with a categorical explanatory variable

Notebook activity 3.1 Regression with a factor in R


In this notebook you will use R to fit a regression model with a factor
explanatory variable, using data from wages.

Notebook activity 3.2 Regression with a different factor


for the wages dataset
This notebook will give you further practice at using R for regression
with a factor, again using data from wages.

Notebook activity 3.3 Regression with a factor for the


FIFA 19 dataset
This notebook will give you further practice at using R for regression
with a factor, this time using data from fifa19.

5 Analysis of variance (ANOVA)


In Subsection 3.2, we used our regression model to:
• test for a relationship between the response and the factor (using the
F -statistic and its associated p-value)
• compare the effects of levels of the factor on the response to the effect of
the first level of the factor on the response (using the individual
t-statistics for each level and their associated p-values).
However, we might want to learn more than this about how the various
levels of the factor affect the response, as illustrated in Example 6.

Example 6 What else might we want to know about the


effects of occupation?
Consider once again the wages dataset with the response:
• hourlyWageSqrt: the square root of the individual’s hourly wage
(in £)
and the factor:
• occ: the occupation of the individual, with the codes
◦ 1 (Professional)
◦ 2 (Employer/Manager)
◦ 3 (Intermediate non-manual)
◦ 4 (Junior non-manual)

238
5 Analysis of variance (ANOVA)

◦ 5 (Skilled manual)
◦ 6 (Semi-skilled manual)
◦ 7 (Unskilled manual).
In Activity 19, we tested for a relationship between hourlyWageSqrt
and occ and concluded that there was indeed evidence for a
relationship between the two. Additionally, in Activity 20, we tested
the individual effect terms and concluded that there was strong
evidence that they were all non-zero.
The individual effect terms compare the effects on the response of
each occupation group with the effect on the response of being in the
‘professional’ occupation group. However, there might be other issues
which we’d like to investigate for these data. For example, we might
be interested in issues such as:
• comparing the effect on the response of manual occupations (that
is, occ levels 5, 6 and 7) with the effect of the other occupations
(that is, occ levels 1 to 4)
• comparing the effect on the response of the more senior occupations
(that is, occ levels 1 and 2) with the effect of the other occupations
(that is, occ levels 3 to 7)
• comparing the effect on the response within manual occupations, by
comparing the response for individuals classed as having a skilled
manual occupation (that is, occ level 5) with the response for
individuals classed as having an unskilled manual occupation (that
is, occ level 7).

In order to be able to investigate the issues raised in Example 6, we need


to use a technique called analysis of variance, or ANOVA for short. When
introducing ANOVA, it is helpful to start by thinking about our model in
terms of how well the model explains the variation in the response: we will
do this next, in Subsection 5.1, before introducing the basic idea of
ANOVA in Subsection 5.2. In order to assess how much response variation
is explained by the model, we can use an ANOVA test: testing in ANOVA
is the subject of Subsection 5.3. To finish this section, in Subsection 5.4,
we’ll introduce what is known as the ANOVA table, which is commonly
used to present a summary of the ANOVA results.

239
Unit 3 Regression with a categorical explanatory variable

5.1 A different way of thinking about the


model
To help explain the ideas being presented in this section, we will use some
data concerning the salinity of water in the Bimini Lagoon in the
Bahamas: this dataset is described next.

The salinity of water in the Bimini Lagoon


A study was conducted which measured the salinity in samples of
water from three separate water masses in the Bimini Lagoon in the
Bahamas.
The lagoon dataset (lagoon)
For each sample, two variables were recorded:
A view across part of Bimini • salinity: the salinity of the water, measured in parts per thousand
Lagoon, looking towards Alice
Town on the island of Bimini • waterMass: a categorical variable indicating which of three separate
water masses the sample came from, taking the (coded) values 1, 2
and 3.
The dataset contains 30 observations: 12 of these are measurements
taken from the water mass coded as 1, eight are taken from the water
mass coded as 2, and 10 from the water mass coded as 3.
The first five observations from the dataset are given in Table 9.
Table 9 First five observations from lagoon

salinity waterMass
37.54 1
37.01 1
36.71 1
37.03 1
37.32 1

Source: Till, 1974

Before going any further, let’s have a quick look at the lagoon dataset.

240
5 Analysis of variance (ANOVA)

Activity 24 A quick visual look at the data

We are interested in taking the variable salinity from the lagoon dataset
as our response variable and the variable waterMass as a factor.
A comparative boxplot of the response over the three levels of waterMass
is shown in Figure 13.
Water mass code

37 38 39 40
Salinity (parts per thousand)
Figure 13 A comparative boxplot of salinity over the levels of waterMass

Explain why this plot suggests that the model


salinity ∼ waterMass
is likely to be a useful model for salinity.

As the model
salinity ∼ waterMass
is likely to be a useful model for salinity for data from the lagoon
dataset, we will use these data to help us think about the model in a
different way.
Now, as we saw in Subsection 5.2.1 in Unit 2, one of the aims of modelling
the relationship between an explanatory variable and the response is to
explain some of the variation in the response. We illustrated what we
meant by this when an explanatory variable is a covariate, in Example 12
of Unit 2. In Example 7, we will illustrate how a regression model can help
to ‘explain the variation in the response’ when the explanatory variable is
a factor, using data from the lagoon dataset.

241
Unit 3 Regression with a categorical explanatory variable

Example 7 Using a model with a factor to explain


variation in the response
Figure 14 shows a boxplot of all the values of the response salinity.
From this boxplot, it looks like most of the values of salinity vary
between about 37 and 40.

37 38 39 40
Salinity (parts per thousand)
Figure 14 A boxplot of all values of salinity

However, when we also take the values of waterMass into


consideration, we can see from Figure 13 that the values of salinity
within the same level of waterMass don’t actually vary much.
So, the large variation in salinity seen in Figure 14 is actually due
to differences in the locations of the values of salinity for the three
levels of waterMass. The variable waterMass therefore helps to
explain some of the variation seen across the values of salinity.
Likewise, the model
salinity ∼ waterMass,
which explicitly models the differences in location of salinity for the
different levels of waterMass through the mean responses for the three
levels, can be thought of as explaining some of the variation in the
values of salinity.

As we saw in Example 7, a regression model with a factor can help to


explain some of the variation observed in a response variable. ANOVA
gives us a method for formally assessing the extent to which the variation
in the response can be explained by the model.

242
5 Analysis of variance (ANOVA)

5.2 ANOVA: the basic idea


To help keep things simple and aid the understanding of the ideas being
presented, we will only use the first three observations for each of the three
levels of waterMass from the lagoon dataset when presenting the main
ideas; the reduced dataset is given in Table 10. (We will return to
considering the full lagoon dataset later.)
Table 10 Observations included in the reduced lagoon dataset

Observation salinity waterMass


number
1 37.54 1
2 37.01 1
3 36.71 1
4 40.17 2
5 40.80 2
6 39.76 2
7 39.04 3
8 39.21 3
9 39.05 3

In order to help visualise the ideas about to be discussed, we will use


Figure 15, which shows a scatterplot of each of the observed response
values given in Table 10 against their observation number, with the level of
waterMass for each observation identified.

Water mass: 1 2 3

40
Salinity (parts per thousand)

39

38

37

1 2 3 4 5 6 7 8 9
Observation number
Figure 15 Scatterplot of salinity against the associated observation
number for the data given in Table 10
243
Unit 3 Regression with a categorical explanatory variable

So, now that we have a (very small!) dataset to think about the model
salinity ∼ waterMass,
and Figure 15 to help us visualise things, we are ready to look at the ideas
behind what ANOVA is and how it can be used.
Recall from Subsection 5.2.1 in Unit 2 that when trying to assess the
extent to which the variation in the response can be explained by a
regression model with a covariate, we broke the variation of the response –
Excited to learn about
the total variation – into two types of variation:
ANOVA?
• the explained variation – the variation that can be explained by our
model
• the residual variation – the variation that still remains and can’t be
explained by our model.
We then used the sums of squares – TSS, ESS and RSS – as measures of
the three types of variation, as follows.
• The TSS is the total sum of squares and gives a measure of how the
observed responses y1 , y2 , . . . , yn vary about the overall response mean y:
n
X
TSS = (yi − y)2 .
i=1

• The ESS is the explained sum of squares and gives a measure of how the
fitted values yb1 , yb2 , . . . , ybn vary about the overall response mean y:
n
X
ESS = yi − y)2 .
(b
i=1

• The RSS is the residual sum of squares and gives a measure of how the
observed responses y1 , y2 , . . . , yn vary about the fitted values
yb1 , yb2 , . . . , ybn :
n
X
RSS = (yi − ybi )2 .
i=1

• The three sums of squares are related, so that


TSS = ESS + RSS.

In Subsection 5.2.1 in Unit 2, we illustrated these three measures in


relation to regression models with covariates. In this subsection, we will
illustrate TSS, ESS and RSS when the regression model has a single factor.

244
5 Analysis of variance (ANOVA)

Let’s start by looking at the total variation. The TSS gives a measure of
how the observed responses vary about the overall response mean y, and is
based on the distances
y1 − y, y2 − y, . . . , yn − y.
These distances are illustrated when we have a factor for the reduced
version of the lagoon dataset in Figure 16.

Water mass: 1 2 3

40
Salinity (parts per thousand)

Distance
y1 − y
39

y
38 Distance
y2 − y

37

1 2 3 4 5 6 7 8 9
Observation number
Figure 16 Scatterplot of salinity against observation number: the dotted
vertical lines show the distances used to calculate the TSS

Now let’s consider the variation that can be explained by our model. The
ESS gives a measure of the variation explained by our model by
considering how the fitted values for the model vary about the overall
response mean y. The ESS is therefore based on the distances
yb1 − y, yb2 − y, . . . , ybn − y.
These distances are illustrated when we have a factor for the reduced
version of the lagoon dataset in Figure 17, which follows. Remember that,
when the explanatory variable is a factor, observations which have the
same level of the factor, have the same model and hence the same fitted
values.

245
Unit 3 Regression with a categorical explanatory variable

Water mass: 1 2 3

40

Salinity (parts per thousand)


Distance
y1 − y
39

y
Distance
38 y2 − y

37
y1 y2

1 2 3 4 5 6 7 8 9
Observation number
Figure 17 Scatterplot of salinity against observation number: the fitted
values are shown as red dots and the dotted vertical lines show the distances
used to calculate the ESS

Finally, we’ll consider the variation which can’t be explained by the model.
The RSS gives a measure of the residual variation by considering how the
observed values of the response vary about the fitted values, and is based
on the distances
y1 − yb1 , y2 − yb2 , . . . , yn − ybn .
These distances are illustrated when we have a factor for the reduced
version of the lagoon dataset in Figure 18. (Once again, remember that
when we have a factor, observations which take the same level of the factor
have the same fitted values.)

246
5 Analysis of variance (ANOVA)

Water mass: 1 2 3

40
Salinity (parts per thousand)

39

38 Distance
y1 − y1
Distance
y3 − y3
37
y1 y3

1 2 3 4 5 6 7 8 9
Observation number
Figure 18 Scatterplot of salinity against observation number: the fitted
values are shown as red dots and the dotted vertical lines show the distances
used to calculate the RSS

Activity 25 Which sum of squares contributes more?

By considering Figures 16, 17 and 18, which of the explained and residual
sum of squares seems to contribute more to the total sum of squares?

Following on from Activity 25, it seems that the ESS is larger than the
RSS for the model
salinity ∼ waterMass.
So, does that mean that the variation in the response explained by the
model must be larger than the residual variation, indicating that the
model may be useful for the data? Well, not necessarily. The ESS and RSS
are only part of the story, and in order to compare the explained variation
with the residual variation, we need to scale both the ESS and the RSS so
that we have variance estimates of the two sources of variation which can
then be compared. (This scaling is so that we have unbiased variance
estimates of each type of variation, in the same way that the total variation
in the n responses is estimated by dividing the TSS by n − 1.) In the next
subsection, you will see how these two variance estimates can be compared.

247
Unit 3 Regression with a categorical explanatory variable

5.3 The ANOVA test


ANOVA provides us with a formal method to test whether the explained
variation is large enough in comparison with the residual variation to
conclude that the variation in the responses Y1 , Y2 , . . . , Yn can be explained
by the model
Y ∼ A,
where A is a factor with K levels.
To do this, ANOVA uses the test statistic
ESS/(K − 1)
F = .
RSS/(n − K)
The test statistic F is often called the F -value or the F -ratio. The
p-value associated with the test statistic is calculated using the
F (K − 1, n − K) distribution. Both the F -value and its p-value are easy to
obtain in R, and the p-value is interpreted in the usual way.
The numerator of the test statistic F , ESS/(K − 1), is a variance estimate
of the explained variation, whereas RSS/(n − K), the denominator of F , is
a variance estimate of the residual variation. The fact that the test
statistic compares variance estimates is the reason for the name ‘analysis
of variance’.
Box 4 gives a summary of the ANOVA test.

Box 4 The ANOVA test


For response Y and factor A with K levels, in an ANOVA test we
have the hypotheses
H0 : factor A does not help to explain the variation in Y ,
H1 : factor A does help to explain the variation in Y .
The ANOVA test uses as its test statistic the F -value given by
ESS/(K − 1)
F =
RSS/(n − K)
estimated variance in response explained by the model
= .
estimated variance in response not explained by the model
If the p-value is small, then we reject H0 . We then conclude that
factor A does help to explain the variation in Y .
If the p-value is not small, then we do not reject H0 . We then
conclude that factor A does not help to explain the variation in Y .

The ANOVA test is illustrated in Activities 26 and 27.

248
5 Analysis of variance (ANOVA)

Activity 26 Using an ANOVA test for the lagoon dataset

For the response salinity and factor waterMass (which has three levels),
the model
salinity ∼ waterMass
was fitted using the 30 observations in the lagoon dataset.
(a) For these data, the ESS is calculated to be 38.80, whereas the RSS is
calculated to be 7.93 (both values given to two decimal places). Show
that the F -value associated with the ANOVA test for this model and
these data is approximately 66.
(b) The p-value associated with this F -value is reported to be less
than 0.001. What do you conclude?

Activity 27 The ANOVA test using data from the wages


dataset
Back in Subsection 1.1, we introduced the wages dataset which includes
data for 3331 individuals from the 1994 UK Labour Force Survey.
Throughout Sections 1 to 3 we considered the model
hourlyWageSqrt ∼ occ,
where hourlyWageSqrt is the response and occ is a factor with seven
levels.
The ESS for this model and these data is calculated to be 648.4, whereas
the RSS is calculated to be 3035.7.
(a) Show that the F -value associated with the ANOVA test for this
model and these data is 118.3.
(b) The p-value associated with this F -value is reported to be less
than 0.001. What do you conclude?

Now, if the model


Y ∼A
explains the response variation well, then we would want to include the
factor A as an explanatory variable in the model. On the other hand, if
the model doesn’t explain the response variation well, then we would not
want to include the factor A in our model. But, from Subsection 3.2, we
already know how to test whether or not factor A should be included as a
factor. In fact, for the model Y ∼ A, the value of the F -value is actually
the same as that of the F -statistic used earlier in Subsection 3.2! What’s
more, the p-value associated with the F -value is based on the same null
distribution: the F -distribution with K − 1 and n − K degrees of freedom.
As a result, both the F -value and the F -statistic have the same associated
p-value. This is demonstrated in Example 8 for a model you have already
considered.

249
Unit 3 Regression with a categorical explanatory variable

Example 8 F -value and F -statistic when modelling


hourlyWageSqrt
In Activity 27, we considered the model
hourlyWageSqrt ∼ occ
for the wages dataset. In that activity, we saw that the F -value for
this model and these data is 118.3, with the associated p-value
reported to be less than 0.001.
Back in Activity 19 (Subsection 3.2), when using the same model and
data, the F -statistic for testing for a relationship between
hourlyWageSqrt and occ was 118.3, and the associated p-value was
reported to be less than 0.001.
So, the F -value is indeed the same value as the F -statistic, and both
have the same associated p-value.

You may well be wondering why we would bother with ANOVA when the
F -value and its associated p-value are the same values as the F -statistic
and its associated p-value! Well, you’ll be glad to know that there is a
good reason to bother with ANOVA, so studying this section has not been
a complete waste of time!
It turns out that the ESS can be partitioned further into more sums of
squares, each of which relates to the response variation associated with
subsets of A’s factor levels. To illustrate, think about the model
hourlyWageSqrt ∼ occ
for the wages dataset considered in Sections 1 to 3. For this model, one of
the sums of squares in the partitioning of ESS could relate to the variation
in hourlyWageSqrt associated with whether an occupation can be
described as manual or non-manual. ANOVA techniques can then be used
to test whether an occupation being manual or being non-manual helps to
explain the variation in hourlyWageSqrt. This test will generate an
F -value and associated p-value, but these values will not be the same as
the F -statistic and its associated p-value generated by the regression
model. This is because we’re no longer testing whether factor occ explains
the variation in hourlyWageSqrt, but instead looking at the factor more
closely and testing whether the fact that the occupation is manual or
non-manual explains the variation in hourlyWageSqrt.
We will consider partitioning the ESS and the subsequent use of ANOVA
More ANOVA excitement to techniques in Section 7, but first, in the next subsection, we will introduce
come! a widely used way for presenting ANOVA results.

250
5 Analysis of variance (ANOVA)

5.4 The ANOVA table


ANOVA results are often summarised in an analysis of variance table,
or an ANOVA table for short. The general form and layout of an
ANOVA table for a model for the response Y with the single factor A with
K levels as the explanatory variable is given in Box 5.

Box 5 The ANOVA table


For the response Y and single explanatory variable, factor A, with
K levels, the ANOVA table for the model
Y ∼A
has the generic form as given in Table 11.
Table 11 The general form of an ANOVA table for a model with factor A
with K levels

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
ESS ESS/(K − 1)
Factor A K −1 ESS pA
K −1 RSS/(n − K)
RSS
Residual n−K RSS
n−K
Total n−1 TSS

Let’s consider the different columns of Table 11, the general form of an
ANOVA table, in turn.
• The first column, labelled ‘Source of variation’, identifies the type of
variation which each row in the ANOVA table refers to. The first row is
the explained variation, which is labelled ‘Factor A’ here because A is
the only explanatory variable for this model. The next row then refers
to the residual variation, while the last row refers to the total variation.
• The second column, labelled ‘Degrees of freedom’, gives the
denominators for the variance estimates for the associated sources of
variation. (The name ‘degrees of freedom’ refers to the associated null
distribution of the F -value.)
• The third column, labelled ‘Sum of squares’, gives the sum of squares for
each of the associated sources of variation.
• The fourth column, labelled ‘Mean square’, gives the variance estimates
for the associated sources of variation (and, as you have no doubt
gathered, each of these estimates is known as a ‘mean square’). Note
that ANOVA tables only include mean square values for the factor and
the residual.
• Finally, the columns labelled ‘F -value’ and ‘p-value’ give, respectively,
the F -value and p-value associated with the ANOVA test. A really mean square!

251
Unit 3 Regression with a categorical explanatory variable

Activities 28 and 29 will give you some practice at completing ANOVA


tables.

Activity 28 ANOVA table for the lagoon dataset

Use the information given in Activity 26 (Subsection 5.3) to complete the


ANOVA table given in Table 12 for the model
salinity ∼ waterMass.

Table 12 Incomplete ANOVA table for salinity ∼ waterMass

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
waterMass
Residual

Total

Activity 29 ANOVA table for the wages dataset

Use the information given in Activity 27 (Subsection 5.3) to complete the


ANOVA table given in Table 13 for the model
hourlyWageSqrt ∼ occ.

Table 13 Incomplete ANOVA table for hourlyWageSqrt ∼ occ

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
occ
Residual

Total

We’ll finish this subsection by completing and interpreting the ANOVA


table for a dataset concerning an experiment on pea growth. The dataset
I guess this is also an ANOVA is described next.
table . . .

252
5 Analysis of variance (ANOVA)

The effect of sugars on pea growth


Data from an experiment in plant physiology recorded the lengths of
pea sections. The pea sections were grown in an appropriate tissue
culture in the presence of each of five different treatments. The aim
was to test the effects of different sugars on pea growth. The lengths
were given in ‘coded units’, which implies that, while meaning
something to the experimenter, the units do not mean anything to us.
The five treatments differed only in the presence or absence of
different sugars: one treatment contained no sugars, whereas the
others contained different quantities of different sugars. For each of
the five treatments, data for 10 pea sections were recorded.
The pea growth dataset (pea)
The data for 50 pea sections were recorded for the following variables:
• length: the length of the pea section (measured in ‘coded units’)
Well, this one is ready for the
• treatment: a factor identifying the treatment used on each
‘come dressed as a pea in a
pea section, taking the possible values: pod’ party!
◦ none (treatment with no sugars)
◦ glucose (treatment containing 2% glucose)
◦ fructose (treatment containing 2% fructose)
◦ gluc + fruct (treatment containing 1% glucose and 1% fructose)
◦ sucrose (treatment containing 2% sucrose).
The first five observations from the dataset are given in Table 14.
Table 14 First five observations from pea

length treatment
75 none
67 none
70 none
75 none
65 none

Source: Sokal and Rohlf, 1981

In Activity 30 you will complete an ANOVA table based on a model for


these data and then interpret it.

253
Unit 3 Regression with a categorical explanatory variable

Activity 30 ANOVA for the pea growth dataset

The model
length ∼ treatment
was fitted using data from the pea growth dataset.
The ESS and RSS were calculated to be 1077.3 and 245.5, respectively.
Complete the ANOVA table given in Table 15 for this model and these
data. What do you conclude?
Table 15 Incomplete ANOVA table for length ∼ treatment for the pea growth
dataset

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
treatment < 0.001
Residual

Total

In the next section, we’ll produce ANOVA tables using R.

6 Using R to produce ANOVA


tables
In this section, we’ll start in Notebook activity 3.4 by producing the
ANOVA table for the regression model
hourlyWageSqrt ∼ edLev
that we considered in Notebook activity 3.1 for the wages dataset.
In Notebook activity 3.5, we’ll again consider the wages dataset, but this
time, we’ll produce the ANOVA table for a regression model for the
response hourlyWageSqrt with a different factor.
In Notebook activity 3.3, we used R to fit a regression model for strength
from the FIFA 19 dataset using the factor skillMoves as the explanatory
variable. Finally, in Notebook activity 3.6, we will use R to produce the
ANOVA table for a regression model for strength using preferredFoot –
the other categorical variable for this dataset – as a factor.

254
7 Analysing the effects of the factor levels further

Notebook activity 3.4 Producing ANOVA tables in R


In this notebook, you will learn how to produce the ANOVA table
associated with a regression model with a single factor.

Notebook activity 3.5 Another ANOVA table for the


wages dataset
In this notebook, we’ll produce the ANOVA table associated with a
regression model with a different factor from the wages dataset.

Notebook activity 3.6 ANOVA table for the FIFA 19


dataset
This notebook will give you further practice at using R to produce an
ANOVA table, this time using data from the FIFA 19 dataset.

7 Analysing the effects of the factor


levels further
Consider once again the pea growth dataset. As a reminder, the factor
treatment has five levels: none, glucose, fructose, gluc + fruct and sucrose.
In Activity 30 (Subsection 5.4), we concluded that the factor treatment
helps to explain the variation in length in the pea growth dataset. We
can, however, use these data to learn more about how the different
treatments affect the growth of pea sections. For example, a researcher
may also be interested in investigating further questions, such as:
• Does sugar (in general) affect the growth of pea sections?
• How do the effects of the treatments containing sucrose compare with
the effects of the treatments containing other sugars?
• How does the effect of a treatment containing a mix of both glucose and
fructose (that is, treatment gluc + fruct) compare with the effects of
treatments containing glucose or fructose alone?
In this section, you will discover how ANOVA techniques can be extended
to answer such questions.
In Subsection 7.1, you will see how such questions can be framed in terms
of groups of means and you will be introduced to contrasts – differences in
groups of means. In Subsection 7.2, you will see how it is possible to test a
single contrast and draw conclusions about the question it represents.
As you have seen with the pea growth data, there might be more than one
question that we are interested in. So this section finishes by considering
how multiple contrasts should be handled, in Subsection 7.3.

255
Unit 3 Regression with a categorical explanatory variable

7.1 Introducing contrasts


Continuing with the pea growth data, we’ll start by considering the
question:
• Does sugar (in general) affect the growth of pea sections?
Now, if sugar does affect the growth of pea sections, then we would expect
the mean pea growth when using treatments containing sugar, which we’ll
denote as µsugar , to differ from the mean pea growth when using
treatments which don’t contain sugar, which we’ll denote as µno sugar . So,
to investigate whether sugar affects pea growth, we need to compare µsugar
with µno sugar .
Four of the levels of treatment are for treatments which contain sugar:
glucose, fructose, gluc + fruct and sucrose. So,
µsugar = mean response across the four levels glucose,
fructose, gluc + fruct and sucrose of treatment.
In contrast, only one of the levels of treatment – level none – is for a
treatment which doesn’t contain sugar. Therefore,
µno sugar = mean response for level ‘none’ of treatment.
So, in order to compare µsugar with µno sugar , we need to partition the
five levels of treatment into two sets, as illustrated diagrammatically in
Figure 19, and we can then compare the mean responses across the levels
in these two groups.

none glucose gluc + fruct


Levels of treatment:

fructose sucrose

Groups of treatment: no sugar sugar

Figure 19 Partitioning the levels of treatment to investigate the effects of


sugar on pea growth

Let
θsugar = µsugar − µno sugar .
The parameter θsugar is known as a contrast because it allows us to
compare – that is, contrast – the mean responses of two groups of levels of
our original factor in order to address a research question; if θsugar is large
in magnitude, then this would suggest that the means for the two groups
are different, implying that sugar does affect pea growth. As Section 7
progresses, you will see how we can learn about this contrast as part of an
ANOVA analysis.

256
7 Analysing the effects of the factor levels further

Before we go any further, Box 6 gives a more general explanation of how to


define a contrast.

Box 6 Defining a contrast


To define a contrast, θcontrast , when the model can be written as
Y ∼ A,
where A is a factor:
1. Partition the levels of A, or a subset of the levels, into two distinct
groups: group 1 and group 2, say. (Note that there may be levels
of A which are in neither group.)
2. Denote the mean for each group:
µ1 = mean of the responses across all levels of A
in group 1,
µ2 = mean of the responses across all levels of A
in group 2.

3. Then we have:
θcontrast = µ1 − µ2 .
That is, the contrast θcontrast compares the response means for the
two groups of data.

So far in this section we have just been considering the pea growth dataset.
This dataset is not unique in raising research questions that can be
answered using contrasts. In Activity 31 you will define a contrast to
address a question about the wages dataset.

Activity 31 Defining a contrast for the wages dataset


Consider once again the wages dataset and the model
hourlyWageSqrt ∼ occ,
where the response is:
• hourlyWageSqrt: the square root of the individual’s hourly wage (in £)
and the factor explanatory variable is: An article in The Independent
(Oppenheim, 2020) reported
• occ: the occupation of the individual, with the codes
that the company Thames
◦ 1 (Professional) Water saw a surge in female
◦ 2 (Employer/Manager) applicants for front-line
◦ 3 (Intermediate non-manual) manual jobs after changing the
◦ 4 (Junior non-manual) wording in job adverts from
‘masculine coded’ phrasing
◦ 5 (Skilled manual)
◦ 6 (Semi-skilled manual)
◦ 7 (Unskilled manual).

257
Unit 3 Regression with a categorical explanatory variable

Suppose researchers analysing these data are interested in the following


question.
• Is the square root of the individual’s hourly wage affected by whether or
not an occupation is manual?
Define a contrast to help the researchers to address this question.

So, once we have defined a contrast to compare the response means of the
two groups of levels, we then need to decide whether or not the contrast is
large enough to conclude that the response means of the two groups are
different. We therefore want to test the hypotheses:
H0 : θcontrast = 0 (that is, the response means are the same),
H1 : θcontrast ̸= 0 (that is, the response means are different).
We will do this in the next subsection.

7.2 Testing a contrast


So far we have used ANOVA to test whether or not the factor helps to
explain the response variation; if the p-value produced in the ANOVA
analysis is small, then we conclude that the factor does help to explain the
variation in the response. However, the ANOVA test also tests whether or
not the response means are different across the factor levels. The reason
for this is as follows.
Recall that the F -value and p-value for an ANOVA analysis of a model
with a single factor are, respectively, the same values as the F -statistic and
p-value for a regression analysis of the same model, as summarised in
Figure 20.
Therefore, since the F -statistic is used for testing whether or not the
response means are different across the factor levels (by testing whether
there is evidence that the effect terms in the regression model are
non-zero), the F -value and p-value from an ANOVA analysis can also be
used for testing whether or not the response means are different across the
factor levels.

258
7 Analysing the effects of the factor levels further

Model: Y ∼ A

Multiple regression ANOVA

Test whether
effect terms are non-zero
Test whether
=
A explains
Test whether
variation in Y
response means differ
across levels of A

F -statistic = F -value

p-value = p-value

Figure 20 Summary of testing for Y ∼ A

In the next activity you will consider what a p-value from such an analysis
tells us about the variation of the response means across the factor levels.

Activity 32 Small p-value in an ANOVA analysis

If an ANOVA analysis produces a small p-value, what does this tell us


about the response means across the factor levels?

In Example 9 (next), we illustrate how an ANOVA analysis (which is


based on the explained sum of squares and the residual sum of squares for
a fitted model) can be used to come to conclusions regarding whether or
not the response means differ across the factor levels.

259
Unit 3 Regression with a categorical explanatory variable

Example 9 Conclusions from an ANOVA analysis


In Activity 30 (Subsection 5.4), we completed the ANOVA table for
the model
length ∼ treatment,
using data from the pea growth dataset.
The p-value from this ANOVA analysis was very small (< 0.001), and
so we concluded that treatment helps to explain the variation in
length. But this very small p-value also tells us that there is evidence
that the means of the response length differ across the five levels of
the factor treatment.

Although the results of an ANOVA analysis for the model


length ∼ treatment
mean that we can conclude that the means of the response length differ
across the five levels of treatment, unfortunately this particular ANOVA
analysis doesn’t tell us anything about whether µsugar differs from µno sugar
(which in turn would tell us about the contrast θsugar and whether sugar
affects pea growth). We can, however, learn about these means by using an
ANOVA analysis for a model with a new factor which relates directly to
µsugar and µno sugar . We will do this in the next example.

Example 10 Introducing a new factor to learn about the


contrast θsugar
We are interested in comparing µsugar (the mean of the response
length for those treatments which contain sugar) with µno sugar (the
mean of the response length for those treatments which do not
contain sugar). To do this, we can define a new factor:
• sugar: a factor identifying whether or not the treatment used
contains sugar, taking the possible values yes and no.
Then,
µsugar = the response mean for those observations for which
sugar takes the value yes,
µno sugar = the response mean for those observations for which
sugar takes the value no.

260
7 Analysing the effects of the factor levels further

We can then carry out an ANOVA analysis of a model using the new
factor sugar, that is, using the model
length ∼ sugar,
to test whether or not the response means for the two levels of sugar
differ – in other words, whether or not µsugar differs from µno sugar –
and therefore whether or not sugar affects pea growth.
According to the Yes Peas
website (British Growers
Association, 2022), the world
Since the contrast θsugar was defined by partitioning the levels of record for eating peas is held
treatment, the level of our new factor sugar for each observation is by Janet Harris of Sussex who,
directly determined by the level of treatment for that observation. We in 1984, ate 7175 peas one by
can therefore specify the value of sugar for each observation simply from one in 60 minutes using
the values of treatment. This is what we will do next in Activity 33. chopsticks!

Activity 33 Specifying the values for the new factor sugar

Table 16 gives the values of the response length and the factor treatment
for some of the observations recorded for the pea growth dataset. For these
observations, specify the values of the new factor sugar.
Table 16 Some of the observations from the pea growth dataset

Observation length treatment sugar


10 68 none
20 61 glucose
30 58 fructose
40 59 gluc + fruct
50 67 sucrose

So far, in our quest to test the contrast θsugar using data from the pea
growth dataset, we have defined the new factor sugar. We now will carry
out an ANOVA analysis for the model
length ∼ sugar.
The resulting explained sum of squares for this fitted model – which we’ll
denote here by ESSsugar – will then give a measure of the variation in the
data which is explained by whether or not a treatment contains sugar.
Because the contrast θsugar was defined by partitioning the levels of the
factor treatment, the explained sum of squares associated with θsugar
(that is, ESSsugar ) is in fact part of the explained sum of squares for our
original model length ∼ treatment (that is, ESS). So, rather like TSS
partitions into ESS and RSS, ESS can be partitioned into the explained
sums of squares associated with a set of contrasts defined by partitioning

261
Unit 3 Regression with a categorical explanatory variable

the levels of treatment, so that


 
explained sums of squares
ESS = ESSsugar + .
of further contrasts
(Note that, for this partition to hold, there are rules concerning how the
‘further contrasts’ can be defined – we will consider these later in the
section (in Box 9).)
Figure 21 illustrates the distances used for calculating the ESS for our
original model
length ∼ treatment,
and the distances used for calculating ESSsugar for our new model
length ∼ sugar.
In Figure 21(a), the five distinct fitted values for the five levels of
treatment can be clearly seen. On the other hand, since the associated
factor sugar only has two levels, there are only two distinct fitted values in
Figure 21(b).

Treatment: none fructose gluc + fruct glucose sucrose

75 ESSsugar
ESS distances
distances
70
Length

65 y

60

55
0 10 20 30 40 50 0 10 20 30 40 50
(a) Observation number (b) Observation number
Figure 21 Scatterplots of length against observation number: the fitted values are shown as red dots and the
dotted vertical lines show the distances used to calculate (a) ESS and (b) ESSsugar

We can now use ESSsugar to assess whether the variation in pea growth can
be explained by whether or not a treatment contains sugar. If we conclude
that the factor sugar does explain length, then we can conclude that the
two means – µsugar and µno sugar – are different, and so sugar does affect
growth of pea sections.
Now, we could compare ESSsugar with the residual sum of squares for the
model
length ∼ sugar.
However, our new factor sugar was defined from the original factor

262
7 Analysing the effects of the factor levels further

treatment by partitioning the levels of treatment into two groups, and


ESSsugar is, in fact, a measure of the explained variation associated with
the model length ∼ treatment, which is actually due to whether or not
the treatment contains sugar. So, we really want to assess ESSsugar as part
of the explained variation due to the factor treatment.
To do this, we compare ESSsugar with RSS, the residual sum of squares for
the fitted model
length ∼ treatment.
The associated F -value for testing whether µsugar is different to µno sugar
(that is, for testing whether the contrast θsugar is large) is then calculated
as
ESSsugar /1
Fsugar = ,
RSS/(n − K)
where K is the number of levels of the factor treatment.
As with the F -value for an ANOVA analysis, the test statistic Fsugar is the
ratio of two variance estimates: the numerator provides an estimate of the
variance explained by the factor sugar (the divisor is 1 in the numerator
because the factor sugar has two levels), whereas the denominator
provides an estimate of the overall residual variance.

Activity 34 F -value associated with the contrast θsugar

For the pea growth dataset (which includes data for 50 pea sections), we
have
ESSsugar = 832.3,
and the residual sum of squares for the model length ∼ treatment is
RSS = 245.5.

Calculate Fsugar , the F -value associated with the contrast θsugar .

In Activity 30 (Subsection 5.4), the F -value for an ANOVA analysis of the


model
length ∼ treatment
using the pea growth dataset was calculated to be 49.37, whereas the value
of Fsugar associated with the contrast θsugar was calculated to be 152.56 in
Activity 34. Notice that these values are not the same, and (as mentioned
briefly towards the end of Subsection 5.3), although the F -value for the
ANOVA analysis with the factor treatment will be the same value as the
F -statistic from using multiple regression, this is not the case for Fsugar ,
which is associated with a contrast.

263
Unit 3 Regression with a categorical explanatory variable

The p-value associated with Fsugar is calculated using the F1,n−K


distribution and is interpreted in the usual way:
• If the p-value is small, then conclude that µsugar differs from µno sugar
and so sugar does (in general) affect growth of pea sections.
• If the p-value is not small, then conclude that µsugar does not differ from
µno sugar and so sugar does not (in general) affect growth of pea sections.
You will use this in Activity 35, before the general method for testing a
contrast is summarised in Box 7.

Activity 35 Does sugar affect the growth of pea sections?


The p-value corresponding to Fsugar (calculated in Activity 34) is less
than 0.001. What do you conclude regarding whether or not sugar
generally affects the growth of pea sections?

Box 7 Testing a contrast


Suppose that for a response Y and factor A with K levels we have the
contrast
θcontrast = µ1 − µ2 ,
where µ1 and µ2 are, respectively, the response means for levels of A
in group 1 and group 2.
To test the hypotheses:
H0 : θcontrast = 0 (that is, the response means are the same),
H1 : θcontrast ̸= 0 (that is, the response means are different),
use the test statistic
ESScontrast /1
Fcontrast = ,
RSS/n − K
where
• ESScontrast is the explained sum of squares associated with fitting
the model
Y ∼ contrast,
and contrast is a factor associated with θcontrast
• RSS is the residual sum of squares associated with fitting the model
Y ∼ A.

If the p-value is small, then reject H0 – and conclude that µ1 is


different to µ2 .
If the p-value is not small, then do not reject H0 – and conclude that
µ1 is the same as µ2 .

264
7 Analysing the effects of the factor levels further

The ANOVA table can be extended to include the results for testing a
contrast, in addition to the usual ANOVA table information, so that all of
the results are summarised in one place.
The general form of the extended ANOVA table is summarised in Box 8.

Box 8 The extended ANOVA table


Suppose that for a response Y and factor A with K levels we have the
contrast
θcontrast = µ1 − µ2 ,
where µ1 and µ2 are, respectively, the response means for levels of A
in group 1 and group 2.
The general form of the expanded ANOVA table is given in Table 17.
Table 17 The general form of an extended ANOVA table for a model for
response Y with factor A with K levels, and contrast θcontrast

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
ESS ESS/(K − 1)
A K −1 ESS pA
K −1 RSS/(n − K)
ESScontrast ESScontrast /1
A: θcontrast 1 ESScontrast pcontrast
1 RSS/(n − K)
RSS
Residual n−K RSS
n−K
Total n−1 TSS

Notice that the extended ANOVA table shown in Box 8 is the same as the
ANOVA table for the model Y ∼ A (given in Box 5, Subsection 5.4), but
with an additional row relating to the contrast θcontrast . The total row still
relates to the model Y ∼ A, so it is still true that TSS = ESS + RSS.
The notation ‘A: θcontrast ’ has been used in the first column of the extended
ANOVA table in Table 17, to emphasise that the contrast θcontrast was
defined from factor A and the ANOVA analysis associated with θcontrast is
assessed as part of the ANOVA analysis for the model Y ∼ A.
Example 11 shows what an extended ANOVA table looks like in practice.

265
Unit 3 Regression with a categorical explanatory variable

Example 11 The extended ANOVA table for the pea


growth data
In Activity 30 (Subsection 5.4), we obtained the ANOVA table for the
model
length ∼ treatment,
This must be the extended
using data from the pea growth dataset.
ANOVA table . . .
The extended ANOVA table for this model, including the results for
the contrast θsugar from Activity 34, is given in Table 18.
Table 18 The extended ANOVA table for length ∼ treatment and the
contrast θsugar

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
treatment 4 1077.3 269.3 49.37 < 0.001
treatment: θsugar 1 832.3 832.3 152.56 < 0.001
Residual 45 245.5 5.5

Total 49 1322.8

Notice that the extended ANOVA table is exactly the same as the
ANOVA table obtained in Activity 30, except that there is an extra
row presenting the results for the contrast θsugar considered in the
analysis.
Notice also that ESSsugar is less than ESS. This is because, as
mentioned earlier, ESS can be partitioned into ESSsugar and other
explained sums of squares associated with a set of contrasts.

Next, in Activity 36 you will use an extended ANOVA table that has
already been created to draw conclusions about a research question.

Activity 36 The extended ANOVA table for the wages data

Consider once again the wages dataset of 3331 individuals and the model
hourlyWageSqrt ∼ occ.

In Activity 31 (Subsection 7.1), the following contrast was defined:


θmanual = µmanual − µnon-manual ,

266
7 Analysing the effects of the factor levels further

where
µmanual = mean response across levels 5, 6 and 7 of occ,
µnon-manual = mean response across levels 1, 2, 3 and 4 of occ.
The extended ANOVA table for this model and the contrast θmanual is
given in Table 19.
Table 19 The extended ANOVA table for hourlyWageSqrt ∼ occ and the
contrast θmanual

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
occ 6 648.4 108.1 118.3 < 0.001
occ: θmanual 1 417.1 417.1 456.77 < 0.001
Residual 3324 3035.7 0.9

Total 3330 3684.1

What do you conclude regarding whether or not an occupation is manual


affects the square root of the hourly wage?

7.3 Specifying further contrasts


At the start of Section 7, three questions concerning the data from the pea
growth dataset were posed, namely:
• Does sugar (in general) affect the growth of pea sections?
• How do the effects of the treatments containing sucrose compare with
the effects of the treatments containing other sugars?
• How does the effect of a treatment containing a mix of both glucose and
fructose (that is, treatment gluc + fruct) compare with the effects of
treatments containing glucose or fructose alone?
In Subsection 7.1, we defined the contrast θsugar to investigate the first of
these questions (and we concluded in Activity 35 that sugar does indeed
affect pea growth). In this subsection, you will see that we can define
further contrasts to investigate the other two questions.
Let’s start by considering the second question, concerning the effects of
sucrose compared with the effects of the other sugars. For this question,
we are only interested in treatments which contain sugar. So, we will
define a contrast using only those levels of treatment which contain sugar;
that is, we will define a contrast θsucrose using only the four levels glucose,
gluc + fruct, fructose and sucrose of treatment.
Out of these four sugar treatments, just the one treatment – sucrose –
contains sucrose.

267
Unit 3 Regression with a categorical explanatory variable

So, define the contrast θsucrose to be


θsucrose = µsucrose − µno sucrose ,
where
µsucrose = mean response when treatment takes level sucrose,
µno sucrose = mean response when treatment takes one of the other sugar
treatments (that is, glucose, gluc + fruct or fructose).
Then, if θsucrose is large in magnitude, this would imply that the effect of
sucrose on pea growth differs from the effect of the other sugars.
Note that for our new contrast θsucrose we are not considering all of the
levels of treatment; instead, we are partitioning the levels in the group of
‘sugar treatments’ defined with our first contrast θsugar . This is illustrated
in Figure 22.

none glucose gluc + fruct


Levels of treatment:

fructose sucrose

Groups of treatment: no sugar sugar


(a)

glucose gluc + fruct


Levels of treatment
containing sugar:
fructose sucrose

Groups of treatment
containing sugar: no sucrose sucrose
(b)

Figure 22 Partitioning the levels of treatment when defining the contrasts


(a) θsugar and (b) θsucrose

In Activity 37, we will specify another contrast to help us to answer the


third question posed.

268
7 Analysing the effects of the factor levels further

Activity 37 Specifying another contrast

Define a contrast, such as θmix , which can be used to investigate the final
question concerning the pea growth data:
• How does the effect of a treatment containing a mix of both glucose and
fructose (that is, treatment gluc + fruct) compare with the effects of
treatments containing glucose or fructose alone?

Following on from Activity 37, note that, as with θsucrose , we are not
considering all of the levels of treatment when specifying the contrast
θmix ; this time, we are partitioning the levels in the group ‘no sucrose’
defined with our contrast θsucrose . This is illustrated in Figure 23.

glucose gluc + fruct


Levels of treatment
containing sugar:
fructose sucrose

Groups of treatment
containing sugar: no sucrose sucrose
(a)

glucose gluc + fruct


Levels of treatment
containing sugar,
but no sucrose:
fructose

Groups of treatment
containing sugar,
but no sucrose: no mix mix
(b)
Figure 23 Partitioning the levels of treatment when defining the contrasts
(a) θsucrose and (b) θmix

So, we have now defined two contrasts, θsucrose and θmix , and to help us
investigate the two questions of interest we can test the two sets of
hypotheses:
H0 : θsucrose = 0, H1 : θsucrose ̸= 0,
and
H0 : θmix = 0, H1 : θmix ̸= 0.

269
Unit 3 Regression with a categorical explanatory variable

To do this, for each set of hypotheses we basically follow the same


procedure as we did for testing θsugar .
So, in order to test θsucrose , we can define another factor – sucrose, say –
from the levels of treatment which relates to θsucrose :
• sucrose: a factor identifying whether or not a sugar treatment contains
sucrose, taking the possible values yes and no.
Then carry out an ANOVA analysis for the model
length ∼ sucrose
using only data for those observations which have a sugar treatment. The
resulting explained sum of squares for this fitted model on the reduced
dataset, ESSsucrose , then provides us with a measure of the variation in the
responses for the sugar treatments which is explained by whether or not
the sugar treatment contains sucrose. The test statistic for this test,
Fsucrose , compares ESSsucrose with the overall residual sum of squares and is
calculated as
ESSsucrose /1
Fsucrose = ,
RSS/(n − K)
where, as usual, K is the number of the levels of treatment.
Next, in Activity 38 you will consider how to test θmix .

Activity 38 Testing θmix

Following the methods described for testing θsucrose , briefly outline how we
can test the contrast θmix .

All three contrasts, θsugar , θsucrose and θmix , can be added to an extended
ANOVA table as shown in Table 20.
Table 20 The extended ANOVA table for length ∼ treatment and the
contrasts θsugar , θsucrose and θmix

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
treatment 4 1077.3 269.3 49.37 < 0.001
treatment: θsugar 1 832.3 832.3 152.56 < 0.001
treatment: θsucrose 1 235.2 235.2 43.11 < 0.001
treatment: θmix 1 3.8 3.8 0.69 0.411
Residual 45 245.5 5.5

Total 49 1322.8

Notice that the extended ANOVA table is once again the same as the
ANOVA table for the model length ∼ treatment obtained in Activity 30

270
7 Analysing the effects of the factor levels further

(Subsection 5.4), but with three additional rows relating to the three
contrasts.
Also notice that, because ESS partitions into the explained sums of
squares for a set of contrasts,
ESSsugar + ESSsucrose + ESSmix = 832.3 + 235.2 + 9.3
= 1076.8,
which is less than the ESS value of 1077.3.
In the next activity you will make use of this extended ANOVA table to
draw conclusions about the pea growth dataset.

Activity 39 Conclusions for the pea growth dataset

Given Table 20, what do you conclude about the effects of the treatments
on pea growth?

When analysing the pea growth dataset, we used the three contrasts θsugar ,
θsucrose and θmix . We could, however, have used different contrasts – the
choice of contrasts depends on the research questions of interest. For
example, we might have wanted to define a contrast to compare the effects
of using a treatment with a mixture of sugars (that is, gluc + fruct) with
the treatments which only contain single sugars (that is, glucose, fructose
and sucrose), or we might have wanted to define a contrast to compare the
effects of treatments containing glucose (that is, glucose and gluc + fruct)
with the other sugar treatments.
Whichever set of contrasts are used for an analysis, there are rules which
need to be followed when specifying multiple contrasts, to ensure that the
associated tests are valid, as summarised in Box 9.

Box 9 Defining multiple contrasts


In order for the tests in an extended ANOVA table to be valid, each
contrast is defined by partitioning a set of the factor’s levels that have
been considered together in a previous contrast.
For a factor with K levels, the maximum number of such possible
contrasts is K − 1.

To illustrate the rules for defining multiple contrasts, if we defined a


contrast to compare the effects of a mixture of sugars (that is, gluc + fruct)
with the treatments which only contain single sugars (that is, glucose,
fructose and sucrose), then we wouldn’t be able to go on to use the
contrast mix (which compared the effects of gluc + fruct with the effects of
glucose and fructose), since glucose and fructose have already been
partitioned into a different group to gluc + fruct.

271
Unit 3 Regression with a categorical explanatory variable

Also, the number of contrasts defined depends on what is of particular


interest to the researcher. For example, the researcher may be interested in
θsugar and θsucrose , but not interested in θmix , in which case there is no need
to consider θmix at all. And, of course, if the researcher doesn’t have any
research questions regarding the factor levels, then there is no need to
define any contrasts! There is, however, an upper limit to the number of
contrasts which can be defined: because of the way they are defined, the
maximum number of possible contrasts for an analysis is always K − 1.
Have a go at applying Box 9 to a data analysis in Activity 40.

Activity 40 Defining some contrasts for the wages data

Consider once again the wages dataset and the model


hourlyWageSqrt ∼ occ,
where the response is:
• hourlyWageSqrt: the square root of the individual’s hourly wage (in £)
and the factor explanatory variable is:
• occ: the occupation of the individual, with the codes
◦ 1 (Professional)
◦ 2 (Employer/Manager)
◦ 3 (Intermediate non-manual)
◦ 4 (Junior non-manual)
◦ 5 (Skilled manual)
◦ 6 (Semi-skilled manual)
This jeweller would be classed ◦ 7 (Unskilled manual).
as a skilled manual worker
In Activity 31 (Subsection 7.1), you defined the contrast θmanual to
investigate the question:
• Is the square root of the hourly wage affected by whether or not an
occupation is manual?
The researchers were also interested in a second question:
• Is the square root of the hourly wage affected by whether or not a
manual occupation is unskilled?
To help answer this, define the contrast:
θunskilled = µunskilled − µskilled ,
where
µunskilled = mean response for unskilled manual occupations,
µskilled = mean response for skilled manual occupations.
(a) Explain how the levels of occ are partitioned when defining θmanual ,
and when defining θunskilled . Do these two contrasts follow the rules
for multiple contrasts to ensure that the tests are valid?

272
8 Using R to produce extended ANOVA tables

(b) Explain why we couldn’t use either of the contrasts θmanual or θunskilled
to define a third contrast to compare the effects on the square root of
the hourly wage of the more senior occupations (that is, occ levels 1
and 2) with the effects of the other occupations (that is, occ levels 3
to 7).
(c) What is the maximum number of contrasts that could be defined for
this model and these data?

We will use R to obtain the extended ANOVA table for the wages dataset
and test the two contrasts θmanual and θunskilled , considered in Activity 40,
in the (very short!) final section of this unit next.

8 Using R to produce extended


ANOVA tables
In this section you will obtain an extended ANOVA table using R. We
won’t go into the details of how to use contrasts in R, as this is actually
rather fiddly. However, Notebook activity 3.7 does provide the code
required so that you can specify the contrasts and obtain the resulting
extended ANOVA table for yourself.

Notebook activity 3.7 Obtaining an extended ANOVA


table in R
In this notebook we will use R to specify contrasts and obtain an
associated extended ANOVA table, focusing on data from wages.

273
Unit 3 Regression with a categorical explanatory variable

Summary
This unit has focused on regression when there is a single factor (a
categorical explanatory variable). A regression model with a factor needs
to be treated differently to a regression model with a covariate: the focus
of regression with a covariate is a fitted line, whereas the focus of
regression with a factor is based on the response means for the different
levels of the factor.
In order to model the relationship between the response and the factor
across the factor’s levels, we use a baseline mean which is common to all
levels, together with level effects which, for each level, model how the
mean response for that level differs from the baseline mean. In this unit,
we have used R’s default convention of defining the baseline mean to be
the mean response for level 1 of the factor.
Indicator variables can be used to identify which level of the factor each
observation takes. We can then use the indicator variables to express the
different model equations associated with the different levels of the factor,
by a single model equation. By doing this, our model has the form of a
multiple regression model, where the indicator variables are the covariates.
As such, we can use multiple regression techniques to fit our model and
check the model assumptions.
However, the model with a factor is not interpreted or used in the same
way as a multiple regression model is. When we have a factor, each
regression coefficient is an effect term representing how the associated level
of the factor affects the response in comparison to how level 1 of the factor
affects the response. Also, even though we have multiple indicator
variables (for the levels of the factor), we still only have one explanatory
variable, and so we either need to include all of the indicator variables as
covariates in the model, or none of the indicator variables.
We can then test for a relationship between the response and the factor by
calculating the F -statistic for testing whether all of the effect terms are
zero; if the associated p-value is small, then this implies that there is a
relationship between the response and the factor.
In the second half of the unit, we introduced the idea of ANOVA (analysis
of variance). The focus of ANOVA is to assess whether the factor helps to
explain the variation in the response by comparing the variation which can
be explained by the model, with the variation which is left unexplained by
the model, through the F -value. If the p-value associated with the F -value
is small, then this implies that the factor does help to explain the variation
in the response. The ANOVA table is widely used to summarise the results
from an ANOVA analysis.
We can use ANOVA techniques to analyse the effects of the factor levels
further. To do this, we define contrasts, which compare the mean responses
for two (non-overlapping) groups of levels of the factor; a large value of a
contrast would imply that there is a difference in the mean responses for

274
Summary

the two groups, implying that the difference between the levels in the two
groups affects the response. The contrast is tested using another F -value;
if the associated p-value is small, then this implies that the difference
between the levels in the two groups does affect the response. The results
from testing contrasts can be combined with the ANOVA table for the
factor in an extended ANOVA table.
In this unit, R was used for regression with a factor, for obtaining ANOVA
tables, and for specifying contrasts and obtaining the associated extended
ANOVA table.
As a reminder of what has been studied in Unit 3 and how the different
sections link together, the route map for the unit is repeated below.

The Unit 3 route map

Section 1 Section 2
Regression with a factor: Developing the
the basic idea model further

Section 4
Section 3
Using R to fit
Using the proposed
a regression
model
with a factor

Section 5 Section 6
Analysis of variance Using R to produce
(ANOVA) ANOVA tables

Section 7 Section 8
Analysing the effects Using R to produce
of the factor levels extended ANOVA
further tables

275
Unit 3 Regression with a categorical explanatory variable

Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate the difference between a covariate and a factor
• appreciate that regression with a factor is based on the response means
for the levels of a factor
• explain how the relationship between the response and the factor can be
modelled through a baseline mean (common to all factor levels) and
individual effect terms, representing how the response mean for each
level differs from the baseline mean
• interpret the baseline mean and individual effect terms for a model fitted
to a specific dataset
• use indicator variables to express the model as a multiple regression
model
• explain the differences in interpretation and use of a multiple regression
model (with q covariates) and a regression model with a single factor
using indicator variables
• use parameter estimates to calculate fitted values and point predictions
• appreciate that the F -statistic (used for testing whether all of the effect
terms are zero) can be used to test for a relationship between the
response and the factor
• explain why we either need to include all of the effect terms in the
model, or none of them
• check the model assumptions using residual plots and normal probability
plots
• fit a regression model with a factor in R and be able to interpret the
output
• appreciate the idea behind ANOVA (analysis of variance) as a method of
assessing the extent to which the variation in the response can be
explained by the factor
• explain the ideas behind the sums of squares TSS, ESS and RSS when
the explanatory variable is a factor
• interpret the results from the ANOVA test based on the F -value test
statistic and its associated p-value
• complete and interpret an ANOVA table
• use R to produce an ANOVA table
• appreciate the ideas behind contrasts for investigating the effects on the
response of groups of levels of the factor
• define a contrast to address a particular research question
• interpret the results from testing a contrast based on the test statistic
Fcontrast and its associated p-value

276
Learning outcomes

• complete and interpret an extended ANOVA table which includes the


results for any contrasts
• specify multiple contrasts so that the associated tests are valid
• use (given) code to define contrasts and produce an extended ANOVA
table in R.

277
Unit 3 Regression with a categorical explanatory variable

References
British Growers Association (2022) Pea facts. Available at:
https://peas.org/pea-facts/ (Accessed: 13 February 2022).
Oppenheim, M. (2020) ‘Surge in women applying for manual jobs after
wording in adverts made less “masculine”’, The Independent, 22 June.
Available at: https://www.independent.co.uk/news/uk/home-news/
women-job-advert-london-thames-water-process-technician-a9579251.html
(Accessed: 13 February 2022).
Sokal, R.R. and Rohlf, F.J. (1981) Biometry: the principles and practice of
statistics in biological research. 2nd edn. San Francisco: W.H. Freeman.
Taylor, K. (1999) Male earnings dispersion over the period 1973 to 1995 in
four industries. Unpublished PhD thesis. The Open University.
Till, R. (1974) Statistical methods for the Earth scientist. London:
Macmillan.
Woodland Trust (no date) Ash. Available at:
https://www.woodlandtrust.org.uk/trees-woods-and-wildlife/british-
trees/a-z-of-british-trees/ash/ (Accessed: 28 January 2022).

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, spending a wage: © Hanna Kuprevich / www.123rf.com
Subsection 2.1, variable dummy: © Prapass Wannapinij / www.123rf.com
Subsection 2.1, manager: © Wavebreak Media Ltd / www.123rf.com
Subsection 2.1, pH indicator paper: © Bjoern Wylezich / www.123rf.com
Subsection 2.1, group of individuals: © rawpixel / www.123rf.com
Subsection 2.2, fireworks: © Maksim Pasko / www.123rf.com
Subsection 3.1, olive oil: © rrraven / www.123rf.com
Subsection 3.2, pay day: © EdZbarzhyvetsky / www.create.vista.com
Subsection 5.1, Bimini Lagoon: © andydidyk / iStock / Getty Images Plus
Subsection 5.2, excited person: © deagreez / www.123rf.com
Subsection 5.3, jumping for joy: © nasrul0412 / www.123rf.com
Subsection 5.4, dressed as a pea: © Mark Bowden / www.123rf.com
Subsection 7.1, manual worker: © Visoot Uthairam | Dreamstime.com
Subsection 7.2, peas: © izikmd / www.123rf.com
Subsection 7.3, jeweller: © Olga Yastremska / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

278
Solutions to activities

Solutions to activities
Solution to Activity 1
Both workHrs and educAge are numerical variables, and so are potential
covariates.
Although edLev and occ have numerical values in the dataset, these are in
fact just numerical codes that represent the different values of categorical
variables. As such, both edLev and occ are potential factors.
The remaining two variables, gender and computer, are clearly categorical
and not numerical, and so these two are also potential factors.

Solution to Activity 2
There seems to be a potential error for the fifth individual in the wages
dataset where the value of educAge – the age, in years, at which the
individual ceased education – is just 2. This seems extremely unlikely,
especially given the individual’s education is high (since edLev is 4) and
they are in a professional occupation (since occ is 1).

Solution to Activity 3
From the description of the wages dataset, occ takes seven possible values
and so the factor occ has seven levels.

Solution to Activity 4
Both the boxes and the median lines in the comparative boxplot seem to
follow the general trend – that hourlyWageSqrt seems to increase as the
level codes of occ decrease. Therefore, there seems to be a negative
relationship between hourlyWageSqrt and the level codes of occ.

Solution to Activity 5
The value of occ for the ith individual is 1 and the (population) mean
response for individuals for which occ takes the value 1 is µ1 .
From Figure 6, we can see that the observed responses for individuals for
which occ is 1 vary about the sample mean (the large red circle). So,
following the simple linear regression model given in Model (1) and
illustrated in Figures 4 and 5, we can capture this variation by using a
random term.
A possible model for Yi is therefore given by
Yi = µ 1 + W i , Wi ∼ N (0, σ 2 ).

279
Unit 3 Regression with a categorical explanatory variable

Solution to Activity 6
For the second observation in Table 1, the value of occ is 3. Therefore,
Model (2) can be written as
Y2 = µ3 + W2 , W2 ∼ N (0, σ 2 ),
where µ3 is the (population) mean of the responses for observations for
which occ is 3.

Solution to Activity 7
The value of occ for the second observation in Table 1 is 3, and so
Model (3) for Y2 has the form
Y2 = µ + α3 + W2 , W2 ∼ N (0, σ 2 ),
where
µ = mean response for observations for which occ is 1,
α3 = effect on the response of occ being 3 in comparison to
when occ is 1.
Similarly, since the values of occ are 2 and 6 for Y3 and Y4 , respectively,
we have the model forms
Y3 = µ + α2 + W3 , W3 ∼ N (0, σ 2 ),
Y4 = µ + α6 + W4 , W4 ∼ N (0, σ 2 ),
where µ is as defined above and
α2 = effect on the response of occ being 2 in comparison to
when occ is 1,
α6 = effect on the response of occ being 6 in comparison to
when occ is 1.
The model form for Y5 , however, looks slightly different: this is because
the value of occ for the fifth observation is 1. This means that the model
for Y5 is given simply as
Y5 = µ + W5 , W5 ∼ N (0, σ 2 ),
since the baseline mean has been defined to be the mean of level 1 and
α1 = 0.

Solution to Activity 8
The indicator variable Z4 indicates whether or not the ith observation
takes level 4 of occ or not. As a result, Z4 will only be 1 for those
observations which take level 4 of occ, whereas the rest will be 0. But
none of these five observations take level 4 of occ, and so all of the values
of Z4 in Table 4 will be 0. The other values of the indicator variables are
found similarly.

280
Solutions to activities

The completed table is shown below.

Observation (i) hourlyWageSqrt occ Z2 Z3 Z4 Z5 Z6 Z7


1 3.58 5 0 0 0 1 0 0
2 2.45 3 0 1 0 0 0 0
3 2.80 2 1 0 0 0 0 0
4 2.46 6 0 0 0 0 1 0
5 5.68 1 0 0 0 0 0 0

Notice that the values of all of the indicator variables are 0 for the fifth
observation, which takes level 1 of occ. This is because we don’t need an
indicator variable for level 1 of occ, since the effect of level 1 is assumed to
be part of the baseline mean µ.

Solution to Activity 9
(a) If the ith observation takes level k of factor A, then the indicator
variable zik takes the value 1, whereas the other indicator variables all
take the value 0. Model (5) then becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αk × 1) + · · · + (αK × 0) + Wi
= µ + αk + Wi ,
which is the form given in Model (4).
(b) If the ith observation takes level 1 of factor A, then all of the indicator
variables zi2 , zi3 , . . . , ziK will be zero. In this case, Model (5) becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αK × 0) + Wi
= µ + Wi ,
as required.

Solution to Activity 10
(a) Using the indicator variable Z2 , for each i = 1, 2, . . . , 42, we can write
the model
height ∼ side
in the form
Yi = µ + α2 zi2 + Wi , Wi ∼ N (0, σ 2 ).

(b) The first tree in the manna ash trees dataset is located on the west
side of the road. This means that side takes level 2 for this
observation, so that z12 = 1. Therefore, using the answer to part (a),
the model for Y1 can be written as
Y1 = µ + (α2 × 1) + W1 , W1 ∼ N (0, σ 2 ),
that is,
Y1 = µ + α2 + W1 , W1 ∼ N (0, σ 2 ).

281
Unit 3 Regression with a categorical explanatory variable

The 28th tree in the manna ash trees dataset is located on the east
side of the road. This means that side takes level 1 for this
observation, so that z28,2 = 0. Therefore, using the answer to part (a),
the model for Y28 can be written as
Y28 = µ + (α2 × 0) + W28 , W28 ∼ N (0, σ 2 ),
that is,
Y28 = µ + W28 , W28 ∼ N (0, σ 2 ).

Solution to Activity 11
(a) Using the indicator variables Z2 , Z3 , . . . , Z7 , for each i = 1, 2, . . . , 3331,
the model can be written as a single equation with the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + α7 zi7 + Wi , Wi ∼ N (0, σ 2 ).

(b) For the first observation, occ takes level 5. Therefore, the value of z15
is 1, whereas z12 , z13 , z14 , z16 , z17 are all zero. Therefore, the model
from part (a) for Y1 becomes
Y1 = µ + (α2 × 0) + (α3 × 0) + (α4 × 0) + (α5 × 1)
+ (α6 × 0) + (α7 × 0) + W1
= µ + α5 + W1 ,
which is indeed the same as the model for Y1 obtained in Example 2.

Solution to Activity 12
A multiple regression model equation has the same form. We can think of
µ as the intercept parameter, α2 , α3 , . . . , αK as the regression coefficients
and zi2 , zi3 , . . . , ziK as covariates.

Solution to Activity 13
The value of occ for the third observation is 2, and so the fitted value for
Y3 is
yb3 = µ
b+α
b2 = 4.489 + (−0.304)
= 4.185 = 4.19 (to 2 d.p.).
The value of occ for the fourth observation is 6, and so the fitted value for
Y4 is
yb4 = µ
b+α
b6 = 4.489 + (−1.383)
= 3.106 = 3.11 (to 2 d.p.).
The value of occ for the fifth observation is 1, and as α1 is set to zero this
means the fitted value for Y5 is simply
yb5 = µ
b = 4.489 = 4.49 (to 2 d.p.).

282
Solutions to activities

Solution to Activity 14
The parameter µ is our baseline mean for level 1 of occ – this is the mean
of the square root of the hourly wage for individuals in a professional
occupation.
Each of the α parameters (effect terms) is the mean of the square root of
the hourly wage for individuals with the corresponding level of occ
compared with the mean of the square root of the hourly wage for those in
professional occupations.
Since all of the α parameters are negative, this means that the mean of the
square root of the hourly wages for individuals in occupations other than a
professional occupation are lower than for individuals in a professional
occupation.
What’s more, the estimates of the α parameters are decreasing as the level
codes of occ increase. This means that the mean of the square root of the
hourly wage decreases as the level codes of occ increase. But, since the
coding for occ is such that the occupation skill level decreases as the
associated codes increase, this in turn means that the mean of the square
root of the hourly wage decreases as the occupation skill level also
decreases.

Solution to Activity 15
(a) The parameter µ is our baseline mean for level 1 of side. Since ‘east’
has been taken to be level 1 of side, this means that µ represents the
mean height of trees on the east side of the road.
The α2 parameter is the effect term for the second level of side.
Therefore, α2 provides a measure of how the mean height of trees on
the west side of the road compares with the mean height of trees on
the east side of the road.
Since the estimated value of α2 is positive, this means that the mean
height of trees on the west side of the road is higher than the mean
height of trees on the east side of the road.
(b) The first tree is located on the west side of the road. This means that
this tree takes level 2 of side, and so z12 = 1. Therefore, the fitted
value, in metres, for Y1 is
yb1 = µ
b+α
b2 = 6.636 + 1.880 = 8.516 = 8.52 (to 2 d.p.).

The 28th tree is located on the east side of the road, so this tree takes
level 1 of side and z28,2 = 0. Therefore, the fitted value, in metres,
for Y28 is
yb28 = µ
b = 6.636 = 6.64 (to 2 d.p.).

(c) Since the first tree is located on the west side of the road, the fitted
values for all of the trees on the west side of the road will be the same
as yb1 , that is, 8.52 (to 2.d.p.).

283
Unit 3 Regression with a categorical explanatory variable

Similarly, the 28th tree is located on the east side of the road, and so
the fitted values for all of the trees on the east side of the road will be
the same as yb28 , that is, 6.64 (to 2 d.p.).

Solution to Activity 16
The value of occ is 7, and so the predicted value for Y0 is
yb0 = µ
b+α
b7 = 4.489 + (−1.435) = 3.054.
This gives us the predicted value of the response, which is the square root
of the hourly wage. We were asked to predict the hourly wage for this
individual (rather than the square root), and so we need to square the
value of yb0 to obtain the value required. So the predicted hourly wage in £
for this individual is
3.0542 ≃ 9.33 (to 2 d.p.).

Solution to Activity 17
When the regression model with a factor is expressed in terms of indicator
variables, we can think of the model as a multiple regression model where
the effect parameters α2 , α3 , . . . , αK are regression coefficients. The
hypotheses are then equivalent to the hypotheses (in Subsection 1.3.1 of
Unit 2) which were formulated to test if all of the regression coefficients in
a multiple regression model are zero.
The p-value associated with this test was calculated using the
F -distribution with q and n − (q + 1) degrees of freedom, where q is the
number of regression coefficients in the model. Therefore, using the same
distribution here, the p-value associated with our test will be based on the
F -distribution with K − 1 and n − ((K − 1) + 1) = n − K degrees of
freedom (since there are K − 1 regression coefficients for the K − 1
indicator variables).

Solution to Activity 18
The p-value is 0.00107, which is small. There is therefore evidence to reject
H0 and we conclude that there is a relationship between height and side.

Solution to Activity 19
The p-value is less than 0.001, and so is very small. Therefore, there is
evidence to reject H0 and we can conclude that there is a relationship
between hourlyWageSqrt and occ.

284
Solutions to activities

Solution to Activity 20
All of the p-values in the table are very small. There is therefore (strong)
evidence to suggest that all of the level effect terms are non-zero.
This means that the square root of the hourly wage is significantly
different for individuals in each of the occupation groups in comparison to
the first occupation group – that is, in comparison to individuals classed as
being in a professional occupation. What’s more, since each of the level
effect terms is negative, the square root of the hourly wage is significantly
lower for individuals in each of the occupation groups in comparison to
individuals in a professional occupation.

Solution to Activity 21
Since the factor side has two levels, the associated model using indicator
variables for the levels of side has just one indicator variable (for level 2
of side). Therefore, a test of whether all of the effect terms are zero is
testing the same thing as a test of whether the individual effect term for
level 2 of side is zero. Since we are testing the same thing using the same
data, we should get the same result – that is, the same p-value.
(Note however, that the values of the associated test statistics are not the
same values. This is because the two tests are calculating different test
statistics and calculating the associated p-values using different
distributions – the t-distribution for the t-value and the F -distribution for
the F -statistic.)

Solution to Activity 22
(a) From Unit 2 (Box 7, Subsection 3.1), the model assumptions
underlying multiple regression are that:
• the relationship between each of the explanatory variables
x1 , x2 , . . . , xq and Y is linear
• the random terms Wi , i = 1, 2, . . . , n, are independent
• the random terms Wi , i = 1, 2, . . . , n, all have the same variance σ 2
across the values of x1 , x2 , . . . , xq
• the random terms Wi , i = 1, 2, . . . , n, are normally distributed with
zero mean and constant variance, N (0, σ 2 ).
When each explanatory variable is an indicator variable, the linearity
assumption is automatically satisfied. This is because each indicator
variable only takes two possible values (0 or 1) and so a straight line
will always go through the fitted values associated with these two
values. Also, because the fitted value for each level of the factor is the
same as the sample mean, the assumption of zero mean is also
automatically satisfied.
(b) As for multiple regression (and indeed simple linear regression), a plot
of the residuals against the fitted values can be used to check that it
is reasonable to assume that all the random terms Wi , i = 1, 2, . . . , n,

285
Unit 3 Regression with a categorical explanatory variable

have constant variance. If the assumption of constant variance for the


error terms is reasonable, then we would expect the amount of scatter
about the zero residual line to be roughly constant across the fitted
values.
A normal probability plot of the residuals can be used to check that it
is reasonable to assume that the error terms are normally distributed.
If the normality assumption seems reasonable, then the points in the
normal probability plot should lie roughly along the straight line in
the plot.
We can check that the assumption of independence is reasonable by
plotting the residuals in some sort of order (perhaps in time order or
in the order the data were collected), to check that the residuals don’t
exhibit any noticeable patterns which might indicate lack of
independence. But, as pointed out in Subsection 3.1 of Unit 2, it isn’t
always obvious what order we could plot the residuals for a multiple
regression model to check for independence, and independence is
generally assumed unless there is reason to doubt it.

Solution to Activity 23
In Figure 12(a), the residuals seem to be fairly randomly scattered either
side of the zero residual line, and the scatter seems to be roughly constant
across the fitted values. So the assumption that the residuals have
constant variance seems reasonable.
In Figure 12(b), the points in the centre of the plot follow the line very
closely, but then they deviate from the line at either end. The assumption
of normality may therefore be questionable.

Solution to Activity 24
When the explanatory variable is a factor, the model is based on the mean
responses for each level of the factor and is most useful when there are
differences in the mean response for at least one level of the factor.
From Figure 13, it’s clear that the mean of salinity is different for all
three levels of waterMass (since the boxes in the comparative boxplot do
not overlap with each other). As such, modelling salinity according to
which level is taken by waterMass is likely to be a useful model.

Solution to Activity 25
Comparing Figures 16, 17 and 18, it seems that the ESS is larger than the
RSS, and so the explained sum of squares seems to contribute more to the
total sum of squares than the residual sum of squares does.

286
Solutions to activities

Solution to Activity 26
(a) There are 30 observations in this dataset, and so n = 30. Also, the
factor waterMass has three levels, and so K = 3. The F -value is
therefore calculated as
ESS/(K − 1) 38.80/2
F = = ≃ 66,
RSS/(n − K) 7.93/27
as required.
(b) The reported p-value is very small, and so we conclude that the factor
waterMass does help to explain the variation in the response
salinity.

Solution to Activity 27
(a) There are 3331 observations in this dataset, and so n = 3331. Also,
the factor occ has seven levels, and so K = 7. The F -value is
therefore calculated as
ESS/(K − 1) 648.4/6
F = = ≃ 118.3,
RSS/(n − K) 3035.7/3324
as required.
(b) The reported p-value is very small, and so we conclude that the factor
occ does help to explain the variation in the response
hourlyWageSqrt.

Solution to Activity 28
Using information from Activity 26 and the fact that
TSS = ESS + RSS,
the completed ANOVA table for the model salinity ∼ waterMass is
given below.

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
waterMass 2 38.80 19.40 66.02 < 0.001
Residual 27 7.93 0.29

Total 29 46.73

287
Unit 3 Regression with a categorical explanatory variable

Solution to Activity 29
Using information from Activity 27 and the fact that
TSS = ESS + RSS,
the completed ANOVA table for hourlyWageSqrt ∼ occ is given below.

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
occ 6 648.4 108.07 118.33 < 0.001
Residual 3324 3035.7 0.91

Total 3330 3684.1

Solution to Activity 30
Data were recorded for ten pea sections for each of the five treatments, and
so n = 50 and K = 5. Therefore, the degrees of freedom column should
have K − 1 = 4 in the treatment row, n − K = 45 in the residual row and
n − 1 = 49 in the total row.
To fill in entries for the sum of squares column, the values of ESS and RSS
were given in the question, and TSS can be calculated by using the
formula TSS = ESS + RSS.
For the entries in the mean square column, the mean square associated
with treatment is calculated as ESS/(K − 1), while the mean square
associated with the residuals is calculated as RSS/(n − K).
The F -value is calculated by dividing the mean square for treatment by
the mean square for the residual.
The completed ANOVA table for length ∼ treatment is given below.

Source of Degrees of Sum of Mean F -value p-value


variation freedom squares square
treatment 4 1077.3 269.33 49.37 < 0.001
Residual 45 245.5 5.46

Total 49 1322.8

The p-value is very small, and so we’d reject H0 and conclude that the
factor treatment does help to explain the variation in length.

288
Solutions to activities

Solution to Activity 31
The researchers need to compare the square root of the hourly wage for
those occupations which are manual with those occupations which are not
manual.
Let µmanual denote the mean response across all levels of occ which
represent manual occupations, and let µnon-manual denote the mean
response across those levels of occ which represent non-manual
occupations. Then to investigate whether the hourly wage is affected by
whether or not an occupation is manual, we need to compare µmanual
with µnon-manual .
There are three levels of occ which represent manual occupations:
5 (skilled manual), 6 (semi-skilled manual) and 7 (unskilled manual), and
the other levels represent non-manual occupations. This means that
µmanual = mean response across levels 5, 6 and 7 of occ,
µnon-manual = mean response across levels 1, 2, 3 and 4 of occ.

So, to help the researcher to answer the question of whether the hourly
wage is affected by whether or not their occupation is manual, define the
contrast
θmanual = µmanual − µnon-manual .
If θmanual is large in magnitude, then this suggests that the means for the
two groups are not the same – in which case, whether or not an occupation
is manual does affect the square root of the hourly wage.

Solution to Activity 32
If the p-value in an ANOVA analysis is small, then the p-value in a
regression analysis of the same model will also be small (because it’s the
same value).
But a small p-value in a regression analysis indicates that the effect terms
in the model are non-zero, and so that the response means are different
across the factor levels.
Therefore, if the p-value is small in an ANOVA analysis of the model, then
this tells us that the response means are different across the factor levels.

Solution to Activity 33
The factor sugar will take the value ‘yes’ for any observation which used a
treatment containing sugar, and will take the value ‘no’ for any
observation which used a treatment not containing sugar. This means that
sugar will take the value ‘yes’ for any observation for which treatment
takes one of the four levels glucose, fructose, gluc + fruct and sucrose, and
sugar will take the value ‘no’ for any observation for which treatment
takes the level ‘none’. The values of sugar for the observations in the
question are given in the completed version of Table 16 below.

289
Unit 3 Regression with a categorical explanatory variable

Observation length treatment sugar


10 68 none no
20 61 glucose yes
30 58 fructose yes
40 59 gluc + fruct yes
50 67 sucrose yes

Solution to Activity 34
The factor treatment has five levels, so K = 5, and there are
50 observations in the dataset. Therefore, Fsugar is calculated as
ESSsugar /1
Fsugar =
RSS/n − K
832.3/1
= ≃ 152.56.
245.5/45

Solution to Activity 35
The p-value corresponding to Fsugar is the p-value associated with the
contrast θsugar . As it is very small, it suggests that µsugar differs from
µno sugar , and so sugar does indeed affect the growth of pea sections.

Solution to Activity 36
The p-value associated with the contrast θmanual is very small. We therefore
conclude that the response mean for manual occupations is different to the
response mean for non-manual occupations, and so whether or not an
occupation is manual does indeed affect the square root of the hourly wage.

Solution to Activity 37
We want to compare the effects of the treatment gluc + fruct (the only
treatment which is a mix of glucose and fructose), with the effects of the
two treatments glucose and fructose (the treatments containing glucose or
fructose alone). So let
θmix = µmix − µno mix ,
where
µmix = mean response when treatment takes level gluc + fruct,
µno mix = mean response when treatment takes one of the levels
glucose or fructose.

290
Solutions to activities

Solution to Activity 38
Define a new factor – mix, say – from the levels of treatment which relate
to θmix :
• mix: a factor identifying whether or not the sugar treatments containing
glucose and/or fructose is a mix of glucose and fructose, taking the
possible values yes and no.
Carry out an ANOVA analysis for the model
length ∼ mix
using only data for those observations taking levels glucose, gluc + fruct or
fructose of treatment.
The resulting explained sum of squares for this fitted model on the reduced
dataset, ESSmix , then provides us with a measure of the variation in the
responses for the sugar treatments containing glucose and/or fructose
which is explained by whether or not the sugar treatment is a mix. The
test statistic for this test, Fmix , compares ESSmix with the overall residual
sum of squares and is calculated as
ESSmix /1
Fmix = ,
RSS/(n − K)
where K is the number of the levels of treatment.

Solution to Activity 39
The p-values associated with treatment, and the contrasts θsugar and
θsucrose , are all very small (p < 0.001). This means that there is strong
evidence to suggest that treatment helps to explain the variation in
length, as does whether or not the treatment contains sugar, and whether
or not the sugar treatment contains sucrose.
However, the p-value associated with the contrast θmix is large (p = 0.198),
and so there is no evidence to suggest that pea growth is affected by
whether a treatment is a mixture of glucose and fructose, or just either
glucose or fructose alone.

Solution to Activity 40
(a) There are three levels of occ which represent manual occupations: 5
(skilled manual), 6 (semi-skilled manual) and 7 (unskilled manual),
whereas the other four occupations (occ levels 1 to 4) are all
non-manual. So, the contrast θmanual partitions the seven levels of occ
into two groups: a ‘manual’ group for levels 5, 6 and 7 of occ, and a
‘non-manual’ group for levels 1, 2, 3 or 4 of occ.
For the contrast θunskilled , we are only interested in manual
occupations – that is, levels 5, 6 and 7 of occ. Of these, only level 7
(unskilled manual) is classed as being unskilled, whereas levels 5
and 6 are both classed as skilled. So, the contrast θunskilled partitions

291
Unit 3 Regression with a categorical explanatory variable

the manual occupations into two groups: an ‘unskilled’ group for


level 7, and a ‘skilled’ group for levels 5 and 6.
These two contrasts do follow the rules for multiple contrasts since
levels 5, 6 and 7 appear together in one of the groups for the contrast
θmanual (the ‘manual’ group).
(b) We couldn’t use either of the contrasts θmanual or θunskilled to define a
third contrast to compare the effects of levels 1 and 2 of occ with the
effects of levels 3 to 7 of occ because, although levels 1 and 2 of occ
are both contained in the same group for θmanual (the ‘non-manual’
group), levels 3 to 7 of occ are not in the same group for either of the
contrasts.
(c) There are seven levels of occ, so K = 7. Therefore, the maximum
number of contrasts that could be defined for this model and these
data is K − 1 = 6.

292
Unit 4
Multiple regression with both
covariates and factors
Introduction

Introduction
In Unit 2, we considered regression models for a continuous response where
the explanatory variables were all covariates (that is, numerical
explanatory variables), whereas in Unit 3 we introduced regression models
for a continuous response where the single explanatory variable was a
factor (that is, a categorical explanatory variable). In real-world
applications, however, it is often the case that the potential explanatory
variables can include a mixture of both covariates and factors.
For example, when analysing the wages dataset in Unit 3, we took the
variable hourlyWageSqrt as the response and occ as a single factor
explanatory variable. However, the dataset also contains data on several
additional potential explanatory variables – both potential covariates
(namely, workHrs and educAge) and potential factors (namely, gender,
edLev and computer).
In this unit we will introduce regression models which can accommodate
any number of covariates in addition to any number of factors as
explanatory variables.

How Unit 4 relates to the module so far


Moving on from . . . What’s next?

Regression with any


number of numerical
explanatory variables
(Unit 2)
Regression with any number
of numerical and categorical
explanatory variables
Regression with one
categorical variable
(Unit 3)

Section 1 starts by setting up regression models for the situation in which


there is just a single covariate and a single factor as potential explanatory
variables. In this case, separate regression lines fit data for different
observed levels of the factor. Section 2 develops these models in the case
where the separate regression lines are parallel to each other, whereas
Section 3 develops the models in the case where the separate regression
lines are not parallel to each other. Section 4 then considers the case
where there are two factors (but no covariates) as the explanatory
variables, and these models are extended in Section 5. Finally, in
Section 6, we’ll bring everything together and consider regression models
with any number of covariates and factors as the explanatory variables.

295
Unit 4 Multiple regression with both covariates and factors

The route map shows how the sections of the unit connect to each other.

The Unit 4 route map

Section 1
Regression with one
covariate and one factor

Section 2 Section 3
Modelling using Modelling using
parallel slopes non-parallel slopes

Section 4 Section 5
Regression with Regression with
two factors that two factors that
do not interact interact

Section 6
Regression with any number
of covariates and factors

Note that you will need to switch between the written unit and your
computer for Subsections 2.4, 3.4, 4.5, 5.3 and 6.4.

1 Regression with one covariate and


one factor
In this section, we will lay the foundations for regression models which can
have both multiple covariates and multiple factors as explanatory
variables. We’ll consider the situation where there is a continuous response
variable and just a single covariate and a single factor as potential
explanatory variables. In Subsection 1.1 we’ll introduce the main idea
behind the model, before we build the model in Subsection 1.2.

296
1 Regression with one covariate and one factor

1.1 The basic idea


To help explain the basic idea behind incorporating both a covariate and a
factor into a regression model, we’ll revisit the manna ash trees dataset
introduced in Unit 1 (Subsection 2.2.2). This dataset contains data
collected as part of a citizen science project concerning 42 manna ash trees
located on Walton Drive on The Open University campus in Milton
Keynes. Three of the variables for which data were collected for each tree
are:
• height: the height of the tree (in metres), rounded to the nearest metre
• diameter: the diameter of the tree (in metres, to two decimal places) A manna ash tree flowering
at 1.3 m above the ground
• side: the side of Walton Drive the tree is on, taking the possible values
west and east.
In Activity 14 of Unit 1 (Subsection 4.2.1), a simple linear regression
model was fitted to these data taking height as the response variable and
diameter as a single explanatory variable, with resulting fitted model
height = 5.05 + 12.27 diameter.
A scatterplot of the data, together with the fitted regression line, was
given in Figure 9 in that activity: this is repeated here as Figure 1 for your
convenience.

10

8
Height (m)

4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 1 Repeat of Figure 9 from Unit 1 showing a scatterplot of diameter
and height together with the fitted line

297
Unit 4 Multiple regression with both covariates and factors

In Section 8 of Unit 1, two other regression lines were also fitted to these
data – one line was fitted using only data for the trees on the west side of
Walton Drive, while the other regression line was fitted using only data for
the trees on the east side of Walton Drive. The resulting equations of those
fitted lines are
height = 4.50 + 17.26 diameter, for trees on the west side,
height = −0.81 + 27.59 diameter, for trees on the east side.
Figure 2 shows these two fitted lines on a scatterplot of height and
diameter (this is in fact a repeat of Figure 22 in Unit 1). It is clear from
both the equations of the fitted lines and Figure 2 that the fitted line for
trees on the west side of the road is not the same as the fitted line for trees
on the east side of the road. As such, the different levels of the factor side
are affecting the relationship between height and diameter and we ought
to accommodate this in our regression model.

Side: East West


10

8
Height (m)

4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure 2 Repeat of Figure 22 from Unit 1 showing the fitted regression lines
for the manna ash trees dataset using trees on the west side of Walton Drive
only (blue triangles and dashed blue fitted line) and using trees on the east
side of Walton Drive only (red circles and solid red fitted line)

In order to do this, we could continue using the approach of splitting up


the data according to the levels of the factor and then fitting separate
regression models for each factor level. This may be fine when we just have
a single factor with only two levels (as we have here), but things can soon
get tricky when we have a factor with multiple levels or several factors.

298
1 Regression with one covariate and one factor

Additionally, by using separate models, we are losing information about


the ‘bigger picture’ of the data and model as a whole. For example, since
each model has its own error variance (each estimated using only a subset
of the data), the error variances for the different models are assumed to be
unrelated, but we are usually interested in estimating the error variance
across the whole model and data.
To address these issues, in the next subsection we will introduce another
regression model that can still allow for different regression lines for the
levels of a factor, but can be expressed in the form of a single model
equation (with a single error variance) to allow us to gain more insight into
the data and model as a whole.

1.2 Building a model


We wish to build a regression model for a (continuous) response
variable Y , using two explanatory variables – a single covariate, x, and a
single factor, A, with K levels. (Although it looks rather odd with a
lower-case x and an upper-case A when they are both (known) explanatory
variables, it is conventional to denote factors by upper-case letters.)
Recall from Unit 2 that two covariates can be introduced into a multiple
regression model for response Y by adding each covariate into the model
equation. For the model we are considering now, we also have the situation
where there are two explanatory variables, but this time one of the
explanatory variables is a factor. Despite this difference, we can still use
the same idea from multiple regression of adding the two explanatory
variables into the model equation. To see how we may do this, let’s first
revisit the model
Y ∼ A.
Now, in Unit 3, we saw the general form of this model for the ith
observation is
Yi = µ + αk + Wi , i = 1, 2, . . . , n. (1)
Here, the Wi ’s are independent normal random variables with zero mean
and constant variance σ 2 , µ is the baseline mean for Y for level 1 of A, αk
is the effect on Y of level k of factor A in comparison to the effect of
level 1 of A, and α1 is set to be zero.

299
Unit 4 Multiple regression with both covariates and factors

Activity 1 Why is α1 set to be zero?

In Model (1), the term α1 is set to be zero. Explain why this is so.

We saw in Subsection 2.1 in Unit 3 that Model (1) can be rewritten in


terms of a set of indicator variables which allow the model to be
represented by an equation of the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + Wi , i = 1, 2, . . . , n, (2)
where

1 if the ith observation takes level 2 of A,
zi2 =
0 otherwise,

1 if the ith observation takes level 3 of A,
zi3 =
0 otherwise,
..
.

1 if the ith observation takes level K of A,
ziK =
0 otherwise.
Each of the indicator variables is numerical, and therefore a covariate. So,
as we saw in Subsection 2.2 in Unit 3, by expressing the regression model
with a single factor explanatory variable in terms of indicator variables, we
equivalently have a multiple regression model with K − 1 covariates.
Now, we would like a regression model with x (a covariate) as an
explanatory variable in addition to having A (a factor) as an explanatory
variable. Since we can express the model with A as a multiple regression
model (using indicator variables as covariates), and x is also a covariate,
we can simply add the extra covariate (x) into Model (2) and then once
again use multiple regression.
Combining the model notation from Unit 2 for multiple regression with the
Hooray! We already know how notation from Unit 3 for a regression model for a single factor explanatory
to do multiple regression! variable, we can then represent this model as
Y ∼ A + x.
Figure 3 illustrates diagrammatically how we are going to build the model.
In Activity 2, we will then use this approach to build a model for the
variable height using the manna ash trees dataset.

300
1 Regression with one covariate and one factor

Factor A
Response: Y Covariate x
with K levels

K − 1 indicator
variables
zi2 , zi3 , . . . , ziK

Multiple regression
with covariates
zi2 , zi3 , . . . , ziK

Multiple regression
with covariates
zi2 , zi3 , . . . , ziK and xi

Model: Y ∼ A + x

Figure 3 Illustration of how we are going to build the model Y ∼ A + x

Activity 2 Multiple regression for the manna ash trees data


Consider once again the manna ash trees dataset with response height,
covariate diameter and factor side. The two levels of the factor side are
east and west. Let east be level 1 of side and west be level 2.
We can express the model
height ∼ side + diameter
as a multiple regression model with two covariates. What are these two
covariates?

In this section, we have introduced the idea of using multiple regression


when we have a single factor and a single covariate as the explanatory
variables. How we might do this depends on how the regression lines
associated with the different levels of the factor are related to one another.
The simplest situation is when the regression lines for the levels of the
factor are parallel to one another. In this case, the slopes of all of the
regression lines are the same, but the intercepts of the regression lines are
not the same. We will develop a model for this in Section 2.
The alternative situation is when the regression lines for the levels of the
factor are not parallel to one another, so that neither the slopes nor the
intercepts of the regression lines are the same across the levels. We’ll
consider a model for this in Section 3.

301
Unit 4 Multiple regression with both covariates and factors

2 Modelling using parallel slopes


In this section, we will introduce what is known as the parallel slopes model
for modelling factor-level regression lines which are parallel to one another.
We’ll start in Subsection 2.1 by specifying the model, before discussing
tests for the parameters for this model in Subsection 2.2 and the model
assumptions in Subsection 2.3. We’ll finish the section with Subsection 2.4
where we’ll implement the model in R.

2.1 The parallel slopes model


We wish to include both a covariate x and a factor A (with K levels) as
explanatory variables in a linear regression model. Using the model ideas
Some ‘parallel slopes’ of a presented in Subsection 1.2, we can do this using multiple regression
more scenic nature with x and the indicator variables associated with the factor(that is,
zi2 , zi3 , . . . , ziK ) as covariates in the model.
So, let’s add x as an extra covariate into Model (2), and write our model
equation as
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + βxi + Wi , i = 1, 2, . . . , n, (3)
where the Wi ’s are independent normal random variables with zero mean
and constant variance σ 2 , and zi2 , zi3 , . . . , ziK are indicator variables so
that, for k = 2, 3, . . . , K,

1 if the ith observation takes level k of A,
zik =
0 otherwise.

The parameter µ in Model (3) has a different interpretation to that in


Model (2). This time, µ is a baseline mean for Y when factor A is level 1
and the value of x is zero. As such, µ is an intercept parameter for the
regression line associated with level 1 of A.
The parameters α2 , α3 , . . . , αK in Model (3) also have a different
interpretation to those in Model (2). This is because we now also have a
covariate (x) in the model and so, as was the case with multiple regression,
each parameter cannot be interpreted in isolation from the rest of the
explanatory variables. This time, αk is the effect of level k of factor A on
the response (in comparison to the effect of level 1 of A) after controlling
for the covariate x; that is, the effect of level k of factor A when treating
the effect of x as fixed. Similarly, the parameter β is the expected increase
in the response due to a unit increase in x after controlling for the
factor A; that is, the effect of x when treating the effects of all the different
levels of factor A as fixed.
You will investigate how the model works in the next activity.

302
2 Modelling using parallel slopes

Activity 3 A linear regression model for the manna ash


trees dataset
Consider once again the manna ash trees dataset and the model
height ∼ side + diameter.
The fitted model, for i = 1, 2, . . . , 42, turns out to be
height = 1.29 + 2.62 zi2 + 19.80 diameter,
where zi2 is an indicator variable for the factor side given by

 1 if the ith tree takes level 2 of side
zi2 = (that is, if the ith tree is on the west side of Walton Drive),
0 otherwise,

and the ith tree is the tree given in the ith row of the dataset (as in
Example 3 of Unit 3 (Subsection 2.1)).
(a) Write down the fitted model for Yi when the ith tree is on the east
side of Walton Drive.
(b) Write down the fitted model for Yi when the ith tree is on the west
side of Walton Drive.
(c) What do you notice about the two fitted regression lines for the
different levels of the factor side?
(d) Compared with being on the east side of Walton Drive, what effect
does being on the west side of the road have on height after
controlling for diameter?
(e) What effect does an increase in diameter by 0.1 m have on height
after controlling for side?

It turns out that Model (3) will always produce fitted regression lines for
the individual levels of the factor A which are parallel, because there is
only one slope parameter β for all of the levels. For this reason, this model
is known as the parallel slopes model.
A scatterplot of the response and the covariate with the two fitted lines
obtained for the manna ash trees dataset in Activity 3 is given in Figure 4,
next, together with a visual representation of how the fitted values µ
b, α
b2
and β relate to the fitted lines.
b

303
Unit 4 Multiple regression with both covariates and factors

The data points and regression lines in Figure 4 are identified according to
the two levels of side.

Side: East West


10

Height (m)
6
Both lines
have slope
4 β = 19.80

2 α2 = 2.62

µ = 1.29
0
0.0 0.1 0.2 0.3
Diameter (m)
Figure 4 Scatterplot of height and diameter, together with the fitted
regression lines for the two levels of the factor side

The parallel slopes model is summarised in Box 1.

Box 1 Regression with one covariate and one factor:


the parallel slopes model
The model
Y ∼A+x
for response Y , covariate x and factor A with K levels has the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + βxi + Wi , i = 1, 2, . . . , n,
where the Wi ’s are independent normal random variables with zero
mean and constant variance σ 2 , and zi2 , zi3 , . . . , ziK are indicator
Making different kinds of
parallel slopes variables so that, for k = 2, 3, . . . , K,

1 if the ith observation takes level k of A,
zik =
0 otherwise.

304
2 Modelling using parallel slopes

Therefore, the model becomes


Yi = µ + αk + βxi + Wi , i = 1, 2, . . . , n,
where α1 is set to be zero.
The fitted lines for the different levels of factor A are parallel to each
other, and so the model is called the parallel slopes model.

We’ll finish this subsection with an example and an activity using a parallel
slopes model for the FIFA 19 dataset first introduced in Unit 1 (Section 3).

Example 1 A parallel slopes model for the FIFA 19


dataset
The FIFA 19 dataset contains data on various attributes of 100
football players. Three of the variables for which there are data are:
• strength: a score of strength, expressed as an integer between 0
and 100
• weight: the player’s weight, measured in pounds (lb), to the
nearest pound
• skillMoves: an assessment of each player’s football ‘skill’ moves,
taking possible values 1, 2, 3, 4 and 5 (with 5 being the highest
level).
In Notebook activity 1.20 of Unit 1, R was used to fit a simple linear
regression model to the data taking strength as the response variable
and the covariate weight as the explanatory variable. The fitted line
for this model is shown on a scatterplot of strength and weight in
Figure 5, next, where the plotting symbol indicates the corresponding
level of skillMoves for each player.

305
Unit 4 Multiple regression with both covariates and factors

Skills moves: 1 2 3 4 5

80

75

Strength score
70

65

60

150 160 170 180 190


Weight (lb)
Figure 5 Scatterplot of strength and weight with data for the
different levels of the factor skillMoves identified, together with the
fitted line for the model strength ∼ weight

Notice that in Figure 5, the data points associated with the different
levels of skillMoves don’t seem to be totally randomly scattered
about the fitted line. For example, the data points associated with
level 1 of skillMoves are generally below the fitted line, while those
for level 2 of skillMoves are generally above the fitted line. This
suggests that the different levels of the factor skillMoves may also
affect the response strength. Therefore, the model
strength ∼ skillMoves + weight
was fitted to the data, producing the following fitted regression lines
for the five different levels of skillMoves:
strength = 6.94 + 0.36 weight, for level 1,
strength = 13.75 + 0.36 weight, for level 2,
strength = 12.53 + 0.36 weight, for level 3,
strength = 9.00 + 0.36 weight, for level 4,
strength = 9.14 + 0.36 weight, for level 5.
These fitted lines are shown on a scatterplot of strength and weight
in Figure 6.

306
2 Modelling using parallel slopes

Skills moves: 1 2 3 4 5

80

75
Strength score

70

65

60

150 160 170 180 190


Weight (lb)
Figure 6 Scatterplot of strength and weight with data for the
different levels of the factor skillMoves identified, together with the five
fitted regression lines for the model strength ∼ skillMoves + weight

Activity 4 Interpreting a parallel slopes model

Consider the regression model


strength ∼ skillMoves + weight
that was introduced in Example 1.
(a) Given the fitted regression lines for the five different levels of
skillMoves, what are the values of the estimated parameters µ b, α
b2 ,
α
b3 , α
b4 , α
b5 and β?
b
(b) After controlling for weight, describe how the different levels of
skillMoves affect strength.

So, we have a model for fitting parallel regression lines associated with the
levels of a factor, but, as usual for regression models, we now need to check
whether or not all of the explanatory variables need to be in the model.
We will consider how to do this in the next subsection.

307
Unit 4 Multiple regression with both covariates and factors

2.2 Testing the model parameters


Recall from Subsection 1.3 in Unit 2 that, in multiple regression with
q regression coefficients, we can test whether all of the regression
coefficients are zero by using a p-value for the F -statistic which is based on
the F -distribution with q and n − (q + 1) degrees of freedom. We used this
same test in Unit 3 to test whether all of the regression coefficients for the
K − 1 indicator variables are zero in a regression model with a single factor
(as summarised in Box 3 of Unit 3). It should therefore come as no surprise
to you that we can use the same test (although with different degrees of
freedom) to test if all of the regression coefficients are zero in the model
Y ∼ A + x.

Activity 5 Testing if all regression coefficients are zero

For the model


Y ∼A+x
we wish to test the hypotheses
H0 : α2 = α3 = · · · = αK = β = 0,
H1 : at least one of the regression coefficients is not zero.
Considering the tests used in Units 2 and 3, on what distribution do you
think the p-value associated with this test is based?

In the next activity, we will use the test described in Activity 5 for the
FIFA 19 dataset.

Activity 6 Testing the regression coefficients for the


FIFA 19 dataset
When the model
strength ∼ skillMoves + weight
is fitted to data from the FIFA 19 dataset it produces an F -statistic value
of 20.2 (with 5 and 94 degrees of freedom), with an associated p-value that
It’s not just footballers who is less than 0.001. Is there evidence to suggest that at least one of the
can show a ‘skill move’ !
regression coefficients is non-zero?

So, the F -test tests whether all of the regression coefficients are zero.
However, this doesn’t tell us whether there is evidence that any of the
individual parameters are zero.
Now, the model that we’re considering here, Y ∼ A + x, is a multiple
regression model. As such, separate t-tests can be used to test whether

308
2 Modelling using parallel slopes

each individual parameter is zero after controlling for the other explanatory
variables; R carries out these t-tests automatically when fitting the model.
While an individual t-test is fine for assessing β (the regression coefficient
for the covariate x), individual t-tests for assessing the level effect
parameters associated with the factor A aren’t always particularly useful.
This is because, as mentioned in Unit 3 (Subsection 3.2), either all or none
of the level effect parameters associated with the factor A need to be
included in the model, and so the level effects really ought to be assessed
‘as a whole’ (which is indeed what we did by using the F -test to test
whether all of the level effect parameters are zero for the model Y ∼ A).
So, for the model Y ∼ A + x, rather than relying on the individual t-tests
for deciding whether A should be included in the model, we will take a
different approach and instead consider how well the model with both A
and x fits the data in comparison to how well the model with only x fits
the data. If the fit of the model is significantly improved by including A
(in addition to x), then this would suggest that A should be included in
the model.
This leads us to the question: ‘How should we compare the fits of the two
models Y ∼ x and Y ∼ A + x?’ To answer this question, you first need to
know what nested models are. We explain these next in Box 2, and
demonstrate them for one particular case in Example 2.

Box 2 Nested models


Two linear regression models, M1 and M2 , say, are nested if the
explanatory variables in M1 are a subset of the explanatory variables
in M2 . This means that M2 has all of M1 ’s explanatory variables, as
well as one or more extra explanatory variables.
M1 is then said to be nested within M2 .

Example 2 Identifying nested models


In this section, we are interested in the two models
Y ∼x and Y ∼ A + x.
The model Y ∼ x has the single explanatory variable x, and the
model Y ∼ A + x has both x and A as explanatory variables.
As x is a subset of A + x, the two models are nested and Y ∼ x is
nested within Y ∼ A + x.

In the next activity, you will practise identifying pairs of nested models.

309
Unit 4 Multiple regression with both covariates and factors

Activity 7 Which pairs of models are nested?

Suppose that we have three linear regression models, MA , MB and MC , for


the same dataset with response variable Y and possible explanatory
variables x1 , x2 , . . . , x5 , where:
• MA has explanatory variables x2 , x4 , x5
• MB has explanatory variables x1 , x2 , x4 , x5
• MC has explanatory variables x2 , x5 .
List the pairs of nested models amongst these three models.

You may be wondering what all this talk of nested models has to do with
comparing the fits of the two models Y ∼ x and Y ∼ A + x? Well, it turns
out that, when we have two nested models (which indeed we do have with
models Y ∼ x and Y ∼ A + x), we can compare the fits of the two models
by considering the difference between the values of the residual sum of
squares (RSS) for the two models. (Basically, when we have two nested
models, the maths works nicely!)
Now, recall from Subsection 5.2 in Unit 3 that a model’s RSS provides a
measure of how much variation in the response is left unexplained by the
fitted model. Also recall from Section 5 in Unit 2 that model fit gets better
as we increase the number of explanatory variables, which means that the
model Y ∼ A + x will fit the data better than the model Y ∼ x. Of course,
the better the fit of a model, the better the model explains the data, which
in turn means that the unexplained variation decreases. It therefore follows
that the RSS for the (better fitting) model Y ∼ A + x will be less than the
RSS for the model Y ∼ x. Thus it is the size of the difference in RSS
between these two models that matters, as you will explore in Activity 8.

Activity 8 What does the difference between the RSS


values tell us?
Suppose that the difference between the RSS values for the two models
Y ∼ x and Y ∼ A + x is large. What does this suggest about the relative
fits of the two models?

We know from Section 5 in Unit 2 that model fit isn’t the ‘be all and end
all’; the principle of parsimony is also important. This means we only want
to include in the model those explanatory variables which significantly
increase the fit of the model. Therefore we only want to include A in the
model (in addition to x) if there is a significant gain in fit from doing so.

310
2 Modelling using parallel slopes

As you found in Activity 8, we can use the difference between the values of
the RSS for the two models Y ∼ x and Y ∼ A + x to assess the gain in fit
by including A in the model.
• If the difference is large enough, then this suggests that the gain in fit
when A is included (in addition to x) is significant enough to suggest
that A should be in the model as well as x.
• If the difference is small, then this suggests that there isn’t much gain in
fit when A is included (in addition to x), and so, for parsimony, it would
be better to use the simpler model Y ∼ x.
You will probably not be surprised to learn that a test can be used to
decide whether the difference in the values of the RSS is ‘large enough’ to
suggest that A should be in the model in addition to x. The test is in fact
another ANOVA test, because it compares different sources of response
variation. As such, the test is based on the F -distribution and is referred
to as an F -test. (Yes, yet another ‘F -test’ !)
As usual, R can easily calculate the test statistic for this test, together
with its associated p-value. For this module, all you need to know about
the details of the test are summarised in Box 3. You will then use this Another sort of ‘F test’ (and
F -test in Activities 9 and 10. ‘E’ test, ‘M’ test, ‘T’ test, . . . )?

Box 3 Testing whether factor A should be in the model


To test the hypotheses
H0 : A should not be included in the model,
H1 : A should be included in the model,
for the model
Y ∼ A + x,
a test statistic is used that is based on the difference in the values of
the RSS for the two models
Y ∼x and Y ∼ A + x.

The test statistic, F , can be expressed as



(RSS for Y ∼ x) − (RSS for Y ∼ A + x) /(K − 1)
F =
(RSS for Y ∼ A + x)/(n − K − 1)
estimated decrease in unexplained variation due to including A
= .
estimated variation not explained by Y ∼ A + x
If the p-value is small, then we reject H0 . We then conclude that the
factor A should be included in the model together with the
covariate x.
If the p-value is not small, then there is not enough evidence to
reject H0 . We then conclude that the factor A should not be included
in the model together with the covariate x.

311
Unit 4 Multiple regression with both covariates and factors

Activity 9 Are diameter and side both required?

The model
height ∼ side + diameter
was fitted to data from the manna ash trees dataset and produced the
output shown in Table 1.
Table 1 Output produced when fitting the parallel slopes model for the manna
ash trees dataset

Parameter Estimate Standard t-value p-value


error
µ (baseline mean) 1.29 1.106 1.167 0.25
β (slope for diameter) 19.80 3.876 5.108 < 0.001
α2 (side level 2) 2.62 0.442 5.924 < 0.001

A second model
height ∼ diameter
was fitted to the data. An ANOVA test to compare the RSS values for the
two fitted models for height was carried out: the test statistic was
calculated to be F = 35.1 and the associated p-value was less than 0.001.
Are diameter and side both required in the model?

Activity 10 Are skillMoves and weight both required?

The model
strength ∼ skillMoves + weight
was fitted to data from the FIFA 19 dataset and produced the output
shown in Table 2.
Table 2 Output produced when fitting the parallel slopes model for the FIFA 19
dataset

Parameter Estimate Standard t-value p-value


error
µ (baseline mean) 6.94 7.897 0.879 0.382
β (slope for weight) 0.36 0.044 8.239 < 0.001
α2 (skillMoves level 2) 6.81 1.653 4.121 < 0.001
α3 (skillMoves level 3) 5.59 1.512 3.695 < 0.001
α4 (skillMoves level 4) 2.06 1.511 1.365 0.176
α5 (skillMoves level 5) 2.20 2.839 0.775 0.440

312
2 Modelling using parallel slopes

A second model
strength ∼ weight
was also fitted to the data. An ANOVA test to compare the RSS values for
the two fitted models for strength was carried out: the test statistic was
calculated to be F = 7.4 and the associated p-value was less than 0.001.
Given these results, are weight and skillMoves both required in the
model?

Testing for the parallel slopes model is summarised in Figure 7.

Model: Y ∼ A + x

Is A required Is x required
in addition to x? in addition to A?

F -test (ANOVA test) Individual t-test


to compare RSS (from multiple
values of models regression)
Y ∼ x and Y ∼ A + x

Figure 7 Summary of testing for the parallel slopes model

Following the usual methods for regression, once we have decided which
explanatory variables should be included in the model, we need to check
whether the assumptions underlying the model are reasonable for the fitted
model and data. Checking the assumptions of the parallel slopes model is
the focus of the next subsection.

2.3 Checking the assumptions of the


parallel slopes model
The assumptions underlying the parallel slopes model are basically those
underlying multiple regression, together with an additional assumption
that each of the regression lines for the different levels of the factor has the
same slope. The assumptions are summarised in Box 4, next.

313
Unit 4 Multiple regression with both covariates and factors

Box 4 Assumptions of the parallel slopes model


For response Y , covariate x and factor A with K levels, there are four
main assumptions of the parallel slopes model
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + βxi + Wi , i = 1, 2, ..., n.
These are:
• Parallel regression lines: the regression lines for the different
levels of the factor all have the same slope.
• Independence: the random terms Wi , i = 1, 2, . . . , n, are
independent.
• Constant variance: the random terms Wi , i = 1, 2, . . . , n, all have
the same variance σ 2 across the values of x1 , x2 , . . . , xq .
• Normality: the random terms Wi , i = 1, 2, . . . , n, are normally
distributed with zero mean and constant variance, N (0, σ 2 ).

The last three assumptions given in Box 4 regarding the random terms
W1 , W2 , . . . , Wn of the parallel slopes model can be checked using methods
from multiple regression given in Subsection 3.1 of Unit 2 (and also used in
Subsection 3.3 of Unit 3). The first assumption that the regression lines for
the different levels of the factor all have the same slope can be checked
informally by visual inspection of a scatterplot of the response variable and
the covariate, together with the fitted regression lines. (A formal method
to test whether or not the slopes are the same will be introduced later, in
Subsection 3.2.)
In the next example, we will check whether or not the assumptions of the
parallel slopes model are reasonable for the fitted model for the manna ash
trees dataset.

Example 3 Checking the assumptions of the parallel


slopes model for the manna ash trees dataset
The model
height ∼ side + diameter
was fitted in Activity 3 using data from the manna ash trees dataset.
In Activity 21 of Unit 1 (Subsection 5.1.2) you checked the
independence assumption of a simple linear regression model fitted to
these data by plotting the residuals against identification numbers
(treeID). We could do the same using the residuals for the model
fitted in Activity 3 to check the independence assumption. However,
as was noted in the Solution to Activity 21 in Unit 1, it is not clear
what the identification numbers actually represent. So we will assume

314
2 Modelling using parallel slopes

that the independence assumption is reasonable given the absence of a


reason of why it shouldn’t be. Figure 8 shows the resulting residual
plot and normal probability plot of the residuals.

3
14
13

1
Residuals

−1

−2

16
−3
5 6 7 8 9 10
(a) Fitted values

2
Standardised residuals

−1

−2

−2 −1 0 1 2
(b) Theoretical quantiles
Figure 8 Checking the parallel slopes model assumptions for the manna
ash trees dataset: (a) residual plot; (b) normal probability plot

315
Unit 4 Multiple regression with both covariates and factors

In the residual plot, there are three residuals (numbered 13, 14 and 16
on the plot) which seem rather large in comparison to the other
residuals, suggesting that perhaps the assumption that the variance of
the Wi ’s is constant may be questionable. However, the rest of the
points in the residual plot seem to be scattered fairly randomly about
zero, suggesting that, overall, the assumption that the Wi ’s have zero
mean and constant variance does seem reasonable. Additionally, the
residuals in the normal probability plot lie roughly along a straight
line, and so the assumption of normality of the residuals also seems
plausible.
A scatterplot of height and diameter, together with the two fitted
regression lines for the levels of side, was given in Figure 4. The two
regression lines seem to fit the associated data points for the two
levels of side fairly well, and so the assumption that the two
regression lines are parallel also seems to be reasonable.

We will finish this subsection with an activity checking the assumptions for
the parallel slopes model for the FIFA 19 dataset.

Activity 11 Checking the assumptions of the parallel slopes


model for the FIFA 19 dataset
In Activity 4, we fitted the model
strength ∼ skillMoves + weight
using data from the FIFA 19 dataset. Figure 9 shows the resulting residual
plot and normal probability plot of the residuals.

3
10
2
Standardised residuals

5
1
Residuals

0 0

−1
−5
−2
−10
−3
65 70 75 80 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 9 Checking the parallel slopes model assumptions for the FIFA 19 dataset: (a) residual plot;
(b) normal probability plot

316
2 Modelling using parallel slopes

(a) Do the plots given in Figure 9 suggest any problems with the
assumptions of the random terms in the parallel slopes model for
these data?
(b) By considering Figure 6, is it reasonable to assume that the regression
lines for the five levels of the factor skillMoves are parallel?

As we have seen, the parallel slopes model produces separate parallel


regression lines for the different levels of the factor. However, there will be
situations where, although each of the levels of the factor do require
different regression lines, it is not appropriate that these lines are parallel.
To accommodate such situations, we need to adapt the parallel slopes
model to form the non-parallel slopes model ; this is the subject of the next
section. First, we’ll complete this section by modelling parallel slopes in R.

2.4 Using R to fit parallel slopes models


In this subsection, we’ll use the parallel slopes model in R. First, in
Notebook activity 4.1, we will explain how to do this using the parallel
slopes model specified in Activity 3 for modelling the manna ash trees
dataset. Then, in Notebook activity 4.2, we will use a parallel slopes model
for data from the wages dataset (first introduced in Unit 3).

Notebook activity 4.1 A parallel slopes model in R


This notebook explains how to use R for a parallel slopes model,
focusing on mannaAsh.

Notebook activity 4.2 A parallel slopes model for the


wages dataset
In this notebook we’ll use R for a parallel slopes model using data
from wages.

317
Unit 4 Multiple regression with both covariates and factors

3 Modelling using non-parallel


slopes
In this section, we will consider once again the situation where we have the
response variable Y , and there are two explanatory variables: a covariate x
and a factor A with K levels. However, this time we will look at what
happens when the regression lines associated with the different levels of the
factor are not parallel, and so a parallel slopes model isn’t a good model
for the data. Such a situation is illustrated in the following example.

Some more scenic slopes, this


time ‘non-parallel’ ones Example 4 Checking the parallel regression lines
assumption for a model fitted to wages
Consider once again the wages dataset first introduced in Unit 3.
A parallel slopes model was fitted to data from this dataset in
Notebook activity 4.2, where the response variable is
• hourlyWageSqrt: the square root of the individual’s hourly wage
(in £)
and the explanatory variables are the covariate educAge and the
factor gender. We’ll now consider modelling the same response
(hourlyWageSqrt), but this time using the covariate
• workHrs: the average number of hours the individual works each
week
along with the factor
• gender: the gender the individual identifies with, taking the values
male and female.
We’ll start by fitting the parallel slopes model
hourlyWageSqrt ∼ gender + workHrs.
The fitted lines for the two levels of gender are shown in Figure 10.
From Figure 10, we can see that the slope of the two fitted parallel
lines is negative. However, the data themselves tell a different story.
To see this, we’ll first fit the model
hourlyWageSqrt ∼ workHrs
using only data for which gender takes the value male, and then we’ll
fit the same model using only data for which gender takes the value
female. The two resulting fitted lines are shown on the scatterplots of
hourlyWageSqrt and workHrs given in Figures 11(a) and 11(b), and
Figure 11(c) shows the two groups of data and fitted lines together on
the same plot.

318
3 Modelling using non-parallel slopes

gender: male female

6
hourlyWageSqrt

0
0 20 40 60 80 100
workHrs
Figure 10 Scatterplot of hourlyWageSqrt and workHrs from the wages
dataset, together with the fitted regression lines for the two levels of
gender after fitting a parallel slopes model

6
hourlyWageSqrt

0
0 20 40 60 80 100
workHrs
Figure 11(a) Scatterplot of hourlyWageSqrt and workHrs together
with a fitted regression line, using only data for which gender takes the
value male

319
Unit 4 Multiple regression with both covariates and factors

hourlyWageSqrt
4

0
0 20 40 60 80 100
workHrs

Figure 11(b) Scatterplot of hourlyWageSqrt and workHrs together


with a fitted regression line, using only data for which gender takes the
value female

gender: male female

6
hourlyWageSqrt

0
0 20 40 60 80 100
workHrs

Figure 11(c) Showing Figures 11(a) and (b) on the same plot

320
3 Modelling using non-parallel slopes

It is clear from Figure 11(c) that the fitted lines for the two levels of
gender are not parallel. What’s more, there seems to be a negative
relationship between hourlyWageSqrt and workHrs for males, but a
positive relationship between hourlyWageSqrt and workHrs for
females. As such, the assumption that the fitted lines are parallel
(required for the parallel slopes model) doesn’t appear to hold.

Example 4 illustrated a situation in which we need to accommodate fitted


lines for the levels of a factor which are not parallel to one another. We
will develop such a model in the next subsection. We’ll then explain how
to test the model parameters in Subsection 3.2 and consider the
assumptions of the model in Subsection 3.3. Finally, in Subsection 3.4, the
model is implemented in R.

3.1 The non-parallel slopes model


In order to build a regression model suitable for modelling non-parallel
slopes such as those seen in Example 4, we will adapt the parallel slopes
model.
Suppose that the ith observation takes level k of factor A. The form of the
parallel slopes model for Yi is given by
Yi = µ + α k + β x i + W i , Wi ∼ N (0, σ 2 ), (4)
| {z } |{z}
↑ ↑
intercept slope
where α1 is set to be zero. This model specifies a different intercept for the
regression lines associated with each of the K levels of factor A (that is, µ
for level 1 and µ + αk for levels k = 2, 3, . . . , K). The fact that the
regression coefficient, β, is the same for all K levels of A ensures that all of
the regression lines have the same slope and are therefore parallel.
Now, let’s consider the situation where the regression lines associated with
the different levels of the factor are not parallel.

Activity 12 What needs to change?

What needs to change in Model (4) in order for the slopes to not be
parallel?

321
Unit 4 Multiple regression with both covariates and factors

Instead of using the single regression coefficient β for each xi , we now set
the slope for level 1 of A to be the regression coefficient β and then adjust
the slopes for levels k = 2, 3, . . . , K by adding an amount γk to β, where
γk = added effect on the regression slope when A takes level k,
in comparison to the regression slope when A takes level 1.
This allows the slopes to have different values and hence for the lines not
to be parallel.
The adjustment γk can be either positive (meaning that the slope for level
k is an increase on the slope for level 1) or negative (meaning that the
slope for level k is a reduction on the slope for level 1). This means that
the regression coefficient for x will not be the same for the different levels
of A, hence this will give different slopes for the regression lines for the
different levels of factor A.
So, Model (4) is adapted to become
Yi = µ + αk + ( β + γk ) xi + Wi , Wi ∼ N (0, σ 2 ), (5)
| {z } | {z }
↑ ↑
intercept slope
where α1 and γ1 are both set to be zero. (This is because the intercept and
slope of the line for level k are compared to the intercept and slope of the
line for level 1.)
Notice that, for level 1, we have exactly the same model as we have for the
parallel slopes model given in Model (4).
Because Model (5) can accommodate slopes which are not parallel for the
different levels of A, this model is known as the non-parallel slopes
model.
In the next activity, we will use the non-parallel slopes model for the wages
dataset considered in Example 4.

Activity 13 A fitted model for non-parallel slopes

A non-parallel slopes model (as given by Model (5)) was fitted to data in
the wages dataset, taking hourlyWageSqrt as the response (Y ), gender as
a factor (A) and workHrs as a covariate (x). Level 1 of gender was set to
be male, while level 2 was set to be female.
The following parameter estimates for the fitted model were obtained:
Some occupations have less
b = 4.575, βb = −0.0173, α
µ b2 = −1.816, γ
b2 = 0.0335.
predictable weekly working
hours than others (a) Write down the fitted line for individuals who are male and the fitted
line for individuals who are female.
(b) How does gender affect the slopes of the fitted lines?
(c) The value of workHrs for two of the individuals in the dataset is 35.
One of these individuals is male and the other is female. According to

322
3 Modelling using non-parallel slopes

the fitted non-parallel slopes model, what is the fitted value of


hourlyWageSqrt for each of these individuals? Hence what are the
fitted values of their respective hourly wages?

Figure 12 shows a scatterplot of the response hourlyWageSqrt and


covariate workHrs, together with the fitted regression lines for the two
levels of the factor gender presented in Activity 13. Notice how the fitted
lines now match more closely to the slopes fitted to the male and female
individuals separately that we saw earlier in Figures 11(a)–(c).

gender: male female Line has


slope
β + γ2 = 0.016

6 Line has
slope
β = −0.017
hourlyWageSqrt

0
0 20 40 60 80 100
workHrs
Figure 12 Scatterplot of hourlyWageSqrt and workHrs, together with the
fitted regression lines for the two levels of gender

Now, in Model (3) we used the indicator random variables zi2, , zi3 , . . . , ziK
to express the parallel slopes model by a single equation. Similarly, for
i = 1, 2, . . . , n, we can use the same indicator variables to express the
model for non-parallel slopes given in Model (5) by an equation of the form
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK
+ (β + γ2 zi2 + γ3 zi3 + · · · + γK ziK )xi + Wi , (6)
where, for k = 2, 3, . . . , K,

1 if the ith observation takes level k of A,
zik =
0 otherwise.

You will justify why we can do this in the next activity.

323
Unit 4 Multiple regression with both covariates and factors

Activity 14 Equivalence of the two model forms

Explain why the model representation given in Model (6) reduces to


Yi = µ + βxi + Wi ,
when the ith observation takes level 1 of factor A, and reduces to
Yi = µ + αk + (β + γk )xi + Wi ,
when the ith observation takes level k = 2, 3, . . . , K of factor A.

Model (6) can be rearranged (for i = 1, 2, . . . , n) to the form


Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + βxi
+ γ2 (zi2 × xi ) + γ3 (zi3 × xi ) + · · · + γK (ziK × xi ) + Wi . (7)
Now, since zi2 , zi3 , . . . , ziK and xi are all numerical, then so are
(zi2 × xi ), (zi3 × xi ), . . . , (ziK × xi ). Therefore, by using indicator variables,
the non-parallel slopes model has been expressed as a multiple regression
model with covariates
zi2 , zi3 , . . . , ziK , xi , (zi2 × xi ), (zi3 × xi ), . . . , (ziK × xi ).
And, of course, we know how to do multiple regression!
When the value of γk is non-zero for at least one of k = 2, 3, . . . , K, there is
said to be an interaction between the covariate x and the factor A,
because the two explanatory variables interact with one another to affect
the response Y . The parameters γ2 , γ3 , . . . , γK are then the interaction
parameters.
It is important to note here that an interaction between x and A does not
mean that x affects A or that A affects x. Instead, it means that the effect
on the response of one of the explanatory variables depends on the value of
the other explanatory variable as you will see in Example 5.

Example 5 The interaction between workHrs and gender


In Example 4 and Activity 13, we saw how the relationship between
the response hourlyWageSqrt and the covariate workHrs depends on
the value of the factor gender. As such, there is an interaction
between workHrs and gender.

We can extend the model notation used for the parallel slopes model to
include an interaction between x and A for a non-parallel slopes model.
Following the notation used in R, we will denote this interaction by A:x,
and then the non-parallel slopes model (as given in Model (5)) is denoted

324
3 Modelling using non-parallel slopes

by
Y ∼ A + x + A:x.
This is often written in abbreviated form as
Y ∼ A ∗ x.
The ‘∗’ between A and x simply tells us that both A and x are explanatory
variables in the model, and there is also an interaction between A and x.
The non-parallel slopes model is summarised in Box 5.

Box 5 Regression with one covariate and one factor:


modelling non-parallel slopes
The model
Y ∼A∗x
for response Y , covariate x and factor A with K levels, has the form,
for i = 1, 2, . . . , n,
Yi = µ + α2 zi2 + α3 zi3 + · · · + αK ziK + βxi
+ γ2 (zi2 × xi ) + γ3 (zi3 × xi ) + · · · + γK (ziK × xi ) + Wi ,
where the Wi ’s are independent normal random variables with zero
mean and constant variance σ 2 , and zi2 , zi3 , . . . , ziK are indicator
Making different kinds of
variables so that, for k = 2, 3, . . . , K, non-parallel slopes

1 if the ith observation takes level k of A,
zik =
0 otherwise.

So, when the ith observation takes level 1 of factor A, the model
becomes
Yi = µ + βxi + Wi , i = 1, 2, . . . , n,
and when the ith observation takes level k = 2, 3, . . . , K of factor A,
the model becomes
Yi = µ + αk + (β + γk )xi + Wi , i = 1, 2, . . . , n.

The parameters γ2 , γ3 , . . . , γK are the interaction parameters.


If γk is non-zero for at least one of k = 2, 3, . . . , K, then there is said
to be an interaction between x and A.
For this model, the fitted lines for the different levels of factor A are
not parallel to each other, and as a result, this model is called the
non-parallel slopes model.

The translation of the model terms into terms fitted in a multiple


regression model is illustrated in Figure 13, next.

325
Unit 4 Multiple regression with both covariates and factors

Factor A Interaction
Response Y Covariate x
with K levels A:x

K − 1 indicator
variables
zi2 , zi3 , . . . , ziK

Covariates
Covariates
(zi2 × xi), (zi × xi),
zi2 , zi3 , . . . , ziK
. . . , (ziK × xi)

Multiple regression with


covariates zi2 , zi3 , . . . , ziK , xi
(zi2 × xi), (zi × xi), . . . , (ziK × xi)

Model: Y ∼ A + x + A:x

Figure 13 Summary of the non-parallel slopes model

In the next activity, we will fit a non-parallel slopes model using data from
the FIFA 19 dataset.

Activity 15 Non-parallel slopes model for the FIFA 19


dataset
A parallel slopes model was fitted to data from the FIFA 19 dataset in
Example 1, taking strength as the response variable, weight as a
covariate and skillMoves as a factor. We’ll now consider modelling the
same response (strength), but this time using the covariate
• height: the player’s height, measured in inches to the nearest inch
and the factor
• preferredFoot: the foot preferred by each player, taking possible values
left (level 1) and right (level 2).
The model
strength ∼ preferredFoot ∗ height
was fitted to the data; the fitted regression equations for the two levels of
the factor preferredFoot are
strength = −86.2 + 2.22 height, for level 1 (left),
strength = −6.5 + 1.08 height, for level 2 (right).

326
3 Modelling using non-parallel slopes

(a) Given the fitted regression lines for the two levels of preferredFoot,
what are the values of the estimated parameters µ b, β,
b αb2 and γ
b2 ?
(b) How does preferredFoot affect the slopes of the fitted lines?
(c) The value of height for two of the football players in the dataset
is 75. One of these individuals is left-footed and the other is
right-footed. According to the fitted non-parallel slopes model, what
is the fitted value of strength for each of these individuals?

We now know how to include an interaction between A and x in the


model. So, now we need to investigate how to test whether the interaction
is in fact needed in the model. We will do this in the next subsection.

3.2 Testing for an interaction


In deciding whether or not an interaction should be included in a model,
attention focuses on testing the interaction parameters γ2 , γ3 , . . . , γK . If
there is no interaction between the covariate and the factor, then the
interaction parameters will all be zero. However, if there is an interaction,
then this means that at least one of the interaction parameters is non-zero.
In this case, we need to include all of the individual interaction parameters
in the model (in the same way that we have to include all of the level effect
parameters α2 , α3 , . . . , αK in the model if at least one of them is non-zero).
So, in order to test for an interaction, we need to test the hypotheses
H0 : γ2 = γ3 = · · · = γK = 0,
H1 : at least one of the interaction parameters is not zero.
To do this, we’ll build on the method used in Subsection 2.2 to test
whether or not the factor A should be included in addition to the
covariate x. Recall that we tested this by considering the difference
between the values of the RSS for the two nested models
Y ∼x and Y ∼ A + x.
We then concluded that A should be included in the model (in addition
to x) if the difference in the RSS values is large enough to indicate a
significant gain in fit when including A in addition to x. We will use this
same type of idea to test whether the interaction term A:x should be
included in the model, by considering the fits of the two models
Y ∼A+x (a parallel slopes model)
and
Y ∼ A + x + A:x (a non-parallel slopes model).
Notice that these two models are nested, and so we can compare the fits of
the models by considering the difference in their RSS values. We will
consider this approach in the next activity.

327
Unit 4 Multiple regression with both covariates and factors

Activity 16 How does an interaction affect the difference?

The two models


Y ∼A+x and Y ∼ A + x + A:x
were fitted to some data and the values of the RSS were calculated for
each model.
If there is an interaction between A and x, would you expect the difference
in RSS values for these two models to be large or small?

As you have found in Activity 16, the size of difference in the RSS values
indicates if the interaction term A:x should be in the model. We can use
another ANOVA test (similar to that used in Subsection 2.2) to test
whether the difference in the RSS values is ‘large enough’ to suggest that
there is an interaction between A and x (meaning that the interaction term
should be included in the model). Once again, this is a form of F -test,
with a test statistic F and p-value calculated using an F -distribution.
The test is summarised in Box 6.

Box 6 Testing for an interaction A:x


To test the hypotheses
H0 : A:x should not be included in the model,
H1 : A:x should be included in the model,
for the model
Y ∼ A + x + A:x,
we use a test statistic based on the difference in the values of the RSS
for the two models
Y ∼A+x and Y ∼ A + x + A:x.
The test statistic, F , can be expressed as
estimated decrease in unexplained variation due to including A:x
F = .
estimated variation not explained by Y ∼ A + x + A:x
If the p-value is small, then we reject H0 and conclude that the
interaction term A:x should be included in the model in addition to A
and x.
If the p-value is not small, then there is not enough evidence to reject
H0 and so we conclude that the interaction term A:x need not be
included in the model together with A and x.

328
3 Modelling using non-parallel slopes

In the last subsection, we fitted non-parallel slopes models to data from


the wages and FIFA 19 datasets (in Activities 13 and 15, respectively). We
will investigate whether the interaction is required for each of these two
models in Activities 17 and 18.

Activity 17 Non-parallel or parallel slopes for modelling


data from the wages dataset?
The following two models were fitted to data from the wages dataset:
hourlyWageSqrt ∼ gender + workHrs
and
hourlyWageSqrt ∼ gender + workHrs + gender:workHrs.
The p-value for an ANOVA test testing whether the interaction term
gender:workHrs should be included in the model was calculated to be less
than 0.001.
Do you think it was wise to use a non-parallel slopes model in Activity 13,
or could we have simply used the more parsimonious parallel slopes model?

Activity 18 Non-parallel or parallel slopes for modelling


data from the FIFA 19 dataset?
The two models
strength ∼ preferredFoot + height
and
strength ∼ preferredFoot + height + preferredFoot:height
Andreas Brehme was
were fitted to data from the FIFA 19 dataset. comfortable using both feet,
An ANOVA test was carried out for testing whether the interaction term scoring two penalties for
preferredFoot:height should be included in the model. The resulting Germany in World Cup
value of the test statistic was F = 4.4, and its associated p-value was 0.039. shoot-outs, first with his left
foot and then with his right
Do you think it is better to use a non-parallel slopes model or a parallel foot
slopes model for modelling strength for these data?

Earlier in this unit, in Activity 3 (Subsection 2.1), a parallel slopes model


was used for modelling the response height with covariate diameter and
factor side from the manna ash trees dataset. However, when modelling
the response height with covariate diameter using data for each level of
side (east and west) separately, the fitted regression lines were not
parallel, as was shown in Figure 2 (Subsection 1.1). So, might the
non-parallel slopes model be better for these data? Or, is the difference in
the slopes small enough to suggest that the more parsimonious parallel
slopes model is preferable? We will investigate this in the next activity.

329
Unit 4 Multiple regression with both covariates and factors

Activity 19 Is an interaction needed to model the manna


ash trees data?
The following two models were fitted to data from the manna ash trees
dataset:
height ∼ side + diameter
and
height ∼ side + diameter + side:diameter.
A test was carried out for testing whether the interaction term
side:diameter should be included in the model. The resulting p-value for
this test was 0.256.
Do you think it is better to use a non-parallel slopes model or a parallel
slopes model for modelling height for these data?

Testing for an interaction is summarised in Figure 14.

Model: Y ∼ A + x + A:x

Is the interaction
A:x required?

F -test (ANOVA test) to


compare RSS values of
models Y ∼ A + x
and Y ∼ A + x + A:x

Figure 14 Summary of testing for an interaction

We’ll look at checking the assumptions of the non-parallel slopes model in


the next subsection.

330
3 Modelling using non-parallel slopes

3.3 Checking the assumptions of the


non-parallel slopes model
The assumptions of the non-parallel slopes model are the same as those for
the parallel slopes model given in Box 4, with the obvious exception that
there is no longer the assumption that the regression slopes for the
different levels of A are parallel!
In Activities 20 and 21, we will check the assumptions for the non-parallel
slopes models fitted (in Activities 13 and 15) to data from the wages and
FIFA 19 datasets.

Activity 20 Checking the assumptions of the non-parallel


slopes model for the wages dataset
The non-parallel slopes model
hourlyWageSqrt ∼ gender ∗ workHrs
was fitted to data from the wages dataset in Activity 13. Figure 15 shows
the resulting residual plot and normal probability plot of the residuals.
Do the plots given in Figure 15 suggest any problems with the assumptions
of the non-parallel slopes model for these data?

4 4
Standardised residuals

2 2
Residuals

0 0

−2 −2

−4 −4
3.0 3.5 4.0 4.5 −2 0 2
(a) Fitted values (b) Theoretical quantiles
Figure 15 Checking the non-parallel slopes model assumptions for the wages dataset: (a) residual plot;
(b) normal probability plot

331
Unit 4 Multiple regression with both covariates and factors

Activity 21 Checking the assumptions of the non-parallel


slopes model for the FIFA 19 data
The non-parallel slopes model
strength ∼ preferredFoot ∗ height
was fitted in Activity 15 for data from the FIFA 19 dataset. Figure 16
shows the resulting residual plot and normal probability plot of the
residuals.
Do the plots given in Figure 16 suggest any problems with the assumptions
of the non-parallel slopes model for these data?

10 2

5 Standardised residuals 1
Residuals

0 0

−5 −1

−10 −2

65 70 75 −2 −1 0 1 2
(a) Fitted values (b) Theoretical quantiles
Figure 16 Checking the non-parallel slopes model assumptions for data from the FIFA 19 dataset: (a) residual
plot; (b) normal probability plot

We’ll complete this section by fitting some non-parallel slopes using R.

332
4 Regression with two factors that do not interact

3.4 Using R to fit non-parallel slopes


models
In this subsection, R will be used to explore two non-parallel slopes
models. First, Notebook activity 4.3 considers the non-parallel slopes
model specified in Activity 13 (Subsection 3.1) for modelling the wages
dataset. Then, in Notebook activity 4.4, you will consider another
non-parallel slopes model using data from the wages dataset, this time
using different explanatory variables.

Notebook activity 4.3 Fitting a non-parallel slopes model


in R
This notebook explains how to fit a non-parallel slopes model in R,
focusing on data from wages.

Notebook activity 4.4 Fitting another non-parallel slopes


model in R
In this notebook you’ll use R to fit another non-parallel slopes model
using data from wages.

4 Regression with two factors that


do not interact
In this section, we are still interested in modelling a response Y with two
explanatory variables, but this time both of the explanatory variables are
factors. We will denote the two factors by A and B, where A has K levels
and B has L levels.
We have seen in Section 3 that when a covariate x and factor A are used
together as the explanatory variables in a regression model, they can
interact with each other to affect the response Y . Likewise, when the
explanatory variables are two factors A and B, they can also interact with
each other to affect the response.
In this section, we will consider the simpler situation in which it is
assumed that there isn’t an interaction between A and B. The situation in
which A and B interact with each other to affect the response is the focus
of the next section.
We’ll start by specifying the general form for this model in Subsection 4.1,
then show how the model can be visualised in Subsection 4.2. We’ll then
consider how to test whether both factors should be in the model, in
Subsection 4.3, and introduce a plot which can be used to informally check
the assumption that there is no interaction between the two factors, in
Subsection 4.4. Finally, in Subsection 4.5 we’ll implement the model in R.

333
Unit 4 Multiple regression with both covariates and factors

4.1 The model for two factors that do not


interact
We’ll start by considering a dataset with a continuous response variable
and two possible factors; the dataset is described next.

Employment rates
As mentioned in Example 1 of Unit 1 (Subsection 2.1), the
Organisation for Economic Co-operation and Development (OECD) is
an international organisation which works with governments, policy
makers and citizens to find solutions to social, economic and
environmental challenges. The OECD website states that it has
‘helped to place issues relating to well-being and inequalities of
income and opportunities high on the international agenda’ and that
its goal ‘is to shape policies that foster prosperity, equality,
opportunity and well-being for all’ (OECD, 2020).
Some members of a diverse The OECD collects a vast amount of data concerning many different
and inclusive workforce issues. The dataset we introduce here contains data on the 2019
employment rates in 37 countries for people educated to a Bachelors
degree or equivalent level, broken down by gender and age grouping.
The employment rates dataset (employmentRate)
Data are available for each of the 37 countries for the following
variables:
• country: the name of the country
• rate: the 2019 employment rate (as a percentage) for people
educated to a Bachelors degree or equivalent level
• gender: the gender the individual identifies with, taking the values
male and female
• age: the age groupings (in years), taking the values 25 to 34,
35 to 44, 45 to 54, and 55 to 64.
The first three and last three observations from this dataset are given
in Table 3. So, for example, the employment rate for women aged 25
to 34 in Australia was 82.5% in 2019.
Table 3 The first three and last three observations in employmentRate

country rate gender age


Australia 82.5 female 25 to 34
Austria 78.0 female 25 to 34
Belgium 85.7 female 25 to 34
Turkey 53.8 male 55 to 64
United Kingdom 72.1 male 55 to 64
United States 76.8 male 55 to 64
Source: OECD, 2021

334
4 Regression with two factors that do not interact

We wish to model a response Y using two factors A and B, and we are


assuming in this section that there is no interaction between A and B.
Following our model notation convention, we can represent this model by
Y ∼ A + B.
In Example 6 you will see how this general formulation translates into a
model for the employment rates dataset.

Example 6 What do we want to model using the


employment rates dataset?
For the employment rates dataset, suppose that we would like to take
the variable rate as our response and the factors gender and age as
two explanatory variables. We are therefore interested in the model
rate ∼ gender + age,
where gender has two levels (male and female) and age has four
levels (25 to 34, 35 to 44, 45 to 54, and 55 to 64).
(These levels for gender and age correspond to the levels given in the A worker who may not be
employment rates dataset. To include other levels for gender and age, represented in the employment
we would need to obtain data for these other levels.) rates dataset

To help us build the model


Y ∼ A + B,
let’s first think back to the model from Unit 3 where there was just the
single factor explanatory variable A, that is, the model
Y ∼ A.
For that model, if the ith observation takes level k of A, for
k = 1, 2, . . . , K, then our model for the response Yi can be expressed in the
following general form:
 
   effect of   
baseline random
Yi = + kth level + , (8)
mean term
of A
 

where
• the ‘baseline mean’ is the mean response when A takes level 1 (this was
denoted µ in Unit 3)
• the ‘effect of kth level of A’ is the effect on the mean response of the
kth level of A in comparison to the effect on the mean response of
level 1 of A (this was denoted αk in Unit 3)
• each ‘random term’ is a normal random variable with zero mean and
constant variance σ 2 (this was denoted Wi in Unit 3).

335
Unit 4 Multiple regression with both covariates and factors

Notice that, since each level effect term is the added effect on the response
in comparison to the effect of level 1, when the ith observation takes
level 1 of A, the level effect term is simply zero. Model (8) then reduces to
   
baseline random
Yi = + . (9)
mean term
We can extend and adapt this general model form for Y ∼ A to the
situation where we instead have two factors A and B. This time, each
observation will take one level of A and one level of B. So, if the ith
observation takes level k of A and level l of B, for k = 1, 2, . . . , K,
l = 1, 2, . . . , L, then we will write our model for the response Yi in the
following general form:
   
   effect of   effect of   
baseline random
Yi = + kth level + lth level + , (10)
mean term
of A of B
   

where, this time,


• the ‘baseline mean’ is the mean response when A takes level 1 and B
takes level 1
• the ‘effect of kth level of A’ is the effect on the mean response of the
kth level of A in comparison to the effect on the mean response of
level 1 of A (after controlling for B)
• the ‘effect of lth level of B’ is the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of level 1
of B (after controlling for A)
• each ‘random term’ is a normal random variable with zero mean and
constant variance σ 2 .
Notice that the baseline mean now considers both A and B, and that we
also have an added effect term for B.
You have already seen how the model for only one factor, Model (8),
reduces to Model (9) when A takes level 1. In Activity 22 you will consider
how the general form given by Model (10) reduces when one or both of A
and B take level 1.

Activity 22 Model when A and/or B are level 1

Write down what the general form given in Model (10) reduces to for each
of the following situations.
(a) Factors A and B both take level 1.
(b) Factor A takes level 1 and factor B takes level l, where l = 2, 3, . . . , L.
(c) Factor B takes level 1 and factor A takes level k, where
k = 2, 3, . . . , K.

336
4 Regression with two factors that do not interact

In Activity 23, you will see the general form given in Model (10) in action
for the model
rate ∼ gender + age
using data from the employment rates dataset.

Activity 23 Fitting a model for employment rates

The model
rate ∼ gender + age
was fitted to data from the employment rates dataset, taking level 1 of
gender to be male and level 2 to be female, and taking levels 1, 2, 3 and 4
of age to be, respectively, 25 to 34, 35 to 44, 45 to 54 and 55 to 64.
The output from fitting the model is given in Table 4.
Table 4 Output produced from fitting the model rate ∼ gender + age

Parameter Estimate Standard t-value p-value


error
Baseline mean 87.777 1.005 87.324 < 0.001
gender female −8.068 0.902 −8.949 < 0.001
age 35 to 44 5.332 1.271 4.197 < 0.001
age 45 to 54 5.657 1.271 4.452 < 0.001
age 55 to 64 −11.340 1.279 −8.863 < 0.001

Suppose that a researcher is interested in the employment rates for


different gender and age groups for one of the countries in the dataset.
(a) Use the fitted model (rounding all estimates to one decimal place) to
calculate the following.
(i) The fitted employment rate for males aged 55 to 64 in the
country.
(ii) The fitted employment rate for females aged 25 to 34 in the
country.
(iii) The fitted employment rate for males aged 35 to 44 in the
country.
(iv) The fitted employment rate for females aged 45 to 54 in the
country.
(b) According to the fitted model, what effect does gender have on a
country’s employment rate, after controlling for age?
(c) According to the fitted model, what effect does age have on a
country’s employment rate, after controlling for gender?

337
Unit 4 Multiple regression with both covariates and factors

We have already seen in this unit and Unit 3, that when there is a factor,
indicator random variables can be used to express the model as a single
equation (as is done, for example, in Model (2), Subsection 1.2). It will
therefore probably come as no surprise to you to learn that when there are
two factors, we can simply use a set of indicator variables for each of the
factors so that the model with two factors is equivalent to a multiple
regression model with (K − 1) + (L − 1) indicator variables as covariates.
The model with two factors (and no interaction) is summarised in Box 7.

Box 7 Regression with two factors that do not interact


Consider the model
Y ∼A+B
for response Y , factor A (with K levels) and factor B (with L levels).
When the ith observation takes level k of A and level l of B, the
model for Yi then has the general form
   
   effect of   effect of   
baseline random
Yi = + kth level + lth level + ,
mean term
of A of B
   

where
• the ‘baseline mean’ is the mean response when A takes level 1 and
B takes level 1
• the ‘effect of kth level of A’ is the effect on the mean response of
the kth level of A in comparison to the effect on the mean response
of level 1 of A (after controlling for B)
• the ‘effect of lth level of B’ is the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of
level 1 of B (after controlling for A)
• each ‘random term’ is a normal random variable with zero mean
and constant variance σ 2 .
By using indicator variables for each factor, the model is equivalent to
a multiple regression model with (K − 1) + (L − 1) indicator variables
as covariates.

The translation of the model terms for regression with two factors into
terms fitted in a multiple regression model is illustrated in Figure 17.

338
4 Regression with two factors that do not interact

Factor A Factor B
Response: Y
with K levels with L levels

Define K − 1 Define L − 1
indicator indicator
variables variables

Multiple regression with


(K − 1) + (L − 1) indicator
variables as covariates

Model: Y ∼ A + B
Figure 17 Summary of regression with two factors

4.2 Visualising the model


In Sections 2 and 3 you have seen how regression models with one
covariate and one factor can be visualised as parallel, and non-parallel lines
on a scatterplot. In this subsection we will consider the visualisation of
regression models with two factors.
For response Y , factor A (with K levels) and factor B (with L levels), the
model Y ∼ A + B produces K × L fitted mean responses (since there is a
fitted mean response for each of the K × L possible combinations of level k
of A and level l of B).
To help us to visualise the model, we will use two plots of the fitted mean
responses for the model rate ∼ gender + age, which was fitted to data
from the employment rates dataset in Activity 23 (Subsection 4.1). As a
reminder, gender has two levels (with male taken to be level 1) and age
has four levels (with 25 to 34 taken to be level 1).
The first plot is shown in Figure 18(a). In this plot, the fitted mean
responses for rate for the two levels of gender are plotted. Since there are
four levels of age, there are four fitted mean responses plotted for each
level of gender. The fitted mean responses for the same level of age have
then been joined by a line to help visualise how the fitted mean responses
for the levels of age differ between the two levels of gender.
The second plot is shown in Figure 18(b). This plot swaps the factors
around and plots the fitted mean responses for rate for the four levels of
age. This time, since there are two levels of gender, there are only two
fitted mean responses plotted for each of the four levels of age. Again, to
help visualise how the fitted mean responses differ for the levels of gender
across the four levels of age, the fitted mean responses for the same level
of gender have been joined by lines.

339
Unit 4 Multiple regression with both covariates and factors

Age group:
25 to 34 35 to 44 45 to 54 55 to 64
100 Fitted mean response
when gender and age
both take level 1

Employment rate (%)


90
µ

80
Effect of level 4
(55 to 64) of age
Effect of level 2
70 (female) of gender

Male Female
(a) Gender

Gender: Male Female


100 Fitted mean response
when gender and age
both take level 1
Employment rate (%)

90
µ

80
Effect of level 2
(35 to 44) of age
Effect of level 2
(female) of gender
70

25 to 34 35 to 44 45 to 54 55 to 64
(b) Age

Figure 18 Visualisation of rate ∼ gender + age, with (a) gender on the


horizontal axis and separate lines for the fitted mean responses for levels of
age; (b) age on the horizontal axis and separate lines for the fitted mean
responses for levels of gender
340
4 Regression with two factors that do not interact

Activity 24 What do you notice?

Figure 18 shows visual representations of the fitted model for the response
rate with the two factors gender and age as explanatory variables. The
fitted model assumed that there was no interaction between the two
factors.
Now consider the visualisation of the fitted model for the parallel slopes
model for the manna ash trees dataset shown in Figure 4 (Subsection 2.1).
What do the fitted lines in Figure 4 have in common with the lines joining
the fitted mean responses in Figure 18?

The lines joining the fitted mean responses in both of the plots in
Figure 18 are parallel for the different levels of the second factor. The fact
that the lines are parallel means that the effect of each factor on the
response is the same across the different levels of the other factor, as
summarised in Box 8. For example, in Figure 18(a), the differences
between the employment rates for the different age groups are the same for
each gender, and likewise, in Figure 18(b), the differences between the male
and female employment rates is the same for each of the four age groups.

Box 8 Fitted mean responses for model with two factors


that do not interact
When the two explanatory variables are both factors, and we assume
there is no interaction between these two factors, then the lines
joining the fitted mean responses associated with the levels of each
factor are parallel to each other. As such, the effect of each factor on
Putting parallel lines into
the response is the same across the different levels of the other factor.
perspective

4.3 Testing whether both factors are


required
Once we have fitted a model for two factors that do not interact, we can
then test whether or not both factors should be in the model.
Recall that in Subsection 2.2 we tested whether a factor A should be in a
model in addition to a covariate x by carrying out an ANOVA test
comparing the values of the RSS for the two nested models
Y ∼x and Y ∼ A + x.
If the difference between the two RSS values is large enough, then that
suggests that adding A into the model has significantly reduced the
amount of unexplained response variation, and so A should be included in
the model (because we’ve significantly increased the fit by including A in
addition to x). We can use the same ANOVA testing method here to test

341
Unit 4 Multiple regression with both covariates and factors

whether both factors A and B are required in the model. For this we need
two nested models, the identification of which you will consider in the next
activity.

Activity 25 Which nested models to compare?

The model
Y ∼A+B
has been fitted to some data and we would like to decide whether both A
and B should be included in the model.
(a) For which two nested models could we compare RSS values in order
to test whether A should be included in the model in addition to B?
(b) For which two nested models could we compare RSS values in order
to test whether B should be included in the model in addition to A?

The method for testing whether both factors A and B should be in the
model is summarised in Box 9.

Box 9 Testing whether both factors should be in the


model
To test whether both A and B should be included in the model
Y ∼ A + B,
we need to carry out two ANOVA tests.
Test whether A should be included in addition to B
A test statistic is used which is based on the difference in the values of
the RSS for the two nested models
Y ∼B and Y ∼ A + B.
The test statistic, F , can be expressed as
estimated decrease in unexplained variation due to including A
F = .
estimated variation not explained by Y ∼ A + B
If the p-value is small, then we conclude that A should be included in
the model in addition to B.
If the p-value is not small, then we conclude that A need not be
included in the model in addition to B.

342
4 Regression with two factors that do not interact

Test whether B should be included in addition to A


A test statistic is used which is based on the difference in the values of
the RSS for the two nested models
Y ∼A and Y ∼ A + B.
The test statistic, F , can be expressed as
estimated decrease in unexplained variation due to including B
F = .
estimated variation not explained by Y ∼ A + B
If the p-value is small, then we conclude that B should be included in
the model in addition to A.
If the p-value is not small, then we conclude that B need not be
included in the model in addition to A.

In the final activity of this subsection, we will use these tests to see
whether both gender and age are needed in the model for rate using data
from the employment rates dataset.

Activity 26 Are gender and age both needed?


The three models rate ∼ gender, rate ∼ age and rate ∼ gender + age
were fitted to data from the employment rates dataset.
An ANOVA test compared the RSS values of the two models
rate ∼ gender and rate ∼ gender + age.
A second ANOVA test compared the RSS values of the two models
rate ∼ age and rate ∼ gender + age.
Both ANOVA tests had a resulting p-value of less than 0.001. Given these
results, should gender and age both be included in the model?

Testing whether both factors are required in the model is summarised in


Figure 19.
Model: Y ∼ A + B

Is A required Is B required
in addition to B? in addition to A?

ANOVA test to ANOVA test to


compare RSS values compare RSS values
of models of models
Y ∼ B and Y ∼ A + B Y ∼ A and Y ∼ A + B

Figure 19 Summary of testing whether to include both factors A and B


343
Unit 4 Multiple regression with both covariates and factors

4.4 Informally checking the assumption of


no interaction
Since the model Y ∼ A + B can be expressed as a multiple regression
model (with indicator variables as covariates), the model assumptions are
once again those underlying multiple regression, except with the added
assumption that there is no interaction between A and B. The multiple
regression assumptions can be checked using the usual diagnostic plots,
but we need a new plot – a means plot – to help us to informally check the
assumption of no interaction.
In a means plot, the sample mean responses for each level k of A
and l of B, for k = 1, 2, . . . , K, l = 1, 2, . . . , L, are plotted against the levels
of one of the factors. The sample mean responses for the same level of the
other factor are then joined together by lines. Basically, a means plot is a
sample version of the plots shown in Figure 18 (Subsection 4.2).
Now, from Box 8 (Subsection 4.2), we know that, when there is no
interaction between A and B in the model, the lines joining the fitted
mean responses associated with the levels of each factor are parallel to
each other. So, if the lines joining the sample mean responses in a means
plot show a similar pattern and are also roughly parallel, then this would
suggest that the model without an interaction appears to be reasonable –
that is, this would suggest that it is reasonable to assume that there is no
interaction between A and B. (We will introduce a formal test for this in
Subsection 5.2.)
The means plots for the employment rates dataset are given and discussed
in the next example.

Example 7 The means plots for the employment rates


dataset
The means plots for rate across the factors gender and age for the
employment rates data (Subsection 4.1) are given in Figure 20. In
each plot, the lines are roughly parallel to each other, and they also
look similar to the plots visualising the fitted model given in Figure 18
(Subsection 4.2). Therefore, the assumption that there is no
interaction between gender and age seems reasonable.

344
4 Regression with two factors that do not interact

Age group:
25 to 34 35 to 44 45 to 54 55 to 64
100
Employment rate (%)

90

80

70

Male Female
(a) Gender

Age group: Male Female


100
Employment rate (%)

90

80

70

25 to 34 35 to 44 45 to 54 55 to 64
(b) Age

Figure 20 Means plots for rate from the employment rates dataset,
with: (a) gender on the horizontal axis and separate lines for levels of
age; (b) age on the horizontal axis and separate lines for levels of gender

345
Unit 4 Multiple regression with both covariates and factors

We’ll now use the model Y ∼ A + B for data concerning an experiment


investigating the effect of different protein-based diets on rats. The dataset
is described first, before we consider a model for the data in Activity 27.

Rats and protein


An experiment was carried out to investigate the effect of different
protein-based diets on rats. The gain in weight was measured after
the type and amount of protein were varied in the rats’ diets.
The rats and protein dataset (ratProtein)
This dataset includes data about 40 rats for the following variables:
• gain: the weight gain of the rat in grams (g)
• source: the source of protein, taking the values beef or cereal
• amount: the amount of protein, taking the values high or low.
The first three and last three observations of this dataset are given in
Table 5. So, for example, the first rat gained 90 g on a low beef
protein diet.
Table 5 The first three and last three observations in ratProtein

gain source amount


90 beef low
76 beef low
90 beef low
77 cereal high
86 cereal high
92 cereal high

Source: Snedecor and Cochran, 1967

Activity 27 Model for the rats and protein dataset


assuming no interaction
The model
gain ∼ source + amount
was fitted to data from the rats and protein dataset, taking beef as level 1
of source, cereal as level 2, and taking high as level 1 of amount, low as
level 2.
The means plots for the data are given in Figure 21. Explain why this
model may not be suitable for these data.

346
4 Regression with two factors that do not interact

Amount of protein: High Low Source of protein: Beef Cereal


100 100

95 95
Weight gain (g)

Weight gain (g)


90 90

85 85

80 80

Beef Cereal High Low


(a) Source of protein (b) Amount of protein
Figure 21 Means plots for gain from the rats and protein dataset, with: (a) source on the horizontal axis and
separate lines for levels of amount; (b) amount on the horizontal axis and separate lines for levels of source

All the models we have considered so far in Section 4 can be visualised as


having parallel lines in the type of plot introduced, but, as you have seen
in Activity 27, these may not be suitable for all data.
In Section 3 when there was one covariate and one factor as the
explanatory variables, we accommodated the need for non-parallel fitted
regression lines by introducing an interaction term into the model.
Likewise, when both explanatory variables are factors, we can introduce an
interaction term in order to produce lines joining the fitted mean responses
which aren’t parallel for the different factor levels. So, if the lines in a
means plot are clearly not parallel, then this indicates that we may need to
introduce an interaction term into the model.
How means plots can be used is summarised in Box 10. Note that this
makes reference to a model you will find out about in the next section.

Box 10 Means plots


Means plots can be used as an informal guide for identifying whether
or not there may be an interaction between A and B:
• If the lines in the means plots are roughly parallel to each other,
this suggests there is not an interaction. In this case, the model
from Subsection 4.1 and summarised in Box 7 may be appropriate.
• If the lines in the means plots are clearly not parallel to each other,
this suggests there may be an interaction. In this case, the model
that will be described in Subsection 5.1 may be appropriate.

347
Unit 4 Multiple regression with both covariates and factors

4.5 Using R for regression with two factors


that do not interact
In this subsection, we’ll use R for regression with two factors. To start off,
Notebook activity 4.5 explains how to fit a regression model with two
factors in R, and how to use R to test whether both factors are required in
the model. The notebook focuses in particular on the model considered in
Activity 23 (Subsection 4.1), using data from the employment rates
dataset.
Subsection 4.4 introduced means plots for informally checking that the
assumption of no interaction between the two factors is reasonable.
Notebook activity 4.6 explains how we can use R to produce means plots,
focusing in particular on the means plots presented in Activity 27
(Subsection 4.4) for the data in the rats and protein dataset.
Finally, in Notebook activity 4.7, you’ll consider a regression model with
two factors using data from the wages dataset.

Notebook activity 4.5 Regression with two factors in R


This notebook explains how to use R for regression with two factors,
focusing in particular on data from employmentRates.

Notebook activity 4.6 Means plots in R


This notebook explains how to use R to obtain means plots, using
data from ratProtein.

Notebook activity 4.7 A regression model with two


factors for the wages dataset
This notebook uses a regression model with two factors using data
from wages.

5 Regression with two factors that


interact
In Activity 27 (Subsection 4.4), the means plots associated with the rats
and protein dataset suggested that there may be an interaction between
the two factors source and amount. In this section, we will consider how
we can accommodate such an interaction between the two factors in the
model.
The general form for the model which allows for an interaction between
factors A and B is specified in Subsection 5.1, while Subsection 5.2

348
5 Regression with two factors that interact

discusses how to test whether the interaction is required in the model. We


round off the section by using the model in R in Subsection 5.3.

5.1 Including an interaction


Once again consider the situation in which we have a response Y and two
factors A and B as explanatory variables. If we believe that A and B
interact with each other to affect the response Y , then we can
accommodate this by adding an interaction term into the general form
given in Model (10) in Subsection 4.1, so that when the ith observation
takes level k for factor A and level l for factor B, a model for Yi has the
form
   
   effect of   effect of 
baseline
Yi = + kth level + lth level
mean
of A of B
   
 
 interaction effect of   
random
+ kth level of A + , (11)
term
and lth level of B
 

where, this time,


• the ‘baseline mean’ is still the mean response when A takes level 1 and
B takes level 1
• the ‘effect of kth level of A’ is now the effect on the mean response of
the kth level of A in comparison to the effect on the mean response of
level 1 of A, when B takes level 1
• the ‘effect of lth level of B’ is now the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of level 1
of B, when A takes level 1
• the ‘interaction effect of kth level of A and lth level of B’ is the added
effect on the mean response of the interaction between the kth level of A
and the lth level of B
• each ‘random term’ is a normal random variable with zero mean and
constant variance σ 2 .
Notice that, when we have an interaction term in the model, the effect of
the kth level of A assumes that B takes level 1, and likewise, the effect of
the lth level of B assumes that A takes level 1. The interaction effect of
the kth level of A and the lth level of B is then the added effect of the
interaction between A and B (in the same way that αk in the models of
Unit 3 is the additional effect on the response of level k of factor A on top
of the effect on the response of level 1 of A). As a result, if either A or B
take level 1, then the associated interaction term is simply zero (since the
individual effect terms have already accounted for the other factor taking
level 1).
Following on from the notation introduced in Subsection 3.1 (see Box 5)

349
Unit 4 Multiple regression with both covariates and factors

for models with an interaction, we can denote this model by


Y ∼ A + B + A:B,
or, more simply, as
Y ∼ A ∗ B.
Introducing an interaction into the model means that the effect of one of
the factors on the response is no longer the same across all of the levels of
the other factor. This is illustrated in Figure 22, which gives a visual
representation of the model
Y ∼ A + B + A:B
for two factors A and B each with two levels. In Figure 22, we can clearly
see that the effect of factor B on the response is not the same for the two
levels of factor A.

100 Factor B
Level 1
Level 2

80
Interaction effect
of level 2 of A
Response

and level 2 of B
60
Effect of
level 2 of B Effect of
level 2 of B

40
Effect of
level 2 of A

20 Baseline
mean

Level 1 Level 2
Factor A
Figure 22 Visualisation of the model Y ∼ A ∗ B

In the next example and activity, we’ll see the general form given in
Model (11) in action for modelling data from the rats and protein dataset.

350
5 Regression with two factors that interact

Example 8 Model (with interaction) for the rats and


protein dataset
The model
gain ∼ source + amount + source:amount
was fitted to data from the rats and protein dataset, taking beef as
level 1 of source, cereal as level 2, and taking high as level 1 of
amount, low as level 2.
Since source and amount each have two levels, there is only one
individual interaction term in the model – the interaction between
level 2 of source (cereal) and level 2 of amount (low). This is because
• the interaction effect of level 1 of source (beef) and level 1 of
amount (high) is already accounted for in the baseline mean
• the interaction effect of level 1 of source (beef) and level 2 of
amount (low) is already accounted for in the effect of level 2 of
amount (low)
• the interaction effect of level 2 of source (cereal) and level 1 of
amount (high) is already accounted for in the effect of level 2 of
source (cereal).
To illustrate, the first observation in the rats and protein dataset is
for a rat fed with a low beef protein. Therefore, for this observation,
source takes level 1 and amount takes level 2. Since source takes
level 1, there isn’t an added interaction term in the model nor an
individual effect term for source.
The model form for this rat is
     
baseline effect of random
Y1 = + + .
mean low amount term

As another example, the last observation (out of 40 observations) is


for a rat fed with a high cereal protein. Therefore, for this
observation, source takes level 2 and amount takes level 1. Again, one
of the factors takes level 1, and so there isn’t an added interaction
term in the model nor an individual effect term (for amount).
The model form for this rat is
     
baseline effect of random
Y40 = + + .
mean cereal source term

On the other hand, the 21st rat in the dataset was fed a low cereal
protein diet and so takes level 2 for both source and amount.
Therefore, this time, there is an added interaction term in the model.
The model form for this rat is given next.

351
Unit 4 Multiple regression with both covariates and factors

     
baseline effect of effect of
Y21 = + +
mean cereal source low amount
 
 interaction effect of   
random
+ cereal source + .
term
and low amount
 

Activity 28 Some fitted values for the rats and protein


dataset
The model
gain ∼ source + amount + source:amount
was fitted to data from the rats and protein dataset, taking beef as level 1
of source, cereal as level 2, and taking high as level 1 of amount, low as
level 2. The output for the fitted model is given in Table 6.
Table 6 Output produced for the model gain ∼ source ∗ amount

Parameter Estimate Standard t-value p-value


error
Baseline mean 100.0 4.729 21.148 < 0.001
source cereal −14.1 6.687 −2.109 0.042
amount low −20.8 6.687 −3.110 0.004
source cereal, amount low 18.8 9.457 1.988 0.054

(a) The first rat in the dataset was fed a low beef protein diet. What will
be the fitted value yb1 for this rat?
(b) The 11th rat in the dataset was fed a high beef protein diet. What
will be the fitted value yb11 for this rat?
(c) The 21st rat in the dataset was fed a low cereal protein diet. What
will be the fitted value yb21 for this rat?
(d) The 31st rat in the dataset was fed a high cereal protein diet. What
will be the fitted value yb31 for this rat?

Recall from Section 3 (and in particular, Model (7)) that when we have
one factor and one covariate for the explanatory variables, and there is an
interaction between the two explanatory variables, we can express our
model as a multiple regression model with the covariates
zi2 , zi3 , . . . , ziK , xi , (zi2 × xi ), (zi3 × xi ), . . . , (ziK × xi ),
where zi2 , zi3 , . . . , ziK are indicator variables associated with the factor
and xi is the covariate.

352
5 Regression with two factors that interact

Now, in this section where we instead have two factors, a set of indicator
variables is used for each of the factors. So, following the ideas from
Section 3, for factor A with K levels and factor B with L levels, the model
Y ∼ A + B + A:B
can be expressed as a multiple regression model with the following
covariates:
• K − 1 indicator variables for factor A
• L − 1 indicator variables for factor B
• (K − 1) × (L − 1) indicator variables associated with forming the
product of the kth indicator variable for A with the lth indicator
variable for B, for k = 2, 3, . . . , K, l = 2, 3, . . . , L.
The model with two factors and an interaction is summarised in Box 11.

Box 11 Regression with two factors and an interaction


Consider the model
Y ∼ A ∗ B, or equivalently, Y ∼ A + B + A:B,
for response Y , factor A (with K levels) and factor B (with L levels).
When the ith observation takes level k of A and level l of B, the
model for Yi has the general form
   
   effect of   effect of 
baseline
Yi = + kth level + lth level
mean
of A of B
   
 
 interaction effect of   
random
+ kth level of A + ,
term
and lth level of B
 

where
• the ‘baseline mean’ is the mean response when A takes level 1 and
B takes level 1
• the ‘effect of kth level of A’ is the effect on the mean response of
the kth level of A in comparison to the effect on the mean response
of level 1 of A, when B takes level 1
• the ‘effect of lth level of B’ is the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of
level 1 of B, when A takes level 1
• the ‘interaction effect of kth level of A and lth level of B’ is the
added effect on the mean response of the interaction between the
kth level of A and the lth level of B
• each ‘random term’ is a normal random variable with zero mean
and constant variance σ 2 .

353
Unit 4 Multiple regression with both covariates and factors

By using indicator variables for each factor, the model is equivalent to


a multiple regression model with (K − 1) + (L − 1) + (K − 1)(L − 1)
indicator variables as covariates. (As such, the model has the usual
multiple regression model assumptions.)

The translation of the model terms into terms fitted in a multiple


regression model is illustrated in Figure 23.

Response Factor A Interaction Factor B


Y with K levels A:B with L levels

Define K − 1 Define L − 1
indicator variables indicator variables

(K − 1) × (L − 1) indicator variables
that is, the products of the
(K − 1) indicator variables for A
and the
(L − 1) indicator variables for B

Multiple regression with


(K − 1) + (L − 1) + (K − 1)(L − 1)
indicator variables
as covariates

Model: Y ∼ A + B + A:B

Figure 23 Summary of regression with two factors and an interaction

5.2 Testing for an interaction


In Subsection 4.4, Activity 27 considered the means plots (in Figure 21)
for data from the rats and protein dataset. Because the lines in the two
means plots suggested that there may be an interaction between the two
factors source and amount, we concluded that the model
gain ∼ source + amount
may not be suitable for the data.
So, in Activity 28 (Subsection 5.1), the model
gain ∼ source + amount + source:amount
was fitted to the data.

354
5 Regression with two factors that interact

But do we need to include the interaction in the model?


In order to test whether the interaction should be included in the model,
we will once again use the ANOVA testing method (used in
Subsections 2.2, 3.2 and 4.3) which compares the RSS values, and hence
the fits, of two nested models. In the next activity you will consider which
two models these two nested models are.

Activity 29 Testing for an interaction – which two models


should we compare?
Suggest two nested models for which we could compare RSS values in
order to test whether the interaction A:B should be included in the model
in addition to both A and B.
What would a large difference in RSS between these models suggest? Why?

The method for testing whether an interaction should be included in a


model in addition to the factors A and B is given in Box 12.

Box 12 Testing for an interaction between two factors


To test whether the interaction A:B should be included in the model
Y ∼ A + B + A:B
in addition to both A and B, use an ANOVA test statistic based on
the difference in the values of the RSS for the two nested models
Y ∼A+B and Y ∼ A + B + A:B.
The test statistic, F , can be expressed as
estimated decrease in unexplained variation due to including A:B
F = .
estimated variation not explained by Y ∼ A + B + A:B
If the p-value is small, then conclude that A:B should be included in
the model in addition to A and B.
If the p-value is not small, then conclude that A:B need not be
included in the model in addition to A and B.

The method given in Box 12 is summarised in Figure 24, which follows.

355
Unit 4 Multiple regression with both covariates and factors

Model : Y ∼ A + B + A:B

Is the interaction
A:B required?

ANOVA test to compare


RSS values of models
Y ∼ A + B and Y ∼ A + B + A:B
Figure 24 Summary of testing for an interaction between two factors

In the next activity, we’ll look at testing whether the interaction


source:amount is required when modelling the data from the rats and
protein dataset.

Activity 30 Is an interaction required for the rats and


protein dataset?
Two models were fitted to data from the rats and protein dataset:
gain ∼ source + amount
and
gain ∼ source + amount + source:amount.
An ANOVA test was carried out to compare the RSS values of these two
models. The resulting p-value for this test was 0.054.
For a model with both factors source and amount, should an interaction
term also be included in the model?

The model including an interaction for the rats and protein dataset
discussed in Activity 30 illustrates an important point: even though a
means plot may suggest that there is an interaction between two factors, it
may turn out that the interaction isn’t actually judged to be significant.
When this happens, the more parsimonious model without an interaction
is preferable.
In the next activity, we will revisit the wages dataset and fit a model for
the response hourlyWageSqrt using two of the factors in the dataset as
explanatory variables.

356
5 Regression with two factors that interact

Activity 31 A model for the wages dataset using two


factors
Suppose we wish to model the response hourlyWageSqrt (the square root
of the individual’s hourly wage, in £) from the wages dataset using two
factors as explanatory variables:
• gender: the gender the individual identifies with, taking the values male
and female
• computer: whether the individual has access to a computer at home,
taking the values yes and no.
Here, and in the rest of this unit, we will take level 1 of gender to be male
and level 1 of computer to be yes.
The means plots across the factors gender and computer are given in
Figure 25 (and also in Notebook activity 4.7), and output produced after
fitting the model
hourlyWageSqrt ∼ gender ∗ computer
is given in Table 7.

:
computer: yes no gender: male female

4.0 4.0
hourlyWageSqrt

hourlyWageSqrt

3.8 3.8

3.6 3.6

3.4 3.4

male female yes no


(a) gender (b) computer
Figure 25 Means plots for hourlyWageSqrt from the wages dataset, with (a) gender on the horizontal axis and
separate lines for levels of computer; (b) computer on the horizontal axis and separate lines for levels of gender

Table 7 Parameter estimates for hourlyWageSqrt ∼ gender ∗ computer

Parameter Estimate Standard t-value p-value


error
Baseline mean 4.073 0.031 129.441 < 0.001
gender female −0.820 0.097 −8.486 < 0.001
computer no −0.384 0.040 −9.546 < 0.001
gender female, computer no 0.391 0.109 3.592 < 0.001

357
Unit 4 Multiple regression with both covariates and factors

(a) Explain why the means plots given in Figure 25 suggest that a model
for hourlyWageSqrt using gender and computer as the explanatory
variables may need to include their interaction.
(b) An ANOVA test was carried out to compare the RSS values of the
two models
hourlyWageSqrt ∼ gender + computer
and
hourlyWageSqrt ∼ gender + computer + gender:computer.
The resulting p-value was less than 0.001.
Should the interaction gender:computer be included in the model?
(c) A particular individual in the dataset identifies as male and has a
computer at home. According to the model that includes the
interaction, what is the fitted hourly wage for this individual?
(d) A different male individual doesn’t have a computer at home.
According to the model that includes the interaction, what is the
fitted hourly wage for this individual?
(e) According to the model that includes the interaction, what affect does
having a computer at home have on a male’s hourly wage?

5.3 Using R for regression with two factors


with an interaction
In this subsection, which includes two notebook activities, we’ll use R for
regression when we have two factors with an interaction. Notebook
activity 4.8 explains how to use R for this type of regression model, using
data from the wages dataset. Then, in Notebook activity 4.9, we will apply
this type of regression model to data from the dataset described next.

Open University students


In common with other universities, The Open University (OU)
collects data concerning its students to help inform its teaching, with
the aim of improving student outcomes for all.
We’ll consider a dataset containing data for students who have studied
Level 3 OU statistics modules on presentations with October start
The Open University is one of dates between 2015 and 2020. There is one observation associated
the largest universities in with each student who studied on each presentation of each statistics
Europe, with over 205 000
module, so students who studied more than one OU statistics module
students
and/or presentation over the data collection period will have a data
entry for each module and/or presentation they studied.

358
5 Regression with two factors that interact

All data have been anonymised so that no individual student can be


identified.
The OU students dataset (ouStudents)
For the original data (as collected by the OU), there are some
incomplete data cases with missing values for some variables. For our
dataset here, we have only included complete data cases, of which
there are a total of 1796. (The problem of missing data will be
considered in Unit 5.) Data are available for the following variables:
• modResult: overall final module result, taking the values 1 for pass
and 0 for fail
• examScore: the final exam score, taking values from 0 to 100
(rounded to the nearest integer)
• contAssScore: the final continuous assessment score, taking values
from 0 to 100 (rounded to the nearest integer)
• region: the geographical (UK) region linked to the student, taking
the possible values east midlands, east of england, london, northern
ireland, north west, north, scotland, south, south east, south west,
west midlands, wales and yorkshire
• gender: the gender the individual identifies with, taking the values
f (female) and m (male)
• imd: the index of multiple deprivation (IMD), which is a measure of
the level of deprivation for the student’s (UK) postcode address,
taking the values most (for the most deprived areas with IMD
values 0% to 35%), middle (for areas with IMD values 35% to 65%),
least (for the least deprived areas with IMD values 65% to 100%)
and other (for non-UK students or where the value of IMD is
unknown)
• qualLink: the OU qualification the student is linked to, taking
possible values maths (for qualifications containing substantial
mathematical content) and not (for all other qualifications or no
qualification link)
• bestPrevModScore: the best previous overall final module score,
taking values from 0 to 100 (rounded to one decimal place)
• age: the age of the student (in years). So that no individual
student can be identified, the age recorded in this dataset is the
student’s true age (from the original data collected by the OU) with
one of the values −2, −1, 0, 1 or 2 randomly added. Thus, the real
age of a student whose value of age in the dataset is 36 is equally
likely to be 34, 35, 36, 37 or 38.
The data for the first three and last three observations from the
OU students dataset are given in Table 8, next.

359
Unit 4 Multiple regression with both covariates and factors

Table 8 The first three and last three observations from ouStudents

modResult examScore contAssScore region gender imd


1 76 77 west midlands m middle
1 68 82 south west m least
1 89 97 west midlands m least
1 91 72 north m least
1 62 60 yorkshire f middle
1 73 86 north m other

quaLink bestPrevModScore age


maths 89.2 32
not 100.0 36
maths 96.0 52
maths 98.5 34
maths 93.5 44
maths 95.5 43

Source: The Open University, 2020

In Notebook activity 4.9 we’ll take examScore to be the response variable


and the factors qualLink and imd to be the explanatory variables. We will
return to modelling more of the variables from the OU students dataset
later in Section 6 – and, indeed, also in later units!

Notebook activity 4.8 Regression with two factors with


an interaction in R
This notebook explains how to use R for regression with two factors
with an interaction, focusing in particular on data from wages.

Notebook activity 4.9 A regression model with two


factors for the OU students
dataset
In this notebook, we consider a regression model with two factors for
data from ouStudents.

360
6 Regression with any number of covariates and factors

6 Regression with any number of


covariates and factors
So far in this unit we have considered models for the response variable Y
where there are two explanatory variables which are either one covariate
and one factor, or both factors. In this final section of the unit, we’ll
extend these models to allow any number of covariates and factors as the
explanatory variables. These models are known collectively as general
linear models.
We’ll start in Subsection 6.1 by considering the case in which there are
more than two factors as explanatory variables, but no covariates and no
interactions between any of the factors. We’ll then consider regression
models with multiple covariates and factors (assuming no interactions) in
Subsection 6.2, before introducing interactions into the model in
Subsection 6.3. Finally, in Subsection 6.4, we’ll implement the models in R.

6.1 Regression with more than two factors


Section 4 introduced regression where there are two explanatory variables
which are both factors, without an interaction, so that when the ith
observation takes level k for factor A and level l for factor B, a model for
Yi has the general form given in Model (10). This general model form is
repeated here for your convenience:
   
   effect of   effect of   
baseline random
Yi = + kth level + lth level + ,
mean term
of A of B
   

where
• the ‘baseline mean’ is the mean response when A takes level 1 and B
takes level 1
• the ‘effect of kth level of A’ is the effect on the mean response of the kth
level of A in comparison to the effect on the mean response of level 1 of
A (after controlling for B)
• the ‘effect of lth level of B’ is the effect on the mean response of the lth
level of B in comparison to the effect on the mean response of level 1 of
B (after controlling for A)
• each ‘random term’ is a normal random variable with zero mean and
constant variance σ 2 .
This model form can be extended in a natural way to accommodate any
number of factors as explanatory variables, as summarised in Box 13 next.

361
Unit 4 Multiple regression with both covariates and factors

Box 13 Regression with multiple factors, assuming no


interactions
Suppose that for modelling a response Y , we have the factors A (with
K levels), B (with L levels), . . . , Z (with R levels). Suppose further
that there are no interactions between the factors.
When the ith observation takes the kth level of factor A, the lth level
of factor B, . . . , the rth level of factor Z, a model for the response Yi
has the general form
   
   effect of   effect of 
baseline
Yi = + kth level + lth level
mean
of A of B
   
 
 effect of   
random
+ ··· + rth level + , (12)
term
of Z
 

where
• the ‘baseline mean’ is the mean response when A, B, . . . , Z all take
level 1
• the ‘effect of kth level of A’ is the effect on the mean response of
the kth level of A in comparison to the effect on the mean response
of level 1 of A (after controlling for all the other factors)
• the ‘effect of lth level of B’ is the effect on the mean response of the
lth level of B in comparison to the effect on the mean response of
level 1 of B (after controlling for all the other factors)
..
.
• the ‘effect of rth level of Z’ is the effect on the mean response of the
rth level of Z in comparison to the effect on the mean response of
level 1 of Z (after controlling for all the other factors)
• each ‘random term’ is a normal random variable with zero mean
and constant variance σ 2 .
So, Yi is modelled as the baseline mean when all of the factors take
level 1, with added individual effects for the other levels of each factor.

Following the notation used earlier, we denote the model in Box 13 as


Y ∼ A + B + · · · + Z.
As usual, for each factor we can define a set of indicator variables, so that
this model is equivalent to a multiple regression model with
(K − 1) + (L − 1) + · · · + (R − 1)
indicator variables as covariates. As such, the usual multiple regression
assumptions hold and can be checked in the usual way with diagnostic
plots.

362
6 Regression with any number of covariates and factors

In the next activity, we will revisit the wages dataset, this time fitting a
model for the response hourlyWageSqrt using four of the factors in the
dataset as explanatory variables.

Activity 32 A model for the wages dataset using four


factors
Model (12) was used to model the response hourlyWageSqrt (the square
root of the individual’s hourly wage, in £) from the wages dataset using
these four factors as explanatory variables:
• gender: the gender the individual identifies with, taking the values male
and female These researchers look like
• computer: whether or not the individual has access to a computer at they are level 1 for both edLev
home, taking the values yes and no and occ

• edLev: the education level attained by the individual, taking 17 possible


values with codes 1 (Higher degree) to 17 (No qualifications)
• occ: the occupation of the individual, taking the values with codes
1 (Professional), 2 (Employer/Manager), 3 (Intermediate non-manual),
4 (Junior non-manual), 5 (Skilled manual), 6 (Semi-skilled manual) and
7 (Unskilled manual).
Table 9 gives part of the output produced after fitting the model
hourlyWageSqrt ∼ gender + computer + edLev + occ.

Table 9 Some of the parameter estimates for


hourlyWageSqrt ∼ gender + computer + edLev + occ

Parameter Estimate Standard t-value p-value


error
Baseline mean 4.699 0.114 41.148 < 0.001
gender female −0.484 0.046 −10.611 < 0.001
computer no −0.117 0.035 −3.374 < 0.001
edLev 2 0.052 0.118 0.439 0.661
edLev 3 −0.128 0.178 −0.718 0.473
edLev 17 −0.628 0.120 −5.244 < 0.001
occ 2 −0.094 0.067 −1.394 0.163
occ 3 −0.217 0.074 −2.919 0.004
occ 7 −0.792 0.104 −7.612 < 0.001

Use the fitted model (rounding all estimates to two decimal places) to
calculate the following.
(a) The fitted value of hourlyWageSqrt for a male who has a computer,
and takes level 3 of edLev and level 2 of occ.
(b) The fitted value of hourlyWageSqrt for a female who hasn’t got a
computer, and takes level 17 of edLev and level 7 of occ.

363
Unit 4 Multiple regression with both covariates and factors

As usual, we can assess whether each individual factor needs to be in the


model in addition to the rest by using ANOVA tests to compare the fits of
two nested models. You will identify which models these are in the next
activity.

Activity 33 Identifying nested models

For which two nested models could we compare RSS values in order to test
whether the factor occ needs to be in the model for hourlyWageSqrt in
addition to gender, computer and edLev?

As we saw in Section 5 of Unit 2, when there are several possible


explanatory variables, we need to choose which explanatory variables to
include in our model; we need to choose a model which fits the data well,
but is also as parsimonious as possible. Subsection 5.3 of that unit
introduced stepwise regression as a procedure that can be used to select
our final model. We can also use stepwise regression when we have factors.
We will do this for the wages dataset in the next activity, using AIC as our
criterion. (AIC, the Akaike information criterion, was introduced in
Subsection 5.2.2 of Unit 2.)

Activity 34 Stepwise regression with factors

Consider once again the wages dataset with the response hourlyWageSqrt
and the four factors gender, computer, edLev and occ. The following
stepwise regression procedures were used to choose which of these factors
ought to be in our model.
(a) A forward stepwise regression procedure starting from the null
regression model was performed. The results from the procedure are
given below. Which explanatory variables are selected by this
procedure?

Step 1 change AIC


Adding occ −295.27
Adding edLev −245.12
Adding gender 174.35
Adding computer 226.22
None 337.57

Step 2 change AIC


Adding edLev −478.13
Adding gender −443.22
Adding computer −334.90
None −295.27

364
6 Regression with any number of covariates and factors

Step 3 change AIC


Adding gender −599.61
Adding computer −499.50
None −478.13

Step 4 change AIC


Adding computer −609.06
None −599.61

(b) A backward stepwise regression procedure starting from the full


regression model was also performed, and the results from the
procedure are given below. Did forward and backward stepwise
regression both suggest the same factors for the model?

Step 1 change AIC


None −609.06
Dropping computer −599.61
Dropping gender −499.50
Dropping edLev −464.38
Dropping occ −394.12

6.2 Regression with multiple covariates and


factors
Now we will turn to the more general situation where we have both
multiple covariates and multiple factors. For now, we’ll assume that there
are no interactions between any of the explanatory variables. (We will
come onto the situation where there are interactions as well soon.)
From Unit 2 you know that multiple regression allows the covariates
x1 , x2 , . . . , xq to be included as the explanatory variables in the model
Y ∼ x1 + x2 + · · · + xq .
You also know, from Subsection 6.1, about regression with multiple factors
A, B, . . . , Z as the explanatory variables in the model
Y ∼ A + B + · · · + Z.
In addition, from Section 2 you know about regression with one covariate
and one factor as the explanatory variables in the model
Y ∼ A + x.
Here, we can bring all of these ideas together for a regression model where
we have both multiple covariates x1 , x2 , . . . , xq and multiple factors
A, B, . . . , Z as the explanatory variables in the model
Y ∼ A + B + · · · + Z + x1 + x2 + · · · + xq .

365
Unit 4 Multiple regression with both covariates and factors

Activity 35 Yes, it’s another multiple regression model . . .


Explain how the model
Y ∼ A + B + · · · + Z + x1 + x2 + · · · + xq
can be expressed as a multiple regression model.

Box 14 summarises how a regression model with multiple factors and


multiple covariates as the explanatory variables can be expressed as a
multiple regression model.

Box 14 Regression with multiple covariates and factors,


assuming no interactions
Suppose that we wish to specify a regression model for response Y
with covariates x1 , x2 , . . . , xq and factors A, B, . . . , Z as explanatory
variables.
When the ith observation takes the kth level of factor A, the lth level
of factor B, . . . , the rth level of factor Z, and the values of the
covariates are xi1 , xi2 , . . . , xiq , a model for the response Yi has the form
   
   effect of   effect of 
baseline
Yi = + kth level + lth level
mean
of A of B
   
 
 effect of 
+ ··· + rth level + β1 xi1 + β2 xi2 + · · · + βq xiq + Wi ,
of Z
 

where Wi ∼ N (0, σ 2 ).
The interpretation of the model parameters continues to follow the
same principles as we’ve seen before, so that:
• a baseline mean is the mean response when all of the factors take
level 1 and all of the covariates are zero
• for each factor, the effect term in the model is the effect on the
mean response of the associated level of the factor in comparison to
the effect on the mean response of level 1 of the factor, after
controlling for the other factors and the covariates
• for each covariate, its regression coefficient represents the effect on
the mean response, after controlling for the other covariates and the
factors.
The model can be expressed as a multiple regression model (with
indicator variables as covariates), and so the usual multiple regression
model assumptions hold.

You will consider an application of this model in the next activity.

366
6 Regression with any number of covariates and factors

Activity 36 A model for the OU students dataset

The following model was fitted to the OU students dataset


examScore ∼ gender + qualLink + bestPrevModScore + age.
As a reminder, these variables are described as follows:
• examScore: the final exam score, taking values from 0 to 100 (rounded
to the nearest integer)
• gender: the gender the individual identifies with, taking the values f
(female) and m (male)
• qualLink: the OU qualification the student is linked to, taking possible
values maths (for qualifications containing substantial mathematical
content) and not (for all other qualifications or no qualification link)
You’re never too old to learn:
• bestPrevModScore: the best previous overall final module score, taking the values of age in the OU
values from 0 to 100 (rounded to one decimal place) students dataset vary from 19
• age: the age of the student (in years), with one of the values −2, −1, 0, right up to 93!
1 or 2 randomly added.
Here, level ‘f’ for gender and level ‘not’ for qualLink are each taken to be
level 1. The output for the fitted model is given in Table 10.
Table 10 Coefficients for the fitted model
examScore ∼ gender + qualLink + bestPrevModScore + age

Parameter Estimate Standard t-value p-value


error
Baseline mean −24.320 4.250 −5.722 < 0.001
bestPrevModScore 1.087 0.048 22.844 < 0.001
age −0.113 0.035 −3.241 0.001
gender m 0.783 0.822 0.953 0.341
qualLink maths −2.709 1.009 −2.684 0.007
(a) The first student in the OU students dataset is male, is linked to a
maths-based qualification, has a score of 89.2 as his best previous
module score and is recorded as 32 years old. Use the output from the
fitted model to calculate the fitted exam score for this student.
(b) What does the fitted model tell us about the expected effect on the
exam score of a student being linked to a maths-based qualification?
(c) Thinking back to the tests discussed in this unit, explain how we
could test whether the covariate age should be included in the model
in addition to gender, qualLink and bestPrevModScore.
(d) Again thinking back to the tests discussed in this unit, explain how
we could test whether the factor qualLink should be included in the
model in addition to gender, bestPrevModScore and age.
(e) Which procedure could be used to help with choosing which of the
four explanatory variables should be included in the model?

367
Unit 4 Multiple regression with both covariates and factors

To decide which explanatory variables need to be included in a model with


multiple covariates and factors:
• use individual t-tests (testing whether the individual (partial) regression
coefficients are zero) for each covariate
• use ANOVA tests to compare the RSS values for nested models for each
factor.
However, when there are several possible explanatory variables, we need to
choose which ones to include in the model: stepwise regression can be used
to do this. (We will use R to carry out stepwise regression for the OU
students dataset soon, in Subsection 6.4.)
So far in this section, we have assumed that there are no interactions
between any of the explanatory variables; we will introduce interactions
into the model in the next subsection.

6.3 Including interactions


We have already considered models which include interactions between a
covariate and a factor in Section 3, and between two factors in Section 5.
It is also possible to have interactions between two covariates. To see how
this is done, we will first revisit interactions between one covariate and one
factor.
Recall from Section 3 (on non-parallel slopes models), that when we have
one covariate and one factor as the explanatory variables, we can express
the model in terms of indicator variables. In the simplest case in which the
factor has just two levels, using Model (6) from Subsection 3.1, the model
including an interaction term has the form
Yi = µ + α2 zi2 + βxi + γ2 zi2 xi + Wi , Wi ∼ N (0, σ 2 ),
where

1 if the ith observation takes level 2 of the factor,
zi2 =
0 otherwise.
Here, the interaction is the product of the covariate xi and the indicator
variable zi2 associated with the factor. When both of the explanatory
variables are covariates, we also use their product to form the interaction.
So, for example, for the two covariates x1 and x2 , a model for the response
Yi including an interaction term has the following form:
Yi = µ + β1 xi1 + β2 xi2 + γxi1 xi2 + Wi , Wi ∼ N (0, σ 2 ),
where γ is the interaction parameter for the interaction between xi1 and
xi2 .
Following the notation convention that we’ve been using, we’ll denote this
model as
Yi ∼ x1 + x2 + x1 :x2

368
6 Regression with any number of covariates and factors

or, more simply,


Yi ∼ x 1 ∗ x 2 .
Notice that, since the interaction x1 :x2 is simply the product of two
covariates, the interaction is also a single covariate. As such, we can use
the individual t-test for its regression parameter to test whether the
interaction should be included in addition to the other explanatory
variables. (Remember that we can’t do that for interactions involving
factors, because we need to consider all of the factor’s levels as a whole,
rather than individually.)
The model including an interaction between two covariates is illustrated in
the next activity.

Activity 37 Modelling the OU students dataset including


an interaction between two covariates
Consider once again the OU students dataset introduced in Subsection 5.3.
The following model was fitted to the data:
examScore ∼ bestPrevModScore + age + bestPrevModScore:age.
The output for the fitted model is given in Table 11.
Table 11 Output produced when fitting a regression model using two covariates
and an interaction using data from the OU students dataset

Parameter Estimate Standard t-value p-value


error
Baseline mean 13.193 13.451 0.981 0.327
bestPrevModScore 0.644 0.149 4.333 < 0.001
age −1.151 0.358 −3.216 0.001
bestPrevModScore:age 0.012 0.004 2.983 0.003

(a) Explain why the output for the fitted model suggests that both
covariates and their interaction should be included in the model.
(b) The first student listed in the dataset has observed values of 89.2 for
bestPrevModScore and 32 for age. According to the model, what is
the fitted value (to the nearest whole mark) for examScore, yb1 , for
this student?

The interactions considered so far involve two explanatory variables (where


both are factors, both are covariates or one is a covariate and one is a
factor). As such, they are known as two-way interactions. However, it is
possible for more than two explanatory variables to interact with each
other to affect the response; we could have a three-way interaction
between three explanatory variables, a four-way interaction between
four explanatory variables, and so on. The more variables involved in an
interaction, the higher order the interaction is said to have. So a three-way

369
Unit 4 Multiple regression with both covariates and factors

interaction has a higher order than two-way interactions. Similarly, a


two-way interaction has a lower order than four-way and three-way
interactions. The notation for two-way interactions is extended in a
natural way, so that, for example, the three-way interaction between three
factors A, B and C is denoted A:B:C. You will consider such interactions
in the context of a specific dataset in the next activity.

Activity 38 Possible interactions when there are four


explanatory variables
Consider once again modelling the response hourlyWageSqrt (the square
root of the individual’s hourly wage, in £) from the wages dataset using
the four factors gender, computer, edLev and occ, as described in
Activity 32 (Subsection 6.1). This time, however, we will allow the model
to include interactions between the factors.
(a) What are the possible two-way interactions for this model?
(b) What are the possible three-way interactions for this model?
(c) How many possible interactions are there for this model altogether?

Although there could be many possible interactions between the


explanatory variables, we only want the model to include those
interactions which significantly improve the fit of the model, as assessed
either by ANOVA tests to compare the RSS values of nested models, or by
stepwise regression.
There is an important rule, called the hierarchical principle, which is also
used when deciding which interaction terms to include in a model, as
summarised in Box 15.

Box 15 The hierarchical principle


If an interaction is included in a model, then the model must also
include:
• the individual effect terms for each of the variables in the interaction
• any lower-order interactions involving any of the variables in the
interaction.

The hierarchical principle given in Box 15 is illustrated in the next


example and activity.

370
6 Regression with any number of covariates and factors

Example 9 Which terms to include?


Consider once again modelling the response hourlyWageSqrt from the
wages dataset using the four factors gender, computer, edLev and
occ, together with their eleven possible interaction terms, as
described in Activity 38.
Suppose that the three-way interaction
gender:computer:edLev
significantly improves the fit of the model and so is included in our
parsimonious model. Our parsimonious model should then also
include:
• the individual effect terms for each of the factors gender, computer
and edLev
• any lower-order interactions involving any of gender, computer and
edLev – that is, the two-way interactions gender:computer,
gender:edLev, and computer:edLev.
Note that the terms listed in the bullet points are those which, by the
hierarchical principle, must be included in the model if the interaction
between gender, computer and edLev is included. It is possible that
the model also needs to include other terms, such as the individual
effect term for the factor occ and other two-way or three-way
interactions involving occ.

Activity 39 Which terms to include this time?

Consider once again modelling the response hourlyWageSqrt from the


wages dataset using the four factors gender, computer, edLev and occ,
together with their eleven possible interaction terms, as described in
Activity 38.
(a) Which terms need to be included in the model if the interaction
edLev:occ is in the model?
(b) Which terms need to be included in the model if the interaction
gender:computer:occ is in the model?
(c) Which terms need to be included in the model if the interaction
between all four of the factors is in the model?

So, as you have seen, it is possible to build regression models with any
number of covariates and factors, including interactions between them.
A collective name often used for such models is general linear models.

371
Unit 4 Multiple regression with both covariates and factors

As the name suggests, a wide range of models are general linear models.
One thing that general linear models have in common is that the random
term is assumed to have a normal distribution for all of them.
We will round off this unit by using general linear models in R. It is worth
noting that R automatically uses the hierarchical principle when choosing
a model using stepwise regression.

6.4 Using R for modelling using general


linear models
The first two notebook activities in this subsection both focus on general
linear models for the response hourlyWageSqrt from the wages dataset,
using two covariates (workHrs and educAge) and two factors (gender and
computer) as the explanatory variables. In Notebook activity 4.10, we will
consider the situation in which there are no interactions between the
explanatory variables, whereas in Notebook activity 4.11 we will consider
using the same explanatory variables, but this time including interactions.
Finally, in Notebook activity 4.12 we will return to modelling data from
the OU students dataset. As in Activity 36 (Subsection 6.2), we will
consider the same response variable, examScore, and explanatory
variables, gender, qualLink, bestPrevModScore and age. However, in the
notebook we will use stepwise regression in order to choose which
explanatory variables and interactions ought to be in the model.

Notebook activity 4.10 Multiple covariates and factors


without interactions in R
This notebook considers regression with multiple covariates and
multiple factors, but no interactions.

Notebook activity 4.11 Multiple covariates and factors


with interactions in R
This notebook considers the same response and explanatory variables
as previously, with interactions between the explanatory variables.

Notebook activity 4.12 Choosing a model for the OU


students dataset
This notebook uses stepwise regression to choose a model for the
response examScore using data from ouStudents.

We will revisit modelling the exam scores from the OU students dataset by
considering an alternative model later in the module (in Unit 7).

372
Summary

Summary
This unit brought together the regression ideas presented in Units 2 and 3
to develop regression with any number of covariates and factors as the
explanatory variables.
We began the unit by considering regression when there is just a single
covariate, x, and a single factor, A, as the explanatory variables. In this
case, separate regression lines are fitted for the different levels of the
factor. These lines can either be parallel or non-parallel.
If the fitted lines are parallel, then the lines for the different levels of the
factor only differ in terms of their intercepts: this is the parallel slopes
model, which we denoted
Y ∼ A + x.
On the other hand, if the fitted lines are non-parallel, then the lines for the
different levels of the factor differ in terms of both their intercepts and
their slopes. In this case, the covariate and the factor are said to interact,
so that the effect on the response of one of the explanatory variables
depends on the value of the other explanatory variable. For example, the
relationship between the response and x may be positive for level 1 of A,
but may be negative for level 2 of A. The non-parallel slopes model
accommodates the interaction between the covariate and factor by
including an extra interaction term, A:x, in the model so that the model
becomes
Y ∼ A + x + A:x or, equivalently, Y ∼ A ∗ x.
We then moved on to consider regression when there are two factors,
A and B, as the explanatory variables. We started with the case where
A and B do not interact to affect the response:
Y ∼ A + B.
In this case, the model is an extension of the model presented in Unit 3,
where:
• instead of having a baseline mean as the mean response when the
(single) factor (A) takes level 1, the baseline mean is now the mean
response when both factors (A and B) take level 1
• in addition to a separate added effect of each level of A, there is also a
separate added effect for each level of B.

373
Unit 4 Multiple regression with both covariates and factors

A means plot can be used to informally check that there is no interaction


between A and B. In a means plot, the sample mean responses are plotted
against the levels of one of the factors, and the sample mean responses for
the same level of the other factor are then joined together by lines.
If there is no interaction between the two factors, then the resulting lines
are expected to be roughly parallel, because the effect of each factor on the
response is the same across the levels of the other factor.
In contrast, if there is an interaction between A and B, then the effect of
one of the factors on the response is not the same across all of the levels of
the other factor. For example, level k of A may be expected to increase the
response when B takes level 1, but the same level k of A might be
expected to decrease the response when B takes level 2. In this case, an
extra interaction term, A:B, can be added to the model, so that the model
becomes
Y ∼ A + B + A:B or, equivalently, Y ∼ A ∗ B.

All of these models were brought together and extended in a natural way
towards the end of the unit where we considered regression with any
number of covariates, x1 , x2 , . . . , xq , and any number of factors,
A, B, . . . , Z. The simplest model here is
Y ∼ A + B + · · · + Z + x1 + x2 + · · · + xq ,
where there are no interactions between any of the explanatory variables.
Interactions can also be added into the model, and can be:
• between covariates, between factors or between a mixture of covariates
and factors
• between two explanatory variables (two-way interactions), between three
explanatory variables (three-way interactions), and so on.
When adding interactions into the model, the hierarchical principle needs
to be respected, which says that if an interaction is included in a model,
then the model must also include:
• the individual explanatory variables for each variable in the interaction
• any lower-order interactions involving any of the variables in the
interaction.
Stepwise regression can be useful to help with deciding which explanatory
variables and which interactions should be included in the model.
For the models in this unit:
• A set of indicator variables can be used to represent each factor, so that
all of the models can be expressed as multiple regression models (with
the usual model assumptions).
• Individual t-tests (to test whether the (partial) regression coefficient is
zero) can be used to test whether each covariate should be included in
the model (in addition to the other explanatory variables in the model).

374
Summary

• ANOVA tests comparing the fit of the model with the factor to the fit of
the model without the factor can be used to test whether each factor
should be included in the model (in addition to the other explanatory
variables in the model).
• ANOVA tests comparing the fit of the model with the interaction to the
fit of the model without the interaction can be used to test whether an
individual interaction should be included in the model (in addition to
the other explanatory variables in the model).
R was used to fit the models introduced in this unit, to test whether
individual covariates, factors or interaction terms should be included in the
model, and to carry out stepwise regression in order to choose which
covariates, factors and interactions should be included in the model.
As a reminder of what has been studied in Unit 4 and how the sections in
the unit link together, the route map for the unit is repeated below.

The Unit 4 route map

Section 1
Regression with one
covariate and one factor

Section 2 Section 3
Modelling using Modelling using
parallel slopes non-parallel slopes

Section 4 Section 5
Regression with Regression with
two factors that two factors that
do not interact interact

Section 6
Regression with any number
of covariates and factors

375
Unit 4 Multiple regression with both covariates and factors

Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate that regression with one covariate and one factor as the
explanatory variables produces separate fitted lines for different levels of
the factor; the fitted lines are parallel if there is no interaction, and
non-parallel if there is an interaction
• appreciate that for regression with multiple factors, the fitted model is
based on the mean responses for the different level combinations of the
factors
• use means plots to informally decide whether there may be an
interaction between two factors
• appreciate that the ideas of multiple regression, regression with one
covariate and one factor, and regression with multiple factors, can be
combined to produce a regression model with any number of covariates
and factors
• understand that sets of indicator variables can be used to represent any
factors in the model so that a regression model with any number of
covariates and factors can be expressed as a multiple regression model
• interpret the baseline mean, individual effect terms and (partial)
regression coefficients in a data context
• understand that an interaction between explanatory variables means
that the explanatory variables in the interaction work together to affect
the response
• appreciate that interactions can be between covariates, between factors
or between a mixture of covariates and factors
• appreciate that interactions can be between two explanatory variables
(two-way interactions), between three explanatory variables (three-way
interactions), and so on
• understand and be able to use the hierarchical principle
• use parameter estimates to calculate fitted values for the response for a
model with any number of covariates, factors and interactions
• use individual t-tests to test whether each covariate should be in the
model in addition to the other explanatory variables
• use ANOVA tests comparing the RSS values of two nested models to
test whether individual factors should be in the model in addition to the
other explanatory variables
• use ANOVA tests comparing the RSS values of two nested models to
also test whether interactions involving one or more factors should be in
the model in addition to the other explanatory variables
• appreciate that stepwise regression can be used to choose which
covariates, factors and interactions should be included in a model

376
References

• use R to fit regression models with any number of covariates, factors and
interactions
• use the summary output from R to test whether individual covariates
and individual interactions between covariates should be included in the
model in addition to the rest of the explanatory variables
• use R to fit nested models and carry out ANOVA tests to test whether
individual factors, and individual interactions involving factors, should
be included in the model in addition to the rest of the explanatory
variables
• use R to produce means plots
• use R to carry out stepwise regression to choose which explanatory
variables to include in the model when there are multiple covariates,
multiple factors and multiple interactions.

References
OECD (2020) OECD 60th anniversary. Available at:
https://www.oecd.org/60-years (Accessed: 20 March 2022).
OECD (2021) Trends in employment, unemployment and inactivity rates,
by educational attainment and age group. Available at:
https://stats.oecd.org (Accessed: 4 October 2020).
Snedecor, G.W. and Cochran, W.G. (1967) Statistical methods. 6th edn.
Ames, IA: Iowa State University Press.
The Open University (2020) ‘Internal bespoke anonymised extract M348’,
OU administration data (Accessed: 25 October 2020).

377
Unit 4 Multiple regression with both covariates and factors

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, a manna ash tree flowering: © zanozaru / www.123rf.com
Subsection 1.2, ‘hooray!’: © Roman Samborskyi / www.123rf.com
Section 2, scenic parallel slopes: © bsvet / www.123rf.com
Subsection 2.2, dog showing skill move: © alexeitm / www.123rf.com
Section 3, scenic non-parallel slopes: © Kotenko / www.123rf.com
Subsection 3.1, musicians performing: © razihusin / www.123rf.com
Subsection 3.2, Andreas Brehme: © Getty Images
Subsection 4.1, diverse and inclusive workforce: © langstrup /
www.123rf.com
Subsection 4.1, older worker: © Wavebreakmedia Ltd | Dreamstime.com
Subsection 4.2, railway lines: © konradbak / www.123rf.com
Subsection 6.1, researchers: © dragoscondrea / www.123rf.com
Subsection 6.2, older learner: © Fizkes | Dreamstime.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

378
Solutions to activities

Solutions to activities
Solution to Activity 1
The term α1 is set to be zero because µ is the baseline mean of Y for
level 1 of A, and so the effect of level 1 of A is already accommodated in µ.
Besides, the α1 term compares the effect on Y of level 1 of A with the
effect on Y of level 1 of A, which must be zero!

Solution to Activity 2
Since side is a factor with two levels, K − 1 = 2 − 1 = 1 and we can define
one indicator variable zi2 , where

 1 if the ith tree takes level 2 of side
zi2 = (that is, if the ith tree is on the west side of Walton Drive),
0 otherwise.

Then, since zi2 is numerical, we can use zi2 and diameter as the two
covariates in a multiple regression model.

Solution to Activity 3
(a) When the ith tree is on the east side of Walton Drive, it takes level 1
of side and so zi2 = 0. The fitted model is then
height = 1.29 + 19.80 diameter.

(b) When the ith tree is on the west side of Walton Drive, it takes level 2
of side and so zi2 = 1. Therefore, this time the fitted model is
height = 1.29 + 2.62 + 19.80 diameter = 3.91 + 19.80 diameter.

(c) These two regression lines both have the same slope (19.80) but
different intercepts (1.29 for east and 3.91 for west). The lines are
therefore parallel to each other. This can be clearly seen in Figure S1,
which follows.

379
Unit 4 Multiple regression with both covariates and factors

Side: East West


10

Height (m) 7

4
0.15 0.20 0.25 0.30 0.35
Diameter (m)
Figure S1 Scatterplot of height and diameter, together with the two fitted
(parallel) regression lines for height ∼ side + diameter

(d) Since αb2 = 2.62, after controlling for diameter, the effect on height
of being on the west side of Walton Drive, in comparison to being on
the east side, is that height would be expected to increase by 2.62 m.
(e) Since βb = 19.80, after controlling for side, the effect on height of an
increase in diameter of 0.1 m is that height would be expected to
increase by 0.1 m × 19.80 = 1.98 m.

Solution to Activity 4
(a) The parameter µ is the baseline intercept when skillMoves takes
level 1 and weight is zero, and so µ b is the intercept for the fitted
regression line for level 1, that is, µ
b = 6.94.
The parameters α2 , α3 , α4 and α5 are the added effects on strength
of skillMoves levels 2, 3, 4 and 5, respectively, after controlling for
weight, in comparison to the effect of level 1. (The effect of level 1 on
the relationship between strength and weight is accounted for in the
baseline intercept µ.) Therefore,
α
b2 = 13.75 − 6.94 = 6.81,
α
b3 = 12.53 − 6.94 = 5.59,
α
b4 = 9.00 − 6.94 = 2.06,
α
b5 = 9.14 − 6.94 = 2.20.

Finally, the parameter β is the slope parameter, and so βb = 0.36.

380
Solutions to activities

(b) The values of αb2 and αb3 are both positive and relatively large, which
suggests that, after controlling for weight, strength is generally
higher for players whose values of skillMoves are 2 or 3, in
comparison to players whose values of skillMoves are 1.
Although the values of α
b4 and αb5 are also both positive, their values
are smaller than α
b2 and α
b3 , which suggests that, after controlling for
weight, strength is generally a little higher for players whose values
of skillMoves are 4 or 5, in comparison to players whose values of
skillMoves are 1, but not as high as for players whose values of
skillMoves are 2 or 3.
The values of αb4 and αb5 are also very close to each other, suggesting
that, after controlling for weight, strength is very similar for players
whose values of skillMoves are 4 or 5. This can be seen in Figure 6,
where the regression lines associated with these two levels of
skillMoves are very close to each other.

Solution to Activity 5
For this model, we have K − 1 regression coefficients associated with A
(that is, α2 , α3 , . . . , αK ), and one regression coefficient associated with x
(that is, β). We therefore have K regression coefficients altogether. So,
from multiple regression, the p-value associated with this test is based on
an F -distribution with K and n − (K + 1) degrees of freedom.

Solution to Activity 6
The p-value associated with the test is very small, so we should reject H0
and conclude that there is evidence to suggest that at least one of the
regression coefficients is non-zero.

Solution to Activity 7
The explanatory variables in MC are x2 and x5 . Those in MA are x2 , x5
and also x4 . So MC is nested within MA .
MC is also nested within MB , since, in addition to MC ’s explanatory
variables (x2 and x5 ), MB ’s explanatory variables also include x1 and x4 .
Finally, MA is also nested within MB , since MB ’s explanatory variables
include x1 in addition to MA ’s explanatory variables (x2 , x4 and x5 ).

Solution to Activity 8
We know that the RSS for the model Y ∼ A + x will be less than the RSS
for the model Y ∼ x. However, if the difference between the RSS values for
the two models is large, this suggests that the RSS for the model
Y ∼ A + x is quite a bit smaller than the RSS for the model Y ∼ x, which
in turn suggests that the fit of the model Y ∼ A + x is quite a bit better
than the fit of the model Y ∼ x.

381
Unit 4 Multiple regression with both covariates and factors

Solution to Activity 9
From Table 1, the p-value associated with diameter is very small
(< 0.001), which suggests that diameter is significant after controlling for
side and so should be included in the model.
The p-value from the ANOVA test for whether side should be included in
the model in addition to diameter is also very small (< 0.001), which
suggests that side should also be included in the model in addition to
diameter.

Solution to Activity 10
From Table 2, the p-value associated with weight is very small (< 0.001),
which suggests that weight is significant after controlling for skillMoves
and so should be included in the model.
The p-value from the ANOVA test testing whether skillMoves should be
included in the model in addition to weight is also very small (< 0.001),
which suggests that skillMoves should also be included in the model in
addition to weight.

Solution to Activity 11
(a) With the exception of one large (negative) residual, the points in the
residual plot seem to be scattered randomly about zero, suggesting
that the assumption that the Wi ’s have zero mean and constant
variance seems reasonable. Also, the residuals in the normal
probability plot lie roughly along a straight line, and so the
assumption of normality of residuals seems plausible as well.
So, neither of the plots suggest any problems with the assumptions of
the random terms in the parallel slopes model for these data.
Although, note that neither of these plots give us any information
about the reasonableness of the independence assumption.
(b) The five regression lines for the levels of skillMoves seem to fit the
associated data points fairly well in Figure 6, and so the assumption
that the five lines are parallel seems to be reasonable.

Solution to Activity 12
In order for the slopes to not be parallel, we need different slopes for the
different factor levels. This means that we need the regression coefficient,
β, to differ across the K levels of A.

382
Solutions to activities

Solution to Activity 13
(a) When individual i is male, gender takes level 1. The fitted model
then has the form
hourlyWageSqrt = µ
b + βb workHrs,
and so the fitted line for individuals who are male is
hourlyWageSqrt = 4.575 − 0.0173 workHrs.

When individual i is female, gender takes level 2. The fitted model


then has the form
hourlyWageSqrt = µ
b+α
b2 + (βb + γ
b2 ) workHrs,
and so the fitted line for individuals who are female is
hourlyWageSqrt = 4.575 − 1.816 + (−0.0173 + 0.0335) workHrs
= 2.759 + 0.0162 workHrs.

(b) Since βb = −0.0173, the slope for the fitted line when gender takes
level 1 is negative; that is, when gender is male, the fitted line is a
downwards slope.
However, since βb + γ b2 is positive (0.0162), the slope for the fitted line
when gender takes level 2 is positive; that is, when gender is female,
the fitted line is an upwards slope.
(c) The fitted model is given by:
hourlyWageSqrt = 4.575 − 0.0173 workHrs, for level 1 (male),
hourlyWageSqrt = 2.759 + 0.0162 workHrs, for level 2 (female).
For the male individual, the fitted value of hourlyWageSqrt when the
value of workHrs is 35 is, therefore,
hourlyWageSqrt = 4.575 − (0.0173 × 35) = 3.9695 ≃ 3.97,
while for the female individual, the fitted value of hourlyWageSqrt
when the value of workHrs is 35 is
hourlyWageSqrt = 2.759 + (0.0162 × 35) = 3.3260 ≃ 3.33.
The fitted values of their respective hourly wages are the squares of
the fitted values for hourlyWageSqrt. Therefore, the fitted value of
the hourly wage in £, to the nearest 10p, for the male individual is
3.96952 ≃ 15.80
and for the female individual is
3.32602 ≃ 11.10.

383
Unit 4 Multiple regression with both covariates and factors

Solution to Activity 14
If the ith observation takes level 1 of factor A, then the indicator variables
zi2 , zi3 , . . . , ziK all take the value 0. Model (6) then becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αK × 0)
+ (β + (γ2 × 0) + (γ3 × 0) + (γK × 0))xi + Wi
= µ + β x i + Wi .
If the ith observation takes level k of factor A, for k = 2, 3, . . . , K, then the
indicator variable zik takes the value 1, while the other indicator variables
all take the value 0. Model (6) then becomes
Yi = µ + (α2 × 0) + (α3 × 0) + · · · + (αk × 1) + · · · + (αK × 0)
+ (β + (γ2 × 0) + (γ3 × 0) + · · · + (γk × 1) + · · · + (γK × 0))xi + Wi
= µ + αk + (β + γk ) xi + Wi .

Solution to Activity 15
(a) When observation i takes level 1 of preferredFoot – that is, when
the player is left-footed – the fitted model for strength has the form
strength = µ
b + βb height.
Therefore, since the fitted line for level 1 of preferredFoot is
strength = −86.2 + 2.22 height,
b = −86.2 and βb = 2.22.
this means µ
When observation i takes level 2 of preferredFoot – that is, when
the player is right-footed – the fitted model for strength has the form
strength = µ
b+α
b2 + (βb + γ
b2 ) height.
The fitted line for level 2 of preferredFoot is
strength = −6.5 + 1.08 height,
and so
µ
b+α b2 = −6.5
−86.2 + α
b2 = −6.5
b2 = −6.5 + 86.2 = 79.7
α
and
βb + γ
b2 = 1.08
2.22 + γb2 = 1.08
b2 = 1.08 − 2.22 = −1.14.
γ
(Note that when using parameter estimates given to full computer
accuracy, it turns out that γb2 is −1.13 to two decimal places.)
(b) Since β = 2.22, the slope for the regression equation when
b
preferredFoot takes level 1 is positive; that is, when preferredFoot
is left, there is an upwards slope.

384
Solutions to activities

The value of γb2 represents the added effect that level 2 (right) of
preferredFoot has on the slope in comparison to the regression slope
when preferredFoot is level 1 (left). Since βb + γ b2 is positive, the
slope for the regression equation is still positive, but since γ
b2 is
negative, the slope is less steep when preferredFoot is right than
when preferredFoot is left.
(c) We have the fitted model
strength = −86.2 + 2.22 height, for level 1 (left),
strength = −6.5 + 1.08 height, for level 2 (right).
For the left-footed player, the fitted value of strength when the value
of height is 75 is therefore
strength = −86.2 + (2.22 × 75) ≃ 80,
while for the right-footed player, the fitted value of strength when
the value of height is 75 is
strength = −6.5 + (1.08 × 75) ≃ 75.

Solution to Activity 16
If there is an interaction between A and x, we would expect the model
Y ∼ A + x + A:x
to fit the data much better than the model
Y ∼ A + x.
In this case, we would expect a fairly large reduction in the amount of
unexplained variation when the interaction term is added into the model,
and so we might expect the difference in the RSS values for the two models
to be large.

Solution to Activity 17
The p-value for testing whether the interaction term gender:workHrs
should be included in the model is very small (< 0.001), which means that
there is strong evidence to suggest that the interaction term should be
included in the model. Therefore, it was indeed wise to use a non-parallel
slopes model in Activity 13, since there would have been a significant
decrease in fit if we’d simply used the parallel slopes model.

385
Unit 4 Multiple regression with both covariates and factors

Solution to Activity 18
A p-value of 0.039 is quite small, which suggests that the interaction term
preferredFoot:height should probably be included in the model,
meaning that a non-parallel slopes model is preferable to a parallel slopes
model for these data.

Solution to Activity 19
The model
height ∼ side + diameter
is a parallel slopes model, whereas the model
height ∼ side + diameter + side:diameter
is a non-parallel slopes model. So, deciding whether a non-parallel slopes
model or a parallel slopes model is better, is equivalent to deciding
whether or not the interaction term side:diameter should be included in
the model.
The p-value for testing whether the interaction term side:diameter should
be included in the model was 0.256. This value is quite large and so there
is not enough evidence to suggest that the interaction term side:diameter
should be included in the model. The more parsimonious parallel slopes
model is therefore better for modelling height for these data.

Solution to Activity 20
There seems to be a decreasing trend in the residual plot for the high
fitted values. There are, however, not very many of these higher values,
and for the vast majority of the fitted values the residuals appear to be
fairly randomly scattered about zero. So, overall (to the module team) the
assumption that the random terms have zero mean and constant variance
seems to be reasonable.
The majority of the points in the normal probability plot lie along the
straight line, which is consistent with the assumption that the random
terms are normally distributed. At the ends of the plot, however, the
points systematically deviate slightly from the line, so the assumption of
normality may therefore be questionable. Despite this, since the deviation
is only slight and the majority of points lie close to the line, the normality
assumption can’t be ruled out.

386
Solutions to activities

Solution to Activity 21
Although there are a few large negative residuals in the residual plot, and
also a hint of curvature in the residual plot, the points generally seem to
be scattered fairly randomly about zero, so the assumption that the Wi ’s
have zero mean and constant variance seems reasonable.
In the normal probability plot, the residuals lie roughly along a straight
line but systematically depart from the line for the higher residual values.
The assumption of normality may, therefore, be questionable, but (to the
module team) the normal probability plot suggests that the assumption of
normality is still plausible.
So, both plots raise some potential issues, but neither of the plots suggest
any major problems with the assumptions of the non-parallel slopes model
for these data.

Solution to Activity 22
(a) Since each level effect term for A is the added effect on the response in
comparison to the effect of level 1 of A, when k = 1, the ‘effect of kth
level of A’ term in Model (10) will simply be zero. Similarly, when
l = 1, the ‘effect of lth level of B’ term in Model (10) will also be zero.
Therefore, when both factors A and B take level 1, Model (10)
reduces to
   
baseline random
Yi = + .
mean term

(b) From part (a), when k = 1, the ‘effect of kth level of A’ term in
Model (10) is zero, and so Model (10) reduces to
 
   effect of   
baseline random
Yi = + lth level + .
mean term
of B
 

(c) Also from part (a), when l = 1, the ‘effect of lth level of B’ term in
Model (10) is zero, and so Model (10) reduces to
 
   effect of   
baseline random
Yi = + kth level + .
mean term
of A
 

Solution to Activity 23
(a) (i) For males aged 55 to 64, gender takes level 1 and so we do not
have a gender effect in the model, but age does not take level 1
and so we do need to include the level effect for age being
55 to 64. So, the fitted employment rate for males aged 55 to 64
in the country is calculated as
87.777 − 11.340 ≃ 76.4.

387
Unit 4 Multiple regression with both covariates and factors

(ii) For females aged 25 to 34, gender takes level 2 and so we need to
include the level effect for gender taking the value female in the
model, but age takes level 1, and so we do not have an age
effect. So, the fitted employment rate for females aged 25 to 34
in the country is calculated as
87.777 − 8.068 ≃ 79.7.

(iii) For males aged 35 to 44, gender takes level 1 and so we do not
have a gender effect in the model, but age does not take level 1
and so we do need to include the level effect for age being
35 to 44. So, the fitted employment rate for males aged 35 to 44
in the country is calculated as
87.777 + 5.332 ≃ 93.1.

(iv) For females aged 45 to 54, neither gender nor age take level 1,
and so we need to include both of the level effect terms for
gender taking the value female and age being 45 to 54. So, the
fitted employment rate for females aged 45 to 54 in the country is
calculated as
87.777 − 8.068 + 5.657 ≃ 85.4.

(b) Since the effect term when gender takes the value female is
approximately −8.1, after controlling for age, the employment rate is
expected to be lower by 8.1% for females in comparison to the
employment rate for males.
(c) The effect term when age is 35 to 44 is approximately 5.3 and the
effect term when age is 45 to 54 is approximately 5.7. These values
are both positive, and so, after controlling for gender, in comparison
to the employment rate for the age group 25 to 34, the employment
rate is expected to be higher on average by 5.3% for the age group
35 to 44 and higher on average by 5.7% for the age group 45 to 54.
Notice that since the values of these level effect terms are very similar,
the employment rates for these two age groups, after controlling for
gender, are also very similar.
On the other hand, since the effect term when age is 55 to 64 is
approximately −11.3, and this value is negative, after controlling for
gender, the employment rate is expected to be lower by 11.3% for
those aged 55 to 64 in comparison to those aged 25 to 34.

388
Solutions to activities

Solution to Activity 24
In both figures, the lines for the different levels of the factor are parallel to
each other.

Solution to Activity 25
(a) In order to decide whether A should be included in the model in
addition to B, we could compare the values of the RSS for the two
nested models
Y ∼B and Y ∼ A + B.
Then, if the difference is large enough, this would suggest that adding
A into the model (in addition to B) significantly reduces the
unexplained response variation (and therefore significantly increases
the model fit).
(b) Similarly, in order to decide whether B should be included in the
model in addition to A, we could compare the values of the RSS for
the two nested models
Y ∼A and Y ∼ A + B.
Then, if the difference is large enough, this would suggest that adding
B into the model (in addition to A) significantly reduces the
unexplained response variation (and therefore significantly increases
the model fit).

Solution to Activity 26
The p-value from the first test is very small, and so there is evidence to
suggest that adding age into the model in addition to gender significantly
increases the model fit.
The p-value from the second test is also very small, and so there is also
evidence to suggest that adding gender into the model in addition to age
significantly increases the model fit.
Therefore, we should include both gender and age in the model.

Solution to Activity 27
If the model is suitable for these data, then we might expect the lines in
the means plots to be roughly parallel to each other. However, this is not
the case for either of the means plots in Figure 21 and so this model may
not be suitable for these data.

389
Unit 4 Multiple regression with both covariates and factors

Solution to Activity 28
(a) The first rat takes level 1 of source and level 2 of amount. Therefore,
there isn’t an added interaction term in the model, nor the individual
effect term for source. The fitted value is therefore
yb1 = 100.0 − 20.8 = 79.2.

(b) The 11th rat takes level 1 of both source and amount. Therefore,
there isn’t an added interaction term in the model, nor either of the
individual effect terms for source and amount. The fitted value is
therefore simply the fitted baseline mean
yb11 = 100.0.

(c) The 21st rat takes level 2 of both source and amount. Therefore,
there is an added interaction term in the model, and also individual
effect terms for both source and amount. The fitted value is therefore
yb21 = 100.0 − 14.1 − 20.8 + 18.8 = 83.9.

(d) The 31st rat takes level 2 of source and level 1 of amount. Therefore,
there isn’t an added interaction term in the model, nor the individual
effect term for amount. The fitted value is therefore
yb31 = 100.0 − 14.1 = 85.9.

Solution to Activity 29
In order to decide whether the interaction A:B should be included in the
model in addition to factors A and B, we could compare the values of the
RSS for the two nested models
Y ∼A+B and Y ∼ A + B + A:B.
If the difference is large enough, this would suggest that adding A:B into
the model (in addition to A and B) significantly reduces the unexplained
response variation (and therefore significantly increases the model fit).

Solution to Activity 30
The p-value for testing whether the interaction should be included in the
model in addition to source and amount is 0.054. As such, there is some
evidence to suggest that the fit of the model including an interaction is
significantly better than the model without an interaction, although the
evidence isn’t very strong.
In situations such as this, the context of the research problem and the
opinion of the researcher are important to be able to judge how ‘small’ the
p-value needs to be for us to conclude that the interaction term should be
included in the model. If the researcher would only consider a p-value to
be ‘small’ if it is less than 0.01, say, then they would conclude that the
interaction should not be in the model. On the other hand, if they

390
Solutions to activities

consider a p-value of 0.054 to be ‘small enough’, then they would conclude


that the interaction should be included in the model.

Solution to Activity 31
(a) The lines in both of the means plots are not parallel, which suggests
that the effect on the response of each factor depends on the other
factor. In particular, whether or not the individual has a computer at
home seems to have little effect on the hourly wage for females, but
makes quite a difference for males.
(b) The p-value associated with the interaction is very small (< 0.001).
There is therefore strong evidence to suggest that the interaction
gender:computer should be included in the model.
(c) For a male individual who has a computer, both gender and
computer take level 1, and so the fitted value for the hourlyWageSqrt
is simply the baseline mean
yb ≃ 4.073.
We want the fitted value of the hourly wage (in £), rather than its
square root, which is 4.0732 ≃ 16.59.
(d) For a male individual who doesn’t have a computer, gender takes
level 1 and computer takes level 2. So, since gender takes level 1,
there is neither an interaction in the model nor an effect for gender.
The fitted value of hourlyWageSqrt for this individual is then
yb ≃ 4.073 − 0.384 = 3.689.
So, the fitted value of the hourly wage (in £) is 3.6892 ≃ 13.61.
(e) From parts (c) and (d), the fitted value of hourly wage for a male
with a computer is £16.59, while that for a male without a computer
is £13.61. Therefore, according to the fitted model, having a
computer at home increases a male’s hourly wage by
£16.59 − £13.61 = £2.98, on average.

Solution to Activity 32
(a) This individual takes level 1 of both gender and computer, so that
the individual effects of these factors are accounted for in the baseline
mean for this individual. However, the other two factors, edLev and
occ, do not take level 1, and so their individual level effects need to
be added into the model. The fitted value is then calculated as
   
   effect of   effect of 
baseline
yb = + level 3 + level 2
mean
of edLev of occ
   

= 4.699 − 0.128 − 0.094


= 4.477 = 4.48 (to two decimal places).

391
Unit 4 Multiple regression with both covariates and factors

(b) This individual doesn’t take level 1 for any of the four factors, and so
the individual effects of the factor levels need to be added into the
model. The fitted value is therefore calculated as
   
   effect of   effect of 
baseline
yb = + level 2 + level 2
mean
of gender of computer
   
   
 effect of   effect of 
+ level 17 + level 7
of edLev of occ
   

= 4.699 − 0.484 − 0.117 − 0.628 − 0.792


= 2.678 = 2.68 (to two decimal places).

Solution to Activity 33
To test whether the factor occ needs to be in the model in addition to
gender, computer and edLev, we can compare the two nested models
hourlyWageSqrt ∼ gender + computer + edLev
and
hourlyWageSqrt ∼ gender + computer + edLev + occ.

Solution to Activity 34
(a) The outcome at Step 1 is to add occ to the model (because this factor
has the smallest AIC value). So, Step 2 starts with the model with
the single factor occ. Repeating this method at each step, we have
that the outcome at Step 2 is to add edLev to the model, then add
gender at Step 3, and finally add computer at Step 4. There are then
no further explanatory variables to try adding into the model, and so
all four factors – gender, computer, edLev and occ – were selected by
this forward stepwise regression procedure.
(b) The model with no change has the smallest AIC value in Step 1 of the
backward stepwise regression procedure, and so we can’t improve the
model by removing any of the four factors. Therefore, the backward
stepwise regression procedure selected all four factors, and so forward
and backward stepwise regression both suggested the same factors for
the model.

Solution to Activity 35
Following the ideas presented in Subsection 6.1, we can define a set of
indicator variables for each factor. Then, since each indicator variable is a
covariate, and x1 , x2 , . . . , xq are also covariates, we have a multiple
regression model.

392
Solutions to activities

Solution to Activity 36
(a) The first student takes level 2 (m) of gender and also level 2 (maths)
of qualLink. Therefore, we need to include the effects of both of
these factors in the model. The fitted exam score for the first student
is therefore calculated as
yb1 = −24.320 + 0.783 − 2.709 + (1.087 × 89.2) + (−0.113 × 32)
≃ 67.

(b) The estimated effect associated with a maths-based qualification in


the fitted model is −2.709. Therefore, we would expect the exam
score for a student linked to a maths-based qualification to be 2.709
lower than the exam score for a student (of the same gender, with the
same best previous module score and the same age) who is not linked
to a maths-based qualification.
(c) From Subsection 2.2, to test whether the covariate age should be
included in the model in addition to the three other explanatory
variables, we can use the p-value associated with the individual t-test
for testing whether the (partial) regression coefficient for age is zero.
(From Table 10, this p-value is 0.001, which suggests that age should
be included in the model in addition to the other three explanatory
variables.)
(d) From Subsection 4.3, to test whether the factor qualLink should be
included in the model in addition to the three other explanatory
variables, we can carry out an ANOVA test to compare the RSS
values of the two nested models
examScore ∼ gender + bestPrevModScore + age
and
examScore ∼ gender + qualLink + bestPrevModScore + age.

(e) A stepwise regression procedure could be used to help with choosing


which of the four explanatory variables should be included in the final
model.

Solution to Activity 37
(a) The p-values for each of the covariates bestPrevModScore and age
are both very small (p < 0.001 for bestPrevModScore and p = 0.001
for age), as is the p-value for their interaction bestPrevModScore:age
(p = 0.003). This suggests that both covariates bestPrevModScore
and age should be included in the model, together with their
interaction.
(b) The value of xi1 is 89.2 and the value of xi2 is 32, and so the fitted
value for this student is
yb1 = 13.193 + (0.644 × 89.2) + (−1.151 × 32) + (0.012 × 89.2 × 32)
= 68.

393
Unit 4 Multiple regression with both covariates and factors

(Note that when using parameter estimates given to full computer


accuracy, it turns out that this fitted value is 67 to the nearest
whole mark.)

Solution to Activity 38
(a) There are six possible two-way interactions – gender:computer,
gender:edLev, gender:occ, computer:edLev, computer:occ and
edLev:occ.
(b) There are four possible three-way interactions –
gender:computer:edLev, gender:computer:occ, gender:edLev:occ
and computer:edLev:occ.
(c) There are six two-way interactions, four three-way interactions and a
single four-way interaction (between all four of the factors). So, in
total there are 6 + 4 + 1 = 11 possible interactions for this model.

Solution to Activity 39
(a) If the interaction edLev:occ is in the model, then the individual
effects for each of the factors edLev and occ also need to be in the
model. Since the interaction is a two-way interaction, there aren’t any
lower-order interactions to consider.
(b) If the interaction gender:computer:occ is in the model, then the
individual effects of each of the factors gender, computer and occ
need to be in the model, together with the two-way interactions
gender:computer, gender:occ, and computer:occ.
(c) If the interaction between all four of the factors is in the model, then
the individual effects of each of the four factors needs to be in the
model, together with all six of the two-way interactions between each
pair of the four factors, and all four of the three-way interactions
between each subset of three of the four factors.

394
Unit 5
Linear modelling in practice
Introduction

Introduction
So far in this module, we have seen how multiple regression including any
number of covariates and factors, also known as general linear models, can
be used to model the relationships between a response variable and various
explanatory variables. These explanatory variables can be either covariates
or factors. General linear modelling techniques represent a powerful set of
tools in the statistician’s toolbox, enabling a wide variety of data to be
analysed. In Units 2, 3 and 4, the focus has been on the fitting and
checking of such models. In this unit, although there will still be some
fitting and checking, the focus will be on other aspects of the statistical
modelling process, both before a model is formulated and after a good
model has been identified.

How Unit 5 relates to the module so far


Moving on from . . . What’s next?

The modelling process


(Unit 1)

The modelling process


from problem definition
Regression with any to dealing with results
number of numerical
and categorical
explanatory variables
(Unit 4)

The statistical modelling process was discussed back at the start of Unit 1
and represented by a diagram (Figure 1 in Unit 1). For convenience, this
diagram is repeated shortly (Figure 1 in this unit) with an indication of
the steps that we will be focusing on in this unit: ‘Pose questions’, ‘Design
study’, ‘Collect data’, ‘Report results’ and, in part, ‘Formulate model’.
For the majority of the unit, we’ll use the statistical modelling process in
the context of predicting Olympic success. We’ll start in Section 1 by
translating a general description of the question of interest, expressed
using non-technical language, into a well-defined statistical modelling task.
Statistical modelling obviously requires some data to work with. So, in
Section 2, we’ll deal with sourcing suitable data and then preparing the
data ready for model fitting. This includes putting the data into a form
suitable for whatever statistics software we are using. Preparing the data
for modelling can be a non-trivial process which, if not handled
appropriately, can compromise, or even invalidate, the entire analysis.

397
Unit 5 Linear modelling in practice

Pose questions

Design study

Collect data

Explore data

Make assumptions

Formulate model
Improve model

Fit model

Check model

Choose model

Report results

Figure 1 The statistical modelling process: the steps of the process that will
be focused on in this unit are circled

We’ll then move on to building a statistical model for these data in


Section 3. This will involve choosing which models to fit to the data, and
then selecting which model, or indeed models, seem to be the best for the
data. Some further modelling issues are then discussed in Section 4.
In Section 5, we’ll take a bit of a detour to consider the common data
quality issue of missing data, before we return to the more general
statistical modelling process in Section 6, where we focus on reporting the
results from the analysis. This reporting usually needs to be done using
non-technical language. Finally, in Section 7, we’ll consider the reliability
of our results.
The structure of Unit 5 in terms of how the unit’s sections fit together is
represented diagrammatically in the route map.

398
1 Specifying the problem

The Unit 5 route map

Section 1 Section 2
Specifying our Sourcing and preparing
problem the data

Section 3
Building a
statistical model

Section 4
Section 5
Further modelling
Missing data
issues

Section 6
Section 7
Documenting the
Replication
analysis

Note that Subsections 2.2, 3.2, 3.3, 4.3 and 6.2 contain a number of
notebook activities, so you will need to switch between the written
unit and your computer to complete these.
Additionally, in Subsections 1.3, 2.1, 6.1, 6.3 and 6.4, you will need to
access other resources on the module website (such as videos and
articles) to complete a number of other activities.

1 Specifying the problem


As mentioned in the introduction, this unit will focus on using linear
modelling to predict Olympic success. We won’t, however, be jumping
straight into fitting linear models! In this section, we’ll be considering the
first two steps of the modelling process – ‘Pose questions’ and ‘Design
study’ – to address the problem of how to predict Olympic success.
Subsection 1.1 starts with some background information about the
Olympic Games. Subsection 1.2 relates to the ‘Pose questions’ step of the
modelling process: we translate the general aim of predicting Olympic
success into something specific we can model by identifying a suitable
response variable. We continue this step in Subsection 1.3 by identifying
possible explanatory variables to use in our modelling. (Activity 2 in this
subsection involves watching a short video on the module website.)

399
Unit 5 Linear modelling in practice

Subsection 1.4 deals with the ‘Design study’ step of the modelling process.
In this case, as we will be using secondary data, this corresponds to
identifying exactly what we are trying to achieve with our modelling.

1.1 The background context


The Olympic Games, often referred to simply as the Olympics or the
Games, is a multi-sport tournament and one of the biggest sports
tournaments in the world. Since the first modern Olympic Games in 1896,
the Olympics are generally held every four years, hosted by different cities
around the world. There is a summer Olympics, in which sports such as
athletics, football, tennis, cycling, swimming and rowing feature. There is
also a winter Olympics, where sports that rely on snow and ice feature. In
this unit, we’re focusing only on the summer Olympics. Table 1 lists the
summer Olympics that have been held since 1980, as well as planned
future ones.
Table 1 Past and future summer Olympics from 1980 to 2032 (International
Olympic Committee, 2021)

Common name Year Host city Host country


Moscow 1980 1980 Moscow Soviet Union
Los Angeles 1984 1984 Los Angeles United States
Seoul 1988 1988 Seoul South Korea
Barcelona 1992 1992 Barcelona Spain
Atlanta 1996 1996 Atlanta United States
Sydney 2000 2000 Sydney Australia
Athens 2004 2004 Athens Greece
Beijing 2008 2008 Beijing China
London 2012 2012 London United Kingdom
Rio 2016 2016 Rio de Janeiro Brazil
Tokyo 2020 2021 Tokyo Japan
Paris 2024 2024 Paris France
Los Angeles 2028 2028 Los Angeles United States
Brisbane 2032 2032 Brisbane Australia

For each sport at the Olympics, a number of events are held. For example,
at recent summer Olympic Games, athletics has included events such
as 100 m, 200 m, high jump, long jump, 4 × 100 m relay and 4 × 400 m
relay, both for men and for women. Each event awards gold, silver and
bronze medals for first, second and third place, respectively.
Throughout this unit, we will use the term ‘athlete’ to include any
competitor at an Olympics, not just those taking part in athletics events.
For many athletes, winning an Olympic gold medal is seen as the pinnacle
of sporting achievement. But in addition to being extremely important to
individual athletes, success at the Olympics is often seen as a matter of
national pride. It is therefore not surprising that, before an Olympic
Games, there is speculation about which nations will do well and which will
not. This is something that statistics can help to try to shed some light on!

400
1 Specifying the problem

With the use of appropriate data, statistical models can be used to model
success at previous Olympic Games, which can then provide a basis for
predictions of what will happen in future Olympic Games. Such statistical
modelling is the focus of this unit. In particular, we will be developing a
statistical model for predicting which countries are likely to do well and
which countries are likely to not do so well, at the Paris 2024 Summer
Olympics. (At the time of writing this unit, this Olympics is yet to be
held.)

1.2 Identifying a response variable


The first issue that we need to address is how we are going to measure a
nation’s Olympic success. We will consider this question in Activity 1 next.

Activity 1 Measuring a nation’s success at an Olympics

What approaches might be used for measuring a nation’s Olympic success?

As you will have noticed while working on Activity 1, there is more than
one way to measure a nation’s success at the Olympics. Even if you just
regard ‘winning a medal’ as the key criterion of success (and hence all the
athletes from fourth place downwards are regarded as having not
succeeded), there is still the issue of how much more weight, if any, to
place on a gold medal as opposed to a silver medal, and a silver medal as
opposed to a bronze medal.
For the purposes of this unit, we will select a single (simple) measure of a
A gold medal from the
nation’s Olympic success, given in Box 1.
Tokyo 2020 Summer Olympics

Box 1 A measure of a nation’s Olympic success


In this unit, the performance of a nation at an Olympic Games will be
measured by the total number of medals won by the athletes
competing in that nation’s team.

Notice that in the measure of Olympic success given in Box 1, it does not
matter which medal (gold, silver or bronze) an athlete wins, just that they
win a medal. Then, using this measure, any model we consider will take
the total number of medals won (or a transformation of the total number
of medals won) as the response variable.

401
Unit 5 Linear modelling in practice

1.3 Which explanatory variables might be


useful?
Now that we have identified our response variable, what about the
explanatory variables? For the possible explanatory variables, we need to
consider which variables are likely to affect the response (that is, the
number of medals won). So, in order to identify possible explanatory
variables, in Activity 2 you will consider what aspects of a nation are
associated with winning medals.

Activity 2 What makes a nation successful at an Olympics?

Watch ‘Video for Activity 2’ provided on the module website and then
answer the questions below. The short video describes some of the
explanatory variables a team of modellers used when trying to predict the
number of medals a nation would win at Rio 2016. (Note that this video
was made in 2016, before those Olympic Games had taken place.)
(a) What explanatory variables are suggested in the video?
(b) Which of the suggested explanatory variables are likely to affect a
nation’s Olympic success directly? And which are likely to affect a
nation’s Olympic success more indirectly?
(c) Suggest at least one other potential explanatory variable.

So, as you have seen in Activity 2, knowledge of the context in which the
modelling sits plays a part in deciding on potential explanatory variables.
Another important factor in the selection of potential explanatory
variables is pragmatism. For any variable we wish to use, we need to be
able to source data for it. For example, a good explanatory variable to
predict a nation’s success at the Olympics might be one that directly
measures the importance a nation gives to sporting success. However, data
for this variable are difficult, if not impossible, to obtain! In contrast, data
on a nation’s success at the previous Olympics (as measured by the
number of medals won, for example) are easy to obtain and could offer an
alternative explanatory variable as a useful approximation to the one for
which we cannot obtain data.
For predictive models, a particular concern is whether, for any prediction
we wish to make, we will be able to obtain the corresponding values of the
explanatory variables. As you will see in Activity 3, this can depend on
when we want to make the predictions, as well as which predictions we
wish to make.

402
1 Specifying the problem

Activity 3 Explanatory variables for prediction


As mentioned in Subsection 1.1, we wish to build a model for predicting
sporting success at the Paris 2024 Summer Olympics. Two of the potential
explanatory variables suggested in Activity 2 were the size of the nation’s
team and its previous success at the Olympics. How useful these two
explanatory variables might be for our problem depends on when we would
like to make the predictions.
(a) Are values for each of these two explanatory variables likely to be
known one week before Paris 2024 starts? Why or why not?
(b) Are values for each of these two explanatory variables likely to be
known one year before Paris 2024 starts? Why or why not?
(c) Suppose that we wish to predict the number of medals each nation
wins at Paris 2024 on 9 August 2021 (the day after the closing
ceremony for Tokyo 2020). Given your answers to parts (a) and (b),
are either of these variables useful potential explanatory variables?

As Activity 3 demonstrated, when it comes to selecting potential


explanatory variables to help us predict performance at a summer
Olympics, it matters when we are making that prediction – or more
accurately, how far off the next Olympics is. In this unit, we will focus on
predicting national success at a summer Olympics just after the previous
summer Olympics has ended.

1.4 Summary of the modelling task


The specific prediction task which we’ll be focusing on in this unit is
specified in Box 2.

Box 2 Prediction task


In this unit, we wish to build a model which can be used to predict
the total number of medals won by individual nations at Paris 2024,
based on what is known straight after the end of Tokyo 2020 on
9 August 2021.

We will tackle this prediction task by building a model for the number of
medals won by nations at summer Olympics, considering all of the summer
Olympics held between 1996 and 2016, basing the predictions for each of
these summer Olympics on what is known at the end of the associated
previous summer Olympics. (Data for the summer Olympics in 2020 will
be used later on in the unit to assess how well our model performs.)
In Subsection 1.3, a number of potential explanatory variables to help us
with the task have already been suggested. For the rest of the unit, we will
just focus on the ones listed in Box 3, which follows.

403
Unit 5 Linear modelling in practice

Box 3 Potential variables for predicting Olympic success


In this unit, for each nation and for each summer Olympics between
1996 and 2016, the following variables will be considered for use in
predicting Olympic success.
Response variable:
• the number of medals won at a national level at a summer
Olympics.
Explanatory variables:
• the number of medals won at a national level at the previous
summer Olympics
• the nation’s population size
• the wealth of the nation
• whether or not the nation is the host nation, taking possible values
yes and no
• whether or not the nation is going to host the following summer
Olympics, taking possible values yes and no.

Having decided on the response variable and potential explanatory


variables, we are now ready to source and prepare the data, which is
covered in the next section.

2 Sourcing and preparing the data


In Subsection 1.4 the modelling task was summarised, accomplishing the
‘Pose questions’ and ‘Design study’ part of the statistical modelling
process. We move on to the ‘Collect data’ step of the statistical modelling
in this section. In Subsection 2.1 we’ll source some data to tackle the
modelling task, and then in Subsection 2.2 we’ll prepare these data ready
for analysing.

2.1 Sourcing data for analysis


So far in the module, you have not had to worry about sourcing data, or
getting the data into a form ready for data analysis. Instead, data in a
suitable format have been supplied in order to allow you to concentrate on
the statistical modelling. However, sourcing and preparing data for
analysis are non-trivial tasks. It is not unusual for this to take far longer
than the statistical modelling! It is also something that is important to get
right. Sophisticated statistical modelling cannot overcome problems that
occurred when the data were gathered and prepared. This is summarised
by the commonly quoted phrase ‘garbage in, garbage out’.

404
2 Sourcing and preparing the data

fantastic
model!

Data Results

2.1.1 Primary and secondary data revisited


In Subsection 2.2.1 of Unit 1 we introduced primary and secondary data.
Recall that primary data are data that have been specifically gathered
with particular statistical modelling goals in mind. For example, in drug
trials, data are specifically collected with the aim of being able to explore
efficacy and safety. This enables the researchers, including any statistician
included in the team, to have control of which variables are collected, and
to have awareness of any data quality issues.
As another example of primary data, Example 1 considers a statistician’s
involvement with data collection for tax purposes.

Example 1 Being involved with data collection


Watch ‘Video for Example 1’ provided on the module website. In this
short video, a statistician talks about the work that they do for the
Inland Revenue Service (a department of the United States federal
government that deals with tax collection).
Note that the statistician mentions that their work involves designing
samples, developing data-entry programs and training ‘editors’
(employees in the field who are responsible for collecting the data), as
well as doing some data analysis.

Also recall that secondary data are data that have been collected for other
purposes. These data might be data that are in the public domain. Or the
data may be confidential to the organisation that the statistician works in.
Alternatively, they might be data from elsewhere that the statistician has
obtained permission to use.
As you saw in Subsection 2.2.1 of Unit 1, if the data used are secondary
data, it can affect the usefulness and quality of the data for the particular
problem at hand. For example, the data collected may not adequately
address the researcher’s problem, the data may be out of date, it might be
difficult to decipher definitions, and so on.

405
Unit 5 Linear modelling in practice

2.1.2 Sourcing data about Olympic medals


One of the explanatory variables in Box 3 (Subsection 1.4) is the number
of medals won at the previous summer Olympics. Sourcing data for this is
what we will consider in this subsection.
With secondary data, defining exactly what is meant by the ‘correct
values’ for a variable may not be as straightforward as it first seems, as
you will see in Activity 4.

Activity 4 Medals won at Rio 2016

One source of information about the number of medals won at the summer
Olympics games is Wikipedia, which we will consider now.
Read the copy of the Wikipedia page for Rio 2016 given on the module
website (‘2016 Summer Olympics medal table’, 2021) and then answer the
following questions.
(a) Do you trust that the medals table given on this Wikipedia page
correctly gives the number of medals that have been awarded to each
nation?
(b) Is it true that every event awarded a gold, silver and bronze medal?
(c) At Rio 2016, did every athlete who won a medal represent a nation?
(d) Has the winner of each medal remained the same over time?

As Activity 4 shows, even if the source of the data is trusted, and the data
are correct, it might still be easy to make false assumptions about the
data. (For example, it would be wrong to assume that every event awarded
a gold, silver and bronze medal, or that every athlete represented a
nation.) Making a wrong assumption can lead to errors in building and
interpreting the model. Activity 5 highlights another problem that arose at
Rio 2016 and can affect our model to predict success at future Olympics.

Activity 5 Russia at Rio 2016

In the solution to Activity 4, you saw that Kuwaiti athletes were not able
to compete for their nation at Rio 2016. Unfortunately, these were not the
only athletes that faced problems: the situation for athletes from Russia
was also not straightforward.
Read the BBC News article ‘Rio 2016 Olympics: Russians “have cleanest
team” as 271 athletes cleared to compete’ (BBC News, 2016) provided on
the module website, then consider the following question.
When using data from Rio 2016 to build a model to predict success at
Russia’s team after winning
future Olympics, why is it important to know about the problems faced by
silver in women’s gymnastics
at Rio 2016 the Russian team?

406
2 Sourcing and preparing the data

As you have seen in Activities 4 and 5, the question ‘how many medals did
each nation win at Rio 2016?’ does not have as straightforward an answer
as it first might seem. Rio 2016 is not unique in that respect. Similar
issues surround other summer Olympics.
Some of the other problems that have arisen since 1980 are listed below.
• Moscow 1980: more than 60 nations boycotted the Olympics.
• Los Angeles 1984: 14 nations boycotted the Olympics.
• Barcelona 1992: following the break-up of the Soviet Union, Russia and
other former Soviet republics competed as a ‘Unified Team’.
• Sydney 2000: seven medals were reallocated. This included one, initially
won by Lance Armstrong, which was reallocated over a decade later
in 2013.
• Beijing 2008 and London 2012: in 2016, a wave of retesting of athletes
for potential doping violations from both these Olympic Games led to a
number of medals getting reallocated.
• Tokyo 2020: Russia was not allowed to compete as a nation. However
athletes from that country were able to compete representing the
‘Russian Olympic Committee (ROC)’ instead.
When issues with a variable are identified, such as we have seen with the
number of medals each nation has won, a decision has to be taken about
what to do.
Sometimes it will be decided that a variable is sufficiently unreliable that it
is better not to use it. In other cases, we can still continue to use the data
after resolving any ambiguities.
For example, in terms of medal allocations, in this unit we will use the
medal allocations as they stood on 9 August 2021 (just after the end of
Tokyo 2020), as this was around the time this unit was written. This has
the advantage of making the medal allocation used unambiguous. However,
this means that for more recent Olympic Games, in particular Tokyo 2020,
there has been less time afterwards for medal reallocations to occur.

2.1.3 Sourcing data about population and wealth


In Subsection 2.1.2, we considered sourcing Olympic medal allocations
data. In this subsection, we turn to sourcing two other variables mentioned
in Box 3 (Subsection 1.4): the nation’s population size and the nation’s
wealth.
In the case of population, we can use the total population: that is, how
many people are living in a country. However, when thinking about the
link with potential Olympic success, maybe it would be better to focus on
the population under 65, or under 40, or maybe even under 25? This is
because older people are generally at a physical disadvantage compared to Mary Hanna, the oldest person
younger people, and so it is less common to have older athletes. to have competed at
Tokyo 2020 (aged 66)
Also, maybe children shouldn’t be counted because it takes time for a child
to develop into a world-class athlete? Furthermore, although there is not

407
Unit 5 Linear modelling in practice

an overall minimum age for competitors, some individual sports impose a


minimum age limit. For example, at the time of writing, diving
competitors have to be at least 14 in that calendar year, and gymnastics
competitors must be at least 16 in that calendar year.
Despite these differing possibilities of how population may be defined for
our prediction task, in this unit, we will simply take the total population of
the nation as our population value.
There is, however, another consideration regarding population that we
Hend Zaza, the youngest
person to have competed at need to consider. Because a nation’s population size changes over time, it
Tokyo 2020 (aged 12) matters which year the population size corresponds to. Is population size
for the year in which the Olympics took place the most appropriate year to
use? It takes time, often years, for an athlete to progress to the point that
they are good enough to compete at an Olympics. So maybe the
population a year or two earlier would be better?
Now, in this unit, we’re focusing on predicting national success at
Paris 2024, but we need to do this using only data that were available in
2021 (when this unit was written!). So, as given in Box 4, we will use the
(total) population size four years beforehand. This corresponds to when
the previous summer Olympics was scheduled to take place.

Box 4 A measure of a nation’s population


In this unit, the population size of a nation at an Olympic games will
be measured by the (total) population size four years beforehand.
This corresponds to when the previous summer Olympics was
scheduled to take place.

So, now that we have decided how we’re going to measure population, the
next task is to source some data! One source for population data is the
World Bank. You will consider this in the next activity.

Activity 6 How reliable are population estimates from the


World Bank?
Read the World Bank’s description of population estimates (The World
Bank Group, 2020a) provided on the module website, before answering the
following questions.
(a) How reliable are the data from the World Bank? Could more reliable
data be obtained from elsewhere?
(b) Why is it correct to say that the data are estimates of the population
size, not exact counts?

408
2 Sourcing and preparing the data

Sources of data such as the World Bank tend to be very reliable. However,
since maintaining data and making them available costs money, there is no
guarantee that such data will be kept up-to-date or in formats that are
easy to use.
When it comes to data about the wealth of a nation, the World Bank is
again a useful source. It provides several different measures of national
wealth. The World Bank’s description of these measures (The World Bank
Group, 2020b) is provided on the module website.
In this unit we will use gross domestic product (GDP) per capita as a
measure of a nation’s wealth. (An explanation of GDP per person is given
in Example 4 of Unit 1.) However, even having made this decision, there is
still ambiguity, because there are different units which are used to measure
GDP. For example, it might be measured in terms of each nation’s own
currency or relative to a single standard currency such as US dollars.
It is not clear which are the most appropriate units to use when it comes
to predicting the impact of a nation’s wealth on Olympic success. So, here
we will take the arbitrary decision to focus on GDP per capita relative to
the US dollar in 2010. Similarly to population size, a country’s GDP per
capita changes over time. Therefore, the best time to measure GDP –
relative to each summer Olympics – needs to be decided. This could be
based on considerations such as when increased government-level
investment in sport might be thought to translate to better performance at
the Olympics. However, in this unit, we simply choose to use the same
year as the population estimate, which is four years beforehand.

2.1.4 Sourcing the remaining variables


In Subsections 2.1.2 and 2.1.3 we found sources for the number of medals
won at the previous summer Olympics, population size estimates and
national wealth estimates. In this subsection we will consider sourcing the
remaining variables.
Looking back at Box 3 (Subsection 1.4), these are:
• whether or not the nation is the host nation, taking possible values yes
and no
• whether or not the nation is going to host the following summer
Olympics, taking possible values yes and no.
These are actually easy variables to source. The host nation of an
Olympics gains great prominence in the run-up to the Games and during
it. Therefore, even if this is not remembered, it is something easily found
via a search on the internet. Hosting the Games also requires a huge
amount of preparation by the country concerned, and so future hosts are
decided many years in advance. Recall that Table 1 (Subsection 1.1)
includes the host cities and countries for the summer Olympics from 1980
right up to 2032.
So, we now have sources for all the data we require. The next task is to get
the data ready for analysis. We will deal with this in the next subsection.

409
Unit 5 Linear modelling in practice

2.2 Preparing data for analysis


Subsection 2.1 described data sources for the potential variables that we
listed in Box 3 (Subsection 1.4). However, to fit a statistical model to
these data, it is necessary to ensure that the data are in a format that can
be used by a statistical package.
For example, in this module the format you have been using is data frames
contained within an R package. However, what happens when we want to
analyse data that are not contained in one of these data frames? One way
forward, and the one we will pursue in this subsection, is to create a single
data frame that contains all the data we require.
In Subsection 2.2.1, you will see how data can be read into R from a file,
and a data frame created. In Subsection 2.2.2, you will start combining
data frames, in this case to create a data frame that contains data about
the medals won. You will continue combining data frames in
Subsection 2.2.3, this time by merging them. This is a process that
requires more care than that used in Subsection 2.2.2 to ensure that
observations from different data frames are matched correctly. Finally, in
Subsection 2.2.4, we will describe the remaining steps necessary to have a
data frame ready to use.

2.2.1 Reading data into R from a file


There are many formats in which data can be saved and transferred
electronically. Statistical packages often have a particular format that they
use natively. For example, R, Minitab and SPSS all have particular
formats that they use for datasets. Generally, these formats are the best
ones to use when it is known that whoever will be working with the data
will be using the same piece of software. That way, information about the
variables (such as more informative labels and/or the source of the data)
can be stored along with the actual data. Statistical software will often also
read data formats used by other statistical packages. For example, in R it
is possible to read in (some) formats produced by Minitab and SPSS.
When it is not known what software will be used to read the data, a
format based on plain text is often used instead. One such format, and the
one we will introduce in Activity 7, is known as comma-delimited, or
CSV. In a CSV file, each data value is ‘delimited’ – that is, separated – by
a comma: ‘CSV’ stands for ‘comma-separated values’. Such a format has
the advantage that it is far more likely that statistical and other software
will be able to read the file. This greatly reduces the risk that
insurmountable problems will occur when transferring the data between
different computer set-ups.

Activity 7 Looking at a CSV file


Figure 2 shows a screenshot of the first 13 lines of the medals table from
Tokyo 2020 (derived from ‘2020 Summer Olympics medal table’, 2021)
saved in a CSV format.

410
2 Sourcing and preparing the data

Figure 2 The first 13 lines of a CSV file

Using Figure 2, answer the following questions.


(a) In CSV files, do all the data have to be numerical?
(b) What do you think the first line of this CSV file represents?
(c) What do each of the other lines of this CSV file represent?
(d) What role are the commas playing in this CSV file?

Now that you have seen in Activity 7 what data in CSV format can look
like, the time has come to read such data into a statistical package.
We will start in Notebook activity 5.1 by reading into R the CSV file
containing data from the medals table for Tokyo 2020 (part of which was
shown in Figure 2). We’ll then save the data in a new file that is in R’s
data format. Notebook activity 5.2 explains how the data from the medals
table for Tokyo 2020 can be read into R from the saved R data file created
in Notebook activity 5.1.
Although the resulting data frame and the data for each variable in
Notebook activity 5.1 are ‘ready for use’, this is not always the case. For
example, we may find that the variable names given by the original data
source are not ones that would be helpful for us to use. Alternatively,
there may be problems with using the data frame because of the way the
data were defined in the data source, or because of the way that R read
the data. In such cases, the analyst may want to adapt the data frame to
make it ‘ready for use’ for their particular purpose. In Notebook
activity 5.3, the final notebook activity in Subsection 2.2.1, we will create
a data frame which is not ‘ready for use’, using data (from a CSV file) on
national population estimates for various years from the World Bank. This
data frame will then be adapted until it suits our needs.

411
Unit 5 Linear modelling in practice

Notebook activity 5.1 Reading a CSV file into R and


creating an R data file
In this notebook, we will create a data frame for data given in a CSV
file and then save the data frame in a file in R’s data format.

Notebook activity 5.2 Reading an R data file into R


This notebook explains how to read an R data file into R.

Notebook activity 5.3 Adapting a data frame created


from a CSV file
In this notebook we will adapt a data frame created from a CSV file
so that it better suits our modelling task.

2.2.2 Using R to combine data


Having worked through Notebook activities 5.1 to 5.3 in Subsection 2.2.1,
you now have data frames for the medals table from Tokyo 2020 and for
national population estimates for the years 1992, 1996, 2000, 2004, 2008,
2012, 2016 and 2020. A similar process can be done for the medals tables
for the summer Olympic Games between 1992 and 2016, and for the GDP
data from the World Bank.
However, producing data frames for all of these different datasets is not
the end of the process; the data need to be combined if we are going to fit
a model to them. This, unfortunately, is not a trivial process! You will
start in Activity 8 by considering how the combined data frame should be
structured.

Activity 8 A structure for combining data from medals


tables
Table 2 shows the first six rows of the medals table for Tokyo 2020 given in
Figure 2 (in Activity 7).
Table 2 The first six rows of the medals table for Tokyo 2020

rank country countryCode gold silver bronze total


1 United States USA 39 41 33 113
2 China CHN 38 32 18 88
3 Japan∗ JPN 27 14 17 58
4 Great Britain GBR 22 21 22 65
5 ROC ROC 20 28 23 71
6 Australia AUS 17 7 22 46

412
2 Sourcing and preparing the data

Using the information from the Wikipedia page referred to in Activity 4,


the medals table for Rio 2016 has also been put in the same format.
(a) We wish to combine the data from the two medals tables for Rio 2016
and Tokyo 2020 into a single table. Do you think that it would be
better to add the rows of one table to the rows of the other, or add
the columns of one table to the columns of the other?
(b) Once the two medals tables have been combined into a single data
structure, should data about other countries be added before the
combined data are used for modelling? (Hint: think about what a
country needed to do to get included in a medals table.)

Activity 8 discussed how we could combine the data from the two medals
tables for Rio 2016 and Tokyo 2020 into a single table. We will put this
into practice in Notebook activity 5.4 by combining a data frame that
contains the Tokyo 2020 medals table with a data frame that contains the
Rio 2016 medals table.

Notebook activity 5.4 Combining Olympic medals tables


In this notebook, we will create a single data frame which combines
data from two data frames by adding the rows of one data frame to
the rows of the second data frame.

A similar process to Notebook activity 5.4 can be used to add in the


medals tables from the Olympics held between 1992 and 2012. We have,
however, done that task for you using ‘2016 Summer Olympics medal
table’ (2021) plus other similarly named Wikipedia pages for the 1992,
1996, 2000, 2004, 2008 and 2012 Olympics.

2.2.3 Using R to merge data


At the end of Subsection 2.2.2, all the medals tables from the summer
Olympics held between 1992 and 2020 were combined into one data frame.
However, we are not finished yet! Other data still need to be added,
primarily about population sizes and GDP per capita.
In Activity 9, you will consider first how to structure a data frame that
contains the data we sourced from the World Bank – that relate to
population sizes and GDP per capita – and the medals table data.

Activity 9 Adding data to the combined medals tables

Suppose that the data on population sizes and GDP are added to the data
in the medals tables for each country. In what way should the structure of
the combined data differ from the medals tables data?

413
Unit 5 Linear modelling in practice

Adding columns to a data structure is a trickier task than adding more


rows, because care needs to be taken to make sure that the data in each
row for the new columns matches up correctly. We will tackle this problem
in Notebook activity 5.5 using data obtained from the World Bank, which
have been saved as two separate data frames.

Notebook activity 5.5 Merging population and GDP data


In this notebook, we will merge two data frames by adding columns
from one of the data frames to the other data frame.

After working through Notebook activities 5.4 and 5.5, we have ended up
with two datasets: one containing the information about the medals tables
and one for the World Bank data. These two datasets can be merged using
the same approach taken in Notebook activity 5.5. However, it turns out
that matching rows does not work well. In Activity 10, you will explore
why there can be problems.

Activity 10 Matching countries by name


Matching the Olympic medals tables and World Bank data relies on being
able to match countries and years in the two databases. In particular,
matching relies on the computer being able to recognise when rows of data
relate to the same country.
(a) If we start by considering Japan, which of the following would be
regarded as the same in the matching process?
• ‘Japan*’ and ‘Japan’
• ‘Japan (JPN)’ and ‘Japan’
• ‘Japan ’ and ‘Japan’
(Note that in the medals tables given on Wikipedia, an asterisk by a
country’s name was used to indicate the host nation.)
(b) Now consider the following countries. Which of these would be
regarded as the same in the matching process? Should they be?
• ‘Bahamas’ and ‘Bahamas, The’
Whether it’s ‘Bahamas’ or
‘Bahamas, The’, it’s still a • ‘Great Britain’ and ‘United Kingdom’
great view! • ‘Czechoslovakia’ and ‘Czech Republic’

So, as you have seen in Activity 10, it can be difficult to automatically


match rows in different datasets, even when you know that rows should
match. It is possible to overcome this to a certain extent by clever coding,
but other instances might require additional intervention from the
statistician (for example, to match ‘Great Britain’ with ‘United Kingdom’).
However, sometimes it will not be possible to be sure whether or not there
is a match. In Activity 11, you will consider an example of this when
matching information about people using their name.

414
2 Sourcing and preparing the data

Activity 11 Matching people by name


Think about your own name. Do you know someone else who has the same
name as you? Do you always use exactly the same name in all situations,
or does the name you use differ according to the context? For example, do
you use the same name with friends, with relatives and on official forms?

Activity 11 implicitly assumed that someone’s name would always be spelt


correctly. There are algorithms, such as Soundex, that aim to overcome
the problem that the same spelling might not always be used for someone’s
name. Some other examples of problems that can arise when trying to
match names relate to different practices between countries. For instance:
• Chinese full names usually have the surname first and the given name
last, but in the ‘West’ the given name is usually first and the surname
last. This can lead to inconsistencies on how a name is recorded.
• Hispanic tradition is to have two surnames (without a hyphen), one from
the father and one from the mother. However, sometimes only one of
these is used in contexts where single surnames are more common,
potentially leading to inconsistencies on how a name is recorded.

Aside 1 Are you Dave Gorman?


Tracking down people with the same name proved to be an inspiration
for comedian Dave Gorman. For a bet, he set out on a quest (before
social media) to find other people with the same name. The goal was
to find 54 other Dave Gormans: one for every card in a deck of playing
cards (including two jokers) – a goal that was met. A successful stage
show, ‘Are you Dave Gorman?’, was written about the quest. This
stage show was later turned into a book and television series.

2.2.4 Finishing the preparation of the data


The matching of the medals table data with the World Bank data can be
sorted out with time and effort. However, for the purpose of this unit, we
will not ask you to take this any further, nor will we cover the final steps of
preparing the data for analysis, which are as follows:
• adding information about countries who were at the Olympics but did
not win any medals
• adding information about the number of medals won at the preceding
Olympics
• adding information about whether or not a country was the host country
or would be the next host country.
These steps can be achieved using techniques you have seen across
Subsection 2.2. However, it is tedious and time-consuming to ensure that Yep, preparing the data ready
it is done correctly. for analysing can be rather
tedious . . .
415
Unit 5 Linear modelling in practice

(Remember ‘garbage in, garbage out’ means that it is important that any
sorting out of the data is done correctly.) So this has been done for you
(because we’re kind like that!).
The dataset has been split into two parts – the ‘Olympics dataset’ and the
‘Olympics 2020’ dataset – so that the data from the 2020 Olympics can be
held back to use later when models are compared. We will explain in
Subsection 4.3 why this is a good idea.
A description of the Olympics dataset is given next. In Section 3, we will
move on to analysing the data.

The Olympics dataset (olympic)


This dataset includes data for the competing nations for each summer
Olympics between 1996 and 2016.
The response variable of interest is:
Since 1913, the symbol of the • medals: the number of medals won by a nation at a summer
Olympic Games has been the Olympics, as it stood on 9 August 2021 (the day after the end of
Olympic rings
Tokyo 2020).
For each observation in the dataset, data are also available for the
following variables:
• country: the name of the nation competing
• countryCode: a three-letter abbreviation of the nation’s name
• year: the year of the summer Olympics to which the observation
relates
• lagMedals: the number of medals won by a nation at the previous
summer Olympics, as it stood on 9 August 2021 (the day after the
end of Tokyo 2020)
• host: whether a nation is the current host, taking the possible
coded values 0 (for not current host) and 1 (for current host)
• nextHost: whether a nation is going to be the next host, taking the
possible coded values 0 (for not next host) and 1 (for next host)
• population: total population size (in millions) of a nation in the
year of the previous Olympics
• gdp: a nation’s GDP per capita (in thousands), relative to the US
dollar in 2010, in the year of the previous Olympics.
The data for the first three and last three observations in the
Olympics dataset are given in Table 3.

416
3 Building a statistical model

Table 3 The first three and last three observations from olympic

country countryCode year medals lagMedals host


Albania ALB 1996 0 0 0
Algeria ALG 1996 3 2 0
Andorra AND 1996 0 0 0
Yemen YEM 2016 0 0 0
Zambia ZAM 2016 0 0 0
Zimbabwe ZIM 2016 0 0 0

nextHost population gdp


0 3.25 1.24
0 27.03 3.42
0 0.06 33.93
0 24.47 1.13
0 14.47 1.59
0 13.12 1.22
Sources: The World Bank Group (2020a, 2020b), and Wikipedia pages for
the summer Olympics between 1996 and 2016, as described at the end of
Subsection 2.2.2

3 Building a statistical model


Section 1 introduced the general modelling task of trying to predict success
at the Olympics, and Section 2 went through some of the steps necessary
to obtain data in a form ready to analyse. In this section, we will actually
be fitting some models to the data!
A first necessary step, and the one covered in Subsection 3.1, is to decide
on which model, or models, to consider. We’ll then use R to try fitting
some initial simple models in Subsection 3.2, before using R to expand our
model in Subsection 3.3.

3.1 Which models to consider?


In this subsection, we will explore the task of translating the problem of
interest into a set of possible statistical models to fit.
A useful first step is to think about the types of model that you have
experience of fitting. You will do this in Activity 12, next.

417
Unit 5 Linear modelling in practice

Activity 12 What is in the toolbox?

Consider the models you have already learnt about in this module, and in
previous modules.

s
Tool
(a) What sort of response variable are the models suitable for?
(b) What sort of explanatory variable(s) are the models suitable for?

A statistical toolbox
So, as Activity 12 will have reminded you, you have learnt how to fit
models that contain a wide variety of explanatory variables, such as when
an explanatory variable is categorical or when it is numerical. You can also
fit models with lots of explanatory variables as well as just one.
However, when it comes to the response variable, the situation is more
limited. The variable needs to be continuous, or at least not too discrete.
Furthermore, in all of the models we have met so far, there is an
assumption that the random part of the model is based on the normal
distribution. So, the starting point is to identify the response variable and
check that this assumption is not unreasonable.

Aside 2 When response variables are not normal


You may be wondering what happens when the response variable is
not close enough to being continuous, or what happens when the
assumption of normality is clearly not appropriate.
One possibility, that you met in Unit 2, is to transform the response
variable. However, it is not always possible to identify a
transformation that will result in the assumption of normality being
reasonable.
Another option (apart from giving up!) is to learn about other types
of model. In Units 6 to 8 you will learn about models that can cope
with types of response variable that are not normally distributed.
There are also other types of model that we don’t have room to fit
into this module.
Finally, there remain situations for which existing techniques do not
result in an adequate model. Developing new techniques to handle
these situations is an aspect of statistical research.

In Box 3 (Subsection 1.4), it was mentioned that in the case of modelling


success at a summer Olympics, we will take the response variable to be the
total number of medals won by a nation at a summer Olympics. So, the
first question that we need to investigate is whether it is reasonable to
assume that the random variation about the model with this response
variable is normally distributed. This is something you will consider in
Activity 13.

418
3 Building a statistical model

Activity 13 Is assuming a normal distribution reasonable?

Figure 3 shows a histogram of the total number of medals won by nations


in the years 1996 to 2020.

800

600
Frequency

400

200

0
0 20 40 60 80 100 120
Number of medals

Figure 3 A histogram of the number of medals won by countries for the


years 1996 to 2020
(a) Does the distribution look symmetric? Is this shape compatible with
the random variation about the model being normally distributed?
(b) Are the data discrete or continuous? If you think the data are
discrete, do you think it is reasonable to treat the data as close
enough to being continuous?
(c) What is the lowest possible value for the number of medals? How
often is a data value equal to this lower bound? Why might this cast
doubt on the assumption of normality?

As Activity 13 shows, the assumption of the normal distribution for the


random variation of the response about the model may be questionable for
these data. Nevertheless, since the models currently in your toolbox all
require this normality assumption, we will pursue fitting such models in
this unit. Although we might not expect such a model to be a particularly
good fit to the data, the model could still turn out to be good enough to
provide useful predictions. After all, as Box 5 points out next, it is
unrealistic to expect a model to be perfect (unless the model is as
complicated as the data themselves).

419
Unit 5 Linear modelling in practice

Box 5 No model is perfect


With any model it will always be possible to find reasons why it does
not fit perfectly. For example, for the simple linear model, the random
terms will not be exactly normally distributed and the relationship
between the response variable and the explanatory variable will not
be exactly linear. However, that won’t necessarily stop the model
producing results that are good enough to be useful.
This idea is neatly summed up by the quote: ‘All models are wrong,
but some are useful,’ commonly ascribed to the statistician George
Box (1919–2013).

3.2 Using R for an initial data analysis


In Subsection 3.1, it was decided that it is worth trying linear models to
model the number of medals won by a nation at a summer Olympic Games.
In this subsection, we are actually going to start fitting such models!
In Notebook activities 5.6 to 5.8 shortly (and Notebook activity 5.9 in
Subsection 3.3.2), we will fit models for the number of medals won at the
1996, 2000, 2004, 2008, 2012 and 2016 summer Olympics. The data from
the 1992 Olympics will be used simply to provide some of the values for
the number of medals won at a previous Olympics.
We will also only include data about a nation at a particular Olympics for
which the following is known:
• the nation’s population
• the nation’s GDP per capita
• the number of medals that nation won at the previous summer Olympics.
We will return to the implications of this decision in Section 5.
The data and variables that we will be using to build our models are
contained in the Olympics dataset described in Subsection 2.2.4.
When building a model, we want to use explanatory variables which are
going to help us predict the response medals. So, a good way to start, is
by trying some individual simple linear regression models to get a feel for
which explanatory variables are likely to be useful in our model.
The first explanatory variable that we’ll consider in a simple linear
regression model is lagMedals (the number of medals won at the previous
summer Olympics). So in Notebook activity 5.6, we will fit a model with
just the covariate lagMedals as an explanatory variable.
There are also several other potential explanatory variables listed in the
description of the Olympics dataset. In Notebook activity 5.7, we will use
two more covariates from the list of possible explanatory variables –
namely, population and gdp – in two further simple linear regression
models for medals.

420
3 Building a statistical model

You saw in Unit 2 that a regression model involving more than one
explanatory variable can be better than a simple linear regression model.
Furthermore, while the significance or otherwise of an explanatory variable
in a simple linear regression model can indicate whether that explanatory
variable should be in a multiple regression model, it doesn’t guarantee it.
Explanatory variables that are not significant in a simple linear regression
model may be so in a multiple regression model, while explanatory
variables that are significant in a simple linear regression model may not
be when fitted with other explanatory variables.
We end this subsection with Notebook activity 5.8, where we will try using
a multiple regression model for medals using all three of the covariates
considered in separate simple linear regression models in Notebook
activities 5.6 and 5.7 – namely, lagMedals, and population and gdp.

Notebook activity 5.6 A simple linear regression model


for medals
In this notebook we will fit a model for medals using a single
covariate.

Notebook activity 5.7 Trying other explanatory variables


in a simple linear regression model
This notebook considers two further simple linear regression models
for medals, each using a different potential covariate.

Notebook activity 5.8 Fitting a multiple regression model


In this notebook we will fit a multiple regression model for medals
using three covariates.

3.3 Expanding the model


In Subsection 3.2, we fitted a multiple regression model for medals using
the three covariates lagMedals, population and gdp as explanatory
variables. There are three more variables in the Olympics dataset that
could be included as explanatory variables: year (which summer Olympics
the observation relates to), host (whether a nation is the current host) and
nextHost (whether a nation will be the next host).
For each of these variables we need to decide whether we are going to treat
it as a factor or covariate. This is what we will do in Subsection 3.3.1.
Then, in Subsection 3.3.2, we move on to considering models that include
some or all of these variables (in addition to lagMedals, population
and gdp).

421
Unit 5 Linear modelling in practice

3.3.1 Deciding whether to treat an explanatory


variable as a covariate or a factor
In this subsection we are going to consider whether to treat the
explanatory variables year, host and nextHost as covariates or factors in
our modelling.
First, in Activity 15, we will consider the variable host.

Activity 14 Variable host – factor or covariate?

Explain why host could be treated either as a factor or as a covariate.

The variable nextHost – whether or not the country will be the next host
of the Olympic games – can be treated in the same way as host. You will
consider the remaining variable, year, in Activity 15.

Activity 15 Variable year – factor or covariate?

In olympic, year can only take one of six values: 1996, 2000, 2004, 2008,
2012 or 2016. So, in a linear model do you think it is better to treat year
as a covariate or as a factor? Or does it not matter? Justify your opinion.

Activity 15 shows that deciding whether an explanatory variable should be


treated as a factor or as a covariate in a model may not be a clear-cut
decision.
• If we regard an explanatory variable as a factor, there is no assumption
about the ‘shape’ of the relationship between this variable and the
response to worry about in terms of whether its appropriate. However, it
makes prediction difficult for an observation that has a level that isn’t in
our data.
• If we regard an explanatory variable as a covariate, making predictions
about new observations is possible whatever value the explanatory
variable happens to take. However, we have to make assumptions about
the shape of the relationship between the explanatory variable and the
response – which may be inappropriate.
So, in such situations, it is important to be aware that a choice is being
made and that the pros and cons of this decision need to be considered.
Sometimes the decisions about how to treat a variable don’t stop with just
deciding whether to regard it as a factor or a covariate, as Example 2
shows.

422
3 Building a statistical model

Example 2 Postponement of Tokyo 2020


If year is treated as a covariate, the postponement of Tokyo 2020
adds a further complication. This postponement means that the
spacing between successive summer Olympics is no longer four years
(which it had been since 1948). In particular, it would make a
difference whether we are interested in trends over successive Olympic
Games or trends over calendar time.

In all of this discussion about how to treat year in any prospective model,
you may be wondering why the year in which an Olympic Games is held
might make a difference. One reason is that the events at the Olympic
Games have not remained the same, so the total number of medals that
can be won has also varied.

Aside 3 Events at summer Olympics


The events that are included at the summer Olympics have varied
over time. Early Olympic Games included sports such as tug-of-war
and lacrosse, whereas more recent summer Olympics feature triathlon
and boxing. Also, within individual sports, the events have not always
remained the same. For example, in athletics, it was once thought
that women should not compete in running races longer than 3000 m.
Nowadays, women compete in the marathon, just like the men. These Tug-of-war final at the 1912
changes have meant that the number of medals available at a summer summer Olympics in
Olympic Games has not remained the same over time. Stockholm

3.3.2 Using R to explore adding more variables to


the model
For the purposes of modelling, we will treat the three variables host,
nextHost and year as covariates. Next, in Notebook activity 5.9, we will
look for a parsimonious model that includes some, or all, of these
covariates, in addition to the covariates considered so far.

Notebook activity 5.9 Fitting a multiple regression model


with more explanatory variables
This notebook uses R to find a parsimonious model for medals based
on all of the potential explanatory variables available in olympic.

423
Unit 5 Linear modelling in practice

4 Further modelling issues


In Section 3, we fitted several potential models for the response medals
from the Olympics dataset. There are, however, further issues which need
to be considered as part of the statistical modelling process. We will
discuss some of these in this section.
In our modelling so far, we have implicitly assumed that all potential
explanatory variables are of equal importance to the statistician. However,
this is not always the case. We will consider how we can deal with such
situations in Subsection 4.1. We’ll then go on to discuss the issue of
outliers in Subsection 4.2, before exploring how we can assess whether the
predictions from our model are any good in Subsection 4.3. The section
rounds off with a very brief introduction to the idea of model confirmation
(which we’ll revisit again later in Section 7).

4.1 When explanatory variables vary in


importance
So far, we have used statistical considerations to decide which explanatory
variables should be in the model. For example, in stepwise regression, the
change in AIC (Akaike information criterion) has been used to decide
which explanatory variables should be included in the model, so that only
variables that are statistically significant are included. This is justified via
the principle of parsimony: the simplest model that fits the data is best.
However, just relying on statistical considerations implicitly assumes that
all of the explanatory variables are of equal importance. This can,
however, be an inappropriate assumption to make. The context of the data
and the modelling problem can mean that some variables are not
important to the statistician, while other variables relate to the key
purpose of the analysis. This is illustrated in Example 3.

Example 3 Is hosting the Olympics worth it?


Hosting the Olympic Games is expensive. The London 2012 Olympics
was estimated to have had an operating budget of approximately
$10bn and the Rio 2016 Olympics an operating budget of
approximately $13bn (‘Cost of the Olympic Games’, 2022).
One of the things that host nations hope to gain in return is a general
‘feel good’ factor for its population, something which is likely to be
affected by how well its athletes do. So, a researcher investigating the
economic impact of the Olympic Games might be most interested in
how being the host nation affects the number of medals won.

424
4 Further modelling issues

In other words, the researcher might be most interested in learning


about the effect that the explanatory variable host has on the
response medals.

The researcher in Example 3 is particularly interested in learning about


the effect on the response (medals) of just one of the six potential
explanatory variables (host). To do this, they could in theory fit a simple
linear regression model with host as the explanatory variable. However,
because there are other potential explanatory variables which can also
affect medals, a simple linear regression model won’t necessarily give us an
accurate picture of the effect on medals of host.
Example 4 illustrates another application (in a medical context) where the
researchers are particularly interested in the effect on the response of just
one of a number of explanatory variables.

Example 4 Vaccination and the risk of long COVID


SARS-CoV-2, the virus that causes COVID-19, was first identified as
a new virus in December 2019. The virus rapidly spread across the
world and the World Health Organization (WHO) declared a global
pandemic on 11 March 2020. For some people, COVID-19 can cause
symptoms that can last for weeks or months after the infection has
Administering the COVID-19
gone. This is sometimes referred to as ‘long COVID’. vaccine
The first COVID vaccine was administered outside of clinical trials on
8 December 2020. Suppose that a group of researchers wishes to
assess the impact of vaccination against COVID-19 on the risk of
developing long COVID after catching COVID. So, in statistical
modelling terms, the researchers are interested in the effect on the
response (risk of developing long COVID) of a person’s vaccination
status (an explanatory variable).
At the time of writing this example, it is known (or at least heavily
suspected) that age, gender and socio-economic group all affect a
person’s risk of developing long COVID. So, even though the
researchers are only actually interested in assessing the effect on the
risk of long COVID of vaccination status, there are in fact four
explanatory variables for the researchers to consider – age, gender,
socio-economic group and vaccination status.

425
Unit 5 Linear modelling in practice

When there is an explanatory variable which is the key purpose of the


analysis, there are two models which can be particularly useful:
• the best model not involving this key variable
• the model that corresponds to adding in the key variable to this best
model.
This then allows the effect of the key variable to be estimated using a good
baseline. This is illustrated in the next example.

Example 5 Adding vaccination status into the model


Consider once again the researchers from Example 4 who wish to
assess the effect of vaccination against COVID-19 on the risk of
developing long COVID.
Suppose that the best model without including vaccination status as
an explanatory variable has age, gender and socio-economic group as
the explanatory variables. Then a good way to assess the impact of
vaccination is by adding vaccination status as an additional
explanatory variable to this ‘best’ model.
So, using the usual interpretation of (partial) regression coefficients in
multiple regression, the regression coefficient for vaccination status
then represents the effect of vaccination on the risk of long COVID
after adjusting for age, gender and socio-economic group.

There can also be occasions when a variable, or variables, are always in the
model, regardless of their statistical significance. Often these are variables
that are already known to be associated with the response variable.
Including such variables can simplify the modelling process by reducing the
number of models that are considered.
The different status of variables can impact on the way a model is
described. When a key explanatory variable, say X1 , is added to a model
containing a set of other variables, say X2 , X3 , . . . , Xq , reference might be
made to the effect of X1 adjusted for X2 , X3 , . . . , Xq .
‘But that’s the same interpretation as we already use for multiple
regression!’ I hear you cry. Well yes it is, but the main difference in this
subsection from what we’ve done before is the fact that we are only
interested in the effect of one (or more) key explanatory variable(s) – and
the other explanatory variables are simply there to ensure that we can
assess the effects of the key variable(s) from a good model baseline. If the
model representing the relationship between the response variable and
X2 , X3 , . . . , Xq is a good one, then we should be able to assess the effect of
the key variable well. This is described in Box 6.

426
4 Further modelling issues

Box 6 ‘Adjusted for’


If an interpretation of a model contains a statement of the form
‘the effect of (the key variable of interest) X1 adjusted for
X2 , X3 . . . , Xq ’
then it means
‘the effect of X1 when X2 , X3 , . . . , Xq (or any transformations
of X2 , X3 , . . . , Xq ) are also in the model’.

So, what might this mean for our researcher from Example 3 who is most
interested in the effect of being the host nation on the number of Olympic
medals won? Well, we have already seen in Notebook activity 5.6 that
there is a (very) strong relationship between the number of medals won by
a nation at an Olympics (the response medals) and the number of medals
won by the nation at the previous Olympics (the explanatory variable
lagMedals). It therefore looks likely that any good model for medals
should have lagMedals as an explanatory variable. So, in order to assess
the effect of host on medals, the researcher should also include (at least)
lagMedals in the model.
We will interpret the effect of host on the response medals using ‘adjusted
for’ next in Activity 16.

Activity 16 Interpreting models using ‘adjusted for’

(a) Table 4 shows the resulting regression coefficients from fitting the
following model to the Olympics dataset
medals ∼ lagMedals + host.

Table 4 Parameter estimates for medals ∼ lagMedals + host

Parameter Estimate Standard t-value p-value


error
Baseline mean 0.258 0.104 2.489 0.013
lagMedals 0.961 0.008 127.423 < 0.001
host 12.639 1.345 9.400 < 0.001

Interpret the regression coefficient for host using ‘adjusted for’.


(b) Suppose that the effect of hosting the Olympics is said to be ‘an
increase of 12.4 medals adjusted for the number of medals won
previously, the population size and the GDP per capita’. What can
you say about this fitted model?

427
Unit 5 Linear modelling in practice

4.2 Is it an outlier?
In this module you have already come across the terms ‘outlier’ or
‘potential outlier’ applied to observations in a dataset. However, before
considering whether an observation is an outlier (or potential outlier), in
the next activity it is worth revisiting what this label implies about an
observation.

Activity 17 What is an outlier?

Take a little bit of time to think about what you understand by the term
‘outlier’. What makes an observation an outlier?

As you saw in Activity 17, an observation can be an outlier based on its


value for a single variable, or based on its values for a number of variables.
The key thing is that outliers don’t fit the pattern of the other
observations. This is implicitly linked to whatever model the data are
assumed to belong to. So, whether or not an observation is an outlier
depends on the model that is being considered (or implicitly being
considered). This will be demonstrated in Activity 18 using data from the
Olympics dataset, and in Activity 19 using data from a completely
different application concerning industrial absorption columns.

Activity 18 Number of medals won at Tokyo 2020 – are


there any outliers?
(a) Figure 4 shows a boxplot of the number of medals won by countries at
Tokyo 2020. The countries chosen were all those that won at least one
medal at Tokyo 2020 and won at least one medal at Rio 2016. From
this boxplot, do any of the data points appear to be outliers? If so,
which ones and in what way are they outlying?

0 20 40 60 80 100 120
Medals won at Tokyo 2020
Figure 4 Boxplot of the medals data

428
4 Further modelling issues

(b) Figure 5 shows a boxplot of the same data after a log transformation
has been applied. Do any of the data points appear to be outliers in
this boxplot? If so, which ones and in what way are they outlying?

0 1 2 3 4 5
log(Medals won at Tokyo 2020)
Figure 5 Boxplot of the data after a log transformation

Activity 19 Absorption of ammonia – are there any


outliers?
In industry, absorption columns can be used to remove impurities from a
gas. In this activity, you will be considering data about the performance of
one such column with respect to the removal of ammonia. (This is a
standard dataset in R, from Brownlee (1965).)
The response variable we will consider is the percentage of ammonia that
is absorbed by the column. So, higher values correspond to better
performance because a higher percentage of ammonia has been removed.
One of the explanatory variables, and the one we will consider in this
activity, is air flow. This is a measure of how fast the absorbtion column is
operating. There are 21 observations in this dataset.
(a) Figure 6 shows a boxplot of the percentage of ammonia absorbed. Do
any of the data points appear to be outliers in this boxplot? If so,
which ones and in what way are they outlying?

95.5 96.0 96.5 97.0 97.5 98.0 98.5 99.0 99.5


Ammonia absorbed (%)
Figure 6 Boxplot of the percentage of ammonia absorbed

(b) Figure 7(a), which follows, shows a scatterplot of the percentage of


ammonia absorbed and the air flow, together with the corresponding
simple linear regression line fitted to these data. Figure 7(b) shows a
residual plot obtained for this fitted model. From these plots, do any
of the data points appear to be outliers? If so, which ones and in
what way are they outlying?

429
Unit 5 Linear modelling in practice

99 1.0
Ammonia absorbed (%)

0.5
98

Residuals
0.0

97
−0.5

96 −1.0

50 55 60 65 70 75 80 96.5 97.0 97.5 98.0 98.5 99.0


(a) Air flow (b) Fitted values
Figure 7 (a) Scatterplot of the percentage of ammonia absorbed and air flow together with the fitted line, and
(b) residual plot from a regression of percentage of ammonia absorbed with air flow as an explanatory variable

(c) Figure 8 is a residual plot from a regression including air flow and two
other explanatory variables. Now do any of the data points appear to
be outliers? If so, which ones and in what way are they outlying?

0.5
Residuals

0.0

−0.5

96 97 98 99
Fitted values
Figure 8 Residual plot from a regression of percentage of ammonia absorbed
including three explanatory variables

430
4 Further modelling issues

So, as we have seen in Activities 18 and 19, whether or not an observation


appears to be an outlier depends on what way, if any, the data are
transformed, and also depends on what model is fitted to the data. This is
summarised in Box 7.

Box 7 Outliers
An outlier is an observation that does not follow the same pattern as
the majority of the data. As such, whether or not an observation is
regarded as an outlier depends on the values the observation has for all
of the variables and the model, or models, that are being considered.

In your study of statistics before, and in the module so far, you have
already met some methods for spotting outliers. For example:
• looking at boxplots of the data
• looking at scatterplots
• looking at a plot of residuals against fitted values from a regression
model.
However, with all of these methods, it is worth noting that they are not
foolproof. Just because a particular plot does not show an outlier, does not
mean that one is not there. For example, in regression, if an outlying
observation is sufficiently influential, the fitted regression line will go close
to that observation even if it does not reflect the pattern in the rest of the
data!
So, given all this uncertainty, what should be done about it? Box 8 gives
some suggestions.

Box 8 Dealing with outliers


When analysing the data, the following points are useful.
• Be alert to the possibility that there might be outliers; do not just
assume that it will all be fine.
• Consider using robust techniques that are less sensitive to outliers.
For example, using the median instead of the mean.
In terms of regression, however, using robust techniques is more
difficult. The linear models you have learnt about in this module
are fitted using ‘least squares’. This means that the values for all
observations count in the calculation. So, like the mean, the
estimates are not robust.
(Robust versions of regression have been developed, but they are
outside the scope of this module.)
• Fit the model with and without any potential outliers that you have
identified in the dataset, and compare the results.

431
Unit 5 Linear modelling in practice

As Box 8 suggests, one strategy for dealing with outliers is to compare the
results of fitting the model both with and without the outlier(s) included.
Activities 20 to 22 focus on three situations where this is done. For the
purposes of making teaching points, we will make the arbitrary decision to
use a significance value of 0.05 throughout these three activities and we
will be focusing on three separate small subsets of data from the Olympics
dataset.

Activity 20 Modelling the number of medals won using a


subset of the data
A statistician working with a small subset of the data from the Olympics
dataset decided to fit a model of the form
medals ∼ gdp + population.
(a) After fitting the model, the p-value associated with the coefficient for
gdp was 0.075, whereas the p-value associated with the coefficient for
population was 0.046. The p-value for the F -statistic for the model
was 0.061.
Using a significance level of 0.05, do either of the two covariates gdp
and population appear to be significantly related to the number of
medals won?
(b) Figure 9 (which follows) gives the corresponding plot of residuals
against fitted values. Based on this plot, how many outliers do you
think there are? Which points are they on this plot?
(c) The statistician decided that there was one outlier in the data. The
regression model was refitted after having dropped this outlier from
the dataset.
The revised p-value associated with the coefficient for gdp was 0.094,
the revised p-value associated with the coefficient for population was
0.036, and the revised p-value for the F -statistic was 0.062.
Have the general conclusions changed? If so, in what way? Hence,
does it matter whether the outlier is included or not?

432
4 Further modelling issues

20
Residuals

−20

5 10 15 20 25
Fitted values
Figure 9 Residual plot for medals ∼ gdp + population using a subset of the
data

Activity 21 Modelling the number of medals won using a


different subset of data
The subset of data analysed in Activity 20 was not the only small subset of
the data picked by the statistician. This activity will consider a second
subset of the data from the Olympics dataset.
(a) The statistician again decided to fit a model of the form
medals ∼ gdp + population.

After fitting the model to the data in the second subset, this time the
p-value associated with the coefficient for gdp was 0.054, and that
associated with the coefficient for population was 0.001. The p-value
for the F -statistic for the model using these new data was 0.003.
Once again using a significance level of 0.05, do either of the two
covariates gdp and population appear to be significantly related to
the number of medals won for this second subset of data?

433
Unit 5 Linear modelling in practice

(b) The corresponding residuals against fitted values plot is given in


Figure 10. Based on this plot, how many outliers do you think there
are? Which points are they on this plot?

40

20
Residuals

−20

−40
10 20 30 40 50
Fitted values
Figure 10 Residual plot for medals ∼ gdp + population using a different
subset of data to Activity 20

(c) The statistician decided that there was one outlier in the data. The
regression was refitted after having dropped this outlier from the
dataset.
The revised p-value associated with the coefficient for gdp was 0.037,
the revised p-value associated with the coefficient for population was
less than 0.001, and the revised p-value for the F -statistic was less
than 0.001.
Have the general conclusions changed? If so, in what way? Hence,
does it matter whether the outlier is included or not?

Activity 22 Modelling the number of medals won using a


third subset of data
This activity will consider a third subset of data picked by the statistician
from the Olympics dataset. Once again they decided to fit a model of the
form
medals ∼ gdp + population.

434
4 Further modelling issues

(a) After fitting the model to the data in the third subset of data, the
p-value associated with the coefficient for gdp was 0.947, and that
associated with the coefficient for population was 0.001. The p-value
for the F -statistic for the model using this third subset of data was
0.005.
Once again using a significance level of 0.05, do either of the two
covariates gdp and population appear to be significantly related to
the number of medals won for this third subset of data?
(b) The corresponding residuals against fitted values plot is given in
Figure 11. Based on this plot, how many outliers do you think there
are? Which points are they on this plot?

30

20

10
Residuals

−10

−20

−30
10 20 30
Fitted values

Figure 11 Residual plot for medals ∼ gdp + population using a different


subset of data to Activities 20 and 21

(c) The statistician decided that there were two outliers in this third
subset of data. The regression was refitted after having dropped these
outliers from the dataset.
The revised p-value associated with the coefficient for gdp was 0.958,
the revised p-value associated with the coefficient for population was
0.203, and the revised p-value for the F -statistic was 0.411.
Have the general conclusions changed? If so, in what way? Hence,
does it matter whether the outliers are included or not?

435
Unit 5 Linear modelling in practice

So, as Activities 20 to 22 demonstrate, repeating an analysis without


outliers may still result in the same general conclusions being reached. This
is the best situation to be in (assuming there are outliers). It means that
the decision as to whether or not to include the outliers is not a critical
one. Instead all that needs to be reported is that it makes little difference.
In situations where it does make a big difference whether or not the
outliers are included, it is important to be aware that the results are not
robust and to report accordingly.

4.3 Are the predictions any good?


The focus in Subsections 3.2 and 3.3 was on seeking a good model for the
number of medals won at a summer Olympics. In this subsection, we’ll
consider using such a model to predict the number of medals a nation will
get at the next summer Olympics. At the time of writing, this corresponds
to Paris 2024.
In Subsection 4.3.1, we will introduce a couple of measures of the quality
of predictions, before showing you how to calculate these using R in
Subsection 4.3.2. Then, in Subsection 4.3.3, you will see how a new
dataset, the test dataset, enables us to better estimate how well a model
predicts new values. In Subsection 4.3.4, we will use R to assess
predictions from a model using a test dataset.

4.3.1 Measures for the quality of predictions


In this subsection we will describe a couple of ways we can measure how
close predicted values are, in general, to the actual values. However, first
complete Activity 23 to remind yourself about using a regression model to
obtain predictions.

Activity 23 Predicting from a regression model

(a) In general, how can a regression equation be used to calculate


predicted values?
(b) Table 5 shows the results from Activity 16 (Subsection 4.1), when the
following model was fitted to the Olympics dataset
medals ∼ lagMedals + host.

Table 5 Repeat of Table 4 showing parameter estimates for


medals ∼ lagMedals + host

Parameter Estimate Standard t-value p-value


error
Baseline mean 0.258 0.104 2.489 0.013
lagMedals 0.961 0.008 127.423 < 0.001
host 12.639 1.345 9.400 < 0.001

436
4 Further modelling issues

Use this model to predict the number of medals that will be won by a
nation who is hosting the summer Olympics and who won 17 medals
at the previous summer Olympics.
(c) Using this model, the 95% prediction interval for the number of
medals won by a nation described in part (b) is estimated to
be (22.5, 36.0). That nation turns out to be Brazil at Rio 2016.
At Rio 2016, the actual number of medals that Brazil won was
19 medals. Did the actual observed response lie inside the 95%
prediction interval or not?

Activity 23 shows that there are two interlinked aspects to prediction. One
is how close the prediction is to the actual result. The other is whether any
prediction interval we give reflects the true uncertainty about the predicted
value.
Box 9 introduces two ways of measuring how close the predictions are to
the actual results: the mean squared error and the mean absolute
percentage error.

Box 9 Measures of how good predicted values are


Suppose that there are n predicted values, yb1 , yb2 , . . . , ybn , and the
corresponding actual values are y1 , y2 , . . . , yn .
The mean squared error (MSE) is defined as
n
1X
MSE = (yi − ybi )2 .
n
i=1

The mean absolute percentage error (MAPE) is defined as


n
1 X |yi − ybi |
MAPE = .
n yi
i=1

Note that, for both measures, the smaller the value, the closer the
predicted values tend to be to the actual values. Also, in the unusual
situation that all the predictions match the corresponding actual
values, both the MSE and the MAPE take their minimum value of 0.

When considering whether the prediction intervals reflect the true


uncertainty about the predicted values, the key consideration is whether
the right percentage of such intervals include the actual values. For
example:
• for 95% prediction intervals, we want to check that 95% of the actual
observations are in their corresponding prediction intervals
• for 99% prediction intervals, we want to check that 99% of the actual
observations are in the corresponding prediction intervals.

437
Unit 5 Linear modelling in practice

If there are too few or too many actual observations in the prediction
intervals, this suggests the prediction intervals do not have the width they
should have.
• If there are too few observations in their respective intervals, then it
means that the intervals are not capturing all of the uncertainty
surrounding the predicted values. In this case, there is overconfidence
about the range of values the predicted value might take.
• If there are too many observations in their respective intervals, then it
means that the intervals are too cautious about the range of values the
predicted value might take.

4.3.2 Using R to calculate MSE and MAPE


In Subsection 4.3.1 we introduced MSE and MAPE as measures of the
quality of predictions. Now it is time to start calculating these for a model.
In Notebook activity 5.10, we’ll calculate the MSE, the MAPE and the
proportion of observations that are in their corresponding prediction
intervals for the model you found when working through
Notebook activity 5.9.

Notebook activity 5.10 Assessing predictions


This notebook will explain how to calculate the MSE, the MAPE, and
the proportion of observations which are in their corresponding
prediction intervals for a model.

4.3.3 Using a test dataset to assess predictions


In Subsection 4.3.2 you explored how good the predictions from one model
are by looking at the data from 2016. Now, the data from 2016 was part of
the data that was used to build (or ‘train’) the model. Those data helped
both select the best model and estimate the parameters of this model.
Unfortunately, this means that the assessment of the prediction capability
of the model is likely to be over-optimistic – at least when it comes to
predicting the number of medals won in future Olympics.
If, for example, the allocation of medals in the Rio 2016 is a bit unusual
and unlike that which normally happens, models which favour the way in
which the data are unusual will look better than they really are for future
data. This is due to a problem known as over-fitting, described in Box 10.

438
4 Further modelling issues

Box 10 Over-fitting
When fitting a model, it is possible to have a model that fits the data
too well. When this happens, the model reflects peculiarities that are
specific to the data used. As these peculiarities are not then reflected
in other data that the model might be applied to, the applicability of
the model is compromised. Such a model is said to over-fit the data.

You will explore fitting and over-fitting a little more in the next activity.

Activity 24 Is more complicated better?

Figure 12 shows four different regression models fitted to a dataset


containing twenty observations, with each regression model dependent on a
different number of terms. (Note that the apparent break in the curve in
Figure 12(d) around the value x = 2 is simply due to the curve dipping to
a minimum below the lower limit plotted on the y-axis.)

7 7
6 6
5 5
y

4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(a) x (b) x

7 7
6 6
5 5
y

4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(c) x (d) x
Figure 12 A dataset to which regression models with various numbers of
parameters have been fitted: (a) a simple linear regression model, depending
on two terms (parameters), with (b) to (d) showing regression models with an
increasing number of terms fitted to the data
(a) Which regression model appears to provide the closest fit to the data
so that the curve tends to be closest to the observed points?

439
Unit 5 Linear modelling in practice

(b) The MSE values for each of these models are given in Table 6. Are
these values to be expected? Why or why not? (Hint: think about
how the MSE values are calculated.)
Table 6 MSEs for the models shown in Figure 12

Plot Number of terms MSE


(a) 2 0.1449
(b) 4 0.1321
(c) 8 0.1203
(d) 16 0.0839

(c) Which regression model do you think best represents the relationship
between x and y over the range displayed?

As Activity 24 has demonstrated, it is possible for a model to look as


though it is fitting the data too well. However, given that by definition
models that over-fit the data fit the data better than competing simpler
models, how can over-fitting be detected? One way is by the use of a
training dataset and a test dataset, as described in Box 11.

Box 11 Training datasets and test datasets


A training dataset is a dataset that is used to select the most
appropriate model, and to obtain estimates of model parameters.
A test dataset is a dataset that has not been used to select the
model or fit it. Such test datasets are used to assess how good
predictions are likely to be when applied to situations where the
outcomes are not known.

All of the datasets you have been fitting models to so far in this module
have been training datasets. As such, the training dataset has observed
values for the response variable as well as the explanatory variables. It
turns out that in the test dataset we also need observed values for the
response variable, along with observed values for the explanatory variables.
This is because, although we don’t need the values of the response variable
when making predictions, we do need them to compare the predictions
from the model with the actual observed values. In the next activity you
will use a test dataset to assess models.

440
4 Further modelling issues

Activity 25 Is more complicated better? Using a test


dataset to decide
In Activity 24, you considered four regression models fitted to a set of
data. In this activity, you will consider exactly the same four models, so
that the estimates for the curves are the same, as well as the terms
included in each model. However, this time the points are from a test
dataset and so these data weren’t used to fit any of the models. (The data
are all simulated so that it is known that the same underlying model
actually applies to the data used in Activity 24 and this activity.)
The fitted curves (from Activity 24) and the data points from the test
dataset are shown in Figure 13.
Now which regression model appears to be the best?

7 7
6 6
5 5
y
y

4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(a) x (b) x

7 7
6 6
5 5
y
y

4 4
3 3
2 2
2 4 6 8 10 2 4 6 8 10
(c) x (d) x

Figure 13 The four regression models from Activity 25 shown relative to a


test dataset

441
Unit 5 Linear modelling in practice

4.3.4 Assessing predictions through the use of a test


dataset in R
So far in this section, we have been fitting models to data from the
summer Olympics held between 1996 and 2016. The data from these years
therefore form the training dataset. (Since data from Barcelona 1992 are
only used as information about ‘numbers of medals won at the previous
summer Olympics’, they can be considered to be part of the 1996 data.)
The data from Tokyo 2020 have not so far been used in the modelling.
This leaves these data available to form a test dataset. This additional
data is provided in the Olympics 2020 dataset, which is described next.
The test dataset is then used in Notebook activity 5.11.

The Olympics 2020 dataset (olympic2020)


This dataset includes data for the competing nations for Tokyo 2020.
The variables are the same as the the Olympics dataset, given in
Subsection 2.2.4.
The data for the first three and last three observations in the
Olympics 2020 dataset are given in Table 7.
Table 7 The first three and last three observations from olympic2020

country countryCode year medals lagMedals host


Afghanistan AFG 2020 0 0 0
Albania ALB 2020 0 0 0
Algeria ALG 2020 0 2 0
Yemen YEM 2020 0 0 0
Zambia ZAM 2020 0 0 0
Zimbabwe ZIM 2020 0 0 0

nextHost population gdp


0 35.38 0.57
0 2.88 4.68
0 40.55 4.83
0 27.17 0.69
0 16.36 1.65
0 14.03 1.22
Sources: ‘2016 Summer Olympics medal table’ (2021), ‘2020 Summer
Olympics medal table’ (2021) and The World Bank Group (2020a, 2020b)

442
5 Missing data

Notebook activity 5.11 Assessing predictions using a test


dataset
In this notebook you will assess a model for the number of medals
using a test dataset.

In practice, training and test datasets are usually formed by splitting a


Data to fit model
larger dataset. This is because both the training and test datasets require
100
that values for the response variable are known. The allocation of 0% %
individual observations to the training and test datasets is a balancing act.
0% Data to test 0%
The bigger the test dataset is, the more information there is to use to 10 model
assess how good the predictions are. However, an increase in the size of the
test dataset means there is less information available to estimate a good
model in the first place!

4.4 Model exploration or model


confirmation?
So far in this section, we have been focusing on model building. As you
have seen, this process is not quick. The variety of different models that
could be fitted to data is huge. Even focusing on just a few possibilities
can still lead to many models being fitted. So, it is just as well that
modern computing is fast enough so that fitting any particular model for a
dataset of the size of the Olympics dataset takes very little time. This type
of data analysis is sometimes referred to as model exploration.
Sometimes, however, the structure of the model has already been proposed
and the focus is on whether this model is appropriate for the data being
analysed. Such data analysis can then be referred to as model
confirmation. Model confirmation does not involve fitting multiple
models. Instead, the emphasis is on assessing how well the model fits the
data. At first sight it may seem that model confirmation is an unnecessary
duplication of resources. However, as you will discover in Section 7, such
replication plays a vital role in helping to distinguish between real and
spurious findings.

5 Missing data
So far in this unit, we have been focusing on the problem of developing a
model for predicting the number of medals a nation will win at Paris 2024.
The data we have been using for this in Notebook activities 5.6 to 5.11 are
complete, which means that for each observation in the dataset, all values
of the variables are known. However, often we do not know every value for
every observation in a dataset. Indeed, we had this problem for some of
the values of the variables considered in Notebook activity 5.5. When such
values are not available, or not known, this is referred to as missing data.

443
Unit 5 Linear modelling in practice

The presence of missing data matters since it can put at risk the
representativeness of the data to the wider population of interest. This is
illustrated in Example 6.

Example 6 Impact of ‘shy’ Conservative Party voters


In the 1992 UK general election, there were two main political parties
battling to win: the Labour Party and the Conservative Party.
In the run-up to the election, opinion polls were indicating a slight
lead for the Labour Party over the Conservative Party. Overall, the
A ‘shy’ voter? polls were indicating that 39% of voters would vote for the Labour
Party compared with 38% voting for the Conservative Party. It
therefore came as a surprise when the Conservative Party won the
election with 42% of the votes, compared with 34% of the votes for
the Labour Party.
Following the election, the possible reasons why the opinion polls
failed to predict the Conservative Party’s success were investigated.
One such investigation was carried out by a group of experts set up by
The Market Research Society who, in their report back to the society
(Curtice et al., 1994) ascribed some of it as being down to the fact
that, in comparison to other voters, Conservative Party voters were
more reluctant to reveal to pollsters how they intended to vote. This
meant that the sample of respondents on which the pollsters based
their analysis was not representative of voters, since their sample
contained fewer Conservative Party supporters than a representative
sample would have done.

Example 6 illustrated how missing data can cause problems with how
representative the data are. But what about data for the summer
Olympics? What missing data are there that can impact on
representativeness? You’ll consider examples of this in Activity 26.

Activity 26 Missing Olympics data

Here are two examples of where there are missing data connected with
data about the Olympics that we have been analysing. In each case, briefly
discuss what impact on representativeness such missing data might have.
(a) No GDP per capita values being available for the Democratic People’s
Republic of Korea.
(b) The number of medals won at the previous Olympics not being known
for the Czech Republic in 1996. (In 1992, Czech athletes competed as
part of a Czechoslovakian team.)

444
5 Missing data

In Subsection 5.1 you’ll learn about strategies for handling missing data in
analyses. Then, in Subsection 5.2, we’ll focus on why the data are missing.
As you will discover in Subsection 5.3, considering why data are missing
allows us to then decide whether our chosen strategy for dealing with
missing data is appropriate.

5.1 Dealing with missing data


To make the ideas on dealing with missing data more concrete, it is helpful
to consider a dataset in which there are missing values. Such a dataset is
described next.

Investigating the placebo effect


In medicine, the placebo effect is a well-known, but not
well-understood, phenomenon. It refers to the situation where
patients experience a benefit from a treatment, known as a placebo,
even though in theory that treatment should not have any real effect.
The placebo is just effectively a fake treatment. For example, the
placebo could be a ‘sugar pill’, that is, a pill that does not contain any
active ingredient.
The data we will explore come from a study investigating whether it
is possible to induce a placebo effect by just changing someone’s
mindset (Crum and Langer, 2007). Specifically, the study aimed to
see whether participants could gain noticeable physical health benefits
from exercise simply by believing that they were doing more exercise.
This study was conducted on a group of 75 hotel room attendants. All Room attendants busy at work
of the participants happened to be female, though the researchers did
not specifically exclude men from taking part. Room attendants were
chosen because the nature of their work (cleaning hotel rooms) meant
that they were almost certainly fulfilling the government’s
recommended target for daily activity.
The researchers were interested in what the impact would be on some
key health measures if the room attendants were simply made more
aware of just how active their job required them to be. The room
attendants were not asked to change the amount of activity (at work
or outside of work) they did. Furthermore, when the researchers
analysed the data, they did not pick up any tangible differences in
lifestyle by the end of the study.

445
Unit 5 Linear modelling in practice

The placebo effect dataset (placeboEffect)


The dataset from the study contains data on the following variables:
• attID: an identification number for the room attendant
• informed: whether the room attendant was informed that she was
meeting the exercise guidelines through her work, taking the
(coded) values 0 (for no) and 1 (for yes)
• age: the room attendant’s age, in years
• wt: the room attendant’s weight at the start of the study, measured
in pounds (lb) to the nearest pound
• bmi: the room attendant’s body mass index at the start of the study
• percent: the room attendant’s body fat as a percentage of total
weight, at the start of the study
• percent2: the room attendant’s body fat as a percentage of total
weight, at the end of the study
• ratio: the ratio of the room attendant’s waist to hips, at the start
of the study
• syst: the room attendant’s systolic blood pressure at the start of
the study
• diast: the room attendant’s diastolic blood pressure at the start of
the study.
The first six rows of this dataset are given in Table 8.
Table 8 The first six rows from placeboEffect

attID informed age wt bmi percent percent2 ratio syst diast


1 0 43 137 25.1 31.9 32.8 0.79 124 70
2 0 42 150 29.3 35.5 0.81 119 80
3 0 41 124 26.9 35.1 0.84 108 59
4 0 40 173 32.8 41.9 42.4 1.00 116 71
5 0 33 163 37.9 41.7 0.86 113 73
6 0 24 90 16.5 0.73 78

In the next activity, you will consider the placebo effect dataset in more
detail.

Activity 27 Missing values in the placebo effect dataset

The values of the variables for the first six room attendants are given in
Table 8 above. What do you notice while looking at this table?

446
5 Missing data

There are many missing values across the whole dataset of 75 room
attendants. The number of missing values for each variable is given in
Table 9. From this, we can see that, although there are a couple of
variables for which missing data are not a problem, most of the variables
in this dataset have at least one missing value.
Table 9 Number of missing values for each variable in placeboEffect

Variable Number of missing values


informed 0
age 1
wt 0
bmi 1
percent 7
percent2 19
ratio 2
syst 9
diast 9

In the rest of this subsection, we will introduce three strategies for dealing
with such data: complete case analysis, available case analysis and
imputation.

5.1.1 Complete case analysis


A complete case analysis refers to the strategy of analysing only those
observations (that is, cases) for which there were no missing data recorded.
In the next example, we show how this works with respect to the first six
rows of the placebo effect dataset that was given in Table 8.

Example 7 Complete cases in the placebo effect dataset


In Activity 27, you considered some data relating to room attendants.
For these data, an observation or case refers to a single room
attendant – or, more accurately, the data about a single room
attendant.
The complete cases therefore provide data about only those room
attendants for whom values were recorded for all of the variables. For
example, out of the first six room attendants (given in Table 8) only
data from the first and fourth room attendants would be used.

There is one big advantage to a complete case analysis – it is easy to


implement. Firstly, it is straightforward to remove any cases that have a
missing value for the dataset to be analysed. And then, having done so,
the analysis can proceed as if there were no missing data.

447
Unit 5 Linear modelling in practice

However, there is a big downside to a complete case analysis – it can be an


inefficient use of the data that were collected.
Activity 28 explores this issue with respect to the placebo effect dataset.

Activity 28 Estimating the efficiency of data use

Recall that the placebo effect dataset contains data gathered about a total
of 75 room attendants and that – excluding attID, which just gives
identification numbers – there are a total of nine variables.
(a) Suppose that a value was recorded for every variable for every room
attendant. How many individual recorded data values would there be
in the total dataset? (Ignore the values given for attID as these are
just identification numbers.)
(b) Table 9 gave the number of missing values for each variable in the
placebo effect dataset.
In total, how many values were not recorded? Hence, what percentage
of the total number of values that could have been recorded in the
dataset actually were recorded? (In calculating this percentage, ignore
the values given for attID as these are just identification numbers.)
(c) The number of room attendants with 0, 1, 2, 3, 4 or 5 missing values
is given in Table 10. (Each room attendant had no more than
5 missing values.)
Table 10 Number of room attendants in placeboEffect with 0, 1, 2, 3, 4 or 5
missing values

Number of missing values Number of room attendants


0 47
1 16
2 6
3 5
4 0
5 1

How many room attendants would be included in a complete case


analysis? Hence what percentage of individual data values would be
used in a complete case analysis?
(d) Compare your answers to parts (b) and (c), and hence comment on
the use that a complete case analysis is making of the data that are
available.

448
5 Missing data

5.1.2 Available case analysis


You saw in Activity 28 that a complete case analysis can be inefficient.
Such an analysis drops all cases where not everything was observed, even if
this means the variables for which there are missing values are not used in
the analysis! An alternative is to use an available case analysis, which
only drops those cases that have a missing value for one (or more) of the
variables in the model being fitted.
In Example 8, we show how this works with respect to the first six rows of
the placebo effect dataset that was given in Table 8. Then, in Activity 29,
you will consider how this works for some other rows in the same dataset.

Example 8 Available cases in the placebo effect dataset


Suppose that the following multiple regression models are being
considered for the data given in the placebo effect dataset:
• Model 1: percent ∼ informed
• Model 2: percent2 ∼ informed + percent.
Looking back at the table of values for the first six room attendants in
the dataset given in Table 8, notice that we have the following pattern
of missing data:
• all variables collected: room attendants 1 and 4
• just percent2 missing: room attendants 2, 3 and 5
• percent, percent2 and syst missing: room attendant 6.
When fitting Model 1, only the variables percent and informed are
used. So the only cases dropped are those which have a missing value
for percent or informed (or both). Therefore, out of the first six
room attendants, when using Model 1, the data from room attendants
1, 2, 3, 4 and 5 are used in an available cases analysis.
Model 2 also involves percent and informed, but also involves
percent2. So, when fitting this model, dropped cases have a missing
value for at least one of these variables. This means that in the
available cases analysis using Model 2, only the data from room
attendants 1 and 4 are used.

449
Unit 5 Linear modelling in practice

Activity 29 Data used in an available case analysis

Table 8 gives the data for the first six room attendants in the placebo
effect dataset. Due to the way this dataset is organised, all of these first
six room attendants were in the ‘Not informed’ group (that is, they all had
the value 0 for informed).
The data for the first six room attendants in the ‘Informed’ group (that is,
for attID 35 to 40) are given in Table 11.
Table 11 The first six room attendants in the ‘Informed’ group

attID informed age wt bmi percent percent2 ratio syst diast


35 1 48 170 28.3 43.3 44.1 0.91 195 107
36 1 50 133 26.1 34.2 33.6 0.87
37 1 40 109 21.9 26.2 25.3 0.83 132 99
38 1 32 178 33.6 39.9 0.91 128 109
39 1 54 137 22.7 33.7 30.5 0.85 132 69
40 1 24 166 28.5 41.3 41.1 0.88 114 56

For each of the following models, give the identification numbers of the
room attendants whose data will be included in an available cases analysis.
(a) percent ∼ informed
(b) percent2 ∼ informed
(c) percent2 ∼ informed + percent
(d) percent2 ∼ informed + percent + age + wt + bmi + ratio
+ syst + diast

An available case analysis has the same advantage as a complete case


analysis, namely that it is easy to implement. Furthermore, it is less
wasteful of data. A case is only dropped if there is a missing value for one
or more of the variables in the model being fitted. Missing values for any
of the other variables are ignored.
A downside of an available case analysis compared with a complete case
analysis is that it is more difficult to compare results from different
models. For example, in Activity 29, the difference between models in
parts (b) to (d) was simply which explanatory variables they contained in
addition to informed. However, in an available case analysis, the fitting of
the model in part (b) would be based on data from more room attendants.
So, any comparison of model fit between these two models will partly
reflect differences in the data used to fit the models.

450
5 Missing data

5.1.3 Imputation
For both a complete case analysis and an available case analysis, missing
data can be handled by dropping cases which contain missing values.
Imputation takes a different approach – it aims to replace missing data
with suitable values, so that the dataset can then be analysed in the same
way as if there had been no missing values.
In Example 9 we show how this works with respect to the first six rows of
the placebo effect dataset that was given in Table 8.

Example 9 A basic form of imputation


Recall from Table 8 that for the first six room attendants, there is one
missing value of percent, four missing values of percent2 and one
missing value of syst.
One way of imputing values for these missing values is to use the
mean of the values that are recorded. It turns out that over the entire
dataset, the mean of the observed values for percent is 35.3, for
percent2 it is 34.8 and for syst it is 129. This leads to the dataset in
Table 12 after the imputation.
Table 12 The first six rows from placeboEffect after imputation.
(The imputed values are given in bold.)

attID informed age wt bmi percent percent2 ratio syst diast


1 0 43 137 25.1 31.9 32.8 0.79 124 70
2 0 42 150 29.3 35.5 34.8 0.81 119 80
3 0 41 124 26.9 35.1 34.8 0.84 108 59
4 0 40 173 32.8 41.9 42.4 1.00 116 71
5 0 33 163 37.9 41.7 34.8 0.86 113 73
6 0 24 90 16.5 35.3 34.8 0.73 129 78

When imputation is used as described in Example 9, all of the missing


values are replaced by values. This means that, like complete case analysis
and available case analysis, it then becomes straightforward to implement
statistical techniques, as you will see in Activity 30.

451
Unit 5 Linear modelling in practice

Activity 30 Performing calculations on the imputed dataset

In Example 9, an imputed version of the placebo effect dataset was created


by replacing missing values by sample means.
(a) Using only data after imputation for the six room attendants in
Table 12, what is the sample mean for percent2?
(b) Based on all 75 of the room attendants, what is the sample mean for
percent2?
(c) In the original dataset (that is, before imputation), the standard
deviation of percent2 based on only the values that were not missing
is 6.13. Will the standard deviation of percent2 based on this
imputed version of the dataset be more than 6.13, less than 6.13 or
exactly 6.13. (To create the imputed dataset, 19 missing values for
percent2 were replaced by the mean.)

Example 9 describes one imputation method for deciding what values


should replace missing values. There are, however, many other ways of
imputing values. Each of these methods tries to balance producing a good
estimate of what the missing values would have been if observed, against
the complexity of implementing the imputation method. We will not,
however, be covering these other methods in M348.

5.2 Why missing data arise


Subsection 5.1 introduced three methods for dealing with missing values.
In this subsection, we will take a step back to consider why missing data
arise.
In your study of statistics so far, you are likely to have analysed many
datasets. In most of these datasets, if not all, it was unlikely that there
were any missing data. This is because, when datasets are chosen for
teaching purposes, there is often the desire to avoid the complications that
missing data bring!
However, in practice, it is common for there to be some missing data in a
dataset. Reasons for this include the following.
• A value gets lost or is not recorded. Or the value is so obviously wrong,
such as the age of a child being recorded as −7 years, that the only
sensible course is to treat it as not known.
• Someone might choose not to reveal the value. For example, in a survey,
respondents may be happy to answer some questions but refuse to
answer others. Furthermore, someone’s willingness to respond might
depend on what that answer actually is. For example, in the opinion
polls described in Example 6, those supporting the Labour Party were
thought to have been more willing to answer questions regarding their
voting intention than those supporting the Conservative Party.

452
5 Missing data

• For some observations, it does not make sense for there to be a value.
For example, for a variable defined as ‘age when first married’ it only
makes sense to have a value for those people who have been married.
For those who have never been married, this value is necessarily missing.
Knowing why the data are missing is important as it can help us to decide
whether we are dealing with the missing data appropriately.
Now, for any given dataset, the precise reasons why data are missing will
be specific to the context in which the data sit. However, using a
categorisation first proposed in Rubin (1976), it is helpful to classify the
missing data as one of these three types:
• missing completely at random (MCAR)
• missing at random (MAR)
• not missing at random (NMAR).
The idea underpinning these different types of missing data is the extent to
which ‘everything we do know’ tells us something about the values that are
missing.
Before we get to the definitions of these types a little later on in the
subsection, it is useful to think about what is ‘everything we do know’.
Well first, but easily overlooked, we know that for every missing value that
we have, we do not have a value for it! You may be thinking that just
knowing that a value is missing could not possibly tell us anything about
what that missing value should be. You would be right in some
circumstances, but not in all. To see why this is so, it is helpful to consider
a concrete example.
Suppose that in a study about the effectiveness of teaching material, we
are interested in how successfully students have studied a module. Further,
suppose that we take the performance in the module’s exam as our
measure of how successful students have been. Unfortunately, unless a
module only has relatively few students, it is unusual for exam scores to be
recorded for all students. The reasons why an exam score may be missing
for a student are many and varied. As you will see in Example 10, some of
the reasons will not tell us anything about what a score might have been. We wouldn’t expect any of
However, as you will see in Example 11, other reasons can tell us these students to have missing
something about what a score might have been. exam scores

453
Unit 5 Linear modelling in practice

Example 10 Missing exam scores – when missingness


tells us nothing
The following could be thought of as reasons why knowing that an
exam score is missing does not tell us anything about what the score
would have been.
• On the day before the exam, the student comes down with flu and
is too ill to take the exam. (Remember: if you find yourself in a
similar situation, contact the OU as soon as you can.)
• The exam script is lost before it could be marked. (At the OU this
is a very rare occurrence.)
In both of these cases, the reasons why there is not an exam score for
this student is unconnected with how successfully the student had
studied the module and hence what the exam score would be if we
had it. The lack of an exam score is due to something that happened
‘out of the blue’.

Example 11 Missing exam scores – when missingness


tells us something
In Example 10, the reason underlying the missing exam was assumed
to be some ‘out of the blue’ event. Another reason why an exam score
is missing is as follows:
• The student dislikes the final part of the teaching material and
gives up on the module before the exam.
Now, if a student has given up on the module before the exam, then it
is likely that their exam score would have been relatively low, had
they taken the exam. (Though, of course, we can’t know for sure.) So,
in this case, the reason why the exam score is missing does provide
some information about what that score might have been.

So far, we have just been considering whether the mere fact a value is
missing tells us anything about what value it would have been. However,
unless we only collect data relating to just one variable, ‘everything we do
know’ includes values for other variables too. As Example 12 illustrates,
sometimes the values of these variables tell us something about what the
missing values would have been, and sometimes they don’t.

454
5 Missing data

Example 12 Missing exam scores – using other variables


Suppose in our study about the effectiveness of teaching material we
also collect the following information about students:
• in which country the student is based
• the age of the student
• the score on the first TMA for that module.
It is hoped that the country a student is based in does not impact on
how effective they find the teaching material, nor that their age does.
So, knowing these two pieces of information about a student with a
missing exam score is unlikely to help us guess what the score would
have been for them.
In contrast, it is reasonable to assume that TMA scores and exam
scores are related. In particular, it is often the case that students with
a higher score for the first TMA tend to be the students with higher
exam scores. (Though, of course, some students still get high exam
scores even with a relatively low score for the first TMA.) So, knowing
a student’s score for the first TMA does provide information about
what a missing exam score might have been.

Figure 14 illustrates another situation where ‘everything we do know’


helps: we might be able to guess what the pattern is on a missing jigsaw
piece by looking at the surrounding jigsaw pieces.

Figure 14 Another situation where ‘everything we do know’ helps . . .

We now return to the three missing data types from Rubin (1976).
A missing value is missing completely at random (MCAR) if nothing
that we know tells us anything about what the value would have been. So,
for example, a missing exam score would be MCAR if we knew that the
student had gone down with flu the day before the exam and the only

455
Unit 5 Linear modelling in practice

other data we’d collected about the student was their age and which region
they were based in.
Both missing at random (MAR) and not missing at random
(NMAR) assume that the values of other variables tell us something
about what the missing value would be. So, a missing exam score would be
MAR or NMAR if we had also collected data about the first TMA. The
distinction between the two then lies in whether the reason that the exam
score is missing tells us anything extra.
For example, if we had the score for the first TMA for the student who had
gone down with flu, the missing exam score would be MAR. This is
because knowing that the student had gone down with flu doesn’t provide
us with any further information on what the student’s exam score would
have been.
However, the missing exam score for the student who gave up on the
module is likely to be NMAR. This is because the reason for the missing
exam score gives us additional information that can inform us about what
the value of the exam score would have been, over and above the score for
the first TMA.
Unfortunately, the assumptions for each of the missing data types cannot
be checked since such checking would rely on knowing the actual values of
the missing data – which, of course, are not known because they’re
missing! So we can only use the context that the data arose from to decide
which of these missing data types is the most appropriate. You will
practise doing this for a hypothetical drug trial in the next activity.

Activity 31 MCAR, MAR or NMAR?

Suppose that in a drug trial, participants attend a follow-up clinic every


three months over the course of a year. At the clinic, various
measurements are taken, including participants’ blood pressure.
For each of the scenarios described below, say whether you think the lack
of blood pressure reading from the final follow-up clinic for a participant is
MCAR (missing completely at random), MAR (missing at random) or just
NMAR (not missing at random).
(a) The participant forgets to attend the follow-up clinic.
(b) The drug given to the participant makes them too ill to attend the
follow-up clinic (but the illness is not linked to blood pressure).
(c) At the previous clinic visit, the participant’s blood pressure was
worrying enough that they were taken off the trial.
(d) The participant gets a new job which stops them attending the
follow-up clinics and so withdraws from the trial.
(e) The participant suffers a complication as a result of uncontrolled
blood pressure, which makes them too ill to attend.

456
5 Missing data

A summary of the three missing data types is given in Box 12.

Box 12 Missing data types


Missing data can be categorised into the following three types.
• Missing completely at random (MCAR).
In this situation, it is assumed that the reason why a value is
recorded as missing is independent of its value. Additionally, why
the value is recorded as missing is independent of everything else
that is known about the observation.
• Missing at random (MAR).
Here, the actual value of a missing value depends on other values
that have been recorded for the observation. However, after taking
into account everything else that is known about the observation,
the actual value is independent.
• Not missing at random (NMAR).
The fact that a value is missing provides information about what its
actual value is.

Which type of missing data we decide we have has an impact on how the
analysis can be done while avoiding a biased result. We will consider this
very briefly next.

5.3 Impact of the missing data type


Subsection 5.1 introduced three strategies for dealing with missing data:
namely, complete case analysis, available case analysis and imputation.
You have already seen that a complete case analysis can be wasteful of the
data collected. Observed data for some variables can be thrown away due
to missing values on other variables, even if those variables are not directly
used in the analysis. Available case analysis is less wasteful because it only
throws away cases that have a missing value for one (or more) of the
variables used in the model. Imputation is also less wasteful: replacing
missing values by imputed values means that all cases are retained in the
analysis.
However, making the best use of data is not the only – or even the most
important – consideration. Whether the strategy leads to biased estimates
also matters. This depends on the missing data type.
When the missing data are assumed to be missing completely at random
(MCAR), both a complete case analysis and an available case analysis
should lead to estimates that are as unbiased as they would be if there
were no missing data. A well-designed imputation method for MCAR data
should not bias estimates either.

457
Unit 5 Linear modelling in practice

When the missing data are assumed to be missing at random (MAR),


complete case analysis and available case analysis can, in some situations,
also lead to unbiased estimates. This happens, for example, when only
values for the response are missing. However, such analyses do not always
lead to unbiased estimates. Well-designed imputation also leads to
unbiased estimates for MAR data.
However, when the missing data are assumed to be not missing at random
(NMAR), biased estimates are likely to result whether complete case
analysis, available case analysis or imputation is used. Because of this, it is
best to avoid ending up with data that are NMAR at the data collection
stage if at all possible!

6 Documenting the analysis


So far, this module has largely focused on analysing the data. Another
part of the statistical modelling process is documenting the analysis. There
are two main aspects to this: documenting the analysis for yourself and
then documenting the analysis for others.
At the heart of documenting the analysis is the requirement that we should
be able to recreate the same results if we have the same data and use the
same methodology. After all, it is no good being able to produce really
insightful results if these cannot be recreated from the same data! This
requirement is referred to as reproducibility, which is described in Box 13.

Box 13 Reproducibility
In research, reproducibility means that sufficient detail is given so
that others are able to recreate the results of the original researchers.
For data analysis, this means that sufficient detail needs to be given
to enable others working from the same original data to follow the
same analysis and obtain the same results.

In Subsection 6.1 we will look at documenting data analyses for yourself.


In particular, we’ll look at how to create documentation using R in
Jupyter notebooks in Subsection 6.2. We will then consider documenting
for others in Subsection 6.3. To round off the section, we’ll discuss
describing a model in Subsection 6.4.
For all the subsections, you will need to refer to the module website
alongside this printed text.

458
6 Documenting the analysis

6.1 Documenting for yourself


For data analysis, ensuring reproducibility at the most basic level means
keeping an audit trail of all the actions taken from receipt of the data to
obtaining the results. This includes the following.
• Where did the data come from?
• What changes, if any, were made to individual data values? Why were
these changes made?
• Which transformations were applied to the data?
• Have any observations been excluded from the analysis? If so, why?
• Which model, or models, have been deemed appropriate for these data?
How were these models found?
Software can help with reproducibility, as Example 13 shows.

Example 13 Reproducibility and Jupyter notebooks


In this module you have been using Jupyter notebooks to implement
data analyses. The design of the notebooks encourages data analyses
to be recorded in a reproducible way. The inclusion of the code used
to generate results, along with the results and conclusions drawn,
facilitates keeping all of the information that might be needed to go
back and repeat an analysis.

This audit trail, and the analysis more generally, can be documented in
different ways: either as personal notes or in a form intended to be read by
others. Although personal notes do not need to be as tidy as those for use
by others, they still need to be good enough to be informative. For
instance, the handwritten notes in Figure 15 may make sense to the
statistician who wrote them, but they are fairly useless for anyone else
(and probably just as useless for the statistician to look back on too!).

Analysis

194 of the 200 observations.


Take logs?
Explanatory variables: age, sct, soc, num
Significant!
Normality assumption ok?

Figure 15 Some notes handwritten by a statistician during an analysis


459
Unit 5 Linear modelling in practice

Every line of the notes in Figure 15 is ambiguous and therefore essentially


uninformative. For example, what does the statistician mean by their
statement ‘194 of the 200 observations’ ? Is the analysis only based on 194
observations? And if so, why weren’t all 200 used? And as another
example, what does ‘Take logs?’ mean? That logs were taken? And if so,
take logs of what?
However the analysis is documented, it is important the notes made are
going to be understandable weeks, months or even years afterwards, when
you may no longer remember the details of the analysis, despite how
ingrained they might seem first time round. This is because we might need
to return to an analysis some time in the future, as will be demonstrated
in Activity 32.

Activity 32 Revising journal papers


One way in which research is disseminated is by getting it published as one
or more articles in a peer-reviewed journal. ‘Peer-reviewed’ means that the
article has been read by other researchers (the ‘reviewers’ or ‘referees’)
working in the same area and judged by them to be worthy of publication
in that journal. The Journal of the Royal Statistical Society – Series C
(Applied Statistics) is a respected peer-reviewed journal. An overview of
the journal notes that:
The emphasis is on statistical analyses in practice.
(Royal Statistical Society, 2022)
Frequently, the acceptance of an article by a journal occurs only after the
authors of the article have addressed all of the issues raised by the
reviewers. However, this review process may not be a quick one.
You will estimate the review time allowed for the articles in one issue of
the Journal of the Royal Statistical Society – Series C (Applied Statistics)
next. To do this, you will need to search online using OU Library Services.
Note that there is accompanying guidance provided on the module website,
should you need it.
(a) Access one issue of the Journal of the Royal Statistical Society –
Series C (Applied Statistics) on the OU Library Services website.
(b) How many articles does your selected issue contain? (Do not count
items such as ‘Front Matter’, ‘Back matter’ or any indexes.)
(c) Note down the date when the article was first received by the journal
(from the authors) and the date of the final revision. (You should find
this information just beneath the authors’ names at the beginning of
the article.)
(d) Estimate the amount of time that the authors of papers might have
had between completing the analysis sufficiently for an article to be
submitted to this journal and (potentially) having to repeat the
analysis in order to address reviewers’ comments.

460
6 Documenting the analysis

6.2 Creating documentation in Jupyter


notebooks
It has already been noted in Example 13 that Jupyter notebooks can help
with keeping sufficient notes so that reproducibility is achieved. This
subsection contains two notebook activities. Notebook activity 5.12
describes some ways of adding notes to Jupyter notebooks. You are then
given the opportunity to practise these methods in Notebook activity 5.13.

Notebook activity 5.12 Documenting in Jupyter


notebooks
This notebook explains how you can document data preparation in
Jupyter notebooks.

Notebook activity 5.13 Further practice at documenting


in Jupyter
In this notebook you can practise documenting in Jupyter.

The strategies used in Notebook activities 5.12 and 5.13 can be easily
extended to other aspects of the analysis beyond preparing the data, and
they can be used to document the entire analysis.
Keeping all of the code, right from reading in the data through to fitting
and checking the last model, also helps enormously when data are updated,
for example, if more data are gathered. Then, it is simply a case of reading
in the updated dataset and re-running all of the rest of the code.

6.3 Documenting for others


Being able to reproduce analyses is not the only reason that the decisions
taken during analyses should be documented. By making it explicit what
decisions have been taken, it allows others to decide how reasonable those
decisions are. While in many situations this simply leads to interesting
debate, there have been examples when a statistical analysis has come
under intense scrutiny. One such situation is described in Example 14,
next, and you are then asked to comment on an extract of the subsequent
court case in Activity 33.

461
Unit 5 Linear modelling in practice

Example 14 Statistical analysis in court


In 1990, the British Medical Journal (BMJ) published details of a
study investigating possible causes of a cluster of cases of childhood
leukaemia and lymphoma at Seascale in the UK, near to the Sellafield
nuclear plant (Gardner et al., 1990a, 1990b). The study concluded
Sellafield nuclear site that exposure to radiation in men prior to conception was associated
with an increased risk of leukaemia in the child conceived.
Subsequent to this study being published, families of two of the
children in the study took the owner of Sellafield nuclear plant to
court in order to seek damages. As the study formed part of the
evidence in this high-profile case, the data and their analysis were
examined by both the prosecution and defence in great detail.
Although the eventual judgment of the court was that, on the balance
of probabilities, a causal link between pre-conceptual exposure to
radiation in the fathers and subsequent leukaemia and lymphoma in
the children had not been proved, the judge was quoted as describing
the study as ‘a good study, well carried out and presented’ (Wakeford
and Tawn, 1994, p. 311).

Activity 33 Preparing data for analysis


The extract below is taken from the write-up of the court case described in
Example 14. Read this extract and identify two issues connected with
sourcing and cleaning the data that were identified.
Dr K D MacRae (for the Defendants) had examined closely the
relevant documents made available by the MRC[*]. He observed that
PPI[**] was one of a number of factors under examination in the
study as a potential explanation for the Seascale excess and that
multiple statistical testing must be taken into account when
interpreting the results. This observation was not contested although
the details of an adjustment to statistical significance to deal with
multiple comparisons (whether qualitative or quantitative) were the
subject of some debate. More controversial were Dr MacRae’s
findings on two other points: the inclusion of a case who was
diagnosed apparently while living outside West Cumbria and who
should have been excluded from a case-control study of cases
diagnosed while resident in West Cumbria; and the restriction of the
study to individuals who were diagnosed during 1950–85 while
resident in West Cumbria and who were also born in the district,
whereas the original study protocol did not stipulate the criterion of
birth in West Cumbria.
(Source: Wakeford and Tawn, 1994, p. 296)

462
6 Documenting the analysis

* The MRC (Medical Research Council) was the organisation that


conducted the original study.
** PPI stands for ‘paternal pre-conceptual irradiation’, which is the
father’s exposure to radiation.

The documentation in the court case described in Example 14 and


Activity 33 allowed issues to do with sourcing and cleaning the data to be
brought out into the open. This enabled the consequent impact on the
results to be properly discussed.
Activity 34, the final activity in this subsection, involves reading a section
of a research article so that you can assess for yourself whether or not the
authors have documented their analysis well enough for it to be
reproducible.

Activity 34 Reproducible results?

In the article ‘Explaining the gender wage gap in STEM: does field sex
composition matter?’ (Michelmore and Sassler, 2016), the authors looked
for evidence of a pay gap between men and women working in the USA.
This article is available via the module website.
Read the ‘Data and measurement’ section of the article (pp. 197–200) and
then answer the following questions. (When reading the article, remember
that ‘dependent variable’ is simply another name for the response variable,
and ‘independent variable’ is another name for an explanatory variable.)
(a) From what source did the authors obtain the data?
(b) Were all the people in the original data source included in the
analysis? If not, who were excluded?
(c) What variable did the authors use as the response variable? Did they
transform this variable? If so, in what way?
(d) Which explanatory variable were the authors most interested in?
(e) From the information given, do you think this analysis is
reproducible?

6.4 Describing a model


In Section 6.3 you saw how important it is to describe an analysis for
others in such a way that it is possible to reproduce the analysis. The
audience for such a description is likely to have a similar statistical
background to students on this module, so the description might be quite
technical.
In the next activity you will consider an article written for people like you:
those who have an understanding and appreciation of statistics.

463
Unit 5 Linear modelling in practice

Activity 35 Predicting Olympic success – a statistical


description
In Section 3, we built models to predict the number of medals won by
countries at the Olympics. In the run-up to Rio 2016, three researchers
published details about a statistical model to predict the number of medals
each country would win: ‘Olympic medals: Does the past predict the
future?’ (Bredtmann, Crede and Otten, 2016). This article is available via
the module website.
Read the article and answer the following questions.
(a) What is the response variable used in this analysis?
(b) Which explanatory variables did the authors consider? Which of these
were continuous and which were categorical?
(c) In the article, the authors mention that they tried two different
models: a ‘naive’ model and a ‘sophisticated’ model. What is the
difference between the two models?

The article you read while working on Activity 35 was not the only place
that the authors’ work was described. In the following activity, you will
read another account. This time, the description is aimed at a general,
non-statistical audience.

Activity 36 Predicting Olympic success – a non-statistical


description
On 3 August 2016, three days before the opening ceremony for Rio 2016,
the BBC News website also had an article about the same work in its
‘Magazine’ section: ‘Predicting the Rio Olympic medal table’ (Dunbar,
2016). You have already encountered one element from this article on the
module website – the video you watched while completing Activity 2
(Subsection 1.3). Now read the text from this article provided on the
module website and compare it with the article you read in Activity 35.
(a) Is the description of the variables, both the response and the
explanatory variables, the same? If not, in what way, or ways, do they
differ?
(b) Is the description of models fitted the same? Again, if not, in what
way, or ways, do they differ?
(c) Which, if any, statistical technical terms are used in the BBC News
website article?

As you have seen in Activities 35 and 36, the audience a report is written
for makes a difference to what gets described in an article, as well as the
terminology used.

464
7 Replication

7 Replication
In Subsections 6.1 and 6.3 you learnt about the importance of
documenting an analysis so that, given access to the data, it is possible to
repeat what was done exactly. In science there is also the related notion of
replication, as described in Box 14

Box 14 A general notion of replication


In research, results from a study are regarded as having been
replicated if another study following the same methodology has been
able to produce similar general results.

Notice from Box 14 that replication is not about being able to use the
same data to reproduce the results. Instead, it is about being able to
generate fresh data that produce similar general results. At the heart of
this is the notion that when results are truly telling us something about
the real world, they are not one-offs: other researchers should be able to
repeat – that is, replicate – what was done.
In the next example we describe a couple of studies that aim to replicate a
study that has already been considered in this unit.

Example 15 Replicating the effect of changing mindsets


In Subsection 5.1, we considered the placebo effect dataset which
contained data from a study investigating whether people could
experience noticeable physical health benefits from exercise simply by
believing that they were doing more exercise. From these data, the
researchers concluded that changing the mindset of the room
attendants about the amount of exercise they did resulted in positive
health changes. But can this finding be replicated? A couple of
studies that tried to do so (and in fact succeeded!) are as follows.
• A study investigating whether levels of the hunger hormone ghrelin
This one looks indulgent . . .
differs if someone believes that they have consumed an ‘indulgent’
milkshake instead of a ‘sensible’ one (even though the milkshakes
were otherwise identical) (Crum et al., 2011).
• A study investigating whether someone performs better on a test if
they are told they have had good-quality sleep (regardless of how
well they actually slept) (Draganich and Erdal, 2014).
Notice that neither of these studies are exactly duplicating the study
conducted on room attendants. Instead, they looked to replicate the
general finding that changing someone’s mindset can produce
measurable physical effects.

465
Unit 5 Linear modelling in practice

Unfortunately, many research discoveries – including some interesting and


high-profile ones – have not been replicated by others, despite efforts to do
so. This has contributed to what has been labelled in recent years as the
replication crisis in science. Although fraud or poor practice may
account for some discoveries that others have failed to replicate, it is
thought that the reason for much of it is actually statistical and is linked
to multiple testing and the interpretation of p-values. In Subsection 7.1 we
will consider why multiple testing is a problem from a statistical point of
view, and in Subsection 7.2 we will consider how replication can help.

7.1 Perils of multiple testing


Recall that a result showing up as statistically significant does not
necessarily mean that the underlying null hypothesis is false. A
statistically significant result simply means that the data look unlikely if
the null hypothesis is true, but it is still possible that the null hypothesis is
true none the less.
You might think that this is not something to worry about because getting
a set of data that produces a result that is statistically significant, even
though the null hypothesis is true (that is, a false positive), won’t happen
very often. Unfortunately, this may not be the case, as you will discover by
doing Activity 37.

Activity 37 Predictivity of significant results

In genetic research, ‘association’ studies investigate the locations on the


genome of variations associated with a particular disease. Such knowledge
is helpful for suggesting potential treatments, or for identifying individuals
at increased risk.
Suppose that a genetic association study is carried out focusing on Type 1
diabetes, and that for each location on the genome we have the following
null and alternative hypotheses:
H0 : no association between variations at this location and Type 1 diabetes,
H1 : an association between variations at this location and Type 1 diabetes.
Suppose also that:
• A total of 10 million locations of interest on the genome are to be
investigated. This means that there will be 10 million hypothesis tests
carried out (one for each location).
• Out of these 10 million locations, 3000 of them are actually associated
with Type 1 diabetes: that is, variation at 3000 of the locations is
associated with differences in the risk or severity of Type 1 diabetes. So,
this means that for 3000 of the hypothesis tests, H0 is in fact false.
Then, for the other 9 997 000 locations, there is no association between
the location and Type 1 diabetes. So, this means that for 9 997 000 of
the hypothesis tests, H0 is in fact true.

466
7 Replication

• For a single location, the significance level of the test is set to be 0.01.
This means that, for each of the 10 million tests, the probability of
rejecting H0 when it is in fact true (a ‘false positive’), is 0.01.
• The size of the study is such that, for a single location, the power of the
test is 0.9. Recall from your previous study of statistics that this means
that, for each of the 10 million tests, the probability of rejecting H0
when it is in fact false is 0.9.
Based on this scenario, answer the following questions.
(a) For how many of the hypothesis tests would we expect to reject H0
when it is in fact false (and hence correctly decide that there is an
association between the location and Type 1 diabetes)?
(b) For how many of the hypothesis tests would we expect to reject H0
when it is in fact true (and hence incorrectly decide that there is an
association between the location and Type 1 diabetes)?
(c) Using your answers to parts (a) and (b), calculate the proportion of
hypothesis tests which reject H0 that are actually associated with
Type 1 diabetes.

As you have seen in Activity 37, when carrying out multiple tests with lots
of hypotheses, it is quite easy to get in a situation where the null
hypotheses that really should be rejected are swamped by all the null
hypotheses that were spuriously rejected. This issue about multiple testing
is well-known in genetics research and means that, in practice, geneticists
would not set up a study exactly like this.
You may think that multiple testing only applies in studies where it is
obvious that very many hypotheses are being tested, such as in gene
association studies. Unfortunately, its effect is more widespread. It is easy
to get into the situation of carrying out lots of tests – enough to be a
problem. In particular, model building of the sort described in Section 3
also falls within the realm of doing lots of hypothesis testing, as you will
see in Activity 38.

Activity 38 Hypothesis testing in model building

One key activity in building a statistical model is deciding which


explanatory variables to include.
(a) Which hypothesis test is used to help decide whether an explanatory
variable should be in a model?
(b) When building a model, how often is a hypothesis test of the kind
identified in part (a) likely to be done: only once, a few times or many
times? Also, does this depend on the number of explanatory variables
that are being considered for the model?

467
Unit 5 Linear modelling in practice

In Activities 37 and 38, we have seen how multiple testing can arise within
a single study or data analysis. The fact that studies with statistically
significant results tend to gain far more attention than other studies – for
example, being published in peer-reviewed journals, or even just to get
written up – means that the problem with false positive results plays out
across the totality of studies being done. This has led some to argue that
many of the statistically significant results that are published are likely to
be spurious results, particularly for studies that are small (Ioannidis, 2005).

7.2 Overcoming the problem of multiple


testing
In Subsection 7.1, you saw that multiple testing can lead to a large
proportion of false positive results (that is, hypothesis tests giving
significant results when in fact the null hypothesis is true).
One way of overcoming the problem of too many spurious results being
generated is to reduce the significance level. For example, in gene
association studies the significance level might be of the order of 5 × 10−8 .
However, if the size of the studies is kept the same (and there might be
reasons why a larger study is not feasible), a lower significance level means
that the power of the test is lower. This means that we become more likely
to have non-significant results when in fact we should be rejecting the null
hypothesis. For example, using the scenario in Activity 37, we become
more likely to miss locations on the genome where variations are
associated with Type 1 diabetes.
An alternative way of overcoming the multiple testing problem is where
replication comes in. This can help to identify which statistically
significant results are true significant results and which are not, as you will
discover in Activity 39.

Activity 39 Predictivity of significant results – using


replication
Consider once again the scenario given in Activity 37 in which a genetic
association study is focusing on Type 1 diabetes. In that activity, we
assumed the following.
• A total of 10 million locations of interest on the genome are to be
investigated.
• 3000 of these are actually associated with Type 1 diabetes.
• At each location, the significance level of the associated test is set
at 0.01.
• The size of the study is such that, for a single location, the power of the
test is 0.9.
In Activity 37, we found that fewer than 3% of the locations where we
rejected the null hypothesis actually correspond to true significant results.

468
Summary

Now, suppose that a follow-up association study is conducted. In this


study, only those locations where we rejected the null hypothesis in the
first study are considered (and nothing else), but the significance level and
power at each location remains the same: 0.01 and 0.9, respectively.
(a) Suppose that in the first study, we rejected H0 for 2700 of the
locations actually associated with Type 1 diabetes (this is the same
number as was expected in that study). For how many of these would
we expect to reject H0 a second time?
(b) Suppose that in the first study, we rejected H0 for 99 970 of the
locations which are not associated with Type 1 diabetes (this is again
the same number as was expected in the study). For how many of
these would we expect to reject the H0 a second time?
(c) What proportion of the locations where the null hypothesis is rejected
a second time actually are associated with Type 1 diabetes?

So, as Activity 39 demonstrates, replication also helps to stop false


significant results drowning out the true significant results. Replication can
also provide evidence about the extent to which results can be generalised
by looking at similar, but not identical, set-ups. This means that research
which aims to replicate what others have done has its place in science, as
well as research that is doing something ground-breaking.

Summary
In this unit, we have been applying the techniques learnt in Units 1 to 4 to
a real data analysis task: modelling national Olympic success and using
that model to predict success at future Olympics. In doing so, you saw
that fitting a linear model or models is not necessarily the most difficult or
time-consuming aspect of the task!
Data need to be read into whatever statistical package is being used. A
format such as comma-separated values (CSV) can be read by many
different statistical packages. However, the resulting data frame may not
be ‘ready to use’ for the statistician’s particular problem and may need to
be adapted. Merging data from different sources also needs to be done
with care. It is important to ensure that information about observations
from different sources gets correctly linked together, and in some cases it
may not be possible to make that linkage with confidence.
The issue of missing data is another problem that often occurs in data
analysis. There are three types of missing data: missing completely at
random (MCAR), missing at random (MAR) and not missing at random
(NMAR). Strategies for dealing with missing data include complete case
analysis (keeping only those observations with no missing values), available
case analysis (dropping only those observations with missing values for
variables in the model) and imputation (replacing each missing value with

469
Unit 5 Linear modelling in practice

an estimate of the missing value). These strategies work best when missing
data are MCAR. When missing data are NMAR, all of these strategies will
lead to bias.
However an analysis is done, it is important to document it in sufficient
detail to be able to repeat what was done exactly: the analysis needs to be
reproducible. A statistician may need to return to the analysis long after
they have had time to forget the exact details. Documenting the analysis
is also important to allow others to reproduce the results.
Finally, the volume of significance testing conducted across the world has
led to what is known as the replication crisis. Many hypotheses, including
some which have become high-profile, are likely to be associated with
spuriously significant p-values. This has led to a greater recognition of the
need to replicate results, so that studies are repeated with the aim of
seeing whether a significant result is again obtained.
As a reminder of what has been studied in Unit 5 and how the sections in
the unit link together, the route map is repeated below.

The Unit 5 route map

Section 1 Section 2
Specifying our Sourcing and preparing
problem the data

Section 3
Building a
statistical model

Section 4
Section 5
Further modelling
Missing data
issues

Section 6
Section 7
Documenting the
Replication
analysis

470
Learning outcomes

Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate that sourcing and preparing data ready for modelling can be
a difficult and time-consuming task
• understand the importance of the data preparation stage: ‘garbage in,
garbage out’
• appreciate that it is not always obvious how the response and
explanatory variables should be specified, and that decisions sometimes
need to be made based on pragmatism
• appreciate that some explanatory variables can be treated as either a
factor or a covariate, and the pros and cons of each need to be
considered when choosing between the two
• appreciate that the explanatory variables may not always be of equal
importance to the analyst
• understand that when the primary focus of a study is to assess the
impact on the response of one key explanatory variable (of several), a
good strategy is to find the best model without including that key
explanatory variable, and then fit this best model with the key
explanatory variable also added in
• understand that whether or not an observation is considered to be an
outlier, depends not only on the other observations, but also on the
assumed model for the data
• assess the impact of an outlier (or outliers) by fitting the model both
with and without the outlier(s) included, so that the results can be
compared
• use the mean squared error (MSE), the mean absolute percentage error
(MAPE) and prediction intervals to help assess predictions
• appreciate that a more complicated model is not necessarily better than
a simpler one
• understand the uses of training datasets and test datasets
• identify three types of missing data: missing completely at random
(MCAR), missing at random (MAR) and not missing at random
(NMAR)
• use and appreciate the pros and cons of three strategies for dealing with
missing data: complete case analysis, available case analysis and
imputation
• understand what reproducibility means and its importance
• appreciate that how an analysis is documented should depend on the
audience
• understand the notion of replication

471
Unit 5 Linear modelling in practice

• understand what the replication crisis is and its link to multiple testing
and the interpretation of p-values
• appreciate how replication can help to prevent false significant results
drowning out the true significant results
• read a CSV file into an R data frame
• create an R data file
• read an R data file into R
• change variable names in R
• change variable types in R
• combine data frames in R by adding rows of data (that is, by adding
observations) or by adding columns of data (that is, by adding variables
for existing observations)
• use R to calculate the MSE, the MAPE and the percentage of
observations contained in prediction intervals
• use R to assess predictions using a test dataset
• document an analysis in Jupyter.

472
References

References
‘2016 Summer Olympics medal table’ (2021) Wikipedia. Available at:
https://en.wikipedia.org/w/index.php?title=2016 Summer Olympics
medal table&oldid=1039342470 (Accessed: 24 August 2021).
‘2020 Summer Olympics medal table’ (2021) Wikipedia. Available at:
https://en.wikipedia.org/wiki/2020 Summer Olympics medal table
(Accessed: 9 August 2021).
BBC News (2016) ‘Rio 2016 Olympics: Russians “have cleanest team” as
271 athletes cleared to compete’, 5 August. Available at:
https://www.bbc.co.uk/sport/olympics/36970627 (Accessed: 21 March
2022).
Bredtmann, J., Crede, C.J. and Otten, S. (2016) ‘Olympic medals: Does
the past predict the future?’, Significance, 13(3), pp. 22–25.
doi:10.1111/j.1740-9713.2016.00915.x
Brownlee, K.A. (1965) Statistical theory and methodology in science and
engineering, 2nd edn. New York: John Wiley and Sons, pp. 454–55.
‘Cost of the Olympic Games’ (2022) Wikipedia. Available at:
https://en.wikipedia.org/wiki/Cost of the Olympic Games (Accessed:
18 March 2022).
Crum, A.J. and Langer, E.J. (2007) ‘Mind-set matters: exercise and the
placebo effect’, Psychological Science, 18(2), pp. 165–171. Data obtained
from: https://dasl.datadescription.com/datafile/hotel-maids
(Accessed: 5 May 2020).
Crum, A.J., Corbin, W.R., Brownell, K.D. and Salovey, P. (2011) ‘Mind
over milkshakes: mindsets, not just nutrients, determine ghrelin response’,
Health Psychology, 30(4), pp. 424–429.
Curtice, J. et al. (1994) ‘The Opinion Polls and the 1992 General Election
(Full Report)’. Available at: https://amsr.contentdm.oclc.org/digital/
collection/p21050coll1/id/669 (Accessed: 11 July 2022).
Draganich, C. and Erdal, K. (2014) ‘Placebo sleep affects cognitive
functioning’, Journal of Experimental Psychology: Learning, Memory, and
Cognition, 40(3), pp. 857–864.
Dunbar, J. (2016) ‘Predicting the Rio Olympic medal table’, BBC News,
3 August. Available at: https://www.bbc.co.uk/news/magazine-36955132
(Accessed: 2 April 2022).
Gardner, M.J., Snee, M.P., Hall, A.J., Powell, C.A., Downes, S. and
Terrell, J.D. (1990a) ‘Results of case-control study of leukaemia and
lymphoma among young people near Sellafield nuclear plant in West
Cumbria’, British Medical Journal, 300(6722), pp. 423–429.

473
Unit 5 Linear modelling in practice

Gardner, M.J., Hall, A.J., Snee, M.P., Downes, S. Powell, C.A. and Terrell,
J.D. (1990b) ‘Methods and basic data of case-control study of leukaemia
and lymphoma among young people near Sellafield nuclear plant in West
Cumbria’, British Medical Journal, 300(6722), pp. 429–434.
‘Independent Olympic Athletes at the 2016 Summer Olympics’ (2021)
Wikipedia. Available at: https://en.wikipedia.org/w/index.php?title=
Independent Olympic Athletes at the 2016 Summer Olympics&oldid
=999908107 (Accessed: 20 August 2021).
Ioannidis, J.P.A. (2005) ‘Why most published research findings are false’,
PLoS Medicine, 2(8), e124.
International Olympic Committee (2021) Factsheet: The Games of the
Olympiad. Available at: https://stillmed.olympics.com/media/
Documents/Olympic-Games/Factsheets/The-Games-of-the-Olympiad.pdf
(Accessed: 30 June 2022).
Michelmore, K. and Sassler, S. (2016) ‘Explaining the gender wage gap in
STEM: does field sex composition matter?’, RSF: The Russell Sage
Foundation Journal of the Social Sciences, 2(4), pp. 194–215.
Royal Statistical Society (2022) RSS – Journal Series C. Available at:
https://rss.org.uk/news-publication/publications/journals/series-c
(Accessed: 22 March 2022).
Rubin, D.B. (1976) ‘Inference and missing data’, Biometrika, 63(3),
pp. 581–592.
The World Bank Group (2020a) World Development Indicators – Themes –
People. Available at: http://datatopics.worldbank.org/world-development-
indicators/themes/people.html#population (Accessed: 13 August 2020).
The World Bank Group (2020b) World Development Indicators – Themes
– Economy. Available at: http://datatopics.worldbank.org/world-
development-indicators/themes/economy.html (Accessed: August 2020).
Wakeford, R. and Tawn, E.J. (1994) ‘Childhood leukaemia and Sellafield:
the legal cases’, Journal of Radiological Protection, 14, pp. 293–316.
‘Wikipedia:About’ (2021) Wikipedia. Available at: https://en.wikipedia
.org/w/index.php?title=Wikipedia:About&oldid=1037955346
(Accessed: 20 August 2021).

474
Acknowledgements

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.2, a gold medal from Tokyo 2020: Taken from:
https://triathlonmagazine.ca/racing/tokyo-2020-olympic-medals-revealed/
Subsection 2.1.2, Russia’s women’s gymnastics team at Rio 2016:
© Agência Brasil Fotografias / https://commons.wikimedia.org/wiki/
File:Russia takes silver in womens artistic gymnastics.jpg This file is
licensed under the Creative Commons Attribution Licence
http://creativecommons.org/licenses/by/3.0/
Subsection 2.1.3, Mary Hanna: © Franz Verhaus /
https://www.flickr.com/photos/franz-josef/7776871064/ This file is
licensed under the Creative Commons Attribution Licence
http://creativecommons.org/licenses/by/3.0/
Subsection 2.1.3, Hend Zaza: © Tim Clayton - Corbis / Contributor /
Getty Images
Subsection 2.2.3, somewhere in the Bahamas: © Lawrence Malvin /
www.123rf.com
Subsection 2.2.4, Olympic rings: https://olympics.com/ioc/olympic-rings
Subsection 2.2.4, person falling asleep in front of a computer © fizkes /
www.123rf.com
Subsection 4.1, COVID-19 vaccination: © milkos / www.123rf.com
Section 5, a ‘shy’ voter? © Prostockstudio — Dreamstime.com
Subsection 5.1, room attendants at work: © Olga Yastremska /
www.123rf.com
Subsection 5.2, students taking an exam © Wavebreak Media Ltd /
www.123rf.com
Figure 14: © Hjem / https://www.flickr.com/photos/hjem/1876641223/
This file is licensed under the Creative Commons
Attribution-Noncommercial-ShareAlike Licence
http://creativecommons.org/licenses/by-nc-sa/2.0/
Subsection 6.3, Sellafield nuclear site: © Simon Ledingham /
https://en.wikipedia.org/wiki/Sellafield#/media/File:Aerial view Sellafield,
Cumbria - geograph.org.uk - 50827.jpg This file is licensed under the
Creative Commons Attribution-Noncommercial-ShareAlike Licence
http://creativecommons.org/licenses/by-sa/2.0/
Section 7, milkshake: © Brent Hofacker / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

475
Unit 5 Linear modelling in practice

Solutions to activities
Solution to Activity 1
There is no single ‘correct’ answer here. These are some of the possible
ways to measure a nation’s success, but you may have thought of others.
• Count the number of medals (gold, silver or bronze) that a nation wins
overall. The more medals, the more successful the nation is.
• Assign a points score to each type of medal: 3 points for a gold, say,
2 points for a silver and 1 point for a bronze. Then calculate the total
points scored for each nation; the higher the score, the more successful
the nation is.
• For each sport, calculate the percentage of medals each nation wins, and
average these percentages across all of the sports. The higher the
average, the more successful the nation is.
• Calculate the average position of each nation’s athletes across all events.
This time, the lower the average, the more successful the nation is.

Solution to Activity 2
(a) The following explanatory variables are mentioned.
• The nation’s wealth. The richer a nation is, the more success they
are likely to have.
• The population size. The more populous a nation is, the more
success they are likely to have (although the video points out that
this is not necessarily the case!).
• Previous success at the Olympics. Nations that were successful
before are more likely to be successful again.
• Whether the nation is hosting the Olympics. The host nation does
better than might otherwise be expected.
• Whether a nation is going to host the next Olympics. The nation
that is going to host next is also likely to do better than might
otherwise be expected.
(b) Of the explanatory variables listed in part (a), arguably only
population size and hosting the Olympics are likely to affect a nation’s
Olympic success directly. A larger population means that there is a
greater pool of people from which to pick athletes, and so, all other
things being equal, the better someone has to be in order to be
amongst the very best in that nation. Also, it is suggested that being
a host nation – and therefore competing on home ground – inspires
those athletes to do better than they would if they were competing
elsewhere, because they don’t want to fail in front of home crowds.
The other explanatory variables listed in part (a) affect a nation’s
Olympic success more indirectly. Richer nations have more money to
invest in sport, and it is this extra money which drives the Olympic

476
Solutions to activities

success. Nations which have been successful at previous Olympics are


likely to be nations which invest in sport and feel that the Olympics is
important, and it is this investment and national sporting culture
which will directly affect the Olympic success. And finally, a nation
which is going to host the next Olympics needs to invest in the
Olympics for some time before hosting, and it is this investment
which will directly affect the nation’s Olympic success.
(c) Some more potential explanatory variables follow, but you may have
thought of others.
• The size of the nation’s team.
• Performance at the immediately preceding world championships or
other international competitions for the different sports in the
Olympics.

Solution to Activity 3
(a) That close to the Paris 2024 start, the teams are likely to have been
selected and all the previous Olympic Games will have happened. So,
yes, the values of both explanatory variables are likely to be known.
(b) Previous success at the Olympics will be known one year before
Paris 2024 starts. However, it is unlikely the selection of teams will be
confirmed that far in advance of Paris 2024, because the time period is
long enough for the form of individual athletes to change significantly.
(c) In this situation, only previous success at the Olympics will be a useful
potential explanatory variable. Just after Tokyo 2020 ended, the size
of teams for Paris 2024 was not known. So, if on 9 August 2021 we
did try to use a model that included the size of the team as an
explanatory variable to predict success at Paris 2024, we would first
need to predict what the sizes of all the teams for Paris 2024 will be!

Solution to Activity 4
(a) Wikipedia is a website based on the principle that what is contributed
is more important than who contributes it (‘Wikipedia:About’, 2021).
So there are few restrictions on who can add or amend content, or on
what changes they make. As such, there is no guarantee that what is
there is correct. However, this community aspect is also a strength as
it makes it easy for others to go in and correct material – particularly
for entries which are not contentious. So, we (the module team) think
the information on this Wikipedia page can be trusted to be correct.
(b) No, it is not. The article states that for boxing, judo, taekwondo and
wrestling, each event awarded two bronze medals. In addition, for a
few events, a tied result for a medal occurred. The website states that
a tie for a gold meant the tied athletes were each awarded the gold
model but no silver medal was awarded. Similarly, a tie for silver
meant the tied athletes were each awarded the silver medal but no
bronze medal was awarded.

477
Unit 5 Linear modelling in practice

(c) No, they did not. At these Olympics there was an ‘Independent
Olympic Athletes’ team who won a gold medal and a bronze medal.
This team was composed of athletes from Kuwait. They were unable
to represent Kuwait because of the suspension of the Kuwait Olympic
Committee from the International Olympic Committee. (‘Independent
Olympic Athletes at the 2016 Summer Olympics’, 2021)
There was also a Refugee Olympic Team. However, this team,
consisting of 10 athlete refugees from a variety of nations, did not win
any medals and hence does not appear in the medals table.
(d) No, it has not. The winners of four medals – one silver and three
bronzes – are listed as having been officially changed.

Solution to Activity 5
At Rio 2016, Russia’s team was depleted. At the time the article was
written, only 70% of Russia’s original team were judged eligible to
compete. Russian athletes in athletics and weightlifting faced a complete
ban, while athletes in seven other sports faced a partial ban. It is likely
that some of these banned athletes would have won medals if they had
been allowed to participate. Thus, Russia’s medal-winning potential was
diminished.
It is also worth noting that, by the same token, it means that the number
of medals won by other countries will be increased as a result of Russia’s
diminished participation. (The medals were still won by somebody!)
However, as this increase probably was spread across many other
countries, the impact on other countries is likely to have been much
smaller than that on Russia.

Solution to Activity 6
(a) The population size data from the World Bank appears to be very
reliable. They have collated data from national and international
bodies. They claim that the primary source they have used is the
most reliable dataset.
(b) They state that the estimates are based on census data. That is,
attempts by countries to count all of their population. However, even
in high-income countries they acknowledge that censuses do not
include everybody. Nor do censuses occur every year. So, for years
when a census has not occurred, the population of a country has to be
estimated. Besides, even if censuses were carried out every year,
populations are changing all the time as people die, are born, or
migrate.

478
Solutions to activities

Solution to Activity 7
(a) No, they do not. As you can see in Figure 2, it is possible to include
data that are composed of text, such as the name of a nation.
(b) The first line, which is often known as the ‘header’, gives names for
each of the variables in the file. This is not always done in such files,
but can be useful when it is. It provides an indication of what each
variable is for someone looking at the file, and can save the task of
giving variables sensible names once they have been read into a
statistical package.
(c) Each line represents the data for a single nation. More generally, in
such files each line gives the data for one (and only one) observation –
even if that means the line is long!
(d) The commas separate the information about each variable. Note that
this can be a problem if a variable can take values that contain a
comma.

Solution to Activity 8
(a) The best way is by adding the rows. This is because the data from
Rio 2016 effectively adds more observations about how many medals a
country actually gets at an Olympics. However, when doing this, it is
helpful to add an extra column that indicates which Olympic Games
the data come from.
(b) The medals tables only list those countries that have won at least one
medal. Those countries that took part, but did not win anything, are
not included. If a good model is to be built, then data from these
countries are also required. Just because a country did not win a
medal at Rio 2016 and Tokyo 2020 does not mean it will never win a
medal!
For example, at Tokyo 2020 three countries won their first ever
Olympic medals: Burkina Faso, San Marino and Turkmenistan.

Solution to Activity 9
Adding information about the population size and the GDP corresponds to
adding more variables. So more columns need to be added.

Solution to Activity 10
(a) It is likely that none of these would be automatically regarded as the
same. A person looking at these is likely to say they are the same
because they all contain ‘Japan’. However, the inclusion of ‘*’,
‘(JPN)’ or just an extra space, or spaces, could be enough for a match
not to be found by a computer. In these cases, though, it is possible
to pre-process the country names so that such differences do not cause
a match to fail.

479
Unit 5 Linear modelling in practice

(b) None of these are likely to be automatically regarded as matches,


since they look too different. However, the first two should be
matched. In the case of ‘Bahamas’ and ‘Bahamas, The’, the match
probably looks obvious to you because – unlike a computer – you are
able to bring meaning to the letters.
Knowing that in the Olympics ‘Great Britain’ and ‘United Kingdom’
are the same, but that ‘Czechoslovakia’ and ‘Czech Republic’ are not,
requires extra knowledge about the countries.
In most usages, ‘Great Britain’ includes England, Scotland and Wales,
whereas ‘United Kingdom’ includes England, Scotland, Wales and
Northern Ireland. However, at the Olympics, a team representing
Great Britain and Northern Ireland is formed which is called ‘Great
Britain’. So, in the context of predicting Olympic success, it is
reasonable to regard ‘Great Britain’ in terms of data about the
Olympics to be the same as ‘United Kingdom’ in terms of World
Bank data.
On the other hand, ‘Czechoslovakia’ and ‘Czech Republic’ should be
regarded as different in the context of predicting Olympic success,
because Czechoslovakia formally dissolved on 1 January 1993 and two
new independent states – the Czech Republic and Slovakia – were
formed in its place.

Solution to Activity 11
The answer to this question partly depends on how common your name is,
and on your own preferences.
For instance, if you have a relatively common name, there will likely be
situations where others have the same name as you.
You might prefer to use a different name with friends, say, to that used
with relatives and/or on official forms.

Solution to Activity 12
(a) In this module you have so far learnt about:
• multiple regression (in Unit 2)
• regression with one factor and ANOVA (in Unit 3)
• regression with any number of covariates and factors (in Unit 4).
In addition, you should have met simple linear regression in your
previous study of statistics.
For all of these models, the response variable is assumed to be
continuous. (Note this does not mean that the response variable is
always continuous in a strict sense – just that it is not ‘too discrete’.)
Furthermore, for all of these models it is assumed that the deviations
of observations from their fitted or predicted values can be assumed to
have a normal distribution, with zero mean, and that the variance of
the deviations is the same for all observations.

480
Solutions to activities

(b) Simple linear regression is suitable when there is a single covariate.


Multiple regression includes models where there are multiple
covariates. In Unit 3, you saw how multiple regression can be used to
model an explanatory variable that is a factor through the use of
indicator variables. Finally, in Unit 4, we showed how multiple
regression can be set up so that it incorporates a number of covariates
and factors.

Solution to Activity 13
(a) No, it does not look symmetric. There are lots of instances where
countries won a relatively low number of medals, and very few
countries that won a large number of medals. However, this by itself
is not incompatible with the random variation being normally
distributed. It could be that the expected number of medals produced
by the model is similarly highly skewed.
(b) The data are discrete. The number of medals won by a nation is
necessarily an integer. However, it is reasonable to treat the data as
continuous since there are a hundred or more different values that the
number of medals could take.
(c) The lowest number of medals that a nation could win is zero. It
appears that a data value corresponds to this lower bound a lot of the
time. This could be a problem because the assumption that the
random variation is normally distributed means that, in theory at
least, there is no lower bound to the number of medals that could be
won. With much of the data at this bound, the model is liable to
imply that values less than the lower bound often happen.

Solution to Activity 14
The variable host takes two (coded) values:
• 0 if the nation is not the current host
• 1 if the nation is the current host.
So, it does not matter whether host is treated as a factor or as a covariate.
If host is declared to be a factor, all that would happen is that an
indicator variable would be created which would have exactly the same
(coded) values as host currently does.

Solution to Activity 15
Since year can take more than two different values, it matters whether
year is treated as a factor or as a covariate.
By treating year as a factor, all that is implied in the model is that the
number of medals a nation wins (partly) depends on which Olympics we
are talking about. It does not impose any ordering on this effect of year.
(See, for example, Unit 3, Subsection 1.3.)

481
Unit 5 Linear modelling in practice

By treating year as a covariate, the model assumes that there is a linear


effect of year. So, for example, if the increase in the estimated number of
medals won by a particular country from Atlanta 1996 to Sydney 2000
is 1.5, then the increase in the estimated number of medals from
London 2012 to Rio 2016 is also 1.5. (See, for example, Unit 1,
Subsection 4.1.)
Whether it is better to treat year as a factor or as a covariate is actually a
question without a clear answer.
By treating year as a factor we do not assume what ‘shape’ the
relationship between year and medals is. As such, there isn’t an
assumption to get wrong!
However, treating year as a covariate involves fitting fewer parameters,
and so the model is simpler. Furthermore, it means that we are able to
predict values for future Olympics. For example, if we want to predict
what will happen at the Paris 2024 Olympics, treating year as a covariate
means we just have to substitute ‘year = 2024’ into our regression
equation. If year is treated as a factor, then all we would know is that the
effect of year would be different to that for previous Olympics, but we
would not know in what way.
So in this unit we will treat year as a covariate.

Solution to Activity 16
(a) Hosting the Olympics is associated with an increase of 12.6 medals
after adjusting for the number of medals the country won at the
previous Olympics.
(b) The model included the explanatory variables host along with
lagMedals, population and gdp. However, it is not possible to tell
exactly how they were included in the model. For example, whether it
was just population that had been included in the model, or just
log(population), or both population and log(population), or
indeed another transformation of population.

Solution to Activity 17
You have probably come across this term in the context of an observation
being noticeably different to the other observations. Often this is because
the value associated with this observation is particularly big or particularly
small in relation to the others.
When more than one variable is observed, it could be that the values for
each of the variables are not individually noticeably different to the rest of
the data. However, the combination of the values might be enough to
make an observation look different to the rest.

482
Solutions to activities

Solution to Activity 18
(a) There appear to be up to nine outliers in these data: these are the
observations indicated by individual points. All of these observations
correspond to nations that have high numbers of medals won. One of
these points, the one corresponding to over 100 medals (and checking
back in the data, this turns out to correspond to the USA), seems to
be particularly outlying.
(b) In this boxplot, none of the data points appears to be outlying!

Solution to Activity 19
(a) There appear to be two outliers, both corresponding to low
percentages of ammonia removed. Since the dataset only contains
21 observations, this corresponds to almost 10% of the observations.
(b) In the scatterplot, there appear to be two points that might be
regarded as outliers. The point corresponding to an air flow of just
over 60 and just over 97.0% of ammonia absorbed looks a bit lower
than other readings with similar air flow. More obviously, the point
corresponding to an air flow of 70 appears to have an unusually high
percentage of ammonia absorbed. The residual plot makes it clearer
that one point seems unusually high: the point with a fitted value of
between 97.0 and 97.5. This is the point corresponding to an air flow
of 70.
(c) From this residual plot it could be argued that there is just one
outlier, the point with the highest residual. The residual with the
lowest value is not very much less than some of the other residuals.

Solution to Activity 20
(a) Using a significance level of 0.05, the individual p-values suggest that
the number of medals depends on a country’s population but not on
its GDP per capita. However, if we also use a significance level of 0.05
for the p-value associated with the F -statistic, then that seems to
indicate that the model overall isn’t significant.
(b) There appears to be at least one outlier: the point with the highest
residual. This is because its residual is much bigger than the residuals
for the other points.
(c) The general conclusions have not changed. The revised individual
p-values suggest that the number of medals depends on a country’s
population but not its GDP per capita, and the revised p-value
associated with the F -statistic still indicates that the model overall
isn’t significant. It therefore doesn’t matter whether or not the outlier
is included in the analysis.

483
Unit 5 Linear modelling in practice

Solution to Activity 21
(a) This time, there again appears to be a significant effect of
population. However, the p-value for gdp is just above the p < 0.05
threshold that we’re using here. This suggests that there is evidence
that a model should include population, but not gdp. There is also
strong evidence from the p-value associated with the F -statistic that
the model is significant.
It is worth noting here that, although we’ve specifically chosen to use
p < 0.05 as our threshold for deciding on significance (to help make a
teaching point about outliers!), this activity – with an individual
p-value just above 0.05 – illustrates just how arbitrary picking a
significance level can be!
(b) Again there appears to be one clear outlier. This time it corresponds
to the point with a residual of about 40.
(You may have thought that the point relating to the large fitted
value greater than 50 is an outlier because this point lies far away
from the other points in the plot. However, this point is not an
outlier, because the value of the residual for this point is very small,
which means that this particular point fitted the model well.)
(c) This time, whether the general conclusions have changed is debatable.
In particular, the p-value for gdp is now below the cut-off of 0.05
(slightly). So, without the outlier, it looks like the number of medals
won may depend on both GDP per capita and population size.
However, we still have the same conclusion that the overall model is
significant. So whether or not the outlier is included (and, of course,
how we might interpret the results) may well affect our conclusions.
Again it is worth noting how the arbitrariness of picking a significance
level can affect the decisions that we may make!

Solution to Activity 22
(a) For this third subset of data, the individual p-values suggest there is a
significant effect for population but not for gdp. The p-value
associated with the F -statistic also suggests that overall the model is
significant.
(b) This time there appears to be two outliers, both having fitted values
over 25.
(c) This time, it matters a lot whether the outliers are included or not!
This is because without them in the dataset, neither gdp nor
population appear to be needed in the model. What’s more, while
the overall model was significant when the outliers were included
(with p = 0.005), it was no longer so when the outliers were removed
(where p increased up to 0.411!).

484
Solutions to activities

Solution to Activity 23
(a) A predicted value can be calculated by substituting the corresponding
values of the explanatory variables into the regression equation. See
Section 6 in Unit 1 and Section 2 in Unit 2.
(b) For this model, the regression equation is as follows:
medals = 0.258 + 0.961 lagMedals + 12.639 host.
Now, when lagMedals = 17 and host = 1 are substituted into this
equation, we get
predicted medals = 0.258 + 0.961 × 17 + 12.639 = 29.234 ≃ 29.2.
In other words, using this model such a nation is predicted to win
29 medals.
(c) The actual number of medals that Brazil won at Rio 2016 is below the
lower limit of the 95% prediction interval. So no, the observed value
of the response did not lie in the 95% prediction interval.

Solution to Activity 24
(a) The regression model with the most terms provides the closest fit to
the data. The curve appears to go through more than half of the
points – more than for any of the other curves. The curve is also close
to the other points.
(b) The MSE associated with each of the models goes down as the
number of terms increases. This is to be expected. Having extra
terms in the regression model gives the curve more flexibility to get
closer to the data points.
(c) Based on Figure 12, there isn’t a definitive answer to this. However,
we (the module team) would say that either the regression with 2
terms or the regression with 4 terms best represents the relationship.
In both cases, the change in y for a small change in x remains similar
across the range of values depicted.
When 8 and 16 terms are used for the regression model, the
relationship between x and y between observed data points varies
much more. Most noticeably, with 16 terms, the regression model
implies some sharp changes in the value of y associated with very
small changes in x, for example when x < 2.5. Such dramatic changes
in the relationship between x and y are possible, but would generally
be accepted as unlikely without further evidence for it.

485
Unit 5 Linear modelling in practice

Solution to Activity 25
In all four cases, the regression model follows the data reasonably closely.
In the case of the regression models with 8 and 16 terms, the data do not
give compelling evidence for the complexity both of these models provide.
Out of the remaining two regression models, there is some suggestion that
the relationship between x and y is non-linear, which is reflected in the
regression model with four terms but not the one with two terms. So, for
the module team, the best model (just) appears to be one with four terms.

Solution to Activity 26
(a) The Democratic People’s Republic of Korea, with its strongly
authoritarian regime and closed society (at least for the years 1996 to
2016), is not likely to follow the pattern set by most other countries.
So, by not having the data from this nation in the dataset, the range
of variation is likely to be less than it should be.
(b) If anything, not including the data for the Czech Republic might
enhance the representativeness of the data (at least, if it is intended
that the model be applied when nations are stable). This is because,
although the dissolution of Czechoslovakia into the Czech Republic
and Slovakia was a relatively peaceful transition, it means that any
short-term effects of the transition will not impact on the modelling.

Solution to Activity 27
From the table it is clear that not all of the measurements were recorded
for all of the room attendants. For example, the variable percent2 was
only recorded for two of these six room attendants.

Solution to Activity 28
(a) If a value was recorded for every variable for every room attendant,
then the total number of values in the dataset would be 75 × 9 = 675.
(b) Overall, 0 + 1 + · · · + 9 + 9 = 48 values are missing in the dataset.
Therefore, the number of values from the possible 675 which were
recorded is 627, which corresponds to 92.9% of the total number of
values that could have been recorded.
(c) Only those room attendants who did not have any missing values
would be included in the complete case analysis, which means that
47 room attendants would be included in the analysis. So, since there
were 75 room attendants altogether, 62.7% of the individual data
values would be used in a complete case analysis.
(d) A complete case analysis would use only 62% of the values that would
ideally have been there, whereas 92.9% of the values were actually
collected. So, a complete case analysis fails to make use of about 30%
of the values.

486
Solutions to activities

Solution to Activity 29
(a) Data from the room attendants with the following identification
numbers would be included: 35, 36, 37, 39 and 40. The data from
room attendant 38 would be dropped because the value for percent is
missing.
(b) Data from all six of these room attendants would be included since all
values for informed and percent2 were recorded.
(c) Data from the same group of room attendants as in part (a) would be
included: room attendants 35, 36, 37, 39 and 40. Again the lack of a
percent value for room attendant 38 means that data from her would
be dropped.
(d) Data from room attendants 35, 37, 39 and 40 would be included when
this model is fitted. This is because having a missing value for any of
the variables would be enough for the data from a room attendant to
be dropped.

Solution to Activity 30
(a) For percent2 the sample mean based on the first six room attendants
is
1 214.4
(32.8 + 34.8 + 34.8 + 42.4 + 34.8 + 34.8) = ≃ 35.7.
6 6
(b) The sample mean based on all the room attendants will be 34.8.
This is because this mean can be thought of being a weighted average
of the mean for the values that were not missing and the mean of the
imputed values. By assuming that the missing values are in fact equal
to the sample mean for all the observed values (34.8 in this case), the
mean of the imputed values will be the same as the sample mean for
all the observed values (because all of the imputed values have the
same value 34.8). So, since the weighted average of 34.8 and 34.8 is
equal to 34.8, the sample mean based on all values must be 34.8.
(c) The standard deviation based on the imputed version of the dataset
will be less than 6.13. The imputation is making available some values
that are identical to the sample mean, so less variable than the
observed values. Thus, including the imputed values will reduce the
variability and hence the standard deviation. In fact, it turns out to
be 5.28.

487
Unit 5 Linear modelling in practice

Solution to Activity 31
(a) MCAR. There is no reason that the participant’s non-attendance, and
hence lack of a blood pressure reading, is linked to what the blood
pressure reading actually is, or measurements taken at previous clinic
visits. Unless forgetfulness is a side effect of the drug in the trial!
(b) MAR. The non-attendance is linked to something that is already
known (that is, which drug the participant received as part of the
trial). (If the illness were linked to blood pressure, then the missing
data type would be NMAR.)
(c) NMAR. The reason why the participant did not attend the clinic is
linked to their blood pressure.
(d) MCAR. There is no reason why the participant getting a new job
should be linked to their blood pressure or the drugs they were given.
(e) NMAR. In this case, the failure of the participant to attend the
follow-up clinic is clearly linked to their blood pressure.

Solution to Activity 32
(a) The issue that we (the module team) chose was volume 64, issue 5
(November 2015).
(b) There were six articles in our chosen issue. (You will probably have
chosen a different issue and hence get different results.)
(c) In our chosen issue, the articles, along with the dates on which they
were first received and revised, are as follows.

Title Date received Date revised

Modelling the type and timing of consecutive September November


events: application to predicting preterm 2013 2014
birth in repeated pregnancies
Modelling short- and long-term October November
characteristics of follicle stimulating hormone 2013 2014
as predictors of severe hot flashes in the Penn
Ovarian Aging Study
A spatiodynamic model for assessing frost September January
risk in south-eastern Australia 2013 2015
Measuring and accounting for strategic August 2013 November
abstentions in the US Senate, 1989–2012 2014
Heteroscedastic conditional auto-regression April 2014 February
models for areally referenced temporal 2015
processes for analysing California asthma
hospitalization data
A marginalized zero-inflated Poisson October January
regression model with random effects 2013 2015

488
Solutions to activities

(d) The length of time (in months) between the two dates for the six
articles is as follows:
14 13 16 15 10 15.
The authors were therefore likely to be returning to their analysis
about a year after they first analysed the data for submission to the
journal. That is lots of time for them to forget the details if they have
not been recorded somewhere!
You may well have found similar results for other issues of the journal,
at least for issues published when this unit was written.

Solution to Activity 33
One issue was whether all the cases were living in West Cumbria at the
time of diagnosis. Although this might seem an easy thing to define, here
the difficulty arose in deciding the place of residence of somebody who was
living away from home while studying at university.
The other issue was whether the individuals in the study needed to have
been born in West Cumbria or whether it was enough that they had lived
there.

Solution to Activity 34
(a) The data came from the Scientists and Engineers Statistical Data
System (SESTAT). In particular, they used six waves of SESTAT
from 1995 to 2008. SESTAT combines data from three surveys: the
National Survey of College Graduates Science and Engineering Panel,
the National Survey of Recent College Graduates, and the Survey of
Doctoral Recipients.
(b) No, not all of them were included. They had to have received their
bachelor’s degree between 1970 and 2004, have majored in STEM and
be working in STEM. They also needed to be working at least
35 hours per week.
The analysis was run separately for what the authors regarded as the
four main STEM categories, and so only data for a single category
was considered at once.
(c) The authors used the hourly wage of the individuals as the response.
They transformed this variable by calculating the log of it. The
authors also noted that they had to calculate the hourly wage from
the annual salary by dividing by the weeks worked per year and hours
worked per week.
(d) The authors state that the gender of the respondent was the
explanatory variable they were most interested in. (In fact, this is
made clear further on in the paper as the gender of the respondent
was the only variable included in Model 1.)
(e) Yes, it does appear that the analysis will be reproducible.

489
Unit 5 Linear modelling in practice

Solution to Activity 35
(a) The response variable is the total number of medals won. This is the
same response variable that we have been using.
(b) The authors considered a total of eight explanatory variables: Lag
Medals, ln GDP, ln Pop, Host, Next Host, Planned, Muslim and Year.
Of these, Host, Next Host, Planned and Muslim seem to have been
treated as categorical (because the article says that these are
‘indicator variables’), whereas Lag Medals, ln GDP, ln Pop and Year
appear to have been treated as continuous variables.
Note that each of the categorical variables Host, Next Host, Planned
and Muslim can each only take one of two values. So, as you found in
Activity 14 (Subsection 3.3.1) for host in olympic, it does not matter
whether they are treated as being continuous or categorical.
(c) The difference between the two models lies with the explanatory
variables included. In the ‘naive’ model, just Lag Medals and Year are
included. In the ‘sophisticated’ model, all eight explanatory variables
are included.

Solution to Activity 36
(a) The descriptions of the variables are similar in the two articles. In
both cases, all of the variables are listed along with explanations of
why it is reasonable to include them. Notice that in both cases, the
authors make it clear that the explanatory variables are not intended
to represent direct causes but an indirect measure of other effects.
(b) The article written for the BBC News website says very little about
the fitted models. All the article really describes is which variables
were included in the model. It does not say anything about the type
of model. In the other article, there is a box giving the equation of
the models fitted, along with how they were fitted (‘using ordinary
least squares’).
(c) In the BBC News website article, no statistical technical terms were
used. The article therefore does not require the reader to have any
statistical training. Instead, the article sticks with language which a
reader proficient in English should be able to understand.

Solution to Activity 37
(a) H0 is false for the 3000 locations associated with Type 1 diabetes, and
the probability of rejecting H0 when it is indeed false is the power of
the test, 0.9. So, the number of hypothesis tests for which we would
expect to reject H0 when it is false is
3000 × 0.9 = 2700.

490
Solutions to activities

(b) H0 is true for the 9 997 000 locations which are not associated with
Type 1 diabetes, and the probability of rejecting H0 when it is true is
the significance level, 0.01. So, the number of hypothesis tests for
which we would expect to reject H0 when in fact it is true is
9 997 000 × 0.01 = 99 970.

(c) From part (a), we expect to reject H0 in 2700 hypothesis tests when
H0 is false, and from part (b), we expect to reject H0 in 99 970
hypothesis tests when H0 is true. So, in total, we expect to reject H0
2700 + 99 970 = 102 670 times.
Therefore, the proportion of these that are actually associated with
Type 1 diabetes is
2700
≃ 0.0263.
102 670
In other words, fewer than 3% of the hypothesis tests expected to
have significant results in this study (that is, the ‘positives’) are likely
to correspond to locations that are actually associated with Type 1
diabetes!

Solution to Activity 38
(a) As you saw in Activity 26 (Subsection 5.1) of Unit 2, in order to help
decide whether an explanatory variable should be in a model, we can
test whether the associated regression coefficient, βj say, is zero. We
therefore test the hypotheses
H0 : βj = 0, H1 : βj ̸= 0.
The test is completed by comparing a test statistic of the form
βbj /(standard error of βbj ) against a t-distribution.
(b) Such a test is likely to be done many times. When there are q
variables in a model, the p-values from q such tests are presented.
Furthermore, it is unlikely that just one such model would be fitted to
the data. So, with multiple such models likely to be fitted to the data,
the situation where many tests are considered quickly arises.

Solution to Activity 39
(a) The probability of rejecting H0 when it is indeed false is still 0.9.
So, the expected number of locations associated with Type 1 diabetes
where the null hypothesis would be rejected a second time is
2700 × 0.9 = 2430.
(b) The probability of rejecting H0 when it is true is still 0.01.
So, the expected number of locations not associated with Type 1
diabetes where the null hypothesis would be rejected a second time is
99 970 × 0.01 = 999.7.

491
Unit 5 Linear modelling in practice

(c) The expected proportion is


2430
≃ 0.709.
2430 + 999.7
So, under these conditions, approximately 71% of the locations where
the null hypothesis is rejected twice will correspond to locations which
are associated with Type 1 diabetes.

492
Index

Index
: (model notation) 324, 350, 368, 370 dataset
∗ (model notation) 325, 350, 369 Brexit 101
car prices 132
adjusted for 427 employment rates 334
adjusted R2 statistic 162 Facebook 144
AIC 165 FIFA 19 28
Akaike information criterion 165 films 137
analysis of variance 239 lagoon 240
analysis of variance table 251 manna ash trees 22
ANOVA 239 Olympics 416
extended table 265 Olympics 2020 442
table 251 OU students 359
test 248 pea growth 253
assumption (model) 47, 79, 110, 236, 314, 331 Peru 177
athlete (definition) 400 placebo effect 446
available case analysis 449 rats and protein 346
roller coasters 105
backward stepwise regression 171
test 440
baseline mean 215
training 440
brexit 101
wages 206
carPrices 132 diagnostic checks 109
comma-delimited file 410 diagnostics 109
complete case analysis 447 distribution
confidence level 60 t 43
constant variance assumption 47 normal 34
contrast 256, 257 standard normal 34
maximum number 272 dummy variable 218
multiple 271
testing 264 effect
Cook’s distance 123 parameter 215
formula 123 term 215
plot 124 employmentRate 334
correlation matrix 151 error term 33
covariate 203 ESS 158, 161, 244
CSV file 410 experimental data 16
explained sum of squares 158, 161, 244
data explained variation 157, 244
experimental 16 explanatory variable 5
missing 23 extended ANOVA table 265
natural science 17 extrapolation 59
observational 16
primary 14 F -distribution 94
secondary 14 F -ratio 248
social science 17 F -statistic 94, 233
data frame 12 F -test 311
data sources 17 for an interaction 328, 355

493
Index

in multiple regression 93, 98 mean absolute percentage error 437


whether a factor should be in a model 311 mean square 251
whether both factors are needed 342 mean squared error 437
F -value 248 means plot 344
facebook 144 missing at random 457
factor 203 missing completely at random 457
code for level 207, 211 missing data 23, 443
false positive 466 impact 457
fifa19 28 reasons 452
films 137 missing data type 457
fitted line 38 MAR 457
forward stepwise regression 166, 167 MCAR 457
full model 171 NMAR 457
function (in R) 12 model
confirmation 443
general linear model 371 exploration 443
graphics 27 full 171
general linear 371
hierarchical principle 370
multiple covariates and factors 366
imputation 451 multiple factors but no interactions 362
independence assumption 47 multiple regression 89
indicator variable 218, 225 non-parallel slopes 322, 325
influential points 121 null 167
interaction 324, 325, 349, 368 parallel slopes 303, 304
four-way 369 regression with a factor 217
notation 324 simple linear regression 35
order 369 two factors 353
three-way 369 two factors that do not interact 338
two-way 369 two factors that interact 353
interaction parameter 324, 325 modelling process 3
interpolation 59 MSE 437
interpretation 46 multicollinearity 147
multiple contrasts 271
Jupyter Notebook 8 multiple correlation coefficient 161
multiple linear regression 81
ladder of powers 131 multiple linear regression model 89
lagoon 240 multiple regression 81
level 207 fitted equation 89
leverage 114, 118 assumptions 110
formula 118 estimated equation 89
LFS 205 model 89
linearity assumption 47 prediction 104, 107
testing all regression coefficients 93, 98
mannaAsh 22 testing individual regression coefficients 95,
MAPE 437 99
MAR 457 transformation 130
Markdown 10, 11
MCAR 457 natural science data 17

494
Index

nested model 309 raw residual 54


NMAR 457 rcoaster 105
non-parallel slopes model 322, 325 regression coefficients 91
assumptions 331 interpretation 91
normal curve 34 partial 90
normal distribution 34 regression function 32
normal probability plot 53, 111 regression with a factor
normal score 54 fitted value 229
normality assumption 48, 54 model 217, 225
not missing at random 457 model notation 216
notebook 10 predicted value 231
null regression model 167 testing effect terms 234
numerical summaries 27 testing relationship 233
regression, simple linear 31
object (in R) 12 relationship 44
observational data 16 replication 465
olympic 416 reproducibility 458
olympic2020 442 residual 40
ONS 205, 207 raw 54
ouStudents 359 standardised 54
outlier 26, 431 residual plot 48, 111
over-fit 439 residual sum of squares 158, 159, 161, 244
p-value 45 residual variation 157, 244
interpretation 46 residuals versus leverage plot 119, 122
parallel slopes model 303, 304 response 5
assumptions 314 response variable 5
parsimony 145 RSS 158, 159, 161, 244
partial regression coefficients 90 rule of thumb
pea 253 correlation 151
percentage of variance accounted for 162 standardised residuals 121
peru 177 scatterplot matrix 147
placeboEffect 446 secondary data 14
point prediction 57, 104 simple linear regression 31
prediction interval 60, 107 t-value 44
primary data 14 assumptions 47
principle of parsimony 146 coefficients 35
QQ plot 53 fitted model 38
fitted value 40
R 8 model 35
dataframe 12 model notation 37
function 12 p-value 46
object 12 prediction 57, 60
vector 12 residual 40
2
Ra statistic 162 residual plot 48
R2 statistic 161 test statistic 44
random term 33 testing 44
ratProtein 346 social science data 17

495
Index

Soundex 415
standard error 43
standard normal distribution 34
standardised residual 54
statistical modelling process 3
stepwise regression 154
backward 171
forward 167
strategy 173
sum of squares 244, 251
partition 160
partitioning 250, 261

t-distribution 43
t-test in multiple regression 96, 99
t-value 44
test dataset 440
theoretical quantile 54
total sum of squares 157, 161, 244
total variation 157, 244
training dataset 440
transformation 130
Treezilla 21
TSS 157, 161, 244
two factor model 338, 353
two-sided test 45

variable selection 154


variance estimate 248
variation 244
explained 157
residual 157
total 157
vector (in R) 12
visual summaries 27

wages 206

496

You might also like