0% found this document useful (0 votes)
495 views

M348 Applied Statistical Modelling - Generalised Linear Models

Applied statistical modelling - Generalised linear models

Uploaded by

M T
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
495 views

M348 Applied Statistical Modelling - Generalised Linear Models

Applied statistical modelling - Generalised linear models

Uploaded by

M T
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 295

M348

Applied statistical modelling

Book 2
Generalised linear models
This publication forms part of the Open University module M348 Applied statistical modelling. Details of this
and other Open University modules can be obtained from Student Recruitment, The Open University, PO Box
197, Milton Keynes MK7 6BJ, United Kingdom (tel. +44 (0)300 303 5303; email [email protected]).
Alternatively, you may visit the Open University website at www.open.ac.uk where you can learn more about
the wide range of modules and packs offered at all levels by The Open University.

The Open University, Walton Hall, Milton Keynes, MK7 6AA.


First published 2022.
Copyright © 2022 The Open University
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, transmitted or
utilised in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without
written permission from the publisher or a licence from The Copyright Licensing Agency Ltd. Details of such
licences (for reprographic reproduction) may be obtained from The Copyright Licensing Agency Ltd, 5th Floor,
Shackleton House, 4 Battle Bridge Lane, London, SE1 2HX (website www.cla.co.uk).
Open University materials may also be made available in electronic formats for use by students of the
University. All rights, including copyright and related rights and database rights, in electronic materials and
their contents are owned by or licensed to The Open University, or otherwise used by The Open University as
permitted by applicable law.
In using electronic materials and their contents you agree that your use will be solely for the purposes of
following an Open University course of study or otherwise as licensed by The Open University or its assigns.
Except as permitted above you undertake not to copy, store in any medium (including electronic storage or use
in a website), distribute, transmit or retransmit, broadcast, modify or show in public such electronic materials in
whole or in part without the prior written consent of The Open University or in accordance with the Copyright,
Designs and Patents Act 1988.
Edited, designed and typeset by The Open University, using LATEX.
Printed in the United Kingdom by Halstan Printing Group, Amersham.

ISBN 978 1 4730 3553 9


1.1
Contents
Unit 6 Regression for a binary response variable 1

Introduction 3

1 Setting the scene 6


1.1 Binary response variables 6
1.2 Initial modelling ideas 7
1.2.1 Modelling data on companies across Europe 7
1.2.2 Modelling data on patient survival 15

2 Introducing logistic regression 19


2.1 Modelling the success probability 19
2.2 The logistic function 23
2.3 The logistic regression model 28

3 Interpreting the logistic regression model 32


3.1 Odds and log odds 32
3.2 Interpreting a regression coefficient 35
3.3 Interpreting more than one regression coefficient 41

4 Using the logistic regression model 43


4.1 Exploring some fitted logistic regression models 43
4.2 Using R for logistic regression 48

5 Assessing model fit 49


5.1 The likelihood as a measure of model fit 50
5.2 The residual deviance 52
5.3 The null deviance 59
5.4 Using R to assess model fit 60

6 Choosing a logistic regression model 61


6.1 Comparing nested models 61
6.2 Comparing non-nested models 67
6.3 Using R to compare model fits 68

7 Checking the logistic regression model assumptions 69


7.1 The model assumptions 70
7.2 Deviance residuals 71
7.3 Diagnostic plots for logistic regression 72
7.3.1 Standardised deviance residuals against a transformation
b
of µ 72
7.3.2 Standardised deviance residuals against index 75
7.3.3 Squared standardised deviance residuals against index 76
7.3.4 Normal probability plot 78
7.4 Using R to produce diagnostic plots 81
Summary 82

Learning outcomes 85

References 85

Acknowledgements 86

Solutions to activities 87

Unit 7 Regression for other response variables 103

Introduction 105

1 Setting the scene 107


1.1 Some non-normal response variables 107
1.2 A closer look at count response variables 108

2 Building a model 115


2.1 A unified model form for linear regression and logistic regression116
2.2 Using the model form for count responses 119

3 The generalised linear model (GLM) 125


3.1 The model 126
3.2 Link functions 130
3.3 Inverse link functions 135
3.4 Fitted mean responses and prediction 137
3.5 Using R for Poisson GLMs 141

4 GLMs for two more response variable distributions 141


4.1 GLMs for responses with exponential distributions 142
4.2 GLMs for responses with binomial distributions 149
4.3 Using R for exponential and binomial GLMs 154

5 Assessing model fit and choosing a GLM 155


5.1 Assessing model fit 155
5.2 Comparing GLMs 158
5.3 Using R to assess model fit and choose a GLM 162

6 Checking the GLM model assumptions 164


6.1 The GLM model assumptions 164
6.2 Diagnostic plots for GLMs 166
6.2.1 Standardised deviance residuals against a transformation
b
of µ 166
6.2.2 Standardised deviance residuals against index 168
6.2.3 Squared standardised deviance residuals against index 171
6.2.4 Normal probability plot 173
6.3 Using R to produce diagnostic plots for GLMs 176
7 Common issues in practice 176
7.1 Overdispersion 177
7.2 Modelling Poisson rates 179
7.3 Poisson rates and overdispersion in action 182
7.3.1 A dataset involving counts over varying time lengths 182
7.3.2 Poisson rates and overdispersion in R 185

Summary 185

Learning outcomes 189

References 189

Acknowledgements 190

Solutions to activities 191

Unit 8 Log-linear models for contingency tables 203

Introduction 205

1 The modelling problem 207


1.1 Contingency table data 207
1.2 A modelling strategy 210

2 Introducing log-linear models for two-way contingency tables 211


2.1 The response variable for two-way contingency table data 211
2.2 The log-linear model for independent variables 214
2.3 Does the fitted log-linear model work? 220

3 Are the classifying variables in a two-way table independent? 224


3.1 Visualising two-way contingency tables 224
3.2 Testing for independence 230
3.3 Two-way contingency tables in R 238

4 Contingency tables with more than two variables 240


4.1 Extending the log-linear model 241
4.2 Choosing a log-linear model 243
4.3 Some restrictions when choosing a log-linear model 250
4.4 No diagnostic plots? 254
4.5 Using R for log-linear models for more than two variables 254

5 How are the classifying variables related? 256


5.1 Types of relationships for three-way tables 257
5.2 Using R to identify relationships 260

6 Logistic and log-linear models 261


6.1 Logistic regression for contingency table data 262
6.2 Relationship between logistic and log-linear models 264
6.3 Which model to use: logistic or log-linear? 267
Summary 269

Learning outcomes 272

References 273

Acknowledgements 273

Solutions to activities 274

Index 287
Unit 6
Regression for a binary response
variable
Introduction

Introduction
The models considered so far in this module (namely, simple linear
regression, multiple regression and ANOVA) are all forms of linear models.
Although there has been variety in the type and number of explanatory
variables that we’ve been able to incorporate into these linear models, the
choice of response variable has been restricted to satisfying the conditions
that the response is both continuous (or at least not too discrete) and the
random variation of the response about the model can be assumed to be
normally distributed.
This normality assumption means that the response variable is usually
approximately normally distributed (although not necessarily). But what
if the response variable is not even close to being approximately normal?
Is it still possible to build a statistical model then? Well, for many
response variables, yes it is! The focus of Units 6, 7 and 8 is the
development of statistical models for such non-normal response variables.
Collectively, these models are known as generalised linear models.
We’ll start in this unit by considering just one sort of non-normal response
variable: a binary response variable; that is, a response variable which can
only take one of two possible values.

How Unit 6 relates to the module so far


Moving on from . . . What’s next?

Regression with a Regression with a


normally distributed binary response
response variable variable
(Unit 4)

The type of modelling problem that we’ll be focusing on in this unit is


illustrated in Example 1. (We’ll develop models for other non-normal
response variables in Unit 7.)

3
Unit 6 Regression for a binary response variable

Example 1 Credit card fraud?


Credit card fraud costs banks and card companies large sums of
money. Fraud detection is therefore extremely important for credit
card companies. One fraud prevention method currently used by card
companies is based on the use of models and real-time data.
A contactless card payment When a credit card transaction takes place, the bank collects real-time
data on various features associated with the transaction, such as the
amount of money, where the transaction takes place, what type of
purchase has been made, and so on. This helps to build up a picture
of the customer’s spending habits over time so that, together with
information about the person making the transaction, such as their
credit rating, any unusual behaviour can be flagged up as potentially
fraudulent. Indeed, you may yourself have received a call from your
credit card company in the past if you have made a transaction which
is unusual based on your regular credit card spending habits.
So, for each credit card transaction, the card company wants to
predict whether the transaction is legitimate or fraudulent, based on
data regarding the transaction and the person making the
transaction. For this situation, the credit card company is interested
in modelling the response variable:
• transaction: a binary variable taking the two possible values
‘legitimate’ or ‘fraudulent’
using several possible explanatory variables, such as:
• amount: the transaction amount
• place: where the transaction takes place
• type: the type of purchase
• score: the credit score rating of the person making the transaction.

In Section 1 of the unit, we’ll consider binary response variables further


and explore some initial ideas for modelling them. We’ll see that it is
possible to model a binary response by adapting the linear regression
modelling approach used so far in M348; the resulting model is called the
logistic regression model, or simply logistic regression. We’ll introduce
logistic regression in Section 2.
In linear regression, the fitted model tells us about the relationship
between the explanatory variable(s) and the response. Although a fitted
logistic regression model is also informative regarding the relationship
between the explanatory variable(s) and the binary response, it is not
interpreted in the same way as a linear model is. The interpretation of
logistic regression models is the subject of Section 3.

4
Introduction

Section 4 of the unit focuses on using logistic regression, while Section 5


considers how we can assess the fit of a logistic regression model. This idea
is then extended in Section 6 to compare the model fits of logistic
regression models to help choose a suitable logistic regression model for the
data. Finally, in Section 7, we’ll look at the model assumptions for logistic
regression and how we can check these assumptions.
The following route map illustrates how the sections fit together for this
unit.

The Unit 6 route map

Section 1
Setting the scene

Section 2
Introducing logistic regression

Section 3 Section 4
Interpreting the Using the
logistic regression logistic regression
model model

Section 5
Assessing model fit

Section 7
Section 6
Checking the
Choosing a logistic
logistic regression
regression model
model assumptions

Note that you will need to switch between the written unit and your
computer for Subsections 4.2, 5.4, 6.3 and 7.4.

5
Unit 6 Regression for a binary response variable

1 Setting the scene


In this section, we will start in Subsection 1.1 by considering binary
response variables, before exploring some initial modelling ideas for binary
responses in Subsection 1.2.

1.1 Binary response variables


There are many situations where the outcome of an investigation is a
binary response variable that may depend on several explanatory variables.
We described one such situation concerning credit card fraud in Example 1
(in the introduction), but numerous situations arise in areas such as
medicine, economics, marketing, law enforcement, ecology and many more.
In Activity 1, you will be asked to have a think of some possible situations.

Activity 1 Datasets with a binary response

Think of three situations where the dataset contains a binary response


variable that could depend on one or more explanatory variables. In each
situation, clearly identify two possible values that your binary response
variable can take, together with the possible explanatory variables.

In addition to datasets with binary responses, there are some situations


where, although the response is a continuous variable, it is actually
sensible to replace it with a binary one. This is often done in situations
where the main interest is in a particular threshold value. Two situations
where it might be helpful to create a binary response variable from a
continuous variable are given in the following two examples.

Example 2 Under or over the alcohol limit?


Drinking alcohol affects a driver’s ability to drive safely, and so there
are alcohol limits for drivers in most countries. The limits are based
on the amount of alcohol detected in someone’s breath, blood or
urine.
Measuring a driver’s alcohol However, when the police measure a driver’s alcohol level, the
level outcome of interest is usually whether the driver’s alcohol level is
above or below the legal limit, rather than the numerical value of the
alcohol level. So, in this case, although the response is a continuous
variable (the alcohol level), the real variable of interest is a binary
variable with two possible values: ‘at or below the legal alcohol limit’
or ‘above the legal alcohol limit’.

6
1 Setting the scene

Example 3 Pass or fail?


In Unit 4, we introduced the OU students dataset, which contains
data for students who studied Level 3 OU statistics modules. In that
unit, we fitted several linear regression models taking examScore as
the response variable.
Suppose now that we are only interested in whether or not students
passed a module, rather than their individual scores.
In this case, we could take the overall final module score and create a
binary variable so that anyone with a score greater than or equal to
40 is categorised as having passed and anyone with a score less than
40 is categorised as having not passed.
This binary variable has in fact already been created as part of the
OU students dataset through the variable:
• modResult: overall final module result, taking the values 1 for pass
and 0 for fail.

As you have seen in Activity 1 and Examples 2 and 3, there are many
examples of situations where datasets have a binary response variable. It is
therefore important that we have a method for modelling such responses!

1.2 Initial modelling ideas


We would like to develop a model for a binary response variable. Now, we
know how to model a normal response, but obviously, the distribution for a
binary response looks very different to the distribution for a normal
response! So, in our quest for a suitable model for a binary response, we’ll
explore the properties of two datasets, both of which have a binary
response. We’ll start in Subsection 1.2.1 by considering a dataset
The distribution for a binary
concerning data from companies across Europe, before moving on to response and the distribution
consider a dataset concerning patient survival from third-degree burns in for a normal response don’t
Subsection 1.2.2. look the same!

1.2.1 Modelling data on companies across Europe


In this subsection, we’ll be focusing on some data on companies across
Europe taken from a large database called Amadeus. In M348, we’ll be
considering two datasets taken from this database. We’ll introduce both of
these datasets next.

7
Unit 6 Regression for a binary response variable

Public and private companies across Europe


The data in this dataset are taken from Bureau van Dijk’s database
called Amadeus, which contains comprehensive information on public
and private companies across Europe. There are around 21 million
companies in the database, with information on financial indicators,
stock prices, directors and many other attributes. The database is
published by Bureau van Dijk, which has 900 employees working
across the globe to collect data on corporate businesses.
The European companies dataset (europeanCompanies)
In this module, we will be considering a subset of the data contained
in the Amadeus database for the year 2018. The European companies
dataset contains data for 270 companies on the following variables:
• country: a categorical variable taking the values GB (for United
Kingdom of Great Britain and Northern Ireland), FR (for France)
and DE (for Germany)
• product: a categorical variable giving the industrial classification
linked to what a company produces taking the following coded
values: 10 (for manufacture of food products), 20 (for manufacture
of chemicals and chemical products), 22 (for manufacture of rubber
and plastic products), 26 (for manufacture of computer, electronic
The company that made these
and optimal products) and 27 (for manufacture of electrical
plastics would have value 22
for the variable product equipment)
• averageWage: average wage of an employee in 2018 (calculated by
taking the cost of the employees for that company in 2018 (in
thousands of euro) and dividing by the number of employees in
2018)
• resAndDev: a binary variable taking the values 1 (if the company
participates in research and development) and 0 (if the company
doesn’t participate in research and development).
The data for the first five observations from the European companies
dataset (from the subset of 270 companies taken from the Amadeus
database) are shown in Table 1.
Table 1 First five observations from europeanCompanies

country product averageWage resAndDev


GB 10 34.3 0
GB 10 18.2 1
GB 10 56.4 1
GB 10 45.1 0
GB 10 46.1 1
Source: Amadeus, 2020, accessed 22 November 2022

8
1 Setting the scene

The GB companies dataset (gbCompanies)


The GB companies dataset is a subset of the European companies
dataset including only those data for which country takes the
value GB and product takes the value 22; that is, companies in the
GB which manufacture rubber and plastic products. This restricts the
dataset to just 28 companies.
The data for the first five observations in the GB companies dataset
(from the subset of 28 companies taken from the European companies
dataset) are shown in Table 2.
Table 2 First five observations from gbCompanies

country product averageWage resAndDev


GB 22 40.3 0
GB 22 37.4 0
GB 22 33.5 0
GB 22 45.5 1
GB 22 64.3 1
Source: Amadeus, 2020, accessed 22 November 2022

We are initially going to consider the GB companies dataset: the data


regarding those companies for which country takes the value GB and
product takes the value 22; that is, companies in the GB which
manufacture rubber and plastic products. This means that initially we’ll
be considering data for just 28 companies. (We shall consider the
European companies dataset later in the unit.) We are also going to keep
things simple by focusing only on two of the variables for which there are
data: we’ll take the binary variable resAndDev to be our response variable
and just use the single explanatory variable averageWage.
The question we are interested in exploring, using data for these
28 companies, is whether engagement in research and development
depends on the average wage of the employees within that company.
Let’s start in Activity 2 by looking at the data for the response and
explanatory variable for these 28 companies.

9
Unit 6 Regression for a binary response variable

Activity 2 Plots for a binary response variable

To help with getting a feel for the data, Figure 1 shows the comparative
boxplot of the average wage, averageWage, separating the 28 companies
into those which participate in research and development (so that
resAndDev = 1), and those which do not participate in research and
development (so that resAndDev = 0).

resAndDev 1

30 40 50 60
averageWage
Figure 1 Comparative boxplot showing the average wage for 28 companies

From the boxplot given in Figure 1, is it plausible that there is a


relationship between whether a company participates in research and
development and the average wage of its employees?

In Activity 2, we saw that there does seem to be a relationship between


the covariate averageWage and the response resAndDev. But how can we
model this relationship?
So far in this module, all of the regression models that we’ve used have
been linear regression models. So, an obvious starting place is to
investigate whether we can use linear regression to model the relationship
between averageWage and resAndDev. When exploring whether there is a
linear relationship between a covariate and a (continuous) response, we
would usually start by looking at a scatterplot of the data. So, looking at a
scatterplot of the covariate averageWage and the response resAndDev
sounds sensible. There is the problem that scatterplots need both of the
variables to be continuous, but we can get around this simply by treating
our binary variable with the code values 0 and 1 as a continuous variable
with the numerical values 0 and 1. (Remember that a categorical
explanatory variable with two levels can be treated either as a factor or a
covariate.) The resulting scatterplot is shown in Figure 2.

10
1 Setting the scene

1.0

0.8

0.6
resAndDev

0.4

0.2

0.0
30 40 50 60
averageWage
Figure 2 Scatterplot of averageWage and resAndDev (treating resAndDev
as a continuous variable)

From the scatterplot, we can see that there certainly appears to be some
sort of relationship between averageWage and resAndDev, with generally
higher values for averageWage when resAndDev = 1. But would a linear
relationship be appropriate? Let’s explore a few options for fitting a
straight line to these data.
Firstly, if we treat resAndDev as a continuous response (which happens to
take the numerical values 0 and 1), there is nothing to stop us from fitting
a linear regression model to these data. So, let’s try fitting the (linear)
model
resAndDev ∼ averageWage.

11
Unit 6 Regression for a binary response variable

The fitted regression line for this model is shown in Figure 3.

1.0

0.8

resAndDev 0.6

0.4

0.2

0.0
30 40 50 60
averageWage
Figure 3 Scatterplot of averageWage and resAndDev with the fitted
regression line for the linear model resAndDev ∼ averageWage

We’ll consider the fit of the regression line shown in Figure 3 in the next
activity.

Activity 3 Does linear regression fit?


Considering the fitted regression line in Figure 3, do you think the linear
model
resAndDev ∼ averageWage
is a good fit for the data?

From Activity 3, it looks like the linear regression models that we’ve used
so far are not going to work with a binary response variable. However, if
we look at the scatterplot given in Figure 3 again, whilst the points do not
lie on the fitted linear regression line, they do all lie on one of the two lines
added to the scatterplot of resAndDev and averageWage shown in
Figure 4.
• The bottom line on the plot (redAndDev = 0), is a perfect fit for all of
the points relating to companies who do not engage in research and
development (but doesn’t fit any of the points relating to companies for
which resAndDev = 1).

12
1 Setting the scene

• The top line on the plot (resAndDev = 1), is a perfect fit for all of the
points relating to companies who do engage in research and development
(but doesn’t fit any of the points relating to companies for which
resAndDev = 0).

1.0

0.8

0.6
resAndDev

0.4

0.2

0.0
30 40 50 60
averageWage
Figure 4 Scatterplot of averageWage and resAndDev with two possible
fitted lines

Although we can fit each data point perfectly using the two lines shown in
Figure 4, having two fitted lines in our model like this isn’t much use to us!
For example, for an observation with a value of 30 for averageWage, which
of the two lines would we choose to predict resAndDev?
So, what about using shorter versions of both lines? For example, we could
consider a cut-off at averageWage = 40, say, so that all values of
averageWage less than or equal to 40 would use the bottom line for
predicting resAndDev, while all values of averageWage greater than 40
would use the top line for predicting resAndDev. In other words, we could
use the two fitted lines given by

0 if averageWage ≤ 40 (this relates to the bottom line),
resAndDev =
1 if averageWage > 40 (this relates to the top line).

13
Unit 6 Regression for a binary response variable

These two shorter fitted lines are shown in Figure 5. Might these help us to
predict the value of resAndDev from a given value of averageWage better?

1.0

0.8

0.6
resAndDev

0.4

0.2

0.0
30 40 50 60
averageWage
Figure 5 Scatterplot of averageWage and resAndDev with the two fitted
lines from Figure 4 shortened

Well, having the two shorter fitted lines shown in Figure 5 would certainly
be better than using the two full lines shown in Figure 4: there is now only
one line associated with each value of averageWage and only three data
points which don’t lie on one of the lines. This kind of approach – where
we move between the two lines for the different values of the explanatory
variable – therefore appears to give us a way forwards.
However, the cut-off value of 40 for averageWage was chosen fairly
arbitrarily, and so, if we are to use this type of approach, we need a
method of deciding how we can use the values of the explanatory variable
to move from one line to the other. To investigate this further, we’ll
explore another new dataset in the next subsection.

14
1 Setting the scene

1.2.2 Modelling data on patient survival


The dataset to be explored in this subsection is described next.

Patient survival from third-degree burns


Severe burns need immediate treatment as they can lead to lasting
damage to skin, muscle and bones, and can sometimes lead to death.
Burns are categorised into different types depending on what caused
the burn, together with how severely the skin has been hurt. The
most severe type of burn is a third-degree burn which destroys two
full layers of skin and damages nerve endings – a third-degree burn
may often appear black, brown, white or yellow.
The burns dataset (burns)
Researchers in the mid 1990s collected data on 435 adults who were
treated for third-degree burns by the University of Southern
California General Hospital Burn Center. The burns dataset contains
the following variables for the 435 adults:
• logArea: calculated as log(area of third-degree burns + 1); the
reason for adding the ‘+1’ is to avoid any problems with the need Severe burns should be treated
to evaluate log 0 as an emergency
• survival: a binary variable which records whether the patient
survived, taking the value 1 if the patient survived and 0 if the
patient didn’t survive.
The first five observations extracted from this dataset can be seen in
Table 3.
Table 3 The first five observations from burns

logArea survival
2.301 0
1.903 1
2.039 1
2.221 0
1.725 1
Source: Fan, Heckman and Wand, 1995

When trying to model the binary variable resAndDev using the


explanatory variable averageWage, we used a two-line approach (as
illustrated in Figure 5). Might the same approach work here? We shall
investigate this in Activity 4.

15
Unit 6 Regression for a binary response variable

Activity 4 Can we use the two-line approach?

(a) A comparative boxplot of logArea by survival for the burns dataset


is given in Figure 6. Using the comparative boxplot, does it seem
likely that the amount of body area covered by third-degree burns
affects whether or not a patient survives?

survival
1

1.2 1.4 1.6 1.8 2.0 2.2


logArea
Figure 6 A comparative boxplot of logArea split by the whether the patient
survived or not

(b) A scatterplot of survival and logArea is given in Figure 7 (treating


survival as a continuous variable). Using the scatterplot, can you
suggest suitable straight lines which might be useful to model these
data?

1.0

0.8

0.6
survival

0.4

0.2

0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 7 Scatterplot of survival and logArea

16
1 Setting the scene

Following on from Activity 4, it appears that fitting lines using the


two-line approach used in Figure 5 is not going to be suitable for the whole
range of data in the burns dataset. We therefore need to find a different
way to model the data.
In order to work out how we might go about modelling survival, let’s
start by thinking about binary responses more generally. Using our usual
notation for a response variable, let’s denote our binary response by Y .
Then, since Y is binary, we can always code the two possible values that Y
can take as 1 and 0, and refer to a value of 1 as being a success and a value
of 0 as being a failure. Y can then be modelled using a Bernoulli
distribution. A summary of the Bernoulli distribution is given in Box 1.

Box 1 The Bernoulli distribution


The binary random variable Y taking the two possible values 1 (for
success) and 0 (for failure), is said to have a Bernoulli distribution
with parameter p, where 0 < p < 1, if it has the probability mass
function

1 − p y = 0,
p(y) =
p y = 1,
or equivalently
p(y) = py (1 − p)1−y , y = 0, 1,
so that
P (Y = 1) = p and P (Y = 0) = 1 − p.

This is written as Y ∼ Bernoulli(p).


The parameter p is referred to as the success probability.
The mean and variance of a Bernoulli random variable are,
respectively,
E(Y ) = p and V (Y ) = p(1 − p).

Example 4 Distribution for the burns dataset response


In the context of modelling the binary response survival using data
from the burns dataset, for each patient i, we observe the response Yi
which takes the value 1 if patient i survives (a success) and takes the
value 0 if patient i doesn’t survive (a failure). Then, each Yi has a
Bernoulli distribution, where the probability that patient i survives is
the success probability pi ; that is,
Yi ∼ Bernoulli(pi ), 0 < pi < 1.

17
Unit 6 Regression for a binary response variable

Now, from Activity 4 we know that the value of the explanatory variable
logArea seems to affect the value of the response survival. As such, pi ,
the success probability for patient i, depends on the value of logArea for
patient i, as demonstrated in the next activity.

Activity 5 Estimates of individual patient success


probabilities
By considering the scatterplot of survival and logArea shown in
Figure 7 in Activity 4, answer the following.
(a) The observed value of logArea for patient 24 is 1.223. What does this
tell us about the likely value of the probability of survival, p24 , for
patient 24?
(b) The observed value of logArea for patient 1 is 2.301. Would you
estimate the likely value of p1 , the probability of survival for
patient 1, to be greater than p24 , equal to p24 , or less than p24 ?
Explain your reasoning.

We’ve seen in Activity 5 how the value of the explanatory variable for the
ith observation affects the value of the corresponding success probability pi
associated with the response Yi . In fact, it turns out that the success
probabilities (p1 , p2 , . . . , pn ) are key to building a model which allows us to
use the value of the explanatory variable to predict the response. This is
because the value of the success probability pi can guide our prediction for
the response Yi as follows.
• If pi > 0.5, then it is more likely that Yi is a success than a failure and
so it would be sensible to predict success for Yi .
• If pi < 0.5, then it is more likely that Yi is a failure than a success and
so it would be sensible to predict failure for Yi .
What’s more, the success probabilities show us what our fitted regression
line, on which predictions are based, should look like. The reason for this
is as follows.
Right back in Unit 1, we saw that in linear regression, all of the values
E(Y1 ), E(Y2 ), . . . , E(Yn ) lie along the regression line. So, we should also
expect that E(Y1 ), E(Y2 ), . . . , E(Yn ) lie along the regression line when we
have a binary response. But, for a binary response, we know from Box 1
that
E(Yi ) = pi .
So, the success probabilities p1 , p2 , . . . , pn for the n binary responses should
also lie along the regression line. We can therefore use p1 , p2 , . . . , pn to
show us what our fitted regression line should look like.

18
2 Introducing logistic regression

Unfortunately, we don’t actually know the values of p1 , p2 , . . . , pn . We can,


however, use our data to estimate them. In particular, we want to use our
data to model the relationship between the success probability and the
explanatory variable, so that we can use the known value of the
explanatory variable for the ith observation (xi ) to estimate the success
probability for that observation (pi ). Modelling how the success
probability varies with the explanatory variable therefore provides us with
a method for deciding whether the line Y = 1 (success) or Y = 0 (failure)
should be used for predicting Yi .
We shall focus on building such a model in the next section.

2 Introducing logistic regression


As mentioned at the end of the previous section, in order to use an
explanatory variable x to predict a binary response Y , we wish to model
how the success probability varies across the values of x. Modelling the
success probability is the focus of Subsection 2.1, and we shall see in
Subsection 2.2 that a function called the logistic function is promising for
this. We can then use the logistic function to adapt linear regression to
accommodate a binary response. The resulting model – the logistic
regression model – is formally introduced in Subsection 2.3.

2.1 Modelling the success probability


To get a feel for what kind of model we need to fit for modelling the
success probability, we’ll start by looking at a plot of estimates of the
success probability for various values of logArea using data from the
burns dataset.
Now, we can estimate pi – the probability that patient i survives – by
using the observed survival proportion for those patients with the same
value of logArea as patient i. But, in the case of the burns dataset, we
have a continuous explanatory variable logArea, which means that it is
unlikely that another patient will have exactly the same value of logArea.
As such, using the observed survival proportion for patients with the exact
same value of logArea as patient i to estimate pi won’t be terribly useful!
We can, however, partition the values of logArea into intervals, look at
the observed survival proportions of patients with values of logArea
within each interval, and then use these proportions as estimates of the
survival probabilities for the different possible values of logArea. This is
illustrated in Activity 6.

19
Unit 6 Regression for a binary response variable

Activity 6 Estimating survival probabilities

To estimate the probability of survival for values of logArea using data


from the burns dataset, we will start by partitioning the values of the
variable logArea into intervals: Table 4 shows the resulting intervals.
(Note that we could have used different intervals; there is nothing special
about the ones used in the table!) For each of these intervals, the table
also shows the number of surviving patients with a value of logArea
taking a value somewhere between the lowest value in the interval and up
to, but not including, the highest value in the interval together with the
total number of patients with a value of logArea in the interval. For
example, 13 of the patients had a value of logArea between 1.2 and up to,
but not including, 1.5, and all 13 of these patients survived. The resulting
proportions of patients surviving for the intervals are also given in Table 4.
Table 4 Number and proportion of patients surviving third-degree burns
grouped by values of logArea

logArea Number Total Proportion


interval surviving number surviving
1.2 to 1.5 13 13 1.000
1.5 to 1.7 19 19 1.000
1.7 to 1.8 67 69 0.971
1.8 to 1.9 45 50 0.900
1.9 to 2.0 71 79 0.899
2.0 to 2.1 50 70 0.714
2.1 to 2.2 35 66
2.2 to 2.3 7 56 0.125
2.3 to 2.4 1 13 0.077

(a) Complete Table 4 by calculating the proportion of patients who


survive when logArea is between 2.1 and 2.2.
(b) For the fifth patient, logArea = 1.725, and for the seventh patient,
logArea = 2.295. Using the observed proportions surviving for each
of the intervals of logArea given in Table 4 as estimates of the
survival probabilities, estimate E(Y5 ) and E(Y7 ).
(c) Figure 8 shows a scatterplot of logArea and survival, together with
estimates of the survival probabilities (the observed proportions
surviving) plotted at the midpoint of each corresponding logArea
interval from Table 4. Does it look like there is a linear relationship
between the estimated survival probabilities and logArea?

20
2 Introducing logistic regression

Survival: Observed value Estimated survival probability


1.0

0.8

0.6
survival

0.4

0.2

0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 8 Scatterplot of logArea and survival, together with estimates of
the survival probabilities plotted at the midpoint of each corresponding
logArea interval

From Activity 6, it seems unlikely that a linear relationship between the


estimated survival probabilities and the explanatory variable is
appropriate. To show that this is indeed the case, let’s fit a straight line to
the data using a simple linear regression model where the response variable
is the proportion of patients who survived and the explanatory variable is
the midpoint of each logArea interval. The resulting fitted regression line
is shown in Figure 9 (overleaf). As we expected, this fitted regression line
does not appear to fit the estimated probabilities well. In fact, for patients
with logArea < 1.6, this fitted line would predict that the survival
probability is greater than 1!

21
Unit 6 Regression for a binary response variable

Survival: Observed value Estimated survival probability


1.0

0.8

0.6
survival

0.4

0.2

0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 9 Scatterplot of logArea and survival, together with estimates of
the survival probabilities and the fitted linear regression line

A better fit to the estimated probabilities might be a curve with some sort
of elongated backwards S-shape, such as the one shown in Figure 10. This
curve certainly appears to offer a better fit than the fitted linear regression
line shown in Figure 9 did.

22
2 Introducing logistic regression

Survival: Observed value Estimated survival probability


1.0

0.8

0.6
survival

0.4

0.2

0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 10 Scatterplot of logArea and survival, together with estimates of
the survival probabilities and a backward S-shaped curve

It turns out that for binary responses in general, the estimated success
probabilities across the values of an explanatory variable exhibit similar
S-shaped curves – either backward-facing, so that the curve is decreasing
as in Figure 10, or forward-facing, so that the curve is increasing. So, one
way forwards might be to use a curve to model the relationship between
the success probability and an explanatory variable.
In the next subsection, we shall introduce the logistic function which has
the required type of S-shaped curve to fit the success probabilities typically
associated with a binary response.

2.2 The logistic function


Although there are a number of mathematical functions that produce
S-shaped curves, for fitting the success probability typically associated
with a binary response, we shall just focus on the logistic function given
by the following equation:
1
f (x) = , −∞ < x < ∞, (1)
1 + exp(−(α + βx))
where α and β are parameters specifying the exact shape of the curve.
(We’ll look at how the values of α and β affect the shape of the curve
soon.)

23
Unit 6 Regression for a binary response variable

A plot of the logistic function when α = 0 and β = 1, for values of x such


that −10 ≤ x ≤ 10, is given in Figure 11. (When α = 0 and β = 1, this is
known as the standard logistic function, or simply the sigmoid function.)

1.0

0.8

0.6
f (x)

0.4

0.2

0.0
−10 −5 0 5 10
x
Figure 11 Logistic function f (x) for α = 0 and β = 1

There are several properties of the logistic function which make it


promising for modelling success probabilities:
• the values of f (x) are constrained to lie between 0 and 1 for all x
• the relationship between f (x) and x is almost linear from about
f (x) = 0.2 to about f (x) = 0.8
• the curve flattens off at either end.
These properties are potentially useful since we want the curve to fit points
that estimate E(Yi ) = pi , which, since they are probabilities, lie between 0
and 1. The properties also match what we noted in Activity 6, namely
that, for the success probabilities in the burns dataset, the middle part of
the curve was roughly linear and the curve flattened off at either end.
There are, however, some problems with using the curve given in Figure 11
(with α = 0 and β = 1) for fitting the success probabilities for the burns
dataset! We will consider some of these in the next activity.

Activity 7 The logistic function for the burns dataset

What problems can you see with fitting the logistic function shown in
Figure 11 (where α = 0 and β = 1) to the survival probabilities in the
burns dataset? (It might be helpful to look back at Figure 8.)

24
2 Introducing logistic regression

From Activity 7, it is obvious that there are problems with using the
logistic function shown in Figure 11 for modelling the survival probabilities
in the burns dataset. However, Figure 11 is showing the logistic function
from Equation (1) for the specific values α = 0 and β = 1, and we can
overcome these problems simply by changing the values of these
parameters.
The effect that the parameter α has on the shape of the logistic function is
summarised in Box 2.

Box 2 Effect of α on the logistic function shape


Consider the logistic function given by the equation
1
f (x) = , −∞ < x < ∞,
1 + exp(−(α + βx))
for parameters α and β.
The parameter α affects the location of the logistic function.
• If α > 0, then the curve shifts to the left by α units.
• If α < 0, then the curve shifts to the right by |α| units.
This is illustrated in Figure 12.

Values of α: α>0 α=0 α<0


1.0

0.8

α > 0, curve α < 0, curve


0.6
shifts left shifts right
f (x)

by α units by |α| units


0.4

0.2

0.0
−10 −5 0 5 10
x
Figure 12 Illustrating the effect on the logistic function of changing the
value of α

In the next activity, you will consider the effect on the location of the
curve in Figure 11 for different values of α.

25
Unit 6 Regression for a binary response variable

Activity 8 Changing the α parameter

Keeping the value of β fixed at 1, describe how the location of the curve in
Figure 11 changes when:
(a) α = 2
(b) α = −4

Now we’ll consider the effect that the parameter β has on the shape of the
logistic function; this is summarised in Box 3.

Box 3 Effect of β on the logistic function shape


Consider the logistic function given by the equation
1
f (x) = , −∞ < x < ∞,
1 + exp(−(α + βx))
for parameters α and β.
The sign of the parameter β affects the direction of the logistic
function.
• If β > 0, then the curve is increasing.
• If β < 0, then the curve is decreasing.
This is illustrated for the values β = +1 and β = −1 in Figure 13.

1.0

0.8

β = −1 β = +1
0.6
f (x)

0.4

0.2

0.0
−10 −5 0 5 10
x
Figure 13 Illustrating the effect on the logistic function of changing the
sign of β

26
2 Introducing logistic regression

The magnitude of the parameter β affects the steepness and spread of


the logistic function.
• The larger |β| is, the steeper and less spread the curve is.
• The smaller |β| is, the shallower and more spread the curve is.
This is illustrated for the values β = 0.25, β = 1 and β = 4 in
Figure 14.

1.0

β=4 β=1
0.8

0.6
β = 0.25
f (x)

0.4

0.2

0.0
−10 −5 0 5 10
x
Figure 14 Illustrating the effect on the logistic function of changing the
magnitude of β

In the next activity, you will consider the effect on the slope and spread of
the curve in Figure 11 for different values of β.

Activity 9 Changing the β parameter

Keeping the value of α fixed at 0, describe how the slope and spread of the
curve in Figure 11 changes when:
(a) β = 2
(b) β = −0.5

27
Unit 6 Regression for a binary response variable

From Boxes 2 and 3, the logistic function allows both increasing and
decreasing relationships to be modelled and can be flexible with respect to
both the scale and range of x. By adjusting the values of the parameters
α and β, we can directly model the relationship between the success
probability pi (for the ith response Yi ) and the explanatory variable xi by
using an equation of the form
1
pi = . (2)
1 + exp(−(α + βxi ))

So, now that we have found an equation suitable for modelling the
relationship between a binary response variable’s success probability and
an explanatory variable, we are ready to build a regression model for data
with a binary response! We shall introduce such a model next.

2.3 The logistic regression model


Like this dancer, the logistic
In the previous subsection, we saw how the logistic function is promising
function curve is flexible
for modelling the relationship between the success probability and an
explanatory variable. In this subsection, we’ll use this logistic function to
specify a regression model for a binary response. Because of the role
played by the logistic function, the resulting model is known as the logistic
regression model, or simply logistic regression.
Consider once again the model equation for pi using the logistic function
given in Equation (2). This equation doesn’t look very much like the
equations we have been using when constructing linear models! Luckily,
with a little manipulation, we can make the equation look a little more like
the linear models with which we are familiar. We shall start this
manipulation in Activity 10.

Activity 10 Manipulating our equation for pi

(a) Show that Equation (2), given by


1
pi = ,
1 + exp(−(α + βxi ))
can be rewritten in the form
exp(α + βxi )
pi = .
exp(α + βxi ) + 1
(Hint: remember that exp(−z) = 1/ exp(z).)
(b) Hence show that
pi
= exp(α + βxi ).
1 − pi

We saw in Activity 10 that Equation (2) can be rewritten in the form


pi
= exp(α + βxi ). (3)
1 − pi
28
2 Introducing logistic regression

But how does this help us? Well, if we now take the logarithm of both
sides in Equation (3), we end up with the equation
 
pi
log = α + βxi (4)
1 − pi
and this equation has the familiar form of a simple linear regression model
on the right-hand side. (Hooray!) Note that from this point onwards, when
we use the logarithm function in M348 we always mean log to base e.
Although Equation (4) has a familiar form on the right-hand side,
unfortunately the left-hand side certainly doesn’t look like the left-hand
side of a linear model. However, by looking at the left-hand side more
closely, we’ll find that Equation (4) does in fact have a similar form to a
linear model!
To see this, recall that for our binary response variable we have E(Yi ) = pi .
So, Equation (4) can be rewritten as
 
E(Yi )
log = α + βxi . (5)
1 − E(Yi )
But, we also know from Unit 1 that for the simple linear regression model
we have
E(Yi ) = α + βxi . (6)
The only real difference between Equations (5) and (6) is the fact that
Equation (5) has a function of E(Yi ) on the left-hand side, rather than just
E(Yi ).
The function of E(Yi ) in Equation (5) is known as the logit function,
where the term ‘logit’ is short for ‘logistic unit’. The logit function is often
denoted by logit(), so that
 
E(Yi )
logit(E(Yi )) = log .
1 − E(Yi )
(The logit function is in fact the inverse of the logistic function when α = 0
and β = 1.) So, using the logit function, Equation (5) can be written as
logit(E(Yi )) = α + βxi
and equivalently, Equation (4) can be expressed as
logit(pi ) = α + βxi .
When modelling a binary response, the logit function plays a special role
in the model and therefore has a special name – the logit link function,
often shortened to logit link. This is because the logit function links the
expected value of the response variable E(Yi ) (which in the case of a
binary response is the success probability pi ) with the linear component of
the model (that is, α + βxi ).
We are now in a position to be able to specify the logistic regression model
for a binary response variable with just one covariate explanatory variable,
as described in Box 4.

29
Unit 6 Regression for a binary response variable

Box 4 The logistic regression model


Consider the binary response Y which depends on just one covariate
explanatory variable x.
For the ith observation, we have
Yi ∼ Bernoulli(pi ), 0 < pi < 1,
where pi is the probability of success for the ith observation.
Then, denoting the value of x for the ith observation by xi , the
logistic regression model models the success probability using the
logistic function given by
1 exp(α + βxi )
pi = = , (7)
1 + exp(−(α + βxi )) exp(α + βxi ) + 1
which can be rearranged to the form
 
pi
logit(pi ) = log = α + βxi .
1 − pi
The function logit(pi ) is called the logit link function because it links
a function of E(Yi ) = pi to a linear function of the explanatory
variable xi .

Note that it is in fact possible to use other link functions for binary
response variables, and if you explore more regression models for binary
responses in your work or through further reading, you might also
encounter the probit link function and the complementary log-log link
function. In this unit, however, we shall only be using the logit link
function, since it is the one most commonly used with binary response
variables and its results are relatively easy to interpret.
Whilst we are currently using the logit link function for binary response
data, in Unit 7 you will use other link functions for different types of
response variables. In each case, the link function links E(Yi ) to a linear
function of the explanatory variable(s); these models (including the logistic
regression model) are known collectively as generalised linear models. We
shall return to looking at generalised linear models in Unit 7.
Simple linear regression and logistic regression (as specified in Box 4) both
have one covariate explanatory variable, but each assumes a different
distribution for the response. In the next activity, we’ll consider what
Aren’t we similar? other similarities and differences the two models have.

30
2 Introducing logistic regression

Activity 11 Comparing the logistic regression and simple


linear regression models
The simple linear regression model for the response variable Y and
(covariate) explanatory variable x is given, for i = 1, 2, . . . , n, by
Yi = α + βxi + Wi ,
where the random terms W1 , W2 , . . . , Wn are independent normal random
variables with zero mean and constant variance σ 2 . Then, since xi is
assumed fixed and non-random,
E(Yi ) = α + βxi .
For the binary response Y , such that, for i = 1, 2, . . . , n,
Yi ∼ Bernoulli(pi ), 0 < pi < 1,
the logistic regression model for Yi and (covariate) explanatory variable xi
is given by
 
pi
log = α + βxi .
1 − pi
What similarities and differences do you notice between these two models?

In logistic regression we are modelling E(Yi ) rather than each individual


value Yi . In other words, rather than modelling each value of Yi with the
predicted mean (that is, α + βxi ) plus a random term (as is the case for
simple linear regression), logistic regression simply models the predicted
mean α + βxi (albeit through the logit function logit(pi )). This difference
between the two models will be important when we come to think about
how we assess how well the model fits.
As with linear regression models, the model notation can get rather messy
when there are several explanatory variables. So, we shall continue to use
the simpler notation of the form
y∼x
to denote that the response is Y and we have just one explanatory variable
x, but in doing so, we need to make it clear that we are using a logistic
regression model. For example, we could say ‘the logistic regression model
y ∼ x’ or ‘the model y ∼ x for the binary response with the logit link’.
Just as we moved from simple linear regression to multiple regression, we
can extend the linear part of a logistic regression model to take account of
any number of explanatory variables. These explanatory variables can be
factors A, B, . . . , Z, or covariates x1 , x2 , . . . , xq . Again, we can use the
simpler notation
y ∼ A + B + · · · + Z + x1 + x2 + · · · xq ,
but we also need to make it clear that we are using a logistic regression
model.
31
Unit 6 Regression for a binary response variable

3 Interpreting the logistic regression


model
Recall that for the simple linear regression model, where
E(Yi ) = α + βxi ,
the regression coefficient β is interpreted as giving the average change in Y
which is associated with a one-unit increase in x. Can we come up with an
analogous interpretation for the logistic regression model with the logit
link function? Well, yes we can! And, in fact, one of the reasons why the
logit link function is often preferred to other possible link functions for
binary responses is for this very reason!
In order to be able to interpret the logistic regression model when using
the logit link function, we first need to consider two quantities associated
with logistic regression, namely the odds and the log odds. We shall discuss
these in Subsection 3.1. We’ll then consider how we can interpret logistic
regression models when there is a single covariate explanatory variable in
Subsection 3.2, and when there are several explanatory variables in
Subsection 3.3.

3.1 Odds and log odds


For a binary random variable with success probability p and failure
probability 1 − p, the odds is the ratio of the probability of success to the
probability of failure, so that
p
odds = . (8)
1−p
Odds are often used instead of We can think of odds as the ratio of the number of outcomes which result
probability in betting in success to the number of outcomes which result in failure – that is,
number of outcomes resulting in success
odds of a success = .
number of outcomes resulting in failure
Probability, on the other hand, is the ratio of the number of outcomes
which result in success to the total number of outcomes – that is,
number of outcomes resulting in success
probability of a success = .
total number of outcomes
So, the odds of a success and the probability of a success are not the same,
but they are related.
We shall consider the link between odds and probability in Example 5 and
Activity 12.

32
3 Interpreting the logistic regression model

Example 5 Odds of being October


1
The odds that a randomly chosen month of the year is October is 11 .
This is because there is one outcome where the event occurs (that is,
there is one month October in the year) and 11 outcomes where the
event doesn’t occur (that is, there are 11 months of the year which are
not October).
On the other hand, the probability that a randomly chosen month of
1
the year is October is 12 , because there is one outcome when the
event occurs, and a total of 12 months in a year.
We can see that Equation (8) yields the same odds value calculated
by using the numbers of outcomes which result in success and failure,
since
1
P (October) 1
odds = = 12 1 = .
1 − P (October) 1 − 12 11

Activity 12 Relating probability and odds

(a) Suppose that the probability of a student passing a module is 0.9, so


that p = 0.9. What are the odds of the student passing the module?
(b) Using Equation (8), produce a formula for the probability of a success
in terms of the odds of a success.
(c) If the odds of a company going into receivership is 0.6, what is the
probability that the company goes into receivership?

The values of probabilities are constrained to lie between 0 and 1. But


what about the values that the odds can take? We consider this in the
next activity.

Activity 13 What values can odds take?

(a) The odds can take values between 0 and ∞. Explain why this is so.
(b) What value of the success probability gives a value 1 for the odds?
(c) What does a value of odds less than 1 tell us about the success
probability? What about a value of odds greater than 1?

33
Unit 6 Regression for a binary response variable

As we saw in Activity 13:


• odds < 1 when p < 1
2 (so that success is less likely than failure)
• odds = 1 when p = 1
2 (so that success and failure are equally likely)
• odds > 1 when p > 1
2 (so that success is more likely than failure).
Then, since the odds can take values between 0 and ∞:
• if the odds of success is close to 0, then there is a very low probability of
success
• if the odds of success is large, then there is a very high probability of
success.
You may be wondering why we’re interested in the odds? Well, the logit
function is a function of the odds, since
 
p
logit(p) = log .
1−p
So, logit(p) is the log of the odds of success. Because of this, logit(p) is
often simply called the log odds, and the logit function is sometimes called
the log odds function. Odds and log odds are summarised in Box 5.

Box 5 Odds and log odds


For a binary random variable with success probability p and failure
probability 1 − p, the odds of success is
p
odds =
1−p
or equivalently
number of outcomes which result in success
odds = .
number of outcomes which result in failure
The odds can take the following values:
• between 0 and 1 when p < 1
2
• 1 when p = 1
2
• between 1 and ∞ when p > 12 .
The log odds is simply log(odds) (taking logs to base e), and
 
p
log(odds) = log = logit(p).
1−p

You will see shortly that the odds and log odds are key to interpreting the
logistic regression model.

34
3 Interpreting the logistic regression model

3.2 Interpreting a regression coefficient


Consider once again the logistic regression model with the logit link
function, so that
 
pi
log = α + βxi , i = 1, 2, . . . , n.
1 − pi
Now that we know what the odds and log odds of a success are, we are in a
position to be able to interpret the regression coefficient β for a unit
increase in x.
To illustrate how we might do this, in Example 6 we’ll consider a binary
response Y and a covariate x which only takes the two possible values
0 or 1, and is therefore a binary explanatory variable. (Remember that a
binary explanatory variable can be classed either as a covariate (with
numerical values 0 and 1) or as a factor (with coded values 0 and 1).)

Example 6 Spam email


Spam email is generally unsolicited and unwanted junk email. Such
emails are typically sent out to a large number of people and often for
commercial reasons. Suppose that we want to explore the probability
that an email is spam. We can therefore define our response variable
to be

1 if the email is spam, Looks like they’ve got some
Y =
0 if the email is not spam. spam email!
We are going to assume that the probability that an email is spam
depends on just one covariate x which can take two values such that,

1 if the email contains the phrase ‘Call free’,
x=
0 if the email does not contain the phrase ‘Call free’.
This scenario is clearly highly simplified, but its simplicity will help
with our illustration.
Denote the values of Y and x for the ith email by Yi and xi ,
respectively. Then, using a logistic regression model with the logit
link function, we have that
 
pi
log = α + βxi .
1 − pi
A unit increase in the explanatory variable x, corresponds to an
increase from x = 0 (the email does not contain the phrase ‘Call free’)
to x = 1 (the email does contain the phrase ‘Call free’). So, in order
to interpret β, we need to see what happens to the model when x = 0
and when x = 1.

35
Unit 6 Regression for a binary response variable

Now, the value of pi will depend on the value of xi . So, let’s define
p(Y = 1 | x = 0) to be the probability that an email is spam (Y = 1),
given that it does not contain the phrase ‘Call free’ (x = 0), so that
when xi = 0,
pi = P (Y = 1 | x = 0).
Similarly, define p(Y = 1 | x = 1) to be the probability that an email
is spam (Y = 1), given that it does contain the phrase ‘Call free’
(x = 1), so that when xi = 1,
pi = P (Y = 1 | x = 1).
Then, substituting values of pi and xi into our model, when xi = 0 we
have that
 
p(Y = 1 | x = 0)
log =α
1 − p(Y = 1 | x = 0)
and when xi = 1, we have that
 
p(Y = 1 | x = 1)
log = α + β.
1 − p(Y = 1 | x = 1)
Therefore
   
p(Y = 1 | x = 1) p(Y = 1 | x = 0)
β = log − log .
1 − p(Y = 1 | x = 1) 1 − p(Y = 1 | x = 0)
In other words, the coefficient β represents the difference between:
• the log odds for emails with x = 1 (emails containing the phrase
‘Call free’), and
• the log odds for emails with x = 0 (emails without the phrase ‘Call
free’).
So β represents the difference between the logs odds when there is a
unit increase in x.

Following on from Example 6 (and continuing the notation used there), we


saw that β is the difference between the log odds when there is a unit
increase in x, so that
   
p(Y = 1 | x = 1) p(Y = 1 | x = 0)
β = log − log .
1 − p(Y = 1 | x = 1) 1 − p(Y = 1 | x = 0)
Now, you will probably recall that
a
log a − log b = log .
b

36
3 Interpreting the logistic regression model

Using this knowledge, we see that the difference between the log odds is
the same as the log of the ratio of the two odds – that is,
   
p(Y = 1 | x = 1) p(Y = 1 | x = 0)
β = log − log
1 − p(Y = 1 | x = 1) 1 − p(Y = 1 | x = 0)
 . 
p(Y = 1 | x = 1) p(Y = 1 | x = 0)
= log
1 − p(Y = 1 | x = 1) 1 − p(Y = 1 | x = 0)
 
odds when x = 1
= log .
odds when x = 0
The ratio of odds in this equation is, perhaps unsurprisingly, called the
odds ratio and is often written simply as OR. The equation for β
therefore becomes
β = log(OR).
The odds ratio is a useful measure of the association between the
explanatory variable and the response variable, as demonstrated next in Could this image also be
Example 7 and Activity 14. considered a ratio of ‘odds’ ?

Example 7 The odds ratio for spam email


Consider once again the scenario concerning spam emails discussed in
Example 6. Suppose that the probability that an email is spam given
that it contains the phrase ‘Call free’ is 0.9, and the probability that
an email is spam given that it doesn’t contain the phrase ‘Call free’
is 0.4.
Then, the odds that an email is spam given that it contains the
phrase ‘Call free’ (so that x = 1) is
p(Y = 1 | x = 1) 0.9
odds when x is 1 = = = 9,
1 − p(Y = 1 | x = 1) 0.1
and the odds that an email is spam given that it does not contain the
phrase ‘Call free’ (so that x = 0) is
p(Y = 1 | x = 0) 0.4 2
odds when x is 0 = = = .
1 − p(Y = 1 | x = 0) 0.6 3
The odds ratio is then calculated as
odds when x is 1 9
OR = = = 13.5.
odds when x is 0 2/3
This means that the odds that an email is spam given that it contains
the phrase ‘Call free’ is increased by a factor of 13.5 compared to the
odds that an email is spam given that it doesn’t contain the phrase
‘Call free’.

37
Unit 6 Regression for a binary response variable

Activity 14 A different explanatory variable for predicting


spam emails
Once again consider the response variable

1 if the email is spam,
Y =
0 if the email is not spam.
This time, however, consider a different (but also highly simplified!)
explanatory variable x2 , say, which can also only take two values so that

1 if the email contains at most two spelling mistakes,
x2 =
0 if the email contains three or more spelling mistakes.

Suppose that the probability that an email is spam given that it has at
most two spelling mistakes is 0.1, and the probability that an email is
spam given that it has three or more spelling mistakes is 0.55.
Calculate the odds ratio that an email is spam for the explanatory variable
x2 , and interpret its value.

We saw in Example 7 and Activity 14 that the value of the OR is


meaningful in terms of how a unit increase in x affects the odds, and hence
the probability of success. Interpreting the odds ratio is summarised in
Box 6.

Box 6 Interpreting the odds ratio (OR)


The odds ratio (OR) can be interpreted as follows.
• If OR > 1, then the odds of success increases by a factor equal to
the value of the OR when x is increased by one unit.
• If OR < 1, then the odds of success decreases by a factor equal to
the value of the reciprocal of the OR when x is increased by one
unit.

So, how does the value of the OR help us to interpret a logistic regression
model? Well, we know that
β = log(OR),
which means that
OR = exp(β).
So, the value of exp(β) is the OR, and we know how to interpret the OR!
The quantity exp(β) is known as the odds multiplier, because it is the
number that the odds are multiplied by when increasing the value of x by
one unit.

38
3 Interpreting the logistic regression model

Of course, the spam email example only had a single binary explanatory
variable, and we would really like an interpretation for β when we have a
more general explanatory variable. Luckily, we can extend the idea for a
general covariate x. To do this, let’s consider a logistic regression model
when x takes some value w, say, and when this value is increased by 1 to
the value w + 1. Let p(Y = 1 | x = w) and p(Y = 1 | x = w + 1) denote the
success probabilities when the explanatory variable x takes values w and
w + 1, respectively. Then, when x = w,
  Extending the idea!
p(Y = 1 | x = w)
log = α + βw
1 − p(Y = 1 | x = w)
and when x = w + 1,
 
p(Y = 1 | x = w + 1)
log = α + β(w + 1).
1 − p(Y = 1 | x = w + 1)
The difference in these two log odds is then
   
p(Y = 1 | x = w + 1) p(Y = 1 | x = w)
log − log
1 − p(Y = 1 | x = w + 1) 1 − p(Y = 1 | x = w)
= (α + β(w + 1)) − (α + βw) = β. (9)
So, as we have just seen, when we have a more general covariate, β still
represents the difference in log odds (or the log of the odds ratio) given a
unit increase in the explanatory variable. As such, β has the same
interpretation for any covariate x.
Now, the odds multiplier exp(β) tells us how the odds change for a unit
increase in x. But what about if we’d like to know how the odds change for
an increase in x of c units, for some value c? To answer this, let’s consider
the logistic regression model when we increase x from w to w + c (instead
of increasing x from w to w + 1). In this case, Equation (9) becomes
   
p(Y = 1 | x = w + c) p(Y = 1 | x = w)
log − log
1 − p(Y = 1 | x = w + c) 1 − p(Y = 1 | x = w)
= (α + β(w + c)) − (α + βw) = cβ.
So
exp(cβ) = odds ratio for an increase of c units in x, (10)
that is, exp(cβ) is the number that the odds of success is multiplied by for
an increase in x of c units.

39
Unit 6 Regression for a binary response variable

The odds multiplier is summarised in Box 7.

Box 7 The odds multiplier


For the logistic regression model with logit link function so that
 
pi
log = α + βxi , i = 1, 2, . . . , n,
1 − pi
the odds multiplier is exp(β), and
exp(β) = OR,
where OR is the odds ratio for a unit increase in the explanatory
variable.
The odds multiplier tells us about the relationship between the
response and the explanatory variable, because it is the number that
the odds are multiplied by for a unit increase in the explanatory
variable.
Similarly, exp(cβ) is the number that the odds are multiplied by for
an increase of c units in the explanatory variable.

Activity 15 provides some practice at interpreting β.

Activity 15 Interpreting β

For a logistic regression model with the logit link function:


(a) What can we say about the relationship between the response variable
and the explanatory variable when β = 0.7?
(b) If β = −1, what does this tell us about the relationship between the
response variable and the explanatory variable?
(c) When β = −0.2, what can we say about the odds of success when the
explanatory variable is increased by 10 units?

The odds multiplier allows us to interpret the regression coefficient in a


logistic regression model and therefore makes the logit link function both
intuitive and popular. As an additional bonus, the logistic regression
model often also fits the data well!
Next we’ll consider how to interpret the regression parameters when a
logistic regression model has more than one explanatory variable.

40
3 Interpreting the logistic regression model

3.3 Interpreting more than one regression


coefficient
Having seen how to interpret the logistic regression model when there is
only one covariate (and hence only one regression coefficient), we will now
consider how to interpret a logistic regression model with more than one
regression coefficient.
Consider a linear multiple regression model with q covariates x1 , x2 , . . . , xq .
Recall from Unit 2 that when we have q covariates, we have q partial
regression coefficients β1 , β2 , . . . , βq . Each βj is then interpreted as the
effect on the response for a unit increase in xj , assuming that all the other
covariates are fixed. We have an analogous interpretation for logistic
regression with q covariates.
Like multiple regression, when there are q covariates in logistic regression,
there are q associated coefficients β1 , β2 , . . . , βq . Each of these coefficients
is interpreted as increasing or decreasing the log odds of success, just as we
saw when we had a logistic regression model with one explanatory variable.
However, as in (linear) multiple regression, these parameters are partial
regression coefficients and explain what happens to the log odds if all the
other explanatory variables are held constant. So, for each j = 1, 2, . . . , q:
• if βj > 0, then the odds multiplier, exp(βj ), is greater than 1 and
therefore a unit increase in xj (and fixed values for the other covariates)
increases the odds of success by a factor equal to the odds multiplier
• if βj < 0, then the odds multiplier, exp(βj ), is less than 1 and therefore a
unit increase in xj (and fixed values for the other covariates) decreases
the odds of success by a factor equal to the reciprocal of the odds
multiplier.
Note that, as with multiple regression, although the model has partial
regression coefficients, they are usually referred to simply as ‘regression
coefficients’. We’ll use these ideas to interpret a logistic regression model
in Activity 16.

Activity 16 Interpreting a logistic regression model with


multiple covariates
Interpret a logistic regression model with three covariates x1 , x2 and x3 ,
when β1 = 0.2, β2 = −2.5 and β3 = 1.5.

So far, we’ve only considered logistic regression models with covariate


explanatory variables, but of course, explanatory variables can also be
factors. Recall from Units 3 and 4 how a factor A with K levels can be
included in a linear regression model through the use of K − 1 indicator
variables. The same approach is used for including factors in logistic
regression. The only difference is the way that the K − 1 regression
coefficients for the indicator variables are interpreted.

41
Unit 6 Regression for a binary response variable

In logistic regression, the coefficient for the indicator variable for the kth
level of the factor is interpreted as the difference in log odds between the
kth level of the factor and the first level of the factor, with all other
explanatory variables held fixed. Therefore, the odds multiplier of the
coefficient for the kth level of the factor is interpreted to be the OR for the
kth level of the factor relative to the first level of the factor. We’ll consider
this idea further in Activity 17.

Activity 17 Interpreting a logistic regression model with a


factor
Interpret a logistic regression model where the explanatory variable is the
factor A with three levels, where level 1 of A is taken to be the reference
level, and the regression coefficient associated with the indicator variable
for level 2 is −0.4 and the regression coefficient associated with the
indicator variable for level 3 is 3.

Note that in Activity 17 there was just the one explanatory variable A, but
there was more than one regression coefficient because A was a factor.
Logistic regression models can be extended in a natural way to include any
number of covariates and factors as the explanatory variables. As for the
linear models with any number of factors and covariates that we studied in
Unit 4, the interpretation of the regression coefficients in logistic regression
assumes that the values of any other explanatory variables in the model
are fixed.
Interpreting a logistic regression model with more than one regression
coefficient is summarised in Box 8.

Box 8 Interpreting logistic regression with multiple


coefficients
Suppose that we have a logistic regression model with the factors
A, B, . . . , Z and covariates x1 , x2 , . . . , xq as explanatory variables.
For the regression coefficient βj associated with xj , the odds
multiplier exp(βj ) is the odds ratio (OR) associated with a unit
increase in the covariate xj , assuming fixed values for the other
covariates and all of the factors.
For each factor, the odds multiplier of the coefficient for the kth level
of the factor is interpreted to be the OR for the kth level of the factor
relative to the first level of the factor, assuming that the values of all
the covariates and other factors in the model are fixed.

Yes, get excited – you’re about So, we have introduced the logistic regression model for a binary response
to see logistic regression in in Sections 1 and 2, and we have discussed how to interpret the resulting
action! model in this section. Now we’re ready to use logistic regression!

42
4 Using the logistic regression model

4 Using the logistic regression


model
In order to fit a model to a dataset, logistic regression uses the method of
maximum likelihood estimation, which selects the model coefficients that
are most likely for the data that have been observed. In this module, we
will not concern ourselves with the mathematics behind maximum
likelihood estimation as this is essentially a computational problem.
Instead, we shall trust the statistical software packages at our disposal to
use tried-and-tested algorithms for performing such calculations.
Once we have our fitted model, we can then use it to calculate pbi , the
fitted value of the success probability for the ith observation, and pb0 , the
predicted value of the success probability for a new response Y0 .
In this section, we shall explore using the fitted logistic regression models
for some datasets in Subsection 4.1, before using R for logistic regression in
Subsection 4.2.

4.1 Exploring some fitted logistic regression


models
We’ll start with an example of a logistic regression model with one
explanatory variable using data from the burns dataset (which was
introduced back in Section 1).

Example 8 A fitted logistic regression model for the


burns dataset
The burns dataset contains data for 435 adult patients suffering from
third-degree burns. As a reminder, we have the response variable
• survival: a binary variable which records whether the patient
survived, taking the value 1 if the patient survived and 0 if the
patient didn’t survive
and the covariate
• logArea: calculated as log(area of third-degree burns + 1).
Recall that, for the ith patient, survival has a Bernoulli(pi )
distribution, where pi is the probability of success – that is, pi is the
probability of survival – for the ith patient.
The data were used to fit the logistic regression model
survival ∼ logArea,
with a logit link function.

43
Unit 6 Regression for a binary response variable

The estimated model parameters (obtained from R) are α b = 22.223


and βb = −10.453, so that the fitted model is given by
 
pb
log = 22.223 − 10.453 logArea.
1 − pb
Now, βb = −10.453 and therefore the odds multiplier is calculated as
b = exp(−10.453) ' 0.000 03.
exp(β)
Since
exp(β) = OR
the estimated value of the OR of success is approximately 0.000 03.
So, for a unit increase in logArea, the odds of survival decreases by a
factor of
1 1
= ' 34 648.
OR exp(−10.453)
This represents a very large decrease in the odds of survival!
The reason that the odds multiplier is so small (and so the decrease in
the odds is by such a large factor) is because the scale of logArea
means that a unit increase represents a huge change (as we can see in
Figures 6 and 7, for example). Because of this, it is more meaningful
to consider the effect of a smaller increase in logArea: given the scale
of logArea, it looks like it would be more sensible to consider an
increase of something like 0.1.
To do this, we need to use the result (given in Equation (10) in
Subsection 3.2) that
exp(cβ)
represents the odds ratio for an increase of c units in the explanatory
variable. So, setting c to be 0.1,
exp(0.1 × (−10.453)) ' 0.352.
This means the odds of survival increases by a factor of 0.352 for an
increase of 0.1 for logArea. Or alternatively, the odds of survival
decreases by a factor of
1 1
= ' 2.844
OR exp(−1.0453)
as the value of logArea increases by 0.1.

The logistic regression model for the burns dataset just had one covariate.
Next we’ll revisit the OU students dataset and consider a logistic
regression model with more than one explanatory variable.

44
4 Using the logistic regression model

In Example 3, we introduced the idea of using a binary response variable


to indicate whether a student passed or failed a module. In Activity 18,
we’ll look at a fitted logistic regression model using this response.

Activity 18 A fitted logistic regression model for data from


the OU students dataset
Consider once again the OU students dataset. In Example 3, we
introduced the response variable
• modResult: overall final module result, taking the value 1 if the student
passes and 0 if the student fails.
So, if we are interested in whether the probability of passing a module A student celebrating success
depends on the values of various explanatory variables for that student,
then we can model modResult using a logistic regression model. Here, we’ll
consider three of the possible explanatory variables, namely
• bestPrevModScore: the best previous overall final module score, taking
values from 0 to 100 (rounded to one decimal place)
• age: the age of the student (in years), with one of the values −2, −1, 0,
1 or 2 randomly added
• qualLink: a factor classifying the OU qualification the student is linked
to, taking possible values maths (for qualifications containing substantial
mathematical content) and not (for all other qualifications or no
qualification link). Level ‘not’ is taken to be level 1 of the factor.
The parameter estimates for the coefficients for the fitted logistic
regression model
modResult ∼ bestPrevModScore + age + qualLink
are given in Table 5.
Table 5 Parameter estimates for the logistic regression model
modResult ∼ bestPrevModScore + age + qualLink

Parameter Estimate
Intercept −4.273
bestPrevModScore 0.099
age −0.020
qualLink maths −0.696

(a) Interpret the odds multiplier associated with a unit increase in a


student’s best previous module score.
(b) Interpret the odds multiplier associated with a unit increase in a
student’s age.
(c) Interpret the odds multiplier associated with being linked to a
maths-based qualification.

45
Unit 6 Regression for a binary response variable

It is not always convenient to think of our outcomes in terms of the log of


the odds ratio, or even in terms of the odds ratio. Instead we can
manipulate our fitted model equation to obtain the fitted value of the
success probability pbi . To see this, consider the fitted logistic regression
model for response Yi with associated covariate xi given by
 
pbi b i.
log b + βx

1 − pbi
Taking the exponential of both sides of this equation gives
pbi b i ),
= exp(b
α + βx
1 − pbi
which can be rearranged to the form
exp(b b i)
α + βx
pbi = . (11)
1 + exp(b b i)
α + βx
Note that this equation is simply a fitted version of Equation (7) from
Subsection 2.3.
We can therefore use Equation (11) to obtain an estimate of the success
probability for the ith response Yi , and we can also use the same equation
to predict the success probability for a new response Y0 .
Of course, we won’t always have just a single explanatory variable in our
model. However, regardless of how many explanatory variables there are,
the fitted logistic regression model will have the general form
 
pb
log = fitted linear function of the explanatory variables,
1 − pb
which can be rearranged to the form
exp(linear function of the explanatory variables)
pb = . (12)
1 + exp(linear function of the explanatory variables)
So, for any logistic regression model, Equation (12) can be used to
calculate the fitted value of the success probability as well as the predicted
probability of success for a new response Y0 . We shall see Equations (11)
and (12) in action in the next two activities.

46
4 Using the logistic regression model

Activity 19 Fitted and predicted survival probabilities for


the burns dataset
From Example 8, the fitted logistic regression model for the burns data is
 
pb
log = 22.223 − 10.453 logArea,
1 − pb
where pb is the fitted probability of survival.
(a) What is the fitted probability of survival for the second patient,
whose value of logArea is 1.903?
(b) For the third patient, the recorded value of logArea is 2.039.
Calculate pb3 .
(c) What is the predicted probability of survival for a new patient whose
value of logArea is 2.3?

Activity 20 Fitted and predicted probability of passing


Activity 18 considered the model
modResult ∼ bestPrevModScore + age + qualLink
using data from the OU students dataset. The parameter estimates for the
fitted model were given in Table 5.
(a) The eleventh student in the dataset is aged 36, is not studying for a
maths-based qualification, and has a score of 92 for their best
previous module score. What is the fitted probability of passing the
module for this student?
(b) Predict the probability that a student aged 60, studying for a maths
qualification whose best previous module score was 80 will pass the
level 3 module.

The method for calculating fitted and predicted success probabilities is


summarised in Box 9.

Box 9 Fitted and predicted success probabilities


Suppose that a fitted logistic regression model has the general form
 
pb
log = fitted linear function of explanatory variables.
1 − pb
Then, pbi , the fitted success probability for Yi , can be calculated using
the equation
exp(linear function of explanatory variables for Yi )
pbi = .
1 + exp(linear function of explanatory variables for Yi )
Similarly, pb0 , the predicted success probability for the new response
Y0 , can be calculated using the equation
exp(linear function of explanatory variables for Y0 )
pb0 = .
1 + exp(linear function of explanatory variables for Y0 )

47
Unit 6 Regression for a binary response variable

4.2 Using R for logistic regression


The time has come for you to use logistic regression in R! We’ll start with
Notebook activity 6.1, which explains how to fit a logistic regression model
in R and how to interpret the output. Then, in Notebook activity 6.2, we
shall see how to predict success probabilities for new responses when using
logistic regression in R.

Notebook activity 6.1 Fitting a logistic regression model


in R
This notebook explains how to fit a logistic regression model in R.

Notebook activity 6.2 Predicting success probabilities


in R
This notebook explains how predicted success probabilities can be
calculated in R.

Back in Subsection 1.2, we introduced the European companies dataset,


which contains data for 2018 on 270 public and private companies across
Europe. A subset of these data are in the GB companies dataset, which we
used when we were looking at some initial modelling ideas, taking the
response variable to be
• resAndDev: a binary variable taking the values 1 if the company
participates in research and development and 0 otherwise
and the single (covariate) explanatory variable to be
• averageWage: average wage of an employee in 2018.
In doing so, we focused on a subset of data for only 28 companies – these
were the companies which had the value GB for the factor country and
22 for the factor product, where these two factors are defined as
• country: a categorical variable taking the values GB, FR and DE
• product: a categorical variable giving the industrial classification linked
to what a company produces taking the following coded values: 10, 20,
22, 26 and 27.
Although we explored the data on the 28 GB companies for the purposes
of developing model ideas, we haven’t yet fitted a logistic regression model
to these data! We shall do this next in Notebook activity 6.3. Then, in
Notebook activity 6.4, we shall consider the European companies dataset
(rather than the GB companies dataset considered in Notebook
activity 6.3). Once again, we’ll take resAndDev to be our response
variable, but this time we shall include both factors country and product
as explanatory variables in addition to the covariate averageWage.

48
5 Assessing model fit

Notebook activity 6.3 A logistic regression model for the


GB companies dataset
In this notebook we shall fit a logistic regression model
to gbCompanies.

Notebook activity 6.4 A logistic regression model for the


European companies dataset
In this notebook we shall fit a logistic regression model
to europeanCompanies, using multiple explanatory variables.

So far in this unit, we have explored a few different logistic regression


models and interpreted these. What we now require is a way of assessing
the fit of a logistic regression model. We shall do this next in Section 5.

5 Assessing model fit


When fitting linear models with several explanatory variables, we saw that
there was more to fitting a model than just estimating the coefficients in
the model: we also assessed how well our chosen model fitted the data and
which explanatory variables, or transformations of them, were worth
including in the model. We will use these same ideas when constructing a
logistic regression model. In this section, we’ll look at how we can assess
the fit of a logistic regression model, then we’ll see in Section 6 how these
measures of fit can be used to help us choose which explanatory variables
to include in our model.
In linear regression, t-values, F -values and their associated p-values are
used for assessing model fit; the distribution theory underlying these
methods is exact. However, in logistic regression, no such exact theory Assessing the fit . . .
exists. Luckily, all is not lost, since there are some approximations which
can be used instead. (Phew!)
The section starts by discussing how the likelihood can be used as a
measure of model fit in Subsection 5.1. Subsection 5.2 then introduces a
measure of fit based on likelihoods called the residual deviance, and
Subsection 5.3 introduces a related measure called the null deviance. The
section rounds off with Subsection 5.4 in which we look at using R to
assess model fit in logistic regression.

49
Unit 6 Regression for a binary response variable

5.1 The likelihood as a measure of model fit


For any fitted model, the likelihood gives us a measure of how likely we are
to observe the data that we have in fact observed, for that particular fitted
model. Now, if a model fits our data well, then the data that have been
observed should be expected, in which case the likelihood will be high.
But, if the model doesn’t fit the data well, then the data won’t match
what is expected under the model very well, in which case the likelihood
will be lower. The likelihood therefore provides us with a useful way of
comparing the fits of different models. Basically, the better the model fits
the data, the more likely we are to observe the data that have been
observed, and therefore the higher the likelihood will be.
Let’s denote the likelihood for our proposed model by
L(proposed model).
You don’t need to worry about how this likelihood is calculated, but there
are two facts about L(proposed model) which are important for us here:
• The better the proposed model fits, the higher L(proposed model) is.
• The value of L(proposed model) is not an absolute measure of fit, and
on its own doesn’t tell us very much about the model fit. However, we
can use L(proposed model) to assess model fit by comparing its value
with the likelihoods for different models.
So, this leads us to the question: which model, or models, should we
choose to compare likelihoods?
Well, there are two models which can provide useful reference likelihood
values for the observed data. These are:
• the saturated model , which is the best model we could use in terms of fit
• the null model , which is the worst model we could use in terms of fit.
The saturated model has a separate parameter for every single
observation, and models each fitted value ybi by its observed value yi . This
means that the saturated model provides a perfect fit to the data. Despite
this ‘perfect fit’, the saturated model is of little use for modelling because
it doesn’t actually summarise any relationship between the response Y and
the explanatory variables (which, of course, is one of the main purposes of
modelling!). The saturated model, however, can be useful as the ‘best’
fitting model, so that, the likelihood of the saturated model,
L(saturated model), will be the largest value that the likelihood could take
for our data.
We have already met the null (regression) model in Unit 2 when learning
about stepwise regression: it’s the model with an intercept only and none
of the possible explanatory variables. So, since the null model is the worst
model that we could use in terms of fit the likelihood of the null model,
L(null model), will be the smallest value that the likelihood could take for
our data.

50
5 Assessing model fit

Because the fit of our proposed model must lie somewhere between the
worst-fit model (the null model) and the best-fit model (the saturated
model), we know that L(proposed model) must lie somewhere between
L(null model) and L(saturated model). So, comparing the likelihoods of
these three models can tell us about how well (or not!) the proposed model
fits the data. In particular, we shall introduce a measure of fit in
Subsection 5.2 which is based on the idea that, by comparing
L(proposed model) with L(saturated model), we can assess how much fit
to the data is lost by fitting the proposed model. This is illustrated in
Figure 15.

Somewhere between
Smallest L(null model) Largest
possible and possible
likelihood L(saturated model) likelihood

Measure of
lost fit due to
proposed model

L(null model) L(proposed model) L(saturated model)


Possible likelihood values for the data
Figure 15 Illustration of how the likelihoods compare for the null, proposed
and saturated models

Instead of comparing model likelihoods for measuring the lost fit of the
proposed model, for mathematical reasons it is actually more convenient to
compare log-likelihoods (that is, the log of the likelihoods). We’ll denote
the log-likelihood for a model by

l(model) = log L(model) .
Since the log transformation is a monotonic increasing transformation, a
high value of L(model) will produce a high value of l(model), and so we
can still use the log-likelihoods for comparing model fits, but with the
added bonus that the maths works nicely!

51
Unit 6 Regression for a binary response variable

5.2 The residual deviance


In order to measure how much fit to the data is lost by using our proposed
model, we shall compare the log-likelihoods of the saturated and proposed
models using a measure known as the residual deviance, D, where

D = 2 × l(saturated model) − l(proposed model) . (13)
The reason that the residual deviance is twice the difference between the
log-likelihoods (rather than simply their difference) is for mathematical
convenience, as you will see shortly. The residual deviance can be thought
of as an indication of how much fit is still ‘available’ after the model has
been fitted, and can be considered to be analogous to the RSS in linear
regression.

Activity 21 What does a small or large residual deviance


mean?
Suppose that we fit our proposed logistic regression model and the residual
deviance is calculated using Equation (13).
(a) What would a small value of the residual deviance mean in terms of
the fit of our proposed model?
(b) What would a large residual deviance mean?

The residual deviance can be used to help with deciding whether a


proposed model is a good fit or not:
• A small value of the residual deviance would indicate that there is not
much fit to the data lost when using the proposed model. This in turn
means that the fits of the proposed and saturated models are
comparable, and so we would conclude that the fit of the proposed
model is good.
• On the other hand, a large value of the residual deviance would indicate
that too much fit to the data is lost when using the proposed model, and
so we would conclude that the fit of the proposed model is inadequate.

52
5 Assessing model fit

But how do we decide that a residual deviance is ‘large enough’ to indicate


that our proposed model may not be a good fit? To answer this question,
we need to use the chi-squared distribution, also often written as the χ2
distribution. A summary of the chi-squared distribution is given in Box 10.

Box 10 The chi-squared distribution


The chi-squared distribution, also often written as the χ2 distribution,
with ν degrees of freedom is a probability distribution for a
continuous random variable which only takes positive values. This
distribution is usually denoted χ2 (ν).
The probability density function (p.d.f.) of the χ2 (ν) distribution is
right-skew.
The degrees of freedom (taking possible values ν = 1, 2, . . .)
determines the shape of the curve: the lower the degrees of freedom,
the more right-skew the p.d.f. is. As the degrees of freedom increases,
χ2 (ν) becomes more symmetric.

So, how can we use the chi-squared distribution to assess the fit of a
logistic regression model? Well, it turns out that, if the proposed model is
a good fit, then the residual deviance D is approximately distributed as a
χ2 (r) distribution, which is denoted as D ≈ χ2 (r) (in this module, the ≈
symbol means ‘is approximately distributed as’), with
r = number of parameters in the saturated model
− number of parameters in the proposed model. (14)
This distributional result is in fact the reason why D is defined as twice
the difference in the log-likelihoods.

53
Unit 6 Regression for a binary response variable

Now, we’ve already noted that a large value of D indicates that the
proposed logistic regression model may not be a good fit. So, since
D ≈ χ2 (r) when the model is a good fit, if the value of D falls in the upper
tail of the χ2 (r) distribution, then this suggests that D is larger than we’d
expect, and so the proposed model loses too much fit. This is illustrated in
Figure 16.

χ2 (r)

Values of D here are Values of D here are


as expected or smaller, larger than expected,
therefore proposed model therefore proposed model
doesn’t lose much fit loses too much fit

Figure 16 How D can be used to decide whether the proposed model is a


good fit (here r has been taken to be 10)

So in order to test the hypotheses


H0 : proposed model is a good fit,
H1 : proposed model is not a good fit,
we can use D as the test statistic and χ2 (r) as the null distribution. If D is
large, then we’d reject H0 and conclude that the proposed model is not a
good fit, otherwise we wouldn’t reject H0 and therefore conclude that the
model is a good fit. As usual, the decision of whether to reject H0 or not
can be based on the p-value for the observed value of the test statistic
(D here) using the null distribution (χ2 (r) here). Since we’d only reject
H0 for large values of D, the p-value is the area under the χ2 (r) curve to
the right of the observed value of D: H0 should be rejected if the p-value is
small.

54
5 Assessing model fit

Calculating the value of the degrees of freedom r for the residual deviance
for a particular logistic regression model is illustrated in the next activity.

Activity 22 How many degrees of freedom?

Suppose that the observations y1 , y2 , . . . , yn are used to fit a logistic


regression model of the form
 
pi
log = α + β1 xi1 + β2 xi2 + · · · + βq xiq , i = 1, 2, . . . , n,
1 − pi
where x1 , x2 , . . . , xq are covariates. A different degree of freedom

(a) How many parameters are there in this proposed model?


(b) How many parameters are there in the saturated model?
(c) How many degrees of freedom r should be used so that, if the
proposed model is a good fit, then D ≈ χ2 (r)?

The residual deviance is summarised in Box 11.

Box 11 The residual deviance


Suppose that the observations y1 , y2 , . . . , yn are used to fit a logistic
regression model – the proposed model.
Then the residual deviance, denoted D, is defined to be

D = 2 × l(saturated model) − l(proposed model) .
If the proposed model is a good fit, then
D ≈ χ2 (r),
where
r = number of parameters in the saturated model
− number of parameters in the proposed model
= n − number of parameters in the proposed model.
If D is large (so that the p-value is small), we then conclude that the
proposed model is a poor fit.
If D is small (so that the p-value is not small), we then conclude that
the proposed model is a good fit.

We’ll use the residual deviance in Example 9 and Activity 23.

55
Unit 6 Regression for a binary response variable

Example 9 Residual deviance for the burns dataset


The burns dataset contains data on 435 patients. In Example 8, the
following logistic regression model was fitted to the data:
 
pb
log = 22.223 − 10.453 logArea.
1 − pb
The residual deviance for this model is calculated to be D = 336.97.
Since there are data for 435 patients in this dataset, the saturated
model for these data must have 435 parameters.
The proposed model has two parameters: the intercept parameter α
(with αb = 22.223) and one regression coefficient β (with βb = −10.453)
for the covariate logArea.
Then D ≈ χ2 (r), where
r = 435 − 2 = 433.
The p-value for the observed value D = 336.97 is calculated using the
χ2 (433) distribution to be 0.9998. This p-value is very large (close
to 1), and so the value of D is nowhere near the right-hand tail of
χ2 (433). Therefore, we can conclude that the proposed model is a
good fit for the data.

Activity 23 Residual deviance for the OU students dataset


In Activity 18, we fitted the following logistic regression model
modResult ∼ bestPrevModScore + age + qualLink
using data from the OU students dataset. The explanatory variables
bestPrevModScore and age are both covariates, whereas qualLink is a
factor with two levels (‘maths’ and ‘not’).
The fitted model was
 
pb
log = −4.273 + 0.099 bestPrevModScore − 0.020 age
1 − pb
− 0.696 qualLinkMaths,
where qualLinkMaths = 1 for the level associated with a maths
qualification for the factor qualLink.
There were data for 1796 students in this dataset, and the response
variable modResult was 1 if the student passed the module and 0 if they
failed the module.
(a) Calculate the number of degrees of freedom for the appropriate χ2
distribution used to test whether the residual deviance D suggests
that the proposed model fits the data.

56
5 Assessing model fit

(b) The residual deviance for the model given in part (a) is calculated to
be D = 581.31, and the associated p-value is approximately 1. Does
this suggest that the proposed model is a good fit to the data?

The p-value associated with a residual deviance isn’t always routinely


reported by statistical software when fitting a logistic regression model.
We don’t, however, necessarily always need to calculate the p-value, since
there is a useful result which means that we can informally assess the fit of
a model using the residual deviance and the degrees of freedom of the null
χ2 distribution.
From standard distribution theory, the mean of a χ2 distribution is equal
to the distribution’s degrees of freedom. Now, the residual deviance D is
approximately distributed as χ2 (r) when the model is a good fit, and so
when the model is a good fit we have the result that
E(D) ' r. (15)
We shall see how we can use this result to informally assess the fit of the
proposed model next in Activity 24.

Activity 24 Using D and r to informally assess model fit


Suppose that a logistic regression model was fitted to some data and the
residual deviance D for the model was calculated, so that D ≈ χ2 (r) if the
model is a good fit.
(a) If D < r, then is it likely that the model is a good or poor fit to the
data? Explain your reasoning.
(b) Is the model likely to be a good or poor fit to the data if D is much
larger than r? Explain your reasoning.

In Activity 24, we saw that we can use the value of D and the associated
degrees of freedom r, to informally assess the fit of the proposed model.
This leads us to the general ‘rule of thumb’ for informally assessing the fit
of logistic regression models given in Box 12.

Box 12 ‘Rule of thumb’ for informally assessing fit using


the residual deviance
Suppose that a proposed logistic regression model has residual
deviance D, so that D ≈ χ2 (r) if the model is a good fit.
We can informally assess the fit of the model by using the following
‘rule of thumb’.
• If D ≤ r, then the model is likely to be a good fit to the data.
• If D is ‘much larger’ than r, then the model is likely to be a poor fit
to the data.

57
Unit 6 Regression for a binary response variable

The ‘rule of thumb’ given in Box 12 is useful for giving an informal


assessment of model fit. Basically, if D < r or D ' r, then the associated
p-value will be large enough for us to conclude that the model is an
adequate fit. However, if D > r, then D might be large enough to be in
the right-hand tail of the χ2 (r) distribution, and so we should calculate the
p-value to check whether or not D is ‘large enough’ to conclude that the
model is a poor fit. We shall assess the model fit of some logistic regression
models informally in Example 10 and Activity 25.

Example 10 Informal assessment of model fit for the


burns dataset
In Example 9, we had the fitted model
 
pb
log = 22.223 − 10.453 logArea,
1 − pb
using data from the burns dataset. In that example, we saw that the
residual deviance for this model is D = 336.97 and we calculated the
degrees of freedom for the associated χ2 null distribution to be
r = 433.
So, D < r for this model, which, by the ‘rule of thumb’ given in
Box 12, implies that the model is a good fit.
In Example 9, we also concluded that the model was a good fit, since
the p-value for the model given in that example was calculated to be
0.9998.

Activity 25 Assessing fit informally for the GB companies


and European companies datasets
(a) In Notebook activity 6.3, we fitted the logistic regression model
resAndDev ∼ averageWage
using data from the GB companies dataset.
The residual deviance for the fitted model is D = 25.221 with
associated degrees of freedom r = 26. Do these values of D and r
suggest that the model is a good fit or a poor fit?
(b) In Notebook activity 6.4, we fitted the logistic regression model
resAndDev ∼ averageWage + country + product
using the European companies dataset.
The residual deviance for the fitted model is D = 275.41 with
associated degrees of freedom r = 262. Use these values to informally
assess the fit of the model.

58
5 Assessing model fit

5.3 The null deviance


As mentioned in Subsection 5.1, while the saturated model is the ‘best’
model in terms of fit, the null model (which doesn’t include any
explanatory variables and simply models each Yi by the intercept) is the
‘worst’ model in terms of fit.
The null deviance is the residual deviance of the null model. This means
that the null model is now our ‘proposed model’ in the residual deviance
formula, so that

null deviance = 2 × l(saturated model) − l(null model) .
The null deviance is a measure of how much fit is lost by not modelling the
data at all. As such, we can think of the null deviance as an indication of
the total fit ‘available’ by modelling the data, and can be considered to be
analogous to the TSS in linear regression.
The next activity considers the null deviance further.

Activity 26 What does a small or large null deviance mean?


What does a small value of the null deviance mean? And what does a
large value mean?

The null deviance is the residual deviance for the null model. As such, if
the null model is a good fit to the data, then the null deviance will follow a
χ2 distribution.

Activity 27 Degrees of freedom for the null deviance’s


distribution
What would be the degrees of freedom for the χ2 distribution associated
with the null deviance?

The null deviance is summarised in the next box.

Box 13 The null deviance


For observations y1 , y2 , . . . , yn , the null deviance is the residual
deviance of the null model:

null deviance = 2 × l(saturated model) − l(null model) .
If the null model is a good fit, then
null deviance ≈ χ2 (n − 1).

We’ll consider the null deviance for the burns dataset next in Activity 28.

59
Unit 6 Regression for a binary response variable

Activity 28 Null deviance for the burns dataset

The logistic regression model


survival ∼ logArea
was fitted to data from the burns dataset for 435 patients, and the null
deviance was calculated to be 525.39.
(a) Explain how we can informally assess whether the data can simply be
adequately described by the null model.
(b) The p-value associated with the observed value of the null deviance is
calculated to be approximately 0.002. What do you conclude?

The null deviance is used as a measure of how much fit there is without
any explanatory variables in the model. It is also commonly used for
assessing the amount of fit gained by a proposed model in comparison to a
model with no explanatory variables, by considering the difference between
the residual deviance for the proposed model and the null deviance. This is
part of a wider strategy of comparing the fits of logistic models; this is the
subject of the next section. But first, we’ll complete this section by
using R to assess model fit.

5.4 Using R to assess model fit


You are now in a position to use your knowledge about assessing the fit of
models in R.
In Notebook activity 6.4 (in Subsection 4.2), we fitted the logistic
regression model
resAndDev ∼ averageWage + country + product
using data from the European companies dataset. In Notebook
activity 6.5, we’ll consider the fit of this model as measured by the residual
deviance. Notebook activity 6.6 provides further practice at assessing the
model fit of a logistic regression model.

Notebook activity 6.5 The residual deviance in R


In this notebook, we’ll use R to assess the model fit of a logistic
regression model using the residual deviance.

Notebook activity 6.6 Assessing another model fit


In this notebook, we’ll assess the fit of another logistic regression
model.

60
6 Choosing a logistic regression model

6 Choosing a logistic regression model


In this section, we’ll look at how we can choose a logistic regression model
for a dataset by comparing model fits. There are two different strategies
that we’ll discuss – the first compares nested models in Subsection 6.1, and
the second compares non-nested models in Subsection 6.2. We’ll then put
these methods into practice using R in Subsection 6.3.

6.1 Comparing nested models


In the previous section, we saw how the residual deviance for a proposed
logistic regression model can be used to compare the fit of the proposed
model with the ‘perfect fit’ of the saturated model. In this subsection, we’ll
see how the residual deviance can also be used to compare the fits of two
proposed logistic regression models, as long as the two models are nested.
We met nested linear regression models in Unit 4; the same basic concept
relates to nested logistic regression models, as summarised in Box 14.

Box 14 Nested logistic regression models


Two logistic regression models, M1 and M2 say, are nested if the
linear part of the simpler model M1 is a special case of the linear part
of the more general model M2 . This means that M2 has all of M1 ’s
explanatory variables, as well as some extra explanatory variables.

Nested logistic regression models are illustrated in the next example. Nested models fit inside each
other like Russian dolls

Example 11 Which models are nested?


Suppose that a dataset has data for the response y, four possible
covariates, x1 , x2 , x3 , x4 , and three possible factors A, B, C.
Suppose that there are three logistic regression models, M1 , M2 and
M3 , being considered for this dataset:
• Model M1 : y ∼ A + x1 + x2 + x3 .
• Model M2 : y ∼ A + B + x1 .
• Model M3 : y ∼ A + B + C + x1 + x2 + x3 .
Then, models M1 and M3 are nested, with M1 nested within M3 , since
M1 ’s explanatory variables are a subset of M3 ’s explanatory variables.
Models M2 and M3 are also nested, with M2 nested within M3 , since
M2 ’s explanatory variables are also a subset of M3 ’s explanatory
variables.
However, M1 and M2 are not nested models, since neither model’s
explanatory variables are a subset of the other model’s explanatory
variables.

61
Unit 6 Regression for a binary response variable

Suppose that we have two logistic regression models, M1 and M2 , where


M1 is nested within M2 . Denote the log-likelihood of the data under model
M1 by l(M1 ), and the log-likelihood of the data under model M2 by l(M2 ).
We can then calculate the residual deviance for each of these models. So,
denoting the residual deviance for M1 by D(M1 ), and the residual
deviance for M2 by D(M2 ), we have

D(M1 ) = 2 × l(saturated model) − l(M1 )
and

D(M2 ) = 2 × l(saturated model) − l(M2 ) .

Activity 29 Which residual deviance is larger?

For the nested logistic regression models M1 and M2 , where M1 is nested


within M2 , which residual deviance – D(M1 ) or D(M2 ) – will be the
larger, and why?

Now, D(M1 ) and D(M2 ) give measures of the fit lost for models M1 and
M2 , respectively, in comparison to the ‘perfect fit’ saturated model.
Therefore, the difference between D(M1 ) and D(M2 ) gives a measure of
the fit lost due to using the smaller model M1 in comparison to using the
larger model M2 . This measure is called the deviance difference and is
defined as
deviance difference = D(M1 ) − D(M2 ).
Figure 17 illustrates D(M1 ), D(M2 ) and the deviance difference.

Deviance difference
is twice this

D(M2 ) is
twice this

D(M1 ) is
twice this

l(null model) l(M1 ) l(M2 ) l (saturated model)


Possible log-likelihood values for the data
Figure 17 D(M1 ), D(M2 ) and the deviance difference

62
6 Choosing a logistic regression model

Notice that from Figure 17 it looks like the deviance difference is twice the
difference between the log-likelihoods for the larger model M2 and the
smaller model M1 . This is indeed the case, which we can see
mathematically:
deviance difference = D(M1 ) − D(M2 )

= 2 × l(saturated model) − l(M1 )

− 2 × l(saturated model) − l(M2 )

= 2 × l(M2 ) − l(M1 ) .
Activity 30 considers what a small or large deviance difference mean.

Activity 30 What does a small or large deviance difference


mean?
(a) If the logistic regression model M1 is nested within the logistic
regression model M2 , what would a small value of the deviance
difference mean in terms of choosing between models M1 and M2 ?
(b) What would a large deviance difference mean?
Which is better? Large or
small?
As we saw in Activity 30, the deviance difference for two proposed nested
models, M1 and M2 , can help us choose between the models as follows.
• A small value of the deviance difference indicates that there is not much
fit to the data lost when using the smaller model M1 in comparison to
using the larger model M2 . As such, the fits of the two models are
comparable, and so for parsimony, we’d choose the smaller model M1 .
• A large value of the deviance difference indicates that too much fit to
the data is lost when using the smaller model M1 in comparison to the
larger model M2 . In this case, we’d choose the larger model M2 .
In the same way as the residual deviance is analogous to the RSS in linear
models, so the deviance difference is analogous to the difference in the RSS
values of two nested linear models.
In linear regression, we used the difference in RSS values as part of the
F -test statistic for deciding whether the gain in fit when using the larger
of the nested models was big enough to conclude that the extra parameters
should be included in the model. We can use the deviance difference in a
similar way, since the fit lost by using the smaller model M1 in comparison
to the larger model M2 is the same as the fit gained by using M2 in
comparison to M1 .
We can therefore use the deviance difference to test the following
hypotheses for logistic regression models M1 and M2 , where M1 is nested
within M2 .
H0 : the extra parameters in M2 should not be included
(that is, M1 is an adequate fit in comparison to M2 ),
H1 : the extra parameters in M2 should be included
(that is, M1 is not an adequate fit in comparison to M2 ).
63
Unit 6 Regression for a binary response variable

A large value of the deviance difference would suggest that there is a


substantial gain in fit when the extra parameters are added, and so we
would want to reject H0 . On the other hand, a small value of the deviance
difference would suggest that there is little gain in fit with the extra
parameters, in which case we would not want to reject H0 .
In order to decide whether the test statistic (the deviance difference here)
is ‘large enough’ to suggest that we should reject H0 , we need the null
distribution for the deviance difference; this turns out to be
(approximately) another χ2 distribution. As usual, we can use the p-value
associated with the observed test statistic to decide whether or not to
reject H0 . Since we’d only want to reject H0 for large values of the
deviance difference, the p-value is the area under the χ2 null distribution
curve to the right of the observed deviance difference; H0 should then be
rejected if the p-value is small.

Phew! We can still use a χ2 Using the deviance difference to compare the fits of nested logistic
distribution . . . regression models is summarised in Box 15.

Box 15 Comparing the fits of nested logistic regression


models
Suppose that we have two logistic regression models, M1 and M2 ,
where M1 is nested within M2 .
Denoting M1 ’s residual deviance by D(M1 ) and M2 ’s by D(M2 ), the
deviance difference is then defined to be
deviance difference = D(M1 ) − D(M2 ).
If both M1 and M2 are good fits, then
deviance difference ≈ χ2 (d),
where
d = difference in degrees of freedom for D(M1 ) and D(M2 )
= number of extra parameters in the larger model M2 .
To decide whether or not the deviance difference is large enough to
suggest a significant gain in model fit when including M2 ’s extra
parameters, use the deviance difference as a test statistic and χ2 (d) as
the null distribution.
If the deviance difference is large (so that the p-value is small), then
there is a significant gain in model fit when including M2 ’s extra
parameters, which indicates we should choose M2 .
If the deviance difference is small (so that the p-value is large), then
there is not a significant gain in model fit when including M2 ’s extra
parameters, which indicates we should choose M1 for parsimony.

64
6 Choosing a logistic regression model

Using the deviance difference for comparing nested logistic regression


models is illustrated in the following example and activity.

Example 12 Deviance difference in action


The OU students dataset contains data on 1796 students. The
following logistic regression models were fitted to these data.
• Model M1 : modResult ∼ bestPrevModScore.
The residual deviance for this model, D(M1 ), is 588.82; the
associated approximate null distribution is χ2 (1794).
• Model M2 : modResult ∼ bestPrevModScore + age + qualLink.
The residual deviance for this model, D(M2 ), is 581.31; the
associated approximate null distribution is χ2 (1792).
As a reminder, bestPrevModScore and age are both covariates, and
qualLink is a factor with two levels (‘maths’ and ‘not’).
Since M1 is nested within M2 , we can calculate the deviance
difference between these two models by
deviance difference = D(M1 ) − D(M2 )
= 588.82 − 581.31 = 7.51.
Model M2 has two more parameters than M1 (one for the covariate
age and one for the factor qualLink), and so this deviance difference
is approximately distributed as χ2 (2). We can also calculate the
degrees of freedom as the difference between the degrees of freedom of
the χ2 null distributions for D(M1 ) and D(M2 ), namely
1794 − 1792 = 2.
The p-value associated with this deviance difference is calculated to
be 0.023. Since this is small, there is evidence to suggest that there is
a significant gain in fit by including age and qualLink in the model,
in addition to bestPrevModScore.

Activity 31 Comparing two more models for the OU


students dataset
Consider once again the OU students dataset which contains data for 1796
students. The following two logistic regression models were fitted to the
data (note that M1 is the same model considered in Example 12).
• Model M1 : modResult ∼ bestPrevModScore.
The residual deviance for this model, D(M1 ), is 588.82; the associated
approximate null distribution is χ2 (1794).

65
Unit 6 Regression for a binary response variable

• Model M2 : modResult ∼ bestPrevModScore + age.


The residual deviance for this model, D(M2 ), is 586.54; the associated
approximate null distribution is χ2 (1793).
(a) Calculate the deviance difference for the nested models M1 and M2 .
(b) What are the degrees of freedom of the χ2 null distribution for the
deviance difference that you calculated in part (a)?
(c) The p-value associated with the deviance difference is 0.131. What do
you conclude?

As mentioned at the end of Subsection 5.3, the null deviance is commonly


used for assessing the amount of fit gained by a proposed model in
comparison to a model with no explanatory variable. We can do this using
the deviance difference as described in Box 15, by thinking of our null
model as being model M1 and our proposed model as being M2 (since the
null model is nested within all of the other possible models). In particular,
this gives us a method of assessing whether or not individual explanatory
variables are useful for modelling the response. This is demonstrated next
in Example 13 and Activity 32.

Example 13 Is logArea useful for modelling survival?


Consider once again the logistic regression model
survival ∼ logArea,
which was fitted to data from the burns dataset.
In Example 9, we saw that the residual deviance for this model is
D = 336.97 with 433 associated degrees of freedom, and in
Activity 28, we saw that null deviance for these data is calculated to
be 525.39 with 434 associated degrees of freedom.
Now, since the null model is nested within the proposed model, we can
think of the null model as M1 and the proposed model as M2 , then
deviance difference = D(M1 ) − D(M2 )
= null deviance − D
= 525.39 − 336.97 = 188.42.
We can also calculate the degrees of freedom as the difference between
the degrees of freedom of the χ2 null distributions for D(M1 ) and
D(M2 ), namely
434 − 433 = 1.

66
6 Choosing a logistic regression model

Here, the deviance difference is 188.42 and the associated degrees of


freedom for the χ2 null distribution is 1, which means that the
deviance difference is much larger than the expected deviance
difference (that is, the degrees of freedom) if both models are a good
fit. As such, we can be pretty sure that the deviance difference is
‘large enough’ to suggest that there is a significant increase in fit when
logArea is added to the model (and indeed, the p-value is
approximately 0!).
Therefore, it looks like logArea is indeed useful for modelling
survival.

Activity 32 Is age useful for modelling modResult?

The logistic regression model


modResult ∼ age
was fitted to data from the OU students dataset.
The residual deviance for this model is D = 678.30 with 1794 associated
degrees of freedom, and the null deviance is 678.51 with 1795 associated
degrees of freedom.
Does age seem to be useful for modelling modResult?

6.2 Comparing non-nested models


The deviance difference can only be used for comparing the fits of nested
models. Non-nested logistic regression models require an alternative
measure. When comparing the fits of non-nested multiple regression
models in Unit 2, we used the Akaike information criterion (AIC). This
same measure can be used for comparing the fits of logistic regression
models. The AIC is useful for comparing the fits of models since it
balances model fit (which is measured by the log-likelihood) against model
complexity. A brief reminder of the AIC is given in Box 16. Non-nested models are like a
square peg in a round hole
Box 16 A brief reminder of the AIC
• The Akaike information criterion (AIC) calculates the amount of
information lost by a given model relative to the amount of
information lost by other models.
• The AIC also considers the number of explanatory variables in the
model, preferring simpler models.
• The best model from a set of alternatives, is the model with the
lowest AIC.

67
Unit 6 Regression for a binary response variable

Note that, like the likelihood, the value of the AIC is not an absolute
measure of the fit of a model. As such, the AIC can only be used to
compare the fits of a set of possible models. This, however, means that if
all of the potential models are a bad fit, then we won’t know this from the
values of the AIC.
The AIC is used for selecting the ‘best’ logistic regression model from a
group of alternatives in exactly the same way as it was used for selecting
the ‘best’ multiple regression model in Unit 2. The AIC is used to compare
two logistic regression models in Activity 33.

Activity 33 Using AIC to compare non-nested models

Two logistic regression models, M1 and M2 , were fitted to the OU students


dataset:
• Model M1 : modResult ∼ bestPrevModScore.
• Model M2 : modResult ∼ contAssScore.
The AIC for M1 was calculated to be 592.82, while the AIC for M2 was
calculated to be 411.26. Based on this information, which of the two
models is better?

When using linear regression in this module, we have been using stepwise
regression as an automated procedure for selecting which explanatory
variables should be included in our model, and at each stage in the
stepwise regression procedure, the AIC has been used to compare the
model fits. Stepwise regression can be used in exactly the same way in
logistic regression; we shall see stepwise regression in action using R in the
next subsection.

6.3 Using R to compare model fits


We shall start this subsection by seeing how R can be used to compare the
fits of logistic regression models. In Notebook activity 6.7, we’ll fit some
models using data from the OU students dataset, and we’ll compare the
fits of these models using both the deviance difference (which can only be
used to compare nested models) and the AIC (which can be used to
compare both nested and non-nested models).
Notebook activity 6.8 then gives further practice at using R to compare
the fits of (nested) logistic regression models, this time using data from the
European companies dataset. Finally, in Notebook activity 6.9, we’ll use
stepwise regression to choose a logistic regression model for the response
modResult using data from the OU students dataset.

68
7 Checking the logistic regression model assumptions

Notebook activity 6.7 Comparing the fits of logistic


regression models in R
This notebook explains how to use R to compare the fits of logistic
regression models.

Notebook activity 6.8 Comparing the fits of more


(nested) logistic regression models
This notebook gives further practice at using R to compare the fits of
nested logistic regression models.

Notebook activity 6.9 Choosing a logistic regression


model in R
This notebook uses stepwise regression to choose a logistic regression
model.

7 Checking the logistic regression


model assumptions
We’ve seen how we can assess the fit of a logistic regression model, and
how we can compare the fits of different models in order to choose the
‘best’ logistic regression model for our data. However, we also need to be
sure that our chosen model satisfies the assumptions that are needed in
order to use logistic regression; we shall consider what these assumptions
are in Subsection 7.1.
Subsection 7.2 introduces what are known as deviance residuals. These are
a type of residual which work better than the standard residuals used so
far in M348 for checking the model assumptions in logistic regression.
We’ll then look at some diagnostic plots which can be used with logistic
regression in Subsection 7.3, before using R to produce these plots in
Subsection 7.4.

69
Unit 6 Regression for a binary response variable

7.1 The model assumptions


We’ll start by explicitly stating the assumptions of the logistic regression
model; these are given in Box 17.

Box 17 Assumptions of the logistic regression model


Suppose that a response Y is modelled by a logistic regression model,
with possible factors A, B, . . . , Z and covariates x1 , x2 , . . . , xq as the
explanatory variables. Then the following assumptions must hold.
• Response distribution: The response Yi is a binary variable
which follows a Bernoulli distribution, so that, for the ith
observation
Yi ∼ Bernoulli(pi ), 0 < pi < 1,
where pi is the probability of success.
• Linearity: The relationship between the log odds, log(pi /(1 − pi )),
and the explanatory variables is linear.
• Independence: The response variables Y1 , Y2 , . . . , Yn are
independent of each other.

The independence assumption given in Box 17 is also an assumption of


linear regression, and so this one should be familiar to you. The other
assumptions are, however, different to those for linear regression. Let’s
look briefly at each of these in turn.
The first assumption in Box 17 concerns the response distribution. The
fact that the response is binary, so that linear regression is no longer
suitable, has been driving the development of logistic regression
throughout this unit. So, hopefully this assumption is no surprise to you!
The second assumption in Box 17 concerns linearity. In linear regression,
the linearity assumption involves the explanatory variables and the
response, whereas in logistic regression, the linearity assumption involves
the explanatory variables and a function of the response mean,
E(Yi ) (= pi ). Despite this distinction, note however that, like linear
regression, the linearity assumption applies to the linearity of the
parameters. This means, for example, that the model
 
pi
log = α + β1 xi1 + β2 xi2
1 − pi
and a logistic regression model with the explanatory variables transformed
 
pi √
log = α + β1 x2i1 + β2 xi2
1 − pi
both satisfy the linearity assumption, since both equations are linear in the
parameters α, β1 and β2 .

70
7 Checking the logistic regression model assumptions

For linear regression, checking the model assumptions primarily revolves


around diagnostic plots. There is a similar set of diagnostic plots for
logistic regression. Unfortunately, these diagnostic plots are not quite as
helpful as the diagnostic plots that we’ve been using for linear regression.
They can, however, be useful for flagging up when things are going
horribly wrong!

7.2 Deviance residuals


In linear regression, the residuals
ri = yi − ybi , i = 1, 2, . . . , n,
play a crucial role in checking whether or not the assumptions underlying
the linear regression model seem reasonable.
Residuals can be defined in exactly the same way for fitted logistic
regression models. However, it turns out that, for checking the assumptions
in logistic regression, it is better to use what are known as deviance
residuals. It is unfortunate that we have the confusing terminology of
‘deviance residuals’ and ‘residual deviance’ ! Both are standard
terminology and mean quite different things. If you find yourself getting
confused, focus on the second word for each – that is, ‘deviance residuals’
are residuals, while ‘residual deviance’ is a deviance measure of fit.
As the name suggests, deviance residuals are related to the residual
deviance D. In fact, denoting the deviance residual for the ith observation
by di , we have that
n
X
D= d2i .
i=1

You do not need to know the formula for calculating di for this module,
since R automatically calculates the deviance residuals when fitting a
logistic regression model. Basically, each deviance residual di is analogous
to the residual ri in multiple linear regression, and each deviance residual
can be thought of as the contribution to the residual deviance D for the
ith data point.
If the model assumptions for the logistic regression model are adequate,
then the deviance residuals should be approximately normally distributed
(provided that the number of observations for each combination of factor
levels is not small). In particular, this means that the standardised
deviance residuals should approximately follow the standard normal
N (0, 1) distribution. (As a reminder, a variable can be ‘standardised’ by
transforming it so that it has mean 0 and variance 1.)
There are several plots of the deviance residuals which can be used to
highlight any potential problems with the fitted model. We shall introduce
you to four of these plots briefly in Subsection 7.3. However, be warned!
For the logistic regression model, many of these plots look rather odd due
to the fact that the response variable is binary.

71
Unit 6 Regression for a binary response variable

7.3 Diagnostic plots for logistic regression


In this subsection, we’ll consider four diagnostic plots for checking logistic
regression models. As has already been mentioned, the logistic regression
model is a specific type of a more general class of models called generalised
linear models. It turns out that the diagnostic plots that we’ll introduce
for logistic regression can also be used more generally for other types of
generalised linear models. As such, we shall also use these same plots when
considering other types of generalised linear models in Unit 7.
We’ll start in Subsection 7.3.1 by considering a plot which is analogous to
a plot of the residuals against fitted values in linear models.
Subsections 7.3.2 and 7.3.3 then introduce two plots which can be used to
check the independence assumption. Finally, in Subsection 7.3.4, we’ll
discuss how normal probability plots can be used for checking the model
assumptions in logistic regression.

7.3.1 Standardised deviance residuals against a


b
transformation of µ
The first plot that we’ll consider is analogous to a plot of the residuals
against fitted values in linear models and is useful for checking the
linearity assumption. However, unlike residual plots for linear models
where we hope that the points are randomly scattered, there is a very
distinct pattern in the analogous plot for logistic regression!
Instead of the residuals, when we have logistic regression we plot the
standardised deviance residuals. Also, since logistic regression focuses on
modelling the mean response E(Yi ), rather than the response itself, the
plot uses the fitted mean (that is, pbi ) instead of the fitted response (that
is, ybi ). However, rather than using the notation pbi (the specific notation
used for the fitted mean for logistic regression), for consistency with what’s
to come in Unit 7 the plots will use the more general notation µ bi to
represent the fitted mean of the ith observation. What’s more, the
standardised deviance residuals willpbe plotted against a transformation of
the fitted mean µ b, namely 2 arcsin µ b. (Note that arcsin is the same as
sin−1 .) You do not need to worry about the details of this transformation,
but basically, it helps to make the plot more interpretable.
We’ll take a look at this plot for the logistic regression model fitted to data
from the burns dataset in Example 14.

72
7 Checking the logistic regression model assumptions

Example 14 Standardised deviance residuals against a


b for the burns dataset
transformation of µ
Figure 18 shows a plot of the standardised deviance residuals against
b for the logistic regression model fitted to data
a transformation of µ
from the burns dataset.

2
Standardised deviance residual

−1

−2

−3
1.0
2.0 1.52.5 3.0

2 arcsin µ 
Figure 18 Plot of the standardised deviance residuals against a
b for the burns dataset
transformation of µ
In the plot shown in Figure 18, there are two distinct ‘lines’ of points.
These lines are due to the fact that the response variable is binary.
The upper line of positive values corresponds to data with Yi = 1 and
the lower line of negative values corresponds to the data with Yi = 0.
The pattern of lines is then a result of the fact that the fitted logistic
curve runs between 1 and 0, whereas the Yi values are always exactly
1 or 0. The points follow on from each other smoothly because they
are ordered by the estimated values of pbi (= µbi ).
The most useful thing to concentrate on in Figure 18 is the smoothed
red line. If the linearity assumption is fine, then the smoothed red line
should be roughly a horizontal straight line. In the plot shown, the
red line is fairly constant with only a slight curve, and so the linearity
assumption seems reasonable.

We saw that the plot of the standardised deviance residuals against the
transformation of µ bi resulted in two ‘lines’ of points in Example 14. This
type of pattern in the standardised deviance residuals is typical for this
plot in logistic regression.
73
Unit 6 Regression for a binary response variable

As mentioned in Example 14, the smoothed red line is the most useful
thing to concentrate on in the plot. Ideally, we want this line to be roughly
a horizontal straight line: this would indicate that the linearity assumption
is reasonable. Curvature in the smoothed red line could indicate that the
model may need a different link function, rather than log(pi /(1 − pi )), or
that a transformation of one of the explanatory variables may be needed.
Next, in Activity 34, we’ll use this type of plot to check the assumption of
linearity for a logistic regression model fitted to data in the GB companies
dataset.

Activity 34 Standardised deviance residuals against a


b for the GB companies
transformation of µ
dataset
In Notebook activity 6.3, we fitted the model
resAndDev ∼ averageWage
using data from the GB companies dataset. In Activity 25, we concluded
that this model was a good fit by informally comparing the values of the
residual deviance for this fitted model (D = 25.221) with its associated
degrees of freedom (r = 26).
Figure 19 shows a plot of the standardised deviance residuals against the
transformed fitted mean for this model.
Does this plot indicate any problems with the model assumptions?

2
Standardised deviance residual

−1

−2
0.5 1.0 1.5 2.0 2.5

2 arcsin µ 
Figure 19 Plot of the standardised deviance residuals against a
b for the GB companies dataset
transformation of µ

74
7 Checking the logistic regression model assumptions

7.3.2 Standardised deviance residuals against index


A plot of the standardised deviance residuals against index number
produces a plot of the standardised deviance residuals in the order that the
data were collected. This plot can be used to check for independence of the
responses Y1 , Y2 , . . . , Yn : if they are independent, the standardised
deviance residuals in the plot should fluctuate randomly and there
shouldn’t be systematic patterns in the plot. We shall see this plot in
action in Example 15.

Example 15 Standardised deviance residuals against


index for the burns dataset
Figure 20 shows a plot of the standardised deviance residuals against
index for the logistic regression model fitted to data from the burns
dataset.

2
Standardised deviance residual

−1

−2

−3
0 100 200 300 400
Index
Figure 20 Plot of standardised deviance residuals against index for the
burns dataset
In the plot shown in Figure 20, we can see two groups of points. This
is again due to the fact that the response variable is binary.
We want to check that the standardised deviance residuals fluctuate
randomly in the plot. However, the spread of the negative deviance
residuals appears to increase as the index number increases. Since the
data concern individual patients, it is unlikely that the observations
are not independent. However, given that a pattern has been seen in
the plot, it would be worth checking how the observations were
recorded.

75
Unit 6 Regression for a binary response variable

We’ll look at the plot of the standardised deviance residuals against index
for the GB companies dataset next.

Activity 35 Standardised deviance residuals against index


for the GB companies dataset
Following on from Activity 34, Figure 21 shows a plot of the standardised
deviance residuals against index for the fitted model
resAndDev ∼ averageWage
fitted to the GB companies dataset.
Does this plot indicate any problems with the independence assumption
for the model?

2
Standardised deviance residual

−1

−2
0 5 10 15 20 25
Index
Figure 21 Plot of standardised deviance residuals against index for the GB
companies dataset

7.3.3 Squared standardised deviance residuals


against index
Like the previous plot, a scatterplot of the squared standardised deviance
residuals against index can be used to check the independence assumption.
By squaring the standardised deviance residuals, all of the residuals can be
compared on the same scale, and, since the standardised deviance residuals
relating to observations for which Yi = 1 are distinguished from those
relating to the observations for which Yi = 0, the plot can help us to
compare the standardised deviance residuals for the two groups more easily.
76
7 Checking the logistic regression model assumptions

The plot is shown for the logistic regression model fitted to the burns
dataset in Example 16.

Example 16 Squared standardised deviance residuals


against index for the burns dataset
Figure 22 shows a plot of the squared standardised deviance residuals
against index for the logistic regression model fitted to the burns
dataset.

8
Squared standardised deviance residuals

0
0 100 200 300 400
Index
Figure 22 Squared standardised deviance residuals against index for
the burns dataset. The red circles denote observations for which Yi = 1
and the blue triangles denote observations for which Yi = 0.

In this plot, we can clearly see the different spreads of the


standardised deviance residuals relating to Yi = 1 and Yi = 0 that
were identified in the plot of the standardised deviance residuals
against index considered in Example 15. In fact, it could be argued
that the increasing scatter with increasing index looks more
pronounced in this plot.

We’ll look at the plot of the squared standardised deviance residuals


against index for the GB companies dataset in the next activity.

77
Unit 6 Regression for a binary response variable

Activity 36 Squared standardised deviance residuals against


index for the GB companies dataset
Following on from Activities 34 and 35, Figure 23 shows a plot of the
squared standardised deviance residuals against index for the fitted model
resAndDev ∼ averageWage.
Does this plot indicate any problems with the model assumptions?

Squared standardised deviance residuals

0
0 5 10 15 20 25
Index
Figure 23 Squared standardised deviance residuals against index for the GB
companies dataset. The red circles denote observations for which Yi = 1 and
the blue triangles denote observations for which Yi = 0.

7.3.4 Normal probability plot


If the response distribution assumption is correct, then the standardised
deviance residuals should be approximately distributed as N (0, 1). This
can be checked through a normal probability plot of the standardised
deviance residuals. As usual, the normality assumption is reasonable if the
points lie roughly on the straight line. We’ll consider the normal
probability plot for the logistic regression model fitted to the burns dataset
in Example 17.

78
7 Checking the logistic regression model assumptions

Example 17 Normal probability plot for the burns dataset


Figure 24 shows the normal probability plot of the standardised
deviance residuals for the logistic regression model fitted to data from
the burns dataset.

2
Standardised deviance residuals

−1

−2

−3
−3 −2 −1 0 1 2 3
Theoretical quantiles
Figure 24 Normal probability plot of the standardised deviance
residuals for the burns dataset

In this normal probability plot, we again notice two distinct ‘lines’ of


points. As with the plot of the standardised deviance residuals against
a transformation of the fitted mean, these two lines are due to the fact
that the response variable is binary.
The points in the normal probability plot are generally close to the
diagonal line, although there is some deviation in the lower end of the
first ‘line’ and in both ends in the second ‘line’. However, since the
normality assumption is only approximate and the deviations from the
line are not terribly marked, the assumption of normality of the
standardised deviance residuals does not seem unreasonable.

Finally, we shall consider the normal probability plot for the logistic
regression model fitted to the GB companies dataset in Activity 37.

79
Unit 6 Regression for a binary response variable

Activity 37 Normal probability plot for the GB companies


dataset
Following on from Activities 34, 35 and 36, Figure 25 shows the normal
probability plot of the standardised deviance residuals for the logistic
regression model fitted to the GB companies dataset.
Does this plot indicate any problems with the model assumptions?

2
Standardised deviance residuals

−1

−2

−2 −1 0 1 2
Theoretical quantiles
Figure 25 Normal probability plot for the GB companies dataset

Now that we have introduced the four diagnostic plots that we’ll be using
for logistic regression, we can move onto using R to obtain the plots.

80
7 Checking the logistic regression model assumptions

7.4 Using R to produce diagnostic plots


We’ll start in Notebook activity 6.10 by explaining how R can be used to
produce diagnostic plots for fitted logistic regression models. In doing so,
we’ll focus on producing diagnostic plots for the logistic regression model
resAndDev ∼ averageWage + country + product,
which we fitted in Notebook activity 6.5 (in Subsection 5.4) using data
from the European companies dataset.
Then, in Notebook activity 6.11, we shall produce the diagnostic plots for
the logistic regression model chosen by the stepwise regression procedure
carried out in Notebook activity 6.9 (in Subsection 6.3) for the OU
students dataset.

Notebook activity 6.10 Diagnostic plots for logistic


regression in R
This notebook explains how to produce diagnostic plots for logistic
regression in R.

Notebook activity 6.11 Checking the assumptions of


another logistic regression model
In this notebook, we shall check the assumptions for another logistic
regression model.

81
Unit 6 Regression for a binary response variable

Summary
In this unit, we have developed logistic regression for modelling binary
variables.
For binary responses Y1 , Y2 , . . . , Yn , it is assumed that for each
i = 1, 2, . . . , n,
Yi ∼ Bernoulli(pi ), 0 < pi < 1,
where pi is the probability of success for the ith observation. Logistic
regression then models the relationship between E(Yi ) = pi and the
explanatory variables.
In the model, a link function links E(Yi ) to a linear combination of the
explanatory variables. The link function used in M348 (and which is also
most commonly used) is the logit link function, which is defined as
 
E(Yi )
logit(E(Yi )) = log .
1 − E(Yi )
Our regression model is then
 
E(Yi )
log = linear function of explanatory variables.
1 − E(Yi )
One of the reasons that the logit link function is popular for logistic
regression is because of the interpretability of the model. Let β denote the
regression coefficient for a covariate x. Then
β = log(OR),
where OR is the odds ratio given a unit increase in x, so that,
odds of success when x takes value w + 1
OR = .
odds of success when x takes value w
The odds ratio can be interpreted as follows.
• If OR > 1, then the odds of success increases by a factor equal to the
value of the OR when x is increased by one unit.
• If OR < 1, then the odds of success decreases by a factor equal to the
value of the reciprocal of the OR when x is increased by one unit.
A logistic regression model is then interpreted through the result that
OR = exp(β).
The quantity exp(β) is known as the odds multiplier. It tells us about the
relationship between Y and x, because it is the number that the odds are
multiplied by for a unit increase in x.

82
Summary

When there is more than one explanatory variable, each odds multiplier is
partial and assumes that the other explanatory variables are fixed for a
unit increase in the associated explanatory variable. For a factor, there’s
an odds multiplier associated with each of the indicator variables for the
levels of the factor: so, for the kth level of the factor, the odds multiplier is
interpreted as the OR for the kth level of the factor relative to the first
level of the factor.
Once we have a fitted logistic regression model, we are often interested in
pbi , the fitted success probability for Yi , and pb0 , the predicted success
probability for new response Y0 . These are obtained easily by using
exp(fitted linear function of explanatory variables)
pbi = .
1 + exp(fitted linear function of explanatory variables)

The fit of the proposed logistic regression model can be assessed using the
residual deviance D, where

D = 2 × l(saturated model) − l(proposed model) .
If the proposed model is a good fit, then
D ≈ χ2 (r),
where
r = n − number of parameters in proposed model.
A useful ‘rule of thumb’ is
• if D ≤ r, then conclude that the model is a good fit.
The fits of two nested logistic regression models M1 and M2 , with M1
nested within M2 , can be compared using the deviance difference, where
deviance difference = D(M1 ) − D(M2 ).
If both M1 and M2 are a good fit, then
deviance difference ≈ χ2 (d),
where
d = difference in the degrees of freedom for D(M1 ) and D(M2 ).
The AIC can be used to compare non-nested models; we choose the model
with the smallest AIC.
In order to check the model assumptions in logistic regression, diagnostic
plots focus on deviance residuals – these are analogous to the standard
residuals used in linear regression. If the logistic regression model
assumptions are reasonable, then the standardised deviance residuals are
approximately distributed as N (0, 1).

83
Unit 6 Regression for a binary response variable

A reminder of what has been studied in Unit 6 and how the sections link
together is shown in the following route map.

The Unit 6 route map

Section 1
Setting the scene

Section 2
Introducing logistic regression

Section 3 Section 4
Interpreting the Using the
logistic regression logistic regression
model model

Section 5
Assessing model fit

Section 7
Section 6
Checking the
Choosing a logistic
logistic regression
regression model
model assumptions

84
References

Learning outcomes
After you have worked through this unit, you should be able to:
• explain why linear regression is not appropriate for modelling a binary
response variable
• understand how modelling the success probability is key to building a
model for binary responses
• appreciate the role of the logit link function in logistic regression
• interpret a logistic regression model
• obtain fitted and predicted success probabilities for given values of the
explanatory variable(s)
• assess the fit of a logistic regression model
• compare the fits of two logistic regression models, both in the case of
nested models and non-nested models
• identify if there are problems with the assumptions of logistic regression
• fit a logistic regression model in R
• use R to predict success probabilities
• use R to assess model fit
• compare model fits in R
• use stepwise regression for logistic regression in R
• use R to produce diagnostic plots for logistic regression.

References
Bureau van Dijk (2020) Amadeus. Available at:
https://www.open.ac.uk/libraryservices/resource/database:350727&amp;f=33492
(Accessed: 22 November 2022). (The Amadeus database can be accessed
from The Open University Library using the institution login.)
Fan, J., Heckman, N.E. and Wand, M.P. (1995) ‘Local polynomial kernel
regression for generalized linear models and quasi-likelihood functions’,
Journal of the American Statistical Association, 90 (429), pp. 141–150.
doi:10.1080/01621459.1995.10476496.

85
Unit 6 Regression for a binary response variable

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Introduction, contactless payment: © rh2010 / www.123rf.com
Subsection 1.1, driver’s alcohol level: © Stuart Pearcey / Dreamstime.com
Subsection 1.2.1, plastics: © Cassidy Karasek / www.freepik.com
Subsection 1.2.2, hospital sign: © Sherry Young / www.123rf.com
Subsection 2.2, flexible dancer: © Aleksandr Doodko / www.123rf.com
Subsection 2.3, man and dog: © damedeeso / www.123rf.com
Subsection 3.1, betting chips: © Rawf88 / Dreamstime.com
Subsection 3.2, spam email: © gilc / www.123rf.com
Subsection 3.2, woman extending: © Fizkes / Dreamstime.com
Subsection 3.3, excited: © Joe Caione / www.unsplash.com
Subsection 4.1, student passing: © sebra / Shutterstock
Subsection 5.2, degree of freedom: © Nirat Makjantuk / www.123rf.com
Subsection 6.1, Russian dolls: © Rui Santos / www.123rf.com
Subsection 6.1, fish bowls: © Oleg Dudko / Dreamstime.com
Subsection 6.1, χ2 relief: © Mykola Kravchenko / Dreamstime.com
Subsection 6.2, non-nested models: © chianim8r / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

86
Solutions to activities

Solutions to activities
Solution to Activity 1
Some examples of datasets with a binary response are given below. You
probably thought of different situations, since binary response variables are
common in many areas!
• There are several medical examples where the binary response variable is
the presence or absence of a disease. So, in this case, the two possible
values for the binary response could be ‘disease present’ and ‘disease
absent’.
This response could depend on a range of explanatory variables which
describe the symptoms of the patient, such as their blood pressure,
together with a range of demographic variables, such as age and gender.
• An ecological study may be interested in whether or not a particular
species is in an area. So, in this case, the two possible values for the
binary response could be ‘species present’ and ‘species absent’.
This response could depend on a range of explanatory variables which
describe the conditions of the area, such as the abundance of food, the
climate, the habitat and the presence of any predators in the area.
• A company might be interested in marketing a new type of phone, and a
binary response variable could indicate whether or not a person buys the
new phone. So, in this case, the two possible values for the binary
response could be ‘purchase new phone’ and ‘don’t purchase new phone’.
This response could depend on a range of explanatory variables such as
the demographics of the potential customer (their age, salary, and so on)
and certain aspects of the phone (even down to things such as the
phone’s size, weight and colour).
• An email client needs to decide whether each incoming email is spam or
not. So, in this case, the two possible values for the binary response
could be ‘spam’ and ‘not spam’.
Whether an email is categorised as spam or not could depend on
explanatory variables such as the number of typing errors in the email
and the number of times particular words or phrases occur (such as
‘offer’, ‘prize’ or ‘free gift’).

Solution to Activity 2
The box associated with companies which engage in research and
development (so that resAndDev = 1) lies to the right of the box
associated with companies which do not engage in research and
development (so that resAndDev = 0). Therefore, from the boxplot, it
appears that companies that pay a higher average wage tend to engage in
research and development, and so it is plausible that there is a relationship
between the two variables.

87
Unit 6 Regression for a binary response variable

Solution to Activity 3
The fitted linear regression line does give the impression that companies
with a higher average wage are more likely to engage in research and
development. This is indicated by the increasing fitted regression line.
However, none of the observed points is particularly close to the fitted
linear regression line.
Furthermore, the fitted line doesn’t go through resAndDev = 1 for the
range of observed values for averageWage, which is a problem since many
of the observed values of resAndDev are, of course, equal to 1!
So, overall, the fitted linear regression line is not a good fit for these data.

Solution to Activity 4
(a) From the comparative boxplot in Figure 6, the patients that survived
appear to have a distribution of logArea which is more spread out
than the distribution of logArea for patients who didn’t survive,
especially for the lower values of logArea. So, it appears that
logArea does have some effect on survival, with, unsurprisingly, the
patients with greater burn areas appearing to have less chance of
survival. There is, however, overlap in the distributions of logArea
for the two groups of patients.
(b) It is not obvious from Figure 7 how two straight lines could model
these data. For logArea < 1.7 the line survival = 1 fits the data
perfectly. But it is not obvious how a single straight line would best
fit the data when logArea > 1.7.

Solution to Activity 5
(a) In Figure 7, all of the points take the value 1 for survival when
logArea is less than about 1.7. Therefore, since logArea = 1.223 for
patient 24, it looks like this patient is almost certain to survive, so
that the likely value of p24 is approximately 1.
(b) In Figure 7, survival takes both the value 1 and the value 0 when
logArea > 1.7. Therefore, since logArea = 2.301 for patient 1, it’s
possible that patient 1 may survive, but it’s also possible that they
may not survive. As such, it is likely that p1 < 1, and so p1 is likely to
be less than p24 .

88
Solutions to activities

Solution to Activity 6
(a) There are 66 patients whose value of logArea is in the interval
2.1 to 2.2, of which 35 survived. Therefore, the proportion who
survived is
35
' 0.530.
66
(b) From Box 1,
E(Y5 ) = p5 and E(Y7 ) = p7 .
Now, the fifth patient has logArea = 1.725 and so their value of
logArea is in the interval 1.7 to 1.8. From Table 4, the observed
proportion of patients surviving with values of logArea in this interval
is 0.971. So, using the observed proportions surviving as estimates of
the survival probabilities, we have the estimate E(Y5 ) = 0.971.
For the seventh patient, logArea = 2.295. So, their value of logArea
is in the interval 2.2 to 2.3. From Table 4, the observed proportion of
patients surviving with values of logArea in this interval is 0.125. So,
using the observed proportions surviving as estimates of the survival
probabilities, we have the estimate E(Y7 ) = 0.125.
(c) The estimated survival probabilities shown in Figure 8 do not appear
to lie on a single straight line. It may be possible to fit a straight line
through the points corresponding to values of logArea between about
1.9 and 2.3. For these points, the proportion of patients who survive
in each logArea interval rapidly decreases from around 1 to close to
0. However, the points are flat for lower values of logArea and also
(though less obviously) level off to 0 for higher values of logArea.

Solution to Activity 7
Some of the problems with using this curve for modelling the survival
probabilities are as follows.
• The relationship between the survival probabilities and logArea in the
burns dataset is decreasing, whereas the logistic function shown in
Figure 11 is increasing.
• The logistic function shown in Figure 11 is centred on x = 0, whereas
the S-shaped curve for survival probabilities is roughly centred on the
value logArea = 2.1.
• The value of the logistic function shown in Figure 11 is roughly 0 when
x is less than about −5 and roughly 1 when x is greater than about 5.
For the survival probabilities, we require the curve to be roughly 1 for
values of logArea less than about 1.6. We can’t see from Figure 8 what
happens to the curve for values of logArea greater than 2.4, but the
curve seems to be decreasing towards 0 for values of logArea greater
than about 2.4.

89
Unit 6 Regression for a binary response variable

Solution to Activity 8
(a) When α = 2, the curve is positioned two units to the left of the curve
in Figure 11.
(b) When α = −4, the curve is positioned four units to the right of the
curve in Figure 11.
Although the question didn’t ask for sketches of the new curves, to
help you visualise them, both curves, together with the curve from
Figure 11, are shown in Figure S1.

1.0

0.8

0.6
α=2 α=0 α = −4
f (x)

0.4

0.2

0.0
−10 −5 0 5 10
x
Figure S1 Logistic function for different values of α when β = 1

90
Solutions to activities

Solution to Activity 9
(a) When β = 2, the curve is still increasing (since 2 > 0), but is steeper
and less spread out.
(b) When β = −0.5, the curve is now decreasing (since −0.5 < 0), and is
shallower and more spread out.
Although the question didn’t ask for sketches of the new curves, to
help you visualise them, both curves, together with the curve from
Figure 11, are shown in Figure S2.

1.0 β=2

0.8 β=1

0.6
f (x)

0.4

β = −0.5
0.2

0.0
−10 −5 0 5 10
x
Figure S2 Logistic function for different values of β when α = 0

Solution to Activity 10
(a) We have that
1
pi = .
1 + exp(−(α + βxi ))
But, since exp(−z) = 1/ exp(z), this can be rewritten as
1
pi = 1 .
1+ exp(α+βxi )

Then, multiplying every term in the fraction by exp(α + βxi ), this


becomes
exp(α + βxi )
pi = ,
exp(α + βxi ) + 1
as required.

91
Unit 6 Regression for a binary response variable

(b) From part (a),


exp(α + βxi )
pi =
exp(α + βxi ) + 1
and so
pi (exp(α + βxi ) + 1) = exp(α + βxi )
pi exp(α + βxi ) + pi = exp(α + βxi )
pi = exp(α + βxi )(1 − pi ),
which gives
pi
= exp(α + βxi )
1 − pi
as required.

Solution to Activity 11
Probably the most obvious similarity is the fact that both models have the
same linear function of the explanatory variable (α + βxi ).
One of the obvious differences (that we’ve already discussed) is the fact
that E(Yi ) is equal to a linear function of the explanatory variable in the
simple linear regression model, but in the logistic regression model it is the
logit function of E(Yi ) = pi which is equal to a linear function of the
explanatory variable.
A more subtle difference is that in the simple linear regression model there
is an additive random term Wi , whereas the logistic regression model is
expressed directly in terms of the distribution of Yi (and there is no
additive random term).

Solution to Activity 12
(a) If the probability of a student passing a module is 0.9, then the odds
of the student passing is
p 0.9
odds = = = 9.
1−p 1 − 0.9

(b) We have that


p
odds = .
1−p
Rearranging this, we get
odds × (1 − p) = p
odds − odds × p = p
odds = p (1 + odds).
So
odds
p= .
1 + odds

92
Solutions to activities

(c) If the odds of a company going into receivership is 0.6, then, using the
result from part (b), the probability that the company goes into
receivership is
odds 0.6
p= = = 0.375.
1 + odds 1 + 0.6

Solution to Activity 13
(a) The success probability p takes a value between 0 and 1. Now, when
p = 0,
0
odds = =0
1−0
and when p = 1,
1
odds = = ∞.
1−1
For any value of p such that 0 < p < 1, the odds will be positive (since
it is the ratio of two positive numbers). Therefore, the odds can take
values between 0 and ∞.
(b) If odds = 1, then p = 1 − p, and so p = 12 .
(c) The odds will be less than 1 if p < 1 − p. In this case, p < 21 , and so
success is less likely than failure.
The odds will be greater than 1 if p > 1 − p. In this case, p > 12 , and
so success is more likely than failure.

Solution to Activity 14
Let P (Y = 1 | x2 = 1) denote the probability that an email is spam
(Y = 1) given that it has at most two spelling mistakes (x2 = 1), and let
P (Y = 1 | x2 = 0) denote the probability that an email is spam (Y = 1)
given that it has three or more spelling mistakes (x2 = 0).
Then
p(Y = 1 | x2 = 1) 0.1 1
odds when x2 is 1 = = =
1 − p(Y = 1 | x2 = 1) 0.9 9
and
p(Y = 1 | x2 = 0) 0.55 11
odds when x2 is 0 = = = .
1 − p(Y = 1 | x2 = 0) 0.45 9

So, the odds ratio is


odds when x2 is 1 1/9 1
OR = = = .
odds when x2 is 0 11/9 11
This means that the odds that an email is spam given that it contains at
1
most two spelling mistakes is increased by a factor of 11 – or equivalently,
1
is decreased by a factor of 11 (the reciprocal of 11 ) – compared to the odds
that an email is spam given that it contains three or more spelling
mistakes.

93
Unit 6 Regression for a binary response variable

Solution to Activity 15
(a) If β = 0.7, then the odds multiplier is
exp(0.7) ' 2.014.
This means that
OR ' 2.014
and so the odds of success are increased by a factor of 2.014 for a unit
increase in the explanatory variable.
(b) If β = −1, then the odds multiplier is
exp(−1) ' 0.368.
So
OR ' 0.368
and the odds of success are increased by a factor of 0.368 for a unit
increase in the explanatory variable. Or equivalently, the odds of
success are decreased by a factor of
1 1
= ' 2.718
OR exp(−1)
for a unit increase in the explanatory variable.
(c) When β = −0.2,
exp(10 × −0.2) ' 0.135.
So, when β = −0.2, the odds of success are increased by a factor of
0.135 for an increase of 10 units in the explanatory variable. Or
equivalently, the odds of success are decreased by a factor of
1 1
= ' 7.389
OR exp(−2)
for an increase of 10 units in the explanatory variable.

Solution to Activity 16
If β1 = 0.2, then the odds multiplier is
exp(0.2) ' 1.221.
So
OR ' 1.221
and the odds of success are increased by a factor of 1.221 for a unit
increase in x1 (assuming x2 and x3 are both fixed).
If β2 = −2.5, then the odds multiplier is
exp(−2.5) ' 0.082.

94
Solutions to activities

So
OR ' 0.082
and the odds of success are increased by a factor of 0.082 for a unit
increase in x2 (assuming x1 and x3 are both fixed). Or equivalently, the
odds of success are decreased by a factor of
1 1
= ' 12.182
OR exp(−2.5)
for a unit increase in the explanatory variable (assuming x1 and x3 are
both fixed).
If β3 = 1.5, then the odds multiplier is
exp(1.5) ' 4.482.
So
OR ' 4.482
and the odds of success are increased by a factor of 4.482 for a unit
increase in x3 (assuming x1 and x2 are both fixed).

Solution to Activity 17
If the regression coefficient associated with the indicator variable for level 2
is −0.4, then the odds multiplier for this indicator variable is
exp(−0.4) ' 0.670.
So, the odds of success is increased by a factor of 0.670 for level 2 of A in
comparison to the odds of success for level 1 of A. Or equivalently, the
odds of success is decreased by a factor of
1 1
= ' 1.492
OR exp(−0.4)
for level 2 of A in comparison to the odds of success for level 1 of A.
If the regression coefficient associated with the indicator variable for level 3
is 3, then the odds multiplier for this indicator variable is
exp(3) ' 20.086.
So, the odds of success is increased by a factor of 20.086 for level 3 of A in
comparison to the odds of success for level 1 of A.

95
Unit 6 Regression for a binary response variable

Solution to Activity 18
(a) The regression coefficient related to bestPrevModScore is 0.099.
Therefore the odds multiplier for a unit increase in
bestPrevModScore is exp(0.099) ' 1.104. This means that, for a unit
increase in bestPrevModScore, the odds of passing the module
increases by a factor of 1.104 (assuming age and qualLink are fixed).
(b) The regression coefficient related to age is −0.020. Therefore the odds
multiplier for a unit increase in age is exp(−0.020) ' 0.980. This
means that, for a unit increase in age, the odds of passing the module
increases by a factor of 0.98, or equivalently, decreases by a factor of
1 1
= ' 1.02
OR exp(−0.02)
(assuming bestPrevModScore and qualLink are fixed).
(c) The regression coefficient related to level ‘maths’ of the factor
qualLink is −0.696. Therefore, the odds multiplier associated with a
student who is studying for a maths-based qualification compared to a
non-maths qualification is exp(−0.696) ' 0.499. This means that the
odds of passing the module for students studying a maths-based
qualification are 0.499 of the odds of passing the module for students
who are not studying a maths-based qualification (assuming
bestPrevModScore and age are fixed). In other words, the odds of
passing decreases by a factor of
1 1
= ' 2.006
OR exp(−0.696)
for maths students compared to non-maths students!

Solution to Activity 19
(a) For patient 2, logArea = 1.903. So, using Equation (11),
exp(22.223 − (10.453 × 1.903))
pb2 = ' 0.911.
1 + exp(22.223 − (10.453 × 1.903))
So, the fitted probability of survival for patient 2 is 0.911.
(b) For patient 3, logArea = 2.039. So, using Equation (11),
exp(22.223 − (10.453 × 2.039))
pb3 = ' 0.713.
1 + exp(22.223 − (10.453 × 2.039))

(c) Denote the predicted probability of survival for the new patient by pb0 .
For this new patient, the value of logArea is 2.3. Once again using
Equation (11), we have
exp(22.223 − (10.453 × 2.3))
pb0 = ' 0.140.
1 + exp(22.223 − (10.453 × 2.3))

So, the predicted probability of survival for this new patient is 0.140.

96
Solutions to activities

Solution to Activity 20
(a) We wish to calculate pb11 , which we can do using Equation (12). For
this student, bestPrevModScore takes the value 92, age = 36 and
qualLink takes the value ‘not’. Substituting these values into
Equation (12) and using the parameter estimates for the fitted model
given in Table 5, pb11 is calculated as
exp(−4.273 + (0.099 × 92) − (0.02 × 36))
pb11 =
1 + exp(−4.273 + (0.099 × 92) − (0.02 × 36))
' 0.984.
The fitted probability of passing for this student is almost 1, and so
they are almost certain to pass the module.
(b) We wish to calculate pb0 , which we can also do using Equation (12).
For this student, bestPrevModScore takes the value 80, age = 60 and
qualLink takes the value ‘maths’. So
exp(−4.273 + (0.099 × 80) − (0.02 × 60) − 0.696)
pb0 =
1 + exp(−4.273 + (0.099 × 80) − (0.02 × 60) − 0.696)
' 0.852.
So, the predicted probability that this student will pass the module is
0.852.

Solution to Activity 21
(a) If the residual deviance is small, the difference between the
log-likelihood of the saturated model and the log-likelihood of the
proposed model is small. Therefore, we are not losing much fit by
using the proposed model in comparison to the more complicated
saturated model. So, since there is not much fit to the data lost by
choosing the proposed model, our proposed model appears to be a
good fit.
(b) On the other hand, if the residual deviance is large, then there is a
noticeable loss in fit to the data when the simpler proposed model is
used. So, in this case, the proposed model does not appear to be a
good fit.

97
Unit 6 Regression for a binary response variable

Solution to Activity 22
(a) There is a single α parameter, and q of the βj parameters. So the
number of parameters in the proposed model is q + 1.
(b) In the saturated model, each response Yi is essentially modelled by
the observed value of Yi , therefore the saturated model has n
parameters. In other words, the number of parameters in the
saturated model is equal to the number of observations.
(c) From Equation (14), the degrees of freedom is calculated as
r = number of parameters in the saturated model
− number of parameters in the proposed model.
So, from parts (a) and (b),
r = n − (q + 1).

Solution to Activity 23
(a) There are n = 1796 parameters in the saturated model (since there
are data for 1796 students in the dataset).
For the proposed model, there are four parameters: one intercept
parameter, one regression coefficient for each of the two covariates
(bestPrevModScore and age) and one parameter for the indicator
variable representing level ‘maths’ of qualLink.
Therefore, if the proposed model is a good fit, D ≈ χ2 (r), where
r = 1796 − 4 = 1792.

(b) Since the p-value is approximately 1 which is (very) large, we


conclude that the proposed model is a good fit to the data.

Solution to Activity 24
(a) If D < r, then using Result (15) we know that
D < E(D).
This means that the value of D is smaller than the value of the
expected residual deviance if the model is a good fit, and therefore the
observed value of D will be towards the left-hand side of the χ2 (r)
distribution. In this case, the proposed model is not losing much fit
(see, for example, the illustration given in Figure 16) and so it’s likely
that the model is a good fit.
(b) If D is much larger than r, then, again using Result (15), we know
that D is much larger than E(D), so that D is much larger than we’d
expect the residual deviance to be if the model was a good fit. In this
case, the proposed model is losing too much fit (again, see the
illustration given in Figure 16) and so, this time, it’s likely that the
model is a poor fit.

98
Solutions to activities

Solution to Activity 25
(a) For this model, D = 25.221 and r = 26. This means that D ' r, so
that D is close to the value we’d expect if the model is a good fit,
suggesting that the model is a good fit.
(b) For this model, D = 275.41 and r = 262. Here, D > r, so that D is
larger than we’d expect if the model is a good fit. It’s therefore
possible that the model is a poor fit, but it’s also possible that D isn’t
large enough for us to conclude that the model is a poor fit. So, in
order to make a decision regarding this model’s fit, we’d need to
calculate the p-value.

Solution to Activity 26
A small value of the null deviance means that not much fit is lost by using
the null model rather than the saturated model. If this is the case, then,
for parsimony, the data can be adequately described simply by their mean.
On the other hand, a large value of the null deviance means that a lot of
fit is lost by using the null model, and so the data need to be described by
a more complicated model than the null model.

Solution to Activity 27
Since the null deviance is the residual deviance for the null model, the null
model plays the role of the ‘proposed model’ in the residual deviance
formula. Therefore, if the null model is a good fit, then
null deviance ≈ χ2 (r),
where (from Equation (14))
r = number of parameters in the saturated model
− number of parameters in the proposed model.

Now, the number of parameters in the saturated model is n, while the null
model has just one parameter (the intercept). Therefore, the degrees of
freedom for the χ2 distribution associated with the null deviance is n − 1.

99
Unit 6 Regression for a binary response variable

Solution to Activity 28
(a) If the null model is good fit, then, since there are data for 435 patients,
null deviance ≈ χ2 (n − 1) = χ2 (434).
Now, we know that the mean of a χ2 distribution is equal to its
degrees of freedom. Therefore
E(null deviance) ' 434.
But we’ve observed the null deviance to be 525.39, which means that
the null deviance is larger than expected. As such, the null deviance
may be large enough to suggest that the null model is a poor fit (that
is, the data can’t be adequately described by the null model). To be
sure, we need to calculate the p-value associated with the null
deviance.
(b) The p-value is very small, and so we’d conclude that the null model is
a poor fit to the data, and so the data cannot be adequately described
by the null model.

Solution to Activity 29
Adding extra parameters into a model improves the model fit, and so the
larger model with more parameters (M2 ) must provide a better fit to the
data than the smaller model with fewer parameters (M1 ). As such, the fit
to the data lost due to M1 will be greater than the fit to the data lost due
to M2 . A greater loss in fit relates to a larger residual deviance, and so
D(M1 ) will be larger than D(M2 ).

Solution to Activity 30
(a) If the deviance difference is small, then there is not much fit lost when
using the smaller model M1 , which has fewer parameters. Therefore,
for parsimony, it would be wise to choose the smaller model M1 .
(b) On the other hand, if the deviance difference is large, then there is a
noticeable loss in fit when using the simpler model M1 – or
equivalently, there is a noticeable gain in fit when the extra
parameters of the larger model M2 are included. In this case, to avoid
losing too much fit, it is worth including these extra parameters and it
would be wise to choose the larger model M2 .

100
Solutions to activities

Solution to Activity 31
(a) Model M1 is nested within M2 , so
deviance difference = D(M1 ) − D(M2 )
= 588.82 − 586.54 = 2.28.

(b) Model M2 has one more parameter than M1 (for the covariate age),
and so this deviance difference is approximately distributed as χ2 (1).
We can also calculate the degrees of freedom as the difference between
the deviance degrees of freedom for M1 and M2 , namely
1794 − 1793 = 1.

(c) The p-value associated with the deviance difference is large


(p = 0.131). Since this is large, there is no evidence to suggest that
there is a significant gain in fit by including age in the model, in
addition to bestPrevModScore. So, for parsimony, the smaller model
M1 is preferable.

Solution to Activity 32
Label the null model as M1 and the proposed model as M2 . Then
deviance difference = D(M1 ) − D(M2 )
= null deviance − D
= 678.51 − 678.30 = 0.21.
The degrees of freedom associated with this deviance difference is the
difference between the degrees of freedom of the χ2 null distributions for
D(M1 ) and D(M2 ), namely
1795 − 1794 = 1.
Here, the deviance difference (0.21) is quite a bit smaller than the degrees
of freedom (1), which means that the deviance difference is smaller than
the expected value, and therefore there is no evidence to suggest that M2
is better than M1 ; that is, there is no evidence to suggest that age is useful
for modelling modResult. (Note, however, that this doesn’t mean that age
won’t be useful for modelling modResult in combination with other
explanatory variables.)

Solution to Activity 33
Model M2 is better, since it has the lower AIC.

101
Unit 6 Regression for a binary response variable

Solution to Activity 34
In this plot, the two ‘lines’ of points can be clearly seen. There is some
curvature in the smoothed red line, which might suggest that the link
function may not be appropriate, or a higher-order function of
averageWage may be needed in the model. However, the curvature may be
due to the fact that the dataset is only small.

Solution to Activity 35
The positive standardised deviance residuals appear to have a larger
scatter than the negative ones do, although there doesn’t seem to be any
particular pattern to the standardised deviance residuals across the index.
As such, this plot doesn’t indicate any serious problems with the
independence assumption.

Solution to Activity 36
This plot allows us to easily compare the relative spreads for the
standardised deviance residuals for Yi = 1 and Yi = 0. The plot merely
confirms what we noted in Activity 35: namely, that one of the groups of
standardised deviance residuals appears to have a larger scatter than the
other group does, but there doesn’t appear to be a pattern to the
standardised deviance residuals across the index.

Solution to Activity 37
If the logistic regression model assumptions hold, then the standardised
deviance residuals should be approximately distributed as N (0, 1), so that
the points in Figure 25 should lie roughly along the straight line. However,
many of the points in the normal probability plot do not lie close to the
line, and the first ‘line’ seems to systematically deviate away from the line
in the middle of the plot. As such, this plot raises concerns that there
could be a problem with using the logistic regression model for these data.

102
Unit 7
Regression for other response
variables
Introduction

Introduction
As mentioned in the introduction to Unit 6, in Units 6 to 8 of this module
we’ll develop statistical models – known collectively as generalised linear
models, or GLMs for short – for modelling non-normal response variables.
In Unit 6, we focused on one particular type of generalised linear model for
modelling binary response variables: the logistic regression model. In this
unit, we shall consider modelling some other non-normal response
variables, and generalised linear models will be introduced more formally.

How Unit 7 relates to the module so far


Moving on from . . . What’s next?

Regression with a
normal response
variable Regression for responses
(Unit 4) with other distributions
(including Poisson,
Regression with a exponential, binomial)
binary response
variable
(Unit 6)

We’ll start in Section 1 by setting the scene for this unit. We’ll focus in
particular on a dataset with a count response variable, discussing why
linear regression isn’t ideal for modelling these data and which non-normal
distribution might be more suitable for the response.
Although generalised linear models are used for modelling non-normal
response variables, it turns out that the linear regression model (with its
assumed normal distribution for the response) is also a generalised linear
model. As such, linear regression models and logistic regression models
have some features in common which we can use as a framework for
building a model form suitable for modelling both normal and binary
response variables. We’ll see how this model form can also be used to
model count response variables; indeed, the model form can be used for
many non-normal responses and provides the basis for the generalised
linear model. Developing these ideas is the focus of Section 2.
The generalised linear model is formally defined in Section 3. In that
section, we’ll focus on using generalised linear models for modelling
responses with normal, Bernoulli and Poisson distributions only.
Generalised linear models are then used for modelling responses with
exponential and binomial distributions in Section 4. Assessing model fit
and choosing a GLM are the subject of Section 5, while the focus of
Section 6 is on checking the GLM model assumptions. The unit rounds off
with Section 7 by considering two common issues which can arise when
using generalised linear models in practice.

105
Unit 7 Regression for other response variables

The following route map illustrates how the sections fit together for this
unit.

The Unit 7 route map

Section 1
Setting the scene

Section 2
Building a model

Section 3 Section 4
The generalised GLMs for two more
linear model (GLM) response variable
distributions

Section 5 Section 6
Section 7
Assessing model Checking the
Common issues
fit and choosing GLM model
in practice
a GLM assumptions

Note that you will need to switch between the written unit and your
computer for Subsections 3.5, 4.3, 5.3, 6.3 and 7.3.2.

106
1 Setting the scene

1 Setting the scene


We’ll start this section with a look at the some of the non-normal response
variables of interest in this unit in Subsection 1.1. We’ll then look more
closely at the specific problem of modelling one of these response variables
in Subsection 1.2.

1.1 Some non-normal response variables


In linear regression, the response variables are assumed to be continuous,
normally distributed random variables which can take any values in the
range −∞ to +∞. In this module, we have already seen that response
variables of this type are encountered in many application areas of
statistical modelling. In Unit 6, however, we saw that linear regression is
usually not very useful in applications where the response variables are
binary: in this case, logistic regression is more appropriate. There are
other types of response variables which arise in a number of practical
situations for which linear regression might also not be useful. Three
particular situations are discussed in Activity 1.

Activity 1 Is linear regression appropriate?

For each of the following response variables, explain why linear regression
might not be appropriate in a statistical analysis.
(a) The waiting times between the occurrence of serious earthquakes
worldwide.
(b) The number of traffic accidents per year at a particular road junction.
Another one to add to the
(c) The number of insects surviving out of N insects exposed to an number of traffic accidents . . .
insecticide.

In Activity 1, we met three scenarios where linear regression may not be


appropriate. In earlier units, when there were problems with the
distribution assumptions for the response, we tried transforming the
response variable so that the assumptions seemed more plausible. So,
could we do the same here?

107
Unit 7 Regression for other response variables

Well, for the waiting times between earthquakes mentioned in part (a) of
Activity 1, transforming the waiting times might well be an option. For
example, since the waiting times are positive and continuous, it would be
sensible to take logs of the waiting times so that the transformed values
are on a continuous scale between −∞ and +∞. However, it’s not clear
how the non-negative discrete count data in part (b) of Activity 1, nor the
non-negative integers between 0 and N in part (c) of Activity 1, might be
transformed to a continuous scale between −∞ and +∞. So, we need to
find an alternative way forwards.
As an alternative to transforming the response, we shall use a general
model which is suitable for modelling many non-normal responses, and
indeed, also normal responses; we shall introduce a framework for building
such a model in Section 2. But first, for the rest of this section we’ll look
more closely at the specific problem of modelling responses which are
counts: a situation where transforming the response so that we can use
linear regression is not an ideal option.

1.2 A closer look at count response


variables
Count response variables arise in many situations: examples include the
number of passengers per flight who miss the departure of their booked
flight, the number of computer virus infections per year in a company, and
the scenario that we considered in part (b) of Activity 1 – namely, the
number of traffic accidents per year at a particular road junction.
To help us to get a feel for what a dataset with a count response looks like,
we’ll focus on one particular dataset involving data from a survey
conducted on households in the Philippines; the dataset is described next.

108
1 Setting the scene

Survey data from households in the Philippines


The Republic of the Philippines, with a population of 100.98 million
in 2015, is an archipelago in Southeast Asia consisting of 7641 islands.
The country has 17 regions; Figure 1 shows a map of the Philippines
with the different regions.

Figure 1 A map of The Philippines and its 17 regions (in different


colours)

The Philippines Statistics Authority conducts a nationwide survey


every three years called the Family Income and Expenditure Survey
(FIES). The survey provides a wealth of information on households in
the Philippines, such as a household’s family size, their income and
expenditure, as well as information regarding their dwelling, such as
how many bedrooms there are and what type of roof their dwelling
has.

109
Unit 7 Regression for other response variables

The Philippines survey dataset (philippines)


The data we will be using in this unit are a random sample of 1500
households taken from the 2015 FIES for the five regions in the
Philippines with the most household members, namely the regions
Calabarzon, National Capital Region, Central Luzon, Western Visayas
and Central Visayas.
This dataset contains data for the five selected regions on the
following variables:
• familySize: the total number of family members, excluding the
head of the household
• employed: the total number of family members employed
• bedrooms: the number of bedrooms in the household
• age: the age of the head of the household
• income: the total annual income of the household (in Philippine
pesos (PHP))
• expend: the total annual expenditure on food of the household (in
PHP)
• bread: the total annual expenditure on bread and cereals of the
household (in PHP)
• roof: the roof type in the household, classified into six categories
according to the material the roof is made of: mixed strong, mixed
light, salvaged, strong, light or not applicable
• region: the name of the region.
The data for the first five observations from the Philippines survey
dataset are shown in Table 1.
Table 1 First five observations from philippines

familySize employed bedrooms age income expend bread


3 1 1 27 109 447 57 549 22 813
1 0 3 53 284 183 83 425 13 655
4 2 2 72 197 586 78 618 17 307
3 0 2 84 154 811 97 541 27 657
3 2 1 51 158 262 80 211 23 508

roof region
mixed strong central luzon
strong central luzon
strong central luzon
strong central luzon
strong central luzon
Source: Flores, 2017, accessed 26 June 2022

110
1 Setting the scene

In this unit, we’ll take familySize as our response variable, and to keep
things simple, for now we’ll just consider one of the possible explanatory
variables, age. Then a question of interest in the analysis of the Philippines
survey dataset is whether the age of the family head helps to explain
family size. Can we use regression to help answer this question? (We’ll be
using the rest of the possible explanatory variables later in the unit.)
We’ll start by investigating (in Activity 2) whether linear regression might
be useful for modelling familySize.

Activity 2 Linear regression for familySize?

The linear regression model


familySize ∼ age
was fitted to data from the Philippines survey dataset. The resulting
normal probability plot for the fitted model is given in Figure 2.
The largest household family
size in the Philippines survey
dataset is 15

4
Standardised residuals

−2

−4

−2 0 2
Theoretical quantiles
Figure 2 Normal probability plot for the fitted linear regression
model familySize ∼ age
(a) By considering Figure 2, explain why the linear regression normality
assumption is questionable for these data.

111
Unit 7 Regression for other response variables

(b) Figure 3 shows the relative frequency bar chart of the response
familySize from the Philippines survey dataset. Because regression
assumes that the response variable is influenced by the explanatory
variable(s), we wouldn’t expect a plot of the relative frequencies of
familySize to necessarily look like a typical plot of the distribution
being used to model the response. However, Figure 3 does highlight
some of the issues which make the assumption of normality for the
response variable familySize less than ideal for these data.
By considering Figure 3, explain why it is unlikely that the
assumption of a normal response will be ideal for these data.

0.20

0.15
Relative frequency

0.10

0.05

0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
familySize
Figure 3 A relative frequency bar chart of familySize

As we saw in Activity 2, assuming that the response count variable


familySize has a normal distribution is problematic. So, what
distribution might work better? For count data such as these, the Poisson
distribution is often used. A summary of the Poisson distribution is given
in Box 1. (Note that the Poisson distribution has a capital ‘P’ because the
distribution was named after the French mathematician Siméon Denis
Poisson.)

112
1 Setting the scene

Box 1 The Poisson distribution


The random variable Y is said to have a Poisson distribution with
parameter λ, where λ > 0, if it has probability mass function (p.m.f.)
λy e−λ
P (Y = y) = , where y = 0, 1, . . . .
y!
This is written Y ∼ Poisson(λ). The Poisson distribution is often used
for count variables.
When λ < 1, the Poisson p.m.f. is decreasing and zero is the most
likely value of Y ; see Figure 4(a) for an example. When λ > 1, the
Poisson p.m.f. has an up-then-down shape; see Figure 4(b) for an
example.

0.6

0.5

0.4
P (Y = y)

0.3

0.2

0.1

0.0
0 1 2 3 4 5 6 7 8 9 10
(a) y

0.20

0.15
P (Y = y)

0.10

0.05

0.00
0 1 2 3 4 5 6 7 8 9 10
(b) y

Figure 4 Two Poisson(λ) p.m.f.s with (a) λ = 0.5 and (b) λ = 3

113
Unit 7 Regression for other response variables

The mean and the variance of a Poisson(λ) random variable are both
equal to the parameter λ, so that for Y ∼ Poisson(λ),
E(Y ) = V (Y ) = λ.

In the next activity, we shall investigate whether a Poisson distribution


might be a suitable distribution for the response familySize from the
Philippines survey dataset.

Activity 3 A Poisson distribution for familySize?

Figure 5 shows a side-by-side bar chart of the observed relative frequencies


for familySize from the Philippines survey dataset together with the
expected relative frequencies assuming a Poisson distribution fitted to
these data. As mentioned in part (b) of Activity 2, a plot of the relative
frequencies of familySize won’t necessarily look like a typical plot of the
distribution being used to model the response, but Figure 5 is helpful for
seeing whether it might be sensible to assume a Poisson distribution for
the response.

Relative frequency: Observed Expected

0.20

0.15
Relative frequency

0.10

0.05

0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
familySize
Figure 5 A side-by-side bar chart of the observed relative frequencies for
familySize and the expected relative frequencies assuming a Poisson
distribution fitted to the same data
By considering Figure 5, would the Poisson distribution or the normal
distribution be better as the distribution for the response familySize?

114
2 Building a model

From Activities 2 and 3, it looks like a Poisson distribution is more


appropriate than a normal distribution for the count response variable
familySize. It would therefore be useful to have a regression model which
could accommodate a Poisson distribution for the response.
Now, so far in this module, we’ve seen examples of response variables with
a normal distribution (in Units 1 to 5), response variables with a Bernoulli
distribution (in Unit 6), and now we have just seen a response variable
with a Poisson distribution. There are, however, many other distributions
that response variables can take. So, rather than developing a different
regression model for each of these distributions separately, what we’d really
like is a general regression model which can accommodate a variety of
possible distributions – both normal and non-normal – which can be
assumed for response variables. We shall develop a framework for such a
model next.

2 Building a model
In this unit, we’d like to develop a regression model which can
accommodate a variety of different distributions for the response.
So far in this module, we have used linear regression for modelling
responses which are assumed to follow a normal distribution, and logistic
regression for modelling responses which are assumed to follow a Bernoulli
distribution. So, in order to build a general model which can cope with
responses from many different distributions, we shall start by considering
the similarities between linear regression and logistic regression, so that we
can build a unified model form which can represent both linear regression
and logistic regression. We’ll then see that this unified model form can in
fact be used to model responses from a whole host of different distributions.

115
Unit 7 Regression for other response variables

The strategy that we’ll use for building a model capable of accommodating
a variety of different distributions for the response is represented in
Figure 6.

Normal response Binary response

Linear regression Logistic regression

Similarities

General model for responses


from various distributions:
• normal
• Bernoulli
• Poisson
• exponential
..
.

Figure 6 Illustration of a strategy for building a model for responses from a


variety of distributions

We’ll start in Subsection 2.1 by discussing the similarities between linear


regression and logistic regression, and we’ll use these similarities to specify
a unified model form which can represent both linear regression and logistic
regression. This model form will then be used to build a regression model
for a response with a Poisson distribution in Subsection 2.2, and forms the
basis for the generalised linear model (to be presented in Section 3) which
can accommodate both normal and non-normal response variables.
To help keep things simple, in this section we’ll restrict attention to the
situation in which there is just the single covariate explanatory variable x
to model the response Y .

2.1 A unified model form for linear


regression and logistic regression
Consider the response Y and the covariate x. Let’s start by specifying a
linear regression model for Y , and then a logistic regression model for Y ,
so that we can clearly see any similarities between the two models.
• A linear regression model for Y is given, for i = 1, 2, . . . , n, by
Yi = α + βxi + Wi , Wi ∼ N (0, σ 2 ).
Then
Yi ∼ N (µi , σ 2 ) (1)

116
2 Building a model

where, for known xi ,


E(Yi ) = α + βxi . (2)

• A logistic regression model for Y is given, for i = 1, 2, . . . , n, by


Yi ∼ Bernoulli(pi ) (3)
and
 
pi
logit(pi ) = log = α + βxi .
1 − pi
Then, since E(Yi ) = pi , we have that
 
E(Yi )
log = α + βxi . (4)
1 − E(Yi )
We have, in fact, considered the similarities and differences between
Equations (2) and (4) already in Activity 11 in Unit 6. We’ll consider
these again briefly in Activity 4, before we discuss a further similarity
between the two models which we haven’t yet considered.

Activity 4 Any similarities and differences?

In what way(s) are Equations (2) and (4) similar, and in what way(s) are
they different?

Following on from Activity 4, we can make the following observations.


• In linear regression
E(Yi ) = α + βxi .

• In logistic regression
g(E(Yi )) = α + βxi ,
where g is the logit function such that
 
E(Yi )
g(E(Yi )) = log .
1 − E(Yi )

So, Equations (2) and (4) would have exactly the same form if we had a
function g(E(Yi )) instead of E(Yi ) in Equation (2) for linear regression;
we’ll consider which function g could be used for linear regression in
Activity 5.

Activity 5 Which function for linear regression?


For simple linear regression, suggest a function g for which
g(E(Yi )) = α + βxi .

117
Unit 7 Regression for other response variables

Linear regression and logistic regression also have another feature in


common. To see this, consider the following observations regarding the two
models.
• In linear regression, we have (from Model (1)) that, for i = 1, 2, . . . , n,
Yi ∼ N (µi , σ 2 ),
or equivalently
Yi ∼ N (E(Yi ), σ 2 ),
so that Y1 , Y2 , . . . , Yn each has a normal distribution, but with different
means.
• In logistic regression, we have (from Model (3)) that, for i = 1, 2, . . . , n,
Yi ∼ Bernoulli(pi )
so that Y1 , Y2 , . . . , Yn each has a Bernoulli distribution, but with
different success probabilities. However, since E(Yi ) = pi , this
equivalently means that Y1 , Y2 , . . . , Yn each has a Bernoulli distribution,
but with different means.
So, for both linear regression and logistic regression, the responses being
modelled – Y1 , Y2 , . . . , Yn – each have the same distribution, but with
different means.
We can bring these similarities together to specify a unified model form
which can represent both linear regression and logistic regression. This is
summarised in Box 2.

Box 2 A unified model form for linear regression and


logistic regression
• Y1 , Y2 , . . . , Yn each has the same distribution but with different
means.
◦ In linear regression, Y1 , Y2 , . . . , Yn all have normal distributions,
but each Yi has a different mean µi (= E(Yi )) (but a common
variance σ 2 ):
Yi ∼ N (µi , σ 2 ).

◦ In logistic regression, Y1 , Y2 , . . . , Yn all have Bernoulli


distributions, but each Yi has a different mean pi (= E(Yi )):
Yi ∼ Bernoulli(pi ).

• For each Yi , i = 1, 2, . . . , n, and covariate x, the regression equation


has the form
g(E(Yi )) = α + βxi
for some function g.
◦ In linear regression, g is the identity function.
◦ In logistic regression, g is the logit function.

118
2 Building a model

In Unit 6, the logit function for logistic regression was referred to as the
logit link function. The reason for the term ‘link function’ is that the
function logit (E(Yi )) links the mean of the response E(Yi ) (which, in the
case of a binary response, is the success probability pi ) to the linear
component of the model (that is, α + βxi ). Similarly, in linear regression,
the identity function is also a link function, because it links the mean of
the response E(Yi ) to the linear component of the model. The idea of link
functions is summarised in Box 3.

Box 3 Link functions


A link function is a function which, for i = 1, 2, . . . , n, links the
mean response E(Yi ) to the linear component of the model.

Now that we have a model form which can accommodate both linear
regression and logistic regression, we shall see in the next subsection how
we can also use the same model form when we have a response from a
Poisson distribution by using a different link function.

2.2 Using the model form for count


responses
In this subsection, we’ll see that the model form which can represent both
linear regression and logistic regression, given in Box 2, can also be used as
a model form for a count response. In doing so, we’ll focus on the
Philippines survey dataset introduced in Section 1, taking our response
variable to be
• familySize: the total number of family members, excluding the head of
the household
and our covariate to be
• age: the age of the head of the household.
As usual, the first step when building a regression model is to explore the
relationship between the response and the explanatory variable visually.
When the explanatory variable is a covariate, we would usually use a
scatterplot of the response and the explanatory variable to do this.
However, when the sample size is large, and the response and the
explanatory variable only take a limited number of possible values, it can
be difficult to use a scatterplot to visually determine clear relationships
between the variables.
This is the case in the Philippines survey dataset: the sample size is 1500,
and for each value of the explanatory variable age (which takes integer
values between 18 and 93), there are multiple values of the response
familySize. For example, there are 30 families in the dataset for which
the age of the head of the household is 40.

119
Unit 7 Regression for other response variables

Because there is only a limited number of (integer) values that the


response familySize can take, plotting all 30 of these data points in a
scatterplot will result in many data points plotted on top of each other –
for example, all families of size 3 whose head is aged 40 would be plotted
at the same point. As a result, it is not easy to clearly see any relationship
between the two variables using a scatterplot – as can be seen in Figure 7.

15

Family size (excluding household head)

10

0
20 40 60 80
Age of household head (years)
Figure 7 Scatterplot of familySize against age

One way around this problem, which is commonly used in regression


analysis, is to plot the sample mean of all the values of the response for
which the value of the explanatory variable is the same, instead of plotting
all of the individual values. For example, instead of plotting all 30 values
of familySize for those households whose head is aged 40, we could just
plot the sample mean of the 30 familySize values (which equals 3.53)
against age = 40 in a single data point. So, we can calculate the sample
mean of familySize for each value of age (that is, calculate the sample
mean of familySize for all households where age = 18, then calculate the
sample mean of familySize for all households where age = 19, and so
on), and then plot these sample means against age.
Although by doing this we lose any sense of the variability of the response,
it will greatly simplify the scatterplot and help us to see any relationship
between the variables more clearly. What’s more, as we have just seen in
this section, in regression we model the relationship between E(Yi ) and a
linear combination of the explanatory variable(s), and so plotting the
sample means for each x value is a sensible thing to do in order to get a
feel for the relationship.

120
2 Building a model

A scatterplot of the sample means of familySize, calculated for each


individual age, are plotted against age in Figure 8.

4
Sample mean of familySize

0
20 40 60 80
Age of household head (years)
Figure 8 Scatterplot of sample means of familySize against age

The pattern of points in Figure 8 looks quadratic. There are ways that we
could model the relationship as such, but in this unit, we are focusing on
modelling linear relationships, and so instead we shall think in terms of
straight line segments. Here, we’ll focus on three possible line segments:
the line for values of age up to 40, the line for values of age between 40
and 80, and the line for ages over 80. These values have been chosen fairly
arbitrarily to coincide with the straight line segments which are roughly
apparent in Figure 8. We’ll consider these line segments in the next
activity.

Activity 6 Straight line segments for different ages

Based on Figure 8, how does the linear relationship between the sample
mean of familySize and age for age < 40 differ from the linear
relationship between the sample mean of familySize and age for
40 ≤ age < 80 and for age ≥ 80?

121
Unit 7 Regression for other response variables

Following on from Activity 6, let’s just consider the households for which
age is in the range from 40 to 80 years, so that we can focus on only one of
the linear relationships visible in Figure 8. This reduced dataset is
described next.

The Philippines 40 to 80 dataset (philippines40to80)


This dataset is a subset of the Philippines survey dataset given in
Subsection 1.2. The variables in the Philippines 40 to 80 dataset are
the same as those in the Philippines survey dataset, but the
Philippines 40 to 80 dataset only includes data for households where
the household head is aged 40 to 80 years.
The first five observations in the Philippines 40 to 80 dataset are
given in Table 2.
Table 2 First five observations from philippines40to80

familySize employed bedrooms age income expend bread


1 0 3 53 284 183 83 425 13 655
4 2 2 72 197 586 78 618 17 307
3 2 1 51 158 262 80 211 23 508
2 2 1 69 210 784 99 007 24 822
0 0 2 66 89 748 63 253 16 685

roof region
strong central luzon
strong central luzon
strong central luzon
strong central luzon
strong central luzon
Source: Flores, 2017, accessed 26 June 2022

A scatterplot of the sample means of familySize against age for


households in the Philippines 40 to 80 dataset is shown in Figure 9.

122
2 Building a model

4.5
Sample mean of familySize

4.0

3.5

3.0

2.5

2.0

1.5
40 50 60 70 80
Age of household head (years)
Figure 9 Scatterplot of sample means of familySize against age for
households in the Philippines 40 to 80 dataset

As already discussed in Section 1, the Poisson distribution is usually


adopted for modelling count variables such as the response variable
familySize. So, let’s see if we can define a model for familySize from
the Philippines 40 to 80 dataset using a Poisson distribution in the model
form specified for linear regression and logistic regression given in Box 2.
First, let’s denote the response familySize by Y and the covariate age by
x, to match up with the notation used in Box 2. Then, from Box 2:
• Y1 , Y2 , . . . , Yn all have Poisson distributions, where each Yi has a
different mean.
But, from Box 1, we know that if Y ∼ Poisson(λ), then E(Y ) = λ.
Therefore, equivalently we have that Y1 , Y2 , . . . , Yn all have Poisson
distributions, where each Yi has a different parameter λi > 0, so that
Yi ∼ Poisson(λi ).

• For each Yi , i = 1, 2, . . . , n, the regression equation has the form:


g(E(Yi )) = α + βxi ,
where g is a link function which relates g(E(Yi )) linearly with xi (for
values of xi from 40 to 80).

123
Unit 7 Regression for other response variables

Now, Figure 9, which showed a scatterplot of the sample means of


familySize against age for households in the Philippines 40 to 80 dataset,
suggests that there is actually a fairly good linear relationship between the
mean of familySize, E(Yi ), and age, xi . However, there is the potential
problem that E(Yi ) must be greater than zero (since E(Yi ) = λi > 0),
whereas there is no such constraint on α + βxi . So, if we can find a link
function g which makes sure that E(Yi ) > 0 and for which there is a linear
relationship between g(E(Yi )) and xi , then it would be better to use this
link function so that we can’t predict negative values of E(Yi ).
One possible link function g which will ensure that E(Yi ) > 0 is the log
function, so that g(E(Yi )) = log(E(Yi )) (takings logs to base e).
So, the next question is whether it is reasonable to assume that there is a
linear relationship between log(E(Yi )) (that is, the log of the mean of
familySize) and xi , the explanatory variable age. We shall address this
question in the next activity.

Activity 7 Log function as the link function?

Figure 10 shows a scatterplot of the logs of the sample means of


familySize against age (for ages from 40 to 80 years). Based on
Figure 10, does it seem reasonable to assume that there is a linear
relationship between log(E(Yi )) and xi ?

1.4
log (sample mean of familySize)

1.2

1.0

0.8

0.6

40 50 60 70 80
Age of household head (years)
Figure 10 Scatterplot of the logs of the sample means of familySize and
age, for households in the Philippines 40 to 80 dataset

124
3 The generalised linear model (GLM)

Activity 7 concluded that it seems reasonable to assume a linear


relationship between log(E(Yi )) and xi for the response variable
familySize and the covariate age. So, a possible link function g for these
data which ensures that the predicted values of familySize cannot be
negative, but which also gives a ‘roughly’ linear relationship between
g(E(Yi )) and xi , is the log function.
This is often true in general, and the log function is indeed a possible link
function when the response has a Poisson distribution. This model for a
response with a Poisson distribution is commonly known as Poisson
regression. Poisson regression is summarised in Box 4.

Box 4 Poisson regression


• The responses Y1 , Y2 , . . . , Yn all have Poisson distributions, but each
Yi has a different parameter λi > 0, so that, for i = 1, 2, . . . , n,
Yi ∼ Poisson(λi ).

• For each Yi , i = 1, 2, . . . , n, and covariate xi , the regression equation


has the form
g(E(Yi )) = α + βxi ,
where g is a link function.
One such link function is the log function so that
g(E(Yi )) = log(E(Yi )).

So, we have used the unified model form representing both linear
regression and logistic regression to define Poisson regression (where the
response has a Poisson distribution). This model form can, in fact, be used
to specify regression models for responses from a whole host of
distributions, and it forms the basis of the generalised linear model. We
shall look at the generalised linear model more closely in the next section.

3 The generalised linear model


(GLM)
This section formally defines and explores the generalised linear model,
s
Tool
which is commonly known as the GLM for short. The model has linear
regression, logistic regression and Poisson regression as special cases, and
as such, a GLM can model responses with normal, Bernoulli and Poisson
distributions. The applications for which the generalised linear model can
be used, however, go much further than modelling responses from these So far, we have three response
three distributions, since the GLM can model responses from many other variable distributions that we
distributions. can put in our GLM toolbox!

125
Unit 7 Regression for other response variables

We’ll start the section by formally defining the generalised linear model in
Subsection 3.1. Subsection 3.2 then considers link functions for the model,
and inverse link functions are introduced in Subsection 3.3. Subsection 3.4
looks at how fitted values and predictions are obtained for a GLM. Finally,
in Subsection 3.5, we’ll do some practical work and use GLMs in R.

3.1 The model


In Section 2, we saw how linear regression, logistic regression and Poisson
regression can all be written in a unified model form. Specifically, for
response Y and covariate x, we have the following unified model form:
• The response variables Y1 , Y2 , . . . , Yn all have the same distribution but
each Yi has a different mean.
◦ In linear regression, Y1 , Y2 , . . . , Yn all have a normal distribution, but
the mean µi (= E(Yi )) is different for each i = 1, 2, . . . , n.
◦ In logistic regression, Y1 , Y2 , . . . , Yn all have a Bernoulli distribution,
but the parameter pi (= E(Yi )) is different for each i = 1, 2, . . . , n.
◦ In Poisson regression, Y1 , Y2 , . . . , Yn all have a Poisson distribution,
but the parameter λi (= E(Yi )) is different for each i = 1, 2, . . . , n.
• For each Yi , i = 1, 2, . . . , n, and covariate xi , the regression equation has
the form
g(E(Yi )) = α + βxi
for some link function g.
◦ In linear regression, g is the identity function
g(E(Yi )) = E(Yi ).

◦ In logistic regression, we took g to be the logit function


 
E(Yi )
g(E(Yi )) = log .
1 − E(Yi )
◦ In Poisson regression, we took g to be the log function
g(E(Yi )) = log(E(Yi )).

This model form extends naturally to the case where there are multiple
factors A, B, . . . , Z and multiple covariates x1 , x2 , . . . , xq , so that the
regression equation has the general form
g(E(Yi )) = linear combination of the explanatory variables
for observation i.
The linear component of the regression equation is known as the linear
predictor for observation i, and is often denoted by ηi . (η is the Greek
letter eta.) The regression equation can therefore be written more
succinctly as
g(E(Yi )) = ηi .

126
3 The generalised linear model (GLM)

We’ll look at some linear predictors in Example 1 and Activity 8.

Example 1 Linear predictor for modelling familySize


Consider once again the Philippines 40 to 80 dataset, and take
familySize to be the response Y , and age to be a single covariate x.
In Subsection 2.2, the following regression equation was suggested for
modelling the data:
g(E(Yi )) = α + βxi ,
where xi is the age of the head of the ith household. So, for this
model, the linear predictor is
ηi = α + βxi .

Activity 8 Linear predictor for a logistic regression model

One of the models considered in Activity 31 in Unit 6 for modelling the


binary response modResult from the OU students dataset was the logistic
regression model
modResult ∼ bestPrevModScore + age,
where bestPrevModScore and age are both covariates.
They passed!
Denoting modResult, bestPrevModScore and age by Y , x1 and x2 ,
respectively, write down the linear predictor for this model.

We are now in a position to specify the generalised linear model, as given


in Box 5.

Box 5 The generalised linear model (GLM)


The relationship between a response variable Y and a set of
explanatory variables follows a generalised linear model, or GLM
for short, if
• Y1 , Y2 , . . . , Yn all have the same distribution, but each Yi has a
different mean
• for i = 1, 2, . . . , n, the regression equation has the form
g(E(Yi )) = ηi ,
where g is a link function and ηi is the linear predictor.

127
Unit 7 Regression for other response variables

As with all of the models that have been introduced in this module, the
notation can get messy when there are several explanatory variables. So,
when using a GLM to model a response Y with the factors A, B, . . . , Z and
covariates x1 , x2 , . . . , xq as explanatory variables, we’ll continue to use the
simpler notation
y ∼ A + B + · · · + Z + x1 + x2 + · · · + xq .
In doing so, however, we also need to be clear as to which distribution is
being assumed for the response and which link function is being used.
We’ll use the convention that a GLM for a response with a Poisson
distribution will be referred to as ‘a Poisson GLM’, and a GLM for a
response with a Bernoulli distribution will be referred to as ‘a Bernoulli
GLM’, and so on. (Of course, a Poisson GLM is also a Poisson regression
model, and a Bernoulli GLM is also a logistic regression model!) The
notation is illustrated in Example 2.

Example 2 Simpler notation for a GLM for familySize


For the model for familySize from the Philippines 40 to 80 dataset
with the single covariate age, when using a GLM to model
familySize, we can say something along the lines of
‘the model
familySize ∼ age
is a Poisson GLM with a log link’.

For a given dataset, the model parameters in the linear predictor (for
example, α and β in Example 1, and α, β1 and β2 in the solution to
Activity 8) are estimated using the method of maximum likelihood
estimation. We won’t go into the details of the mathematics behind how
the estimates are calculated in this module, since these parameter
estimates are easily obtained in R (as we saw when fitting logistic
regression models in Unit 6).
Once we have our parameter estimates, we can use their values in the
linear predictor to give us the fitted linear predictor, ηb, which is also
commonly referred to as the fitted model. The fitted linear predictor is
illustrated in Example 3 and Activity 9.

128
3 The generalised linear model (GLM)

Example 3 Fitted linear predictor for modelling


familySize
Consider once again the Philippines 40 to 80 dataset and the problem
of modelling the response familySize (Y ) using the covariate age
(x). In Example 1, we saw that the linear predictor for this model is
ηi = α + βxi ,
where xi is the age of the head of the ith household.
A Poisson GLM with a log link was fitted to these data and the
estimates of α and β were calculated to be approximately 1.97 and
−0.01, respectively. Therefore, the fitted linear predictor for this
model is
ηb = 1.97 − 0.01 x,
which can alternatively be written as
ηb = 1.97 − 0.01 age.

Activity 9 Fitted linear predictor for modelling modResult

Consider once again the OU students dataset and the logistic regression
model of Activity 8:
modResult ∼ bestPrevModScore + age.
In the solution to that activity, we saw that the linear predictor for this
model is
ηi = α + β1 xi1 + β2 xi2 ,
where xi1 and xi2 are the values of bestPrevModScore and age,
respectively, for the ith student.
Given that the parameter estimates for this model are calculated to be
approximately α b = −4.45, βb1 = 0.09 and βb2 = −0.01, write down the fitted
linear predictor for this model.

Although we have only considered responses with normal, Bernoulli and


Poisson distributions here (so far!), there are many other distributions
which are possible for the response in a GLM. However, generalised linear
models usually only use distributions for the response which are in the
exponential family of distributions.

129
Unit 7 Regression for other response variables

The exponential family covers a wide range of practically useful continuous


and discrete distributions, including the normal, Bernoulli, binomial,
Poisson and exponential distributions, amongst others. The advantage of
confining interest to the exponential family of distributions is because then
there is a particularly useful unified theory for estimation, inference and
computational algorithms. (The details of this theory are not covered in
this module.) All of the GLMs in this module involve response variables
with distributions belonging to the exponential family of distributions, and
you will not need to be aware of the general formula for the exponential
family in M348.
Next, we’ll see how a link function is chosen for a GLM.

3.2 Link functions


A link function g establishes a link between the mean response E(Yi ) and
the linear component of the model – that is, the linear predictor ηi – so
that
g(E(Yi )) = ηi .
This is illustrated in Figure 11.

Mean Link Linear


response function predictor
E(Yi ) g g(E(Yi )) = ηi

Figure 11 The link function provides the link between E(Yi ) and ηi

So, what properties do we want a link function g to have?


Well, since
g(E(Yi )) = ηi ,
we’d like g to transform E(Yi ) so that there is a linear relationship
between g(E(Yi )) and the explanatory variables. This is illustrated in
Figure 12 and Example 4.

Linear
Relationship
relationship
between Link
between
E(Yi ) and the function
g(E(Yi )) and the
explanatory g
explanatory
variables
variables

Figure 12 The link function ensures a linear relationship between g(E(Yi ))


and the explanatory variables

130
3 The generalised linear model (GLM)

Example 4 A linear relationship with a logit link


Consider once again the OU students dataset. We’ll again take the
binary variable modResult to be our response Y , but this time we’ll
only use the covariate bestPrevModScore as an explanatory
variable x.
Suppose that we’d like to fit the model
modResult ∼ bestPrevModScore
using a Bernoulli GLM with a logit link. In this case, we have the
regression equation
logit(E(Yi )) = α + βxi ,
where xi is the value of bestPrevModScore for the ith student. This
regression equation means that there should be a (roughly) linear
relationship between logit(E(Yi )) and xi . Is this a reasonable
assumption for these data?
Now, E(Yi ) is the probability that the ith student passes the module
(that is, the success probability for the ith student). These
probabilities vary from student to student, and we don’t know the
true values of them. For our model here, however, we do expect the
probability that the ith student passes to depend on their value of
bestPrevModScore. So, we can estimate the probability that the ith
student passes – and hence E(Yi ) – by using the observed proportion
of students who passed with the same value of bestPrevModScore as
the ith student’s value.
However, the explanatory variable bestPrevModScore is continuous,
and so it is unlikely that many students in the dataset will have
exactly the same value of bestPrevModScore as the ith student. We
had a similar situation in Activity 6 of Unit 6 where we had a
continuous explanatory variable (logArea). So, following the methods
presented in that activity, we’ll partition the values of
bestPrevModScore into the intervals 45 to 50, 50 to 55, . . . , 95 to
100, look at the observed proportions of students passing with values
of bestPrevModScore within each interval, and then use these
proportions as estimates of the probability of passing – and hence
E(Yi ) – for the different possible values of bestPrevModScore.
We can now use these estimates of E(Yi ) to investigate the
relationship between logit(E(Yi )) and xi for these data. A scatterplot
of logit(E(Yi )) for each of the estimated values of E(Yi ) plotted
against the midpoints of the corresponding intervals for
bestPrevModScore is given in Figure 13. From this plot, it seems that
the relationship between logit(E(Yi )) and xi does indeed seem to be
roughly linear for these data.

131
Unit 7 Regression for other response variables

logit(observed proportion passing)


3

0
50 60 70 80 90 100
Best previous module score
Figure 13 Scatterplot of estimates of logit(E(Yi )) plotted at the
midpoint of each corresponding bestPrevModScore interval

Next, in Activity 10, we’ll consider the log link for a Poisson GLM.

Activity 10 Log link for a GLM for familySize

Consider once again the data from the Philippines 40 to 80 dataset.


Suppose that we’d like to fit the model
familySize ∼ age
using a Poisson GLM with a log link.
Using Figure 10 in Activity 7, explain why it seems reasonable to assume a
linear relationship between g(E(Yi )) and xi for these data and this model.

So, we’d like our link function g to transform E(Yi ) so that the
relationship between g(E(Yi )) and the explanatory variables is linear. But
there’s also a second property that we’d like a link function g to have. The
range of possible values that E(Yi ) can take is often restricted (for
example, E(Yi ) might always be positive). However, there are no such
restrictions on the value that ηi can take.

132
3 The generalised linear model (GLM)

Therefore, since
g(E(Yi )) = ηi ,
we need a link function which allows g(E(Yi )) to take any value. This is
illustrated in Figure 14 and Example 5.

E(Yi ) Link g(E(Yi ))


takes function takes
restricted g any
values value

Figure 14 We want the link function to allow E(Yi ) to take any value

Example 5 Logit link to tackle restricted values of E(Yi )


Consider once again the OU students dataset. Following on from
Example 4, we’ll again take the binary variable modResult to be our
response Y , and the covariate bestPrevModScore as an explanatory
variable x.
In this case, the linear predictor for a GLM for modResult is
ηi = α + βxi ,
where xi is the value of bestPrevModScore for the ith student.
Now, E(Yi ) is the probability that the ith student passes the module,
and as such, E(Yi ) is restricted to lie between 0 and 1. However, it is
possible for the value of α + βxi to take a value which does not lie
between 0 and 1.
To see this, suppose, for example, that the value of
bestPrevModScore for the ith student is 0, so that xi = 0. (Although
this scenario is unlikely, it is possible.) In this case, if α < 0, then
ηi = α + βxi < 0
and if α > 1, then
ηi = α + βxi > 1.
So, we need a link function g which transforms E(Yi ) (which is
restricted to taking values between 0 and 1 only) to g(E(Yi )), which
can take any value between −∞ and +∞ to match the possible values
that can be taken by ηi .
The logit link function fulfills that role, since
 
E(Yi )
log
1 − E(Yi )
can take any value between −∞ and +∞.

133
Unit 7 Regression for other response variables

The next activity looks at the effect of the log link function on the values
that E(Yi ) can take in a Poisson GLM.

Activity 11 Effect of the log link function

A GLM with a response Yi ∼ Poisson(λi ), for λi > 0, i = 1, 2, . . . , n, and


linear predictor ηi , uses the log link. For this model, what possible values
can the following take?
(a) The response Yi .
(b) The mean response E(Yi ).
(c) The transformed value g(E(Yi )).

One consequence of using GLMs for responses with distributions which are
in the exponential family is that certain link functions, called canonical
link functions, have a special status. They provide simplifications (which
we will not go into) for the theory and analysis of GLMs. All of the link
functions that you have seen so far in this unit are in fact canonical link
functions; these are listed in Table 3.
Table 3 Some canonical link functions

Response Canonical link function g Link function name

Normal g(E(Yi )) = E(Yi ) identity


 
E(Yi )
Bernoulli g(E(Yi )) = log logit
1 − E(Yi )

Poisson g(E(Yi )) = log(E(Yi )) log

Note that other link functions which are not canonical link functions can
be, and are, used in practice. This is because canonical link functions are
not always the most sensible link functions to use. You will meet an
example of such a non-canonical link function in Subsection 4.1 later in
this unit.
GLM link functions are summarised in the following box.

Box 6 GLM link functions


We would like a GLM link function g to transform E(Yi ) so that:
• there’s a linear relationship between g(E(Yi )) and the explanatory
variables
• g(E(Yi )) can take any value between −∞ and +∞.
Canonical link functions provide simplifications for the theory and
analysis of GLMs, but non-canonical link functions can be, and are,
used in practice.

134
3 The generalised linear model (GLM)

3.3 Inverse link functions


We have seen that an important property of many link functions is that
they transform the mean response E(Yi ) from whatever scale it is naturally
on (such as from 0 to ∞) to the whole real line, so that g(E(Yi )) can take
any possible value to match the unrestricted values of ηi . However, we
would want the outputs of the GLM to be on the same scale as E(Yi ),
rather than on the scale of the unrestricted values of ηi . For example, in a
GLM with Yi ∼ Poisson(λi ), we are ultimately interested in the value of
E(Yi ) (which is positive), rather than g(E(Yi )) = log(E(Yi )) (which can
take any real value). In order to produce output of the GLM on the same
scale as E(Yi ), the inverse transformation of the link function is required.
The inverse link function g −1 arises from solving the equation
g(E(Yi )) = ηi
so that
E(Yi ) = g −1 (ηi ).
This is illustrated in Figure 15.

Link
function
g(E(Yi ))

E(Yi ) ηi

Inverse
link
function
g −1 (ηi )

Figure 15 The link function and its inverse relating the mean response to
the linear predictor in a GLM

We’ll find the inverse link function for the logit link in the following
example.

135
Unit 7 Regression for other response variables

Example 6 Inverse link function for the logit link


The logit link is the canonical link function for a Bernoulli GLM (that
is, logistic regression), so that
 
E(Yi )
g(E(Yi )) = log = ηi .
1 − E(Yi )
We wish to find the function g −1 so that E(Yi ) = g −1 (ηi ).
So, let’s start with the equation
 
E(Yi )
log = ηi
1 − E(Yi )
and take exponentials of both sides. This becomes
E(Yi )
= exp(ηi )
1 − E(Yi )
E(Yi ) = (1 − E(Yi )) exp(ηi )
E(Yi ) = exp(ηi ) − E(Yi ) exp(ηi )
E(Yi )(1 + exp(ηi )) = exp(ηi )
and so
exp(ηi )
E(Yi ) = .
1 + exp(ηi )
Therefore, the inverse link function for the logit link is
exp(ηi )
g −1 (ηi ) = .
1 + exp(ηi )

We can follow the same approach in the next activity to obtain the inverse
link functions for the (canonical) link functions of GLMs for responses
with normal and Poisson distributions (that is, linear regression and
Poisson regression, respectively).

Activity 12 Some more inverse link functions

For each of the following models, find the inverse link function g −1 (ηi ) for
the canonical link function.
(a) A GLM with a normal response.
(b) A Poisson GLM.

136
3 The generalised linear model (GLM)

Note that we can specify a GLM in terms of either the link function or the
inverse link function. For instance, for a Poisson GLM with a log link, we
could say that Yi ∼ Poisson(λi ) and E(Yi ) = exp(ηi ), instead of
log(E(Yi )) = ηi .
Inverse link functions are summarised in Box 7.

Box 7 Inverse link functions


For a GLM with link function g(E(Yi )), such that
g(E(Yi )) = ηi ,
the inverse link function is the function g −1 so that
E(Yi ) = g −1 (ηi ).

The canonical link functions that you’ve seen so far, together with their
inverse link functions, are summarised in Table 4.
Table 4 Some canonical link functions and their inverse link functions

Response Canonical link function g Inverse link function g −1

Normal g(E(Yi )) = E(Yi ) g −1 (ηi ) = ηi


 
E(Yi ) exp(ηi )
Bernoulli g(E(Yi )) = log g −1 (ηi ) =
1 − E(Yi ) 1 + exp(ηi )

Poisson g(E(Yi )) = log(E(Yi )) g −1 (ηi ) = exp(ηi )

Inverse link functions are important for calculating fitted and predicted
mean responses, and we shall see them in action for doing just that in the
next subsection!

3.4 Fitted mean responses and prediction


We saw in Subsection 3.1 that the fitted linear predictor, ηb, replaces the
parameters in the linear predictor by their estimated values (see for
example, Example 3 and Activity 9). But what we’re really interested in is
the fitted value of the mean response E(Yi ). This is known as the fitted
mean response for the ith observation.
Following our usual convention of adding a ‘b’ symbol to denote a fitted
\
value, we are therefore interested in E(Y i ). This notation is, however, a bit
cumbersome! So, since E(Yi ) is conventionally denoted by µi , we’ll instead
bi to denote the fitted mean response.
use the notation µ
So, how do we obtain µ bi ? Well, when the values of the observed
explanatory variables for the ith observation are put into the fitted linear
predictor, we obtain ηbi , the fitted linear predictor for the ith observation.
We then need to use the inverse link function to find µ bi , the fitted mean
response for the ith observation, so that
bi = g −1 (b
µ ηi ).
137
Unit 7 Regression for other response variables

The process used to obtain fitted mean responses in GLMs is illustrated in


Figure 16.

Mean Link Linear


response function predictor
E(Yi ) g g(E(Yi )) = ηi

Inverse Fitted
Fitted
link linear
mean response
function predictor
E(Yi ) = µi = g −1 (ηi )
g −1 ηi

Figure 16 The process of calculating fitted mean responses in GLMs

Calculating a fitted mean response for a Poisson GLM is demonstrated in


Example 7.

Example 7 A fitted mean response for familySize


Consider the Philippines 40 to 80 dataset. In Example 3, we saw that
the fitted linear predictor for the Poisson GLM
familySize ∼ age
(with a log link) is
ηb = 1.97 − 0.01 age.

The age of the head of the first household in the dataset is 53. So, the
fitted linear predictor for the first household, ηb1 , is
ηb1 = 1.97 − (0.01 × 53) = 1.44.

From Table 4, the inverse link function for the log link is
g −1 (ηi ) = exp(ηi ).
So, using our model, the fitted mean response of familySize for the
first household is calculated as
b1 = g −1 (b
µ η1 ) = exp(b
η1 )
= exp(1.44) ' 4.22.

We’ll calculate more fitted mean responses in the next two activities.

138
3 The generalised linear model (GLM)

Activity 13 Another fitted mean response for familySize

Following on from Example 7 and using the same fitted model, what is the
fitted mean response of familySize for the second household in the
Philippines 40 to 80 dataset, whose household head is aged 72?

Activity 14 A fitted mean response for modResult

Consider the OU students dataset. In Activity 9, we saw that the logistic


regression model
modResult ∼ bestPrevModScore + age
(with a logit link) has the fitted linear predictor given by
ηb = −4.45 + 0.09 bestPrevModScore − 0.01 age.
The first student in the OU students dataset has values 89.2 for
bestPrevModScore and 32 for age. What is µ b1 , the fitted mean response
of modResult for this student?

In addition to calculating fitted mean responses, the fitted linear predictor


can also be used to predict the value of the mean response for a new
response Y0 , given values of the associated explanatory variables. The
fitted linear predictor for the new data, ηb0 , can then be used to calculate
b0 , the predicted mean response of Y0 , using
µ
b0 = g −1 (b
µ η0 ).
Essentially, the process of obtaining predicted mean responses for new
responses is the same as that for obtaining fitted mean responses for the
observed responses.
We’ll calculate the predicted mean response for a Bernoulli GLM next in
Activity 15.

Activity 15 A predicted mean response for modResult

Following on from Activity 14, the fitted linear predictor for a logistic
regression model for the binary response modResult with explanatory
variables bestPrevModScore and age is
ηb = −4.45 + 0.09 bestPrevModScore − 0.01 age.
For a new student aged 49 with a value of 74.2 for bestPrevModScore,
what is the predicted mean response of modResult for this student?

139
Unit 7 Regression for other response variables

It is also possible to calculate prediction intervals which give an indication


of the precision of the predicted mean responses in a GLM. There are,
however, a few issues with calculating these intervals for GLMs, and we
won’t be considering them further in M348. (Basically, prediction intervals
for GLMs tend to be ‘optimistic’, leading to intervals smaller than they
should be.)
Fitted and predicted mean responses for GLMs are summarised in Box 8.

Box 8 Fitted and predicted mean responses


Suppose that a GLM has inverse link function g −1 , so that
E(Yi ) = g −1 (ηi ).
For the ith observation with observed values of the explanatory
variables, the fitted mean response of Yi is
\
E(Y bi = g −1 (b
i) = µ ηi ),
where ηbi is the fitted linear predictor for the ith observation.
For a new response Y0 with given values of the explanatory variables,
the predicted mean response of Y0 is
\
E(Y b0 = g −1 (b
0) = µ η0 ),
where ηb0 is the fitted linear predictor for this new observation.

So far in this unit, we’ve looked at the theory behind GLMs, focusing in
particular on GLMs for responses with normal, Bernoulli and Poisson
distributions. We will consider GLMs for responses with two other
non-normal distributions soon in Section 4, but before we do that, it’s
about time that we did some practical work on the computer!
You have actually already used R to fit a Bernoulli GLM when using
logistic regression in Unit 6, and, as we now know from this unit, the linear
regression models from Units 1 to 5 are also GLMs (with normal response
Computers at the ready! variables). So, for now, we’ll just focus on using R for Poisson GLMs.

140
4 GLMs for two more response variable distributions

3.5 Using R for Poisson GLMs


We’ll start in Notebook activity 7.1 by using R to fit the model
familySize ∼ age
using data from the Philippines 40 to 80 dataset and a Poisson GLM with
a log link. Then, in Notebook activity 7.2, we’ll use R to calculate
predicted mean responses for the same model.
For the final notebook activity in this subsection, we’re going to return to
the problem of modelling data from the Olympics dataset which we worked
on in Unit 5. In Unit 5, we focused on the problem of modelling the
response medals, which is defined as
• medals: the number of medals won by a nation at a summer Olympics
(as it stood at the end of Tokyo 2020).
In that unit, even though the response medals is a count variable, we used
a linear model, because at that stage we only had linear models in our
statistical modelling toolbox! In Notebook activity 7.3, we shall instead
model medals using a Poisson GLM with a log link.

Notebook activity 7.1 Fitting a Poisson GLM in R


This notebook explains how to use R to fit a Poisson GLM.

Notebook activity 7.2 Prediction for a Poisson GLM in R


In this notebook, we’ll use R to calculate predicted mean responses
for a Poisson GLM.

Notebook activity 7.3 A GLM for predicting Olympic


medals
This notebook uses a Poisson GLM with a log link to model data
from the Olympics dataset.

4 GLMs for two more response


variable distributions
In this section, we shall look at how two other distributions – namely, the
exponential and binomial distributions – are commonly used as the
distribution for the response in a GLM. We shall start in Subsection 4.1 by
focusing on GLMs for responses with exponential distributions, before
moving on to GLMs for responses with binomial distributions in
Subsection 4.2. To finish the section, we’ll use these two GLMs in R in
Subsection 4.3.

141
Unit 7 Regression for other response variables

4.1 GLMs for responses with exponential


distributions
In this subsection, we’ll consider response variables that measure the time
between occurrences of successive events. This type of variable appears in
many practical situations. We have already seen an example of one such
situation in part (a) of Activity 1, where waiting times between serious
earthquakes were mentioned. Other examples include times until recovery
or remission for patients suffering from a disease, lifetimes of mechanical or
electrical components until they fail, and customer waiting times to speak
to someone at a call centre.
In order to get a feel for what datasets with this type of response variable
look like, we’ll focus on a (historical) dataset concerning the survival of
leukaemia patients. This dataset is described next.

Leukaemia patient survival in the 1960s


Back in the 1960s, the survival rate for leukaemia patients was very
low. Luckily, with the development of modern treatments, the
survival rate for leukaemia is now much higher. The data in this
dataset were collected in the 1960s and, as such, are of historical
interest, rather than practical interest.
The leukaemia survival dataset (survival)
Researchers collected data on the lengths of survival, in weeks, of
33 patients suffering from acute myelogenous leukaemia. Patients were
classified into two groups according to the presence or absence of a
morphological characteristic of the leukaemic white cells in the bone
marrow: patients with the morphological characteristic were termed
‘AG positive’ and those without were termed ‘AG negative’. Data on
the white blood cell count were also collected for these patients. The
numbers of white blood cells are so large that the logs of these values
Leukaemia survival rates are were taken so as to make the numbers easier to handle.
much higher now than they
were when these data were The dataset contains data on the following variables:
collected • survivalTime: the survival time of the patient in weeks (rounded
to the nearest week)
• logWbc: the log (to base e) of the number of white blood cells
counted
• ag: the AG condition, taking the possible values pos (for positive)
or neg (for negative).
The data for the first five observations from the leukaemia survival
dataset are given in Table 5.

142
4 GLMs for two more response variable distributions

Table 5 First five observations from survival

survivalTime logWbc ag
65 7.7407 pos
156 6.6201 pos
100 8.3664 pos
134 7.8633 pos
16 8.6995 pos
Source: Feigl and Zelen, 1965

The researchers analysing the leukaemia survival dataset were interested in


investigating how white blood cell counts (logWbc) affected patient
survival times (survivalTime) for each of the two levels of the factor ag.
We’ll therefore take survivalTime as our response variable, and logWbc
and ag as the explanatory variables.
At first sight, it looks like the response survivalTime is a discrete variable
because the survival times in the leukaemia survival dataset were rounded
to the nearest week. However, survivalTime is in fact continuous since
weeks occur on a continuous scale and fractions of a week are therefore
interpretable (which is not the case for ‘fractions of people’ when
considering the variable familySize from the Philippines survey dataset!).
Survival times are necessarily non-negative and so, when modelling
survivalTime, we need a distribution suitable for a continuous,
non-negative response. Although there are several possible candidate
distributions for this, here we’ll investigate using an exponential
distribution for survivalTime. This is because the exponential
distribution is commonly used for waiting time variables similar to
survivalTime, and also the exponential distribution is one of the
distributions which can be assumed for the response in a GLM.
A summary of the exponential distribution is given in Box 9.

Box 9 The exponential distribution


A random variable Y is said to have an exponential distribution
with parameter λ, where λ > 0, if it has probability density
function (p.d.f.)
f (y) = λe−λy , y > 0.
This is written Y ∼ M (λ).
The exponential distribution is used for non-negative continuous
variables whose p.d.f. decreases for increasing y. It has what is often
called a ‘long tail’, allowing occasional large values of y to occur.

143
Unit 7 Regression for other response variables

The exponential p.d.f. always has the same general shape, regardless
of the value of λ. Figure 17 shows the p.d.f. of an exponential
distribution with parameter λ = 1.

1.0

0.8

0.6
f (y)

0.4

0.2

0.0
0 2 4 6 8 10
y
Figure 17 The p.d.f. of an exponential distribution with λ = 1

The parameter λ is an event rate, and is related to both the mean and
the variance of the distribution, so that
1 1
E(Y ) = and V (Y ) = .
λ λ2
Therefore,
V (Y ) = (E(Y ))2 .

In the next activity, we’ll investigate whether the exponential distribution


might be a suitable distribution for the response survivalTime.

Activity 16 An exponential distribution for survivalTime?

Figure 18 shows a unit-area histogram of survivalTime from the


leukaemia survival dataset with an overlaid exponential distribution curve
fitted to the same data. As has already been pointed out earlier in this
unit, because the response is influenced by the explanatory variable(s), we
wouldn’t expect a plot of the response to necessarily look like a typical plot
of the distribution being used to model the response. Figure 18 is, however,
helpful in giving us a general feel for whether the exponential distribution
looks promising as the assumed distribution for survivalTime.
144
4 GLMs for two more response variable distributions

0.020
Relative frequency

0.015

0.010

0.005

0.000
0 50 100 150
Survival time (weeks)
Figure 18 A unit-area histogram of survivalTime together with an overlaid
fitted exponential distribution curve

By considering Figure 18, does the exponential distribution look like it


might be promising as the assumed distribution for the response
survivalTime?

From Activity 16, it looks like it is reasonable to assume an exponential


distribution to use as the distribution for the response survivalTime. The
other important component of a GLM which we need to specify so that we
can model survivalTime is the link function g, so that
g(E(Yi )) = ηi ,
where ηi is the linear predictor.
It turns out that the canonical link function for an exponential GLM (that
is, a GLM for a response with an exponential distribution) is
1
g(E(Yi )) = − .
E(Yi )
This is known as the negative reciprocal link. There is, however, a
problem with this canonical link function which means that it isn’t
necessarily the most sensible link function to use for this GLM! We’ll
consider this problem in Activity 17.

145
Unit 7 Regression for other response variables

Activity 17 Why is the canonical link function not ideal?

From Box 6, there are two properties that we’d like a link function g to
have:
• we’d like a linear relationship between g(E(Yi )) and the explanatory
variables
• we’d like g(E(Yi )) to be able to take any value between −∞ and +∞ (to
match the possible values that ηi can take).
Explain why the canonical link function for an exponential GLM may not
be an ideal link function to use.

So, if the canonical link function is not ideal for an exponential GLM,
which link function should be used instead? Well, when we’re assuming an
exponential distribution for the response, the log link is commonly used
instead of the negative reciprocal link, so that
g(E(Yi )) = log(E(Yi )).
Activity 18 explores why this link might be more sensible.

Activity 18 Why might the log link be more sensible?

Explain why the log link may be more sensible than the (canonical)
negative reciprocal link for an exponential GLM.

Since the log link is commonly used as the link function in an exponential
GLM, we shall try using it when modelling survivalTime.
We know from Activity 18 that the log link satisfies one of the properties
that we’d like our link function to have. But what about the other
property? Is it reasonable to assume a linear relationship between g(E(Yi ))
and the explanatory variables? We shall investigate this question in the
next activity.

146
4 GLMs for two more response variable distributions

Activity 19 Can we assume a linear relationship?

Suppose that we wish to use data from the leukaemia survival dataset to
fit the model
survivalTime ∼ logWbc + ag
using an exponential GLM with a log link.
Figure 19 shows a plot of log(survivalTime) and logWbc, with the
different levels of ag indicated. From this plot, does it look like it might be
reasonable to assume a linear relationship between g(E(Yi )) and the
explanatory variables?

ag: neg pos

4
log(survivalTime)

0
7 8 9 10 11
logWbc
Figure 19 Scatterplot of log(survivalTime) and logWbc, with the different
levels of ag indicated

So, it looks like an exponential GLM with a log link could be a good way
forward for the model
survivalTime ∼ logWbc + ag.
We shall consider this fitted model in the next activity.

147
Unit 7 Regression for other response variables

Activity 20 A fitted model for survivalTime

Data from the leukaemia survival dataset were used to fit the model
survivalTime ∼ logWbc + ag
using an exponential GLM with a log link.
The parameter estimates of the coefficients for the fitted model are given
in Table 6.
Table 6 Parameter estimates for the model survivalTime ∼ logWbc + ag, using
an exponential GLM with a log link

Parameter Estimate
Intercept 5.8154
logWbc −0.3044
ag pos 1.0176

(a) The factor ag has two levels: ‘pos’ and ‘neg’. Which level has been set
to be level 1 in this fitted model?
(b) The first patient in the leukaemia survival dataset has a value of
7.7407 for logWbc and ‘pos’ for ag. What is the value of ηb1 , the fitted
linear predictor for this patient?
b1 , the fitted mean response of survivalTime for this
(c) Hence, what is µ
patient?
(d) The 18th patient in the leukaemia survival dataset has a value of
b18 , the fitted mean
8.3894 for logWbc and ‘neg’ for ag. Calculate µ
response of survivalTime for this patient.

The exponential GLM is summarised in Box 10.

148
4 GLMs for two more response variable distributions

Box 10 Exponential GLMs


• The responses Y1 , Y2 , . . . , Yn all have an exponential distribution,
but each Yi has a different parameter λi > 0, so that
Yi ∼ M (λi ),
where E(Yi ) = 1/λi .
• The canonical link function for this model is the negative reciprocal
link and is given by
1
s
Tool
g(E(Yi )) = − = ηi ,
E(Yi )
where ηi is the linear predictor. The negative reciprocal link
function, however, is not ideal because it only allows g(E(Yi )) to
take negative values, which doesn’t match with the possible values We can now add the
that ηi can take. exponential distribution as one
• When we assume an exponential distribution for the response, the of the possible distributions
that we can assume for the
log link is commonly used (rather than the canonical link), so that
response
g(E(Yi )) = log(E(Yi )) = ηi .

The GLM for a response with a binomial distribution is introduced next.

4.2 GLMs for responses with binomial


distributions
In this subsection, we shall consider GLMs where each response Yi is the
number of successes in Ni trials.
We met a response variable of this kind in part (c) of Activity 1, where we
considered the number of insects surviving out of N insects exposed to an
insecticide as a response variable. In this context, each surviving insect
would be labelled as a success. Here, we’re considering the scenario in
which there are n responses Y1 , Y2 , . . . , Yn – in other words, there are n
groups of insects – and the size of the ith group of insects is Ni , for
i = 1, 2, . . . , n. Then each Yi is the number of insects surviving out of the
Ni insects in the ith group.
For response variables like these, a binomial distribution is usually
assumed. A summary of the binomial distribution is given in Box 11.

149
Unit 7 Regression for other response variables

Box 11 The binomial distribution


A random variable Y is said to have a binomial distribution with
parameters N and p, where 0 < p < 1 and N is a positive integer, if it
has p.m.f.
 
N y
p(y) = p (1 − p)N −y , y = 0, 1, 2, . . . , N.
y
(Note that N is known, but p is unknown.)
This is written as Y ∼ B(N, p).
The binomial distribution is used as a distribution for the number of
successes in a sequence of N Bernoulli trials, in which the probability
of success in a single trial is p.
The mean and variance of Y are
E(Y ) = N p and V (Y ) = N p(1 − p).

In a binomial GLM, the responses Y1 , Y2 , . . . , Yn all have binomial


distributions, but each Yi has different parameters Ni and pi , so that
Yi ∼ B(Ni , pi ).
Remember that for each Yi , the number of Bernoulli trials, Ni , is known,
but the success probability, pi , is unknown.
For a binomial GLM, we’re ultimately interested in E(Yi ), where, for a
binomial response Yi , we have that
E(Yi ) = Ni pi .
Now, Ni is a known constant, and so modelling E(Yi ) is effectively the
same as modelling pi . But pi is the success probability for each of the
Ni Bernoulli trials, and we already know how to model the success
probabilities for Bernoulli trials – by using logistic regression! So, a
binomial GLM is essentially a logistic regression model.
Hooray – we already know how
to model pi ! Let’s look at how a binomial GLM works in a little more detail.
For each response Yi ∼ B(Ni , pi ), we start by modelling the success
probability pi using the logit link. This is the link function used for logistic
regression in Unit 6, and the canonical link function for the equivalent
Bernoulli GLM. We therefore have a regression equation for pi of the form
 
pi
log = ηi ,
1 − pi
where ηi is the linear predictor.
From Table 4, the inverse link function for the logit link is
exp(ηi )
g −1 (ηi ) = .
1 + exp(ηi )

150
4 GLMs for two more response variable distributions

Therefore, the fitted value of pi is


exp(b
ηi )
pbi = g −1 (b
ηi ) = ,
1 + exp(bηi )
where ηbi is the fitted linear predictor. The fitted mean response, µ
bi , of Yi
can then be calculated as
exp(b
ηi )
bi = Ni × pbi = Ni ×
µ .
1 + exp(bηi )

Similarly, the predicted mean response, µ b0 , for a new response Y0 with


associated fitted linear predictor ηb0 , can be calculated as
exp (b
η0 )
b0 = N0 ×
µ .
1 + exp (b
η0 )

So, a binomial GLM is essentially the same as a Bernoulli GLM, except


that fitted and predicted mean responses are scaled up by the number of
trials, Ni .
How a binomial GLM works for calculating fitted mean responses is
summarised in Figure 20.

Binomial response:
Yi ∼ B(Ni , pi )

Mean response:
E(Yi ) = Ni × pi Model
↑ ↑ pi
known unknown

Logit link for pi :


pi
log = ηi
1 − pi

Fitted linear
predictor:
ηi

Fitted mean response: Fitted pi :


exp(ηi )
E(Yi ) = µi = Ni × pi pi =
1 + exp(ηi )

Figure 20 Illustration of how a binomial GLM works for calculating fitted


mean responses

Now it’s time to see a binomial GLM in action! For this, we’ll once again
consider data from the OU students dataset.

151
Unit 7 Regression for other response variables

In Unit 4, examScore was treated as a normal response variable and, in


Unit 4’s computing work, stepwise regression was used to select a linear
regression model with various factors and covariates as explanatory
variables. The resulting normal probability plot for the fitted model is
shown in Figure 21; from this plot, we concluded that the assumption of
normality looks questionable.

2
Standardised residuals

−2

−4

−2 0 2
Theoretical quantiles
Figure 21 The normal probability plot for a linear regression model for
examScore selected using stepwise regression

If the normality assumption is questionable, maybe we can use a GLM


with a different distribution for the response instead? In particular, can we
model examScore using a binomial GLM?
The short answer is yes – if we make a couple of assumptions! There are
100 marks available for each exam, so we can think of these 100 marks as
being 100 ‘trials’. There are two possible outcomes for each of these trials:
either the mark is awarded (a success) or the mark is not awarded (a
failure). In this case, the exam score is the number of successes in
100 trials. We can then consider examScore to be a binomial random
variable if we make two assumptions:
• for a particular student, the probability of obtaining each mark is the
same for each of the 100 available marks
• each ‘trial’ is independent.
Okay, so these assumptions may be a bit unrealistic for some (most?)
students, but let’s assume that they hold so that we can try modelling
examScore using a binomial GLM.

152
4 GLMs for two more response variable distributions

Now, the probability of being awarded each mark might be expected to


vary from student to student: for one student, the probability of being
awarded each mark could be high, while for another student, the
probability could be low. So, for the ith student, we assume that the
response examScore follows a B(100, pi ) distribution, where pi is the
probability that, for each of the 100 marks available, student i is awarded
the mark.
In Activity 21, we’ll assume a binomial distribution for examScore and fit
a binomial GLM with a logit link. Although there are several possible
explanatory variables that we could include in our model, following
Activity 9 (which modelled the response modResult using a Bernoulli
GLM) we will consider using only the two covariates bestPrevModScore
and age as the explanatory variables. (We will use more explanatory
variables to model examScore later in the unit when a computer can do
the hard work for us!)

Activity 21 A binomial GLM for examScore

Data from the OU students dataset were used to fit the model
examScore ∼ bestPrevModScore + age
using a binomial GLM with a logit link.
The parameter estimates for the coefficients for the fitted model are given
in Table 7.
Table 7 Parameter estimates for examScore ∼ bestPrevModScore + age, using
a binomial GLM with a logit link

Parameter Estimate
Intercept −3.3387
bestPrevModScore 0.0472
age −0.0040

(a) The first student in the dataset has a value of 89.2 for
bestPrevModScore and 32 for age. What is the value of the fitted
linear predictor, ηb1 , for this student?
(b) Calculate pb1 , the fitted success probability for this first student.
b1 , for this student? Round
(c) Hence, what is the fitted mean response, µ
your answer to the nearest integer.
(d) Rounding your answer to the nearest integer, use the fitted model to
b0 , for a new student who is
calculate the predicted mean response, µ
aged 64 and has a value of 79.2 for bestPrevModScore.

Binomial GLMs are summarised in Box 12.

153
Unit 7 Regression for other response variables

Box 12 Binomial GLMs


• The responses Y1 , Y2 , . . . Yn all have binomial distributions, but
each Yi has different parameters Ni and pi , so that
Yi ∼ B(Ni , pi ).
The number of trials, Ni , is known, but the success probability, pi ,
is unknown.
• The canonical link function is the logit link, which is used to model
the success probability pi , so that the link function g is
 
pi
g(pi ) = log = ηi ,
1 − pi
s
Tool
where ηi is the linear predictor.
• Denoting the fitted success probability by pbi , the fitted mean
response for the binomial response is then
The binomial distribution is bi = Ni × pbi .
µ
another possible assumed
distribution for the response to
add to our GLM toolbox!
Computers at the ready – now it’s time for some practical work!

4.3 Using R for exponential and binomial


GLMs
We’ll start with Notebook activity 7.4, which explains how we can use R to
fit an exponential GLM. In doing so, we shall revisit the leukaemia
survival dataset and use R to fit the model
survivalTime ∼ logWbc + ag
using an exponential GLM with a (non-canonical) log link.
We’ll then move on to Notebook activity 7.5, which explains how we can
fit a binomial GLM in R. In this notebook activity, we’ll consider once
again the OU students dataset and use R to fit the model
examScore ∼ bestPrevModScore + age
using a binomial GLM with a (canonical) logit link.

Notebook activity 7.4 Fitting an exponential GLM with a


non-canonical link in R
This notebook explains how R can be used to fit an exponential GLM
with a (non-canonical) log link.

154
5 Assessing model fit and choosing a GLM

Notebook activity 7.5 Fitting a binomial GLM in R


This notebook explains how R can be used to fit a binomial GLM.

5 Assessing model fit and choosing


a GLM
In Unit 6, we saw that the methods used for assessing model fit in logistic
regression are not the same as those used in linear regression. This is
because in logistic regression the methods are based on approximations,
rather than the exact distribution theory available in linear regression.
Since a logistic regression model is just one particular GLM, it should
come as no surprise to learn that assessing the model fit of GLMs in
general follows the same methods as those presented in Unit 6 for logistic
regression.
We shall start in Subsection 5.1 by discussing how the fit of a GLM can be
assessed. Then, in Subsection 5.2, we focus on the wider issue of
comparing the fits of GLMs so that we can choose which explanatory
variables to include in our model. We’ll round off the section by using R to
assess fit and choose a GLM in Subsection 5.3.

5.1 Assessing model fit


As already mentioned, we use the same methods for assessing model fit for
GLMs in general as we did for assessing model fit in logistic regression. So,
we’ll start with a reminder of what those methods are, before looking at
the fit of some of the GLMs that have been considered so far.
Recall from Unit 6 that the saturated model is the best model we can use
in terms of fit, so that the log-likelihood of the saturated model,
l(saturated model), is the largest possible log-likelihood. Comparing the
log-likelihood of our proposed model, l(proposed model), with
l(saturated model) then gives us a measure of how much fit is lost by using
our proposed model.
The measure used for comparing these two log-likelihoods is the residual
deviance D, where
D = 2 × (l(saturated model) − l(proposed model)).
A large value of D then indicates that too much fit has been lost, and so
our proposed model is not a good fit.
A summary of the residual deviance and how it can be used to assess the
fit of a GLM is given in Box 13. Not a great fit . . .

155
Unit 7 Regression for other response variables

Box 13 Residual deviance for GLMs


Suppose that the observations y1 , y2 , . . . , yn are used to fit a GLM –
the proposed model.
The residual deviance for the proposed model is given by
D = 2 × (l(saturated model) − l(proposed model)).
If the proposed GLM is a good fit, then
D ≈ χ2 (r),
where
r = number of parameters in the saturated model
− number of parameters in the proposed model
= n − number of parameters in the proposed model.
This result is used to assess the fit of the proposed GLM, as
illustrated in Figure 22.

χ2 (r)

D
D not large, D large,
so p-value not small, so p-value small,
therefore good fit therefore poor fit

Figure 22 Illustration of how we can use D to assess the fit of a


proposed GLM (here, we’ve taken r to be 10)

Unit 6 also introduced a ‘rule of thumb’ for informally assessing the


residual deviance. A summary reminder of this is given in Box 14.

156
5 Assessing model fit and choosing a GLM

Box 14 ‘Rule of thumb’ for residual deviance


Suppose that a proposed GLM has residual deviance D, with
D ≈ χ2 (r). If the model is a good fit, then we have the following ‘rule
of thumb’.
• If D ≤ r, then the GLM is likely to be a good fit to the data.
• If D is ‘much larger’ than r, then the model is likely to be a poor fit
to the data.

We’ll look at the model fit of two of the GLMs that we’ve considered so far
in this unit in Example 8 and Activity 22.

Example 8 Assessing the fit of a GLM for familySize


In Notebook activity 7.1, we fitted the model
familySize ∼ age
using data from the Philippines 40 to 80 dataset and a Poisson GLM
with a log link.
The residual deviance, D, for this fitted model is 1731.8 and the
associated degrees of freedom, r, is 1157.
Using the ‘rule of thumb’ given in Box 14, it is not possible to
informally assess the model as having a good fit since D > r. We
therefore need the associated p-value calculated from the χ2 (1157)
distribution to assess whether D is ‘large enough’ in comparison to r
to suggest that the model is a poor fit.
It turns out that the associated p-value for D is close to 0. So, since
the p-value is so small, this means that the value of D is large and is
in the upper tail of the distribution. We can therefore conclude that
the residual deviance is large and so the GLM is a poor fit.

Activity 22 Assessing the fit of a GLM for survivalTime

In Notebook activity 7.4, we fitted the model


survivalTime ∼ logWbc + ag
using data from the leukaemia survival dataset and an exponential GLM
with a log link.
The residual deviance, D, for this fitted model is 40.319, and the
associated degrees of freedom, r, is 30. The p-value for this value of D is
calculated to be 0.099. What do you conclude about the fit of this GLM?

157
Unit 7 Regression for other response variables

Next we’ll consider how we can compare the fits of two GLMs to help us
choose which explanatory variables should be included in our model.

5.2 Comparing GLMs


The residual deviance compares the fit of a proposed GLM and the
saturated model. We can also use the residual deviance to compare the
model fits of two GLMs, M1 and M2 , say, as long as M1 is nested within
M2 . We do this using the deviance difference, as summarised in Box 15.

Box 15 Comparing the fits of nested GLMs


Suppose that we have two GLMs, M1 and M2 , where M1 is nested
within M2 .
The deviance difference for these two models is
deviance difference = D(M1 ) − D(M2 ),
where D(M1 ) and D(M2 ) denote the residual deviances for models
M1 and M2 , respectively.
If both M1 and M2 are good fits, then
deviance difference ≈ χ2 (d),
where
d = difference in the degrees of freedom for D(M1 ) and D(M2 )
= the number of extra parameters in the larger model M2 .
This result is used to compare the model fits of M1 and M2 , as
illustrated in Figure 23.

158
5 Assessing model fit and choosing a GLM

χ2 (d)

deviance difference
deviance difference not large, deviance difference large,
so p-value not small, so p-value small,
thus no significant gain in thus significant gain in
fit for M2 , fit for M2 ,
therefore choose M1 therefore choose M2

Figure 23 Illustration of how we can use the deviance difference to


compare the fits of nested GLMs (here, we’ve taken d to be 5)

Next, we’ll use the deviance difference to compare the model fits of GLMs
for modelling survivalTime from the leukaemia survival dataset.
So far, for these data we’ve considered the model
survivalTime ∼ logWbc + ag
using an exponential GLM with a log link. Now, in Activity 22, we saw
that the p-value for the residual deviance, D, for this fitted model is 0.099.
So, since this p-value is quite large, we concluded in that activity that the
model was an adequate fit to the data.
But, do we need both of the explanatory variables in the model? We shall
consider this question in the next example and activity: we’ll investigate
whether logWbc is needed in the model in addition to ag in Example 9,
and then we’ll investigate whether ag is needed in the model in addition to
logWbc in Activity 23.

159
Unit 7 Regression for other response variables

Example 9 Is logWbc needed as well as ag?


The leukaemia survival dataset contains data on 33 patients. The
following GLMs with an exponential response and log link were fitted
to these data.
• Model M1 : survivalTime ∼ ag.
The residual deviance for this model, D(M1 ), is 46.198; the
associated approximate null distribution is χ2 (31).
• Model M2 : survivalTime ∼ logWbc + ag.
The residual deviance for this model, D(M2 ), is 40.319; the
associated approximate null distribution is χ2 (30).
Since M1 is nested within M2 , we can calculate the deviance
difference between these two models:
deviance difference = D(M1 ) − D(M2 )
= 46.198 − 40.319 = 5.879.
Model M2 has one more parameter than M1 (the regression coefficient
for the covariate logWbc), and so this deviance difference is
approximately distributed as χ2 (1). We can also calculate the degrees
of freedom as the difference between the degrees of freedom of the
χ2 null distributions for D(M1 ) and D(M2 ), namely
31 − 30 = 1.
The p-value associated with this deviance difference is calculated to
be 0.020. Since this is small, there is evidence to suggest that there is
a significant gain in fit by including logWbc in the model, in addition
to ag. So, we should choose model M2 over model M1 .

Activity 23 Is ag needed as well as logWbc?

Consider once again the leukaemia survival dataset. The following GLMs
with an exponential response and log link were fitted to these data. (Note
that M2 is the same model M2 considered in Example 9.)
• Model M1 : survivalTime ∼ logWbc.
The residual deviance for this model, D(M1 ), is 47.808; the associated
approximate null distribution is χ2 (31).
• Model M2 : survivalTime ∼ logWbc + ag.
The residual deviance for this model, D(M2 ), is 40.319; the associated
approximate null distribution is χ2 (30).

160
5 Assessing model fit and choosing a GLM

(a) Calculate the deviance difference for the nested models M1 and M2 .
(b) What are the degrees of freedom of the χ2 null distribution for the
deviance difference that you calculated in part (a)?
(c) The p-value associated with the deviance difference is 0.009. What do
you conclude?

In Subsection 6.1 of Unit 6, we saw that in logistic regression we can also


use the deviance difference for assessing whether or not individual
explanatory variables are useful for modelling the response, by comparing
the fit of the model containing just the single explanatory variable of
interest with the fit of the null model (the model with no explanatory
variables and hence the worst fit). We can assess individual explanatory
variables in the same way for GLMs in general. We shall see this process in
action next in Activity 24, where we’ll assess whether age is useful for
modelling familySize for the Philippines 40 to 80 dataset.

Activity 24 Is age useful for modelling familySize?

In Example 8, we used the residual deviance to assess the fit of the model
familySize ∼ age
fitted to data from the Philippines 40 to 80 dataset using a Poisson GLM
with a log link. In that example, we concluded that this model was a poor
fit to the data, since the p-value for the residual deviance is close to 0.
This, however, doesn’t necessarily mean that age is not going to be useful
for modelling familySize. It’s possible, for example, that age is useful for
modelling familySize, but that the model is missing some other extra key
explanatory variables to improve the model fit.
So, to assess whether age is useful for modelling familySize, we’ll use the
deviance difference to compare the fit of our proposed model with the fit of
the null model.
Let the null model be M1 and our proposed model be M2 . The null model
is, of course, nested within the proposed model, so that M1 is nested
within M2 . The residual deviance of the null model, D(M1 ), is then 1795.8
with 1158 associated degrees of freedom, while the residual deviance of the
proposed model, D(M2 ), is 1731.8 with 1157 associated degrees of freedom.
(a) Calculate the deviance difference for the nested models M1 and M2 .
(b) What are the degrees of freedom of the χ2 null distribution for the
deviance difference that you calculated in part (a)?
(c) The p-value associated with the deviance difference is close to 0.
What do you conclude?

161
Unit 7 Regression for other response variables

The deviance difference can only be used to compare the fits of nested
GLMs. When comparing the fits of non-nested GLMs, the AIC can
instead be used (as it was for non-nested models in logistic regression).
A reminder of the AIC was given in Box 16 in Subsection 6.2 of Unit 6.
The AIC works in exactly the same way for GLMs in general. The key
things to remember when assessing non-nested GLMs are summarised in
Box 16 and illustrated by comparing two non-nested GLMs in Activity 25.

Box 16 Comparing the fits of non-nested GLMs


• The AIC can be used to compare the fits of non-nested GLMs.
• For a set of possible GLMs, the preferred GLM is the model with
the smallest AIC value.

Activity 25 Comparing the fits of non-nested GLMs


The following two models, M1 and M2 , were fitted to data from the
leukaemia survival dataset using an exponential GLM with a log link.
• Model M1 : survivalTime ∼ logWbc.
• Model M2 : survivalTime ∼ ag.
The value of the AIC for M1 is 306.22, while the value of the AIC for M2
is 304.85.
Which of models M1 and M2 is preferable, based on their values of
the AIC?

When using both linear regression and logistic regression in this module,
we have been using stepwise regression as an automated procedure for
selecting which explanatory variables should be included in our model, and
at each stage in the stepwise regression procedure the AIC has been used to
compare the model fits. As you have probably guessed, stepwise regression
can be used in exactly the same way for GLMs in general; we shall see
stepwise regression for GLMs in action using R in the next subsection.
Stepwise regression isn’t going
to help her choose here!
5.3 Using R to assess model fit and choose
a GLM
In Notebook activity 7.3, we considered a Poisson GLM with a log link as
a model for predicting Olympic medals using data from the Olympics
dataset. In Notebook activity 7.6 in this subsection, we’ll use this model’s
residual deviance to assess the model’s fit.
We’ll then use R to compare the fits of GLMs in Notebook activity 7.7. In
this notebook activity, we’ll compare the fits of binomial GLMs for
modelling examScore from the OU students dataset so that we can decide
which of the two covariates bestPrevModScore and age should be included
in the model.
162
5 Assessing model fit and choosing a GLM

So far in this unit, when using a binomial GLM for modelling examScore
from the OU students dataset, we’ve only considered two possible
covariates – namely, bestPrevModScore and age. There are, however,
several other potential explanatory variables that we could use for
modelling examScore.
Now, in Unit 4, we used stepwise regression with the OU students dataset
to select the explanatory variables to use for modelling examScore using a
linear regression model. In Notebook activity 7.8, we’ll use stepwise
regression to choose the explanatory variables to include for modelling
examScore using a binomial GLM with a logit link. For comparison
purposes, we’ll consider the same full model form as was considered when
using stepwise regression for the linear regression model in Unit 4, namely,
our full binomial GLM for examScore will include the four explanatory
variables gender, qualLink, bestPrevModScore and age, together with all
of their possible interactions.
In the final notebook activity in this subsection, we’ll once again model the
Philippines 40 to 80 dataset. So far, we have modelled familySize using
only one explanatory variable – age. There are, however, data for several
other explanatory variables (as described in Subsection 1.2). In Notebook
activity 7.9, we shall use stepwise regression to select which explanatory
variables should be included when modelling familySize using a Poisson
GLM with a log link.

Notebook activity 7.6 Assessing the fit of a GLM in R


This notebook uses R to assess the fit of a GLM for predicting
Olympic medals.

Notebook activity 7.7 Comparing the fits of GLMs in R


In this notebook, we’ll use R to compare the fits of both nested and
non-nested GLMs.

Notebook activity 7.8 Choosing a GLM for examScore


This notebook will use stepwise regression to choose a binomial GLM
with a logit link for examScore.

Notebook activity 7.9 Choosing a GLM for familySize


This notebook will use stepwise regression to choose a Poisson GLM
with a log link for familySize.

163
Unit 7 Regression for other response variables

6 Checking the GLM model


assumptions
We discussed checking the logistic regression model assumptions in
Section 7 of Unit 6. The model assumptions for GLMs in general are
(unsurprisingly) generalisations of those presented for logistic regression in
Box 17 in Subsection 7.1 of Unit 6, while the diagnostic plots are the same
as those used for logistic regression in Unit 6.
We’ll start this section by looking at what the model assumptions are for
GLMs in Subsection 6.1, before discussing how diagnostic plots are used
for GLMs in Subsection 6.2. We finish the section by using R to check the
model assumptions for some of the GLMs which have been fitted in this
unit.

6.1 The GLM model assumptions


As already mentioned, the assumptions for GLMs are generalisations of
the model assumptions for logistic regression. The assumptions are given
in Box 17.

Box 17 Model assumptions for GLMs


Suppose that a response Y is modelled by a GLM, with possible
factors A, B, . . . , Z and covariates x1 , x2 , . . . , xq as the explanatory
variables. Then the following assumptions must hold.
• Response distribution: The responses Y1 , Y2 , . . . , Yn all have the
same distribution, but each Yi has a different mean, E(Yi ).
• Linearity: For the link function g, there is a linear relationship
between g(E(Yi )) and the explanatory variables.
• Independence: The response variables Y1 , Y2 , . . . , Yn are
independent of each other.

The independence assumption in Box 17 is our usual regression


independence assumption, while the response distribution and linearity
assumptions are simply the same as those for logistic regression, only
written in a general form so that they are applicable to all GLMs.
There is another property of GLMs which hasn’t yet been mentioned.
Recall that in linear regression, Yi ∼ N (µi , σ 2 ), where the variance
V (Yi ) = σ 2 is assumed constant for all i = 1, 2, . . . , n, regardless of the
value of the mean E(Yi ) = µi . An important characteristic of some GLMs,
however, is that the magnitude of the variance of the response variable is a
function of its mean, which means that the variance is allowed to vary with
the mean and does not necessarily have to be constant. This is commonly
referred to as the variance–mean relationship and is summarised in
Box 18.

164
6 Checking the GLM model assumptions

Box 18 The variance–mean relationship


For some GLMs, V (Yi ) is a function of E(Yi ), so that the variance of
the response variable may vary with the mean.

The following example shows how the variance relates to the mean of a
response variable with a Bernoulli distribution.

Example 10 Variance–mean relationship for a Bernoulli


response
For the response variable Yi ∼ Bernoulli(pi ), for i = 1, 2, . . . , n,
E(Yi ) = pi and V (Yi ) = pi (1 − pi ).
So, the variance of the response is a function of the mean of the
response. This means that if the mean of Yi changes – that is, if pi
changes – then the variance will also change. Therefore, in logistic
regression, the assumption of constant V (Yi ), for i = 1, 2, . . . , n, is
replaced by the assumption that
V (Yi ) = E(Yi )(1 − E(Yi )).

The next two activities will look at the variance–mean relationships for
Poisson GLMs and for exponential GLMs.

Activity 26 Variance–mean relationship for a Poisson GLM


Determine how the variance relates to the mean of the response variable
Yi ∼ Poisson(λi ), for λi > 0, i = 1, 2, . . . , n. (You might find Box 1 in
Subsection 1.2 useful here.)

Activity 27 Variance–mean relationship for an


exponential GLM
Determine how the variance relates to the mean of the response variable Yi
in an exponential GLM where Yi ∼ M (λi ), with λi > 0, i = 1, 2, . . . , n.
(You might find Box 9 in Subsection 4.1 useful here.)

Note that, since the variance of the response is assumed constant in linear
regression, if the variance of the response changes with changes in the mean
of that response, then linear regression may not be appropriate. Since the
assumption of constant variance can be relaxed for some GLMs, this
widens the variety of datasets for which GLMs can be applied in practice.

165
Unit 7 Regression for other response variables

6.2 Diagnostic plots for GLMs


We have already met the diagnostic plots which we’ll use for GLMs in
M348, since they are the same plots which were introduced for logistic
regression in Subsection 7.3 of Unit 6.
As a reminder, the diagnostic plots for logistic regression, and hence GLMs
in general, centre on standardised deviance residuals. These are basically
GLM standardised versions of the residuals used in linear regression, and
each deviance residual can be thought of as the ith data point’s
contribution to the residual deviance D.
Four diagnostic plots were introduced in Unit 6; we shall take each of these
in turn in Subsections 6.2.1 to 6.2.4.

6.2.1 Standardised deviance residuals against a


b
transformation of µ
The first diagnostic plot that we’ll consider is a plot of the standardised
deviance residuals against a transformation of µ b, the fitted mean response.
The transformation which is recommended varies depending on which
distribution is being assumed for the response; Table 8 gives the
transformations which are recommended for the distributions that you
have met in this unit. Notice that the same transformation is used for
both Bernoulli and binomial responses, since pi , a Bernoulli trial success
probability, is modelled for each.
Table 8 b recommended for diagnostic plots for different
Transformations of µ
GLMs

Response distribution b
Transformation of µ
Normal b
µ p
Bernoulli 2parcsin µb
Poisson 2 µ b
Exponential b p
2 log µ
Binomial 2 arcsin µb

The plot of the standardised deviance residuals against a transformation of


b as used for GLMs is summarised in Box 19.
µ

Box 19 Standardised deviance residuals against a


b
transformation of µ
• This plot is analogous to a plot of the residuals against fitted values
in linear regression.
• The transformation of µb used for each distribution assumed for the
response is given in Table 8.
• The linearity assumption is reasonable for the GLM if the smoothed
red line is roughly a horizontal straight line.
• Curvature in the smoothed red line could indicate that the model
needs a different link function, or that a transformation of one (or
more) of the explanatory variables is needed.
166
6 Checking the GLM model assumptions

We’ll see this diagnostic plot in action in Example 11 and Activity 28.

Example 11 Standardised deviance residuals against a


b for modelling familySize
transformation of µ
The model
familySize ∼ age
was fitted to data from the Philippines 40 to 80 dataset using a
Poisson GLM with a log link. Figure 24 shows a plot of the
standardised deviance residuals against a transformation of µb for the
fitted model.

4
Standardised deviance residual

−2

3.4 3.6 3.8 4.0 4.2


2 µ
Figure 24 Plot of the standardised deviance residuals against a
b for modelling familySize
transformation of µ

The smoothed red line in the plot in Figure 24 is roughly a horizontal


straight line with only slight curvature. As such, the assumption of
linearity seems reasonable.
Note that the appearance of ‘lines’ in the plot is due to the fact that
the response is discrete.

167
Unit 7 Regression for other response variables

Activity 28 Standardised deviance residuals against a


b for modelling survivalTime
transformation of µ
The model
survivalTime ∼ logWbc + ag
was fitted to data from the leukaemia survival dataset using an exponential
GLM with a log link. Figure 25 shows a plot of the standardised deviance
b for the fitted model.
residuals against a transformation of µ

2
Standardised deviance residual

−1

−2

5 6 7 8 9

2 log µ
Figure 25 Plot of the standardised deviance residuals against a
b for modelling survivalTime
transformation of µ

Does this plot indicate any problems with the model assumptions?

6.2.2 Standardised deviance residuals against index


The plot of the standardised deviance residuals against index as used for
GLMs is summarised in Box 20.

Box 20 Standardised deviance residuals against index


• This is a plot of the standardised deviance residuals in the order
that the data were collected.
• The plot can be used to check for independence of the responses
Y1 , Y2 , . . . , Yn : if they are independent, the standardised deviance
residuals in the plot should fluctuate randomly across the index.

168
6 Checking the GLM model assumptions

We’ll use this diagnostic plot in Example 12 and Activity 29.

Example 12 Standardised deviance residuals against


index for modelling familySize
Following on from Example 11, Figure 26 shows a plot of the
standardised deviance residuals against index for the model
familySize ∼ age
fitted to data from the Philippines 40 to 80 dataset using a Poisson
GLM with a log link.

4
Standardised deviance residual

−2

0 200 400 600 800 1000 1200


Index
Figure 26 Plot of standardised deviance residuals against index for
modelling familySize

The points in the plot shown in Figure 26 seem to be randomly


scattered across the index, and so the plot doesn’t indicate that there
are any problems with the independence assumption.

169
Unit 7 Regression for other response variables

Activity 29 Standardised deviance residuals against index


for modelling survivalTime
Following on from Activity 28, Figure 27 shows a plot of the standardised
deviance residuals against index for the model
survivalTime ∼ logWbc + ag
fitted to data from the leukaemia survival dataset using an exponential
GLM with a log link.

2
Standardised deviance residual

−1

−2

0 5 10 15 20 25 30
Index
Figure 27 Plot of standardised deviance residuals against index for
modelling survivalTime

Does this plot indicate any problems with the independence assumption
for the model?

170
6 Checking the GLM model assumptions

6.2.3 Squared standardised deviance residuals


against index
A summary of the plot of the squared standardised deviance residuals
against index as used for GLMs is given in Box 21.

Box 21 Squared standardised deviance residuals against


index
• This plot is a plot of the squared standardised deviance residuals in
the order that the data were collected.
• By squaring the standardised deviance residuals, all of the residuals
can be compared on the same scale.
• The squared standardised deviance residuals relating to positive
residuals are distinguished from those relating to negative residuals.
• The plot can be used to check for independence of the responses
Y1 , Y2 , . . . , Yn : if they are independent, the squared standardised
deviance residuals in the plot should fluctuate randomly across the
index number.

We’ll see the plot in action in Example 13 and Activity 30.

171
Unit 7 Regression for other response variables

Example 13 Squared standardised deviance residuals


against index for modelling familySize
Following on from Examples 11 and 12, Figure 28 shows a plot of the
squared standardised deviance residuals against index for the model
familySize ∼ age
fitted to data from the Philippines 40 to 80 dataset using a Poisson
GLM with a log link.

20
Squared standardised deviance residuals

15

10

0
0 200 400 600 800 1000 1200
Index
Figure 28 Plot of squared standardised deviance residuals against index
for modelling familySize. The red circles denote positive residuals and
the blue triangles denote negative residuals.

A few of the positive standardised deviance residuals stand out as


being unusually large. There aren’t many of these though, and there
doesn’t seem to be any particular pattern to them across the index.
As such, this plot doesn’t indicate any problems with the
independence assumption.

172
6 Checking the GLM model assumptions

Activity 30 Squared standardised deviance residuals against


index for modelling survivalTime
Following on from Activities 28 and 29, Figure 29 shows a plot of the
squared standardised deviance residuals against index for the model
survivalTime ∼ logWbc + ag
fitted to data from the leukaemia survival dataset using an exponential
GLM with a log link.

5
Squared standardised deviance residuals

0
0 5 10 15 20 25 30
Index
Figure 29 Plot of squared standardised deviance residuals against index for
modelling survivalTime. The red circles denote positive residuals and the
blue triangles denote negative residuals.

Does this plot indicate any problems with the independence assumption?

6.2.4 Normal probability plot


The final diagnostic plot that we’ll consider for GLMs is the normal
probability plot. How this plot is used for GLMs is summarised in Box 22.

173
Unit 7 Regression for other response variables

Box 22 Normal probability plot


• If the response distribution is correct for a GLM, then the
standardised deviance residuals should be approximately
distributed as N (0, 1).
• This can be checked by using a normal probability plot of the
standardised deviance residuals for the fitted model.
• The assumption of normality is reasonable if the points in the
normal probability plot lie along the line.

Normal probability plots for GLMs are illustrated in Example 14 and


Activity 31.

Example 14 Normal probability plot for modelling


familySize
Following on from Examples 11, 12 and 13, Figure 30 shows the
normal probability plot of the standardised deviance residuals for the
model
familySize ∼ age
fitted to data from the Philippines 40 to 80 dataset using a Poisson
GLM with a log link.

4
Standardised deviance residuals

−2

−4

−2 0 2
Theoretical quantiles
Figure 30 Normal probability plot of the standardised deviance
residuals for modelling familySize
174
6 Checking the GLM model assumptions

The points in the normal probability plot are generally quite close to
the diagonal line, although there is some curvature at the ends of the
plot. On the whole, though, the assumption of normality of the
deviance residuals seems reasonable. This in turn means that the
assumption that the response distribution is Poisson also seems
reasonable.

Activity 31 Normal probability plot for modelling


survivalTime
Following on from Activities 28, 29 and 30, Figure 31 shows the normal
probability plot of the standardised deviance residuals for the model
survivalTime ∼ logWbc + ag
fitted to data from the leukaemia survival dataset using an exponential
GLM with a log link.

2
Standardised deviance residuals

−1

−2

−2 −1 0 1 2
Theoretical quantiles
Figure 31 Normal probability plot of the standardised deviance residuals for
modelling survivalTime

Does the normal probability plot indicate any problems with the
assumption that the response has an exponential distribution?

Now we’re ready to use R to obtain diagnostic plots for GLMs.


175
Unit 7 Regression for other response variables

6.3 Using R to produce diagnostic plots for


GLMs
In Notebook activity 7.8, we used R to carry out stepwise regression to
choose a binomial GLM with a logit link for the response examScore from
the OU students dataset. In that notebook activity, the model chosen
when using stepwise regression starting from the full model was slightly
different to the model chosen when starting from the null model. In
Notebook activity 7.10, we’ll obtain diagnostic plots to assess the GLM
assumptions for one of these proposed models.
In Notebook activity 7.9, we used stepwise regression to choose a Poisson
GLM with a log link for the response familySize using data from the
Philippines 40 to 80 dataset. The stepwise regression procedure chose the
same model when starting with the full model as it did when starting with
the null model. In the final notebook activity of this subsection, we’ll
obtain diagnostic plots to assess the model assumptions for the selected
GLM.

Notebook activity 7.10 Diagnostic plots for a GLM for


examScore
This notebook will produce diagnostic plots to check the model
assumptions of a GLM for examScore.

Notebook activity 7.11 Diagnostic plots for a GLM for


familySize
This notebook will produce diagnostic plots to check the model
assumptions of a GLM for familySize.

7 Common issues in practice


In this final section of the unit, we will consider two common issues which
can arise when using GLMs ‘in the real world’. The first issue, known as
overdispersion, is often encountered when modelling responses with
Poisson or binomial distributions. Overdispersion is the subject of
Subsection 7.1. The second issue concerns using count data to model
Poisson rates, rather than Poisson means; GLMs for Poisson rates will be
considered in Subsection 7.2. Finally, in Subsection 7.3 we shall use R to
model Poisson rates and accommodate overdispersion.

176
7 Common issues in practice

7.1 Overdispersion
When the variance of the observed data is larger than the model’s
variance, we say that there is overdispersion. This is a common problem
in practice when using Poisson GLMs or binomial GLMs. It is also
possible for the variance of the observed data to be smaller than the
model’s variance: in this case, there is under dispersion. However,
underdispersion is a lot less common than overdispersion in practice, so we
shall only focus on overdispersion here.
For a GLM for Poisson responses Y1 , Y2 , . . . , Yn , so that, for i = 1, 2, . . . , n, In the context of infectious
diseases, overdispersion is the
Yi ∼ Poisson(λi ), λi > 0, idea that one infected person
the model mean and variance for Yi are driven by the single parameter λi , in a crowd could infect many
since:
E(Yi ) = λi and V (Yi ) = λi .
The model mean and variance are also driven by a single parameter in a
binomial GLM. In this case, for i = 1, 2, . . . , n and Ni known,
Yi ∼ B(Ni , pi ), 0 < pi < 1,
with
E(Yi ) = Ni pi and V (Yi ) = Ni pi (1 − pi ).
As a result, the model variances for Poisson and binomial GLMs are
constrained and don’t always allow for the amount of variability which can
occur in real datasets.
To correct for this, an extra dispersion parameter, φ > 0 say, can be
introduced into the model which scales the model’s variance to fit the
observed data better, so that, for a Poisson response
V (Yi ) = φλi
and for a binomial response
V (Yi ) = φNi pi (1 − pi ).
If there is overdispersion, then φ > 1 so that the model’s variance is
increased. If φ = 1, then the observed and model variances are the same,
while if φ < 1, then there is underdispersion and the model’s variance is
decreased.
Introducing the dispersion parameter φ does not affect the maximum
likelihood estimates of the model parameters, and the parameter estimates
are the same values whether or not φ is in the model. What φ does affect,
though, is the standard errors of the parameter estimates. If there is
overdispersion which isn’t accommodated in the model, then the standard
errors of the parameters will be underestimated; introducing a dispersion
parameter into the model corrects this.
As you will discover in Subsection 7.3, it is very easy to use R to fit a
GLM with the extra dispersion parameter φ. But how do we know that
there is a problem with overdispersion so that the dispersion parameter is
required in the model?
177
Unit 7 Regression for other response variables

If there is overdispersion, then the estimated model will provide a poor fit
to the data. We have already met a measure of model fit for GLMs – the
residual deviance, D. Therefore, when fitting Poisson or binomial GLMs, if
the model’s residual deviance is large (indicating a poor model fit), then
this could be an indication of overdispersion. However, a large residual
deviance could also be an indication that one or more important
explanatory variables are missing from the model. So, in order to assess
whether there might be a problem with overdispersion, a GLM is fitted
using all of the available explanatory variables, and the residual deviance
for this model is used to assess whether or not there could be a problem
with overdispersion.
Now, we know from Box 13 in Subsection 5.1 that, if a model is a good fit
to the data, then the model’s residual deviance D approximately follows a
χ2 (r) distribution, where
r = n − number of parameters in the proposed model.
The mean of this χ2 distribution is r. So, if there isn’t a problem with
overdispersion, then we would expect
D
' 1.
r
But if
D
>1
r
then this could be an indication of overdispersion. (Conversely, if D/r < 1,
then this could be an indication of underdispersion.)
Due to random variation, the value of D/r could well be slightly greater
than 1 without there being a problem with overdispersion, but we’d
certainly expect D/r to be less than 2 if overdispersion is not a problem.
How to detect overdispersion is summarised in Box 23.

Box 23 Detecting overdispersion


To detect possible overdispersion when using Poisson or binomial
GLMs:
• fit a GLM including all of the possible explanatory variables
• if overdispersion is not a problem, then
D
'1
r
and certainly
D
< 2,
r
where D is the model’s residual deviance and r is the associated
degrees of freedom.

178
7 Common issues in practice

We’ll finish this subsection with two activities to give you some practice at
detecting possible overdispersion.

Activity 32 Overdispersion when modelling familySize?

Using data from the Philippines 40 to 80 dataset, a Poisson GLM with a


log link was fitted for the response familySize. This model included all of
the available factors and covariates as explanatory variables. The residual
deviance for this fitted model is D = 962.05, with associated degrees of
freedom r = 1144. Is there an indication of overdispersion when using this
GLM?

Activity 33 Overdispersion or not?

Three GLMs for different binomial responses were fitted, each using all of
the associated available explanatory variables. For each of the resulting
residual deviances D and associated degrees of freedom r given below,
could there be a problem with overdispersion?
(a) Model 1: D = 4.21, r = 3.
(b) Model 2: D = 634.81, r = 864.
(c) Model 3: D = 146.8, r = 26.

It is important to remember that, in addition to a possible problem with


overdispersion, poor model fit could also be due to important explanatory
variables missing from the model for which there are no data, or it could
be due to the model needing a different link function.

7.2 Modelling Poisson rates


When modelling count responses, Poisson rates, rather than Poisson
means, are often of interest so that the observed data are comparable
between units. For example, if counts are measured over varying lengths of
time – the first count could be the number of events in one day, say, while
the second count could be the number of events in three days, and so on –
then the rate per unit time will be of interest.
Suppose that we have count responses Y1 , Y2 , . . . , Yn with associated means
λ1 , λ2 , . . . , λn . A Poisson rate is specified in terms of units of exposure. For
example, for counts Y1 , Y2 , . . . , Yn measured over varying lengths of time
t1 , t2 , . . . , tn , the exposure associated with the count Yi is the time
length ti . The Poisson rate for the ith response per unit exposure, θi say,
is then
λi
θi = .
ti

179
Unit 7 Regression for other response variables

Example 15 Rate of calls arriving at a call centre


Suppose that Yi is the number of calls received by a customer service
call centre each work day between 8 a.m., when the call centre opens,
and 8 p.m., when the call centre closes. Suppose further that the
associated mean of Yi is λi . In this case, the exposure is 12 hours, the
length of time over which the number of calls received are counted.
I wonder what the hourly rate
of calls received is at this call So, the rate of calls received in one hour, θi say, is given by
centre? λi
θi = .
12

We want to model the rates θ1 , θ2 , . . . , θn . Now, each θi is a function of λi ,


and we know how to model each mean response E(Yi ) = λi using a Poisson
GLM with a log link. So, can we use what we know about modelling λi to
help us model θi ? Yes, indeed we can! Be warned that things are going to
look a bit mathematical for the next page or so. Don’t worry if you can’t
follow everything: the main ideas that you need to grasp are summarised
in Box 24 to follow.
So, let’s start by looking at our model for λi . Using a Poisson GLM with a
log link, we have the regression equation for λi given by
log(λi ) = ηi ,
where ηi is the linear predictor. Now
λi
θi =
ti
and so, if we take logs of both sides of this equation, we get
 
λi
log(θi ) = log
ti
= log(λi ) − log(ti ).
But we know that log(λi ) = ηi , and so
log(θi ) = ηi − log(ti ).
Therefore, ηi − log(ti ) can be thought of as a linear predictor for a model
for the Poisson rate θi .
But, since ti (the known exposure time) is a given value in the dataset,
log(ti ) is simply a known constant term. This term therefore doesn’t have
an unknown coefficient parameter and consequently doesn’t need to be
estimated. This extra term in θi ’s linear predictor is called the offset.

180
7 Common issues in practice

Now, if
log(θi ) = ηi − log(ti )
then, since log(ti ) is a known constant, the fitted rate θbi must satisfy
 
log θbi = ηbi − log(ti ),

where ηbi is the fitted linear predictor for λi (calculated using a Poisson
GLM with a log link). But, since
 
log λ bi = ηbi

we then have that


   
log θbi = log λbi − log(ti )
!
bi
λ
= log
ti
and so, taking exponentials of both sides, we get
bi
λ
θbi = .
ti
Modelling Poisson rates using a GLM is summarised in Box 24.

Box 24 GLM for a Poisson rate


Suppose that we have the count responses Y1 , Y2 , . . . , Yn observed
from (known) exposures t1 , t2 , . . . , tn , so that, for i = 1, 2, . . . , n,
Yi ∼ Poisson(λi ), λi > 0,
and the rate θi is given by
λi
θi = .
ti
Then, we can model θi as
log(θi ) = ηi − log(ti ),
where ηi = log(λi ) is the linear predictor for a GLM for the Poisson
response Yi , and the (constant) term log(ti ) is called the offset.
The fitted rate parameter, θbi , is then calculated as
bi
λ
θbi = ,
ti
bi is the fitted mean response of Yi .
where λ

It is straightforward to include an offset in a Poisson GLM in R as


demonstrated next.

181
Unit 7 Regression for other response variables

7.3 Poisson rates and overdispersion in


action
We’ll finish this unit by using R to model a dataset involving counts of
epileptic seizures over varying time lengths. As you will discover in
Notebook activity 7.12, we’ll require an offset to model these data, and
we’ll also need to tackle an overdispersion problem. The dataset is
described in Subsection 7.3.1 and modelled using R in Subsection 7.3.2.

7.3.1 A dataset involving counts over varying time


lengths
The dataset to be considered in this subsection is described next.

Clinical trial for testing a drug to reduce epileptic seizures


The data in this dataset come from a clinical trial of a new drug for
reducing the number of seizures suffered by patients with epilepsy.
The experiment was carried out on 15 patients attending the
Westmead Hospital in Sydney, Australia.
The patients were monitored for two periods of time. During one of
The image shows someone these periods they took their usual drug, and during the other period
having an they took the new drug. Seven patients were given the new drug
electroencephalogram (EEG), during the period coded 0 and their usual drug in the period coded 1,
which is one of the methods and the other eight were given the new drug in the period coded 1
used to help diagnose epilepsy and their usual drug in the period coded 0. There was a sufficient gap
between time periods for there to be no residual effect of the
treatment given in the first period carrying over to the second period.
Patients were randomly allocated to treatment orderings. (Such a
trial is called a crossover clinical trial.)
The epileptic seizures dataset (seizures)
This dataset contains data for 15 patients on the following variables:
• patient: the patient’s identification number i
• numSeizures: the number of epileptic seizures experienced by the
patient in the specified period when receiving the specified
treatment; this is the observed response yi
• exposure: the exposure time (in days), which is the length of the
period of observation
• treatment: an indicator of treatment given, coded 0 for the usual
drug and 1 for the new drug
• period: an indicator of the period during which the patient was
monitored, coded 0 for one of the two periods and 1 for the other.
This dataset is a little different to those that we’ve met so far in
M348. So, to help to understand the data a bit better, Table 9 shows
data for nine selected observations from the dataset.

182
7 Common issues in practice

Table 9 Nine selected observations from seizures

patient numSeizures exposure treatment period


3 152 56 1 0
8 10 56 1 0
6 10 42 0 0
7 81 56 0 0
11 1161 56 0 0
3 149 56 0 1
8 0 56 0 1
9 4 56 1 1
11 854 56 1 1

Source: unpublished data provided by A.D. Lunn, taken from


(discontinued) Open University module M346

The main question of interest here is whether the new drug is effective
in significantly reducing the number of seizures for these patients
relative to their usual drug.

There are a few things to note about the epileptic seizures dataset.
• Each of the 15 patients has two entries in the dataset. For example,
patient 3 in period 0 was observed to suffer 152 seizures in 56 days using
the new drug (as shown in the first row of Table 9), and the same
patient 3 was observed to suffer 149 seizures in 56 days using their usual
drug in period 1 (in a later row of Table 9).
• The exposure time for a treatment was 56 days for most patients, but a
few were observed over shorter time spans. For instance, in period 0,
patient 6 was observed to suffer 10 seizures in 42 days of exposure using
their usual drug.
• Patient 11 is the one with by far the largest number of seizures amongst
all 15 patients, with 1161 seizures during 56 days using their usual drug
in period 0 and 854 seizures during 56 days using the new drug in
period 1.
We’ll consider a potential model for the epileptic seizures dataset in the
next activity.

Activity 34 A model for numSeizures

What kind of model comes to mind as a good first model for modelling the
response variable numSeizures from the epileptic seizures dataset?

Note that patient is listed as a variable in the epileptic seizures dataset.


Usually, patient number is ignorable as a potential explanatory variable
because it just identifies individuals, but not so in this case. This is
because there are two observations per patient, and so the patient number
reflects part of the design of the study.
183
Unit 7 Regression for other response variables

Our primary question of interest for these data is whether the new drug
reduces the number of epileptic seizures in comparison to a patient’s usual
drug. There is, however, a lot of variability in the numbers of seizures
amongst patients. This can be seen in Figure 32, which shows a scatterplot
of log(numSeizures + 0.5) against patient. (A plot of numSeizures
against patient is dominated by the unusually large values for patient 11.
Plotting log(numSeizures + 0.5) against patient makes the patterns
easier to see: the ‘0.5’ has been added to avoid the problem of zero
numSeizures.) So, given that there is so much variability in the response
across patients, patient should be included as an explanatory variable in
a GLM for numSeizures, although the effect of this explanatory variable is
not of major interest per se. We shall do this by treating patient as a
factor with 15 levels (a level for each patient).

Treatment: Usual drug New drug

6
log (numSeizures + 0.5)

5 10 15
Patient
Figure 32 Scatterplot of log(numSeizures + 0.5) against patient

In trying to fit the model mentioned in Activity 34, we immediately come


up against the difficulty of trying to take proper account of the
information (in the variable exposure) on differing exposure times. Rather
than simply fitting a Poisson response distribution where, for the ith
response, Yi ∼ Poisson(λi ), it will take exposure properly into account if
we include an offset in the GLM for a Poisson rate, as summarised in
Box 24 in Subsection 7.2. We shall use R to do this next.

184
Summary

7.3.2 Poisson rates and overdispersion in R


In the final notebook activity of this unit, we will model the response
numSeizures from the epileptic seizures dataset using a Poisson GLM with
an offset. You will also see that overdispersion is an issue which we will
tackle in the notebook activity.

Notebook activity 7.12 Using a GLM with an offset and


tackling overdispersion in R
In this notebook, we’ll use R to fit a GLM to data from the epileptic
seizures dataset using an offset and tackle a problem with
overdispersion.

Summary
This unit formally introduced generalised linear models, commonly referred
to as GLMs, for modelling response variables from a variety of different
distributions.
Both linear regression (which assumes a normal distribution for the
response) and logistic regression (which assumes a Bernoulli distribution
for a binary response) are particular types of generalised linear model. In
this unit, we also used GLMs for modelling responses with Poisson,
exponential and binomial distributions, although these are not the only
distributions that can be assumed for the response in a GLM.
The relationship between the response Y and a set of explanatory variables
follows a GLM if:
• Y1 , Y2 , . . . , Yn all have the same distribution, but each Yi has a different
mean
• for i = 1, 2, . . . , n, the regression equation has the form
g(E(Yi )) = ηi ,
where g is a link function and ηi is the linear predictor, a linear
combination of the explanatory variables.
The link function, g, therefore provides the link between the mean
response, E(Yi ), and the linear predictor, ηi . We’d like g to transform
E(Yi ) so that:
• there’s a linear relationship between g(E(Yi )) and the explanatory
variables
• g(E(Yi )) can take any real value (to match the possible values that ηi
can take).

185
Unit 7 Regression for other response variables

Canonical link functions provide simplifications for the theory and analysis
of GLMs, but non-canonical link functions can be, and are, used in
practice. The link functions used in this unit are summarised in Table 10.
(Remember that a binomial GLM models the success probability pi rather
than E(Yi ) directly.)
Table 10 The link functions used in M348

Response Link function g Link name Canonical?

Normal g(E(Yi )) = E(Yi ) identity yes


 
E(Yi )
Bernoulli g(E(Yi )) = log logit yes
1 − E(Yi )

Poisson g(E(Yi )) = log(E(Yi )) log yes

Exponential g(E(Yi )) = log(E(Yi )) log no


 
pi
Binomial g(pi ) = log logit yes
1 − pi

The inverse link function, g −1 , links the linear predictor, ηi , back to the
mean response E(Yi ), so that
E(Yi ) = g −1 (ηi ).
The inverse link function for the link functions used in this unit are
summarised in Table 11.
Table 11 The link functions and associated inverse link functions used in M348

Response Link function g Inverse link function g −1

Normal g(E(Yi )) = E(Yi ) g −1 (ηi ) = ηi


 
E(Yi ) exp(ηi )
Bernoulli g(E(Yi )) = log g −1 (ηi ) =
1 − E(Yi ) 1 + exp(ηi )
Poisson g(E(Yi )) = log(E(Yi )) g −1 (ηi ) = exp(ηi )

Exponential g(E(Yi )) = log(E(Yi )) g −1 (ηi ) = exp(ηi )


 
pi exp(ηi )
Binomial g(pi ) = log g −1 (ηi ) =
1 − pi 1 + exp(ηi )

186
Summary

Inverse link functions are important for calculating µbi , the fitted mean
b0 , the predicted mean response for a new response Y0 .
response for Yi , and µ
Then
bi = g −1 (b
µ ηi ) b0 = g −1 (b
and µ η0 ),
where ηbi and ηb0 are the fitted linear predictors for Yi and Y0 , respectively.
The fit of the proposed GLM can be assessed using the residual
deviance D, where
D = 2 × (l(saturated model) − l(proposed model)).
If the proposed model is a good fit, then
D ≈ χ2 (r),
where
r = n − number of parameters in the proposed model.
A useful ‘rule of thumb’ is that
• If D ≤ r, then the model is likely to be a good fit to the data.
The fits of two nested GLMs M1 and M2 , with M1 nested within M2 , can
be compared using the deviance difference, where
deviance difference = D(M1 ) − D(M2 ).
If both M1 and M2 are a good fit, then
deviance difference ≈ χ2 (d),
where
d = difference in the degrees of freedom for D(M1 ) and D(M2 ).
The AIC can be used to compare non-nested models: we choose the model
with the smallest AIC.
In order to check the model assumptions for GLMs, diagnostic plots focus
on deviance residuals – these are analogous to the standard residuals used
in linear regression. If the GLM assumptions are reasonable, then the
standardised deviance residuals are approximately distributed as N (0, 1).
Overdispersion is a common problem for Poisson and binomial GLMs in
practice. This occurs when the observed response variance is larger than
the model’s variance. Overdispersion can be corrected by introducing an
extra dispersion parameter (φ > 0) to scale the model’s variance. To
detect possible overdispersion:
• fit a GLM including all of the possible explanatory variables
• overdispersion could be a problem if
D
> 2,
r
where D is the GLM’s residual deviance and r is the associated degrees
of freedom.

187
Unit 7 Regression for other response variables

Another problem that can arise for Poisson GLMs is when the Poisson
responses Y1 , Y2 , . . . , Yn are observed counts over varying lengths of time
t1 , t2 , . . . , tn . In this case, the Poisson rate
λi
θi =
ti
is of interest: log(ti ) is called the offset. The fitted rate θbi is calculated as
bi
λ
θbi = ,
ti
bi is the fitted mean response of Yi .
where λ
A reminder of what has been studied in Unit 7 and how the sections link
together is shown in the following route map.

The Unit 7 route map

Section 1
Setting the scene

Section 2
Building a model

Section 3 Section 4
The generalised GLMs for two more
linear model (GLM) response variable
distributions

Section 5 Section 6
Section 7
Assessing model Checking the
Common issues
fit and choosing GLM model
in practice
a GLM assumptions

188
References

Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate the different types of non-normal response variables which
can be of interest
• understand the roles of the link function and the inverse link function in
a GLM
• obtain fitted and predicted mean responses for given values of the
explanatory variable(s)
• assess the fit of a GLM
• compare the fits of two GLMs, both in the case of nested GLMs and
non-nested GLMs
• identify potential problems with the model assumptions for a GLM
• appreciate what overdispersion is, how we can detect it, and how we can
correct for it
• appreciate how we can model Poisson rates instead of Poisson response
means
• fit Poisson, exponential and binomial GLMs in R
• use R to predict mean responses
• use R to assess the fit of GLMs
• compare the fits of GLMs in R
• use stepwise regression for GLMs in R
• use R to produce diagnostic plots for GLMs
• use R to correct for overdispersion
• use R to model Poisson rates.

References
Feigl, P. and Zelen, M. (1965) ‘Estimation of exponential survival
probabilities with concomitant information’, Biometrics, 21(4),
pp. 826–838, doi:10.2307/2528247.
Flores, F.P. (2017) ‘Filipino family income and expenditure’. Available at:
https://www.kaggle.com/datasets/grosvenpaul/family-income-and-
expenditure (Accessed: 26 June 2022).

189
Unit 7 Regression for other response variables

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, crashed cars: © stockbroker / www.123rf.com
Figure 1: © Iryna Volina / Alamy Stock Photo
Subsection 1.2, Filipino family: Public domain
Subsection 3.1, students passing exams: © stylephotographs /
www.123rf.com
Subsection 3.4, laptop being opened: © dragoscondrea / www.123rf.com
Subsection 4.1, survival rates: © yasemin / www.123rf.com
Subsection 4.2, exploding party popper: © slavadumchev /
www.123rf.com
Subsection 5.1, child in oversized clothes: © Ferreira / www.123rf.com
Subsection 5.2, choosing clothes: © maridav / www.123rf.com
Subsection 7.1, crowd: © nd3000 / www.123rf.com
Subsection 7.2, call centre: © fizkes / www.123rf.com
Subsection 7.3.1, an electroencephalogram (EEG): © Phanie /
Alamy Stock Photo
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

190
Solutions to activities

Solutions to activities
Solution to Activity 1
(a) The waiting time between serious earthquakes is a non-negative
continuous response variable since waiting times cannot take negative
values. Linear regression includes negative values as possible
outcomes of the response, and so may not be appropriate to model
this type of response variable.
(b) The number of traffic accidents per year at a particular road junction
is a count response variable that can only take non-negative integer
values. Linear regression would include negative values as possible
outcomes of the response and would also include decimal values. Both
negative values and decimals are not possible for count (integer) data,
and so linear regression may not be appropriate.
(c) The number of insects surviving out of N insects exposed to an
insecticide is a non-negative integer that can only take values between
0 and N . Linear regression would include negative values, decimal
values, and also values greater than N as possible outcomes of the
response, which are all not possible for this particular response. As
such, linear regression may not be appropriate for modelling this
response.

Solution to Activity 2
(a) The points in the normal probability plot given in Figure 2 deviate
systematically from the line at both ends of the plot, suggesting that
residuals may not follow a normal distribution. The linear regression
normality assumption therefore is questionable for these data.
(b) Figure 3 highlights several issues which make it unlikely that the
assumption of a normal distribution for the response will be ideal for
these data.
Firstly, the response familySize only takes integer values, whereas
the normal distribution is continuous. Also, the distribution for
familySize looks skew, whereas the normal distribution is symmetric.
What’s more, the observed family size values are non-negative low
counts with most frequent values occurring in the range from 0 to 7,
as can be seen in the bar chart in Figure 3. As a result, if a normal
distribution for the response is assumed, then the possible values of
the response would include negative values, giving non-zero
probabilities of occurrence for negative values of the response.

191
Unit 7 Regression for other response variables

Solution to Activity 3
A Poisson distribution seems to be better than a normal distribution as
the distribution for the response familySize. In the side-by-side bar chart
in Figure 5, the observed relative frequencies match fairly closely to those
expected when assuming a Poisson distribution. The Poisson distribution
is also discrete, to match the discrete data. In addition, the probability of
a negative value for familySize is zero under the Poisson distribution, but
is non-zero under the normal distribution.

Solution to Activity 4
Similarities:
• Both equations have a linear function of the explanatory variable
(α + βxi ) on the right-hand side of the equation.
• Both equations involve E(Yi ) on the left-hand side of the equation.
Difference:
• The equation for logistic regression has a function of E(Yi ) on the
left-hand side, whereas linear regression just has E(Yi ).

Solution to Activity 5
In linear regression
E(Yi ) = α + βxi .
So in order for
g(E(Yi )) = α + βxi ,
the function g must be the identity function, so that
g(E(Yi )) = E(Yi ).

Solution to Activity 6
There seems to be very different linear relationships between the sample
mean of familySize and age for age < 40, 40 ≤ age < 80 and age ≥ 80.
In general, the sample mean of familySize seems to increase from age 18
to approximately age 40, after which it starts to decrease towards lower
mean values for ages up to approximately age 80, after which the decrease
seems even steeper.

Solution to Activity 7
The relationship between the logs of the sample means of familySize and
age seems fairly linear in Figure 10, and so it does seem reasonable to
assume a linear relationship between log(E(Yi )) and xi .

192
Solutions to activities

Solution to Activity 8
The regression equation for this logistic regression model for the ith
student has the form
 
pi
log = α + β1 xi1 + β2 xi2 ,
1 − pi
where pi is the success probability of the ith student, and xi1 and xi2 are
their values of x1 and x2 , respectively.
Here
 
pi
log = g(E(Yi ))
1 − pi
and so the linear predictor for this model is
ηi = α + β1 xi1 + β2 xi2 .

Solution to Activity 9
Substituting the values of the estimates into the linear predictor, the fitted
linear predictor for this model is
ηb = −4.45 + 0.09 x1 − 0.01 x2
or equivalently
ηb = −4.45 + 0.09 bestPrevModScore − 0.01 age.

Solution to Activity 10
Figure 10 shows a scatterplot of the logs of the sample means of
familySize against age.
The sample means are estimates of E(Yi ) for the different values of age.
Therefore, the scatterplot is actually plotting the estimated values
of log(E(Yi )) against the different values that age can take.
But for a Poisson GLM with a log link,
log(E(Yi )) = g(E(Yi )).
Therefore, Figure 10 shows a scatterplot of the estimated values of
g(E(Yi )) and xi for these data, and since the scatterplot shows a fairly
linear relationship, it does indeed seem reasonable to assume a linear
relationship between g(E(Yi )) and xi for these data and this model.

193
Unit 7 Regression for other response variables

Solution to Activity 11
(a) The response Yi can only take non-negative integer values 0, 1, 2, . . . .
(b) If Yi ∼ Poisson(λi ), then E(Yi ) = λi . So, since λi > 0, then E(Yi ) > 0
and the mean response can only take positive real values.
(c) For the log link,
g(E(Yi )) = log(E(Yi )).
But since E(Yi ) = λi if Yi ∼ Poisson(λi ), this means that
g(E(Yi )) = log(λi ).
Therefore, g(E(Yi )) can take any value between −∞ and +∞. This is
illustrated in the plot of log(λi ) against λi in Figure S3, where λi is
restricted to positive values, but log(λi ) can take any real value.

0
log(λi )

−2

−4

0 2 4 6 8 10
λi
Figure S3 Plot of log(λi ) against λi

194
Solutions to activities

Solution to Activity 12
(a) The canonical link function for a GLM with a normal response is the
identity link function, so that
g(E(Yi )) = E(Yi ).
So,
E(Yi ) = ηi
and the inverse link function is therefore
g −1 (ηi ) = ηi .
That is, the inverse link function g −1 is also the identity function.
(b) The canonical link function for a Poisson GLM is the log link
g(E(Yi )) = log(E(Yi )).
So,
log(E(Yi )) = ηi
and therefore, taking exponentials of both sides, we have
E(Yi ) = exp(ηi ).
So the inverse link function is
g −1 (ηi ) = exp(ηi ).

Solution to Activity 13
From Example 7, the fitted linear predictor for the model is
ηb = 1.97 − 0.01 age.
So, the fitted linear predictor for the second household, ηb2 , is
ηb2 = 1.97 − (0.01 × 72) = 1.25.
We need to use the inverse link function to calculate the fitted mean
response of familySize for the second household, so that
b2 = g −1 (b
µ η2 ) = exp(b
η2 )
= exp(1.25) ' 3.49.

195
Unit 7 Regression for other response variables

Solution to Activity 14
Given ηb, the fitted linear predictor for the first student, ηb1 , is
ηb1 = −4.45 + (0.09 × 89.2) − (0.01 × 32)
= 3.258.
From Table 4, the inverse link function for the logit link is
exp(ηi )
g −1 (ηi ) = .
1 + exp(ηi )
So
exp(b
η1 )
b1 = g −1 (b
µ η1 ) =
1 + exp(bη1 )
exp(3.258)
= ' 0.963.
1 + exp(3.258)
Now, for the logistic regression model, µ b1 is the fitted probability that the
first student passes the module. So, since µ b1 ' 0.963, the fitted probability
of passing for this student is close to 1 – in other words, our fitted model
estimates that they are almost certain to pass the module.
Notice that in this activity, ηb1 = 3.258, which lies outside the possible
values that E(Y1 ) can take. However, the inverse link function transforms
ηb1 to a value in the correct range of possible values for µ
b1 .

Solution to Activity 15
The fitted linear predictor for this student, ηb0 , is
ηb0 = −4.45 + (0.09 × 74.2) − (0.01 × 49)
= 1.738.

From Table 4, the inverse link function for the logit link is
exp(ηi )
g −1 (ηi ) = .
1 + exp(ηi )
b0 , is
So, the predicted mean response for this student, µ
exp(b
η0 )
b0 = g −1 (b
µ η0 ) =
1 + exp(bη0 )
exp(1.738)
= ' 0.85.
1 + exp(1.738)

196
Solutions to activities

Solution to Activity 16
An exponential distribution does look like it might be promising as the
assumed distribution for survivalTime. The fitted exponential
distribution curve (in Figure 18) generally follows the shape of the
histogram, and the probability of a negative value for survivalTime is
zero under the exponential distribution.

Solution to Activity 17
Since each response Yi has an exponential distribution, each E(Yi ) must be
positive. As a result, the canonical link function
1
g(E(Yi )) = −
E(Yi )
can only take negative values.
But this means that the canonical link function doesn’t satisfy the second
property that we’d like the link function g to have, since g(E(Yi )) can’t
take any value between −∞ and +∞.
As such, the canonical link function for an exponential GLM may not be
an ideal link function to use.

Solution to Activity 18
As we saw in Activity 17, one of the problems with using the negative
reciprocal link for an exponential GLM is that the values of
g(E(Yi )) = −1/E(Yi ) must be negative, since E(Yi ) must be positive.
This is not the case when using the log link, since log(E(Yi )) can take any
value between −∞ and +∞ for positive E(Yi ). As such, the log link may
be more sensible than the canonical link when we’re assuming an
exponential distribution for the response.

Solution to Activity 19
Although there aren’t many observations in this dataset, Figure 19
suggests that there does seem to be a roughly linear relationship between
log(survivalTime) and logWbc for each of the levels of ag (or at least, the
plot doesn’t indicate that the linearity assumption is unreasonable!).
Therefore, it does seem reasonable to assume a linear relationship between
g(E(Yi )) and the explanatory variables.

197
Unit 7 Regression for other response variables

Solution to Activity 20
(a) In the table, there is a parameter estimate for level ‘pos’ of ag, but
not one for level ‘neg’. Therefore, level ‘neg’ has been set to be level 1
of ag.
(b) The first patient takes level ‘pos’ for ag and has a value of 7.7407 for
logWbc. So, the fitted linear predictor for this patient is
ηb1 = 5.8154 + (−0.3044 × 7.7407) + 1.0176 ' 4.4767.

(c) From Box 8, the fitted mean response for the first patient is
b1 = g −1 (b
µ η1 ),
where g −1 is the inverse link function.
Now, for the log link, we have that
g(E(Yi )) = log(E(Yi )) = ηi .
So, taking exponentials of each side, we have that
E(Yi ) = exp(ηi )
so that the inverse link function is
g −1 (ηi ) = exp(ηi ).

Therefore, the fitted mean response for the first patient is


b1 = g −1 (b
µ η1 ) = exp(b
η1 )
' exp(4.4767) ' 87.94.

(d) The 18th patient takes level ‘neg’ for ag and has a value of 8.3894 for
logWbc. So, the fitted linear predictor for this patient is
ηb18 = 5.8154 + (−0.3044 × 8.3894) ' 3.2617.
Therefore, the fitted mean response for the 18th patient is
b18 = g −1 (b
µ η18 ) = exp(b
η18 )
' exp(3.2617) ' 26.09.

198
Solutions to activities

Solution to Activity 21
(a) The first student in the dataset has a value of 89.2 for
bestPrevModScore and 32 for age. So, using the parameter estimates
given, the fitted linear predictor for this student is
ηb1 = −3.3387 + (0.0472 × 89.2) + (−0.0040 × 32) = 0.74354.

(b) In this GLM, we are using the logit link so that


 
pi
log = ηi .
1 − pi
Then, using the inverse link function for the logit link, the value of pb1
is calculated as
exp(b
η1 )
pb1 = g −1 (b
η1 ) =
1 + exp(b
η1 )
exp(0.74354)
= ' 0.6778.
1 + exp(0.74354)

(c) The fitted mean response for the first student is calculated using the
equation
b1 = N1 × pb1 ,
µ
where N1 is the number of ‘trials’ for the first student. In our scenario
here, N1 is 100, the number of exam questions. Therefore, the fitted
mean response for the first student is
b1 ' 100 × 0.6778 = 67.78
µ
b1 is 68, rounded to the nearest integer.
that is, µ
(d) The new student has a value of 79.2 for bestPrevModScore and 64 for
age. So, ηb0 , the fitted linear predictor for this student, is
ηb0 = −3.3387 + (0.0472 × 79.2) + (−0.0040 × 64) = 0.14354.

The predicted success probability, pb0 , for this student is therefore


exp(bη0 )
pb0 = g −1 (b
η0 ) =
1 + exp(bη0 )
exp(0.14354)
= ' 0.5358.
1 + exp(0.14354)

The predicted mean response for this student is then


b0 ' 100 × 0.5358 = 53.58
µ
b0 is 54, rounded to the nearest integer.
that is, µ

199
Unit 7 Regression for other response variables

Solution to Activity 22
The p-value is 0.099, which is quite large. This suggests that the value of
D is not large enough to suggest that the model is a poor fit. We therefore
conclude that the model seems to be an adequate fit to the data.

Solution to Activity 23
(a) Model M1 is nested within M2 , so
deviance difference = D(M1 ) − D(M2 )
= 47.808 − 40.319 = 7.489.

(b) Model M2 has one more parameter than M1 (the regression coefficient
for level pos of the factor ag), and so this deviance difference is
approximately distributed as χ2 (1). We can also calculate the degrees
of freedom as the difference between the deviance degrees of freedom
for M1 and M2 , namely
31 − 30 = 1.

(c) The p-value associated with the deviance difference is small


(p = 0.009). Therefore, there is evidence to suggest that there is a
significant gain in fit by including ag in the model, in addition to
logWbc. So, we should choose model M2 over model M1 .

Solution to Activity 24
(a) Model M1 is nested within M2 , so
deviance difference = D(M1 ) − D(M2 )
= 1795.8 − 1731.8 = 64.

(b) Model M2 has one more parameter than M1 (the regression coefficient
for the covariate age), and so this deviance difference is
approximately distributed as χ2 (1). We can also calculate the degrees
of freedom as the difference between the deviance degrees of freedom
for M1 and M2 , namely
1158 − 1157 = 1.

(c) Since the p-value is so small (close to 0), there is evidence to suggest
that there is a significant gain in fit by using our proposed model in
comparison to the null model – in other words, there is a significant
gain in fit when age is included in the model. It therefore looks like
age is useful for modelling familySize.

200
Solutions to activities

Solution to Activity 25
The preferred model is the one with the smallest AIC, so M2 is the
preferred model.

Solution to Activity 26
If Yi ∼ Poisson(λi ), then
E(Yi ) = λi and V (Yi ) = λi ,
that is, the variance is the same as the mean. So, if the mean of the
response Yi changes, then the variance will also change and be equal to the
mean.

Solution to Activity 27
If Yi ∼ M (λi ), then
1 1
E(Yi ) = and V (Yi ) = .
λi λ2i
So,
V (Yi ) = (E(Yi ))2 .
This means that as E(Yi ) changes, so does V (Yi ).

Solution to Activity 28
There is some slight curvature in the smoothed red line in the plot in
Figure 25. However, this could be a result of the fact that the dataset is
small. Therefore, the plot doesn’t raise any alarm bells to indicate that the
linearity assumption for the model seems unreasonable.

Solution to Activity 29
The points in the plot seem to be fairly randomly scattered across the
index, and so the plot doesn’t indicate that there are any problems with
the independence assumption.

Solution to Activity 30
Two of the negative standardised deviance residuals are much larger than
the others, which, in itself, doesn’t indicate any problems with the
independence assumption. However, the fact that both of these unusually
large points are right next to each other in index order and they’re also
both negative residuals suggests that perhaps there might possibly be an
issue with independence. Given that the data points related to different
patients, it is likely that they are independent, but it would be worth
checking how the data were collected.

201
Unit 7 Regression for other response variables

Solution to Activity 31
Although the points deviate from the line slightly at either end of the
normal probability plot, the points in the plot are generally quite close to
the diagonal line, and so the assumption of normality of the deviance
residuals seems reasonable. This in turn, means that the assumption of an
exponential distribution for the response also seems reasonable.

Solution to Activity 32
Here
D 962.05
= ' 0.84.
r 1144
So, since 0.84 < 1 < 2, overdispersion is not a problem when using this
GLM.

Solution to Activity 33
(a) For Model 1
D 4.21
= ' 1.4.
r 3
Since 1 < 1.4 < 2, there could be some overdispersion but probably
not enough to be a problem.
(b) For Model 2
D 634.81
= ' 0.73.
r 864
Since 0.73 < 1 < 2, there certainly doesn’t seem to be a problem with
overdispersion.
(c) For Model 3
D 146.8
= ' 5.65.
r 26
Since 5.65 > 2, there could well be a problem with overdispersion.

Solution to Activity 34
A Poisson distribution would be naturally considered for the response
numSeizures, since these are counts. So, a first good model to try for
numSeizures is a Poisson GLM with a log link, with treatment and
period as explanatory variables.

202
Unit 8
Log-linear models for contingency
tables
Introduction

Introduction
In Unit 7, we learned about generalised linear models – that is, GLMs –
and used them to model data with various assumed distributions for the
response. This chapter continues that thread, but involves a different
format of data from those we have been working with so far.
In this unit, we shall concentrate on data which are in the form of
contingency tables. Contingency tables are tables of counts showing how
often within a given sample each combination of the different values of
various categorical random variables occurs. One of the questions often of
interest for contingency table data is whether there are any relationships
between the categorical variables represented in the contingency table. In
this unit, we’ll introduce a GLM, known as the log-linear model, for
modelling the contingency table data in order to learn about these
relationships.

How Unit 8 relates to the module so far


Moving on from . . . What’s next?

Factors and interactions


in regression
(Unit 4) Models of counts
classified according to
Regression for when the two or more variables
response variable has a
Poisson or binomial
distribution
(Unit 7)

In Section 1, we’ll introduce the type of data we’ll be focusing on in this


unit together with the associated modelling problem, and we’ll outline the
modelling strategy that we’ll use.
In Section 2, the proposed model for contingency table data – the
log-linear model – is introduced for the simplest situation in which the
contingency table being modelled is based on data for just two categorical
variables. Section 3 then considers how this log-linear model can be used
to test whether or not the two classifying variables are independent.
In Section 4, we move onto using log-linear models for contingency tables
with more than two variables categorising the data. Then, in Section 5,
we’ll discuss what the fitted log-linear model can tell us about the
relationships between the classifying variables.

205
Unit 8 Log-linear models for contingency tables

We’ll finish the unit in Section 6 by considering how a logistic regression


model can be used instead of a log-linear model for modelling some
contingency tables. We’ll discuss how this can be done, and the
relationship between log-linear and logistic regression models.
The following route map illustrates how the sections fit together for this
unit.

The Unit 8 route map

Section 1
The modelling
problem

Section 2 Section 3
Introducing log-linear Are the classifying
models for two-way variables in a two-way
contingency tables table independent?

Section 5
Section 4
How are the
Contingency tables with
classifying variables
more than two variables
related?

Section 6
Logistic and
log-linear models

Note that you will need to switch between the written unit and your
computer for Subsections 3.3, 4.5 and 5.2.

206
1 The modelling problem

1 The modelling problem


In this section, we’ll explore the modelling problem of interest in this unit.
We’ll start in Subsection 1.1 by illustrating the type of data that we’ll be
focusing on, and we’ll introduce some of the associated questions that are
of interest to us. Then, in Subsection 1.2, we’ll outline the modelling
strategy that we’ll use for these data.

1.1 Contingency table data


To illustrate the type of data being modelled in this unit, we’ll consider
some data from a UK survey on living costs. The dataset is described next.

Living Costs and Food Survey


Surveys on household expenditure in the UK have been conducted
regularly since 1957. These surveys have gone by various names over
the years, but became the Living Costs and Food Survey in 2008. The
surveys collect information for individuals (over 16) in a UK
household about income and expenditure, together with personal
information such as age, gender, marital status and employment
Rent for accommodation can
status. The survey is carried out by the Office for National Statistics be a large source of living costs
in Great Britain and by the Northern Ireland Statistics and Research for some
Agency in Northern Ireland.
The data collected from the survey are used for various purposes, such
as providing information on household spending patterns for the
Retail Prices Index, studying how government taxes and benefits
affect households, and monitoring food purchasing trends.
The UK survey dataset (ukSurvey)
The data that we’ll be using in this unit are a sample of 5144
households taken from the 2013 Living Costs and Food Survey. The
UK survey dataset contains data on four of the many variables for
which data were collected, as follows:
• employment: the type of employment of the household reference
person (HRP), taking the values full-time, part-time, unemployed
and inactive (that is, retired or not working and not looking for
work)
• gender: the gender that the HRP identifies with, taking the values
male and female
• incomeSource: the main source of household income, taking the
two possible values earned and other
• tenure: the type of tenancy for the household’s accommodation,
taking the three possible values public rented, private rented and
owned.

207
Unit 8 Log-linear models for contingency tables

The data for the first five observations from the UK survey dataset
are shown in Table 1. From this table of data, we see that for each
individual completing the 2013 Living Costs and Food Survey, the
observed category was recorded for each of the four categorical
variables; for example, the first individual in the dataset recorded
inactive for employment, female for gender, earned for incomeSource
and public rented for tenure. (Note that the HRPs in the survey may
not be the main source of household income, so it is possible for
incomeSource to take the value earned even if the HRP is unemployed
or inactive – as is the case for the first individual in the dataset.)
Table 1 First five observations from ukSurvey

employment gender incomeSource tenure


inactive female earned public rented
full-time female earned owned
full-time female earned owned
full-time male earned owned
full-time male earned owned

Source: Cathie Marsh Institute for Social Research, 2019,


accessed 9 September 2022

The data in the UK survey dataset record which categories each individual
takes for the four categorical variables employment, gender, incomeSource
and tenure. In this unit, we are not so interested in these individual data
values, but rather we are interested in modelling the counts of individuals
in the dataset who take combinations of the different values of the
categorical variables. For example, we’re interested in modelling the counts
of individuals (out of the 5144 who completed the survey) who took the
values female for gender and earned for incomeSource, or took the values
female for gender and other for incomeSource, and so on. These counts
can be represented in a contingency table.
An example of a contingency table showing data from the UK survey
dataset is given in Table 2. This table shows the numbers of individuals
classified according to the two categorical variables gender and
incomeSource. For example, of the 5144 individuals in the dataset, the
household income source was earned and the gender of the HRP was male
for 1894 individuals.
Table 2 The UK survey dataset classified by gender and incomeSource

incomeSource
gender earned other Total
female 947 1041 1988
male 1894 1262 3156
Total 2841 2303 5144

208
1 The modelling problem

Since the data in Table 2 are categorised in terms of two variables, it is


referred to as a two-way contingency table.
A question of interest regarding the data in Table 2 may be whether or not
the HRP’s gender is independent of their household income source. In
previous study, you may have investigated such a question using a
hypothesis test. In this module, we shall instead use a GLM to answer the
same question. Using a GLM allows us to continue the module’s theme of
taking a modelling, rather than a testing, approach to answer a question of
interest. What’s more, a GLM can be easily extended to answer questions
regarding more complicated contingency tables involving more than two
categorical variables.
An example of such a table (a three-way contingency table) can be seen in
Table 3. Table 3 also shows data from the UK survey dataset, but this
time the households are classified according to the three variables
employment, gender and incomeSource.
Table 3 The UK survey dataset classified by employment, gender and
incomeSource

incomeSource
earned other
gender gender
employment female male female male
full-time 626 1688 31 95
part-time 235 112 123 66
unemployed 18 16 72 58
inactive 68 78 815 1043
Total 947 1894 1041 1262

There are many questions that we might want to answer for the three-way
contingency table given in Table 3. Not only might we be interested in the
question of whether gender and incomeSource are independent (as we
were for Table 2), but we might also be interested in whether gender and
employment are independent of each other, or whether employment and
incomeSource are independent of each other.
Furthermore, we might want to consider whether the relationship between
any pair of variables (independent or not) differs according to the other
remaining variable. For example, we might be interested in whether the
relationship between employment and gender differs according to whether
incomeSource takes the value earned or other. We’ll consider what else
might be of interest in Activity 1.

Activity 1 What else might be of interest?


With reference to Table 3, can you think of any other questions regarding
the categorical variables which might be of interest?

209
Unit 8 Log-linear models for contingency tables

Investigating these relationships is relatively straightforward using a GLM


– and indeed we shall investigate such questions regarding Table 3 later in
the unit.

1.2 A modelling strategy


We saw in Subsection 1.1 that we’re considering contingency table data in
this unit, and we would like to learn about the relationships between the
categorical variables represented in the contingency table.
When building a linear model or GLM, one of the variables is clearly
labelled as a response variable. However, in a contingency table, the
variables categorising the data are on an equal footing, and it’s not obvious
why we might label one as a response variable. So, in order to use a GLM
modelling framework, we treat the counts (or frequencies of occurrence) in
the cells of the contingency table as values of the response variable, rather
than using either of the variables defining the rows and columns. A GLM
The idea is to treat the counts is then set up, with the counts as the response variable and the row and
as values of the response! column variables as factor explanatory variables.
This idea is illustrated in Activity 2 and summarised in Box 1.

Activity 2 Modelling strategy for two-way contingency


table
Consider once again the contingency table for the UK survey dataset
shown in Table 2. Using the proposed modelling strategy, what could we
use for the response variable and the explanatory variables in a GLM for
these data?

Box 1 Modelling strategy for contingency table data


In order to model contingency table data, we’ll use a GLM with:
• the counts in the contingency table as values of the response
• the categorical variables presented in the contingency table as
factor explanatory variables.

To keep things simple, for now we’ll restrict our attention to modelling
two-way contingency tables only, so that the data are categorised in terms
of just two categorical variables. Modelling contingency table data for
more than two categorical variables will build on these models; we’ll
consider these more complicated contingency tables later in the unit.

210
2 Introducing log-linear models for two-way contingency tables

2 Introducing log-linear models for


two-way contingency tables
In this section, we will build a model for two-way contingency tables. As
has already been mentioned, modelling contingency table data can help us
to investigate the relationships between the variables categorising the data.
In a two-way contingency table, we are often primarily interested in
whether or not the two variables, A and B say, are independent. So, in
this section, we’ll build a model for the situation where A and B are
independent.
We’ll start by looking at the response variable for two-way contingency
table data in a little more detail in Subsection 2.1, before introducing the
log-linear model for independent variables in Subsection 2.2. To finish the
section, in Subsection 2.3 we’ll take a look at the fitted log-linear model to
demonstrate that the fitted model works as we would like it to work.

2.1 The response variable for two-way


contingency table data
Let’s start by defining the general notation that we’ll use for two-way
contingency tables, as follows.
• A represents the row variable and B represents the column variable.
• K represents the number of levels (that is, categories) for variable A,
and L represents the number of levels for variable B (so, K and L are
the number of rows and columns, respectively, in the contingency table).
• Ykl is the variable representing the count in the kth row and lth column
of the contingency table, for k = 1, 2, . . . , K, l = 1, 2, . . . , L.
• Yk+ represents the total for row k. (The ‘+’ represents the index that
we’re summing over – in this case, we’re summing over the columns
indexed by l.)
• Y+l represents the total for column l.
• Y++ represents the total of all the counts in the table. (So, we are
summing over both the rows and the columns.)

211
Unit 8 Log-linear models for contingency tables

Table 4 shows this notation in a two-way contingency table.


Table 4 Notation used in a two-way contingency table

Levels of Levels of variable B


variable A 1 2 ··· l ··· L Total
1 Y11 Y12 ··· Y1l ··· Y1L Y1+
2 Y21 Y22 ··· Y2l ··· Y2L Y2+
.. .. .. .. .. ..
. . . . . .
k Yk1 Yk2 ··· Ykl ··· YkL Yk+
.. .. .. .. .. ..
. . . . . .
K YK1 YK2 ··· YKl ··· YKL YK+
Total Y+1 Y+2 ··· Y+l ··· Y+L Y++

The next activity considers this notation in the context of a two-way


contingency table representing data from the UK survey dataset.

Activity 3 Using the notation

Table 5 shows the contingency table of counts of households from the UK


survey dataset classified by employment and gender.
Table 5 The UK survey dataset classified by employment and gender

gender
employment female male Total
full-time 657 1783 2440
part-time 358 178 536
unemployed 90 74 164
inactive 883 1121 2004
Total 1988 3156 5144

With reference to this contingency table, answer the following questions.


(a) What are the values of K and L?
(b) What is the observed value y32 ? Which employment type and gender
does this count relate to?
(c) What are the values of y+2 , y4+ and y++ ?

From Box 1, in our modelling strategy, we are taking the counts in a


contingency table as values of our response variable. As such, using the
notation shown in Table 4, as our responses we’ll take the variables
Ykl , for k = 1, 2, . . . , K, l = 1, 2, . . . , L.

212
2 Introducing log-linear models for two-way contingency tables

Notice that our response variables are indexed by two subscripts k and l,
representing, respectively, the kth row and lth column in the contingency
table. This is, of course, different to the notation that we’ve used so far for
responses; previously we’ve had the responses Y1 , Y2 , . . . , Yn , where n is the
number of individual observations in the dataset. So, previously we’ve had
a response variable to represent each of the n observations, whereas here
we are no longer considering responses relating to individual observations,
but instead we’re considering responses representing the counts in the
contingency table after the individual observations have been categorised.
Activity 4 considers these responses further.

Activity 4 How many responses?

Consider a contingency table with the classifying variables A with K levels


and B with L levels. If we use a GLM to model this contingency table,
how many values of the response will we have in our model?

Box 2 gives a summary of the responses used for modelling two-way


contingency table data.

Box 2 Responses for modelling contingency table data


For a contingency table with K and L levels for categorical variables
A and B, respectively, the response variables representing the counts
in the contingency table are denoted
Ykl , for k = 1, 2, . . . , K, l = 1, 2, . . . , L,
and there are K × L responses for the data.

We’ll finish this subsection with an activity identifying the responses for
modelling a two-way contingency table for the UK survey dataset.

Activity 5 Responses for a two-way contingency table

Consider once again the contingency table given in Table 5 (in Activity 3).
Using the notation summarised in Box 2, write down the responses for
modelling the counts in this contingency table.

Now that we have a modelling strategy (as summarised in Box 1) and we


have specified what our responses are (as summarised in Box 2), we are
ready to build a model for two-way contingency table data.

213
Unit 8 Log-linear models for contingency tables

2.2 The log-linear model for independent


variables
The data for contingency tables can be collected in a number of different
ways. The size of the sample of individual observations, n, is often fixed in
advance, so that the total count Y++ is fixed to be n. This puts constraints
on the cell counts (because they must then all add up to n). Additionally,
the number of individual observations taking each level of A or B may be
fixed in advance, so that the row totals Y1+ , Y2+ , . . . , YK+ , or the column
totals Y+1 , Y+2 , . . . , Y+L , are fixed, imposing further constraints on the cell
counts. Alternatively, none of the totals may be fixed, so that the
There’s quite a bit of contingency table simply records the counts for each cell in a certain time
explanation text coming up – period with no constraints. Although each of these different methods of
accompanying refreshments collecting the data leads to a different assumed distribution for the
needed? responses Ykl , k = 1, 2, . . . , K, l = 1, 2, . . . , L, it turns out that the
resulting data can all be modelled by the same model.
We’ll start by building a model assuming that the data have been collected
so that the size of the sample of individual observations, n, has been fixed
in advance – that is, Y++ has been fixed to be n. We’re also making the
assumption in this subsection that A and B are independent.
For the individuals in our sample, suppose that
• individuals are independent of each other
• the probability of falling into a given cell in a contingency table is the
same for each individual.
Then, let
pkl = P (individual takes level k of A and level l of B),
pk+ = P (individual takes level k of A),
p+l = P (individual takes level l of B).
So, pkl is the joint probability associated with the (k, l)-cell of the
contingency table, while pk+ is the marginal probability associated with
row k of the contingency table and p+l is the marginal probability
associated with column l of the contingency table. These probabilities are
shown in Table 6. Note that
K X
X L K
X L
X
pkl = pk+ = p+l = 1.
k=1 l=1 k=1 l=1

214
2 Introducing log-linear models for two-way contingency tables

Table 6 Cell probabilities associated with a two-way contingency table

Levels of Levels of variable B


variable A 1 2 ··· l ··· L Total
1 p11 p12 ··· p1l ··· p1L p1+
2 p21 p22 ··· p2l ··· p2L p2+
.. .. .. .. .. ..
. . . . . .
k pk1 pk2 ··· pkl ··· pkL pk+
.. .. .. .. .. ..
. . . . . .
K pK1 pK2 ··· pKl ··· pKL pK+
Total p+1 p+2 ··· p+l ··· p+L 1

Since we’re building a GLM for each response Ykl , we’re interested in
E(Ykl ), the expected cell count for level k of A and level l of B. This is
directly related to the cell probability pkl , and, since we’re assuming that
the total number of observations is fixed to be n, is given by
E(Ykl ) = n × pkl . (1)
But, if A and B are independent, then
P (A = k and B = l) = P (A = k) × P (B = l),
which means that the cell probability pkl is the product of the marginal
probabilities pk+ and p+l – that is,
pkl = pk+ × p+l .
So, if A and B are independent, then Equation (1) becomes
E(Ykl ) = n × pk+ × p+l . (2)
Now, for a GLM for Ykl , we need to have a regression equation of the form
g(E(Ykl )) = ηkl ,
where ηkl is a linear function. Although the right-hand side of
Equation (2) isn’t linear, if we take logs of both sides, then the right-hand
side will be linear. In this case, Equation (2) becomes
log(E(Ykl )) = log(n) + log(pk+ ) + log(p+l )
   

 term 
 
 term 

   
 associated





 associated



constant
= + with + with . (3)
term 
 
 
 


 kth level 
  lth level 
    

of A of B
So, could we use this as a regression equation for modelling the counts?
Well we could in theory, but the model wouldn’t be easy to work with
because the responses would not be independent since they need to add up
to the fixed total n.

215
Unit 8 Log-linear models for contingency tables

Equation (3) does, however, look similar to the regression equation of


another model which we’ve seen before and which is easier to work with!
The left-hand side looks like a log link function, while the right-hand side
looks like the form of a linear predictor when we have two factors as the
explanatory variables. In particular, Equation (3) looks similar to the form
of a regression equation for a Poisson GLM with a (canonical) log link for
the model
Y ∼ A + B,
since the regression equation for this model has the general form
   
   effect of   effect of 
baseline
log(E(Ykl )) = + kth level + lth level .
mean    
of A of B
As usual, in this model, the baseline mean assumes that A and B are both
level 1, and the effect of the kth level of A is the effect in comparison to
the effect of level 1 of A, while the effect of the lth level of B is the effect
in comparison to the effect of level 1 of B.
So, a Poisson GLM – which is straightforward to use – may be a way
forward to model the counts in a two-way contingency table when the two
classifying variables A and B are independent. There is, however, a
problem! We’ll look at this problem in the next activity.

Activity 6 A problem with the Poisson GLM

When specifying the regression equation presented in Equation (3), we


used the assumption that the total sample size Y++ is fixed in advance to
be n. If we use a Poisson GLM to model the cell counts, then the responses
Ykl , k = 1, 2, . . . , K, l = 1, 2, . . . , L, are assumed to be independent of each
other, and each Ykl is assumed to have a Poisson distribution.
Explain why fixing the total sample size Y++ in advance to be n causes a
problem with the Poisson GLM’s assumption that the responses have
independent Poisson distributions.

As we’ve just seen in Activity 6, there is a problem with the Poisson GLM
model assumptions if the sample size Y++ is fixed to be n in advance.
However, all is not lost! It turns out that if we fit a Poisson GLM to data
from a contingency table (with the canonical log link), assuming
independent Poisson responses, then maximum likelihood estimation
ensures that the fitted values for the cell counts add up to the actual total
count that was observed in the first place! As a result, it makes no
difference whether or not we fix Y++ to be n in advance, and we can
As if by magic, we can use a
Poisson GLM for contingency simply assume that the cell counts are independent Poisson responses so
table data regardless of any that everything can fit nicely into the standard (and easy-to-use!) Poisson
constraints on the totals! GLM framework.

216
2 Introducing log-linear models for two-way contingency tables

What’s more, even though the assumed distributions for the response are
also different when the row totals Y1+ , Y2+ , . . . , YK+ are fixed, or when the
column totals Y+1 , Y+2 , . . . , Y+L are fixed, again it turns out that
maximum likelihood estimation for a Poisson GLM also gives exactly the
same fitted values as these constrained models, and that the fitted values
of the cell counts add up to the actual row and column totals that were
observed. Therefore, again we can assume that the cell counts are
independent Poisson responses, so that a Poisson GLM can also be used
for contingency tables where the row and column totals are fixed.
When using a Poisson GLM to model counts in a contingency table, the
model is usually referred to as a log-linear model. As you’ve probably
guessed, the ‘log’ part of this title refers to the fact that we’re using a log
link function, and the ‘linear’ part refers to the fact that the multiplicative
relationship between the marginal probabilities becomes linear in the
model.
The log-linear model for the counts in a two-way contingency table when A
and B are independent is summarised in Box 3.

Box 3 Log-linear model for two-way contingency table


data when A and B are independent
Consider a two-way contingency table categorised by variables A with
K levels and B with L levels. The cell counts in the table are the
values of the response Y .
Then, if A and B are independent, we can model Y as
Y ∼A+B
using a Poisson GLM with the canonical log link, so that
Ykl ∼ Poisson(λkl ), for λkl > 0, k = 1, 2, . . . , K, l = 1, 2, . . . , L,
and
log(E(Ykl )) = ηkl ,
where ηkl is the linear predictor of the form
   
   effect of   effect of 
baseline
ηkl = + kth level + lth level .
mean    
of A of B
The baseline mean assumes that A and B are both level 1, and the
effect of the kth level of A is the effect in comparison to the effect of
level 1 of A, while the effect of the lth level of B is the effect in
comparison to the effect of level 1 of B.
In this context, the model is usually referred to as a log-linear model.
A log-linear model can be used for modelling contingency table data
regardless of whether or not the total number of observations, the row
totals, or the column totals, were fixed and known in advance.

217
Unit 8 Log-linear models for contingency tables

Let’s take a closer look at the log-linear model in the simplest situation in
which we have a two-way contingency table as shown in Table 7, where the
two classifying variables each has just two levels.
Table 7 General form of a contingency table with observed counts for two
variables A and B, each with two levels

Levels of Levels of variable B


variable A 1 2 Total
1 y11 y12 y1+
2 y21 y22 y2+
Total y+1 y+2 y++

We saw an example of a contingency table with this structure in Table 2,


where the data in the UK survey dataset were classified according to the
two variables gender (factor A) and incomeSource (factor B), both of
which have two levels.
We’ll model the four responses Y11 , Y12 , Y21 and Y22 assuming that A and
B are independent using a log-linear model as defined in Box 3, so that
Ykl ∼ Poisson(λkl ), for λkl > 0, k = 1, 2, l = 1, 2,
and
log(E(Ykl )) = ηkl . (4)

Following the usual convention of representing factors by indicator


variables, we can write the linear predictor in the form
ηkl = µ + αA zA + αB zB , (5)
where
µ = baseline mean count for level 1 of both A and B,

1 for counts where A is level 2,
zA =
0 otherwise,

1 for counts where B is level 2,
zB =
0 otherwise.

So, substituting the linear predictor given in Equation (5) into


Equation (4), we have that
log(E(Ykl )) = µ + αA zA + αB zB .
The expected count for the (k, l)-cell is therefore
E(Ykl ) = exp(µ + αA zA + αB zB ). (6)
We’ll look at the expected cell counts more closely in Activity 7.

218
2 Introducing log-linear models for two-way contingency tables

Activity 7 Expressions for the expected cell counts

Show that the expected cell counts for the contingency table in Table 7 can
be written using the expressions given in Table 8.
Table 8 Expressions for the expected cell counts E(Ykl ) for the contingency
table in Table 7

Levels of Levels of variable B


variable A 1 2
1 exp(µ) exp(µ + αB )
2 exp(µ + αA ) exp(µ + αA + αB )

The previous activity considered expressions for the expected cells counts
for a contingency table where each of the classifying variables has just two
levels. Notice that µ, the baseline mean, is in all of the expressions for the
expected cell counts in Table 8. On the other hand, αA , the level 2 effect
parameter for A, only appears in expressions for the expected cell counts
in the second row in Table 8; that is, the row associated with level 2 of A.
Likewise, αB , the level 2 effect parameter for B, only appears in
expressions for the expected cell counts in the second column in Table 8;
that is, the column associated with level 2 of B.
This idea extends naturally to the general case where A has K levels and
B has L levels: µ is in all K × L expressions for the expected cell counts,
the level k effect parameter for A only appears in the expressions for the
expected cell counts for row k of the table, while the level l effect
parameter for B only appears in the expressions for the expected cell
counts for column l of the table.
We’ll finish this subsection by using the expressions for the expected cell
counts from Activity 7 to calculate the fitted expected cell counts for
contingency table data from the UK survey dataset.

Activity 8 Fitted expected cell counts

A two-way contingency table classifying the UK survey dataset according


to the two factors gender and incomeSource was given in Table 2; this
table is repeated here in Table 9 for convenience.
Table 9 Repeat of Table 2

incomeSource
gender earned other Total
female 947 1041 1988
male 1894 1262 3156
Total 2841 2303 5144

219
Unit 8 Log-linear models for contingency tables

Let the response variable be count, representing the cell counts in the
contingency table. The model
count ∼ gender + incomeSource
was fitted to the data using a log-linear model taking female to be level 1
of gender and earned to be level 1 of incomeSource. The resulting output
from fitting the model is given in Table 10.
Table 10 Parameter estimates for the fitted log-linear model
count ∼ gender + incomeSource

Parameter Estimate
Intercept 7.001
gender male 0.462
incomeSource other −0.210

Complete Table 11 by calculating the fitted expected cell counts for this
model.
Table 11 Fitted expected cell counts for the log-linear model
count ∼ gender + incomeSource

incomeSource
gender earned other

female

male

2.3 Does the fitted log-linear model work?


When introducing the log-linear model
Y ∼A+B
ll
Fitted ce in Subsection 2.2, we claimed that A and B are treated as being
dd up
counts a independent in this model. We also claimed that, although the log-linear
A and B
model assumes that the cell counts are independent Poisson responses, the
as fitted expected cell counts for the log-linear model add up to the actual
modelled
dent
row, column and overall totals (and we used this fact to justify using the
indepen easy-to-use log-linear model, regardless of whether or not the row, column
or overall totals in the contingency table were fixed in advance). In this
subsection, we’ll use a fitted log-linear model to check that these claims are
indeed true!
We’ll start in Activity 9 by considering whether the fitted expected cell
counts add up to the actual row, column and overall totals.

220
2 Introducing log-linear models for two-way contingency tables

Activity 9 Do the fitted expected cell counts add up?

Activity 8 considered the log-linear model


count ∼ gender + incomeSource
fitted to data from the UK survey dataset.
The fitted expected cell counts calculated using R are shown in the
contingency table in Table 12. (Note that these fitted values will differ
slightly from the same fitted values that we calculated in Activity 8. This
is simply due to the fact that in Activity 8 we calculated the fitted values
using the rounded parameter estimates given in Table 10.)
Table 12 Fitted expected cell counts for the log-linear model
count ∼ gender + incomeSource

incomeSource
gender earned other Total
female 1097.96 890.04
male 1743.04 1412.96
Total

Complete Table 12 by calculating the row, column and overall totals of the
fitted values, and confirm that these totals are the same as the totals
displayed in Table 2 (and repeated in Table 9 in Activity 8).

In Activity 9, we confirmed our earlier claim that, although the log-linear


model assumes independent Poisson responses, the fitted log-linear model
gives the same row, column and overall totals as those observed in the
ll
sample. Fitted ce
dd up
Next we’ll consider our claim from Subsection 2.2 that the classifying counts a
variables are modelled as being independent in the log-linear model A and B
as
modelled
dent
Y ∼ A + B.
indepen
Now, if our log-linear model is modelling the classifying variables as being
independent, then we’d expect
pbkl = pbk+ × pb+l , (7)
where pbkl is the fitted value of the joint cell probability pkl , and pbk+ and
pb+l are the estimated values of the marginal probabilities pk+ and p+l ,
respectively.
So, we can conclude that the log-linear model Y ∼ A + B is modelling A
and B as being independent, if we can show that Equation (7) holds for
the fitted model.

221
Unit 8 Log-linear models for contingency tables

In order to do this, we first need the values of pbkl , pbk+ and pb+l for our
fitted model. We can use the expected cell counts to estimate these, as
demonstrated in Example 1.

Example 1 Calculating fitted cell probabilities


Consider once again the log-linear model
count ∼ gender + incomeSource
fitted to data from the UK survey dataset.
Using the fitted expected cell counts from Table 12 in Activity 9, the
joint probability p12 is estimated as
890.04
pb12 = ' 0.173
5144
and the marginal probabilities p1+ and p+2 are estimated to be
1097.96 + 890.04
pb1+ = ' 0.386,
5144
890.04 + 1412.96
pb+2 = ' 0.448.
5144

In Activity 10, we’ll estimate some more joint and marginal probabilities,
and we’ll use these estimates to investigate whether Equation (7) is
satisfied for our fitted model.

Activity 10 Is the joint probability the product of the


marginals?
Following on from Activity 9, the fitted expected cell counts, together with
their row, column and overall totals are given in Table 13 for the fitted
log-linear model
count ∼ gender + incomeSource.

Table 13 Fitted expected cell counts, together with their row, column and
overall totals, for the fitted log-linear model count ∼ gender + incomeSource

incomeSource
gender earned other Total
female 1097.96 890.04 1988.00
male 1743.04 1412.96 3156.00
Total 2841.00 2303.00 5144.00

222
2 Introducing log-linear models for two-way contingency tables

(a) In Example 1, we saw that pb12 ' 0.173, pb1+ ' 0.386 and pb+2 ' 0.448.
Confirm that, to three decimal places
pb12 = pb1+ × pb+2 .

(b) (i) Calculate the fitted joint cell probability pb21 .


(ii) Calculate the two fitted marginal probabilities pb2+ and pb+1 .
(iii) Hence show that, to three decimal places
pb21 = pb2+ × pb+1 .

In Activity 10, we showed that


pb12 = pb1+ × pb+2 and pb21 = pb2+ × pb+1
for the fitted log-linear model
count ∼ gender + incomeSource.
In fact, it can be shown that the equation
pbkl = pbk+ × pb+l
holds for all k = 1, 2, . . . , K, l = 1, 2, . . . , L for any two-way contingency
table data modelled by a log-linear model. As such, the log-linear model
does indeed model the two classifying variables as being independent.
So, in this subsection we’ve seen that the log-linear model
Y ∼A+B
ll
Fitted ce
does fulfill the claims made about the model in Subsection 2.2 – namely: dd up
counts a
• the model gives fitted values whose rows, columns and overall totals are
A and B
the same as those in the dataset as
modelled
• the model treats A and B as being independent. dent
indepen
But how do we know whether the classifying variables are actually
independent? We shall investigate this question next in Section 3.

223
Unit 8 Log-linear models for contingency tables

3 Are the classifying variables in a


two-way table independent?
In this section, we’ll discuss how we can decide whether or not the
classifying variables are independent.
We’ll start in Subsection 3.1 by introducing a plot to visualise two-way
contingency table data which can help to assess informally whether or not
the classifying variables seem to be independent. We’ll then look at a more
formal method in Subsection 3.2. To round off the section, we’ll use R for
two-way contingency table data in Subsection 3.3.

3.1 Visualising two-way contingency tables


In order to visualise two-way contingency table data, we can use a plot
known as a mosaic plot. A mosaic plot is a square which is subdivided
into rectangles, where each rectangle represents a cell in the contingency
table. The rectangles are arranged horizontally according to the levels of
one of the variables in the contingency table, and arranged vertically
according to the levels of the other variable in the contingency table. This
basic structure is illustrated in Example 2.

Example 2 The basic structure of mosaic plots


Consider once again the contingency table classifying the UK survey
dataset according to gender and incomeSource. This contingency
table was given in Table 2 and repeated in Table 9, but to save you
searching for it, it’s repeated (yet again!) in Table 14.
Table 14 Repeat of Table 2

incomeSource
gender earned other Total
female 947 1041 1988
male 1894 1262 3156
Total 2841 2303 5144

The mosaic plot for this contingency table, taking gender as the
horizontal variable and incomeSource as the vertical variable, is
shown in Figure 1. Notice that there are four rectangles, one to
represent each cell in Table 14. The rectangles in the first column are
associated with the first level of the horizontal variable gender (that
is, female), and the rectangles in the second column are associated
with the second level of gender (that is, male).

224
3 Are the classifying variables in a two-way table independent?

For the vertical variable incomeSource, the rectangles in the first row
are associated with the first level of incomeSource (that is, earned),
while the rectangles in the second row are associated with the second
level (that is, other).
gender
female male

earned
incomeSource

other

Figure 1 Mosaic plot representing the contingency table data given in


Table 14, taking gender as the horizontal variable and incomeSource as
the vertical variable

Alternatively, we could take incomeSource to be the horizontal


variable and gender to be the vertical variable. The mosaic plot in
this case is shown in Figure 2.

incomeSource
earned other

female
gender

male

Figure 2 Mosaic plot representing the contingency table data given in


Table 14, taking incomeSource as the horizontal variable and gender as
the vertical variable

225
Unit 8 Log-linear models for contingency tables

For both of the mosaic plots given in Figures 1 and 2 in Example 2, the
horizontal width of each rectangle represents the proportion of observations
taking the associated level for the horizontal variable. The vertical length
of each rectangle within each column represents the proportion of
observations taking the associated level for the vertical variable,
conditional on the observation taking the associated level of the horizontal
variable. So, if the two variables are independent of one another, we’d
expect the vertical lengths of the rectangles to be roughly the same across
the horizontal variable. This is illustrated in Example 3.

Example 3 Mosaic plot when the variables are


independent
Figure 3 shows mosaic plots which again classify the data according to
gender and incomeSource. This time however, rather than
representing the observed data, the mosaic plots show the fitted cell
probabilities pbkl , k = 1, 2, l = 1, 2, for the log-linear model
count ∼ gender + incomeSource.
Since we know that this model treats gender and incomeSource as
being independent, these mosaic plots will show us what the plots
should look like if gender and incomeSource are independent.

226
3 Are the classifying variables in a two-way table independent?

gender
female male

earned
incomeSource

other

(a)

incomeSource
earned other

female
gender

male

(b)

Figure 3 Mosaic plots representing the fitted cell probabilities for the
model count ∼ gender + incomeSource with (a) gender as the
horizontal variable, and (b) incomeSource as the horizontal variable

Notice that when the variables are independent, the rectangles across
rows have the same vertical length. For example, in Figure 3(a) the
vertical lengths of the rectangles for earned are the same for both
female and male, as are the vertical lengths of the rectangles for other.

227
Unit 8 Log-linear models for contingency tables

Using this idea, a mosaic plot can help to informally assess whether or not
the two variables are likely to be independent. What’s more, mosaic plots
also provide us with a visualisation of the proportions observed for each
level within each variable.
Interpreting a mosaic plot is illustrated in Example 4.

Example 4 Interpreting a mosaic plot when classifying by


gender and incomeSource
Consider the mosaic plot given in Figure 1. In this plot, the horizontal
widths of the rectangles in the first and second columns – that is, the
rectangles associated with the two levels of gender – represent,
respectively, the proportions of the total sample for which gender
takes the levels female and male. So, since the widths of the rectangles
representing female are smaller than the widths of the rectangles
representing male, we can see that the proportion of individuals in the
sample who are female is smaller than the proportion who are male.
The vertical lengths of the rectangles in the first column represent the
proportions of individuals taking levels earned and other for
incomeSource from those with female for gender, while the vertical
lengths of the rectangles in the second column represent the
proportions of individuals taking levels earned and other for
incomeSource from those with male for gender.
The vertical lengths of the rectangles for earned are not the same in
the two gender columns, and it looks like the probability that
incomeSource takes level earned is not the same for the two levels of
gender. In other words, there might be a relationship between gender
and incomeSource. However, it’s possible that the differences are
simply what we would expect with random variation, and so these two
variables could be independent.

In Activity 11, we’ll interpret the other mosaic plot from Example 2.

Activity 11 Interpreting another mosaic plot when


classifying by gender and incomeSource
Consider the mosaic plot given in Figure 2 in Example 2, which has
incomeSource as the horizontal variable and gender as the vertical
variable. What does this mosaic plot tell us about incomeSource and
gender?

We’ll round off this subsection with an activity looking at the mosaic plots
for another contingency table taken from the UK survey dataset.

228
3 Are the classifying variables in a two-way table independent?

Activity 12 Mosaic plots when classifying by incomeSource


and employment
Table 15 shows a contingency table classifying the UK survey dataset
according to the two variables incomeSource and employment.
Table 15 The UK survey dataset classified by incomeSource and employment

incomeSource
employment earned other
full-time 2314 126
part-time 347 189
unemployed 34 130
inactive 146 1858

Mosaic plots representing this contingency table are given in Figure 4.


employment
incomeSource part-time unemployed
earned other full-time inactive
incomeSource
employment

full-time
earned

part-time
unemployed
other
inactive
(a) (b)

Figure 4 Mosaic plots representing Table 15 with (a) incomeSource as the horizontal variable,
and (b) employment as the horizontal variable

Does it look like the two variables incomeSource and employment are
independent?

The vertical lengths of the rectangles in the mosaic plots given in Figure 4
in Activity 12 seem to be very different across the horizontal variable.
From these plots, it certainly looks like incomeSource and employment are
not independent. However, things are not so clear for the mosaic plots in
Figures 1 and 2 in Example 2. Although the vertical lengths of the
rectangles in these plots are not the same across the horizontal variable,
the differences may not be large enough to rule out the independence
assumption. We therefore need a formal method to help us to decide
whether or not the two variables are independent. We shall introduce such
a method in the next subsection.
229
Unit 8 Log-linear models for contingency tables

Before we leave mosaic plots, it is worth mentioning that mosaic plots can
also be used to visualise contingency tables with more than two classifying
variables. However, when there are more than two variables, mosaic plots
can be difficult to interpret, and therefore are not particularly helpful. As
a result, we shall only consider mosaic plots for two-way tables in this
module.

3.2 Testing for independence


In order to decide whether or not the variables are independent, we’ll
compare the fit of the log-linear model which assumes independence (as
given in Box 3) with the fit of the log-linear model which doesn’t assume
independence. So, first we need to know what the form of the log-linear
model is when the classifying variables are not independent.
We’ve seen that the log-linear model
Y ∼A+B
can model the cell counts when A and B are independent. In this case, the
linear predictor ηkl , k = 1, 2, . . . , K, l = 1, 2, . . . , L, has the general form
   
   effect of   effect of 
baseline
ηkl = + kth level + lth level . (8)
mean    
of A of B
Now, if A and B are not independent, then Equation (8) will no longer be
a reasonable model. In this case, we need to add an extra term into the
linear predictor which captures how each ηkl differs from the independence
model. In particular, we can add an interaction term into the model so
that
   
   effect of   effect of 
baseline
ηkl = + kth level + lth level
mean    
of A of B
 
 interaction effect of 
+ kth level of A .
 
and lth level of B

230
3 Are the classifying variables in a two-way table independent?

We have, of course, used models which include interaction terms for two
factors before: they were first introduced in Unit 4. Following the same
convention as we’ve used before, the interaction effect of the kth level of A
and the lth level of B is the added effect of the interaction between A and
B. As a result, if either A or B take level 1, then the associated interaction
term is simply zero (since the individual effect terms assume that the other
variable takes level 1, and so any interaction between level 1 of either
variable is already accounted for).
Following our usual notation, we’ll denote the log-linear model which
includes an interaction by
Y ∼ A + B + A:B
or, equivalently,
Y ∼ A ∗ B,
where A:B represents the interaction between A and B.
So, we have two models for the cell counts – M1 and M2 , say – where
• M1 is the log-linear model when A and B are independent given by
Y ∼A+B

• M2 is the log-linear model when A and B are not independent given by


Y ∼ A + B + A:B.

Then testing whether A and B are independent is equivalent to testing


whether the interaction term A:B is required in the model, which we can
test by comparing the fits of models M1 and M2 . Then:
• if there is a significant increase in fit when we include the interaction
A:B, then we should choose M2 , and conclude therefore that there is
evidence to suggest that A and B are not independent
• if there is not a significant increase in fit when we include the interaction
A:B, then we should choose M1 for parsimony, and conclude therefore
that there is evidence to suggest that A and B are independent.

231
Unit 8 Log-linear models for contingency tables

This strategy for testing whether A and B are independent is summarised


in Figure 5.

Contingency table with


classifying variables A and B

Model when Model when


A and B independent: A and B not independent:
Y ∼A+B Y ∼ A + B + A:B

‘Are A and B independent?’


or, equivalently,
‘Is the interaction A:B required?’

Compare the fits of the models


Y ∼ A + B and Y ∼ A + B + A:B

No significant increase Significant increase


in fit when include A:B, in fit when include A:B,
then choose model then choose model
Y ∼A+B Y ∼ A + B + A:B

Conclude: Conclude:
A and B independent A and B not independent

Figure 5 Summary of strategy for testing whether A and B are independent

So, in order to test whether A and B are independent, we want to compare


the model fits of M1 and M2 . Now, models M1 and M2 are nested, where
M1 is nested within M2 . We therefore need to compare the model fits of
two nested GLMs, and we know (from Subsection 5.2 in Unit 7) how to do
that using the deviance difference given by
deviance difference = D(M1 ) − D(M2 ), (9)
where D(M1 ) and D(M2 ) denote the residual deviances for models M1 and
M2 , respectively.
It turns out, however, that for these particular models, the deviance
difference takes a simpler form. To see this, we’ll first need to take a closer
look at model M2 ; we’ll do this in Activity 13.

232
3 Are the classifying variables in a two-way table independent?

Activity 13 A closer look at the model Y ∼ A + B + A:B


Consider the log-linear model M2 given by
Y ∼ A + B + A:B,
where the factors A and B have K and L levels, respectively, and the
linear predictor has the form
   
   effect of   effect of 
baseline
ηkl = + kth level + lth level
mean    
of A of B
 
 interaction effect of 
+ kth level of A .
 
and lth level of B
(a) For this model, how many parameters are associated with each of the
following in the linear predictor?
(i) The ‘baseline mean’ term.
(ii) The ‘effect of kth level of A’ term.
(iii) The ‘effect of lth level of B’ term.
(iv) The ‘interaction effect of kth level of A and lth level of B’ term.
(b) Hence, how many parameters altogether does model M2 have?
(c) Explain why the log-linear model M2 is therefore the saturated model.
(Saturated models were first introduced in Subsection 5.1 in Unit 6.)

In Activity 13, we saw that the log-linear model


Y ∼ A + B + A:B
is the saturated model. We’ll use this fact in Activity 14 to revisit the
formula for the deviance difference given in Equation (9).

Activity 14 Deviance difference revisited

(a) The residual deviance D for a proposed model (from Subsection 5.2 of
Unit 6) is given by
D = 2 × (l(saturated model) − l(proposed model)).
Show that, for the log-linear model M2 given by
Y ∼ A + B + A:B,
the residual deviance, D(M2 ), is zero.
(b) Hence, find an expression for the deviance difference for comparing
the fits of the log-linear model M1 , given by
Y ∼ A + B,
and the log-linear model M2 .

233
Unit 8 Log-linear models for contingency tables

The results from Activity 14 mean that we can test whether the
interaction A:B should be included in the model, and therefore whether A
and B are independent, using D(M1 ), the residual deviance of the model
without the interaction A:B. And we already know from Subsection 5.1 in
Unit 7 how to use the residual deviance to assess the fit of a GLM!
From Box 13 in Unit 7, if a proposed GLM is a good fit, then D, the
residual deviance for the proposed model, satisfies
D ≈ χ2 (r),
where
r = number of observations
− number of parameters in the proposed model.
We’ll use this result next in Activity 15 to find the distribution for D(M1 )
when M1 is a good fit.

Activity 15 Distribution for D(M1 )

As usual, suppose that the classifying variables A and B have K and L


levels, respectively.
Show that, if the log-linear model M1 given by
Y ∼A+B
is a good fit, then
D(M1 ) ≈ χ2 ((K − 1)(L − 1)).

We are now in a position to be able to test whether the classifying


variables A and B in a two-way contingency table are independent; the
method is summarised in Box 4.

Box 4 Testing whether the classifying variables are


independent in a two-way contingency table
Consider a two-way contingency table categorised by variables A with
K levels and B with L levels. The cell counts in the table are the
values of the response Y .
We wish to test the hypotheses
H0 : A and B are independent,
H1 : A and B are not independent.

234
3 Are the classifying variables in a two-way table independent?

Then:
• fit the model M1 given by
Y ∼ A + B,
which assumes that A and B are independent
• obtain the residual deviance for this model, D(M1 )
• if M1 is a good fit, then
D(M1 ) ≈ χ2 ((K − 1)(L − 1))

• assess the value of D(M1 ) and complete the test as illustrated in


Figure 6.

χ2 ((K − 1)(L − 1))

D(M1 )

D(M1 ) not large, D(M1 ) large,


so p-value not small, so p-value small,
thus M1 adequate fit, thus M1 not adequate fit,
therefore A and B independent therefore A and B not independent

Figure 6 Illustration of how we can use D(M1 ) to assess whether A and


B are independent (here, we’ve taken (K − 1)(L − 1) to be 8)

As we saw in both Units 6 and 7, there’s a ‘rule of thumb’ for informally


assessing the residual deviance. This ‘rule of thumb’ in the context of
testing the independence of the classifying variables in a two-way
contingency table is summarised in Box 5.

235
Unit 8 Log-linear models for contingency tables

Box 5 ‘Rule of thumb’ for D(M1 )


Suppose that the model M1 given by
Y ∼A+B
has residual deviance D(M1 ), with D(M1 ) ≈ χ2 ((K − 1)(L − 1)).
If M1 is a good fit, then we have the following ‘rule of thumb’.
• If D(M1 ) ≤ (K − 1)(L − 1), then model M1 is likely to be a good fit
to the data, and so we conclude that A and B are independent.
• If D(M1 ) is ‘much larger’ than (K − 1)(L − 1), then M1 is likely to
be a poor fit to the data, and so we conclude that A and B are
likely to not be independent.

So, we are now ready to do some testing!


In Example 4, we used mosaic plots of the UK survey dataset classifying
by gender and incomeSource to informally assess whether gender and
incomeSource are independent. In that example, we concluded that,
although the vertical lengths of the rectangles in the associated mosaic
plots did differ, the differences were perhaps small enough to be due to
random variation so that the two variables could possibly be independent.
In Example 5, we shall use the residual deviance to test whether these two
variables are independent.

Example 5 Are gender and incomeSource independent?


Consider once again the contingency table classifying the UK survey
dataset according to gender and incomeSource, which was first given
in Table 2 (and was repeated in Tables 9 and 14).
Here, we wish to test whether gender and incomeSource are
independent. So, first we need to fit the log-linear model
count ∼ gender + incomeSource.
In the notation from Boxes 4 and 5, this is our model M1 .
The residual deviance D(M1 ) for the fitted model M1 is 75.5, and the
associated chi-squared distribution is χ2 (1) (since for this contingency
table, K = L = 2, and so (K − 1)(L − 1) = 1).
By the ‘rule of thumb’, since D(M1 ) is much larger than
(K − 1)(L − 1), it looks like M1 is a poor fit. This can be confirmed by
checking the p-value, which is < 0.0001, and so M1 is indeed a poor fit.
So, because we’ve concluded that M1 is a poor fit, we’ll therefore
conclude that gender and incomeSource are not independent.

236
3 Are the classifying variables in a two-way table independent?

We’ll finish this subsection with an activity testing for independence


between the classifying variables in a contingency table involving the
births of babies. The dataset is described next.

Study on the birth process for newborn babies


This dataset contains data taken from a medical study on the birth
process for 738 newborn babies. For each birth, the gender of the
baby was recorded, along with information on the birth process for
each baby, including whether the birth was induced, whether
membranes had ruptured before the beginning of labour and whether
a caesarean section had been performed.
A baby can hear sounds from
The newborn births dataset (newbornBirths) about 23 weeks of pregnancy
The newborn births dataset contains contingency table data for these
738 newborn babies after classifying the data according to the two
variables:
• gender: the gender of the baby, taking the values male and female
• induced: whether or not the birth was induced, taking the values
no and yes.
The counts in the cells of the contingency table for these classifying
variables are stored in the variable count.
Data from the study are shown in the two-way contingency table
given in Table 16.
Table 16 Counts of newborn babies classified by gender and induced

induced
gender no yes Total
male 327 86 413
female 243 82 325
Total 570 168 738

Source: Tutz, 2011

Activity 16 will investigate whether gender and induced from the


newborn births dataset are independent.

237
Unit 8 Log-linear models for contingency tables

Activity 16 Are gender and induced independent?

Using the data from the newborn babies dataset, we wish to test the
hypotheses
H0 : gender and induced are independent,
H1 : gender and induced are not independent.
(a) Let the response variable be count, representing the cell counts in the
contingency table given in Table 16. The log-linear model
count ∼ gender + induced
was fitted to these data and the residual deviance for this fitted model
is 2.00. Which distribution should this residual deviance be compared
to in order to carry out the test?
(b) The associated p-value for this test is 0.157. What do you conclude
about whether or not gender and induced are independent?

We now know how we can use mosaic plots to visualise contingency tables
and consider informally whether or not it looks like the two classifying
variables are independent, and we also know how to use a log-linear model
to test independence more formally. So, we are now ready to put these
ideas into practice in R. We shall do this next.

3.3 Two-way contingency tables in R


We’ll start in Notebook activity 8.1 by using R to produce a contingency
table for a dataset, and then produce a mosaic plot to visualise the
contingency table. In doing so, we shall once again use data from the UK
survey dataset, and in particular we shall focus on producing and
visualising the contingency table classified using employment and gender
given earlier in Table 5 in Activity 3.
Notebook activity 8.2 also considers data from the UK survey dataset, but
this time we’ll produce a contingency table and its associated mosaic plot
for different classifying variables.
Mosaic plots allow us to informally assess whether or not the classifying
variables seem to be independent. Notebook activity 8.3 explains how we
can use R to fit a log-linear model and test for independence formally. In
doing so, we’ll revisit the newborn births dataset.
Notebook activity 8.1 produces a mosaic plot for the contingency table
given in Table 5 (in Activity 3) which classifies the UK survey dataset
according to employment and gender. In the final notebook activity of
this section, we’ll fit a log-linear model to these data and formally test
whether employment and gender are independent.

238
3 Are the classifying variables in a two-way table independent?

Notebook activity 8.1 Contingency tables and mosaic


plots in R
This notebook explains how to use R to produce a contingency table
for a dataset, and how to produce a mosaic plot for this contingency
table.

Notebook activity 8.2 Another contingency table and


mosaic plot
This notebook produces another contingency table and its associated
mosaic plot.

Notebook activity 8.3 Fitting a log-linear model and


testing for independence in R
This notebook explains how to use R to fit a log-linear model and test
whether the classifying variables are independent.

Notebook activity 8.4 Testing whether employment and


gender are independent
In this notebook, we’ll use R to test whether employment and gender
from ukSurvey are independent.

239
Unit 8 Log-linear models for contingency tables

4 Contingency tables with more


than two variables
Contingency tables can categorise data in terms of more than two variables.
An example of such a contingency table categorising data in terms of three
variables – a three-way contingency table – was shown in Table 3 in
Subsection 1.1; this table is repeated here in Table 17 for your convenience.
This contingency table shows data from the UK survey dataset categorised
in terms of the variables employment, gender and incomeSource.
Table 17 Repeat of Table 3

incomeSource
earned other
gender gender
employment female male female male
full-time 626 1688 31 95
part-time 235 112 123 66
unemployed 18 16 72 58
inactive 68 78 815 1043
Total 947 1894 1041 1262

We’ll start in Subsection 4.1 by extending the saturated log-linear model


for tables categorised according to two variables, to tables categorised
according to more than two variables.
When using log-linear models for two-way contingency tables, we
compared the fit of the independence model which includes just the
individual effects of the two variables, with the (perfect-fit) saturated
model which includes the two-way interaction term. Of these two models,
only the independence model could be a potentially useful model.
In contrast, when there are three or more variables in a contingency table,
there are several potential models which could be useful for modelling the
counts in the table, and so there is more of an element of model choice to
the modelling process. Choosing a log-linear model is the subject of
Subsection 4.2. There are, however, some restrictions on the log-linear
models that we can choose between when there are more than two
classifying variables; we shall consider these in Subsection 4.3.
To finish the section, we’ll discuss diagnostic plots for log-linear models
very briefly in Subsection 4.4, before we do some practical work using R in
Subsection 4.5.

240
4 Contingency tables with more than two variables

4.1 Extending the log-linear model


To help keep things simple, we’ll start by just considering three-way
contingency tables. We’ll use the following general notation:
• A, B and C represent the three categorical variables
• K, L and S represent the number of categories for variables A, B and C,
respectively
• Ykls represents the count for the kth level of A, the lth level of B, and
the sth level of C, for k = 1, 2, . . . , K, l = 1, 2, . . . , L, s = 1, 2, . . . , S. A fun fact on the theme of
‘extending’ things – on
Since there are K levels of A, L levels of B and S levels of C, there will be average, a chameleon’s tongue
K × L × S cells in the contingency table. So for a three-way contingency can extend to roughly twice its
table, there are K × L × S observed values of the response. body length!
We’ll take a closer look at the three-way contingency table given in
Table 17 in the next activity.

Activity 17 A closer look at a three-way table

For the contingency table shown in Table 17, let A represent the variable
employment, B represent the variable incomeSource and C represent the
variable gender.
(a) What are the values of K, L and S for this contingency table? Hence
confirm that there are K × L × S observed values of the response.
(b) Table 17 has four rows and four columns. How are the data arranged
in the table?

In Subsection 3.2, we discussed the saturated log-linear model


Y ∼ A + B + A:B
for a two-way contingency table with response Y and classifying variables
A and B. We saw that ηkl , the linear predictor for the kth level of A and
the lth level of B, has the form
   
   effect of   effect of 
baseline
ηkl = + kth level + lth level
mean    
of A of B
 
 interaction effect of 
+ kth level of A .
 
and lth level of B
The individual effects of the classifying variables are often referred to as
the main effects so that the linear predictor can be expressed more
generally as
   
     main   interaction 
linear baseline
= + effect + effect . (10)
predictor mean    
terms term

241
Unit 8 Log-linear models for contingency tables

When we have a three-way contingency table with classifying variables A,


B and C, the linear predictor for the saturated log-linear model also has
the general form given in Equation (10), although, of course, what each of
the general terms in Equation (10) represents is slightly different when
there are three classifying variables. In this case, for the kth level of A, lth
level of B and sth level of C, in Equation (10) we have
 
baseline
= mean count when A, B and C are level 1
mean
and
       
 main   effect of   effect of   effect of 
effect = kth level + lth level + sth level .
       
terms of A of B of C
As usual, for each level of a factor, the main effect term represents the
effect of that level in comparison to the effect of level 1 of that factor.
For a two-way contingency table, there is just one possible interaction in
the saturated log-linear model (the interaction A:B). However, when there
are three classifying variables, there are more interaction effects in the
saturated log-linear model to consider. There are three two-way
interactions between pairs of classifying variables – that is, there’s an
interaction effect associated with the pair of variables A and B, a second
interaction effect associated with the pair A and C, and a third interaction
effect associated with B and C. There’s also one more interaction effect to
consider: the three-way interaction associated with all three classifying
variables A, B and C.
So, for a three-way contingency table, in Equation (10) we have:
     
 interaction   interaction effect of   interaction effect of 
effect = kth level of A + kth level of A
     
term and lth level of B and sth level of C
 
   interaction effect of 
 interaction effect of    

kth level of A,
+ lth level of B + .
    lth level of B 

and sth level of C  
and sth level of C
As usual, each interaction effect represents the added effect of that
interaction, and if any factor takes level 1 then the associated interaction
term is zero.
Using our usual model notation, the saturated log-linear model for a
three-way contingency table can be written as
Y ∼ |A + {z
B + C} + |A:B + A:C
{z + B:C} + A:B:C
| {z } .
↑ ↑ ↑
main two-way three-way
effects interactions interaction

242
4 Contingency tables with more than two variables

We can also use the ‘∗’ symbol to write this model in shorthand form as
Y ∼ A ∗ B ∗ C.
As usual, the ‘∗’ symbol between factors tells us that these factors are in
the model, as are all of the interactions between them. So, A ∗ B ∗ C
means all the individual factors and all of the possible interactions between
A, B and C.
This log-linear model form can be extended in a natural way to
contingency tables with more than three classifying variables. To
illustrate, in Activity 18 we’ll look at the saturated model for a four-way
contingency table.

Activity 18 Saturated model for a four-way table

Suppose that we have a contingency table classified by the four variables


A, B, C and D.
(a) What are the possible interaction terms for a log-linear model of this
contingency table?
(b) Hence, write down the saturated log-linear model for this contingency
table using the notation ‘Y ∼ ...’.

So, we now know the general form for a saturated log-linear model for a
contingency table. But, of course, a saturated model is of no use for
modelling! So, the first question of interest is: can a more parsimonious
model be obtained? That is, can any of the terms in the saturated model
be removed from the model without significantly reducing the model fit?
We will consider this question next.

4.2 Choosing a log-linear model


The process of choosing an appropriate log-linear model is in some ways
rather different from the modelling that you have done before, although
the general principles are the same. With the other types of generalised
linear model we have met, it is often not possible to fit the data exactly
within the family of models being considered. In these cases, the best
model that can be found might still not fit the data all that well. However,
this is not the case for log-linear models! This is because it is always
possible to fit a log-linear model that fits the data in a contingency table Choices, choices . . .
exactly, by including all of the possible interactions. So, in choosing an
appropriate log-linear model, the focus is on finding a simpler, more
parsimonious alternative to the saturated model, which still fits the data
adequately well.

243
Unit 8 Log-linear models for contingency tables

Suppose that we have a log-linear model, M say, for a three-way


contingency table, which is more parsimonious than the saturated
log-linear model. For example, M could be the log-linear model
Y ∼ A + B + C + A:B + A:C + B:C
or M could be the log-linear model
Y ∼ A + B + C + A:B.
Both of these examples of possible models M are more parsimonious than
the saturated log-linear model (because neither of them includes all three
main effects for A, B and C, as well as all of their iterations). But how do
we decide whether the more parsimonious model M fits the data
adequately? The next activity considers this question.

Activity 19 Which test statistic to check model fit?

Suppose that we have a log-linear model M for a contingency table, where


M is more parsimonious than the saturated log-linear model.
Can you suggest a test statistic that we can use to test whether M is an
adequate fit to the data?

Activity 19 suggested that, because the log-linear model is a GLM, we can


use the value of D(M ), the residual deviance of a proposed log-linear
model M , to test whether M is an adequate fit to the data.
Then, using Box 13 in Unit 7, if the more parsimonious log-linear model
M is a good fit, then
D(M ) ≈ χ2 (r),
where
r = number of observations
− number of parameters in the proposed model.
The value of r will, of course, depend on which log-linear model is being
fitted, and therefore doesn’t have one single ‘neat’ formula in terms of K,
L, S, and so on. The value of r is, however, given along with the value of
the residual deviance in the standard output when R fits a log-linear
model, so we’ll simply use this given value. The residual deviance D(M )
can then be used to test the adequacy of M ’s fit in the usual way; this is
summarised in Box 6.

Box 6 Testing the fit of a log-linear model


Let the log-linear model M be a more parsimonious model for a
contingency table than the saturated log-linear model.
We wish to test the hypotheses:
H0 : M is an adequate fit,
H1 : M is not an adequate fit.

244
4 Contingency tables with more than two variables

Then:
• fit the model M and obtain the residual deviance for this model,
D(M )
• if M is a good fit, then
D(M ) ≈ χ2 (r),
where
r = number of observations
− number of parameters in the proposed model
(the value of r is given with the value of D(M ) as part of R’s
standard output for fitting M )
• assess the value of D(M ) and complete the test as illustrated in
Figure 7.

χ2 (r)

D(M )

D(M ) not large, D(M ) large,


so p-value not small, so p-value small,
thus M is adequate fit thus M is not adequate fit

Figure 7 Illustration of how we can use D(M ) to assess the fit of a


log-linear model M (here, we’ve taken r to be 6)

The usual ‘rule of thumb’ applies:


• If D(M ) ≤ r, then we conclude that M is an adequate fit.
• If D(M ) is ‘much larger’ than r, then M is likely to be a poor fit.

245
Unit 8 Log-linear models for contingency tables

In the next activity, we’ll consider the fits of some possible models for the
three-way contingency table given in Table 17 (which classifies the UK
survey dataset by employment, gender and incomeSource).

Activity 20 Which models are an adequate fit?

Four possible log-linear models were fitted to the contingency table data
given in Table 17. These models are as follows.
• Model M1 is the log-linear model which only has the main effects:
count ∼ employment + gender + incomeSource.
The residual deviance for this model is 4538.5 and the associated degrees
of freedom is 10.
• Model M2 is the log-linear model which includes all of the main effects
and the two-way interactions:
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource
+ gender:incomeSource.
The residual deviance for this model is 0.29 and the associated degrees
of freedom is 3.
• Model M3 is the log-linear model which includes all of the main effects
and the two-way interaction employment:incomeSource:
count ∼ employment + gender + incomeSource
+ employment:incomeSource.
The residual deviance for this model is 365.07 and the associated degrees
of freedom is 7.
• Model M4 is the log-linear model which includes all of the main effects
and the two-way interactions employment:gender and
employment:incomeSource:
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource.
The residual deviance for this model is 1.22 and the associated degrees
of freedom is 4.
Using the ‘rule of thumb’ from Box 6, which of the models M1 , M2 , M3
and M4 can be considered as an adequate fit to the data?

In Activity 20, we concluded that there are two models out of those listed
which can be considered as being an adequate fit to the data. This brings
us to the question of how do we choose a log-linear model from a selection
of alternatives? We’ll consider this in the next activity.

246
4 Contingency tables with more than two variables

Activity 21 Choosing a log-linear model

(a) Suppose that we have two log-linear models, M1 and M2 , for a


contingency table, where M1 is nested within M2 . Explain, in general
terms, how we could compare the fits of these two models.
(b) Suppose that we have two log-linear models, M3 and M4 , which are
not nested. Explain, in general terms, how we could compare the fits
of these two models.
(c) Which automated procedure could we use for selecting a log-linear
model from a set of alternatives?

Activity 21 considered methods for choosing a log-linear model from a set


of alternatives. We’ll use two of these methods to choose between two
possible models for the UK survey dataset next in Activity 22.

Activity 22 Which log-linear model is preferable?

Activity 20 considered four possible models for the contingency table data
from the UK survey dataset given in Table 17. That activity concluded
that two of these models were an adequate fit to the data. In this activity,
we’ll compare the fits of these two models so that we can select the
preferred model.
The two models which were an adequate fit are:
• Model M2 :
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource
+ gender:incomeSource.
The residual deviance for this model is 0.29 and the associated degrees
of freedom is 3.
• Model M4 :
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource.
The residual deviance for this model is 1.22 and the associated degrees
of freedom is 4.
(a) Explain why we can use the value of
deviance difference = D(M4 ) − D(M2 )
to compare the fits of models M4 and M2 .
(b) Calculate the deviance difference to compare these models. What is
the value of the associated degrees of freedom for the deviance
difference?

247
Unit 8 Log-linear models for contingency tables

(c) Hence explain why, if we use the deviance difference, we prefer model
M4 to model M2 .
(d) The value of the AIC for model M2 is 133.01, while the value of the
AIC for model M4 is 131.94. Using these AIC values, which of models
M2 and M4 is preferable?

In the next activity, we’ll choose a log-linear model for data concerning
litters of sheep; the dataset is described next.

Litters of sheep
Agricultural researchers were interested in investigating the
relationships between the size of litters of lambs, the breed of ewe
giving birth to the lambs (from three possible breeds), and the farm
where the ewe gave birth (from three possible farms).
The sheep litters dataset (sheepLitters)
On 3 May 2017, the Derbyshire This dataset contains data on 840 ewes who gave birth to litters of
Times reported the birth of a lambs. For each ewe, the number of lambs in her litter, the ewe’s
rare litter of three black and breed and the farm where the birth took place were recorded. The
three white lambs ewes were then classified according to three factors:
• litterSize: the number of lambs born in the litter, taking the
values 0, 1, 2 and ≥ 3
• breed: the breed of the ewe, taking the values a, b and c
• farm: the farm where the birth took place, taking the values 1, 2
and 3.
The counts in the cells of the contingency table for these classifying
variables are stored in the variable count.
Data from the experiment are shown in the three-way contingency
table given in Table 18.
Table 18 Counts of ewes categorised according to litterSize, breed
and farm

farm
1 2 3
breed breed breed
litterSize a b c a b c a b c
0 10 4 6 8 5 1 22 18 4
1 21 6 7 19 17 5 95 49 12
2 96 28 58 44 56 20 103 62 16
≥3 23 8 7 1 1 2 4 0 2
Total 150 46 78 72 79 28 224 129 34

Source: Mead, Curnow and Hasted, 2002

248
4 Contingency tables with more than two variables

We’ll choose a log-linear model for the sheep litters dataset next in
Activity 23.

Activity 23 Choosing a log-linear model for the sheep


litters dataset
Consider the three-way contingency table shown in Table 18 which shows
data from the sheep litters dataset. We want to choose a log-linear model
for the response count, which represents the counts in this contingency
table.
(a) Let the model M1 be the log-linear model
count ∼ litterSize + farm + breed.
For the fitted model, the residual deviance D(M1 ) is 206.27, and the
associated degrees of freedom is 28. The p-value associated with
D(M1 ) is < 0.0001. Is this model an adequate fit to the data?
(b) Let the model M2 be the log-linear model
count ∼ litterSize + farm + breed
+ litterSize:farm + litterSize:breed
+ farm:breed.
For this fitted model, the value of the residual deviance, D(M2 ), is
14.58 on 12 degrees of freedom, and the associated p-value is 0.265.
Explain why M2 can be considered an adequate fit to the data.
(c) We saw in part (b) that M2 can be considered an adequate fit to the
data. So, now we’ll investigate whether any of the two-way
interactions in M2 can be omitted from the model so that the
resulting model is still an adequate fit to the data. So, let models M3 ,
M4 and M5 be as follows.
• Model M3 is model M2 omitting the interaction term
litterSize:farm, so that M3 is
count ∼ litterSize + farm + breed
+ litterSize:breed + farm:breed.

• Model M4 is model M2 omitting the interaction term


litterSize:breed, so that M4 is
count ∼ litterSize + farm + breed
+ litterSize:farm + farm:breed.

• Model M5 is model M2 omitting the interaction term farm:breed,


so that M5 is
count ∼ litterSize + farm + breed
+ litterSize:farm + litterSize:breed.

249
Unit 8 Log-linear models for contingency tables

Table 19 shows the deviance difference for each of the models M3 , M4


and M5 in comparison to model M2 (which includes all three two-way
interactions). Table 19 also shows the associated degrees of freedom
and p-values for each of the deviance differences.
Table 19 Comparing models M3 , M4 and M5 with model M2

Model Interaction omitted Deviance difference Degrees of freedom p-value


M3 litterSize:farm 101.69 6 < 0.0001
M4 litterSize:breed 4.26 6 0.6409
M5 farm:breed 64.33 4 < 0.0001

Does it look like any of the two-way interactions can be dropped from
model M2 ?
(d) Which of the log-linear models M1 , M2 , . . . , M5 would you choose?
(Note that the p-values for each of the log-linear models with the
main effects and only one two-way interaction are all extremely small
(p < 0.0001), and so the models with one two-way interaction and no
interactions will all be inadequate fits to the data.)

4.3 Some restrictions when choosing a


log-linear model
In the last subsection, we discussed how we can choose a log-linear model
so that the model is an adequate fit to the data, but is as simple as
possible. There are, however, some restrictions regarding the models that
we can choose from. We shall look at these restrictions in this subsection.
We have already met (in Unit 4) one of the restrictions regarding the
log-linear models that we can choose from: this is the rule known as the
hierarchical principle. A summary reminder of the hierarchical principle is
given in Box 7.

Box 7 The hierarchical principle


If an interaction is included in a model, then the model must also
include:
• the individual effect terms for each of the variables in the interaction
• any lower-order interactions involving any of the variables in the
interaction.

The hierarchical principle is illustrated in Example 6 and Activity 24.

250
4 Contingency tables with more than two variables

Example 6 Hierarchical log-linear models


Suppose that we have a three-way contingency table with classifying
variables A, B and C.
The log-linear model
Y ∼ A + B + C + B:C
is hierarchical, because the only interaction in the model is the
two-way interaction B:C, and the main effects for both B and C are
in the model.
On the other hand, the log-linear model
Y ∼ A + B + C + A:B:C
is not hierarchical. This is because this model includes the three-way
interaction A:B:C, but doesn’t include all of the lower-order
interactions involving these variables. In particular, the model doesn’t
include any of the three two-way interactions A:B, A:C and B:C,
which would all need to be included for the model to be hierarchical.

Activity 24 Hierarchical or not?

The data in a three-way contingency table with factors A, B and C are to


be modelled by a log-linear model. The following three log-linear models
are considered for these data:
• Model M1 : Y ∼ A + B + A:B + A:C + B:C
• Model M2 : Y ∼ A + B + C + A:B
• Model M3 : Y ∼ A + B + C + A:B + A:B:C.
For each of these models, state whether or not the model is hierarchical.
Explain your answers.

251
Unit 8 Log-linear models for contingency tables

1st In this module, we shall only be using hierarchical models, because they
are often easier to interpret than non-hierarchical models. What’s more,
when choosing a log-linear model in this module, we shall not consider
leaving out any of the main effects. This, of course, goes against the
parsimony principle. However, it is more important that a log-linear model
is interpretable than the model is parsimonious.
So, the first restriction on our choice of log-linear models is that we only
want to choose between hierarchical log-linear models. The second
restriction arises when some of the totals in the contingency table are fixed
in advance of collecting the data. For example, in the UK survey dataset,
the total number of individuals in the survey may be fixed in advance, or
in the sheep litters dataset, the number of ewes for each breed may be
2nd fixed in advance.
Back in Subsection 2.2, it was stated that two-way tables in which the row,
column or overall totals are fixed can be analysed using log-linear models
in exactly the same way as two-way tables in which the counts are all
Interpretability wins over independent of each other. However, this is not so for contingency tables
parsimony! with three or more classifying variables. In this case, if the total number of
observations is fixed, and/or the totals for counts across one or more of the
variable categories are fixed, then this imposes constraints on which terms
must be included in any log-linear model.
Box 8 summarises which terms need to be included in a log-linear model
for the different possible fixed totals in a contingency table.

Box 8 Log-linear models with fixed totals in a


contingency table
For contingency tables with three or more classifying variables, fixed
totals impose constraints on which terms must be included in a
log-linear model for the data. The terms needed when various totals
are fixed are as follows.
• Total number of observations is fixed:
The baseline mean term needs to be included (but this term is in all
log-linear models anyway).
• Totals of each level of an individual variable are fixed:
The main effect term for that variable needs to be included (but
note that in this module we are only considering log-linear models
including all main effect terms anyway).
• Totals of combinations of levels of two or more variables
are fixed:
The interaction term for that combination of variables needs to be
included.

252
4 Contingency tables with more than two variables

To finish this subsection, the ‘rules’ given in Box 8 for contingency tables
with fixed totals are illustrated in the next example and following activity.

Example 7 Which terms should be included?


Consider a four-way contingency table where the data are classified
according to the four variables A, B, C and D.
If the overall total of the sample is fixed, then we need the baseline
mean term in the model (which we would have in the model anyway!).
In addition, in this module we’re always including all of the main
effects in the log-linear model. So, if the overall total is fixed, then the
simplest possible model we could use is
Y ∼ A + B + C + D.
We’d then want to test which interactions need to also be included in
the model.
If the totals for each combination of the levels of B and C are fixed in
advance, then we need to include the interaction B:C in the model.
Additionally, in this module we’re including all of the main effects.
So, if the totals for each combination of the levels of B and C are
fixed, then the simplest possible model we could use is
Y ∼ A + B + C + D + B:C.
We’d then want to test which other interactions need to also be
included in the model.
Now suppose that the totals for each combination of the levels of A, C
and D are fixed in advance. In this case, we’d need to include the
three-way interaction A:C:D in the model. But in addition, to ensure
that we have a hierarchical model, we also need to include all
lower-order interactions including A, C and D, as well as their main
effects. As usual, we also need to include the main effect B, since
we’re including all main effects in this module.
So, when the totals for each combination of the levels of A, C and D
are fixed in advance, the simplest possible model we could use is
Y ∼ A + B + C + D + A:C + A:D + C:D + A:C:D.
We’d then need to test whether any of the other interactions need to
be included. (And, of course, if another interaction is included, we
then also need to make sure that the model is hierarchical.)

253
Unit 8 Log-linear models for contingency tables

Activity 25 Identifying the simplest possible models

Consider once again the four-way contingency table from Example 7,


where the data are classified according to the four variables A, B, C and
D. For each of the scenarios below, what is the simplest possible log-linear
model we could use?
(a) The totals for each level of B are fixed.
(b) The totals for each combination of the levels of C and D are fixed.
(c) The totals for each combination of the levels of A, B and D are fixed.

4.4 No diagnostic plots?


Once we have chosen our model, our usual next step is to obtain diagnostic
plots to check the assumptions of the fitted model. However, things are a
little different for log-linear models for contingency tables and we will not
be considering any diagnostic plots! There are several reasons for this.
In very many cases, diagnostic plots do not show you anything very useful
for log-linear models. Questions of the model not fitting the data do not
arise in the same way that they do with some other GLMs, because there
is always a model (the saturated model) that fits the data exactly.
As a result, the question of having to transform something or use a
different distribution does not usually arise. What’s more, any problems
with the modelling assumptions are often impossible to investigate once
the data have been recorded as a contingency table. For instance, there
may be dependence between successive observations, but the order of the
observations is not recorded when they are summarised in a table.
To finish this section, we’ll use R for modelling contingency tables with
more than two classifying variables.

4.5 Using R for log-linear models for more


than two variables
In this subsection, we’ll use R to choose log-linear models for two separate
contingency tables, each with more than two classifying variables.
We’ll start by modelling data from a dataset concerning the UK 2019
general election. The dataset is described next.

Demographic characteristics of voters and political parties


In December 2019, the UK held a general election. After the election,
researchers were interested in whether there were any relationships
between various demographic characteristics of voters and the
political party voted for. For example, was there a relationship
between the political party voted for and the voter’s age, or their
gender, or their household earnings? The data contained in this
dataset were collected to address these kinds of questions.
254
4 Contingency tables with more than two variables

The UK election dataset (ukElection)


The UK election dataset contains data for a random sample of 41 995
UK adults who voted in the UK 2019 general election. For each adult
in this sample, the political party they voted for and which age group
they belonged to were recorded. The votes were then categorised
according to the following three variables:
• party: the political party voted for, taking the values con (for the The Conservative Party won
Conservative Party), lab (for the Labour Party), libdem (for the the election, receiving 43.6% of
Liberal Democrat Party), snp (for the Scottish National Party) and the votes
other
• gender: the gender the voter identified with, taking the values male
and female
• age: the age group the voter belongs to, taking the values 18 to 24,
25 to 49, 50 to 64, 65+.
The counts in the cells of the contingency table for these classifying
variables are stored in the variable count.
The counts of votes for the different political parties for separate age
and gender groups can be seen in Table 20. For example, 639 adults
in the sample were male, aged 18 to 24 years and voted for the
Conservative Party.
Table 20 Counts of votes for different political parties for separate age
groups and genders for a random sample of 41 995 UK adults who voted in
the UK’s 2019 general election

age
18 to 24 25 to 49 50 to 64 65+
gender gender gender gender
party male female male female male female male female
con 639 345 3032 2892 2462 2560 2852 3384
lab 1050 1496 3465 4066 1207 1433 668 952
libdem 274 230 1213 1084 579 614 490 529
snp 160 115 433 452 145 205 134 106
other 183 115 606 452 483 307 267 317

Source: YouGov, 2019, accessed 31 July 2022

The counts given in Table 20 were not directly available from the data
source, but were instead estimated from the available data, namely,
the (rounded) percentages of adults voting for each political party for
each gender and age group, together with the total number of adults
in each gender and age group in the sample. The discrepancy between
the sample size (41 995) and the total number of votes in Table 20
(41 996) is due to rounding of the percentages of votes at the data
source and then further rounding of the estimated numbers of votes.

255
Unit 8 Log-linear models for contingency tables

We’ll use R to choose a log-linear model for the UK election dataset in


Notebook activity 8.5.
In the final notebook activity of this subsection, we’ll once again consider
the UK survey dataset. Contingency table data from this dataset have
been used throughout this unit. However, so far we have only considered
three of the possible factors for which there are data available. In
Notebook activity 8.6, we’ll consider the data when classified according to
all four factors.

Notebook activity 8.5 Log-linear model for the UK


election dataset
This notebook uses R to choose a log-linear model for data from
ukElection.

Notebook activity 8.6 Log-linear model for the UK


survey dataset
In this notebook, we’ll use R to choose a log-linear model for data
from ukSurvey classifying the data according to all four factors in the
dataset.

5 How are the classifying variables


related?
In Section 4, we discussed log-linear models for contingency tables with
more than two classifying variables and considered how to choose a final
log-linear model for a given dataset. Once we have our final log-linear
model, we can use it to tell us about the relationships between the
classifying variables. This section focuses on how we can do this.
We used our final log-linear model to tell us about the variable
relationships when we modelled two-way contingency tables in Sections 2
and 3. If the chosen model didn’t include the interaction, then we
concluded that the two classifying variables are independent, but if the
chosen model did include the interaction, then we concluded that the two
classifying variables are not independent. We can use the same sort of idea
when there are more than two classifying variables, only the relationships
can be more complicated in this case!
Since it is often difficult to interpret the model in terms of the variable
relationships when there are more than three variables, we will only
investigate variable relationships for three-way tables in this module. We’ll
look at the types of relationships that can exist between the classifying
variables in a three-way contingency table in Subsection 5.1, and then we’ll
use R to identify relationships in Subsection 5.2.

256
5 How are the classifying variables related?

5.1 Types of relationships for three-way


tables
We’ll start by looking in Activity 26 at the relationships between the
classifying variables represented by the simplest log-linear model for a
three-way contingency table.

Activity 26 No interactions in the model

Suppose that the final model for a three-way contingency table with
classifying variables A, B and C is the log-linear model with no
interactions given by
Y ∼ A + B + C.
Given what you know about log-linear models for two-way contingency
tables, what do you think this model tells us about the relationships
between the variables A, B and C?

Following on from Activity 26, if our final model is the log-linear model
Y ∼ A + B + C,
then the variables A, B and C are said to be mutually independent.
Next we’ll consider the case in which there is one single two-way
interaction: we’ll use a specific example in the next activity to think things
through.

Activity 27 One two-way interaction in the model

Suppose that our final log-linear model is


Y ∼ A + B + C + A:B.
What do you think this model tells us about the relationships between the
variables A, B and C? (You might find it helpful to think about what it
means when there is an interaction in the model for two-way contingency
tables.)

Following on from Activity 27, if our final model is a log-linear model with
one single two-way interaction, then the two variables in the two-way
interaction are said to be jointly independent of the third variable.
We’ll see two further log-linear models representing joint independence in
Activity 28.

257
Unit 8 Log-linear models for contingency tables

Activity 28 Relationships for models with a single two-way


interaction
(a) Suppose that our final log-linear model is
Y ∼ A + B + C + A:C.
What does this model tell us about the relationships between the
variables A, B and C?
(b) If instead, the final log-linear model is
Y ∼ A + B + C + B:C,
what can we say about the relationships between the variables A, B
and C now?

The next situation that we’ll consider is when there are two two-way
interactions in the model. The independence relationships in this case are
a little trickier to interpret from the model.
It will help to explain things if we consider a specific example. So, suppose
that our chosen model is the log-linear model
Y ∼ A + B + C + A:B + A:C.
Now, since the interaction A:B is in the model, then that suggests that A
and B are not independent. Likewise, since the interaction A:C is in the
model, then that suggests that A and C are not independent. However,
there is no interaction involving both B and C together. Since B and C
are related to each other only through their relationships with A, we say
that B and C are conditionally independent, given A.
We’ll consider the relationships between A, B and C for the other possible
models with two two-way interactions in the next activity.

Activity 29 Relationships for models with two two-way


interactions
(a) Suppose that our final log-linear model is
Y ∼ A + B + C + A:B + B:C.
What does this model tell us about the relationships between the
variables A, B and C?
(b) Suppose instead that the final log-linear model is
Y ∼ A + B + C + A:C + B:C.
What does this alternative final model tell us about the relationships
between the variables A, B and C?

258
5 How are the classifying variables related?

The only other log-linear model not yet considered is the model containing
all three two-way interactions, but missing the three-way interaction; that
is, the log-linear model
Y ∼ A + B + C + A:B + A:C + B:C.
In this model, none of the pairs of variables is independent of each other
and there is said to be uniform association between the variables.
Using the final log-linear model to investigate the relationships between
the variables in a three-way table is summarised in Box 9.

Box 9 Variable relationships in a three-way table Relationship status:


For a three-way contingency table with variables A, B and C, Table 21 I don’t want to say
summarises the possible interactions included in the log-linear model Mutually independent
Jointly independent
and the associated relationships between the variables. Conditionally independent
Uniform association
Table 21 Interactions included in the log-linear model and the associated It’s complicated…
relationships

Interactions included Relationship


No interactions A, B, C mutually independent Relationship status
A:B A and B jointly independent of C
A:C A and C jointly independent of B
B:C B and C jointly independent of A
A:B, A:C B and C conditionally independent, given A
A:B, B:C A and C conditionally independent, given B
A:C, B:C A and B conditionally independent, given C
A:B, A:C, B:C Uniform association between A, B, C

The next two activities will give you some practise at interpreting
log-linear models in terms of the variable relationships.

Activity 30 Variable relationships represented by models

A three-way contingency table is classified according to three variables A,


B and C. What can you say about the relationships between A, B and C
represented by each of the following log-linear models?
(a) Y ∼ A + B + C + A:B + A:C + B:C.
(b) Y ∼ A + B + C.
(c) Y ∼ A + B + C + A:C + B:C.
(d) Y ∼ A + B + C + A:B.

259
Unit 8 Log-linear models for contingency tables

Activity 31 Relationships between employment, gender and


incomeSource
Consider once again the data in Table 17 which classifies the UK survey
dataset according to the variables employment, gender and incomeSource.
In Activity 20, we saw that the log-linear model
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource
was an adequate fit to the data.
What does this model tell us about the relationships between employment,
gender and incomeSource?

To finish this section, we’ll use R to identify the variable relationships in a


three-way contingency table.

5.2 Using R to identify relationships


In this subsection, we’ll use R to select a log-linear model for a three-way
contingency table so that we can interpret the fitted model in terms of the
variable relationships. The three-way contingency table contains data on
Australian health insurance; the dataset is described next and analysed in
Notebook activity 8.7.

The Australian National Health Survey


The Australian Bureau of Statistics carries out a series of national
health surveys to collect data on the health of Australians as well as
demographic and socio-economic information.
The Australian health insurance dataset (healthInsurance)
This dataset considers data from the 1995 Australian National Health
According to the Australian Survey. As part of this survey, data were collected regarding private
Government Department of health insurance. The Australian health insurance dataset contains
Health and Aged Care website data from 20 851 respondents from this survey, classified according to
(health.gov.au), more than half
the following three variables:
of the Australian population
have private health insurance • insurance: private health insurance type, taking the four possible
values inclusive, hospital, ancillary and none
• age: the age group of the respondent, taking the three possible
values < 35, 35 to 64 and > 64
• seifa: the socio-economic index for the area where each respondent
lives, taking the three possible values 1, 2 to 4 and 5.
The counts in the cells of the contingency table for these classifying
variables are stored in the variable count.

260
6 Logistic and log-linear models

The data for the first five observations from the Australian health
insurance dataset are given in Table 22.
Table 22 First five observations from healthInsurance

count insurance age seifa


598 inclusive < 35 1
127 hospital < 35 1
89 ancillary < 35 1
1065 none < 35 1
406 inclusive 35 to 64 1

Source: de Jong and Heller, 2008

Notebook activity 8.7 Identifying relationships for a


three-way table in R
This notebook selects a log-linear model for data from
healthInsurance and identifies the relationships between the
variables.

To complete this unit, in the final section we’ll see how logistic regression
models can be used instead of log-linear models for some contingency table
data.

6 Logistic and log-linear models


So far, we have always used log-linear models to model contingency table
data. However, sometimes it is also possible to model the data in a
contingency table using a logistic regression model. Each type of model
has its advantages and disadvantages, as we shall see later in this section.
We’ll start in Subsection 6.1, by considering the type of contingency table
data for which logistic regression can be used, and what form a logistic
regression model would take. We’ll then go on to discuss the relationship
between logistic and log-linear models in Subsection 6.2, and finally in
Subsection 6.3, we’ll briefly consider the advantages and disadvantages of
logistic and log-linear models for modelling contingency table data.

261
Unit 8 Log-linear models for contingency tables

6.1 Logistic regression for contingency


table data
Not all contingency table data can be modelled by a logistic regression
model. So, we’ll start this subsection by introducing a contingency table
which can be modelled using logistic regression. The dataset is described
next.

Dental flossing habits


A study on dental flossing habits of schoolchildren was carried out in
São Paulo, Brazil. One of the aims of the study was to investigate
whether a child’s ability to clean their teeth with dental floss is
influenced by various factors.
The dental flossing dataset (dentalFlossing)
Is he able to floss his teeth The dental flossing dataset contains data from this study. A sample of
with dental floss? 120 children from a primary school were classified according to the
following four categorical variables:
• gender: the gender of the child, taking the values male and female
• age: the child’s age (in years) classified into two categories, taking
the values 5 to 8 and 9 to 12
• frequency: the child’s dental flossing frequency, taking the values
rarely and regularly
• able: the child’s ability to hold and manipulate the floss around
his/her teeth properly, taking the values yes and no.
The counts in the cells of the contingency table for these classifying
variables are stored in the variable count.
These categorisations lead to a four-way contingency table containing
2 × 2 × 2 × 2 = 16 cells, as shown in Table 23.
Table 23 Sample of primary schoolchildren classified according to four
categorical variables

able
no yes
frequency frequency
gender age rarely regularly rarely regularly
male 5 to 8 19 4 5 2
9 to 12 5 0 8 17
female 5 to 8 11 7 6 6
9 to 12 2 1 5 22
Total 37 12 24 47

Source: Paulino and Singer, 2006

262
6 Logistic and log-linear models

Let’s first think about how we’d model the four-way contingency table
data from the dental flossing dataset if we were to use a log-linear model.

Activity 32 Response and explanatory variables using a


log-linear model
If we were to model the four-way contingency table from the dental flossing
dataset given in Table 23 using a log-linear model, what would we take as
the response variable? What possible explanatory variables would there
be?

If we were to model the contingency table given in Table 23 using a


log-linear model, then we’d take the cell count as our response variable.
But using the count as a response won’t work for a logistic regression
model, because logistic regression has a binary response! So we need a
different response if we want to use logistic regression.
Now, one of the aims of the study was to investigate whether a child’s
ability to clean their teeth with dental floss is influenced by the other
factors. In other words, one of the aims was to investigate whether the
variable able is influenced by the variables gender, age and frequency.
This would suggest that able seems like a natural response variable, with
the other three variables (and their interactions) as possible explanatory
variables.

Activity 33 Variable able as the response

Explain why we could use logistic regression for these data if we took able
to be our response variable.

If we take able as our response variable to help answer one of the study’s
aims, then we could use logistic regression because able is a binary
variable. A logistic regression model can, in fact, be used to model any
contingency table in which there is a binary categorical variable that could
be treated as the response variable. In this case, the contingency table can
be modelled by either a log-linear model or a logistic regression model.

263
Unit 8 Log-linear models for contingency tables

The main ideas behind using the two models for the same contingency
table are summarised in Box 10.

Box 10 Log-linear and logistic regression models for


contingency tables
A contingency table in which there is a binary categorical variable
which could be considered as a response variable can be modelled
either by a log-linear model or a logistic regression model.
A log-linear model for contingency table data has:
• the counts in the table as the response variable
• the variables categorising the data as potential explanatory
variables, together with their interactions.
A logistic regression model for contingency table data has:
• one binary variable categorising the data in the table which can be
considered as a response variable
• the remaining variables categorising the data as potential
explanatory variables, together with their interactions.

So, if a log-linear model and a logistic regression model can both be used
to model the same data, what is the relationship between the two resulting
models? We will address this question next.

6.2 Relationship between logistic and


log-linear models
In order to investigate the relationship between logistic and log-linear
models, we’ll start by looking at the two models for the dental flossing
dataset: we’ll focus on the logistic regression model in Activity 34 and
then the log-linear model in Example 8.

Activity 34 A logistic regression model for the dental


flossing dataset
Suppose that a logistic regression model is used to model the binary
response variable able, with the remaining variables – gender, age and
frequency – as potential explanatory variables.
Using stepwise regression to select the best logistic regression model for
these data, the selected model is the logistic regression model
able ∼ age + frequency + age:frequency.
What does this logistic regression model tell us about the relationships
between able and the variables gender, age and frequency?

264
6 Logistic and log-linear models

Example 8 A log-linear model for the dental flossing


dataset
In Activity 34, stepwise regression selected the logistic regression
model
able ∼ age + frequency + age:frequency.
The terms in this logistic regression model tell us something about the
Levi Spear Parmly is credited
terms that we should expect in a log-linear model for the same data. with inventing the first form of
Since age is in the logistic regression model for able, this means that dental floss in 1819
age and able are related. So, we would therefore expect a log-linear
model for the same data to have the two-way interaction able:age.
Similarly, since frequency is in the logistic regression model for able,
we would expect a log-linear model for the same data to also include
the two-way interaction able:frequency.
What’s more, since the logistic regression model for able also has the
interaction age:frequency, this means that the interaction between
age and frequency is related to able. Therefore, we would expect a
log-linear model for the same data to also include the three-way
interaction able:age:frequency.
However, since gender is not in the logistic regression model for able,
we wouldn’t expect any more interaction terms involving able in a
log-linear model for the same data.
This is, in fact, exactly what we do find if we use stepwise regression
to fit a log-linear model for these data: the selected log-linear model is
count ∼ gender + age + frequency + able
+ gender:frequency + age:frequency + age:able
+ frequency:able + age:frequency:able.
Notice that this log-linear model does include the three interactions
able:age, able:frequency and able:age:frequency, but doesn’t
include the interaction able:gender.
Also notice that the final log-linear model includes interactions which
don’t involve able. This is because the log-linear model is modelling
the relationships between all of the variables, whereas the logistic
regression model is only modelling the relationships between able and
the other variables.

Example 8 illustrates a general result concerning the relationship between


the log-linear and logistic regression models for a dataset, as summarised
in Box 11.

265
Unit 8 Log-linear models for contingency tables

Box 11 Relationship between log-linear and logistic


regression models for contingency table data
Suppose that one of the categorical variables in a contingency table
can be considered to be a binary response variable Y , and the other
categorical variables in the table are labelled A, B, . . . , Z.
If a logistic regression model for response Y includes the factor A,
then a log-linear model for the same data contains the two-way
interaction Y :A.
If a logistic regression model for response Y has the two-way
interaction A:B, then a log-linear model for the same data contains
the three-way interaction Y :A:B.
If a logistic regression model for response Y has the three-way
interaction A:B:C, then a log-linear model for the same data contains
the four-way interaction Y :A:B:C.
And so on.
This is summarised in Table 24.
Table 24 Relationships between terms included in a logistic regression
model and interactions included in a log-linear model for the same data

Interactions in Interactions in
logistic regression model ←→ log-linear model
Main effect A, but no interaction ←→ Y :A
A:B ←→ Y :A:B
A:B:C ←→ Y :A:B:C
· ·
· ·
· ·

This result is illustrated in the next activity.

Activity 35 Which terms in the model?

(a) A four-way contingency table is such that one of the categorical


variables can be considered to be a binary response variable Y , and
the other three variables in the table are labelled A, B and C.
A logistic regression model for these data has the form
Y ∼ A + B + C + B:C.
What interaction terms involving Y will the corresponding log-linear
model have for these data?

266
6 Logistic and log-linear models

(b) In a four-way contingency table, the data are categorised according to


four variables labelled A, B, C and D. The log-linear model
count ∼ A + B + C + D + A:C + A:D + C:D + A:C:D
is found to have an adequate fit to the data. Suppose that the variable
C can be considered to be a binary response variable for a logistic
regression model for the same data. Given the log-linear model for the
data, write down the corresponding logistic regression model for C.

Although it is possible to use both a log-linear model and a logistic


regression model for contingency table data, it is important to remember
that the two models are not modelling the same thing.
• A log-linear model models the counts in the contingency table, and
considers all of the classifying variables on an equal footing as
explanatory variables. As such, a log-linear model considers the
relationships between all of the classifying variables.
• In contrast, a logistic regression model treats one of the classifying
variables as the response, and considers the relationships between the
response and the other classifying variables. There is no attempt to
simplify the relationships between the explanatory variables.
As such, a log-linear model will invariably have more parameters than the
associated logistic regression model for the same data. Despite this, as
long as the log-linear model includes certain terms, the logistic regression
model and the log-linear model for the same data can produce the same
residual deviance and estimates for the corresponding parameters in the
two models.

6.3 Which model to use: logistic or


log-linear? Logistic Log-linear
As you have seen, when we have data with a binary response variable and regression? model?
explanatory variables that are all categorical (so that the data could be
summarised in a contingency table), we have a choice of ways to analyse
them. We can use either a logistic regression model or a log-linear model.
We can even ensure that they come to the same answer by including
certain terms in the log-linear model. So which of these two models should
we use?
Well, the choice lies largely in the research question that we would like to
answer and the conclusions that we would like to draw from the data. If
interest is focused on the outcome of a binary categorical variable which is Which to use?
an obvious response for the research question, then a logistic regression
model would be a sensible way forward. But if there is no obvious response
variable and we are more interested in investigating the relationships
between all of the categorical variables, then a log-linear model is a better
way forward. There are pros and cons for both models, and we’ll finish this
unit by summarising these in Box 12.
267
Unit 8 Log-linear models for contingency tables

Box 12 Pros and cons of logistic and log-linear models


Logistic regression model
Pros:
• The model is simpler and has fewer parameters in comparison to
the associated log-linear model, which is an exact match.
• The model can accommodate continuous explanatory variables as
well as categorical variables.
Cons:
• It is not always obvious that one of the categorical variables should
have a special status as the response variable.
• The model only focuses on the relationships between one of the
variables (the response) and the other variables.

Log-linear model
Pros:
• The model involves modelling the relationships between all of the
variables categorising the data, so one variable need not be given
special status over the remaining variables.
• If a log-linear model is used which isn’t necessarily an exact match
with the analogous logistic regression model, then the log-linear
model could end up being a simpler model for the data.
• The model works just as well when the proposed response variable
among the categorical variables has more than two categories. (The
usual logistic regression models that you have studied in this
module do not work in such cases.)
Cons:
• There can be difficulties in fitting a log-linear model if the
contingency table contains many zeros so that certain combinations
of the explanatory factors do not occur in the data, either by
chance or by design.
• The model can’t accommodate continuous explanatory variables.

268
Summary

Summary
This unit focused on the problem of modelling contingency table data. In a
contingency table, the individual data values in a dataset are categorised
according to the levels of two or more categorical variables. The
contingency table then gives the counts of observations for each of the
possible combinations of levels for the categorical variables.
In order to model these data, the counts in the contingency table are taken
to be values of the response, and the variables classifying the data are used
as factor explanatory variables. The counts can then be modelled by a
Poisson GLM with the (canonical) log link. In this context, this GLM is
known as a log-linear model.
A question often of interest for two-way contingency tables with classifying
variables A and B is whether A and B are independent. This question can
be investigated informally using mosaic plots. More formally, we can test
whether A and B are independent by comparing the model fits of the
following two log-linear models for the cell counts Y :
• the log-linear model which assumes that A and B are independent, given
by
Y ∼A+B

• the saturated log-linear model which assumes that A and B are not
independent, given by
Y ∼ A + B + A:B.

Since the second model is the saturated model, we can compare the fits of
these two models using the residual deviance for the model Y ∼ A + B. If
we conclude that the model Y ∼ A + B is an adequate fit to the data, then
we can conclude that A and B are independent.
When a contingency table has more than two classifying variables, the
log-linear model can be extended to accommodate the extra main effects
and interactions required. In this situation, there are several possible
models to choose from for a given contingency table.
Since it’s always possible to fit contingency table data perfectly using a
saturated log-linear model, the aim when choosing a log-linear model is to
find a model which is more parsimonious than the saturated model that
also fits the data adequately. As for GLMs, we can assess whether a
particular log-linear model is an adequate fit using the model’s residual
deviance, and we can compare the fits of two log-linear models using the
deviance difference (if the models are nested) or the AIC (otherwise).

269
Unit 8 Log-linear models for contingency tables

There are, however, some restrictions with regards to the possible


log-linear models we can choose from.
• The log-linear model must follow the hierarchical principle, which means
that if an interaction is needed in the model, then we must also include
the main effects of the variables in the interaction, together with their
lower-order interactions.
• Any fixed totals for the levels of any individual variables, or
combinations of levels of two or more variables, impose constraints on
the terms which must be included in a log-linear model.
Since we’re using the convention that we always include all of the main
effects in a model, there is only really one restriction that we need to be
careful with , namely:
◦ if the totals of combinations of the levels of two or more variables are
fixed, then the interaction for these variables needs to be included.
The final chosen model can tell us about the relationships between the
classifying variables. In this module, we only consider the relationships
between variables in three-way contingency tables, as summarised in
Table 25.
Table 25 Interactions included in the log-linear model and the associated
relationships

Interactions included Relationship


No interactions A, B, C mutually independent
A:B A and B jointly independent of C
A:C A and C jointly independent of B
B:C B and C jointly independent of A
A:B, A:C B and C conditionally independent, given A
A:B, B:C A and C conditionally independent, given B
A:C, B:C A and B conditionally independent, given C
A:B, A:C, B:C Uniform association between A, B, C

270
Summary

Contingency table data can also be modelled by logistic regression if one of


the classifying variables is binary and can be considered as a response
variable. In this case, the terms in a logistic regression model tell us about
the interactions that we should expect in a log-linear model for the same
data. This is summarised in Table 26.
Table 26 Relationships between terms included in a logistic regression model
and interactions included in a log-linear model for the same data

Interactions in Interactions in
logistic regression model ←→ log-linear model
Main effect A, but no interaction ←→ Y :A
A:B ←→ Y :A:B
A:B:C ←→ Y :A:B:C
· ·
· ·
· ·

There are various pros and cons of using either logistic regression or a
log-linear model for a contingency table. The choice is often dictated by
the research question of interest and what we’d like to learn from the data.
A reminder of what has been studied in Unit 8 and how the sections link
together is shown in the following route map.

The Unit 8 route map

Section 1
The modelling
problem

Section 2 Section 3
Introducing log-linear Are the classifying
models for two-way variables in a two-way
contingency tables table independent?

Section 5
Section 4
How are the
Contingency tables with
classifying variables
more than two variables
related?

Section 6
Logistic and
log-linear models

271
Unit 8 Log-linear models for contingency tables

Learning outcomes
After you have worked through this unit, you should be able to:
• understand that a log-linear model takes the counts in a contingency
table as values of the response, and the classifying variables as the
explanatory variables
• understand that a log-linear model is a Poisson GLM with the canonical
log link
• interpret mosaic plots for two-way contingency tables
• appreciate that the log-linear model Y ∼ A + B assumes that A and B
are independent
• understand why the log-linear model Y ∼ A + B + A:B is the saturated
log-linear model for a two-way contingency table
• use the residual deviance of Y ∼ A + B to test whether A and B are
independent
• understand how the log-linear model can be extended to model
contingency tables with more than two classifying variables
• appreciate that choosing a log-linear model involves finding a log-linear
model which is simpler than the saturated log-linear model, but also fits
the data adequately
• compare the fits of log-linear models using the deviance difference and
the AIC
• understand and use the hierarchical principle
• understand which terms must be included in a log-linear model when
totals are fixed in a contingency table
• interpret the final model for a three-way contingency table in terms of
what the model tells us about the relationships between the classifying
variables
• understand when a contingency table can be modelled by logistic
regression
• appreciate the link between a logistic regression model and a log-linear
model for the same contingency table data
• produce a two-way contingency table in R
• obtain a mosaic plot for a two-way contingency table in R
• fit a log-linear model in R
• use R to test whether two classifying variables in a two-way contingency
table are independent
• use stepwise regression in R to choose a log-linear model and interpret
what the chosen model tells us about the relationships between the
classifying variables.

272
Acknowledgements

References
Cathie Marsh Institute for Social Research (2019) ‘Living Costs and Food
Survey, 2013: Unrestricted Access Teaching Dataset’. 2nd edn. Available
at: https://doi.org/10.5255/UKDA-SN-7932-2
(Accessed: 9 September 2022).
de Jong, P. and Heller, G.Z. (2008) Generalized linear models for insurance
data, Cambridge: Cambridge University Press.
Mead, R., Curnow, R.N. and Hasted, A.M. (2002) Statistical methods in
agriculture and experimental biology. 3rd edn. London: Chapman and
Hall/CRC Press.
Paulino, C.D. and Singer, J.M. (2006) Análise de dados categorizados. São
Paulo: Edgard Blucher.
Tutz, G. (2011) Regression for categorical data. New York: Cambridge
University Press, Chapter 12.
YouGov (2019) ‘How Britain voted in the 2019 general election: YouGov
Survey Results’. Available at:
https://d25d2506sfb94s.cloudfront.net/cumulus uploads/
document/wl0r2q1sm4/Results HowBritainVoted 2019 w.pdf
(Accessed: 31 July 2022).

Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, for rent sign: © stockbroker / www.123rf.com
Subsection 1.2, light bulb moment: © ismagilov / www.123rf.com
Subsection 2.2, hot drink: © Alena Ozerova / www.123rf.com
Subsection 2.2, magician: © andrew ypopov / www.123rf.com
Subsection 3.2, newborn baby: © Jozef Polc / www.123rf.com
Subsection 4.1, chameleon: © Andrey Gudkov / www.123rf.com
Subsection 4.2, fork in road: © varunalight / www.123rf.com
Subsection 4.2, sheep litter: © Aleksandarlittlewolf / Freepik
Subsection 4.5, polling station: © Peter Titmus / www.123rf.com
Subsection 5.2, health check: © Mark Bowden / www.123rf.com
Subsection 6.1, child dental flossing: © wckiw / www.123rf.com
Subsection 6.2, dental floss: © Oleksandr Rybitskyi / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.

273
Unit 8 Log-linear models for contingency tables

Solutions to activities
Solution to Activity 1
Apart from some of the questions identified in the text preceding this
activity, some possible questions of interest include:
• Does the relationship between employment and incomeSource differ
according to gender?
• Does the relationship between incomeSource and gender differ
according to the different values of employment?

Solution to Activity 2
In a GLM for these data, we could treat the counts in the contingency
table as values of our response and the two categorical variables gender
and incomeSource as factor explanatory variables.

Solution to Activity 3
(a) There are four rows in the contingency table representing the four
categories of the variable employment, and there are two columns for
the variable gender. Therefore, K = 4 and L = 2.
(b) The value of y32 is the count given in the third row and second
column of the table – namely, 74. This is the count of individuals out
of the 5144 in the UK survey dataset for which the HRP is
unemployed and male.
(c) y+2 is the sum of the counts in the second column, which is the total
number of individuals in the UK survey dataset for which the HRP is
male – that is, 3156.
y4+ is the sum of the counts in the fourth row, which is the total
number of individuals in the UK survey dataset for which the HRP’s
employment status is inactive – that is, 2004.
y++ is the sum of all the counts in the table. The UK survey dataset
has 5144 observations, and so y++ is 5144.

Solution to Activity 4
There are K levels for variable A, and L levels for variable B. Therefore,
the contingency table will have K × L counts, and therefore we will have
K × L values of the response.

274
Solutions to activities

Solution to Activity 5
The responses are
Ykl , for k = 1, 2, . . . , K, l = 1, 2, . . . , L.
In Table 5, K = 4 (since there are four rows) and L = 2 (since there are
two columns). Therefore, the responses for modelling these data are: Y11 ,
Y12 , Y21 , Y22 , Y31 , Y32 , Y41 and Y42 .
Note that, although the UK survey dataset contains data on 5144
observations, when the data are represented by a contingency table with
the classifying variables employment and gender, there are only eight
responses which we wish to model.

Solution to Activity 6
Fixing Y++ to be n in advance imposes constraints on the cell counts in
the contingency table since all the cell counts must then sum to n. But, if
the cell counts must sum to n, then the individual cell counts can’t be
independent of each other (since if one count increases, for example, then
at least one cell count must decrease in order for the total to remain fixed
at n). What’s more, if a count is constrained, then it can’t be assumed to
have a Poisson distribution, since, from Box 1 in Unit 7, a Poisson random
variable Ykl takes unrestricted possible values ykl = 0, 1, . . . .

Solution to Activity 7
From Equation (6), the individual cell counts are calculated as
E(Ykl ) = exp(µ + αA zA + αB zB ).

For level 1 of A, zA = 0, and for level 1 of B, zB = 0. Therefore


E(Y11 ) = exp(µ).

For level 1 of A, zA = 0, and for level 2 of B, zB = 1. Therefore


E(Y12 ) = exp(µ + αB ).

For level 2 of A, zA = 1, and for level 1 of B, zB = 0. Therefore


E(Y21 ) = exp(µ + αA ).

And finally, for level 2 of A, zA = 1, and for level 2 of B, zB = 1. Therefore


E(Y22 ) = exp(µ + αA + αB ).

These are indeed the same expressions for the expected cell counts as given
in Table 8.

275
Unit 8 Log-linear models for contingency tables

Solution to Activity 8
The expressions for the expected cell counts from Activity 7 are shown in
Table S1.
Table S1 Repeat of Table 8

Levels of Levels of variable B


variable A 1 2
1 exp(µ) exp(µ + αB )
2 exp(µ + αA ) exp(µ + αA + αB )

We can therefore calculate the fitted expected cell counts by substituting


the parameter estimates for the fitted model given in Table 10 into the
expressions in Table S1. We then have expressions for the fitted expected
cell counts as shown in Table S2.
Table S2 Expressions for the fitted expected cell counts for the log-linear model
count ∼ gender + incomeSource

incomeSource
gender earned other
female exp(7.001) exp(7.001 − 0.210)
male exp(7.001 + 0.462) exp(7.001 + 0.462 − 0.210)

Calculating these expressions, we get the fitted expected cell counts as


shown in Table S3.
Table S3 Fitted expected cell counts for the log-linear model
count ∼ gender + incomeSource

incomeSource
gender earned other
female 1097.73 889.80
male 1742.37 1412.34

Solution to Activity 9
The completed table is given in Table S4.
Table S4 Completed version of Table 12

incomeSource
gender earned other Total
female 1097.96 890.04 1988.00
male 1743.04 1412.96 3156.00
Total 2841.00 2303.00 5144.00

So, the row, column and overall totals do match the totals displayed in the
data table given in Table 2.

276
Solutions to activities

Solution to Activity 10
(a) Using the given fitted values,
pb1+ × pb+2 ' 0.386 × 0.448
' 0.1729 ' 0.173 = pb12
to three decimal places, as required.
(b) (i) Using the fitted values given in Table 13, pb21 is
1743.04
pb21 = ' 0.339.
5144.00
(ii) The fitted marginal probabilities are calculated as
3156.00
pb2+ = ' 0.614,
5144.00
2841.00
pb+1 = ' 0.552.
5144.00
(iii) From parts (b)(i) and (b)(ii),
pb2+ × pb+1 ' 0.614 × 0.552
' 0.3389 ' 0.339 = pb21
to three decimal places, as required.

Solution to Activity 11
The horizontal widths of the rectangles in the first and second columns –
that is, the rectangles associated with the two levels of incomeSource –
represent, respectively, the proportions of the total sample for which
incomeSource takes levels earned and other. Since the horizontal widths of
the rectangles in the earned column seem to be slightly larger than those in
the other column, it looks like the proportion of individuals taking earned
for incomeSource is slightly larger than the proportion taking other.
The vertical lengths of the rectangles in the first column represent the
proportions of individuals taking levels female and male for gender from
those with earned for incomeSource, while the vertical lengths of the
rectangles in the second column represent the proportions of individuals
taking levels female and male for gender from those with other for
incomeSource.
The vertical lengths of the rectangles for female are not the same in the
two incomeSource columns and, as such, might indicate a relationship
between gender and incomeSource. But, as in Example 4, it’s possible
that the differences are simply what we would expect with random
variation, and so these two variables could be independent.

277
Unit 8 Log-linear models for contingency tables

Solution to Activity 12
Both of the mosaic plots in Figure 4 show large differences between the
vertical lengths of the rectangles for each level of the vertical variable
across the horizontal variable. As such, these mosaic plots suggest that
there is a relationship between these two variables and they do not seem
to be independent.

Solution to Activity 13
(a) (i) There is just one parameter associated with the ‘baseline mean’
term which is included in each linear predictor ηkl .
(ii) There are K levels of factor A. The baseline mean assumes that
A is level 1, which leaves (K − 1) parameters associated with the
other levels of A.
(iii) There are L levels of factor B. The baseline mean assumes that
B is level 1, which leaves (L − 1) parameters associated with the
other levels of B.
(iv) For the interaction term, since there are K levels of A and L
levels of B, there are (K × L) possible combinations of k and l.
However, since an interaction term is set to be zero when either
A or B take level 1, there are only (K − 1) × (L − 1) possible
combinations of k and l which will have associated parameters.
(b) The total number of parameters in model M2 is the sum of the
numbers of parameters associated with each of the terms in the model
– that is, the sum of the numbers of parameters identified in part (a).
So, using part (a), the number of parameters for model M2 is
1 + (K − 1) + (L − 1) + ((K − 1) × (L − 1))
= K + L − 1 + (KL − K − L + 1)
= KL.

(c) From part (b), there are KL parameters in model M2 . There are also
KL observations in the contingency table (because there are K levels
of A and L levels of B), and so, since the number of model parameters
equals the number of observations, the log-linear model M2 is the
saturated model.

Solution to Activity 14
(a) Using the formula given,
D(M2 ) = 2 × (l(saturated model) − l(M2 )).
But, from Activity 13, we know that M2 is the saturated model, and
so
D(M2 ) = 2 × (l(saturated model) − l(saturated model)) = 0
as required.

278
Solutions to activities

(b) From Equation (9),


deviance difference = D(M1 ) − D(M2 ).
But, from part (a), we know that D(M2 ) = 0. Therefore, for models
M1 and M2 ,
deviance difference = D(M1 ).

Solution to Activity 15
We know that if M1 is a good fit, then
D(M1 ) ≈ χ2 (r),
where
r = number of observations
− number of parameters in the proposed model.
Now, if the classifying variables A and B have K and L levels, respectively,
then the associated contingency table has KL cells, and therefore KL
observations.
Also, using the results from the solution to Activity 13 part (a), the
number of parameters in the log-linear model Y ∼ A + B is
1 + (K − 1) + (L − 1) = K + L − 1.
Therefore
r = KL − (K + L − 1)
= KL − K − L + 1
= (K − 1)(L − 1)
as required.

Solution to Activity 16
(a) The residual deviance should be compared to a χ2 (1) distribution,
since K = L = 2 and so (K − 1)(L − 1) = 1.
(b) The p-value is quite large, which indicates that the residual deviance
is not large, suggesting that the fitted model is an adequate fit to the
data. In turn, this means that we can conclude that gender and
induced are independent.

279
Unit 8 Log-linear models for contingency tables

Solution to Activity 17
(a) There are four levels of employment, and so K = 4. Both of the
variables incomeSource and gender have two levels, and so L = 2
and S = 2.
So
K × L × S = 4 × 2 × 2 = 16
and there are indeed 16 counts in the contingency table.
(b) There are four rows in the contingency table representing the four
categories of the variable employment. There are four columns: the
first two columns represent the two genders when incomeSource takes
the value earned, while the last two columns represent the two
genders when incomeSource takes the value other.

Solution to Activity 18
(a) When there are four classifying variables, there are six two-way
interactions:
A:B, A:C, A:D, B:C, B:D, C:D,
three three-way interactions:
A:B:C, A:B:D, B:C:D,
and one four-way interaction:
A:B:C:D.

(b) Hence the saturated model for this contingency table is:
Y ∼ A + B + C + D + A:B + A:C + A:D + B:C + B:D + C:D
+ A:B:C + A:B:D + B:C:D + A:B:C:D.

Solution to Activity 19
The log-linear model is a GLM, and we know from Subsection 5.1 of
Unit 7 that we can use the residual deviance of a proposed GLM to test
whether the proposed model is an adequate fit to the data.
So, we could use the residual deviance of the log-linear model M as a test
statistic to test whether M is an adequate fit to the data.

280
Solutions to activities

Solution to Activity 20
From Box 6, a log-linear model can be considered an adequate fit to the
data if the residual deviance is less than or equal to the associated degrees
of freedom.
Therefore, model M2 can be considered to be an adequate fit because its
residual deviance is 0.29, which is much less than the associated degrees of
freedom, which is 3. Model M4 can also be considered to be an adequate
fit because its residual deviance is 1.22, which is also less than the
associated degrees of freedom, which is 4.
In contrast, the residual deviances for models M1 and M3 are both very
large in comparison to their associated degrees of freedom, and so neither
of these models can be considered to be an adequate fit by the ‘rule of
thumb’.

Solution to Activity 21
(a) From Box 15 in Subsection 5.2 of Unit 7, we can compare the fits of
two nested GLMs using the deviance difference given by
deviance difference = D(M1 ) − D(M2 ),
where D(M1 ) and D(M2 ) denote the residual deviances for models
M1 and M2 , respectively.
A large deviance difference means that too much fit is lost by using
the more parsimonious model M1 , and we would therefore prefer M2 .
On the other hand, a small deviance difference means that there isn’t
much fit lost with the more parsimonious model M1 , and so we prefer
M1 .
(b) From Box 16 in Subsection 5.2 of Unit 7, we can use the values of the
AIC to compare the fits of two non-nested GLMs. The preferred
model is the model with the smallest AIC value.
(c) Stepwise regression can be used for selecting a GLM from a set of
alternatives, and therefore can also be used for selecting a log-linear
model.

Solution to Activity 22
(a) M2 is the model with all main effects and all three two-way
interactions, while M4 is the model with all main effects and two of
the two-way interactions. As such, M4 is nested within M2 .
Therefore, from Activity 21 part (a), we can use the deviance
difference between the two models to compare their fits.

281
Unit 8 Log-linear models for contingency tables

(b) Since D(M4 ) = 1.22 and D(M2 ) = 0.29,


deviance difference = 1.22 − 0.29 = 0.93.
The associated degrees of freedom is the difference between the two
degrees of freedom associated with D(M4 ) and D(M2 ), namely
4 − 3 = 1.

(c) The deviance difference is 0.93 and the value of the associated degrees
of freedom is 1. Therefore, since 0.93 < 1 (that is, the deviance
difference is less than the associated degrees of freedom), the deviance
difference is small enough for us to conclude that M4 is an adequate
fit in comparison to M2 . We therefore prefer the more parsimonious
model M4 .
(d) We’d prefer the model with the smaller AIC value. We’d therefore
prefer model M4 in comparison to model M2 .
(As an aside, notice that both the deviance difference and the AIC
values led us to the same conclusion for these data and models.
However, it is worth noting that these two methods don’t necessarily
always lead to the same conclusion.)

Solution to Activity 23
(a) Since the p-value is extremely small (p < 0.0001), which means the
residual deviance is very large, model M1 is not an adequate fit to the
data.
(b) Since the associated p-value is quite large (p = 0.265), which means
the residual deviance is not large, model M2 does seem to be an
adequate fit to the data.
(c) The only model in Table 19 with a large p-value is model M4 , which is
the model with the interaction litterSize:breed omitted. So, this
means that the fits of the two models M4
count ∼ litterSize + farm + breed
+ litterSize:farm + farm:breed
and M2
count ∼ litterSize + farm + breed
+ litterSize:farm + litterSize:breed
+ farm:breed
are not significantly different. So, if we drop the interaction
litterSize:breed, then the resulting model will still be an adequate
fit.
(d) Since each of the models with the main effects and only one of the
two-way interactions has a p-value which is extremely small, none of
these models is an adequate fit to the data. Therefore, model M4 is
the simplest model with an adequate fit to the data, and so M4 is the
preferable model to choose.
282
Solutions to activities

Solution to Activity 24
Model M1 is not hierarchical, because M1 includes the two-way
interactions A:C and B:C, which both involve C, but the model doesn’t
include the main effect C.
Model M2 is hierarchical, because the only interaction included in the
model is A:B, and both A and B are included as main effects in the model.
Model M3 is not hierarchical, because it includes the three-way interaction
A:B:C, but doesn’t include all of the lower-order interactions involving
these variables. In particular, the model doesn’t include the two-way
interactions A:C and B:C.

Solution to Activity 25
(a) If the totals for each level of B are fixed, then the main effect B needs
to be included in the model. In addition, we need to include the main
effects A, C and D, since we’re using the rule that all of the main
effects need to be included in the log-linear model. So, the simplest
possible log-linear model we could use is
Y ∼ A + B + C + D.

(b) If the totals for each combination of the levels of C and D are fixed,
then the interaction C:D needs to be included in the model. In
addition, we’re using the rule that all of the main effects need to be
included in the log-linear model. So, the simplest possible log-linear
model we could use is
Y ∼ A + B + C + D + C:D.

(c) If the totals for each combination of the levels of A, B and D are
fixed, then the interaction A:B:D needs to be included in the model.
Additionally, so that the model is hierarchical, we need to include all
lower-order interactions including A, B and D, as well as their main
effects. As usual, we also need to include all of the main effects. So,
the simplest possible log-linear model we could use is
Y ∼ A + B + C + D + A:B + A:D + B:D + A:B:D.

Solution to Activity 26
The corresponding model with no interaction for two-way contingency
tables is
Y ∼ A + B.
We already know that if this model fits the data adequately, then we
would conclude that the two variables are independent of each other. This
suggests that if our final model for a three-way contingency table is a
log-linear model with no interactions, then we should conclude that the
variables A, B and C are independent of one another.

283
Unit 8 Log-linear models for contingency tables

Solution to Activity 27
For two-way contingency tables, if the interaction is required in the model,
then we conclude that the two variables are not independent. This suggests
that if the interaction A:B is in the three-way model, then A and B are not
independent of one another. However, there are no interactions associated
with C, which would suggest that both A and B are independent of C.

Solution to Activity 28
(a) There is the single two-way interaction A:C, but no interactions
associated with B, and so A and C are jointly independent of B.
(b) This time, there is the single two-way interaction B:C, but no
interactions associated with A, and so B and C are jointly
independent of A.

Solution to Activity 29
(a) Since the interaction A:B is in the model, then that suggests that A
and B are not independent, and since the interaction B:C is in the
model, then that suggests that B and C are not independent.
However, there are no interactions involving both A and C together,
and these are only related to each other through their relationships
with B. So A and C are conditionally independent, given B.
(b) Since the interaction A:C is in the model, then that suggests that A
and C are not independent, and since the interaction B:C is in the
model, then that suggests that B and C are not independent.
However, there are no interactions involving both A and B together,
and these are only related to each other through their relationships
with C. So A and B are conditionally independent, given C.

Solution to Activity 30
(a) This model includes all three of the two-way interactions, and so there
is uniform association between the variables.
(b) There are no interactions in this model, and so the variables A, B and
C are mutually independent.
(c) The interaction A:C is in the model, so that A and C are not
independent, and the interaction B:C is in the model, so that B and
C are not independent. However, there is no two-way interaction term
involving both A and B together, and these are only related to each
other through their relationships with C. So A and B are
conditionally independent, given C.
(d) The single two-way interaction A:B is in the model so that A and B
are not independent of each other, but there is no interaction
involving C. So, A and B are jointly independent of C.

284
Solutions to activities

Solution to Activity 31
The interaction employment:gender is in the model, so that employment
and gender are not independent, and the term employment:incomeSource
is in the model, so that employment and incomeSource are not
independent. However, there is no two-way interaction involving both
gender and incomeSource together, and these are only related to each
other through their relationships with employment. So, gender and
incomeSource are conditionally independent, given employment.

Solution to Activity 32
If we were to model this contingency table using a log-linear model, then
the cell counts would be taken as being values of the response, and the
classifying variables gender, age, frequency and able (and their
interactions) would be the possible explanatory variables.

Solution to Activity 33
The variable able is a binary variable, and so can therefore be modelled as
the response variable using logistic regression.

Solution to Activity 34
In logistic regression, any explanatory variables in the model mean that
the associated explanatory variable affects the response variable.
Therefore, age and frequency both affect able, as does their interaction
age:frequency.
However, gender does not appear in the model, and so gender does not
affect able.

Solution to Activity 35
(a) Since the logistic regression model for these data contains the main
effects for A, B and C, the log-linear model for these data will contain
the three two-way interactions Y :A, Y :B and Y :C.
The logistic regression model also contains the two-way interaction
B:C, so the corresponding log-linear model will also contain the
three-way interaction Y :B:C.
(b) In the log-linear model, the interactions involving C are A:C, C:D
and A:C:D. Therefore, a logistic regression model for C would have
the main effects for A and D, and the two-way interaction A:D.
So, the corresponding logistic regression model for C is
C ∼ A + D + A:D.

285
Index

Index
χ2 distribution 53 dentalFlossing 262
deviance difference 62, 64, 158, 232
AIC 67, 162 testing 62
Akaike information criterion 67 deviance residuals 71
diagnostic plots 72, 166
Bernoulli distribution 17
dispersion parameter 177
Bernoulli GLM 128
distribution
binary response 6
Bernoulli 17
binomial distribution 150
binomial 150
binomial GLM 150, 154
chi-squared 53
overdispersion 177
exponential 143
burns 15
normal 71
canonical link function 134 Poisson 113
cell probability (for contingency table) 214 europeanCompanies 8
chi-squared distribution 53 exponential distribution 143
complementary log-log link function 30 exponential family 129
conditionally independent 258 exponential GLM 149
contingency table 208 exposure 179
joint cell probability 214
log-linear and logistic regression models fitted GLM 128
264, 266, 268 fitted linear predictor 128
marginal cell probability 214 fitted mean response 137, 140
modelling strategy 210 fitted success probability 46, 47
notation for three-way table 241 fixed totals in log-linear models 217, 252
notation for two-way table 211 gbCompanies 9
relationships (in three-way table) 259 generalised linear model 3, 125, 127
response variable 213 GLM 125, 127
three-way 209, 240 assumptions 164
two-way 209 diagnostic plots 166
visualising the data 224 fitted mean response 137, 140
count response 108, 119 for a Poisson rate 179, 181
Poisson 216
dataset
predicted mean response 139, 140
Australian health insurance 260
prediction interval 140
burns 15
with binomial response 150, 154
dental flossing 262
with exponential response 149
epileptic seizures 182
with Poisson response 125
European companies 8
GB companies 9 healthInsurance 260
leukaemia survival 142 hierarchical principle 250
newborn births 237
identity link 134
Philippines 40 to 80 122
interaction term 230, 242
Philippines survey 110
inverse link function 135, 137
sheep litters 248
UK election 255 joint probability (for contingency table) 214
UK survey 207 jointly independent 257

287
Index

likelihood 50 linearity assumption 70


linear predictor 126 model 28, 30
fitted 128 overdispersion 177
linear regression 115, 116 partial regression coefficients 41
link function 119, 130 relationship with log-linear model 264, 266
canonical 134 testing model fit 54
complementary log-log 30 logit function 29
identity 134 logit link 29, 134
inverse 135, 137 logit link function 29
log 124, 134, 146 logit() 29
logit 29, 134
marginal probability (for contingency table)
negative reciprocal 145
214
probit 30
mosaic plot 224
properties 130, 132, 134
interpretation 226
log link 124, 134, 146
mutually independent 257
log odds 34
log odds function 34 negative reciprocal link 145
log-likelihood 51 nested models 61, 158
log-linear model 217 newbornBirths 237
assuming independence 217 normal distribution 71
choosing a model 243 null deviance 59
diagnostic plots 254 null model 50
fixed totals 217, 252 number of successes response 149
interaction term 230, 242
relationship with logistic regression 264, odds 32, 34
266 odds multiplier 38, 40
response variable 213 odds ratio 37
saturated model 233, 241, 242 offset 180
testing for independence 231, 232, 234 OR 37
interpreting the odds ratio 38
testing model fit 244
overdispersion 177
variables not independent 230
detection 178
logistic function 23
dispersion parameter 177
direction of curve 26
location of curve 25 philippines 110
spread of curve 27 philippines40to80 122
standard 24 Poisson distribution 113
steepness of curve 27 Poisson GLM 128, 216
logistic regression 28, 115, 117 assuming independence (in contingency
assumptions 70 table) 217
comparing models 64, 67 modelling Poisson rates 179
diagnostic plots 72 overdispersion 177
fitted model 43 Poisson rate 179
for contingency tables 264, 268 Poisson regression 125
independence assumption 70 overdispersion 177
interpreting the model 32, 42 predicted mean response 139, 140
interpreting the regression coefficient(s) 35, predicted success probability 46, 47
39, 41, 42 probit link function 30

288
Index

regression
linear 115, 116
logistic 115, 117
Poisson 125
stepwise 162
residual deviance 52, 55, 156
rule of thumb 57, 157
testing 53, 54
response
binary 6
binomial 150, 154
count 108, 119
exponential 149
number of successes 149
time between occurrences 142

saturated log-linear model 233, 241, 242


saturated model 50
seizures 182
sheepLitters 248
sigmoid function 24
standard logistic function 24
standardised deviance residuals 71
stepwise regression 162
success probability 17
fitted value 46, 47
predicted value 46, 47
survival 142

testing deviance difference 63, 64


testing for independence 231, 232, 234
rule of thumb 236
testing log-linear model fit 244
rule of thumb 245
testing residual deviance 54, 55
time between occurrences response 142

ukElection 255
ukSurvey 207
underdispersion 177
uniform association 259

variance–mean relationship 165

289

You might also like