M348 Applied Statistical Modelling - Generalised Linear Models
M348 Applied Statistical Modelling - Generalised Linear Models
Book 2
Generalised linear models
This publication forms part of the Open University module M348 Applied statistical modelling. Details of this
and other Open University modules can be obtained from Student Recruitment, The Open University, PO Box
197, Milton Keynes MK7 6BJ, United Kingdom (tel. +44 (0)300 303 5303; email [email protected]).
Alternatively, you may visit the Open University website at www.open.ac.uk where you can learn more about
the wide range of modules and packs offered at all levels by The Open University.
Introduction 3
Learning outcomes 85
References 85
Acknowledgements 86
Solutions to activities 87
Introduction 105
Summary 185
References 189
Acknowledgements 190
Introduction 205
References 273
Acknowledgements 273
Index 287
Unit 6
Regression for a binary response
variable
Introduction
Introduction
The models considered so far in this module (namely, simple linear
regression, multiple regression and ANOVA) are all forms of linear models.
Although there has been variety in the type and number of explanatory
variables that we’ve been able to incorporate into these linear models, the
choice of response variable has been restricted to satisfying the conditions
that the response is both continuous (or at least not too discrete) and the
random variation of the response about the model can be assumed to be
normally distributed.
This normality assumption means that the response variable is usually
approximately normally distributed (although not necessarily). But what
if the response variable is not even close to being approximately normal?
Is it still possible to build a statistical model then? Well, for many
response variables, yes it is! The focus of Units 6, 7 and 8 is the
development of statistical models for such non-normal response variables.
Collectively, these models are known as generalised linear models.
We’ll start in this unit by considering just one sort of non-normal response
variable: a binary response variable; that is, a response variable which can
only take one of two possible values.
3
Unit 6 Regression for a binary response variable
4
Introduction
Section 1
Setting the scene
Section 2
Introducing logistic regression
Section 3 Section 4
Interpreting the Using the
logistic regression logistic regression
model model
Section 5
Assessing model fit
Section 7
Section 6
Checking the
Choosing a logistic
logistic regression
regression model
model assumptions
Note that you will need to switch between the written unit and your
computer for Subsections 4.2, 5.4, 6.3 and 7.4.
5
Unit 6 Regression for a binary response variable
6
1 Setting the scene
As you have seen in Activity 1 and Examples 2 and 3, there are many
examples of situations where datasets have a binary response variable. It is
therefore important that we have a method for modelling such responses!
7
Unit 6 Regression for a binary response variable
8
1 Setting the scene
9
Unit 6 Regression for a binary response variable
To help with getting a feel for the data, Figure 1 shows the comparative
boxplot of the average wage, averageWage, separating the 28 companies
into those which participate in research and development (so that
resAndDev = 1), and those which do not participate in research and
development (so that resAndDev = 0).
resAndDev 1
30 40 50 60
averageWage
Figure 1 Comparative boxplot showing the average wage for 28 companies
10
1 Setting the scene
1.0
0.8
0.6
resAndDev
0.4
0.2
0.0
30 40 50 60
averageWage
Figure 2 Scatterplot of averageWage and resAndDev (treating resAndDev
as a continuous variable)
From the scatterplot, we can see that there certainly appears to be some
sort of relationship between averageWage and resAndDev, with generally
higher values for averageWage when resAndDev = 1. But would a linear
relationship be appropriate? Let’s explore a few options for fitting a
straight line to these data.
Firstly, if we treat resAndDev as a continuous response (which happens to
take the numerical values 0 and 1), there is nothing to stop us from fitting
a linear regression model to these data. So, let’s try fitting the (linear)
model
resAndDev ∼ averageWage.
11
Unit 6 Regression for a binary response variable
1.0
0.8
resAndDev 0.6
0.4
0.2
0.0
30 40 50 60
averageWage
Figure 3 Scatterplot of averageWage and resAndDev with the fitted
regression line for the linear model resAndDev ∼ averageWage
We’ll consider the fit of the regression line shown in Figure 3 in the next
activity.
From Activity 3, it looks like the linear regression models that we’ve used
so far are not going to work with a binary response variable. However, if
we look at the scatterplot given in Figure 3 again, whilst the points do not
lie on the fitted linear regression line, they do all lie on one of the two lines
added to the scatterplot of resAndDev and averageWage shown in
Figure 4.
• The bottom line on the plot (redAndDev = 0), is a perfect fit for all of
the points relating to companies who do not engage in research and
development (but doesn’t fit any of the points relating to companies for
which resAndDev = 1).
12
1 Setting the scene
• The top line on the plot (resAndDev = 1), is a perfect fit for all of the
points relating to companies who do engage in research and development
(but doesn’t fit any of the points relating to companies for which
resAndDev = 0).
1.0
0.8
0.6
resAndDev
0.4
0.2
0.0
30 40 50 60
averageWage
Figure 4 Scatterplot of averageWage and resAndDev with two possible
fitted lines
Although we can fit each data point perfectly using the two lines shown in
Figure 4, having two fitted lines in our model like this isn’t much use to us!
For example, for an observation with a value of 30 for averageWage, which
of the two lines would we choose to predict resAndDev?
So, what about using shorter versions of both lines? For example, we could
consider a cut-off at averageWage = 40, say, so that all values of
averageWage less than or equal to 40 would use the bottom line for
predicting resAndDev, while all values of averageWage greater than 40
would use the top line for predicting resAndDev. In other words, we could
use the two fitted lines given by
0 if averageWage ≤ 40 (this relates to the bottom line),
resAndDev =
1 if averageWage > 40 (this relates to the top line).
13
Unit 6 Regression for a binary response variable
These two shorter fitted lines are shown in Figure 5. Might these help us to
predict the value of resAndDev from a given value of averageWage better?
1.0
0.8
0.6
resAndDev
0.4
0.2
0.0
30 40 50 60
averageWage
Figure 5 Scatterplot of averageWage and resAndDev with the two fitted
lines from Figure 4 shortened
Well, having the two shorter fitted lines shown in Figure 5 would certainly
be better than using the two full lines shown in Figure 4: there is now only
one line associated with each value of averageWage and only three data
points which don’t lie on one of the lines. This kind of approach – where
we move between the two lines for the different values of the explanatory
variable – therefore appears to give us a way forwards.
However, the cut-off value of 40 for averageWage was chosen fairly
arbitrarily, and so, if we are to use this type of approach, we need a
method of deciding how we can use the values of the explanatory variable
to move from one line to the other. To investigate this further, we’ll
explore another new dataset in the next subsection.
14
1 Setting the scene
logArea survival
2.301 0
1.903 1
2.039 1
2.221 0
1.725 1
Source: Fan, Heckman and Wand, 1995
15
Unit 6 Regression for a binary response variable
survival
1
1.0
0.8
0.6
survival
0.4
0.2
0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 7 Scatterplot of survival and logArea
16
1 Setting the scene
17
Unit 6 Regression for a binary response variable
Now, from Activity 4 we know that the value of the explanatory variable
logArea seems to affect the value of the response survival. As such, pi ,
the success probability for patient i, depends on the value of logArea for
patient i, as demonstrated in the next activity.
We’ve seen in Activity 5 how the value of the explanatory variable for the
ith observation affects the value of the corresponding success probability pi
associated with the response Yi . In fact, it turns out that the success
probabilities (p1 , p2 , . . . , pn ) are key to building a model which allows us to
use the value of the explanatory variable to predict the response. This is
because the value of the success probability pi can guide our prediction for
the response Yi as follows.
• If pi > 0.5, then it is more likely that Yi is a success than a failure and
so it would be sensible to predict success for Yi .
• If pi < 0.5, then it is more likely that Yi is a failure than a success and
so it would be sensible to predict failure for Yi .
What’s more, the success probabilities show us what our fitted regression
line, on which predictions are based, should look like. The reason for this
is as follows.
Right back in Unit 1, we saw that in linear regression, all of the values
E(Y1 ), E(Y2 ), . . . , E(Yn ) lie along the regression line. So, we should also
expect that E(Y1 ), E(Y2 ), . . . , E(Yn ) lie along the regression line when we
have a binary response. But, for a binary response, we know from Box 1
that
E(Yi ) = pi .
So, the success probabilities p1 , p2 , . . . , pn for the n binary responses should
also lie along the regression line. We can therefore use p1 , p2 , . . . , pn to
show us what our fitted regression line should look like.
18
2 Introducing logistic regression
19
Unit 6 Regression for a binary response variable
20
2 Introducing logistic regression
0.8
0.6
survival
0.4
0.2
0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 8 Scatterplot of logArea and survival, together with estimates of
the survival probabilities plotted at the midpoint of each corresponding
logArea interval
21
Unit 6 Regression for a binary response variable
0.8
0.6
survival
0.4
0.2
0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 9 Scatterplot of logArea and survival, together with estimates of
the survival probabilities and the fitted linear regression line
A better fit to the estimated probabilities might be a curve with some sort
of elongated backwards S-shape, such as the one shown in Figure 10. This
curve certainly appears to offer a better fit than the fitted linear regression
line shown in Figure 9 did.
22
2 Introducing logistic regression
0.8
0.6
survival
0.4
0.2
0.0
1.2 1.4 1.6 1.8 2.0 2.2
logArea
Figure 10 Scatterplot of logArea and survival, together with estimates of
the survival probabilities and a backward S-shaped curve
It turns out that for binary responses in general, the estimated success
probabilities across the values of an explanatory variable exhibit similar
S-shaped curves – either backward-facing, so that the curve is decreasing
as in Figure 10, or forward-facing, so that the curve is increasing. So, one
way forwards might be to use a curve to model the relationship between
the success probability and an explanatory variable.
In the next subsection, we shall introduce the logistic function which has
the required type of S-shaped curve to fit the success probabilities typically
associated with a binary response.
23
Unit 6 Regression for a binary response variable
1.0
0.8
0.6
f (x)
0.4
0.2
0.0
−10 −5 0 5 10
x
Figure 11 Logistic function f (x) for α = 0 and β = 1
What problems can you see with fitting the logistic function shown in
Figure 11 (where α = 0 and β = 1) to the survival probabilities in the
burns dataset? (It might be helpful to look back at Figure 8.)
24
2 Introducing logistic regression
From Activity 7, it is obvious that there are problems with using the
logistic function shown in Figure 11 for modelling the survival probabilities
in the burns dataset. However, Figure 11 is showing the logistic function
from Equation (1) for the specific values α = 0 and β = 1, and we can
overcome these problems simply by changing the values of these
parameters.
The effect that the parameter α has on the shape of the logistic function is
summarised in Box 2.
0.8
0.2
0.0
−10 −5 0 5 10
x
Figure 12 Illustrating the effect on the logistic function of changing the
value of α
In the next activity, you will consider the effect on the location of the
curve in Figure 11 for different values of α.
25
Unit 6 Regression for a binary response variable
Keeping the value of β fixed at 1, describe how the location of the curve in
Figure 11 changes when:
(a) α = 2
(b) α = −4
Now we’ll consider the effect that the parameter β has on the shape of the
logistic function; this is summarised in Box 3.
1.0
0.8
β = −1 β = +1
0.6
f (x)
0.4
0.2
0.0
−10 −5 0 5 10
x
Figure 13 Illustrating the effect on the logistic function of changing the
sign of β
26
2 Introducing logistic regression
1.0
β=4 β=1
0.8
0.6
β = 0.25
f (x)
0.4
0.2
0.0
−10 −5 0 5 10
x
Figure 14 Illustrating the effect on the logistic function of changing the
magnitude of β
In the next activity, you will consider the effect on the slope and spread of
the curve in Figure 11 for different values of β.
Keeping the value of α fixed at 0, describe how the slope and spread of the
curve in Figure 11 changes when:
(a) β = 2
(b) β = −0.5
27
Unit 6 Regression for a binary response variable
From Boxes 2 and 3, the logistic function allows both increasing and
decreasing relationships to be modelled and can be flexible with respect to
both the scale and range of x. By adjusting the values of the parameters
α and β, we can directly model the relationship between the success
probability pi (for the ith response Yi ) and the explanatory variable xi by
using an equation of the form
1
pi = . (2)
1 + exp(−(α + βxi ))
So, now that we have found an equation suitable for modelling the
relationship between a binary response variable’s success probability and
an explanatory variable, we are ready to build a regression model for data
with a binary response! We shall introduce such a model next.
But how does this help us? Well, if we now take the logarithm of both
sides in Equation (3), we end up with the equation
pi
log = α + βxi (4)
1 − pi
and this equation has the familiar form of a simple linear regression model
on the right-hand side. (Hooray!) Note that from this point onwards, when
we use the logarithm function in M348 we always mean log to base e.
Although Equation (4) has a familiar form on the right-hand side,
unfortunately the left-hand side certainly doesn’t look like the left-hand
side of a linear model. However, by looking at the left-hand side more
closely, we’ll find that Equation (4) does in fact have a similar form to a
linear model!
To see this, recall that for our binary response variable we have E(Yi ) = pi .
So, Equation (4) can be rewritten as
E(Yi )
log = α + βxi . (5)
1 − E(Yi )
But, we also know from Unit 1 that for the simple linear regression model
we have
E(Yi ) = α + βxi . (6)
The only real difference between Equations (5) and (6) is the fact that
Equation (5) has a function of E(Yi ) on the left-hand side, rather than just
E(Yi ).
The function of E(Yi ) in Equation (5) is known as the logit function,
where the term ‘logit’ is short for ‘logistic unit’. The logit function is often
denoted by logit(), so that
E(Yi )
logit(E(Yi )) = log .
1 − E(Yi )
(The logit function is in fact the inverse of the logistic function when α = 0
and β = 1.) So, using the logit function, Equation (5) can be written as
logit(E(Yi )) = α + βxi
and equivalently, Equation (4) can be expressed as
logit(pi ) = α + βxi .
When modelling a binary response, the logit function plays a special role
in the model and therefore has a special name – the logit link function,
often shortened to logit link. This is because the logit function links the
expected value of the response variable E(Yi ) (which in the case of a
binary response is the success probability pi ) with the linear component of
the model (that is, α + βxi ).
We are now in a position to be able to specify the logistic regression model
for a binary response variable with just one covariate explanatory variable,
as described in Box 4.
29
Unit 6 Regression for a binary response variable
Note that it is in fact possible to use other link functions for binary
response variables, and if you explore more regression models for binary
responses in your work or through further reading, you might also
encounter the probit link function and the complementary log-log link
function. In this unit, however, we shall only be using the logit link
function, since it is the one most commonly used with binary response
variables and its results are relatively easy to interpret.
Whilst we are currently using the logit link function for binary response
data, in Unit 7 you will use other link functions for different types of
response variables. In each case, the link function links E(Yi ) to a linear
function of the explanatory variable(s); these models (including the logistic
regression model) are known collectively as generalised linear models. We
shall return to looking at generalised linear models in Unit 7.
Simple linear regression and logistic regression (as specified in Box 4) both
have one covariate explanatory variable, but each assumes a different
distribution for the response. In the next activity, we’ll consider what
Aren’t we similar? other similarities and differences the two models have.
30
2 Introducing logistic regression
32
3 Interpreting the logistic regression model
(a) The odds can take values between 0 and ∞. Explain why this is so.
(b) What value of the success probability gives a value 1 for the odds?
(c) What does a value of odds less than 1 tell us about the success
probability? What about a value of odds greater than 1?
33
Unit 6 Regression for a binary response variable
You will see shortly that the odds and log odds are key to interpreting the
logistic regression model.
34
3 Interpreting the logistic regression model
35
Unit 6 Regression for a binary response variable
Now, the value of pi will depend on the value of xi . So, let’s define
p(Y = 1 | x = 0) to be the probability that an email is spam (Y = 1),
given that it does not contain the phrase ‘Call free’ (x = 0), so that
when xi = 0,
pi = P (Y = 1 | x = 0).
Similarly, define p(Y = 1 | x = 1) to be the probability that an email
is spam (Y = 1), given that it does contain the phrase ‘Call free’
(x = 1), so that when xi = 1,
pi = P (Y = 1 | x = 1).
Then, substituting values of pi and xi into our model, when xi = 0 we
have that
p(Y = 1 | x = 0)
log =α
1 − p(Y = 1 | x = 0)
and when xi = 1, we have that
p(Y = 1 | x = 1)
log = α + β.
1 − p(Y = 1 | x = 1)
Therefore
p(Y = 1 | x = 1) p(Y = 1 | x = 0)
β = log − log .
1 − p(Y = 1 | x = 1) 1 − p(Y = 1 | x = 0)
In other words, the coefficient β represents the difference between:
• the log odds for emails with x = 1 (emails containing the phrase
‘Call free’), and
• the log odds for emails with x = 0 (emails without the phrase ‘Call
free’).
So β represents the difference between the logs odds when there is a
unit increase in x.
36
3 Interpreting the logistic regression model
Using this knowledge, we see that the difference between the log odds is
the same as the log of the ratio of the two odds – that is,
p(Y = 1 | x = 1) p(Y = 1 | x = 0)
β = log − log
1 − p(Y = 1 | x = 1) 1 − p(Y = 1 | x = 0)
.
p(Y = 1 | x = 1) p(Y = 1 | x = 0)
= log
1 − p(Y = 1 | x = 1) 1 − p(Y = 1 | x = 0)
odds when x = 1
= log .
odds when x = 0
The ratio of odds in this equation is, perhaps unsurprisingly, called the
odds ratio and is often written simply as OR. The equation for β
therefore becomes
β = log(OR).
The odds ratio is a useful measure of the association between the
explanatory variable and the response variable, as demonstrated next in Could this image also be
Example 7 and Activity 14. considered a ratio of ‘odds’ ?
37
Unit 6 Regression for a binary response variable
Suppose that the probability that an email is spam given that it has at
most two spelling mistakes is 0.1, and the probability that an email is
spam given that it has three or more spelling mistakes is 0.55.
Calculate the odds ratio that an email is spam for the explanatory variable
x2 , and interpret its value.
So, how does the value of the OR help us to interpret a logistic regression
model? Well, we know that
β = log(OR),
which means that
OR = exp(β).
So, the value of exp(β) is the OR, and we know how to interpret the OR!
The quantity exp(β) is known as the odds multiplier, because it is the
number that the odds are multiplied by when increasing the value of x by
one unit.
38
3 Interpreting the logistic regression model
Of course, the spam email example only had a single binary explanatory
variable, and we would really like an interpretation for β when we have a
more general explanatory variable. Luckily, we can extend the idea for a
general covariate x. To do this, let’s consider a logistic regression model
when x takes some value w, say, and when this value is increased by 1 to
the value w + 1. Let p(Y = 1 | x = w) and p(Y = 1 | x = w + 1) denote the
success probabilities when the explanatory variable x takes values w and
w + 1, respectively. Then, when x = w,
Extending the idea!
p(Y = 1 | x = w)
log = α + βw
1 − p(Y = 1 | x = w)
and when x = w + 1,
p(Y = 1 | x = w + 1)
log = α + β(w + 1).
1 − p(Y = 1 | x = w + 1)
The difference in these two log odds is then
p(Y = 1 | x = w + 1) p(Y = 1 | x = w)
log − log
1 − p(Y = 1 | x = w + 1) 1 − p(Y = 1 | x = w)
= (α + β(w + 1)) − (α + βw) = β. (9)
So, as we have just seen, when we have a more general covariate, β still
represents the difference in log odds (or the log of the odds ratio) given a
unit increase in the explanatory variable. As such, β has the same
interpretation for any covariate x.
Now, the odds multiplier exp(β) tells us how the odds change for a unit
increase in x. But what about if we’d like to know how the odds change for
an increase in x of c units, for some value c? To answer this, let’s consider
the logistic regression model when we increase x from w to w + c (instead
of increasing x from w to w + 1). In this case, Equation (9) becomes
p(Y = 1 | x = w + c) p(Y = 1 | x = w)
log − log
1 − p(Y = 1 | x = w + c) 1 − p(Y = 1 | x = w)
= (α + β(w + c)) − (α + βw) = cβ.
So
exp(cβ) = odds ratio for an increase of c units in x, (10)
that is, exp(cβ) is the number that the odds of success is multiplied by for
an increase in x of c units.
39
Unit 6 Regression for a binary response variable
Activity 15 Interpreting β
40
3 Interpreting the logistic regression model
41
Unit 6 Regression for a binary response variable
In logistic regression, the coefficient for the indicator variable for the kth
level of the factor is interpreted as the difference in log odds between the
kth level of the factor and the first level of the factor, with all other
explanatory variables held fixed. Therefore, the odds multiplier of the
coefficient for the kth level of the factor is interpreted to be the OR for the
kth level of the factor relative to the first level of the factor. We’ll consider
this idea further in Activity 17.
Note that in Activity 17 there was just the one explanatory variable A, but
there was more than one regression coefficient because A was a factor.
Logistic regression models can be extended in a natural way to include any
number of covariates and factors as the explanatory variables. As for the
linear models with any number of factors and covariates that we studied in
Unit 4, the interpretation of the regression coefficients in logistic regression
assumes that the values of any other explanatory variables in the model
are fixed.
Interpreting a logistic regression model with more than one regression
coefficient is summarised in Box 8.
Yes, get excited – you’re about So, we have introduced the logistic regression model for a binary response
to see logistic regression in in Sections 1 and 2, and we have discussed how to interpret the resulting
action! model in this section. Now we’re ready to use logistic regression!
42
4 Using the logistic regression model
43
Unit 6 Regression for a binary response variable
The logistic regression model for the burns dataset just had one covariate.
Next we’ll revisit the OU students dataset and consider a logistic
regression model with more than one explanatory variable.
44
4 Using the logistic regression model
Parameter Estimate
Intercept −4.273
bestPrevModScore 0.099
age −0.020
qualLink maths −0.696
45
Unit 6 Regression for a binary response variable
46
4 Using the logistic regression model
47
Unit 6 Regression for a binary response variable
48
5 Assessing model fit
49
Unit 6 Regression for a binary response variable
50
5 Assessing model fit
Because the fit of our proposed model must lie somewhere between the
worst-fit model (the null model) and the best-fit model (the saturated
model), we know that L(proposed model) must lie somewhere between
L(null model) and L(saturated model). So, comparing the likelihoods of
these three models can tell us about how well (or not!) the proposed model
fits the data. In particular, we shall introduce a measure of fit in
Subsection 5.2 which is based on the idea that, by comparing
L(proposed model) with L(saturated model), we can assess how much fit
to the data is lost by fitting the proposed model. This is illustrated in
Figure 15.
Somewhere between
Smallest L(null model) Largest
possible and possible
likelihood L(saturated model) likelihood
Measure of
lost fit due to
proposed model
Instead of comparing model likelihoods for measuring the lost fit of the
proposed model, for mathematical reasons it is actually more convenient to
compare log-likelihoods (that is, the log of the likelihoods). We’ll denote
the log-likelihood for a model by
l(model) = log L(model) .
Since the log transformation is a monotonic increasing transformation, a
high value of L(model) will produce a high value of l(model), and so we
can still use the log-likelihoods for comparing model fits, but with the
added bonus that the maths works nicely!
51
Unit 6 Regression for a binary response variable
52
5 Assessing model fit
So, how can we use the chi-squared distribution to assess the fit of a
logistic regression model? Well, it turns out that, if the proposed model is
a good fit, then the residual deviance D is approximately distributed as a
χ2 (r) distribution, which is denoted as D ≈ χ2 (r) (in this module, the ≈
symbol means ‘is approximately distributed as’), with
r = number of parameters in the saturated model
− number of parameters in the proposed model. (14)
This distributional result is in fact the reason why D is defined as twice
the difference in the log-likelihoods.
53
Unit 6 Regression for a binary response variable
Now, we’ve already noted that a large value of D indicates that the
proposed logistic regression model may not be a good fit. So, since
D ≈ χ2 (r) when the model is a good fit, if the value of D falls in the upper
tail of the χ2 (r) distribution, then this suggests that D is larger than we’d
expect, and so the proposed model loses too much fit. This is illustrated in
Figure 16.
χ2 (r)
54
5 Assessing model fit
Calculating the value of the degrees of freedom r for the residual deviance
for a particular logistic regression model is illustrated in the next activity.
55
Unit 6 Regression for a binary response variable
56
5 Assessing model fit
(b) The residual deviance for the model given in part (a) is calculated to
be D = 581.31, and the associated p-value is approximately 1. Does
this suggest that the proposed model is a good fit to the data?
In Activity 24, we saw that we can use the value of D and the associated
degrees of freedom r, to informally assess the fit of the proposed model.
This leads us to the general ‘rule of thumb’ for informally assessing the fit
of logistic regression models given in Box 12.
57
Unit 6 Regression for a binary response variable
58
5 Assessing model fit
The null deviance is the residual deviance for the null model. As such, if
the null model is a good fit to the data, then the null deviance will follow a
χ2 distribution.
We’ll consider the null deviance for the burns dataset next in Activity 28.
59
Unit 6 Regression for a binary response variable
The null deviance is used as a measure of how much fit there is without
any explanatory variables in the model. It is also commonly used for
assessing the amount of fit gained by a proposed model in comparison to a
model with no explanatory variables, by considering the difference between
the residual deviance for the proposed model and the null deviance. This is
part of a wider strategy of comparing the fits of logistic models; this is the
subject of the next section. But first, we’ll complete this section by
using R to assess model fit.
60
6 Choosing a logistic regression model
Nested logistic regression models are illustrated in the next example. Nested models fit inside each
other like Russian dolls
61
Unit 6 Regression for a binary response variable
Now, D(M1 ) and D(M2 ) give measures of the fit lost for models M1 and
M2 , respectively, in comparison to the ‘perfect fit’ saturated model.
Therefore, the difference between D(M1 ) and D(M2 ) gives a measure of
the fit lost due to using the smaller model M1 in comparison to using the
larger model M2 . This measure is called the deviance difference and is
defined as
deviance difference = D(M1 ) − D(M2 ).
Figure 17 illustrates D(M1 ), D(M2 ) and the deviance difference.
Deviance difference
is twice this
D(M2 ) is
twice this
D(M1 ) is
twice this
62
6 Choosing a logistic regression model
Notice that from Figure 17 it looks like the deviance difference is twice the
difference between the log-likelihoods for the larger model M2 and the
smaller model M1 . This is indeed the case, which we can see
mathematically:
deviance difference = D(M1 ) − D(M2 )
= 2 × l(saturated model) − l(M1 )
− 2 × l(saturated model) − l(M2 )
= 2 × l(M2 ) − l(M1 ) .
Activity 30 considers what a small or large deviance difference mean.
Phew! We can still use a χ2 Using the deviance difference to compare the fits of nested logistic
distribution . . . regression models is summarised in Box 15.
64
6 Choosing a logistic regression model
65
Unit 6 Regression for a binary response variable
66
6 Choosing a logistic regression model
67
Unit 6 Regression for a binary response variable
Note that, like the likelihood, the value of the AIC is not an absolute
measure of the fit of a model. As such, the AIC can only be used to
compare the fits of a set of possible models. This, however, means that if
all of the potential models are a bad fit, then we won’t know this from the
values of the AIC.
The AIC is used for selecting the ‘best’ logistic regression model from a
group of alternatives in exactly the same way as it was used for selecting
the ‘best’ multiple regression model in Unit 2. The AIC is used to compare
two logistic regression models in Activity 33.
When using linear regression in this module, we have been using stepwise
regression as an automated procedure for selecting which explanatory
variables should be included in our model, and at each stage in the
stepwise regression procedure, the AIC has been used to compare the
model fits. Stepwise regression can be used in exactly the same way in
logistic regression; we shall see stepwise regression in action using R in the
next subsection.
68
7 Checking the logistic regression model assumptions
69
Unit 6 Regression for a binary response variable
70
7 Checking the logistic regression model assumptions
You do not need to know the formula for calculating di for this module,
since R automatically calculates the deviance residuals when fitting a
logistic regression model. Basically, each deviance residual di is analogous
to the residual ri in multiple linear regression, and each deviance residual
can be thought of as the contribution to the residual deviance D for the
ith data point.
If the model assumptions for the logistic regression model are adequate,
then the deviance residuals should be approximately normally distributed
(provided that the number of observations for each combination of factor
levels is not small). In particular, this means that the standardised
deviance residuals should approximately follow the standard normal
N (0, 1) distribution. (As a reminder, a variable can be ‘standardised’ by
transforming it so that it has mean 0 and variance 1.)
There are several plots of the deviance residuals which can be used to
highlight any potential problems with the fitted model. We shall introduce
you to four of these plots briefly in Subsection 7.3. However, be warned!
For the logistic regression model, many of these plots look rather odd due
to the fact that the response variable is binary.
71
Unit 6 Regression for a binary response variable
72
7 Checking the logistic regression model assumptions
2
Standardised deviance residual
−1
−2
−3
1.0
2.0 1.52.5 3.0
2 arcsin µ
Figure 18 Plot of the standardised deviance residuals against a
b for the burns dataset
transformation of µ
In the plot shown in Figure 18, there are two distinct ‘lines’ of points.
These lines are due to the fact that the response variable is binary.
The upper line of positive values corresponds to data with Yi = 1 and
the lower line of negative values corresponds to the data with Yi = 0.
The pattern of lines is then a result of the fact that the fitted logistic
curve runs between 1 and 0, whereas the Yi values are always exactly
1 or 0. The points follow on from each other smoothly because they
are ordered by the estimated values of pbi (= µbi ).
The most useful thing to concentrate on in Figure 18 is the smoothed
red line. If the linearity assumption is fine, then the smoothed red line
should be roughly a horizontal straight line. In the plot shown, the
red line is fairly constant with only a slight curve, and so the linearity
assumption seems reasonable.
We saw that the plot of the standardised deviance residuals against the
transformation of µ bi resulted in two ‘lines’ of points in Example 14. This
type of pattern in the standardised deviance residuals is typical for this
plot in logistic regression.
73
Unit 6 Regression for a binary response variable
As mentioned in Example 14, the smoothed red line is the most useful
thing to concentrate on in the plot. Ideally, we want this line to be roughly
a horizontal straight line: this would indicate that the linearity assumption
is reasonable. Curvature in the smoothed red line could indicate that the
model may need a different link function, rather than log(pi /(1 − pi )), or
that a transformation of one of the explanatory variables may be needed.
Next, in Activity 34, we’ll use this type of plot to check the assumption of
linearity for a logistic regression model fitted to data in the GB companies
dataset.
2
Standardised deviance residual
−1
−2
0.5 1.0 1.5 2.0 2.5
2 arcsin µ
Figure 19 Plot of the standardised deviance residuals against a
b for the GB companies dataset
transformation of µ
74
7 Checking the logistic regression model assumptions
2
Standardised deviance residual
−1
−2
−3
0 100 200 300 400
Index
Figure 20 Plot of standardised deviance residuals against index for the
burns dataset
In the plot shown in Figure 20, we can see two groups of points. This
is again due to the fact that the response variable is binary.
We want to check that the standardised deviance residuals fluctuate
randomly in the plot. However, the spread of the negative deviance
residuals appears to increase as the index number increases. Since the
data concern individual patients, it is unlikely that the observations
are not independent. However, given that a pattern has been seen in
the plot, it would be worth checking how the observations were
recorded.
75
Unit 6 Regression for a binary response variable
We’ll look at the plot of the standardised deviance residuals against index
for the GB companies dataset next.
2
Standardised deviance residual
−1
−2
0 5 10 15 20 25
Index
Figure 21 Plot of standardised deviance residuals against index for the GB
companies dataset
The plot is shown for the logistic regression model fitted to the burns
dataset in Example 16.
8
Squared standardised deviance residuals
0
0 100 200 300 400
Index
Figure 22 Squared standardised deviance residuals against index for
the burns dataset. The red circles denote observations for which Yi = 1
and the blue triangles denote observations for which Yi = 0.
77
Unit 6 Regression for a binary response variable
0
0 5 10 15 20 25
Index
Figure 23 Squared standardised deviance residuals against index for the GB
companies dataset. The red circles denote observations for which Yi = 1 and
the blue triangles denote observations for which Yi = 0.
78
7 Checking the logistic regression model assumptions
2
Standardised deviance residuals
−1
−2
−3
−3 −2 −1 0 1 2 3
Theoretical quantiles
Figure 24 Normal probability plot of the standardised deviance
residuals for the burns dataset
Finally, we shall consider the normal probability plot for the logistic
regression model fitted to the GB companies dataset in Activity 37.
79
Unit 6 Regression for a binary response variable
2
Standardised deviance residuals
−1
−2
−2 −1 0 1 2
Theoretical quantiles
Figure 25 Normal probability plot for the GB companies dataset
Now that we have introduced the four diagnostic plots that we’ll be using
for logistic regression, we can move onto using R to obtain the plots.
80
7 Checking the logistic regression model assumptions
81
Unit 6 Regression for a binary response variable
Summary
In this unit, we have developed logistic regression for modelling binary
variables.
For binary responses Y1 , Y2 , . . . , Yn , it is assumed that for each
i = 1, 2, . . . , n,
Yi ∼ Bernoulli(pi ), 0 < pi < 1,
where pi is the probability of success for the ith observation. Logistic
regression then models the relationship between E(Yi ) = pi and the
explanatory variables.
In the model, a link function links E(Yi ) to a linear combination of the
explanatory variables. The link function used in M348 (and which is also
most commonly used) is the logit link function, which is defined as
E(Yi )
logit(E(Yi )) = log .
1 − E(Yi )
Our regression model is then
E(Yi )
log = linear function of explanatory variables.
1 − E(Yi )
One of the reasons that the logit link function is popular for logistic
regression is because of the interpretability of the model. Let β denote the
regression coefficient for a covariate x. Then
β = log(OR),
where OR is the odds ratio given a unit increase in x, so that,
odds of success when x takes value w + 1
OR = .
odds of success when x takes value w
The odds ratio can be interpreted as follows.
• If OR > 1, then the odds of success increases by a factor equal to the
value of the OR when x is increased by one unit.
• If OR < 1, then the odds of success decreases by a factor equal to the
value of the reciprocal of the OR when x is increased by one unit.
A logistic regression model is then interpreted through the result that
OR = exp(β).
The quantity exp(β) is known as the odds multiplier. It tells us about the
relationship between Y and x, because it is the number that the odds are
multiplied by for a unit increase in x.
82
Summary
When there is more than one explanatory variable, each odds multiplier is
partial and assumes that the other explanatory variables are fixed for a
unit increase in the associated explanatory variable. For a factor, there’s
an odds multiplier associated with each of the indicator variables for the
levels of the factor: so, for the kth level of the factor, the odds multiplier is
interpreted as the OR for the kth level of the factor relative to the first
level of the factor.
Once we have a fitted logistic regression model, we are often interested in
pbi , the fitted success probability for Yi , and pb0 , the predicted success
probability for new response Y0 . These are obtained easily by using
exp(fitted linear function of explanatory variables)
pbi = .
1 + exp(fitted linear function of explanatory variables)
The fit of the proposed logistic regression model can be assessed using the
residual deviance D, where
D = 2 × l(saturated model) − l(proposed model) .
If the proposed model is a good fit, then
D ≈ χ2 (r),
where
r = n − number of parameters in proposed model.
A useful ‘rule of thumb’ is
• if D ≤ r, then conclude that the model is a good fit.
The fits of two nested logistic regression models M1 and M2 , with M1
nested within M2 , can be compared using the deviance difference, where
deviance difference = D(M1 ) − D(M2 ).
If both M1 and M2 are a good fit, then
deviance difference ≈ χ2 (d),
where
d = difference in the degrees of freedom for D(M1 ) and D(M2 ).
The AIC can be used to compare non-nested models; we choose the model
with the smallest AIC.
In order to check the model assumptions in logistic regression, diagnostic
plots focus on deviance residuals – these are analogous to the standard
residuals used in linear regression. If the logistic regression model
assumptions are reasonable, then the standardised deviance residuals are
approximately distributed as N (0, 1).
83
Unit 6 Regression for a binary response variable
A reminder of what has been studied in Unit 6 and how the sections link
together is shown in the following route map.
Section 1
Setting the scene
Section 2
Introducing logistic regression
Section 3 Section 4
Interpreting the Using the
logistic regression logistic regression
model model
Section 5
Assessing model fit
Section 7
Section 6
Checking the
Choosing a logistic
logistic regression
regression model
model assumptions
84
References
Learning outcomes
After you have worked through this unit, you should be able to:
• explain why linear regression is not appropriate for modelling a binary
response variable
• understand how modelling the success probability is key to building a
model for binary responses
• appreciate the role of the logit link function in logistic regression
• interpret a logistic regression model
• obtain fitted and predicted success probabilities for given values of the
explanatory variable(s)
• assess the fit of a logistic regression model
• compare the fits of two logistic regression models, both in the case of
nested models and non-nested models
• identify if there are problems with the assumptions of logistic regression
• fit a logistic regression model in R
• use R to predict success probabilities
• use R to assess model fit
• compare model fits in R
• use stepwise regression for logistic regression in R
• use R to produce diagnostic plots for logistic regression.
References
Bureau van Dijk (2020) Amadeus. Available at:
https://www.open.ac.uk/libraryservices/resource/database:350727&f=33492
(Accessed: 22 November 2022). (The Amadeus database can be accessed
from The Open University Library using the institution login.)
Fan, J., Heckman, N.E. and Wand, M.P. (1995) ‘Local polynomial kernel
regression for generalized linear models and quasi-likelihood functions’,
Journal of the American Statistical Association, 90 (429), pp. 141–150.
doi:10.1080/01621459.1995.10476496.
85
Unit 6 Regression for a binary response variable
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Introduction, contactless payment: © rh2010 / www.123rf.com
Subsection 1.1, driver’s alcohol level: © Stuart Pearcey / Dreamstime.com
Subsection 1.2.1, plastics: © Cassidy Karasek / www.freepik.com
Subsection 1.2.2, hospital sign: © Sherry Young / www.123rf.com
Subsection 2.2, flexible dancer: © Aleksandr Doodko / www.123rf.com
Subsection 2.3, man and dog: © damedeeso / www.123rf.com
Subsection 3.1, betting chips: © Rawf88 / Dreamstime.com
Subsection 3.2, spam email: © gilc / www.123rf.com
Subsection 3.2, woman extending: © Fizkes / Dreamstime.com
Subsection 3.3, excited: © Joe Caione / www.unsplash.com
Subsection 4.1, student passing: © sebra / Shutterstock
Subsection 5.2, degree of freedom: © Nirat Makjantuk / www.123rf.com
Subsection 6.1, Russian dolls: © Rui Santos / www.123rf.com
Subsection 6.1, fish bowls: © Oleg Dudko / Dreamstime.com
Subsection 6.1, χ2 relief: © Mykola Kravchenko / Dreamstime.com
Subsection 6.2, non-nested models: © chianim8r / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
86
Solutions to activities
Solutions to activities
Solution to Activity 1
Some examples of datasets with a binary response are given below. You
probably thought of different situations, since binary response variables are
common in many areas!
• There are several medical examples where the binary response variable is
the presence or absence of a disease. So, in this case, the two possible
values for the binary response could be ‘disease present’ and ‘disease
absent’.
This response could depend on a range of explanatory variables which
describe the symptoms of the patient, such as their blood pressure,
together with a range of demographic variables, such as age and gender.
• An ecological study may be interested in whether or not a particular
species is in an area. So, in this case, the two possible values for the
binary response could be ‘species present’ and ‘species absent’.
This response could depend on a range of explanatory variables which
describe the conditions of the area, such as the abundance of food, the
climate, the habitat and the presence of any predators in the area.
• A company might be interested in marketing a new type of phone, and a
binary response variable could indicate whether or not a person buys the
new phone. So, in this case, the two possible values for the binary
response could be ‘purchase new phone’ and ‘don’t purchase new phone’.
This response could depend on a range of explanatory variables such as
the demographics of the potential customer (their age, salary, and so on)
and certain aspects of the phone (even down to things such as the
phone’s size, weight and colour).
• An email client needs to decide whether each incoming email is spam or
not. So, in this case, the two possible values for the binary response
could be ‘spam’ and ‘not spam’.
Whether an email is categorised as spam or not could depend on
explanatory variables such as the number of typing errors in the email
and the number of times particular words or phrases occur (such as
‘offer’, ‘prize’ or ‘free gift’).
Solution to Activity 2
The box associated with companies which engage in research and
development (so that resAndDev = 1) lies to the right of the box
associated with companies which do not engage in research and
development (so that resAndDev = 0). Therefore, from the boxplot, it
appears that companies that pay a higher average wage tend to engage in
research and development, and so it is plausible that there is a relationship
between the two variables.
87
Unit 6 Regression for a binary response variable
Solution to Activity 3
The fitted linear regression line does give the impression that companies
with a higher average wage are more likely to engage in research and
development. This is indicated by the increasing fitted regression line.
However, none of the observed points is particularly close to the fitted
linear regression line.
Furthermore, the fitted line doesn’t go through resAndDev = 1 for the
range of observed values for averageWage, which is a problem since many
of the observed values of resAndDev are, of course, equal to 1!
So, overall, the fitted linear regression line is not a good fit for these data.
Solution to Activity 4
(a) From the comparative boxplot in Figure 6, the patients that survived
appear to have a distribution of logArea which is more spread out
than the distribution of logArea for patients who didn’t survive,
especially for the lower values of logArea. So, it appears that
logArea does have some effect on survival, with, unsurprisingly, the
patients with greater burn areas appearing to have less chance of
survival. There is, however, overlap in the distributions of logArea
for the two groups of patients.
(b) It is not obvious from Figure 7 how two straight lines could model
these data. For logArea < 1.7 the line survival = 1 fits the data
perfectly. But it is not obvious how a single straight line would best
fit the data when logArea > 1.7.
Solution to Activity 5
(a) In Figure 7, all of the points take the value 1 for survival when
logArea is less than about 1.7. Therefore, since logArea = 1.223 for
patient 24, it looks like this patient is almost certain to survive, so
that the likely value of p24 is approximately 1.
(b) In Figure 7, survival takes both the value 1 and the value 0 when
logArea > 1.7. Therefore, since logArea = 2.301 for patient 1, it’s
possible that patient 1 may survive, but it’s also possible that they
may not survive. As such, it is likely that p1 < 1, and so p1 is likely to
be less than p24 .
88
Solutions to activities
Solution to Activity 6
(a) There are 66 patients whose value of logArea is in the interval
2.1 to 2.2, of which 35 survived. Therefore, the proportion who
survived is
35
' 0.530.
66
(b) From Box 1,
E(Y5 ) = p5 and E(Y7 ) = p7 .
Now, the fifth patient has logArea = 1.725 and so their value of
logArea is in the interval 1.7 to 1.8. From Table 4, the observed
proportion of patients surviving with values of logArea in this interval
is 0.971. So, using the observed proportions surviving as estimates of
the survival probabilities, we have the estimate E(Y5 ) = 0.971.
For the seventh patient, logArea = 2.295. So, their value of logArea
is in the interval 2.2 to 2.3. From Table 4, the observed proportion of
patients surviving with values of logArea in this interval is 0.125. So,
using the observed proportions surviving as estimates of the survival
probabilities, we have the estimate E(Y7 ) = 0.125.
(c) The estimated survival probabilities shown in Figure 8 do not appear
to lie on a single straight line. It may be possible to fit a straight line
through the points corresponding to values of logArea between about
1.9 and 2.3. For these points, the proportion of patients who survive
in each logArea interval rapidly decreases from around 1 to close to
0. However, the points are flat for lower values of logArea and also
(though less obviously) level off to 0 for higher values of logArea.
Solution to Activity 7
Some of the problems with using this curve for modelling the survival
probabilities are as follows.
• The relationship between the survival probabilities and logArea in the
burns dataset is decreasing, whereas the logistic function shown in
Figure 11 is increasing.
• The logistic function shown in Figure 11 is centred on x = 0, whereas
the S-shaped curve for survival probabilities is roughly centred on the
value logArea = 2.1.
• The value of the logistic function shown in Figure 11 is roughly 0 when
x is less than about −5 and roughly 1 when x is greater than about 5.
For the survival probabilities, we require the curve to be roughly 1 for
values of logArea less than about 1.6. We can’t see from Figure 8 what
happens to the curve for values of logArea greater than 2.4, but the
curve seems to be decreasing towards 0 for values of logArea greater
than about 2.4.
89
Unit 6 Regression for a binary response variable
Solution to Activity 8
(a) When α = 2, the curve is positioned two units to the left of the curve
in Figure 11.
(b) When α = −4, the curve is positioned four units to the right of the
curve in Figure 11.
Although the question didn’t ask for sketches of the new curves, to
help you visualise them, both curves, together with the curve from
Figure 11, are shown in Figure S1.
1.0
0.8
0.6
α=2 α=0 α = −4
f (x)
0.4
0.2
0.0
−10 −5 0 5 10
x
Figure S1 Logistic function for different values of α when β = 1
90
Solutions to activities
Solution to Activity 9
(a) When β = 2, the curve is still increasing (since 2 > 0), but is steeper
and less spread out.
(b) When β = −0.5, the curve is now decreasing (since −0.5 < 0), and is
shallower and more spread out.
Although the question didn’t ask for sketches of the new curves, to
help you visualise them, both curves, together with the curve from
Figure 11, are shown in Figure S2.
1.0 β=2
0.8 β=1
0.6
f (x)
0.4
β = −0.5
0.2
0.0
−10 −5 0 5 10
x
Figure S2 Logistic function for different values of β when α = 0
Solution to Activity 10
(a) We have that
1
pi = .
1 + exp(−(α + βxi ))
But, since exp(−z) = 1/ exp(z), this can be rewritten as
1
pi = 1 .
1+ exp(α+βxi )
91
Unit 6 Regression for a binary response variable
Solution to Activity 11
Probably the most obvious similarity is the fact that both models have the
same linear function of the explanatory variable (α + βxi ).
One of the obvious differences (that we’ve already discussed) is the fact
that E(Yi ) is equal to a linear function of the explanatory variable in the
simple linear regression model, but in the logistic regression model it is the
logit function of E(Yi ) = pi which is equal to a linear function of the
explanatory variable.
A more subtle difference is that in the simple linear regression model there
is an additive random term Wi , whereas the logistic regression model is
expressed directly in terms of the distribution of Yi (and there is no
additive random term).
Solution to Activity 12
(a) If the probability of a student passing a module is 0.9, then the odds
of the student passing is
p 0.9
odds = = = 9.
1−p 1 − 0.9
92
Solutions to activities
(c) If the odds of a company going into receivership is 0.6, then, using the
result from part (b), the probability that the company goes into
receivership is
odds 0.6
p= = = 0.375.
1 + odds 1 + 0.6
Solution to Activity 13
(a) The success probability p takes a value between 0 and 1. Now, when
p = 0,
0
odds = =0
1−0
and when p = 1,
1
odds = = ∞.
1−1
For any value of p such that 0 < p < 1, the odds will be positive (since
it is the ratio of two positive numbers). Therefore, the odds can take
values between 0 and ∞.
(b) If odds = 1, then p = 1 − p, and so p = 12 .
(c) The odds will be less than 1 if p < 1 − p. In this case, p < 21 , and so
success is less likely than failure.
The odds will be greater than 1 if p > 1 − p. In this case, p > 12 , and
so success is more likely than failure.
Solution to Activity 14
Let P (Y = 1 | x2 = 1) denote the probability that an email is spam
(Y = 1) given that it has at most two spelling mistakes (x2 = 1), and let
P (Y = 1 | x2 = 0) denote the probability that an email is spam (Y = 1)
given that it has three or more spelling mistakes (x2 = 0).
Then
p(Y = 1 | x2 = 1) 0.1 1
odds when x2 is 1 = = =
1 − p(Y = 1 | x2 = 1) 0.9 9
and
p(Y = 1 | x2 = 0) 0.55 11
odds when x2 is 0 = = = .
1 − p(Y = 1 | x2 = 0) 0.45 9
93
Unit 6 Regression for a binary response variable
Solution to Activity 15
(a) If β = 0.7, then the odds multiplier is
exp(0.7) ' 2.014.
This means that
OR ' 2.014
and so the odds of success are increased by a factor of 2.014 for a unit
increase in the explanatory variable.
(b) If β = −1, then the odds multiplier is
exp(−1) ' 0.368.
So
OR ' 0.368
and the odds of success are increased by a factor of 0.368 for a unit
increase in the explanatory variable. Or equivalently, the odds of
success are decreased by a factor of
1 1
= ' 2.718
OR exp(−1)
for a unit increase in the explanatory variable.
(c) When β = −0.2,
exp(10 × −0.2) ' 0.135.
So, when β = −0.2, the odds of success are increased by a factor of
0.135 for an increase of 10 units in the explanatory variable. Or
equivalently, the odds of success are decreased by a factor of
1 1
= ' 7.389
OR exp(−2)
for an increase of 10 units in the explanatory variable.
Solution to Activity 16
If β1 = 0.2, then the odds multiplier is
exp(0.2) ' 1.221.
So
OR ' 1.221
and the odds of success are increased by a factor of 1.221 for a unit
increase in x1 (assuming x2 and x3 are both fixed).
If β2 = −2.5, then the odds multiplier is
exp(−2.5) ' 0.082.
94
Solutions to activities
So
OR ' 0.082
and the odds of success are increased by a factor of 0.082 for a unit
increase in x2 (assuming x1 and x3 are both fixed). Or equivalently, the
odds of success are decreased by a factor of
1 1
= ' 12.182
OR exp(−2.5)
for a unit increase in the explanatory variable (assuming x1 and x3 are
both fixed).
If β3 = 1.5, then the odds multiplier is
exp(1.5) ' 4.482.
So
OR ' 4.482
and the odds of success are increased by a factor of 4.482 for a unit
increase in x3 (assuming x1 and x2 are both fixed).
Solution to Activity 17
If the regression coefficient associated with the indicator variable for level 2
is −0.4, then the odds multiplier for this indicator variable is
exp(−0.4) ' 0.670.
So, the odds of success is increased by a factor of 0.670 for level 2 of A in
comparison to the odds of success for level 1 of A. Or equivalently, the
odds of success is decreased by a factor of
1 1
= ' 1.492
OR exp(−0.4)
for level 2 of A in comparison to the odds of success for level 1 of A.
If the regression coefficient associated with the indicator variable for level 3
is 3, then the odds multiplier for this indicator variable is
exp(3) ' 20.086.
So, the odds of success is increased by a factor of 20.086 for level 3 of A in
comparison to the odds of success for level 1 of A.
95
Unit 6 Regression for a binary response variable
Solution to Activity 18
(a) The regression coefficient related to bestPrevModScore is 0.099.
Therefore the odds multiplier for a unit increase in
bestPrevModScore is exp(0.099) ' 1.104. This means that, for a unit
increase in bestPrevModScore, the odds of passing the module
increases by a factor of 1.104 (assuming age and qualLink are fixed).
(b) The regression coefficient related to age is −0.020. Therefore the odds
multiplier for a unit increase in age is exp(−0.020) ' 0.980. This
means that, for a unit increase in age, the odds of passing the module
increases by a factor of 0.98, or equivalently, decreases by a factor of
1 1
= ' 1.02
OR exp(−0.02)
(assuming bestPrevModScore and qualLink are fixed).
(c) The regression coefficient related to level ‘maths’ of the factor
qualLink is −0.696. Therefore, the odds multiplier associated with a
student who is studying for a maths-based qualification compared to a
non-maths qualification is exp(−0.696) ' 0.499. This means that the
odds of passing the module for students studying a maths-based
qualification are 0.499 of the odds of passing the module for students
who are not studying a maths-based qualification (assuming
bestPrevModScore and age are fixed). In other words, the odds of
passing decreases by a factor of
1 1
= ' 2.006
OR exp(−0.696)
for maths students compared to non-maths students!
Solution to Activity 19
(a) For patient 2, logArea = 1.903. So, using Equation (11),
exp(22.223 − (10.453 × 1.903))
pb2 = ' 0.911.
1 + exp(22.223 − (10.453 × 1.903))
So, the fitted probability of survival for patient 2 is 0.911.
(b) For patient 3, logArea = 2.039. So, using Equation (11),
exp(22.223 − (10.453 × 2.039))
pb3 = ' 0.713.
1 + exp(22.223 − (10.453 × 2.039))
(c) Denote the predicted probability of survival for the new patient by pb0 .
For this new patient, the value of logArea is 2.3. Once again using
Equation (11), we have
exp(22.223 − (10.453 × 2.3))
pb0 = ' 0.140.
1 + exp(22.223 − (10.453 × 2.3))
So, the predicted probability of survival for this new patient is 0.140.
96
Solutions to activities
Solution to Activity 20
(a) We wish to calculate pb11 , which we can do using Equation (12). For
this student, bestPrevModScore takes the value 92, age = 36 and
qualLink takes the value ‘not’. Substituting these values into
Equation (12) and using the parameter estimates for the fitted model
given in Table 5, pb11 is calculated as
exp(−4.273 + (0.099 × 92) − (0.02 × 36))
pb11 =
1 + exp(−4.273 + (0.099 × 92) − (0.02 × 36))
' 0.984.
The fitted probability of passing for this student is almost 1, and so
they are almost certain to pass the module.
(b) We wish to calculate pb0 , which we can also do using Equation (12).
For this student, bestPrevModScore takes the value 80, age = 60 and
qualLink takes the value ‘maths’. So
exp(−4.273 + (0.099 × 80) − (0.02 × 60) − 0.696)
pb0 =
1 + exp(−4.273 + (0.099 × 80) − (0.02 × 60) − 0.696)
' 0.852.
So, the predicted probability that this student will pass the module is
0.852.
Solution to Activity 21
(a) If the residual deviance is small, the difference between the
log-likelihood of the saturated model and the log-likelihood of the
proposed model is small. Therefore, we are not losing much fit by
using the proposed model in comparison to the more complicated
saturated model. So, since there is not much fit to the data lost by
choosing the proposed model, our proposed model appears to be a
good fit.
(b) On the other hand, if the residual deviance is large, then there is a
noticeable loss in fit to the data when the simpler proposed model is
used. So, in this case, the proposed model does not appear to be a
good fit.
97
Unit 6 Regression for a binary response variable
Solution to Activity 22
(a) There is a single α parameter, and q of the βj parameters. So the
number of parameters in the proposed model is q + 1.
(b) In the saturated model, each response Yi is essentially modelled by
the observed value of Yi , therefore the saturated model has n
parameters. In other words, the number of parameters in the
saturated model is equal to the number of observations.
(c) From Equation (14), the degrees of freedom is calculated as
r = number of parameters in the saturated model
− number of parameters in the proposed model.
So, from parts (a) and (b),
r = n − (q + 1).
Solution to Activity 23
(a) There are n = 1796 parameters in the saturated model (since there
are data for 1796 students in the dataset).
For the proposed model, there are four parameters: one intercept
parameter, one regression coefficient for each of the two covariates
(bestPrevModScore and age) and one parameter for the indicator
variable representing level ‘maths’ of qualLink.
Therefore, if the proposed model is a good fit, D ≈ χ2 (r), where
r = 1796 − 4 = 1792.
Solution to Activity 24
(a) If D < r, then using Result (15) we know that
D < E(D).
This means that the value of D is smaller than the value of the
expected residual deviance if the model is a good fit, and therefore the
observed value of D will be towards the left-hand side of the χ2 (r)
distribution. In this case, the proposed model is not losing much fit
(see, for example, the illustration given in Figure 16) and so it’s likely
that the model is a good fit.
(b) If D is much larger than r, then, again using Result (15), we know
that D is much larger than E(D), so that D is much larger than we’d
expect the residual deviance to be if the model was a good fit. In this
case, the proposed model is losing too much fit (again, see the
illustration given in Figure 16) and so, this time, it’s likely that the
model is a poor fit.
98
Solutions to activities
Solution to Activity 25
(a) For this model, D = 25.221 and r = 26. This means that D ' r, so
that D is close to the value we’d expect if the model is a good fit,
suggesting that the model is a good fit.
(b) For this model, D = 275.41 and r = 262. Here, D > r, so that D is
larger than we’d expect if the model is a good fit. It’s therefore
possible that the model is a poor fit, but it’s also possible that D isn’t
large enough for us to conclude that the model is a poor fit. So, in
order to make a decision regarding this model’s fit, we’d need to
calculate the p-value.
Solution to Activity 26
A small value of the null deviance means that not much fit is lost by using
the null model rather than the saturated model. If this is the case, then,
for parsimony, the data can be adequately described simply by their mean.
On the other hand, a large value of the null deviance means that a lot of
fit is lost by using the null model, and so the data need to be described by
a more complicated model than the null model.
Solution to Activity 27
Since the null deviance is the residual deviance for the null model, the null
model plays the role of the ‘proposed model’ in the residual deviance
formula. Therefore, if the null model is a good fit, then
null deviance ≈ χ2 (r),
where (from Equation (14))
r = number of parameters in the saturated model
− number of parameters in the proposed model.
Now, the number of parameters in the saturated model is n, while the null
model has just one parameter (the intercept). Therefore, the degrees of
freedom for the χ2 distribution associated with the null deviance is n − 1.
99
Unit 6 Regression for a binary response variable
Solution to Activity 28
(a) If the null model is good fit, then, since there are data for 435 patients,
null deviance ≈ χ2 (n − 1) = χ2 (434).
Now, we know that the mean of a χ2 distribution is equal to its
degrees of freedom. Therefore
E(null deviance) ' 434.
But we’ve observed the null deviance to be 525.39, which means that
the null deviance is larger than expected. As such, the null deviance
may be large enough to suggest that the null model is a poor fit (that
is, the data can’t be adequately described by the null model). To be
sure, we need to calculate the p-value associated with the null
deviance.
(b) The p-value is very small, and so we’d conclude that the null model is
a poor fit to the data, and so the data cannot be adequately described
by the null model.
Solution to Activity 29
Adding extra parameters into a model improves the model fit, and so the
larger model with more parameters (M2 ) must provide a better fit to the
data than the smaller model with fewer parameters (M1 ). As such, the fit
to the data lost due to M1 will be greater than the fit to the data lost due
to M2 . A greater loss in fit relates to a larger residual deviance, and so
D(M1 ) will be larger than D(M2 ).
Solution to Activity 30
(a) If the deviance difference is small, then there is not much fit lost when
using the smaller model M1 , which has fewer parameters. Therefore,
for parsimony, it would be wise to choose the smaller model M1 .
(b) On the other hand, if the deviance difference is large, then there is a
noticeable loss in fit when using the simpler model M1 – or
equivalently, there is a noticeable gain in fit when the extra
parameters of the larger model M2 are included. In this case, to avoid
losing too much fit, it is worth including these extra parameters and it
would be wise to choose the larger model M2 .
100
Solutions to activities
Solution to Activity 31
(a) Model M1 is nested within M2 , so
deviance difference = D(M1 ) − D(M2 )
= 588.82 − 586.54 = 2.28.
(b) Model M2 has one more parameter than M1 (for the covariate age),
and so this deviance difference is approximately distributed as χ2 (1).
We can also calculate the degrees of freedom as the difference between
the deviance degrees of freedom for M1 and M2 , namely
1794 − 1793 = 1.
Solution to Activity 32
Label the null model as M1 and the proposed model as M2 . Then
deviance difference = D(M1 ) − D(M2 )
= null deviance − D
= 678.51 − 678.30 = 0.21.
The degrees of freedom associated with this deviance difference is the
difference between the degrees of freedom of the χ2 null distributions for
D(M1 ) and D(M2 ), namely
1795 − 1794 = 1.
Here, the deviance difference (0.21) is quite a bit smaller than the degrees
of freedom (1), which means that the deviance difference is smaller than
the expected value, and therefore there is no evidence to suggest that M2
is better than M1 ; that is, there is no evidence to suggest that age is useful
for modelling modResult. (Note, however, that this doesn’t mean that age
won’t be useful for modelling modResult in combination with other
explanatory variables.)
Solution to Activity 33
Model M2 is better, since it has the lower AIC.
101
Unit 6 Regression for a binary response variable
Solution to Activity 34
In this plot, the two ‘lines’ of points can be clearly seen. There is some
curvature in the smoothed red line, which might suggest that the link
function may not be appropriate, or a higher-order function of
averageWage may be needed in the model. However, the curvature may be
due to the fact that the dataset is only small.
Solution to Activity 35
The positive standardised deviance residuals appear to have a larger
scatter than the negative ones do, although there doesn’t seem to be any
particular pattern to the standardised deviance residuals across the index.
As such, this plot doesn’t indicate any serious problems with the
independence assumption.
Solution to Activity 36
This plot allows us to easily compare the relative spreads for the
standardised deviance residuals for Yi = 1 and Yi = 0. The plot merely
confirms what we noted in Activity 35: namely, that one of the groups of
standardised deviance residuals appears to have a larger scatter than the
other group does, but there doesn’t appear to be a pattern to the
standardised deviance residuals across the index.
Solution to Activity 37
If the logistic regression model assumptions hold, then the standardised
deviance residuals should be approximately distributed as N (0, 1), so that
the points in Figure 25 should lie roughly along the straight line. However,
many of the points in the normal probability plot do not lie close to the
line, and the first ‘line’ seems to systematically deviate away from the line
in the middle of the plot. As such, this plot raises concerns that there
could be a problem with using the logistic regression model for these data.
102
Unit 7
Regression for other response
variables
Introduction
Introduction
As mentioned in the introduction to Unit 6, in Units 6 to 8 of this module
we’ll develop statistical models – known collectively as generalised linear
models, or GLMs for short – for modelling non-normal response variables.
In Unit 6, we focused on one particular type of generalised linear model for
modelling binary response variables: the logistic regression model. In this
unit, we shall consider modelling some other non-normal response
variables, and generalised linear models will be introduced more formally.
Regression with a
normal response
variable Regression for responses
(Unit 4) with other distributions
(including Poisson,
Regression with a exponential, binomial)
binary response
variable
(Unit 6)
We’ll start in Section 1 by setting the scene for this unit. We’ll focus in
particular on a dataset with a count response variable, discussing why
linear regression isn’t ideal for modelling these data and which non-normal
distribution might be more suitable for the response.
Although generalised linear models are used for modelling non-normal
response variables, it turns out that the linear regression model (with its
assumed normal distribution for the response) is also a generalised linear
model. As such, linear regression models and logistic regression models
have some features in common which we can use as a framework for
building a model form suitable for modelling both normal and binary
response variables. We’ll see how this model form can also be used to
model count response variables; indeed, the model form can be used for
many non-normal responses and provides the basis for the generalised
linear model. Developing these ideas is the focus of Section 2.
The generalised linear model is formally defined in Section 3. In that
section, we’ll focus on using generalised linear models for modelling
responses with normal, Bernoulli and Poisson distributions only.
Generalised linear models are then used for modelling responses with
exponential and binomial distributions in Section 4. Assessing model fit
and choosing a GLM are the subject of Section 5, while the focus of
Section 6 is on checking the GLM model assumptions. The unit rounds off
with Section 7 by considering two common issues which can arise when
using generalised linear models in practice.
105
Unit 7 Regression for other response variables
The following route map illustrates how the sections fit together for this
unit.
Section 1
Setting the scene
Section 2
Building a model
Section 3 Section 4
The generalised GLMs for two more
linear model (GLM) response variable
distributions
Section 5 Section 6
Section 7
Assessing model Checking the
Common issues
fit and choosing GLM model
in practice
a GLM assumptions
Note that you will need to switch between the written unit and your
computer for Subsections 3.5, 4.3, 5.3, 6.3 and 7.3.2.
106
1 Setting the scene
For each of the following response variables, explain why linear regression
might not be appropriate in a statistical analysis.
(a) The waiting times between the occurrence of serious earthquakes
worldwide.
(b) The number of traffic accidents per year at a particular road junction.
Another one to add to the
(c) The number of insects surviving out of N insects exposed to an number of traffic accidents . . .
insecticide.
107
Unit 7 Regression for other response variables
Well, for the waiting times between earthquakes mentioned in part (a) of
Activity 1, transforming the waiting times might well be an option. For
example, since the waiting times are positive and continuous, it would be
sensible to take logs of the waiting times so that the transformed values
are on a continuous scale between −∞ and +∞. However, it’s not clear
how the non-negative discrete count data in part (b) of Activity 1, nor the
non-negative integers between 0 and N in part (c) of Activity 1, might be
transformed to a continuous scale between −∞ and +∞. So, we need to
find an alternative way forwards.
As an alternative to transforming the response, we shall use a general
model which is suitable for modelling many non-normal responses, and
indeed, also normal responses; we shall introduce a framework for building
such a model in Section 2. But first, for the rest of this section we’ll look
more closely at the specific problem of modelling responses which are
counts: a situation where transforming the response so that we can use
linear regression is not an ideal option.
108
1 Setting the scene
109
Unit 7 Regression for other response variables
roof region
mixed strong central luzon
strong central luzon
strong central luzon
strong central luzon
strong central luzon
Source: Flores, 2017, accessed 26 June 2022
110
1 Setting the scene
In this unit, we’ll take familySize as our response variable, and to keep
things simple, for now we’ll just consider one of the possible explanatory
variables, age. Then a question of interest in the analysis of the Philippines
survey dataset is whether the age of the family head helps to explain
family size. Can we use regression to help answer this question? (We’ll be
using the rest of the possible explanatory variables later in the unit.)
We’ll start by investigating (in Activity 2) whether linear regression might
be useful for modelling familySize.
4
Standardised residuals
−2
−4
−2 0 2
Theoretical quantiles
Figure 2 Normal probability plot for the fitted linear regression
model familySize ∼ age
(a) By considering Figure 2, explain why the linear regression normality
assumption is questionable for these data.
111
Unit 7 Regression for other response variables
(b) Figure 3 shows the relative frequency bar chart of the response
familySize from the Philippines survey dataset. Because regression
assumes that the response variable is influenced by the explanatory
variable(s), we wouldn’t expect a plot of the relative frequencies of
familySize to necessarily look like a typical plot of the distribution
being used to model the response. However, Figure 3 does highlight
some of the issues which make the assumption of normality for the
response variable familySize less than ideal for these data.
By considering Figure 3, explain why it is unlikely that the
assumption of a normal response will be ideal for these data.
0.20
0.15
Relative frequency
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
familySize
Figure 3 A relative frequency bar chart of familySize
112
1 Setting the scene
0.6
0.5
0.4
P (Y = y)
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
(a) y
0.20
0.15
P (Y = y)
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
(b) y
113
Unit 7 Regression for other response variables
The mean and the variance of a Poisson(λ) random variable are both
equal to the parameter λ, so that for Y ∼ Poisson(λ),
E(Y ) = V (Y ) = λ.
0.20
0.15
Relative frequency
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
familySize
Figure 5 A side-by-side bar chart of the observed relative frequencies for
familySize and the expected relative frequencies assuming a Poisson
distribution fitted to the same data
By considering Figure 5, would the Poisson distribution or the normal
distribution be better as the distribution for the response familySize?
114
2 Building a model
2 Building a model
In this unit, we’d like to develop a regression model which can
accommodate a variety of different distributions for the response.
So far in this module, we have used linear regression for modelling
responses which are assumed to follow a normal distribution, and logistic
regression for modelling responses which are assumed to follow a Bernoulli
distribution. So, in order to build a general model which can cope with
responses from many different distributions, we shall start by considering
the similarities between linear regression and logistic regression, so that we
can build a unified model form which can represent both linear regression
and logistic regression. We’ll then see that this unified model form can in
fact be used to model responses from a whole host of different distributions.
115
Unit 7 Regression for other response variables
The strategy that we’ll use for building a model capable of accommodating
a variety of different distributions for the response is represented in
Figure 6.
Similarities
116
2 Building a model
In what way(s) are Equations (2) and (4) similar, and in what way(s) are
they different?
• In logistic regression
g(E(Yi )) = α + βxi ,
where g is the logit function such that
E(Yi )
g(E(Yi )) = log .
1 − E(Yi )
So, Equations (2) and (4) would have exactly the same form if we had a
function g(E(Yi )) instead of E(Yi ) in Equation (2) for linear regression;
we’ll consider which function g could be used for linear regression in
Activity 5.
117
Unit 7 Regression for other response variables
118
2 Building a model
In Unit 6, the logit function for logistic regression was referred to as the
logit link function. The reason for the term ‘link function’ is that the
function logit (E(Yi )) links the mean of the response E(Yi ) (which, in the
case of a binary response, is the success probability pi ) to the linear
component of the model (that is, α + βxi ). Similarly, in linear regression,
the identity function is also a link function, because it links the mean of
the response E(Yi ) to the linear component of the model. The idea of link
functions is summarised in Box 3.
Now that we have a model form which can accommodate both linear
regression and logistic regression, we shall see in the next subsection how
we can also use the same model form when we have a response from a
Poisson distribution by using a different link function.
119
Unit 7 Regression for other response variables
15
10
0
20 40 60 80
Age of household head (years)
Figure 7 Scatterplot of familySize against age
120
2 Building a model
4
Sample mean of familySize
0
20 40 60 80
Age of household head (years)
Figure 8 Scatterplot of sample means of familySize against age
The pattern of points in Figure 8 looks quadratic. There are ways that we
could model the relationship as such, but in this unit, we are focusing on
modelling linear relationships, and so instead we shall think in terms of
straight line segments. Here, we’ll focus on three possible line segments:
the line for values of age up to 40, the line for values of age between 40
and 80, and the line for ages over 80. These values have been chosen fairly
arbitrarily to coincide with the straight line segments which are roughly
apparent in Figure 8. We’ll consider these line segments in the next
activity.
Based on Figure 8, how does the linear relationship between the sample
mean of familySize and age for age < 40 differ from the linear
relationship between the sample mean of familySize and age for
40 ≤ age < 80 and for age ≥ 80?
121
Unit 7 Regression for other response variables
Following on from Activity 6, let’s just consider the households for which
age is in the range from 40 to 80 years, so that we can focus on only one of
the linear relationships visible in Figure 8. This reduced dataset is
described next.
roof region
strong central luzon
strong central luzon
strong central luzon
strong central luzon
strong central luzon
Source: Flores, 2017, accessed 26 June 2022
122
2 Building a model
4.5
Sample mean of familySize
4.0
3.5
3.0
2.5
2.0
1.5
40 50 60 70 80
Age of household head (years)
Figure 9 Scatterplot of sample means of familySize against age for
households in the Philippines 40 to 80 dataset
123
Unit 7 Regression for other response variables
1.4
log (sample mean of familySize)
1.2
1.0
0.8
0.6
40 50 60 70 80
Age of household head (years)
Figure 10 Scatterplot of the logs of the sample means of familySize and
age, for households in the Philippines 40 to 80 dataset
124
3 The generalised linear model (GLM)
So, we have used the unified model form representing both linear
regression and logistic regression to define Poisson regression (where the
response has a Poisson distribution). This model form can, in fact, be used
to specify regression models for responses from a whole host of
distributions, and it forms the basis of the generalised linear model. We
shall look at the generalised linear model more closely in the next section.
125
Unit 7 Regression for other response variables
We’ll start the section by formally defining the generalised linear model in
Subsection 3.1. Subsection 3.2 then considers link functions for the model,
and inverse link functions are introduced in Subsection 3.3. Subsection 3.4
looks at how fitted values and predictions are obtained for a GLM. Finally,
in Subsection 3.5, we’ll do some practical work and use GLMs in R.
This model form extends naturally to the case where there are multiple
factors A, B, . . . , Z and multiple covariates x1 , x2 , . . . , xq , so that the
regression equation has the general form
g(E(Yi )) = linear combination of the explanatory variables
for observation i.
The linear component of the regression equation is known as the linear
predictor for observation i, and is often denoted by ηi . (η is the Greek
letter eta.) The regression equation can therefore be written more
succinctly as
g(E(Yi )) = ηi .
126
3 The generalised linear model (GLM)
127
Unit 7 Regression for other response variables
As with all of the models that have been introduced in this module, the
notation can get messy when there are several explanatory variables. So,
when using a GLM to model a response Y with the factors A, B, . . . , Z and
covariates x1 , x2 , . . . , xq as explanatory variables, we’ll continue to use the
simpler notation
y ∼ A + B + · · · + Z + x1 + x2 + · · · + xq .
In doing so, however, we also need to be clear as to which distribution is
being assumed for the response and which link function is being used.
We’ll use the convention that a GLM for a response with a Poisson
distribution will be referred to as ‘a Poisson GLM’, and a GLM for a
response with a Bernoulli distribution will be referred to as ‘a Bernoulli
GLM’, and so on. (Of course, a Poisson GLM is also a Poisson regression
model, and a Bernoulli GLM is also a logistic regression model!) The
notation is illustrated in Example 2.
For a given dataset, the model parameters in the linear predictor (for
example, α and β in Example 1, and α, β1 and β2 in the solution to
Activity 8) are estimated using the method of maximum likelihood
estimation. We won’t go into the details of the mathematics behind how
the estimates are calculated in this module, since these parameter
estimates are easily obtained in R (as we saw when fitting logistic
regression models in Unit 6).
Once we have our parameter estimates, we can use their values in the
linear predictor to give us the fitted linear predictor, ηb, which is also
commonly referred to as the fitted model. The fitted linear predictor is
illustrated in Example 3 and Activity 9.
128
3 The generalised linear model (GLM)
Consider once again the OU students dataset and the logistic regression
model of Activity 8:
modResult ∼ bestPrevModScore + age.
In the solution to that activity, we saw that the linear predictor for this
model is
ηi = α + β1 xi1 + β2 xi2 ,
where xi1 and xi2 are the values of bestPrevModScore and age,
respectively, for the ith student.
Given that the parameter estimates for this model are calculated to be
approximately α b = −4.45, βb1 = 0.09 and βb2 = −0.01, write down the fitted
linear predictor for this model.
129
Unit 7 Regression for other response variables
Figure 11 The link function provides the link between E(Yi ) and ηi
Linear
Relationship
relationship
between Link
between
E(Yi ) and the function
g(E(Yi )) and the
explanatory g
explanatory
variables
variables
130
3 The generalised linear model (GLM)
131
Unit 7 Regression for other response variables
0
50 60 70 80 90 100
Best previous module score
Figure 13 Scatterplot of estimates of logit(E(Yi )) plotted at the
midpoint of each corresponding bestPrevModScore interval
Next, in Activity 10, we’ll consider the log link for a Poisson GLM.
So, we’d like our link function g to transform E(Yi ) so that the
relationship between g(E(Yi )) and the explanatory variables is linear. But
there’s also a second property that we’d like a link function g to have. The
range of possible values that E(Yi ) can take is often restricted (for
example, E(Yi ) might always be positive). However, there are no such
restrictions on the value that ηi can take.
132
3 The generalised linear model (GLM)
Therefore, since
g(E(Yi )) = ηi ,
we need a link function which allows g(E(Yi )) to take any value. This is
illustrated in Figure 14 and Example 5.
Figure 14 We want the link function to allow E(Yi ) to take any value
133
Unit 7 Regression for other response variables
The next activity looks at the effect of the log link function on the values
that E(Yi ) can take in a Poisson GLM.
One consequence of using GLMs for responses with distributions which are
in the exponential family is that certain link functions, called canonical
link functions, have a special status. They provide simplifications (which
we will not go into) for the theory and analysis of GLMs. All of the link
functions that you have seen so far in this unit are in fact canonical link
functions; these are listed in Table 3.
Table 3 Some canonical link functions
Note that other link functions which are not canonical link functions can
be, and are, used in practice. This is because canonical link functions are
not always the most sensible link functions to use. You will meet an
example of such a non-canonical link function in Subsection 4.1 later in
this unit.
GLM link functions are summarised in the following box.
134
3 The generalised linear model (GLM)
Link
function
g(E(Yi ))
E(Yi ) ηi
Inverse
link
function
g −1 (ηi )
Figure 15 The link function and its inverse relating the mean response to
the linear predictor in a GLM
We’ll find the inverse link function for the logit link in the following
example.
135
Unit 7 Regression for other response variables
We can follow the same approach in the next activity to obtain the inverse
link functions for the (canonical) link functions of GLMs for responses
with normal and Poisson distributions (that is, linear regression and
Poisson regression, respectively).
For each of the following models, find the inverse link function g −1 (ηi ) for
the canonical link function.
(a) A GLM with a normal response.
(b) A Poisson GLM.
136
3 The generalised linear model (GLM)
Note that we can specify a GLM in terms of either the link function or the
inverse link function. For instance, for a Poisson GLM with a log link, we
could say that Yi ∼ Poisson(λi ) and E(Yi ) = exp(ηi ), instead of
log(E(Yi )) = ηi .
Inverse link functions are summarised in Box 7.
The canonical link functions that you’ve seen so far, together with their
inverse link functions, are summarised in Table 4.
Table 4 Some canonical link functions and their inverse link functions
Inverse link functions are important for calculating fitted and predicted
mean responses, and we shall see them in action for doing just that in the
next subsection!
Inverse Fitted
Fitted
link linear
mean response
function predictor
E(Yi ) = µi = g −1 (ηi )
g −1 ηi
The age of the head of the first household in the dataset is 53. So, the
fitted linear predictor for the first household, ηb1 , is
ηb1 = 1.97 − (0.01 × 53) = 1.44.
From Table 4, the inverse link function for the log link is
g −1 (ηi ) = exp(ηi ).
So, using our model, the fitted mean response of familySize for the
first household is calculated as
b1 = g −1 (b
µ η1 ) = exp(b
η1 )
= exp(1.44) ' 4.22.
We’ll calculate more fitted mean responses in the next two activities.
138
3 The generalised linear model (GLM)
Following on from Example 7 and using the same fitted model, what is the
fitted mean response of familySize for the second household in the
Philippines 40 to 80 dataset, whose household head is aged 72?
Following on from Activity 14, the fitted linear predictor for a logistic
regression model for the binary response modResult with explanatory
variables bestPrevModScore and age is
ηb = −4.45 + 0.09 bestPrevModScore − 0.01 age.
For a new student aged 49 with a value of 74.2 for bestPrevModScore,
what is the predicted mean response of modResult for this student?
139
Unit 7 Regression for other response variables
So far in this unit, we’ve looked at the theory behind GLMs, focusing in
particular on GLMs for responses with normal, Bernoulli and Poisson
distributions. We will consider GLMs for responses with two other
non-normal distributions soon in Section 4, but before we do that, it’s
about time that we did some practical work on the computer!
You have actually already used R to fit a Bernoulli GLM when using
logistic regression in Unit 6, and, as we now know from this unit, the linear
regression models from Units 1 to 5 are also GLMs (with normal response
Computers at the ready! variables). So, for now, we’ll just focus on using R for Poisson GLMs.
140
4 GLMs for two more response variable distributions
141
Unit 7 Regression for other response variables
142
4 GLMs for two more response variable distributions
survivalTime logWbc ag
65 7.7407 pos
156 6.6201 pos
100 8.3664 pos
134 7.8633 pos
16 8.6995 pos
Source: Feigl and Zelen, 1965
143
Unit 7 Regression for other response variables
The exponential p.d.f. always has the same general shape, regardless
of the value of λ. Figure 17 shows the p.d.f. of an exponential
distribution with parameter λ = 1.
1.0
0.8
0.6
f (y)
0.4
0.2
0.0
0 2 4 6 8 10
y
Figure 17 The p.d.f. of an exponential distribution with λ = 1
The parameter λ is an event rate, and is related to both the mean and
the variance of the distribution, so that
1 1
E(Y ) = and V (Y ) = .
λ λ2
Therefore,
V (Y ) = (E(Y ))2 .
0.020
Relative frequency
0.015
0.010
0.005
0.000
0 50 100 150
Survival time (weeks)
Figure 18 A unit-area histogram of survivalTime together with an overlaid
fitted exponential distribution curve
145
Unit 7 Regression for other response variables
From Box 6, there are two properties that we’d like a link function g to
have:
• we’d like a linear relationship between g(E(Yi )) and the explanatory
variables
• we’d like g(E(Yi )) to be able to take any value between −∞ and +∞ (to
match the possible values that ηi can take).
Explain why the canonical link function for an exponential GLM may not
be an ideal link function to use.
So, if the canonical link function is not ideal for an exponential GLM,
which link function should be used instead? Well, when we’re assuming an
exponential distribution for the response, the log link is commonly used
instead of the negative reciprocal link, so that
g(E(Yi )) = log(E(Yi )).
Activity 18 explores why this link might be more sensible.
Explain why the log link may be more sensible than the (canonical)
negative reciprocal link for an exponential GLM.
Since the log link is commonly used as the link function in an exponential
GLM, we shall try using it when modelling survivalTime.
We know from Activity 18 that the log link satisfies one of the properties
that we’d like our link function to have. But what about the other
property? Is it reasonable to assume a linear relationship between g(E(Yi ))
and the explanatory variables? We shall investigate this question in the
next activity.
146
4 GLMs for two more response variable distributions
Suppose that we wish to use data from the leukaemia survival dataset to
fit the model
survivalTime ∼ logWbc + ag
using an exponential GLM with a log link.
Figure 19 shows a plot of log(survivalTime) and logWbc, with the
different levels of ag indicated. From this plot, does it look like it might be
reasonable to assume a linear relationship between g(E(Yi )) and the
explanatory variables?
4
log(survivalTime)
0
7 8 9 10 11
logWbc
Figure 19 Scatterplot of log(survivalTime) and logWbc, with the different
levels of ag indicated
So, it looks like an exponential GLM with a log link could be a good way
forward for the model
survivalTime ∼ logWbc + ag.
We shall consider this fitted model in the next activity.
147
Unit 7 Regression for other response variables
Data from the leukaemia survival dataset were used to fit the model
survivalTime ∼ logWbc + ag
using an exponential GLM with a log link.
The parameter estimates of the coefficients for the fitted model are given
in Table 6.
Table 6 Parameter estimates for the model survivalTime ∼ logWbc + ag, using
an exponential GLM with a log link
Parameter Estimate
Intercept 5.8154
logWbc −0.3044
ag pos 1.0176
(a) The factor ag has two levels: ‘pos’ and ‘neg’. Which level has been set
to be level 1 in this fitted model?
(b) The first patient in the leukaemia survival dataset has a value of
7.7407 for logWbc and ‘pos’ for ag. What is the value of ηb1 , the fitted
linear predictor for this patient?
b1 , the fitted mean response of survivalTime for this
(c) Hence, what is µ
patient?
(d) The 18th patient in the leukaemia survival dataset has a value of
b18 , the fitted mean
8.3894 for logWbc and ‘neg’ for ag. Calculate µ
response of survivalTime for this patient.
148
4 GLMs for two more response variable distributions
149
Unit 7 Regression for other response variables
150
4 GLMs for two more response variable distributions
Binomial response:
Yi ∼ B(Ni , pi )
Mean response:
E(Yi ) = Ni × pi Model
↑ ↑ pi
known unknown
Fitted linear
predictor:
ηi
Now it’s time to see a binomial GLM in action! For this, we’ll once again
consider data from the OU students dataset.
151
Unit 7 Regression for other response variables
2
Standardised residuals
−2
−4
−2 0 2
Theoretical quantiles
Figure 21 The normal probability plot for a linear regression model for
examScore selected using stepwise regression
152
4 GLMs for two more response variable distributions
Data from the OU students dataset were used to fit the model
examScore ∼ bestPrevModScore + age
using a binomial GLM with a logit link.
The parameter estimates for the coefficients for the fitted model are given
in Table 7.
Table 7 Parameter estimates for examScore ∼ bestPrevModScore + age, using
a binomial GLM with a logit link
Parameter Estimate
Intercept −3.3387
bestPrevModScore 0.0472
age −0.0040
(a) The first student in the dataset has a value of 89.2 for
bestPrevModScore and 32 for age. What is the value of the fitted
linear predictor, ηb1 , for this student?
(b) Calculate pb1 , the fitted success probability for this first student.
b1 , for this student? Round
(c) Hence, what is the fitted mean response, µ
your answer to the nearest integer.
(d) Rounding your answer to the nearest integer, use the fitted model to
b0 , for a new student who is
calculate the predicted mean response, µ
aged 64 and has a value of 79.2 for bestPrevModScore.
153
Unit 7 Regression for other response variables
154
5 Assessing model fit and choosing a GLM
155
Unit 7 Regression for other response variables
χ2 (r)
D
D not large, D large,
so p-value not small, so p-value small,
therefore good fit therefore poor fit
156
5 Assessing model fit and choosing a GLM
We’ll look at the model fit of two of the GLMs that we’ve considered so far
in this unit in Example 8 and Activity 22.
157
Unit 7 Regression for other response variables
Next we’ll consider how we can compare the fits of two GLMs to help us
choose which explanatory variables should be included in our model.
158
5 Assessing model fit and choosing a GLM
χ2 (d)
deviance difference
deviance difference not large, deviance difference large,
so p-value not small, so p-value small,
thus no significant gain in thus significant gain in
fit for M2 , fit for M2 ,
therefore choose M1 therefore choose M2
Next, we’ll use the deviance difference to compare the model fits of GLMs
for modelling survivalTime from the leukaemia survival dataset.
So far, for these data we’ve considered the model
survivalTime ∼ logWbc + ag
using an exponential GLM with a log link. Now, in Activity 22, we saw
that the p-value for the residual deviance, D, for this fitted model is 0.099.
So, since this p-value is quite large, we concluded in that activity that the
model was an adequate fit to the data.
But, do we need both of the explanatory variables in the model? We shall
consider this question in the next example and activity: we’ll investigate
whether logWbc is needed in the model in addition to ag in Example 9,
and then we’ll investigate whether ag is needed in the model in addition to
logWbc in Activity 23.
159
Unit 7 Regression for other response variables
Consider once again the leukaemia survival dataset. The following GLMs
with an exponential response and log link were fitted to these data. (Note
that M2 is the same model M2 considered in Example 9.)
• Model M1 : survivalTime ∼ logWbc.
The residual deviance for this model, D(M1 ), is 47.808; the associated
approximate null distribution is χ2 (31).
• Model M2 : survivalTime ∼ logWbc + ag.
The residual deviance for this model, D(M2 ), is 40.319; the associated
approximate null distribution is χ2 (30).
160
5 Assessing model fit and choosing a GLM
(a) Calculate the deviance difference for the nested models M1 and M2 .
(b) What are the degrees of freedom of the χ2 null distribution for the
deviance difference that you calculated in part (a)?
(c) The p-value associated with the deviance difference is 0.009. What do
you conclude?
In Example 8, we used the residual deviance to assess the fit of the model
familySize ∼ age
fitted to data from the Philippines 40 to 80 dataset using a Poisson GLM
with a log link. In that example, we concluded that this model was a poor
fit to the data, since the p-value for the residual deviance is close to 0.
This, however, doesn’t necessarily mean that age is not going to be useful
for modelling familySize. It’s possible, for example, that age is useful for
modelling familySize, but that the model is missing some other extra key
explanatory variables to improve the model fit.
So, to assess whether age is useful for modelling familySize, we’ll use the
deviance difference to compare the fit of our proposed model with the fit of
the null model.
Let the null model be M1 and our proposed model be M2 . The null model
is, of course, nested within the proposed model, so that M1 is nested
within M2 . The residual deviance of the null model, D(M1 ), is then 1795.8
with 1158 associated degrees of freedom, while the residual deviance of the
proposed model, D(M2 ), is 1731.8 with 1157 associated degrees of freedom.
(a) Calculate the deviance difference for the nested models M1 and M2 .
(b) What are the degrees of freedom of the χ2 null distribution for the
deviance difference that you calculated in part (a)?
(c) The p-value associated with the deviance difference is close to 0.
What do you conclude?
161
Unit 7 Regression for other response variables
The deviance difference can only be used to compare the fits of nested
GLMs. When comparing the fits of non-nested GLMs, the AIC can
instead be used (as it was for non-nested models in logistic regression).
A reminder of the AIC was given in Box 16 in Subsection 6.2 of Unit 6.
The AIC works in exactly the same way for GLMs in general. The key
things to remember when assessing non-nested GLMs are summarised in
Box 16 and illustrated by comparing two non-nested GLMs in Activity 25.
When using both linear regression and logistic regression in this module,
we have been using stepwise regression as an automated procedure for
selecting which explanatory variables should be included in our model, and
at each stage in the stepwise regression procedure the AIC has been used to
compare the model fits. As you have probably guessed, stepwise regression
can be used in exactly the same way for GLMs in general; we shall see
stepwise regression for GLMs in action using R in the next subsection.
Stepwise regression isn’t going
to help her choose here!
5.3 Using R to assess model fit and choose
a GLM
In Notebook activity 7.3, we considered a Poisson GLM with a log link as
a model for predicting Olympic medals using data from the Olympics
dataset. In Notebook activity 7.6 in this subsection, we’ll use this model’s
residual deviance to assess the model’s fit.
We’ll then use R to compare the fits of GLMs in Notebook activity 7.7. In
this notebook activity, we’ll compare the fits of binomial GLMs for
modelling examScore from the OU students dataset so that we can decide
which of the two covariates bestPrevModScore and age should be included
in the model.
162
5 Assessing model fit and choosing a GLM
So far in this unit, when using a binomial GLM for modelling examScore
from the OU students dataset, we’ve only considered two possible
covariates – namely, bestPrevModScore and age. There are, however,
several other potential explanatory variables that we could use for
modelling examScore.
Now, in Unit 4, we used stepwise regression with the OU students dataset
to select the explanatory variables to use for modelling examScore using a
linear regression model. In Notebook activity 7.8, we’ll use stepwise
regression to choose the explanatory variables to include for modelling
examScore using a binomial GLM with a logit link. For comparison
purposes, we’ll consider the same full model form as was considered when
using stepwise regression for the linear regression model in Unit 4, namely,
our full binomial GLM for examScore will include the four explanatory
variables gender, qualLink, bestPrevModScore and age, together with all
of their possible interactions.
In the final notebook activity in this subsection, we’ll once again model the
Philippines 40 to 80 dataset. So far, we have modelled familySize using
only one explanatory variable – age. There are, however, data for several
other explanatory variables (as described in Subsection 1.2). In Notebook
activity 7.9, we shall use stepwise regression to select which explanatory
variables should be included when modelling familySize using a Poisson
GLM with a log link.
163
Unit 7 Regression for other response variables
164
6 Checking the GLM model assumptions
The following example shows how the variance relates to the mean of a
response variable with a Bernoulli distribution.
The next two activities will look at the variance–mean relationships for
Poisson GLMs and for exponential GLMs.
Note that, since the variance of the response is assumed constant in linear
regression, if the variance of the response changes with changes in the mean
of that response, then linear regression may not be appropriate. Since the
assumption of constant variance can be relaxed for some GLMs, this
widens the variety of datasets for which GLMs can be applied in practice.
165
Unit 7 Regression for other response variables
Response distribution b
Transformation of µ
Normal b
µ p
Bernoulli 2parcsin µb
Poisson 2 µ b
Exponential b p
2 log µ
Binomial 2 arcsin µb
We’ll see this diagnostic plot in action in Example 11 and Activity 28.
4
Standardised deviance residual
−2
167
Unit 7 Regression for other response variables
2
Standardised deviance residual
−1
−2
5 6 7 8 9
2 log µ
Figure 25 Plot of the standardised deviance residuals against a
b for modelling survivalTime
transformation of µ
Does this plot indicate any problems with the model assumptions?
168
6 Checking the GLM model assumptions
4
Standardised deviance residual
−2
169
Unit 7 Regression for other response variables
2
Standardised deviance residual
−1
−2
0 5 10 15 20 25 30
Index
Figure 27 Plot of standardised deviance residuals against index for
modelling survivalTime
Does this plot indicate any problems with the independence assumption
for the model?
170
6 Checking the GLM model assumptions
171
Unit 7 Regression for other response variables
20
Squared standardised deviance residuals
15
10
0
0 200 400 600 800 1000 1200
Index
Figure 28 Plot of squared standardised deviance residuals against index
for modelling familySize. The red circles denote positive residuals and
the blue triangles denote negative residuals.
172
6 Checking the GLM model assumptions
5
Squared standardised deviance residuals
0
0 5 10 15 20 25 30
Index
Figure 29 Plot of squared standardised deviance residuals against index for
modelling survivalTime. The red circles denote positive residuals and the
blue triangles denote negative residuals.
Does this plot indicate any problems with the independence assumption?
173
Unit 7 Regression for other response variables
4
Standardised deviance residuals
−2
−4
−2 0 2
Theoretical quantiles
Figure 30 Normal probability plot of the standardised deviance
residuals for modelling familySize
174
6 Checking the GLM model assumptions
The points in the normal probability plot are generally quite close to
the diagonal line, although there is some curvature at the ends of the
plot. On the whole, though, the assumption of normality of the
deviance residuals seems reasonable. This in turn means that the
assumption that the response distribution is Poisson also seems
reasonable.
2
Standardised deviance residuals
−1
−2
−2 −1 0 1 2
Theoretical quantiles
Figure 31 Normal probability plot of the standardised deviance residuals for
modelling survivalTime
Does the normal probability plot indicate any problems with the
assumption that the response has an exponential distribution?
176
7 Common issues in practice
7.1 Overdispersion
When the variance of the observed data is larger than the model’s
variance, we say that there is overdispersion. This is a common problem
in practice when using Poisson GLMs or binomial GLMs. It is also
possible for the variance of the observed data to be smaller than the
model’s variance: in this case, there is under dispersion. However,
underdispersion is a lot less common than overdispersion in practice, so we
shall only focus on overdispersion here.
For a GLM for Poisson responses Y1 , Y2 , . . . , Yn , so that, for i = 1, 2, . . . , n, In the context of infectious
diseases, overdispersion is the
Yi ∼ Poisson(λi ), λi > 0, idea that one infected person
the model mean and variance for Yi are driven by the single parameter λi , in a crowd could infect many
since:
E(Yi ) = λi and V (Yi ) = λi .
The model mean and variance are also driven by a single parameter in a
binomial GLM. In this case, for i = 1, 2, . . . , n and Ni known,
Yi ∼ B(Ni , pi ), 0 < pi < 1,
with
E(Yi ) = Ni pi and V (Yi ) = Ni pi (1 − pi ).
As a result, the model variances for Poisson and binomial GLMs are
constrained and don’t always allow for the amount of variability which can
occur in real datasets.
To correct for this, an extra dispersion parameter, φ > 0 say, can be
introduced into the model which scales the model’s variance to fit the
observed data better, so that, for a Poisson response
V (Yi ) = φλi
and for a binomial response
V (Yi ) = φNi pi (1 − pi ).
If there is overdispersion, then φ > 1 so that the model’s variance is
increased. If φ = 1, then the observed and model variances are the same,
while if φ < 1, then there is underdispersion and the model’s variance is
decreased.
Introducing the dispersion parameter φ does not affect the maximum
likelihood estimates of the model parameters, and the parameter estimates
are the same values whether or not φ is in the model. What φ does affect,
though, is the standard errors of the parameter estimates. If there is
overdispersion which isn’t accommodated in the model, then the standard
errors of the parameters will be underestimated; introducing a dispersion
parameter into the model corrects this.
As you will discover in Subsection 7.3, it is very easy to use R to fit a
GLM with the extra dispersion parameter φ. But how do we know that
there is a problem with overdispersion so that the dispersion parameter is
required in the model?
177
Unit 7 Regression for other response variables
If there is overdispersion, then the estimated model will provide a poor fit
to the data. We have already met a measure of model fit for GLMs – the
residual deviance, D. Therefore, when fitting Poisson or binomial GLMs, if
the model’s residual deviance is large (indicating a poor model fit), then
this could be an indication of overdispersion. However, a large residual
deviance could also be an indication that one or more important
explanatory variables are missing from the model. So, in order to assess
whether there might be a problem with overdispersion, a GLM is fitted
using all of the available explanatory variables, and the residual deviance
for this model is used to assess whether or not there could be a problem
with overdispersion.
Now, we know from Box 13 in Subsection 5.1 that, if a model is a good fit
to the data, then the model’s residual deviance D approximately follows a
χ2 (r) distribution, where
r = n − number of parameters in the proposed model.
The mean of this χ2 distribution is r. So, if there isn’t a problem with
overdispersion, then we would expect
D
' 1.
r
But if
D
>1
r
then this could be an indication of overdispersion. (Conversely, if D/r < 1,
then this could be an indication of underdispersion.)
Due to random variation, the value of D/r could well be slightly greater
than 1 without there being a problem with overdispersion, but we’d
certainly expect D/r to be less than 2 if overdispersion is not a problem.
How to detect overdispersion is summarised in Box 23.
178
7 Common issues in practice
We’ll finish this subsection with two activities to give you some practice at
detecting possible overdispersion.
Three GLMs for different binomial responses were fitted, each using all of
the associated available explanatory variables. For each of the resulting
residual deviances D and associated degrees of freedom r given below,
could there be a problem with overdispersion?
(a) Model 1: D = 4.21, r = 3.
(b) Model 2: D = 634.81, r = 864.
(c) Model 3: D = 146.8, r = 26.
179
Unit 7 Regression for other response variables
180
7 Common issues in practice
Now, if
log(θi ) = ηi − log(ti )
then, since log(ti ) is a known constant, the fitted rate θbi must satisfy
log θbi = ηbi − log(ti ),
where ηbi is the fitted linear predictor for λi (calculated using a Poisson
GLM with a log link). But, since
log λ bi = ηbi
181
Unit 7 Regression for other response variables
182
7 Common issues in practice
The main question of interest here is whether the new drug is effective
in significantly reducing the number of seizures for these patients
relative to their usual drug.
There are a few things to note about the epileptic seizures dataset.
• Each of the 15 patients has two entries in the dataset. For example,
patient 3 in period 0 was observed to suffer 152 seizures in 56 days using
the new drug (as shown in the first row of Table 9), and the same
patient 3 was observed to suffer 149 seizures in 56 days using their usual
drug in period 1 (in a later row of Table 9).
• The exposure time for a treatment was 56 days for most patients, but a
few were observed over shorter time spans. For instance, in period 0,
patient 6 was observed to suffer 10 seizures in 42 days of exposure using
their usual drug.
• Patient 11 is the one with by far the largest number of seizures amongst
all 15 patients, with 1161 seizures during 56 days using their usual drug
in period 0 and 854 seizures during 56 days using the new drug in
period 1.
We’ll consider a potential model for the epileptic seizures dataset in the
next activity.
What kind of model comes to mind as a good first model for modelling the
response variable numSeizures from the epileptic seizures dataset?
Our primary question of interest for these data is whether the new drug
reduces the number of epileptic seizures in comparison to a patient’s usual
drug. There is, however, a lot of variability in the numbers of seizures
amongst patients. This can be seen in Figure 32, which shows a scatterplot
of log(numSeizures + 0.5) against patient. (A plot of numSeizures
against patient is dominated by the unusually large values for patient 11.
Plotting log(numSeizures + 0.5) against patient makes the patterns
easier to see: the ‘0.5’ has been added to avoid the problem of zero
numSeizures.) So, given that there is so much variability in the response
across patients, patient should be included as an explanatory variable in
a GLM for numSeizures, although the effect of this explanatory variable is
not of major interest per se. We shall do this by treating patient as a
factor with 15 levels (a level for each patient).
6
log (numSeizures + 0.5)
5 10 15
Patient
Figure 32 Scatterplot of log(numSeizures + 0.5) against patient
184
Summary
Summary
This unit formally introduced generalised linear models, commonly referred
to as GLMs, for modelling response variables from a variety of different
distributions.
Both linear regression (which assumes a normal distribution for the
response) and logistic regression (which assumes a Bernoulli distribution
for a binary response) are particular types of generalised linear model. In
this unit, we also used GLMs for modelling responses with Poisson,
exponential and binomial distributions, although these are not the only
distributions that can be assumed for the response in a GLM.
The relationship between the response Y and a set of explanatory variables
follows a GLM if:
• Y1 , Y2 , . . . , Yn all have the same distribution, but each Yi has a different
mean
• for i = 1, 2, . . . , n, the regression equation has the form
g(E(Yi )) = ηi ,
where g is a link function and ηi is the linear predictor, a linear
combination of the explanatory variables.
The link function, g, therefore provides the link between the mean
response, E(Yi ), and the linear predictor, ηi . We’d like g to transform
E(Yi ) so that:
• there’s a linear relationship between g(E(Yi )) and the explanatory
variables
• g(E(Yi )) can take any real value (to match the possible values that ηi
can take).
185
Unit 7 Regression for other response variables
Canonical link functions provide simplifications for the theory and analysis
of GLMs, but non-canonical link functions can be, and are, used in
practice. The link functions used in this unit are summarised in Table 10.
(Remember that a binomial GLM models the success probability pi rather
than E(Yi ) directly.)
Table 10 The link functions used in M348
The inverse link function, g −1 , links the linear predictor, ηi , back to the
mean response E(Yi ), so that
E(Yi ) = g −1 (ηi ).
The inverse link function for the link functions used in this unit are
summarised in Table 11.
Table 11 The link functions and associated inverse link functions used in M348
186
Summary
Inverse link functions are important for calculating µbi , the fitted mean
b0 , the predicted mean response for a new response Y0 .
response for Yi , and µ
Then
bi = g −1 (b
µ ηi ) b0 = g −1 (b
and µ η0 ),
where ηbi and ηb0 are the fitted linear predictors for Yi and Y0 , respectively.
The fit of the proposed GLM can be assessed using the residual
deviance D, where
D = 2 × (l(saturated model) − l(proposed model)).
If the proposed model is a good fit, then
D ≈ χ2 (r),
where
r = n − number of parameters in the proposed model.
A useful ‘rule of thumb’ is that
• If D ≤ r, then the model is likely to be a good fit to the data.
The fits of two nested GLMs M1 and M2 , with M1 nested within M2 , can
be compared using the deviance difference, where
deviance difference = D(M1 ) − D(M2 ).
If both M1 and M2 are a good fit, then
deviance difference ≈ χ2 (d),
where
d = difference in the degrees of freedom for D(M1 ) and D(M2 ).
The AIC can be used to compare non-nested models: we choose the model
with the smallest AIC.
In order to check the model assumptions for GLMs, diagnostic plots focus
on deviance residuals – these are analogous to the standard residuals used
in linear regression. If the GLM assumptions are reasonable, then the
standardised deviance residuals are approximately distributed as N (0, 1).
Overdispersion is a common problem for Poisson and binomial GLMs in
practice. This occurs when the observed response variance is larger than
the model’s variance. Overdispersion can be corrected by introducing an
extra dispersion parameter (φ > 0) to scale the model’s variance. To
detect possible overdispersion:
• fit a GLM including all of the possible explanatory variables
• overdispersion could be a problem if
D
> 2,
r
where D is the GLM’s residual deviance and r is the associated degrees
of freedom.
187
Unit 7 Regression for other response variables
Another problem that can arise for Poisson GLMs is when the Poisson
responses Y1 , Y2 , . . . , Yn are observed counts over varying lengths of time
t1 , t2 , . . . , tn . In this case, the Poisson rate
λi
θi =
ti
is of interest: log(ti ) is called the offset. The fitted rate θbi is calculated as
bi
λ
θbi = ,
ti
bi is the fitted mean response of Yi .
where λ
A reminder of what has been studied in Unit 7 and how the sections link
together is shown in the following route map.
Section 1
Setting the scene
Section 2
Building a model
Section 3 Section 4
The generalised GLMs for two more
linear model (GLM) response variable
distributions
Section 5 Section 6
Section 7
Assessing model Checking the
Common issues
fit and choosing GLM model
in practice
a GLM assumptions
188
References
Learning outcomes
After you have worked through this unit, you should be able to:
• appreciate the different types of non-normal response variables which
can be of interest
• understand the roles of the link function and the inverse link function in
a GLM
• obtain fitted and predicted mean responses for given values of the
explanatory variable(s)
• assess the fit of a GLM
• compare the fits of two GLMs, both in the case of nested GLMs and
non-nested GLMs
• identify potential problems with the model assumptions for a GLM
• appreciate what overdispersion is, how we can detect it, and how we can
correct for it
• appreciate how we can model Poisson rates instead of Poisson response
means
• fit Poisson, exponential and binomial GLMs in R
• use R to predict mean responses
• use R to assess the fit of GLMs
• compare the fits of GLMs in R
• use stepwise regression for GLMs in R
• use R to produce diagnostic plots for GLMs
• use R to correct for overdispersion
• use R to model Poisson rates.
References
Feigl, P. and Zelen, M. (1965) ‘Estimation of exponential survival
probabilities with concomitant information’, Biometrics, 21(4),
pp. 826–838, doi:10.2307/2528247.
Flores, F.P. (2017) ‘Filipino family income and expenditure’. Available at:
https://www.kaggle.com/datasets/grosvenpaul/family-income-and-
expenditure (Accessed: 26 June 2022).
189
Unit 7 Regression for other response variables
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, crashed cars: © stockbroker / www.123rf.com
Figure 1: © Iryna Volina / Alamy Stock Photo
Subsection 1.2, Filipino family: Public domain
Subsection 3.1, students passing exams: © stylephotographs /
www.123rf.com
Subsection 3.4, laptop being opened: © dragoscondrea / www.123rf.com
Subsection 4.1, survival rates: © yasemin / www.123rf.com
Subsection 4.2, exploding party popper: © slavadumchev /
www.123rf.com
Subsection 5.1, child in oversized clothes: © Ferreira / www.123rf.com
Subsection 5.2, choosing clothes: © maridav / www.123rf.com
Subsection 7.1, crowd: © nd3000 / www.123rf.com
Subsection 7.2, call centre: © fizkes / www.123rf.com
Subsection 7.3.1, an electroencephalogram (EEG): © Phanie /
Alamy Stock Photo
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
190
Solutions to activities
Solutions to activities
Solution to Activity 1
(a) The waiting time between serious earthquakes is a non-negative
continuous response variable since waiting times cannot take negative
values. Linear regression includes negative values as possible
outcomes of the response, and so may not be appropriate to model
this type of response variable.
(b) The number of traffic accidents per year at a particular road junction
is a count response variable that can only take non-negative integer
values. Linear regression would include negative values as possible
outcomes of the response and would also include decimal values. Both
negative values and decimals are not possible for count (integer) data,
and so linear regression may not be appropriate.
(c) The number of insects surviving out of N insects exposed to an
insecticide is a non-negative integer that can only take values between
0 and N . Linear regression would include negative values, decimal
values, and also values greater than N as possible outcomes of the
response, which are all not possible for this particular response. As
such, linear regression may not be appropriate for modelling this
response.
Solution to Activity 2
(a) The points in the normal probability plot given in Figure 2 deviate
systematically from the line at both ends of the plot, suggesting that
residuals may not follow a normal distribution. The linear regression
normality assumption therefore is questionable for these data.
(b) Figure 3 highlights several issues which make it unlikely that the
assumption of a normal distribution for the response will be ideal for
these data.
Firstly, the response familySize only takes integer values, whereas
the normal distribution is continuous. Also, the distribution for
familySize looks skew, whereas the normal distribution is symmetric.
What’s more, the observed family size values are non-negative low
counts with most frequent values occurring in the range from 0 to 7,
as can be seen in the bar chart in Figure 3. As a result, if a normal
distribution for the response is assumed, then the possible values of
the response would include negative values, giving non-zero
probabilities of occurrence for negative values of the response.
191
Unit 7 Regression for other response variables
Solution to Activity 3
A Poisson distribution seems to be better than a normal distribution as
the distribution for the response familySize. In the side-by-side bar chart
in Figure 5, the observed relative frequencies match fairly closely to those
expected when assuming a Poisson distribution. The Poisson distribution
is also discrete, to match the discrete data. In addition, the probability of
a negative value for familySize is zero under the Poisson distribution, but
is non-zero under the normal distribution.
Solution to Activity 4
Similarities:
• Both equations have a linear function of the explanatory variable
(α + βxi ) on the right-hand side of the equation.
• Both equations involve E(Yi ) on the left-hand side of the equation.
Difference:
• The equation for logistic regression has a function of E(Yi ) on the
left-hand side, whereas linear regression just has E(Yi ).
Solution to Activity 5
In linear regression
E(Yi ) = α + βxi .
So in order for
g(E(Yi )) = α + βxi ,
the function g must be the identity function, so that
g(E(Yi )) = E(Yi ).
Solution to Activity 6
There seems to be very different linear relationships between the sample
mean of familySize and age for age < 40, 40 ≤ age < 80 and age ≥ 80.
In general, the sample mean of familySize seems to increase from age 18
to approximately age 40, after which it starts to decrease towards lower
mean values for ages up to approximately age 80, after which the decrease
seems even steeper.
Solution to Activity 7
The relationship between the logs of the sample means of familySize and
age seems fairly linear in Figure 10, and so it does seem reasonable to
assume a linear relationship between log(E(Yi )) and xi .
192
Solutions to activities
Solution to Activity 8
The regression equation for this logistic regression model for the ith
student has the form
pi
log = α + β1 xi1 + β2 xi2 ,
1 − pi
where pi is the success probability of the ith student, and xi1 and xi2 are
their values of x1 and x2 , respectively.
Here
pi
log = g(E(Yi ))
1 − pi
and so the linear predictor for this model is
ηi = α + β1 xi1 + β2 xi2 .
Solution to Activity 9
Substituting the values of the estimates into the linear predictor, the fitted
linear predictor for this model is
ηb = −4.45 + 0.09 x1 − 0.01 x2
or equivalently
ηb = −4.45 + 0.09 bestPrevModScore − 0.01 age.
Solution to Activity 10
Figure 10 shows a scatterplot of the logs of the sample means of
familySize against age.
The sample means are estimates of E(Yi ) for the different values of age.
Therefore, the scatterplot is actually plotting the estimated values
of log(E(Yi )) against the different values that age can take.
But for a Poisson GLM with a log link,
log(E(Yi )) = g(E(Yi )).
Therefore, Figure 10 shows a scatterplot of the estimated values of
g(E(Yi )) and xi for these data, and since the scatterplot shows a fairly
linear relationship, it does indeed seem reasonable to assume a linear
relationship between g(E(Yi )) and xi for these data and this model.
193
Unit 7 Regression for other response variables
Solution to Activity 11
(a) The response Yi can only take non-negative integer values 0, 1, 2, . . . .
(b) If Yi ∼ Poisson(λi ), then E(Yi ) = λi . So, since λi > 0, then E(Yi ) > 0
and the mean response can only take positive real values.
(c) For the log link,
g(E(Yi )) = log(E(Yi )).
But since E(Yi ) = λi if Yi ∼ Poisson(λi ), this means that
g(E(Yi )) = log(λi ).
Therefore, g(E(Yi )) can take any value between −∞ and +∞. This is
illustrated in the plot of log(λi ) against λi in Figure S3, where λi is
restricted to positive values, but log(λi ) can take any real value.
0
log(λi )
−2
−4
0 2 4 6 8 10
λi
Figure S3 Plot of log(λi ) against λi
194
Solutions to activities
Solution to Activity 12
(a) The canonical link function for a GLM with a normal response is the
identity link function, so that
g(E(Yi )) = E(Yi ).
So,
E(Yi ) = ηi
and the inverse link function is therefore
g −1 (ηi ) = ηi .
That is, the inverse link function g −1 is also the identity function.
(b) The canonical link function for a Poisson GLM is the log link
g(E(Yi )) = log(E(Yi )).
So,
log(E(Yi )) = ηi
and therefore, taking exponentials of both sides, we have
E(Yi ) = exp(ηi ).
So the inverse link function is
g −1 (ηi ) = exp(ηi ).
Solution to Activity 13
From Example 7, the fitted linear predictor for the model is
ηb = 1.97 − 0.01 age.
So, the fitted linear predictor for the second household, ηb2 , is
ηb2 = 1.97 − (0.01 × 72) = 1.25.
We need to use the inverse link function to calculate the fitted mean
response of familySize for the second household, so that
b2 = g −1 (b
µ η2 ) = exp(b
η2 )
= exp(1.25) ' 3.49.
195
Unit 7 Regression for other response variables
Solution to Activity 14
Given ηb, the fitted linear predictor for the first student, ηb1 , is
ηb1 = −4.45 + (0.09 × 89.2) − (0.01 × 32)
= 3.258.
From Table 4, the inverse link function for the logit link is
exp(ηi )
g −1 (ηi ) = .
1 + exp(ηi )
So
exp(b
η1 )
b1 = g −1 (b
µ η1 ) =
1 + exp(bη1 )
exp(3.258)
= ' 0.963.
1 + exp(3.258)
Now, for the logistic regression model, µ b1 is the fitted probability that the
first student passes the module. So, since µ b1 ' 0.963, the fitted probability
of passing for this student is close to 1 – in other words, our fitted model
estimates that they are almost certain to pass the module.
Notice that in this activity, ηb1 = 3.258, which lies outside the possible
values that E(Y1 ) can take. However, the inverse link function transforms
ηb1 to a value in the correct range of possible values for µ
b1 .
Solution to Activity 15
The fitted linear predictor for this student, ηb0 , is
ηb0 = −4.45 + (0.09 × 74.2) − (0.01 × 49)
= 1.738.
From Table 4, the inverse link function for the logit link is
exp(ηi )
g −1 (ηi ) = .
1 + exp(ηi )
b0 , is
So, the predicted mean response for this student, µ
exp(b
η0 )
b0 = g −1 (b
µ η0 ) =
1 + exp(bη0 )
exp(1.738)
= ' 0.85.
1 + exp(1.738)
196
Solutions to activities
Solution to Activity 16
An exponential distribution does look like it might be promising as the
assumed distribution for survivalTime. The fitted exponential
distribution curve (in Figure 18) generally follows the shape of the
histogram, and the probability of a negative value for survivalTime is
zero under the exponential distribution.
Solution to Activity 17
Since each response Yi has an exponential distribution, each E(Yi ) must be
positive. As a result, the canonical link function
1
g(E(Yi )) = −
E(Yi )
can only take negative values.
But this means that the canonical link function doesn’t satisfy the second
property that we’d like the link function g to have, since g(E(Yi )) can’t
take any value between −∞ and +∞.
As such, the canonical link function for an exponential GLM may not be
an ideal link function to use.
Solution to Activity 18
As we saw in Activity 17, one of the problems with using the negative
reciprocal link for an exponential GLM is that the values of
g(E(Yi )) = −1/E(Yi ) must be negative, since E(Yi ) must be positive.
This is not the case when using the log link, since log(E(Yi )) can take any
value between −∞ and +∞ for positive E(Yi ). As such, the log link may
be more sensible than the canonical link when we’re assuming an
exponential distribution for the response.
Solution to Activity 19
Although there aren’t many observations in this dataset, Figure 19
suggests that there does seem to be a roughly linear relationship between
log(survivalTime) and logWbc for each of the levels of ag (or at least, the
plot doesn’t indicate that the linearity assumption is unreasonable!).
Therefore, it does seem reasonable to assume a linear relationship between
g(E(Yi )) and the explanatory variables.
197
Unit 7 Regression for other response variables
Solution to Activity 20
(a) In the table, there is a parameter estimate for level ‘pos’ of ag, but
not one for level ‘neg’. Therefore, level ‘neg’ has been set to be level 1
of ag.
(b) The first patient takes level ‘pos’ for ag and has a value of 7.7407 for
logWbc. So, the fitted linear predictor for this patient is
ηb1 = 5.8154 + (−0.3044 × 7.7407) + 1.0176 ' 4.4767.
(c) From Box 8, the fitted mean response for the first patient is
b1 = g −1 (b
µ η1 ),
where g −1 is the inverse link function.
Now, for the log link, we have that
g(E(Yi )) = log(E(Yi )) = ηi .
So, taking exponentials of each side, we have that
E(Yi ) = exp(ηi )
so that the inverse link function is
g −1 (ηi ) = exp(ηi ).
(d) The 18th patient takes level ‘neg’ for ag and has a value of 8.3894 for
logWbc. So, the fitted linear predictor for this patient is
ηb18 = 5.8154 + (−0.3044 × 8.3894) ' 3.2617.
Therefore, the fitted mean response for the 18th patient is
b18 = g −1 (b
µ η18 ) = exp(b
η18 )
' exp(3.2617) ' 26.09.
198
Solutions to activities
Solution to Activity 21
(a) The first student in the dataset has a value of 89.2 for
bestPrevModScore and 32 for age. So, using the parameter estimates
given, the fitted linear predictor for this student is
ηb1 = −3.3387 + (0.0472 × 89.2) + (−0.0040 × 32) = 0.74354.
(c) The fitted mean response for the first student is calculated using the
equation
b1 = N1 × pb1 ,
µ
where N1 is the number of ‘trials’ for the first student. In our scenario
here, N1 is 100, the number of exam questions. Therefore, the fitted
mean response for the first student is
b1 ' 100 × 0.6778 = 67.78
µ
b1 is 68, rounded to the nearest integer.
that is, µ
(d) The new student has a value of 79.2 for bestPrevModScore and 64 for
age. So, ηb0 , the fitted linear predictor for this student, is
ηb0 = −3.3387 + (0.0472 × 79.2) + (−0.0040 × 64) = 0.14354.
199
Unit 7 Regression for other response variables
Solution to Activity 22
The p-value is 0.099, which is quite large. This suggests that the value of
D is not large enough to suggest that the model is a poor fit. We therefore
conclude that the model seems to be an adequate fit to the data.
Solution to Activity 23
(a) Model M1 is nested within M2 , so
deviance difference = D(M1 ) − D(M2 )
= 47.808 − 40.319 = 7.489.
(b) Model M2 has one more parameter than M1 (the regression coefficient
for level pos of the factor ag), and so this deviance difference is
approximately distributed as χ2 (1). We can also calculate the degrees
of freedom as the difference between the deviance degrees of freedom
for M1 and M2 , namely
31 − 30 = 1.
Solution to Activity 24
(a) Model M1 is nested within M2 , so
deviance difference = D(M1 ) − D(M2 )
= 1795.8 − 1731.8 = 64.
(b) Model M2 has one more parameter than M1 (the regression coefficient
for the covariate age), and so this deviance difference is
approximately distributed as χ2 (1). We can also calculate the degrees
of freedom as the difference between the deviance degrees of freedom
for M1 and M2 , namely
1158 − 1157 = 1.
(c) Since the p-value is so small (close to 0), there is evidence to suggest
that there is a significant gain in fit by using our proposed model in
comparison to the null model – in other words, there is a significant
gain in fit when age is included in the model. It therefore looks like
age is useful for modelling familySize.
200
Solutions to activities
Solution to Activity 25
The preferred model is the one with the smallest AIC, so M2 is the
preferred model.
Solution to Activity 26
If Yi ∼ Poisson(λi ), then
E(Yi ) = λi and V (Yi ) = λi ,
that is, the variance is the same as the mean. So, if the mean of the
response Yi changes, then the variance will also change and be equal to the
mean.
Solution to Activity 27
If Yi ∼ M (λi ), then
1 1
E(Yi ) = and V (Yi ) = .
λi λ2i
So,
V (Yi ) = (E(Yi ))2 .
This means that as E(Yi ) changes, so does V (Yi ).
Solution to Activity 28
There is some slight curvature in the smoothed red line in the plot in
Figure 25. However, this could be a result of the fact that the dataset is
small. Therefore, the plot doesn’t raise any alarm bells to indicate that the
linearity assumption for the model seems unreasonable.
Solution to Activity 29
The points in the plot seem to be fairly randomly scattered across the
index, and so the plot doesn’t indicate that there are any problems with
the independence assumption.
Solution to Activity 30
Two of the negative standardised deviance residuals are much larger than
the others, which, in itself, doesn’t indicate any problems with the
independence assumption. However, the fact that both of these unusually
large points are right next to each other in index order and they’re also
both negative residuals suggests that perhaps there might possibly be an
issue with independence. Given that the data points related to different
patients, it is likely that they are independent, but it would be worth
checking how the data were collected.
201
Unit 7 Regression for other response variables
Solution to Activity 31
Although the points deviate from the line slightly at either end of the
normal probability plot, the points in the plot are generally quite close to
the diagonal line, and so the assumption of normality of the deviance
residuals seems reasonable. This in turn, means that the assumption of an
exponential distribution for the response also seems reasonable.
Solution to Activity 32
Here
D 962.05
= ' 0.84.
r 1144
So, since 0.84 < 1 < 2, overdispersion is not a problem when using this
GLM.
Solution to Activity 33
(a) For Model 1
D 4.21
= ' 1.4.
r 3
Since 1 < 1.4 < 2, there could be some overdispersion but probably
not enough to be a problem.
(b) For Model 2
D 634.81
= ' 0.73.
r 864
Since 0.73 < 1 < 2, there certainly doesn’t seem to be a problem with
overdispersion.
(c) For Model 3
D 146.8
= ' 5.65.
r 26
Since 5.65 > 2, there could well be a problem with overdispersion.
Solution to Activity 34
A Poisson distribution would be naturally considered for the response
numSeizures, since these are counts. So, a first good model to try for
numSeizures is a Poisson GLM with a log link, with treatment and
period as explanatory variables.
202
Unit 8
Log-linear models for contingency
tables
Introduction
Introduction
In Unit 7, we learned about generalised linear models – that is, GLMs –
and used them to model data with various assumed distributions for the
response. This chapter continues that thread, but involves a different
format of data from those we have been working with so far.
In this unit, we shall concentrate on data which are in the form of
contingency tables. Contingency tables are tables of counts showing how
often within a given sample each combination of the different values of
various categorical random variables occurs. One of the questions often of
interest for contingency table data is whether there are any relationships
between the categorical variables represented in the contingency table. In
this unit, we’ll introduce a GLM, known as the log-linear model, for
modelling the contingency table data in order to learn about these
relationships.
205
Unit 8 Log-linear models for contingency tables
Section 1
The modelling
problem
Section 2 Section 3
Introducing log-linear Are the classifying
models for two-way variables in a two-way
contingency tables table independent?
Section 5
Section 4
How are the
Contingency tables with
classifying variables
more than two variables
related?
Section 6
Logistic and
log-linear models
Note that you will need to switch between the written unit and your
computer for Subsections 3.3, 4.5 and 5.2.
206
1 The modelling problem
207
Unit 8 Log-linear models for contingency tables
The data for the first five observations from the UK survey dataset
are shown in Table 1. From this table of data, we see that for each
individual completing the 2013 Living Costs and Food Survey, the
observed category was recorded for each of the four categorical
variables; for example, the first individual in the dataset recorded
inactive for employment, female for gender, earned for incomeSource
and public rented for tenure. (Note that the HRPs in the survey may
not be the main source of household income, so it is possible for
incomeSource to take the value earned even if the HRP is unemployed
or inactive – as is the case for the first individual in the dataset.)
Table 1 First five observations from ukSurvey
The data in the UK survey dataset record which categories each individual
takes for the four categorical variables employment, gender, incomeSource
and tenure. In this unit, we are not so interested in these individual data
values, but rather we are interested in modelling the counts of individuals
in the dataset who take combinations of the different values of the
categorical variables. For example, we’re interested in modelling the counts
of individuals (out of the 5144 who completed the survey) who took the
values female for gender and earned for incomeSource, or took the values
female for gender and other for incomeSource, and so on. These counts
can be represented in a contingency table.
An example of a contingency table showing data from the UK survey
dataset is given in Table 2. This table shows the numbers of individuals
classified according to the two categorical variables gender and
incomeSource. For example, of the 5144 individuals in the dataset, the
household income source was earned and the gender of the HRP was male
for 1894 individuals.
Table 2 The UK survey dataset classified by gender and incomeSource
incomeSource
gender earned other Total
female 947 1041 1988
male 1894 1262 3156
Total 2841 2303 5144
208
1 The modelling problem
incomeSource
earned other
gender gender
employment female male female male
full-time 626 1688 31 95
part-time 235 112 123 66
unemployed 18 16 72 58
inactive 68 78 815 1043
Total 947 1894 1041 1262
There are many questions that we might want to answer for the three-way
contingency table given in Table 3. Not only might we be interested in the
question of whether gender and incomeSource are independent (as we
were for Table 2), but we might also be interested in whether gender and
employment are independent of each other, or whether employment and
incomeSource are independent of each other.
Furthermore, we might want to consider whether the relationship between
any pair of variables (independent or not) differs according to the other
remaining variable. For example, we might be interested in whether the
relationship between employment and gender differs according to whether
incomeSource takes the value earned or other. We’ll consider what else
might be of interest in Activity 1.
209
Unit 8 Log-linear models for contingency tables
To keep things simple, for now we’ll restrict our attention to modelling
two-way contingency tables only, so that the data are categorised in terms
of just two categorical variables. Modelling contingency table data for
more than two categorical variables will build on these models; we’ll
consider these more complicated contingency tables later in the unit.
210
2 Introducing log-linear models for two-way contingency tables
211
Unit 8 Log-linear models for contingency tables
gender
employment female male Total
full-time 657 1783 2440
part-time 358 178 536
unemployed 90 74 164
inactive 883 1121 2004
Total 1988 3156 5144
212
2 Introducing log-linear models for two-way contingency tables
Notice that our response variables are indexed by two subscripts k and l,
representing, respectively, the kth row and lth column in the contingency
table. This is, of course, different to the notation that we’ve used so far for
responses; previously we’ve had the responses Y1 , Y2 , . . . , Yn , where n is the
number of individual observations in the dataset. So, previously we’ve had
a response variable to represent each of the n observations, whereas here
we are no longer considering responses relating to individual observations,
but instead we’re considering responses representing the counts in the
contingency table after the individual observations have been categorised.
Activity 4 considers these responses further.
We’ll finish this subsection with an activity identifying the responses for
modelling a two-way contingency table for the UK survey dataset.
Consider once again the contingency table given in Table 5 (in Activity 3).
Using the notation summarised in Box 2, write down the responses for
modelling the counts in this contingency table.
213
Unit 8 Log-linear models for contingency tables
214
2 Introducing log-linear models for two-way contingency tables
Since we’re building a GLM for each response Ykl , we’re interested in
E(Ykl ), the expected cell count for level k of A and level l of B. This is
directly related to the cell probability pkl , and, since we’re assuming that
the total number of observations is fixed to be n, is given by
E(Ykl ) = n × pkl . (1)
But, if A and B are independent, then
P (A = k and B = l) = P (A = k) × P (B = l),
which means that the cell probability pkl is the product of the marginal
probabilities pk+ and p+l – that is,
pkl = pk+ × p+l .
So, if A and B are independent, then Equation (1) becomes
E(Ykl ) = n × pk+ × p+l . (2)
Now, for a GLM for Ykl , we need to have a regression equation of the form
g(E(Ykl )) = ηkl ,
where ηkl is a linear function. Although the right-hand side of
Equation (2) isn’t linear, if we take logs of both sides, then the right-hand
side will be linear. In this case, Equation (2) becomes
log(E(Ykl )) = log(n) + log(pk+ ) + log(p+l )
term
term
associated
associated
constant
= + with + with . (3)
term
kth level
lth level
of A of B
So, could we use this as a regression equation for modelling the counts?
Well we could in theory, but the model wouldn’t be easy to work with
because the responses would not be independent since they need to add up
to the fixed total n.
215
Unit 8 Log-linear models for contingency tables
As we’ve just seen in Activity 6, there is a problem with the Poisson GLM
model assumptions if the sample size Y++ is fixed to be n in advance.
However, all is not lost! It turns out that if we fit a Poisson GLM to data
from a contingency table (with the canonical log link), assuming
independent Poisson responses, then maximum likelihood estimation
ensures that the fitted values for the cell counts add up to the actual total
count that was observed in the first place! As a result, it makes no
difference whether or not we fix Y++ to be n in advance, and we can
As if by magic, we can use a
Poisson GLM for contingency simply assume that the cell counts are independent Poisson responses so
table data regardless of any that everything can fit nicely into the standard (and easy-to-use!) Poisson
constraints on the totals! GLM framework.
216
2 Introducing log-linear models for two-way contingency tables
What’s more, even though the assumed distributions for the response are
also different when the row totals Y1+ , Y2+ , . . . , YK+ are fixed, or when the
column totals Y+1 , Y+2 , . . . , Y+L are fixed, again it turns out that
maximum likelihood estimation for a Poisson GLM also gives exactly the
same fitted values as these constrained models, and that the fitted values
of the cell counts add up to the actual row and column totals that were
observed. Therefore, again we can assume that the cell counts are
independent Poisson responses, so that a Poisson GLM can also be used
for contingency tables where the row and column totals are fixed.
When using a Poisson GLM to model counts in a contingency table, the
model is usually referred to as a log-linear model. As you’ve probably
guessed, the ‘log’ part of this title refers to the fact that we’re using a log
link function, and the ‘linear’ part refers to the fact that the multiplicative
relationship between the marginal probabilities becomes linear in the
model.
The log-linear model for the counts in a two-way contingency table when A
and B are independent is summarised in Box 3.
217
Unit 8 Log-linear models for contingency tables
Let’s take a closer look at the log-linear model in the simplest situation in
which we have a two-way contingency table as shown in Table 7, where the
two classifying variables each has just two levels.
Table 7 General form of a contingency table with observed counts for two
variables A and B, each with two levels
218
2 Introducing log-linear models for two-way contingency tables
Show that the expected cell counts for the contingency table in Table 7 can
be written using the expressions given in Table 8.
Table 8 Expressions for the expected cell counts E(Ykl ) for the contingency
table in Table 7
The previous activity considered expressions for the expected cells counts
for a contingency table where each of the classifying variables has just two
levels. Notice that µ, the baseline mean, is in all of the expressions for the
expected cell counts in Table 8. On the other hand, αA , the level 2 effect
parameter for A, only appears in expressions for the expected cell counts
in the second row in Table 8; that is, the row associated with level 2 of A.
Likewise, αB , the level 2 effect parameter for B, only appears in
expressions for the expected cell counts in the second column in Table 8;
that is, the column associated with level 2 of B.
This idea extends naturally to the general case where A has K levels and
B has L levels: µ is in all K × L expressions for the expected cell counts,
the level k effect parameter for A only appears in the expressions for the
expected cell counts for row k of the table, while the level l effect
parameter for B only appears in the expressions for the expected cell
counts for column l of the table.
We’ll finish this subsection by using the expressions for the expected cell
counts from Activity 7 to calculate the fitted expected cell counts for
contingency table data from the UK survey dataset.
incomeSource
gender earned other Total
female 947 1041 1988
male 1894 1262 3156
Total 2841 2303 5144
219
Unit 8 Log-linear models for contingency tables
Let the response variable be count, representing the cell counts in the
contingency table. The model
count ∼ gender + incomeSource
was fitted to the data using a log-linear model taking female to be level 1
of gender and earned to be level 1 of incomeSource. The resulting output
from fitting the model is given in Table 10.
Table 10 Parameter estimates for the fitted log-linear model
count ∼ gender + incomeSource
Parameter Estimate
Intercept 7.001
gender male 0.462
incomeSource other −0.210
Complete Table 11 by calculating the fitted expected cell counts for this
model.
Table 11 Fitted expected cell counts for the log-linear model
count ∼ gender + incomeSource
incomeSource
gender earned other
female
male
220
2 Introducing log-linear models for two-way contingency tables
incomeSource
gender earned other Total
female 1097.96 890.04
male 1743.04 1412.96
Total
Complete Table 12 by calculating the row, column and overall totals of the
fitted values, and confirm that these totals are the same as the totals
displayed in Table 2 (and repeated in Table 9 in Activity 8).
221
Unit 8 Log-linear models for contingency tables
In order to do this, we first need the values of pbkl , pbk+ and pb+l for our
fitted model. We can use the expected cell counts to estimate these, as
demonstrated in Example 1.
In Activity 10, we’ll estimate some more joint and marginal probabilities,
and we’ll use these estimates to investigate whether Equation (7) is
satisfied for our fitted model.
Table 13 Fitted expected cell counts, together with their row, column and
overall totals, for the fitted log-linear model count ∼ gender + incomeSource
incomeSource
gender earned other Total
female 1097.96 890.04 1988.00
male 1743.04 1412.96 3156.00
Total 2841.00 2303.00 5144.00
222
2 Introducing log-linear models for two-way contingency tables
(a) In Example 1, we saw that pb12 ' 0.173, pb1+ ' 0.386 and pb+2 ' 0.448.
Confirm that, to three decimal places
pb12 = pb1+ × pb+2 .
223
Unit 8 Log-linear models for contingency tables
incomeSource
gender earned other Total
female 947 1041 1988
male 1894 1262 3156
Total 2841 2303 5144
The mosaic plot for this contingency table, taking gender as the
horizontal variable and incomeSource as the vertical variable, is
shown in Figure 1. Notice that there are four rectangles, one to
represent each cell in Table 14. The rectangles in the first column are
associated with the first level of the horizontal variable gender (that
is, female), and the rectangles in the second column are associated
with the second level of gender (that is, male).
224
3 Are the classifying variables in a two-way table independent?
For the vertical variable incomeSource, the rectangles in the first row
are associated with the first level of incomeSource (that is, earned),
while the rectangles in the second row are associated with the second
level (that is, other).
gender
female male
earned
incomeSource
other
incomeSource
earned other
female
gender
male
225
Unit 8 Log-linear models for contingency tables
For both of the mosaic plots given in Figures 1 and 2 in Example 2, the
horizontal width of each rectangle represents the proportion of observations
taking the associated level for the horizontal variable. The vertical length
of each rectangle within each column represents the proportion of
observations taking the associated level for the vertical variable,
conditional on the observation taking the associated level of the horizontal
variable. So, if the two variables are independent of one another, we’d
expect the vertical lengths of the rectangles to be roughly the same across
the horizontal variable. This is illustrated in Example 3.
226
3 Are the classifying variables in a two-way table independent?
gender
female male
earned
incomeSource
other
(a)
incomeSource
earned other
female
gender
male
(b)
Figure 3 Mosaic plots representing the fitted cell probabilities for the
model count ∼ gender + incomeSource with (a) gender as the
horizontal variable, and (b) incomeSource as the horizontal variable
Notice that when the variables are independent, the rectangles across
rows have the same vertical length. For example, in Figure 3(a) the
vertical lengths of the rectangles for earned are the same for both
female and male, as are the vertical lengths of the rectangles for other.
227
Unit 8 Log-linear models for contingency tables
Using this idea, a mosaic plot can help to informally assess whether or not
the two variables are likely to be independent. What’s more, mosaic plots
also provide us with a visualisation of the proportions observed for each
level within each variable.
Interpreting a mosaic plot is illustrated in Example 4.
In Activity 11, we’ll interpret the other mosaic plot from Example 2.
We’ll round off this subsection with an activity looking at the mosaic plots
for another contingency table taken from the UK survey dataset.
228
3 Are the classifying variables in a two-way table independent?
incomeSource
employment earned other
full-time 2314 126
part-time 347 189
unemployed 34 130
inactive 146 1858
full-time
earned
part-time
unemployed
other
inactive
(a) (b)
Figure 4 Mosaic plots representing Table 15 with (a) incomeSource as the horizontal variable,
and (b) employment as the horizontal variable
Does it look like the two variables incomeSource and employment are
independent?
The vertical lengths of the rectangles in the mosaic plots given in Figure 4
in Activity 12 seem to be very different across the horizontal variable.
From these plots, it certainly looks like incomeSource and employment are
not independent. However, things are not so clear for the mosaic plots in
Figures 1 and 2 in Example 2. Although the vertical lengths of the
rectangles in these plots are not the same across the horizontal variable,
the differences may not be large enough to rule out the independence
assumption. We therefore need a formal method to help us to decide
whether or not the two variables are independent. We shall introduce such
a method in the next subsection.
229
Unit 8 Log-linear models for contingency tables
Before we leave mosaic plots, it is worth mentioning that mosaic plots can
also be used to visualise contingency tables with more than two classifying
variables. However, when there are more than two variables, mosaic plots
can be difficult to interpret, and therefore are not particularly helpful. As
a result, we shall only consider mosaic plots for two-way tables in this
module.
230
3 Are the classifying variables in a two-way table independent?
We have, of course, used models which include interaction terms for two
factors before: they were first introduced in Unit 4. Following the same
convention as we’ve used before, the interaction effect of the kth level of A
and the lth level of B is the added effect of the interaction between A and
B. As a result, if either A or B take level 1, then the associated interaction
term is simply zero (since the individual effect terms assume that the other
variable takes level 1, and so any interaction between level 1 of either
variable is already accounted for).
Following our usual notation, we’ll denote the log-linear model which
includes an interaction by
Y ∼ A + B + A:B
or, equivalently,
Y ∼ A ∗ B,
where A:B represents the interaction between A and B.
So, we have two models for the cell counts – M1 and M2 , say – where
• M1 is the log-linear model when A and B are independent given by
Y ∼A+B
231
Unit 8 Log-linear models for contingency tables
Conclude: Conclude:
A and B independent A and B not independent
232
3 Are the classifying variables in a two-way table independent?
(a) The residual deviance D for a proposed model (from Subsection 5.2 of
Unit 6) is given by
D = 2 × (l(saturated model) − l(proposed model)).
Show that, for the log-linear model M2 given by
Y ∼ A + B + A:B,
the residual deviance, D(M2 ), is zero.
(b) Hence, find an expression for the deviance difference for comparing
the fits of the log-linear model M1 , given by
Y ∼ A + B,
and the log-linear model M2 .
233
Unit 8 Log-linear models for contingency tables
The results from Activity 14 mean that we can test whether the
interaction A:B should be included in the model, and therefore whether A
and B are independent, using D(M1 ), the residual deviance of the model
without the interaction A:B. And we already know from Subsection 5.1 in
Unit 7 how to use the residual deviance to assess the fit of a GLM!
From Box 13 in Unit 7, if a proposed GLM is a good fit, then D, the
residual deviance for the proposed model, satisfies
D ≈ χ2 (r),
where
r = number of observations
− number of parameters in the proposed model.
We’ll use this result next in Activity 15 to find the distribution for D(M1 )
when M1 is a good fit.
234
3 Are the classifying variables in a two-way table independent?
Then:
• fit the model M1 given by
Y ∼ A + B,
which assumes that A and B are independent
• obtain the residual deviance for this model, D(M1 )
• if M1 is a good fit, then
D(M1 ) ≈ χ2 ((K − 1)(L − 1))
D(M1 )
235
Unit 8 Log-linear models for contingency tables
236
3 Are the classifying variables in a two-way table independent?
induced
gender no yes Total
male 327 86 413
female 243 82 325
Total 570 168 738
237
Unit 8 Log-linear models for contingency tables
Using the data from the newborn babies dataset, we wish to test the
hypotheses
H0 : gender and induced are independent,
H1 : gender and induced are not independent.
(a) Let the response variable be count, representing the cell counts in the
contingency table given in Table 16. The log-linear model
count ∼ gender + induced
was fitted to these data and the residual deviance for this fitted model
is 2.00. Which distribution should this residual deviance be compared
to in order to carry out the test?
(b) The associated p-value for this test is 0.157. What do you conclude
about whether or not gender and induced are independent?
We now know how we can use mosaic plots to visualise contingency tables
and consider informally whether or not it looks like the two classifying
variables are independent, and we also know how to use a log-linear model
to test independence more formally. So, we are now ready to put these
ideas into practice in R. We shall do this next.
238
3 Are the classifying variables in a two-way table independent?
239
Unit 8 Log-linear models for contingency tables
incomeSource
earned other
gender gender
employment female male female male
full-time 626 1688 31 95
part-time 235 112 123 66
unemployed 18 16 72 58
inactive 68 78 815 1043
Total 947 1894 1041 1262
240
4 Contingency tables with more than two variables
For the contingency table shown in Table 17, let A represent the variable
employment, B represent the variable incomeSource and C represent the
variable gender.
(a) What are the values of K, L and S for this contingency table? Hence
confirm that there are K × L × S observed values of the response.
(b) Table 17 has four rows and four columns. How are the data arranged
in the table?
241
Unit 8 Log-linear models for contingency tables
242
4 Contingency tables with more than two variables
We can also use the ‘∗’ symbol to write this model in shorthand form as
Y ∼ A ∗ B ∗ C.
As usual, the ‘∗’ symbol between factors tells us that these factors are in
the model, as are all of the interactions between them. So, A ∗ B ∗ C
means all the individual factors and all of the possible interactions between
A, B and C.
This log-linear model form can be extended in a natural way to
contingency tables with more than three classifying variables. To
illustrate, in Activity 18 we’ll look at the saturated model for a four-way
contingency table.
So, we now know the general form for a saturated log-linear model for a
contingency table. But, of course, a saturated model is of no use for
modelling! So, the first question of interest is: can a more parsimonious
model be obtained? That is, can any of the terms in the saturated model
be removed from the model without significantly reducing the model fit?
We will consider this question next.
243
Unit 8 Log-linear models for contingency tables
244
4 Contingency tables with more than two variables
Then:
• fit the model M and obtain the residual deviance for this model,
D(M )
• if M is a good fit, then
D(M ) ≈ χ2 (r),
where
r = number of observations
− number of parameters in the proposed model
(the value of r is given with the value of D(M ) as part of R’s
standard output for fitting M )
• assess the value of D(M ) and complete the test as illustrated in
Figure 7.
χ2 (r)
D(M )
245
Unit 8 Log-linear models for contingency tables
In the next activity, we’ll consider the fits of some possible models for the
three-way contingency table given in Table 17 (which classifies the UK
survey dataset by employment, gender and incomeSource).
Four possible log-linear models were fitted to the contingency table data
given in Table 17. These models are as follows.
• Model M1 is the log-linear model which only has the main effects:
count ∼ employment + gender + incomeSource.
The residual deviance for this model is 4538.5 and the associated degrees
of freedom is 10.
• Model M2 is the log-linear model which includes all of the main effects
and the two-way interactions:
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource
+ gender:incomeSource.
The residual deviance for this model is 0.29 and the associated degrees
of freedom is 3.
• Model M3 is the log-linear model which includes all of the main effects
and the two-way interaction employment:incomeSource:
count ∼ employment + gender + incomeSource
+ employment:incomeSource.
The residual deviance for this model is 365.07 and the associated degrees
of freedom is 7.
• Model M4 is the log-linear model which includes all of the main effects
and the two-way interactions employment:gender and
employment:incomeSource:
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource.
The residual deviance for this model is 1.22 and the associated degrees
of freedom is 4.
Using the ‘rule of thumb’ from Box 6, which of the models M1 , M2 , M3
and M4 can be considered as an adequate fit to the data?
In Activity 20, we concluded that there are two models out of those listed
which can be considered as being an adequate fit to the data. This brings
us to the question of how do we choose a log-linear model from a selection
of alternatives? We’ll consider this in the next activity.
246
4 Contingency tables with more than two variables
Activity 20 considered four possible models for the contingency table data
from the UK survey dataset given in Table 17. That activity concluded
that two of these models were an adequate fit to the data. In this activity,
we’ll compare the fits of these two models so that we can select the
preferred model.
The two models which were an adequate fit are:
• Model M2 :
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource
+ gender:incomeSource.
The residual deviance for this model is 0.29 and the associated degrees
of freedom is 3.
• Model M4 :
count ∼ employment + gender + incomeSource
+ employment:gender + employment:incomeSource.
The residual deviance for this model is 1.22 and the associated degrees
of freedom is 4.
(a) Explain why we can use the value of
deviance difference = D(M4 ) − D(M2 )
to compare the fits of models M4 and M2 .
(b) Calculate the deviance difference to compare these models. What is
the value of the associated degrees of freedom for the deviance
difference?
247
Unit 8 Log-linear models for contingency tables
(c) Hence explain why, if we use the deviance difference, we prefer model
M4 to model M2 .
(d) The value of the AIC for model M2 is 133.01, while the value of the
AIC for model M4 is 131.94. Using these AIC values, which of models
M2 and M4 is preferable?
In the next activity, we’ll choose a log-linear model for data concerning
litters of sheep; the dataset is described next.
Litters of sheep
Agricultural researchers were interested in investigating the
relationships between the size of litters of lambs, the breed of ewe
giving birth to the lambs (from three possible breeds), and the farm
where the ewe gave birth (from three possible farms).
The sheep litters dataset (sheepLitters)
On 3 May 2017, the Derbyshire This dataset contains data on 840 ewes who gave birth to litters of
Times reported the birth of a lambs. For each ewe, the number of lambs in her litter, the ewe’s
rare litter of three black and breed and the farm where the birth took place were recorded. The
three white lambs ewes were then classified according to three factors:
• litterSize: the number of lambs born in the litter, taking the
values 0, 1, 2 and ≥ 3
• breed: the breed of the ewe, taking the values a, b and c
• farm: the farm where the birth took place, taking the values 1, 2
and 3.
The counts in the cells of the contingency table for these classifying
variables are stored in the variable count.
Data from the experiment are shown in the three-way contingency
table given in Table 18.
Table 18 Counts of ewes categorised according to litterSize, breed
and farm
farm
1 2 3
breed breed breed
litterSize a b c a b c a b c
0 10 4 6 8 5 1 22 18 4
1 21 6 7 19 17 5 95 49 12
2 96 28 58 44 56 20 103 62 16
≥3 23 8 7 1 1 2 4 0 2
Total 150 46 78 72 79 28 224 129 34
248
4 Contingency tables with more than two variables
We’ll choose a log-linear model for the sheep litters dataset next in
Activity 23.
249
Unit 8 Log-linear models for contingency tables
Does it look like any of the two-way interactions can be dropped from
model M2 ?
(d) Which of the log-linear models M1 , M2 , . . . , M5 would you choose?
(Note that the p-values for each of the log-linear models with the
main effects and only one two-way interaction are all extremely small
(p < 0.0001), and so the models with one two-way interaction and no
interactions will all be inadequate fits to the data.)
250
4 Contingency tables with more than two variables
251
Unit 8 Log-linear models for contingency tables
1st In this module, we shall only be using hierarchical models, because they
are often easier to interpret than non-hierarchical models. What’s more,
when choosing a log-linear model in this module, we shall not consider
leaving out any of the main effects. This, of course, goes against the
parsimony principle. However, it is more important that a log-linear model
is interpretable than the model is parsimonious.
So, the first restriction on our choice of log-linear models is that we only
want to choose between hierarchical log-linear models. The second
restriction arises when some of the totals in the contingency table are fixed
in advance of collecting the data. For example, in the UK survey dataset,
the total number of individuals in the survey may be fixed in advance, or
in the sheep litters dataset, the number of ewes for each breed may be
2nd fixed in advance.
Back in Subsection 2.2, it was stated that two-way tables in which the row,
column or overall totals are fixed can be analysed using log-linear models
in exactly the same way as two-way tables in which the counts are all
Interpretability wins over independent of each other. However, this is not so for contingency tables
parsimony! with three or more classifying variables. In this case, if the total number of
observations is fixed, and/or the totals for counts across one or more of the
variable categories are fixed, then this imposes constraints on which terms
must be included in any log-linear model.
Box 8 summarises which terms need to be included in a log-linear model
for the different possible fixed totals in a contingency table.
252
4 Contingency tables with more than two variables
To finish this subsection, the ‘rules’ given in Box 8 for contingency tables
with fixed totals are illustrated in the next example and following activity.
253
Unit 8 Log-linear models for contingency tables
age
18 to 24 25 to 49 50 to 64 65+
gender gender gender gender
party male female male female male female male female
con 639 345 3032 2892 2462 2560 2852 3384
lab 1050 1496 3465 4066 1207 1433 668 952
libdem 274 230 1213 1084 579 614 490 529
snp 160 115 433 452 145 205 134 106
other 183 115 606 452 483 307 267 317
The counts given in Table 20 were not directly available from the data
source, but were instead estimated from the available data, namely,
the (rounded) percentages of adults voting for each political party for
each gender and age group, together with the total number of adults
in each gender and age group in the sample. The discrepancy between
the sample size (41 995) and the total number of votes in Table 20
(41 996) is due to rounding of the percentages of votes at the data
source and then further rounding of the estimated numbers of votes.
255
Unit 8 Log-linear models for contingency tables
256
5 How are the classifying variables related?
Suppose that the final model for a three-way contingency table with
classifying variables A, B and C is the log-linear model with no
interactions given by
Y ∼ A + B + C.
Given what you know about log-linear models for two-way contingency
tables, what do you think this model tells us about the relationships
between the variables A, B and C?
Following on from Activity 26, if our final model is the log-linear model
Y ∼ A + B + C,
then the variables A, B and C are said to be mutually independent.
Next we’ll consider the case in which there is one single two-way
interaction: we’ll use a specific example in the next activity to think things
through.
Following on from Activity 27, if our final model is a log-linear model with
one single two-way interaction, then the two variables in the two-way
interaction are said to be jointly independent of the third variable.
We’ll see two further log-linear models representing joint independence in
Activity 28.
257
Unit 8 Log-linear models for contingency tables
The next situation that we’ll consider is when there are two two-way
interactions in the model. The independence relationships in this case are
a little trickier to interpret from the model.
It will help to explain things if we consider a specific example. So, suppose
that our chosen model is the log-linear model
Y ∼ A + B + C + A:B + A:C.
Now, since the interaction A:B is in the model, then that suggests that A
and B are not independent. Likewise, since the interaction A:C is in the
model, then that suggests that A and C are not independent. However,
there is no interaction involving both B and C together. Since B and C
are related to each other only through their relationships with A, we say
that B and C are conditionally independent, given A.
We’ll consider the relationships between A, B and C for the other possible
models with two two-way interactions in the next activity.
258
5 How are the classifying variables related?
The only other log-linear model not yet considered is the model containing
all three two-way interactions, but missing the three-way interaction; that
is, the log-linear model
Y ∼ A + B + C + A:B + A:C + B:C.
In this model, none of the pairs of variables is independent of each other
and there is said to be uniform association between the variables.
Using the final log-linear model to investigate the relationships between
the variables in a three-way table is summarised in Box 9.
The next two activities will give you some practise at interpreting
log-linear models in terms of the variable relationships.
259
Unit 8 Log-linear models for contingency tables
260
6 Logistic and log-linear models
The data for the first five observations from the Australian health
insurance dataset are given in Table 22.
Table 22 First five observations from healthInsurance
To complete this unit, in the final section we’ll see how logistic regression
models can be used instead of log-linear models for some contingency table
data.
261
Unit 8 Log-linear models for contingency tables
able
no yes
frequency frequency
gender age rarely regularly rarely regularly
male 5 to 8 19 4 5 2
9 to 12 5 0 8 17
female 5 to 8 11 7 6 6
9 to 12 2 1 5 22
Total 37 12 24 47
262
6 Logistic and log-linear models
Let’s first think about how we’d model the four-way contingency table
data from the dental flossing dataset if we were to use a log-linear model.
Explain why we could use logistic regression for these data if we took able
to be our response variable.
If we take able as our response variable to help answer one of the study’s
aims, then we could use logistic regression because able is a binary
variable. A logistic regression model can, in fact, be used to model any
contingency table in which there is a binary categorical variable that could
be treated as the response variable. In this case, the contingency table can
be modelled by either a log-linear model or a logistic regression model.
263
Unit 8 Log-linear models for contingency tables
The main ideas behind using the two models for the same contingency
table are summarised in Box 10.
So, if a log-linear model and a logistic regression model can both be used
to model the same data, what is the relationship between the two resulting
models? We will address this question next.
264
6 Logistic and log-linear models
265
Unit 8 Log-linear models for contingency tables
Interactions in Interactions in
logistic regression model ←→ log-linear model
Main effect A, but no interaction ←→ Y :A
A:B ←→ Y :A:B
A:B:C ←→ Y :A:B:C
· ·
· ·
· ·
266
6 Logistic and log-linear models
Log-linear model
Pros:
• The model involves modelling the relationships between all of the
variables categorising the data, so one variable need not be given
special status over the remaining variables.
• If a log-linear model is used which isn’t necessarily an exact match
with the analogous logistic regression model, then the log-linear
model could end up being a simpler model for the data.
• The model works just as well when the proposed response variable
among the categorical variables has more than two categories. (The
usual logistic regression models that you have studied in this
module do not work in such cases.)
Cons:
• There can be difficulties in fitting a log-linear model if the
contingency table contains many zeros so that certain combinations
of the explanatory factors do not occur in the data, either by
chance or by design.
• The model can’t accommodate continuous explanatory variables.
268
Summary
Summary
This unit focused on the problem of modelling contingency table data. In a
contingency table, the individual data values in a dataset are categorised
according to the levels of two or more categorical variables. The
contingency table then gives the counts of observations for each of the
possible combinations of levels for the categorical variables.
In order to model these data, the counts in the contingency table are taken
to be values of the response, and the variables classifying the data are used
as factor explanatory variables. The counts can then be modelled by a
Poisson GLM with the (canonical) log link. In this context, this GLM is
known as a log-linear model.
A question often of interest for two-way contingency tables with classifying
variables A and B is whether A and B are independent. This question can
be investigated informally using mosaic plots. More formally, we can test
whether A and B are independent by comparing the model fits of the
following two log-linear models for the cell counts Y :
• the log-linear model which assumes that A and B are independent, given
by
Y ∼A+B
• the saturated log-linear model which assumes that A and B are not
independent, given by
Y ∼ A + B + A:B.
Since the second model is the saturated model, we can compare the fits of
these two models using the residual deviance for the model Y ∼ A + B. If
we conclude that the model Y ∼ A + B is an adequate fit to the data, then
we can conclude that A and B are independent.
When a contingency table has more than two classifying variables, the
log-linear model can be extended to accommodate the extra main effects
and interactions required. In this situation, there are several possible
models to choose from for a given contingency table.
Since it’s always possible to fit contingency table data perfectly using a
saturated log-linear model, the aim when choosing a log-linear model is to
find a model which is more parsimonious than the saturated model that
also fits the data adequately. As for GLMs, we can assess whether a
particular log-linear model is an adequate fit using the model’s residual
deviance, and we can compare the fits of two log-linear models using the
deviance difference (if the models are nested) or the AIC (otherwise).
269
Unit 8 Log-linear models for contingency tables
270
Summary
Interactions in Interactions in
logistic regression model ←→ log-linear model
Main effect A, but no interaction ←→ Y :A
A:B ←→ Y :A:B
A:B:C ←→ Y :A:B:C
· ·
· ·
· ·
There are various pros and cons of using either logistic regression or a
log-linear model for a contingency table. The choice is often dictated by
the research question of interest and what we’d like to learn from the data.
A reminder of what has been studied in Unit 8 and how the sections link
together is shown in the following route map.
Section 1
The modelling
problem
Section 2 Section 3
Introducing log-linear Are the classifying
models for two-way variables in a two-way
contingency tables table independent?
Section 5
Section 4
How are the
Contingency tables with
classifying variables
more than two variables
related?
Section 6
Logistic and
log-linear models
271
Unit 8 Log-linear models for contingency tables
Learning outcomes
After you have worked through this unit, you should be able to:
• understand that a log-linear model takes the counts in a contingency
table as values of the response, and the classifying variables as the
explanatory variables
• understand that a log-linear model is a Poisson GLM with the canonical
log link
• interpret mosaic plots for two-way contingency tables
• appreciate that the log-linear model Y ∼ A + B assumes that A and B
are independent
• understand why the log-linear model Y ∼ A + B + A:B is the saturated
log-linear model for a two-way contingency table
• use the residual deviance of Y ∼ A + B to test whether A and B are
independent
• understand how the log-linear model can be extended to model
contingency tables with more than two classifying variables
• appreciate that choosing a log-linear model involves finding a log-linear
model which is simpler than the saturated log-linear model, but also fits
the data adequately
• compare the fits of log-linear models using the deviance difference and
the AIC
• understand and use the hierarchical principle
• understand which terms must be included in a log-linear model when
totals are fixed in a contingency table
• interpret the final model for a three-way contingency table in terms of
what the model tells us about the relationships between the classifying
variables
• understand when a contingency table can be modelled by logistic
regression
• appreciate the link between a logistic regression model and a log-linear
model for the same contingency table data
• produce a two-way contingency table in R
• obtain a mosaic plot for a two-way contingency table in R
• fit a log-linear model in R
• use R to test whether two classifying variables in a two-way contingency
table are independent
• use stepwise regression in R to choose a log-linear model and interpret
what the chosen model tells us about the relationships between the
classifying variables.
272
Acknowledgements
References
Cathie Marsh Institute for Social Research (2019) ‘Living Costs and Food
Survey, 2013: Unrestricted Access Teaching Dataset’. 2nd edn. Available
at: https://doi.org/10.5255/UKDA-SN-7932-2
(Accessed: 9 September 2022).
de Jong, P. and Heller, G.Z. (2008) Generalized linear models for insurance
data, Cambridge: Cambridge University Press.
Mead, R., Curnow, R.N. and Hasted, A.M. (2002) Statistical methods in
agriculture and experimental biology. 3rd edn. London: Chapman and
Hall/CRC Press.
Paulino, C.D. and Singer, J.M. (2006) Análise de dados categorizados. São
Paulo: Edgard Blucher.
Tutz, G. (2011) Regression for categorical data. New York: Cambridge
University Press, Chapter 12.
YouGov (2019) ‘How Britain voted in the 2019 general election: YouGov
Survey Results’. Available at:
https://d25d2506sfb94s.cloudfront.net/cumulus uploads/
document/wl0r2q1sm4/Results HowBritainVoted 2019 w.pdf
(Accessed: 31 July 2022).
Acknowledgements
Grateful acknowledgement is made to the following sources for figures:
Subsection 1.1, for rent sign: © stockbroker / www.123rf.com
Subsection 1.2, light bulb moment: © ismagilov / www.123rf.com
Subsection 2.2, hot drink: © Alena Ozerova / www.123rf.com
Subsection 2.2, magician: © andrew ypopov / www.123rf.com
Subsection 3.2, newborn baby: © Jozef Polc / www.123rf.com
Subsection 4.1, chameleon: © Andrey Gudkov / www.123rf.com
Subsection 4.2, fork in road: © varunalight / www.123rf.com
Subsection 4.2, sheep litter: © Aleksandarlittlewolf / Freepik
Subsection 4.5, polling station: © Peter Titmus / www.123rf.com
Subsection 5.2, health check: © Mark Bowden / www.123rf.com
Subsection 6.1, child dental flossing: © wckiw / www.123rf.com
Subsection 6.2, dental floss: © Oleksandr Rybitskyi / www.123rf.com
Every effort has been made to contact copyright holders. If any have been
inadvertently overlooked, the publishers will be pleased to make the
necessary arrangements at the first opportunity.
273
Unit 8 Log-linear models for contingency tables
Solutions to activities
Solution to Activity 1
Apart from some of the questions identified in the text preceding this
activity, some possible questions of interest include:
• Does the relationship between employment and incomeSource differ
according to gender?
• Does the relationship between incomeSource and gender differ
according to the different values of employment?
Solution to Activity 2
In a GLM for these data, we could treat the counts in the contingency
table as values of our response and the two categorical variables gender
and incomeSource as factor explanatory variables.
Solution to Activity 3
(a) There are four rows in the contingency table representing the four
categories of the variable employment, and there are two columns for
the variable gender. Therefore, K = 4 and L = 2.
(b) The value of y32 is the count given in the third row and second
column of the table – namely, 74. This is the count of individuals out
of the 5144 in the UK survey dataset for which the HRP is
unemployed and male.
(c) y+2 is the sum of the counts in the second column, which is the total
number of individuals in the UK survey dataset for which the HRP is
male – that is, 3156.
y4+ is the sum of the counts in the fourth row, which is the total
number of individuals in the UK survey dataset for which the HRP’s
employment status is inactive – that is, 2004.
y++ is the sum of all the counts in the table. The UK survey dataset
has 5144 observations, and so y++ is 5144.
Solution to Activity 4
There are K levels for variable A, and L levels for variable B. Therefore,
the contingency table will have K × L counts, and therefore we will have
K × L values of the response.
274
Solutions to activities
Solution to Activity 5
The responses are
Ykl , for k = 1, 2, . . . , K, l = 1, 2, . . . , L.
In Table 5, K = 4 (since there are four rows) and L = 2 (since there are
two columns). Therefore, the responses for modelling these data are: Y11 ,
Y12 , Y21 , Y22 , Y31 , Y32 , Y41 and Y42 .
Note that, although the UK survey dataset contains data on 5144
observations, when the data are represented by a contingency table with
the classifying variables employment and gender, there are only eight
responses which we wish to model.
Solution to Activity 6
Fixing Y++ to be n in advance imposes constraints on the cell counts in
the contingency table since all the cell counts must then sum to n. But, if
the cell counts must sum to n, then the individual cell counts can’t be
independent of each other (since if one count increases, for example, then
at least one cell count must decrease in order for the total to remain fixed
at n). What’s more, if a count is constrained, then it can’t be assumed to
have a Poisson distribution, since, from Box 1 in Unit 7, a Poisson random
variable Ykl takes unrestricted possible values ykl = 0, 1, . . . .
Solution to Activity 7
From Equation (6), the individual cell counts are calculated as
E(Ykl ) = exp(µ + αA zA + αB zB ).
These are indeed the same expressions for the expected cell counts as given
in Table 8.
275
Unit 8 Log-linear models for contingency tables
Solution to Activity 8
The expressions for the expected cell counts from Activity 7 are shown in
Table S1.
Table S1 Repeat of Table 8
incomeSource
gender earned other
female exp(7.001) exp(7.001 − 0.210)
male exp(7.001 + 0.462) exp(7.001 + 0.462 − 0.210)
incomeSource
gender earned other
female 1097.73 889.80
male 1742.37 1412.34
Solution to Activity 9
The completed table is given in Table S4.
Table S4 Completed version of Table 12
incomeSource
gender earned other Total
female 1097.96 890.04 1988.00
male 1743.04 1412.96 3156.00
Total 2841.00 2303.00 5144.00
So, the row, column and overall totals do match the totals displayed in the
data table given in Table 2.
276
Solutions to activities
Solution to Activity 10
(a) Using the given fitted values,
pb1+ × pb+2 ' 0.386 × 0.448
' 0.1729 ' 0.173 = pb12
to three decimal places, as required.
(b) (i) Using the fitted values given in Table 13, pb21 is
1743.04
pb21 = ' 0.339.
5144.00
(ii) The fitted marginal probabilities are calculated as
3156.00
pb2+ = ' 0.614,
5144.00
2841.00
pb+1 = ' 0.552.
5144.00
(iii) From parts (b)(i) and (b)(ii),
pb2+ × pb+1 ' 0.614 × 0.552
' 0.3389 ' 0.339 = pb21
to three decimal places, as required.
Solution to Activity 11
The horizontal widths of the rectangles in the first and second columns –
that is, the rectangles associated with the two levels of incomeSource –
represent, respectively, the proportions of the total sample for which
incomeSource takes levels earned and other. Since the horizontal widths of
the rectangles in the earned column seem to be slightly larger than those in
the other column, it looks like the proportion of individuals taking earned
for incomeSource is slightly larger than the proportion taking other.
The vertical lengths of the rectangles in the first column represent the
proportions of individuals taking levels female and male for gender from
those with earned for incomeSource, while the vertical lengths of the
rectangles in the second column represent the proportions of individuals
taking levels female and male for gender from those with other for
incomeSource.
The vertical lengths of the rectangles for female are not the same in the
two incomeSource columns and, as such, might indicate a relationship
between gender and incomeSource. But, as in Example 4, it’s possible
that the differences are simply what we would expect with random
variation, and so these two variables could be independent.
277
Unit 8 Log-linear models for contingency tables
Solution to Activity 12
Both of the mosaic plots in Figure 4 show large differences between the
vertical lengths of the rectangles for each level of the vertical variable
across the horizontal variable. As such, these mosaic plots suggest that
there is a relationship between these two variables and they do not seem
to be independent.
Solution to Activity 13
(a) (i) There is just one parameter associated with the ‘baseline mean’
term which is included in each linear predictor ηkl .
(ii) There are K levels of factor A. The baseline mean assumes that
A is level 1, which leaves (K − 1) parameters associated with the
other levels of A.
(iii) There are L levels of factor B. The baseline mean assumes that
B is level 1, which leaves (L − 1) parameters associated with the
other levels of B.
(iv) For the interaction term, since there are K levels of A and L
levels of B, there are (K × L) possible combinations of k and l.
However, since an interaction term is set to be zero when either
A or B take level 1, there are only (K − 1) × (L − 1) possible
combinations of k and l which will have associated parameters.
(b) The total number of parameters in model M2 is the sum of the
numbers of parameters associated with each of the terms in the model
– that is, the sum of the numbers of parameters identified in part (a).
So, using part (a), the number of parameters for model M2 is
1 + (K − 1) + (L − 1) + ((K − 1) × (L − 1))
= K + L − 1 + (KL − K − L + 1)
= KL.
(c) From part (b), there are KL parameters in model M2 . There are also
KL observations in the contingency table (because there are K levels
of A and L levels of B), and so, since the number of model parameters
equals the number of observations, the log-linear model M2 is the
saturated model.
Solution to Activity 14
(a) Using the formula given,
D(M2 ) = 2 × (l(saturated model) − l(M2 )).
But, from Activity 13, we know that M2 is the saturated model, and
so
D(M2 ) = 2 × (l(saturated model) − l(saturated model)) = 0
as required.
278
Solutions to activities
Solution to Activity 15
We know that if M1 is a good fit, then
D(M1 ) ≈ χ2 (r),
where
r = number of observations
− number of parameters in the proposed model.
Now, if the classifying variables A and B have K and L levels, respectively,
then the associated contingency table has KL cells, and therefore KL
observations.
Also, using the results from the solution to Activity 13 part (a), the
number of parameters in the log-linear model Y ∼ A + B is
1 + (K − 1) + (L − 1) = K + L − 1.
Therefore
r = KL − (K + L − 1)
= KL − K − L + 1
= (K − 1)(L − 1)
as required.
Solution to Activity 16
(a) The residual deviance should be compared to a χ2 (1) distribution,
since K = L = 2 and so (K − 1)(L − 1) = 1.
(b) The p-value is quite large, which indicates that the residual deviance
is not large, suggesting that the fitted model is an adequate fit to the
data. In turn, this means that we can conclude that gender and
induced are independent.
279
Unit 8 Log-linear models for contingency tables
Solution to Activity 17
(a) There are four levels of employment, and so K = 4. Both of the
variables incomeSource and gender have two levels, and so L = 2
and S = 2.
So
K × L × S = 4 × 2 × 2 = 16
and there are indeed 16 counts in the contingency table.
(b) There are four rows in the contingency table representing the four
categories of the variable employment. There are four columns: the
first two columns represent the two genders when incomeSource takes
the value earned, while the last two columns represent the two
genders when incomeSource takes the value other.
Solution to Activity 18
(a) When there are four classifying variables, there are six two-way
interactions:
A:B, A:C, A:D, B:C, B:D, C:D,
three three-way interactions:
A:B:C, A:B:D, B:C:D,
and one four-way interaction:
A:B:C:D.
(b) Hence the saturated model for this contingency table is:
Y ∼ A + B + C + D + A:B + A:C + A:D + B:C + B:D + C:D
+ A:B:C + A:B:D + B:C:D + A:B:C:D.
Solution to Activity 19
The log-linear model is a GLM, and we know from Subsection 5.1 of
Unit 7 that we can use the residual deviance of a proposed GLM to test
whether the proposed model is an adequate fit to the data.
So, we could use the residual deviance of the log-linear model M as a test
statistic to test whether M is an adequate fit to the data.
280
Solutions to activities
Solution to Activity 20
From Box 6, a log-linear model can be considered an adequate fit to the
data if the residual deviance is less than or equal to the associated degrees
of freedom.
Therefore, model M2 can be considered to be an adequate fit because its
residual deviance is 0.29, which is much less than the associated degrees of
freedom, which is 3. Model M4 can also be considered to be an adequate
fit because its residual deviance is 1.22, which is also less than the
associated degrees of freedom, which is 4.
In contrast, the residual deviances for models M1 and M3 are both very
large in comparison to their associated degrees of freedom, and so neither
of these models can be considered to be an adequate fit by the ‘rule of
thumb’.
Solution to Activity 21
(a) From Box 15 in Subsection 5.2 of Unit 7, we can compare the fits of
two nested GLMs using the deviance difference given by
deviance difference = D(M1 ) − D(M2 ),
where D(M1 ) and D(M2 ) denote the residual deviances for models
M1 and M2 , respectively.
A large deviance difference means that too much fit is lost by using
the more parsimonious model M1 , and we would therefore prefer M2 .
On the other hand, a small deviance difference means that there isn’t
much fit lost with the more parsimonious model M1 , and so we prefer
M1 .
(b) From Box 16 in Subsection 5.2 of Unit 7, we can use the values of the
AIC to compare the fits of two non-nested GLMs. The preferred
model is the model with the smallest AIC value.
(c) Stepwise regression can be used for selecting a GLM from a set of
alternatives, and therefore can also be used for selecting a log-linear
model.
Solution to Activity 22
(a) M2 is the model with all main effects and all three two-way
interactions, while M4 is the model with all main effects and two of
the two-way interactions. As such, M4 is nested within M2 .
Therefore, from Activity 21 part (a), we can use the deviance
difference between the two models to compare their fits.
281
Unit 8 Log-linear models for contingency tables
(c) The deviance difference is 0.93 and the value of the associated degrees
of freedom is 1. Therefore, since 0.93 < 1 (that is, the deviance
difference is less than the associated degrees of freedom), the deviance
difference is small enough for us to conclude that M4 is an adequate
fit in comparison to M2 . We therefore prefer the more parsimonious
model M4 .
(d) We’d prefer the model with the smaller AIC value. We’d therefore
prefer model M4 in comparison to model M2 .
(As an aside, notice that both the deviance difference and the AIC
values led us to the same conclusion for these data and models.
However, it is worth noting that these two methods don’t necessarily
always lead to the same conclusion.)
Solution to Activity 23
(a) Since the p-value is extremely small (p < 0.0001), which means the
residual deviance is very large, model M1 is not an adequate fit to the
data.
(b) Since the associated p-value is quite large (p = 0.265), which means
the residual deviance is not large, model M2 does seem to be an
adequate fit to the data.
(c) The only model in Table 19 with a large p-value is model M4 , which is
the model with the interaction litterSize:breed omitted. So, this
means that the fits of the two models M4
count ∼ litterSize + farm + breed
+ litterSize:farm + farm:breed
and M2
count ∼ litterSize + farm + breed
+ litterSize:farm + litterSize:breed
+ farm:breed
are not significantly different. So, if we drop the interaction
litterSize:breed, then the resulting model will still be an adequate
fit.
(d) Since each of the models with the main effects and only one of the
two-way interactions has a p-value which is extremely small, none of
these models is an adequate fit to the data. Therefore, model M4 is
the simplest model with an adequate fit to the data, and so M4 is the
preferable model to choose.
282
Solutions to activities
Solution to Activity 24
Model M1 is not hierarchical, because M1 includes the two-way
interactions A:C and B:C, which both involve C, but the model doesn’t
include the main effect C.
Model M2 is hierarchical, because the only interaction included in the
model is A:B, and both A and B are included as main effects in the model.
Model M3 is not hierarchical, because it includes the three-way interaction
A:B:C, but doesn’t include all of the lower-order interactions involving
these variables. In particular, the model doesn’t include the two-way
interactions A:C and B:C.
Solution to Activity 25
(a) If the totals for each level of B are fixed, then the main effect B needs
to be included in the model. In addition, we need to include the main
effects A, C and D, since we’re using the rule that all of the main
effects need to be included in the log-linear model. So, the simplest
possible log-linear model we could use is
Y ∼ A + B + C + D.
(b) If the totals for each combination of the levels of C and D are fixed,
then the interaction C:D needs to be included in the model. In
addition, we’re using the rule that all of the main effects need to be
included in the log-linear model. So, the simplest possible log-linear
model we could use is
Y ∼ A + B + C + D + C:D.
(c) If the totals for each combination of the levels of A, B and D are
fixed, then the interaction A:B:D needs to be included in the model.
Additionally, so that the model is hierarchical, we need to include all
lower-order interactions including A, B and D, as well as their main
effects. As usual, we also need to include all of the main effects. So,
the simplest possible log-linear model we could use is
Y ∼ A + B + C + D + A:B + A:D + B:D + A:B:D.
Solution to Activity 26
The corresponding model with no interaction for two-way contingency
tables is
Y ∼ A + B.
We already know that if this model fits the data adequately, then we
would conclude that the two variables are independent of each other. This
suggests that if our final model for a three-way contingency table is a
log-linear model with no interactions, then we should conclude that the
variables A, B and C are independent of one another.
283
Unit 8 Log-linear models for contingency tables
Solution to Activity 27
For two-way contingency tables, if the interaction is required in the model,
then we conclude that the two variables are not independent. This suggests
that if the interaction A:B is in the three-way model, then A and B are not
independent of one another. However, there are no interactions associated
with C, which would suggest that both A and B are independent of C.
Solution to Activity 28
(a) There is the single two-way interaction A:C, but no interactions
associated with B, and so A and C are jointly independent of B.
(b) This time, there is the single two-way interaction B:C, but no
interactions associated with A, and so B and C are jointly
independent of A.
Solution to Activity 29
(a) Since the interaction A:B is in the model, then that suggests that A
and B are not independent, and since the interaction B:C is in the
model, then that suggests that B and C are not independent.
However, there are no interactions involving both A and C together,
and these are only related to each other through their relationships
with B. So A and C are conditionally independent, given B.
(b) Since the interaction A:C is in the model, then that suggests that A
and C are not independent, and since the interaction B:C is in the
model, then that suggests that B and C are not independent.
However, there are no interactions involving both A and B together,
and these are only related to each other through their relationships
with C. So A and B are conditionally independent, given C.
Solution to Activity 30
(a) This model includes all three of the two-way interactions, and so there
is uniform association between the variables.
(b) There are no interactions in this model, and so the variables A, B and
C are mutually independent.
(c) The interaction A:C is in the model, so that A and C are not
independent, and the interaction B:C is in the model, so that B and
C are not independent. However, there is no two-way interaction term
involving both A and B together, and these are only related to each
other through their relationships with C. So A and B are
conditionally independent, given C.
(d) The single two-way interaction A:B is in the model so that A and B
are not independent of each other, but there is no interaction
involving C. So, A and B are jointly independent of C.
284
Solutions to activities
Solution to Activity 31
The interaction employment:gender is in the model, so that employment
and gender are not independent, and the term employment:incomeSource
is in the model, so that employment and incomeSource are not
independent. However, there is no two-way interaction involving both
gender and incomeSource together, and these are only related to each
other through their relationships with employment. So, gender and
incomeSource are conditionally independent, given employment.
Solution to Activity 32
If we were to model this contingency table using a log-linear model, then
the cell counts would be taken as being values of the response, and the
classifying variables gender, age, frequency and able (and their
interactions) would be the possible explanatory variables.
Solution to Activity 33
The variable able is a binary variable, and so can therefore be modelled as
the response variable using logistic regression.
Solution to Activity 34
In logistic regression, any explanatory variables in the model mean that
the associated explanatory variable affects the response variable.
Therefore, age and frequency both affect able, as does their interaction
age:frequency.
However, gender does not appear in the model, and so gender does not
affect able.
Solution to Activity 35
(a) Since the logistic regression model for these data contains the main
effects for A, B and C, the log-linear model for these data will contain
the three two-way interactions Y :A, Y :B and Y :C.
The logistic regression model also contains the two-way interaction
B:C, so the corresponding log-linear model will also contain the
three-way interaction Y :B:C.
(b) In the log-linear model, the interactions involving C are A:C, C:D
and A:C:D. Therefore, a logistic regression model for C would have
the main effects for A and D, and the two-way interaction A:D.
So, the corresponding logistic regression model for C is
C ∼ A + D + A:D.
285
Index
Index
χ2 distribution 53 dentalFlossing 262
deviance difference 62, 64, 158, 232
AIC 67, 162 testing 62
Akaike information criterion 67 deviance residuals 71
diagnostic plots 72, 166
Bernoulli distribution 17
dispersion parameter 177
Bernoulli GLM 128
distribution
binary response 6
Bernoulli 17
binomial distribution 150
binomial 150
binomial GLM 150, 154
chi-squared 53
overdispersion 177
exponential 143
burns 15
normal 71
canonical link function 134 Poisson 113
cell probability (for contingency table) 214 europeanCompanies 8
chi-squared distribution 53 exponential distribution 143
complementary log-log link function 30 exponential family 129
conditionally independent 258 exponential GLM 149
contingency table 208 exposure 179
joint cell probability 214
log-linear and logistic regression models fitted GLM 128
264, 266, 268 fitted linear predictor 128
marginal cell probability 214 fitted mean response 137, 140
modelling strategy 210 fitted success probability 46, 47
notation for three-way table 241 fixed totals in log-linear models 217, 252
notation for two-way table 211 gbCompanies 9
relationships (in three-way table) 259 generalised linear model 3, 125, 127
response variable 213 GLM 125, 127
three-way 209, 240 assumptions 164
two-way 209 diagnostic plots 166
visualising the data 224 fitted mean response 137, 140
count response 108, 119 for a Poisson rate 179, 181
Poisson 216
dataset
predicted mean response 139, 140
Australian health insurance 260
prediction interval 140
burns 15
with binomial response 150, 154
dental flossing 262
with exponential response 149
epileptic seizures 182
with Poisson response 125
European companies 8
GB companies 9 healthInsurance 260
leukaemia survival 142 hierarchical principle 250
newborn births 237
identity link 134
Philippines 40 to 80 122
interaction term 230, 242
Philippines survey 110
inverse link function 135, 137
sheep litters 248
UK election 255 joint probability (for contingency table) 214
UK survey 207 jointly independent 257
287
Index
288
Index
regression
linear 115, 116
logistic 115, 117
Poisson 125
stepwise 162
residual deviance 52, 55, 156
rule of thumb 57, 157
testing 53, 54
response
binary 6
binomial 150, 154
count 108, 119
exponential 149
number of successes 149
time between occurrences 142
ukElection 255
ukSurvey 207
underdispersion 177
uniform association 259
289