Lecture 10
Lecture 10
The data below are from a study conducted by Milicer and Szczotka on pre-teen and teenage
girls in Warsaw. The subjects were classified into 25 age categories. The number of girls in
each group (sample size) and the number that reached menarche (# RM) at the time of the
study were recorded. The age for a group corresponds to the midpoint for the age interval.
The researchers were interested in whether the proportion of girls that reached menarche (
# RM/ sample size ) varied with age. One could perform a test of homogeneity by arranging
the data as a 2 by 25 contingency table with columns indexed by age and two rows: ROW1
= # RM and ROW2 = # that have not RM = sample size − # RM. A more powerful
approach treats these as regression data, using the proportion of girls reaching menarche as
the “response” and age as a predictor.
The data were imported into Stata using the infile command and labelled menarche,
total, and age. A plot of the observed proportion of girls that have reached menarche (ob-
tained in Stata with the two commands generate phat = menarche / total and twoway
126
1
.8
.6
phat
.4
.2
0
10 12 14 16 18
age
(scatter phat age)) shows that the proportion increases as age increases, but that the
relationship is nonlinear.
The observed proportions, which are bounded between zero and one, have a lazy S-shape
(a sigmoidal function) when plotted against age. The change in the observed proportions
for a given change in age is much smaller when the proportion is near 0 or 1 than when
the proportion is near 1/2. This phenomenon is common with regression data where the
response is a proportion.
The trend is nonlinear so linear regression is inappropriate. A sensible alternative might
be to transform the response or the predictor to achieve near linearity. A better approach is
to use a non-linear model for the proportions. A common choice is the logistic regression
model.
127
Logit Scale Probability Scale
1.0
0.8
5
+ slope
0.6
Probability
Log-Odds
0 slope
0
0.4
0 slope
- slope
0.2
-5
+ slope - slope
0.0
-5 0 5 -5 0 5
X X
or, equivalently, as
exp(α + βX)
p= .
1 + exp(α + βX)
The logistic regression model is a binary response model, where the response for
each case falls into one of 2 exclusive and exhaustive categories, often called success (cases
with the attribute of interest) and failure (cases without the attribute of interest). In many
biostatistical applications, the success category is presence of a disease, or death from a
disease.
I will often write p as p(X) to emphasize that p is the proportion of all individuals with
score X that have the attribute of interest. In the menarche data, p = p(X) is the population
proportion of girls at age X that have reached menarche.
The odds of success are p/(1 − p). For example, the odds of success are 1 (or 1 to 1) when
p = 1/2. The odds of success are 2 (or 2 to 1) when p = 2/3. The logistic model assumes
128
that the log-odds of success is linearly related to X. Graphs of the logistic model relating p
to X are given above. The sign of the slope refers to the sign of β.
There are a variety of other binary response models that are used in practice. The probit
regression model or the complementary log-log regression model might be appropriate
when the logistic model does not fit the data.
X n D
X1 n1 d1
X2 n2 d2
. . .
. . .
Xm nm dm
where di is the number of individuals with the attribute of interest (number of diseased)
among ni randomly selected or representative individuals with predictor variable value Xi .
The subscripts identify the group of cases in the data set. In many situations, the sample
size is 1 in each group, and for this situation di is 0 or 1.
For raw data on individual cases, the sample size column n is usually omitted and D
takes on 1 of two coded levels, depending on whether the case at Xi is a success or not. The
values 0 and 1 are typically used to identify “failures” and “successes” respectively.
129
The maximum likelihood estimates (MLE) of the regression coefficients are estimated
iteratively by maximizing the so-called Binomial likelihood function for the responses, or
equivalently, by minimizing the deviance function (also called the likelihood ratio LR chi-
squared statistic)
m
( Ã ! Ã !)
X di ni − di
LR = 2 di log + (ni − di )log
i=1 ni pi ni − ni pi
over all possible values of α and β, where the pi s satisfy
à !
pi
log = α + βXi .
1 − pi
The ML method also gives standard errors and significance tests for the regression estimates.
The deviance is an analog of the residual sums of squares in linear regression. The choices
for α and β that minimize the deviance are the parameter values that make the observed
and fitted proportions as close together as possible in a “likelihood sense”.
Suppose that α̂ and β̂ are the MLEs of α and β. The deviance evaluated at the MLEs:
m
( Ã ! Ã !)
X di ni − di
LR = 2 di log + (ni − di )log ,
i=1 ni p̂i ni − ni p̂i
where the fitted probabilities p̂i satisfy
à !
p̂i
log = α̂ + β̂Xi ,
1 − p̂i
is used to test the adequacy of the model. The deviance is small when the data fits the
model, that is, when the observed and fitted proportions are close together. Large values
of LR occur when one or more of the observed and fitted proportions are far apart, which
suggests that the model is inappropriate.
If the logistic model holds, then LR has a chi-squared distribution with m − r degrees
of freedom, where m is the number of groups and r (here 2) is the number of estimated
regression parameters. A p-value for the deviance is given by the area under the chi-squared
curve to the right of LR. A small p-value indicates that the data does not fit the model.
Stata does not provide the deviance statistic, but rather the Pearson chi-squared test
statistic, which is defined similarly to the deviance statistic and is interpreted in the same
manner:
130
m
X (di − ni p̂i )2
X2 = .
i=1 ni p̂i (1 − p̂i )
This statistic can be interpreted as the sum of standardized, squared differences between
the observed number of successes di and expected number of successes ni p̂i for each covariate
Xi . When what we expect to see under the model agrees with what we see, the Pearson statis-
tic is close to zero, indicating good model fit to the data. When the Pearson statistic is large,
q
we have an indication of lack of fit. Often the Pearson residuals ri = (di −ni p̂i )/ ni p̂i (1 − p̂i )
are used to determine exactly where lack of fit occurs. These residuals are obtained in Stata
using the predict command after the logistic command. Examining these residuals is
(O−E)2
very similar to looking for large values of E
in a χ2 analysis of a contingency table as
discussed in the last lecture. We will not talk further of logistic regression diagnostics.
A logistic regression model with a single predictor can be fit using one of the many
commands available in Stata depending on the data type and desired results: logistic
(raw data, outputs odds ratios), logit (raw data, outputs model parameter estimates),
and blogit (grouped data). The logistic command has many more options than ei-
ther logit or blogit, but requires you to reformat the data into individual records, one
for each girl. For an example of how to do this, check out the online Stata help at
131
http://www.stata.com/support/faqs/stat/grouped.html. The Stata command blogit
menarche total age yields the following output:
Logit estimates Number of obs = 3918
LR chi2(1) = 3667.18
Prob > chi2 = 0.0000
Log likelihood = -819.65237 Pseudo R2 = 0.6911
------------------------------------------------------------------------------
_outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 1.631968 .0589509 27.68 0.000 1.516427 1.74751
_cons | -21.22639 .7706558 -27.54 0.000 -22.73685 -19.71594
------------------------------------------------------------------------------
The output tables the MLEs of the parameters: α̂ = −21.23 and β̂ = 1.63. Thus, the
fitted or predicted probabilities satisfy:
à !
p̂
log = −21.23 + 1.63AGE
1 − p̂
or
exp(−21.23 + 1.63AGE)
p̂(AGE) = .
1 + exp(−21.23 + 1.63AGE)
The p-value for testing H0 : β = 0 (i.e. the slope for the regression model is zero) based
upon the chi-squared test p-value (P>|z|) is 0.000, which leads to rejecting H0 at any of the
usual test levels. Thus, the proportion of girls that have reached menarche is not constant
across age groups.
The likelihood ratio test statistic of no logistic regression relationship (LR chi2(1) =
3667.18) and p-value (Prob > chi2 = 0.0000) gives the logistic regression analogue of the
overall F-statistic that no predictors are important to multiple regression. In general, the chi-
squared statistic provided here is used to test the hypothesis that the regression coefficients
are zero for each predictor in the model. There is a single predictor here, AGE, so this test
and the test for the AGE effect are both testing H0 : β = 0.
To obtain the Pearson goodness of fit statistic and p-value we must reformat the data
and use the logistic command as described in the webpage above:
generate w0 = total - menarche
rename menarche w1
generate id = _n
reshape long w, i(id) j(y)
logistic y age [fw=w]
lfit
132
We obtain the following output:
Logistic regression Number of obs = 3918
LR chi2(1) = 3667.18
Prob > chi2 = 0.0000
Log likelihood = -819.65237 Pseudo R2 = 0.6911
------------------------------------------------------------------------------
y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 5.113931 .3014706 27.68 0.000 4.555917 5.740291
------------------------------------------------------------------------------
Logistic model for y, goodness-of-fit test
number of observations = 3918
number of covariate patterns = 25
Pearson chi2(23) = 21.87
Prob > chi2 = 0.5281
133
Logistic Regression with Two Effects: Leukemia Data
Feigl and Zelen reported the survival time in weeks and the white cell blood count (WBC)
at time of diagnosis for 33 patients who eventually died of acute leukemia. Each person was
classified as AG+ or AG- (coded as IAG = 1 and 0, respectively), indicating the presence
or absence of a certain morphological characteristic in the white cells. The researchers are
interested in modelling the probability p of surviving at least one year as a function of WBC
and IAG. They believe that WBC should be transformed to a log scale, given the skewness in
the WBC values. Where Live=0, 1 indicates whether the patient died or lived respectively,
the data are
where LWBC = log WBC. This is a logistic regression model with 2 effects, fit using the
logistic command. The parameters α, β1 and β2 are estimated by maximum likelihood.
134
The model is best understood by separating the AG+ and AG- cases. For AG- individ-
uals, IAG=0 so the model reduces to
à !
p
log = α + β1 LWBC + β2 ∗ 0 = α + β1 LWBC.
1−p
For AG+ individuals, IAG=1 and the model implies
à !
p
log = α + β1 LWBC + β2 ∗ 1 = (α + β2 ) + β1 LWBC.
1−p
The model without IAG (i.e. β2 = 0) is a simple logistic model where the log-odds of
surviving one year is linearly related to LWBC, and is independent of AG. The reduced
model with β2 = 0 implies that there is no effect of the AG level on the survival probability
once LWBC has been taken into account.
Including the binary predictor IAG in the model implies that there is a linear rela-
tionship between the log-odds of surviving one year and LWBC, with a constant slope for
the two AG levels. This model includes an effect for the AG morphological factor, but more
general models are possible. Thinking of IAG as a factor, the proposed model is a logistic
regression analog of ANCOVA.
The parameters are easily interpreted: α and α + β2 are intercepts for the population
logistic regression lines for AG- and AG+, respectively. The lines have a common slope, β1 .
The β2 coefficient for the IAG indicator is the difference between intercepts for the AG+
and AG- regression lines. A picture of the assumed relationship is given below for β1 < 0.
The population regression lines are parallel on the logit (i.e. log odds ) scale only, but the
order between IAG groups is preserved on the probability scale.
The data are in the raw data form for individual cases. There are three columns: the
binary or indicator variable iag (with value 1 for AG+, 0 for AG-), wbc (continuous),
live (with value 1 if the patient lived at least 1 year and 0 if not). Note that a frequency
column is not needed with raw data (and hence using the logistic command) and that the
success category corresponds to surviving at least 1 year.
Before looking at output for the equal slopes model, note that the data set has 30 distinct
IAG and WBC combinations, or 30 “groups” or samples that could be constructed from the
33 individual cases. Only two samples have more than 1 observation. The majority of
135
Logit Scale Probability Scale
1.0
5
0.80.6
IAG=1
0
Probability
Log-Odds
IAG=1
0.4
-5
IAG=0
0.2
IAG=0
-10
0.0
-5 0 5 -5 0 5
LWBC LWBC
the observed proportions surviving at least one year (number surviving ≥ 1 year/ group
sample size) are 0 (i.e. 0/1) or 1 (i.e. 1/1). This sparseness of the data makes it difficult to
graphically assess the suitability of the logistic model (Why?). Although significance tests on
the regression coefficients do not require large group sizes, the chi-squared approximations to
the deviance and Pearson goodness-of-fit statistics are suspect in sparse data settings. With
small group sizes as we have here, most researchers would not interpret the p-values for the
deviance or Pearson tests literally. Instead, they would use the p-values to informally check
the fit of the model. Diagnostics would be used to highlight problems with the model.
We obtain the following modified output:
136
LR chi2(2) = 15.18
Prob > chi2 = 0.0005
Log likelihood = -13.416354 Pseudo R2 = 0.3613
------------------------------------------------------------------------------
live | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
iag | 2.519562 1.090681 2.31 0.021 .3818672 4.657257
lwbc | -1.108759 .4609479 -2.41 0.016 -2.0122 -.2053178
_cons | 5.543349 3.022416 1.83 0.067 -.380477 11.46718
------------------------------------------------------------------------------
Logistic model for live, goodness-of-fit test
number of observations = 33
number of covariate patterns = 30
Pearson chi2(27) = 19.81
Prob > chi2 = 0.8387
The large p-value (0.8387) for the lack-of-fit chi-square (i.e. the Pearson statistic) indi-
cates that there are no gross deficiencies with the model. Given that the model fits reasonably
well, a test of H0 : β2 = 0 might be a primary interest here. This checks whether the re-
gression lines are identical for the two AG levels, which is a test for whether AG affects the
survival probability, after taking LWBC into account. The test that H0 : β2 = 0 is equivalent
to testing that the odds ratio exp(β2 ) is equal to 1: H0 : eβ2 = 1. The p-value for this test
is 0.021. The test is rejected at any of the usual significance levels, suggesting that the AG
level affects the survival probability (assuming a very specific model). In fact we estimate
that the odds of surviving past a year in the AG+ population is 12.4 times the odds of
surviving past a year in the AG- population, with a 95% CI of (1.4, 105.4); see below for
this computation carried out explicitly.
The estimated survival probabilities satisfy
à !
p̂
log = 5.54 − 1.11LWBC + 2.52IAG.
1 − p̂
For AG- individuals with IAG=0, this reduces to
à !
p̂
log = 5.54 − 1.11LWBC,
1 − p̂
or equivalently,
exp(5.54 − 1.11LWBC)
p̂ = .
1 + exp(5.54 − 1.11LWBC)
For AG+ individuals with IAG=1,
à !
p̂
log = 5.54 − 1.11LWBC + 2.52 ∗ (1) = 8.06 − 1.11LWBC,
1 − p̂
137
or
exp(8.06 − 1.11LWBC)
p̂ = .
1 + exp(8.06 − 1.11LWBC)
Using the logit scale, the difference between AG+ and AG- individuals in the estimated
log-odds of surviving at least one year, at a fixed but arbitrary LWBC, is the estimated IAG
regression coefficient:
Using properties of exponential functions, the odds that an AG+ patient lives at least one
year is exp(2.52) = 12.42 times larger than the odds that an AG- patient lives at least one
year, regardless of LWBC.
Although the equal slopes model appears to fit well, a more general model might fit
better. A natural generalization here would be to add an interaction, or product term,
IAG ∗ LWBC to the model. The logistic model with an IAG effect and the IAG ∗ LWBC
interaction is equivalent to fitting separate logistic regression lines to the two AG groups.
This interaction model provides an easy way to test whether the slopes are equal across AG
levels. I will note that the interaction term is not needed here.
138