Practical Guide To Logistic Regression - Joseph M. Hilbe (2017)
Practical Guide To Logistic Regression - Joseph M. Hilbe (2017)
Logistic
Regression
Joseph M. Hilbe
Practical Guide to
Logistic
Regression
Practical Guide to
Logistic
Regression
Joseph M. Hilbe
Jet Propulsion Laboratory
California Institute of Technology, USA
and
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Preface ix
Author xv
1 Statistical Models 1
1.1 What Is a Statistical Model? 1
1.2 Basics of Logistic Regression Modeling 3
1.3 The Bernoulli Distribution 4
1.4 Methods of Estimation 7
SAS Code 11
Stata Code 12
v
vi Contents
References 151
Preface
ix
x Preface
This book is aimed at the working analyst or researcher who finds that
they need some guidance when modeling binary response data. It is also of
value for those who have not used logistic regression in the past, and who are
not familiar with how it is to be implemented. I assume, however, that the
reader has taken a basic course in statistics, including instruction on applying
linear regression to study data. It is sufficient if you have learned this on your
own. There are a number of excellent books and free online tutorials related to
regression that can provide this background.
I think of this book as a basic guidebook, as well as a tutorial between you
and me. I have spent many years teaching logistic regression, using logistic-
based models in research, and writing books and articles about the subject.
I have applied logistic regression in a wide variety of contexts—for medical
and health outcomes research, in ecology, fisheries, astronomy, transporta-
tion, insurance, economics, recreation, sports, and in a number of other areas.
Since 2003, I have also taught both the month-long Logistic Regression and
Advanced Logistic Regression courses for Statistics.com, a comprehensive
online statistical education program. Throughout this process I have learned
what the stumbling blocks and problem areas are for most analysts when using
logistic regression to model data. Since those taking my courses are located at
research sites and universities throughout the world, I have been able to gain
a rather synoptic view of the methodology and of its use in research in a wide
variety of applications.
In this volume, I share with you my experiences in using logistic regres-
sion, and aim to provide you with the fundamental logic of the model and its
appropriate application. I have written it to be the book I wish I had read when
first learning about the model. It is much smaller and concise than my 656
page Logistic Regression Models (Chapman & Hall/CRC, 2009), which is a
general reference to the full range of logistic-based models. Rather, this book
focuses on how best to understand the key points of the basic logistic regres-
sion model and how to use it properly to model a binary response variable. I
do not discuss the esoteric details of estimation or provide detailed analysis of
the literature regarding various modeling strategies in this volume, but rather
I focus on the most important features of the logistic model—how to construct
a logistic model, how to interpret coefficients and odds ratios, how to predict
probabilities based on the model, and how to evaluate the model as to its fit. I
also provide a final chapter on Bayesian logistic regression, providing an over-
view of how it differs from the traditional frequentist tradition. An important
component of our examination of Bayesian modeling will be a step-by-step
guide through JAGS code for modeling real German health outcomes data.
The reader should be able to attain a basic understanding of how Bayesian
logistic regression models can be developed and interpreted—and be able to
develop their own models using the explanation in the book as a guideline.
Preface xi
Resources for how to learn how to model slightly more complicated models
will be provided—where to go for the next step. Bayesian modeling is hav-
ing a continually increasing role in research, and every analyst should at least
become acquainted with how to understand this class of models, and with how
to program basic Bayesian logistic models when doing so is advisable.
R statistical software is used to display all but one statistical model dis-
cussed in the book—exact logistic regression. Otherwise R is used for all data
management, models, postestimation fit analyses, tests, and graphics related
to our discussion of logistic regression in the book. SAS and Stata code for
all examples is provided at the conclusion of each chapter. Complete Stata
and SAS code and output, including graphics and tables, is provided on the
book’s web site. R code is also provided on the book’s web site, as well as in
the LOGIT package posted on CRAN.
R is used in the majority of newly published texts on statistics, as well as
for examples in most articles found in statistics journals published since 2005.
R is open ware, meaning that it is possible for users to inspect the actual code
used in the analysis and modeling process. It is also free, costing nothing to
download into one’s computer. A host of free resources is available to learn R,
and blogs exist that can be used to ask others how to perform various opera-
tions. It is currently the most popular statistical software worldwide; hence, it
makes sense to use it for examples in this relatively brief monograph on logis-
tic regression. But as indicated, SAS and Stata users have the complete code
to replicate all of the R examples in the text itself. The code is in both printed
format as well as electronic format for immediate download and use.
A caveat: Keep in mind that when copying code from a PDF document, or
even from a document using a different font from that which is compatible with
R or Stata, you will likely find that a few characters need to be retyped in order
to successfully execute. For example, when pasting program code from a PDF
or word document into the R editor, characters such as “quotation marks” and
“minus signs” may not convert properly. To remedy this, you need to retype the
quotation or minus sign in the code you are using.
It is also important to remember that this monograph is not about R, or
any specific statistical software package. We will foremost be interested in
the logic of logistic modeling. The examples displayed are aimed to clarify
the modeling process. The R language, although popular and powerful, is
nevertheless tricky. It is easy to make mistakes, and R is rather unforgiving
when you do. I therefore give some space to explaining the R code used in the
modeling and evaluative process when the code may not be clear. The goal is
to provide you with code you can use directly, or adapt as needed, in order to
make your modeling tasks both easier and more productive.
I have chosen to provide Stata code at the end of each chapter since Stata
is one of the most popular and to my mind powerful statistical packages on the
xii Preface
commercial market. It has free technical support and well-used blog and user
LISTSERV sites. In addition, it is relatively easy to program statistical proce-
dures and tests yourself using Stata’s programming language. As a result, Stata
has more programs devoted to varieties of logistic-based routines than any
other statistical package. Bob Muenchen of the University of Tennessee and I
have pointed out similarities and differences between Stata and R in our 530
page book, R for Stata Users (Springer, 2010). It is a book to help Stata users
learn R, and for R users to more easily learn Stata. The book is published in
hardback, paperback, and electronic formats.
I should acknowledge that I have used Stata for over a quarter of a century,
authoring the initial versions of several procedures now in commercial Stata
including the first logistic (1990) and glm (1993) commands. I also founded the
Stata Technical Bulletin in 1991, serving as its first editor. The STB became
enhanced to the Stata Journal in 1999. I also used to teach S-Plus courses for
the manufacturer of the package in the late 1980s and early 1990s, traveling
to various sites in the United States and Canada for some 4 years. The S and
S-Plus communities have largely evolved to become R users during the past
decade to decade and a half. In addition, I also programmed various macros in
SAS and gave presentations at SUGI, thus have a background in SAS as well.
However, since it has been a while since I have used SAS on a regular basis,
I invited Yang Liu, a professional SAS programmer and MS statistician to
replicate the R code used for examples in the text into SAS. He has provided
the reader with complete programming code, not just snippets of code that
one finds in many other texts. The SAS/Stat GENMOD Procedure and Proc
Logistic were the two most used SAS procedures for this project. Yang also
reviewed proof pages with me, checking for needed amendments.
The R data sets and user authored functions and scripts are available for
download and installation from the CRAN package, LOGIT. The LOGIT pack-
age will also have the data, functions, and scripts for both the first (2009) and
second (forthcoming 2016) edition of the author’s Logistic Regression Models
(Chapman & Hall/CRC). Data files in Stata, SAS, SPSS, Excel and csv format,
as well as Stata commands and ado/do files are located on the author’s web site:
http://works.bepress.com/joseph_hilbe/
as well as on the publishers web site for the book:
http://www.crcpress.com/product/isbn/9781498709576
An Errata and Comments PDF as well as other resource material and “hand-
outs” related to logistic regression will also be available on my Bepress web site.
I wish to acknowledge the following colleagues for their input into the cre-
ation of this book. Rafael S. de Souza (astrophysicist, Eötvös Loránd University,
Hungary) and Yang Liu (Baylor Scott & White Health). My collaborative work
Preface xiii
Joseph M. Hilbe
Florence, AZ
Author
xv
Statistical
Models 1
Statistics: Statistics may generically be
understood as the science of collecting
and analyzing data for the purpose of
classification, prediction, and of attempting
to quantify and understand the uncertainty
inherent in phenomena underlying data
(Hilbe, 2014)
1
2 Practical Guide to Logistic Regression
distribution function or PDF. The analyst does not usually observe the entire
range of data defined by the underlying PDF, called the population data, but
rather observes a random sample from the underlying data. If the sample of
data is truly representative of the population data, the sample data will be
described by the same PDF as the population data, and have the same values
of its parameters, which are initially unknown.
Parameters define the specific mean or location (shape) and perhaps scale
of the PDF that best describes the population data, as well as the distribution of
the random sample from the population. A statistical model is the relationship
between the parameters of the underlying PDF of the population data and the
estimates made by an analyst of those parameters.
Regression is one of the most common ways of estimating the true param-
eters in as unbiased manner as possible. That is, regression is typically used
to establish an accurate model of the population data. Measurement error can
creep into the calculations at nearly every step, and the random sample we are
testing may not fully resemble the underlying population of data, nor its true
parameters. The regression modeling process is a method used to understand
and control the uncertainty inherent in estimating the true parameters of the
distribution describing the population data. This is important since the predic-
tions we make from a model are assumed to come from this population.
Finally, there are typically only a limited range of PDFs which analysts
use to describe the population data, from which the data we are analyzing is
assumed to be derived. If the variable we are modeling, called the response
term (y), is binary (0,1), then we will want to use a Bernoulli probability distri-
bution to describe the data. The Bernoulli distribution, as we discuss in more
detail in the next section, consists of a series of 1s and 0s. If the variable we
wish to model is continuous and appears normally distributed, then we assume
that it can be best modeled using a Gaussian (normal) distribution. This is a
pretty straightforward relationship. Other probability distributions commonly
used in modeling are the lognormal, binomial, exponential, Poisson, negative
binomial, gamma, inverse Gaussian, and beta PDFs. Mixtures of distributions
are also constructed to describe data. The lognormal, negative binomial, and
beta binomial distributions are such mixture distributions—but they are nev-
ertheless completely valid PDFs and have the same basic assumptions as do
other PDFs.
I should also mention that probability distributions do not all have the
same parameters. The Bernoulli, exponential, and Poisson distributions are
single parameter distributions, and models directly based on them are single
parameter models. That parameter is the mean or location parameter. The nor-
mal, lognormal, gamma, inverse Gaussian, beta, beta binomial, binomial, and
negative binomial distributions are two parameter models. The first four of
these are continuous distributions with mean (shape) and scale (variability)
1 • Statistical Models 3
parameters. The binomial, beta, and beta binomial distributions will be dis-
cussed later when discussing grouped logistic regression.
The catcher in this is that a probability distribution has various assump-
tions. If these assumptions are violated, the estimates we make of the param-
eters are biased, and may be incorrect. Statisticians have worked out a number
of adjustments for what may be called “violations of distributional assump-
tions,” which are important for an analyst to use when modeling data exhibit-
ing problems. I’ll mention these assumptions shortly, and we will address them
in more detail as we progress through the book.
I fully realize that the above description of a statistical model—of a para-
metric statistical model—is not the way we normally understand the modeling
process, and it may be a bit confusing. But it is in general the way statisticians
think of statistical modeling, and is the basis of the frequency-based tradition
of statistical modeling. Keep these relationships in mind as we describe logis-
tic regression.
In this monograph, I assume that the reader is familiar with the basics of
regression. However, I shall address the fundamentals of constructing, inter-
preting, fitting, and evaluating a logistic model in subsequent chapters. I shall
also describe how to predict fitted values from the estimated model. Logistic
regression is particularly valuable in that the predictions made from a fitted
model are probabilities, constrained to be within the range of values 0–1.
More accurately, a logistic regression model predicts the probability that the
response has a value of 1 given a specific set of predictor values. Interpretation
of logistic model coefficients usually involves their exponentiation, which
allows them to be understood as odds ratios. This capability is unique to the
class of logistic models, whether observation-based format or in grouped for-
mat. The fact that a logistic model can be used to assess the odds ratio of
predictors, and also can be used to determine the probability of the response
occurring based on specific predictor values, called covariate patterns, is the
prime reason it has enjoyed such popularity in the statistical community for
the past several decades.
f ( y; p) = ∏p i
yi
(1 − pi )1− yi (1.1)
i =1
where the joint PDF is the product, Π, of each observation in the data being
modeled, symbolized by the subscript i. Usually the product term is dropped
as being understood since all joint probability functions are products across
the independent components of their respective distributions. We may then
characterize the Bernoulli distribution for a single observation as
where y is the response variable being modeled and p is the probability that y
has the value of 1. Again, 1 generally indicates a success, or that the event of
interest has occurred. y only has values of 1 or 0, whereas p has values ranging
1 • Statistical Models 5
Note that “ln” is a symbol for the natural log of an expression. It is also
symbolized as “log.” Keep in mind that it differs from “log to the base 10,”
or “log10.” The exponentiation of a logged value is the value itself; that is,
exp(ln(x)) = x, or eln(x) = x.
Statisticians usually take the log of both sides of the likelihood function,
creating what is called the log-likelihood function. Doing this allows a sum-
mation across observations rather than multiplication. This makes it much
easier for the algorithms used to estimate distribution parameters to converge;
that is, to solve for the estimates. The Bernoulli log-likelihood function can be
displayed as
n
∑ y ln 1 − p + ln (1 − p )
pi
L( p; y) = i i (1.5)
i
i =1
Mean = µ = p
Variance = V (µ ) = p(1 − p) = µ(1 − µ )
where y-hat, or ŷ, is the sum of the terms in the regression. The sum of regres-
sion terms is also referred to as the linear predictor, or xb. Each βx is a term
indicating the value of a predictor, x, and its coefficient, β. In linear regression,
which is based in matrix form on the Gaussian or normal probability distri-
bution, ŷ is the predicted value of the regression model as well as the linear
predictor. j indicates the number of predictors in a model. There is a linear
relationship between the predicted or fitted values of the model and the terms
on the right-hand side of Equation 1.6—the linear predictor. yˆ = xb. This is
not the case for logistic regression.
1 • Statistical Models 7
However, the fitted or predicted value of the logistic model is based on the
link function, log(μ/(1 − μ)). In order to establish a linear relationship of the
predicted value, μ, and the linear predictor, we have the following relationship:
µi
ln = xi b = β0 + β1 xi1 + β2 xi 2 + + β p xip (1.8)
1 − µ i
exp(xb) 1
µ= = (1.9)
1 + exp(xb) 1 + exp(− xb)
The equations in (1.9) above are very important, and will be frequently
used in our later discussion. Once a logistic model is solved, we may calculate
the linear predictor, xb, and then apply either equation to determine the pre-
dicted value, μ for each observation in the model.
methods may be used as well, but some variety of MLE is used by nearly all
statistical software for logistic regression. There is a subset of MLE though
that can be used if the underlying model PDF is a member of the single param-
eter exponential family of distributions. The Bernoulli distribution is an expo-
nential family member. As such, logistic regression can also be done from
within the framework of generalized linear models or GLM. GLM allows for
a much simplified manner of calculating parameter estimates, and is used in
R with the glm function as the default method for logistic regression. It is a
function in the R stats package, which is a base R package. Stata also has a
glm command, providing the full range of GLM-based models, as well as full
maximum likelihood estimate commands logit and logistic. The SAS Genmod
procedure is a GLM-based procedure, and Proc Logistic is similar to Stata’s
logit and logistic commands. In Python, one may use the statsmodels Logit
function for logistic regression.
Since R’s default logistic regression is part of the glm function, we shall
examine the basics of how it works. The glm function uses an iterative re-
weighted least squares (IRLS) algorithm to estimate the predictor coefficients
of a logistic regression. The logic of a stand-alone R algorithm that can be
used for logistic regression is given in Table 1.1. It is based on IRLS. I have
annotated each line to assist in understanding how it works. You certainly do
not have to understand the code to continue with the book. I have provided the
code for those who are proficient in R programming. The code is adapted from
Hilbe and Robinson (2013).
R users can paste the code from the table into the “New Script” editor.
The code is an entire function titled irls_logit. The code is also available on
the book’s website, listed as irls_logit.r. Select the entire code, right click your
mouse, and click on “Run line or selection.” This places the code into active
memory. To show what a logistic regression model looks like, we can load
some data and execute the function. We shall use the medpar data set, which
is 1991 Arizona inpatient Medicare (U.S. senior citizen national health plan)
data. The data consist of cardiovascular disease patient information from a
single diagnostic group. For privacy purposes, I did not disclose the diagnostic
group to which the data are classified.
> library(LOGIT)
> data(medpar)
> head(medpar)
los hmo white died age80 type provnum
1 4 0 1 0 0 1 030001
2 9 1 1 0 0 1 030001
3 3 1 1 1 1 1 030001
4 9 0 1 0 0 1 030001
5 1 0 1 1 1 1 030001
6 4 0 1 1 0 1 030001
$se
X(Intercept) Xhmo Xwhite
0.1973903 0.1489251 0.2051795
Just typing the model name we assigned, mylogit, displays the coefficients
and standard errors of the model. We can make a table of estimates, standard
errors, z-statistic, p-value, and confidence intervals by using the code:
Running the same data using R’s glm function produces the following
output. I have deleted some ancillary output.
1 • Statistical Models 11
> confint.default(glmlogit)
2.5 % 97.5 %
(Intercept) −1.31306417 −0.5393082
hmo −0.30413424 0.2796413
white −0.09875728 0.7055318
SAS CODE
/* Section 1.4 */
*Import medpar as a temporary dataset;
proc import datafile=“c:\data\medpar.dta” out=medpar
dbms=dta replace;
12 Practical Guide to Logistic Regression
run;
STATA CODE
. use medpar
. glm died hmo white, fam(bin) nolog
Logistic Models
Single Predictor 2
2.1 MODELS WITH A BINARY PREDICTOR
The simplest way to begin understanding logistic regression is to apply it to a
single binary predictor. That is, the model we shall use will consist of a binary
(0,1) response variable, y, and a binary (0,1) predictor, x. In addition, the data
set we define will have 9 observations. Recall from linear regression that a
response and predictor are paired when setting up a regression. Using R we
assign various 1s and 0s to each y and x.
These values will be placed into a data set named xdta. Then we subject it to
the irls_logit function displayed in the previous chapter.
The model name is logit1. Using the code to create the nice looking “standard”
regression output that was shown before, we have
13
14 Practical Guide to Logistic Regression
Coefficients:
(Intercept) x
1.099 -1.504
Call:
glm(formula = y ~ x, family = binomial, data = xdta)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6651 -1.0108 0.7585 0.7585 1.3537
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.099 1.155 0.951 0.341
x -1.504 1.472 -1.022 0.307
> confint.default(logit2)
2.5 % 97.5 %
(Intercept) -1.164557 3.361782
x -4.389065 1.380910
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.099 1.155 0.951 0.341
x -1.504 1.472 -1.022 0.307
. .
Null deviance: 12.365 on 8 degrees of freedom
Residual deviance: 11.229 on 7 degrees of freedom
AIC: 15.229
There are a number of ancillary statistics which are associated with mod-
eling data with logistic regression. I will show how to do this as we prog-
ress, and functions and scripts for all logistic statistics, fit tests, graphics, and
tables are provided on the books web site, as well as in the LOGIT package
that accompanies this book. The LOGIT package will also have the data,
functions and scripts for the second edition of Logistic Regression Models
(Hilbe, 2016).
For now we will focus on the meaning of the single binary predictor
model. The coefficient of predictor x is −1.504077. A coefficient is a slope. It
is the amount of the rate of change in y based on a one-unit change in x. When
x is binary, it is the amount of change in y when x moves from 0 to 1 in value.
But what is changed?
Recall that the linear predictor, xb, of a logistic model is defined as
log(μ/(1 − μ)). This expression is called the log-odds or logit. It is the logistic
link function, and is the basis for interpreting logistic model coefficients.
The interpretation of x is that when x changes from 0 to 1, the log-odds of
16 Practical Guide to Logistic Regression
The odds ratio of x = = 1 is the ratio of the odds of x = 1 to the odds of x = 0.
> table(y,x)
x
y 0 1
0 1 3
1 3 2
> addmargins(table(y,x))
x
y 0 1 Sum
0 1 3 4
1 3 2 5
Sum 4 5 9
The odds of x = 1 is defined as “the value of x = 1 when y = 1 divided by
the value of x = 1 when y = 0.” Here the odds of x = 1 is 2/3, or
Odds x = 1
> 2/3
[1] 0.6666667
Odds x = 0
> 3/1
[1] 3
2 • Logistic Models: Single Predictor 17
That is…
To obtain the odds of x = 1: for x = 1, take the ratio of y = 1 to y = 0, or
2/3 = 0.666667.
To obtain the odds of x = 0: for x = 0, take the ratio of y = 1 to y = 0, or 3/1 = 3.
ln(Odds Ratio) = coefficient
exp(coefficient) = odds ratio
Calculating the odds ratio and odds-intercept from the logit2 model
results,
Now we can reverse the relationships by taking the natural log of both.
2.2 PREDICTIONS, PROBABILITIES,
AND ODDS RATIOS
I mentioned before that unlike linear regression, the model linear predictors
and fitted values differ for logistic regression. If μ is understood as the pre-
dicted mean, or fitted value:
For the logistic model, μ is defined as the probability that y = 1, where y is
the symbol for the model response term.
> logit2 <- glm( y ~ x, family = binomial, data = xdta)
> coef(logit2)
(Intercept) x
1.098612 -1.504077
We use R’s post-glm function for calculating the linear predictor. The
code below generates linear predictor values for all observations in the model.
Remember that R has several ways that certain important statistics can be
obtained.
From the predicted probability that y = 1, or μ, the odds for each level of
x may be calculated.
Let us now check the relationship of x to o, noting the values of o for the
two values of x.
> check_o <-data.frame(x,o)
> round(check_o, 3)
x o
1 0 3.000
2 1 0.667
3 1 0.667
4 1 0.667
5 0 3.000
6 0 3.000
7 1 0.667
8 0 3.000
9 1 0.667
Recall that the odds ratio of x is the ratio of x = 1/x = 0. The odds of the
intercept is the value of o when x = 0. In order to obtain the odds ratio of x
when x = 1, we divide 0.667/3. So that we do not have rounding problems with
the calculations, o = 0.667 will be indicated as o < 1. We will create a variable
called or that retains the odds-intercept value (x = 0) or 3.0 and selectively
changes each value of o < 1 to 0.667/3. The corresponding model coefficient
may be determined by logging each value of or.
> or <- o
> or[or< 1] <- (.6666667/3)
> coeff <- log(or)
What we find is that from the model linear predictor and probabilities we
calculated the model odds ratios and coefficients. Adding additional predictors
20 Practical Guide to Logistic Regression
2.3.1 Standard Errors
Standard errors provide the analyst with information concerning the variabil-
ity of the coefficient. If a coefficient is an estimate of the true coefficient or
slope that exists within the underlying probability distribution describing the
data being analyzed, then the standard error tells us about the accuracy of
2 • Logistic Models: Single Predictor 21
The diagonal elements are 1.3333 for the intercept and 2.16666 for predic-
tor x. These are the variances of the intercept and of x.
> diag(vcov(logit2))
(Intercept) x
1.333331 2.166664
Taking the square root of the variances gives us the model standard errors.
> sqrt(diag(vcov(logit2)))
Intercept)
( x
1.154700 1.471959
These values are identical to the standard errors shown in the logit2 results
table. Note that when using R’s glm function, the only feasible way to calculate
model standard errors is by use of the sqrt(diag(vcov(modelname)))
method. The modelname$se call made following the irls_logit function
from Table 1.1 cannot be used with glm.
Analysts many times make adjustments to model standard errors when they
suspect excess correlation in the data. Correlation can be derived from a variety
22 Practical Guide to Logistic Regression
of sources. One of the earliest adjustments made to standard errors was called
scaling. R’s glm function provides built in scaling of binomial and Poisson
regression standard errors through the use of the quasibinomial and quasipois-
son options. Scaled standard errors are produced as the product of the model
standard errors and square root of the Pearson dispersion statistic. Coefficients
are left unchanged. Scaling is discussed in detail in Chapter 3, Section 3.4.1.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.099 1.309 0.839 0.429
x -1.504 1.669 -0.901 0.397
I will explain more about the Pearson statistic, the Pearson dispersion,
scaling, and other ways of adjusting standard errors when we discuss model
fit. However, the scaled standard error for x in the above model logitsc is
calculated from model logit2 by
Standard errors of odds ratios are calculated by multiplying the odds ratio
by the coefficient standard error. Starting from the logit2 model, odds ratios
and their corresponding standard errors maybe calculated by,
2.3.2 z Statistics
The z statistic is the ratio of a coefficient to its standard error.
The reason this statistic is called z is due to its assumption as being nor-
mally distributed. For linear regression models, we use the t statistic instead.
The z statistic for odds ratio models is identical to that of standard coefficient
models. Large values of z typically indicate a predictor that significantly con-
tributes to the model; that is, to the understanding of the response.
2.3.3 p-Values
The p-value of a logistic model is usually misinterpreted. It is also typically
given more credence than it should have. First, though, let us look at how it is
calculated.
The p-value is a two-tail test of the z statistic. It tests the null hypothesis
that the associated coefficient value is 0. More exactly, p is the probability of
obtaining a coefficient value at least as extreme as the observed coefficient
given the assumption that β = 0. The smaller the p-value, the more likely β ≠ 0.
The standard “level of significance” for most studies is p = 0.05. Values of
less than 0.05 indicate that the null hypothesis of no relationship between the
predictor and response is false. That is, p-values less than 0.05 indicate that
the predictor significantly contributes to the model. Values greater than 0.05
indicate that the null hypothesis has not been rejected and that the predictor
does not contribute to the model.
A cutoff of 0.05 means that one out of every 20 times the coefficient on
average will not reject the null hypothesis; that is, that the coefficient is in
fact not significant when we thought it was. For many scientific disciplines,
24 Practical Guide to Logistic Regression
where qnorm is the outside 2.5% of the observations from each side of the
normal distribution.
> qnorm(.975)
[1] 1.959964
> toOR(logit2)
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.716364 0.218040 -3.285 0.00102 **
white 0.305238 0.208926 1.461 0.14402
los -0.037226 0.007797 -4.775 1.80e-06 ***
factor(type)2 0.416257 0.144034 2.890 0.00385 **
factor(type)3 0.929994 0.228411 4.072 4.67e-05 ***
> toOR(smlogit)
or delta zscore pvalue exp.loci. exp.upci.
(Intercept) 0.4885 0.1065 -3.2855 0.0010 0.3186 0.7490
white 1.3569 0.2835 1.4610 0.1440 0.9010 2.0436
los 0.9635 0.0075 -4.7747 0.0000 0.9488 0.9783
factor(type)2 1.5163 0.2184 2.8900 0.0039 1.1433 2.0109
factor(type)3 2.5345 0.5789 4.0716 0.0000 1.6198 3.9657
I earlier mentioned that the use of confint() following R’s glm displays
profile confidence intervals. confint.default() produces standard confidence
intervals, based on the normal distribution. Profile confidence intervals are
based on the Chi2 distribution. Profile confidence intervals are particularly
important to use when there are relatively few observations in the model, as
well as when the data are unbalanced. For example, if a logistic model has 30
observations, but the response variable consists of 26 ones and only 4 zeros,
the data are unbalanced. Ideally a logistic response variable should have rela-
tively equal numbers of 1s to 0s. Likewise, if a binary predictor has nearly all
1s or 0s, the model is unbalanced, and adjustments may need to be made to
the model.
In any case, profile confidence intervals are derived as the inverse of the
likelihood ratio test defined as
Stata’s pllf command produces profile confidence intervals, but only for con-
tinuous predictors.
Scaled, sandwich or robust, and bootstrapped-based confidence inter-
vals will be discussed in Chapter 4, and compared with profile confidence
intervals. We shall discuss which should be used given a particular type of
data.
2.4 MODELS WITH A
CATEGORICAL PREDICTOR
For our discussion of a logistic model with a single categorical predictor I shall
return to the medpar data described in Chapter 1. I provided an introductory
logistic model of died on white and hmo, which are all binary variables. Type,
on the other hand, is a categorical variable with three levels. As indicated ear-
lier, type = 1 signifies a patient who electively chose to be admitted to a hospi-
tal, type = 2 is used for patients who were admitted to the hospital as “urgent,”
and type = 3 is reserved for those patients who were admitted as emergency.
provnum is a string variable designating the hospital provider number of the
patients whose data are given in the respective lines or observations. I will use
only died (1 = died while hospitalized) and type in this section.
2 • Logistic Models: Single Predictor 29
> library(LOGIT)
> data(medpar)
> head(medpar)
los hmo white died age80 type provnum
1 4 0 1 0 0 1 030001
2 9 1 1 0 0 1 030001
3 3 1 1 1 1 1 030001
4 9 0 1 0 0 1 030001
5 1 0 1 1 1 1 030001
6 4 0 1 1 0 1 030001
> table(medpar$type)
1 2 3
1134 265 96
1 2 3
0.75852843 0.17725753 0.06421405
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.74924 0.06361 -11.779 < 2e-16 ***
factor(type)2 0.31222 0.14097 2.215 0.02677 *
factor(type)3 0.62407 0.21419 2.914 0.00357 **
—-
Null deviance: 1922.9 on 1494 degrees of freedom
Residual deviance: 1911.1 on 1492 degrees of freedom
AIC: 1917.1
Note how the factor function excluded factor type1 (elective) from the
output. It is the reference level though and is used to interpret both type2
(urgent) and type3 (emergency). I shall exponentiate the coefficients of type2
and type3 in order to better interpret the model. Both will be interpreted as
odds ratios, with the denominator of the ratio being the reference level.
> exp(coef(logit3))
(Intercept) factor(type)2 factor(type)3
0.4727273 1.3664596 1.8665158
The interpretation is
Analysts many times find that they must change the reference levels of
a categorical predictor. This may be done with the following code. We will
change from the default reference level 1 to a reference level 3 using the relevel
function.
• Elective patients have about half the odds of dying in the hospital
than do emergency patients.
• Urgent patients have about a three quarters of the odds of dying in
the hospital than do emergency patients.
> table(medpar$type)
1 2 3
1134 265 96
1 2
1134 361
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.74924 0.06361 -11.779 < 2e-16 ***
factor(type)2 0.39660 0.12440 3.188 0.00143 **
acronym for Length of Stay, referring to nights in the hospital. los ranges from
1 to 116. A cubic spline is used to smooth the shape of the distribution of los.
This is accomplished by using the S operator.
> summary(medpar$los)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 8.000 9.854 13.000 116.000
> library(mgcv)
> diedgam <- gam(died ~ s(los), family = binomial, data = medpar)
> summary(diedgam)
. . .
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.69195 0.05733 -12.07 <2e-16 ***
Note that no other predictors are in this model. Adding others may well
alter the shape of the splines. The edf statistic indicates the “effective degrees
of freedom.” It is a value that determines the shape of the curves. An edf of 1
indicates a straight line; 8 and higher is a highly curved shape. The graph has
an edf of 7.424, which is rather high. See Zuur (2012) for a complete analysis
of GAM using R.
If this was all the data I had to work with, based on the change of slope
points in Figure 2.1, I would be tempted to factor los into four intervals with
three slopes at 10, 52, and 90. Each of the four levels would be part of a cat-
egorical predictor with the lowest level as the reference. If the slopes differ
considerably across levels, we should use it for modeling the effect of los rather
than model the continuous predictor.
2.5.3 Centering
A continuous predictor whose lowest value is not close to 0 should likely be
centered. For example, we use the badhealth data from the COUNT package.
> data(badhealth)
> head(badhealth)
2 • Logistic Models: Single Predictor 35
0
s(los,7.42)
–5
–10
0 20 40 60 80 100 120
los
badh is a binary variable, and indicates that a patient has “bad health,” what-
ever that may mean. numvisit, or number of visits to the doctor during the year
1984, and age, are continuous variables. Number of visits ranges from 0 to 40,
and the age range of patients is from 20 to 60.
> table(badhealth$badh)
0 1
1015 112
> summary(badhealth$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.00 28.00 35.00 37.23 46.00 60.00
36 Practical Guide to Logistic Regression
> summary(badhealth$numvisit)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 2.353 3.000 40.000
Centering: xi − mean(xi)
> cage <- badhealth$age - mean(badhealth$age)
> summary(cage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-17.230 -9.229 -2.229 0.000 8.771 22.770
Comparing the coefficients for models with age and centered age (cage):
> bad1 <- glm(badh ~ age, family = binomial, data = badhealth)
> bad2 <- glm(badh ~ cage, family = binomial, data = badhealth)
> badtab <- data.frame(bad1$coefficients, bad2$coefficients)
> badtab
bad1.coefficients bad2.coefficients
(Intercept) -4.58866278 -2.37171785
age 0.05954899 0.05954899
2.5.4 Standardization
Standardization of continuous predictors is important when other continuous
predictors in your model are recorded on entirely different scales. The way this
is done is by dividing the centered variable by the variable standard deviation.
Use of R’s scale function makes this easy:
The standard, centered, and standardized coefficient values for the bad-
health data may be summarized in the following table. The intercept changes
when a predictor such as age is centered. When a predictor is standardized
2 • Logistic Models: Single Predictor 37
both the intercept and predictor coefficients are changed with respect to a stan-
dard model. Note that the intercept remains the same when a predictor is either
centered or standardized.
> badtab2 <- data.frame(bad1$coefficients,
bad2$coefficients, bad3$coefficients)
> badtab2
bad1.coefficients bad2.coefficients bad3.coefficients
(Intercept) -4.58866278 -2.37171785 -2.3717178
age 0.05954899 0.05954899 0.6448512
2.6 PREDICTION
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9273 0.1969 -4.710 2.48e-06 ***
white 0.3025 0.2049 1.476 0.14
—-
Null deviance: 1922.9 on 1494 degrees of freedom
Residual deviance: 1920.6 on 1493 degrees of freedom
AIC: 1924.6
> exp(coef(logit7))
(Intercept) white
0.3956044 1.3532548
White patients have a 35% greater odds of death while hospitalized than
do nonwhite patients.
LINEAR PREDICTOR
> etab <- predict(logit7)
TABULATION OF PROBABILITIES
> table(fitb)
fitb
0.283464566929547 0.348684210526331
127 1368
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.361707 0.088436 -4.090 4.31e-05 ***
los -0.030483 0.007691 -3.964 7.38e-05 ***
2 • Logistic Models: Single Predictor 39
> exp(coef(logit8))
(Intercept) los
0.6964864 0.9699768
The predicted values of died given los range from 0.02 to 0.40.
If we wish to determine the probability of death while hospitalized for a
patient who has stayed in the hospital for 20 days, multiply the coefficient on
los by 20, add the intercept to obtain the linear predictor for los at 20 days.
Apply the inverse logit link to obtain the predicted probability.
The probability is 0.275. A patient who stays in the hospital for 20 days
has a 27% probability of dying while hospitalized—given a specific disease
from this data.
determining the confidence interval, which means that 0.025 is taken from
each tail of the distribution. In terms of the normal distribution, we see that
We may use the inverse logistic link function to convert the above three
statistics to the probability scale. We could also use the true inverse logit link
function, exp(xb)/(1 + exp(xb)) or 1/(1 + exp(−xb)), to convert these to the prob-
ability scale. It is easier to simply use the linkinv function. A summary of each
is displayed based on the following code.
> summary(loci)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.004015 0.293000 0.328700 0.312900 0.350900 0.364900
> summary(mu)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01988 0.31910 0.35310 0.34310 0.38140 0.40320
> summary(upci)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09265 0.34640 0.37820 0.37540 0.41280 0.44260
The mean of the lower 95% confidence interval is 0.313, the mean of μ is
0.343, and the mean of the upper confidence interval is 0.375. A simple R plot
of the predicted probability of death for days in the hospital for patients in this
data is displayed as (Figure 2.2):
> layout(1)
> plot(medpar$los, mu, col = 1)
> lines(medpar$los, loci, col = 2, type = ‘p’)
> lines(medpar$los,upci, col = 3, type = ‘p’)
We next discuss logistic models with more than one predictor. These are
the types of models that are in fact employed in real-life studies and projects.
Understanding single predictor models, however, provides a solid basis for
understanding more complex models.
2 • Logistic Models: Single Predictor 41
0.4
0.3
mu
0.2
0.1
0 20 40 60 80 100 120
Medpar$los
SAS CODE
/* Section 2.1 */
*Generate a table of y by x;
proc freq data=xdta;
tables y*x / norow nocol nocum nopercent;
run;
/* Section 2.2 */
/* Section 2.3 */
se=sqrt(diag(vcov));
print se;
quit;
zscore=coef/se;
delta=ose;
z=zscore[,+];
pvalue=2*(1-probnorm((abs(z))));
print z pvalue;
se1=se[,+];
loci=coef-quantile(‘normal’, 0.975)*se1;
upci=coef+quantile(‘normal’, 0.975)*se1;
expl=exp(loci);
expu=exp(upci);
print or [format=7.4] delta [format=7.4] z [format=7.4]
pvalue [format=7.4] expl [format=7.4] expu [format=7.4];
quit;
/* Section 2.4 */
*Refer to the code in section 1.4 to import and print medpar dataset;
/* Section 2.5 */
/* Section 2.6 */
STATA CODE
2.1
. use xdta
. list
. glm y x, fam(bin) nolog
. table y x
. tab y x
. glm y x, fam(bin) eform nolog nohead
2.2
. glm y x, fam(bin) nolog nohead
. di 1.098612 - 1.504077*1
. di 1.098612 - 1.504077*0
. predict xb, xb
. predict mu
. gen o = mu/(1-mu)
. gen or = .6666667/3 if o < 1
. replace or = o if or = =.
. gen coef = log(or)
. l y x xb mu o or coef
2.3
. glm y x, fam(bin) nolog nohead
. estat vce
. glm y x, fam(bin) nolog nohead scale(x2)
. glm y x, fam(bin) nolog nohead eform
. di normal(-abs(_b[x]/_se[x]))*2 // p-value for x
. di normal(-abs(_b[_cons]/_se[_cons]))*2 // p-value for intercept
. use medpar, clear
. glm died white los i.type, fam(bin) nolog nohead
. glm died white los i.type, fam(bin) nolog nohead eform
2.4
. use medpar, clear
. list in 1/6
. tab type
48 Practical Guide to Logistic Regression
2.5
. use badhealth, clear
. list in 1/6
. tab badh
. summary age
. summary numvis
. egen meanage = mean(age)
. gen cage = age - meanage
. * or: center age, pre(c)
. glm badh cage, fam(bin) nolog nohead
. center age, pre(s) stand
. glm badh sage, fam(bin) nolog nohead
2.6
. glm died white, fam(bin) nolog nohead
. glm died white, fam(bin) nolog nohead eform
. predict etab, xb
. predict fitb, mu
. tab fitb
. glm died los, fam(bin)
. glm died los, fam(bin) eform
. predict etac, xb
. predict fitc
. summary fitc
. use medpar
. glm died los, family(bin) nolog
. predict eta, xb // linear predictor; eta
. predict se_eta, stdp // standard error of the prediction
. gen mu = exp(eta)/(1 + exp(eta)) // or: predict mu
. gen low = eta - invnormal(0.975) * se_eta
. gen up = eta + invnormal(0.975) * se_eta
. gen lci = exp(low)/(1 + exp(low))
. gen uci = exp(up)/(1 + exp(up))
. sum lci mu uci
. scatter mu lci uci los
Logistic Models
Multiple Predictors 3
3.1 SELECTION AND INTERPRETATION
OF PREDICTORS
The logic of modeling data with logistic regression changes very little when
more predictors are added to a model. The basic logistic regression formula we
displayed becomes more meaningful when there is more than one predictor in
a model. Equation 3.1 below expresses the relationship of each predictor to the
predicted linear predictor, xi′β, or ηi. It is more accurate to symbolize the pre-
dicted linear predictor as η̂i or as xi′β, but we shall not employ the hat symbol
on η or β for ease of interpretation, as we have done so for the predicted prob-
ability, µˆ . We shall remember from the context that the expression is predicted
or estimated, and not simply given as raw data.
µi
ln = ηi = xi b = β0 + β1 xi1 + β2 xi 2 + + β p xip (3.1)
1 − µ i
49
50 Practical Guide to Logistic Regression
the other terms in the model are held constant. When the logistic regression
term is exponentiated, interpretation is given in terms of an odds ratio, rather
than log-odds. We can see this in Equation 3.2 below, which results by expo-
nentiating each side of Equation 3.1.
µi
= eβ0 + β1 xi1 + β2 xi 2 ++ β p xip (3.2)
1 − µi
or
µi
= exp (β0 + β1 xi1 + β2 xi 2 + + β p xip ) (3.3)
1 − µi
An example will help clarify what is meant when interpreting a logistic
regression model. Let’s use data from the social sciences regarding the rela-
tionship of whether a person identifies themselves as religious. Our main inter-
est will be in assessing how level of education affects religiosity. We’ll also
adjust by gender (male), age, and whether the person in the study has children
(kids). There are 601 subjects in the study, so there is no concern about sample
size. The data are in the edrelig data set.
A study subject’s level of education is a categorical variable with three
fairly equal-sized levels: AA, BA, and MA/PhD. All subjects have achieved at
least an associate’s degree at a 2-year institution. A tabulation of the educlevel
predictor is shown below, together with the top six values of all variables in
the data.
> data(edrelig)
> head(edrelig)
male age kids educlevel religious
1 1 37 0 MA/PhD 0
2 0 27 0 AA 1
3 1 27 0 MA/PhD 0
4 0 32 1 AA 0
5 0 27 1 BA 0
6 1 57 1 MA/PhD 1
> table(edrelig$educlevel)
AA BA MA/PhD
205 204 192
Male and kids are both binary predictors, having values of 0 and 1. 1 indi-
cates (most always) that the name of the predictor is the case. For instance,
3 • Logistic Models: Multiple Predictors 51
the binary predictor male is 1 = male and 0 = female. Kids = 1 if the subject
has children, and 0 if they have no children. Age is a categorical variable with
levels as 5-year age groups. The range is from 17 to 57. I will interpret age,
however, as a continuous predictor, with each ascending age as a 5-year period.
We model the data as before, but simply add more predictors in the model.
The categorical educlevel predictor is factored into its three levels, with the
lowest level, AA, as the reference. It is not displayed in model output.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) −1.43522 0.32996 −4.350 1.36e-05 ***
age 0.03983 0.01036 3.845 0.000121 ***
male 0.18997 0.18572 1.023 0.306381
kids 0.12393 0.21037 0.589 0.555790
factor(educlevel)BA −0.47231 0.20822 −2.268 0.023313 *
factor(educlevel)MA/PhD −0.49543 0.22621 −2.190 0.028513 *
---
Null deviance: 822.21 on 600 degrees of freedom
Residual deviance: 792.84 on 595 degrees of freedom
AIC: 804.84
Or we can view the entire table of odds ratio estimates and associated
statistics using the code developed in the previous chapter.
> coef <- ed1$coef
> se <- sqrt(diag(vcov(ed1)))
> zscore <- coef / se
> or <- exp(coef)
> delta <- or * se
> pvalue <- 2*pnorm(abs(zscore),lower.tail=FALSE)
> loci <- coef - qnorm(.975) * se
> upci <- coef + qnorm(.975) * se
> ortab <- data.frame(or, delta, zscore, pvalue, exp(loci), exp(upci))
> round(ortab, 4)
or delta zscore pvalue exp.loci. exp.upci.
(Intercept) 0.2381 0.0786 -4.3497 0.0000 0.1247 0.4545
age 1.0406 0.0108 3.8449 0.0001 1.0197 1.0620
male 1.2092 0.2246 1.0228 0.3064 0.8403 1.7402
kids 1.1319 0.2381 0.5891 0.5558 0.7495 1.7096
factor(educlevel)BA 0.6236 0.1298 -2.2683 0.0233 0.4146 0.9378
factor(educlevel)MA/PhD 0.6093 0.1378 -2.1902 0.0285 0.3911 0.9493
52 Practical Guide to Logistic Regression
Since we are including more than a single predictor in this model, it’s wise
to check additional model statistics. Interpretation gives us the following, with
the understanding that the values of the other predictors in the model are held
constant.
> 1/.6235619
[1] 1.60369
> 1/.6093109
[1] 1.641198
such an option, providing substantially more model statistics than are provided
by simply using the glm function when modeling a logistic regression. In this
section, we define and discuss the various GLM statistics that are provided in
the R’s summary function. Stata, SAS, SPSS, Limdep, and other GLM soft-
ware generally provide the same statistics.
I shall display the ed1 model we just estimated, removing the coefficient
table from our view. We are only interested here in the ancillary model statis-
tics that can be used for evaluating model fit.
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6877 -1.0359 -0.8467 1.2388 1.6452
. . .
Null deviance: 822.21 on 600 degrees of freedom
Residual deviance: 792.84 on 595 degrees of freedom
AIC: 804.84
n
µi
L(µ i ; yi ) = ∑ y ln 1 − µ + ln(1 − µ )
i
i
i (3.4)
i =1
observation in the model. This means that a y replaces every μ in the log-
likelihood function.
n
Logistic Model Pearson Chi2 GOF Statistic (based on the Bernoulli distribution):
n
( yi − µ i )2
∑ µ (1 − µ )
i i
(3.8)
i =1
The degrees of freedom for the Pearson statistic are the same as for the
deviance. For count models, the dispersion statistic is defined as the Pearson
Chi2 statistic divided by the residual dof. Values greater than 1 indicate pos-
sible overdispersion. The same is the case with grouped logistic models—a
topic we shall discuss in Chapter 5. The deviance dispersion can also be used
for binomial models—again a subject to which we shall later return.
I mentioned earlier that raw residuals are defined as “y − μ.” All other
residuals are adjustments to this basic residual. The Pearson residual, for
example, is defined as:
Pearson Residual:
y−µ
(3.9)
µ(1 − µ )
2
n
yi − µ i
Pearson Chi 2 statistic = ∑
µ i (1 − µ i )
(3.10)
i =1
Unfortunately neither the Pearson Chi2 statistic nor the Pearson disper-
sion is directly available from R. Strangely though, the Pearson dispersion is
used to generate what are called quasibinomial models; that is, logistic models
with too much or too little correlation in the data. See Hilbe (2009) and Hilbe
and Robinson (2013) for a detailed discussion of this topic.
56 Practical Guide to Logistic Regression
I created a function that calculates the Pearson Chi2 and dispersion fol-
lowing glm estimation. Called P__disp (double underscore), it is a function
in the COUNT and LOGIT packages. If the name of the model of concern is
mymodel, type P__disp(mymodel) on the command line.
Deviance residuals are calculated on the basis of the deviance statistic
defined above. For binary logistic regression, deviance residuals take the form of
If y = 1,
∑ ln µ
1 (3.11)
sign( y − µ ) * 2
If y = 0,
∑ ln 1 − µ
1 (3.12)
sign( y − µ ) * 2
Using R’s built-in deviance residual option for glm models, we may calcu-
late a summary of the values as,
Note the closeness of the residual values to what is displayed in the ed1
model output at the beginning of this section.
I should also mention that the above output for Pearson residuals informs us
that the dispersion parameter for the model is 1 (1.008705). The logistic model is
based on the Bernoulli distribution with only a mean parameter. There is no scale
parameter for the Bernoulli distribution. The same is the case for the Poisson
count model. In such a case the software reports that the value is 1, which means
that it cannot affect the other model statistics or the mean parameter. It is sta-
tistically preferred to use the term scale in this context than it is dispersion, for
reasons that go beyond this text. See Hilbe (2011) or Hilbe (2014) for details.
The glm function fails to display or save the log-likelihood function,
although it is used in the calculation of other saved statistics. By back-coding
other statistics an analyst can calculate a statistic such as the log-likelihood
which is given at the start of this section. For the ed1 model,
Log-likelihood:
R’s glm function saves other statistics as well. To identify statistics that
can be used following an R function, type ?glm or ? followed by the function
name, and a help file will appear with information about model used and saved
values. All CRAN functions should have a help document associated with the
function, but packages and functions that are not part of the CRAN family
have no such requirements. Nearly all Stata commands or functions have asso-
ciated help. For general help on glm type, “help glm.”
58 Practical Guide to Logistic Regression
AIC = −2 L + 2 k or − 2( L − k ) (3.14)
or
−2 L + 2 k −2( L − k )
AIC = or (3.15)
n n
> data(medpar)
> summary(mymod <- glm(died ~ white + hmo + los + factor(type),
+ family = binomial,
+ data = medpar))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.720149 0.219073 -3.287 0.00101 **
white 0.303663 0.209120 1.452 0.14647
3 • Logistic Models: Multiple Predictors 59
Using R and the values of the log-likelihood and the number of predictors,
we may calculate the AIC as:
This is the same value that is displayed in the glm output. It should be
noted that of all the information criteria that have been formulated, this ver-
sion of the AIC is the only one that does not adjust the log-likelihood by n, the
number of observations in the model. All others adjust by some variation of
number of predictors and observations. If the AIC is used to compare models,
where n is different (which normally should not be the case), then the test will
be mistaken. Using the version of AIC where the statistic is divided by n is
then preferable—and similar to that of other criterion tests. The AIC statistic,
captured from the postestimation statistics following the execution of glm, is
displayed below, as is AICn. These statistics are also part of the modelfit func-
tion described below.
$AICn
[1] 1.266322
$BIC
[1] 1925.01
$BICqh
[1] 1.272677
The BIC statistic is given as 1925.01, which is the same value displayed in
the Stata estat ic post-estimation command. Keep in mind that size compari-
son between the AIC and BIC statistics are not statistically meaningful.
4( p2 − pk − 2 p)( p + k + 1)( p + k + 2)
AICH = −2L + (3.18)
n− p−k−2
3 • Logistic Models: Multiple Predictors 61
Next we create Pearson dispersion statistics and multiply their square root
by se above.
. . .
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.720149 0.221302 -3.254 0.00116 **
white 0.303663 0.211248 1.437 0.15079
hmo 0.027204 0.152781 0.178 0.85870
los -0.037193 0.007878 -4.721 2.56e-06 ***
factor(type)2 0.417873 0.145786 2.866 0.00421 **
factor(type)3 0.933819 0.231746 4.029 5.87e-05 ***
3 • Logistic Models: Multiple Predictors 63
The robust standard errors are stored in rse. We’ll add those to the table of
standard errors we have been expanding.
64 Practical Guide to Logistic Regression
3.4.3 Bootstrapping
Bootstrapping is an entire area of statistics in itself. Here we are discussing
bootstrapped standard errors. Statisticians have devised number of ways to
bootstrap. I shall develop a function that will bootstrap the model standard
errors. I set the number of bootstraps at 100, but it could have been higher for
perhaps a bit more accuracy.
>library(boot)
> bootmod <- glm(died ~ white + hmo + los + factor(type),
family=binomial, data=medpar)
> t <- function ( x, i) {
xx <- x[i,]
bsglm <- glm(died ~ white + hmo + los + factor(type),
family = binomial, data = medpar)
return(sqrt(diag(vcov(bsglm))))
}
> bse <- boot(medpar, t, R = 100)
> sqrt(diag(vcov(bootmod)))
> bootse < - apply(bse$t, 2, mean)
The bootstrapped standard errors are in the vector, bootse. We’ll attach
them to the table of standard errors which we keep expanding as we add more
types of adjustments.
3 • Logistic Models: Multiple Predictors 65
manner based on the levels of another predictor. Suppose that the response
term is death and we have predictors white and los. These are variables in
the medpar data. If we believe that the probability of death based on length
of stay in the hospital varies by racial classification, then we need to incor-
porate an interaction term of white × los into the model. The main effects
only model is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.598683 0.213268 -2.807 0.005 **
white 0.252681 0.206552 1.223 0.221
los -0.029987 0.007704 -3.893 9.92e-05 ***
Note that los is significant, but white is not. Let’s create an interaction of
white and los called wxl. We insert it into the model, making sure to include
the main effects terms as well.
That is, we add the slope of the binary predictor to the product of the slope
of the interaction and the value(s) of the continuous predictor, exponentiating
the whole.
# Odds ratios of death for a white patient for length of stay 1–40 days.
# Note that odds of death decreases with length of stay.
3 • Logistic Models: Multiple Predictors 67
White patients who were in the hospital for 14 days have a some 10% greater
odds of death than do non-white patients who were in the hospital for 14
days.
SAS CODE
/* Section 3.1 */
*Refer to the code in section 1.4 to import and print edrelig dataset;
*Refer to proc freq in section 2.4 to generate the frequency table;
*Build logistic model and obtain odds ratio & covariance matrix;
proc genmod data = edrelig descending;
class educlevel (ref = ‘AA’) / param = ref;
model religious = age male kids educlevel/dist = binomial
link = logit covb;
estimate “Intercept” Intercept 1 / exp;
estimate “Age” age 1 / exp;
estimate “Male” male 1 / exp;
estimate “Kid” kids 1 / exp;
estimate “BA” educlevel 1 0 / exp;
estimate “MA/PhD” educlevel 0 1 / exp;
run;
*Refer to proc iml in section 2.3 and the full code is provided
online;
68 Practical Guide to Logistic Regression
/* Section 3.2 */
/* Section 3.4 */
*Refer to proc iml in section 2.3 and the full code is provided online;
*Sort the dataset;
proc sort data = medpar;
by descending type;
run;
3 • Logistic Models: Multiple Predictors 69
data est1;
set est;
parameter1 = parameter;
if parameter = “Scale” then delete;
if level1 = 2 then parameter1 = “type2”;
else if level1 = 3 then parameter1 = “type3”;
run;
/* Section 3.5 */
output;
end;
run;
STATA CODE
3.1
. use edrelig, clear
. glm religious age male kids i.educlevel, fam(bin) nolog nohead eform
. glm religious age male kids i.educlevel, fam(bin) nolog eform
3.2
. e(deviance) // deviance
. e(deviance_p) // Pearson Chi2
. e(dispers_p) // Pearson dispersion
. di e(ll) // log-likelihood
. gen loglike = e(ll)
. scalar loglik = e(ll)
. di loglik
. predict h, hat
. sum(h) // hat matrix diagonal
. predict stpr, pear stand
. sum stpr // stand. Pearson residual
. predict stdr, dev stand
. sum stdr // stand deviance residual
3.3
. use medpar, clear
. qui glm died white hmo los i.type, fam(bin)
. estat ic
. abic
3.4
. glm died white hmo los i.type, fam(bin) scale(x2) nolog nohead
. glm died white hmo los i.type, fam(bin) vce(robust) nolog nohead
. glm died white hmo los i.type, fam(bin) vce(boot) nolog nohead
3.5
. glm died white los, fam(bin) nolog nohead
. gen wxl <- white*los
. glm died white los wxl, fam(bin) nolog nohead
. glm died white los wxl, fam(bin) nolog nohead eform
Testing and
Fitting a
Logistic Model
4
4.1 CHECKING LOGISTIC MODEL FIT
Pearson Chi 2
~ 1.0
Residual dof
71
72 Practical Guide to Logistic Regression
the residual degrees of freedom defining the Chi2 degrees of freedom. The
p-values are based on the distribution, 1-pchisq(pchi2,df)
Chi 2(Pearson Chi 2, rdof )
We may code the Pearson Chi2 GOF test, creating a little table based on
the mymod model, as:
> pr <- sum(residuals(mymod, type=“pearson”)^2)
> df <-mymod$df.residual
> p_value <- pchisq(pr, mymod$df.residual, lower=F)
> print(matrix(c(“Pearson Chi GOF”,“Chi2”,“df”,“p-value”, “ ”,
+ round(pr,4),df, round(p_value,4)), ncol=2))
[,1] [,2]
[1,] “Pearson Chi GOF” “ ”
[2,] “Chi2” “1519.4517”
[3,] “df” “1489”
[4,] “p-value” “0.2855”
This test is still found in many books, articles, and in research reports.
Analysts should be aware however, that many statisticians no longer rely on
this test as a global fit test. Rather than using a single test to approve or disap-
prove a model as well fit, statisticians now prefer to employ a variety of tests
to evaluate a model. The distributional assumptions upon which tests like this
are based are not always met, or are only loosely met, which tends to bias test
results. Care needs to be taken when accepting test results.
With a p > .05, the Pearson Chi2 GOF test indicates that we can reject the
hypothesis that the model is not well-fitted. In short, we may use the test
result to support an acceptance of the model.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.720149 0.219073 -3.287 0.00101 **
white 0.303663 0.209120 1.452 0.14647
los -0.037193 0.007799 -4.769 1.85e-06 ***
hmo 0.027204 0.151242 0.180 0.85725
factor(type)2 0.417873 0.144318 2.896 0.00379 **
factor(type)3 0.933819 0.229412 4.070 4.69e-05 ***
Model:
died ~ white + los + +hmo + factor(type)
Df Deviance AIC LRT Pr(>Chi)
<none> 1881.2 1893.2
white 1 1883.3 1893.3 2.1778 0.1400
los 1 1907.9 1917.9 26.7599 2.304e-07 ***
hmo 1 1881.2 1891.2 0.0323 0.8574
factor(type) 2 1902.9 1910.9 21.7717 1.872e-05 ***
Deviance rd
2∑{In(1/µ)} if y = 1
2∑{In(1/ (1 − µ )) } if y = 0
Stand. Pearson r sp rp
1− h
Stand. deviance r sd rd
1− h
Likelihood rl sgn( y − µ ) h( r p )2 + (1 − h)( r d )2
Anscombe r A A( y ) − A( µ )
{ µ(1 − µ ) }1/ 6
where A(z) = Beta(2/3, 2/3)*{IncompleteBeta(z, 2/3, 2/3), and z = (y; μ)
Beta(2/3, 2/3) = 2.05339. When z = 1, the function reduces to the Beta (see
Hilbe, 2009).
Cooks’ distance r CD hr p , C = number of coefficients
n
C n (1 − h)2
Delta Pearson ΔChi2 ( r sd )2
Delta deviance ΔDev ( r d )2 + h ( r sp )2
Delta beta Δβ h( r p )2
(1 − h)2
> summary(ans)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.5450 -1.0090 -0.9020 -0.0872 1.5310 3.2270
Residual analysis for logistic models is usually based on what are known as
n-asymptotics. However, some statisticians suggest that residuals should be
based on m-asymptotically formatted data. Data in observation-based form;
that is, one observation or case per line, are in n-asymptotic format. The
datasets we have been using thus far for examples are in n-asymptotic form.
m-asymptotic data occurs when observations with the same values for all
4 • Testing and Fitting a Logistic Model 75
There are several ways to reduce the three variable subset of the medpar data
to m-asymptotic form. I will show a way that maintains the died response
variable, which is renamed dead due to it not being a binary variable, and then
show how to duplicate the above table.
> data(medpar)
76 Practical Guide to Logistic Regression
The code above produced the 11 covariate pattern m-asymptotic data, but I
also provide dead and alive, which can be used for grouped logistic models
in the next chapter. m is simply the sum of alive and dead. For example, look
at the top line. With m = 72, we know that there were 72 times in the reduced
medpar data for which white=0, hmo=0, and type=1. For that covariate pat-
tern, died=1 (dead) occurred 17 times and died=0 (alive) occurred 55 times.
To obtain the identical covariate pattern list where only white, hmo, type,
and m are displayed, the following code reproduces the table.
8 1 0 3 83
9 1 1 1 197
10 1 1 2 27
11 1 1 3 3
It should be noted that all but one possible separate covariate pattern exists in
this data. Only the covariate pattern, [white=0, hmo=1, type=3] is not part of
the medpar dataset. It is therefore not in the m-asymptotic data format.
I will provide codes for Figures 4.1 through 4.3 that are important when
evaluating logistic models as to their fit. Since R’s glm function does not use
a m-asymptotic format for residual analysis, I shall discuss the traditional
n-asymptotic method. Bear in mind that when there are continuous predic-
tors in the model, m-asymptotic data tend to reduce to n-asymptotic data.
Continuous predictors usually have many more values in them than do binary
and categorical predictors. A model with two or three continuous predictors
typically results in a model where there is no difference between m-asymptotic
and n-asymptotic formats. Residual analysis on observation-based data is the
traditional manner of executing the plots, and are the standard way of graphing
in R. I am adding los (length of stay; number of days in hospital) back into the
model to remain consistent with earlier modeling we have done on the medpar
data.
You may choose to construct residual graphs using m-asymptotic meth-
ods. The code to do this was provided above. However, we shall keep with
the standard methods in this chapter. In the next chapter on grouped logistic
models, m-asymptotics is built into the model.
R code for creating the standard residuals found in literature related to
logistic regression is given in Table 4.2. Code for creating a simple squared
standardized deviance residual versus mu graphic (Figure 4.1) is given as:
data(medpar)
mymod <- glm(died ~ white + hmo + los + factor(type),
family=binomial, data=medpar)
summary(mymod)
mu <- mymod$fitted.value # p
redicted value;
probability that
died==1
dr <-resid(mymod, type=”deviance”) # deviance residual
hat <- hatvalues(mymod) # hat matrix diagonal
stdr <- dr/sqrt(1-hat) # s
tandardized
deviance residual
plot(mu, stdr^2)
abline(h = 4, col=”red”)
78 Practical Guide to Logistic Regression
4
stdr∧2
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
mu
Analysts commonly use the plot of the square of the standardized deviance
residuals versus mu to check for outliers in a fitted logistic model. Values in
the plot greater than 4 are considered outliers. The values on the vertical axis
are in terms of standard deviations of the residual. The horizontal axis are pre-
dicted probabilities. All figures here are based on the medpar data.
Another good way of identifying outliers based on a residual graph is by
use of Anscombe residuals versus mu, or the predicted probability that the
response is equal to 1. Anscombe residuals adjust the residuals so that they are
as normally distributed as possible. This is important when using 2, or 4 when
the residual is squared, as a criterion for specifying an observation as an out-
lier. It is the 95% criterion so commonly used by statisticians for determining
statistical significance. Figure 4.2 is not much different from Figure 4.1 when
squared standardized deviance residuals are used in the graph. The Anscombe
plot is preferred.
Large hat values indicate covariate patterns that differ from average covari-
ate patterns. Values on the horizontal extremes are high residuals. Values that
are high on the hat scale, and low on the residual scale; that is, high in the
middle and close to the zero-line do not fit the model well. They are also dif-
ficult to detect as influential when using other graphics. There are some seven
4 • Testing and Fitting a Logistic Model 79
10
6
ans^2
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
mu
0.020
0.015
hat
0.010
0.005
–1 0 1 2 3 4 5
stpr
layout(1)
plot(medpar$los, R1, col = 1, main = ‘P[Death] while hospitalized’,
sub= “Black = 1; Red = 2; Yellow = 3”, ylab = ‘Type of Admission’,
xlab = ‘LOS’, ylim = c(0,0.4))
lines(medpar$los, R2, col = 2, type = ‘p’)
lines(medpar$los, R3, col = 3, type = ‘p’)
4 • Testing and Fitting a Logistic Model 81
0.4
0.3
Type of admission
0.2
0.1
0.0
0 20 40 60 80 100 120
LOS
Black=1; Gray=2; Light gray=3
determines the optimal probability value with which to separate predicted ver-
sus observed successes (1) or failures (0).
For an example we shall continue with the model used for residual analy-
sis earlier in the chapter. It is based on the medpar data, with died as the
response variable and white, hmo, los and levels of type as the predictors. We
then obtain predicted probabilities that died = =1, which is the definition of
mu. The goal is then to determine how well the predicted probabilities actu-
ally predict classification as died = =1, and how well they predict died = =0.
Analysts are not only interested in correct prediction though, but also in such
issues as what percentage of times does the predictor incorrectly classify the
outcome. I advise the reader to remember though that logistic models that clas-
sify well are not always well-fitted models. If your interest is strictly to produce
the best classification scheme, do not be as much concerned about model fit. In
keeping with this same logic, a well-fitted logistic model may not clearly dif-
ferentiate the two levels of the response. It’s valuable if a model accomplishes
both fit and classification power, but it need not be the case.
Now to our example model:
Analysts traditionally use the mean of the predicted value as the cut point.
Values greater than 0.3431438 should predict that died = =1; values lower
should predict died = =0. For confusion matrices, the mean of the response,
or mean of the prediction, will be a better cut point than the default 0.5 value
set with most software. If the response variable being modeled has substan-
tially more or less 1’s than 0’s, a 0.5 cut point will produce terrible results. I
shall provide a better criterion for the cut point shortly, but the mean is a good
default criterion.
Analysts can use the percentage at which levels of died relate to mu being
greater or less than 0.3431438 to calculate such statistics as specificity and
sensitivity. These are terms that originate in epidemiology, although tests like
the ROC statistic and curve were first derived in signal theory. Using our
example, we have patients who died (D) and those who did not (~D). The
probability of being predicted to die given that the patient has died is called
model sensitivity. The probability of being predicted to stay alive, given the
fact that the patient remained alive is referred to as model specificity. In epi-
demiology, the term sensitivity refers to the probability of testing positive
4 • Testing and Fitting a Logistic Model 83
for having a disease given that the patient in fact has the disease. Specificity
refers to when a patient tests negative for a disease when they in fact do not
have the disease. Terms such as false positive refers to when a patient tests
positive for a disease even though they do not have it. False negatives happen
when a patient tests negative for a disease, even though they actually have it.
These are all important statistics in classification analysis, but model sensi-
tivity and specificity are generally regarded as the most important results.
However, false positive and false negative are used with the main statistics
for creating the ROC curve. Each of these statistics can easily be calculated
from a confusion matrix. All three of these classification tools intimately
relate with one another.
The key point is that determining the correct cut point provides the
grounds for correctly predicting the above statistics, given an estimated model.
The cut point is usually close to the mean of the predicted values, but is not
usually the same value as the mean. Another way of determining the proper
cut point is to choose a point at which the specificity and sensitivity are clos-
est in values. As you will see though, formulae have been designed to find the
optimal cut point, which is usually at or near the site where the sensitivity and
specificity are the closest.
The Sensitivity-Specificity (S-S) plot and ROC plot and tests are com-
ponents of the ROC_test function. The classification or confusion matrix is
displayed using the confusion_stat function. Both of these functions are
part of the LOGIT package on CRAN. When LOGIT has been loaded into
memory the functions are automatically available to the analyst.
> library(LOGIT)
> data(medpar)
> mymod <- glm(died ~ los + white + hmo + factor(type),
family=binomial, data=medpar)
We shall start with the S–S plot, which is typically used to establish the
cut point used in ROC and confusion matrix tests. The cut point used in ROC_
test is based on Youden’s J statistic (Youden, 1950). The optimal cut point is
defined as the threshold that maximizes the distance to the identity (diagnonal)
line of the ROC curve. The optimality criterion is based on:
max(sensitivities + specificities)
Other criteria have been suggested in the literature. Perhaps the most
noted alternative is:
min((1 - sensitivities)^2 + (1- specificities)^2)
1.00
0.75
Sensitivity/specificity
0.50
0.25
Cut point: 0.364
When using ROC analysis, the analyst should look at both the ROC sta-
tistic as well as at the plot of the sensitivity versus one minus the specificity.
A model with no predictive power has a slope of 1. This represents an ROC
statistic of 0.5. Values from 0.5 to 0.65 have little predictive power. Values
from 0.65 to 0.80 have moderate predictive value. Many logistic models fit into
this range. Values greater than 0.8 and less than 0.9 are generally regarded as
having strong predictive power. Values of 0.9 and greater indicate the highest
amount of predictive power, but models rarely achieve values in this range.
The model is a better classifier with greater values of the ROC statistic, or area
under the curve (AUC). Beware of over-fitting with such models. Validating
the model with a validation sample or samples is recommended. See Hilbe
(2009) for details.
ROC is a test on the response term and fitted probability. SAS users should
note that the ROC statistic described here is referred to as Harrell’s C statistic.
The ROCtest function is used to determine that the predictive power of
the model is 0.607. Note that the type = “ROC” option is given to obtain the
test statistic and graphic. Due to the sampling nature of the statistics, the cut
point for the ROC curve differs slightly from that of the S–S plot (Figure 4.6).
$cut
[1] 0.3614538
1.00
0.75
0.50
Sensitivity
0.25
AUC: 0.607
0.00
0.00 0.25 0.50 0.75 1.00
1-Specificity
Note that a cutoff value of 0.3615 is used for the AUC statistic. Given
that died indicates that a patient died while hospitalized, the AUC statistic
can be interpreted as follows: The estimated probability is 0.61 that patients
who die have a higher probability of death (higher mu) than patients who are
alive. This value is very low. A ROC statistic of 0.5 indicates that the model
has no predictive power at all. For our model there is some predictive power,
but not a lot.
> confusion_stat(out1$Predicted,out1$Observed)
$matrix
obs 0 1 Sum
pred
0 794 293 1087
1 188 220 408
Sum 982 513 1495
$statistics
Accuracy Sensitivity Specificity
0.6782609 0.4288499 0.8085540
Other statistics that can be drawn from the confusion matrix and that can
be of value in classification analysis are listed below. Recall from earlier dis-
cussion that D = patient died while in hospital (outcome = 1) and ~D = patient
did not die in hospital (outcomes = 0).
Observed Total
predicted 1 0
1 292 378 670
0 221 604 825
Total 513 982 1495
observed
predicted 1 0 Total
1 252 233 485
0 261 749 1010
Total 513 982 1495
88 Practical Guide to Logistic Regression
> data(medpar)
> summary (mymod <- (
glm(died ~ white + los + hmo +
factor(type), family=binomial,
data=medpar))
4 • Testing and Fitting a Logistic Model 89
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.720149 0.219073 -3.287 0.00101 **
white 0.303663 0.209120 1.452 0.14647
los -0.037193 0.007799 -4.769 1.85e-06 ***
hmo 0.027204 0.151242 0.180 0.85725
factor(type)2 0.417873 0.144318 2.896 0.00379 **
factor(type)3 0.933819 0.229412 4.070 4.69e-05 ***
---
Null deviance: 1922.9 on 1494 degrees of freedom
Residual deviance: 1881.2 on 1489 degrees of freedom
AIC: 1893.2
data: mymod
X2 = 84.8001, df = 10, p-value = 5.718e-14
The Chi2 test again indicates that the model is ill fitted.
In order to show how different code can result in different results, I
used code for the H–L test in Hilbe (2009). Rather than groups defined
and displayed by range, they are calculated as ranges, but the mean of the
groups is displayed in output. The number of observations in each group is
also given.
This code will develop three H–L tables, with 8, 10, and 12 groups. The
12 group table is displayed below.
The p-value again tells us that the model is not well fitted. The statistics
are similar, but not identical to the table shown earlier. The H–L test is nice
summary test to use on a logistic model, but interpret it with care.
> library(Hmisc)
> data(hivlgold)
> hiv
infec cases cd4 cd8
1 0 3 0 0
2 0 8 1 1
92 Practical Guide to Logistic Regression
3 0 2 2 2
4 0 5 1 0
5 0 2 2 0
6 0 13 2 1
7 1 1 0 2
8 1 2 1 2
9 1 4 0 0
10 1 4 1 1
11 1 1 2 2
12 1 2 1 0
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2877 0.7638 0.377 0.706
factor(cd4)1 -1.2040 1.1328 -1.063 0.288
factor(cd4)2 -20.3297 2501.3306 -0.008 0.994
factor(cd8)1 0.2231 1.0368 0.215 0.830
factor(cd8)2 19.3488 2501.3306 0.008 0.994
Look at the highest level of both cd4 and cd8. The coefficient values are
extremely high compared to level 2, and the standard errors of both are over
100 times greater than their associated coefficient. None of the Wald p-values
are significant. The model appears to be ill fitted, to say the least.
Penalized logistic regression was developed to resolve the problem of
perfect prediction. Heinze and Schemper (2002) amended a method designed
by David Firth (1993) to solve the so-called “problem of separation,” which
results in at least one parameter becoming infinite, or very large compared
to other predictors or levels of predictors in a model. See Hilbe (2009) for a
discussion of the technical details of the method.
The same data as above are modeled using Firth’s penalized logistic
regression. The function, logistf() is found in the logistf package on CRAN.
> firth <- logistf(infec ~ factor(cd4) + factor(cd8), weights=cases, data=hiv)
> firth
4 • Testing and Fitting a Logistic Model 93
The coefficients appear normal with nothing out of the ordinary. Interestingly
the p-values of the second level of cd4 and cd8, which failed in standard logistic
regression, are statistically significant for the penalized logit model. The likeli-
hood ratio test informs us that the penalized model is also not well fitted.
Penalized logistic regression many times produces significant results when
standard logistic regression does not. If you find that there is perfect prediction
in your model, or that the data is highly unbalanced; for example, nearly all 1s
or 0s for a binary variable, penalized logistic regression may be the only viable
way of modeling it. Those analysts who model mostly small data sets are more
likely to have separation problems than those who model larger data.
prediction in the model. When that occurs penalized logistic regression should
be used—as we discussed in the previous section.
For an example of exact logistic regression, I shall use Arizona hospital
data collected in 1991. The data consist of a random sample of heart procedures
referred to as CABG and PTCA. CABG is an acronym meaning coronary artery
bypass grafting surgery and PTCA refers to percutaneous transluminal coronary
angioplasty. It is a nonsurgical method of placing a type of balloon into a coronary
artery in order to clear blockage caused by cholesterol. It is a substantially less
severe procedure than CABG. We will model the probability of death within 48 h
of the procedure on 34 patients who sustained either a CABG or PTCA. The vari-
able procedure is 1 for CABG and 0 for PTCA. It is adjusted in the model by the
type of admission. Type = 1 is an emergency or urgent admit, and 0 is an elective
admission. Other variables in the data are not used. Patients are all older than 65.
> data(azcabgptca34)
> head(azheart)
died procedure age gender los type
1 Died CABG 65 Male 10 Elective
2 Survive CABG 69 Male 7 Emer/Urg
3 Survive PTCA 76 Female 7 Emer/Urg
4 Survive CABG 65 Male 8 Elective
5 Survive PTCA 69 Male 1 Elective
6 Survive CABG 67 Male 7 Emer/Urg
> library(Hmisc)
> table(azheart$died, azheart$procedure)
PTCA CABG
Survive 19 9
Died 1 5
It is clear from the tabulation that more patients died having a CAGB than
with a PTCA. A table of died on type of admission is displayed as:
First we shall use a logistic regression to model died on procedure and type.
The model results are displayed in terms of odds ratios and associated statistics.
4 • Testing and Fitting a Logistic Model 95
Table Format
x
0 1
0 4 5
y
1 6 8
This table has two variables, y and x. It is in summary form. That is, the
above table is a summary of data and can be made into two variables when put
into the following format.
Grouped Format
y x count
0 0 4
0 1 5
1 0 6
1 1 8
The cell (x = 0; y = 0), or (0,0) in the above table has a value of 4; the cell
(x = 1; y = 1), or (1,1) has a value of 8. This indicates that if the data were in
observation-level form, there would be four observations having a pattern of
4 • Testing and Fitting a Logistic Model 97
x,y values of 0,0. If we are modeling the data, with y as the binary response and
x as a binary predictor, the observation-level data appears as:
Observation-Level Format
y x
1. 0 0
2. 0 0
3. 0 0
4. 0 0
5. 0 1
6. 0 1
7. 0 1
8. 0 1
9. 0 1
10. 1 0
11. 1 0
12. 1 0
13. 1 0
14. 1 0
15. 1 0
16. 1 1
17. 1 1
18. 1 1
19. 1 1
20. 1 1
21. 1 1
22. 1 1
23. 1 1
The above data give us the identical information as we have in the “y-x
count” table above it, as well as in the initial table. Each of these three formats
yield the identical information. If the analyst simply sums the values of the
numbers in the cells, or sums the values of the count variable, he/she will know
the number of observations in the observation-level data set. 4 + 5 + 6 + 8
indeed sums to 23.
Note that many times we see table data converted to grouped data in the
following format:
y x count
1 1 8
1 0 6
0 1 5
0 0 4
98 Practical Guide to Logistic Regression
To check the above calculations the odds ratio may be calculated directly
from the original table data as well. Recall that the odds ratio of predictor x is
the ratio of the odds of y = 1 divided by the odds of y = 0. The odds of y = 1 is
the ratio of x = 1 to x = 0 when y = 1, and the odds of y = 0 is the ratio of x = 1
to x = 0 when y = 0.
x
0 1
0 4 5
y
1 6 8
4 • Testing and Fitting a Logistic Model 99
> (8/5)/(6/4)
[1] 1.066667
> 6/4
[1] 1.5
Gender
Female Male
sleep party study sleep party study
fail 3 4 2 2 4 3
Grade
pass 2 1 6 3 2 4
The data have a binary response, Grade, with levels of Fail and Pass,
Gender has two levels (Female and Male) and student Type has three levels
(sleep, party, and study). I suggest that the response of interest, Pass, be giv-
ing the value of 1, with Fail assigned 0. For Gender, Female = 0 and Male = 1.
Type: Sleep = 1, Party = 2, and Study = 3. Multiply the levels for the total
number of levels or groups in the data. 2 * 2 * 3 = 12. The response vari-
able then will have six 0s and six 1s. When a table has predictors with more
than two levels, I recommend using the 0,1 format for setting up the data for
analysis.
A binary variable will split its values between the next higher level.
Therefore, Gender will have alternating 0s and 1s for each half of Grade.
Since Type has three levels, 1–2–3 is assigned for each level of Gender.
Finally, assign the appropriate count value to each pattern of variables. The
first level represents Grade = Fail; Gender = Female; Type = Sleep. We move
from the upper left of the top row across the columns of the row, then move
to the next row.
100 Practical Guide to Logistic Regression
> toOR(mymod3)
or delta zscore pvalue exp.loci. exp.upci.
(Intercept) 0.9518 0.6902 -0.0681 0.9457 0.2298 3.9428
gender 1.1039 0.7825 0.1394 0.8891 0.2751 4.4292
factor(type)2 0.3731 0.3461 -1.0628 0.2879 0.0606 2.2983
factor(type)3 2.0074 1.6811 0.8321 0.4053 0.3889 10.3622
SAS CODE
/* Section 4.1 */
*Obstats option provides all the residuals and statistics in Table 4.2;
proc genmod data=medpar descending;
class type (ref=’1’) / param=ref;
model died=white hmo los type / dist=binomial link=logit obstats;
ods output obstats=stats;
102 Practical Guide to Logistic Regression
run;
k2=-0.8714+(-0.0376)*los+0.4386*2;
r2=1/(1+exp(-k2));
k3=-0.8714+(-0.0376)*los+0.4386*3;
r3=1/(1+exp(-k3));
run;
/* Section 4.2 */
*Build the logistic model and output classification table & ROC curve;
proc logistic data=medpar descending plots(only)=ROC;
class type (ref=’1’) / param=ref;
model died=white hmo los type / outroc=ROCdata ctable pprob=(0 to
1 by 0.0025);
ods output classification=ctable;
run;
/* Section 4.3 */
/* Section 4.4 */
104 Practical Guide to Logistic Regression
/* Section 4.5 */
*Refer to the code in section 1.4 to import and print azheart dataset;
*Build the logistic model and obtain odds ratio & statistics;
proc genmod data=azheart descending;
model died=procedure type / dist=binomial link=logit;
estimate “Intercept” Intercept 1 / exp;
estimate “Procedure” procedure 1 / exp;
estimate “Type” type 1 / exp;
run;
*Refer to proc iml in section 2.3 and the full code is provided online;
/* Section 4.6 */
*Build the logistic model with weight and obtain odds ratio;
proc genmod data=mydata descending;
weight count;
4 • Testing and Fitting a Logistic Model 105
*Build the logistic model with weight and obtain odds ratio;
proc genmod data=mydata2 descending;
class type (ref=’1’) / param=ref;
weight count;
model grade=gender type / dist=binomial link=logit;
estimate “Intercept” Intercept 1 / exp;
estimate “Gender” gender 1 / exp;
estimate “Type2” type 1 0 / exp;
estimate “Type3” type 0 1 / exp;
run;
STATA CODE
4.1
. use medpar
. xi: logit died white los hmo i.type, nolog
. lrdrop1
. qui logit died white hmo los i.type, nolog
. estimates store A
. qui logit died white hmo los, nolog
. estimates store B
. lrtest A B
. predict mu
. gen died – mu # raw residual
. predict dev, deviance # deviance resid
. predict pear, residuals # Pearson resid
. predict, hat, hat # hat matrix diagonal
. gen stddev = dev/sqrt(1-hat) # standardized deviance
. predict, stpear, rstandard # standardized Pearson
. predict deltadev, ddeviance # delta deviance
. predict dx2, dx2 # delta Pearson
. predict dbeta, dbeta # delta beta
. scatter stdev^2 mu
. scatter hat stpear
. qui glm died los admit, fam(bin)
. gen L1= _b[_cons] +_b[los]*los + _b[admit]*1 # Cond. effects plot
. gen Y1 = 1/(1+exp(-L1))
. gen L2 = _b[_cons] +_b[los]*los + _b[admit]*0
. gen Y1 = 1/(1+exp(-L2))
. scatter Y1 Y2 age, title(“Prob of death w/I 48 hrs by admit type”)
4.2
. glm died white hmo los i.type, fam(bin)
. predict mu
106 Practical Guide to Logistic Regression
. mean died
. logit died white hmo los i.type, nolog
. lsens, genprob(cut) gensens(sen) genspec(spec)
. lroc
. estat classification, cut(.351)
4.3
. estat gof, table group(10)
. estat gof, table group(12)
4.4
. use hiv1gold
. list
. glm infec i.cd4 i.cd8 [fw=cases], fam(bin)
. firthlogit infec i.cd4 i.cd8 [fw=cases], nolog
4.5
. use azcabgptca34
. list in 1/6
. table died procedure
. table died type
. glm died procedure type, fam(bin) nolog
. glm died procedure type,fam(bin) scale(x2) nolog
. exlogistic died procedure type, nolog
4.6
. use pgmydata
. glm y x [fw=count], fam(bin) nolog
. glm y x [fw=count], fam(bin) nolog eform
. use phmydata2
. glm grade gender i.type [fw=count], fam9bin) nolog nohead
. glm grade gender i.type [fw=count], fam9bin) nolog nohead eform
Grouped
Logistic
Regression
5
5.1 THE BINOMIAL PROBABILITY
DISTRIBUTION FUNCTION
Grouped logistic regression is based on the binomial probability distribution.
Recall that standard logistic regression is based on the Bernoulli distribution,
which is a subset of the binomial. As such, the standard logistic model is a
subset of the grouped. The key concept involved is the binomial probability
distribution function (PDF), which is defined as:
n
f ( y; p, n) = p y (1 − p)n − y (5.1)
y
p n
f ( y; p, n) = exp y ln + n ln(1 − p) + ln (5.2)
1 − p y
107
108 Practical Guide to Logistic Regression
model is the data put into covariate patterns and evaluated by observation-
based residuals. Here the PDF itself is in covariate pattern structure.
The first derivative of the cumulant, −n ln(1 − p), with respect to the link,
ln(p/(1 − p)), is the mean, which for the binomial distribution is
= μ = np
Mean
and the second derivative of the cumulant with respect to the link is the
variance.
= V(Y) = np(1 − p)
Variance
or, in terms of μ
µ
Variance = V (µ ) = µ 1 − = nµ(n − µ )
n
µ
Link = ln
n − µ
n n exp( xb)
Inverse link = =
1 + exp(− xb) 1 + exp( xb)
m
µi n
L(µ i ; yi , ni ) = ∑ y ln 1 − µ + n ln(1 − µ ) + ln y
i
i
i i (5.3)
i =1
m
ni − yi
∑ y ln µ + (n − y ) ln n − µ
yi
D=2 i i i (5.4)
i i i
i =1
5 • Grouped Logistic Regression 109
y cases x1 x2 x3
1 3 1 0 1
1 1 1 1 1
2 2 0 0 1
0 1 0 1 1
2 2 1 0 0
0 1 0 1 0
x1, x2, and x3 are all binary predictors. The variable cases have values that
inform us of the number of times these three binary predictors have the same
values—if the data were in observation format. y indicates how many of the
number of cases with the same covariate pattern have 1 as a value for y. The
first line represents three observations having x1 = 1, x2 = 0, and x3 = 1. One
of the three observations has y = 1, and two have y = 0. In observation format
the above grouped data set appears as
Grouped Data
> y <- c(1,1,2,0,2,0)
> cases <- c(3,1,2,1,2,1)
> x1 <- c(1,1,0,0,1,0)
> x2 <- c(0,1,0,1,0,1)
> x3 <- c(1,1,1,1,0,0)
> grp <- data.frame(y,cases,x1,x2,x3)
> grp$noty <- grp$cases – grp$y
> xx2 <- glm( cbind(y, noty) ~ x1 + x2 + x3, family = binomial, data = grp)
> summary(xx2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2050 1.8348 0.657 0.511
x1 0.1714 1.4909 0.115 0.908
x2 -1.5972 1.6011 -0.998 0.318
x3 -0.5499 1.5817 -0.348 0.728
terms of two columns of data—one for the number of 1s for a given covari-
ate pattern, and the second for the number of 0s (not 1s). It is the only logistic
regression software I know of that allows this manner of formatting the bino-
mial response. However, one can create a variable representing the cbind(y,
noty) and run it as a single term response. The results will be identical.
. . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2050 1.8348 0.657 0.511
x1 0.1714 1.4909 0.115 0.908
x2 -1.5972 1.6011 -0.998 0.318
x3 -0.5499 1.5817 -0.348 0.728
In a manner more similar to that used in other statistical packages, the bino-
mial denominator, cases, may be employed directly into the response—but
only if it is also used as a weighting variable. The following code produces the
same output as above,
The advantage of using this method is that the analyst does not have
to create the noty variable. The downside is that some postestimation func-
tions do not accept being based on a weighted model. Be aware that there are
alternatives and use the one that works best for your purposes. The cbind()
response appears to be the most popular, and seems to be used more in pub-
lished research.
Stata and SAS use the grouping variable; for example, cases, as the variable
n in the binomial formulae listed in the last section and as given in the example
directly above. The binomial response can be thought of as y = numerator and
cases = denominator. Of course these term names will differ depending on
the data being modeled. Check the end of this chapter for how Stata and SAS
handle the binomial denominator.
112 Practical Guide to Logistic Regression
> library(reshape)
> obser$x1 <- factor(obser$x1)
> obser$x2 <- factor(obser$x2)
> obser$x3 <- factor(obser$x3)
> grp <- na.omit(data.frame(cast(melt(obser, measure = “y”),
x1 + x2 + x3 ~ .,
function(x) { c(notyg = sum(x = =0), yg = sum(x = =1))} )))
> grp
x1 x2 x3 notyg yg
1 0 0 1 0 2
2 0 1 0 1 0
3 0 1 1 1 0
4 1 0 0 0 2
5 1 0 1 2 1
6 1 1 1 0 1
. . .
The code used to convert an observation to a grouped data set is the same
code that can convert an n-asymptotic data set to m-symptotic for residual
analysis. You can use the above code as a paradigm for converting any obser-
vation data to grouped format.
5 • Grouped Logistic Regression 113
Pearson Chi2 = 6.630003
Dispersion = 3.315001
Any value of the dispersion greater than 1 indicates extra variation in the
data. That is, it indicates more variation than is allowed by the binomial PDF
which underlies the model. Recall that the dispersion statistic is the Pearson
statistic divided by the residual degrees of freedom, which is defined as the
number of observations in the model less coefficients (predictors, intercept,
extra parameters). The product of the square root of the dispersion by the
standard error of each predictor in grouped logistic model produces a quasi-
binomial grouped logistic model. It adjusts the standard errors of the model.
Sandwich and bootstrapped standard errors may be used as well to adjust for
overdispersed grouped logistic models.
A caveat should be given regarding the identification of overdispersed
data. I mentioned that for grouped logistic models that a dispersion statistic
greater than 1 indicates overdispersion, or unaccounted for variation in the
data. However, there are times that models appear to be overdispersed, but are
in fact not. A grouped logistic model dispersion statistic may be greater than 1,
but the model data can itself be adjusted to eliminate the perceived overdisper-
sion. Apparent overdispersion occurs in the following conditions:
Apparent Overdispersion
• The model is missing a needed predictor.
• The model requires one or more interactions of predictors.
• A predictor needs to be transformed to a different scale; log(x).
• The link is misspecified (the data should be modeled as probit or
cloglog).
• There are existing outliers in the data.
5 • Grouped Logistic Regression 115
Guideline
If a grouped logistic model has a dispersion statistic greater than 1, check
each of the 5 indicators of apparent overdispersion to determine if applying
them reduces the dispersion to approximately 1. If it does, the data are not
truly overdispersed. Adjust the model accordingly. If the dispersion statistic
of a grouped logistic model is less than 1, the data is under-dispersed. This
type of extra-dispersion is more rare, and is usually dealt with by scaling or
using robust SEs.
0.15
0.10
hat
0.05
0.00
–2 0 2 4 6 8
stpr
plot(mu, stdr)
abline(h = 4, lty = “dotted”, col = “red”)
4
stdr
–2
Binomial PDF
n
f ( y; µ, n) = µ y (1 − µ )n − y (5.5)
y
n
As discussed before, the choose function is the binomial coefficient,
y
which is the normalization term of the binomial PDF. It guarantees that the
function sums to 1.0. This form of the function may also be expressed in terms
of factorials:
n n!
y = y !(n − y)! (5.6)
Γ(n + 1)
(5.7)
Γ( y + 1)Γ(n − y + 1)
118 Practical Guide to Logistic Regression
The log-likelihood function for the binomial model can then be expressed,
with subscripts, as:
Beta PDF
Γ(a + b) a −1
f ( y; a, b) = y (1 − y)b −1 (5.9)
Γ(a )Γ(b)
where a is the number of successes and b is the number of failures. The ini-
tial term in the function is the normalization constant, comprised of gamma
functions.
The above function can also be parameterized in terms of μ. Since we
plan on having the binomial parameter, μ, itself distributed as beta, we can
parameterize the beta PDF as:
Γ(a + b) a −1
f (µ ) = µ (1 − µ )b −1 (5.10)
Γ(a )Γ(b)
Notice that the kernal of the beta distribution is similar to that of the
binomial kernal.
µ y (1 − µ )n − y ~ µ a −1 (1 − µ )b −1 (5.11)
Even the coefficients of the beta and binomial are similar in structure. In
probability theory such a relationship is termed conjugate. The beta distribu-
tion is conjugate to the binomial. This is a very useful property when mixing
distributions, since it generally allows for easier estimation. Conjugacy plays
a particularly important role in Bayesian modeling where a prior conjugate
(beta) distribution of a model coefficient, which is considered to be a ran-
dom variable, is mixed with the (binomial) likelihood to form a beta posterior
distribution.
5 • Grouped Logistic Regression 119
The mean and variance of the beta PDF may be given as:
a ab
E ( y) = = µ V ( y) = (5.12)
a+b (a + b)2 (a + b + 1)
f ( y; µ, a, b) = f ( y; µ, n) f ( y; µ, a, b) (5.13)
Γ(a + b)Γ(n + 1)
f ( y; µ, a, b) = π y − a −1 (1 − µ )n − y + b −1 (5.14)
Γ(a )Γ(b)Γ( y + 1)Γ(n − y + 1)
An alternative parameterization may be given in terms of μ and σ, with
μ = a/(a + b).
1 µ 1 − µ
Γ Γ y + Γn − y +
Γ(n + 1) σ σ σ
f ( y; µ, σ ) = (5.15)
Γ( y + 1)Γ(n − y + 1) 1 µ 1 − µ
Γn + Γ Γ
σ σ σ
σ
E (Y ) = nµ V (Y ) = nµ(1 − µ ) 1 + (n − 1) (5.16)
1+ σ
This is the parameterization that is used in R’s gamlss function (Rigby and
Stasinopoulos, 2005) and in the Stata betabin command (Hardin and Hilbe, 2014).
For an example, we shall use the 1912 Titanic shipping disaster passenger
data. In grouped format, the data are called titanicgrp. The predictors of the
model are:
> data(titanicgrp)
> titanicgrp ; attach(titanicgrp) ; table(class)
survive cases age sex class
1 1 1 child women 1st class
2 13 13 child women 2nd class
3 14 31 child women 3rd class
4 5 5 child man 1st class
5 11 11 child man 2nd class
6 13 48 child man 3rd class
7 140 144 adults women 1st class
8 80 93 adults women 2nd class
9 76 165 adults women 3rd class
10 57 175 adults man 1st class
11 14 168 adults man 2nd class
12 75 462 adults man 3rd class
> toOR(jhlogit)
or delta zscore pvalue exp.loci. exp.upci.
(Intercept) 3.6529 0.9053 5.2271 0 2.2473 5.9374
5 • Grouped Logistic Regression 121
ageadults 0.3480
0.0844 -4.3502 0 0.2163 0.5599
sexman 0.0935
0.0136 -16.3129 0 0.0704 0.1243
class032nd class 2.1293 0.3732 4.3126 0 1.5103 3.0021
class031st class 5.8496 0.9986 10.3468 0 4.1861 8.1741
> P__disp(jhlogit)
Pearson Chi2 = 100.8828
Dispersion = 14.41183
> library(sandwich)
> or <- exp(coef(jhlogit))
> rse <- sqrt(diag(vcovHC(jhlogit, type = “HC0”))) # robust SEs
> ORrse <- or*rse
> pvalue <- 2*pnorm(abs(or/ORrse), lower.tail = FALSE)
> rotab <- data.frame(or, ORrse, pvalue)
> rotab
or ORrse pvalue
(Intercept) 3.65285874 2.78134854 0.18906811
ageadults 0.34798085 0.24824238 0.16098137
sexman 0.09353076 0.04414139 0.03409974
class032nd class 2.12934342 1.26408082 0.09208519
class031st class 5.84958983 3.05415838 0.05545591
The robust p-values tell us that age and 2nd class are not significant. 1st
class passengers is marginal, but given the variability in the data we would
keep it in a final model, with a combined 2nd and 3rd class as the reference.
That is, it may be preferred to dichotomize class as a binary predictor with
1 = 1st class and 0 = otherwise.
R output for the beta-binomial model using gamlss is given as displayed
below. Note again that there is a slight difference in estimates. Sigma is the dis-
persion parameter, and can itself be parameterized, having predictors like the
mean or location parameter, mu. The dispersion estimates inform the analyst
which predictors significantly influence the extra correlation in the data, there-
fore influencing the value of sigma. In this form below it is only the intercept
of sigma that is displayed. In this respect, the beta binomial is analogous to
the heterogeneous negative binomial count model (Hilbe, 2011, 2014), and the
122 Practical Guide to Logistic Regression
Beta Binomial
> library(gamlss)
> summary(mybb <- gamlss(cbind(survive,died) ~ age + sex + class03,
data = titanicgrp, family = BB))
. . .
Notice that the AIC statistic is reduced from 157.77 for the grouped logis-
tic model to 85.80 for the beta-binomial model. This is a substantial improve-
ment in model fit. The heterogeneity or dispersion parameter, sigma, is 0.165.
Odds ratio for beta binomial are inflated compared to the grouped logit,
but the p-values are closely the same.
> exp(coef(mybb))
(Intercept) ageadults sexman class032nd class
4.4738797 0.1105858 0.1133972 7.5253615
class031st class
15.8044343
5 • Grouped Logistic Regression 123
SAS CODE
/* Section 5.2 */
/* Section 5.4 */
/* Section 5.5 */
*Build the logistic model and obtain odds ratio & covariance
matrix;
proc genmod data = titanicgrp descending;
class class (ref = ‘3’)/ param = ref;
model survive/cases = age sex class / dist = binomial link = logit
covb;
estimate “Intercept” Intercept 1 / exp;
estimate “ageadults” age 1 / exp;
estimate “sexman” sex 1 / exp;
estimate “class” class 1 0 / exp;
estimate “class” class 0 1 / exp;
run;
*Refer to proc iml in section 2.3 and the full code is provided
online;
STATA CODE
5.1
. use obser
. glm y x1 x2 x3, fam(bin) nolog
5.2
. use obser, clear
. glm y x1 x2 x3, fam(bin) nolog nohead
. use grp, clear
. glm y x1 x2 x3, fam(bin cases) nolog nohead
. use obser
. gen cases = 1
. collapse(sum) cases (sum) yg, by(x1 x2 x3)
. glm yg x1 x2 x3, fam(bin cases) nolog nohead
126 Practical Guide to Logistic Regression
5.4
. use phmylgg
. cases = dead + alive
. glm dead white hmo los i.type, fam(bin cases)
. predict mu
. predict hat, hat
. predict dev, deviance
. gen stdev = dev/sqrt(1-hat)
. predict stpr, rstandard
. scatter stpr hat
. gen stdev2 = stdev^2
. scatter stdev2 mu
5.5
. use titanicgrp
. list
. glm died age sex b3.class, fam(bin cases) nolog
. glm, eform
. glm died age sex b3.class, fam(bin cases) vce(robust) nolog
. betabin died age sex b3.class, n(cases) nolog
. gen died = cases-survive
Bayesian
Logistic
Regression
6
6.1 A BRIEF OVERVIEW OF
BAYESIAN METHODOLOGY
Bayesian methodology would likely not be recognized by the person who is
regarded as the founder of the tradition. Thomas Bayes (1702–1761) was a
British Presbyterian country minister and amateur mathematician who had a
passing interest in what was called inverse probability. Bayes wrote a paper
on the subject, but it was never submitted for publication. He died without
anyone knowing of its existence. Thomas Price, a friend of Bayes, discovered
the paper when going through Bayes’s personal effects. Realizing its impor-
tance, he managed to have it published in the Royal Society’s Philosophical
Transactions in 1764. The method was only accepted as a curiosity and was
largely forgotten until Pierre-Simon Laplace, generally recognized as the
leading mathematician worldwide during this period, discovered it several
decades later and began to employ its central thesis to problems of probability.
However, how Bayes’s inverse probability was employed during this time is
quite different from how analysts currently apply it to regression modeling. For
those who are interested in the origins of Bayesian thinking, and its relation-
ship to the development of probability and statistics in general, I recommend
reading Weisberg (2014) or Mcgrayne (2011).
Inverse probability is simple in theory. Suppose that we know from epide-
miological records that the probability of a person having certain symptoms S
given that they have disease D is 0.8. This relationship may be symbolized as
Pr(S|D) = 0.8. However, most physicians want to know the probability of having
the disease if a patient displays these symptoms, or Pr(D|S). In order to find this
127
128 Practical Guide to Logistic Regression
out additional information is typically required. The idea is that under certain con-
ditions one may find the inverse probability of an event, usually with the additional
information. The notion of additional information is key to Bayesian methodology.
There are six foremost characteristic features that distinguish Bayesian
regression models from the traditional maximum likelihood models such as
logistic regression. Realize though that these features are simplifications. The
details are somewhat more complicated.
1.
Regression models have slope, intercept, and sigma parameters:
Each parameter has an associated prior.
2.
Parameters Are Randomly Distributed: The regression parameters
to be estimated are themselves randomly distributed. In traditional,
or frequentist-based, logistic regression the estimated parameters
are fixed. All main effects parameter estimates are based on the
same underlying PDF.
3.
Parameters May have Different Distributions: In Bayesian logistic
regression, each parameter is separate, and may be described using
a different distribution.
4.
Parameter Estimates As The Means of a Distribution: When esti-
mating a Bayesian parameter an analyst develops a posterior dis-
tribution from the likelihood and prior distributions. The mean (or
median, mode) of a posterior distribution is regarded as the beta,
parameter estimate, or Bayesian coefficient of the variable.
5.
Credible Sets Used Instead of Confidence Intervals: Equal-tailed
credible sets are usually defined as the outer 0.025 quantiles of the
posterior distribution of a Bayesian parameter. The posterior inter-
vals, or highest posterior density (HPD) region, are used when the
posterior is highly skewed or is bi- or multi-model in shape. There is
a 95% probability that the credible set or posterior mean contains the
true posterior mean. Confidence intervals are based on a frequency
interpretation of statistics as defined in Chapter 2, Section 2.3.4.
6.
Additional or Prior Information: The distribution used as the basis
of a parameter estimate (likelihood) can be mixed with additional
information—information that we know about the variable or
parameter that is independent of the data being used in the model.
This is called a prior distribution. Priors are PDFs that add informa-
tion from outside the data into the model.
The basic formula that defines a Bayesian model is:
f ( y | θ) f (θ) f ( y | θ) f (θ)
f (θ | y) = = (6.1)
∫
f ( y) f ( y | θ) f (θ) dθ
6 • Bayesian Logistic Regression 129
where f(y|θ) is the likelihood function and f(θ) is the prior distribution. The
denominator, f(y), is the probability of y over all y. Note that the likelihood
and prior distributions are multiplied together. Usually the denominator,
which is the normalization term, drops out of the calculations so that the
posterior distribution or a model predictor is determined by the product of its
likelihood and prior. Again, each predictor can be comprised of a different
posterior.
If an analyst believes that there is no meaningful outside information that
bears on the predictor, a uniform prior will usually be given. When this hap-
pens the prior is not informative.
A prior having a normal distribution with a mean of 0 and very high vari-
ance will also produce a noninformative or diffuse prior. If all predictors in
the model are noninformative, the maximum likelihood results will be nearly
identical to the Bayesian betas. In our first examples below we will use nonin-
formative priors.
I should mention that priors are a way to provide a posterior distribution
with more information than is available in the data itself, as reflected in the
likelihood function. If a prior is weak it will not provide much additional infor-
mation and the posterior will not be much different than it would be with a
completely noninformative prior. In addition, what may serve as an influential
informative prior in a model with few observations may well be weak when
applied to data with a large number of observations.
It is important to remember that priors are not specific bits of information,
but are rather distributions with parameters which are combined with likeli-
hood distributions. A major difficulty most analysts have when employing a
prior in a Bayesian model is to specify the correct parameters of the prior that
describe the additional information being added to the model. Again, priors
may be multiplied with the log-likelihood to form a posterior for each term in
the regression.
There is much more that can be discussed about Bayesian modeling, in
particular Bayesian logistic modeling. But this would take us beyond the scope
we set for this book. I provide the reader with several suggested books on the
subject at the end of the chapter.
To see how Bayesian logistic regression works and is to be understood is
best accomplished through the use of examples. I will show an example using
R’s MCMCpack package (located on CRAN) followed by the modeling of the
same data using JAGS. JAGS is regarded by many in the area as one of the
most powerful, if not the most powerful, Bayesian modeling package. It was
developed from WinBUGS and OpenBUGS and uses much of the same nota-
tion. However, it has more built-in functions and more capabilities than do the
BUGS packages. BUGS is an acronym for “Bayesian inference Using Gibbs
Sampling” and is designed and marketed by the Medical Research Group out of
130 Practical Guide to Logistic Regression
docvis : The number of visits made to a physician during the year, from 0
to 121.
female : 1 = female; 0 = male.
kids : 1 = has children; 0 = no children.
age : age, from 25 to 64.
The data are first loaded and the data are renamed R84. We shall view the
data, including other variables in the data set.
> library(MCMCpack)
> library(LOGIT)
> data(rwm1984)
> R84 <- rwm1984
# DATA PROFILE
> head(R84)
docvis hospvis edlevel age outwork female married kids hhninc educ self
1 1 0 3 54 0 0 1 0 3.050 15.0 0
2 0 0 1 44 1 1 1 0 3.050 9.0 0
3 0 0 1 58 1 1 0 0 1.434 11.0 0
4 7 2 1 64 0 0 0 0 1.500 10.5 0
5 6 0 3 30 1 0 0 0 2.400 13.0 0
6 9 0 3 26 1 0 0 0 1.050 13.0 0
edlevel1 edlevel2 edlevel3 edlevel4
1 0 0 1 0
2 1 0 0 0
6 • Bayesian Logistic Regression 131
3 1 0 0 0
4 1 0 0 0
5 0 0 1 0
6 0 0 1 0
> dim(R84)
[1] 3874 15
The response variable, outwork, has 1420 1s and 2454 0s, for a mean of
0.5786.
> table(R84$outwork)
0 1
2454 1420
> summary(R84$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
25 35 44 44 54 64
We shall first model the data based on a standard logistic regression, and
then by a logistic regression with the standard errors scaled by the square root
of the Pearson dispersion. The scaled logistic model, as discussed in the previ-
ous chapter, is sometimes referred to as a “quasibinomial” model. We model
both to determine if there is extra variability in the data that may require
adjustments. The tables of coefficients for each model are not displayed below,
but are stored in myg and myq, respectively. I shall use the toOR function to
display the odds ratios and associated statistics of both models in close prox-
imity. The analyst should inspect the delta (SEs) values to determine if they
differ from each other by much. If they do, then there is variability in the data.
A scaled logistic model, or other adjusted models, should be used on the data,
including a Bayesian model. Which model we use depends on what we think is
the source of the extra correlation.
132 Practical Guide to Logistic Regression
> toOR(myq)
or delta zscore pvalue exp.loci. exp.upci.
(Intercept) 0.1340 0.0113 -23.7796 0e+00 0.1135 0.1581
cdoc 1.0247 0.0067 3.7420 2e-04 1.0117 1.0379
female 9.5525 0.8242 26.1560 0e+00 8.0663 11.3126
kids 1.4304 0.1342 3.8168 1e-04 1.1902 1.7191
cage 1.0559 0.0046 12.5413 0e+00 1.0469 1.0649
A comparison of the standard errors of the two models shows that there is
not much extra variability in the data. The standard errors are nearly the same.
No adjustments need to be made to the model. However, for pedagogical sake
we shall subject the data to a Bayesian logistic regression.
Recall from Chapter 3, Section 3.4.1 that the quasibinomial “option” in
R’s glm function produces mistaken confidence intervals. Our toOR function
corrects this problem for odds ratios. Log the intervals to obtain correct scaled
confidence intervals.
We use the MCMCpack package, which has the MCMClogit function for
estimating Bayesian logistic models. The algorithms in MCMCpack employ a
random walk version of the Metropolis-Hastings algorithm when estimating a
logistic model. MCMC is an acronym for Markov Chain Monte Carlo, which
is a class of sampling algorithm used to find or determine the mean, stan-
dard deviation, and quantiles of a distribution from which the data to be mod-
eled is theoretically derived or, at least, best described. There are a variety of
algorithms employed by Bayesians which are based on MCMC; for example,
Metropolis-Hastings, Gibbs Sampling.
For our example I shall employ the default multivariate normal priors
on all of the parameters. It is used because we have more than one parameter,
6 • Bayesian Logistic Regression 133
burnin is used to tell the algorithm how many of the initial samples should be
discarded before beginning to construct a posterior distribution, from which
the mean, standard deviation, and quantiles are derived. mcmc specifies how
many samples are to be used in the estimation of the posterior. We discard the
first 5000 iterations and keep the next 100,000.
Options many times used in the model are b0 and B0, which repre-
sent the mean and precision of the prior(s). The precision is defined as the
inverse of the variance. As such one typically sees B0 as B0–1. Since we
used the default prior of b0 = 0 and B0 = 0 here, assigning values to b0
and B0 was not required. We could have used b0 = 0 and B0 = 0.00001 as
well, for a mean of 0 and an extremely high variance, which means that
nothing specific is being added to the model. The priors are noninformative,
and therefore do not appreciatively influence the model. That is, the data,
or rather likelihood, is the prime influence on the parameter estimates, not
the priors. An analyst may also use the user.prior.density option to
define their own priors.
The output is given as usual:
> summary(mymc)
Iterations = 5001:105000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 1e+05
Compare the output above for the noninformative prior with SAS output
on the same data and model. The results are remarkably similar.
POSTERIOR SUMMARIES
PERCENTILES
STANDARD
PARAMETER N MEAN DEVIATION 25% 50% 75%
Intercept 100,000 −2.0140 0.0815 −2.0686 −2.0134 −1.9586
Cdoc 100,000 0.0247 0.00632 0.0204 0.0246 0.0289
Female 100,000 2.2605 0.0832 2.2043 2.2602 2.3166
Kids 100,000 0.3596 0.0907 0.2981 0.3590 0.4207
Cage 100,000 0.0545 0.00418 0.0516 0.0545 0.0573
POSTERIOR INTERVALS
PARAMETER ALPHA EQUAL-TAIL INTERVAL HPD INTERVAL
Intercept 0.050 −2.1755 −1.8557 −2.1710 −1.8520
Cdoc 0.050 0.0124 0.0373 0.0124 0.0373
Female 0.050 2.0989 2.4242 2.0971 2.4220
Kids 0.050 0.1831 0.5382 0.1838 0.5386
Cage 0.050 0.0463 0.0628 0.0464 0.0628
0.03 40
20
0.00 0
2e+04 6e+04 1e+05 0.00 0.01 0.02 0.03 0.04 0.05
Iterations N = 100,000 Bandwidth = 0.0006621
80
0.06
40
0.04 0
2e+04 6e+04 1e+05 0.04 0.05 0.06 0.07
Iterations N= 100,000 bandwidth = 0.0004408
the distributions of each parameter in the model. The peak of each distribution
is at the point which defines the parameter’s mean. The intercept therefore is
about −2.0, the mean for centered docvis (cdoc) is about 0.025, and for centered
age (cage) at about 0.055. The trace plots on the left side of Figure 6.1 show
time series plots across all iterations. We are looking for the convergence of
the estimation to a single value. When the plot converges or stabilizes without
excessive up and down on the y axis, convergence has been achieved. There
appears to be no abnormality in the sampling draws being made by the MCMC
algorithm in any of the trace plots. This is what we want to observe. In addi-
tion, if there are breaks in the trace, or places where clumps are observed, we
may conclude that the sampling process is not working well.
> geweke.diag(mymc)
number of different models. Of course, our example will be to show its use in
creating a Bayesian logistic model.
First, make sure you have installed JAGS to your computer. It is freeware,
as is R. JAGS is similar to WinBUGS and OpenBUGS, which can also be run
as standalone packages or within the R environment. JAGS is many times pre-
ferred by those in the hard sciences like physics, astronomy, ecology, biology,
and so forth since it is command-line driven, and written in C ++ for speed.
WinBUGS and OpenBUGS are written in Pascal, which tends to run slower
than C ++ implementations, but can be run within the standalone WinBUGS
or OpenBUGS environments, which include menus, help, and so forth. The
BUGS programs are more user-friendly. Both OpenBUGS and JAGS are also
able to run on a variety of platforms, which is advantageous to many users. In
fact, WinBUGS is no longer being developed or supported. The developers are
putting all of their attention to OpenBUGS. Lastly, and what I like about it,
when JAGS is run from within R, the program actually appears as if it is just
another R package. I do not feel as if I am using an outside program.
To start it is necessary to have JAGS in R’s path, and the R2jags package
needs to be installed and loaded. For the first JAGS example you also should
bring two functions contained in jhbayes.R into memory using the source
function.
> library(R2jags)
>
source(“c://Rfiles/jhbayes.R”) # or where you store R
files; book’s website
The code in Table 6.1 is specific to the model we have been working with
in the previous section. However, as you can see, it is easily adaptable for other
logistic models. With a change in the log-likelihood, it can also be used with
other distributions and can be further amended to incorporate random effects,
mixed effects, and a host of other models.
Let us walk through the code in Table 6.1. Doing so will make it much
easier for you to use it for other modeling situations.
The top two lines
X <- model.matrix(~ cdoc + female + kids + cage,
data = R84)
K <- ncol(X)
create a matrix of predictors, X, from the model R84, and a variable, K, which
contains the number of predictors contained in X. A column of 1s for the inter-
cept is also generated by model.matrix().
The next code segment is logit.data, although we may call it anything
we wish. logit.data is a list of the components of the JAGS model we are
6 • Bayesian Logistic Regression 139
cat(“
model{
# Priors
beta ~ dmnorm(b0[], B0[,])
# Likelihood
for (i in 1:N){
Y[i] ~ dbern(p[i])
logit(p[i]) <- max(-20, min(20, eta[i]))
eta[i] <- inprod(beta[], X[i,])
LLi[i] <- Y[i] * log(p[i]) +
(1 - Y[i]) * log(1 - p[i])
}
LogL <- sum(LLi[1:N])
AIC <- -2 * LogL + 2 * K
BIC <- -2 * LogL + LogN * K
}
“,fill = TRUE)
sink()
# JAGs
J0 <- jags(data = logit.data,
inits = inits,
parameters = params,
model.file = “LOGIT.txt”,
n.thin = 10,
n.chains = 3,
n.burnin = 40000,
n.iter = 50000)
# OUTPUT DISPLAYED
out <- J0$BUGSoutput
myB <- MyBUGSOutput(out, c(uNames(“beta”, K), “LogL”, “AIC”, “BIC”))
round(myB, 4)
140 Practical Guide to Logistic Regression
that puts everything within the model braces, { }, below the code into a text
file called LOGIT.txt.
Priors and the likelihood function are defined within the model parentheses:
model{
We start by defining the priors. The prior betas are all defined as multi-
variately normal. The values we just defined for both b0 and B0 are supplied
to the arguments of dmnorm().
beta ~ dmnorm(b0[], B0[,])
If we wanted to have a uniform prior for each of the predictors, the right
side of the above distribution would be expressed as dunif(-20, 20).
The following code segment defines the likelihood. This is a crucial seg-
ment. The likelihood is calculated across all observations in the model; that
is, from 1 to N.
The first term in the for-loop specifies this to be a logistic regression; that
is, the likelihood across all observations is Bernoulli distributed. The next two
lines provide the link function, logit and eta, which is the linear predictor. It is
formed by the product (inprod) of the beta and X values. The final line within
the parenthesis is the Bernoulli log-likelihood function. The sum of the obser-
vation log-likelihood values produces the model log-likelihood statistic, LogL.
6 • Bayesian Logistic Regression 141
for (i in 1:N){
Y[i] ~ dbern(p[i])
logit(p[i]) <- max(-20, min(20, eta[i]))
eta[i] <- inprod(beta[], X[i,])
LLi[i] <- Y[i] * log(p[i]) +
(1 - Y[i]) * log(1 - p[i])
}
LogL <- sum(LLi[1:N])
The Akaike and Bayesian information criteria (AIC and BIC) statistics
are then calculated, and the model parenthesis closes. The fill confirms that the
LOGIT.txt file should contain everything within the parenthesis, and the sink()
function actually saves the file to the working directory.
The inits segment formally defines the initial parameter values, which
are all defined as normally distributed terms with a mean of 0 and variance of
10 (the precision, 1/V, is 0.1). The term params contains the coefficient, log-
likelihood, AIC, and BIC statistics.
The segment JO is the JAGS function, containing the values and set-
tings we just defined. The JAGS algorithm itself uses the following values to
define the manner in which MCMC sampling occurs. It is the core of the JAGS
function.
Terms we have not defined yet include the n.thin, meaning here that sam-
pling actually keeps every 10th value from the MCMC Gibbs sampler, dis-
carding the others. This is done in case the data are autocorrelated. Thinning
is an attempt to increase sampling efficiency. Keeping one of every 10 samples
for our distribution helps effect randomness. n.chains specifies how many dis-
tributions are being sampled. Chains of sampling are mixed, which assists in
obtaining a distribution that properly characterizes the data. Here we spec-
ify that three chains of sampling are to be run. n.burnin indicates how many
142 Practical Guide to Logistic Regression
sampling values are discarded before values are kept in the posterior distri-
bution. The initial values can vary widely, and skew the results. If all of the
early values were kept, the mean of the posterior distribution could be severely
biased. Discarding a sizeable number of early values helps guarantee a better
posterior. Finally, the n.iter specifies how many values are kept for the poste-
rior distribution, after thinning and discarding of burn-in values.
J0 <- jags(data = logit.data,
inits = inits,
parameters = params,
model.file = “LOGIT.txt”,
n.thin = 10,
n.chains = 3,
n.burnin = 40000,
n.iter = 50000)
After running the jags function, which we have called J0, typing J0 on
the R command-line will provide raw model results. The final code in Table
6.1 provides nicer looking output. The source code in jhbayes.R is relevant
at this point. jhbayes.r consists of two small functions from the Zuur support
package, MCMCSupportHighstat.R, which comes with Zuur, Hilbe and Ieno
(2013) and is available for other books by Zuur as well. The posterior means,
or betas, the log-likelihood function, and AIC and BIC statistics are displayed,
together with their standard errors and outer 0.025 “credible set.” We specified
that only four decimal digits are displayed. BUGSoutput and MyBUGSOutput
are parts of the R2jags package:
out <- J0$BUGSoutput
myB <- M
yBUGSOutput(out, c(uNames(“beta”, K),
“LogL”, “AIC”, “BIC”))
round(myB, 4)
The Bayesian logistic model results are listed in the table below.
> round(myB, 4)
mean se 2.5% 97.5%
beta[1] -2.0193 0.0824 -2.1760 -1.8609
beta[2] 0.0245 0.0063 0.0128 0.0370
beta[3] 2.2569 0.0843 2.0922 2.4216
beta[4] 0.3685 0.0904 0.1920 0.5415
beta[5] 0.0545 0.0042 0.0466 0.0626
LogL -1961.6258 1.5178 -1965.4037 -1959.5816
AIC 3933.2517 3.0357 3929.1632 3940.8074
BIC 3964.5619 3.0357 3960.4734 3972.1176
Compare the above statistics with the summary table of myg, which was
the model as estimated using the glm function. Note that the AIC values are
6 • Bayesian Logistic Regression 143
statistically identical. This output also matches the SAS results displayed esti-
mated using noninformative priors.
> summary(myg)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.010276 0.081087 -24.792 < 2e-16 ***
cdoc 0.024432 0.006263 3.901 9.57e-05 ***
female 2.256804 0.082760 27.269 < 2e-16 ***
kids 0.357976 0.089962 3.979 6.92e-05 ***
cage 0.054379 0.004159 13.075 < 2e-16 ***
---
Null deviance: 5091.1 on 3873 degrees of freedom
Residual deviance: 3918.2 on 3869 degrees of freedom
AIC: 3928.2
The example above did not employ an informative prior. For instance,
we could have provided information that reflected our knowledge that docvis
has between 40% and 50% zero counts. We compounded the problem since
docvis was centered, becoming cdoc. The centered values for when docvis = 0
are −3.162881. They are −2.162881 when docvis = 1. We can therefore set up a
prior that we expect 40%–50% zero counts when cdoc is less than −3.
will be between 0.020 and 0.030. Priors are expressed in terms of probability
functions, usually the normal, lognormal, beta, binomial, Bernoulli, Cauchy,
t, gamma, inverse gamma, Poisson, Poisson-gamma, and negative binomial.
The same prior may be set on one or more parameters, and different priors
may be set for separate parameters. Each software package specifies how this
should be coded.
The example below employs a Cauchy prior on all three parameters; that
is, the coefficients on intercept, cdoc, and cage.
beta.0 ~ dt(0,1/(2.5^2),1)
beta.1 ~ dt(0, 1/(2.5^2),1)
beta.2 ~ dt(0, 1/(2.5^2),1)
where 1/(2.5)^2 is equal to 0.16. For those of my readers who have taken
a course in probability, you may recall that the Cauchy corresponds to a
Student’s t distribution, with 2n − 1 degrees of freedom, multiplied by the
value 1/sqrt(s*(2*n − 1)). n and s are the shape and scale parameters, respec-
tively, for the Cauchy distribution. Perhaps the normal might be preferable for
the intercept; the reader may want to check if this is indeed the case (Table
6.2). The code, presented in a slightly different manner from Table 6.1 can be
used for a wide variety of models. The output does not include implementing
the R2jags MyBUGSOutput function that produces nicely formatted results.
#
load contents of Table 6.2 into memory prior to
running summary() below
> summary(codasamples)
Iterations = 41001:91000
Thinning interval = 1
Number of chains = 3 # <= note that 3 chains are used
Sample size per chain = 50000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
#2. Likelihood
for (i in 1:N){
Y[i] ~ dbern(p[i])
logit(p[i]) <- max(-20, min(20, eta[i]))
eta[i] <- beta.0+beta.1*cdoc[i]+beta.2*cage[2]
}
“
# INITIAL VALUES - BETAS AND SIGMAS
inits <- function () {
list(
beta.0 = 0.1,beta.1=0.1, beta.2=0.1
) }
params <- c(“beta.0”,”beta.1”,”beta.2”,”LogL”, “AIC”, “BIC”)
# JAGs
J0 <- jags.model(data = logit.data,
inits = inits,
textConnection(GLM.txt),
n.chains = 3,
n.adapt=1000)
update(J0, 40000)
codasamples <- coda.samples(J0, params, n.iter = 50000)
summary(codasamples)
146 Practical Guide to Logistic Regression
#1. Priors
beta.0 ~ dnorm(0, 0.00001)
beta.1~dnorm(0, 0.00001)
beta.2~dnorm(0, 0.00001)
Notice that the values of the distributional means for each parameter—
intercept, cdoc, and cage—differ, as do other associated statistics. The prior
has indeed changed the model. What this means is that we can provide our
model with a substantial amount of additional information about the predic-
tors used in our logistic model. Generally speaking, it is advisable to have a
prior that is distributionally compatible with the distribution of the parameter
having the prior. The subject is central to Bayesian modeling, but it takes us
beyond the level of this book. My recommendations for taking the next step in
Bayesian modeling include Zuur et al. (2013), Cowles (2013), and Lunn et al.
(2013). More advanced but thorough texts are Christensen et al. (2011) and
Gelman et al. (2014). There are many other excellent texts as well. I should
6 • Bayesian Logistic Regression 147
also mention that Hilbe et al. (2016) will provide a clear analysis of Bayesian
modeling as applied to astronomical data.
SAS CODE
/* Section 6.2 */
*Refer to the code in section 1.4 to import and print rwm1984 dataset;
*Refer to proc freq in section 2.4 to generate the frequency table;
*Summary for continuous variables;
proc means data=rwm1984 min q1 median mean q3 max maxdec=3;
var docvis age;
output out=center mean=;
run;
*Build the logistic model and obtain odds ratio & statistics;
proc genmod data=R84 descending;
model outwork=cdoc female kids cage / dist=binomial link=logit;
estimate “Intercept” Intercept 1 / exp;
estimate “Cdoc” cdoc 1 / exp;
estimate “Female” female 1 / exp;
estimate “Kids” kids 1 / exp;
estimate “Cage” cage 1 / exp;
run;
*Refer to proc iml in section 2.3 and the full code is provided
online;
148 Practical Guide to Logistic Regression
STATA CODE
. use rwm1984
. center docvis, pre(c)
. rename cdocvis cdoc
. center age, pre(c)
. sum cdoc cage
* Logistic regression: standard and scaled
. glm outwork cdoc female kids cage, fam(bin) eform nolog
. glm outwork cdoc female kids cage, fam(bin) eform scale(x2) nolog
* Non-informative priors, normal(0, 100000)
6 • Bayesian Logistic Regression 149
Equal-tailed
outwork Mean Std. Dev. MCSE Median [95% Cred. Interval]
CONCLUDING COMMENTS
This book is intended as a guidebook to help analysts develop and execute
well-fitted logistic models. In reviewing it now that it is finished, the book can
also be regarded as an excellent way for an analyst to learn R, as well as SAS
and Stata as applied to developing logistic models and associated tests and
data management tasks related to statistical modeling. Several new functions
are found in this book that are new to R—functions that were written to assist
the analyst in producing and testing logistic models. I will frequently use these
functions in my own future logistic modeling endeavors.
I mentioned in the book that when copying code from one electronic for-
mat to another, characters such as quotation marks and minus signs can result
150 Practical Guide to Logistic Regression
in errors. Even copying code from my own saved Word and PDF documents
to R’s editor caused problems. Many times I had to retype quotation marks,
minus signs, and several other symbols in order for R to run properly. I also
should advise you that when in the R editor, it may be wise to “run” long
stretches of code in segments. That is, rather than select the entire program
code, select and run segments of it. I have had students, and those who have
purchased books of mine that include R code, email me that they cannot run
the code. I advise them to run it in segments. Nearly always they email back
that they now have no problems. Of course, at times in the past there have
indeed been errors in the code, but know that the code in this book has all been
successfully run multiple times. Make sure that the proper libraries and data
have been installed and loaded before executing code.
There is a lot of information in the book. However, I did not discuss issues
such as missing values, survey analysis, validation, endogeny, and latent class
models. These are left for my comprehensive book titled, Logistic Regression
Models (2009, Chapman & Hall), which is over 650 pages in length. A forth-
coming second edition will include both Stata and R code in the text with SAS
code as it is with this book. Bayesian logistic regression will be more thor-
oughly examined, with Bayesian analysis of grouped, ordered, multinomial,
hierarchical, and other related models addressed.
I primarily wrote this book to go with a month-long web-based course
I teach with Statistics.com. I have taught the course with them since 2003,
three classes a year, and continually get questions and feedback from research-
ers, analysts, and professors from around the world. I have also taught logistic
regression and given workshops on it for over a quarter a century. In this book,
I have tried to address the most frequent concerns and problem areas that prac-
ticing analysts have informed me about. I feel confident that anyone reading
carefully through this relatively brief monograph will come away from it with
a solid knowledge of how to use logistic regression—both observation based
and grouped. For those who wish to learn more after going through this book,
I recommend my Logistic Regression Models (2009, 2016 in preparation). I
also recommend Bilger and Loughin (2015), which uses R code for exam-
ples, Collett (2003), Dohoo et al. (2012), and for nicely written shorter books
dealing with the logistic regression and GLM in general, Dobson and Barnett
(2008), Hardin and Hilbe (2013), and Smithson and Merkle (2014). Hosmer
et al. (2013) is also a fine reference book on the subject, but there is no code
provided with the book. The other recommended books have code to support
examples, which I very much believe assists the learning process.
I invite readers of this book to email me their comments and suggestions
about it: hilbe//works.bepress.com/joseph_hilbe/, has the data sets used in the
book in various formats, and all of the code used in the book in electronic
format. Both SAS and Stata code and output is also provided.
References
Bilder, C.R. and Loughin, T.M. 2015. Analysis of Categorical Data with R. Boca Raton,
FL: Chapman & Hall/CRC.
Christensen, R., Johnson, W., Branscu, A. and Hanson, T.E. 2011. Bayesian Ideas and
Data Analysis. Boca Raton, FL: Chapman & Hall/CRC.
Collett, D. 2003. Modeling Binary Data, 2nd Edn. Boca Raton, FL: Chapman & Hall/CRC.
Cowles. M.K. 2013. Applied Bayesian Statistics. New York, NY: Springer.
De Souza, R.S. Cameron, E., Killedar, M., Hilbe, J., Vilatia, R., Maio, U., Biffi, V., Riggs,
J.D. and Ciardi, B., for the COIN Collaboration. 2015. The overlooked potential
of generalized linear models in astronomy—I: Binomial regression and numeri-
cal simulations, Astronomy & Computing, DOI: 10.1016/j.ascom.2015.04.002.
Dobson, A.J. and Barnett, A.G. 2008. An Introduction to Generalized Linear Models,
3rd Edn. Boca Raton, FL: Chapman & Hall/CRC.
Dohoo, I., Martin, W. and Stryhn, H. 2012. Methods in Epidemiological Research.
Charlottetown, PEI, CA: VER.
Firth, D. 1993. Bias reduction of maximum likelihood estimates, Biometrika 80, 27–28.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, C.B. 2014.
Bayesian Data Analysis, 3rd Edn. Boca Raton, FL: Chapman & Hall/CRC.
Geweke, J. 1992. Evaluating the accuracy of sampling-based approaches to calculating
posterior moments. In Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M.
(eds.), Bayesian Statistics, 4th Edn. Oxford, UK: Clarendon Press.
Hardin, J.W. and Hilbe, J.M. 2007. Generalized Linear Models and Extensions, 2nd
edition, College Station, TX: Stata Press.
Hardin, J.W. and Hilbe, J.M. 2013. Generalized Linear Models and Extensions, 3rd
Edn., College Station, TX: Stata Press/CRC (4th edition due out in late 2015 or
early 2016).
Hardin, J. W. and Hilbe, J.M. 2014. Estimation and testing of binomial and beta-binomial
regression models with and without zero inflation, Stata Journal 14(2): 292–303.
Heinze, G. and Schemper, M. 2002. A solution to the problem of separation in logistic
regression. Statistics in Medicine 21, 2409–2419.
Hilbe, J.M. 2009. Logistic Regression Models. Boca Raton, FL: Chapman & Hall/CRC.
Hilbe, J.M. 2011. Negative Binomial Regression, 2nd Ed. Cambridge, UK: Cambridge
University Press.
Hilbe, J.M. 2014. Modeling Count Data. New York, NY: Cambridge University Press.
Hilbe, J.M. and Robinson, A.P. 2013. Methods of Statistical Model Estimation. Boca
Raton, FL: Chapman & Hall/CRC.
Hilbe, J.M., de Souza, R.S. and Ishida, E. 2016. Bayesian Models for Astrophysical
Data: Using R/JAGS and Python/Stan. Cambridge, UK: Cambridge University
Press.
Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. 2013. Applied Logistic Regression,
3rd Edn. Hokoken, NJ: Wiley.
151
152 References
Lunn, D., Jackson, C., Best, N., Thomas, A. and Speigelhalter, D. 2013. The BUGS
Book. Boca Raton, FL: Chapman & Hall/CRC.
McGrayne, S.B. 2011. The Theory that Would not Die. New Haven, CT: Yale University
Press.
Morel, G. and Neerchal, N.K. 2012. Overdispersion Models in SAS. Carey, NC: SAS
Publishing.
Rigby, R.A. and Stasinopoulos, D.M. 2005. Generalized additive models for location,
scale and shape, (with discussion). JRSS Applied Statistics 54: 507–554.
Smithson, M. and Merkle, E.C. 2014. Generalized Linear Models for Categorical and
Continuous Limited Dependent Variables. Boca Raton, FL: Chapman & Hall/
CRC.
Weisberg, H.I. 2014. Willful Ignorance. Hoboken, NJ: Wiley.
Youden, W.J. 1950. Index for rating diagnostic tests. Cancer 3: 32–35.
Zuur, A.F. 2012. A Beginner’s Guide to Generalized Additive Models with R. Newburgh,
UK: Highlands Statistics.
Zuur, A.F., Hilbe, J.M. and Ieno, E.M. 2013. A Beginner’s Guide to GLM and GLMM
with R: A Frequentist and Bayesian Perspective of Ecologists. Newburgh, UK:
Highlands Statistics.
Statistics
Practical Guide to Logistic Regression covers the key points of the basic
logistic regression model and illustrates how to use it properly to model a binary
response variable. This powerful methodology can be used to analyze data from
various fields, including medical and health outcomes research, business analytics
and data science, ecology, fisheries, astronomy, transportation, insurance,
economics, recreation, and sports. By harnessing the capabilities of the logistic
model, analysts can better understand their data, make appropriate predictions
and classifications, and determine the odds of one value of a predictor compared
to another.
Drawing on his many years of teaching logistic regression, using logistic-based
models in research, and writing about the subject, the author focuses on the
most important features of the logistic model. He explains how to construct a
logistic model, interpret coefficients and odds ratios, predict probabilities and
their standard errors based on the model, and evaluate the model as to its fit.
Using a variety of real data examples, mostly from health outcomes, the author
offers a basic step-by-step guide to developing and interpreting observation and
grouped logistic models as well as penalized and exact logistic regression. He
also gives a step-by-step guide to modeling Bayesian logistic regression.
R statistical software is used throughout the book to display the statistical models
while SAS and Stata codes for all examples are included at the end of each
chapter. The example code can be adapted to your own analyses. All the code is
also available on the author’s web site.
Features
• Gives practical guidance on constructing, modeling, interpreting, and
evaluating binary response data using logistic regression
• Explores solutions to common stumbling blocks when using logistic
regression to model data
• Compares Bayesian logistic regression to the traditional frequentist
approach, with R, JAGS, Stata, and SAS codes provided for example
Bayesian logistic models
• Includes complete Stata, SAS, and R codes in the text and on the author’s
website, enabling you to adapt the code as needed and thus make your
modeling tasks easier and more productive
• Provides new R functions and data in the LOGIT package on CRAN
K24999
w w w. c rc p r e s s . c o m