Applied Missing Data Analysis, 2nd Edition
Applied Missing Data Analysis, 2nd Edition
This series provides applied researchers and students with analysis and research design books that
emphasize the use of methods to answer research questions. Rather than emphasizing statistical
theory, each volume in the series illustrates when a technique should (and should not) be used and
how the output from available software programs should (and should not) be interpreted. Common
pitfalls as well as areas of further development are clearly articulated.
RECENT VOLUMES
Craig K. Enders
People who know me, know how much I love missing data. When Craig Enders agreed
to do the first edition of this book, I was elated. Now that the second edition is here
with so much new trailblazing material, I’m simply tickled pink. This second edition
is like a biblical tome for researchers in any discipline where missing data arise. Craig
Enders is a rock star on the scientific stage, and I’m right there in the proverbial mosh
pit, grooving to every word and idea he presents in this edition. He’s encapsulated new
ideas (e.g., factored regression specifications, multilevel missing data methods, sensitiv-
ity analyses) and expanded on the “analytic pillars” (as he calls them) such as factored
regression specifications for maximum likelihood approaches and three whole chapters
dedicated to Bayesian estimation. This new edition is hands down the most comprehen-
sive, practical, and accessible book devoted to missing data.
As I wrote in the Series Editor’s Note in the first edition, missing data can be a real
bane to researchers across all social science disciplines. For most of our scientific his-
tory, we have approached missing data much like a doctor from the ancient world might
have used bloodletting to cure disease or amputation to stem infection (e.g., removing
the infected parts of one’s data by using listwise or pairwise deletion). My metaphor
should make you feel a bit squeamish, just as you should feel if you see published papers
that dealt with missing data using the antediluvian and ill-advised approaches of old.
When Craig ushered us into the age of modern missing data treatments in the first edi-
tion, I’d hoped we’d see most researchers embrace the modern treatments for missing
data. At the time, Craig captured what we knew then and presented it to us in a refresh-
ing pedagogical manner.
The field of missing data has advanced probably more than any other quantitative
topic area. In the second edition, Craig again captures what we know now and brings
it to us in the most accessible way. As before, he demystifies the arcane discussions of
missing data mechanisms and their labels (e.g., MNAR) and the esoteric acronyms of the
various techniques used to address them (e.g., FIML, MCMC, and the like).
v
vi Series Editor’s Note
Todd D. Little
Still virtually circumnavigating the world from home
Lubbock, Texas
Preface
Thank you for investigating the second edition of Applied Missing Data Analysis. This is
a brand-new book, written from the ground up with only a small handful of paragraphs
carrying over from the original. In fact, a lot has changed in the missing data world,
and the tools we have at our disposal today represent important leaps forward from
the now-classic methods described in the first edition. A major overhaul was needed
to put these recent innovations front and center. Consistent with the first edition, my
overarching goal for this second edition was to translate the missing data literature into
a comprehensive, accessible resource that serves both substantive researchers who use
sophisticated quantitative methods in their work as well as quantitative specialists. I
hope you enjoy the new edition and find it useful in your work.
Rewinding back to 2007, when I began writing the first edition of Applied Missing
Data Analysis, missing data-handling methods were more primitive than they are today.
For researchers in the social and behavioral sciences, commercial structural equation
modeling programs offered full information maximum likelihood estimation, but this
option was limited to multivariate normal data (Arbuckle, 1996). The predominant mul-
tiple imputation approach at the time—joint model imputation (Schafer, 1997)—was
similarly based on the multivariate normal distribution, and fully conditional speci-
fication imputation for mixed variable types was new to the scene (van Buuren, 2007;
van Buuren & Groothuis-Oudshoorn, 2011). The multivariate normal distribution was
(and still is) a flexible missing data-handling tool, but applying normal curve-based
approaches to real data often required compromises. A few such examples include
imputing categorical variables as though they were normal, then rounding imputes to
achieve discrete values; dummy coding level-2 units to preserve clustering effects with
multilevel missing data; and treating incomplete interaction terms as independent, nor-
mally distributed variables. The list goes on.
Cut to today, and missing data analyses rarely require meaningful compromises.
Perhaps the biggest innovation since the first edition of this book has been the develop-
vii
viii Preface
xi
xii Contents
11 • Wrap‑Up 470
11.1 Chapter Overview / 470
11.2 Choosing a Missing Data‑Handling Procedure / 470
11.3 Software Landscape / 473
11.4 Reporting Results from a Missing Data Analysis / 474
11.5 Final Thoughts and Recommended Readings / 483
References 493
It goes without saying that missing data are a pervasive interdisciplinary problem. Not
surprisingly, how we deal with the issue can have a major impact on the validity of sta-
tistical inferences and the substantive conclusions from a data analysis. In a highly cited
paper nearly 20 years ago, Schafer and Graham (2002) described maximum likelihood
estimation and Bayesian multiple imputation as “state-of-the-art” missing data-handling
procedures. A lot has changed since then, and these approaches are now considerably
more mature and far more capable than they were at the time. The Bayesian paradigm has
simultaneously gained in popularity and is now an important alternative to maximum
likelihood and multiple imputation rather than an estimation method co-opted for the
latter. This trio of contemporary analytic approaches forms the core of the book, which
I’ve rewritten from the ground up to showcase new developments and applications.
Modern missing data-handling procedures have a lot to offer, but we need to under-
stand when and why they work. The first half of this chapter sets the stage with a sum-
mary of Rubin and colleagues’ theoretical framework for missing data problems (Little &
Rubin, 1987, 2020; Mealli & Rubin, 2016; Rubin, 1976). This nearly universal classifica-
tion system comprises three missing data mechanisms or processes that describe differ-
ent ways in which the probability of missing values relates to the data. From a practical
perspective, Rubin’s mechanisms function as data analysis assumptions that dictate the
validity of our statistical inferences. As you will see, these assumptions involve mostly
untestable propositions, although we can take steps to make certain conditions more
plausible. This includes leveraging additional variables that carry information about the
missing values but are not part of the main analysis plan.
The middle section of the chapter describes a small selection of older missing data-
handling methods. Methodologists have been studying missing data problems for the
better part of a century, and the statistical literature is replete with potential solutions,
most of which are historical footnotes. Researchers are now broadly aware that bet-
ter options are available, so I limit this section to a small collection of strategies you
1
2 Applied Missing Data Analysis
may still encounter in published research articles or statistical software packages. I use
computer simulation studies to highlight the shortcomings of these methods relative to
modern approaches such as maximum likelihood estimation.
The chapter concludes with sections on planned missing data designs that intro-
duce intentional missing values as a device for reducing respondent burden or lowering
research costs. Purposefully creating missing data might seem like a bad idea, but this
strategy is perfectly appropriate and cannot introduce bias. Although analyzing fewer
data points necessarily reduces power, the reduction can be surprisingly small, espe-
cially for longitudinal variants of these designs. I describe strategies for creating good
designs, and I illustrate how to use computer simulations to vet their power.
A missing data pattern refers to the configuration of observed and missing values in
a data set. This term should not be confused with a missing data mechanism, which
describes possible relationships between the data and one’s propensity for missing val-
ues. Roughly speaking, patterns describe where the holes are in the data, whereas mech-
anisms describe why the values are missing. Figure 1.1 shows six prototypical missing
data patterns, with shaded areas representing the location of the missing values. The
univariate pattern in panel a has missing values isolated on a single variable. This pat-
tern could occur, for example, in an experimental setting where outcome scores are
missing for a subset of participants. A univariate pattern is one of the earliest missing
data problems to receive attention in the statistics literature, and a number of classic
resources are devoted to this topic (e.g., Little & Rubin, 2020, Ch. 2). Panel b shows a
monotone missing data pattern from a longitudinal study where individuals with miss-
ing data at a particular measurement occasion are always missing subsequent measure-
ments. Monotone patterns received attention in the early literature, because this con-
figuration of missing values can be treated without complicated iterative estimation
algorithms (Jinadasa & Tracy, 1992; Schafer, 1997, pp. 218–238).
The general pattern in panel c has missing values scattered throughout the entire
data matrix. Importantly, the three contemporary methods that form the core of this
book—maximum likelihood, Bayesian estimation, and multiple imputation—work well
with this configuration, so there is generally no reason to choose an analytic method
based on the missing data pattern alone. Panel d illustrates a planned missing data
pattern where three of the variables are intentionally missing for a large proportion
of respondents (Graham, Hofer, & MacKinnon, 1996; Graham, Taylor, Olchowski, &
Cumsille, 2006). As described later in the chapter, planned missingness designs can
reduce respondent burden and research costs, often with minimal impact on statistical
power. Panel e shows a pattern where a latent variable (denoted Y4*) is missing for the
entire sample. This pattern will surface in Chapter 6 with categorical variable models
that view discrete responses as arising from an underlying latent variable distribution
(Albert & Chib, 1993; Johnson & Albert, 1999).
One final configuration warrants attention, because it can introduce estimation
problems for modern missing data-handling procedures. For lack of a better term, I refer
Introduction to Missing Data 3
FIGURE 1.1. Six missing data patterns. The gray shaded areas of each bar represent missing
observations.
Rubin and colleagues (Little & Rubin, 1987; Rubin, 1976) introduced a classification
system for missing data problems that is virtually universal in the literature. This work
outlines three missing data mechanisms or processes that describe different ways in
which the probability of missing values relates to the data: missing completely at ran-
dom (MCAR), missing at random (MAR), and missing not at random (MNAR). From a
4 Applied Missing Data Analysis
practical perspective, these processes are vitally important, because they function as
statistical assumptions for a missing data analysis. However, the terms can be confusing
(e.g., missing at random refers to a systematic process), and published research articles
sometimes conflate their meaning. In the years since Rubin’s seminal work, methodolo-
gists have clarified certain aspects of his original definitions (Mealli & Rubin, 2016;
Raykov, 2011; Seaman, Galati, Jackson, & Carlin, 2013) and have added special sub-
types of processes (Diggle & Kenward, 1994; Little, 1995). As an aside, I mostly avoid
acronyms throughout the book, but I generally refer to missing data mechanisms by
their abbreviations.
distributions for these variables, but they are nevertheless integral to the theory. The
rightmost set of columns in Table 1.1 show the matrix of binary missing data indicators
M that code whether scores are observed or missing; Mv = 0 if a participant’s score on
variable Yv is observed, and Mv = 1 if Yv is missing.
Missing data mechanisms describe different ways in which the pattern of 0’s and
1’s may relate to the realized data in Y(obs) or Y(mis). Rubin’s framework describes three
possibilities: The MCAR mechanism stipulates that the propensity for missing values is
unrelated to the data; an MAR process posits that missingness is related to the observed
parts of the data only; and an MNAR mechanism allows missingness to depend on the
unseen scores. To make each mechanism more concrete, I used computer simulation to
create bivariate data sets that conform exactly to each process. I modeled the artificial
samples after the perceived control over pain and depression variables from the chronic
pain data set on the companion website. This data set includes psychological correlates
of pain severity (e.g., depression, pain interference with daily life, perceived control
over pain) from a sample of N = 275 individuals suffering from chronic pain. Figure
1.2 shows the scatterplot of the hypothetical complete data (i.e., Y(com)) for an artificial
sample of the same size. The contour rings convey the perspective of a drone hovering
over the peak of the bivariate normal population distribution. I subsequently deleted
50% of the depression scores following each mechanism.
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.2. Complete-data scatterplot showing the would-be values of two variables from a
sample of 250 participants.
6 Applied Missing Data Analysis
=
Pr (
M 1| Y( obs ) , Y( mis=
),φ )=
Pr ( M 1| φ ) (1.1)
where φ is a set of missingness model parameters that link the data to the indicators (e.g.,
φ could contain logistic or probit regression coefficients). The left side of the expression,
which contains the full complement of possible associations between the indicators and
the data, says that the probability of a missing score depends on both the observed and
missing parts of the data, as well as some parameters that dictate missingness. The
MCAR process on the right side of the expression simplifies by eliminating all depen-
dence on the realized data. In other words, the equation says that all participants have
the same chance of missing values, and the parameters in φ define the overall probabili-
ties of missing data.
A directed acyclic graph is a useful graphical tool for representing the missing data
mechanism in Equation 1.1 (Mohan, Pearl, & Tian, 2013; Thoemmes & Mohan, 2015).
Figure 1.3a depicts an MCAR process involving a complete variable, X, an incomplete
variable, Y, and a binary missing data indicator, MY. The white circle labeled Y repre-
sents the hypothetically complete variable (i.e., the combination of Y(mis) and Y(obs)), and
the circle labeled Y * represents realized values of Y (i.e., Y * = Y when the missing data
X Y X Y X Y
Y* Y* Y*
MY MY MY
FIGURE 1.3. Directed acyclic graphs that depict missing data processes involving one com-
plete variable, X, one incomplete variable, Y, and a binary missing data indicator, MY. The white
circle labeled Y represents the hypothetically complete variable, and the circle labeled Y * denotes
the realized values of Y.
Introduction to Missing Data 7
indicator MY = 0 and is missing whenever MY = 1). Two features of the graph convey
an MCAR mechanism. First, the absence of arrows pointing to MY indicates that all
sources of missingness are contained in the indicator and no other variables predict
nonresponse. Second, directed acyclic graph rules tell us that the unseen values of Y are
unrelated to MY, because the MY → Y * ← Y path connecting the two variables is blocked
by a third variable with two incoming arrows (Y * is a so-called “collider variable”).
Rubin’s missing data mechanisms can further be viewed as distributional assump-
tions for the missing values. The definition in Equation 1.1 implies that the missing and
observed scores share the same overall (marginal) distributions. To illustrate this point,
I randomly removed 50% of the artificial depression scores from the complete data set
in Figure 1.2 (i.e., missingness was determined by an electronic coin toss). Figure 1.4
shows the scatterplot of the resulting data, with gray circles representing complete cases
and black crosshairs denoting partial data records with perceived control over pain
scores but no depression values. Figure 1.2 shows that missing scores are unsystemati-
cally dispersed throughout the entire distribution, such that the circles and crosshairs
completely overlap, with no differences in their center, spread, or association. The graph
highlights that the observed data are a simple random sample of the hypothetically
complete data set.
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.4. Scatterplot showing an MCAR process where 50% of the scores are missing hap-
hazardly in a way that does not depend on the data. Circles denote complete observations, and
crosshairs denote pairs with missing depression scores.
8 Applied Missing Data Analysis
Missing at Random
A missing at random mechanism states that the probability of missing values is related
to the observed but not the missing parts of the realized data. The formal definition of
this process is as follows:
=
Pr(M 1| Y( obs ) , Y( mis=)
),φ =
Pr (
M 1| Y( obs ) , φ ) (1.2)
The right side of the equation says that the would-be scores in Y(mis) carry no additional
information about missingness above and beyond that in the observed data. The term
missing at random is often misunderstood, because it seems to imply a haphazard pro-
cess instead of a systematic one. Rather, the phrase means that missingness is purely
random after conditioning on or controlling for the observed data. Said differently, two
participants with identical observed score profiles would share the same chance of miss-
ing values, whereas two participants with different observed score profiles would have
different missingness rates. To clarify this idea, Graham (2009) refers to this mecha-
nism as conditionally missing at random (CMAR), and I often do so as well.
The directed acyclic graph in Figure 1.3b depicts an MAR process that features a
directed arrow from X to MY. The graph shows that the unseen values in Y are poten-
tially related to missingness via the MY ← X → Y path (in the parlance of this graphical
framework, Y and MY are said to be d-connected). Graphing rules further tell us that
conditioning on X eliminates the association between Y and MY (i.e., satisfies a condi-
tionally MAR process) by closing the MY ← X → Y path. Procedurally, conditioning on X
means that the missing data-handling procedure leverages all available data, including
the partial records for observations with missing Y values. The three analytic pillars of
this book—maximum likelihood, Bayesian estimation, and multiple imputation—do
just that.
To further illustrate an MAR mechanism, I deleted 50% of the artificial depression
scores in Figure 1.2 following a process where the chance of a missing value increased
as perceived control over pain decreased (e.g., participants with little control over their
pain were more likely to experience pain-related disruptions that could lead them to
drop out of the study). The selection process was relatively strong, with the predicted
probability of missing data increasing from about 16% at one standard deviation above
the perceived control mean to 84% at one standard deviation below the mean. Figure
1.5 shows the scatterplot of the data, with gray circles again representing complete cases
and black crosshairs denoting partial data records with perceived control scores but no
depression values. The figure clearly depicts a systematic process where missing scores
are primarily located on the left side of the contour plot. Unlike Figure 1.4, the gray
circles (cases with complete data on both variables) are no longer representative of the
hypothetically complete data, because there are too many scores at the high end of the
perceived control distribution and too few at the low end.
An MAR mechanism can also be viewed as a distributional assumption for the
missing values. The definition in Equation 1.2 implies that the observed and unseen
values of a variable share the same distribution after controlling for the observed values
of other variables (i.e., the two sets of scores follow the same conditional distribution).
Introduction to Missing Data 9
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.5. Scatterplot showing an MAR process where 50% of the depression scores are
missing for participants with low perceived control over pain values. Circles denote complete
observations, and crosshairs denote pairs with missing depression scores.
Applied to the bivariate normal data in Figure 1.5, this assumption stipulates that the
observed and missing depression scores are normally distributed around points on the
regression line and share the same constant variation (i.e., the depression distribution
is the same for any two individuals with the same perceived control over pain score,
regardless of whether they have missing data). Visually, this feature is evident by the fact
that the circles and crosshairs lock together like puzzle pieces around the regression line
from the hypothetically complete data.
Viewing the MAR process as a distributional assumption provides intuition about
the inner workings of contemporary analytic procedures. Although they do so in dif-
ferent ways, maximum likelihood, Bayesian estimation, and multiple imputation all
attempt to infer the location of the missing values based on the corresponding observed
data. Consider the task of imputing a missing depression score. Given a suitable esti-
mate of the regression line, the MAR process implies that imputations can be sampled
from normal distributions centered along the regression line. To illustrate, Figure 1.6
shows the distribution of plausible imputations at three values of perceived control over
pain. Candidate imputations fall exactly on the vertical hashmarks, but I added horizon-
tal jitter to emphasize that more scores are located at higher contours near the regression
line. Randomly selecting one of the circles from each distribution generates an imputed
10 APPLIED MISSING DATA ANALYSIS
depression score (technically, imputations are not restricted to the circles displayed in
the graph and could be selected from anywhere in the normal distribution, but you get
the idea). In fact, Bayesian estimation and multiple imputation both invoke an iterative
version of this exact procedure.
Finally, an MAR process is very general and readily extends to multivariate data,
although it is more awkward to think about in this context. Returning to the data in
Table 1.1, the mechanism must be viewed on a pattern-by-pattern basis. Considering the
first row of data (and all other rows where only Y3 is missing), an MAR process requires
that Y3’s missingness is fully explained by Y1 and Y2. Moving to the fourth row of data,
the mechanism requires that the likelihood of a pattern where Y1 and Y3 are both miss-
ing depends only on Y2. Notice that this condition contradicts the statement for the first
row, which allows missing values on Y3 to depend on Y1. As a final example, the mecha-
nism requires the chance of missing both Y1 and Y2 (the pattern in the bottom row of the
table) to depend only on Y3. Again, parts of this proposition are at odds with conditions
that govern other patterns. Despite its somewhat clunky construction with multivariate
data, Little and Rubin (2020, p. 23) argue that a MAR process is a better approximation
to reality than the simpler MCAR mechanism.
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
(
Pr M = 1| Y( obs ) , Y( mis ) , φ ) (1.3)
Unlike the previous expressions, the conditional distribution of the missing data indi-
cators doesn’t simplify and features two distinct determinants of missingness. Under
such a process, two participants with identical observed score profiles no longer have
the same chance of a missing value, as the would-be scores themselves carry additional
information above and beyond the observed data. Gomer and Yuan (2021) refer to Equa-
tion 1.3 as diffuse MNAR, because missingness depends on both components of the
hypothetically complete data, and they define a focused MNAR process as one that
depends only on the unseen values in Y(mis).
(
Pr M = 1| Y( mis ) , φ ) (1.4)
Although there is no way to differentiate MNAR subtypes from the observed data, the
authors argue that the distinction is important, because diffuse and focused processes
can differentially impact one’s analysis results. I return to this issue in Chapter 9.
The directed acyclic graph in Figure 1.3c depicts a (diffuse) MNAR process involv-
ing the same variables as before. The graph suggests that the unseen values in Y are
potentially related to missingness via the MY ← X → Y path and the Y → MY path. As
before, conditioning on X closes the MY ← X → Y path, thereby eliminating part of the
association between Y and its missingness indicator. However, the would-be values of
Y still influence missingness via their direct pathway to MY. Graphing rules tell us that
a pair of connected variables adjacent in a chain cannot be separated, so conditioning
on the observed data does not eliminate the dependence between Y and its missing data
indicator.
To further illustrate an MNAR mechanism, I deleted 50% of the artificial depres-
sion scores in Figure 1.2 following a process where participants with higher levels of
depression were more likely to have missing values (e.g., those with acute symptoms
would leave the study to seek treatment elsewhere). The selection process was relatively
strong, with the predicted probability of missing data increasing from about 16% at
one standard deviation below the depression mean to 84% at one standard deviation
above the mean. Figure 1.7 shows the scatterplot of the data, with gray circles again
representing complete cases and black crosshairs denoting partial data records with
perceived control scores but no depression values. The figure illustrates a systematic
process where missing scores are primarily located in the top half of the contour plot
above the regression line. The gray circles (cases with complete data on both variables)
are clearly unrepresentative of the hypothetically complete data.
Unlike the conditionally MAR mechanism, which stipulates that the observed
and missing scores share the same distribution after controlling for other variables,
12 APPLIED MISSING DATA ANALYSIS
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.7. Scatterplot showing an MNAR process where 50% of the depression scores are
missing for participants with high depression. Circles denote complete observations, and cross-
hairs denote pairs with missing depression scores.
an MNAR process implies that the two sets of scores have different distributions. This
situation is clear in Figure 1.7, where the vast majority of the missing scores are above
the regression line, and the complete data are mostly below the line. This feature makes
imputation considerably more difficult, because there are no data with which to esti-
mate the unique parameters of the missing data distribution. For example, leveraging
the perceived control over pain scores alone would create imputations that fall on either
side of the regression line, and there is no way to formulate an appropriate adjustment
without knowing the unseen depression values. As you will see, analytic procedures for
MNAR processes (e.g., selection models or pattern mixture models) can only counteract
this indeterminacy by invoking relatively strong assumptions about the unseen data.
tion and multiple imputation), but this conclusion does not hold for standard errors and
significance tests that rely on the frequentist framework and repeated sampling argu-
ments (Kenward & Molenberghs, 1998; Savalei, 2010).
Getting accurate measures of uncertainty under a particular process requires
a stricter assumption that the same missingness process always generates data sets.
Returning to Equation 1.2, valid inferences require the MAR definition to hold for any
Y(obs) that you could have worked with, not just the Y(obs) in a particular sample of data.
Statisticians refer to this condition as missing always at random (Bojinov, Pillai, &
Rubin, 2020; Mealli & Rubin, 2016) or everywhere missing at random (Seaman et al.,
2013), and Mealli and Rubin (2016) define parallel conditions for MCAR and MNAR pro-
cesses known as missing always completely at random and missing not always at random,
respectively. Because missingness mechanisms are so prevalent throughout the book, I
refer to them by their simpler monikers, with the understanding that measures of uncer-
tainty and significance tests require slightly different definitions.
(
f Y= ) (
( obs ) , M | θ, φ ) (
f M | Y( obs ) , φ × f Y( obs ) | θ ) (1.5)
I use generic function notation f(∙) throughout the book to represent distributions in the
abstract without specifying their type or form (e.g., “f of something” could be a normal
curve, a binomial distribution). If the parameters in θ and φ are independent, applying
rules of probability gives the factorization on the right side of the equation. The missing-
ness model is ignorable in this case, because f(M|Y(obs), φ) functions as a constant, and
estimating the focal model parameters from the observed data gives the same results
with or without this term. In contrast, the missingness model is said to be nonignorable
if the missing values follow an MNAR process or the nuisance parameters in φ carry
14 Applied Missing Data Analysis
information about the focal parameters in θ. In this situation, we can only get valid esti-
mates of θ by pairing the focal analysis model with an additional model for missingness
(see Chapter 9).
Ignorability is ultimately something we just take on faith, because there is no
way to evaluate either of its propositions. Referring to distinctness, (Schafer, 1997,
p. 11) says, “In many situations this is intuitively reasonable, as knowing θ [the focal
model’s parameters] will provide little information about ξ [the missingness model’s
parameters] and vice-versa.” The MAR part of the assumption can be a bit trickier. Van
Buuren (2012, p. 33) warns that “the label ‘ignorable’ does not mean that we can be
entirely careless about the missing data,” and he goes on to emphasize that satisfying
this assumption requires the missing data-handling procedure to condition on all the
important determinants of missingness. The next three sections address this point in
more detail.
Unfortunately, the observed data do not contain the necessary information to evaluate
a conditionally MAR or MNAR mechanism, because both make propositions about the
unseen scores—the former says the would-be values are unrelated to missingness after
conditioning on the observed data, and the latter says they are related. Although meth-
odologists have proposed various diagnostic procedures for evaluating these conditions
(Bojinov et al., 2020; Potthoff, Tudor, Pieper, & Hasselblad, 2006; Yuan, 2009a), the
validity of contemporary missing data-handling procedures ultimately relies on untest-
able assumptions and our own expert knowledge about the data and possible reasons for
missingness. This leaves an unsystematic MCAR process as the only mechanism with
testable propositions.
In truth, evaluating whether missingness is consistent with an unsystematic pro-
cess isn’t necessarily useful, because contemporary methods do not require this strict
assumption, and finding that haphazard missingness is (or is not) plausible does not
change the recommendation to use these approaches. To this point, Raykov (2011,
p. 428) suggests that “the desirability of the MCAR condition has been frequently over-
rated in empirical social and behavioral research,” and I couldn’t agree more. Never-
theless, the logic of evaluating an MCAR process warrants brief discussion, because
applications of MCAR tests abound in published research articles, and it is important to
understand what these tests do and do not tell us about the missing data.
As explained previously, an MCAR process implies that missing and observed scores
share the same overall (marginal) distributions; that is, even without conditioning on
the observed data, the observed and would-be scores have identical means, variances,
and associations with other variables. Kim and Bentler (2002) refer to this condition as
homogeneity of means and covariances. Methodologists have proposed numerous proce-
dures for evaluating the MCAR mechanism (Chen & Little, 1999; Jamshidian & Jalal,
2010; Kim & Bentler, 2002; Little, 1988b; Muthén, Kaplan, & Hollis, 1987; Park & Lee,
1997; Raykov & Marcoulides, 2014), most of which involve comparing features of the
observed data across different missing data patterns. I focus on two simple approaches
Introduction to Missing Data 15
that consider group mean differences, as these methods enjoy widespread use and are
readily available in statistical software.
tests will be at a maximum when a variable has 50% missing data, because its missing-
ness indicator has equal group sizes. Conversely, lower (or higher) missing data rates
cause unbalanced group sizes and lower power. To illustrate, consider the conditionally
MAR process depicted in Figure 1.5. Achieving power equal to .80 with a 50% missing
data rate requires a standardized pattern mean difference of |d| > 0.34 or larger (a small
effect size). Had I instead deleted 10% of the data (i.e., group sizes of n(obs) = 247 and n(mis)
= 28), the effect size requirement to achieve the same power increases to |d| > 0.56 or
larger (a medium effect size). Finally, a pattern mean difference does not automatically
imply that the variable in question is a source of nonresponse bias, as the variable’s cor-
relation with the focal analysis variables also plays an important role (Collins, Schafer,
& Kam, 2001). I return to this point in Section 1.5.
∑ (
g =1
′
) (
TL = n g Yg − μˆ g Sˆ −g 1 Yg − μˆ g ) (1.6)
where G is the number of missing data patterns, ng is the number of cases in missing
data pattern g, Yg contains the arithmetic means for that group, and μ̂g and Ŝg contain
the rows and columns of μ̂ and Ŝ (the maximum likelihood estimates) that correspond
to the observed variables in Yg. The parentheses contain deviations between pattern g’s
arithmetic averages and the corresponding grand means, and these are squared (and
summed) via matrix multiplication. Multiplying by the inverse of the covariance matrix
(the matrix analogue of division) standardizes the discrepancies, such that the numeri-
cal value of TL is a weighted sum of G squared z-scores. If values are missing completely
at random, TL is approximately distributed as a chi-square statistic with Svg – V degrees
of freedom, where vg is the number of observed scores in pattern g, and V is the total
number of variables. Consistent with the mean difference approach, a significant test
statistic suggests that missingness is not purely random.
To illustrate Little’s test, reconsider the conditionally MAR process depicted in
Figure 1.5. In practice, the primary motivation for using Little’s test is to evaluate a
larger number of variables in Yg, but a bivariate application is useful for illustrating the
Introduction to Missing Data 17
mechanics of the equation. To begin, the maximum likelihood estimates of the grand
means and variance–covariance matrix are as follows:
20.31 ˆ 27.27 −13.80
= μˆ = S (1.7)
14.29 −13.80 36.15
These means are the benchmark against which to compare pattern-specific means. There
are just two missing data patterns in this example: n(obs) = 141 observations have scores
on both variables (i.e., v1 = 2), and n(mis) = 134 cases have missing depression scores (i.e.,
v2 = 1). The pattern-specific arithmetic means for the two groups are as follows:
23.27 17.19
= Y1 = Y2 (1.8)
12.79 NA
Substituting the estimates into Equation 1.6 gives the following test statistic:
If an unsystematic process generated the data, this test statistic should approximate a
chi-square statistic with Svg – V = (2 + 1) – 2 = 1 degrees of freedom. The test is statisti-
cally significant, TL(1) = 98.27, p < .001, indicating that the MCAR mechanism is not
plausible for these data. In a multivariate application with more than two variables, a
significant test statistic indicates that two or more patterns differ, but the test gives no
indication about which variables might be responsible.
Yi = β0 + β1 X i + ε i (1.10)
Moreover, suppose that the outcome is missing due to another measured variable A that
also correlates with Y. Figure 1.8a shows a directed acyclic graph that depicts theoretical
associations among the three variables and the missing data indicator, MY. As before, Y
represents the hypothetically complete variable, and Y * represents realized values of Y
(i.e., Y * equals Y when the missing data indicator MY = 0 and is missing whenever MY =
1). Graphing rules imply that Y is potentially related to missingness via two pathways:
MY ← X → Y and MY ← A → Y.
As explained previously, directed acyclic graphs clarify that conditioning on or
controlling for the middle variable in a path eliminates the dependency between the two
outer variables. The regression model conditions on X and therefore eliminates part of
the association between Y and MY by closing the MY ← X → Y path. However, Y and MY
are still related via the MY ← A → Y path, so the analysis induces an MNAR-by-omission
process, because it fails to condition on A. Whether the open path introduces substantial
bias depends the magnitude of the association between A and MY and A and Y (Collins et
al., 2001), but the analysis is nevertheless at odds with the MAR assumption.
Perhaps the simplest way to condition on A is to simply include it as an additional
covariate in the analysis model as follows:
Yi = β0 + β1 X i + β2 Ai + ε i (1.11)
This analysis is consistent with an MAR process, because it eliminates all sources of
dependency between Y and MY. However, the model achieves this desirable status by
X Y X Y X Y
Y* Y* Y*
A MY A MY A MY
FIGURE 1.8. Directed acyclic graphs that depict missing data processes involving one com-
plete variable, X, one incomplete variable, Y, a binary missing data indicator, MY, and an auxiliary
variable A. The white circle labeled Y represents the hypothetically complete variable, and the
circle labeled Y * denotes the realized values of Y.
Introduction to Missing Data 19
statistical software application can estimate these associations, but most do so after
discarding incomplete data records. The advantage of Raykov and West’s approach is
that it leverages maximum likelihood missing data handling (or alternatively, Bayesian
estimation). Again, we don’t know how maximum likelihood estimation works yet, but
for now it is sufficient to know that the estimator leverages the entire sample’s observed
data without discarding any information.
A respectable semipartial correlation signals an auxiliary variable that contains
unique information about the missing values above and beyond that already contained
in the analysis. How large does this correlation need to be in order to reap the benefits
of conditioning on the additional variable or suffering the consequences of ignoring
it? Simulation studies in Collins et al. (2001) provide some insights. The Collins et al.
article examined auxiliary variables with semipartial correlations equal to .32 and .72.
Not surprisingly, failing to condition on a variable with a very strong correlation usu-
ally produced a bias-inducing MNAR-by-omission process. In contrast, ignoring a vari-
able with the smaller correlation often gave acceptable parameter estimates with little
to no bias. Based on these results, it seems reasonable to focus on auxiliary variables
with semipartial correlations at least as strong as Cohen’s (1988) medium effect size
benchmark of ±0.30. Fortunately, we don’t need to be too discerning about this cutoff,
because these simulations showed no serious consequences of overfitting with a large
set of uncorrelated variables. Nevertheless, limiting the number of auxiliary variables
is often necessary in practice, because modeling strategies for introducing these extra
variables can be prone to convergence failures (e.g., the saturated correlates model; Gra-
ham, 2003).
Finally, although the literature has long favored an inclusive strategy (Collins et al.,
2001; Rubin, 1996; Schafer, 1997; Schafer & Graham, 2002), it is hypothetically possible
that conditioning on an auxiliary variable could enhance rather than reduce nonre-
sponse bias. This could happen, for example, if an auxiliary variable’s correlation with
an analysis variable and its missingness indicator is fully explained by an unmeasured
latent variable. It is unclear how often the constellation of associations needed to cause
this problem actually occurs in practice, but interested readers can find an illustration
of this phenomenon in Thoemmes and Rose (2014).
chological correlates of pain severity (e.g., depression, pain interference with daily life,
perceived control) for a sample of N = 275 individuals with chronic pain. Because the
missing data mechanism is an assumption for a specific analysis, I build the example
around a linear regression model where depressions scores are a function of pain inter-
ference with daily life activities and a binary severe pain indicator (0 = no, little, or mod-
erate pain, 1 = severe pain).
Approximately 7.3% of the binary pain ratings are missing, and the missing data rates
for the depression and pain interference scales are 13.5 and 10.5%, respectively. I use
these same variables in Chapter 10 to illustrate missing data handling for a mediation
analysis, and I incorporate auxiliary variables from this illustration.
univariate mean differences do not condition on the focal variables, so the effect sizes
in Table 1.2 does not say whether a given auxiliary variable predicts missingness above
and beyond the variables already in the analysis. Finally, mean differences alone do not
signal a problem, as a bias-inducing MNAR-by-omission process also requires salient
semipartial correlations with the analysis variables.
I’ve repeatedly referenced the analytic trio that forms the basis of this book: maximum
likelihood, Bayesian estimation, and multiple imputation. These methods have been the
“state of the art” for some time (Schafer & Graham, 2002), because they are capable of
producing valid estimates and inferences in a wide range of applications. The literature
24 Applied Missing Data Analysis
describes numerous other approaches to missing data problems, some of which have
enjoyed widespread use, while others are now little more than a historical footnote.
This section describes a small collection of strategies you may still encounter in pub-
lished research articles or statistical software packages: listwise and pairwise deletion,
arithmetic mean imputation, regression imputation, stochastic regression imputation,
and last observation carried forward imputation. These methods deal with missing data
either by removing cases or by filling in the missing values with a single set of replace-
ment scores (a process known as single imputation). Except for stochastic regression
imputation, these methods are potentially problematic, because they invoke restrictive
assumptions about the missing data process or introduce bias regardless of mechanism.
In contrast, stochastic regression imputation gives valid estimates with a conditionally
MAR process, but it inappropriately shrinks standard errors. I return to the artificial
data in Figure 1.5 to illustrate these older approaches. To refresh, the scatterplot depicts
a conditionally MAR process where participants with low perceived control over their
pain were more likely to have missing depression scores.
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.9. Scatterplot showing data points that remain after applying listwise deletion to
an MAR process where 50% of the depression scores are missing for participants with lower
perceived control over pain. The black circle denotes the means of the complete observations.
predictors, in which case deleting incomplete data records gives the optimal maximum
likelihood estimates (Glynn & Laird, 1986; Little, 1992; von Hippel, 2007). The situa-
tion is more complicated with incomplete predictors, but deletion generally works well
if missingness is unrelated to the dependent variables. This includes an MAR process
where a covariate is missing as a function of another predictor, as well as an MNAR
mechanism where missingness is related to the would-be values of a covariate (White
& Carlin, 2010). A complete-case analysis can also provide optimal estimates of logistic
regression slope coefficients in a more limited number of scenarios (Vach, 1994; van
Buuren, 2012, p. 48).
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.10. Scatterplot showing the data that result from applying arithmetic mean impu-
tation to an MAR process where 50% of the depression scores are missing for participants with
lower perceived control over pain. The black crosshairs denote data records with perceived con-
trol scores and imputed depression values.
imputation recoups the full set of perceived control scores, but it does a terrible job of
preserving the depression distribution. As you might expect, imputing missing scores
with values at the center of the distribution artificially reduces variability and attenu-
ates measures of association (mathematically, each missing value contributes a zero to
the sum of squares and sum of cross-products terms). If you focus on just the imputed
score pairs, you’ll notice that their correlation necessarily equals 0, because depression
scores are constant. As such, you can think of mean imputation as filling in the data
with scores that have no variation and no correlation with other variables. If you were
going to be stranded on a desert island with only one missing data-handling procedure
in your analytic suitcase, this is not the one you’d choose for your 3-hour tour.
A popular variation of mean imputation appears with questionnaire data where
multiple items tap into different aspects of the same construct. For example, the contin-
uous depression scores in the previous scatterplots result from summing item responses
measuring sadness, lack of motivation, sleep difficulties, feelings of low self-worth, and
so on. A common way to deal with item-level missing data is to compute a prorated
scale score that averages the available item responses. For example, if a participant
answered four out of six depression items, the prorated scale score would be the aver-
age of just four responses. The missing data literature often describes this procedure as
Introduction to Missing Data 27
Regression Imputation
Regression imputation (also known as conditional mean imputation) replaces miss-
ing values with predicted scores from a regression equation. Regression imputation has
a long history that dates back more than 60 years (Buck, 1960), and the basic idea is
intuitively appealing: Variables tend to be correlated, so replacing missing values with
predicted scores borrows important information from the observed data. Although this
idea makes good sense, the resulting imputations can introduce substantial bias. The
nature and magnitude of these biases depend on the missing data mechanism and vary
across different estimands.
Regression imputation requires regression models that predict the incomplete vari-
ables from the complete variables. A complete-case analysis can generate the necessary
estimates, as can maximum likelihood estimation (e.g., so-called “EM imputation”; von
Hippel, 2004). Returning to the artificial data in Figure 1.5, imputation requires the
regression of depression on perceived control. The following equation generates the pre-
dicted scores that serve as imputations:
I use the γ symbol throughout the book to reference coefficients that are not part of the
focal analysis, and the γ’s in this equation are meant to emphasize that the regression
model is a device for imputing the data. The focal analysis could be something entirely
different (e.g., a correlation; the regression of perceived control on depression). The logic
of regression imputation is largely the same with multivariate data, but the procedure
is more cumbersome to implement, because each missing data pattern requires its own
regression equation.
Figure 1.11 shows the scatterplot of the artificial data after filling in the missing
depression scores with predicted values, with gray circles again representing the complete
cases and black crosshairs denoting score pairs with imputed data. As you can see, the
procedure recoups the full data set, but it does a subpar job of preserving the depression
distribution. In particular, the imputed values lack variation, because they fall directly on
the regression line. This feature also implies that the imputed score pairs have a correla-
tion equal to 1. In effect, regression imputation suffers from the opposite problem as mean
imputation, because it replaces missing values with perfectly correlated scores.
As mentioned previously, a complete-case analysis or maximum likelihood estima-
tion can generate the coefficients for regression imputation. The latter option warrants a
brief discussion, because it often confuses researchers into thinking they are applying a
more sophisticated procedure than they are. This so-called “EM imputation” procedure
28 APPLIED MISSING DATA ANALYSIS
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.11. Scatterplot showing the data that result from applying regression imputation
to an MAR process where 50% of the depression scores are missing for participants with lower
perceived control over pain. The black crosshairs denote data records with perceived control
scores and imputed depression values.
chastic regression imputation is the only procedure in this section that is generally capa-
ble of producing unbiased parameter estimates when scores are conditionally MAR. As
you will see later in the book, the core idea behind stochastic regression imputation—an
imputation equals predicted value plus noise—resurfaces with Bayesian estimation and
multiple imputation. These procedures use iterative algorithms to generate imputations
over many alternate estimates of regression model parameters, but they are fundamen-
tally sophisticated relatives of stochastic regression imputation.
Applying stochastic regression imputation to the bivariate data in Figure 1.6 again
requires the regression of depression on perceived control. The residual variance from
this regression plays an important role, because it defines the spread of the random
noise terms. As before, substituting a participant’s observed data into the right side of
a regression equation gives the predicted value of the missing data point. Next, Monte
Carlo computer simulation creates a synthetic residual term by drawing a random num-
ber from a normal distribution with a mean equal to 0 and spread equal to the residual
variance estimate. Each imputation is then the sum of a predicted value and random
noise term.
(
ε i ~ N1 0, σˆ 2ε )
The bottom row of the expression says that residuals are sampled from a univariate
normal curve, and the dot accent on εi indicates that this is a synthetic value created by
Monte Carlo computer simulation.
I previously introduced the possibility of drawing replacement scores from a nor-
mal curve, and Figure 1.6 shows the distribution of plausible imputations at three values
of perceived control over pain. Candidate imputations fall exactly on the vertical hash-
marks, but I added horizontal jitter to emphasize that more scores are located at higher
contours near the regression line. Randomly selecting one of the circles from each dis-
tribution would generate an imputed depression score (technically, imputations are not
restricted to the circles displayed in the graph and could be selected from anywhere in
the normal distribution).
Figure 1.12 shows the scatterplot of the artificial data after filling in the miss-
ing depression scores with stochastic regression imputes. As before, the gray contour
rings convey the location and elevation of the bivariate normal population distribution.
Unlike the other approaches in this section, stochastic regression imputation disperses
imputations throughout the entire contour plot and doesn’t over- or underrepresent cer-
tain areas of the distribution. Comparing the plot to the hypothetically complete data
set in Figure 1.5, the filled-in values look like good surrogates, because they preserve the
center and spread of the depression scores, as well as their correlation with perceived
control over pain. Although analyzing a stochastically imputed data set can provide
accurate parameter estimates if values are MAR, doing so artificially shrinks standard
errors and distorts significance tests; statistical software applications incorrectly treat
imputes as real data when computing measures of uncertainty, such that standard errors
reflect the hypothetical sampling variation that would have resulted had the data been
complete. Pairing stochastic regression imputation with bootstrap resampling (Efron,
30 APPLIED MISSING DATA ANALYSIS
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 1.12. Scatterplot showing the data that result from applying stochastic regression
imputation to an MAR process where 50% of the depression scores are missing for participants
with lower perceived control over pain. The black crosshairs denote data records with perceived
control scores and imputed depression values.
1987; Efron & Gong, 1983; Efron & Tibshirani, 1993) is one option for estimating mea-
sures of uncertainty (see Chapter 2) and generating and analyzing multiple sets of impu-
tations is another (see Chapter 7).
Imputed data
1 25 28 28 28
2 22 21 24 26
3 18 18 18 18
4 30 30 31 34
5 20 20 22 21
Last observation carried forward effectively assumes no change after the final obser-
vation or during the intermittent period where scores are missing. The conventional wis-
dom is that imputing the data with stable scores yields a conservative estimate of treat-
ment group differences at the end of a study. However, empirical research shows that this
isn’t necessarily true, as the method can also exaggerate group differences (Cook, Zeng,
& Yi, 2004; Liu & Gould, 2002; Mallinckrodt, Clark, & David, 2001; Molenberghs et al.,
2004). The direction and magnitude of the bias depend on specific characteristics of the
data, but the approach is likely to produce distorted parameter estimates, even with an
unsystematic missingness process (Molenberghs et al., 2004). Suffice to say, there are
much better strategies for dealing with longitudinal missing data.
The previous scatterplots suggest that older missing data methods can misrepresent
distributions in ways that almost certainly introduce bias. Monte Carlo computer simu-
lations can reveal how the tendencies depicted in the graphs unfold over many different
samples and across different estimands. To this end, I used a series of simulation studies
to compare listwise deletion, arithmetic mean imputation, regression imputation, and
stochastic regression imputation to a “gold standard” maximum likelihood estimator
for missing data. As mentioned previously, maximum likelihood missing data handling
leverages the entire sample’s observed data without discarding any information. The
other “gold standards,” Bayesian estimation and multiple imputation, are equivalent in
this case (Collins et al., 2001; Schafer, 2003).
The first step of a computer simulation is to specify a set of hypothetical parameter
values. Recycling the parameters that created the artificial depression and perceived con-
trol over pain data in the previous scatterplots helps visualize the procedure. Returning
32 Applied Missing Data Analysis
to Figure 1.2, the contour rings convey the perspective of a drone hovering over the peak
of the bivariate normal population distribution, and the gray circles are an artificial
sample of hypothetically complete data. The next step generates many artificial data
sets from the population. Researchers often ask whether contemporary approaches like
maximum likelihood can be used with small samples or large amounts of missing data.
To examine this issue, I programmed a simulation that created 1,000 random samples of
N = 100 from the bivariate normal population, and I deleted 50% of the artificial depres-
sion scores following one of the missing data mechanisms. The missing at completely at
random process mimicked Figure 1.4, the conditionally MAR mechanism followed Fig-
ure 1.5, and the MNAR process mirrored Figure 1.7. After deleting scores, I used different
missing data-handling methods to estimate three sets of parameters: the mean vector and
variance–covariance matrix, coefficients from the regression of Y on X (e.g., perceived
control over pain predicting depression), and coefficients from the regression of X on Y
(e.g., depression predicting perceived control over pain). Any discrepancy between the
average estimates and their true values reflects systematic nonresponse bias.
Regression of Y on X
β0 25.12 25.31 20.06 25.31 25.36 25.31
β1 (X) –0.51 –0.52 –0.25 –0.52 –0.52 –0.52
σε2 33.60 33.75 18.30 16.48 33.12 32.39
Regression of X on Y
γ0 24.74 24.76 24.76 28.06 24.80 24.82
γ1 (Y) –0.32 –0.32 –0.32 –0.54 –0.32 –0.32
σr2 21.00 20.77 22.92 17.69 20.56 20.19
Note. LWD, listwise deletion; AMI, arithmetic mean imputation; RI, regression imputation;
SRI, stochastic regression imputation; FIML, full-information maximum likelihood.
Introduction to Missing Data 33
theory predicts that listwise deletion, stochastic regression imputation, and maximum
likelihood estimation are unbiased in large samples. The simulation bears this out, as
the average estimates are effectively identical to the true population parameters, even
with a small sample size and 50% missing data. As you might expect, mean imputation
and regression imputation were prone to substantial biases. To illustrate, the solid curve
in Figure 1.13 shows the sampling distribution of the correlation estimates for regres-
sion imputation, and the dashed curve shows the corresponding distribution for mean
imputation. Neither method did a good job of recovering the population correlation,
as the true value (the vertical line) was in the tails of both distributions. Although the
presence and magnitude of the biases varied across estimands, the simulation results
provide no support for these approaches on balance.
Although deletion appears to be just as good as maximum likelihood, leveraging the
full sample’s observed data generates estimates that are more precise, with less variation
across samples. The precision difference is dramatic for some estimands and modest for
others. To illustrate, the solid curve in Figure 1.14 is a kernel density plot displaying
the sampling distribution of the maximum likelihood mean estimates, and the dashed
curve shows the corresponding distribution for listwise deletion. As you can see, both
distributions are centered at the true value of 20, but the maximum likelihood estimates
are substantially closer to the truth, on average (e.g., the peak of the solid curve is higher
at the true value and its tails are less thick). As a second example, Figure 1.15 shows the
sampling distributions of the covariance. Maximum likelihood is again more precise,
but the difference is quite modest.
Relative Probability
FIGURE 1.13. Kernel density plots of the correlation estimates from the MCAR computer sim-
ulation. The solid curve shows the sampling distribution of the regression imputation estimates,
and the dashed curve shows the corresponding mean imputation estimates. Neither distribution
is centered at the true value of –.40, indicating substantial nonresponse bias.
Relative Probability
17 18 19 20 21 22 23
Mean Estimates
FIGURE 1.14. Kernel density plots of the X mean estimates from the MCAR computer simula-
tion. The solid curve shows the sampling distribution of the maximum likelihood estimates, and
the dashed curve shows the corresponding deletion estimates. Both distributions are centered at
the true value of 20, but the maximum likelihood estimates are substantially closer to the true
value, on average.
Relative Probability
FIGURE 1.15. Kernel density plots of the covariance estimates from the MCAR computer
simulation. The solid curve shows the sampling distribution of the maximum likelihood esti-
mates, and the dashed curve shows the corresponding deletion estimates. Both distributions are
centered at the true value of –12.65, but the maximum likelihood estimates are slightly closer to
the true value, on average.
34
Introduction to Missing Data 35
Missing at Random
The second simulation, which mimicked Figure 1.5, modeled a missing (always) at ran-
dom mechanism where the probability of a missing Y score increased as the value of X
decreased (e.g., depression scores were more likely to be missing for participants with
low perceived control over pain). Table 1.6 shows the average parameter estimates for
each method, along with their true values. Following the first simulation, mean impu-
tation and regression imputation estimates were prone to bias, and the results offer no
support for these procedures. A systematic missingness process was generally detrimen-
tal to the listwise deletion estimates as well. The notable exception was the regression
of Y on X, where complete-case analysis gives optimal estimates when missingness does
not depend on the outcome variable (Glynn & Laird, 1986; Little, 1992; von Hippel,
2007; White & Carlin, 2010). Finally, missing data theory again predicts that maximum
likelihood estimation and stochastic regression imputation should be unbiased in large
samples, and they are virtually so here. These results are consistent with published sim-
ulation studies showing that the percentage of missing data is not a strong determinant
of bias provided that presumed mechanism is correct (Madley-Dowd, Hughes, Tilling,
& Heron, 2019). Finally, stochastic regression imputation gave equivalent point esti-
mates to maximum likelihood, but its standard errors and significance tests are untrust-
worthy without corrective procedures like the bootstrap.
Regression of Y on X
β0 25.12 25.09 17.46 25.09 25.10 25.09
β1 (X) –0.51 –0.50 –0.19 –0.50 –0.50 –0.50
σε2 33.60 33.76 18.20 16.54 32.98 32.40
Regression of X on Y
γ0 24.74 25.89 23.37 27.99 24.79 24.77
γ1 (X) –0.32 –0.25 –0.25 –0.53 –0.32 –0.32
σr2 21.00 16.32 24.04 18.03 20.79 20.46
Note. LWD, listwise deletion; AMI, arithmetic mean imputation; RI, regression imputation;
SRI, stochastic regression imputation; FIML, full-information maximum likelihood.
36 Applied Missing Data Analysis
Regression of Y on X
β0 25.12 22.97 18.77 22.97 23.00 22.97 25.47
β1 (X) –0.51 –0.40 –0.19 –0.40 –0.40 –0.40 –0.53
σε2 33.60 26.23 14.02 12.90 25.67 25.18 34.64
Regression of X on Y
γ0 24.74 24.80 24.82 28.61 25.03 25.04 24.94
γ1 (X) –0.32 –0.32 –0.32 –0.58 –0.34 –0.34 –0.33
σr2 21.00 21.14 23.71 19.08 21.55 21.21 20.35
Note. LWD, listwise deletion; AMI, arithmetic mean imputation; RI, regression imputation; SRI, stochastic regression
imputation; FIML, full-information maximum likelihood.
Introduction to Missing Data 37
The remainder of the chapter describes planned missing data designs that introduce
intentional missing values as a device for reducing respondent burden or lowering
research costs. The thought of intentionally creating missing values might seem odd at
first, but you are probably already familiar with the idea. For example, in a randomized
study with two treatment conditions, everyone has a hypothetical score from both con-
ditions, but participants only provide a response to their assigned condition. The unob-
served response to the other condition—the potential outcome or counterfactual—is
missing completely at random. Viewing randomized experiments as a missing data
problem is popular in the statistics literature and is a key component of Rubin’s causal
inference framework (Rubin, 1974; West & Thoemmes, 2010). The fractional factorial
(Montgomery, 2020) is another research design that yields MCAR values. With this
design, you purposefully select a subset of experimental conditions from a full facto-
rial scheme and randomly assign participants to a restricted combination of conditions.
Carefully omitting certain design cells saves resources by eliminating higher-order
effects that are unlikely to be present in the data. Finally, planned missingness designs
have long been a staple in educational testing applications, where examinees are admin-
istered a subset of test questions from a larger item bank (Johnson, 1992; Lord, 1962).
You likely encountered a variant of this approach if you took the Graduate Record Exam.
The advent of sophisticated missing data-handling methods prompted the devel-
opment of planned missingness designs that use intentional missing values to address
logistical and budgetary constraints (Graham, Taylor, & Cumsille, 2001; Graham
et al., 2006; Little & Rhemtulla, 2013; Raghunathan & Grizzle, 1995; Rhemtulla &
Hancock, 2016; Rhemtulla & Little, 2012; Silvia, Kwapil, Walsh, & Myin-Germeys,
2014). I describe three such designs in this section: multiform designs for questionnaire
data, wave missing data designs for longitudinal studies, and two-method measurement
designs that pair expensive and inexpensive measures of a construct. Importantly, these
designs cannot introduce bias, because they create patterns of unsystematically missing
values. Of course, introducing missing data necessarily reduces power, but the loss of
precision is surprisingly low in many cases.
Multiform Designs
Multiform planned missingness designs are most often associated with studies that use
lengthy surveys that comprise several questionnaires and many items. Respondent bur-
den is a major concern in these settings, because the number of items that people can
reasonably answer in a single sitting is limited. A multiform design addresses this issue
by administering multiple questionnaire forms that comprise different subsets of vari-
ables. For example, the classic three-form design (Graham et al., 1996, 2006) distributes
variables into four blocks (X, A, B, and C) that are allocated across three different ques-
tionnaire forms. Each form includes the X set and is missing the A, B, or C set. Table 1.8
shows the distribution of four blocks across the three forms, with O’s denoting observa-
tions and M’s indicating missing values, and Figure 1.1d shows a graphical schematic
of the design. Supposing that each variable set contains 25 questionnaire items, then
38 Applied Missing Data Analysis
survey length is reduced by 25% and participants respond to 75 rather than 100 ques-
tions. Multiform designs readily extend to include additional variable sets as needed.
For example, Table 1.9 shows a six-form design from Rhemtulla and Little (2012) where
respondents provide data on three out of five blocks, and Raghunathan and Grizzle
(1995) and Graham et al. (2006) describe designs with even more forms.
The main downside to multiform designs (and planned missingness designs in gen-
eral) is a reduction in statistical power. The impact of missing data on power and preci-
sion is complex and depends on the type of model and parameter being estimated (e.g.,
models with latent vs. manifest variables; correlations vs. regression slopes), as well as
the effect sizes within and between blocks (Rhemtulla, Savalei, & Little, 2016). Looking
at the percentage of observed responses for each variable or variable pair (sometimes
called covariance coverage) provides some insight. To illustrate, Table 1.10 shows the
covariance coverage rates for a three-form design with eight variables distributed equally
across four blocks. The cell percentages reflect three tiers of precision. All things being
equal, tests involving members of the X set (e.g., Y1 and Y2) have the most power, because
these variables are complete. Variable pairs with 33% missing data introduce a second,
lower tier of precision and power. This tier includes between-set associations involving
a member of the X set (e.g., Y1 and Y3) and within-set associations between variables in
the A, B, or C blocks (e.g., Y3 and Y4). Finally, the greatest reductions in power occur
when testing associations between variable pairs with 66% missing data. This includes
all between-set associations involving members of A, B, or C (e.g., Y3 and Y5).
With these percentages in mind, we can devise strategies for distributing variables
to blocks in a way that mitigates rather than exacerbates the design’s natural inefficien-
cies. First, pairs of variables with strong associations should appear in different blocks
(Raghunathan & Grizzle, 1995; Rhemtulla & Little, 2012; Rhemtulla et al., 2016). This
makes intuitive sense, because a large effect size introduces redundancy that offsets
a lack of observations. This principle has important implications for studies that use
multiple-item scales to measure complex constructs, where items from the same scale
tend to have much stronger correlations than items belonging to different scales. Dis-
tributing a scale’s items across different sets maximizes power (Graham et al., 1996,
2006; Rhemtulla & Hancock, 2016; Rhemtulla & Little, 2012), especially when using
a latent variable model to examine associations among constructs (Rhemtulla et al.,
2016).
Pairs of variables with weak associations are good candidates for the fully complete
X set, because small effect sizes naturally require more data to achieve adequate power.
Additionally, Graham et al. (2006) recommend assigning key outcome variables to the
X set, as doing so maximizes power to test a study’s main substantive hypotheses. Ana-
lytic work from Rhemtulla et al. (2016) supports this recommendation, as the strategy
maximizes power to detect non-zero regression slopes. Including outcome variables in
the X set also ensures that two-way interaction effects are estimable (Enders, 2010).
Finally, the X set could also include potential determinants or correlates of unplanned
missing data, as conditioning on such variables is necessary to satisfy the MAR assump-
tion (Rhemtulla & Little, 2012). The power analyses in the next section highlight some
of these principles.
Longitudinal Designs
Respondent burden and budgetary constraints can be particularly acute in longitudi-
nal studies where researchers administer assessments repeatedly over time. Extending
40 Applied Missing Data Analysis
the logic of the three-form design, Graham et al. (2001) described a number of wave
missing data designs where each participant provides data at a subset of measurement
occasions. Table 1.11 shows one such design that features seven random subgroups,
six of which have intentional missing data at one wave. Longitudinal planned missing-
ness designs can be especially efficient relative to their complete-data counterparts. For
example, applying the design in the table to the group-by-time interaction effect from
a linear growth curve model, Graham and colleagues showed that power was 94% as
large as that of a complete-data analysis. Other designs produce comparable power with
even fewer data points. In situations where the total number of assessments is fixed
(e.g., a grant budget can accommodate 1,000 assessments, each costing $100), Graham’s
chapter further showed that wave missing data designs can achieve higher power than
a corresponding complete-data design; that is, collecting incomplete data from 300 par-
ticipants can achieve higher power than collecting complete data from 250 participants.
Myriad configurations of patterns are possible with wave missing designs, not all of
which are nearly as beneficial as the ones described earlier. Computer simulation stud-
ies provide details on a few possibilities (e.g., Graham et al., 2001; Mistler & Enders,
2011), and methodologists have outlined general strategies for identifying designs that
maximize efficiency in a particular scenario. Wu, Jia, Rhemtulla, and Little (2016)
developed a computer simulation tool for this purpose called SEEDMC (SEarch for Effi-
cient Designs using Monte Carlo Simulation). Their algorithm creates a design pool con-
taining all possible planned missingness designs with a given number of measurement
occasions, and it uses Monte Carlo computer simulations to create many artificial data
sets for each member of the pool. Fitting a longitudinal model to each artificial data set
and computing the sampling variation of the resulting estimates identifies designs with
the highest relative efficiency (i.e., lowest possible sampling variation). More recently,
Brandmaier, Ghisletta, and von Oertzen (2020) developed an analytic approach that
estimates the measurement error of the individual change rates from a given configura-
tion of measurement occasions. Their method selects the same optimal designs as Monte
Carlo computer simulations, but it does so without intensive computations. I illustrate a
combination of these strategies in Section 10.11.
Wave missing data designs are particularly useful for studies that examine change
following an intervention or a treatment. However, many researchers are interested in
developmental processes that involve age-related change (e.g., the development of read-
ing skills in early elementary school, the development of behavioral problems during
the teenage years). Cohort-sequential (Duncan, Duncan, & Hops, 1996; Nesselroade
& Baltes, 1979) or cross-sequential designs (Little, 2013; Little & Rhemtulla, 2013)
are ideally suited for this type of research question. This design requires multiple age
cohorts, each of which is followed over a fixed period. These shorter longitudinal stud-
ies combine to produce a much longer developmental span. To illustrate, Table 1.12
shows a cross-sequential design from a 3-year study with four age cohorts: 12, 13, 14,
and 15. Notice that each cohort has three waves of intentional missing data (e.g., the
12-year-olds have missing data at ages 15, 16, and 17; the 13-year-olds have missing data
at ages 12, 16, and 17; and so on).
The four 3-year studies combine to create a longitudinal design spanning 6 years,
but you must be careful analyzing the data, because several bivariate associations are
inestimable. For example, there are no data with which to estimate the correlation
between scores at ages 12 and 15, 13 and 16, 14 and 17, and so on. This feature rules out
popular multiple imputation procedures that array repeated measurements in columns
(e.g., Schafer, 1997; van Buuren, 2007). However, you can readily use maximum likeli-
hood or Bayesian estimation to fit growth models to the data, and multilevel imputation
schemes that nest repeated measurements within individuals are another possibility
(see Chapter 8).
There are at least two ways to analyze data from a two-method measurement design.
One approach is to cast the “gold standard” measure in the focal analysis model and use
the inexpensive measure as an auxiliary variable. As a preview, Figure 1.16a shows a
path diagram of the so-called extra dependent variable model (Graham, 2003) that
features the auxiliary variable (the inexpensive measure) as an additional outcome. The
idea is that the inexpensive measure transmits information to the expensive measure
(and thus enhances the power) via its mutual association with the predictor and a cor-
related residual term (the double-headed curved arrow connecting the residuals). If the
two measures can be cast as multiple indicators of the same construct, a second option
is to analyze the data with a latent variable model similar to the one in Figure 1.16b.
Graham et al. (2006) refer to this diagram as a bias-reduction model, because the cor-
related residual between the two inexpensive measures removes extraneous sources of
Predictor
Expensive Inexpensive
Predictor Latent
Inexpensive Inexpensive
Expensive
Measure 1 Measure 2
FIGURE 1.16. The top panel shows a path diagram of the extra dependent variable model, and
the bottom panel is diagram of a bias-reduction model for a two-method measurement design
where inexpensive and expensive measures are indicators of a latent factor.
Introduction to Missing Data 43
correlation that result from invalidity, thus improving the accuracy of the structural
regression coefficient connecting the covariate to the latent outcome. Graham et al.
(2006) and Rhemtulla and Little (2012) provide guidelines for determining the optimal
sample size ratio for the expensive measure, and Monte Carlo computer simulations are
also ideally suited for this task.
This final section illustrates a power analysis for a three-form design. Section 10.10
presents a similar power study for a longitudinal growth curve model with wave missing
data and unplanned missingness. I use computer simulations for this purpose, because
they are relatively easy to implement and are generally applicable to virtually any analy-
sis model. The goal of a computer simulation is to generate many artificial data sets with
known population parameters and examine the distributions of the estimates across
those many samples. In a power analysis, the focus shifts to significance tests, where the
simulation-based power estimate is the proportion of artificial data sets that produced
a significant test statistic.
The first step of a simulation is to specify hypothetical values for the population
parameters. This is especially important when planning a three-form design, because the
expected effect sizes dictate the assignment of variables to the four sets (e.g., variables
with strong associations can be exposed to large amounts of missingness). I take a some-
what different tack that holds effect size constant to illustrate the design’s natural tenden-
cies and highlight previous findings from the literature. To this end, I considered four nor-
mally distributed variables (one variable per set) with uniformly moderate correlations
equal to .30. The simulation created 5,000 random samples of N = 250 from this popula-
tion, and I subsequently deleted data according to the three-form design in Table 1.8.
Power depends, in part, on the type of parameter being estimated (e.g., the covari-
ance between two variables has different power than a regression slope). To illustrate
this point, I fit two models to each artificial data set: a saturated model consisting of a
mean vector and variance–covariance matrix, and a three-predictor linear regression
model with one of the variables arbitrarily designated as the outcome. The assignment
of the outcome variable to the four sets is an important consideration, so I further exam-
ined two design configurations: one with a complete predictor in the X set, and the other
with a complete outcome in the X set. Figure 1.17 shows path diagrams of the four possi-
bilities, with shaded rectangles representing blocks with missing data. I used maximum
likelihood estimation to fit the analysis models to the artificial data sets, and I recorded
the proportion of the 5,000 samples that produced statistically significant estimates.
This proportion is a simulation-based estimate of the probability of rejecting a false null
hypothesis. Maximum likelihood is the focus of the next two chapters, but for now it is
sufficient to know that the estimator leverages the full sample’s observed data without
discarding any information. Simulation scripts are available on the companion website.
Table 1.13 gives power estimates for each correlation and regression slope along
with the corresponding power values for optimal analyses with no missing data. To
facilitate interpretation, the power ratios reflect complete-data power relative to that of
44 Applied Missing Data Analysis
Y (C Set)
Y (C Set)
FIGURE 1.17. Path diagrams depicting two analysis models and two configurations of planned
missing data. The four sets of the three-form designs are color coded, with shaded rectangles rep-
resenting blocks with missing data.
a planned missingness design (e.g., 1.20 means that a complete-data analysis has 20%
more power). Table 1.13 illustrates several important points, all of which echo findings
from the literature. First, notice that power estimates differ by estimand, with regres-
sion slopes exhibiting lower power than correlations. This isn’t necessarily surprising
given that the coefficients reflect partial associations, but it nevertheless highlights the
importance of considering different analyses that will be performed on the incomplete
data. Second, correlations involving a complete variable in the X set (e.g., the association
in the first row of the table) experienced virtually no reduction in power, even though
Introduction to Missing Data 45
Regression slopes
X1 → Y .85 .61 1.40 .62 1.37
X2 → Y .86 .40 2.12 .63 1.36
X3 → Y .86 .40 2.14 .64 1.35
33% of the other variable’s scores were missing (e.g., the power advantage of a complete-
data analysis was only about 2%). Third, correlations involving variable sets AB, AC, or
BC (e.g., the correlation between X2 and X3) still had sufficient power values above .80,
even though only 33% of score pairs were complete (see Table 1.10). Finally, the bottom
section of Table 1.13 illustrates that assigning the outcome variable to the complete X set
uniformly improves the power of all regression slopes, whereas assigning a predictor to
the X set benefits only that covariate’s slopes. As noted previously, assigning outcomes
to the X set also ensures that all two-way interactions are estimable.
This chapter described the theoretical underpinnings for missing data analyses, as out-
lined by Rubin and colleagues (Little & Rubin, 1987; Mealli & Rubin, 2016; Rubin,
1976). This work classifies missing data problems according to three different processes
that link missingness to the data: an unsystematic or haphazard missing completely at
random (MCAR) mechanism, a systematic conditionally missing at random (CMAR)
process where missingness relates only to the observed data, and a systematic missing
not at random (MNAR) mechanism where unseen score values determine missingness.
From a practical perspective, these mechanisms function as statistical assumptions for
a missing data analysis, and they also help us understand why not to use older methods
like deletion and single imputation with a mean or predicted value.
Looking forward, most of the book is devoted to methods that naturally require a
conditionally MAR assumption—maximum likelihood, Bayesian estimation, and mul-
tiple imputation. This mechanism is reasonable for many applications, and flexible soft-
ware options abound. Chapter 9 describes how to modify these methods to model differ-
46 Applied Missing Data Analysis
ent MNAR processes. In the near term, maximum likelihood estimation is the next topic
on the docket. Chapter 2 describes the full information estimator for complete data,
and Chapter 3 applies the method to missing data problems. Finally, I recommend the
following articles for readers who want additional details on topics from this chapter:
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and resrictive strate-
gies in modern missing data procedures. Psychological Methods, 6, 330–351.
Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data
designs in psychological research. Psychological Methods, 11, 323–343.
Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data
should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiol-
ogy, 110, 63–73.
Olinsky, A., Chen, S., & Harlow, L. (2003). The comparative efficacy of imputation methods for
missing data in structural equation modeling. European Journal of Operational Research,
151, 53–79.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychologi-
cal Methods, 7, 147–177.
2
Maximum likelihood is the go-to estimator for many common statistical models, and
it is one of the three major pillars of this book. As its name implies, the estimator
identifies the population parameters that are most likely responsible for a particular
sample of data. I spend most of the chapter unpacking this statement for analyses with
normally distributed outcomes. Not only are such models exceedingly common across
many different substantive disciplines, but the normal curve also appears prominently
throughout the book as a distribution for missing values. As such, this chapter sets up
a lot of later material. For now, I focus on complete-data maximum likelihood analyses,
but all the major ideas readily generalize to missing data, and much of Chapter 3 tweaks
concepts from this chapter.
The chapter begins with a simple univariate example that illustrates the mechanics
of estimation and builds to multiple regression. As you will see, maximum likelihood
estimates are equivalent to those of ordinary least squares, as both approaches iden-
tify estimates that minimize squared distances to the data points, albeit in different
ways. After describing significance tests and corrective procedures for non-normal data,
I illustrate estimation for a mean vector and variance–covariance matrix. This multi-
variate analysis lays the groundwork for missing data handling in models with general
missing data patterns. Although I mostly discuss models with analytic solutions for the
estimates, I introduce iterative optimization algorithms in this chapter, as they will be
the norm with missing data.
Probability distributions and likelihood functions play a prominent role throughout the
book, so it is important to introduce these concepts early and establish some recurring
47
48 Applied Missing Data Analysis
notation. A binary outcome with score values of 0 and 1 provides a simple platform for
exploring some key ideas. As the name implies, a probability distribution is a math-
ematical function that describes the relative frequency of different score values. The
Bernoulli distribution below describes the probability of the two scores:
(1−Yi ) π if Yi =1
f ( Yi | π ) = πYi (1 − π ) = (2.1)
1 − π if Yi =0
The function on the left side of the equation says that the probability of a particular
score value depends on the unknown population proportion π to the right of the vertical
pipe (the pipe means “conditional on” or “depends on”). The right side of the equation
gives the rules for computing the two probabilities.
To provide a substantive context, I use the math achievement data on the com-
panion website. Among other things, the data set includes pretest and posttest math
achievement scores and academic-related variables (e.g., math self-efficacy, standard-
ized reading scores, sociodemographic variables) for a sample of N = 250 students (see
Appendix). One of the variables in the data is a binary indicator that measures whether
a student is eligible for free or reduced-priced lunch (0 = no assistance, 1 = eligible for
free or reduced-price lunch). Hypothetically, suppose we knew that the true proportion
of eligible students in the population is π = .45. Figure 2.1 displays the probability dis-
tribution as a bar graph, and its mathematical description is f(Yi|π = .45). I use generic
function notation f(∙) throughout the book to represent the height of a distribution or
curve at some value on its horizontal axis, so “f of something” always refers to vertical
elevation. In this example, f(Yi|π = .45) is just a fancy way of referencing the vertical
height of the bars in Figure 2.1.
The figure and previous equation highlight the defining feature of a probability
distribution: Probabilities must sum to 1. The same is true for continuous probability
distributions like the normal curve, where the area under the curve must equal 1. We
will encounter many different curves and functions throughout the book, not all of
which are probability distributions. The likelihood is one important example. Returning
to Equation 2.1, the function on the left side of the expression has two inputs inside the
parentheses: data values and a parameter. The ordering of the two inputs implies that
the data values vary, but the parameter to the right of the vertical pipe (the “conditional
on” symbol) functions as a known constant; that is, the probability distribution says
how likely certain scores are given an assumed value for π.
After collecting data, the function is “reversed” by treating scores as known and
varying the parameter π. The resulting likelihood function describes the relative fre-
quency of different parameter values given the observed data. For example, suppose that
we collect data from a single student who is eligible for free or reduced-price lunch (i.e.,
Y = 1). Reversing the role of the data and the parameter in the function gives the follow-
ing likelihood expression:
Li ( π | Yi = 1) = π1 (1 − π ) = π
0
(2.2)
Maximum Likelihood Estimation 49
1.0
0.8
0.6
Proportion
0.4
0.2
0.0
0 = No 1 = Yes
Free or Reduced-Price Lunch
FIGURE 2.1. The probability distribution for a binary variable that measures whether a stu-
dent is eligible for free or reduced-priced lunch. The bar graph corresponds to a distribution
where the true proportion π = .45.
The left side of the equation now says that the likelihood of a particular parameter value
depends on the observed data. Consistent with the previous function notation, “L of
something” references the height of the distribution at a particular value on the horizon-
tal axis, but the abscissa now reflects all possible values of π between 0 and 1.
To illustrate the effect of reversing the function’s arguments, Figure 2.2 graphs the
likelihood in Equation 2.2 across the entire range of π. The height of the graph—the
likelihood of the parameter given the observed data— quantifies the data’s support
for every possible value of π. Two points are worth highlighting. First, the probabil-
ity distribution of the data is discrete, but the likelihood function is a continuous
distribution. Second, notice that the function defines a triangle with an area equal to
0.50. Thus, by the previous definition, the likelihood is not a probability distribution,
because the area under the function does not equal 1. This distinction is important,
because it is incorrect to say that Li(π|Yi) describes the probability of the parameter
given the data—that interpretation is reserved for a Bayesian analysis. Rather, you
should view likelihood as a function that describes the data’s evidence or support for
different parameter values. As you will see later in the chapter, the likelihood function
provides the mathematical machinery for identifying parameter values that maximize
fit to the observed data.
50 Applied Missing Data Analysis
1.2
1.0
0.8
Likelihood
0.6
0.4
0.2
0.0
FIGURE 2.2. The likelihood function describing the relative frequency of different parameter
values given a single observation where Y = 1. The height of the graph—the likelihood of the
parameter given the observed data—quantifies the data’s support for every possible value of π.
1 ( Y − μ )2
(
f=
Yi | μ, σ 2
) 1
2
exp −
2
i
σ2
(2.3)
2πσ
where Yi is the outcome score for participant i (e.g., a student’s math posttest score), and
μ and σ2 are the population mean and variance. To reiterate some important notation,
Maximum Likelihood Estimation 51
the function on the left side of the equation can be read as “the relative probability of
a score given assumed values for the parameters.” Visually, “f of Y” is the height of the
normal curve at a particular score value on the horizontal axis. Dissecting the right side
of the expression, the kernel inside the exponential function defines the curve’s shape.
Notice that the main component is a squared z-score that quantifies the standardized
distance between a score and the mean. Finally, the fraction to the left of the exponential
function is a scaling term that ensures that the area under the curve sums or integrates
to 1. This scaling term makes the function a probability distribution.
From the previous section, you know that a probability distribution treats scores as
variable and parameters as known constants. To illustrate, assume that the true popula-
tion parameters are μ = 56.79 and σ2 = 87.72 (these happen to be the maximum likeli-
hood estimates). Next, consider two math scores, Y1 = 53 and Y2 = 45. Substituting these
scores and the parameter values into Equation 2.3 gives f(Y = 53|μ, σ2) = 0.039 and
f(Y = 45|μ, σ2) = 0.019. As seen in Figure 2.3, “f of something” refers to the height of the
normal curve at a particular score value on the horizontal axis. Although these verti-
cal coordinates look like probabilities, they are not—the probability of any one score is
effectively 0, because the horizontal axis can be sliced into a countless number of infini-
tesimally small intervals. Rather, the height coordinates represent relative probabilities.
For example, it is incorrect to say that 3.9% of all students from this population have a
0.05
0.04
Relative Probability
0.03
0.02
0.01
0.00
20 40 60 80 100
Posttest Math Scores
FIGURE 2.3. A normal distribution with parameters μ = 56.79 and σ2 = 87.72. The black dots
are the relative probabilities for two math scores: f(Y = 53|μ, σ2) = 0.039 and f(Y = 45|μ, σ2) =
0.019.
52 Applied Missing Data Analysis
test score of 53, but you can say that a score of 53 is about twice as likely as a score of 45,
because its vertical elevation is twice as high.
1 ( Y − μ )2
(
Li= 2
μ, σ | Yi ) 1
2
exp −
2
i
σ2
(2.4)
2πσ
where Li represents one observation’s support for a particular combination of the mean
and variance. The likelihood expression might seem like a notational sleight of hand
since the right side of the expression is identical to Equation 2.3. However, the notation
on the left side of the equation signals an important shift: The probability distribu-
tion views scores as hypothetical and parameters as known, whereas likelihood views
parameters as hypothetical and scores as known. Applied to the math achievement data,
Equation 2.4 quantifies the degree to which one observation from this sample supports
different values of μ and σ2.
Identifying the maximum likelihood estimates requires a summary measure that
quantifies the entire sample’s evidence about the unknown parameter values. From
probability theory, the product of individual probabilities describes the joint occurrence
of a set of independent events. For example, the probability of flipping a fair coin twice
and observing two heads in a row is .50 × .50 = .25. Applying this rule to the individual
likelihood expressions gives the sample likelihood function.
N
( = ) ∏ L (μ, σ
L μ, σ2 | data i
2
| Yi ) (2.5)
i =1
Extending previous ideas, the likelihood quantifies a particular sample’s support for dif-
ferent values of μ and σ2. Visually, the likelihood function describes a three-dimensional
surface with the population mean and variance on the horizontal and depth axes and
L as the height of the surface at each unique combination of the two parameters. It
is important to reiterate that the likelihood function is not a probability distribution,
because the area under the surface does not equal 1.
Applying Equation 2.5 to the math data involves multiplying 250 very small num-
bers, each of which requires many decimals to achieve good precision. As you can imag-
ine, the resulting product is infinitesimally small. Taking the natural logarithm of the
relative probabilities provides a more tractable metric. This transformation maps prob-
abilities onto the negative side of the number line, with higher probabilities taking on
“less negative” values than lower probabilities. To illustrate, reconsider the pair of math
scores and the parameters from the previous example: Y1 = 53, Y2 = 45, μ = 56.79, and σ2 =
Maximum Likelihood Estimation 53
87.72. Transforming the relative probabilities to the logarithmic scale gives ln(0.039) =
–3.24 and ln(0.019) = –3.96. Figure 2.4 shows that –3.24 and –3.96 also represent height
coordinates, but the log transformation has changed the normal curve to a parabola.
Nevertheless, the conclusion is the same: A score of 53 is more likely than a score of 45.
Working with logarithms changes the structure of the likelihood, because the loga-
rithm product rule says to add rather than multiply the transformed likelihood values
(i.e., ln(A × B) = ln(A) + ln(B)). Applying the product rule gives the log-likelihood func-
tion below.
1 1 ( Y − μ )2
) ∑ ( ( )) ∑
N N
2
LL μ, σ= | data ( ln Li μ, σ
= 2
| Yi ln
2
exp −
2
i
σ2
=i 1 =i 1 2πσ (2.6)
N
N N
( ) ( ) ∑ (Y − μ )
1 −1
ln ( 2π ) − ln σ2 − σ2
2
=− i
2 2 2 i =1
20 40 60 80 100
Posttest Math Scores
FIGURE 2.4. Natural logarithm of a normal distribution with parameters μ = 56.79 and σ2 =
87.72. The black dots represent the natural log of two relative probabilities: ln(.039) = –3.24 and
ln(.019) = –3.96.
54 Applied Missing Data Analysis
drone hovering over the peak of the log-likelihood surface, with smaller contours denot-
ing higher elevation (and vice versa). The data’s support for the parameters increases as
the contours get smaller, and the maximum likelihood estimates are located at the peak
of surface, shown as a black dot. The goal of estimation is to identify the parameter val-
ues at that coordinate.
As you might have surmised, the log-likelihood value will always be a large nega-
tive number, because it sums individual fit values that are themselves usually negative
numbers. For example, the peak of the function in the previous figures has a vertical
elevation of LL = –913.999, and the log-likelihood values decrease (i.e., become more
negative) as μ and σ2 move away from their optimal values for the data. Several factors
influence the log-likelihood value (e.g., the sample size, the number of variables, the
amount of missing data), and there is no cutoff that determines good or bad fit to the
data. However, we can use the log-likelihood to make relative judgments about different
candidate parameter values. These relative fit assessments are an integral part of estima-
tion and hypothesis testing.
–900
–920
Log-Lik
–940
elihood
–960
–980
400
–1000
300
e
anc
ari
45 200
nV
50
tio
Pop 55
ula
ula 100
tion
Pop
Me 60
an
65
0
FIGURE 2.5. Bivariate log-likelihood surface for different values of μ and σ2. The height of the
surface represents the data’s support for different combinations of the mean and the variance.
Note that the floor of the function is located well below the minimum value on the vertical axis.
Maximum Likelihood Estimation 55
400
300
Population Variance
200
100
0
45 50 55 60 65 70
Population Mean
FIGURE 2.6. Contours of the log-likelihood surface at different values of μ and σ2. The plot
conveys the perspective of a drone hovering over the peak of the log-likelihood surface, with
smaller contours denoting higher elevation (and vice versa). The height of the surface represents
the data’s support for different combinations of parameter values, and the maximum likelihood
estimates are located at the peak of surface (shown as a black dot).
The key take-home message thus far is that “reversing” a probability distribution by
treating the observed data as known constants defines a log-likelihood function that
measures the data’s support for different candidate parameter values. The goal of esti-
mation is to identify the parameter values that maximize the log-likelihood function,
as these are the values that garner the most support from the data. Visually, this corre-
sponds to finding the peak of the three-dimensional surface in Figures 2.5 and 2.6. The
resulting estimates are optimal in the sense that they minimize the sum of the squared
z-scores in the normal distribution function. There are three main ways to find the max-
imum likelihood estimates: (a) a grid search that computes the log-likelihood value for
each unique combination of the parameter values, (b) an analytic solution that provides
an equation for solving the estimates, and (c) an iterative optimization algorithm. The
first approach is usually too unwieldly and inefficient for practical applications, but it is
a good starting point for this simple example, because it illustrates important concepts.
I describe analytic solutions and optimization algorithms later in the chapter.
56 Applied Missing Data Analysis
To illustrate the mechanics of a grid search, Table 2.1 shows individual and sample
log-likelihood values at five different estimates of the population mean (to keep the
illustration simple, I held the variance constant at its maximum likelihood estimate). As
you might expect, an individual’s contribution to the log-likelihood differs across the
five estimates, because a given score offers more support for some parameter values than
others (i.e., the standardized distances from the scores to the center of the normal curve
change with different values of μ). The summary log-likelihood values in the bottom
row of Table 2.1 similarly fluctuate as a function of the population mean. As explained
previously, the log-likelihood summarizes the data’s support for a particular combina-
tion of parameter values, such that higher (i.e., less negative) values reflect better fit to
the data. If the five means in the table were our only options, we would choose μ̂ = 57
as the maximum likelihood estimate, because this parameter value maximizes fit to the
sample data (i.e., minimizes the sum of the squared z-scores).
Next, I conducted a comprehensive grid search that varied the population mean
in tiny increments of 0.01 and plotted the resulting log-likelihood values in Figure 2.7.
As you can see, the function resembles a hill or a parabola, with the optimal parameter
value located at its peak. This brute-force estimation process revealed that the curve’s
highest elevation, LL = –913.999, is located at μ = 56.79, and no other value of the mean
has more support from the data. As such, μ̂ = 56.79 is the maximum likelihood estimate
of the mean, or the population parameter with the highest probability of producing this
sample of 250 math scores. I applied the same grid search procedure to the variance after
fixing the mean at its maximum likelihood estimate. Figure 2.8 shows the resulting log-
likelihood function. Although the function looks very different—the right skew owes
–900
–950
–1000
Log-Likelihood
–1050
–1100
–1150
–1200
40 45 50 55 60 65 70 75
Population Mean
FIGURE 2.7. Likelihood function with respect to the mean, holding the variance constant
at its sample estimate. The log-likelihood on the vertical axis represents the data’s support for a
particular parameter value. The peak of the function is the maximum likelihood estimate of the
mean.
–910
–915
–920
Log-Likelihood
–925
–930
–935
–940
FIGURE 2.8. Likelihood function with respect to the variance, holding the mean constant at
its maximum likelihood estimate. The log-likelihood on the vertical axis represents the data’s
support for a particular parameter value. The maximum likelihood estimate of the variance is
located at the peak of the function.
58 Applied Missing Data Analysis
to the fact that the variance is bounded at 0 on the low end—the graph nevertheless
displays the data’s support for different parameter values. The brute-force grid search
revealed that σ̂2 = 87.72 is the maximum likelihood estimate of the variance.
You can imagine that a grid search quickly becomes impractical as the number of model
parameters increases. A second approach is to derive an equation that gives an ana-
lytic solution for the maximum likelihood estimates. Although this strategy has limited
applications, the mechanics of getting the solution—in particular, leveraging calculus
derivatives—sets the stage for the iterative optimization algorithms that I discuss later
in the chapter.
To begin, a first derivative is a slope coefficient. Returning to Figure 2.7, the log-
likelihood function is a parabolic curve. Imagine using a magnifying glass to zoom in
on the log-likelihood function within a very narrow slice along the horizontal axis.
Although the entire function has substantial curvature, magnifying the log-likelihood
at a particular point on the curve would reveal a straight line. Thus, you can think of
the curved function in Figure 2.7 as stringing together a sequence of very tiny straight
lines, the direction and magnitude of which vary as you move from left to right on the
horizontal axis. These linear slopes are the first derivatives of the function. To infuse a
bit more precision, the first derivative is the slope of a line that is tangent to the function
at a particular value on the horizontal axis. To illustrate, Figure 2.9 shows the deriva-
tives at five values of μ. I refer to these slopes as the first derivatives of the log-likelihood
function with respect to the mean, because the variance (the other unknown quantity in
the function) is held constant. First derivatives are central to finding an equation for
the maximum likelihood estimates, and they also appear prominently in the iterative
optimization algorithms I discuss later in the chapter.
Moving from left to right across Figure 2.9, the derivatives decrease in magnitude
(i.e., the slopes flatten) as elevation rises, and the slope is exactly 0 at the function’s
peak. The fact that the first derivative is 0 at the point on the function directly above the
maximum likelihood estimate suggests that we can set the derivative expression to 0
and solve for the unknown parameter. First, we need the derivative equations. I give the
expressions below, and introductory calculus resources catalog the differential calculus
rules for getting the first derivatives of a function. To begin, the first derivative of the
log-likelihood function with respect to the mean (i.e., the linear slopes in Figure 2.9) is
as follows:
N
∂LL
∂μ
= (σ ) ∑ (Y − μ)
2 −1
i (2.7)
i =1
In words, the left side of the expression reads “the first derivative of the log-likelihood
function with respect to the mean,” where ∂ is a common symbol for a derivative, and
the fraction denotes the differential operator. Setting the right side of the equation equal
to 0 and solving for μ gives the maximum likelihood estimate of the mean.
Maximum Likelihood Estimation 59
–900
–950
–1000
Log-Likelihood
–1050
–1100
–1150
–1200
40 45 50 55 60 65 70 75
Population Mean
FIGURE 2.9. Likelihood function with respect to the mean, holding the variance constant at
its maximum likelihood estimate. The dashed lines represent first derivatives, or slopes of lines
tangent to the function at each black dot.
N
1
=μˆ = ∑
N i =1
Yi Y (2.8)
Notice that μ̂ is identical to the familiar formula for the arithmetic mean. Consistent
with the previous grid search, applying the expression to the math posttest scores gives
a maximum likelihood estimate of μ̂ = 56.79.
Differentiating the log-likelihood function with respect to the variance gives the
slopes of tangent lines at different points on the function in Figure 2.8.
N
∂LL
( )
N 2 1 2
( ) ∑ (Y − μ)
−1 −2 2
2
=− σ + σ i (2.9)
∂σ 2 2 i =1
Again, setting the right side of the equal to 0 and solving for σ2 gives the maximum
likelihood estimate of the variance.
N
1
∑ (Y − μ)
2
=σˆ 2 i (2.10)
N i =1
Notice that the maximum likelihood solution has N rather than N – 1 in the denomi-
nator. We know that applying the equation for the population variance to sample data
60 Applied Missing Data Analysis
The log-likelihood function provides a mechanism for estimating standard errors, and
this, too, relies on calculus derivatives. The process lends itself well to graphical dis-
plays, so I interleave a conceptual description with the technical details. To set the stage,
Figure 2.10 shows the log-likelihood functions for two data sets with the same mean but
different variance. The solid curve, which is identical to Figure 2.7, corresponds to the
math posttest data, and the flatter dashed function comes from a data set with 50% more
variance (i.e., σ̂2 = 131.58 vs. 87.72).
The curvature of the log-likelihood function (i.e., its steepness or flatness) deter-
mines the precision of the maximum likelihood estimate at its peak. To understand why
this is the case, recall that the log-likelihood quantifies the data’s evidence for different
candidate parameter values. Looking at the solid curve in Figure 2.10, you can see that the
–900
–950
–1000
Log-Likelihood
–1050
–1100
–1150
–1200
40 45 50 55 60 65 70 75
Population Mean
FIGURE 2.10. Log-likelihood functions for two data sets with the same mean but different
variance. The solid curve, which is identical to Figure 2.7, corresponds to the math posttest data,
and the flatter dashed function comes from a data set with 50% more variance.
Maximum Likelihood Estimation 61
data’s support for competing parameter values decreases rapidly as μ moves away from its
optimal value in either direction. In contrast, the dashed curve is much flatter, meaning
that the data provide similar support for a range of parameter values near the peak. As
such, the steeper function reflects a more precise estimate with a smaller standard error.
This makes intuitive sense if you think about estimation as a hiker trying to climb to the
highest possible elevation on a mountain. A climber standing at the top of a steep peak
would be very certain about reaching the exact summit, because elevation drops quickly
in every direction, whereas a climber standing on a flatter plateau would be less confident
about the summit’s precise location. To apply this idea to data, we need to figure out how
to quantify curvature of the log-likelihood and translate that into a standard error.
Second Derivatives
Measuring curvature and computing standard errors requires the second derivatives of
the log-likelihood function. These second derivatives, which are also slope coefficients,
have an intuitive visual interpretation. To illustrate, Figure 2.11 displays the first deriva-
tives of the two log-likelihood functions from Figure 2.10. Moving from left to right,
the linear slopes along the steep curve vary substantially, changing from large positive
values on the left to large negative values on the right. Conversely, the slopes along the
-900
-950
-1000
Log-Likelihood
-1050
-1100
-1150
-1200
40 45 50 55 60 65 70 75
Population Mean
FIGURE 2.11. Log-likelihood functions for two data sets with the same mean but different
variance. The straight lines represent first derivatives. The steep function has rapidly changing
first derivatives and thus a large second derivative, whereas the flatter function has a smaller
second derivative, because its slopes don’t change as much near the peak.
62 Applied Missing Data Analysis
flatter curve exhibit less variability, ranging from moderately positive to moderately
negative. Mathematically, a second derivative captures the rate at which the first deriva-
tive slopes change across the log-likelihood function. For example, the steep function
in Figure 2.11 has rapidly changing first derivatives and thus a large second derivative.
Conversely, the flatter function has a smaller second derivative, because its slopes don’t
change as much near the function’s peak.
Second derivatives can be confusing, because they are metaquantities that capture
the rate of change in the linear slopes; that is, they are equations that give the slope of the
slopes. A regression analogy is useful for sorting this out. Returning to Figure 2.9, you can
think of the curve as a nonlinear regression line that predicts the log-likelihood at different
values of the parameter (i.e., the parameter is the predictor variable, and the log-likelihood
is the outcome). The linear term from this regression, which is the first derivative, tells us
how much the log-likelihood changes for an infinitesimally small increase in the parame-
ter. The second derivative is also a slope from a regression, but that regression now predicts
the first derivatives at different values of the parameter (i.e., the parameter is the predictor
variable, and the first derivative is the outcome). Because the linear slopes change at a
constant rate across the parabolic function, the second derivative reflects the change in the
slope for each one-unit increase in the population mean. The regression analogy highlights
that first and second derivatives are just the same concept applied to different variables.
To illustrate second derivatives more concretely, reconsider the first derivative slope
expression from Equation 2.7. We know that substituting μ̂ = 56.79 into the formula (i.e.,
evaluating the function at the maximum likelihood estimate) returns a slope coefficient
of 0. Next, we can use the expression to compute the first derivative after increasing
or decreasing the mean by 1 point. Starting with the steep curve in Figure 2.11, sub-
stituting μ = 55.79 and 57.79 into the equation gives first derivatives equal to +2.85 and
–2.85, respectively. Thus, we can verify that a one-unit increase in the population mean
changes the first derivative (i.e., the slope of the log-likelihood at a particular point) by
–2.85. This value is the second derivative! Moving to the flatter function, substituting
the same two estimates into the equation gives first derivatives equal to +1.90 and –1.90,
respectively. A one-unit increase in the population mean now induces smaller changes
in the linear slopes, because the log-likelihood function is less peaked. As you can see,
larger second derivatives (in absolute value) reflect greater curvature and more preci-
sion, whereas smaller second derivatives imply less curvature.
I previously explained that second derivatives are the same concept as a first deriva-
tive but applied to a different dependent variable (a function of the original function).
As such, getting the second derivatives involves applying differential calculus rules to
the slope equations from Equations 2.7 and 2.9. To begin, the second derivative of the
log-likelihood function with respect to the mean (i.e., the curvature of the function in
Figure 2.7) is as follows:
∂ 2 LL N
2
= − 2 (2.11)
∂μ σ
Substituting σ̂2 = 87.72 (the maximum likelihood estimate) and N = 250 into the expres-
sion verifies the earlier conclusion that the second derivative equals –2.85. The second
Maximum Likelihood Estimation 63
derivative of the log-likelihood function with respect to the variance (i.e., the curvature
of the function in Figure 2.8) is as follows:
N
∂ 2 LL N 2
( ) ( ) ∑ (Y − μ)
−2 −3 2
= σ − σ2 (2.12)
( )
2 i
∂σ2 2 i =1
Substituting σ̂2 = 87.72 and N = 250 in the expression gives a second derivative equal to
–.016. Because the log-likelihood function in Figure 2.8 has multiple bends, the rate of
change in the linear slopes is no longer constant going from left to right. Thus, we need
to view the second derivative as curvature at the function’s peak. Again, you can think
of this number (in absolute value) as the estimate’s precision.
You probably noticed that the values of the second derivative were both negative.
In fact, this is not a coincidence, as the sign of the second derivative signals whether a
solution corresponds to the maximum or the minimum of a function. To understand
why this is the case, imagine a U-shaped log-likelihood function that is a mirror image
of the parabola in Figure 2.7. When applied to a U-shaped function, the first derivative
takes on a value of 0 at the lowest point on the curve (i.e., the bottom of a valley instead
of the peak of a hill). The sign of the second derivative differentiates the minimum and
maximum of a function and thus tells us whether an estimate is located at the bottom
of a trough or the peak of a hill. To illustrate, imagine traversing a U-shaped function
moving from left to right. Contrary to the derivatives displayed in Figure 2.9, the linear
slopes from an inverted function change from large negative values to large positive
values; that is, a one-unit increase to the parameter increases rather than decreases the
slopes, thus giving a positive second derivative. Consequently, the fact that the second
derivatives were negative is important, because it signals that the estimates are, in fact,
located at the peak of the surface.
−1 −1
N 2 −2 N
2 3
N
2
( ) ( ) ( ) ∑ ( )
−3
var σˆ 2 = −
=
2
σˆ − σˆ 2 (
i 1=
i 1
∑
Yi − μˆ ) = 2 σˆ 2 −N σˆ 2 + 2 ( Yi − μˆ )
(2.14)
64 Applied Missing Data Analysis
Finally, taking the square root of the sampling variance gives the standard error. Notice
that the square root of Equation 2.13 is the familiar formula for the standard error of
the mean, σˆ ÷ N .
To illustrate standard error computations, reconsider the two log-likelihood func-
tions in Figure 2.11. The steeper curve corresponds to the math achievement data from
the companion website, which has a variance σ̂2 = 87.72. Substituting this estimate into
Equation 2.13 gives a sampling variance equal to var(μ̂) = 0.35 and a standard error
equal to SEμˆ = 0.59. Consistent with the usual interpretation of a standard error, 0.59 is
the expected difference between the maximum likelihood estimate and the true popula-
tion mean, or the standard deviation of estimates from many random samples of size
250. As a comparison, the dashed curve corresponds to a transformed data set with 50%
more variance. Substituting σ̂2 = 131.58 into Equation 2.13 returns a sampling variance
and standard error equal to var(μ̂) = 0.53 and SEμˆ = 0.73, respectively. These results rein-
force the previous conclusion that steeper functions with more curvature reflect greater
precision and smaller standard errors.
The left side of the equation reads “first differentiate the log-likelihood with respect to
the mean, then differentiate the resulting expression with respect to the variance” (or
vice versa).
Next, the second derivatives and the cross-product terms are stored in a symmetric
matrix known as the Hessian.
Maximum Likelihood Estimation 65
N
N
( ) ∑
−2
− 2 − σ2 ( Yi − μ )
σ
HO ( θ ) =
i =1
(2.16)
N N
( ) ∑ N 2
( ) ( ) ∑
−2 −2 − 3
− σ2 ( Yi − μ ) σ − σ 2 2
( Yi − μ )
2
= i 1 =i 1
Notice that the diagonal elements contain the second derivatives from Equations 2.11
and 2.12, and the new addition from Equation 2.15 appears in the off-diagonal elements.
The subscript on HO indicates that the derivative equations depend on the observed
data (an alternate approach described below replaces data values with the expectations
or averages), and θ denotes the parameter values. Substituting the maximum likelihood
estimates into the expressions gives HO(θ̂).
Computing standard errors involves the same three steps as before. First, multiply-
ing the matrix of second derivatives by –1 gives the observed information matrix.
()
I O θˆ = −H O θˆ () (2.17)
As before, this step rescales the derivatives so that large positive values reflect greater
precision or confidence in the estimates. Second, taking the inverse of the information
matrix (the matrix analogue of a reciprocal) gives the variance–covariance matrix of
the parameter estimates.
()
Sˆ θˆ = I O−1 θˆ (2.18)
The parameter covariance matrix for the univariate analysis has sampling variances on
the diagonal and the covariance between the two estimates in the off-diagonal elements.
= ˆ
S θˆ
var ( μˆ )
=
(
)
cov μˆ , σˆ 2 0.35 0
(2.19)
( )
cov σˆ 2 , μˆ var σˆ ( )
2
0
61.55
You can see that the covariance between the mean and variance is 0, because the devia-
tion scores in the Hessian’s off-diagonal sum to 0. The independence of the mean and
variance (or more generally, a model’s mean parameters and its variance–covariance
parameters) is a well-known feature of maximum likelihood estimation. As you will
see in the next chapter, this independence doesn’t necessarily hold with missing data
(Kenward & Molenberghs, 1998; Savalei, 2010). Finally, taking the square root of the
sampling variances on the diagonal of the variance–covariance matrix gives the stan-
dard errors (e.g., SEμˆ = 0.35 = .59 and SEσˆ 2 = 61.55 = 7.85).
observed and expected information are often equivalent and produce identical stan-
dard errors. However, the two approaches are not always the same with missing data
(Kenward & Molenberghs, 1998; Savalei, 2010).
Revisiting the Hessian matrix in Equation 2.16, the second derivatives reflect sum-
mations across the N scores. To see how expected information works, it is useful to look
at a single observation’s contribution to these sums.
−1
N −σ2 ( Yi − μ )
σ2
HO ( θ ) =∑ (2.20)
i =1 −σ 2 Y − μ
1 2
( ) ( ) 2
−2 2 −3
( i ) σ − σ ( Yi − μ )
2
The expected information matrix invokes a computational shortcut that replaces (Yi – μ)
and (Yi – μ)2 with their expectations or long-run averages.
E ( Yi − μ ) = 0 (2.21)
E ( Yi − μ ) = σ2
2
1
N − σ2 0
HE ( θ) = ∑ (2.22)
i =1 0
− σ
2
( )
1 2 −2
Substituting the maximum likelihood estimates into the Hessian and multiplying the
matrix by –1 gives the expected information matrix.
N
− σˆ 2 0
IE ()
ˆθ = ˆ −
−H E θ = () (2.23)
0
−
2
( )
N 2 −2
σˆ
Finally, taking the inverse of the information matrix gives the variance–covariance
matrix of the estimates, the diagonal of which contains squared standard errors.
As you can see, the expected information is simpler to compute, because it does
not rely on the raw data. With complete data, standard errors based on the observed
and expected information are often indistinguishable, as they are in this example. This
equality doesn’t necessarily hold with missing data, as the expectations in Equation 2.21
require an MCAR process where missingness is unrelated to the data. In contrast, stan-
dard errors based on the observed information assume a less stringent MAR mechanism
where missingness depends on the observed data. Simulation results favor standard
errors based on observed information (Kenward & Molenberghs, 1998; Savalei, 2010),
so I strictly rely on this approach.
Maximum Likelihood Estimation 67
The normal curve plays an integral role in every phase of maximum likelihood estima-
tion, as its log-likelihood function provides a basis for identifying the optimal estimates
for the data and computing standard errors. Of course, non-normal data are exceedingly
common, and some authors argue that normality is the exception rather than the rule
(Micceri, 1989). Depending on the analysis model, maximum likelihood estimates may
still be consistent when normality is violated, meaning that they converge to their true
population values as the sample size increases (Yuan, 2009b; Yuan & Bentler, 2010).
However, standard errors and significance tests are almost certainly compromised.
This section describes two alternate (and very different) strategies for estimating
sampling variation when normality is violated: so-called “robust” or sandwich estimator
standard errors (Freedman, 2006; Greene, 2017; White, 1980) and bootstrap resampling
(Efron, 1987; Efron & Gong, 1983; Efron & Tibshirani, 1993). These methods have a
long history in the literature and a substantial body of literature that generally supports
their use (Arminger & Sobel, 1990; Enders, 2001; Finch, West, & MacKinnon, 1997;
Gold & Bentler, 2000; Hancock & Liu, 2012; Rhemtulla, Brosseau-Liard, & Savalei,
2012; Savalei & Falk, 2014; Yuan, 2009b; Yuan & Bentler, 2000, 2010; Yuan, Bentler,
& Zhang, 2005; Yuan, Yang-Wallentin, & Bentler, 2012). I discuss analogous corrective
procedures for significance tests later in the chapter.
covariance matrix from Equation 2.18 forms the outer pieces of “bread,” and the “meat”
in the middle of the sandwich is a new matrix that captures deviations between the data
and the assumed normal distribution. The sandwich estimator covariance matrix is
() ()
Sˆ θˆ = bread × meat × bread = I O−1 θˆ Sˆ S θˆ I O−1 θˆ
()
(2.24)
where IO(θ) is the information matrix from Equation 2.17, and the meat in the middle
term is a new covariance matrix based on first derivatives (described below).
Revisiting Equations 2.7 and 2.9, the first derivative or slope expressions reflect
summations across the N scores. To illustrate the composition of the meat term, we need
to look at a single observation’s contribution to these equations. Arranging the terms in
an array gives the so-called score vector for a single observation.
( )
−1
σ2 ( Yi − μ )
Si ( θ ) = (2.25)
1
( ) ( )
1
−1 −2
+ σ2 ( Yi − μ )
2
− σ2
2 2
The meat of the sandwich is the variance–covariance matrix of these score vectors eval-
uated at the maximum likelihood estimates (i.e., the Sˆ S( θˆ ) term in Equation 2.24).
To understand how the formula works, you need know that IO(θ̂) and Sˆ S( θˆ ) both esti-
mate the information matrix, albeit in different ways. When the data are normal, the two
matrices are equivalent and effectively cancel out when multiplying one by the inverse
of the other (the resulting product is an inert identity matrix), leaving only the normal-
theory covariance matrix from Equation 2.18. In contrast, when the data are non-normal,
the product of the two matrices has diagonal elements that reflect the relative magnitude
of the two information matrices, and this array serves to rescale the parameter covari-
ance matrix in a way that compensates for kurtosis. Returning to the score vector in
Equation 2.25, notice that the first derivative expressions include deviation scores. When
the data are leptokurtic, the thicker tails produce a higher proportion of large deviation
scores than a normal curve, and multiplying the first piece of bread by the meat returns
a matrix containing large diagonal values that inflate the parameter covariance matrix
(the rightmost piece of bread). In contrast, when the data are platykurtic, the distribution
has fewer extreme scores than a normal curve, and the bread × meat product returns a
matrix with fractional values that attenuate the covariance matrix elements.
Recall from the earlier example that the normal-theory standard errors for the
mean and variance were SEμˆ = 0.35 = 0.59 and SEσˆ 2 = 61.55 = 7.85, respectively. The
sandwich estimator covariance matrix for the same data is as follows:
=ˆ
S θˆ
var ( μˆ )
=
(
cov μˆ , σˆ 2 0.35 0.17 ) (2.26)
cov σˆ 2 , μˆ
( )
var σˆ 2 0.17 60.30
( )
Taking the square root of the diagonal elements gives SEμˆ = 0.35 = .59 and SEσˆ 2 = 60.31
= 7.77. This example highlights two points. First, the standard error of the mean is the
same in both cases, because this parameter is unaffected by the robustification pro-
Maximum Likelihood Estimation 69
cess (White, 1982; Yuan et al., 2005). Second, the standard error of the variance barely
changes, because the data are essentially normal (as noted previously, the sandwich
estimator simplifies to the conventional covariance matrix in this case). More generally,
a divergence between the two covariance matrices would likely signal a model mis-
specification (e.g., the normal distribution is a poor approximation for the data; King &
Roberts, 2015; White, 1982).
Bootstrap Resampling
Bootstrap resampling (Efron, 1987; Efron & Gong, 1983; Efron & Tibshirani, 1993) is a
second approach to generating standard errors that are robust to normality violations.
The bootstrap uses Monte Carlo computer simulation to generate an empirical sampling
distribution of each parameter estimate, the standard deviation of which is the stan-
dard error. This section describes a so-called “naive bootstrap” that generates standard
errors, and modifications to the basic procedure can also generate sampling distribu-
tions of test statistics (Beran & Srivastava, 1985; Bollen & Stine, 1992; Enders, 2002;
Hancock & Liu, 2012; Savalei & Yuan, 2009).
The basic idea behind the bootstrap is to treat the observed data as a surrogate
for the population and draw B samples of size N with replacement; that is, after being
selected for a bootstrap sample, each observation returns to the surrogate population
and is eligible to be chosen again. The sampling with replacement scheme ensures that
some data records appear more than once in each sample, whereas others do not appear
at all. To illustrate, Table 2.2 shows five bootstrap samples from a small toy data set with
10 observations. Drawing many bootstrap samples (e.g., B > 2,000) and fitting a model
to each data set gives an empirical sampling distribution of the estimates. The standard
deviation of B estimates is the bootstrap standard error
B
∑( )
1 2
=
SEθˆ θˆ=
b −θ SDθˆ (2.27)
B − 1 b =1
where θ̂b is the maximum likelihood estimate from sample b, and θ is the average esti-
mate across the B samples. Finally, the 2.5 and 97.5% quantiles of the empirical distribu-
tion (i.e., the estimates that separate the most extreme 2.5% of the lower and upper tails
of the distribution) define a 95% confidence interval. Unlike their theoretical counter-
parts, bootstrap confidence intervals need not be symmetric around the average point
estimate.
This chapter focuses primarily on analyses with analytic solutions for the maximum
likelihood estimates. Beyond the univariate example, this includes linear regression
models and multivariate analyses involving a mean vector and covariance matrix.
Many, if not most, applications of maximum likelihood do not have analytic solutions,
and even the tidy problems from this chapter become messy later with missing data.
I describe two such algorithms in this chapter, gradient ascent and Newton’s method,
and in Chapter 3, I describe the expectation maximization (EM) algorithm (Dempster,
Laird, & Rubin, 1977; Rubin, 1991).
Returning to the log-likelihood function in Figure 2.7, an optimization algorithm
tasked with finding the maximum likelihood estimates is like a hiker trying to reach
the summit of a mountain. The hiker could start the trek at different trailheads, and that
starting point would dictate the direction of travel and rate of ascent. Similarly, optimi-
zation algorithms need initial guesses about the parameter values, and software defaults
could generate starting values on either side of hill. The first derivative is like a compass
in the sense that its sign tells the algorithm the direction it needs to travel to reach the
curve’s maximum elevation. For example, starting the climb at μ = 45 requires positive
adjustments to the parameter, whereas starting at μ = 65 requires negative adjustments.
The starting coordinates also dictate the size of the hiker’s steps. If the trek begins far
from the peak, the hiker can take big steps without worrying about missing the summit.
In contrast, the surface flattens near the top where very tiny steps are needed to find the
exact location of the peak. The size of each step links to the magnitude of the derivatives
in Figure 2.9, with larger slopes inducing bigger steps, and slopes closer to 0 requiring
very small steps.
Gradient Ascent
Gradient ascent (or equivalently, gradient descent, if you invert the log-likelihood func-
tion) is a good starting point for exploring iterative optimization, because it parallels
the hiking analogy. Starting with an initial guess about the parameter, the algorithm
takes repeated steps in the direction of maximum until it finds the optimal estimate for
the data. The iterative recipe for gradient ascent is straightforward: At each iteration,
compute an updated estimate that equals the previous estimate plus some adjustment,
the size of which depends on the first derivative or slope. More formally, the updating
step is
Maximum Likelihood Estimation 71
=
new estimate current estimate + step size (2.28)
∂LL
θ( ) = θ( ) +
t +1 t
× constant
∂θ
where θ denotes the parameter of interest, t indexes the iterations, and the step size term
in parentheses is the first derivative (evaluated at the current estimate) times a small
constant, sometimes referred to as the learning rate.
To illustrate iterative optimization, I applied gradient ascent to the mean (to keep
the illustration simple, I held the variance at its maximum likelihood estimate). A cus-
tom R program is available on the companion website for readers interested in coding
the algorithm by hand. To begin, I initiated the process with a starting value of μ(0) = 0
and a constant learning rate of .25 (the constant is usually some small value between 0
and 1). Substituting the initial parameter value into the first derivative expression from
Equation 2.7 (i.e., evaluating the function at μ = 0) gives a slope equal to 161.86. The
huge positive slope implies a correspondingly large positive adjustment to the param-
eter. Multiplying the derivative by the learning rate gives a step size equal to 161.86 ×
.25 = 40.47 and an updated parameter value equal to μ(1) = 40.47. The new estimate is
closer to the peak, so the slope coefficient decreases in magnitude to 46.53. Repeating
the process gives a step size equal to 46.53 × .25 = 11.63 and an updated estimate equal
to μ(2) = 52.10.
Table 2.3 gives the parameter updates, first derivatives, and log-likelihood values
from 17 iterations. As you can see, the first few cycles produced steep slope coefficients
and large adjustments to the parameter. The vertical elevation of the log-likelihood also
increased rapidly as the algorithm took large strides toward the peak. In contrast, the
final few iterations induced very small adjustments to the mean, and changes to the log-
likelihood were in the 10th decimal. Continuing to iterate until the derivative equals 0 is
inefficient and unnecessary, because any additional improvement to the estimate would
be infinitesimally small (e.g., after 17 iterations, the estimate is changing in the seventh
decimal place). Instead, I terminated the iterations when the estimates from consecutive
steps differed by less than .000001, as changes of this magnitude effectively signal that
the algorithm has reached the summit.
Newton’s Algorithm
Gradient ascent is useful for establishing some intuition about iterative optimization,
but the simple variant I describe here can be slow to converge and may not converge
at all when variables have different scales. Newton’s algorithm (also known as the
Newton–Raphson algorithm) similarly parallels the hiking analogy, but it uses a more
complex formulation for the step size that requires first and second derivatives. The
upside of this additional complexity is that the updating step naturally provides the
building blocks for computing standard errors after the final iteration. To illustrate the
basic ideas, reconsider the log-likelihood function with respect to the variance in Figure
2.8. Although the log-likelihood is a complex curve with multiple bends, magnifying a
graph of the function at its maximum would reveal a simpler curved line that resembles
a quadratic function (i.e., an inverted U, or a parabola). Leveraging this idea, Newton’s
algorithm uses the first and second derivative values (i.e., the linear slope and curvature
at a specific point on the function) to construct a parabolic curve that extends from the
current parameter value toward the log-likelihood’s peak. The apex of each quadratic
function represents the algorithm’s best guess about the maximum likelihood estimate
at a particular iteration, and this temporary peak becomes the updated parameter value
for the next iteration.
Figure 2.12 shows the log-likelihood function, with black dots denoting four con-
secutive parameter values. The three dashed lines are quadratic curves assembled from
the first and second derivative formulas. To illustrate the iterative updates, suppose that
the optimizer begins its ascent from a starting value of σ2(0) = 50. A black dot appears
on the log-likelihood function at this coordinate, and the leftmost dashed curve (the
smallest of the three) is the parabolic function that projects from the starting value. The
dashed curve is trying to approximate what the log-likelihood function looks like near
its summit, and the apex of the curve represents the parabola’s best guess about the
maximum likelihood estimate at the initial iteration. The peak of the quadratic curve,
located at σ2(1) = 65.03, becomes the new estimate for the next iteration. Repeating the
process, the algorithm substitutes the updated estimate into the first and second deriva-
tive expressions and uses the resulting quantities to project another quadratic function
from the new coordinate. The middle of the three dashed curves shows the parabola for
this step, the peak of which is located at σ2(2) = 78.40. Similarly, the rightmost dashed
curve shows the quadratic approximation at the third iteration, the maximum of which
corresponds to σ2(3) = 85.93. You can see that the dashed curves become wider and flat-
Maximum Likelihood Estimation 73
–910
–915
–920
Log-Likelihood
–925
–930
–935
–940
FIGURE 2.12. The likelihood function with respect to the variance, holding the mean con-
stant at its maximum likelihood estimate. The black dots represent four consecutive updates
to the variance beginning at the starting value σ2(0) = 50. The three dashed lines are quadratic
curves assembled from the first and second derivative formulas, and the peak of each parabola
identifies the updated parameter value at the next iteration.
ter as elevation increases, such that each successive update does an increasingly better
job at approximating the shape of the log-likelihood function near its peak. After a few
more iterations, the algorithm locates the summit.
More formally, the jump from the current to the updated parameter value is as fol-
lows:
=
new estimate current estimate + step size (2.29)
−1
∂ 2 LL ∂LL
θ( ) = θ( ) −
t +1 t
2
∂θ ∂θ
The step size, computed as the ratio of the first and second derivatives at the current
parameter value θt, corresponds to the horizontal distance between the current estimate
and the peak of the projected quadratic curve. In effect, Newton’s algorithm is breaking
the total vertical elevation into several smaller hikes, and the derivative terms function
as a wayfinder that plots the route to each intermediate peak. The updating step readily
extends to more complex models with multiple parameters. In this case, the multivari-
ate updating equation is
74 Applied Missing Data Analysis
θ( =)
θ( ) − H −1(θ(t ) )S(θ(t ) )
t +1 t
(2.30)
where θ is a vector of parameter values, t indexes the iterations, S(θ(t)) is the vector of
first derivatives (i.e., the score vector, computed as the sum of Equation 2.25 across all
N observations), and the rightmost term is the Hessian matrix of second derivatives.
To illustrate a multivariate optimization scheme, I used Newton’s algorithm to esti-
mate the mean and variance of the math posttest scores. A custom R program is available
on the companion website for readers interested in coding the algorithm by hand. In this
example, S(θ) is a vector containing the slope expressions from Equations 2.7 and 2.9,
and H(θ) is the second derivative matrix from Equation 2.16. The multivariate updating
scheme is virtually identical to the univariate scheme depicted in Figure 2.12, except
that each parameter’s parabolic approximation now accounts for the associations in the
Hessian’s off-diagonal. Table 2.4 shows the iterative updates from a climb initiated at
(terrible) starting values of μ(0) = 0 and σ2(0) = 1. Notice that the algorithm immediately
locates the optimal estimate of the mean after the first update. Returning to Figure
2.7, the log-likelihood with respect to the mean is itself a parabolic function, so the
optimizer can immediately predict the peak of the function from any starting value. In
contrast, the algorithm requires 17 iterations to locate the optimal value of the variance.
Consistent with gradient ascent, you can see that the optimizer makes large adjustments
at first and very small alterations as it approaches the peak.
Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (2.31)
(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
where E(Y|X) is a predicted value (i.e., the expected value or mean of Y given a particular
X score), the tilde means “distributed as,” N1 denotes the univariate normal distribution
function (i.e., the probability distribution in Equation 2.3), and the conditional mean
and residual variance inside the parentheses are the distribution’s two parameters. The
bottom row of the expression is simply stating our usual assumption that outcome scores
are normally distributed around a regression line with constant residual variation.
Switching gears to a different substantive context, I use the smoking data from
the companion website to illustrate multiple regression. The data set includes several
sociodemographic correlates of smoking intensity from a survey of N = 2,000 young
adults (e.g., age, whether a parent smoked, gender, income). To facilitate graphing, I start
with a simple regression model where the parental smoking indicator (0 = parents did not
smoke, 1 = one or both parents smoked) predicts smoking intensity (higher scores reflect
more cigarettes smoked per day):
The intercept represents the expected smoking intensity score for a respondent whose
parents did not smoke, and the slope is the group mean difference. The analysis example
later in this section expands the model to include additional explanatory variables.
1 ( Yi − E ( Yi | X i ) )
2
f ( 2
Yi | β, σ= )
ε , Xi
1
exp −
2 σ2ε
(2.33)
2πσ2ε
To reiterate recurring notation, the function on the left side of the equation can be read
as “the relative probability of a score given assumed values for the model parameters.”
76 Applied Missing Data Analysis
Visually, “f of Y” is the height of the conditional normal curve that describes the spread
of scores around a particular point on the regression line (e.g., the normal distribution
of smoking intensity scores for participants who share the same value of the parental
smoking indicator). The main component in the kernel is still a squared z-score, but that
quantity now represents the standardized distance between a score and its predicted
value. As before, the fraction to the left of the exponential function is a scaling term that
ensures the area under the probability distribution sums or integrates to 1. Finally, note
that explanatory variables function as fixed constants like the parameters. This feature
will change in Chapter 3, where incomplete predictors appear as variables in a prob-
ability distribution.
As you know, maximum likelihood estimation reverses the probability distribu-
tion to get the likelihood of different combinations of population parameters given the
observed data. Taking the natural logarithm of each observation’s likelihood and sum-
ming the transformed probabilities gives a log-likelihood function that summarizes the
data’s evidence about the coefficients and residual variance.
1 ( Yi − ( β0 + β1 X i ) )
N 2
LL ( β= ) ∑
, σ2ε | data ln 1
2πσ2
exp −
2 σ2ε
i =1
ε
N
N N
( ) ( ) ∑ ( Y − (β
1 −1 (2.34)
+ β1 X i ) )
2
=− ln ( 2π ) − ln σ2ε − σ2ε i 0
2 2 2 i =1
N N 1
( ) ( )
−1
=− ln ( 2π ) − ln σ2ε − σ2ε ( Y − Xβ )′ ( Y − Xβ )
2 2 2
The compact matrix expression in the bottom row stacks the N outcome scores into a
vector Y, and it uses X to denote a corresponding matrix that contains predictor vari-
ables and a column of ones for the intercept.
With only two coefficients, we can visualize the log-likelihood surface of β0 and β1
in three dimensions. Figure 2.13 is a contour plot conveying the perspective of a drone
hovering over the peak of the log-likelihood surface, with smaller contours denoting
higher elevation (and vice versa). The data’s support for the parameters increases as the
contours get smaller, and the maximum likelihood estimates of β0 and β1 are located at
the peak of the surface, shown as a black dot. The angle of the ellipses owes to the fact
that the intercept and slope coefficients are negatively correlated (i.e., the data’s support
for a larger mean difference requires concurrent support for lower comparison group
average). Identifying the optimal parameters for the data is again analogous to a hiker
climbing a mountain peak. Following the univariate example, we can derive an exact
solution or use an iterative optimization approach such as Newton’s algorithm.
∂LL 1
− 2 ( −X ′Y + X ′Xβ )
= (2.35)
∂β σε
∂LL
( )
N 2
( )
1 2
−1 −2
2
=
− σε + σε ( Y − Xβ )′ ( Y − Xβ ) (2.36)
∂σ ε 2 2
Setting these slope equations to 0 and solving for the unknown parameters at the peak
of the log-likelihood surface gives the maximize likelihood estimates below.
−1
=βˆ (=
X ′X ) X ′Y βˆ OLS (2.37)
ˆ 2ε
σ
=
1
N
( ′
)(
Y − Xβˆ Y − Xβˆ ) (2.38)
Notice that the coefficients are identical to those of ordinary least squares, but the resid-
ual variance differs, because the sample size is not adjusted for the number of estimates
in β̂. This matches the earlier result for the mean and variance.
From Section 2.6, you know that second derivatives quantify the curvature or
steepness of the log-likelihood function near its peak (i.e., the rate at which the first-
4.5
4.0
3.5
Population Slope
3.0
2.5
2.0
1.5
FIGURE 2.13. Contour plot that conveys the perspective of a drone hovering over the peak of
the log-likelihood surface for a simple regression model, with smaller contours denoting higher
elevation (and vice versa). The maximum likelihood estimates of β0 and β1 are located at the peak
of surface (shown as a black dot).
78 Applied Missing Data Analysis
order slopes change across the range of parameter values). These second derivatives are
obtained by applying differential calculus rules to Equations 2.35 and 2.36, and the Hes-
sian collects these equations in a matrix.
( )
2 −1
( ) X′ ( Y − Xβ )′
−2
− σ ε X ′X − σ2ε
HO ( θ ) = (2.39)
( ) N
( σ ) − ( σ ) ( Y − Xβ )′ ( Y − Xβ )
−2 2 −2 2 −3
− σ2ε ( Y − Xβ )′ X ε ε
2
Substituting the maximum likelihood estimates into the expression and multiplying
HO(θ̂) by –1 gives the observed information matrix, then taking its inverse (the matrix
analogue of a reciprocal) gives the variance–covariance matrix of the parameter esti-
mates. Equations 2.17 and 2.18 depict these steps. The parameter covariance matrix
for the simple regression analysis is symmetric with three rows and columns, one per
parameter.
var βˆ 0
( ) (
cov βˆ 0 , βˆ 1 ) ( )
cov βˆ 0 , σˆ 2ε
0.018 −0.018 0
θ (
ˆS ˆ = cov βˆ , βˆ
1 0 ) ( )
var βˆ 1 ( ˆ
)
2
cov β1 , σˆ ε = −0.018 0.039 0
(2.40)
0 0.365
(
cov σˆ 2ε , βˆ 0
) (
cov σˆ 2ε , βˆ 1 ) ( )
var σˆ 2ε
0
Finally, taking the square root of the diagonal elements gives the standard errors (e.g.,
SEβˆ = 0.039 = 0.20). To establish further linkages to ordinary least squares, the expres-
1
sion in the upper left block of Equation 2.39 is a 2 × 2 matrix that contains derivatives
with respect to the two coefficients. Multiplying this submatrix by –1 and taking its
inverse gives an expression that is identical to a parameter covariance matrix from ordi-
nary least squares regression.
Analysis Example
To illustrate maximum likelihood estimation for multiple regression, I expanded the
previous analysis model to include age and income as predictors. I centered the addi-
tional variables at their grand means to maintain the intercept’s interpretation as the
expected smoking intensity score for a respondent whose parents did not smoke.
Importantly, the smoking intensity distribution has substantial positive skewness and
kurtosis, so I used robust (sandwich estimator) standard errors and the bootstrap to
illustrate different corrective procedures. Analysis scripts are available on the compan-
ion website, including a custom R program for readers interested in coding Newton’s
algorithm by hand.
Table 2.5 shows the maximum likelihood estimates, along with ordinary least
squares results as a comparison. As expected, the two estimators produced identical
Maximum Likelihood Estimation 79
coefficients, but the maximum likelihood residual variance is very slightly smaller,
because it does not subtract the four degrees of freedom spent estimating the coeffi-
cients. This slight difference aside, the estimates themselves have the same meaning. For
example, the intercept (β̂0 = 9.09, SE = .12) is the expected number of cigarettes smoked
per day for a respondent whose parents didn’t smoke, and the parental smoking indi-
cator slope (β̂1 = 2.91, SE = .19) is the mean difference, controlling for age and income.
The corrective procedures induced relatively minor changes to the coefficients’ standard
errors, but they had a dramatic impact on the standard error of the residual variance. As
is often the case with a reasonably large sample size, sandwich estimator and bootstrap
standard errors were effectively equivalent.
Maximum likelihood estimation offers three significance testing options: the Wald test
(Wald, 1943), likelihood ratio statistic (Wilks, 1938), and the score test or Lagrange
multiplier (Rao, 1948). The latter is commonly referred to as the modification index
in structural equation modeling applications (Saris, Satorra, & Sörbom, 1987; Sörbom,
1989). I describe the first two approaches, because they are widely available in general-
purpose software packages, and Buse (1982) provides a nice tutorial on this “trilogy of
tests” for readers who are interested in additional details.
The Wald test and likelihood ratio statistic can evaluate the same hypotheses, but
they do so in different ways. The Wald test compares the discrepancy between the esti-
mates and hypothesized parameter values (usually zeros) to sampling variation. The
simplest incarnation of the test statistic is just a z-score or chi-square. In contrast, the
likelihood ratio statistic compares log-likelihood values from two competing models,
the simpler of which aligns with the null hypothesis. The two tests are equivalent in
very large samples but can give markedly different answers in small to moderate samples
(Buse, 1982; Fears, Benichou, & Gail, 1996; Greene, 2017; Pawitan, 2000). These differ-
ences are sometimes attributable to the fact that the Wald test inappropriately assumes
that sampling distributions follow a normal curve, but discrepancies can arise for other
80 Applied Missing Data Analysis
reasons that are more difficult to predict. Statistical issues aside, the likelihood ratio test
is somewhat less convenient to implement, because it requires two analyses, but this is
not a compelling disadvantage.
Wald Test
The simplest incarnation of the Wald test is the familiar z-statistic that compares the
difference between an estimate and hypothesized parameter value (e.g., θ0 = 0) to the
estimate’s standard error.
θˆ − θ0
z= (2.42)
SEθˆ
Leveraging the large-sample normality of maximum likelihood estimates, a standard
normal distribution generates a probability value for the test, and symmetrical confi-
dence interval limits are computed by multiplying the standard error by the appropriate
z critical value, then adding and subtracting that product (i.e., the margin of error or
half-width) to the estimate.
CI = θˆ ± z CV × SEθˆ (2.43)
The z critical values for different alpha levels are available in textbooks and online (e.g.,
zCV = ±1.96 for α = .05).
Squaring the z-score gives an alternative expression for the Wald statistic that
instead follows a chi-square distribution with a single degree of freedom.
=
θˆ − θ0
TW =
2
( θˆ − θ )( θˆ − θ )
0 0
(2.44)
SE ˆ
θ var ( θˆ )
TW = ( ′
) (
θˆ Q − θ 0 Sˆ θ−ˆ 1 θˆ Q − θ 0
Q
) (2.45)
analysis to a more restrictive version of the model that fixes a subset of parameters to 0.
Returning to the earlier regression analysis, we could use the likelihood ratio statistic
to evaluate the null hypothesis that R2 = 0 by comparing the fit of the analysis model
from Equation 2.41 to that of an empty model that constrains the slope coefficients to
0. A slightly different application of the likelihood ratio test occurs in structural equa-
tion modeling analyses in which a researcher compares the fit of a saturated model
(i.e., a model that places no restrictions on the mean vector and covariance matrix) to
that of a more parsimonious analysis model that imposes a structure on the data (e.g., a
confirmatory factor analysis model). In either scenario, the simpler model with Q fewer
parameters aligns with the null hypothesis, so I denote the restricted model’s param-
eters as θ0 and the full model’s parameters as θ.
The likelihood ratio statistic is
TLR = ( ( ) (
−2 LL θˆ 0 | data − LL θˆ | data )) (2.46)
where LL(θ̂0|data) is the sample log-likelihood value for the restricted model (e.g., an
empty regression model with only an intercept), and LL(θ̂|data) is the log-likelihood
for the more complex model (e.g., the full regression model). The more complex model
with additional parameters will always achieve better fit and a higher log-likelihood,
but that improvement should be very small when the null hypothesis is true. If the two
models are equivalent in the population, the likelihood ratio statistic follows a central
chi-square distribution with Q degrees of freedom, which in this case is the difference
between the number of parameters in the two models. A significant test statistic indi-
cates that the data provide more support for the full model than the restricted model
(e.g., one or more parameters are significantly different from zero).
TLR
TSB = (2.47)
cLR
P0 c0 − PF cF
cLR =
P0 − PF
where TLR is the likelihood ratio statistic from Equation 2.46, and cLR is a scaling con-
stant that combines the number of parameters in the full and restricted models, PF and
P0, respectively, and model-specific scaling terms, cF and c0.
The scaling term can be understood by revisiting the sandwich estimator cova-
riance matrix in Equation 2.24. As explained previously, the “bread × meat” product
yields a matrix with diagonal elements that reflect the relative magnitude of two infor-
mation matrices, one of which is sensitive to outlier scores. When the data are normal,
the two matrices are equivalent and cancel out when multiplying one by the inverse of
the other (the resulting product is an inert identity matrix). In contrast, when the data
are non-normal, the resulting product contains fractional diagonal terms that can be
smaller or larger than 1, depending on the kurtosis of the data. Multiplying this matrix
by the rightmost piece of “bread” inflates or deflates elements in parameter covariance
matrix accordingly.
The rescaling terms for the likelihood ratio test also leverage discrepancies between
the two information matrices. In the simplest possible univariate application (e.g., the
analysis from Section 2.8), the scaling term is a fraction that compares a single diago-
nal value from each information matrix (Yuan et al., 2005). More generally, cF and c0
pool the elements of the “bread × meat” product into a single scalar value that r escales
the test statistic to have the same expected value or mean as its optitmal central chi-
square distribution (Satorra & Bentler, 1988, 1994, 2001). As such, referencing TSB to
a chi-square distribution with Q degrees of freedom gives an approximate p-value,
and a significant test statistic indicates that the data provide more support for the full
model than the restricted model (e.g., one or more parameters are significantly different
from 0).
A second option for getting a robust significance test is to use the original TLR from
Equation 2.46 but reference the test statistic against a simulation-based bootstrap sam-
pling distribution that honors the data’s shape. This is essentially the opposite tack of
rescaling, which fixes up the test statistic and leaves the theoretical sampling distribu-
tion intact. As explained in Section 2.8, the bootstrap procedure treats the observed
data as a surrogate for the population and draws many samples of size N with replace-
ment. Fitting the analysis model to each data set produces a collection of estimates
that form empirical sampling distributions, the standard deviations of which are robust
standard errors. A slight modification is needed to apply the bootstrap to test statistics.
As you know, a probability value reflects the likelihood that the observed test statistic
originated from a hypothetical population where the null hypothesis is exactly true. To
achieve this interpretation from the bootstrap, you need to first transform the observed
data to match the null hypothesis. Returning to the multiple regression model from
Equation 2.41, a null hypothesis that R2 = 0 implies that all regression slopes equal 0.
The estimated slopes will never be exactly 0, yet the sample data must be exactly consis-
tent with this condition for the bootstrap to work properly.
Maximum Likelihood Estimation 83
Beran and Srivastava (1985) and Bollen and Stine (1992) modified the bootstrap
procedure by first applying an algebraic transformation that aligns the mean and covari-
ance structure of the data to the null hypothesis (the procedure is sometimes referred
to as the model-based bootstrap). Importantly, this transformation does not modify dis-
tribution shapes, so drawing bootstrap samples from the rescaled data gives an empiri-
cal sampling distribution that reflects the natural variation of the test statistic with
non-normal data. A robust p-value is then obtained by computing the proportion of
bootstrap samples that give a test statistic larger than TLR from the original analysis. The
transformation expression is
( Yi − μˆ )′ Sˆ −.5Sˆ 0−.5 + μˆ ′0
Y i = (2.48)
Analysis Example
Returning to the multiple regression model from Equation 2.41, I use the Wald test and
likelihood ratio statistic to evaluate the null hypothesis that R2 = 0. Both tests func-
tion like the omnibus F test from ordinary least squares in this context. To begin, the
Wald test standardizes discrepancies between the estimates and null values against the
parameter covariance matrix. The full covariance matrix is a 5 × 5 matrix, but the test
uses only the elements related to the slope coefficients. The composition of the test sta-
tistic for this example is as follows:
( ) ( ) ( )
−1
βˆ 1 0 ′ var β1 cov βˆ 1 , βˆ 3 βˆ
ˆ cov βˆ 1 , βˆ 2
1 0
(
TW = βˆ 2 − 0 cov βˆ 2 , βˆ 1
ˆ
) ( )
var βˆ 2 ( )
cov βˆ 2 , βˆ 3 βˆ 2 − 0 (2.49)
β3 0 ˆ 0
(
cov βˆ 3 , βˆ 1 ) (
cov βˆ 3 , βˆ 2 ) ( )
var βˆ 3 β3
The diagonal elements of the middle matrix are the sampling variances (i.e., squared
84 Applied Missing Data Analysis
standard errors), and the off-diagonal elements capture the degree to which the esti-
mates covary across repeated samples. Substituting the appropriate estimates into the
previous expression gives TW = 481.19, the value of which represents the sum of squared
standardized differences from zero. Referencing the test statistic to a chi-square dis-
tribution with Q = 3 degrees of freedom gives p < .001; consistent with an analogous F
test, we can conclude that at least one of the slopes is nonzero. The sandwich estimator
(robust) test statistic was markedly lower at TW = 423.15 but gave the same conclusion.
The likelihood ratio statistic evaluates the same hypothesis but requires a nested
or restricted model that aligns with the null. This secondary model is an empty regres-
sion that fixes the three slope coefficients to zero. With complete data, you can get the
restricted model log-likelihood by constraining the slope coefficients to zero during
estimation or by excluding the explanatory variables from the analysis. Although it
makes no difference here, explicitly constraining the slopes to zero as follows is prefer-
able, because it generalizes to missing data analyses.
Fitting the two models and substituting the resulting log-likelihood values into Equa-
tion 2.46 gives the following test statistic:
( ( ) ( ))
TLR =−2 LL θˆ 0 | data − LL θˆ | data =2 ( ( −5,895.145 ) − ( −5, 679.545 ) ) =431.20 (2.51)
As you can see, fixing the slopes to zero substantially decreased the log-likelihood from
–5,679.545 to –5,895.145, indicating that the restricted model’s parameters are located
at a much lower vertical elevation on the log-likelihood surface. Referencing the test
statistic to a chi-square distribution with Q = 3 degrees of freedom returns a probability
value of p < .001, which, again, indicates that one or more of the slopes’ coefficients are
nonzero. The corresponding rescaled test statistic from Equation 2.47 was markedly
lower at TSB = 173.97 (cLR = 2.48) but gave the same conclusion. Although TW and TLR
produced the same substantive conclusion, their numerical values aren’t particularly
well calibrated. This is not unusual, as the tests often require a much larger sample size
to achieve equivalence.
The multivariate normal distribution plays an important role throughout the book, and
it appears prominently in Chapter 3, where it provides a flexible framework for miss-
ing data handling. To set the stage for missing data, this section uses the distribution
as a backdrop for estimating a mean vector and variance–covariance matrix. As you
will see, the concepts we’ve already established readily generalize to multivariate data
with virtually no modifications (although some of the equations are messier). I use the
employee data from the companion website to provide a substantive context. The data
set includes several workplace-related variables (e.g., work satisfaction, turnover inten-
Maximum Likelihood Estimation 85
WORKSATi μ1 ε1i
Yi = EMPOWER i = μ 2 + ε 2i = μ + ε (2.52)
LMX μ ε
i 3 3i
Yi ~ N 3 ( μ, S )
The bottom equation is shorthand notation to reference data that follow a multivariate
normal distribution; N3 denotes a three-dimensional normal distribution, and the first
and second terms in parentheses are the mean vector and variance–covariance matrix
(the multivariate distribution’s parameters).
The multivariate normal distribution function generalizes the normal curve to mul-
tiple variables. In addition to a mean and variance for each variable, the distribution also
incorporates covariances among the variables (or alternatively, correlated residuals).
To illustrate, Figure 2.14 shows an idealized bivariate normal distribution for the pain
interference and depression composite variables. The distribution retains its familiar
shape and looks like a bell-shaped surface in three-dimensional space. The probability
distribution function that describes the shape of the surface has the same basic struc-
ture as its univariate sibling in Equation 2.3, with vectors and matrices replacing scalar
quantities.
( −V ×.5) 1
f ( Yi | μ, S ) =π
(2 ) exp − ( Yi − μ )′ S −1 ( Yi − μ )
−.5
S (2.53)
2
The column vector Yi now contains V observations for a participant i, μ is the corre-
sponding vector of population means, and Σ is a variance–covariance matrix of the V
variables. As before, the function on the left side of the expression can be read as “the
relative probability of the V observations given assumed values for the model param-
eters.” Visually, the equation describes the height of the surface in Figure 2.14 at the
intersection of score values along the horizontal and depth axes. The term in the expo-
nential function, (Yi – μ)′ Σ–1(Yi – μ), is a key component that equals the sum of squared
86 Applied Missing Data Analysis
standardized differences between the scores and the distribution’s center (a quantity
known as Mahalanobis distance). Finally, the terms to the left of the exponential func-
tion scale the distribution so the area under the surface sums or integrates to 1.
As you know, a probability distribution treats scores as variable and the parameters
as known constants. To illustrate the distribution function’s output, assume that the
true population parameters are as follows (these happen to be the maximum likelihood
estimates for the employee empowerment and leader–member exchange variables):
28.61 20.38 5.37
= μ = S (2.54)
9.59 5.37 9.10
The contour plot in Figure 2.15 shows the perspective of a drone hovering over the
peak of the bivariate normal distribution in Figure 2.14, with smaller contours denoting
higher elevation and larger relative probabilities (and vice versa). The overhead perspec-
tive better reveals the positive correlation between pain interference and depression.
The black diamond corresponds to interference and depression scores of Y1 = (32.00,
13.18)′, and the black circle corresponds to Y2 = (33.25, 9)′. Substituting everything into
0.015
Relative
0.010
Probabil
0.005
ity
50
nt
40
rme
owe
0.000
30
Emp
0
Lead
er-M 5
e
20
loye
emb
er E 10
xch
Emp
ang
e (R 15
elat
ions
hip 10
Qua 20
lity)
FIGURE 2.14. An idealized bivariate normal probability distribution for the employee empow-
erment and leader–member exchange variables.
Maximum Likelihood Estimation 87
50
40
Employee Empowerment
30
20
10
0 5 10 15 20
Leader-Member Exchange (Relationship Quality)
FIGURE 2.15. The contour plot shows the perspective of a drone hovering over the peak of
the bivariate normal distribution in Figure 2.14, with smaller contours denoting higher eleva-
tion and larger relative probabilities (and vice versa). The overhead perspective better reveals the
positive correlation between pain interference and depression. The black circle and diamond are
two pairs of scores located at the same vertical elevation.
Equation 2.53 returns relative probability values of f(Y1|μ, Σ) = 0.006 and f(Y2|μ, Σ) =
0.006. The two pairs of scores have the same relative probability (i.e., are located at the
same vertical elevation), despite the fact that the straight line connecting Y1 to the center
of the distribution is noticeably shorter than the line connecting Y2 to the peak. This
result happens, because the positive correlation rotates the contours in such a way that
elevation drops rapidly directly above and below the distribution’s peak. This feature is
also apparent in Equation 2.53, where scaling the squared deviation scores relative to
the variance–covariance matrix standardizes the distances in a way that accounts for
the correlations among the variables.
Following established concepts, estimation “reverses” the probability distribution’s
arguments to get the likelihood of different combinations of population parameters
given the observed data. Taking the natural logarithm gives the log-likelihood contribu-
tion for a single observation:
V 1 1
LLi ( μ, S | data ) =− ln ( 2π ) − ln S − ( Yi − μ )′ S −1 ( Yi − μ ) (2.55)
2 2 2
88 Applied Missing Data Analysis
Numerically, the log-likelihood is a large negative value that summarizes the data’s evi-
dence for a specific combination of parameter values in μ and Σ, with higher or less
negative numbers reflecting better fit (and vice versa). Visually, the log-likelihood cor-
responds to the height of a multidimensional surface at specific values of μ and Σ. As
always, the goal of estimation is to identify the parameter values that maximize fit to
the observed data (or equivalently, minimize the sum of the squared z-scores in the
rightmost term).
N
∂LL 1 S −1 − S −1 Y − μ Y − μ ′ S −1
∂S
=
−
2 i =1 ∑ ( i )( i )
(2.58)
Setting these equations to 0 and solving for the parameters gives the following analytic
solutions for the maximum likelihood estimates:
N
1
μˆ =
N ∑Y
i =1
i (2.59)
N
ˆ 1
S
=
N ∑ ( Y − μˆ )( Y − μˆ )′
i =1
i i (2.60)
The analytic solutions highlight a recurring theme, which is that maximum likelihood
estimates of variances and covariances do not adjust for the degrees of freedom spent
estimating the means; as such, variance–covariance estimates are biased in small sam-
ples but approach their true population values as sample size increases (i.e., the esti-
mates are said to be consistent).
Maximum Likelihood Estimation 89
∂ 2 LL ∂ 2 LL
∂μ 2 ∂μ∂S
HO ( θ ) = (2.61)
∂ 2 LL ∂ 2 LL
∂S∂μ ∂S 2
The second derivative equations below are the building blocks for the observed infor-
mation matrix, and analogous expressions for the expected information are available
in the literature (Savalei, 2010; Savalei & Bentler, 2009; Yuan & Hayashi, 2006) and in
Chapter 3.
∂ 2 LL
= −N S −1 (2.62)
∂μ 2
{ }
N
∂ 2 LL
∂S 2
= ∑
− D′V S −1 ⊗ S −1 ( Yi − μ )( Yi − μ )′ S −1 − .5S −1 DV
i =1
N
∂ 2 LL
− S −1 ⊗ ( Yi − μ )′ S −1 DV
= ∑
∂μ∂S i =1
The ⊗ symbol is a Kronecker product that multiplies one matrix by each element of
another matrix, and DV is the so-called “duplication matrix” (Magnus & Neudecker,
1999). Each covariance term appears twice in the first derivative matrix from Equation
2.58 but only once in the Hessian (and similarly, only once in the parameter covariance
matrix). The duplication matrix combines these redundant terms into a single value.
Substituting the maximum likelihood estimates into the derivative expressions, multi-
plying HO(θ̂) by –1, then taking its inverse gives the variance–covariance matrix of the
estimates.
Analysis Example
Returning to the empty regression models in Equation 2.52, I use work satisfaction,
employee empowerment, and leader–member exchange scales to illustrate maximum
likelihood estimation. Analysis scripts are available on the companion website, includ-
ing a custom R program for readers interested in coding Newton’s algorithm by hand.
Table 2.6 gives the maximum likelihood estimates of the means, standard deviations,
variances and covariances, and correlations (in bold typeface above the diagonal). I
computed the standard deviations and correlations by transforming the maximum like-
lihood estimates of the variances and covariances (e.g., a correlation is a covariance
divided by square root of the product of two variances). As a comparison, Table 2.6 also
gives results from the usual unbiased estimator of the variance–covariance matrix. The
90 Applied Missing Data Analysis
Looking ahead to missing data analyses, we now have flexible estimators that accom-
modate mixtures of categorical and continuous incomplete variables. To set the stage for
later examples, I illustrate complete-data maximum likelihood estimation for a binary
outcome variable. Continuing with the employee data set, I use a dichotomous measure
of turnover intention that equals 0 if an employee has no plan to leave his or her position
and 1 if the employee has the intention of quitting. The bar graph in Figure 2.16 shows
the distribution of the discrete responses.
100
80
60
Percent
40
20
0
FIGURE 2.16. Bar graph of the dichotomous measure of turnover intention. TURNOVER = 0
if an employee has no plan to leave his or her position, and TURNOVER = 1 if the employee has
intentions of quitting.
below this threshold correspond to the category proportions in the bar chart: 69% of the
area under the curve falls below the threshold, and 31% falls above in the shaded region.
Using generic notation, the link between the latent scores and categorical responses is
0 if Yi ≤ τ
*
Yi = *
(2.63)
1 if Yi > τ
where Yi is the binary outcome for individual i, Yi* is the corresponding latent response
score, and τ is the threshold parameter (the vertical line in Figure 2.17). Fixing the latent
response variable’s mean or its threshold parameter to 0 provides a necessary identifica-
tion constraint, and I always adopt the latter strategy.
Adding an explanatory variable to the latent response model is a relatively small
step forward. To illustrate, consider a simple regression with leader–member exchange
(employee– supervisor relationship quality) predicting turnover intention, the latent
variable model for which is as follows:
The key difference between logistic and probit regression is the distribution of the resid-
ual term—the probit model defines ϵi as a standard normal variable, whereas logis-
92 APPLIED MISSING DATA ANALYSIS
TURNOVER = 0 TURNOVER = 1
Relative Probability
–4 –3 –2 –1 0 1 2 3 4
Latent Response Variable
FIGURE 2.17. Latent response distribution for a binary variable. The vertical line at 0 is a
threshold parameter τ that divides the latent distribution into two regions. Employees with no
quitting intentions have latent scores below the threshold, and employees who intend to quit
have scores above the threshold. The area under the shaded region of the curve is the probability
of quitting (the proportion of 1’s in the data).
tic regression defines the residual as a standard logistic variable. To illustrate a probit
regression model, Figure 2.18 shows the latent variable distributions at three values of
the explanatory variable, with the area above the threshold parameter (the predicted
probabilities) shaded in gray. The black dots represent predicted values, and the contour
rings convey the perspective of a drone hovering over the peak of a bivariate normal dis-
tribution, with smaller contours denoting higher elevation (and vice versa). The graph
for a logistic regression is similar, but standard logistic distributions have thicker tails
than the normal curves in the figure.
Going forward, I use the following notation for probit regression models to empha-
size the normally distributed latent response variable, which later functions as an
incomplete variable to be imputed:
Yi* = β0 + β1 X i + ε i (2.65)
ε i ~ N1 ( 0,1)
The second term in the normal distribution function indicates that the latent response
variable’s variance is fixed at 1 to provide a metric. I write the logistic model in its more
usual format as
Maximum Likelihood Estimation 93
Pr ( Yi = 1)
ln
1 − Pr ( Y = = β0 + β1 X i (2.66)
i 1)
where the term on the left side of the equation is the log odds or logit. The logistic model
also has a fixed variance, which I omit from the expression.
Both modeling frameworks provide a conversion to the probability metric, albeit
using different functions. The predicted probability of endorsing the highest category
(e.g., the probability of quitting) from the probit model is
τ − X iβ
Pr ( Yi = 1| β,data ) = 1 − Φ 2 = 1 − Φ ( −X iβ ) = Φ ( −X iβ ) = πi (2.67)
σε
where Xi is the predictor vector for individual i (including a column of 1’s for the inter-
cept), β contains the coefficients, Xiβ is the predicted latent response, and Φ(·) is the
cumulative distribution function of the standard normal curve. The subtraction inside
the parentheses expresses the threshold as a z-score (recall that τ = 0 and σε2 = 1), and
6
4
Latent Turnover Intention
2
0
–2
–4
–6
0 5 10 15 20
Leader-Member Exchange (Relationship Quality)
FIGURE 2.18. Latent response distribution for a binary variable. The vertical line at 0 is a
threshold parameter τ that divides the latent distribution into two regions. Employees with no
quitting intentions have latent scores below the threshold, and employees who intend to quit
have scores above the threshold. The area under the shaded region of the curve is the probability
of quitting (the proportion of 1’s in the data).
94 Applied Missing Data Analysis
the function returns the area below this value in a standard normal curve. Subtracting
that result from 1 gives the area above the threshold (e.g., the shaded regions of the nor-
mal curves in Figure 2.18). Similarly, the logit link function translates predicted latent
response scores to the probability metric as follows:
exp ( X iβ )
Pr(Yi = 1| β,data) = = πi (2.68)
1 + exp ( X iβ )
Li ( β | data ) = Φ( −X iβ)Yi × (1 − Φ ( −X iβ ) )
1−Yi
= πYi i (1 − πi )
(1−Yi )
(2.69)
In the context of the employee turnover example, the likelihood features the product of
the predicted probability of quitting (left term) and not quitting (right term). The scores
in the exponents act like on–off switches that activate the left term (the predicted prob-
ability that Y = 1) if Y = 1 and trigger the right term (the predicted probability that Y =
0) if Y = 0. Taking the natural logarithm and summing across the N cases gives the fol-
lowing sample log-likelihood expression:
N
LL ( β | data
= ) ∑ (Y × ln ( Φ ( −X β )) + (1 − Y ) × ln (1 − Φ ( −X β )))
i =1
i i i i (2.70)
Numerically, the log-likelihood is a large negative number that equals the sum of log-
arithmically transformed probability values. Conceptually, this value represents the
data’s support for a particular combination of population regression coefficients in β.
The log-likelihood for logistic regression has the same form as Equation 2.70 but
uses the Bernoulli distribution probability distribution from Equation 2.1. Reversing
the probability distribution’s arguments by taking data values as given and varying the
parameters gives the likelihood expression for a single observation.
Y 1−Yi
exp ( X iβ ) i exp ( X iβ ) (1−Yi )
Li ( β | data ) = × 1 − = πYi i (1 − πi )
1 + exp ( X β ) 1 + exp ( X β )
(2.71)
i i
Consistent with Equation 2.70, the likelihood features the product of the predicted
probability of quitting (left term) and not quitting (right term), and the scores in the
exponent activate the probability that corresponds to one’s binary response. Taking
the natural logarithm and summing across the N cases gives the following sample log-
likelihood expression, which again represents the data’s support for a particular combi-
nation of regression parameters:
Maximum Likelihood Estimation 95
N exp ( X iβ ) 1
LL ( β | data
= ) ∑ Y × ln 1 + exp ( X β ) + (1 − Y ) × ln 1 + exp ( X β )
i i (2.72)
i =1 i i
Unlike the other models in this chapter, there is no analytic solution for the probit
and logistic regression coefficients, and iterative optimizers such as Newton’s algorithm
are a must. Iterative optimization works the same as it did with normally distributed
data, so I point readers to the literature for additional technical details (Agresti, 2012;
Greene, 2017). Putting aside the technicalities, the process of computing standard errors
follows the same procedure described earlier in the chapter; manipulating the matrix of
second derivatives that quantifies the curvature of the log-likelihood function gives the
variance–covariance matrix of the estimates, the diagonal of which contains squared
standard errors. Similarly, the significance testing options described in Section 2.11 are
no different with categorical variable models.
Analysis Example
Expanding on the employee turnover example, I used maximum likelihood estimation
to fit probit and logistic regression models that use leader–member exchange, employee
empowerment, and a male dummy code (0 = female, 1 = male) to predict a binary mea-
sure of turnover intention (TURNOVER = 0 if an employee has no plan to leave his or her
position, and TURNOVER = 1 if the employee has intentions of quitting).
Pr (TURNOVER i = 1)
ln
1 − Pr (TURNOVER = = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi )
i 1 )
The probit model’s residual variance is fixed at 1 for identification, and the model addi-
tionally incorporates a fixed threshold parameter that divides the latent response vari-
able distribution into two segments. The logistic regression can also be viewed as a latent
response model, but it is typical to write the equation without a residual. Note that I
use β’s to represent focal model parameters, but the estimated coefficients will not be
the same (logit coefficients are approximately 1.7 times larger than probit coefficients;
Birnbaum, 1968). As always, analysis scripts are available on the companion website.
Table 2.7 shows the maximum likelihood analysis results for both models. Start-
ing with the probit regression results, the Wald test of the full model was statistically
significant, TW(3) = 20.00, p < .001, meaning that the estimates are at odds with the null
hypothesis that all three population slopes equal zero. Each slope coefficient reflects the
expected z-score change in the latent response variable for a one unit increase in the pre-
dictor, controlling for other regressors. For example, the leader–member exchange coef-
ficient indicates that a one-unit increase in relationship quality is expected to decrease
the latent proclivity to quit by 0.06 z-score units (β̂1 = –0.06, SE = .02), holding other
predictors constant.
Turning to the logistic regression results, the Wald test of the full model was again
statistically significant, and the test statistic’s numerical value was comparable to that
96 Applied Missing Data Analysis
Logistic regression
β0 1.37 0.60 2.30 .02 —
β1 (LMX) –0.10 0.04 –2.96 .00 0.90
β2 (EMPOWER) –0.04 0.02 –1.81 .07 0.96
β3 (MALE) –0.06 0.18 –0.31 .75 0.95
R2 .05 .02 2.30 .02 —
of the probit model, TW(3) = 19.35, p < .001. Each slope coefficient now reflects the
expected change in the log odds of quitting for a one-unit increase in the predictor,
holding all other covariates constant. For example, the leader–member exchange slope
indicates that a one-unit increase in relationship quality decreases the log odds of quit-
ting by .10 (β̂1 = –0.10, SE = .04), controlling for employee empowerment and gender.
Notice that the logistic coefficients are approximately 1.7 times larger than the probit
slopes, as expected (Birnbaum, 1968). Exponentiating each slope gives an odds ratio
that reflects the multiplicative change in the odds (the probability ratio on the left side
of Equation 2.66) for a one-unit increase in a predictor (e.g., a one-point increase on the
leader–member exchange scale multiplies the odds of quitting by 0.90).
The analysis results highlight that probit and logistic models are effectively equiva-
lent and almost always lead to the same conclusions. Some researchers favor the logistic
framework, because it yields odds ratios, but there is otherwise little reason to prefer
one approach to the other. As you will see, probit regression plays a more central role
with Bayesian estimation and multiple imputation.
Maximum likelihood is the go-to estimator for many common statistical models, and it
is one of the three major pillars of this book. As its name implies, the estimator identi-
fies the population parameters that are most likely responsible for a particular sample of
data. Much of this chapter has unpacked this definition in the context of linear regres-
sion models and multivariate analyses based on the normal distribution, and the last
section has outlined logistic and probit models for categorical outcomes. Having estab-
Maximum Likelihood Estimation 97
lished all the major details behind estimation and inference, Chapter 3 applies maxi-
mum likelihood to missing data problems. As you will see, everything from this chapter
carries over to missing data applications, where the goal remains to identify parameter
values that maximize fit to the data—the only difference is that some participants have
more of it than others. Finally, I recommend the following articles for readers who want
additional details on topics from this chapter:
Buse, A. (1982). The likelihood ratio, Wald, and Lagrange multiplier tests: An expository note.
American Statistician, 36, 153–157.
Eliason, S. R. (1993). Maximum likelihood estimation: Logic and practice. Newbury Park, CA:
Sage.
The origins of maximum likelihood missing data handling are quite old and date back
to the 1950s (Anderson, 1957; Edgett, 1956; Hartley, 1958; Lord, 1955). These early solu-
tions were limited in scope and had specialized applications (e.g., bivariate normal data
with a single, incomplete variable). Many important breakthroughs came in the 1970s
when methodologists developed the theoretical underpinnings of modern missing data-
handling techniques, as well as computational methods to implement them (Beale &
Little, 1975; Dempster et al., 1977; Finkeiner, 1979; Hartley & Hocking, 1971; Orchard
& Woodbury, 1972). For researchers in the social and behavioral sciences, maximum
likelihood missing data handling became a practical reality in the 1990s, when struc-
tural equation modeling software packages began implementing estimators for raw data
(Arbuckle, 1996; Jamshidian & Bentler, 1999; Muthén et al., 1987; Wothke, 2000). As
an aside, researchers often refer to maximum likelihood missing data handling as full-
information maximum likelihood (FIML) estimation. Although the FIML acronym is
often synonymous with missing data applications (Arbuckle, 1996), the name conveys
that estimates are derived from the raw data, just as they were in Chapter 2.
Virtually everything from Chapter 2 carries over to missing data applications,
where the goal remains to identify parameter values that maximize fit to the data—the
only difference is that some participants have more of it than others. Missing data analy-
ses generally require iterative optimization routines, but the nuts and bolts of estima-
tion and inference mirror Chapter 2. As you will see, the missing data-handling aspect
of maximum likelihood all happens behind the scenes; a researcher simply needs to
dial up a capable software package and specify a model. The estimator does not discard
incomplete data records, nor does it impute them. Rather, it identifies the parameter
values with maximum support from whatever data are available. The first part of this
chapter digs under the hood to illustrate how procedures from the previous chapter
98
Maximum Likelihood Estimation with Missing Data 99
accommodate missing data. These changes are intuitive, so readers who aren’t as inter-
ested in the finer details can still get the gist.
Maximum likelihood analyses have evolved considerably in recent years. The
estimators that were widely available when I was writing the first edition of this book
were generally limited to multivariate normal data. This is still a common assump-
tion for missing data analyses, but flexible estimation routines that accommodate mix-
tures of categorical and continuous variables are now widely available (Ibrahim, 1990;
Ibrahim, Chen, Lipsitz, & Herring, 2005; Lüdtke, Robitzsch, & West, 2020a; Muthén,
Muthén, & Asparouhov, 2016; Pritikin, Brick, & Neale, 2018). As you will see, these
approaches generally don’t work from a multivariate distribution, but rather disassem-
ble a model into multiple parts that leverage different probability distributions. This
strategy—factorizing a multivariate distribution into easier component distributions or
submodels—also paves the way for estimating interactions and nonlinear effects with
missing data (Lüdtke et al., 2020a; Robitzsch & Lüdke, 2021). This is an important
innovation, as classic methods based on multivariate normality are known to introduce
bias (Cham, Reshetnyak, Rosenfeld, & Breitbart, 2017; Enders, Baraldi, & Cham, 2014;
Lüdtke et al., 2020a; Seaman, Bartlett, & White, 2012; Zhang & Wang, 2017). I use sev-
eral data analysis examples throughout the chapter to illustrate these newer methods
and their classic counterparts.
Revisiting maximum likelihood estimation for multivariate normal data is a good start-
ing point that sets the stage for much of this chapter. As you will see, the missing data
methods in this book uniformly require distributional assumptions for incomplete vari-
ables. From a practical perspective, this means that univariate models such as multiple
regression must be specified as multivariate analyses to do any type of missing data
handling. The multivariate normal distribution is often a reasonable way to assign a
distribution to variables that wouldn’t otherwise need one, and it can work surprisingly
well with non-normal variables. The multivariate normal distribution is also founda-
tional to the structural equation modeling approach that I discuss later in the chapter.
The structural modeling framework is an important toolkit for implementing maximum
likelihood estimation, as it accommodates a wide range of models with missing values
on outcomes or predictors.
I use the employee data from the companion website to provide a substantive con-
text. The data set includes several workplace-related variables (e.g., work satisfaction,
turnover intention, employee–supervisor relationship quality) for a sample of N = 630
employees (see Appendix). The illustration uses a 7-point work satisfaction rating (1 =
extremely dissatisfied to 7 = extremely satisfied) and two composite scores that measure
employee empowerment and a construct known as leader–member exchange scale (the
quality of an employee’s relationship with his or her supervisor). I treat work satisfac-
tion as a normally distributed variable, because it has a sufficient number of response
options and a symmetrical distribution (Rhemtulla et al., 2012). The empty regression
models for the multivariate analysis are as follows:
100 Applied Missing Data Analysis
WORKSATi μ1 ε1i
Yi = EMPOWER i = μ 2 + ε 2i = μ + ε (3.1)
LMX μ ε
i 3 3i
Yi ~ N 3 ( μ, S )
The bottom row of the expression says that the variables follow a three-dimensional
normal distribution with parameters μ and Σ. I used this same example in Section 2.12,
but now there are missing values; work satisfaction ratings have a 4.8% missing data
rate, the employee empowerment variable has 16.2% of its scores missing, and 4.1% of
the leader–member exchange values are incomplete.
Complete‑Data Log‑Likelihood
A quick recap of concepts from Chapter 2 sets the stage for missing data handling. After
collecting data, recall that we “reverse” the probability distribution’s arguments to get
the likelihood of different combinations of population parameters given the observed
data. Taking the natural logarithm of the multivariate normal distribution function
gives the log-likelihood contribution for a single observation.
( )
LLi μ, S | Yi( com ) =−
V
2
1 1
ln ( 2π ) − ln S − ( Yi − μ )′ S −1 ( Yi − μ )
2 2
(3.2)
( )
N
V N 1
LL μ, S | Y( com ) =−N
2
ln ( 2π ) − ln S −
2 2 ∑ ( Y − μ )′ S
i =1
i
−1
( Yi − μ ) (3.3)
Observed‑Data Log‑Likelihood
Returning to the ideas from Chapter 1 (Little & Rubin, 2020; Rubin, 1976), missing
data theory imagines a hypothetically complete data set that partitions into observed
and missing components. Symbolically, this idea is expressed as Y(com) = (Y(obs), Y(mis)).
Maximum Likelihood Estimation with Missing Data 101
The values in Y(mis) are essentially latent variable scores that we are unable to collect. To
illustrate, the three variables from the employee data illustration exhibit the five missing
data patterns in Table 3.1: (1) cases with complete data on all three variables, (2) par-
ticipants with missing data on just one of the three variables, and (3) persons missing
bo work satisfaction and employee empowerment scores. The contents of Y(obs) and Y(mis)
thus vary across patterns, with Y(obs) containing between one and three scores and Y(mis)
containing one or two unseen values.
The unseen values in Y(mis) cannot function as known constants in the log-likelihood
expression, so the estimator removes the missing parts of the data from the multivariate
normal distribution and identifies the parameter values that maximize fit to the remain-
ing observed data. As a result, each participant with one or more observations contrib-
utes to the analysis, and nothing is wasted. You often see the probability distribution of
the observed data written as follows:
( ) ∫ f ( Y(
f Y( obs ) | θ = com ) )
| θ dY( mis ) (3.4)
The integration operator says that the observed-data distribution on the left side of
the equation is obtained by averaging or marginalizing over the missing values. Mar-
ginalizing is akin to replacing each latent score in Y(mis) with a weighted sum over all
possible values of the missing variable, with higher weights assigned to more plausible
scores and vice versa.
Next, let’s see how Equation 3.4 translates into a log-likelihood function. When a
participant has missing values, the observed data for that individual no longer contain
information about every model parameter. The log-likelihood equation accommodates
this feature by eliminating the elements in the data and parameter arrays that corre-
spond to the missing variables. A single individual’s contribution to the observed-data
log-likelihood function is
( )
LLi μ, S | Yi( obs ) =−
Vi
2
1 1
ln ( 2π ) − ln S i − ( Yi − μ i )′ S i−1 ( Yi − μ i )
2 2
(3.5)
where Yi contains the participant’s observed data, Vi is the number of scores in the data
vector, and μi and Σi contain the subset of parameters in μ and Σ that correspond to the
observed variables in Yi. The equation says that all participants share the same model
parameters, but the fit of the observed data is restricted to those parameters for which an
individual has scores. Summing across N participants gives the log-likelihood function
for a sample of incomplete data.
( ) ∑ V2 ln (2π) − 21 ∑ln S
N N N
1
LL μ, S | Y( obs ) =−
=i 1
i
=i
i
1 =i
2
− ∑ ( Y − μ )′ S
1
i i
−1
i ( Yi − μ i ) (3.6)
1 σ σ XY
2
( 2
2
)
LLi μ, S | Y( obs ) =Y , X =− ln ( 2π ) − ln X
2 σYX
σY2
' −1 (3.7)
1 Xi μ X σ2X σ XY Xi μ X
− − −
2 Yi μ Y σ σ2Y
YX Yi μ Y
In contrast, participants with missing Y scores provide no information about μY, σY2,
or σYX. Dropping these elements from the parameter arrays leaves μi = μX and Σi = σX2,
and the data’s support for these remaining parameters derives from a univariate normal
distribution.
1 ( Xi − μ X )
2
( 1
2
) 1
LLi μ, S | Y( obs ) =X =− ln ( 2π ) − ln σ2X −
2 2 σ2X
(3.8)
Equation 3.8 is a concrete example of the integration operation from Equation 3.4, as
marginalizing a bivariate normal distribution over one of its variables yields a univariate
normal log-likelihood.
Summing across the N observations gives the observed-data log-likelihood function
for the sample
nC σ X σ XY
2
( )
LL μ, S | Y( obs ) =−nC ln ( 2π ) −
2
ln
σ
σY2
YX
−1
1
nC
X i μ X ′ σ2X σ XY Xi μ X
−
2 ∑ Y − μ
i =1 i Y σ YX
σ2Y Y − μ
i Y
(3.9)
nM
( X i − μ X )2
−
nM
2
n
ln ( 2π ) − M ln σ2X −
2
( ) ∑
1
2 σ2X
i =1
where nC and nM are the number of cases with complete data and missing values, respec-
Maximum Likelihood Estimation with Missing Data 103
tively. As always, the log-likelihood summarizes the data’s evidence for a specific com-
bination of parameter values. The only new wrinkle is that some participants contribute
more information than others. Importantly, the estimator does not discard incomplete
data records, nor does it impute them. The overall goal remains the same, which is to
identify the parameters that maximize fit to the data.
Analysis Example
I use the trivariate analysis model from Equation 3.1 to illustrate maximum likelihood
estimation for a multivariate analysis. Table 3.1 shows the five missing data patterns.
Applying previous ideas, estimation uses all available data, with each missing data pat-
tern contributing different information. For example, nearly 80% of the sample members
have complete data records, and the log-likelihood contributions for these individuals
reflect fit to a trivariate normal distribution with the full collection of parameters (i.e.,
μi contains all three means and Σi is a 3 × 3 matrix). Three patterns comprise individu-
als who contribute two data points, and their fits are gauged relative to the parameters
of a bivariate normal distribution (i.e., μi contains two of three means, and Σi is a 2 ×
2 matrix containing the two relevant variances and a covariance). The final pattern
comprises participants with a single observation. These log-likelihood contributions
reflect fit to a univariate normal distribution, where μi and Σi are both scalar values as
in Equation 3.8.
As explained previously, iterative optimizers such as Newton’s algorithm or the
expectation maximization algorithm (discussed later) are necessary for finding the esti-
mates that maximize fit to the observed data. Analysis scripts are available on the com-
panion website, including a custom R program for readers interested in coding Newton’s
algorithm by hand. Table 3.2 gives the maximum likelihood estimates of the means,
standard deviations, variances and covariances, and correlations. The standard devia-
tions and correlations are not estimated parameters but are instead deterministic func-
tions of the variances and covariances (e.g., a correlation is a covariance divided by
square root of the product of two variances), and their delta method standard errors are
similarly functions of the component standard errors (Raykov & Marcoulides, 2004).
Following ideas established in Chapter 2, maximum likelihood estimates of variances
and covariances are negatively biased in small samples, because they do not subtract
the degrees of freedom spent estimating the means; these biases should be trivial, with
a sample size of N = 630, even with substantial missing data. As an aside, researchers
often ask what sample size they should report for a missing data analysis. While no
single N drives the precision of the estimates, the sample size is the number of cases
with at least one observation. For this analysis, that’s all N = 630 employees.
Unlike Bayesian estimation and multiple imputation, maximum likelihood does not
explicitly impute the missing data. Rather, the estimator identifies the optimal parameter
values using whatever data it has at its disposal. While the observed-data log-likelihood
104 Applied Missing Data Analysis
Standard deviations
Work Satisfaction 1.27 0.04 34.66 < .01
Empowerment 4.42 0.14 31.95 < .01
LMX 3.02 0.09 34.89 < .01
Correlations
Work Satisfaction ↔ Empowerment 0.31 0.04 7.65 < .01
Work Satisfaction ↔ LMX 0.42 0.03 12.35 < .01
Empowerment ↔ LMX 0.42 0.04 11.21 < .01
equation clearly shows that some observations contribute more information than others,
it doesn’t convey how the partial data records help achieve a more accurate answer—
statistical theory and computer simulations like those in Chapter 1 tell us that they
can help, even with very large amounts of missing data. To provide some insight into
how estimation works, I created an artificial data set of employee empowerment and
leader–member exchange scores with estimates like those in Table 3.2. I deleted 50% of
the leader–member exchange values to mimic a conditionally MAR process where par-
ticipants with low empowerment are less likely to report their supervisor relationship
quality. This is a scenario in which maximum likelihood is known to produce accurate
estimates. Figure 3.1 shows the scatterplot of the hypothetically data, with gray circles
representing complete cases and black crosshairs denoting partial data records with
missing leader–member exchange scores. The gray contour rings convey the perspective
of a drone hovering over the peak of the bivariate normal population data, with smaller
contours denoting higher elevation (and vice versa).
Maximum Likelihood Estimation with Missing Data 105
50
40
Empowerment
30
20
10
0 5 10 15 20
Leader-Member Exchange
FIGURE 3.1. Scatterplot of an artificial data set of leader–member exchange and employee
empowerment scores. Fifty percent of the leader–member exchange scores follow an MAR pro-
cess where participants with low empowerment are more likely to have missing values. Gray
circles represent complete cases, and black crosshairs denote partial data records.
To begin, consider what happens if we discard the incomplete data records and
base the analysis on the 50% of participants with complete data. Figure 3.2 shows the
scatterplot after removing the observations with missing data. Importantly, the black
dot at the means is too high along both axes. This bias makes sense considering Figure
3.1, where the MAR process systematically culls data points from the lower tails of the
distributions in the lower-left quadrant of the plot.
Figure 3.3 adds the partial data records with observed empowerment scores along
the vertical axis (horizontal jitter is added to enhance their visibility). The black dot
now represents the maximum likelihood estimates of the means based on all observed
data. The additional employee empowerment scores from the incomplete data records
exert two forces that steer the inaccurate black dot in Figure 3.2 toward the accurate
black dot in Figure 3.3. First, adding empowerment scores to the low end of the dis-
tribution increases the now-complete variable’s variance and decreases its mean. Visu-
ally, the black dot at the center of the complete-case distribution in Figure 3.2 moves
down to the vertical location of the maximum likelihood estimate in Figure 3.3. The
second, less obvious adjustment comes from the normal curve itself. In a normal distri-
bution, the additional scores at the low end of the employee empowerment distribution
106 APPLIED MISSING DATA ANALYSIS
(the crosshair symbols) are only plausible if they are paired with correspondingly low
leader–member exchange scores in the lower-left quadrant of the plot. Although the
matching relationship quality scores are unobserved, the estimator infers their location
and adjusts the parameters to account for the presence of latent or unobserved data in
the lower tail of the distribution. The inaccurate black dot from Figure 3.2, which had
already moved to the correct vertical coordinate, now moves left to the horizontal coor-
dinate of the maximum likelihood estimates (the black dot) in Figure 3.3.
I’ve repeatedly emphasized that maximum likelihood estimation does not impute
missing values, but the illustration suggests that something similar is happening under
the hood. In fact, the normal curve itself functions an imputation machine in the sense
that the estimator can infer the horizontal location of an unseen data point from its
observed vertical location (and vice versa). Thus, while the estimator doesn’t literally
create a filled-in data set, it does use the normal distribution to deduce plausible values
for the missing data. Widaman (2006) describes this process as implicit imputation, and
I sometimes use his phrase to describe maximum likelihood.
50
40
Empowerment
30
20
10
0 5 10 15 20
Leader-Member Exchange
FIGURE 3.2. Complete-case scatterplot after from removing the observations with missing
leader–member exchange scores. The contour rings convey the perspective of a drone hovering
over the bivariate normal population distribution, with smaller contours denoting higher eleva-
tion (and vice versa). The black dot denotes the complete-case means, which are too high along
both axes.
Maximum Likelihood Estimation with Missing Data 107
50
40
Empowerment
30
20
10
0 5 10 15 20
Leader-Member Exchange
FIGURE 3.3. Scatterplot that adds the partial data records with observed empowerment
scores along the vertical axis (horizontal jitter is added to enhance their visibility). The black dot
represents the maximum likelihood estimates of the means.
Recall from Chapter 2 that the curvature or steepness of the log-likelihood function
near its peak determines the precision of the estimates. A steep function implies high
precision (and a small standard error), because the data’s support for different candi-
date parameter values decreases rapidly as the parameter moves away from its optimal
value in either direction. Measuring curvature and computing standard errors required
second derivatives of the log-likelihood function. Mathematically, these derivative equa-
tions quantify curvature by measuring the rate at which tangent lines change near the
function’s peak (e.g., lines tangent to a steep function change rapidly near its peak,
whereas lines tangent to a flatter function change very little).
In fact, all the concepts from Chapter 2 work the same with missing data. The
only new detail is that participants with missing data contribute less information to
the derivative expressions. When participants have missing values, their data no lon-
ger contain information about every model parameter. The observed-data log-likelihood
function from Equation 3.5 accommodates this by ignoring parameters that depend on
the missing scores. The incomplete data records similarly contain no information about
108 Applied Missing Data Analysis
the precision of certain estimates, so it makes sense that they wouldn’t contribute to
the corresponding derivative equations (and thus the standard errors). From a practical
perspective, the missing information flattens the log-likelihood function and increases
standard errors, as you might expect. However, the reduction in power may not be com-
mensurate with the missing data rate (e.g., a 20% missing data rate does not imply a 20%
reduction in precision or power). This section outlines changes to the second derivative
matrix that gives rise to standard errors. The steps for converting this matrix into stan-
dard errors—multiplying the Hessian by –1 and computing its inverse—are the same
as in Chapter 2. Although most of the equations are not intuitive, I include them as a
resource for interested readers. Equations aside, the take-home message is straightfor-
ward: When a score is missing, a participant contributes a 0 to any element of the second
derivative matrix that depends on that variable.
A bivariate analysis with a single incomplete variable like that depicted in Figure
3.1 is sufficient for illustrating how the derivative computations change with missing
data. To keep notation simple, I generically refer to the complete and incomplete vari-
ables as Y and X, respectively. Recall from Section 2.12 that second derivatives are stored
in a symmetrical matrix known as the Hessian. The diagonal elements of this matrix
capture the curvature or steepness of the log-likelihood function near its peak, and
the off-diagonal elements measure the degree to which changes to one parameter cor-
respond with changes in another. The Hessian is a symmetrical matrix that comprises
three unique blocks.
∂ 2 LL ∂ 2 LL
∂μ 2 ∂μ∂S
HO ( θ ) = (3.10)
∂ 2 LL ∂ 2 LL
∂S∂μ ∂S 2
The upper-left block is a V × V matrix containing derivatives taken with respect to the
means, the lower-r ight block is a symmetrical matrix with V × (V + 1) ÷ 2 rows and
columns, one for each unique element of the variance–covariance matrix, and the lower
diagonal (or upper diagonal) is a matrix of cross-product derivatives taken first with
respect to a mean and then with respect to a variance or covariance (or vice versa). The
Hessian for this example is a symmetrical matrix with P = 5 rows and columns, one for
each unique element in μ and Σ.
With complete data, the upper-left block (derivatives taken with respect to the
means) is computed by summing the inverse of the covariance matrix across the N
observations as follows:
−1
N 2
∂ 2 LL
N
σX σ XY
∂=μ 2
=
− ∑ S −1
=
− ∑
i 1 =i 1 σ YX
σ2Y
(3.11)
When a participant has missing values, the observed data for that individual no longer
contain information about the precision of certain estimates, so the derivative expres-
Maximum Likelihood Estimation with Missing Data 109
sion replaces the appropriate elements of Σ with zeros. In this example, cases with miss-
ing Y values contribute zeros in place of σYX and σX2, as follows:
−1 −1
∂ 2 LL
nC
σ2X σ XY nM
0 0
∂μ 2
=
−
∑
2
− 2
i 1 0 σY
∑ (3.12)
= σYX σY
i 1=
The same is true for other parts of the Hessian. For the two off-diagonal blocks, the
incomplete cases contribute information only to the element crossing μY and σY2:
nC 0 nM 0
∂ 2 LL −1
∂μ ∂S
=
−1
∑
′
− S ⊗ ( Yi − μ ) S DV −
0
∑ 0
(3.13)
( Yi − μY ) ( σY2 )
=i 1 =i 1
−1
0
and their contribution to the lower-r ight block of the Hessian is similarly restricted to
the second derivative with respect to σY2.
n
∂ 2 LL C
=
− ∑ D′V S −1 ⊗ S −1 ( Yi − μ )( Yi − μ )′ S −1 − .5S −1 DV
∂S 2
i =1
(3.14)
0 0
nM 0
∑
− 0 0 0
( Yi − μY )2 ( σY2 ) ( )
i =1
−3
2 2
−
0 0 − .5 σY
The derivative expressions for the complete cases are the same as those in Section 2.12,
and the text adjacent to Equation 2.62 describes the meaning of the previous expres-
sion’s components. Although the equations are complicated, they offer a clear take-
home message: Participants contribute information only about parameters for which
they have data. The expressions also highlight that inserting zeros makes the resulting
sums smaller than they would have been with complete data. As a result, multiplying
the Hessian by –1 and computing its inverse (the matrix analog of a reciprocal) inflates
elements of the parameter covariance matrix, the diagonal of which contains squared
standard errors.
For completeness, the rest of this section provides second derivative equations for
multivariate analyses with more than two variables (e.g., the analysis model in Equa-
tion 3.1). Readers who aren’t interested in these expressions can skip to the next sec-
tion without losing important information. To generalize the previous notation beyond
the bivariate example, we need to introduce a Vi × V matrix τi for each participant that
starts as an identity matrix (a matrix with ones on the diagonal and zeros elsewhere)
and removes the rows corresponding to the missing variables. Incorporating this new
matrix into the complete-data derivative equations from Section 2.12 gives the expres-
sions that follow (Savalei, 2010; Savalei & Bentler, 2009).
110 Applied Missing Data Analysis
N
∂ 2 LL
∂μ 2
∑
= − τ′i S i−1τ i
i =1
(3.15)
{ }
N
∂ 2 LL
∂S 2
∑
− D′V τ′i S i−1τ i ⊗ τ′i S i−1 ( Yi − μ i )( Yi − μ i )′ S i−1τ i − .5τ′i S i−1τ i DV
=
i =1
2 N
∂ LL
= − τ′i S i−1τ i ⊗ ( Yi − μ i )′ S i−1τ i DV
∑
∂μ∂S i =1
Substituting the maximum likelihood estimates into the derivative expressions, multi-
plying HO(θ̂) by –1, then taking its inverse gives the variance–covariance matrix of the
estimates, the diagonal of which contains squared standard errors.
The parameter arrays in Equation 3.15 align with the observed-data log-likelihood
expression from Equation 3.5, where μi and Σi contain the subset of parameters in μ
and Σ that corresponds to the observed variables in Yi. Pre- and postmultiplying these
arrays by τi fills in the missing elements of those matrices with zeros, just like in Equa-
tions 3.12 through 3.14. To verify, reconsider the previous bivariate example, where the
complete cases have τi = I2 and Σi = Σ and the incomplete cases have τi = (0 1)and Σi =
σY2. Substituting these quantities into the top formula of Equation 3.15 gives the same
result as Equation 3.12.
N n n
∂ 2 LL C M
∂ μ 2
=
− τ ′ S
i i
=i 1 =i 1 =i 1
∑
−1
τ i =
− τ ′ S
i i∑−1
τ i − ∑
τ′i S i−1τ i
−1
1 0 ′ σ X σ XY 1 0 M
2 nC n
( )
−1
=
−
=i 1 =
0 1
σYX σY
2 0 1 ∑−
i1
( 0 1)' σ2Y ∑ (0 1) (3.16)
−1 −1
σ2X σ XY nC
nM
0 0
−
σ 2
− ∑ 2
i 1 0 σY
∑
YX σY
=i 1 =
∑ ( })
N
∂ 2 LL
∂S 2
=
−
1
2
{
D′V τ′i S i−1τ i ⊗ τ′i S i−1τ i DV (3.17)
i =1
∂ 2 LL
=0
∂μ∂S
Maximum Likelihood Estimation with Missing Data 111
The derivatives with respect to the means are unchanged, because the precision of these
parameters depends only on Σ.
With complete data, standard errors based on the observed and expected informa-
tion are often indistinguishable. With missing data, however, Kenward and Molenberghs
(1998) showed that standard errors based on the expected information require an unsys-
tematic MCAR mechanism, whereas standard errors based on expected information are
valid with conditionally MAR processes (technically, a missing always at random pro-
cess). The differential treatment of the deviation scores in Equation 3.15 is the issue, as
assigning zeros to the observed deviation scores works fine if missing values are equally
distributed above and below the estimated means, as would be the case with purely
random missingness. However, this substitution is not optimal for a conditionally MAR
process that culls values from one tail of the distribution, leaving more observations
above the estimated mean than below it (or vice versa). Such is the case in Figure 3.3,
where most of the observed leader–member exchange scores (i.e., the gray circles) are
above the maximum likelihood mean estimate (i.e., to the right of the black dot). Simu-
lation studies show that standard errors based on observed information are preferable
in this case, as the expected information tends to attenuate standard errors and inflate
Type I error rates (Kenward & Molenberghs, 1998; Savalei, 2010).
To illustrate the difference between observed and expected information, I applied
both approaches to the artificial data from Figure 3.1. Table 3.3 shows the variance–
covariance matrices of the estimates, the diagonals of which contain squared standard
errors. Notice that substituting expectations fills the off-diagonal blocks with zeros,
whereas using the observed data produces some nonzero values. Other elements also
differ, as do the standard errors of parameters impacted by missing data. For example,
the standard error of μ̂X (e.g., the leader–member exchange average) based on observed
information is SE = 0.066 = 0.27, whereas using the expected information shrinks this
Expected information
μX 0.050
μY 0.023 0.065
σX2 0 0 1.065
σXY 0 0 0.737 1.200
σY2 0 0 0.340 0.950 2.654
112 Applied Missing Data Analysis
value to SE = 0.050 = 0.22. The differences in the table agree with published simulation
studies showing that expected information often attenuates standard errors and inflates
Type I error rates when scores are conditionally MAR (Kenward & Molenberghs, 1998;
Savalei, 2010).
Even tidy estimation problems like those in Chapter 2 no longer have analytic solutions
with missing data, making iterative optimization algorithms a necessity. Newton’s algo-
rithm works with derivatives of the observed-data log-likelihood from Equation 3.5, but
the procedure is otherwise the same as that in Section 2.9. The EM algorithm (Dempster
et al., 1977; Rubin, 1991) takes the very different tack of filling in rather than removing
the missing parts of the complete-data log-likelihood from Equation 3.3. Conceptually,
EM is a tool for solving the chicken or the egg dilemma in which knowing the missing
values would lead to solutions for the estimates and having the estimates would provide
the necessary information for predicting the missing values. The algorithm leverages
this interdependence by “imputing” the missing data given the current parameter values
and then updating the parameters given the filled-in data. The idea is that each succes-
sive iteration gives better predictions about the missing values, which in turn improve
the estimates, which in turn sharpen the missing values, and so on.
I use the word “imputing” in air quotes, because EM doesn’t literally fill in the data.
Rather, the algorithm uses the parameter estimates to predict the missing parts of the
complete-data log-likelihood function. In most cases, these imputed terms are functions
of the missing values rather than the missing values themselves (e.g., expected values).
This is an important practical point, because published research articles often describe
EM as an imputation method. Software packages that use EM-generated parameter esti-
mates to implement flawed regression imputation schemes no doubt contribute to this
confusion (e.g., the Missing Values Analysis module in SPSS; von Hippel, 2004).
EM’s origins trace to the early 70s (Baum, Petrie, Soules, & Weiss, 1970; Beale &
Little, 1975; Orchard & Woodbury, 1972), and Dempster et al. (1977) formalized the
algorithm and gave it a name. EM has since evolved into a very general optimization
tool and has enjoyed widespread use with latent variable models that treat unmeasured
latent scores as missing data. Such applications include factor analysis, structural equa-
tion models, multilevel models, finite mixture models, and item response theory models,
to name a few (Bock & Aitkin, 1981; Cai, 2008; Jamshidian & Bentler, 1999; Liang &
Bentler, 2004; McLachlan & Krishnan, 2007; Muthén & Shedden, 1999; Raudenbush
& Bryk, 2002). It is important to highlight that EM’s two-step logic—impute the miss-
ing data, then update the parameters—appears throughout the book, as Markov chain
Monte Carlo (MCMC) algorithms for Bayesian estimation and multiple imputation apply
the same recipe. These procedures estimate parameter values and missing data by draw-
ing the unknown quantities from a probability distribution (i.e., they use computer simu-
lation to generate artificial values), but they are essentially EM algorithms with random
noise added to account for missing data uncertainty. Rubin (1991) provides an excellent
tutorial on the EM algorithm that describes some of these linkages and extensions.
Maximum Likelihood Estimation with Missing Data 113
A bivariate analysis with a single incomplete variable (e.g., Figure 3.1) provides
a closer look at the EM algorithm. To keep notation simple, I generically refer to the
incomplete and complete variables as X and Y, respectively. The EM algorithm works
with the hypothetical complete-data log-likelihood that would have resulted had there
been no missing values (see Equation 3.3). We know from Section 2.12 that the maxi-
mum likelihood estimates for this scenario have the following solutions:
N N
1 1
N
=μˆ X
X=
i μY
ˆ
=i 1 =i 1
N
Yi ∑ ∑ (3.18)
N
2 N
2
1 1
N N
1 1
σˆ 2X
=
N
=i 1 =
2
∑
Xi − Xi =
N ∑
i 1 =
σY
ˆ 2
N
Yi − Yi
2
N i 1 ∑ ∑
i 1=
1 N
1
N N
= σˆ XY
N=
X i Yi −
i 1 N ∑ X i Yi
=i 1 =i 1
∑ ∑
Missing values introduce holes in the sums and sums of cross-products terms that
define the sufficient statistics for computing μ and Σ. I previously characterized EM as
a tool for solving the chicken or the egg dilemma: Knowing the missing values on the
right side of the equation would lead to solutions for the estimates and having the esti-
mates on the left side of the equation would provide the information necessary to predict
the missing values. The algorithm tackles the dilemma by iterating between two steps:
the expectation step (E-step) addresses the missing values, and the maximization step
(M-step) updates the estimates.
The E-step treats the observed data and current parameter values at iteration t as
known constants and imputes the missing parts of the sums and sums of cross-products
terms with expectations or averages. The bivariate example requires “imputations” for
the missing X, X2, and XY values. With multivariate normal data, a linear regression
model generates predictions for the missing terms.
(
E X | Y , μ ( ) , S ( ) = γ (0 ) + γ1( ) Yi
t t
) t t
(3.19)
( ) ( γ ( ) + γ ( )Y ) + σ ( )
2
E X 2 | Y, μ( ) , S( ) =
t t t t 2t
0 1 i X|Y
E ( XY | Y , μ ( ) , S ( ) )= Y × E ( X | Y )= Y ( γ ( ) + γ ( ) Y )
t t t t
i i 0 1 i
As you can see, the expectation of X is a predicted value from the regression, and expec-
tation of X2 is a squared predicted score plus a residual variance term that captures the
expected spread of the missing data. Finally, the regression parameters are straightfor-
ward functions of the estimates in μ(t) and Σ(t).
The M-step identifies updated parameter values for the next iteration by substitut-
ing the observed data and the expectations into the complete-data formulae in Equation
3.18. The updated estimates for the bivariate example are as follows:
1 C
n nM
( )
μ=
X
t +1
∑
N i 1 =i 1 ∑
X i + E ( X i | Yi )
(3.21)
=
n
2
1 C
nM n nM
∑( )
1 C 2
σ X(=
2 t +1)
N ∑ ∑ ∑
X i + E X i | Yi − X i + E ( X i | Yi )
2
N i 1 =i 1
=i 1 =i 1
=
1 C nC
n nM N nM
1
= ( t +1)
σ XY
N
∑
i 1 =i 1 ∑
X i Yi + Yi E ( X i | Yi ) −
N ∑ ∑
∑
Yi X i + E ( X i | Yi )
= =i 1 = i 1 =i 1
These equations clarify that EM is not filling in the missing values themselves, but
rather the missing parts of the sums and sums of cross-products terms needed to com-
pute the complete-data log-likelihood function.
The updated estimates at the M-step carry forward to the next E-step, where the
algorithm uses them to generate new and improved estimates of the missing data, after
which it again updates the parameters. This two-step sequence repeats until the esti-
mates from consecutive M-steps no longer differ. Dempster et al. (1977) and others
(Little & Rubin, 2020; Schafer, 1997) show that the updated estimates from the M-step
always improve on those from the previous iteration in the sense that they increase the
observed-data log-likelihood. As such, EM is achieving the same goal as Newton’s algo-
rithm, albeit with a different approach that simplifies the iterative computations.
Analysis Example
To illustrate iterative optimization with missing data, I used the EM algorithm to estimate
the mean vector and covariance matrix of the artificial data from Figure 3.1. A custom R
program is available on the companion website for readers interested in coding EM by
hand, as is a program that implements Newton’s algorithm with missing data. To highlight
the resilience of the algorithm to (really) poor starting values, I fixed the initial means to
0 and set the variance–covariance matrix to an identity matrix (a matrix with ones on
the diagonal and zeros elsewhere). Finally, I terminated the algorithm when the estimates
from consecutive iterations differed by less than .000001, as changes of this magnitude
effectively signal that the algorithm has identified the maximum likelihood estimates.
Table 3.4 shows the iterative updates to the estimates and the observed-data log-
likelihood (the complete-data parameters converge immediately, so I omit these values
from the table). Consistent with Newton’s algorithm from Chapter 2, EM makes large
adjustments to the parameters at first and tiny alterations as it approaches the maximum
likelihood estimates. By the final cycle, the estimates are changing only in the fifth
decimal, so there is no reason to continue iterating. An attractive feature of EM is that
it doesn’t directly manipulate the observed-data log-likelihood function, the composi-
tion of which changes across missing data patterns. As such, the values in the rightmost
Maximum Likelihood Estimation with Missing Data 115
column of the table are not a natural by-product of EM, but I include them to highlight
the important conclusion that each iteration is guaranteed to improve on the previous
one (Dempster et al., 1977). Conceptually, these values demonstrate that the algorithm
is “hiking” to a higher elevation on the log-likelihood surface every time it completes an
M-step, just like Newton’s algorithm (albeit with simpler math). Although EM doesn’t
naturally produce standard errors, methodologists have developed procedures for com-
puting these quantities (Cai, 2008; Little & Rubin, 2020; Louis, 1982; McLachlan &
Krishnan, 2007; Meng & Rubin, 1991), and marrying EM and the bootstrap is also an
option.
Having established its core ideas, we can extend maximum likelihood missing data
handling to regression models with directed pathways. Estimation is simple if missing
values are relegated to the outcome variable, in which case deleting incomplete data
records gives the optimal maximum likelihood estimates (Little, 1992; von Hippel,
2007). The situation is more complex and nuanced when explanatory variables are
incomplete, especially when one or more of the predictors are categorical. Rewinding
back to Section 2.10, explanatory variables functioned as known constants in the log-
likelihood expression, and the normal curve assumption applied only to the outcome
(or its residuals, more precisely). The situation changes when predictors are incomplete,
because the covariates require their own probability distribution. This section describes
structural equation (Arbuckle, 1996; Muthén et al., 1987; Wothke, 2000) and factored
regression modeling (Ibrahim et al., 2002; Lipsitz & Ibrahim, 1996; Lüdtke et al., 2020a)
approaches to this problem. As you will see, the former generally foists a normal dis-
116 Applied Missing Data Analysis
tribution on the predictors, whereas the latter offers a more flexible specification that
accommodates mixed response types.
Switching gears to a different substantive context, I use the smoking data from the
companion website to illustrate a multiple regression analysis. The data set includes
several sociodemographic correlates of smoking intensity from a survey of N = 2,000
young adults (e.g., age, whether a parent smoked, gender, income). Piggybacking on
the data analysis example from Section 2.10, the model uses a parental smoking indica-
tor (0 = parents did not smoke, 1 = parent smoked), age, and income to predict smoking
intensity, defined as the number of cigarettes smoked per day. The model and its generic
counterpart are as follows:
Yi = β0 + β1 X1i + β2 X 2i + β3 X 3i + ε i
(
ε i ~ N1 0, σ2ε )
The smoking intensity variable has 21.2% missing data, 3.6% of the parental smoking
indicator scores are missing, and 11.4% of the income values are unknown. Note that I
list the incomplete regressors first, because this will facilitate the factorization process
described below.
f ( Y , X1 , X 2 ,=
X 3 ) f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 ) × f ( X 2 | X 3 ) × f ( X 3 ) (3.23)
The first term to the right of the equals sign is the distribution induced by the analysis
model (e.g., Equation 3.22), and the remaining terms are regressions that define predic-
tor distributions. This idea resurfaces later in the context of Bayesian estimation and
multiple imputation, where the multivariate distribution on the left is consistent with
the joint model imputation framework (Asparouhov & Muthén, 2010c; Carpenter &
Kenward, 2013; Schafer, 1997), and the expression on the right is called a sequential
specification (Erler, Rizopoulos, Jaddoe, Franco, & Lesaffre, 2019; Erler et al., 2016;
Ibrahim et al., 2002; Ibrahim et al., 2005; Lipsitz & Ibrahim, 1996; Lüdtke, Robitzsch,
& West, 2020b).
Structural equation and factored regression models will give the same answer
when the data are normal, because the multivariate normal distribution always spawns
an equivalent set of linear regression models (Arnold, Castillo, & Sarabia, 2001; Liu,
Maximum Likelihood Estimation with Missing Data 117
Gelman, Hill, Su, & Kropko, 2014). The methods diverge with categorical predictors,
because no common joint distribution describes the co-occurrence of discrete and con-
tinuous scores. Therefore, the classic structural equation model effectively foists a nor-
mal distribution on the predictors regardless of their metrics. Limited computer simu-
lations suggest this misspecification might not be detrimental with binary covariates
(Muthén et al., 2016), but it is nonsensical for multicategorical nominal predictors. In
contrast, the factorization on the right side of Equation 3.23 makes no assumption about
the multivariate distribution on the left; f(X1|X2, X3) could be a logistic or probit regres-
sion, f(X2|X3) could be a linear regression, and so on. In fact, the distinction between the
two modeling frameworks is not as clean as my description suggests, as some structural
equation modeling programs accommodate mixtures of categorical and continuous out-
comes in a way that is effectively equivalent to a factored regression model (Muthén et
al., 2016; Pritikin et al., 2018). With this advance organizer in mind, let’s dig into the two
modeling frameworks, paying particular attention to the parental smoking indicator.
X i =γ 0 + ri (3.24)
Yi = β0 + β1 X i + ε i
(
ri ~ N1 0, σ2r ) (
ε i ~ N1 0, σ2ε )
where γ0 and β0 denote the grand mean and regression intercept, respectively, β1 is the
slope coefficient, and r i and εi are normally distributed residuals with variances σr2 and
σε2. I generally use γ’s to denote nuisance parameters that we wouldn’t have estimated
had the data been complete, and here these coefficients represent features of the predic-
tor distribution.
An important feature of a structural equation model is that its parameters combine
to produce predictions about the population means and variance–covariance matrix. I
refer to these model-predicted or model-implied moments as μ(θ) and Σ(θ) to differenti-
ate them from the μ and Σ arrays in previous sections. The regression model parameters
from Equation 3.24 make the following predictions about the population mean vector
and covariance matrix:
μX (θ) γ0
μ ( θ ) =
= (3.25)
Y ( ) 0
μ θ β + β γ
1 0
These expressions have intuitive meaning. For example, the mean of Y is the value
that results from substituting the mean of X into the regression equation, and Y’s vari-
ance has an explained component due to the predictor and leftover residual part. Linear
regression models like this one will always perfectly predict the sample moments (i.e.,
μ̂(θ) = μ̂ and Ŝ(θ) = Ŝ), but this won’t generally be true for more complex models (e.g., a
confirmatory factor analysis model, a path model with omitted arrows).
Because it assumes multivariate normality, maximum likelihood estimation for
structural equation models borrows heavily from concepts we’ve already covered. For
example, the observed-data log-likelihood replaces the population mean vector and cova-
riance matrix in Equation 3.6 with their model-implied counterparts (Arbuckle, 1996).
( )
N N
1 1
2
=i 1 =i
∑
ln S i ( θ ) −
LL μ, S | Y( obs ) = constant −
2 ∑ ( Y − μ ( θ ))′ S
1
i i
−1
i ( θ ) ( Yi − μ i ( θ ) ) (3.26)
Returning to the bivariate log-likelihood expressions in Equations 3.7 and 3.8, the com-
posite or model-implied parameters on the right side of Equation 3.25 replace their
normal distribution counterparts (e.g., individual deviation scores reflect distances
between X and γ0 and Y and β0 + β1γ0), but the expressions are otherwise the same.
Importantly, every variable in the analysis appears in the Y vector regardless of its role
in the model. This feature is vital, because missing data handling requires a distribution
for the outcome and the predictors. As before, the equation says that the data’s evidence
about the model parameters is restricted to those parameters for which an individual
has scores. Some data records provide more information than others, but the overall
goal remains the same—identify the regression model parameters that maximize fit (or
minimize differences) between the observed data and model-implied mean vector and
covariance matrix.
The standard errors for the structural equation model parameters also borrow heav-
ily from previous ideas. As you know, the matrix of second derivatives (the Hessian)
provides the building blocks for standard errors. Replacing μi and Σi in Equation 3.15
with μi(θ) and Σi(θ) gives the second derivative matrix for the model-implied mean vec-
tor and covariance matrix, and multiplying the Hessian by –1 gives the observed infor-
mation matrix.
∂ 2 LL ∂ 2 LL
∂μ ( θ )
2
∂μ ( θ ) ∂S ( θ )
IO ( μ ( θ ) , S ( θ ) ) =
−H O ( μ ( θ ) , S ( θ ) ) =
− (3.27)
∂ 2 LL ∂ 2 LL
∂S ( θ ) ∂μ ( θ ) ∂S 2 ( θ )
This matrix isn’t exactly what we need, however, because it reflects the data’s informa-
tion about the composite parameters in μ(θ) and Σ(θ). Introducing an additional matrix
that distributes the model-implied information to the appropriate structural model
parameters provides the standard errors.
Returning to Equation 3.26, each mean, variance, and covariance in μ(θ) and Σ(θ) is
a weighted combination of the regression model parameters. The new array in question
summarizes these linkages in a matrix containing weights or coefficients that capture
the amount by which the model-implied moments in the rows change as a function of
Maximum Likelihood Estimation with Missing Data 119
the regression parameters in the columns. Table 3.5 shows the coefficient matrix for
the simple regression model. The lone 1 in the first row indicates that γ0 is the sole
determinant of μX(θ) and no other structural model parameters contribute to this mean.
Similarly, the three nonzero coefficients in the second row reflect the amount by which
μY(θ) changes as a function of γ0, β0, and β1 (e.g., the β1 in the first column weights γ0’s
contribution to the mean, the γ0 in the third column reflects β1’s influence). To keep
notation simple, I denote the coefficient matrix in the table as Δ (technically, the weights
are derivatives of the model-implied moments with respect to the regression model
parameters). Finally, substituting the maximum likelihood estimates, pre- and post-
multiplying the information matrix by Δ, then taking the inverse (the matrix analogue
of a reciprocal) gives the variance–covariance matrix of the regression model parameters.
( ( ( ) ( )) )
−1
Δˆ ′I O μ θˆ , S θˆ Δˆ
Sˆ θˆ = (3.28)
As always, the diagonal elements of the parameter covariance matrix contain squared
standard errors. The expression is somewhat complicated, but all it’s doing is reappor-
tioning the data’s information about the model-implied mean vector and covariance
matrix to the appropriate structural model parameters. Interested readers can consult
Savalei and Rosseel (2021) for a more detailed account of estimation for structural equa-
tion models.
Returning to the multiple regression from Equation 3.22, the structural model
represents the analysis as four linear regression equations with normally distributed
residuals.
PARSMOKEi =γ 01 + r1i
INCOMEi =γ 02 + r2i
AGEi =γ 03 + r3i
Figure 3.4 depicts the regressions as a path diagram, with rectangles denoting manifest
(measured) variables, circles representing latent variables or residuals, straight arrows
PARSMOKE
INCOME INTENSITY
AGE
FIGURE 3.4. Path diagram of a three-predictor regression model treating predictors as ran-
dom variables. Incomplete predictors are linked via correlations (curved arrows).
f ( Y , X1 ,…
=, X K ) f ( Y | X1 ,…, X K ) × f ( X K | X1 ,…, X K −1 ) ×…× f ( X 2 | X1 ) × f ( X1 ) (3.30)
Maximum Likelihood Estimation with Missing Data 121
I ultimately drop the rightmost term, because age is complete and does not require a dis-
tribution. The parental smoking distribution, f(PARSMOKE|INCOME, AGE), could be a
logistic or probit regression. I use logistic regression, but the choice is arbitrary, because
this model is not the substantive focus. The generic functions above translate into the
following regression models:
Pr ( PARSMOKEi = 1)
ln
1 − Pr ( PARSMOKE = = γ 01 + γ11 ( INCOMEi ) + γ 21 ( AGEi )
i 1)
As a reminder, I use γ’s throughout the book to differentiate supporting model param-
eters from the focal model’s coefficients.
Figure 3.5 shows a path diagram of the models in Equation 3.32. As explained in
Section 2.13, categorical variable models envision binary responses originating from
an underlying latent response variable that represents one’s underlying proclivity or
propensity to endorse the highest category (Agresti, 2012; Johnson & Albert, 1999). Fol-
lowing diagramming conventions from Edwards, Wirth, Houts, and Xi (2012), I use an
122 Applied Missing Data Analysis
PARSMOKE* PARSMOKE
INCOME INTENSITY
AGE
FIGURE 3.5. Path diagram of a three-predictor regression model that links the predictors
via regressions rather than correlations. The broken arrow denotes a link function that maps
the underlying latent variable to the discrete scores. Parental smoking is a latent variable in its
regression on age but a binary predictor of smoking intensity.
oval and rectangle to differentiate the latent response variable and its binary indicator,
respectively, and the broken arrow connecting the two is the link function that maps
the unobserved continuum to the discrete responses (e.g., the broken arrow reflects the
idea that latent scores above and below the threshold convert to 1’s and 0’s, respectively).
The latent variable has a residual term, but its variance is a fixed constant (π2 ÷ 3, the
variance of the standard logistic distribution). The figure highlights that the parental
smoking indicator simultaneously exists in two forms: The latent response variable (the
logit) appears as the outcome in the categorical regression, and the binary indicator is
a predictor in the focal model. The path diagram is inconsistent with the classic struc-
tural equation model formulation that assigns a multivariate normal distribution to the
variables, but newer modeling frameworks accommodate mixtures of categorical and
continuous outcomes in a manner equivalent to factored regressions (Muthén et al.,
2016; Pritikin et al., 2018).
The factored regression model is conceptually straightforward, as it simply casts a
set of multivariate associations as a sequence of univariate regression models. However,
estimating the models is not so straightforward, because incomplete variables can appear
in multiple equations, as they do here (e.g., the parental smoking indicator is the depen-
dent variable in one equation and an explanatory variable in another). This means that
maximum likelihood’s “implicit imputation” procedure must simultaneously account
for a variable’s role in two or more distributions. Ibrahim (1990) and Lipsitz and Ibrahim
(1996) use a variant of the EM algorithm called “EM by the method of weights” to obtain
the maximum likelihood estimates. The algorithm uses a procedure known as numeri-
cal integration (Rabe-Hesketh, Skrondal, & Pickles, 2004; Wirth & Edwards, 2007) to
replace each missing value with a weighted set of pseudo-imputations. I sketch the main
ideas behind the procedure and point interested readers to Lüdtke et al. (2020a) for a
worked example.
Maximum Likelihood Estimation with Missing Data 123
As described earlier, EM’s E-step treats the observed data and parameter values at
iteration t as known constants and fills in the missing parts of the data with expectations
or averages. Ibrahim’s (1990) method of weights achieves this in an imputation-esque
fashion by filling in each individual’s missing values with more than one replacement
score. However, unlike multiple imputation, all participants share a common, fixed grid
of replacement values or “nodes” that span the incomplete variable’s entire range. For
example, the smoking intensity scores range from 2 to 29, so the procedure could use a
fixed grid of integer pseudo-imputations ranging from 0 to 30. Similarly, missing paren-
tal smoking indicator scores are imputed twice with support nodes of 0 and 1. The data
are stacked, such that participants with missing values have multiple rows, one per each
combination of pseudo-imputations. The primary goal of the E-step is to weight each
row according to the likelihood of its data given the current parameter values. These
weights are derived by substituting observed scores and pseudo-imputations into the
distribution functions depicted in Equation 3.31 and performing the multiplication pre-
scribed by the factorization (e.g., the weights for this example involve the product of a
Bernoulli likelihood for the binary variable and two normal likelihoods).
The M-step updates the parameters for the next iteration by finding the estimates
that maximize a weighted complete-data log-likelihood function that accounts for the
fact that some participants have multiple data records with different pseudo-imputations
(e.g., using Newton’s algorithm; Ibrahim, 1990; Lipsitz & Ibrahim, 1996). For this exam-
ple, the M-step estimates two linear regressions (the focal model and the income model)
and a logistic regression (the parental smoking model) from a stacked data set contain-
ing the newly updated grid of weighted replacement scores. As described earlier, each
successive iteration of the EM algorithm gives better predictions about the missing val-
ues (which in this case are encoded as the weighted pseudo-imputations), which in turn
improve the estimates, which in turn sharpen the missing values, and so on.
Analysis Example
To illustrate maximum likelihood missing data handling for multiple regression, I used
the structural equation model and factored regression approaches to estimate the smok-
ing intensity model from Equation 3.22. Figures 3.4 and 3.5 show the path diagrams.
I centered the income and age and income variables at their grand means to maintain
the intercept’s interpretation as the expected smoking intensity score for a respondent
whose parents did not smoke. Importantly, the smoking intensity distribution has sub-
stantial positive skewness and kurtosis, so I used robust (sandwich estimator) standard
errors and test statistics to compensate. Bootstrap resampling is an alternative approach
to addressing this problem. Analysis scripts are available on the companion website.
Table 3.6 shows the maximum likelihood estimates of the focal model parameters.
In the interest of space, I omit the regressor model estimates, because they are not the
substantive focus. The two approaches produced estimates and standard errors that
were equivalent to the third decimal. The performance of the classic structural equation
model might be surprising given that it implicitly imputes the binary predictor with
continuous normal scores, but computer simulation evidence suggests that this may be
fine in some situations (Muthén et al., 2016). This example is probably optimal, because
124 Applied Missing Data Analysis
the binary variable has similar category proportions, and larger differences could result
if group sizes are highly unbalanced. Importantly, the estimates from both models have
the same meaning. For example, the intercept (β̂0 = 8.78, SE = 0.11) is the expected num-
ber of cigarettes smoked per day for a respondent whose parents didn’t smoke, and the
parental smoking indicator slope (β̂1 = 2.66, SE = 0.18) is the mean difference, control-
ling for age and income.
Maximum likelihood estimation offers three significance testing options: the Wald test
(Wald, 1943), likelihood ratio statistic (Wilks, 1938), and the score test or modifica-
tion index (Rao, 1948; Saris et al., 1987; Sörbom, 1989). All three procedures are appli-
cable to missing data, and their details are largely the same as those in Section 2.12.
For example, the Wald test requires parameter estimates and their variance–covariance
matrix. We’ve already seen that the computation of the covariance matrix changes with
missing data (e.g., some cases contribute more information about the parameters than
others), but the composition and interpretation of the test statistic is otherwise identi-
cal to Equation 2.45. Similarly, corrective procedures for non-normality have long been
available for missing data (Arminger & Sobel, 1990; Enders, 2002; Savalei, 2010; Savalei
& Yuan, 2009; Yuan, 2009b; Yuan & Bentler, 2000, 2010; Yuan & Zhang, 2012), and
there is a good deal of literature supporting their use (Enders, 2001; Savalei & Bentler,
2005; Savalei & Falk, 2014; Yuan, Tong, & Zhang, 2014; Yuan et al., 2012). Several
analysis examples in Chapter 10 illustrate significance tests and corrective procedures
for missing data.
Returning to the multiple regression model from Equation 3.22, I use the Wald test
and likelihood ratio statistic to evaluate the null hypothesis that R2 = 0. Both tests func-
tion like the omnibus F test from ordinary least squares in this context. To begin, the
Wald test standardizes discrepancies between the estimates and null values against the
covariance matrix of the parameter estimates, the diagonal of which contains squared
standard errors. The full covariance matrix is a 5 × 5 matrix, but the test uses only the
elements related to the slope coefficients. Equation 2.49 shows the composition of the
test statistic. The test statistic with Q = 3 degrees of freedom was statistically signifi-
Maximum Likelihood Estimation with Missing Data 125
cant, TW = 519.43 (p < .001), from which we can conclude that at least one slope is non-
zero. The normal-theory Wald test is not optimal for this example, because the smoking
intensity variable is substantially skewed and kurtotic. The test statistic based on the
sandwich estimator covariance matrix was markedly lower at TW = 449.32 (p < .001) but
gave the same conclusion.
The likelihood ratio statistic evaluates the same hypothesis but requires a nested
or restricted model that aligns with the null. This secondary model is an empty regres-
sion that fixes the three slope coefficients to zero. With complete data, you can get the
restricted model log-likelihood by constraining the slope coefficients to zero during
estimation or by excluding the explanatory variables from the analysis. The latter option
generally isn’t valid with missing data, because the observed data are not constant in the
two models (i.e., incomplete explanatory variables have a distribution and thus contrib-
ute to the log-likelihood). Rather, keeping the explanatory variables in the model and
constraining their coefficients to zero during estimation always gives the correct test
statistic. The appropriate nested model is as follows:
Fitting the two models and applying Equation 2.46 produced a statistically significant
test statistic with Q = 3 degrees of freedom, TLR = 449.38 (p < .001). The validity of this
test is questionable given the non-normal data, so I applied the Satorra–Bentler rescaled
test statistic as a comparison (Satorra & Bentler, 1988, 1994). The rescaled test statistic
from Equation 2.47 was markedly lower at TSB = 325.34 (cLR = 1.38) but gave the same
conclusion.
Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i (3.34)
(
ε i ~ N1 0, σ2ε )
In this model, β1 is a conditional effect that reflects the influence of X when M equals
zero, and β2 is the corresponding conditional effect of M when X equals zero. The β3
coefficient is usually of particular interest, since it captures the change in the β1 slope
for a one-unit increase in M (i.e., the amount by which X’s influence on Y is moderated
by M).
126 Applied Missing Data Analysis
Switching gears to a different substantive context, I use the chronic pain data to illus-
trate a moderated regression analysis with an interaction effect. The data set includes psy-
chological correlates of pain severity (e.g., depression, pain interference with daily life, per-
ceived control) for a sample of N = 275 individuals with chronic pain (see Appendix). The
motivating question is whether gender moderates the influence of depression on psychoso-
cial disability, a construct capturing pain’s impact on emotional behaviors (e.g., psychologi-
cal autonomy and communication; emotional stability). The moderated regression model is
where DISABILITY and DEPRESS are scale scores measuring psychosocial disability and
depression, MALE is a gender dummy code (0 = female, 1 = male), and PAIN is a binary
severe pain indicator (0 = no, little, or moderate pain, 1 = severe pain). I centered depres-
sion scores at their grand mean to facilitate interpretation. The disability and depression
scores have 9.1 and 13.5% of their scores missing, respectively, and approximately 7.3%
of the binary pain ratings are missing. By extension, 13.5% of the sample is also missing
the product term.
Just‑Another‑Variable Approach
The so-called “just-another-variable” approach to estimating interactive and curvilinear
effects warrants a brief discussion, because it is easy to implement in the structural
equation modeling framework. The method should be a last resort, because it requires
an MCAR mechanism and is prone to substantial biases otherwise (Enders et al., 2014;
Lüdtke et al., 2020a; Zhang & Wang, 2017). The factored regression model is generally
a better choice, because it requires an MAR process.
As its name implies, the just-another-variable strategy treats the product term like
any other normally distributed predictor (von Hippel, 2009). To apply this method to the
moderated regression analysis in Equation 3.36, you would first compute a new variable
PRODUCT = DEPRESS × MALE, treating the result as missing when either of the com-
ponents is missing. The product would then function like any other variable in analysis.
The structural equation model for the chronic pain example represents the analysis with
four linear regression equations, each with normally distributed residual terms.
The path diagram for this analysis is like the one in Figure 3.4.
The normality assumption is especially problematic here, because the product of
two random variables isn’t normal (Craig, 1936; Lomnicki, 1967; Springer & Thomp-
Maximum Likelihood Estimation with Missing Data 127
son, 1966), and the mean and variance of a product are deterministic functions of the
component variables (Aiken & West, 1991; Bohrnstedt & Goldberger, 1969). Seaman
et al. (2012) present analytic arguments showing that the just-another-variable strategy
is approximately unbiased when missingness is completely at random (i.e., does not
depend on the data), and they further show that the procedure is biased with a condi-
tionally MAR process. Several simulation studies support this conclusion (Enders et al.,
2014; Kim, Belin, & Sugar, 2018; Kim, Sugar, & Belin, 2015; Lüdtke et al., 2020a; Sea-
man et al., 2012; von Hippel, 2009; Zhang & Wang, 2017).
f (Y, =
X, M ) f ( Y | X, M, X × M, Z ) × f ( X | M, Z ) × f ( M | Z ) × f ( Z ) (3.37)
The first term to the right of the equals sign is now the distribution induced by the mod-
erated regression, and the regressor models follow the same pattern as before. Impor-
tantly, the product term is not a variable with its own distribution but operates more like
a deterministic function of X or M, either of which could be missing.
Assigning the complete predictor to the rightmost term gives the following factor-
ization for the psychosocial disability analysis:
(
f ( DEPRESS | PAIN, MALE ) × f PAIN * | MALE × f MALE* ) ( )
The first term to the right of the equals sign corresponds to the analysis model from
Equation 3.35, and the regressor models in the latter three terms translate into a linear
regression for depression, a logistic (or probit) model for the pain severity indicator,
and an empty logistic (or probit) model for the marginal distribution of gender (which
I ignore, because this variable is complete and does not require a distribution). An
asterisk superscript on the variable name denotes a latent response variable (logit or
probit).
Pr ( PAIN i = 1)
ln
1 − Pr ( PAIN = = γ 02 + γ12 ( MALEi )
i 1)
128 Applied Missing Data Analysis
Analysis Example
Continuing with the chronic pain example, I used the factored regression approach
to estimate the models in Equation 3.40. The procedure requires a grid of pseudo-
imputations or nodes for numerical integration. The observed psychosocial disability
scores range from 6 to 32, so I specified a somewhat wider grid consisting of integers
between 0 and 40. The observed depression scores similarly range from 7 to 28, so I
used integer values from 0 to 35 for the pseudo-imputations. Finally, the severe pain
indicator requires just two nodes, 0 and 1. The number of grid points is consistent with
recommendations from the literature (Skrondal & Rabe-Hesketh, 2004), and using more
nodes had no impact on the analysis results. As explained previously, the EM algorithm
derives person-specific weights for the pseudo-imputations that quantify a node’s fit to
the observed data, and missing values are essentially replaced by a weighted average
over the entire grid of plausible values. Finally, I centered the depression scores at the
grand mean to enhance interpretability (Aiken & West, 1991; Cohen et al., 2002), and
doing so required a preliminary maximum likelihood analysis to estimate the mean
vector and covariance matrix of the analysis variables, as the mean of the available data
could be biased. Analysis scripts are available on the companion website.
Table 3.7 gives the parameter estimates from the analysis, and Figure 3.6 plots
the simple slopes for males and females. In the interest of space, Table 3.7 omits the
regressor model estimates, as they are not the substantive focus. Recall that lower-
order terms are conditional effects that depend on scaling; β̂1 = 0.38 (SE = 0.06) is the
effect of depression on psychosocial disability for female participants (the solid line
in Figure 3.6), and β̂2 = –0.79 (SE = 0.55) is the gender difference at the depression
mean (the vertical distance between lines at a value of zero on the horizontal axis).
The interaction effect captures the slope difference for males. The negative coefficient
β̂3 = –0.24, SE = 0.09) indicates that the male depression slope (the dashed line) was
approximately 0.24 points lower than the female slope (i.e., the male slope is β̂1 + β̂3 =
0.38 – 0.24 = 0.14).
Although the just-another-variable approach is not advisable, you can use the struc-
tural equation framework to recast the moderated regression as a multiple-group model
that features separate regressions for males and females. This approach isn’t always an
40
30
Psychosocial Disability
20
10
Female
Male
0
–20 –10 0 10 20
Depression (Centered)
FIGURE 3.6. Simple slopes (conditional effects for males and females) from the moderated
regression analysis example. The zero value on the horizontal axis corresponds to the mean.
option, but it is here, because the moderator variable is categorical and complete. The
multiple-group specification features three regression equations per group.
γ (01) + r1(i )
F F
DEPRESSi = (3.40)
(F) (F)
PAIN i =γ 02 + r2i
DISABILITYi = β(0 ) + β1( ) ( DEPRESSi ) + β2( ) ( PAIN i ) + ε i( )
F F F F
γ (01 ) + r1(i )
M M
DEPRESSi =
γ (02 ) + r2(i )
M M
PAIN i =
DISABILITYi = β(0 ) + β1( ) ( DEPRESSi ) + β(2 ) ( PAIN i ) + ε i( )
M M M M
The alphanumeric superscripts on the coefficients and residual terms indicate that
every parameter in the model can potentially differ by gender. The moderated regres-
sion is somewhat more restrictive, because it includes only group-specific intercepts
and slopes. To align the multiple-group model with the factored regression analysis, I
imposed between-group equality constraints on all other parameters, but this specifica-
tion is optional.
Table 3.8 gives the multiple-group model parameter estimates. Although it packages
the results differently, the multiple-group estimates were quite like those of the factored
regression model. To highlight this linkage, I formed two contrasts that compared the
130 Applied Missing Data Analysis
Males
β0(M) 20.95 0.49 42.74 0.00
β1(M) (DEPRESS) 0.15 0.07 2.15 0.03
β2(M) (PAIN) 1.92 0.60 3.20 0.00
σε2 16.45 1.51 10.92 0.00
Group contrasts
β0(M) – β0(F) –0.54 0.55 –0.98 0.33
β1(M) – β1(F) –0.24 0.09 –2.67 0.01
groups’ intercept and slope coefficients. The two bottom rows of Table 3.8 show that the
intercept difference is similar to the β2 coefficient from the factored regression and the
slope difference is equal to the β3 interaction effect. Furthermore, the difference between
the male and female slopes was statistically significant, as it was in the factored regres-
sion model.
The factored regression approach readily accommodates other types of nonlinear terms.
Curvilinear regression models with polynomial terms and incomplete predictors are an
important example. To illustrate, consider a prototypical polynomial regression model
that features a squared or quadratic term for X (i.e., the interaction of X with itself).
Yi = β0 + β1 X i + β2 X i2 + ε i (3.41)
(
ε i ~ N1 0, σ2ε )
Like a moderated regression analysis, β1 is a conditional effect that captures the influ-
ence of X when X itself equals 0 (Aiken & West, 1991; Cohen et al., 2002). The β2 coef-
ficient is of particular interest, because it captures acceleration or deceleration (i.e., cur-
vature) in the trend line. For example, if β1 and β2 are both positive, the influence of X
on Y becomes more positive as X increases, whereas a positive β1 and a negative β2 imply
that X’s influence diminishes as X increases.
Maximum Likelihood Estimation with Missing Data 131
To provide a substantive context, I use the math achievement data set from the
companion website that includes pretest and posttest math scores and several academic-
related variables (e.g., math self-efficacy, anxiety, standardized test scores, sociodemo-
graphic variables) for a sample of N = 250 students. The literature suggests that anxi-
ety could have a curvilinear relation with math performance, such that the negative
influence of anxiety on achievement worsens as anxiety increases. The model below
accommodates this nonlinearity while controlling for a binary indicator that measures
whether a student is eligible for free or reduced-priced lunch (0 = no assistance, 1 = eli-
gible for free or reduced-price lunch), math pretest scores, and a gender dummy code (0 =
female, 1 = male):
Anxiety scores are centered at their grand mean to facilitate interpretation. Approxi-
mately 16.8% of the posttest math scores and 8.8% of the math anxiety ratings are miss-
ing, as are 5.2% of the lunch assistance indicator codes.
(
f MATHPOST | ANXIETY , ANXIETY 2 , FRLUNCH, MATHPRE, MALE × ) (3.43)
f ( ANXIETY | FRLUNCH, MATHPRE, MALE ) ×
( )
f FRLUNCH * | MATHPRE, MALE × f ( MATHPRE | MALE ) × f MALE* ( )
The first term to the right of the equals sign is the normal distribution induced by
the curvilinear regression in Equation 3.42, and the regressor models translate into a
linear regression for anxiety and a logistic (or probit) model for the lunch assistance
indicator.
Pr ( FRLUNCH i = 1)
ln
1 − Pr ( FRLUNCH = = γ 02 + γ12 ( MATHPREi ) + γ 22 ( MALEi )
i 1)
Following earlier examples, I ignore the last two terms in Equation 3.43, because the
pretest scores and gender dummy code are complete and do not require a distribution.
Finally, note that the squared term is not a variable with its own distribution and instead
operates more like a deterministic function of the incomplete anxiety scores.
132 Applied Missing Data Analysis
Analysis Example
Continuing with the math achievement example, I used the factored regression approach
to estimate the curvilinear regression from Equation 3.42 and the supporting predictor
models. To enhance the interpretability of the estimates, I centered math anxiety at its
grand mean after first performing a maximum likelihood analysis to estimate the mean
vector and covariance matrix of the analysis variables. Recall that numerical integra-
tion requires a grid of pseudo-imputations or nodes for any variable with a distribution.
The observed math scores range from 35 to 85, so I specified a somewhat wider grid
consisting of integers between 25 and 95. The observed anxiety scores range from 0 to
56, so I used integer values between –10 and 65 for the pseudo-imputations. Again, the
number of grid points is consistent with recommendations from the literature (Skrondal
& Rabe-Hesketh, 2004), and using more nodes had no impact on the analysis results.
Analysis scripts are available on the companion website.
Table 3.9 gives the parameter estimates from the analysis, and Figure 3.7 plots the
regression line (marginalizing over the covariates). In the interest of space, Table 3.9
omits the regressor model estimates, as they are not the substantive focus. Because of
centering, the lower-order anxiety slope (β̂1 = –0.26, SE = 0.08) reflects the influence of
this variable on math achievement at the anxiety mean (i.e., instantaneous rate of change
in the outcome when the predictor equals zero). The negative curvature coefficient (β̂2 =
–0.01, SE = 0.005) indicates that the anxiety slope became more negative as anxiety
increased. This interpretation is clear from the figure, where the regression function is
concave down.
A conditionally MAR mechanism is usually the default assumption for a maximum like-
lihood analysis. This process stipulates that whether a participant has missing values
depends strictly on observed data, and the unseen scores themselves are unrelated to
Maximum Likelihood Estimation with Missing Data 133
70
Posttest Math Achievement
60
50
40
30
–20 –10 0 10 20
Math Anxiety (Centered)
FIGURE 3.7. Estimated regression line from the curvilinear regression analysis. The predic-
tor variable, math anxiety, is centered at its grand mean. The zero value on the horizontal axis
corresponds to the mean.
missingness. In practical terms, the definition implies that the focal analysis model
should include all important correlates of missingness, as omitting such a variable could
result in a bias-inducing MNAR-by-omission process if the semipartial correlations are
strong enough. Section 1.5 described an inclusive analysis strategy that fine-tunes a
missing data analysis by introducing extraneous auxiliary variables into the model
(Collins et al., 2001; Schafer & Graham, 2002). Adopting such a strategy can reduce
nonresponse bias, improve precision, or both. The analysis example in Section 1.6 illus-
trated a method for selecting auxiliary variables, and this section describes strategies for
incorporating them into a maximum likelihood analysis.
I describe four broad strategies for introducing auxiliary variables, three of which
leverage the flexibility of the structural equation modeling framework. Graham (2003)
outlined two model specification strategies—the saturated correlates and extra depen-
dent variable models—that use a particular configuration of residual correlations and
regression slopes to connect the auxiliary variables to the focal analysis model. Two-
stage estimation is an alternative approach (Savalei & Bentler, 2009; Yuan & Bentler,
2000) that tackles the missing data in two steps. The first stage estimates the mean
vector and variance–covariance matrix of a superset that includes analysis variables and
auxiliary variables, and the second stage uses a subset of these summary statistics as
input data for a complete-data analysis. Two-stage estimation is analogous to multiple
134 Applied Missing Data Analysis
imputation in the sense that it uses a preliminary analysis to fill in the missing data,
after which it estimates the focal model. In this context, the filled-in data are the sum-
mary statistics needed to estimate the complete-data model. The factored regression
specification is a final option that is well suited for analyses with interactions or nonlin-
ear effects or mixtures of categorical and continuous variables.
X1 X2 X3 A1 A2
FIGURE 3.8. Path diagram of a saturated correlates regression model with two auxiliary vari-
ables. The model uses curved arrows (covariances or residual covariances) to connect the auxil-
iary variables to each other and the residuals of all other variables.
Maximum Likelihood Estimation with Missing Data 135
LATENT
Y1 Y2 Y3 A1 A2
FIGURE 3.9. Path diagram of a latent variable saturated correlates model with two auxiliary
variables. The model uses curved arrows (covariances or residual covariances) to connect the
auxiliary variables to the residuals of all other manifest variables.
including them). For path or structural equation models, it is also worth noting that the
saturated correlates model doesn’t affect fit, because it “spends” the degrees of freedom
from the additional variables.
The saturated correlates model is prone to convergence failures, especially with
more than a few auxiliary variables (Graham, Cumsille, & Shevock, 2013; Howard et
al., 2015). Among other reasons, estimation problems can occur, because the model
imposes an awkward pattern on the residual covariance matrix that induces implau-
sible variances and covariances (Savalei & Bentler, 2009). My own experience suggests
that estimation usually tolerates a relatively small number of auxiliary variables (e.g.,
three to five) and almost certainly fails to converge as the number of variables increases.
This isn’t a major practical limitation, because it is often difficult to identify more than
one or two auxiliary variables that explain a meaningful amount of unique variation in
the analysis variables. In situations where it is necessary or desirable to leverage many
extra variables, Howard et al. (2015) describe a strategy that uses principal components
analysis to reduce a large set of auxiliary variables into a smaller number of linear com-
posites. Computer simulation results suggest that a single principal component or linear
combination can effectively replace an entire set of auxiliary variables. If the auxiliary
variables themselves are incomplete, a single imputation method (e.g., stochastic regres-
sion imputation, a single data set from multiple imputation) can fill in the missing val-
ues prior to data reduction.
136 Applied Missing Data Analysis
Two‑Stage Estimation
Two-stage estimation is an alternative approach to introducing auxiliary variables into
a structural equation model (Cai & Lee, 2009; Savalei & Bentler, 2009; Savalei & Falk,
2014; Savalei & Rhemtulla, 2017; Yuan & Bentler, 2000; Yuan et al., 2014). As the name
implies, the procedure requires two steps. The first stage treats the missing data by
estimating the mean vector and variance–covariance matrix of a superset of variables
that includes auxiliary variables and the focal analysis variables. The top panel of Figure
3.12 shows the first-stage missing data model for a multiple regression analysis with two
auxiliary variables. The advantage of this preliminary step is that the saturated model
X1 X2 X3
A1 Y A2
FIGURE 3.10. Path diagram of a regression model that incorporates two auxiliary variables
as extra dependent variables. The model uses a combination of directed and curved arrows to
connect the auxiliary variables to other variables.
Maximum Likelihood Estimation with Missing Data 137
X A1
A2
LATENT
Y1 Y2 Y3
FIGURE 3.11. Path diagram of a latent variable regression model that incorporates two auxil-
iary variables as extra dependent variables. The model uses a combination of directed and curved
arrows to connect the auxiliary variables to other variables.
should be easy to estimate, even with many extra variables. We already know how to
estimate μ and Σ with missing data, so there is nothing new about the initial stage. The
second step ignores the auxiliary variables and uses a subset of the estimates in μ̂ and Ŝ
as input data for the focal analysis shown in the bottom panel of Figure 3.12.
The second estimation stage incorrectly assumes that the summary statistics came
from a data set with N complete cases. This has no bearing on the estimates, but the
standard errors are too small, because they fail to account for the missing information.
Yuan and Bentler (2000) outlined a sandwich estimator covariance matrix that fixes
this problem, and Savalei and Bentler (2009) extend this solution to an MAR process.
For completeness, the rest of this section describes the correction procedure for the
standard errors. Readers who aren’t interested in these finer points can skip to the next
section without losing important information.
A simple regression model is sufficient for describing the standard error adjust-
ment. Equation 3.24 gives the structural equations for the second-stage regression anal-
ysis, and Equation 3.25 shows the model-predicted or model-implied moments, μ(θ) and
Σ(θ). Maximum likelihood estimation from summary data finds the regression model
parameters in θ that minimize the difference between the first-stage estimates in μ̂ and
Ŝ and the model-implied moments in μ(θ) and Σ(θ). The classic maximum likelihood
discrepancy function below is found in most structural equation modeling texts (e.g.,
Bollen, 1989; Kaplan, 2009).
( )
f θ μˆ ,Sˆ = ( ) {(
ˆ −1 ( θ ) + ( μˆ − μ ( θ ) )′ S −1 ( θ ) ( μˆ − μ ( θ ) ) + tr SS
− ln SS ) }
ˆ −1 ( θ ) − V (3.45)
The expression features an offset term in curly braces that makes the function return a
138 Applied Missing Data Analysis
Y X1 X2 A1 A2
Stage 1
(Saturated Model)
X1
Stage 2 Y
(Focal Model)
X2
FIGURE 3.12. Path diagrams for two-stage estimation. The first-stage diagram that includes
the auxiliary variables is a saturated model consisting of means, variances, and covariances. The
second-stage (focal) model is a multiple regression.
value of zero when the model-implied moments perfectly match the sample estimates
(as they do in this example), but the optimization process is fundamentally the same as
maximizing a log-likelihood function.
Although they aren’t the parameters of substantive interest, the variance–covariance
matrix of the model-implied mean vector and covariance matrix is a good place to start,
because the precision of these estimates is key to understanding how the sandwich esti-
mator works. Recall that second derivatives are stored in a symmetrical matrix known
as the Hessian. Multiplying this derivative matrix by –1 gives the information matrix
shown below (it is expected information here, because the second stage does not use the
raw data):
∂ 2 LL
2 0
μ (θ)
IE (μ ( θ ), S ( θ )) =
−H E ( μ ( θ ) , S ( θ ) ) =
− (3.46)
2
0 ∂ LL
∂S 2 ( θ )
The diagonal blocks of second derivative matrix are as follows:
∂ 2 LL
= −N S −1 ( θ ) (3.47)
∂μ 2 ( θ )
∂ 2 LL
∂S ( θ )
2
=
N
2
(
− D′V S −1 ( θ ) ⊗ S −1 ( θ ) DV ) (3.48)
The important thing about the derivative equations is that every participant contributes
the same amount of information regardless of missing data pattern; that is, instead of
summing N individual contributions to the derivative expressions, some of which con-
tain zeros (e.g., Equations 3.12 to 3.14), the equations invoke a constant contribution for
Maximum Likelihood Estimation with Missing Data 139
all N participants. As a result, some of the elements in the information matrix will be
too large, and taking the inverse gives sampling variances and standard errors that are
too small.
Revisiting ideas from Section 3.6, pre- and postmultiplying the information matrix
by a coefficient matrix Δ and then taking the inverse (the matrix analogue of a recipro-
cal) gives the variance–covariance matrix of the regression model parameters.
( ( ( ) ( )) )
−1
Σˆ θˆ = Δˆ ′I E μ θˆ , Σ θˆ Δˆ (3.49)
Recall that the matrix Δ contains weights or coefficients that capture the amount by
which the model-implied moments change as a function of the regression model param-
eters (see Table 3.5), and pre- and postmultiplying by Δ reapportions the data’s infor-
mation about the model-implied mean vector and covariance matrix to the appropri-
ate structural model parameters. Again, the elements in Σˆ θˆ are too small, because they
incorrectly assume the data are complete, so taking the square root of the diagonal ele-
ments thus gives standard errors that are too precise.
The two-stage standard error correction uses sandwich estimator formulation like
the one for non-normal data back in Section 2.8. In this application, the biased covari-
ance matrix from Equation 3.51 forms the outer pieces of “bread,” and the “meat” of the
sandwich inflates the outer terms to compensate for missing information. The two-stage
variance–covariance matrix of the estimates from Savalei and Bentler (2009) is
Σˆ θˆ ( TS= ˆ
{ ˆ′
( ( ) ( )) (I (μˆ , Σˆ )) I (μ ( θˆ ), Σ ( θˆ )) Δˆ }× Σˆ
ˆ ˆ
) Σ θˆ × Δ I E μ θ , Σ θ O
−1
E θˆ
(3.50)
where IO(μ̂, Σ̂) is the observed information matrix from the first stage. The equation
is really complicated, but we can deconstruct the basic ideas. Focusing on the “meat”
inside the curly braces, IE(μ(θ̂), Σ(θ̂)) and IO(μ̂, Σ̂) should be identical with complete
data, because they estimate the same information matrix, albeit in different ways. In this
situation, their product sets off a cascade of identity matrices that simplifies the expres-
sion equal to Equation 3.49. With missing data, premultiplying the inverse of IO(μ̂, Σ̂)
by IE(μ(θ), Σ(θ)) returns a matrix with diagonal values that represent the proportional
increase in information going from incomplete to complete data (e.g., a value of 1.20
means that the complete-data information for a given parameter is about 20% larger
than that of the observed data). This matrix, which effectively captures how far off the
elements in Σˆ θˆ are in proportional terms, is the crux of the adjustment, and the other
multiplications just distribute the correction terms to the structural model’s standard
errors.
Factored Regression
The factored regression framework provides a final method for introducing auxil-
iary variables. As you know, this method factorizes a multivariate distribution into a
sequence of univariate distributions, each of which corresponds to a regression model.
To maintain the desired interpretation of the focal model parameters, it is important to
specify a sequence where the analysis variables predict the auxiliary variables and not
140 Applied Missing Data Analysis
X3 Y
X2 A2
X1 A1
FIGURE 3.13. Path diagram of a factored regression model with two auxiliary variables.
vice versa. To illustrate, the factorization for a multiple regression model with a pair of
auxiliary variables is as follows:
f ( A1 | A2 , Y , X1 , X 2 ) × f ( A2 | Y , X1 , X 2 , X 3 ) (3.51)
f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 ) × f ( X 2 | X 3 ) × f ( X 3 )
The first two terms are the auxiliary variable regression models, the third term is the
focal model, and the final three terms are the regressor models. The path diagram in
Figure 3.13 shows that the factored regression is similar in spirit to the extra dependent
variable model but replaces curved arrows with straight arrows. This specification is
especially useful for leveraging categorical auxiliary variables, because it isn’t straight-
forward to marry logistic or probit models with the correlated residuals in Graham’s
structural equation models. The strategy is also ideally suited for analyses with inter-
actions or nonlinear effects with incomplete data, as conventional structural equation
models (e.g., the just-another-variable model) generally do a poor job preserving these
effects.
Analysis Example
This example uses the psychiatric trial data on the companion website to illustrate max-
imum likelihood missing data handling with auxiliary variables. The data, which were
collected as part of the National Institute of Mental Health Schizophrenia Collaborative
Study, comprise four illness severity ratings, measured in half-point increments ranging
from 1 (normal, not at all ill) to 7 (among the most extremely ill). In the original study, the
437 participants were assigned to one of four experimental conditions (a placebo condi-
tion and three drug regimens), but the data collapse these categories into a dichotomous
treatment indicator (DRUG = 0 for the placebo group, and DRUG = 1 for the combined
medication group). The researchers collected a baseline measure of illness severity prior
to randomizing participants to conditions, and they obtained follow-up measurements
1 week, 3 weeks, and 6 weeks later. The overall missing data rates for the repeated mea-
surements were 1, 3, 14, and 23%.
The focal regression model predicts illness severity ratings at the 6-week follow-up
assessment from baseline severity ratings, gender, and the treatment indicator.
Maximum Likelihood Estimation with Missing Data 141
(
ε i ~ N1 0, σ2ε )
Centering the baseline scores and male dummy code at their grand means facilitates
interpretation, as this defines β0 and β1 as the placebo group average and group mean
difference, respectively, marginalizing over the covariates.
This small data set offers limited choices for auxiliary variables, but the illness
severity ratings at the 1-week and 3-week follow-up assessments are excellent candi-
dates, because they have strong semipartial correlations with the dependent variable
(r = .40 and .61, respectively) and uniquely predict its missingness. I used the saturated
correlates, extra dependent variable, and factored regression approaches to incorporate
these variables into the analysis. The saturated correlates model was identical to the
path diagram in Figure 3.8, and the extra dependent variable mimicked Figure 3.10.
The factored regression strategy features a sequence of univariate distributions,
each of which corresponds to a regression model. To maintain the desired interpretation
of the focal model parameters, it is important to specify a sequence where the analysis
variables predict the auxiliary variables and not vice versa. The factorization for this
analysis is as follows:
( ) (
f DRUG* | MALE × f MALE* )
The first two terms are auxiliary variable distributions that derive from linear regression
models, the third term corresponds to the focal linear regression, the fourth term is a
linear regression model for the incomplete baseline scores, and the final two terms are
regressions for the complete predictors (which I ignored, because these variables do not
require distributions). The full complement of regression models is shown below, and
analysis scripts are available on the companion website.
Table 3.10 gives maximum likelihood estimates and standard errors with and with-
out auxiliary variables. In the interest of space, I omit the auxiliary variable and covari-
142 Applied Missing Data Analysis
ate model parameters, because they are not the substantive focus. As you might expect,
the three auxiliary variable models produced identical results (to the third decimal), so
Table 3.10 reports a single set of results. As explained previously, introducing auxiliary
variables does not affect the interpretation of the focal model parameters; the intercept
coefficient is placebo group mean at the 6-week follow-up (β̂0 = 4.41, SE = 0.16), and the
treatment assignment slope is the group mean difference for the medication condition
(β̂1 = –1.46, SE = 0.18), controlling for covariates.
Conditioning on the auxiliary variables had a substantial impact on the numeri-
cal values of key parameters estimates. In particular, the intercept coefficients (pla-
cebo group means) from the two analyses differed by nearly three-fourths of a standard
error unit, and the slope coefficients (medication group mean differences) differed by
nearly 1.2 standard errors. Although the natural inclination is to favor the analysis with
auxiliary variables, there is no way to know for sure which is more correct, because
conditioning on the wrong set of variables can exacerbate nonresponse bias, at least
hypothetically (Thoemmes & Rose, 2014). Nevertheless, the differences are consistent
with the shift from an MNAR-by-omission mechanism to a more MAR-like process. The
fact that the auxiliary variables have strong semipartial correlations with the dependent
variable (r = .40 and .61, respectively) and uniquely predict its missingness reinforces
this conclusion.
In my experience, the results in Table 3.10 are probably on the optimistic side of
what you might expect to see in practice. This example benefited from two variables that
explained a substantial proportion of unique variation above and beyond that already
captured by the focal model. As explained in Section 1.5, this net covariation is what
makes an auxiliary variable useful, and its bivariate associations with the analysis vari-
ables are less diagnostic in this regard. The impact of auxiliary variables also depends on
the amount of missing data. If the analysis variables have relatively small missing data
rates, you might expect to see little to no change in the estimates after adding auxiliary
variables, whereas there is more to gain with high missing data rates. Regardless, even
small gains are worthwhile given the ease with which you can include a few additional
variables.
Maximum Likelihood Estimation with Missing Data 143
The probit model’s residual variance is fixed at 1 for identification, and the model addi-
tionally incorporates a fixed threshold parameter that divides the latent response vari-
able distribution into two segments. The logistic regression can also be viewed as a
latent response model, but it is typical to write the equation without a residual. Note
that I use β’s to represent focal model parameters, but the estimated coefficients will not
be the same (with complete data, logit coefficients are approximately 1.7 times larger
than probit coefficients; Birnbaum, 1968). Finally, approximately 5.1% of the turnover
intention scores are missing, and the leader–member exchange and employee empower-
ment scales have 4.1 and 16.2% missing data rates (see Appendix).
As explained previously, factorizing the multivariate distribution such that incom-
plete predictors come before complete regressors simplifies estimation, because the lat-
ter terms can be ignored. Applying this strategy to the employee turnover analysis gives
the following factorization:
f (TURNOVER | LMX, EMPOWER, MALE ) ×
(3.56)
f ( LMX | EMPOWER, MALE ) × f ( EMPOWER | MALE ) × f MALE* ( )
The first term (the employee turnover distribution) corresponds to the probit or logistic
regression in Equation 3.55, and the supporting covariate models are linear regressions
with normally distributed residuals, as follows:
Following earlier examples, I drop the rightmost term in Equation 3.56, because the
gender dummy code is complete and does not require a distribution (i.e., this variable
functions as a known constant).
144 Applied Missing Data Analysis
Analysis Example
Table 3.11 shows the maximum likelihood analysis results for both models. I omit the
supporting predictor model estimates, because they are not the substantive focus. Start-
ing with the probit regression, the Wald test of the full model was statistically significant,
TW(3) = 25.14, p < .001, meaning that the estimates are at odds with the null hypothesis
that all three population slopes equal zero. Each slope coefficient reflects the expected
z-score change in the latent response variable for a one-unit increase in the predictor,
controlling for other regressors. For example, the leader–member exchange coefficient
indicates that a one unit increase in relationship quality is expected to decrease the
latent proclivity to quit by 0.07 z-score units (β̂1 = –0.07, SE = .03), holding other predic-
tors constant.
Logistic regression
β0 1.82 0.66 2.78 .01 —
β1 (LMX) –0.12 0.04 –3.11 < .01 0.89
β2 (EMPOWER) –0.05 0.03 –1.94 .05 0.95
β3 (MALE) –0.09 0.19 –0.47 .64 0.92
R2 .07 .03 2.66 .01 —
Note. RSE, robust standard error; OR, odds ratio; LMX, leader–member exchange.
Maximum Likelihood Estimation with Missing Data 145
Turning to the logistic regression results, the Wald test of the full model was again
statistically significant, and the test statistic’s numerical value was comparable to that
of the probit model, TW(3) = 24.13, p < .001. Each slope coefficient now reflects the
expected change in the log odds of quitting for a one-unit increase in the predictor,
holding all other covariates constant. For example, the leader–member exchange slope
indicates that a one-unit increase in relationship quality decreases the log odds of quit-
ting by 0.12 (β̂1 = –0.12, SE = .04), controlling for employee empowerment and gender.
Although the rule of thumb is not quite as precise with missing data, the logistic coef-
ficients are roughly 1.7 times larger than the probit slopes. Exponentiating each slope
gives an odds ratio that reflects the multiplicative change in the odds for a one-unit
increase in a predictor (e.g., a one-point increase on the leader–member exchange scale
multiplies the odds of quitting by 0.89). The analysis results highlight that probit and
logistic models are effectively equivalent and almost always lead to the same conclu-
sions. Some researchers favor logistic framework, because it yields odds ratios, but there
is otherwise little reason to prefer one approach to the other.
This chapter extended maximum likelihood estimation to missing data problems. The
mechanics of estimation are largely the same as Chapter 2, where the goal was to iden-
tify estimates that maximize fit to the data. When confronted with missing values, the
estimator does not discard incomplete data records, nor does it impute them. Rather, it
identifies the parameter values with maximum support from whatever data are avail-
able, with some participants contributing more than others. Maximum likelihood anal-
yses have evolved considerably in recent years. The estimators that were widely available
when I was writing the first edition of this book were generally limited to multivariate
normal data. This is still a common (and effective) assumption for missing data analy-
ses, but flexible estimation routines that accommodate mixtures of categorical and con-
tinuous variables are now available. In particular, the factored regression strategy intro-
duced in this chapter is pivotal throughout the rest of the book, as Bayesian estimation
and contemporary multiple imputation routines leverage the same specification.
Speaking of which, Bayesian estimation is the next topic. Chapter 4 describes the
philosophical underpinnings of the Bayesian statistical paradigm and the Markov chain
Monte Carlo (MCMC) estimator for complete data. Chapter 5 extends MCMC estima-
tion to missing data problems, and Chapter 6 introduces models for mixtures of numeri-
cal and categorical variables. Like maximum likelihood estimation, the primary goal of
a Bayesian analysis is to fit a model to the data and use the resulting estimates to inform
one’s substantive research questions. However, unlike maximum likelihood, Bayes-
ian estimation repeatedly fills in or imputes the missing values en route to getting the
parameter values. As you will see, multiple imputation—the third pillar of the book—
uses Bayesian MCMC to create several filled-in data sets that are reanalyzed with fre-
quentist methods. Finally, I recommend the following for readers who want additional
details on topics from this chapter.
146 Applied Missing Data Analysis
Lüdtke, O., Robitzsch, A., & West, S. G. (2020). Analysis of interactions and nonlinear effects
with missing data: A factored regression modeling approach using maximum likelihood
estimation. Multivariate Behavioral Research, 55, 361–381.
Savalei, V. (2010). Expected versus observed information in SEM with incomplete normal and
nonnormal data. Psychological Methods, 15, 352–367.
Savalei, V., & Rosseel, Y. (2021, April 14). Computational options for standard errors and test
statistics with incomplete normal and nonnormal data. Structural Equation Modeling: A
Multidisciplinary Journal. [Epub ahead of print] https://doi.org/10.31234/osf.io/wmuqj
4
Bayesian Estimation
Bayesian analyses have gained a strong foothold in social and behavioral science disci-
plines in the last decade or so (Andrews & Baguley, 2013; van de Schoot, Winter, Ryan,
Zondervan-Zwijnenburg, & Depaoli, 2017), and user-friendly tools for conducting these
analyses are now widely available in software programs (e.g., Stan: Gelman, Lee, & Guo,
2015; Blimp: Keller & Enders, 2021; Mplus: Muthén & Muthén, 1998–2017; Jeffreys’s
Amazing Statistics Program [JASP]: Wagenmakers, Love, et al., 2018). Whereas the first
edition of this book viewed the Bayesian framework through a narrow lens—as an esti-
mation method co-opted for a specific type of multiple imputation—this edition takes
the much broader view that Bayesian analyses are an alternative to maximum likelihood
estimation.
Stepping back and considering the organization of the book, Bayesian analyses are
a bridge connecting maximum likelihood to multiple imputation. Like maximum like-
lihood estimation, the primary goal of a Bayesian analysis is to fit a model to the data
and use the resulting estimates to inform the substantive research questions. When
confronted with missing values, maximum likelihood uses the normal curve to deduce
the missing parts of the data as it iterates to a solution (technically, the estimator mar-
ginalizes over the missing values). Bayesian estimation has more of a multiple imputa-
tion flavor, because it fills in the missing values en route to getting the parameters. Like
maximum likelihood, missing data handling happens behind the scenes, and imputa-
tion (implicit or explicit) is just a means to a more important end, which is to learn
something from the parameter estimates.
This chapter takes a hiatus from missing data issues to outline Bayesian analyses
with complete data. The goal of the chapter is to provide a user-friendly introduction
to the Bayesian paradigm that serves as a springboard for accessing any number of the
specialized textbooks (Gelman et al., 2014; Hoff, 2009; Kaplan, 2014; Levy & Mislevy,
2016; Lynch, 2007; Robert & Casella, 2004) and the many tutorial articles on the topic
147
148 Applied Missing Data Analysis
(Abrams, Ashby, & Errington, 1994; Casella & George, 1992; Jackman, 2000; Kruschke
& Liddell, 2018; Lee & Wagenmakers, 2005; Sorensen & Vasishth, 2015; Stern, 1998;
Wagenmakers, Marsman, et al., 2018). Additionally, I try to provide a practical recipe
for implementing Bayesian analyses that generalizes to different analysis models and
software packages.
Following the structure of Chapter 2, I start with a univariate analysis and build
to a linear regression model. As you will see, the Bayesian analyses give results that are
numerically equivalent to those of maximum likelihood, but the interpretations of the
estimates and measures of uncertainty require a different philosophical lens. I spend a
good deal of time describing Markov chain Monte Carlo (MCMC) estimation as diag-
nosing whether the iterative algorithm is working properly is a vital part of applying
Bayes estimation (especially with missing data). The final section of the chapter covers
a multivariate normal data analysis comprising a mean vector and covariance matrix.
Collectively, these examples provide the building blocks for understanding more com-
plex analyses, and this chapter sets up foundational concepts that appear throughout
the remainder of the book. As you will see, all the major ideas readily generalize to
missing data in Chapter 5.
A key distinction between the Bayesian framework and the classic frequentist paradigm
that predominates in many disciplines is how they define a parameter. The frequentist
approach defines a parameter as a fixed value (e.g., the true mean that would result from
collecting data from the entire population of interest). The goal of a frequentist analysis
is to estimate the parameter and establish a confidence interval around that estimate.
The standard error, which is integral to this process, quantifies the expected variability
of an estimate across many different random samples. Defining a parameter as a fixed
quantity leads to some important subtleties. For example, when describing a 95% con-
fidence interval, it is incorrect to say that there is a 95% probability that the parameter
falls between values of A and B, because the confidence interval from any single sample
either contains the parameter or it does not. Rather, the correct interpretation describes
the expected performance of the interval across many repeated samples; if we drew 100
samples from a population and constructed a 95% confidence interval around the esti-
mate from each sample, 95 of those intervals should include the true population param-
eter. In a similar vein, the probability value from a frequentist significance test describes
the proportion of repeated samples that would yield a test statistic greater than or equal
to that of the sample data. In both situations, concepts of variation and probability apply
to the sample data and estimates, not to the parameter itself.
In contrast, the Bayesian paradigm views a parameter as a random variable that has
a distribution, just like any other variable we might measure. For example, a Bayesian
analysis defines the mean as a normally distributed variable where some realizations
are more plausible than others given the data. This distribution is called a posterior,
because it is constructed after observing the data (and in conjunction with a prior dis-
Bayesian Estimation 149
tribution). One of the goals of a Bayesian analysis is to use familiar measures of central
tendency and dispersion to describe the shape of the posterior distribution. For exam-
ple, the posterior mean (or median) quantifies the parameter’s most likely value and
functions like a frequentist point estimate. The posterior standard deviation describes
the distribution’s spread and is analogous to a frequentist standard error in the sense
that it describes uncertainty about the parameter after observing the data. However, it
is important to note that the Bayesian notion of uncertainty does not involve the hypo-
thetical process of drawing many random samples and counting the frequency of differ-
ent estimates. Rather, this form of uncertainty reflects our subjective degree of belief or
knowledge about the parameter after collecting and analyzing data (Kopylov, 2008; Levy
& Mislevy, 2016; O’Hagan, 2008).
Viewing a parameter as a random variable also has important implications for infer-
ence. For example, a Bayesian credible interval (the analogue to a frequentist confi-
dence interval) allows us to say that there is a 95% probability that the parameter falls
between values of A and B. This interpretation is very different from that of the frequen-
tist approach, because it attaches the probability statement to the parameter, not to the
hypothetical estimates from different samples of data. By extension, Bayesian probabil-
ity values refer directly to the parameter of interest and not to a collection of hypotheti-
cal estimates from different samples. For example, the probability that a slope parameter
is positive is simply the area of the posterior distribution above 0. These interpretations
are made possible by invoking Bayes’ theorem and a prior distribution that represents
our a priori beliefs about the parameter before data collection. The next section provides
a conceptual, equation-free description of a simple Bayesian analysis on which I expand
in later sections.
A Bayesian analysis consists of three major steps: (1) Specify a prior distribution for
the parameter of interest, (2) use a likelihood function to summarize the data’s evi-
dence about different parameter values, and (3) combine information from the prior and
the likelihood to generate a posterior distribution that describes the relative probabil-
ity of different parameter values given the data. Finally, familiar statistics such as the
mean and standard deviation summarize the posterior’s center and spread, respectively.
Rewinding back to Chapter 2, I emphasized that the likelihood is not a probability dis-
tribution, because the area under the function does not equal 1. Conceptually, you can
think of these three steps as a recipe for converting a likelihood function to a proper
probability distribution.
This remainder of this section gives a conceptual description of the three steps in
the context of a simple univariate analysis in which the goal is to estimate the popula-
tion proportion. Because the goal is to introduce the underlying logic behind Bayesian
estimation, I am purposefully vague about many of the mathematical details. Neverthe-
less, the steps for this simple analysis provide a recipe for applying Bayesian estimation
to more complex analyses.
150 Applied Missing Data Analysis
FIGURE 4.1. Two prior distributions for the population proportion. The solid curve repre-
sents an informative prior that specifies some values as more likely than others. The dashed line
is a noninformative prior where all parameter values are equally likely.
Bayesian Estimation 151
Some readers may be wondering about the origin of the prior distributions in Fig-
ure 4.1. Researchers often adopt a conjugate prior distribution that belongs to the same
family as the likelihood function, as doing so simplifies the conversion of the likeli-
hood function to a probability distribution. The binomial likelihood function for binary
outcome data (see Equation 2.1) is a member of the beta distribution family. The beta
distribution’s shape is proportional to
(
f ( π ) ∝ πa −1 1 − πb −1 ) (4.1)
where f(π) is the height of the curve or the vertical coordinate at a particular value of π,
and a and b are constants that define the distribution’s shape (e.g., larger values of a and
b produce a distribution with greater spread, and the distribution becomes asymmetric
when a ≠ b). The informative prior in Figure 4.1 corresponds to a = 7 and b = 40, and the
flat prior aligns with a = 1 and b = 1. As you know from Chapter 2, a probability distri-
bution contains a scaling factor that makes the area under the curve sum or integrate
to 1. I drop unnecessary scaling terms whenever possible and use a “proportional to”
symbol to indicate that an expression omits one or more constants. This simplifies the
math without affecting the distribution’s shape.
Relative Probability
FIGURE 4.2. Likelihood function displaying the relative probability of the observing seven
“cases” (e.g., the number of new mothers with postpartum depression diagnosis) from a sample
of N = 100, given that assumed population value on the horizontal axis.
tion from the prior and the likelihood. I describe the posterior in more detail later in the
chapter, but the conceptual idea is to weight each point on the likelihood function by the
magnitude of the corresponding point on the prior distribution. For example, attach-
ing a high prior probability to a particular parameter value would increase the height of
the likelihood function at that point on the horizontal axis. Conversely, assigning a low
prior probability to a particular parameter value would decrease the height of the likeli-
hood function at that point.
To illustrate, recall that Researcher B specified a noninformative prior where all
parameter values are equally likely (the flat dashed line in Figure 4.1). This prior speci-
fication assigns the same constant weight to every point on the likelihood function, so
the resulting posterior distribution is identical to the likelihood. Figure 4.3 shows this
posterior distribution as a dashed curve. Prior to collecting data, Researcher A assigned
a high probability to depression rates between π = .10 and .15, but the maximum likeli-
hood estimate from the data was somewhat lower at π̂ = .07. Researcher A’s posterior
distribution blends her prior beliefs with information from the data, giving the solid
curve in Figure 4.3. Comparing the two distributions, you can see that Researcher A’s
posterior is less elevated at π = .05, because she assigned a low prior weight to this
parameter value, and it is slightly elevated at π = .15, because she assigned a high prior
probability to this value. Both functions describe the probability of different parameter
values given the observed data.
Bayesian Estimation 153
Relative Probability
FIGURE 4.3. Two posterior distributions of a population proportion π. The dashed curve cor-
responds to a flat prior that assigns the same constant weight to every point on the likelihood
function (the dashed curve in Figure 4.1). The solid curve assigns a high prior probability to
values between π = 0.10 and 0.15 (the solid curve in Figure 4.1).
Describing the center and spread of the posterior distribution is a primary goal
of a Bayesian analysis; this step is analogous to computing a point estimate and stan-
dard error. Researchers often use the mean or median to characterize the most likely
parameter value (the latter might be preferable, since the distribution is asymmetrical),
and they use the standard deviation to quantify spread or uncertainty. This example
is relatively straightforward, because these summary quantities are simple functions
of the beta distribution’s shape parameters, a and b. Without getting into specifics, the
mean and median of Researcher A’s posterior distribution are Mπ = .095 and Mdnπ = .093,
respectively, and the posterior standard deviation is SDπ = .024. In contrast, Researcher
B’s posterior mean and median are Mπ = .078 and Mdnπ = .076, respectively, and the
standard deviation is SDπ = .026. Researcher A’s posterior distribution is centered at a
somewhat higher value, because her prior distribution assigned high weights to param-
eter values between .10 and .15 (the strong prior information also reduced the spread).
As a comparison, a frequentist analysis with maximum likelihood estimation gives
a point estimate and standard error of π̂ = .070 and SE = .026. Although these quantities
are numerically identical to Researcher B’s posterior mode and standard deviation, they
have different interpretations. For example, π̂ is our best guess about the true population
proportion, and its standard error quantifies the expected variability of point estimates
across many different random samples. The Bayesian analysis, on the other hand, makes
154 Applied Missing Data Analysis
no reference to repeated samples. Rather, a probability distribution is the device for con-
veying our current knowledge about the parameter; the posterior distribution’s center
and spread characterize the most likely realization of the parameter and the degree of
knowledge or uncertainty about the parameter after analyzing the data, respectively.
Bayes’ theorem describes the relationship between two conditional probabilities; that
is, the theorem provides a rule that says how to get from the probability of B given A to
the probability of A given B. Applied to statistics, Bayes’ theorem is the machinery that
converts a likelihood function to a probability distribution. For two generic events, A
and B, the theorem is
Pr ( B ) × Pr ( A | B )
Pr ( B | A ) = (4.2)
Pr ( A )
where Pr(B|A) is the conditional probability of observing event B given that event A has
already occurred, Pr(A|B) is the conditional probability of A given B, and Pr(A) and Pr(B)
are the marginal probabilities of A and B (i.e., the probability of A without reference to
B and vice versa).
Applying Bayes’ theorem to probability distributions gives
f ( θ ) × f ( data | θ )
f ( θ | data ) = (4.3)
f ( data )
where θ is the parameter of interest, and “f of something” references the height of a func-
tion or curve at some point on its horizontal axis. Ignoring the term in the denominator
for the moment, the expression reflects the three components of the previous analysis:
f(θ) is the prior distribution (e.g., Figure 4.1), f(data|θ) corresponds to the likelihood
(e.g., Figure 4.2), and f(θ|data) is the posterior distribution of the parameter given the
data (e.g., Figure 4.3). To clarify some notation, recall that the probability distribution
of the data and the likelihood function are the same function with different arguments
treated as varying and constant; that is, after collecting a sample of data, the observa-
tions in f(data|θ) become fixed and a likelihood function summarizes the data’s evi-
dence about different parameter values. The data distribution and likelihood function
are proportional and differ by some constant that makes f(data|θ) a proper probability
distribution.
As you know, probability distributions such as the normal curve or beta distribu-
tion include scaling terms that do not affect their shape but make the area under the
curve sum or integrate to 1. The denominator of Equation 4.3—the marginal probability
of the data across many different realizations of the parameter—serves this exact pur-
pose. Because the parameter of interest does not appear in the denominator, the scaling
term is usually dropped from the expression as follows:
In words, Equation 4.4 says that the posterior distribution is proportional to the product
of the prior and the likelihood. The “proportional to” symbol conveys the idea that the
posterior distribution on the left has the same basic shape as the product on the right.
Yi = μ + ε i = E ( Yi ) + ε i (4.5)
(
Yi ~ N1 E ( Yi ) , σ 2
)
I used the same model in Chapter 2 to introduce maximum likelihood estimation, and
the Bayesian analysis in this section recycles some of that earlier material. To refresh
notation, N1 denotes a univariate normal distribution function, and the first and sec-
ond terms inside parentheses are the mean and variance parameters, respectively. The
expected value, E(Yi), corresponds to the grand mean in this example, but you can think
about it more generally as a predicted value.
Before getting into specifics, let’s apply the concept from Equation 4.4—the poste-
rior distribution is proportional to the product of the prior and the likelihood—to this
example. Replacing the generic functions with the quantities from the analysis example
gives the following expression:
( ) ( ) (
f μ, σ2 | data ∝ f ( μ ) × f σ2 × f data | μ, σ2 ) (4.6)
Consistent with Equation 4.4, the leftmost term is the posterior distribution, f(μ) and
f(σ2) are prior distributions, and the rightmost term is the probability distribution of the
data (or equivalently, the likelihood once the data are collected). This analysis moves
closer to a realistic application, because there are multiple parameters; the posterior is
a bivariate distribution that describes the relative probability of different combinations
of μ and σ2 given the data.
156 Applied Missing Data Analysis
1 ( Y − μ )2
f= (
Yi | μ, σ2 ) 1
2
exp −
2
i
σ2
(4.7)
2πσ
where Yi is the outcome score for participant i (e.g., a student’s math posttest score), and
μ and σ2 are the population mean and variance, respectively. To reiterate some impor-
tant notation, the function on the left side of the equation can be read as “the relative
probability of a score given assumed values for the parameters.” Visually, “f of Y” is the
height of the normal curve at a particular score value on the horizontal axis. The joint
probability of N observations (or the likelihood of the sample data) is the product of the
individual contributions.
N 1 ( Y − μ )2
( )
f data | μ, σ2 = ( )
∝ L μ, σ2 | data
1
∏ exp − i
(4.8)
( )
N /2
2πσ2 2 σ2
i =1
The scaling terms to the left of the product operator ensure that the area under the curve
sums to 1, and I simplify the expression by dropping these terms when possible.
Prior Distributions
Specifying prior distributions for the parameters is a key step in any Bayesian analysis.
We could specify an informative prior if we had a priori knowledge about the mean
(e.g., from a pilot study or meta-analysis), but I focus on noninformative (or weakly
informative) prior distributions that exert as little influence as possible on the results.
Specialized Bayesian texts and methodological research studies describe other possibili-
ties (Chung, Gelman, Rabe-Hesketh, Liu, & Dorie, 2015; Gelman, 2006; Gelman et al.,
2014; Kass & Wasserman, 1996; Liu, Zhang, & Grimm, 2016).
The data distribution and likelihood often inform the choice of prior distribution,
because it is convenient to work from the same distribution family (e.g., the binomial
likelihood and beta prior from the earlier analysis example). There are at least two ways
to implement a noninformative prior for the mean. We know from introductory sta-
tistics that the frequentist sampling distribution of the mean is a normal curve, and it
ends up that the posterior distribution of μ is also normal. To invoke a conjugate prior
distribution that imparts very little information, we could specify a normal prior with
an arbitrary mean and a very large variance (e.g., a normal curve with μ0 = 0 and σ02 =
10,000). The mean and variance of the prior are sometimes called hyperparameters.
Setting the spread to a very large number effectively produces a flat distribution
over the range of the data (e.g., the math posttest scores range from 32 to 85). Consis-
tent with the earlier example, we could also specify a uniform prior that is flat over the
Bayesian Estimation 157
entire range of the mean. I adopt this approach, because it yields the same result as the
conjugate prior but simplifies some of the ensuing math. A flat prior distribution assigns
every possible value of the mean the same a priori weight of 1.
f (μ ) ∝ 1 (4.9)
Adopting the normal distribution for the data induces a positively skewed inverse
gamma distribution for the variance. The following expression illustrates the linkage:
N
1 N
1
(σ ) 1
− +1
f (X) ∝ X
−( a +1)
∑ (Y − μ)
2 2
exp −b × = 2 exp − × (4.10)
X 2 i
σ2
i =1
The left side of the equation is the generic expression for the inverse gamma distri-
bution, and f(X) means that the height of the function varies with different values of
X. Viewing the variance as the X in the normal distribution function and rearranging
terms gives the expression to the right side of the equals sign (I omit the scaling factor
2π). Visually, the inverse gamma is a positively skewed distribution bounded at 0. The
shape parameter a = N/2 controls the height of the distribution, with larger values of N
(the degrees of freedom) resulting in a more peaked distribution with thinner tails. The
scale parameter b = SS/2 controls the distribution’s spread, which increases as the sum
of squares (SS) increases.
To invoke a conjugate prior distribution for the variance, you would specify an
inverse gamma distribution with hyperparameters df0 and SS0. Conceptually, the degrees
of freedom value df0 can be viewed as the number of imaginary data points used to get
the a priori sum of squares value SS0. Allowing these quantities to approach 0 gives the
so-called Jeffreys prior distribution for the variance (Jeffreys, 1946, 1961; Kass & Was-
serman, 1996; Lynch, 2007).
( )
f σ2 ∝
1
σ2
(4.11)
Figure 4.4 depicts this prior distribution for parameter values between 0 and 20. You
can see that the prior is somewhat informative, because the relative probability values
rapidly increase as the variance approaches 0. The Jeffreys prior is equivalent to specify-
ing a uniform distribution for the natural logarithm of the variance, as the natural log
stretches the exponential curve out to negative infinity.
( ) ( ) (
f μ, σ2 | data ∝ f ( μ ) × f σ2 × f data | μ, σ2 ) (4.12)
Note that the function on the left now reads “the relative probability of the parameters
158 Applied Missing Data Analysis
Relative Probability
0 5 10 15 20
Population Variance
FIGURE 4.4. Jeffreys prior distribution for the variance. The distribution is somewhat infor-
mative, because the relative probability increases rapidly as the variance approaches 0.
given the data.” Substituting the relevant expression for each component on the right
side gives the following posterior distribution:
N 1 ( Y − μ )2
( 2
) 1
f μ, σ | data ∝ 1 × 2 ×
1
∏ exp − i
(4.13)
( )
N /2
σ 2πσ2 2 σ2
i =1
Dropping the scaling constant 2π and combining terms involving the variance further
simplifies the expression.
N N 1 ( Y − μ )2
( ) ( )
− +1
f μ, σ2 | data ∝ σ2 2
∏ exp −
2
i
σ2
(4.14)
i =1
Equation 4.14 is a bivariate distribution that describes the relative probability of
different combinations of μ and σ2 given the data. The bivariate distribution isn’t nec-
essarily useful for inference, because it intertwines two parameters. We usually want
univariate summaries that reflect the marginal distribution of one parameter without
regard for the other (e.g., the goal is to characterize the most likely value of the mean
without considering the variance, and vice versa). In general, deriving marginal distribu-
tions requires integral calculus, and the complexity of the calculations quickly becomes
intractable with more than a small handful of parameters. Instead, researchers use an
MCMC procedure called the Gibbs sampler (Casella & George, 1992; J ackman, 2000) to
Bayesian Estimation 159
The Gibbs sampler breaks a complex multivariate problem into a series of simpler uni-
variate steps that iteratively estimate one parameter at a time, treating the current values
of the remaining parameters as known constants. Gelfand and Smith (1990) are often
credited with popularizing the Gibbs sampler as a flexible tool for Bayesian estimation,
and descriptions of the algorithm are widely available in specialized textbooks (Gelman
et al., 2014; Hoff, 2009; Kaplan, 2014; Levy & Mislevy, 2016; Lynch, 2007; Robert &
Casella, 2004) and tutorial articles (Casella & George, 1992; Jackman, 2000; Smith &
Roberts, 1993). I rely heavily on the Gibbs sampler throughout the rest of the book.
To illustrate the underlying logic of MCMC estimation with the Gibbs sampler, con-
sider a statistical analysis with three parameters, θ1, θ2, and θ3. We need to track changes
in the parameters within an iteration and between successive iterations to fully decon-
struct the algorithm, so I use t = 1, 2, . . . , T to index these repetitive computational
cycles. The Gibbs sampler recipe for the three-parameter analysis example is as follows:
With additional parameters, each iteration continues in a round-robin fashion until all
quantities have been updated.
Three points are worth highlighting. First, each step estimates one parameter at a
time, holding all other parameters constant at their current values. The iteration super-
scripts show that some of the current values originate from the same iteration, while oth-
ers carry over from the previous cycle (e.g., the estimation step for θ2 depends on θ1 from
the current iteration and θ3 from the prior iteration). Second, the order of the estimation
steps usually doesn’t matter, and you should get the same results if you estimate θ3 first,
θ1 second, and θ2 third, for example. Third, the algorithm requires initial estimates (i.e.,
starting values) of all model parameters prior to the first iteration. In many situations,
these initial guesses need not be sophisticated (e.g., initial values of the mean and variance
could be set to 0 and 1, respectively). Remarkably, the Gibbs sampler usually converges
around a more reasonable set of estimates in just a few computational cycles, but accurate
starting values can reduce the number of iterations required to achieve this steady state.
Before going further, I need to clarify the meaning of estimation in this context.
Recall from Chapter 2 that maximum likelihood uses iterative algorithms that succes-
160 Applied Missing Data Analysis
sively adjust parameters until the estimates no longer change from one iteration to the
next. In contrast, the Gibbs sampler uses Monte Carlo computer simulation (the second
“MC” in MCMC) to “sample” or “draw” plausible parameter values at random from a
probability distribution. For example, the MCMC algorithm in the next section esti-
mates the mean by generating a random number from a normal distribution, and it
estimates the variance by drawing random numbers from a right-skewed inverse gamma
distribution. I sometimes refer to these MCMC-generated estimates as synthetic param-
eter values to differentiate them from estimates produced by an analytic solution such as
least squares. Of course, computers are quite adept at generating random numbers, and
all statistical software programs have built-in functions for this purpose. Several acces-
sible tutorial articles are available for readers who want more information about Monte
Carlo simulation (e.g., Morris, White, & Crowther, 2019; Muthén & Muthén, 2002;
Paxton, Curran, Bollen, Kirby, & Chen, 2001).
A typical MCMC chain consists of thousands of iterations, each computational
cycle producing a unique set of parameter values. Unlike maximum likelihood, which
identifies a single set of optimal estimates for the data, MCMC creates parameter values
that continually change as the algorithm iterates (e.g., running MCMC for T = 10,000
iterations gives a posterior distribution of 10,000 plausible parameter values). Whereas
iterative optimization routines for maximum likelihood estimation are akin to a hiker
trying to climb to the highest possible elevation on a mountain, MCMC estimation is
more like an explorer describing the geography of the entire mountain. The procedure
naturally produces the posterior distributions needed for inference, because each set of
parameter estimates marginalizes over all other parameters (e.g., the posterior distribu-
tion of θ1 is taken over many different realizations of θ2 and θ3, so we can talk about θ1
without regard for the others). As explained previously, familiar descriptive statistics
such as the median and standard deviation describe the center and spread of each pos-
terior distribution, respectively, and the estimates at the 2.5 and 97.5% quantiles of the
distribution define a 95% credible interval.
Having sketched the basic ideas, I show how to use MCMC to estimate the posterior
distributions of the mean and variance. The Gibbs sampler alternates between two steps:
Estimate the mean given the current value of the variance, then update the variance
given the latest value of the mean (the order of the steps typically doesn’t matter). The
recipe below summarizes the algorithmic steps:
Each estimation step draws a synthetic parameter value at random from a probabil-
ity distribution that treats the other parameter as a known constant. Mechanically, you
get these full conditional distributions by multiplying the prior and the likelihood, then
doing some tedious algebra to express the product as a function of a single unknown. I
give these distributions below and point readers to specialized Bayesian texts for addi-
tional details on their derivations (e.g., Hoff, 2009; Lynch, 2007).
First, MCMC estimates the mean by drawing a random number from the univariate
normal conditional distribution below:
σ 2
( )
f μ | σ2 ,data ∝ N1 Y , (4.15)
N
The arithmetic mean of the data in the first term defines the distribution’s center, and
the familiar expression for the squared standard error of the mean (i.e., the sampling
variance) defines the spread in the second term. Following van Buuren (2012), the dot
accent on σ 2 indicates a synthetic variance estimate from an earlier MCMC step (this
value functions like a known constant in the normal distribution equation). Figure 4.5
shows the normal distribution that generates the mean estimates.
Next, the algorithm samples an estimate of the variance from a positively skewed
inverse gamma distribution that conditions on (treats as known) the newly minted syn-
thetic mean. The full conditional distribution for this step is as follows:
Relative Probability
54 55 56 57 58 59 60
Population Mean
FIGURE 4.5. Conditional posterior distribution of the mean from the univariate analysis
example. The MCMC algorithm estimates the mean by drawing a number at random from this
normal distribution.
162 Applied Missing Data Analysis
Relative Probability
FIGURE 4.6. Conditional posterior distribution of the variance from the univariate analysis
example. The MCMC algorithm estimates the variance by drawing a number at random from this
inverse gamma distribution.
N 1 N
( )
f σ2 | μ,data ∝ IG ,
2 2 ∑ ( Y − μ )
i
2
(4.16)
i =1
The first and second terms in the function are shape and scale parameters, respectively
(sometimes denoted a and b). The shape parameter determines the height of the dis-
tribution, with larger values of N (the degrees of freedom) resulting in a more peaked
distribution with thinner tails. The scale parameter controls the distribution’s spread,
which increases as the sum of squares around μ increases. Visually, the inverse gamma
looks like a chi-square. To illustrate, Figure 4.6 shows the inverse gamma distribution
that generates the variance estimates.
Analysis Example
I use the math achievement data on the companion website to illustrate a Bayesian anal-
ysis involving the mean and variance. The empty regression model is
MATHPOSTi = μ + ε i (4.17)
Analysis scripts are available on the companion website, including a custom R program
for readers who are interested in coding the algorithm by hand. I specified an MCMC
Bayesian Estimation 163
chain consisting of T = 11,000 iterations, and I discarded the results from the first 1,000
iterations. Discarding estimates from this so-called burn-in interval gives the algorithm
time to recover from its starting values and converge to a trustworthy steady state. I
discuss diagnostic tools for evaluating convergence and determining the length of this
initial interval later in the chapter.
Each MCMC iteration “estimates” the mean and variance by generating random
numbers from a normal curve and inverse gamma distribution (see Figures 4.5 and 4.6).
The normal distribution in Figure 4.5 is centered at the mean of the data, and its spread
is a function of the sample size and the variance estimate from the previous iteration.
Drawing a random number from this distribution gives a new estimate of the mean. The
inverse gamma distribution in Figure 4.6 is right skewed, such that its center is a func-
tion of the sample size or degrees of freedom, and its spread is determined by the sum of
squares of the data around the current mean estimate. Drawing a random number from
this distribution gives a new estimate of the variance. Conceptually, the process of gen-
erating random numbers is akin to wearing a blindfold and throwing a dart at a picture
of each distribution. For any throw that lands under the curve, the location of the dart
on the horizontal axis is the new parameter value. Naturally, you would be more likely
to hit the peaked areas of the curve and less likely to hit the areas in the tails, but over
the course of many throws, the darts would land throughout the entire distribution.
One feature that sets Bayesian estimation apart from maximum likelihood is that
it yields an entire distribution of estimates for each parameter (i.e., it maps the entire
geography of the hill rather than climbing directly to its peak). Furthermore, because
estimates are drawn at random from a distribution of plausible values, they usually don’t
change in a systematic direction from one iteration to the next. To illustrate, Table 4.1
shows the parameters from the first 10 MCMC iterations. Notice that the estimates oscil-
late up and down, and the changes between successive iterations are seemingly random,
with no pattern to their direction or magnitude. This behavior contrasts with maximum
likelihood estimation, which makes large adjustments early in the iterative process and
very small changes later as estimates approach their optimum values (see Table 2.4).
Turning to the posterior summaries, Figure 4.7 gives a kernel density plot (a fine-
grained histogram with a smoothed line connecting the tops of the bars) of the 10,000
mean estimates, and Figure 4.8 is the corresponding plot of the variances. The mean’s
distribution is approximately normal, whereas the variance’s distribution has a slight
positive skew. The slightly irregular shapes are not a function of my subpar graphing
skills, but instead reflect the random nature of the estimation process. Had I increased
the number of iterations (e.g., from 10,000 to 100,000), the distributions would be
smoother with fewer bumps. The kernel density plots are subtly different from the
conditional distributions in Figures 4.5 and 4.6. Whereas the conditional distributions
reflect variations in one parameter with the other held constant, the kernel density plots
display the marginal posterior distribution of each parameter taken over many different
realizations of the other. Readers familiar with calculus may recognize that the Gibbs
sampler is performing integration (marginalization) via brute force.
As described earlier, simple descriptive statistics characterize the center and spread
of the distributions. The solid vertical lines are the median estimates, and the dashed
lines denote the 95% credible intervals (the parameter values above and below which
2.5% of the distribution falls). The median intercept value is Mdnμ = 56.79, and its stan-
dard deviation is SDμ = 0.59. The 95% credible interval for this parameter spans from
approximately 55.64 to 57.98. The posterior median and standard deviation of the vari-
ance are Mdn σ2 = 88.89 and SDσ2 = 7.96, and its 95% credible interval ranges from 74.63
Relative Probability
54 55 56 57 58 59 60
Mean
FIGURE 4.7. Kernel density plot of 10,000 mean estimates. The solid vertical line is the poste-
rior median, and the dashed vertical lines denote the 95% credible interval limits.
Bayesian Estimation 165
Relative Probability
FIGURE 4.8. Kernel density plot of 10,000 variance estimates. The solid vertical line is the
posterior median, and the dashed vertical lines denote the 95% credible interval limits.
to 105.89. You can see that the credible interval in Figure 4.8 is not symmetrical, because
the posterior distribution is positively skewed.
It is instructive to compare the Bayesian summaries to the corresponding maximum
likelihood estimates from Chapter 2. The maximum likelihood estimate of the mean
was μ̂ = 56.79, its standard error was SE = 0.59, and the 95% confidence interval ranged
from 55.63 to 57.95. The point estimate and standard error are effectively identical to the
posterior mean and standard deviation, and the confidence interval boundaries closely
match the Bayesian credible interval. Despite their numerical similarities, the interpreta-
tions of these quantities differ in important ways. For example, the frequentist standard
error represents the expected variation of the estimate across many different random
samples, whereas the posterior standard deviation reflects subjective uncertainty about
the parameter after analyzing the data. Similarly, the 95% confidence interval conveys the
expected performance of many such intervals computed from different random samples,
whereas the 95% credible interval gives a range of high certainty about the parameter.
Turning to the variance, the maximum likelihood estimate and its standard error were
σ̂2 = 87.72 and SE = 7.85, and the 95% confidence interval ranged from 72.34 to 103.09.
Apart from its interpretation, the width of the 95% confidence interval also differs from
the Bayesian credible interval, because the maximum likelihood interval assumes that
sampling variation follows a normal curve (this is only true at very large sample sizes).
This assumption yields confidence intervals that are symmetrical around the point esti-
mate, whereas the Bayesian interval captures the long right tail of the distribution.
166 Applied Missing Data Analysis
Having established the basics of Bayesian inference, we can readily extend the pro-
cedure to linear regression. As you will see, the previous concepts generalize to this
analysis with virtually no modifications, because estimation still relies on the univariate
normal curve. A single-predictor model is a useful starting point, because the poste-
rior distribution of the coefficients can be visualized in a three-dimensional graph. The
simple regression model is
Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (4.18)
(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
where E(Yi|Xi) is the predicted value for individual i (i.e., the expected value or mean of
Y given a particular X score), the tilde means “distributed as,” N1 denotes the univariate
normal distribution function (i.e., the probability distribution in Equation 4.7), and the
conditional mean and residual variance are the distribution’s two parameters. In words,
the bottom row of the expression states that outcome scores are normally distributed
around a regression line with constant residual variation.
Switching gears to a different substantive context, I use the smoking data from the
companion website to illustrate a multiple regression analysis. The data set includes
several sociodemographic correlates of smoking intensity from a survey of N = 2,000
young adults (e.g., age, whether a parent smoked, gender, income). To facilitate graphing,
I start with a simple regression model where the parental smoking indicator (0 = parents
did not smoke, 1 = parent smoked) predicts smoking intensity (higher scores reflect more
cigarettes smoked per day):
The intercept represents the expected smoking intensity score for a respondent whose
parents did not smoke, and the slope is the group mean difference. The analysis example
later in this section expands the model to include additional explanatory variables. For
now, there is no need to specify a distribution for explanatory variables, so any vari-
ables on the right side of the equation function like constants, as they do in ordinary
least squares and maximum likelihood estimation. This feature has no bearing on a
complete-data regression analysis (Jackman, 2009), but we will need to specify a dis-
tribution for incomplete predictor variables in the next chapter (the same was true for
maximum likelihood).
Before getting into specifics, let’s again apply the idea from Equation 4.4—the pos-
terior distribution is proportional to the product of the prior and the likelihood—to the
regression model. Replacing the generic functions with the quantities from the analysis
example gives the following expression:
( ) ( ) (
f β, σ2ε | data ∝ f β, σ2ε × f data | β, σ2ε ) (4.20)
Bayesian Estimation 167
where the leftmost term is the posterior distribution, β represents the vector of regres-
sion coefficients, f(β, σε2) denotes the prior distributions, and the rightmost term is the
distribution of the outcome variable (or equivalently, the likelihood once the data are
collected). The equation gives the relative probability of different combinations of the
coefficients and residual variance given the data (in the case, data refers to the dependent
variable, as the predictor functions like a known constant). Visually, “f of the parameters
given the data” is the height of a multivariate surface at different combinations of values
in β and σε2. Consistent with the previous section, we can use the Gibbs sampler to esti-
mate the marginal posterior distribution of each parameter.
1 ( Yi − E ( Yi | X i ) )
N 2
f ( data | β, σ2ε ) ( )
L β, σ2ε | data
∝=
1
∏
exp − (4.21)
(2πσ )
2 N /2
ε i =1
2
σ2ε
To reiterate the notation, the function on the left side of the equation reads “the relative
probability of the data given assumed values for the parameters.” Visually, each individ-
ual’s contribution to “f of Y” is the height of the conditional normal curve that describes
the spread of scores around a particular point on the regression line (e.g., the normal
distribution of smoking intensity scores for participants who share the same value of
the parental smoking indicator). As before, some of the terms to the left of the product
operator comprise a scaling factor that I ignore whenever possible.
Prior Distributions
As explained previously, the data distribution and likelihood often inform the choice
of prior distribution, because it is convenient to work from the same distribution fam-
ily. There are at least two ways to implement a noninformative prior for the regression
coefficients. To invoke conjugate prior distributions that impart very little information,
we could again specify normal priors with μ0 = 0 and σ02 = 10,000, as this would yield
distributions that are effectively flat over the range of the data. Alternatively, we could
adopt a uniform prior that is flat over the entire range of each coefficient in β. Following
the univariate analysis, I use a uniform prior for the coefficients and a Jeffreys prior for
the residual variance (Jeffreys, 1946, 1961). These distributions are f(β) ∝ 1 and f(σε2) ∝
(σε2)–1.
168 Applied Missing Data Analysis
( ) ( ) (
f β, σ2ε | data ∝ f ( β ) × f σ2ε × f data | β, σ2ε ) (4.22)
To reiterate notation, the function on the left now reads “the relative probability of the
parameters given the data.” Bayes’ theorem has converted the likelihood to a probability
distribution.
As explained previously, the Gibbs sampler breaks a complex multivariate problem
into a series of simpler univariate problems, each of which draws a synthetic param-
eter value at random from a probability distribution that treats all other parameters as
known constants. MCMC estimation for linear regression follows a two-step recipe:
Estimate the coefficients in β as a block given the current value of the residual variance,
then update the variance given the latest coefficients (again, the order of the steps typi-
cally doesn’t matter). The recipe below summarizes the algorithmic steps.
Each estimation step draws synthetic parameter values at random from a probability dis-
tribution. Mechanically, you get these full conditional distributions by multiplying the
prior and the likelihood, then doing some tedious algebra to express the product as a func-
tion of a single unknown. I give these distributions below and point readers to specialized
Bayesian texts for additional details on their derivations (e.g., Hoff, 2009; Lynch, 2007).
First, the MCMC algorithm estimates regression coefficients by drawing a vector
of random numbers from a multivariate normal conditional distribution. With only two
coefficients, we can visualize the conditional posterior distribution of β0 and β1 in three
dimensions. Figure 4.9 shows the bivariate normal distribution of the intercept and slope.
The angle of the distribution owes to the fact that the coefficients are negatively corre-
lated (i.e., a larger mean difference requires a lower comparison group average). More
formally, the shape of the conditional distribution is given by the following equations:
( ) (
f β | σ2ε ,data ∝ N K +1 βˆ , S βˆ ) (4.23)
−1
=βˆ (=
X ′X ) X ′Y βˆ OLS
−1
S βˆ = σ 2ε ( X ′X )
1.0
0.8
Relative
0.6
Probabil
0.4
ity
4.0
0.2
3.5
ope
0.0
n Sl
3.0
8.4
io
8.6
ress
8.8
2.5
Reg
Reg 9.0
ress
ion 9.2
Inte
rcep
t 9.4
2.0
9.6
FIGURE 4.9. Conditional distribution of the intercept and slope from a simple regression
analysis. The MCMC algorithm estimates the coefficients by drawing a pair of numbers at ran-
dom from this bivariate normal distribution.
170 Applied Missing Data Analysis
where (Y – Xβ )′(Y – Xβ ) is the matrix expression for the residual sum of squares. This
distribution is identical to Equation 4.16 except that the scale parameter in the func-
tion’s second argument features the residual sum of squares rather than the sum of
squares around the mean. Visually, the distribution resembles the curve in Figure 4.6.
Analysis Example
To illustrate Bayesian estimation for a multiple regression, I expanded the previous
model to include age and income as predictors. I centered the additional variables at
their grand means to maintain the intercept’s interpretation as the expected smoking
intensity score for a respondent whose parents did not smoke.
You may recall from the corresponding maximum likelihood analysis that the smok-
ing intensity distribution has substantial positive skewness and kurtosis. Frequentist
corrections like robust standard errors do not apply to the Bayesian framework, but the
posterior distributions are not constrained to be a particular shape and will naturally
vary in reaction to the data. Even with this flexibility, the Bayesian estimates won’t
be materially different from those of normal-theory maximum likelihood. Chapter 10
describes a promising procedure for modeling non-normal distributions (Lüdtke et al.,
2020b; Yeo & Johnson, 2000).
Consistent with the previous example, I specified an MCMC chain with T = 11,000
iterations and discarded the results from the first 1,000 iterations. Analysis scripts are
available on the companion website, including a custom R program for readers inter-
ested in coding the algorithm by hand. Discarding the burn-in or warm-up cycles will
be standard practice moving forward, as doing so allows the algorithm to recover from
its starting values and converge to a trustworthy steady state. I discuss convergence
and present diagnostic tools for determining the length of this initial period in the next
section. In the interest of space, I omit kernel density plots, because the distributions
look like Figures 4.7 and 4.8. Instead, Table 4.2 gives a tabular summary of the analysis
that includes the posterior median, standard deviation, and 95% credible interval limits
(the .025 and .975 quantiles of each distribution). From a substantive perspective, the
interpretation of the coefficients is the same as a least squares or maximum likelihood.
For example, the intercept (Mdnβ0 = 9.09) is the expected number of cigarettes smoked
per day for a respondent whose parents didn’t smoke, and the parental smoking indi-
cator slope (Mdnβ1 = 2.91) is the mean difference, controlling for age and income. The
standard deviations of the coefficients are analogous to frequentist standard errors in
the sense that they reflect our uncertainty or degree of knowledge about the parameters
after analyzing the data, but they do so without reference to other hypothetical samples
from the population.
Table 4.2 also gives maximum likelihood estimates as a comparison. The point
estimates and normal-theory standard errors were effectively numerically equivalent
to the posterior median and standard deviation, respectively, and the 95% confidence
interval boundaries closely match the Bayesian credible intervals. It is important to
Bayesian Estimation 171
Note. LCL, lower credible or confidence limit (Bayesian and maximum Likelihood, respectively);
UCL, upper credible or confidence limit (Bayesian and maximum likelihood, respectively).
reiterate that the interpretations of these quantities differ in important ways. For
example, the 95% confidence intervals convey the expected performance of many such
intervals computed from different random samples, whereas the 95% credible inter-
vals give a range of high certainty about the parameters. Frequentist hypothesis tests
would declare the three slope coefficients statistically significant, because a null value
of zero falls outside their 95% confidence intervals. The 95% credible intervals similarly
suggest that zero is an unlikely value for the slope parameters. The Bayesian analysis
also allows you to assign probability values to parameters. For example, the lowest β1
coefficient in the distribution of 10,000 estimates was greater than zero, from which
we can conclude that the probability that the parameter is positive (individuals whose
parents smoked have higher smoking intensity scores) is effectively 100%. This state-
ment makes no sense in the frequentist framework, because the parameter is fixed in
the population.
Recall that the goal of maximum likelihood is to find the optimal parameter values for
the data. An iterative optimization routine like Newton’s algorithm or EM is akin to a
hiker that climbs to the highest possible elevation on a hill as quickly as possible. The
hike ends when parameter estimates no longer change from one iteration to the next.
In contrast, Bayesian estimation generates an entire distribution of plausible parameter
172 Applied Missing Data Analysis
values for the data. The hiker in the analogy is now tasked with mapping the geography
of a hill. Because the algorithm draws parameters at random from a distribution, esti-
mates will continually change for as long as the algorithm is running—the MCMC hiker
will dutifully map the hill until you tell it to stop.
In the context of MCMC estimation, convergence means that the iterative algorithm
is generating estimates that form a stable distribution; that is, running the algorithm for
additional iterations does not change the mean and variance of the estimates. Further-
more, we say that the algorithm is “mixing” well if it’s producing values throughout the
entire range of the distribution. Software programs that implement maximum likeli-
hood have automated rules for determining when the estimator has converged (e.g., stop
when the largest change in any parameter from one iteration to the next is less than
.00001), but we need to be more proactive and involved when using Bayesian estima-
tion. As a rule, it is good practice to perform a preliminary diagnostic run to examine
convergence and mixing.
As you might imagine, Bayesian methodologists have proposed many techniques for
assessing the convergence of iterative algorithms like the Gibbs sampler (Brooks & Gel-
man, 1998; Cowles & Carlin, 1996; Gelman & Rubin, 1992; Geweke, 1992; Geyer, 1992;
Johnson, 1996; Mykland, Tierney, & Yu, 1995; Raftery & Lewis, 1992; Ritter & Tanner,
1992; Zellner & Min, 1995), and authors of specialized texts discuss these methods in
detail (Gelman et al., 2014; Hoff, 2009; Kaplan, 2014; Lynch, 2007; Robert & Casella,
2004). I focus primarily on line graphs known as trace plots and a numerical diagnostic
called the potential scale reduction factor (Brooks & Gelman, 1998; Gelman et al., 2014;
Gelman & Rubin, 1992), as both are readily available in Bayesian analysis software. I use
the regression analysis from Section 4.8 to illustrate these diagnostics.
Trace Plots
A trace plot is a line graph that displays the iterations on the horizontal axis and the
corresponding synthetic parameter values on the vertical axis. To illustrate, consider the
intercept parameter from the smoking intensity regression model. The previous analysis
produced 11,000 estimates of this parameter, and I discarded the results from first 1,000
iterations. Figure 4.10 shows a trace plot of the parameter values from the first 500 itera-
tions following the burn-in period (i.e., iterations 1,001 to 1,500). The solid horizontal
line denotes the median parameter value across all 10,000 iterations, and the dashed
lines represent the 95% credible interval boundaries. The jagged pattern reflects the
fact that the algorithm is sampling parameter values at random from a distribution. To
emphasize this point, Figure 4.11 superimposes the posterior distribution on the trace
plot. You can see from this graph that the trace plot displays the location of the estimates
in the distribution in serial order.
The trace plot illustrates two important features, both of which suggest the algo-
rithm is working well. First, the algorithm appears to have reached a steady state,
because the intercept estimates oscillate around a flat line located at the center of the
distribution. Furthermore, the magnitude and variation of the random shocks are con-
sistent with the 95% credible intervals. The absence of long-term vertical drifts (up or
down) is exactly what you want to see, as this provides evidence that the estimates have
Bayesian Estimation 173
9.6
9.4
Regression Intercept
9.2
9.0
8.8
8.6
FIGURE 4.10. Trace plot of intercept estimates from the first 500 MCMC iterations following
the burn-in period.
9.6
9.4
Regression Intercept
9.2
9.0
8.8
8.6
FIGURE 4.11. Posterior distribution of 10,000 estimates superimposed over a trace plot of
intercept estimates from the first 500 MCMC iterations following the burn-in period.
174 APPLIED MISSING DATA ANALYSIS
converged around a stable mean (i.e., the algorithm is mapping values around the hill’s
peak). Second, the algorithm appears to be mixing well, because the 500 estimates are
dispersed throughout the distribution’s entire range; that is, the algorithm is mapping
the geography of the entire hill without focusing too much on one region of the surface.
Model parameters usually differ with respect to how quickly they achieve a steady
state and how long it takes MCMC to map their full range of values. For this reason, it is
important to examine trace plots for every model parameter. As a second example, Figure
4.12 shows a trace plot of residual variance estimates from the same 500-iteration interval,
and Figure 4.13 superimposes the full posterior distribution over the trace plot. The plot
looks a bit different, because the posterior is positively skewed, but you can see that the
parameter values have converged to a stable distribution and the algorithm is mixing well.
The previous figures are good prototypes for an ideal trace plot, but it is just as
instructive to see what a problematic plot looks like. Figure 4.14 plots 20,000 parameter
estimates from a different data set. Notice that the parameter, which happens to be a
regression slope, never converges to a stable distribution. Rather, the plot features long
periods of vertical drift (e.g., the increasing trend between iterations 10,000 and 15,000),
and it is nearly impossible to identify the center of the distribution. In this case, the lack
of convergence is caused by missing data, as the observed scores do not contain enough
information to estimate the model. Interestingly, trace plots for other parameters in the
same analysis look just fine, which underscores the importance of examining trace plots
for the full set of model parameters.
20
19
Residual Variance
18
17
16
15
FIGURE 4.12. Trace plot of residual variance estimates from the first 500 MCMC iterations
following the burn-in period.
20
19
Residual Variance
18
17
16
15
FIGURE 4.13. Posterior distribution of 10,000 estimates superimposed over a vertical trace
plot of residual variance estimates from the first 500 MCMC iterations following the burn-in
period.
30
25
20
Parameter Value
15
10
5
0
175
176 APPLIED MISSING DATA ANALYSIS
0 5 10 15 20
Iteration
FIGURE 4.15. Trace plot of the residual variance from the first 20 iterations of two MCMC
chains with different starting values.
Bayesian Estimation 177
22
20
Residual Variance
18
16
14
FIGURE 4.16. Trace plot of the residual variance from the first 200 iterations of two MCMC
chains with different starting values.
4.0
Discard Compare
3.5
Regression Slope
3.0
2.5
2.0
0 5 10 15 20
Iteration
FIGURE 4.17. Slope estimates from two MCMC chains composed of 10 iterations each. The
dashed lines denote the mean estimates from each chain, and the solid horizontal line is the
grand mean. The PSRF from the two chains is 1.12.
means should be very similar, particularly when gauged against the magnitude of the
within-chain variation across iterations. In contrast, when two chains have not con-
verged or are not mixing well, the mean difference will be large relative to the within-
chain variation. The PSRF is intuitively appealing, because it defines each chain as a
group of estimates and uses familiar mean square expressions from one-factor analysis
of variance (ANOVA) designs to quantify between-chain mean differences and within-
chain noise variation.
To illustrate the PSRF, consider the first 20 estimates of the parental smoking
regression slope (the β1 coefficient in Equation 4.25). Figure 4.17 shows a trace plot of
the estimates, with dashed horizontal lines showing each chain’s group average. Gelman
et al. (2014, pp. 284–285) recommend splitting MCMC chains in half and using the sec-
ond halves to compute the PSRF. Thus, the diagnostic considers the distributions from
two chains with 10 iterations each (i.e., iterations 11–20). The vertical separation of the
dashed lines is the between-chain mean difference, and the magnitude of the random
spikes around the mean lines reflect within-chain noise variation. The PSRF uses the
between-group mean square from ANOVA to quantify the between-chain mean differ-
ence
C
∑(θ )
T 2
Between-Chain=
Variance −θ (4.26)
(C − 1) c =1
c
Bayesian Estimation 179
where C is the total number of chains, T is the number iterations per chain, θc is the
mean estimate from chain c, and θ is the grand mean (i.e., the mean of the means). The
within-group mean square from ANOVA quantifies the pooled within-chain variance.
C
1
=
Within-Chain Variance
C ∑σ
c =1
2
θc (4.27)
The variance of chain c’s estimates is computed by applying the sample variance formula
to the T estimates within each chain. The total variance of the estimates is a weighted
sum of the between- and within-chain variance.
T −1 1
Total Variance =
× Within-Chain + × Between-Chain (4.28)
T T
Finally, we have the components to define the PSRF:
Total Variance
PSRF = (4.29)
Within-Chain Variance
The idea behind the PSRF is that when the two chains have converged to a stable
distribution and are mixing well, the between-chain mean difference will be very small
relative to within-chain variation, in which case the total variance in the numerator will
be similar to the denominator. In a hypothetically perfect scenario where two chains
have identical means, the between-chain variation vanishes, and the fraction under the
radical is approximately equal to 1. Conversely, the ratio of total variance to within-
chain variance grows increasingly larger as the mean difference increases. From this,
we see that lower PSRF values are better, and the best possible value equals 1. Rules of
thumb from popular Bayesian texts suggest that PSRF values less than 1.05–1.10 are
usually sufficient for practical applications (Gelman et al., 2014).
Returning to the regression analysis, the two chains in Figure 4.17 give a PSRF of
1.10, which is above the recommended threshold. The high PSRF value indicates that
the algorithm has not iterated long enough for the chains to converge and mix well.
Increasing the number of iterations should address this issue and reduce the PSRF. The
between-chain mean difference essentially vanishes when comparing the second halves
of two chains of 200 iterations (i.e., iterations 101–200), and the PSRF drops to near its
theoretical minimum. Consistent with the trace plots, we should examine PSRF values
for all model parameters, as the diagnostic can vary dramatically from one parameter
to the next. Using the largest (worst) PSRF value to specify a burn-in period is often a
safe strategy.
Software packages often compute the PSRF values at regular intervals during the
burn-in period. To illustrate, Table 4.3 shows the PSRF value for each parameter at itera-
tions 20, 30, 40, 50, and 100 (following recommendations, I again compared the second
halves of each interval). Table 4.3 highlights three important points. First, parameters
converge at different rates; the parental smoking and age slopes produce acceptable PSRF
values almost immediately, whereas other parameters require more iterations. Second,
180 Applied Missing Data Analysis
TABLE 4.3. Split‑Chain PSRF Comparisons after 20, 30, 40, 50,
and 100 Iterations
Comparison interval for PSRF computation
Parameter 11 to 20 16 to 30 21 to 40 26 to 50 51 to 100
β0 1.04 1.03 1.04 1.00 1.08
β1 (PARSMOKE) 1.10 1.02 1.05 1.01 1.02
β2 (AGE) 1.18 1.00 1.00 1.04 1.00
β3 (INCOME) 1.01 1.09 1.07 1.00 1.00
σε2 1.08 1.04 1.10 1.18 1.01
PSRF values can increase or decrease from one interval to the next (e.g., the PSRF for β1
drops from 1.10 to 1.02, then it increases to 1.05). These oscillations tend to diminish
as the number of iterations increases, because the chain means are less susceptible to
large random shocks and outlier estimates. Increasing the number of chains also stabi-
lizes the PRSF estimates. Third, at the 100th iteration, the highest (worst) PSRF had not
dropped below the conservative threshold of 1.05, but all indices were acceptably low
by the 200th MCMC cycle. Considered as a whole, the numerical diagnostics reinforce
the conclusion from the trace plots, which is that the algorithm converges and mixes
thoroughly well before the end of the 1,000-iteration burn-in period. I could reduce the
length of the burn-in interval if I wanted, but there is no compelling reason to do so.
Recall that N3 denotes a three-dimensional normal distribution, and the first and second
terms in parentheses are the mean vector and variance–covariance matrix (the multi-
variate distribution’s parameters).
Recycling information from Chapter 2, the joint probability of N observations (or the
likelihood of the sample data) is the product of the individual contributions.
N
1
f ( data | μ, S ) =π
(2 ) ( − N ×V ×.5)
S
( − N ×.5)
∏exp − 2 ( Y − μ )′ S
i
−1
( Yi − μ ) (4.32)
i =1
The column vector Yi contains the V observations for participant i, μ is the correspond-
ing vector of population means, and Σ is a variance–covariance matrix of the V vari-
ables. As before, the function on the left side of the expression can be read as “the
relative probability of the data given assumed values for the parameters.” With a bit of
algebra, the expression simplifies to
f ( data | μ, S ) ∝ S
( − N ×.5) 1
(
exp − tr SS −1
2
) (4.33)
where S is the sum of squares and cross-products matrix of the data computed at the
population means in μ (Hoff, 2009, pp. 110–111).
Prior Distributions
The simple strategy of specifying a flat distribution over the entire range of the popula-
tion mean readily extends to the mean vector from a multivariate analysis. This prior
is f(μ) ∝ 1. In the univariate case, a positively skewed inverse gamma distribution is a
conjugate prior for the variance. The inverse Wishart distribution is a multivariate gen-
eralization of the inverse gamma to a covariance matrix Σ with V variables. Visually, the
inverse Wishart looks like a three-dimensional skewed distribution like the one in Fig-
ure 4.18 (which shows the distribution for a simple bivariate example that is graphable
182 Applied Missing Data Analysis
1.0
0.8
Relative
0.6
Probabil
0.4
ity
0.2
0.0
0.0 3.0
0.5 2.5
Po
pu 1.0 2.0 X
lat f
io eo
n
Va
1.5 1.5
ianc
ria Var
nc 2.0 1.0
on
eo ati
fY 2.5 0.5 p ul
Po
3.0 0.0
FIGURE 4.18. Inverse Wishart distribution for a bivariate example with 25 degrees of free-
dom and variance–covariance matrix elements σ̂X2 = 1, σ̂Y2 = 1, and σ̂XY = .30.
in three dimensions). Like other distributions we’ve encountered, the vertical height
gives the relative probability of the parameter values listed along the horizontal and
depth axes. Muthén and Asparouhov (2012) provide a useful appendix that describes
the inverse Wishart distribution, and I summarize some of its main features below.
Ignoring scaling terms, the inverse Wishart prior distribution is
f (S) ∝ S
−( df0 + V +1) /2 1
( )
exp − tr S0 S −1
2
(4.34)
where S0 and df0 (the hyperparameters) are prior estimates of the sum of squares and
cross-products matrix and degrees of freedom, respectively. Roughly speaking, the
hyperparameters encode a prior guess about the population covariance matrix, and
the degrees of freedom parameter is essentially the number of imaginary data points
assigned to that matrix. The function on the left side is a height coordinate that summa-
Bayesian Estimation 183
The full conditional distributions for each estimation step are found by multiplying
the prior and the likelihood, then doing some tedious algebra to express that product as
a function of a single unknown. Specialized Bayesian texts provide additional details on
their derivations (e.g., Hoff, 2009, pp. 109–112). In this scenario, MCMC first estimates
the means by drawing a vector of random numbers from a multivariate normal distribu-
tion. The full conditional distribution is
S
f ( μ | S, data ) ∝ N V Y, (4.37)
N
where NV denotes a normal distribution with V dimensions or variables, Y is the vector
of arithmetic means computed from the sample data, and S is a synthetic estimate of the
variance–covariance matrix from a prior MCMC step. Dividing S by N gives the usual
expression for the covariance matrix of the means, the frequentist version of which has
squared standard errors on the diagonal.
Next, MCMC updates the variance–covariance matrix by drawing a matrix of ran-
dom numbers from an inverse Wishart distribution. The full conditional distribution is
f ( S | μ,data ) ∝ S
−( df0 + N + V +1) /2 1
(( ) )
exp − tr S0 + S S −1
2
(4.38)
where S is a sum of squares and cross-products matrix that reflects variation and covari-
ation around the synthetic means from the preceding step. The shorthand notation for
the distribution function is as follows:
Bayesian Estimation 185
( (
f ( S | μ,data ) ∝ IW df0 + N, S0 + S )
−1
) (4.39)
The degrees of freedom in the first term—the sum of the sample size and the number
of imaginary observations assigned to the prior—determines the distribution’s center,
and the sum of squares and cross-products matrix in the second term—also the sum of
prior information and information from the data—determines the distribution’s spread.
Analysis Example
Returning to the empty regression models in Equation 4.30, I use work satisfaction,
employee empowerment, and leader–member exchange scale to illustrate Bayesian esti-
mation for a mean vector and variance–covariance matrix. Estimation scripts are avail-
able on the companion website, including a custom R program for readers interested in
coding the MCMC algorithm by hand. To explore the influence of different prior dis-
tributions, I implemented the Wishart specifications described earlier and a separation
strategy that specifies distinct priors for variances and correlations (Merkle & Rosseel,
2018). Following earlier examples, I used the PSRF to determine the burn-in period, and
I based the final analyses on 10,000 MCMC iterations.
Table 4.4 gives Bayesian summaries of the means, standard deviations, variances
and covariances, and correlations (Table 2.6 shows the corresponding maximum likeli-
hood estimates). In the interest of space, Table 4.4 shows the results from an improper
inverse Wishart prior with S0 = 0 and df0 = –V – 1 (Asparouhov & Muthén, 2010a) and a
separation strategy (Merkle & Rosseel, 2018). For the inverse Wishart prior, I computed
the standard deviations and correlations as auxiliary functions of the estimated vari-
ances and covariances (e.g., a correlation is a covariance divided by the square root of the
product of two variances), whereas the correlations and variances were the estimated
parameters for the separate prior strategy. Table 4.4 shows that the choice of prior had no
influence on the variances and covariances (or correlations), as the posterior medians
and standard deviations were effectively identical. This won’t always be the case, but it
is here, because the sample size is very large relative to the prior degrees of freedom.
Bayesian analyses have gained a strong foothold in social and behavioral science disci-
plines in the last decade or so (Andrews & Baguley, 2013; van de Schoot et al., 2017), and
this approach is now a viable alternative to likelihood-based estimation. Like maximum
likelihood, the primary goal of a Bayesian analysis is to fit a model to the data and use
the resulting estimates to inform the substantive research questions. The examples in
this chapter highlight that Bayesian analyses often give results that are numerically
equivalent to those of maximum likelihood, although the interpretations of the esti-
mates and measures of uncertainty require a philosophical lens that views parameters
as random variables instead of fixed quantities. This framework is very different from
the frequentist approach, because it makes no reference to hypothetical estimates from
different samples of data.
186 Applied Missing Data Analysis
Standard deviations
Work Satisfaction 1.27 0.04 1.26 0.04
Empowerment 4.54 0.13 4.52 0.13
LMX 3.04 0.09 3.02 0.09
Correlations
Work Satisfaction ↔ Empowerment .29 .04 .28 .03
Work Satisfaction ↔ LMX .42 .03 .41 .03
Empowerment ↔ LMX .39 .03 .39 .03
Having established the major details behind estimation and inference, Chapter 5
applies Bayesian estimation to missing data problems. As you will see, everything from
this chapter carries over to missing data applications, where missing values are just one
more unknown quantity for MCMC to estimate. After updating the parameters using the
estimation steps from this chapter, each iteration concludes with the algorithm using
the updated parameter estimates to construct a model that imputes the missing values.
At that point, the data are complete, and the next iteration proceeds as if there were no
missing values. Finally, I recommend the following articles for readers who want addi-
tional details on topics from this chapter:
Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychol-
ogy. British Journal of Mathematical and Statistical Psychology, 66, 1–7.
Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. American Statistician, 46,
167–174.
Bayesian Estimation 187
Jackman, S. (2000). Estimation and inference via Bayesian simulation: An introduction to Mar-
kov chain Monte Carlo. American Journal of Political Science, 44, 375–404.
Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scien-
tists. Berlin: Springer.
van de Schoot, R., Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017).
A systematic review of Bayesian articles in psychology: The last 25 years. Psychological
Methods, 22, 217–239.
Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., . . . Epskamp, S.
(2018). Bayesian inference for psychology: Part I. Theoretical advantages and practical
ramifications. Psychonomic Bulletin and Review, 25, 35–57.
5
Bayesian Estimation
with Missing Data
This chapter shows how Bayesian analyses address missing data. Virtually everything
from Chapter 4 extends to the missing data context with little to no modification. Recall
that MCMC breaks a complex multivariate estimation problem into a series of sim-
pler steps that address one parameter (or block of similar parameters) at a time, while
treating other parameters as known constants. This part of MCMC’s “mathemagical”
machinery carries over with no changes. After updating the parameters, the algorithm
uses the newly minted estimates to construct model-predicted distributions of the miss-
ing values, from which it samples imputations. At this point, the data are complete, and
the next iteration proceeds as if there were no missing values. These two major steps—
update the parameters based on the filled-in data, then impute the data based on the
current parameter values—repeat for many iterations, just as shown in Chapter 4.
Because MCMC estimation doesn’t change, missing data imputation is the focus
of this chapter. I start by describing imputation for an outcome variable, after which I
extend the procedure to incomplete explanatory variables. There are at least two ways to
construct missing data distributions for predictors, both of which readily accommodate
interactions and curvilinear terms. The emergence of missing data-handling methods
for interactive and nonlinear effects is an important recent innovation since the first
edition of this book (Bartlett, Seaman, White, & Carpenter, 2015; Enders, Du, & Keller,
2020; Erler et al., 2016; Goldstein, Carpenter, & Browne, 2014; Kim et al., 2015, 2018;
Lüdtke et al., 2020b; Zhang & Wang, 2017). As you will see, the methodology for treat-
ing nonlinearities is the Bayesian equivalent of the factored regression approach from
Chapter 3.
Bayesian analyses are a bridge connecting maximum likelihood to multiple imputa-
tion. On one side of that bridge is maximum likelihood, which extracts the parameter
estimates of interest directly from the observed data. The other side of the bridge is
188
Bayesian Estimation with Missing Data 189
multiple imputation, which creates and saves filled-in data sets for later use. A Bayesian
analysis is like maximum likelihood in the sense that model parameters are the focus,
but the machinery that generates missing values is identical to multiple imputation. The
distinction between a Bayesian analysis and multiple imputation can get blurry, because
the latter co-opts the MCMC algorithms from this chapter and Chapter 6. For now, the
goal is to construct temporary imputations that service a particular analysis. The focus
shifts in Chapter 7, where the Bayesian machinery is a mathematical device that creates
suitable imputations for reanalysis in the frequentist framework.
Imputation for an incomplete outcome variable is a good place to start, because the
procedure is relatively straightforward and doesn’t depend on the composition of the
analysis model (e.g., imputation is the same whether the analysis is a simple regression
or a complex model with interaction effects). In truth, there is no need to impute at all in
this situation, because classic regression models are known to produce good estimates
when missing values are restricted to the dependent variable and missingness is due to
predictors (Little, 1992; von Hippel, 2007). However, that limited scenario doesn’t arise
too often in practice, and we’ll generally need to impute outcomes.
To keep the discussion as straightforward as possible, I use a simple regression
model for the first part of the chapter.
Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (5.1)
(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
where E(Yi|Xi) is the predicted value for individual i (i.e., the expected value or mean of
Y given a particular X score), the tilde means “distributed as,” N1 denotes the univari-
ate normal distribution function (i.e., the probability distribution in Equation 4.7), and
the conditional mean and residual variance are the distribution’s two parameters. The
bottom row of the expression says that outcome scores are normally distributed around
a regression line with constant residual variation. As you will see, this normal curve
defines the distribution of missing values.
I use the employee data from the companion website to provide a substantive con-
text. The data set includes several workplace-related variables (e.g., work satisfaction,
turnover intention, employee–supervisor relationship quality) for a sample of N = 630
employees. The Appendix describes the data set and variable definitions. The simple
regression model features the leader–member exchange scale (a construct measur-
ing the quality of an employee’s relationship with his or her supervisor) predicting an
employee’s sense of empowerment.
Both variables are incomplete, but I focus on the missing outcome scores for now.
190 Applied Missing Data Analysis
The first two steps condition on the missing values, which means that estimation
is carried out on the filled-in data set from the prior iteration. In fact, these operations
are identical to the complete-data estimation steps for linear regression; the MCMC
algorithm “estimates” regression coefficients by drawing a vector of random numbers
from the multivariate normal distribution in Equation 4.23, after which it updates the
residual variance by drawing a random number from the inverse gamma distribution
in Equation 4.24. The final MCMC step creates new imputations based on the updated
parameter values.
( )
σ2ε , X i
f Yi( mis ) | β,= (
N1 E ( Yi | X i ) , σ2ε ) (5.3)
tour rings convey the perspective of a drone hovering over the peak of the bivariate
normal population distribution, with smaller contours denoting higher elevation (and
vice versa). Candidate imputations fall exactly on the vertical hashmarks, but I added
horizontal jitter to emphasize that more scores are located at higher contours near the
regression line. The MCMC algorithm generates an imputation by randomly selecting a
value from the candidate scores along the vertical lines (technically, the algorithm draws
from a full distribution of replacement scores, not just those displayed in the graph).
Figure 5.2 shows a filled-in data set, with black crosshairs denoting cases with
imputed empowerment scores. The concentration of imputed values on the left side of
the graph is consistent with a conditionally MAR process where the probability of miss-
ing data increases as employee– supervisor relationship quality decreases (e.g., because
they are not as engaged or invested). A practical implication of MAR assumption is that
all participants share the same model; that is, the distribution of empowerment is the
same for any two participants with the same leader–member exchange score, regardless
of the missing data pattern. This feature of imputation is clear in Figure 5.2, because
the filled-in observations blend in seamlessly along the same regression line as the com-
plete scores. Chapter 9 describes models that create imputations according to a MNAR
process.
50
40
Empowerment
30
20
10
0 5 10 15 20
Leader-Member Exchange
50
40
Empowerment
30
20
10
0 5 10 15 20
Leader-Member Exchange
FIGURE 5.2. Filled-in data set from one iteration of the MCMC algorithm. The black cross-
hair symbols denote observations with imputed outcome scores.
Specifying a distribution for complete explanatory variables is unnecessary, but the situ-
ation changes with incomplete regressors, because their values also need to be sampled
from a distribution. Consistent with the maximum likelihood framework, Bayesian
structural equation modeling is one option (Kaplan & Depaoili, 2012; Merkle & Ros-
seel, 2018; Palomo, Dunson, & Bollen, 2007), as are factored regression models (Ibrahim
et al., 2002, 2005). The former generally foists a normal distribution on the predictors,
whereas the latter offers a more flexible specification that accommodates interactive or
nonlinear effects and mixed response types. I focus on factored regression models in
this section and return to multivariate normal data later in the chapter.
Expanding on the previous example, I use a multiple regression model with leader–
member exchange, leadership climate, and a gender dummy code (0 = female, 1 = male)
as predictors.
Yi = β0 + β1 X1i + β2 X 2i + β3 X 3i + ε i = E ( Yi |X i ) + ε i
Bayesian Estimation with Missing Data 193
(
Yi ~ N1 E ( Yi |X i ) , σ2ε )
As a reminder, E(Yi|Xi) is a predicted value, the tilde means “distributed as,” N1 denotes
the univariate normal distribution function (i.e., the probability distribution in Equa-
tion 4.7), and the terms inside the parentheses define the distribution’s mean and vari-
ance. The bottom row says that the dependent variable is normally distributed around
points on a regression plane with constant variation.
The asterisk superscript in the bottom equation reflects a latent response variable for-
mulation for the dummy code, which I discuss in Chapter 6.
An alternative model specification factorizes the joint distribution of the analysis
variables into a double product featuring the focal model and a multivariate distribu-
tion for the regressors. The factorization for the employee empowerment example is as
follows:
f ( EMPOWER, LMX, CLIMATE, MALE ) =
(5.7)
(
f ( EMPOWER | LMX, CLIMATE, MALE ) × f LMX, CLIMATE, MALE* )
The second term is a trivariate normal distribution for the explanatory variables.
194 Applied Missing Data Analysis
LMX i µ1 r1i
CLIMATEi = µ 2 + r2i (5.8)
*
MALEi µ 3 r3i
LMX i µ σ12 σ12 σ13
1
CLIMATE i ~ N
3 2 , σ 21
µ σ22 σ23
*
MALEi µ 3 σ 31 σ 32 σ 23
The asterisk superscript again reflects a latent response variable formulation, which I
discuss in Chapter 6.
For lack of a better term, I refer to this specification as a partially factored regres-
sion model or partially sequential specification. Figure 5.3 depicts the fully factored
and partially factored regression models as path diagrams. It suggests that the two
approaches are equivalent, because they simply swap out a straight arrow for a curved
arrow. The models are, in fact, exchangeable in this example, but that won’t always be
the case. I contrast the two strategies later in this section.
As an aside, an equivalent version of the partially factored model expresses the
multivariate distribution in Equation 5.8 as a series of round-robin linear regression
equations, as follows on top of the next page (Enders et al., 2020; Goldstein et al., 2014):
X3
X2 Y
X1
X3
X2 Y
X1
FIGURE 5.3. The path diagram in panel (a) corresponds to a factored regression or sequential
specification, and the diagram in panel (b) is a partially factored regression that specifies a joint
distribution for the regressors.
Bayesian Estimation with Missing Data 195
(
LMX i = μ1 + γ11 ( CLIMATEi − μ 2 ) + γ 21 MALEi* − μ 3 + r1i ) (5.9)
CLIMATEi = μ 2 + (
γ12 MALEi* )
− μ 3 + γ 22 ( LMX i − μ1 ) + r2i
MALEi* = μ 3 + γ13 ( LMX i − μ1 ) + γ 23 ( CLIMATEi − μ 2 ) + r3i
f ( Y , X1 , X 2 , X 3 ) f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 ) × f ( X 2 , X 3 )
( X1 | Y , X 2 , X 3 )
f= =
f ( Y, X2 , X3 ) f ( Y, X2 , X3 ) (5.10)
∝ f ( Y | X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 )
f ( Yi | X1i , X 2i , X 3i ) × f ( X1i = ( ) ( )
| X 2i , X 3i ) N1 E ( Yi|X i ) , σ2ε × N1 E ( X1i | X 2i , X 3i ) , σr21 ∝
Deriving the conditional distribution of X1 involves multiplying the two normal curve
functions and performing some straightforward but tedious algebra that combines the
component functions into a single distribution for X1. The result of that hard work is
a normal distribution with a complex mean and variance that depend on the focal and
regressor model parameters (θθ and φ, respectively).
( )
f X1i( mis ) | Yi , X 2i , X 3i = N1 ( E ( X1i | Yi , X 2i , X 3i ) , var ( X1i | Yi , X 2i , X 3i ) ) (5.12)
γ +γ X +γ X β ( Y − β 0 − β 2 X 2 i − β 3 X 3i )
E ( X1i | Yi , X 2i , X 3i ) =var ( X1i | Yi , X 2i , X 3i ) × 01 11 22i 21 3i
+ 1 i
σr1 σ2ε
−1
1 β12
var ( X1i | Yi , X 2i , X=
3i ) 2 + 2
σr σ ε
1
There is nothing especially intuitive about the equation, but its two-part structure
clearly shows that the conditional distribution of X1 depends on the two models in
which it appears.
To further illustrate, Figure 5.4 shows the distribution of plausible leader–member
50
40
Empowerment
30
20
10
0 5 10 15 20
Leader-Member Exchange
Yi = β0 + β1 ( X1i − μ1 ) + β2 ( X 2i − μ 2 ) + β3 ( X 3i − μ 3 ) + ε i (5.13)
A partially factored specification is ideally suited for this analysis, because the grand
means are explicit model parameters that MCMC iteratively estimates (see Equations
5.8 and 5.9). This seemingly routine analysis is somewhat harder to implement with a
sequential specification, which requires the following regressions:
X 2i = γ 02 + γ12 ( X 3i − γ 03 ) + r2i
X 3i =γ 03 + r3i
198 Applied Missing Data Analysis
MCMC estimation typically estimates each model’s coefficients as a block (see Equa-
tion 4.23), but a more complex algorithm is needed here to account for the fact that two
intercept coefficients appear in multiple equations.
MCMC Algorithm
The estimation recipe below shows the MCMC algorithmic steps in their full generality
where all analysis variables could be missing:
Fundamentally, both the fully factored and partially factored specifications (Equations
5.5 and 5.7, respectively) share the same algorithmic steps, and the primary difference
is the composition of the supporting regressor models. In either case, the focal model
alone determines the distribution of the missing outcome scores, and incomplete predic-
tors depend on two or more sets of model parameters.
Analysis Example
Continuing with the employee data, I applied Bayesian missing data handling to the lin-
ear regression model in Equation 5.4. The missing data rates for the employee empow-
erment and leader–member exchange scales are approximately 16.2 and 4.1%, respec-
tively, and 9.5% of the leadership climate scores are missing. The gender dummy code is
complete. The potential scale reduction factor (Gelman & Rubin, 1992) diagnostic from
Chapter 4 indicated that the MCMC algorithm converged in fewer than 200 iterations,
so I continue using 11,000 total iterations with a conservative 1,000-iteration burn-in
Bayesian Estimation with Missing Data 199
period. Analysis scripts are available on the companion website, including a custom R
program for readers interested in coding the algorithm by hand.
Table 5.1 gives posterior summaries of the analysis model parameters. The sequen-
tial and partially factored specifications gave identical estimates (to the third decimal),
so I report the latter. Furthermore, I omit the regressor model parameters, because they
are not the substantive focus. The interpretation of the regression parameters is the same
as a least squares or maximum likelihood analysis. For example, the leader–member
exchange slope (Mdnβ1 = 0.59, SDβ1 = 0.07) indicates that a one-unit increase in supervi-
sor relationship quality increases employee empowerment by about 0.59, controlling
for other regressors. As you know, the posterior standard deviations are analogous to
frequentist standard errors in the sense that they quantify uncertainty about the param-
eters after analyzing the data, but the subjective definition of uncertainty doesn’t refer-
ence hypothetical estimates from different random samples. Applying null hypothesis-
like logic, the population slope coefficients are unlikely equal to zero, because this null
value falls well outside the 95% credible intervals.
The emergence of Bayesian missing data-handling methods for interactive and nonlin-
ear effects is an important recent development (Bartlett et al., 2015; Enders et al., 2020;
Erler et al., 2016; Kim et al., 2015, 2018; Lüdtke et al., 2020b; Zhang & Wang, 2017).
Moderated regression models are ubiquitous analytic tools, particularly in the social
and behavioral sciences (Aiken & West, 1991; Cohen et al., 2002). A prototypical model
features a focal predictor X, a moderator variable M, the product of X and M, and one or
more covariates like Z below:
Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + β4 Zi + ε i = E ( Yi | X i , Mi , X i × Mi , Zi ) + ε i (5.15)
(
Yi ~ N1 E ( X i , Mi , X i × Mi , Zi ) , σ2ε )
In this model, β1 is a conditional effect that reflects the influence of X when M equals
zero, and β2 is the corresponding conditional effect of M when X equals zero. The β3
coefficient is usually of particular interest, because it captures the change in the β1 slope
200 Applied Missing Data Analysis
for a one-unit increase in M (i.e., the amount by which X’s influence on Y is moderated
by M).
Switching gears to a different substantive context, I use the chronic pain data
to illustrate a moderated regression analysis with an interaction effect. The data set
includes psychological correlates of pain severity (e.g., depression, pain interference
with daily life, perceived control) for a sample of N = 275 individuals with chronic
pain. The motivating question is whether gender moderates the influence of depression
on psychosocial disability, a construct capturing pain’s impact on emotional behaviors
such as psychological autonomy and communication, emotional stability, and so forth.
The moderated regression model is
where DISABILITY and DEPRESS are scale scores measuring psychosocial disability and
depression, MALE is a gender dummy code (0 = female, 1 = male), and PAIN is a binary
severe pain indicator (0 = no, little, or moderate pain, 1 = severe pain). I centered depres-
sion scores at their grand mean to facilitate interpretation. The disability and depression
scores have 9.1 and 13.5% of their scores missing, respectively, and approximately 7.3%
of the binary pain ratings are missing. By extension, 13.5% of the sample is also missing
the product term.
(
f ( DEPRESS | PAIN, MALE ) × f PAIN * | MALE × f MALE*) ( )
The first term to the right of the equals sign corresponds to the analysis model from
Equation 5.16. Importantly, the product is not a variable with its own distribution, but
rather a deterministic function of depression and gender, either of which could be miss-
ing. The regressor models in the next three terms translate into a linear regression for
depression, a probit (or logistic) model for the severe pain indicator, and an empty probit
(or logistic) model for the marginal distribution of gender (which I ultimately ignore,
because the variable is complete).
The asterisk superscripts reflect a latent response variable formulation for the binary
variables, which I discuss in Chapter 6.
The partially factored specification has a two-part construction that comprises the
focal model and a multivariate distribution for the predictors (e.g., see Equation 5.7).
The trivariate normal distribution for this example is as follows:
DEPRESSi μ1 r1i
*
PAIN i = μ 2 + r2i (5.19)
*
MALEi μ 3 r3i
The asterisk superscripts again reflect the latent response variable formulation described
in Chapter 6. As noted previously, an equivalent version of this specification expresses
the multivariate normal distribution as a series of round-robin linear regressions like
Equation 5.9 (Enders et al., 2020; Goldstein et al., 2014).
f ( X | Y , M, Z ) ∝ f ( Y | X, M, X × M, Z ) × f ( X | M, Z ) =
(5.20)
( ) (
N1 E ( Yi |X i , Mi , X i × Mi , Zi ) , σ2ε × N1 E ( X i |Mi , Zi ) , σr21 )
Dropping unnecessary scaling terms and substituting the normal curve’s kernels into
the right side of the expression gives the following:
f ( Yi | X i , Mi , X i × Mi , Zi ) × f ( X i | Mi , Zi ) ∝
( Yi − ( β0 + β1 X i + β2 Mi + β3 X i Mi + β4 Zi ) )
2
1
exp − ×
2 σ2ε
(5.21)
( X i − ( γ 01 + γ11Mi + γ 21 Zi ) )
2
1
exp −
2 σ2r1
202 Applied Missing Data Analysis
( )
f X i( mis ) | Yi , Mi , Zi = N1 ( E ( X i | Yi , Mi , Zi ) , var ( X i | Yi , Mi , Zi ) ) (5.22)
E ( X i | Yi , Mi , Zi ) var ( X i | Yi , Mi , Zi ) ×
=
γ + γ M + γ Z ( β1 + β3 Mi )( Yi − β0 − β2 Mi − β4 Zi )
01 11 2 i 21 i
+
σ σ 2
r1 ε
−1
1 ( β + β M )2
var ( X i | Yi , Mi ,=
Zi ) 2 + 1 23 i
σr σε
1
Kim et al. (2015) give a comparable expression for a partially factored specification that
assigns a multivariate normal distribution to the predictors.
Comparing the distribution above to the one from Equation 5.12, you’ll notice that
the parts of the mean and variance that depend on the focal model expand to incorporate
the interaction effect (e.g., X’s slope is replaced by its simple slope), and the contribution
of the covariate model remains the same. Although it isn’t obvious from Equation 5.12,
drawing scores from this distribution yields imputations that are consistent with the
estimated interaction effect. As such, each time MCMC estimates the moderated regres-
sion from the filled-in data, the product of the imputed X and M scores will preserve any
interaction effect in the data, because the posterior predictive distribution constructs
imputations that anticipate this multiplication. This doesn’t mean that imputation will
create an interaction where none exists; the procedure creates imputations that are con-
sistent with the estimated interaction effect, which could be 0.
Looking at the variance of the imputations, the interaction introduces heterosce-
dasticity, such that the distribution’s spread depends on a person’s moderator score
(i.e., each value of M gives a different variance). Applied to the chronic pain data, the
variance expression implies that the spread of the missing depression scores differs
for males and females. Plugging in estimates from the ensuing analysis gives variance
estimates of 33.47 and 26.83 for males and females, respectively. This result highlights
that nonlinearities induce differences in spread that are incompatible with a multivari-
ate normal distribution (i.e., Equation 5.22 is a mixture of normal distributions that
differ with respect to their spread). Classic maximum likelihood and multiple imputa-
tion approaches that assume multivariate normality (e.g., the so-called “just-another-
variable” approach) do a poor job of approximating this heteroscedasticity and are prone
to substantial biases (Bartlett et al., 2015; Kim et al., 2018; Liu et al., 2014; Seaman et al.,
2012; von Hippel, 2009). A growing body of methodological research suggests that fac-
tored regression models and their Bayesian counterparts are superior options for model-
ing incomplete interaction effects (Bartlett et al., 2015; Enders et al., 2020; Erler et al.,
2016; Grund, Lüdtke, & Robitzsch, 2021; Kim et al., 2015, 2018; Lüdtke et al., 2020b;
Zhang & Wang, 2017).
Bayesian Estimation with Missing Data 203
Analysis Example
Continuing with the chronic pain example, I applied Bayesian missing data handling to
the moderated regression model in Equation 5.16. As explained previously, a partially
factored specification that assigns a multivariate distribution to the predictors is ideally
suited for models with centered predictors, because MCMC iteratively estimates the
grand means. The potential scale reduction factors (Gelman & Rubin, 1992) from a pre-
liminary diagnostic run indicated that MCMC converged in fewer than 400 iterations,
so I continue using 11,000 total iterations with a conservative 1,000-iteration burn-in
period. Analysis scripts are available on the companion website.
Table 5.2 summarizes the posterior distributions of the parameters, and Table 3.7
shows the corresponding maximum likelihood estimates from the factored regression
model. To get a better understanding of the interaction effect, Figure 5.5 uses the pos-
terior medians to plot the regression lines for males and females (averaging over the
severe pain indicator). Recall that lower-order terms are conditional effects that depend
on scaling; Mdnβ1 = 0.38 (SD = 0.06) is the effect of depression on psychosocial disabil-
ity for females (the solid line in the figure), and Mdnβ2 = –0.77 (SD = 0.57) is the gender
difference at the depression mean (the vertical distance between lines at a value of zero
on the horizontal axis). The interaction effect captures the slope difference for males.
The negative coefficient (Mdnβ3 = –0.24, SD = 0.09) indicates that the male depression
slope (the dashed line) was approximately 0.24 points lower than the female slope (i.e.,
the male slope is Mdnβ1 + Mdnβ3 = 0.38 – 0.24 = 0.14). The 95% credible interval for the
interaction does not include 0.
Researchers routinely probe interaction effects by computing the conditional effect
of the focal predictor at different levels of the moderator (i.e., simple slopes; Aiken &
West, 1991; Bauer & Curran, 2005). Following familiar procedures from ordinary least
squares regression, you could compute simple slopes by substituting the point estimates
40
30
Psychosocial Disability
20
Female
Male
10
0
–20 –10 0 10 20
Depression (Centered)
FIGURE 5.5. Simple slopes (conditional effects) for males and females.
(posterior medians) and dummy codes into the regression equation, but this approach
doesn’t yield posterior standard deviations or credible intervals. A better strategy is to
define the conditional effects as auxiliary parameters that depend on the focal model
parameters from each MCMC iteration (Keller & Enders, 2021). The bottom panel of
Table 5.2 summarizes the posterior distributions of the depression slopes for males and
females. The posterior medians define the slopes of the lines in Figure 5.5.
Most methods in this book leverage the normal distribution in important ways; Bayes-
ian estimation makes this dependence explicit by sampling imputations from normal
curves, and maximum likelihood estimation similarly intuits the location of missing
values by assuming they are normal. Of course, the normal distribution is often a rough
approximation for real data where variables are asymmetric and/or kurtotic. Using the
normal curve for missing data handling is fine in many situations, but misspecifications
can introduce bias if the data diverge too much from this ideal (some estimands are
more robust than others).
Bayesian estimation is particularly useful for evaluating the impact of non-normality,
because it produces explicit estimates of the missing values. Graphing imputations next
to the observed data can provide a window into an estimator’s inner machinery, as
severe misspecifications can produce large numbers of out-of-range or implausible val-
ues (e.g., negative imputes for a strictly positive variable). Maximum likelihood estima-
Bayesian Estimation with Missing Data 205
tion is a bit more of a black box in this regard, because it does the same thing—intuits
that missing values extend to a range of implausible score values—without producing
explicit evidence of its assumptions.
Returning to the moderated regression analysis, the observed depression scores
are positively skewed and somewhat platykurtic (skewness = 0.60 and excess kurtosis
= –0.75). To illustrate the impact of sampling normally distributed imputations, I saved
the filled-in data from the final iteration of 10 different MCMC chains. Figure 5.6 shows
overlaid histograms with the observed data as gray bars and the missing values as white
bars with a kernel density function (the graph reflects a stacked data set with all imputa-
tions in the same file). As you can see, the observed data are skewed with scores rang-
ing from 7 to 28, whereas the imputations follow a symmetric distribution that extends
from –5.39 to 32.06. The imputed data are essentially a weighted mixture of a normal
distribution and a skewed distribution, and about 14.6% of the imputed values fall below
the lowest possible score of 7 (about 2% of the imputations are negative).
Implausible score values clearly offend our aesthetic sensibilities, but out-of-range
imputations don’t necessarily invalidate results and translate into biased estimates;
computer simulation studies show that a normal imputation model can work surpris-
ingly well, especially with means and regression coefficients (Demirtas, Freels, & Yucel,
2008; Lee & Carlin, 2017; von Hippel, 2013; Yuan et al., 2012). To underscore this point,
I reran the analysis applying the Yeo–Johnson power transformation (Lüdtke et al.,
2020b; Yeo & Johnson, 2000) to the depression scale. As described in Chapter 10, the
Frequency
–10 0 10 20 30 40
Depression
FIGURE 5.6. Overlaid histograms with the observed data as gray bars and the missing values
as white bars with a kernel density function. The observed data are positively skewed with scores
ranging from 7 to 28, whereas the imputations follow a symmetrical distribution that extends
from –5.39 to 32.06.
206 APPLIED MISSING DATA ANALYSIS
Frequency
–10 0 10 20 30 40
Depression
FIGURE 5.7. Overlaid histograms with the observed data as gray bars and the missing values
as white bars with a kernel density function. The Yeo–Johnson imputations follow a skewed dis-
tribution that mimics the shape of the observed data.
Yeo–Johnson procedure estimates the shape of the data as MCMC iterates, and it gener-
ally creates imputations that closely match the observed-data distribution. The overlaid
histograms in Figure 5.7 show that the resulting imputations were skewed and more
like the observed scores. However, you might be surprised to find out that the analysis
results were indistinguishable from those in Table 5.2.
On balance, filling in the skewed depression data with normal imputes doesn’t
appear to be problematic, despite the relatively large proportion of out-of-range values. In
my experience, this is often the case. In general, the impact of applying normal imputa-
tions to non-normal data depends on the amount of missing data, and misspecifications
are more likely to introduce bias if the skewed variable has a very high missing data rate.
Because there is no analogue to robust standard errors in the Bayesian framework, the
Yeo–Johnson transformation that I previewed here is a potentially important tool for mod-
eling non-normal missing data, and it has shown great promise when paired with a fac-
tored regression specification (Lüdtke et al., 2020b). Inspecting imputations with simple
frequency distribution tables or graphs such as Figures 5.6 and 5.7 can identify potential
candidates for the procedure, which I describe in more detail later in Section 10.3.
you probably have an appreciation for how complicated these functions can be, even
when the analysis model is relatively simple; not surprisingly, the distributions get even
more complex with additional predictors or nonlinear effects (Levy & Enders, 2021). An
alternative imputation strategy uses a Metropolis–Hastings algorithm (Gilks, Richard-
son, & Spiegelhalter, 1996; Hastings, 1970) to approximate these complicated functions
without the need for derivations. This algorithm is a very general and powerful tool for
Bayesian analyses, and it will resurface throughout the remainder of the book.
At its core, the Metropolis–Hastings algorithm performs the same task as the
Gibbs sampler: It draws random numbers from a probability distribution. However,
the algorithm is particularly adept at sampling from complex functions like Equation
5.22, because it works with the simpler component distributions (e.g., a pair of normal
curves). The Metropolis–Hastings algorithm is far more than a convenience feature, as
there are situations in which the product of two distributions is prohibitively complex to
derive or doesn’t have a known form. The curvilinear regression model in the next sec-
tion is one such example (Lüdtke et al., 2020b), and there are many others (e.g., analyses
that use nonconjugate prior distributions).
Proposal
Target
Relative Probability
0 10 20 30 40
Depression
FIGURE 5.8. The solid line shows the target distribution of missing values, and the dashed
curve is a normal proposal distribution. The black circle and the white circle denote the current
and candidate imputations, respectively. Relative to the current value, the candidate imputation
is at a higher elevation on the target distribution.
as a solid circle. To attach some numbers to the example, I used R’s normal distribution
function to compute the likelihood (relative probability) from each normal distribution.
Multiplying these values as follows gives the height of the target distribution at the cur-
rent imputation (the R function retains the omitted scaling factors):
( ) ( )
f Yi | X i( old ) , Mi , X i( old ) × Mi , Zi × f X i( old ) | Mi , Zi = 0.0385 × 0.0315 = 0.0012 (5.23)
The algorithm can’t sample a new imputation directly from the target function with-
out an equation that defines its exact shape (again, we typically won’t have that). The idea
behind the Metropolis step is to sample a candidate imputation from a simple distribu-
tion known as a proposal distribution or jumping distribution, then determine whether
it is a good match to the unknown target function. The candidate value becomes the new
imputation if it has a high probability of originating from the target distribution. Other-
wise, if the candidate is a bad match, the algorithm discards it and uses the participant’s
current imputation for another iteration. For imputation, the proposal distribution is a
just a normal curve centered at the current imputation (as explained later, the variance
is fixed at a value that leads the algorithm to accept new imputations at some optimal
rate). The dashed curve in Figure 5.8 shows a normal proposal distribution centered at
the current imputation (the black dot). Again, the current imputation is the jumping-off
Bayesian Estimation with Missing Data 209
point for evaluating an updated replacement value. As a minor clarification, the proce-
dure I’m describing is technically a Metropolis algorithm, because it uses a symmetrical
proposal distribution, whereas a Metropolis–Hastings algorithm uses an asymmetrical
jumping distribution (Gelman et al., 2014; Lynch, 2007).
For the sake of illustration, suppose that the algorithm draws a random number
from the proposal distribution and gets a value of X(new) = 19 as a candidate imputation.
At this point, the algorithm has proposed a jump from its current position at X(old) = 22
to a new position four points lower on the horizontal axis. The next step is to evaluate
whether the proposed jump is a good one. To do this, we need the candidate imputa-
tion’s location in the target function. Substituting the parameters, data values, and can-
didate imputation into both parts of Equation 5.21 gives the height of the target function
at X = 19. Figure 5.8 shows this relative probability as a solid white circle. I again used
R’s normal distribution function to compute the height of each normal distribution at
this new score, and I multiplied these quantities to get the corresponding height of the
target function.
( ) ( )
f Yi | X i( new ) , Mi , X i( new ) × Mi , Zi × f X i( new ) | Mi , Zi = 0.0540 × 0.0518 = 0.0027 (5.24)
Notice that the candidate imputation’s relative probability is more than twice as large
as that of the current imputation (i.e., 0.0027 vs. 0.0012), which means that its vertical
elevation on the target function is that much higher as well.
Figure 5.8 shows that the proposed jump from X(old) = 22 to X(new) = 19 moves to a
higher elevation on the target distribution. This change suggests that the proposed jump
is a very good one, because the candidate imputation is in a more populated region of
the distribution. As such, we want to accept the candidate and assign it as the current
imputation for the next MCMC iteration. More formally, the relative height of the tar-
get function evaluated at the candidate and current values is known as the importance
ratio. Forming the fraction of Equation 5.24 over Equation 5.23 gives the following ratio:
IR =
( ) (
f Yi | X i( new ) , Mi , X i( new ) × Mi , Zi × f X i( new ) | Mi , Zi )
(
f Yi | X i( old ) , Mi , X i( old ) × M ,Z )× f (X (
i i i old ) | M i , Zi ) (5.25)
0.0540 × 0.0518 0.0027
= = = 2.31
0.0385 × 0.0315 0.0012
This value agrees with Figure 5.8, which shows that the white circle (the candidate
imputation) is more than twice as high in elevation as the black circle (the current impu-
tation). An importance ratio greater than one implies that the proposed jump should
automatically be accepted, because the candidate imputation is in a more populated
region of the target distribution.
Because the proposal distribution is symmetrical, it is just as likely for the candi-
date imputation to move to a lower elevation on the target function. To illustrate what
happens in that case, suppose that the algorithm instead draws a random number from
the proposal distribution and gets X(new) = 24 as a candidate imputation. Figure 5.9 also
shows the current and candidate values as black and while circles, respectively. The
210 Applied Missing Data Analysis
Proposal
Target
Relative Probability
0 10 20 30 40
Depression
FIGURE 5.9. The solid line shows the target distribution of missing values, and the dashed
curve is a normal proposal distribution. The black circle and the white circle denote the current
and candidate imputations, respectively. Relative to the current value, the candidate imputation
is at a lower elevation on the target distribution.
algorithm has now proposed a jump from its current position at X(old) = 22 to a new posi-
tion two points higher on the horizontal axis. Importantly, this jump moves the can-
didate imputation to a lower elevation on the target distribution. We still want to draw
imputations from this region of the distribution, just not as frequently. This is where
the importance ratio comes into play. I again used the R normal distribution function to
compute the importance ratio, which is now 0.4744.
IR =
( ) (
f Yi | X i( new ) , Mi , X i( new ) × Mi , Zi × f X i( new ) | Mi , Zi )
(
f Yi | X i( old ) , Mi , X i( old ) × M ,Z )× f (X (
i i i old ) | M i , Zi ) (5.26)
0.0294 × 0.0195 0.0006
= = = 0.4744
0.0385 × 0.0315 0.0012
You can visually verify the importance ratio by noting that the white circle’s elevation is
about 50% as high as the black circle. Although the relative probabilities in the numera-
tor and denominator of the ratio don’t have an absolute interpretation, they do have
a relative one—the candidate imputation is about 47% as likely as the current value.
As such, the probability of accepting the jump to a lower elevation is 0.47. To decide
whether to keep the candidate value, the algorithm generates a random number from
Bayesian Estimation with Missing Data 211
a binomial distribution with a 47% success rate, which is akin to tossing a biased coin
with a 47% chance of turning up heads. If the random draw is a head (i.e., a “success”),
then the candidate imputation becomes the new imputation for the next MCMC itera-
tion. Otherwise, the participant’s current imputation is used for another iteration.
There is one final detail to address. I previously mentioned that the proposal dis-
tribution’s variance is fixed at some predetermined value. It ends up that the spread of
the jumping distribution largely determines the overall rate at which candidate imputa-
tions are accepted; increasing the variance decreases the overall acceptance rate, and
decreasing the variance increases acceptance rates. Although recommendations vary
from one author to the next, common rules of thumb suggest that acceptance rates
between 25 and 50% are ideal (Gelman et al., 2014; Johnson & Albert, 1999; Lynch,
2007). In practice, software programs often start with a preliminary guess about the
variance and then “tune” the parameter periodically by making upward or downward
adjustments to achieve the desired probability. I use a simple tuning scheme in the R
program on the companion website, and dedicated Bayesian texts describe more sophis-
ticated approaches (e.g., Gelman et al., 2014).
The factored regression approach readily accommodates other types of nonlinear terms.
Curvilinear regression models with polynomial terms and incomplete predictors are an
important example. To illustrate, consider a prototypical polynomial regression model
that features a squared or quadratic term for X (i.e., the interaction of X with itself).
Yi = β0 + β1 X i + β2 X i2 + ε i (5.27)
εi ~ (
N1 0, σ2ε )
Like a moderated regression analysis, β1 is a conditional effect that captures the influ-
ence of X when X itself equals 0 (Aiken & West, 1991; Cohen et al., 2002). The β2 coef-
ficient is of particular interest, because it captures acceleration or deceleration (i.e., cur-
vature) in the trend line. For example, if β1 and β2 are both positive, the influence of X
on Y becomes more positive as X increases, whereas a positive β1 and a negative β2 imply
that X’s influence diminishes as X increases.
To provide a substantive context, I use the math achievement data set from the
companion website that includes pretest and posttest math scores and several academic-
related variables (e.g., math self-efficacy, anxiety, standardized test scores, sociodemo-
graphic variables) for a sample of N = 250 students. The literature suggests that anxi-
ety could have a curvilinear relation with math performance, such that the negative
influence of anxiety on achievement worsens as anxiety increases. The following model
accommodates this nonlinearity while controlling for a binary indicator that measures
whether a student is eligible for free or reduced-priced lunch (0 = no assistance, 1 = eli-
gible for free or reduced-price lunch), math pretest scores, and a gender dummy code (0 =
female, 1 = male):
212 Applied Missing Data Analysis
(
f MATHPOST | ANXIETY , ANXIETY 2 , FRLUNCH, MATHPRE, MALE × ) (5.29)
f ( ANXIETY | FRLUNCH, MATHPRE, MALE ) ×
( )
f FRLUNCH * | MATHPRE, MALE × f ( MATHPRE | MALE ) × f MALE* ( )
The first term to the right of the equals sign is the normal distribution induced by the
curvilinear regression, and the remaining terms are supporting models for the incom-
plete predictors. Importantly, the squared term is not a variable with its own distribu-
tion, but rather a deterministic function of the incomplete anxiety scores. The regressor
models translate into a linear regression for anxiety, a probit (or logistic) model for the
lunch assistance indicator, a linear regression for math pretest scores, and an empty
probit (or logistic) model for the marginal distribution of gender.
I ultimately ignore the bottom two equations, because these variables are complete and
do not require a model. As before, the asterisk superscripts reflect a latent response
variable formulation for the binary variables, which I discuss in Chapter 6. The partially
factored specification replaces the univariate regression with a multivariate normal dis-
tribution (or an equivalent set of round-robin regressions), as shown in Equations 5.8
and 5.9.
Thus far, the conditional distributions of incomplete explanatory variables have
been normal curves (complicated curves, but normal nonetheless). With enough moti-
vation and effort, we could use the factorization as a recipe for deriving an exact equa-
tion for the distribution (e.g., Equation 5.22). The same is not true for curvilinear regres-
sion models, as Lüdtke et al. (2020b) show that a quadratic function induces a quartic
Bayesian Estimation with Missing Data 213
exponential distribution (Cobb, Koppstein, & Chen, 1983; Matz, 1978) that usually isn’t
normal and may even have multiple modes. Depending on the composition of the analy-
sis model, the task of deriving this distribution falls somewhere between very difficult
and intractable. Fortunately, we can use the Metropolis sampler to draw imputations
from this nonstandard target function.
Analysis Example
Continuing with the math achievement example, I used the partially factored regression
specification to illustrate Bayesian estimation for the curvilinear regression model from
Equation 5.28. As explained previously, a partially factored specification that assigns a
multivariate distribution to the predictors is ideally suited for models with centered pre-
dictors, because the grand means are iteratively updated model parameters. The poten-
tial scale reduction factor diagnostic (Gelman & Rubin, 1992) suggested that the MCMC
algorithm converged in fewer than 500 iterations, so I continue using 11,000 total itera-
tions with a conservative 1,000-iteration burn-in period. Analysis scripts are available
on the companion website.
Table 5.3 summarizes the posterior distributions of the model parameters, and Fig-
ure 5.10 shows the regression line based on the posterior medians (averaging over the
other predictors). As a comparison, Table 3.9 gives maximum likelihood estimates from
the factored regression model. I omit the regressor model parameters, because they are
not the substantive focus. Because of centering, the lower-order anxiety slope (Mdnβ1 =
–0.26, SD = 0.09) reflects the influence of this variable on math achievement at the anxi-
ety mean (i.e., instantaneous rate of change in the outcome when the predictor equals 0).
The negative curvature coefficient (Mdnβ2 = –0.01, SD = 0.006) indicates that the anxi-
ety slope became more negative as anxiety increased. This interpretation is clear from
Figure 5.10, where the regression function is concave down. The curvature parameter is
unlikely equal to 0, because this null value falls outside the 95% credible interval (albeit
barely). Finally, note that the maximum likelihood-produced results are numerically
equivalent but with frequentist interpretations. This has been a recurring theme across
several examples.
70
65
Posttest Math Achievement
60
55
50
45
40
–20 –10 0 10 20
Math Anxiety (Centered)
FIGURE 5.10. Estimated regression line from the curvilinear regression analysis, averaging
over the covariates.
option that is well suited for analyses with interactions or nonlinear effects or mixtures
of categorical and continuous variables. The mechanics of implementing this strategy
are identical to those used for the interactive and curvilinear models described previ-
ously.
Analysis Example
This example uses the psychiatric trial data on the companion website to illustrate a
Bayesian regression analysis with auxiliary variables. The data, which were collected
as part of the National Institute of Mental Health Schizophrenia Collaborative Study,
consist of four illness severity ratings, measured in half-point increments ranging from
1 (normal, not at all ill) to 7 (among the most extremely ill). In the original study, the 437
participants were assigned to one of four experimental conditions (a placebo condition
and three drug regimens), but the data collapse these categories into a dichotomous
treatment indicator (DRUG = 0 for the placebo group, and DRUG = 1 for the combined
medication group). The researchers collected a baseline measure of illness severity prior
to randomizing participants to conditions, and they obtained follow-up measurements
1 week, 3 weeks, and 6 weeks later. The overall missing data rates for the repeated mea-
surements were 1, 3, 14, and 23%, respectively.
The focal regression model predicts illness severity ratings at the 6-week follow-up
assessment from baseline severity ratings, gender, and the treatment indicator.
(
ε i ~ N1 0, σ2ε )
Centering the baseline scores and male dummy code at their grand means facilitates
interpretation, as this defines β0 and β1 as the placebo group average and group mean
difference, respectively, marginalizing over the covariates.
This small data set offers limited choices for auxiliary variables, but the illness
severity ratings at the 1-week and 3-week follow-up assessments are excellent candi-
dates, because they have strong semipartial correlations with the dependent variable
(r = .40 and .61, respectively) and uniquely predict its missingness. Following estab-
lished procedures, a factored regression specification features a sequence of univariate
distributions, each of which corresponds to a regression model. To maintain the desired
interpretation of the focal model parameters, it is important to specify a sequence where
the analysis variables predict the auxiliary variables and not vice versa. The factoriza-
tion for this analysis is as follows:
( ) (
f DRUG* | MALE × f MALE* )
216 Applied Missing Data Analysis
The first two terms are auxiliary variable distributions that derive from linear regres-
sion models, the third term corresponds to the focal analysis, the fourth term is a linear
regression model for the incomplete baseline scores, and the final two terms are regres-
sions for the complete predictors (which I ignore, because these variables do not require
distributions). The auxiliary variable regressions are shown below, and the predictor
distributions follow earlier examples:
Figure 3.13 shows path diagram of the models for this example.
The PSRF (Gelman & Rubin, 1992) diagnostic indicated that the MCMC algorithm
converged in fewer than 400 iterations, so I continue using 11,000 total iterations with
a conservative 1,000-iteration burn-in period. Analysis scripts are available on the com-
panion website. Table 5.4 summarizes the posterior distributions of the model param-
eters with and without auxiliary variables. In the interest of space, I omit the auxiliary
variable and covariate model parameters, because they are not the substantive focus.
Although the auxiliary variables change the numerical results, they do not affect the
interpretation of the focal model parameters; the intercept coefficient is placebo group
mean at the 6-week follow-up (Mdnβ0 = 4.41, SD = 0.16), and the posterior median of β1
gives the group mean difference for the medication condition (Mdnβ1 = –1.45, SD = 0.18),
controlling for covariates. Perhaps not surprisingly, maximum likelihood estimates
were numerically equivalent, albeit with frequentist interpretations (see Table 3.10).
Conditioning on the auxiliary variables had a substantial impact on key parameters
estimates; the intercept coefficients (placebo group means) differed by nearly three-
fourths of a posterior standard deviation, and the slope coefficients (medication group
mean differences) differed by more than one standard deviation. Although the natural
inclination is to favor the analysis with auxiliary variables, there is no way to know for
sure which is more correct, as conditioning on the wrong set of variables can exacerbate
nonresponse bias, at least hypothetically (Thoemmes & Rose, 2014). Nevertheless, the
differences are consistent with the shift from an MNAR-by-omission mechanism to a
more MAR-like process. The fact that the auxiliary variables have strong semipartial
correlations with the dependent variable (e.g., greater than .40) and uniquely predict its
missingness reinforces this conclusion.
μ and Σ. Writing the model this way emphasizes that the analysis comprises entirely
dependent variables, and there are no incomplete predictors to worry about.
( ) (
parameters,data N1 E ( Y1i | Y2i , Y3i ) , σ2r
f Y1i( mis ) |= )
Importantly, the regression coefficients and residual variance are a deterministic trans-
formation of the elements in μ(t) and Σ(t) rather than estimated parameters (the equations
are given below). As a second example, consider participants with missing data on Y1
and Y3. Imputation for this pattern requires the multivariate regression of the incom-
plete pair on Y2 (e.g., the regression of work satisfaction and leader–member exchange
on empowerment). The following bivariate normal distribution generates correlated
pairs of imputations:
related noise term that preserves the unexplained part of its association with the other
missing score.
To reiterate, the regression parameters that define the distributions of imputations
are just transformations of the estimated parameters, which are the mean vector and
covariance matrix. For completeness, the remainder of this section shows how to con-
vert μ and Σ into the necessary quantities. Readers who are not interested in these fine-
grained details can skip to the analysis example without losing important information.
To begin, consider the regression of Y1 on Y2 and Y3 in Equation 5.35. To get these regres-
sion parameters, MCMC partitions the mean vector and covariance matrix at iteration
t into blocks, such that μ(com) and Σ(com) are submatrices corresponding to the complete
variables for this pattern, μ(mis) and Σ(mis) are the parameters of the missing variables, and
Σ(mc) contains covariances between the missing and complete variables. The partitions
for the univariate pattern are as follows:
μ ( mis ) S ( mis ) S ( mc )
=μ = S (5.37)
μ ( com ) S ( cm ) S ( com )
μ2
μ ( com ) = μ ( mis ) = ( μ1 )
μ3
σ22 σ23
S ( com ) =
σ 2
σ3
S ( mis ) =σ12 ( ) ( σ12
S ( mc ) = σ13 )
32
Similarly, the multivariate regression in Equation 5.36 requires the following partition:
μ1
( μ2 ) μ ( mis ) =
μ ( com ) = (5.38)
μ3
σ12 σ13 σ12
( )
σ22 S ( mis ) =
S ( com ) =
σ 2
S ( mc ) =
σ 32
31 σ 3
Finally, the regression model parameters are transformations of these submatrices:
(
γ = S ( mc ) S (−com
1
) )
′
(5.39)
γ 0 = μ ( mis ) − μ ( com ) γ ′
S r S ( mis ) − γ ′S ( com ) γ
=
where γ contains regression slopes, γ0 contains intercepts, and Σr is the residual covari-
ance matrix (or variance in patterns with a single incomplete variable).
Analysis Example
Continuing with the employee data example, I apply Bayesian missing data handling
to estimate the mean vector and variance–covariance matrix from the model in Equa-
220 Applied Missing Data Analysis
tion 5.34. To explore the influence of different prior distributions, I implemented the
Wishart specifications described in Section 4.10 and a separation strategy that specifies
distinct priors for variances and correlations (Merkle & Rosseel, 2018). Following earlier
examples, I used PSRFs to determine the burn-in period, and I based the final analyses
on 10,000 MCMC iterations. Estimation scripts are available on the companion website.
Table 5.5 gives Bayesian summaries of the means, standard deviations, variances
and covariances, and correlations. Because the choice of prior had little impact (e.g.,
differences were similar to those in Table 4.4), Table 5.5 shows results from an improper
inverse Wishart prior with S0 = 0 and df0 = –V – 1 (Asparouhov & Muthén, 2010a).
Note that the standard deviations and correlations are deterministic functions of the
estimated variances and covariances at each iteration (e.g., a correlation is a covariance
divided by square root of the product of two variances). As a comparison, Table 3.2 gives
the corresponding maximum likelihood estimates. Consistent with other examples, the
two estimators produced similar numerical results, albeit with different perspectives on
inference. This won’t necessarily be true in smaller samples, where the choice of prior
distribution could be more impactful.
Standard deviations
Work Satisfaction 1.27 0.04 1.21 1.35
Empowerment 4.45 0.14 4.19 4.74
LMX 3.04 0.09 2.87 3.22
Correlations
Work Satisfaction ↔ Empowerment .31 .04 .22 .38
Work Satisfaction ↔ LMX .42 .03 .35 .49
Empowerment ↔ LMX .42 .04 .35 .50
Note. LCL, lower credible limit; UCL, upper credible limit; LMX, leader–member exchange.
Bayesian Estimation with Missing Data 221
This chapter has illustrated missing data handling in the Bayesian framework. Follow-
ing ideas established in Chapter 4, MCMC breaks estimation problems into a series of
steps that address one parameter (or block of similar parameters) at a time, while treat-
ing other parameters as known constants. With missing data, the algorithm then uses
the newly updated estimates to construct model-predicted distributions of the missing
data, from which it samples imputations. A Bayesian analysis is like maximum likeli-
hood in the sense that model parameters are the focus, but all missing data handling is
done via imputation; missing values are just another unknown for MCMC to estimate.
Much of the chapter has focused on specifying models as factored regression equa-
tions (Ibrahim et al., 2002; Lüdtke et al., 2020b). I introduced this approach in Chapter
3, where it was a flexible strategy for assigning distributions to incomplete predictor
variables. The procedure is similarly flexible for Bayesian estimation and works in much
the same way. The missing data distributions defined under this approach are usually
quite complicated and may depend on multiple sets of model parameters (e.g., the distri-
bution of an incomplete predictor is always the product of two or more distributions and
corresponding model parameters). Nevertheless, imputations always have a straightfor-
ward interpretation as the sum of a predicted value plus a normally distributed noise
term.
Looking ahead, Chapter 6 describes Bayesian estimation for binary, ordinal, and
multicategorical nominal variables. The procedure builds on the sequential specification
from this chapter, but probit regressions with latent response variables replace linear
models with continuous outcomes. As you will see, the latent response variable frame-
work is convenient, because it reuses MCMC estimation steps for continuous variables.
Finally, I recommend the following articles for readers who want additional details on
topics from this chapter:
Ibrahim, J. G., Chen, M. H., & Lipsitz, S. R. (2002). Bayesian methods for generalized linear
models with covariates missing at random. Canadian Journal of Statistics, 30, 55–78.
Kim, S., Sugar, C. A., & Belin, T. R. (2015). Evaluating model-based imputation methods for
missing covariates in regression models with interactions. Statistics in Medicine, 34, 1876–
1888.
Lüdtke, O., Robitzsch, A., & West, S. G. (2020). Regression models involving nonlinear effects
with missing data: A sequential modeling approach using Bayesian estimation. Psychologi-
cal Methods, 25, 157–181.
McNeish, D. (2016). On using Bayesian methods to address small sample problems. Structural
Equation Modeling: A Multidisciplinary Journal, 23, 750–773.
Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psycho-
logical Methods, 22, 649–666.
6
Bayesian Estimation
for Categorical Variables
In the not so distant past, the predominant method for dealing with incomplete cat-
egorical variables was to impute them as though they were normally distributed and
apply a rounding scheme to convert continuous imputes to discrete values (Allison,
2002, 2005; Bernaards, Belin, & Schafer, 2007; Horton, Lipsitz, & Parzen, 2003; Yucel,
He, & Zaslavsky, 2008, 2011). Fortunately, these ad hoc approaches are unnecessary at
this point in the evolutionary tree, as user-friendly tools for conducting Bayesian analy-
ses with categorical variables are widely available. This chapter describes estimation
and missing data handling for binary, ordinal, and multicategorical nominal variables. I
focus primarily on the probit regression framework that views categorical responses as
originating from one or more latent response variables. This approach is widely cited in
the missing data literature and readily integrates with the Bayesian estimation routines
from Chapters 4 and 5.
A good deal of methodological work on Bayesian estimation for categorical vari-
ables traces to a seminal paper by Albert and Chib (1993). They describe a data augmen-
tation approach that supplements the categorical scores with latent response variable
estimates. The appeal of their method is that given a full sample of latent scores, MCMC
can simply recycle estimation steps for continuous variables. This makes dealing with
categorical variables straightforward, because we only need to learn how to create the
underlying latent response scores. As you will see, data augmentation is essentially an
extreme form of imputation in which 100% of the sample has missing data on these
variables. As an aside, the literature also describes latent variable data augmentation
for logistic regression models (Asparouhov & Muthén, 2021b; Frühwirth-Schnatter
& Früwirth, 2010; Holmes & Held, 2006; O’Brien & Dunson, 2004; Polson, Scott, &
Windle, 2013), but the probit model is currently the norm for missing data handling
(Asparouhov & Muthén, 2010c; Carpenter, Goldstein, & Kenward, 2011; Carpenter &
222
Bayesian Estimation for Categorical Variables 223
Kenward, 2013; Enders et al., 2020; Enders, Keller, & Levy, 2018; Goldstein, Carpenter,
Kenward, & Levin, 2009).
The chapter begins by describing the latent response formulation for a binary out-
come. This approach readily extends to ordinal outcomes with relatively little modifica-
tion, and it also provides a foundation for understanding the multinomial probit model
for multicategorical nominal variables. The chapter concludes with a brief discussion
of logistic regression. Extending ideas from earlier chapters, missing data imputation
for categorical explanatory variables requires a distribution for these variables. The fac-
tored regression modeling strategy used throughout the book also applies to categorical
variables, and the only change is that probit models replace linear regressions.
I use the employee data on the companion website to illustrate the latent variable for-
mulation for binary and ordinal variables. The data set includes several work-related
variables (e.g., work satisfaction, turnover intention, employee– supervisor relationship
quality) for a sample of N = 630 employees. I begin with a dichotomous measure of
turnover intention that equals 0 if an employee has no plan to leave his or her position
and 1 if the employee has intentions of quitting. The bar graph in Figure 6.1 shows the
distribution of discrete responses.
100
80
60
Percent
40
20
0
0 = No 1 = Yes
Turnover Intention
FIGURE 6.1. Bar graph of the dichotomous measure of turnover intention (0 = an employee has
no plan to leave her or his position, and 1 = the employee has intentions of quitting).
224 APPLIED MISSING DATA ANALYSIS
Probit regression envision the binary scores originating from an underlying latent
response variable that represents one’s underlying proclivity or propensity to endorse
the highest category (Agresti, 2012; Johnson & Albert, 1999). Applied to the turnover
intention measure, this latent variable represents an unobserved, continuous dimen-
sion of intentions to quit. To illustrate, Figure 6.2 shows the latent variable distribution
for the bar graph in Figure 6.1. The vertical line represents the precise cutoff point or
threshold in the latent distribution where discrete scores switch from 0 to 1 (or more
generally, from the lowest code to the highest code). The areas under the curve above
and below this threshold correspond to the category proportions in the bar chart; that
is, 69% of the area under the curve falls below the threshold, and 31% falls above in the
shaded region. Using generic notation, the link between the latent scores and categorical
responses is
0 if Yi* ≤ τ
Yi = *
(6.1)
1 if Yi > τ
where Yi is the binary outcome for individual i, Yi* is the corresponding latent response
score, and τ is the threshold parameter (the vertical line in Figure 6.2).
I introduced the following notation for the probit model earlier in the book:
Y=0 Y=1
Relative Probability
–4 –3 –2 –1 0 1 2 3 4
Latent Variable Score
FIGURE 6.2. Latent response variable distribution for a binary variable. The vertical line at
0 is a threshold parameter that divides the latent distribution into two discrete segments. The
shaded region represents the proportion of employees who intend to quit.
Bayesian Estimation for Categorical Variables 225
( )
Yi* = β0 + ε i = E Yi* + ε i (6.2)
Yi* ~ N ( E ( Y ) ,1)
1 i
*
To refresh, E(Yi*) is the predicted latent response score (i.e., conditional mean), N1 denotes
a univariate normal distribution, and the first and second terms inside the normal dis-
tribution function are its mean and variance, respectively. Because the latent scores
are completely missing, the probit model requires two identification constraints that
establish a metric for the latent response scores. First, fixing either the latent response
variable’s mean or threshold to 0 establishes the mean structure. I always adopt the
strategy of fixing the threshold and estimating the mean. With no explanatory variables
in the model, β0 is simply the grand mean of the latent response variable, which you can
see is approximately located at –0.50 in Figure 6.2. Second, the model scales the latent
response variable as a z-score by fixing its variance to 1. The second term in the normal
distribution function reflects this constraint.
The probit model for ordered categorical variables incorporates additional thresh-
old parameters but is otherwise identical to the binary model. Continuing with the
employee data set, Figure 6.3 shows a bar graph of the 7-point work satisfaction rating
scale (1 = extremely dissatisfied to 7 = extremely satisfied). The latent variable regression
model is the same as Equation 6.2. This variable requires six threshold parameters to
carve the latent response distribution into seven discrete regions. More generally, the
40
30
Percent
20
10
0
1 2 3 4 5 6 7
Work Satisfaction
FIGURE 6.3. Bar graph of a 7-point work satisfaction rating scale ranging from 1 = extremely
dissatisfied to 7 = extremely satisfied.
226 Applied Missing Data Analysis
number of thresholds is always one fewer than the number of response options. The link
function that relates the latent response scores to the discrete categories is shown below:
1 if τ0 < Yi* ≤ τ1
*
2 if τ1 < Yi ≤ τ2
Yi = (6.3)
C if τ < Y * ≤ τ
C −1 i C
Having covered the basics, we can now add explanatory variables to the latent response
model. To illustrate, consider a simple regression with leader– member exchange
(employee–supervisor relationship quality) predicting turnover intention. The latent
variable regression model is as follows:
( )
Yi* = β0 + β1 X i + ε i = E Yi* | X i + ε i
((
Yi* ~ N1 E Yi* | X i ,1) )
The bottom row of the expression says that the latent distribution for participant i is
now centered at a predicted value, and the residual variance is fixed at 1 to establish a
metric (i.e., the conditional distribution of the latent scores is scaled as a z-score). As
before, the model features a fixed threshold parameter that divides the latent variable
distribution into two segments (see Equation 6.1).
Bayesian Estimation for Categorical Variables 227
6
4
Latent Turnover Intention Score
2
0
–2
–4
–6
0 5 10 15 20
Leader-Member Exchange (Relationship Quality)
FIGURE 6.4. Latent variable distributions for three participants with different values of
leader–member exchange. The black dots represent the means of the latent distributions, and
the area above the threshold parameter (shaded in gray) conveys predicted probabilities.
Figure 6.4 shows the latent response distributions at three values of the explana-
tory variable. The black dots represent predicted values, and the contour rings convey
the perspective of a drone hovering over the peak of a bivariate normal distribution,
with smaller contours denoting higher elevation (and vice versa). The area above the
threshold parameter (shaded in gray) in each distribution is the predicted probability of
quitting (see Equation 2.67). Figure 6.4 shows that the likelihood of quitting decreases
as relationship quality (leader–member exchange) increases along the horizontal axis
(i.e., the β1 coefficient is negative).
and does not require a prior distribution. The posterior distribution—the product of
the likelihood and the prior—is a multivariate function that describes the relative prob-
ability of different combinations of the coefficients and latent response scores, given the
data.
N
( ) ∏exp − 2 ( Y
1
)
2
= f (β )×
f β, Y* | data i
*
− ( β0 + β1 X i )
i =1 (6.5)
(( ) ( )
× I Yi* ≤ τ I ( Yi= 0 ) + I Yi* > τ I ( Yi= 1) )
The likelihood expression to the right of the product operator mimics the one for linear
regression (see Equation 4.21) but features the latent response scores as the outcome
(the residual variance also vanishes, because it is fixed at 1). Visually, the kernel of the
normal curve represents the height of the latent variable distributions in Figure 6.4. The
I(⋅) terms on the right side of the expression are indicator functions that encode the
categorization scheme from Equation 6.1. The function works like a true–false state-
ment, such that each I(⋅) takes on a value of 1 if the condition in parentheses is true and
0 otherwise. The indicator functions are there to ensure that an observation contributes
to the likelihood only if its latent score falls in the region prescribed by the categorical
response.
Assign starting values to all parameters, latent data, and missing values.
Do for t = 1 to T iterations.
> Estimate coefficients conditional on the latent data.
> Estimate latent response scores conditional on the updated coefficients.
Repeat.
Each estimation step draws synthetic parameter values at random from a probability
distribution. Mechanically, you get these full conditional distributions by multiplying
the prior and the likelihood, then doing some tedious algebra to express the product as
a function of a single unknown. I give these distributions below and point readers to
specialized Bayesian texts for additional details on their derivations (e.g., Hoff, 2009;
Lynch, 2007).
First, the MCMC algorithm estimates regression coefficients by drawing a vector
of random numbers from the multivariate normal conditional distribution that follows:
Bayesian Estimation for Categorical Variables 229
( ) (
f β | Y* ,data ∝ N K +1 βˆ , S βˆ ) (6.6)
−1
βˆ = ( X ′X ) X ′Y *
−1
S βˆ = ( X ′X )
1 i ((i i ) ) (
N E Y * | X ,1 × I Y * > τ I ( Y =1)
i )
(
) ((
f Yi*|β,data N1 E Yi* | X i ,1=
=
) ) (
× I Y * ≤ τ I ( Yi 0 ) ) (6.7)
((
N1 E Yi* | X i ,1 × I ( Yi =
) )
missing )
Although the indicator functions make it look more complicated than it is, the equation
simply says to draw latent scores from one of two truncated normal curves if the discrete
response is observed and an unrestricted normal distribution otherwise. Specialized
algorithms are available that generate random numbers from truncated normal curves
with no trial and error (Robert, 1995).
To illustrate imputation, Figure 6.5 shows the distribution of unrestricted latent
imputations at three values of leader–member exchange. The contour rings convey the
perspective of a drone hovering over the peak of a bivariate normal distribution, with
smaller contours denoting higher elevation (and vice versa). Candidate imputations fall
exactly on the vertical hashmarks, but I added horizontal jitter to emphasize that more
scores are located at higher contours near the regression line. The MCMC algorithm
generates an imputation by randomly selecting a value from the candidate scores along
the vertical line; for cases with complete data, imputes are sampled from the areas above
or below the threshold, and they are unrestricted if the person’s discrete response is
missing. Figure 6.6 shows a complete set of latent imputations, with black crosshairs
230 APPLIED MISSING DATA ANALYSIS
6
4
Latent Turnover Intention Score
2
0
–2
–4
–6
0 5 10 15 20
Leader-Member Exchange (Relationship Quality)
FIGURE 6.5. Distributions of latent imputations at three values of employee– supervisor rela-
tionship quality. Candidate imputations fall exactly on vertical hashmarks, but I added horizon-
tal jitter to emphasize that more scores are located near the regression line.
and gray circles denoting the two discrete responses. Although we don’t need them right
now, Figure 6.6 highlights that the location of the latent imputations relative to the
threshold defines a corresponding set of discrete imputes; imputes above and below the
threshold are classified as 1’s and 0’s, respectively. As such, missing data imputation for
categorical variables can be viewed as drawing latent and discrete imputes in matched
pairs. The discrete imputes will come into play later with categorical regressors.
Analysis Example
Expanding on the employee turnover example, I fit a binary probit model that features
leader–member exchange, employee empowerment, and a gender dummy code (0 =
female, 1 = male) as predictors of binary turnover intention.
The missing data rates were approximately 5.1% for the turnover intention indicator,
4.1% for the employee– supervisor relationship quality scale, and 16.2% for the empow-
erment scale.
Bayesian Estimation for Categorical Variables 231
6
4
Latent Turnover Intention Score
2
0
–2
–4
–6
0 5 10 15 20
Leader-Member Exchange (Relationship Quality)
FIGURE 6.6. Scatterplot of imputed latent response data. The dots represent latent scores for
employees who plan to stay at their job (i.e., TURNOVER = 0), the crosshair symbols denote cases
who intend to quit (i.e., TURNOVER = 1), and the dashed horizontal line at 0 is the threshold
parameter.
(
f TURNOVER * | LMX, EMPOWER, MALE ) (6.9)
(
× f ( LMX | EMPOWER, MALE ) × f ( EMPOWER | MALE ) × f MALE* )
whereas the partially factored model instead assigns a multivariate distribution to the
predictors as follows:
( ) (
f TURNOVER * | LMX, EMPOWER, MALE × f LMX, EMPOWER, MALE* ) (6.10)
The gender dummy code appears as a latent response variable in models where it func-
tions as a dependent variable (the rightmost terms in both factorizations), but it does not
require a distribution and can be treated as a fixed constant (i.e., change the rightmost
term in 6.10 to a bivariate distribution where leader–member exchange and empower-
ment condition on gender).
232 Applied Missing Data Analysis
The probit model for ordered categorical variables incorporates additional threshold
parameters but is otherwise identical to the binary model. Continuing with the employee
data, consider a model that features leader–member exchange predicting the 7-point
work satisfaction scale (see Figure 6.3). The latent variable regression model is as follows:
Yi* = β0 + β1 X i + ε i = E ( Yi* )
| Xi + εi
Yi* ((
~ N1 E Yi* ) )
| X i ,1
Bayesian Estimation for Categorical Variables 233
To illustrate the model, Figure 6.7 shows the latent response distribution at three values
of leader–member exchange. The black dots represent predicted scores or conditional
means, and the horizontal dashed lines are threshold parameters (z-score cutoff points)
that carve the continuous distribution into discrete segments. The fact that the thresh-
olds are approximately equidistant is a consequence of the discrete distribution’s sym-
metry and is not an inherent feature of the model. In general, distances between cutoff
points can be quite different, particularly if the discrete distribution is asymmetrical.
Finally, note that the first (lowest) threshold is fixed at z = 0 to anchor the latent mean
structure.
The regression model parameters provide a predicted probability for each categori-
cal response (or equivalently, a set of cumulative probabilities). Visually, the probability
of a categorical response c is the area under a normal curve between two adjacent thresh-
olds in Figure 6.7. More formally, the expression for a predicted probability is
( (
Pr ( Yi =c ) =Φ τc − E Yi* | X i )) − Φ ( τ c −1 (
− E Yi* | X i )) (6.12)
where E(Yi*|Xi) is the predicted z-score based on a set of regressors in X, and Φ(·) is the
7
6
Y=7
5
Latent Work Satisfaction Score
Y=6
4
Y=5
3
Y=4
2
Y=3
1
Y=2
0
Y=1
–1
–2
–3
0 5 10 15 20
Leader-Member Exchange (Relationship Quality)
FIGURE 6.7. Latent variable distributions for three participants with different values of
leader–member exchange. The black dots represent the means of the latent distributions, and
the horizontal dashed lines are threshold parameters that represent the z-score cutoff points in
the continuous distribution where discrete scores switch from one category to the next.
234 Applied Missing Data Analysis
cumulative distribution function of the standard normal curve. The subtraction inside
each function centers the threshold at an individual’s conditional mean, and the func-
tion Φ returns the area below that result in a standard normal distribution. Subtracting
two lower-tailed probabilities gives the area between thresholds. Equation 2.67 gives the
comparable expression for a binary outcome.
( )
= f (β )× f ( τ )
f β, τ, Y* | data
N (6.13)
1
( ) ( )
2
∏
×
i =1
exp − Yi* − ( β0 + β1 X i ) × I τc −1 < Yi* ≤ τc I ( Yi = c )
2
The expression is like the one for the binary regression model, but the true–false indi-
cator functions change to accommodate additional response options and threshold
parameters. I adopt noninformative prior distributions for the regression coefficients
and thresholds (i.e., f(β) ∝ 1 and f(τ) ∝ 1.), and the residual variance does not require a
prior, because its value is fixed.
Assign starting values to all parameters, latent data, and missing values.
Do for t = 1 to T iterations.
> Estimate coefficients conditional on the latent data.
> Estimate thresholds conditional on coefficients and latent data.
> Estimate latent response scores conditional on the updated coefficients and
thresholds.
Repeat.
The estimation step for the coefficients draws random numbers from the multivariate
normal conditional distribution from Equation 6.6, and the final imputation step for the
latent response scores also mimics the binary model; MCMC samples latent imputations
from a specific region of the normal curve if the categorical response is observed (i.e.,
latent response scores are sampled from a truncated normal distribution), and it draws
Bayesian Estimation for Categorical Variables 235
scores from the entire distribution if the discrete response is missing. The posterior
predictive distribution below formalizes this idea in an equation:
(( ) ) ( )
N E Y * | X ,1 × I τ < Y * ≤ τ I ( Y =c )
( ) 1 i i c −1 i c i
f Yi*|β, τ,data = (6.14)
(( ) )
N1 E Yi* | X i ,1 × I ( Yi =
missing )
Figure 6.8 shows a complete set of latent imputations, and a unique symbol denotes each
discrete response. As noted earlier, missing data imputation for categorical variables can
be viewed as drawing imputes in matched pairs, as the location of each latent score rela-
tive to the threshold parameters implies a corresponding discrete value.
The estimation step for the threshold parameters is the main new detail. A seminal
work of Albert and Chib (1993) describes an algorithm that draws threshold parameters
from a uniform distribution bounded on the low end by the highest latent score from the
region below the threshold and bounded on the high end by the lowest latent score from
the region above. For example, returning to Figure 6.8, their procedure draws τ2 (the
second dashed line from the bottom) from a uniform distribution spanning the narrow
7
6
Y=7
5
Latent Work Satisfaction Score
Y=6
4
Y=5
3
Y=4
2
Y=3
1
Y=2
0
Y=1
–1
–2
0 5 10 15 20
Leader-Member Exchange (Relationship Quality)
FIGURE 6.8. Scatterplot of a set of imputed latent scores for a 7-point rating scale. Latent
scores for a given discrete response fall between two threshold parameters, denoted as horizontal
dashed lines.
236 Applied Missing Data Analysis
vertical interval between the highest circle in the Y = 2 region and the lowest crosshair
from the Y = 3 region. Albert and Chib’s procedure converges and mixes slowly, because
the widths of the uniform intervals tend to be very small, thus limiting the amount
that thresholds can change from one iteration to the next (Cowles, 1996; Johnson &
Albert, 1999; Nandram & Chen, 1996). Fortunately, other algorithms provide much
better performance (Cowles, 1996; Nandram & Chen, 1996). I describe the procedure
from Cowles (1996), because it is common in software packages. Readers who are not
interested in these technical details can skip to the analysis example without losing
important information.
Cowles (1996) described an algorithm that combines Gibbs sampling and a
Metropolis–Hastings step (Gilks et al., 1996; Hastings, 1970) like the one described in
Section 5.6. Her algorithm first partitions the posterior distribution into two blocks of
unknowns: Regression coefficients form one set, and threshold parameters and latent
scores form the second. She further factors the distribution of τ and Y * into the product
of two univariate distributions as follows:
( )
f τ, Y * | β,data ∝ f (Y *|τ, β,data) × f ( τ | β,data ) (6.15)
Adopting a flat prior distribution for the thresholds (i.e., f(τ) ∝ 1) gives the conditional
posterior distribution of τ:
∏ (Φ (τ )) − Φ ( τ )))
N
f ( τ | β,data ) ∝ Yi (
− E Yi* | X i Yi −1 (
− E Yi* | X i (6.16)
i =1
where τYi and τYi –1 are the upper and lower threshold boundaries for person i’s categorical
response (e.g., if Y = 2, then τYi = τ2 and τYi –1 = τ1). The terms to the right of the product
operator correspond to the predicted probability expression from Equation 6.12, and
their product is an alternative expression for the likelihood that doesn’t require latent
scores.
Chapter 5 used the Metropolis–Hastings algorithm to draw imputations from a
complex distribution, and Cowles uses the same approach to draw threshold parameters
from the previous distribution. The algorithm performs the following steps: (1) draws
candidate threshold parameters from a normal proposal distribution, (2) computes an
importance ratio that captures the relative height of the target function (Equation 6.16)
evaluated at the candidate and current threshold values, and (3) uses a random number
to accept or reject the candidate parameter values.
The algorithm’s first step draws candidate thresholds one at a time, each from a nor-
mal proposal distribution centered at the current estimate. I refer to each pair of thresh-
olds as τc(new) and τc(old), respectively. The proposal distribution’s standard deviation is
fixed at (or adaptively tuned to) a value that accepts new estimates at a rate of 25–50%
(Gelman et al., 2014; Johnson & Albert, 1999; Lynch, 2007). Finally, to ensure that the
thresholds maintain the correct rank order, the lower tail of τc’s proposal distribution is
truncated at the next lowest threshold (i.e., τc–1(new)), and its upper tail cannot exceed the
next highest threshold (i.e., τc+1(old)).
After drawing a set of candidate thresholds one at a time, the algorithm computes
the importance ratio as follows (recall that τ0 = –∞ and τC = ∞):
Bayesian Estimation for Categorical Variables 237
(Φ (τ
N Yi ( new ) (
− E Yi* | X i )) − Φ ( τ Yi −1( new ) (
− E Yi* | X i )))
IR ∏ ×
(Φ (τ
i =1 Yi ( old ) − E (Y i
*
| X )) − Φ ( τ
i Yi −1( old ) − E (Yi
*
| X )))
i
(6.17)
(Φ (( τ
C −1 ) ) ((
c +1( old ) − τc ( old ) / σ MH − Φ τc −1( new ) − τc ( old ) / σ MH ) ))
∏
(Φ (( τ
c =2
c +1( new ) − τc( new ) ) /σ ) − Φ (( τ
MH c −1( old ) − τc( new ) ) / σ ))
MH
The first term is computed by substituting the candidate and current thresholds into the
likelihood expression from Equation 6.16, and the second term (which is often equal
to or close to 1) adjusts for the proposal distribution’s truncation points. Visually, the
importance ratio is the relative height of the target distribution at two sets of threshold
values. A ratio greater than 1 implies that the candidate thresholds in τ(new) are located
at a higher elevation on the target distribution than those in τ(old) (i.e., a more populated
region of the curve), and a ratio less than unity indicates that the candidate values have
moved to a lower elevation. To decide whether to keep the trial parameters, the algo-
rithm generates a random number from a binomial distribution with success rate equal
to the importance ratio. If the random draw is a “success,” the candidate thresholds
become the current parameters. Otherwise, τ(old) is used for another iteration.
Analysis Example
Expanding on the employee data example, I fit an ordered probit model that features
leader–member exchange, employee empowerment, and a gender dummy code (0 =
female, 1 = male) as predictors of work satisfaction ratings. The latent variable regression
model is as follows:
The missing data rates are approximately 4.8% for the work satisfaction ratings, 4.1% for
the employee–supervisor relationship quality scale, and 16.2% for the employee empow-
erment scale. Consistent with previous example, I used a factored regression specifica-
tion for the incomplete regressors (see Equations 6.9 and 6.10). Analysis scripts are
available on the companion website, including a custom R program for readers who are
interested in learning to code the algorithm by hand.
This example is useful, because it highlights the importance of checking conver-
gence and mixing. Cowles’s (1996) method for updating threshold parameters is a sub-
stantial improvement over Albert and Chib’s (1993) classic approach, but these param-
eters still require very long burn-in periods. For most of the models we’ve worked with
thus far, MCMC converged very quickly, usually in fewer than 500 iterations. That is not
the case here. To illustrate, Figure 6.9 shows a trace plot of τ4 from the first 1,000 itera-
tions of two MCMC chains. The solid horizontal line is the posterior median from the
final 10,000 iterations, and the dashed lines are the corresponding 95% credible interval
limits. Overall, the trace plot is a far cry from some of the ideal graphs in Chapter 4. For
one, estimates are constrained to a narrow range and are not yet oscillating around a
238 APPLIED MISSING DATA ANALYSIS
3.2
3.0
2.8
Threshold 4
2.6
2.4
2.2
FIGURE 6.9. Trace plot of threshold τ4 from two MCMC chains that comprise 1,000 itera-
tions each. The solid horizontal line is the median estimate from the final posterior distribution,
and the dashed lines are the corresponding 95% credible interval limits. Comparing the second
halves of each chain yields a potential scale reduction factor of PSRF = 1.36.
stable mean. You can also see that the first halves of the two chains exhibit flat plateaus
where the Metropolis–Hastings algorithm is not accepting new estimates at the desired
rate. These flat spots dissipate after about 500 iterations, suggesting that the tuning
mechanism is producing a better acceptance rate.
Using the split-chain method (Gelman et al., 2014, pp. 284–285) to compare the
second halves of each chain in Figure 6.9 gives a potential scale reduction factor (PSRF)
value of PSRF = 1.36 (Gelman & Rubin, 1992), well above recommended cutoffs. Diag-
nostic runs indicated that all PSRF values dropped below 1.05 somewhere between
15,000 and 20,000 iterations, so I specified 30,000 computational cycles with a burn-in
period of 20,000 iterations. As a practical aside, the slow convergence is almost certainly
exacerbated by the relatively small number of observations in the lowest and highest
categories (see Figure 6.3). In practice, it may be necessary to collapse adjacent catego-
ries to minimize the impact of sparse data. Furthermore, data screening is especially
important for models with multiple categorical variables, because sparse contingency
tables are a common cause of convergence failures.
Table 6.2 gives the posterior summaries for the regression model parameters. In
the interest of space, I omit the supporting regressor model parameters, because they
are not the substantive focus. The intercept coefficient is the predicted latent work sat-
Bayesian Estimation for Categorical Variables 239
isfaction score for a female employee with 0’s on the numeric predictors (essentially, the
lowest possible value of the leader–member exchange and empowerment scales). Each
slope coefficient reflects the expected z-score change in the latent response variable for
a one-unit increase in the predictor, controlling for other regressors. For example, the
leader–member exchange coefficient indicates that a one-unit increase in relationship
quality is expected to increase latent work satisfaction by 0.14 z-score units (Mdnβ1 =
0.14, SD = 0.02), holding other predictors constant. The R2 statistic for the overall model
indicates that the set of predictors explained approximately 22% of the variation in the
latent work satisfaction scores (McKelvey & Zavoina, 1975). Finally, although they are
not necessarily of substantive interest, Table 6.2 also summarizes the threshold param-
eters or z-score cutoffs that divide the latent response distribution into seven segments.
Revisiting ideas from Chapter 5, the factored regression specification defines the distri-
bution of an incomplete predictor variable as a composite function of two or more sets of
model parameters. The same is true for binary and ordinal predictors, but probit regres-
sions replace linear models. Switching gears to a different substantive context, I use
the smoking data from the companion website to illustrate missing data handling for a
binary predictor. The data set includes several sociodemographic correlates of smoking
intensity from a survey of N = 2000 young adults (e.g., age, whether a parent smoked,
gender, income). The model uses a parental smoking indicator (0 = parents did not smoke,
1 = parent smoked), age, and income to predict smoking intensity (a function of the num-
ber of cigarettes smoked per day). The model and its generic counterpart are shown on
top of the next page:
240 Applied Missing Data Analysis
Yi = β0 + β1 X1i + β2 X 2i + β3 X 3i + ε i
(
ε i ~ N1 0, σ2ε )
I centered the income and age variables at their grand means to define the intercept as
the expected smoking intensity score for a respondent whose parents did not smoke.
The smoking intensity variable has 21.2% missing data, 3.6% of the parental smoking
indicator scores are missing, and 11.4% of the income values are unknown.
A probit regression defines the conditional distribution of the parental smoking indica-
tor, and the income and age distributions are linear models, as follows:
As before, the variance of r1 is fixed at 1 to establish a metric, and the parental smok-
ing model also requires a single, fixed threshold parameter. Figure 6.10a shows a path
diagram of this specification. Following diagramming conventions from Edwards et al.
(2012), I use an oval and a rectangle to differentiate the latent variable and its categori-
cal indicator, respectively, and the broken arrow connecting the two is the link function
that maps the unobserved continuum to the discrete responses (e.g., the broken arrow
reflects the idea that latent and discrete scores interact via threshold parameters). Notice
that the latent response variable links to other regressors, but the binary parental smok-
ing indicator predicts the outcome.
The partially factored model specification instead assigns a multivariate normal
distribution to the regressors.
(
f ( INTENSITY | PARSMOKE, INCOME, AGE ) × f PARSMOKE* , INCOME | AGE (6.22) )
When predictors have different metrics, the normality assumption applies to numerical
and latent response variables. For example, the trivariate normal distribution for the
smoking intensity analysis is as follows:
Bayesian Estimation for Categorical Variables 241
Notice that the first element in the covariance matrix is fixed at 1 to establish the
latent variable’s metric (the model also incorporates a single, fixed threshold for the
PARSMOKE* PARSMOKE
INCOME INTENSITY
AGE
PARSMOKE* PARSMOKE
INCOME INTENSITY
AGE
FIGURE 6.10. Path diagram of a factored regression (sequential) specification and partially
factored model specification.
242 Applied Missing Data Analysis
latent response variable). Figure 6.10b shows a path diagram of this specification, with
curved arrows (correlated residuals) connecting predictors instead of direct pathways.
The latent response variable again connects to the other predictors, whereas the binary
parental smoking indicator predicts the outcome.
Modeling the multivariate normal distribution is not so easy here, because one
element of the covariance matrix is a fixed constant. More generally, this matrix could
contain fixed parameters, variances and covariances, and correlations between pairs of
latent response variables. This mixture of odds and ends makes it difficult to specify a
prior distribution and update the covariance matrix in a single step. One option is to
use a Metropolis–Hastings algorithm to update covariance matrix elements one at a
time or in blocks (Asparouhov & Muthén, 2010a; Browne, 2006; Carpenter & Kenward,
2013), and another is to leverage a well-known property that a multivariate normal dis-
tribution’s parameters can be expressed as an equivalent set of linear regression models
(Arnold et al., 2001; Liu et al., 2014). The set of round-robin regression equations below
is an alternative parameterization for the multivariate distribution:
( )
INCOMEi = μ 2 + γ12 ( AGEi − μ 3 ) + γ 22 PARSMOKEi* − μ1 + r2i
AGEi = μ 3 + (
γ13 PARSMOKEi* )
− μ1 + γ 23 ( INCOMEi − μ 2 ) + r3i
Finally, note that the age variable is complete and does not require a distribution in
either the sequential or partially factored specifications.
( )
f ( X1 | Y , X 2 , X 3 ) ∝ f ( Y | X1 , X 2 , X 3 ) × f X1* | X 2 , X 3 =
(6.25)
( ) (( ) )
N1 E ( Yi |X1i , X 2i , X 3i ) , σ2ε × N1 E X1*i | X 2i , X 3i ,1
Dropping unnecessary scaling terms and substituting the normal curve’s kernels into
the right side of the expression gives the following:
( )
f ( Yi | X1i , X 2i , X 3i ) × f X1*i | X 2i , X 3i ∝
1 ( Yi − ( β0 + β1 X1i + β2 X 2i + β3 X 3i ) )
2
exp − ×
2 σ2ε (6.26)
1
( )
2
exp − X1*i − ( γ 01 + γ11 X 2i + γ11 X 3i )
2
Notice that missing values are like quantum objects that simultaneously exist in two
different states—X1 is a dummy code in the focal analysis model and a latent response
variable in its own model. Each latent score is unconstrained and can fall anywhere in
the normal curve, but its location relative to the threshold parameter fully determines
a corresponding discrete impute. For example, a latent imputation above the threshold
induces a discrete response of X1(mis) = 1 (e.g., parent was a smoker), and a latent impu-
tation below the threshold creates X1(mis) = 0 (e.g., parents were nonsmokers). The dual
nature of a categorical predictor makes deriving an analytic expression for the missing
values even more arduous and intractable than it was with continuous predictors, but
this isn’t a problem for the Metropolis–Hastings algorithm.
Applying ideas from Section 5.6, the algorithm draws imputations by manipulating
the simpler component distributions (e.g., two normal curves). For each missing obser-
vation, the algorithm performs four steps. First, it samples a candidate latent imputation
from a normal proposal distribution centered at a person’s current latent response score.
As before, the proposal distribution’s standard deviation is fixed at or adaptively tuned
to a value that accepts candidate imputations at an optimal rate of 25–50% (Gelman et
al., 2014; Johnson & Albert, 1999; Lynch, 2007). Second, the algorithm uses the current
threshold estimates to convert the candidate latent impute to a discrete value (this exam-
ple requires a single threshold fixed at 0). For example, a latent imputation of X1*(new) =
1.27 converts to X1(new) = 1 (i.e., parent was a smoker), because it is above the threshold,
whereas a candidate value of X1*(new) = –.43 converts to X1(new) = 0 (i.e., parents were non-
smokers). Next, the importance ratio is a fraction that features the target function from
Equation 6.26 in both the numerator and denominator. The algorithm substitutes the
matched pair of candidate imputations (along with the necessary parameter and data
values) into the numerator, and it substitutes the current imputations into the denomi-
nator. Visually, the resulting ratio reflects the relative height of the target distribution
when evaluated at the candidate and current imputes. Finally, the algorithm generates a
random number from a binomial distribution with success rate equal to the importance
ratio. If the random draw is a “success,” the candidate imputations become new data for
the next iteration; otherwise, a participant’s current data carry forward for another cycle.
244 Applied Missing Data Analysis
Analysis Example
Continuing with the smoking data example, I used Bayesian estimation to fit the linear
regression model from Equation 6.19. As explained in Section 5.3, a partially factored
specification that assigns a multivariate distribution to the predictors (or equivalently,
round-robin regressions) is ideally suited for models with centered predictors, because
the grand means are estimated parameters (see Equations 6.23 and 6.24). Following
earlier examples, I did not estimate a supporting model for respondent age, because this
variable is complete and does not require a distribution. The potential scale reduction
factors (Gelman & Rubin, 1992) from a preliminary diagnostic run indicated that the
MCMC algorithm converged in fewer than 300 iterations, so I used 11,000 total itera-
tions with a conservative burn-in period of 1,000 iterations. Analysis scripts are avail-
able on the companion website.
Table 6.3 gives the posterior summaries for the regression model parameters. In
the interest of space, I omit the supporting regressor model parameters, because they
are not the substantive focus. For a comparison, Table 3.6 shows corresponding maxi-
mum likelihood estimates, which were numerically equivalent. Centering the income
and age variables at their grand means defined the intercept as the expected smoking
intensity score for a respondent whose parents did not smoke (Mdnβ0 = 8.78, SDβ0 = 0.11),
and the interpretation of the slopes is unaffected by the categorical regressor, which
functions as a dummy code in the focal model. For example, the β1 coefficient indicates
that respondents with parents who smoked have smoking intensity scores that are 2.65
points (cigarettes per day) higher on average, controlling for income and age (Mdnβ1 =
2.65, SDβ1 = 0.17).
The multinomial probit model extends the latent response variable framework to mul-
ticategorical nominal variables. The seminal work on this topic traces to the maximum
indicant model of Aitchison and Bennett (1970), and a mature body of research has
investigated data augmentation procedures that parallel binary and ordinal models
Bayesian Estimation for Categorical Variables 245
(Albert & Chib, 1993; Dyklevych, 2014; Imai & van Dyk, 2005; Mcculloch & Rossi,
1994; McCulloch, Polson, & Rossi, 2000; Zhang, Boscardin, & Belin, 2008). The mul-
tinomial model is a popular method for handling incomplete nominal variables in the
multiple imputation framework (Carpenter et al., 2011; Carpenter & Kenward, 2013;
Enders et al., 2020; Enders, Keller, et al., 2018; Goldstein et al., 2009; Quartagno & Car-
penter, 2019), and the Bayesian estimation procedure described in this section and the
next is the backbone of that application.
Switching substantive gears, I use the chronic pain data on the companion website
to illustrate the latent formulation for nominal variables. The data set includes psy-
chological correlates of pain severity (e.g., depression, pain interference with daily life,
perceived control) for a sample of N = 275 individuals with chronic pain. I focus on a
categorical pain severity rating with C = 3 groups, indexed c = 1, 2, and 3. The first cat-
egory comprises participants who reported none, very little, or little pain (20.8%), the sec-
ond group comprises individuals with moderate pain (47.5%), and the third bin includes
participants with severe or very severe pain (31.8%). Although the categories are ordered,
treating this variable as nominal makes sense, because the bins likely reflect quantita-
tive and qualitative differences.
The multinomial model specifies an underlying latent variable—called indicants
or utilities—for each categorical response. The set of empty latent variable regression
models for this example is
U 2*i = μ 2 + ζ 2i
U 3*i = μ 3 + ζ 3i
where Uc*i is latent indicant c for participant i (c = 1, 2, . . . , C), μc is the latent grand
mean of utility variable c, and ζci is a residual. Applied to the chronic pain data, the U *’s
represent a participant’s latent propensity to endorse each pain severity rating. The indi-
cants are often uncorrelated by assumption with variances fixed at some arbitrary value
to establish a metric (Carpenter & Kenward, 2013). I adopt the following distribution
for the indicants:
U1*i μ1 0.5 0 0
*
U 2i ~ N 3 μ 2 , 0 0.5 0 (6.28)
* μ 0
U 3i 3 0 0.5
As you will see, setting the diagonal elements of the variance–covariance matrix to 0.5
gives a convenient scaling result that links to the binary and ordinal models.
Unlike models for ordered (or binary) categories, the multinomial model does not
incorporate threshold parameters. Rather, the maximum utility score or maximum indi-
cant determines a participant’s categorical response. For example, a participant with no
or little pain (i.e., c = 1) must have a U1* score that exceeds U2* and U3*, an individual
with moderate pain must have U2* as the highest utility score, and U3* is the maximum
indicant for participants with severe pain. Formally, this rule is
246 Applied Missing Data Analysis
= (
Yi c if max U1*i ,…= *
, U Ci U ci*) (6.29)
*
D=
2i U 2*i − U1*i (6.30)
*
D=
3i U 3*i − U1*i
Substantively, the difference scores reflect the underlying propensity to endorse the
second or third category relative to the reference. For example, a positive value of
D2* implies that a participant is more likely to report moderate pain than no or little
pain, whereas a negative difference score indicates that the reference category is more
likely.
Because subtracting two normally distributed variables gives another normal vari-
able, the latent difference scores also follow a multivariate normal distribution. The fol-
lowing pair of empty regression models summarize the latent difference scores:
D2*i β02 ε 2i
D*i =
= + (6.31)
*
D2i β03 ε 3i
D2*i β02 1.0 .50
* ~ N 2 ,
D
3i β 03 .50 1.0
Notice that the average difference scores, β02 and β03, are unknown parameters, and
the entire covariance matrix is fixed to establish a metric. The values in the variance–
covariance matrix Ŝ D* are a consequence of adopting the diagonal covariance matrix for
the utility variables in Equation 6.28. This result is convenient, because it mimics the
scaling of the binary and ordinal probit models.
The categorization rule in Equation 6.29—the maximum utility determines one’s
discrete response—readily translates to the difference score metric, although the rule
is slightly more complicated and depends on whether a respondent belongs to the ref-
erence group. To refresh, I assigned the lowest category as the reference. Returning to
Equation 6.30, you can see that all latent difference scores must be negative for this
group, because U1* is the highest utility score. In contrast, members of the other groups
must have at least one positive difference score, and the maximum of the difference
scores determines the categorical response. For individuals with moderate pain, D2*
must be positive, because U2* > U1* and it must exceed D3*, because U3* < U2* (i.e., U2*
– U1* > U3* – U1*). Similarly, for respondents with severe pain, D3* must be positive and
greater than D2*. The linkage between the latent and discrete responses is summarized
as follows:
Bayesian Estimation for Categorical Variables 247
(
1 if max D2*i ,…, DCi
*
<0 )
2 if D* =
Yi = 2i (
max D2*i ,…, DCi
*
)
and D2*i > 0
(6.32)
*
C if DCi = (
max D2*i ,…, DCi
* *
and DCi )
>0
As an aside, these rules highlight that the multinomial specification is equivalent to the
binary probit model when there are only two groups, in which case a single latent differ-
ence variable is negative for the reference group and positive for the comparison group
(i.e., latent scores below and above the threshold, respectively).
The binary and ordinal probit models defined category proportions as an area under
the normal curve between two z-score cutoffs. The nominal framework is more compli-
cated, because areas under a multivariate normal distribution define the group propor-
tions (i.e., multidimensional integrals). To illustrate, consider a hypothetical analysis
with three groups and mean difference scores equal to 0 (i.e., β02 = β03 = 0). Figure
6.11 shows the contour plot of the bivariate normal difference score distribution. The
4
2
Latent Difference Score D 3*
0
–2
–4
–4 –2 0 2 4
Latent Difference Score D 2*
FIGURE 6.11. Contour plot of the bivariate normal difference score distribution. The vertical
and horizontal dashed lines show the location of the grand means, and the distribution’s peak is
located at their intersection. The reference group has latent scores in the dark-shaded region, the
second group has latent scores in the lightly shaded area, and the third group has latent scores in
the unshaded region of the distribution.
248 Applied Missing Data Analysis
graph conveys the perspective of a drone hovering over the peak of a three-dimensional
bell curve, with smaller contours denoting higher elevation. The vertical and horizontal
dashed lines show the location of the grand means, and the distribution’s peak is located
at their intersection. The group proportions correspond to different areas under the
three-dimensional surface. Following the rules in Equation 6.32, the reference group’s
probability corresponds to the area under the dark-shaded region of the surface where
both difference scores are negative. This probability is Pr(D2* < 0 & D3* < 0) = .25.
The second category’s probability, Pr(D2* > 0 & D2* > D3*) = .375, is the area under the
lightly shaded region of the surface, and the third group’s proportion corresponds to the
unshaded area, Pr(D3* > 0 & D3* > D2*) = .375. A number of algorithms are available for
computing areas under the multivariate normal distribution (Genz, 1993; Genz et al.,
2019; Mi, Miwa, & Hothorn, 2009), and I used an R function to determine these values.
Having established the key ideas behind the latent formulation for multicategorical
nominal variables, I extend the previous example to include exercise frequency, per-
ceived control over pain, and a gender dummy code (0 = female, 1 = male) as predictors
of the categorical pain ratings. The multivariate regression model and its generic coun-
terpart are shown below:
As before, MODERATE * and SEVERE * are latent difference scores contrasting a continu-
ous proclivity for moderate and severe pain ratings relative to a no or little pain rating.
Notice that the regressors exert a unique influence on each latent difference score, and
the residual covariance matrix is now fixed at deterministic values to scale the latent
response variables. The missing data rates for the pain severity and exercise frequency
variables are 7.3 and 1.8%, respectively, and the remaining predictors are complete.
(( ( ) )
× I max D*i < 0 I ( Yi =
1) + I max D*i = ( ( ) )
Dc* I( Dc* > 0) I ( Yi > 1) )
The term to the right of the product indicator is the kernel of the multivariate normal
distribution (i.e., the likelihood sans unnecessary scaling constants), and the collection
of true–false indicator functions enforce the categorization rule from Equation 6.32 and
ensure that an observation contributes to the likelihood only if its latent scores possess
the correct magnitude and rank order. As always, I adopt a flat prior for the coefficients
(i.e., f(β) ∝ 1).
MCMC estimation follows a predictable two-step recipe that mimics the one from
Section 6.3: Estimate the regression coefficients given the current latent data, then
update the latent scores given the discrete responses and new coefficients. For complete-
ness, I give the full conditional distributions below, and readers who are not interested
in these technical details can skip to the analysis example without losing important
information.
First, the MCMC algorithm estimates regression coefficients by drawing a vector of
random numbers from the multivariate normal conditional distribution:
( ) ( () )
f β | D* , X ∝ N ( K +1)(C −1) vec βˆ , S βˆ (6.35)
−1
βˆ = ( X ′X ) X ′D*
−1
S βˆ S D* ⊗ ( X ′X )
=
algorithm can’t specify the magnitude and rank ordering of the latent difference scores
if the discrete response is missing, so it instead draws pairs of latent difference scores
with no restrictions. The relative magnitude and configuration of the latent imputations
again induces a corresponding discrete impute. For example, a participant with imputed
values of D2*(mis) = 0.24 and D3*(mis) = 1.26 would be classified in the severe pain group
(i.e., Y(mis) = 3), because D3*(mis) is positive and larger than D2*(mis).
Analysis Example
Continuing with the chronic pain data example, I used Bayesian estimation to fit the
multinomial regression model in Equation 6.33. Applying established ideas, I used a
factored regression specification to assign a distribution to the incomplete predictor. A
fully sequential specification uses the following factorization:
(
f PAIN * | EXECISE, CONTROL, MALE × ) (6.36)
f ( EXERCISE | CONTROL, MALE ) × f ( CONTROL | MALE ) × f MALE*( )
The exercise distribution translates into the following linear regression model:
The perceived control over pain and gender dummy codes (the rightmost pair of terms)
are complete and do not require a distribution. The potential scale reduction factor
(Gelman & Rubin, 1992) diagnostic indicated that the MCMC algorithm converged in
fewer than 500 iterations, so I used 11,000 total iterations with a conservative burn-in
period of 1,000 iterations. Analysis scripts are available on the companion website.
Table 6.4 gives the posterior summaries for the focal analysis model. In the interest
of space, I omit the supporting regressor model parameters, because they are not the
substantive focus. To facilitate graphing, I centered the predictors such that the latent
difference score means are marginal or overall effects. Figure 6.12 shows the contour
plot of the bivariate normal distribution with the estimated latent differences scores
from one iteration overlaid on the surface. The graph conveys the perspective of a drone
hovering over the peak of a three-dimensional bell curve, with smaller contours denot-
ing higher elevation. The vertical and horizontal dashed lines show the location of the
grand means, and the distribution’s peak is located at their intersection. Latent scores
for the reference group are located in the dark-shaded region of the surface where both
difference scores are negative. The group with moderate pain severity has latent differ-
ence scores in the lightly shaded region of the graph (i.e., the area where D2* > 0 & D2* >
D3*), and the high-severity group’s latent scores are in the unshaded area of the surface
(i.e., the region where D3* > 0 & D3* > D2*).
The posterior medians of the D2* and D3* grand means were Mdnβ02 = 0.50 (SD =
0.09) and Mdnβ03 = 0.17 (SDβ03 = 0.11), respectively. Because the latent response variables
are scaled as z-scores, the positive mean values indicate that moderate and severe pain
ratings are more likely than mild pain ratings (e.g., the latent propensity for indicating a
moderate pain rating is approximately 0.50 z-score units higher than that of the reference
group). As mentioned previously, the model-predicted group proportions correspond to
Bayesian Estimation for Categorical Variables 251
2
0
–2
–4
–4 –2 0 2 4
Latent Difference Score D 2* (Moderate vs. Little)
FIGURE 6.12. Contour plot of the bivariate normal distribution of difference scores with the
estimated latent variable scores from one iteration overlaid on the surface. The vertical and hori-
zontal dashed lines show the location of the grand means, and the distribution’s peak is located
at their intersection. The shaded regions partition the distribution into segments that contain the
latent scores for the three pain severity groups.
252 Applied Missing Data Analysis
different areas under a bivariate normal distribution with these means. Following the
rules in Equation 6.32, the reference group’s probability corresponds to the area under
the dark-shaded region of the surface where both difference scores are negative, Pr(D2* <
0 & D3* < 0) = .21. The second category’s probability, Pr(D2* > 0 & D2* > D3*) = .51, is the
area under the lightly shaded region of the surface, and the third group’s proportion cor-
responds to the unshaded area, Pr(D3* > 0 & D3* > D2*) = .29. A number of algorithms are
available for computing areas under the multivariate normal distribution (Genz, 1993;
Genz et al., 2019; Mi et al., 2009), and I used an R function to determine these values.
Each slope coefficient reflects the expected z-score change in the latent difference
variable for a one-unit increase in the predictor, controlling for other regressors. The
largest effects were associated with the comparison of severe pain versus little or no
pain. For example, male respondents had an average latent difference score that was
0.48 z-score units higher than that of females (Mdnβ33 = 0.40, SD = 0.21), meaning that
men were more likely to report severe pain than women. The negative slope coefficients
for exercise frequency and perceived control over pain mean that an increase in either
variable is associated with a lower probability of a severe pain rating.
The procedure for imputing multicategorical predictor variables parallels that for binary
and ordinal covariates (and continuous explanatory variables, for that matter). Return-
ing to the smoking intensity analysis from Section 6.5, I modify the linear regression
model by replacing the respondent’s household income with a three-category education
variable (1 = less than high school, 2 = high school or some college, and 3 = bachelor’s degree
or higher). The model and its generic counterpart are shown below:
(
ε i ~ N1 0, σ2ε )
where HS and BACH (or D2 and D3) are dummy codes contrasting the high school and
bachelor’s degree groups with the less-than-high-school comparison group. The smok-
ing intensity variable has 21.2% missing data, 3.6% of the parental smoking indicator
scores are missing, and 5.4% of the education values are unknown.
These generic expressions translate into a binary probit model for the parental smok-
ing indicator, a multinomial regression for the educational attainment categories, and a
linear model for age.
Paralleling the dummy variable coding scheme, HS * and BACH * are latent difference
scores contrasting the two higher categories with the less-than-high-school comparison
group.
The partially factored model specification instead uses a multivariate normal dis-
tribution for the predictors.
Notice that the three latent response variables have their variances fixed at 1 to establish
a scale, and the latent difference score correlation is fixed to 0.5 like before.
As mentioned previously, modeling the distribution’s covariance matrix is not
straightforward, because it contains a mixture of fixed parameters, variances and
covariances, and correlations between pairs of latent response variables. One option
is to use a Metropolis–Hastings algorithm to update covariance matrix elements one
at a time or in blocks (Asparouhov & Muthén, 2010a; Browne, 2006; Carpenter &
Kenward, 2013), and another is to parameterize the multivariate normal distribution
as a set of round-robin regression equations (Bartlett et al., 2015; Enders et al., 2020;
Goldstein et al., 2014).
254 Applied Missing Data Analysis
( ) ( )
X1*i = μ1 + γ11 D2*i − μ 2 + γ 21 D3*i − μ 3 + γ 31 ( X 4 i − μ 4 ) + r1i (6.43)
D2*i μ2 γ γ r2i
+ (X − μ ) + (X − μ )+
12 * 22
=
D* μ3 γ
1i
γ
1 4i 4
r3i
3i 13 23
X 4 i = μ 4 + γ14 ( X − μ ) + γ (D − μ ) + γ (D
*
1i 1 24
*
2i 2 34
*
3i )
− μ 3 + r4 i
Following Sections 6.3 and 6.7, residual variances and covariances for the latent response
variables are fixed quantities, and the binary probit model additionally requires a single,
fixed threshold.
(
f ( D2 , D3 | Y , X1 , X 4 ) ∝ f ( Y | X1 , D2 , D3 , X 4 ) × f D2* , D3* | X1* , X 4 = ) (6.44)
( ) ((
N1 E ( Yi |X1i , D2i , D3i , X 4 i ) , σ2ε × N 2 E D2*i , D3*i | X1*i , X 4 i , S D* ) )
Dropping unnecessary scaling terms and substituting the appropriate kernels into the
right side of the expression gives the following:
(
f ( Yi | X1i , D2i , D3i , X 4 i ) × f D2*i , D3*i | X1*i , X 4 i ∝ )
1 ( Yi − ( β0 + β1 X1i + β2 D2i + β3 D3i + β4 X 4 i ) )
2
exp − ×
2 σ2ε (6.45)
1 *
( ( ′
)) ( (
exp − Di − E D*i | X1*i , X 4 i S D−1* D*i − E D*i | X1*i , X 4 i
2
))
Bayesian Estimation for Categorical Variables 255
where E(Di* | X1i, X4i) is the vector of predicted latent variable difference scores from the
probit model in Equation 6.43:
μ 2 γ12 * γ 22
( )
E D*i | X1*i , X=
4i (
+ X1i − μ1 +
μ 3 γ13
) ( X 4i − μ 4 )
γ 23
(6.46)
The missing values appear as dummy codes in the focal regression and latent dif-
ference scores in the probit model. The process of sampling imputations is like that for
binary and ordinal predictors. For each missing observation, the Metropolis algorithm
performs four steps. First, it samples a candidate pair of latent imputations from a mul-
tivariate normal proposal distribution centered at a person’s current latent response
scores. As before, the proposal distribution’s covariance matrix is fixed at or adaptively
tuned to accept candidate imputations at an optimal rate of 25–50% (Gelman et al., 2014;
Johnson & Albert, 1999; Lynch, 2007). Second, the algorithm uses the classification
rule from Equation 6.32 to convert the candidate imputes to a discrete value (or equiva-
lently, a pair of dummy codes). For example, a participant with trial values of D2*(new) =
–0.15 and D3*(new) = –0.86 would be classified as a member of the comparison group (i.e.,
less than high school; D2(new) = 0 and D3(new) = 0), because both latent difference scores
are negative. Next, the importance ratio is a fraction that features the target function
from Equation 6.45 in both the numerator and the denominator. The algorithm substi-
tutes the matched pair of candidate imputations (along with the necessary parameter
and data values) into the numerator, and it substitutes the current imputations into the
denominator. Visually, the resulting ratio reflects the relative height of the target dis-
tribution when evaluated at the candidate and current imputes. Finally, the algorithm
generates a random number from a binomial distribution with success rate equal to the
importance ratio. If the random draw is a “success,” the candidate imputations become
new data for the next iteration; otherwise, a participant’s current data carry forward for
another cycle.
Analysis Example
Continuing with the smoking data example, I used Bayesian estimation to fit the regres-
sion model in Equation 6.38. To facilitate interpretation of the intercept, I centered
respondent age at the grand mean, and I did not estimate a supporting model for this
variable, because it is complete and does not require a distribution. There is no reason
to prefer a sequential or partially factored specification in this analysis, so I used the lat-
ter. After inspecting the potential scale reduction factors (Gelman & Rubin, 1992) from
a preliminary diagnostic run, I specified an MCMC process with 11,000 total iterations
and a burn-in period of 1,000 iterations. Analysis scripts are available on the companion
website.
Table 6.5 gives the posterior summaries of the focal model parameters. In the inter-
est of space, I omit the supporting regressor models, because they are not the substan-
tive focus. Because of centering, the intercept reflects the expected smoking intensity
score (cigarettes smoked per day) for a respondent whose parents did not smoke and
attained less than a high school education (Mdnβ0 = 9.49, SDβ0 = 0.33). The interpretation
of the dummy code slope coefficients is the same as any linear regression model. For
256 Applied Missing Data Analysis
example, the β3 coefficient indicates that respondents who received a bachelor’s (or more
advanced) degree smoked 1.30 fewer cigarettes per day, on average, than the comparison
group (Mdnβ3 = –1.30, SD = 0.36), controlling for other predictors.
My nearly exclusive emphasis on probit regression in this chapter largely reflects the
state of the literature, where much of the methodological work on Bayesian estimation
for categorical variables has focused on normally distributed latent response variables.
The appeal of this modeling approach is that MCMC can apply standard estimation steps
for linear regression models. Recent work has extended Albert and Chib’s (1993) data
augmentation strategy to logistic regression (Asparouhov & Muthén, 2021b; Frühwirth-
Schnatter & Früwirth, 2010; Holmes & Held, 2006; O’Brien & Dunson, 2004; Polson
et al., 2013), and these routines are beginning to appear in statistical software packages
(Keller & Enders, 2021; Muthén & Muthén, 1998–2017). This section summarizes this
approach and provides a data analysis example.
Like the probit model, logistic regression can be viewed through a latent variable
lens where binary scores originate from an underlying continuous dimension (Agresti,
2012; Johnson & Albert, 1999). For example, the simple logistic regression model
Pr ( Yi = 1)
ln
1 − Pr ( Y = = β0 + β1 ( X i ) (6.47)
i 1)
ϵi ~ Logistic(0,1)
Bayesian Estimation for Categorical Variables 257
The key difference between logistic and probit regression is the distribution of the
residual term—the probit model defines this as a standard normal variable, whereas
logistic regression defines the residual as a standard logistic variable (the function’s
inputs, 0 and 1, are the location and scale parameters).
The logistic error distribution is challenging, because it does not lead to a simple
expression for the conditional distribution of the regression coefficients. Polson et al.
(2013) described an exact approach that weights each person’s data to rescale the logis-
tic regression as a probit-like model with normally distributed errors. Asparouhov and
Muthén (2021b) extended this procedure to a broad range of structural equation models
with normally distributed predictors. Integrating Polson’s method with factored regres-
sion models is somewhat more flexible, because it accommodates categorical regressors
and interactive or nonlinear effects with missing data (Keller & Enders, 2021).
The MCMC algorithm for Polson’s procedure cycles between two steps: Estimate
person-specific weights that determine the latent response variable scores, then esti-
mate the regression coefficients given the current latent scores and weighted data. To
begin, the algorithm samples person-specific weights from a so-called Pólya-gamma
distribution (a variable with a value determined by an infinite sum of gamma random
variables).
Wi ~ PG (1, X iβ ) (6.49)
The function’s first argument represents the number of binomial trials (e.g., the 1 indi-
cates that each person has a single score), and the second argument is a predicted value
from the logistic regression equation. Visually, the Pólya-gamma function looks like
the right-skewed inverse gamma distribution from Chapter 4 (see Figure 4.6), with the
predicted value determining spread and peakedness (as Xiβ gets larger, the spread and
peakedness decrease and increase, respectively, and the distribution looks more like a
point mass). After sampling the weights, MCMC deterministically computes the latent
response variable as follows:
The latent response scores do not have the clear interpretation that they do in the probit
framework but are simply a mathematical device that simplifies estimation of the β’s.
The key feature of Polson et al.’s (2013) approach is that the weights rescale the
logistic regression as a probit-like model with normally distributed but heteroscedastic
errors. This reparameterization leads to a multivariate normal posterior distribution for
the regression coefficients
1
( 2
) ′
( )
f ( β | W,data ) ∝ f ( β ) = f ( β ) × exp − Y * − Xβ W −1 Y * − Xβ
(6.51)
(
f ( β | W, Y, X ) ∝ N K +1 βˆ , S βˆ ) (6.52)
−1
βˆ = ( X ′WX ) X ′WY*
−1
S βˆ = ( X ′WX )
where K is the number of predictors, and NK+1 denotes a normal distribution with K +
1 dimensions or variables. This expression is just a weighted version of Equation 6.6.
Importantly, the resulting parameter estimates are logistic regression coefficients, and
the latent scores and weights are essentially a rescaling trick that simplifies estimation.
exp ( X iβ )
Pr(Yi = 1| β,data) = = πi (6.53)
1 + exp ( X iβ )
Yi( mis ) ~ Binomial (1, πi )
The 1 in the function’s first argument indicates that everyone has a single score, and πi is
the predicted probability that Y = 1. Conceptually, drawing binary random numbers is
akin to tossing a biased coin where the probability of a head (e.g., Y(mis) = 1) equals πi and
the probability of a tail (e.g., Y(mis) = 0) equals 1 – πi. Following established procedures,
the Metropolis algorithm imputes missing predictor variables by pairing the binomial
distribution with one or more supporting models.
Analysis Example
Returning to the employee turnover analysis from Section 6.3, I use Bayesian estima-
tion to fit a binary logistic regression model that features leader–member exchange,
employee empowerment, and a gender dummy code (0 = female, 1 = male) as predictors
of turnover intention (TURNOVER = 0 if an employee has no plan to leave her or his
position, and TURNOVER = 1 if the employee has intentions of quitting). The logistic
regression model is as follows:
Pr (TURNOVER i = 1)
ln
1 − Pr (TURNOVER = = β0 + β1 ( LMX i ) + β2 ( EMPOWER i ) + β3 ( MALEi ) (6.54)
i 1)
Bayesian Estimation for Categorical Variables 259
The missing data rates were approximately 5.1% for the turnover intention indicator,
4.1% for the employee–supervisor relationship quality scale, and 16.2 % for the empow-
erment scale.
A fully sequential specification uses the following factorization (dropping the dis-
tribution of gender, which is complete):
whereas the partially factored model instead assigns a conditional bivariate distribution
to the incomplete predictors, as follows:
There is no compelling reason to prefer one specification to the other, and both param-
eterizations produced the same results. The potential scale reduction factors (Gelman
& Rubin, 1992) from a preliminary diagnostic run indicated that the MCMC algorithm
converged in fewer than 200 iterations, so I used 11,000 total iterations with a conserva-
tive burn-in period of 1,000 iterations. Analysis scripts are available on the companion
website.
Table 6.6 gives posterior summaries of the regression model parameters. In the
interest of space, I omit the covariate model parameters, because they are not the sub-
stantive focus. As a comparison, the bottom panel of Table 3.11 gives the maximum like-
lihood estimates for the same model. The slope coefficients reflect the expected change
in the log odds of quitting for a one-unit increase in the predictor, holding other covari-
ates constant. For example, the leader–member exchange slope indicates that a one-
unit increase in relationship quality decreases the log odds of quitting by 0.12 (Mdnβ1 =
–0.12, SD = 0.04), controlling for employee empowerment and gender. Consistent with
complete-data analyses, exponentiating each slope gives an odds ratio that reflects the
multiplicative change in the odds for a one-unit increase in a predictor. For example, a
one-point increase on the leader–member exchange scale multiplies the odds of quitting
by 0.89.
Note. LCL, lower credible limit; UCL, upper credible limit; OR, odds ratio.
260 Applied Missing Data Analysis
This chapter has described Bayesian estimation for binary, ordinal, and multicategori-
cal nominal variables. The chapter focused primarily on a probit regression framework
that envisions discrete scores originating from one or more normally distributed latent
response variables. The data augmentation approach pioneered by Albert and Chib
(1993) supplements discrete scores with estimates of these latent variables. The appeal
of their approach is that, given a full sample of latent scores, MCMC can apply stan-
dard estimation steps for linear regression models. As such, the main new wrinkle for
this chapter was learning how to impute the latent response variables, which are 100%
missing. Discrete imputes are simple functions of the filled-in latent data. Recent exten-
sions of this data augmentation strategy accommodate logistic regression and negative
binomial regression for count variables (Asparouhov & Muthén, 2021b; Polson et al.,
2013). The last section of the chapter has described logistic regression, and Chapter 10
illustrates imputation for an incomplete count variable.
Looking back on the last two chapters, Bayesian analyses are like maximum likeli-
hood in the sense that model parameters are the focus; missing data handling happens
behind the scenes, and the goal is to construct temporary imputations that service a
particular analysis model. The focus shifts in Chapter 7, where the Bayesian machinery
is a mathematical device that creates suitable imputations for reanalysis in the frequen-
tist framework. As you will see, multiple imputation co-opts the MCMC algorithms from
the last two chapters, so much of the new content in Chapter 7 focuses on saving and
analyzing multiply imputed data sets and summarizing the results. Finally, I recom-
mend the following articles for readers who want additional details on topics from this
chapter:
Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88, 669–679.
Asparouhov, T., & Muthén, B. (2021). Expanding the Bayesian structural equation, multilevel
and mixture models to logit, negative-binomial and nominal variables. Structural Equation
Modeling: A Multidisciplinary Journal, 28, 622–637.
Cowles, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link
generalized linear models. Statistics and Computing, 6, 101–111.
Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling. New York: Springer.
7
Multiple Imputation
Reflecting on the procedures we have covered so far, the primary goal of a maximum
likelihood or Bayesian analysis is to fit a model to the observed data and use the result-
ing estimates to inform one’s substantive research questions. When confronted with
missing values, maximum likelihood uses the normal curve to deduce the missing parts
of the data as it iterates to a solution, and Bayesian estimation imputes the missing
values en route to getting the parameters. In both cases, missing data handling hap-
pens behind the scenes, and imputation—implicit or explicit—is just a means to a more
important end, which is to learn something from the estimates. In contrast, multiple
imputation puts the filled-in data front and center, and the goal is to create suitable
imputations for later analysis.
A typical application of multiple imputation comprises three major steps. The first
step is to specify an imputation model and deploy an MCMC algorithm that creates sev-
eral copies of the data, each containing different estimates of the missing values. As you
will see, this step co-opts the MCMC algorithms from Chapters 5 and 6, usually algo-
rithms for regression models and covariance matrices. The next step is to perform one or
more analyses on the M complete data sets and get point estimates and standard errors
from each set of imputations. The multiply imputed data sets are compatible with the
frequentist statistical paradigm (Rubin, 1987), so this stage leverages a familiar inferen-
tial framework. The final step uses “Rubin’s rules” (Little & Rubin, 2020; Rubin, 1987)
to combine estimates and standard errors into a single package of results.
Multiple imputation has a long history that began in 1977, when Donald Rubin pro-
posed the procedure to the Social Security Administration and Census Bureau as a solu-
tion for missing survey data. Scheuren (2005) and van Buuren (2012; pp. 25–28) give
interesting historical accounts of the procedure’s development and subsequent growth,
and Rubin’s 1977 report is available in the American Statistician (Rubin, 2004). Rubin
published his seminal multiple imputation text (Rubin, 1987) a decade after his original
report, and a number of excellent imputation books have followed in the years since
261
262 Applied Missing Data Analysis
(Carpenter & Kenward, 2013; Schafer, 1997; van Buuren, 2012). Not surprisingly, the
number of published applications of multiple imputation has exploded in recent years,
and software options abound.
In the 40 years or so since its inception, multiple imputation has developed into a
big tent that includes a diverse collection of procedures. I restrict the focus to strategies
that piggyback on the Bayesian estimation routines for regression models and multivari-
ate normal data (including latent data for categorical variables). This focus includes the
two predominant imputation frameworks, joint model imputation (Schafer, 1997, 1999;
Schafer & Olsen, 1998) and fully conditional specification (Raghunathan, Lepkowski,
Van Hoewyk, & Solenberger, 2001; van Buuren, 2012; van Buuren, Brand, Groothuis-
Oudshoorn, & Rubin, 2006), and it also includes newer model-based imputation strate-
gies for analyses with interactive or curvilinear effects (Bartlett et al., 2015; Enders et al.,
2020; Goldstein et al., 2014; Zhang & Wang, 2017). Because the initial imputation stage
recycles familiar MCMC algorithms, much of the new content in this chapter focuses on
analyzing multiply imputed data sets and summarizing the results.
Multiple imputation involves two rounds of estimation and modeling. The first step
fits an imputation model (often a regression model or a multivariate regression model)
to the observed data and uses the resulting estimates to create imputed data sets. As
mentioned, this step co-opts MCMC algorithms from earlier chapters. The second step
fits the focal analysis model (or models) to each complete data set, after which the esti-
mates and standard errors are aggregated into a single package of results. Importantly,
the imputation and analysis steps need not apply the same statistical models; in some
applications, the imputation and analysis models could be identical, and in others they
might be unrecognizably different. As an organizational tool, I classify the procedures
in this chapter into two buckets according to the degree of similarity between the
imputation and analysis models: An agnostic imputation strategy deploys a model that
differs from the substantive analysis, and a model-based imputation procedure invokes
the same focal model as the secondary analysis (perhaps with additional auxiliary
variables).
Note that my definition of agnostic does not imply that the imputation process is
somehow blind to or ignores how the data will be analyzed. Quite the contrary, the
imputation model should be flexible enough to preserve important features of the sec-
ondary analyses, and it should not impose restrictions that conflict with the focal anal-
yses. Rather, my definition conveys that agnostic imputation applies a nonrestrictive
model that is not dedicated to one specific analysis, and the resulting data sets could be
suitable for several purposes. In contrast, a model-based approach tailors imputations
around one and only one analysis model.
Van Buuren (2012, p. 40) offers a similar taxonomy that classifies imputation
schemes according to their scope—broad, intermediate, or narrow. A broad scope pro-
cedure creates imputes that support all analyses that could be performed on the data.
Public-use data sets are a prototypical example, where the same imputations could serve
Multiple Imputation 263
many different researchers with diverse substantive goals. Multiple imputation applica-
tions in the social and behavioral sciences are usually intermediate or narrow in scope.
An intermediate scope is one in which a researcher creates a set of imputations for a
family of analyses contained within a single project. A research manuscript is a good
example, where it is often possible to create one set of imputations for all descriptive
summaries and inferential procedures contained within a paper. Finally, a narrow scope
is one in which each analysis for a project requires different imputations.
My categories overlap with van Buuren’s to some degree. For example, model-based
imputation is necessarily narrow in scope, because imputations are tailored to one anal-
ysis, and agnostic imputation schemes are often (but not necessarily) intermediate in
scope. By classifying imputation problems according to the match or mismatch between
the imputation and analysis models, my goal is to emphasize that an analysis model’s
composition—in particular, whether it includes nonlinear effects such as interactions,
polynomial terms, or random effects—determines the type of imputation strategy that
works best. Model-based imputation is usually ideal for these types of nonlinearities,
whereas agnostic imputation is well suited for analyses that do not include these special
features. It is perfectly acceptable to use both procedures within the same project or
paper.
Pretest scores are complete, but 16.8% of the posttest scores (and thus the difference
scores) are missing. Casting the analysis as a regression model is useful when using
general-use statistical software to analyze multiply imputed data sets, because most pro-
grams offer this functionality. The analysis can be expressed as an empty regression
model for the difference scores:
CHANGEi = β0 + ε i (7.2)
(
CHANGEi ~ N1 β0 , σ2ε )
where β0 is the average change score, N1 denotes the univariate normal distribution,
and σε2 captures variation among the difference scores. A familiar paired-samples t-test
evaluates the null hypothesis that β0 equals 0.
One compelling reason to use multiple imputation is that it readily accommodates
an inclusive analysis strategy that leverages information from a set of auxiliary variables
(Collins et al., 2001). To this end, the imputation model also includes standardized
reading scores and a binary indicator that measures whether a student is eligible for free
or reduced-priced lunch (0 = no assistance, 1 = eligible for free or reduced-price lunch).
Although neither variable predicts whether math posttest scores are missing, including
these variables could improve power, because they increment the explained variance in
the posttest scores by about 16% above and beyond the pretest measure (Collins et al.
refer to these as “Type B” auxiliary variables). Note that the auxiliary variables are also
incomplete; 10.4% of the standardized reading scores are missing (e.g., because a student
is new to the district), and 5.2% of the lunch assistance indicator codes are missing.
Incomplete auxiliary variables are still useful, but their utility diminishes if they are
concurrently missing with the analysis variables (Enders, 2008).
The joint imputation model includes pretest and posttest math scores, standardized
reading test scores, and the lunch assistance indicator. To simplify the ensuing notation,
I refer to these variables as Y1 to Y4. Applying ideas from Chapter 6, the dummy code
appears as a latent response variable, which I denote as Y4*. The joint imputation model
invokes an empty multivariate regression model for the continuous variables and latent
scores, and a mean vector and variance–covariance matrix are the imputation model
parameters.
Y1
Y2
Y3
Y 4* Y4
FIGURE 7.1. Path diagram of a joint imputation model with four variables. The oval and rect-
angle differentiate the latent variable and its binary indicator, respectively, and the broken arrow
connecting the two is the link function that maps the unobserved continuum to the discrete
responses.
As a reminder, N4 denotes a four-dimensional normal distribution, and the first and sec-
ond terms inside the normal distribution function are the mean vector and covariance
matrix, μ and Σ. The fixed value on the diagonal of the covariance matrix establishes a
metric for the latent normal variable, and the model also incorporates fixed threshold
parameters (some variants of the joint model instead estimate the threshold and fix the
latent mean to 0).
Figure 7.1 shows a path diagram of the imputation model. Following diagramming
conventions from Edwards et al. (2012), I use an oval and rectangle to differentiate the
latent variable and its binary indicator, respectively, and the broken arrow connecting the
two is the link function that maps the unobserved continuum to the discrete responses
(e.g., the broken arrow reflects the idea that scores above and below the threshold equal
1 and 0, respectively). The residual terms pointing to the rectangles indicate that all
variables have a distribution, and the curved arrows illustrate that the variables link
via covariances. To reduce visual clutter, I omit triangle symbols that researchers some-
times use to denote grand means or intercepts. I characterize this imputation model as
agnostic, because it looks nothing like the analysis model in Equation 7.2. This differ-
ence is not a problem, because the multivariate normal data structure does not conflict
with the analytic model.
(( )
Y2i( mis ) ~ N1 E Y2 | Y1 , Y3 , Y4* , σ22|134 ) (7.4)
Y2i( mis ) E ( Y2 | Y1 , Y3 )
~ N2 , S (7.5)
Y4*i( mis )
(
E Y4* | Y1 , Y3 )
24|13
It’s still correct to view imputations as predicted values plus noise, but the noise terms
correlate via the off-diagonal elements in the residual covariance matrix. Notice that
the binary variable’s imputations are on the latent metric. You may recall from Chapter
6 that location of the latent scores relative to the threshold parameter induces a corre-
sponding set of discrete imputes (e.g., drawing a latent imputation above the threshold is
consistent with Y4(mis) = 1, and sampling a latent score below the threshold implies that
Y4(mis) = 0). The dichotomous imputations play no role in estimation and are only needed
when saving a data set for later analysis. Importantly, the categorical values result from
a functional link between the latent and discrete scores and not an ad hoc rounding
scheme from the earlier days of multiple imputation (Ake, 2005; Allison, 2002, 2005;
Demirtas & Hedeker, 2008a, 2008b; Horton et al., 2003).
cycles, but serial dependencies among a small number of imputed data sets can attenu-
ate multiple imputation standard errors. For this reason, you can’t simply save the M
data sets from successive iterations following the burn-in period. There are two ways
to set up the algorithm to avoid this problem: Save imputed data sets at prespecified
intervals within a single MCMC process, or save the filled-in data from the last iteration
of separate MCMC processes. The latter strategy naturally avoids autocorrelation. Like
other Bayesian estimation problems, it is important to evaluate whether MCMC has
converged and is mixing well prior to saving the first data set, and trace plots (Schafer &
Olsen, 1998) and potential scale reduction factor diagnostics (Gelman & Rubin, 1992)
are familiar tools for this purpose.
A sequential imputation chain invokes a single MCMC process consisting of M ×
T iterations, and it saves data sets in intervals of T iterations; that is, the first data set is
saved after an initial burn-in period of T iterations, and the remaining M – 1 data sets
are saved every T iterations thereafter. The literature refers to the interval separating the
save operations as the thinning interval or between-imputation interval. The recipe for
this approach is as follows:
Assign starting values to all parameters, latent scores, and missing data.
Do for m = 1 to M imputations.
Do for t = 1 to T iterations.
> Estimate model parameters conditional on all imputations.
> Estimate missing values conditional on the model parameters.
Repeat.
> Save the filled-in data for later analysis.
Repeat.
Notice that the recipe features two repetitive loops. The inner block consists of estima-
tion and imputation steps that repeat for T iterations, and the outer loop repeats the
estimation block M times, saving a complete data set with categorized imputes after
each block of T iterations. The recipe reflects a simple strategy that sets the thinning
interval equal to the burn-in period, but the two intervals need not be the same. Schafer
(1997; Schafer & Olsen, 1998) describes graphical tools for assessing the magnitude of
the autocorrelations across iterations, and these plots provide an alternative method for
choosing a thinning interval.
Specifying parallel imputation chains initiates a unique MCMC process (usually
with random starting values) for each of the M data sets, with each chain producing a
single filled-in data set at the conclusion of a burn-in period consisting of T iterations.
The recipe for this approach is as follows:
Do for m = 1 to M imputations.
Assign starting values to all parameters, latent scores, and missing data.
Do for t = 1 to T iterations.
> Estimate model parameters conditional on all imputations.
268 Applied Missing Data Analysis
Notice that the initialization step that assigns preliminary values to the unknowns
moves inside the outer loop, forcing MCMC to start from the beginning with new start-
ing values after completing T iterations. This hard reset negates any autocorrelation,
because the M sets of imputations arise from independent chains with unique starting
values.
Analysis Example
Continuing with the math achievement example, I used the joint imputation model from
Equation 7.3 to create M = 100 data sets. The potential scale reduction factors (Gelman
& Rubin, 1992) from a preliminary diagnostic run indicated that MCMC converged in
Multiple Imputation 269
fewer than 200 iterations, so I created imputations by saving the filled-in data from the
final iteration of 100 parallel MCMC chains, each with 1,000 iterations. As explained
previously, autocorrelated imputations are not a concern with this approach.
Table 7.1 gives the Bayesian point estimates (posterior medians) of the imputa-
tion model parameters. To emphasize, these are not the parameters of substantive inter-
est; they are average estimates from the MCMC process that created multiple imputa-
tions for the next round of analysis. As described in Section 5.5, it’s good practice to
inspect imputations to make sure they are plausible and reasonably like the observed
data. Graphing imputations next to the observed scores can provide a window into an
estimator’s inner machinery, as misspecifications such as applying a normal imputation
model to a highly skewed variable can produce out of range or implausible values (e.g.,
negative imputes for a strictly positive variable). I use simple bivariate scatterplots and
histograms to examine the filled-in data, and Su, Gelman, Hill, and Yajima (2011) and
van Buuren (2012) illustrate other useful graphical displays for this purpose.
Figure 7.2 shows scatterplots for six joint model data sets, with light gray circles
representing observed data and black crosshair symbols indicating imputed data. A
practical consequence of the MAR assumption is that observed and imputed scores
share common model parameters. Although their distributions need not be the same,
the observed and imputed data should combine to form a reasonable looking distri-
bution. They do in this case, as the imputations blend in around the regression line
with relatively few outliers. Figure 7.3 shows overlayed histograms of the math posttest
scores, with the observed data as gray bars and the missing values as white bars with a
kernel density function (the graph reflects a stacked data set with all imputations in the
same file). As you can see, the observed data are relatively normal, and the distribution
of imputations is a close match. Figure 7.4 shows the corresponding plot of the stan-
dardized reading test scores. As you can see, the observed data are negatively skewed,
whereas the imputations follow a symmetric distribution; the imputed data are thus a
weighted mixture of the two.
A mismatch between the observed data distribution and the imputations isn’t nec-
essarily a problem, especially when the goal is to estimate means or regression coef-
ficients (Demirtas et al., 2008; Lee & Carlin, 2017; von Hippel, 2013; Yuan et al., 2012).
It is even less of a problem here, because the skewed variable is auxiliary to the main
80
80
70
70
60
60
Math Posttest
Math Posttest
50
50
40
40
30
30
20
20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80
80
70
70
60
60
Math Posttest
Math Posttest
50
50
40
40
30
30
20
20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80
80
70
70
60
60
Math Posttest
Math Posttest
50
50
40
40
30
30
20
20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
FIGURE 7.2. Scatterplots of pretest and posttest math achievement scores from six joint model
data sets. Gray circles are observed scores, and black crosshair symbols denote imputations.
Multiple Imputation 271
Frequency
20 30 40 50 60 70 80 90
Math Posttest
FIGURE 7.3. Histogram of observed and imputed math posttest scores. The observed data are
the gray bars, and the missing values are the white bars with a kernel density function.
Frequency
Standardized Reading
FIGURE 7.4. Histogram of observed and imputed standardized reading scores. The observed
data are the gray bars, and the missing values are the white bars with a kernel density function.
272 Applied Missing Data Analysis
Yvi( ) = E(Yvi |Y1(i ) ,…, Y((v −) 1)i , Y((v +1))i ,…, YVi( ) , X i ) + rvi
t t t t −1 t −1
(7.6)
t
(
Yvi( ) ~ N1 E(Yvi |Y1(i ) ,…, Y((v −) 1)i , Y((v +1))i ,…, YVi( ) , X i ), σ2rv
t t t −1 t −1
)
where Yvi is participant i’s score on Yv, the expectation E(Yvi| . . .) is a predicted score from
the regression of Yv on all other variables, and r vi is a normally distributed residual.
Notice that the predictors in the regression equation condition on previously imputed
variables, some taken from the current iteration, others taken from the previous itera-
tion. The bottom row conveys the familiar idea that imputations are sampled from a
normal distribution with a predicted value and residual variance defining its center and
spread. Like the joint model imputation, MCMC provides the mathematically machin-
ery for estimating the regression model parameters.
Fully conditional specification accommodates mixed response types by tailoring
each regression model in the sequence to the incomplete variable’s metric. For example,
the first step could employ a linear regression, the second, a logistic regression, the third,
Multiple Imputation 273
a probit regression, and so on. To connect with material from Chapter 6, I use a latent
response formulation and probit models to impute categorical variables. To illustrate,
reconsider the math achievement analysis. This problem requires a probit regression
model for the binary lunch assistance indicator and linear regressions for the posttest
math and standardized reading test scores. Using generic notation, the fully conditional
specification imputation models at MCMC iteration t are as follows, and each model
requires supporting steps that estimate the parameters:
(
Y2(i()mis ) ~ N1 E ( Y2i | Y1i , Y3i , Y4 i ) , σ2r2
t
)
(t ) ( t −1)
(t )
+ γ 33 Y2(i ) + r3i
t
STANREADi = Y3i = γ 03 + γ13 Y1i + γ 23 Y4 i
(
Y3(i()mis ) ~ N1 E ( Y3i | Y1i , Y2i , Y4 i ) , σ2r3
t
)
FRLUNCH i ( ) Y4 i( ) = γ 04 + γ14 Y1i + γ 24 Y2i + γ 34 Y3(i ) + r4 i
(t )
*t *t t
=
Y4 i(( mis
)
*t *
((
) ~ N1 E Y4 i | Y1i , Y2i , Y3i ,1 ) )
The probit model also incorporates a fixed threshold parameter, and latent imputa-
tions above and below this cutoff convert to 1 and 0 values, respectively. Notice that
the binary imputes appear on the right side of each equation. Figure 7.5 shows a path
diagram of the imputation models. As before, I use an oval and rectangle to differen-
tiate the latent variable and its binary indicator, respectively, and the broken arrow
connecting the two is the link function that maps the unobserved continuum to the
discrete responses.
((
Y2(i()mis ) ~ N1 E Y2i | Y1i , Y3i , Y4*i , σ2r2
t
) )
(t )
(t )
γ 23 Y4 i( ) + γ 33 Y2(i ) + r3i
* t −1 t
STANREADi = Y3i = γ 03 + γ13 Y1i +
((
Y3(i()mis ) ~ N1 E Y3i | Y1i , Y2i , Y4*i , σ2r3
t
) )
FRLUNCH i ( ) Y4 i( ) = γ 04 + γ14 Y1i + γ 24 Y2(i ) + γ 34 Y3(i ) + r4 i
*t *t t t
=
Y4 i(( mis
)
*t *
((
) ~ N1 E Y4 i | Y1i , Y2i , Y3i ,1 ) )
As before, the location of the latent variable imputations relative to the threshold param-
274 Applied Missing Data Analysis
Y1
Y3 Y2
Y4
Y1
Y4 Y3
Y2
Y1
Y2 Y 4*
Y3
Y4
FIGURE 7.5. Path diagram of fully conditional specification imputation models for four vari-
ables. Incomplete variables link via regressions in a round-robin scheme where each variable is
regressed on all others.
Y1
Y3 Y2
Y 4*
Y1
Y 4* Y3
Y2
Y1
Y2 Y 4*
Y3
Y4
FIGURE 7.6. Path diagram of fully conditional specification imputation with latent response
variables. Incomplete variables link via regressions in a round-robin scheme where each variable
is regressed on all others, and latent variables replace manifest categorical variables.
Compatibility
An important issue with fully conditional specification is whether the imputation regres-
sion models are mutually compatible (Raghunathan et al., 2001; van Buuren, 2012; van
Buuren et al., 2006). Compatibility has a complex and precise mathematical definition
that can be found in work by Arnold and colleagues (Arnold, Castillo, & Sarabia, 1999;
Arnold et al., 2001; Arnold & Press, 1989) and more recently in Liu et al. (2014) and
Bartlett et al. (2015). I paint the broad brushstrokes here. Returning to the imputation
models in Equation 7.7, each regression induces a distribution for each incomplete vari-
able that conditions on all other variables. The essence of compatibility is whether these
conditional distributions are mutually valid in the sense that their parameters relate to
one another in a coherent way.
Conditional distributions such as those in the previous equations are compatible
if they are spawned by the same joint distribution. To illustrate, suppose Y1 and Y2 are
bivariate normal, as follows:
276 Applied Missing Data Analysis
(
Y1i ~ N1 E ( Y1i | Y2i ) , σ2ε )
Y2i = γ 0 + γ1 ( Y1i ) + ri = E ( Y2i | Y1i ) + ri
(
Y2i ~ N1 E ( Y2i | Y1i ) , σ2r )
These models and their conditional distributions are compatible, because the param-
eters are functionally related to those of the bivariate distribution, as shown below:
σ12 / σ22 β0 =
β1 = μ1 − β1μ 2 σ2ε =
σ12 − σ12
2
/ σ22 (7.11)
σ12 / σ12 γ 0 =
γ1 = μ 2 − γ1μ1 σ2r =
σ22 − σ12
2
/ σ12
Because both regressions link to a common joint distribution, it follows that their param-
eters also link to one another (e.g., Y1’s regression model parameters are a function of Y2’s
regression parameters and vice versa).
From a practical perspective, deploying a set of compatible regression models is
optimal, because the resulting imputations are logically consistent with one another
(assuming the models are correctly specified). However, van Buuren et al. (2006) point
out that “compatibility is not an all-or-nothing phenomenon” (p. 1053), as certain types
of incompatibilities have little or no impact on multiple imputation parameter estimates
(Gelman & Raghunathan, 2001; Raghunathan et al., 2001). As an example, using a logis-
tic rather than probit imputation model in Equation 7.7 would not satisfy compatibility,
because no common multivariate distribution could simultaneously spawn binomial and
normal conditional distributions. Yet, in practice, the difference between that approach
and using the compatible latent variable imputation models from Equation 7.8 is usu-
ally nil. On the other hand, incompatibilities that result from applying fully conditional
specification to models with interactive or nonlinear terms can lead to substantial biases
(Bartlett et al., 2015; Enders, Keller, et al., 2018; Grund, Lüdke, & Robitzsch, 2016a; Kim
et al., 2018; Seaman et al., 2012; Zhang & Wang, 2017). Model-based multiple imputa-
tion is usually a much better option.
regression coefficients given the current residual variance and filled-in data, (2) esti-
mate the residual variance conditional on the new coefficients and the current data, and
(3) update the missing values (including the latent scores) conditional on the regression
model parameters. Again, ordinal variables require an additional step that estimates
threshold parameters.
Consistent with joint model imputation, you can either save imputed data sets at
prespecified intervals within a single MCMC process (sequential imputation chains) or
save the imputed data set from the last iteration of separate MCMC processes (parallel
imputation chains). The fully conditional specification recipe with parallel imputation
chains is as follows:
Do for m = 1 to M imputations.
Assign starting values to all parameters, latent scores, and missing data.
Do for t = 1 to T iterations.
Do for incomplete variable v = 1 to V.
> Estimate regression coefficients conditional on the residual vari-
ance and all imputations.
> Estimate the residual variance conditional on the coefficients and
all imputations.
Repeat.
Do for incomplete variable v = 1 to V.
> Estimate missing scores (including latent response variables) con-
ditional on the regression model parameters and all other data.
Repeat.
Repeat.
> Save the filled-in data for later analysis.
Repeat.
Each MCMC cycle now consists of V estimation blocks (one for each incomplete
variable), and there are similarly V imputation steps. The recipe has a Gibbs sampler-
esque construction, as each estimation sequence conditions on imputations. As a small
technical point, van Buuren’s (2007, 2012; van Buuren et al., 2006) MICE algorithm is
not a true Gibbs sampler, because each estimation step uses just the cases with data
on the target (dependent) variable. However, fully conditional specification with latent
variables does use a Gibbs sampler that conditions on the entire data set.
the observed data (Little, 1988a; van Buuren, 2012; Vink, Frank, Pannekoek, & van
Buuren, 2014). After estimating the variable’s regression model, the procedure identifies
a donor pool of individuals with similar predicted values of the missing variable (i.e.,
similar regressor score profiles). Instead of simulating a synthetic imputation from a
normal curve, the procedure instead imputes each missing value with an observed score
drawn at random from the donor pool. Van Buuren (2012) provides a detailed discus-
sion of predictive mean matching, and the procedure is available in his R package MICE
(van Buuren et al., 2021; van Buuren & Groothuis-Oudshoorn, 2011). Predictive mean
matching is broadly applicable to missing data problems, and it is often recommended
for use with non-normal data, because it doesn’t impose a normal distribution on the
imputations (Lee & Carlin, 2017).
Analysis Example
Continuing with the math achievement example, I use fully conditional specification
models from Equation 7.8 to create M = 100 data sets. Again, 100 data sets are likely
enough to serve a range of purposes such as maximizing power, minimizing the impact
Note. Predictors are centered; intercepts are grand means. LCL, lower credible limit; UCL, upper credible limit.
Multiple Imputation 279
of Monte Carlo error on standard errors, and obtaining precise inferences (Bodner, 2008;
Graham et al., 2007; Harel, 2007; von Hippel, 2020). The potential scale reduction fac-
tors (Gelman & Rubin, 1992) from a preliminary diagnostic run indicated that MCMC
converged in fewer than 200 iterations, so I created imputations by saving the filled-in
data from the final iteration of 100 parallel MCMC chains, each with 1,000 iterations. As
explained previously, autocorrelated imputations are not a concern with this approach.
Table 7.2 shows the posterior summaries of the imputation regression models. The
latent imputation approach centers all predictors, such that the intercept coefficients
equal the target variable’s grand mean (Keller & Enders, 2021). To emphasize, these are
not the parameters of substantive interest; they are average estimates from the MCMC
process that created multiple imputations for the next round of analysis. As mentioned
previously, it’s a good practice to inspect imputations to make sure they are plausible
and reasonably similar to the observed data. Figure 7.7 shows bivariate scatterplots
for six imputed data sets, with light gray circles representing observed data and black
crosshair symbols indicating imputed data. Consistent with the corresponding joint
model imputation graphs, the filled-in values blend in around the regression line with
relatively few outliers. The histograms of the observed and imputed data were also like
those in Figures 7.3 and 7.4, so I omit these graphs in the interest of space.
The product of the initial imputation phase is a set of M complete data sets (e.g., M =
100 for the achievement example). Although it might seem reasonable to do so, averag-
ing the imputations into a single data set is inappropriate, as is stacking the individual
files and analyzing a single aggregated data set. Although the latter strategy could give
unbiased point estimates in some cases, errors and confidence intervals would be wrong
(van Buuren, 2012). Rather, the correct way forward is to perform one or more secondary
analyses on each data set and combine multiple sets of estimates and standard errors
into one package of results. Repeating an analysis 100 times sounds incredibly tedious,
but most major software packages have built-in routines that automate this process.
Returning to the math achievement example, the primary analysis involves a test
of the within-subject mean difference. A table of descriptive statistics would be stan-
dard fare with a such an analysis, and obtaining those summaries requires the means
and standard deviations from each of the 100 data sets. Table 7.3 shows the descriptive
statistics from a few of the 100 fully conditional specification data sets. Similarly, a
test of the within-subjects mean difference requires the mean change score from each
data set. Table 7.4 gives the estimates and their standard errors for both imputation
approaches. The next few sections describe how to use Rubin’s rules (Little & Rubin,
2020; Rubin, 1987) to combine the various quantities from the tables into a single
package of results. Note that the same data can be used for any number of analyses
involving the set of imputation model variables. However, variables omitted from the
imputation model should not be analyzed, as those scores are uncorrelated with the
filled-in data values.
280 APPLIED MISSING DATA ANALYSIS
80
80
70
70
60
60
Math Posttest
Math Posttest
50
50
40
40
30
30
20
20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80
80
70
70
60
60
Math Posttest
Math Posttest
50
50
40
40
30
30
20
20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
80
80
70
70
60
60
Math Posttest
Math Posttest
50
50
40
40
30
30
20
20
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Math Pretest Math Pretest
FIGURE 7.7. Scatterplots of pretest and posttest math achievement scores from six fully con-
ditional specification data sets. Gray circles are observed scores, and black crosshair symbols
denote imputations.
Multiple Imputation 281
Analyzing multiply imputed data sets gives M estimates of each model parameter. Rubin
(1987) defined the multiple imputation point estimate as the arithmetic average of the
M estimates:
M
1
=θˆ ∑
M m =1
θˆ m (7.12)
where θ̂m is the estimate from data set m, and θ̂ is the pooled point estimate. To illustrate,
the bottom row of Table 7.3 shows the pooled means and standard deviations of the
pretest and posttest scores, and the bottom row of Table 7.4 gives the average within-
subjects mean difference for each imputation method. Because the imputation proce-
dures use the same variables and make identical assumptions, we would expect them
to produce equivalent estimates, which they do (β̂0 = 6.53 vs. 6.57 for the joint model
and fully conditional specification, respectively). From a substantive perspective, the
estimate’s interpretation is no different than that of a complete-data analysis—on aver-
age, math scores improved by about six and one-half points between pretest and post-
test. Despite their Bayesian origins, the imputations are compatible with the frequentist
paradigm; β̂0 is an estimate of the true population parameter.
The arithmetic average is an intuitive way to combine estimates, and the formal
statistical rationale for the procedure that requires the estimand of interest to have a
normal sampling distribution (Rubin, 1987, Ch. 3; 1996). The normality assumption is
approximately true for common estimates such as means, regression coefficients, and
proportions (van Buuren, 2012), but not all estimates possess this property. Examples
include correlation coefficients, standard deviations and variances, variance explained
statistics, and odds ratios, to name a few. For these estimands, a common recom-
mendation is to (1) transform estimates to a metric that better approximates a nor-
mal curve, (2) apply the pooling rule from Equation 7.12 to the transformed estimates,
then (3) back-transform the pooled estimate to its original metric. For example, Schafer
(1997) recommends pooling correlation coefficients following a Fisher’s z-transforma-
tion, then back-transforming the average z-statistic to the correlation metric. Computer
simulations suggest that the impact of such transformations is usually only noticeable
in very small samples (e.g., less than 50; Hubbard & Enders, 2022). Finally, note that
test statistics and p-values should not be averaged, because they do not estimate a fixed
population parameter.
Taking the arithmetic average of the standard errors is inappropriate, because each value
is based on complete data and therefore ignores the additional uncertainty that accrues
from imputation. Rather, multiple imputation standard errors combine two sources of
variation: the sampling error that would have resulted had there been no missing val-
ues (within-imputation variance), and additional error or noise due to the missing data
Multiple Imputation 283
(between-imputation variance). As you will see, estimating the second source of varia-
tion is the sole reason why we need more than one data set.
Returning to the math achievement analysis, Table 7.4 gives the within-subject
mean difference and standard error estimates from a handful of imputed data sets, with
pooled values shown in bold typeface in the bottom row. Because they are derived from
complete data sets, each of the M standard errors estimates the sampling error that
would have resulted had there been no missing values. Averaging the squared standard
errors (i.e., sampling variances) gives a more stable estimate of sampling error known as
the within-imputation variance.
M
1
VW = ∑
M m =1
SEm2 (7.13)
Averaging the squared standard errors from the achievement analysis gives a within-
imputation variance estimate of V sW = 0.32. The square root of this value is the expected
standard error from a complete-data analysis. Intuitively, this estimate is too small,
because each standard error treats the filled-in values as real data. A second source of
variation corrects this problem.
A key feature of Table 7.4 is that the estimates vary across data sets; β̂0 equals 6.45 in
the first data set, 6.50 in the second, 6.87 in the third, and so on. This variation is due to
only one source—differences among the imputations, or missing data uncertainty. The
variance of the M estimates around their pooled value captures the influence of miss-
ing data on precision. This between-imputation variance is computed by applying the
familiar formula for the sample variance to the estimates.
M
∑( )
1 2
=VB θˆ − θˆ (7.14)
( − 1) m =1 m
M
The variance of the fully conditional specification estimates from Table 7.4 is VB = 0.04.
This term effectively functions as a correction factor, inflating the standard error to
compensate for missing information. Note that estimating between-imputation varia-
tion is only possible when analyzing more than one data set. Single imputation pro-
cedures (e.g., mean imputation, regression imputation, hot deck imputation) produce
flawed inferential tests, because they lack this important source of variation.
Complete-data sampling variation and the noise due to missing data combine to
form the total variance of an estimate, VT, the square root of which is the multiple impu-
tation standard error.
VB
SE= VW + VB + = VT (7.15)
M
The first two terms under the radical should come as no surprise, but you may be won-
dering about the rightmost term. Numerically, the fraction of VB over M is the squared
standard error of the pooled estimate from Equation 7.12. Because it captures the
expected difference between an estimate computed from M data sets and a hypothetical
284 Applied Missing Data Analysis
0.037
SE = 0.316 + 0.037 + = 0.595 (7.16)
100
Notice that the pooled standard error is about 10% larger than the imputation-specific
standard errors, the average of which is approximately 0.56. This difference owes to the
between-imputation variance terms that inflate the standard error to compensate for the
missing data. A number of papers point out that multiple imputation standard errors
may be too large (Kim et al., 2006; Nielsen, 2003; Reiter & Raghunathan, 2007; Robins
& Wang, 2000; Wang & Robins, 1998), but Rubin (2003) argues that this conservative
tendency is unimportant, because confidence intervals and inferences (the things we
really care about) are often close to optimal.
VB
VB +
FMI = M × dfR + 1 + 2 (7.18)
VT dfR + 3 dfR + 3
1
dfR =( M − 1) 1 + (7.19)
RIV 2
where dfR is the classic degrees of freedom expression from Rubin (1987, Equation 3.1.6).
Equation 7.18 comprises two parts. The first term to the right of the equals sign—the
proportion of between-imputation variance relative to the total variation—is an approx-
imation that assumes an infinite number of imputations, and the terms involving the
degrees of freedom adjust the proportion to compensate for using a finite number of
imputations (Pan & Wei, 2016). Returning to the within-subjects mean difference, the
fraction of missing information based on dfR = 8896.61 is approximately .11, meaning
that 11% of the squared standard error is due to missing data (i.e., the missing data
“explain” about 11% of the estimate’s variation). Although it is not a natural by-product
of estimation, Savalei and Rhemtulla (2012) show how to obtain the fraction of missing
information from a maximum likelihood analysis.
Rubin (1987, p. 114) states that the fraction of missing information will equal the
missing data proportion in some limited situations, and he suggests it will often be
lower than the missing data rate, because some of the missing information is recouped
via the correlations among the variables in the imputation model. Such is the case in this
example, where 16.8% of the change scores are missing but 11% of the total variation in
the mean difference is attributable to missing data. Pan and Wei (2016) caution against
linking the fractions of missing information to the missing data rates, as they observed
that fraction of missing information values can be much smaller and sometimes even
larger than the missingness proportions. As an aside, getting good estimates of the frac-
tion of missing information usually requires a very large number of imputations, typi-
cally more than what might be required to maximize statistical power (Bodner, 2008;
Harel, 2007).
The familiar t-statistic provides a straightforward way to conduct significance tests with
multiply imputed data. The test statistic for a pooled point estimate is
θˆ − θ0
t= (7.20)
SE
where θ̂ and SE are a pooled estimate and standard error, respectively, and θ0 is the
hypothesized parameter value, typically 0. Returning to the math achievement analysis,
the test statistic for the within-subjects mean difference is t = 11.02.
Rubin (1987) suggested a t reference distribution with the degrees of freedom
expression from Equation 7.19. An undesirable feature of this classic expression is
that dfR can exceed the sample size, as it does in the math achievement analysis (dfR
= 8896.61). With such a large degrees of freedom value, the t-statistic is effectively a
286 Applied Missing Data Analysis
z-score with a standard normal probability value. Barnard and Rubin (1999) proposed
an alternate expression that addresses this issue. Their degrees of freedom equation is
dfR dfobs
df BR = (7.21)
dfR + dfobs
V
VB + B
dfcom + 1 M
dfobs= dfcom × × 1 − (7.22)
dfcom + 3 VT
where dfcom is the degrees of freedom value from an analysis with no missing data (e.g.,
a test of the within-subjects mean difference has dfcom = N – 1), and dfobs is an observed-
data degrees of freedom value that downweights the complete-data degrees of freedom
commensurate with the missing information (Reiter & Raghunathan, 2007). The value
of df BR decreases as missing information increases (as it should, because the data contain
less information about the estimate), and it cannot exceed the complete-data degrees of
freedom. Returning to the math achievement, the degrees of freedom for the within-
subjects mean difference is df BR = 215.61. At face value, this result seems far more rea-
sonable given that dfcom = 249 and the fraction of missing information is about 11%.
Barnard and Rubin’s (1999) computer simulations suggest that df BR improves the accu-
racy of confidence intervals in small samples, and this correction is widely available in
software.
Some software programs that analyze multiply imputed data also define the ratio in
Equation 7.20 as a z-statistic and use a standard normal distribution to get a probability
(e.g., structural equation modeling programs; Asparouhov & Muthén, 2010b). In prac-
tice, the choice of reference distribution and degrees of freedom isn’t critical unless the
sample size is very small, in which case there may be an advantage to using a t-test with
Barnard and Rubin’s (1999) adjustment. The reference distribution made no difference
in this example, as z- and t-statistics were both significant at p < .001.
You might have already intuited that averaging confidence interval limits is inap-
propriate, because the standard errors that give rise to those limits are too small. Rather,
confidence interval limits are computed by multiplying the pooled standard error by the
appropriate t critical value, then adding and subtracting that product (i.e., the margin of
error or half-width) to the pooled estimate, as follows:
CI = θˆ ± tCV × SE (7.23)
The appropriate t critical value for the desired alpha level, tCV, requires one of the
previous degrees of freedom expressions. Using an alpha level of .05, Rubin’s (1987)
original expression gives tCV = 1.96 (essentially the same as the critical value from a
normal distribution), whereas Barnard and Rubin’s (1999) degrees of freedom formula
gives tCV = 1.97. The 95% confidence interval for the latter approach spans from a mean
difference of 5.38 to 7.72. Consistent with the test statistic, we can conclude that the
positive change is significant, because the null value of 0 falls well outside the confi-
dence interval.
Multiple Imputation 287
Having completed a multiple imputation analysis from start to finish, let’s take a step
back and contrast the approach with maximum likelihood and Bayesian estimation.
When will the methods give similar results, and when might they differ? Maximum
likelihood and Bayesian estimation are single-stage procedures that extract parameter
estimates directly from the observed data. Although they attack the missing data prob-
lem differently, numerous examples from earlier chapters show that the two procedures
are generally equivalent (at least numerically). In contrast, multiple imputation is a two-
stage approach that separates missing data handling from the analyses; the first stage
deploys an imputation model, the sole purpose of which is to fill in the data, and the sec-
ond stage involves analyzing the completed data sets. I previously emphasized that the
imputation and analysis stages need not employ the same statistical model. Returning
to the math achievement example, the analysis was a within-subjects mean comparison,
but joint model imputation and fully conditional specification deployed different regres-
sion models with two extra auxiliary variables. It ends up that the type and degree of
mismatch between the imputation and analysis models dictate whether multiple impu-
tation differs from direct estimators like maximum likelihood and Bayesian estimation.
Collins et al. (2001) and Schafer (2003) outline three propositions or rules of thumb
that help us anticipate when multiple imputation will agree or disagree with direct esti-
mators such as maximum likelihood and Bayesian estimation. The first proposition
states that direct estimation and multiple imputation will be equivalent if they use the
same set of variables and invoke models that apply equivalent assumptions and struc-
ture to the data. Such models are said to be congenial (Meng, 1994). Returning to the
math achievement example, the analysis model in Equation 7.2 assumes that difference
scores are normally distributed. Had I omitted the auxiliary variables, the bivariate nor-
mal distribution from Equation 7.9 would have served as the joint imputation model.
The imputation and analysis models would be congenial, because they use the same
variables and apply the same assumptions. The fact that the difference score parameters
are functions of the bivariate normal distribution’s parameters demonstrates this point
(i.e., β0 = μ2 – μ1. and σε2 = σ12 + σ22 – 2σ21). The same conclusion holds for fully condi-
tional specification.
The second proposition applies to situations where direct estimation and multiple
imputation use the same set of variables, but the imputation model uses more param-
eters. An example of this proposition occurs in confirmatory factor analysis applications
where a researcher could use direct estimation to fit a factor analysis model that imposes
a particular pattern on the associations in the data, or first employ multiple imputation
with an unrestricted mean vector and covariance matrix (i.e., a saturated imputation
model). In this situation, Collins et al. (2001) predict that the two approaches will give
equivalent parameter estimates, but multiple imputation standard errors could be some-
what larger, because the first-stage imputation model uses more parameters than neces-
sary to represent the data.
The third proposition applies to uncongenial scenarios (Meng, 1994) where the
imputation stage uses different variables than direct estimation. Collins et al. (2001)
288 Applied Missing Data Analysis
suggest that multiple imputation can give different estimates and standard errors, even
when the second-stage analysis matches that of the direct estimator. It is widely known
that excluding an analysis variable from imputation is detrimental, because the residual
correlation between the imputations and omitted variable is fixed to 0 (Rubin, 1996).
The resulting parameter estimates will be biased toward zero unless the association
is truly nil in the population. In contrast, uncongeniality that results from including
extra auxiliary variables is usually considered beneficial, because doing so can reduce
nonresponse bias and improve power (Collins et al., 2001; Enders, 2008; Howard et al.,
2015; Raykov & West, 2015; Rubin, 1996; Schafer, 2003), with a few exceptions (e.g., an
excessively large number of auxiliary variables, a peculiar pattern of associations; Hardt
et al., 2012; Thoemmes & Rose, 2014).
Returning to the math achievement example, a comparable maximum likelihood
analysis that ignored the auxiliary variables produced a mean difference of β̂0 = 6.45 (SE
= 0.62). Although the maximum estimate isn’t identical to imputation, it is close and dif-
fers by about 15% of a standard error unit. The multiple imputation standard errors are
somewhat smaller, which is what we might expect when using auxiliary variables that
contain a substantial proportion of unique variance, as they do here (the two additional
variables increment the explained variance in the math posttest scores by about 16%
above and beyond the pretest assessment). In general, differences that result from using
auxiliary variables should be most apparent when the missing data rates are very high
(e.g., greater than 25%; Collins et al., 2001).
The emergence of missing data-handling methods for interactive and nonlinear effects
is an important recent development (Bartlett et al., 2015; Enders et al., 2020; Erler et
al., 2016; Kim et al., 2015, 2018; Lüdtke et al., 2020a, 2020b; Zhang & Wang, 2017),
and we’ve so far encountered these methods in the maximum likelihood and Bayesian
frameworks. A prototypical model features a focal predictor X, a moderator variable M,
and the product of the two:
Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i (7.24)
(
ε i ~ N1 0, σ2ε )
In this model, β1 is a conditional effect that reflects the influence of X when M equals
zero, and β2 is the corresponding conditional effect of M when X equals zero. The β3
coefficient is usually of particular interest, since it captures the change in the β1 slope
for a one-unit increase in M (i.e., the amount by which X’s influence on Y is moderated
by M).
Two well-known imputation approaches for interactive and curvilinear effects—
passive imputation and just-another-variable imputation—warrant a brief discussion,
because they are easy to implement but known to introduce bias. A passive imputation
Multiple Imputation 289
model includes only the lower-order terms (e.g., Y, X, and M) in the first stage, and any
product or squared term is computed from the filled-in predictors prior to analysis. Von
Hippel (2009) refers to this strategy as impute-then-transform, because transformations
such as product and squared terms are computed postimputation. Iterative variants of
passive imputation compute the transformed variable at the conclusion of each MCMC
iteration and include it as a predictor in the next round of imputation (Royston, 2005;
van Buuren, 2012). Several computer simulation studies report that passive imputation
estimates are biased (Kim et al., 2015, 2018; Seaman et al., 2012; von Hippel, 2009),
which makes intuitive sense given that the imputation model effectively ignores the
nonlinear effect.
In contrast, just-another-variable imputation treats a nonlinear term like any other
variable to be imputed. Von Hippel (2009) refers to this strategy as transform-then-
impute, because transformations such as products or squares are computed prior to
imputation. To apply this method to the moderated regression analysis in Equation
7.24, you would first compute an explicit product variable Z = X × M that is missing
when one of its components is missing. The product would then function like any other
variable in the imputation model. For example, the joint imputation framework would
employ a multivariate normal model with an unrestricted mean vector and covariance
matrix.
The normality assumption for Z is problematic, because the product of two random
variables isn’t normal (Craig, 1936; Lomnicki, 1967; Springer & Thompson, 1966), and
the mean and variance are deterministic functions of the component variables (Aiken
& West, 1991; Bohrnstedt & Goldberger, 1969). Another undesirable consequence of
just-another-variable imputation is that the filled-in Z values do not equal the product of
the imputed X and M scores. Von Hippel (2009) investigated a variant of the procedure
that resolves this inconsistency by deleting the imputed product variable at the end of
each MCMC iteration and recomputing it by multiplying the filled-in X and M scores.
However, this transform, impute, then transform again approach appears to exacerbate
rather than mitigate bias. Seaman et al. (2012) give analytic arguments showing that
just-another-variable imputation is approximately unbiased when missingness is com-
pletely at random (i.e., does not depend on the data) but is biased if scores are condition-
ally MAR. Several simulation studies support this conclusion (Enders et al., 2014; Kim
et al., 2015, 2018; Lüdtke et al., 2020b; Seaman et al., 2012; von Hippel, 2009; Zhang &
Wang, 2017).
There are only two situations where we can safely apply joint model or fully condi-
tional specification imputation to interactive (or nonlinear) effects. First, when missing
values are restricted to the outcome variable, we can simply include the complete prod-
290 Applied Missing Data Analysis
uct term in the imputation model, as we would any other complete variable. Second, if
either X or M is complete and categorical, applying joint model imputation or fully con-
ditional specification within each subgroup preserves all possible two-way associations
with the grouping variable (Enders & Gottschall, 2011; Graham, 2009; van Buuren,
2012). Beyond these scenarios, model-based imputation is usually a much better option
for interactive and nonlinear effects (Bartlett et al., 2015; Enders et al., 2020; Erler et al.,
2016; Kim et al., 2015, 2018; Zhang & Wang, 2017).
A defining feature of agnostic imputation is its reliance on a generic model that differs
from the primary analysis but preserves its main features. The appeal of this strategy is
that the resulting imputations are potentially appropriate for a variety of different analy-
ses (e.g., descriptive summaries, analyses involving different combinations or subsets
of variables from the imputation model). In contrast, model-based imputation tailors
imputes around one particular analysis. The literature also refers to this strategy as a
sequential specification, fully Bayesian imputation, and substantive model-compatible
imputation (Bartlett et al., 2015; Erler et al., 2016, 2019; Lüdtke et al., 2020b; Zhang
& Wang, 2017). In fact, we already know how to implement model-based imputation,
because it piggybacks on a Bayesian analysis. The only new wrinkle is that we save
imputations and refit the analysis model to the filled-in data. To reiterate, a major reason
for choosing model-based imputation is that it readily handles interactive and nonlinear
effects.
Switching gears to a different substantive context, I use the chronic pain data
to illustrate a moderated regression analysis with an interaction effect. The data set
includes psychological correlates of pain severity (e.g., depression, pain interference
with daily life, perceived control) for a sample of N = 275 individuals with chronic pain.
This example piggybacks on the maximum likelihood analysis from Section 3.8 and the
Bayesian analysis from Section 5.4. The motivating question is whether gender moder-
ates the influence of depression on psychosocial disability, a construct capturing pain’s
impact on emotional behaviors such as psychological autonomy and communication,
emotional stability, and so forth. The moderated regression model is
where DISABILITY and DEPRESS are scale scores measuring psychosocial disability and
depression, MALE is a gender dummy code (0 = female, 1 = male), and PAIN is a binary
severe pain indicator (0 = no, little, or moderate pain, 1 = severe pain). I centered depres-
sion scores at their grand mean to facilitate interpretation. The disability and depression
scores have 9.1 and 13.5% of their scores missing, respectively, and approximately 7.3%
of the binary pain ratings are missing. By extension, 13.5% of the sample is also missing
the product term.
Multiple Imputation 291
(
f ( DEPRESS | PAIN, MALE ) × f PAIN * | MALE × f MALE* ) ( )
The first term to the right of the equals sign corresponds to the analysis model from
Equation 7.26. Importantly, the product is not a variable with its own distribution, but
rather a deterministic function of depression and gender, either of which could be miss-
ing. The regressor models in the next three terms translate into a linear regression for
depression, a probit (or logistic) model for the severe pain indicator, and an empty probit
(or logistic) model for the marginal distribution of gender (which I ultimately ignore,
because the variable is complete).
The asterisk superscripts reflect a latent response variable formulation for the binary
variables (see Section 6.3).
Following ideas established in Chapters 5 and 6, the distribution of an incomplete
predictor depends on every model in which it appears (e.g., see Equation 5.22 for the
conditional distribution of depression in this example). Deconstructing a variable’s dis-
tribution into multiple components ensures that the conditional models in Equations
7.26 and 7.28 are formally compatible. Importantly, the product term is not an imputed
variable. Rather, computing the product of the filled-in covariate scores prior to analysis
appropriately preserves the interaction effect, because the imputation model effectively
anticipates that depression scores will be multiplied by gender. Procedurally, the algo-
rithmic recipe for model-based imputation is identical to the Bayesian analysis from Sec-
tion 5.4. The only modification is that we save the filled-in data from the final iteration
of parallel MCMC chains.
Analysis Example
Continuing with the chronic pain example, I used model-based multiple imputation to
create filled-in data sets for the moderated regression from Equation 7.26. To reiterate,
the imputation and analysis models are identical. To illustrate an inclusive imputation
strategy, I augmented model-based imputation with four auxiliary variables: anxiety,
stress, perceived control over pain, and pain interference with daily life. I introduced
the auxiliary variables using a sequential specification like the one from Section 5.8, but
292 Applied Missing Data Analysis
I could have simply added the variables as extra covariates in the first stage regression
model (because the initial estimates are not the focus, modifying the meaning of the
imputation model’s slope coefficients is not a concern). Following earlier examples, I
again created M = 100 imputed data sets. The potential scale reduction factors (Gelman
& Rubin, 1992) from a preliminary diagnostic run indicated that MCMC converged in
fewer than 200 iterations, so I created imputations by saving the filled-in data from the
final iteration of 100 parallel MCMC chains, each with 1,000 iterations. As explained
previously, autocorrelated imputations are not a concern with this approach.
After creating the multiple imputations, I fit the moderated regression model from
Equation 7.26 to each data set and used Rubin’s (1987) rules to pool the parameter esti-
mates and standard errors. Note that the auxiliary variables played no role in the sec-
ondary analysis phase. To facilitate interpretation of the lower-order terms, I centered
depression scores at the pooled estimate of the grand mean prior to computing the prod-
uct term and fitting the model. Table 7.5 gives the pooled estimates and standard errors
from multiple imputation with the Barnard and Rubin (1999) degrees of freedom. Table
3.7 gives the corresponding maximum likelihood estimates, and Table 5.2 shows the
Bayesian analysis results. Figure 3.6 is also an accurate depiction of the multiple impu-
tation simple slopes, which were effectively identical to those of maximum likelihood.
Recall that lower-order terms are conditional effects that depend on scaling; β̂1 = 0.38 (SE
= 0.06) is the effect of depression on psychosocial disability for female participants, and
β̂2 = –0.77 (SE = 0.56) is the gender difference at the depression mean. The interaction
effect captures the slope difference for males. The negative coefficient (β̂3 = –0.23, SE =
0.09) indicates that the male depression slope was approximately 0.23 points lower than
the female slope (i.e., the male slope is β̂1 + β̂3 = 0.38 – 0.21 = 0.16). As you might expect
based on work from Collins et al. (2001) and others, the multiple imputation estimates
were numerically equivalent to those of maximum likelihood and Bayesian estimation.
The univariate t-statistic in Section 7.8 provides a familiar way to evaluate individual
parameters. Extending ideas from Chapter 2, you can use the Wald test and likelihood
ratio statistics to evaluate multiple parameters or compare two nested models. The Wald
test and likelihood ratio statistic can evaluate the same hypotheses, but they do so in dif-
ferent ways. The Wald test compares the discrepancy between the estimates and hypoth-
esized parameter values (usually zeros) to sampling variation. The simplest incarnation
of the test statistic is just a z-score or chi-square. In contrast, the likelihood ratio sta-
tistic compares log-likelihood values from two competing models, the simpler of which
aligns with the null hypothesis. The two tests are equivalent in large samples but can
give markedly different answers in small to moderate samples (Liu & Enders, 2017).
As you will see, the multiple imputation variants of these tests share a common logic,
as both include a component that depends on the pooled estimates and an additional
adjustment term that accounts for variation attributable to missing data.
Wald Test
A quick review of the maximum likelihood Wald test (Buse, 1982; Wald, 1943) sets the
stage for its multiple imputation counterpart. The test statistic is
TW = ( ′
) (
θˆ Q − θ 0 Sˆ θ−ˆ 1 θˆ Q − θ 0
Q
) (7.29)
The variance of the estimates around their pooled value captures the influence of the
missing data on precision, and these between-imputation deviations can also covary.
The Wald test captures this variability with a between-imputation covariance matrix
computed by applying the familiar formula for the sample covariance matrix to the
estimates.
M
∑( )( )′
1
=
SB θˆ m − θˆ Q θˆ m − θˆ Q (7.31)
( M − 1) m =1
The total covariance matrix combines complete-data sampling variation (the within-
imputation covariance matrix) and additional error or noise due to the missing data (the
between-imputation covariance matrix).
SB
ST = S W + S B + (7.32)
M
This expression simplifies to VT when θ̂Q is a single estimate, in which case SW and ΣB
also reduce to V sW and VB (see Equations 7.13 and 7.14).
Numerous authors express concern about the previous covariance matrix, because
ΣB is very noisy (and potentially rank deficient) when the number of imputations is
small to moderate (Meng & Rubin, 1992; Reiter & Raghunathan, 2007; Schafer, 1997;
van Buuren, 2012). The usual solution is to impose a structure on the covariance matrix
(i.e., reduce the number of free moving parts) by assuming that the elements in ΣB are
proportional to those in SW. In practical terms, this simplification implies that missing
values impact the precision of all parameters by the same amount (e.g., missing values
uniformly increase the squared standard errors of the regression slopes by 10%).
Li, Raghunathan, and Rubin (1991) proposed a simplified estimate of Σ T that uses
the average relative increase in variance to adjust the within-imputation covariance
matrix for missing data uncertainty. Recall from earlier in the chapter that the relative
increase in variance is between-imputation variation expressed as a proportion of the
within-imputation variation (e.g., RIV = 0.10 means that additional noise due to missing
data is about 10% as large as complete-data sampling error). If the within- and between-
imputation covariance matrices are proportional, then the average relative increase in
variance for the parameters in θ̂Q is
1
(−1
1 + tr S B S W )
ARIVW =
M
(7.33)
Q
and the variance–covariance matrix of the estimates becomes
Sˆ T= (1 + ARIVW ) S W (7.34)
proportion (the ARIV W term) that reflects the overall impact of missing data on preci-
sion. Computer simulation studies suggest that this simplification is often innocuous
and improves the test statistic’s behavior in small samples (Grund, Lüdke, & Robitzsch,
2016c; Liu & Enders, 2017).
Finally, the multiple imputation Wald statistic is computed by substituting
imputation-based components into Equation 7.29 as follows:
′
( )
θˆ Q − θ 0 S −W1 θˆ Q − θ 0 ( )
ˆ( )
′ ˆ −1 ˆ
(
TW = θQ − θ 0 S T θQ − θ 0 = )
1 + ARIVW
(7.35)
The right side of the equation decomposes the test statistic into a complete-data compo-
nent based on pooled quantities (the numerator) and a correction term that deflates the
test statistic to compensate for missing data (the denominator). This composition paral-
lels that of the t-statistic in Equation 7.20, and TW simplifies to t2 when testing a single
parameter (i.e., Q = 1).
A probability value for the Wald test is obtained by referencing TW to a chi-square
distribution with Q degrees of freedom (Asparouhov & Muthén, 2010b) or by referenc-
ing TW ÷ Q to the approximate F distribution with Q numerator degrees of freedom and
df W denominator degrees of freedom (Li, Raghunathan, et al., 1991).
2
2
1 −
Q × ( M − 1)
df W = 4 + ( Q × ( M − 1) − 4 ) 1 + (7.36)
ARIVW
different application of the likelihood ratio test occurs in structural equation modeling
analyses where a researcher compares the fit of a saturated model (i.e., a model that
places no restrictions on the mean vector and covariance matrix) to that of a more parsi-
monious analysis model that imposes a structure on the data (e.g., a confirmatory factor
analysis model). In either scenario, the simpler model with Q fewer parameters aligns
with the null hypothesis, so I denote the restricted model’s parameters as θ0 and the full
model’s parameters as θ.
As a quick review, the likelihood ratio statistic from Chapter 2 is shown below:
TLR = ( ( ) (
−2 LL θˆ 0 | data − LL θˆ | data )) (7.37)
The log-likelihood values for the two models, LL(θ̂0|data) and LL(θ̂|data), are com-
puted by substituting the sample data and the maximum likelihood estimates into
a distribution function such as the normal curve equation. The numerical value of
each log-likelihood is the sum of N individual fit terms, each of which is scaled as a
negative number, with higher (i.e., less negative) values reflecting better fit. The more
complex model with additional parameters will always achieve better fit and a higher
log-likelihood, but that improvement should be very small when the null hypothesis
is true.
The rest of this section describes the classic likelihood ratio statistic from Meng
and Rubin (1992). Chung and Cai (2019) and Lee and Cai (2012) propose comparable
test statistics in the structural equation modeling framework. Recall that the Wald test
incorporates a component based on pooled quantities and an adjustment term that cor-
rects the test statistic for missing data. The likelihood ratio statistic shares this struc-
ture, because Meng and Rubin (1992) used the asymptotic equivalence of the two tests
to derive expressions that essentially map the Wald statistic onto the log-likelihood
metric. I continue using the moderated regression analysis to illustrate a test that evalu-
ates the null hypothesis that R2 = 0.
As a starting point, reconsider the part of the Wald test that depends on pooled
quantities, shown in the numerator of Equation 7.35. Assuming a large sample size, an
analogous quantity is computed by averaging likelihood ratio tests that evaluate the
relative fit of the imputed data sets to the pooled estimates in θ̂0 and θ̂.
∑ ( ( ))
M
1
TPooled =
M m =1
) (
− 2 LL θˆ 0 | data m − LL θˆ | data m (7.38)
In this example, θ̂ references the pooled estimates from the moderated regression, and
θ̂0 contains the pooled estimates from the empty regression model comprised of an
intercept (grand mean) and variance. The likelihood ratio statistic for data set m is com-
puted by substituting the pooled estimates and the filled-in data into complete-data log
likelihood functions from Chapter 2 (see Equation 2.33).
As you will see, the final test statistic attenuates T Pooled by an amount that depends
on the average relative increase in variance. Computing this correction term requires a
second set of test statistics that evaluate the relative fit of data set m to its own estimates.
This average is
Multiple Imputation 297
∑ ( ( ))
M
1
TLR =
M m =1
) (
− 2 LL θˆ 0 m | data m − LL θˆ m | data m (7.39)
where θ̂0m and θ̂m are imputation-specific estimates rather than pooled values.
Meng and Rubin (1992) show that the likelihood equivalent of the ARIV W is com-
puted as follows:
M +1
ARIV= × (TPooled − TLR ) (7.40)
Q × ( M − 1)
LR
( M + 1)
χ2 − Q × ARIVD
( M − 1) 2
TD2 = (7.42)
1 + ARIVD2
where χ2 is the arithmetic average of M chi-square statistics, and the average relative
increase in variance is
∑( )
M 2
1 1
ARIVD2 = 1 + × χ 2m − χ 2 (7.43)
M ( − 1) m =1
M
A probability value for the pooled chi-square test is obtained by referencing TD2 to a chi-
square distribution with Q degrees of freedom or by referencing TD2 ÷ Q to an approxi-
mate F distribution with Q numerator degrees of freedom and dfD2 denominator degrees
of freedom (Li, Meng, et al., 1991).
2
1
df= Q −3/ M ( M − 1) 1 + (7.44)
D2
ARIV
D2
Like other classic degrees of freedom expressions, dfD2 can exceed the sample size, but
this feature probably doesn’t matter if the sample size is reasonably large (e.g., greater
than 200).
Analysis Example
Returning to the moderated regression analysis, I used the three test statistics to evalu-
ate the null hypothesis that R2 equals 0. The F versions of the test statistics (i.e., T ÷
Q) values were FW(3, 240.79) = 23.74, FLR(3, 15652.44) = 20.62, and FD2(3,1895.75) =
20.79, and all probability values were less than .001. The corresponding average relative
increase in variance values were ARIV W = 0.12, ARIV LR = 0.15, and ARIVD2 = 0.26.
A few broad brushstroke observations are evident. First, the three test statistics are
relatively like one another, although the Wald statistic is slightly larger. Simulation stud-
ies suggest that the Wald test may be more powerful than the likelihood ratio statistic
in some cases (Grund et al., 2016c; Liu & Enders, 2017), but that explanation probably
doesn’t explain this difference, because the effect size is relatively large. Second, the
degrees of freedom are dramatically different, because I applied Reiter’s (2007) small-
sample adjustment for the Wald test, and I used expressions from Li, Raghunathan, et al.
(1991) and Li, Meng, et al. (1991) for the likelihood ratio and D2 statistics, respectively.
The large degrees of freedom values for the latter approaches are troubling, because
they greatly exceed the sample size (i.e., missing values appear to increase rather than
decrease the amount of information in the data). That said, the choice of reference dis-
tribution doesn’t appear to matter with a sample size this large (Liu & Enders, 2017).
Third, the often-derided TD2 statistic is well calibrated to the others, particularly the
likelihood ratio test. This is not entirely unexpected, as results from Grund et al. (2016c)
Multiple Imputation 299
suggest that the test performs well when the rates of missing information are not too
high and auxiliary information is available (as is the case here). The average relative
increase in variance for TD2 is dramatically higher than the others, but this difference
doesn’t appear to impact the test statistic itself.
Maximum likelihood and Bayesian estimation are direct estimators that extract the
model parameters of interest directly from the observed data. In contrast, multiple
imputation puts the filled-in data front and center, and the goal is to create suitable
imputations for later analysis. A typical application comprises three steps: Create several
filled-in data sets, analyze the completed data, and pool and test the estimates. The first
step co-opts the MCMC algorithms from Chapters 5 and 6, and the analysis and pool-
ing stages use “Rubin’s rules” (Little & Rubin, 2020; Rubin, 1987) to combine estimates
and standard errors into a single package of results. Given the same input variables and
assumptions, multiple imputation estimates are usually indistinguishable from those of
maximum likelihood or Bayesian estimation.
As an organizational tool, I classified multiple imputation procedures into two
buckets according to the degree of similarity between the imputation and analysis mod-
els: An agnostic imputation strategy deploys a model that differs from the substantive
analysis, and a model-based imputation procedure invokes the same focal model as the
secondary analysis (perhaps with additional auxiliary variables). These classifications
emphasize that an analysis model’s composition—in particular, whether it includes non-
linear effects such as interactions, polynomial terms, or random effects—determines
the type of imputation strategy that works best. Model-based imputation is usually ideal
for these types of nonlinearities, whereas agnostic imputation is well suited for analyses
that do not include these special features. This distinction continues to be important in
Chapter 8, which covers multilevel missing data. As you will see, random coefficients
are yet another type of nonlinearity that requires a model-based missing data-handling
strategy. Finally, I recommend the following articles for readers who want additional
details on topics from this chapter:
Bartlett, J. W., Seaman, S. R., White, I. R., & Carpenter, J. R. (2015). Multiple imputation of
covariates by fully conditional specification: Accommodating the substantive model. Statisti-
cal Methods in Medical Research, 24, 462–487.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical
Association, 91, 473–489.
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems:
A data analyst’s perspective. Multivariate Behavioral Research, 33, 545–571.
300 Applied Missing Data Analysis
Scheuren, F. (2005). Multiple imputation: How it began and continues. American Statistician,
59, 315–319.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional
specification. Statistical Methods in Medical Research, 16, 219–242.
van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully
conditional specification in multivariate imputation. Journal of Statistical Computation and
Simulation, 76, 1049–1064.
Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psycho-
logical Methods, 22, 649–666.
8
multilevel model. The chapter also describes multilevel extensions of joint model impu-
tation, fully conditional specification, and maximum likelihood.
To begin, I use the math problem-solving data set from the companion website to illus-
trate Bayesian missing data handling for a two-level regression model. The data come
from an educational experiment where J = 29 schools were randomly assigned to an
experimental or comparison condition. There was an average of nj = 33.86 students per
school. Following the parlance of the multilevel modeling literature, students are level-1
units and schools are level-2 units (or clusters). The comparison condition (i.e., control
schools) implemented the district’s standard mathematics curriculum, and the interven-
tion schools implemented a new curriculum designed to enhance math problem-solving
skills. The dependent variable is an end-of-year math problem-solving assessment with
item response theory (IRT)-scaled scores ranging between 37 and 65. The data set and
the variable definitions are described in the Appendix.
A key feature of hierarchical data is that variation and covariation can exist at both
levels of the data hierarchy. Applied to the educational data, student-level variables such
as problem-solving test scores naturally vary across individuals within a given school,
and schools also differ in their average levels of these variables. The dependent variable’s
intraclass correlation is approximately .26, meaning that school-level mean differences
in average problem-solving account for roughly 26% of the total variation; this value is
typical for cluster-randomized designs (Hedges & Hedberg, 2007; Spybrook et al., 2011).
It is important to keep in mind that level-1 regressors can possess the same sources of
variation. As you will see, Bayesian estimation creates missing values that preserve this
important feature of the data.
As the name implies, a random intercept model is a regression with group-specific
intercept coefficients. These models are noteworthy, because they are amenable to a
variety of missing data-handling options, including agnostic imputation schemes and
maximum likelihood estimation. As a starting point, consider a model that features
standardized math scores as a student-level predictor and teacher experience (in years)
as a school-level covariate. The within-cluster regression model describes score varia-
tion among students in the same school. The model and its generic counterpart are
(
PROBSOLVEij = β0 j + β1 STANMATH ij + ε ij) (8.1)
Yij = β0 j + β1 X1ij + ε ij
(
ε ij ~ N1 0, σ2ε )
where Yij represents the outcome score for student i in school j, X1ij is that student’s
covariate value (e.g., standardized math test score), β0j is the random intercept coef-
ficient for school j, β1 is a common slope coefficient, and εij is a within-cluster residual
that captures unexplained variation in the dependent variable. Residuals are normally
distributed by assumption with constant variation across all schools. To illustrate the
Multilevel Missing Data 303
equation, the dashed lines in Figure 8.1 depict the group-specific regression lines for
the 29 schools, and the solid line is the average trajectory. The association between
standardized math test scores and problem solving is constant across schools, but the
vertical separation of the regression lines reflects between-cluster variation in the aver-
age levels of problem solving (i.e., random intercept variation).
Because multilevel models view level-2 groups (e.g., schools) as a random sample
from a larger population of higher-level clusters, each school-specific mean or inter-
cept β0j functions as a latent variable or random effect with its own distribution. The
between-cluster model expresses these school-level differences (the vertical elevation of
the regression lines) as a function of teacher experience, as follows:
(
β0 j = β0 + β2 TEACHEXPj + b0 j ) (8.2)
β 0 j = β 0 + β 2 X 2 j + b0 j
(
b0 j ~ N1 0, σ2b0 )
where β0 is the mean intercept across all schools, β2 gives the influence of the level-2
covariate (average years of teaching experience for school j) on the average level of the
dependent variable, and b0j is a residual capturing unexplained between-cluster varia-
80
70
End-of-Year Problem Solving
60
50
40
30
0 20 40 60 80 100
Standardized Math Achievement
FIGURE 8.1. Within-cluster regressions for 29 schools. The vertical separation of the regres-
sion lines reflects random intercept variation, and all schools share a common slope.
304 Applied Missing Data Analysis
tion. The level-2 residuals are also normal by assumption, and a between-cluster vari-
ance σb20 quantifies this variation.
Finally, replacing the β0j term from the within-cluster model with the right side of
its level-2 equation combines the two models into a single equation.
( )
PROBSOLVEij = β0 j + β1 STANMATH ij + β2 TEACHEXPj + ε ij ( ) (8.3)
( ) ( )
Yij = β0 + b0 j + β1 X1ij + β2 X 2 j + ε ij = E Yij | X1ij , X 2 j + ε ij
Yij ~ N1 E (( ) )
Yij | X1ij , X 2 j , σ2ε
The bottom expression says that dependent variable scores are normally distributed
around predicted values (the E(Yij|X1ij, X2ij) term) that encode the school-specific inter-
cepts. Visually, these predictions correspond to points on the cluster-specific regression
lines in Figure 8.1.
The first term is the normal distribution induced by the focal analysis (see Equation
8.3), the second term is the level-1 predictor’s model, and the third term is the marginal
(overall) distribution of the level-2 predictor. It is important to order the variables so
level-1 predictors condition on level-2 predictors, because higher-level variables can pre-
dict lower-level variables but not vice versa.
The generic expressions from the previous factorization translate into a random
intercept model for the level-1 predictor and a single-level regression model for the
level-2 covariate. The level-1 predictor model is
(
r1ij ~ N1 0, σ2r1 ) (
g 01 j ~ N1 0, σ2g 01 )
where the γ’s are regression coefficients, g01j is a random intercept residual that captures
unexplained between-school variation in the average math scores (this variable’s intra-
Multilevel Missing Data 305
class correlation is approximately .38), and r1ij is a within-cluster deviation that reflects
test score variation among students in the same school. The level-2 covariate model is an
empty single-level regression that features a grand mean and a between-cluster devia-
tion score.
(
r2 j ~ N1 0, σr22 )
A partially factored model instead assigns a multivariate distribution to the explan-
atory variables (Enders et al., 2020; Goldstein et al., 2014). The generic factorization for
the problem-solving analysis is as follows:
The multivariate distribution on the right is a two-part normal distribution with level-1
and level-2 components. This specification decomposes level-1 predictors into a within-
cluster deviation involving a score and a group mean, and a between-cluster deviation
between a group mean and the grand mean.
( ) (
X1ij = μ1 + μ1 j − μ1 + X1ij − μ1 j ) (8.8)
It is important to highlight that μ1j is the latent group mean for cluster j (e.g., a latent
estimate of a school’s average achievement) rather than a deterministic arithmetic aver-
age. Accordingly, the within-cluster regression model expresses the level-1 regressor
(e.g., standardized math test scores) as deviations around the level-2 latent group means.
X1ij ~ (
N1 μ1 j , σ2r1( W ) )
The alphanumeric subscript on the residual highlights that r1ij(W) measures within-
cluster variation.
The level-1 predictor correlates with the level-2 covariate (e.g., teacher experience)
via its latent group mean in the between-cluster model, which comprises two empty
regression equations with correlated residuals.
μ1 j μ1 r1 j( B )
= + (8.10)
X 2 j μ 2 r2 j( B )
2
μ1 σr1( B) σr1r2 ( B)
X j( B ) ~ N 2 ,
μ 2 σ σ2r2 ( B)
r2r1( B)
The bottom expression says that the latent means and level-2 scores are normally distrib-
uted around their grand means, and the variables covary according to the off-diagonal
element in the between-cluster covariance matrix, Σ(B). The pair of round-robin linear
306 Applied Missing Data Analysis
( )
μ1 j = μ1 + γ11 X 2 j − μ 2 + r1 j( B ) (8.11)
X 2 j = μ 2 + γ12 ( μ1 j − μ1 ) + r2 j( B )
As mentioned elsewhere, the distinction between Equations 8.10 and 8.11 is an algo-
rithmic nuance that has no practical impact on analysis results. Finally, notice that the
partially factored specification is ideally suited for models with centered predictors,
because the grand and group means are natural by-products of estimation. Centering is
substantially more complicated with a sequential specification.
(( ) )
Yij( mis ) ~ N1 E Yij | X1ij , X 2 j , σ2ε (8.12)
To illustrate imputation more concretely, Figure 8.2 shows the distribution of missing
outcome scores for three students who belong to different schools. The solid black circles
are predicted values (the expected value in the first argument of the normal distribution
function), and the spread of the normal curves reflects within-school residual variance.
The candidate imputations fall directly on the vertical lines, but I added horizontal jitter
to emphasize that more scores are located near the center of each distribution. Concep-
tually, MCMC algorithm generates an imputation by randomly selecting a value from the
cluster of candidate scores (technically, the imputations can fall anywhere in the normal
distribution).
Turning to the regressors, the MCMC algorithm draws imputations from the condi-
tional distribution of an incomplete predictor given all other analysis variables. To illus-
trate, consider X1 (e.g., standardized math scores). Applying rules of probability reveals
that this distribution is proportional to the product of two univariate distributions, each
of which aligns with one of the previous regression models.
f ( X1 | Y , X 2 ) ∝ f ( Y | X1 , X 2 ) × f ( X1 | X 2 ) (8.13)
80
70
End-of-Year Problem Solving
60
50
40
30
0 20 40 60 80 100
Standardized Math Achievement
FIGURE 8.2. Distribution of missing outcome scores for three students who belong to dif-
ferent schools. The solid black circles are predicted values, and the spread of the normal curves
reflects within-school residual variance. The candidate imputations fall directly on the vertical
lines, but horizontal jitter is added to emphasize that more scores are located near the center of
each distribution.
Dropping unnecessary scaling terms and substituting the kernels of the distribu-
tion functions from Equation 8.3 and 8.9 into the right side of the factorization gives the
following expression:
( ) ( )
f Yij | X1ij , X 2 j × f X1ij | X 2 j ∝
1 ( Yij − ( β0 j + β1 X1ij + β2 X 2 j ) )
2
( )
2
(8.14)
1 X1ij − μ1 j
exp − × exp − 2
σ2ε σ2r1( W )
2
Deriving the conditional distribution of X1 involves multiplying the two normal
curve equations and performing algebra that combines the component functions into a
single distribution for X1. The result is a normal curve with two-part mean and variance
expressions that depend on the focal and regressor model parameters.
308 Applied Missing Data Analysis
( ) ( ( )
f X1ij( mis ) | Yij , X 2 j = N1 E X1ij | Yij , X 2 j , var X1ij | Yij , X 2 j ( )) (8.15)
μ β1 Yij − β0 j + β2 X 2 j ( ( ))
(
E X1ij =
| Yij , X 2 j ) (
var X1ij | Yij , X 2 j × 2 +
σ
1j
) σ2ε
r
1( w )
−1
1 β2
( , X 2 j 2 + 12
var X1ij | Yij=
σr σε
)
1( w )
In fact, the distribution’s structure is identical to the one for linear regression models
back in Section 5.3 (e.g., see Equation 5.12). The main difference is that the distribution
includes group-specific latent variables or random effects that capture between-group
differences (i.e., μ1j and β0j).
Next, consider the distribution of the level-2 predictor, 2 (e.g., school-average
teacher experience). Working from the partially factored specification, the conditional
distribution of missing values is the product of two univariate distributions (the sequen-
tial specification yields a triple product).
f ( X 2 | Y , X1 ) ∝ f ( Y | X1 , X 2 ) × f ( X 2 | X1 ) (8.16)
The first term corresponds to the focal regression, and the second term corresponds to
the between-cluster regression from Equation 8.11. A level-2 predictor like X2 is com-
mon to all members of a given level-2 cluster (e.g., all students within a given school
share the same teacher experience value). To accommodate this feature, the analysis
model’s contribution to the distribution of missing values repeats nj times, once for each
observation in group j.
nj
( ) ∏N ( E (Y | X
X 2 j( mis ) | Yij , X1ij ∝
i =1
1 ij 1ij , X 2 j ) , σ2ε ) × N1 ( E ( X2 j | μ1 j ) , σr2 )
2
( ))
2
1 Yij − β0 j + β1 X1ij + β2 X 2 j
nj
(
∝ exp − ∏ 2
σε
i =1 2 (8.17)
( ( )))
2
1 X 2 j − μ 2 + γ12 μ1 j − μ1 (
× exp − 2
2 σr2 ( B)
Returning to form, the Metropolis–Hastings algorithm is a convenient tool for drawing
imputations from complex functions like the previous one, because it works from the
simpler component distributions, in this case a pair of normal curves.
MCMC Algorithm
The posterior distribution for a multilevel analysis is a complicated multivariate func-
tion describing the relative probability of different combinations of model parameters,
Multilevel Missing Data 309
random effects, latent group means, and missing values given the observed data. The
core logic of Bayesian estimation and MCMC algorithms readily extends to multilevel
data: Estimate one unknown at a time (e.g., parameter, latent variable, missing score),
holding all other quantities at their current values. The generic MCMC recipe for a mul-
tilevel regression model is as follows:
Assign starting values to all parameters, random effects, and missing values.
Do for t = 1 to T iterations.
> Estimate the focal model’s parameters, given everything else.
> Estimate focal model’s random effects, given everything else.
> Estimate each predictor’s model’s parameters, given everything else.
> Estimate each predictor model’s random effects, given everything else.
> Impute the dependent variable given the focal model parameters.
> Impute each predictor, given the focal and supporting models.
Repeat.
The full conditional distributions for MCMC estimation are widely available in the lit-
erature (Browne, 1998; Browne & Draper, 2000; Enders et al., 2020; Goldstein et al.,
2009; Kasim & Raudenbush, 1998; Lynch, 2007; Schafer & Yucel, 2002; Yucel, 2008). In
the interest of space, I point readers to these sources for additional details.
Analysis Example
Expanding on the math problem-solving analysis, I used Bayesian estimation to fit a
random intercept regression model with three level-1 predictors (a pretest measure of
math problem solving, standardized math test scores, and a binary indicator of whether
a student is eligible for free or reduced-priced lunch) and a pair of level-2 predictors
(average years of teacher experience, and the treatment assignment indicator).
( ) ( ) (
PROBSOLVEij = β0 + b0 j + β1 PRETESTij + β2 STANMATH ij ) (8.18)
( ) ( ) (
+ β3 FRLUNCH ij +β4 TEACHEXPj + β5 CONDITION j + ε ij )
The β5 coefficient is of particular interest, as this parameter represents the mean dif-
ference for intervention schools (0 = comparison school, 1 = intervention school), con-
trolling for student- and school-level covariates. The pretest and treatment assignment
indicators are complete, but the remaining variables have missing values; the missing
data rates are 20.5% for the dependent variable, 7.3% for the standardized reading test,
4.7% for the lunch assistance indicator, and 10.3% for the school-level teacher experi-
ence variable.
Either the sequential or partially factored specifications are appropriate for the
explanatory variables. The sequential specification invokes the product of univariate
distributions like the following:
310 Applied Missing Data Analysis
(
X ij( W ) ~ N 3 μ j , S ( W ) )
A diagonal element of the within-cluster covariance matrix is fixed at 1 to establish a
metric for the latent lunch assistance indicator, and the model also incorporates a fixed
threshold parameter for this variable. The model can be simplified by moving the com-
plete pretest scores to the right side of the equation as a predictor of the two incomplete
variables (i.e., treat pretest scores as known constants).
As explained elsewhere, modeling the multivariate normal distribution directly
is not straightforward when a covariance matrix contains a fixed constant. The set of
round-robin linear regression equations below is an alternative way to parameterize the
within-cluster covariance matrix that avoids estimation difficulties associated with a
matrix that contains a fixed constant:
Multilevel Missing Data 311
( ) (
PRETESTij = μ1 j + γ11 STANMATH ij − μ 2 j + γ 21 FRLUNCH ij* − μ 3 j + r1ij) (8.22)
The level-1 predictors condition on level-2 covariates via their latent group means
in the between-cluster model, which now comprises five empty regression equations
with correlated residuals.
r
μ1 j μ1 1 j( B )
r2 j B
μ2 j
μ2 ( )
X= μ = μ 3 + r3 j( B ) (8.23)
j( B ) 3j
TEACHEXPj μ 4 r4 j( B )
CONDITION *
j μ 5 r5 j B
( )
(
X j( B ) ~ N 5 μ, S ( B ) )
A diagonal element of the between-cluster covariance matrix is also fixed at 1 to estab-
lish a metric for the latent treatment condition indicator, and the model also incorpo-
rates a fixed threshold parameter for this variable. This model can also be parameterized
as a set of round-robin regressions like Equation 8.11.
The potential scale reduction factors (Gelman & Rubin, 1992) from a preliminary
diagnostic run indicated that the MCMC algorithm converged in fewer than 1,000 itera-
tions, so I used 12,000 total iterations with a conservative burn-in period of 2,000 itera-
tions. The same analysis that generates Bayesian summaries of the model parameters
can also generate model-based multiple imputations for a frequentist analysis. To illus-
trate, I created M = 100 imputations by saving the filled-in data from the final iteration
of 100 parallel MCMC chains, each with 2,000 iterations. As explained previously, auto-
correlated imputations are not a concern with this approach.
After creating the multiple imputations, I used restricted maximum likelihood to fit
the random intercept regression model to each data set and applied Rubin’s (1987) rules
to pool the parameter estimates and standard errors. The Barnard and Rubin (1999)
small-sample degrees of freedom adjustment for the t-statistic requires the complete-
data degrees of freedom value as an input (see Equations 7.21 and 7.22). Following the
hierarchical linear modeling (HLM) software package (Raudenbush, Bryk, Cheong, &
Congdon, 2019), I used the number of schools minus the number of predictors minus 1
as the degrees of freedom for all coefficients (i.e., dfcom = 29 – 5 – 1). Analysis scripts are
on the companion website.
Table 8.1 gives the posterior summaries from the Bayesian analysis, and Table 8.2
summarizes the multiple imputation point estimates and standard errors. The primary
focus is the β5 coefficient, which indicates that intervention schools scored 2.15 points
higher than control group schools on average, controlling for student- and school-level
covariates. It probably comes as no surprise that the Bayesian and frequentist results are
numerically similar, as we’ve seen numerous examples of this throughout the book. The
312 Applied Missing Data Analysis
one difference was the intercept variance estimate, which was somewhat larger in the
Bayes analysis. This parameter was sensitive to the choice of prior distribution, which isn’t
necessarily surprising given the relatively small number of level-2 units (Gelman, 2006).
To explore the influence of different prior distributions, I implemented three inverse
gamma prior distributions for the random intercept variance: an improper prior that sub-
tracts two degrees of freedom and adds 0 to the sum of squares (Asparouhov & Muthén,
2010a), a more informative prior that adds two degrees of freedom and a value of 1 to the
sum of squares, and the Jeffreys prior described in Section 4.5. A practical way to gauge the
impact of the prior distribution is to express the random intercept variance as a variance-
explained effect size (Rights & Sterba, 2019). Across the three priors, the intercept vari-
ance captured between 10.5 and 12.6% of the total variation in the outcome. I suspect
that most researchers would not find this variability practically meaningful, so Table 8.1
reports the results with an improper prior, which is the default in some software packages
(the Jeffreys prior brought the posterior median closer to the frequentist point estimate).
Multilevel modeling textbooks recommend inspecting level-2 residuals to identify
possible model misspecifications (Raudenbush & Bryk, 2002, Ch. 9; Snijders & Bosker,
2012, Ch. 10). Bayesian analyses and model-based imputation are ideally suited for this
Relative Probability
–6 –4 –2 0 2 4 6
School-Level Random Intercept Residuals
purpose, because the MCMC algorithm estimates the between-cluster residuals at every
iteration. As such, you can treat the random effects as multiply imputed latent data
and save them for further inspection. To illustrate, Figure 8.3 shows the estimated dis-
tributions of the random intercept residuals (i.e., the b0j terms in Equation 8.18) from
the 100 imputed data sets. The residuals are normal by assumption, and the empirical
distributions are a reasonably good match to that ideal (skewness and excess kurtosis
were –0.28 and –0.25, respectively). Quantile–quantile (Q-Q) plots are another option
for evaluating normality, and graphing the residuals against other variables can reveal
certain types of model misspecification.
Changing substantive contexts, I use the two-level daily diary data set from the com-
panion website to illustrate Bayesian estimation for a random coefficient regression
model. The data come from a health psychology study in which J = 132 participants with
chronic pain provided up to nj = 21 daily assessments of mood, sleep quality, and pain
severity. The data set also includes several person-level demographic and background
variables (e.g., educational attainment, gender, number of diagnosed physical ailments,
activity level) and psychological correlates of pain severity (e.g., pain acceptance, cata-
strophizing). The structure of the data set is now quite different, as daily measurements
are level-1 units and persons are level-2 units (or clusters). The data set and the variable
definitions are described in the Appendix.
314 Applied Missing Data Analysis
As explained previously, variation and covariation can exist at both levels of a mul-
tilevel hierarchy. Applied to the diary data, daily assessments (e.g., mood, pain, and sleep
quality) naturally vary within a given person, and individuals also differ in their average
levels of these variables. The dependent variable, positive affect, has an intraclass corre-
lation equal to .63, meaning that person-level mean differences in average mood scores
account for roughly 63% of the total variation. This value is typical of repeated measures
data and is substantially larger than that of the student-level problem-solving scores
from the previous example. The daily predictors also have substantial between-person
variation. As explained previously, the posterior predictive distributions of the missing
values preserve this important feature of the data.
A random coefficient model (also called a random slope model) is a multilevel
regression in which the influence of one or more level-1 explanatory variables varies
across level-2 units. As a starting point, consider a model that features daily pain ratings
as a level-1 predictor of daily positive affect and individual-level average pain and pain
acceptance as level-2 predictors of mean positive affect. The within-cluster regression
model describes daily score variation among affect scores from the same individual. The
model and its generic counterpart are as follows:
(
POSAFFECTij = β0 j + β1 j PAIN ij − μ1 j + ε ij ) (8.24)
( )
Yij = β0 j + β1 j X1ij − μ1 j + ε ij
ε ij ~ (
N1 0, σ2ε )
The j subscript on the intercept and slope coefficients signifies that these quantities
vary across persons (level-2 units). To illustrate, the dashed lines in Figure 8.4 depict
the within-cluster regression lines for 25 individuals, and the solid line is the average
trajectory. Unlike Figure 8.3, there is considerable heterogeneity in both the intercept
and slope coefficients; the slopes are mostly negative, but some persons exhibit stronger
associations than others, and some individuals even have positive slopes.
Centering the daily pain ratings at the cluster (person) means isolates daily fluctua-
tions around a participant’s chronic pain level (i.e., group mean centering or centering
within context; Enders & Tofighi, 2007; Kreft, de Leeuw, & Aiken, 1995), such that
β1j is a “pure” within-person effect. This centering scheme also defines β0j as an indi-
vidual’s average positive affect. Importantly, μ1j is a normally distributed latent mean
rather than a deterministic arithmetic average. Recent methodological research favors
this approach, as modeling cluster-level quantities as latent variables can reduce bias in
some situations (Hamaker & Muthén, 2020; Lüdtke et al., 2008).
The between-cluster part of the model features the latent group means and pain
acceptance scores as predictors of average affect, as follows:
( ) (
β0 j = β0 + β2 μ1 j − μ1 + β3 PAINACCEPTj − μ 2 + b0 j ) (8.25)
β0 j = β0 + β2 ( μ1 j − μ1 ) + β3 ( X 2 j − μ 2 ) + b0 j
β1 j =β1 + b1 j
Multilevel Missing Data 315
Centering the regressors at their grand means defines β0 as the grand mean (the mean
of the individual means), and b0j is a between-person residual that captures unexplained
variation average positive affect. The random slope equation says that an individual’s
coefficient is a function of the grand mean slope plus a person-specific deviation. The
regressors do not predict individual slopes, because that would change their status to
moderator variables (see Section 8.4). By assumption, b0j and b1j are bivariate normal
with a between-cluster covariance matrix Sb.
Finally, substituting the right sides of the β0j and β1j expressions into the within-
cluster model gives the following reduced-form equation:
( ) ( ) (
POSAFFECTij = β0 j + β1 j PAIN ij − μ1 j + β2 μ1 j − μ1 + β3 PAINACCEPTj − μ 2 + ε ij (8.26) )
Yij = β0 j + β1 j ( X1ij − μ1 j ) + β2 ( μ1 j − μ1 ) + β3 ( X 2 j − μ 2 ) + ε ij = E ( Yij | X1ij , X 2 j ) + ε ij
Yij ~ N1 ( E ( Yij | X1ij , X 2 j ) , σ2ε )
The expected value is a predicted score from a cluster-specific regression line, and the
bottom expression says that dependent variable scores are normally distributed around
these points. This normal curve is also the posterior predictive distribution of the miss-
ing positive affect scores.
7
6
5
Positive Affect
4
3
2
1
–4 –2 0 2 4
Daily Pain Rating (Centered)
FIGURE 8.4. The dashed lines depict the within-cluster regression lines for 25 individuals,
and the solid line is the average trajectory.
316 Applied Missing Data Analysis
A good deal of recent research has focused on missing data handling for random
coefficient models (Enders et al., 2020; Enders, Hayes, & Du, 2018; Enders, Keller, et
al., 2018; Erler et al., 2016, 2019; Grund et al., 2016a; Grund, Lüdke, & Robitzsch, 2018;
Kunkle & Kaizer, 2017; Lüdtke, Robitzsch, & Grund, 2017). These models are challeng-
ing, because they feature the product of a level-1 predictor variable and a level-2 latent
variable (e.g., the product of β1j and X1ij). Not all missing data-handling procedures
appropriately accommodate this nonlinearity when the regressor is incomplete (e.g.,
current maximum likelihood estimators are prone to substantial biases), but pairing
factored regression models with Bayesian estimation or model-based multiple imputa-
tion provides a straightforward and familiar solution.
The composition of the focal model has no impact on the supporting regressor models,
so the specification for the second term follows the random intercept analysis (see Equa-
tions 8.5 and 8.6).
7
6
5
Positive Affect
4
3
2
1
–4 –2 0 2 4
Daily Pain Rating (Centered)
FIGURE 8.5. Distribution of missing outcome scores for three persons with different associa-
tions between daily pain and positive affect. The solid black circles are predicted values, and the
spread of the normal curves reflects within-school residual variance. The candidate imputations
fall directly on the vertical lines, but horizontal jitter is added to emphasize that more scores are
located near the center of each distribution.
with interactive effects (as mentioned previously, a random coefficient model features
the product of a level-1 predictor and level-2 latent variable). Dropping unnecessary scal-
ing terms and substituting the appropriate kernels into the right side of the factorization
give the following expression:
( ( ))
2
1 Yij − β0 j + β1 j X1ij + β2 X 2 j
( ) (
f Yij | X1ij , X 2 j × f X1ij | X 2 j ) ∝ exp −
σ2ε
2
(8.28)
( )
2
1 X1ij − μ1 j
× exp −
2 σ2r1( w )
Multiplying the two normal curve functions and performing algebra that combines the
component functions into a single distribution for X1 gives a normal distribution with
318 Applied Missing Data Analysis
two-part mean and variance expressions that depend on the focal and regressor model
parameters.
( ) ( ( ) (
f X1ij( mis ) | Yij , X 2 j = N1 E X1ij | Yij , X 2 j , var X1ij | Yij , X 2 j )) (8.29)
μ ( (
β1 j Yij − β0 j + β2 X 2 j ))
(
E X1ij = )
| Yij , X 2 j (
var X1ij | Yij , X 2 j ) × 2 +
σ
1j
σ2ε
r
1( w )
−1
1 β12 j
(
var X1ij | Yij= )
, X2 j 2 + 2
σr σε
1( w )
The distribution’s structure is virtually identical to the one for moderated regression
models back in Section 5.4 (e.g., see Equation 5.22), but a random coefficient replaces
a simple slope in the expression. Looking at the variance of the imputations, the ran-
dom slope introduces heteroscedasticity, such that the distribution’s spread depends on
cluster j’s coefficient. This result highlights that incomplete random slope predictors
induce differences in spread that are incompatible with a multivariate normal distribu-
tion (i.e., Equation 8.29 is a mixture of normal distributions that differ with respect to
their spread). Maximum likelihood and multiple imputation approaches that assume
multivariate normality (e.g., fully conditional specification) do a poor job of approxi-
mating this heteroscedasticity and are prone to substantial biases (Enders et al., 2020;
Enders, Hayes, et al., 2018; Enders, Keller, et al., 2018; Grund et al., 2016a).
Analysis Example
Expanding on the health psychology analysis, I used Bayesian estimation to fit a ran-
dom coefficient regression model that features daily pain and sleep quality ratings as
within-person predictors of daily positive affect and individual-level average pain, pain
acceptance, and gender as predictors of average mood scores.
( ) ( ) (
POSAFFECTij = β0 j + β1 j PAIN ij − μ1 j + β2 SLEEPij − μ 2 +β3 μ1 j − μ1 ) (8.30)
+β4 ( PAINACCEPTj − μ 2 ) + β5 ( FEMALE j ) + ε ij
I used (latent) group mean centering to partition pain ratings into orthogonal within-
and between-person components, and I centered the level-1 sleep scores at the grand
mean. The factored regression specification follows Equation 8.27 but features two addi-
tional variables in each term.
(8.31)
f ( PAIN, SLEEP, PAINACCEPT, FEMALE ) *
tions, so I used 15,000 total iterations with a burn-in period of 5,000 iterations. The
same analysis that generates Bayesian summaries of the model parameters can also gen-
erate model-based multiple imputations for a frequentist analysis. To illustrate, I created
M = 100 imputations by saving the filled-in data and the latent group means from the
final iteration of 100 parallel MCMC chains, each with 5,000 iterations. As explained
previously, autocorrelated imputations are not a concern with this approach.
After creating the multiple imputations, I centered the variables at the estimated
latent group means and used restricted maximum likelihood to fit the random coef-
ficient model to each data set. The Barnard and Rubin (1999) small-sample degrees of
freedom adjustment for the t-statistic requires the complete-data degrees of freedom
value as an input (see Equations 7.21 and 7.22). Following the HLM software package
(Raudenbush et al., 2019), I again used the number of level-2 units minus the number of
predictors minus 1 as the degrees of freedom for all coefficients (i.e., dfcom = 132 – 5 – 1).
Analysis scripts are on the companion website.
Table 8.3 gives the posterior summaries from the Bayesian analysis, and Table 8.4
summarizes the multiple imputation point estimates and standard errors. Not surpris-
ingly, the two sets of results are numerically equivalent, albeit with different philosophi-
cal baggage. Unlike the previous example, specifying different prior distributions for the
between-cluster covariance matrix had virtually no impact on the Bayesian summaries,
because the number of observations at each level was sufficiently large to nullify the
prior’s influence. Latent group mean centering yields pain coefficients that access dif-
ference sources of variation in the mood scores; β1 quantifies the “pure” within-person
slope, where β3 represents the between-cluster regression of average positive affect on
average pain. Both coefficients are negative, suggesting that increases in daily or chronic
pain were associated with less positive affect. The difference between the two coef-
ficients is interesting in this context, because it clarifies whether daily or chronic pain
has a greater influence (Longford, 1989; Lüdtke et al., 2008; Raudenbush & Bryk, 2002).
The near-zero coefficient difference suggests that daily fluctuations in pain and chronic
pain exert the same influence on positive affect.
Changing the substantive scenery once again, I use the employee data on the companion
website to illustrate Bayesian missing data handling for a multilevel regression model
with random coefficients and a cross-level interaction. The data include several work-
related variables (e.g., work satisfaction, turnover intention, employee–supervisor rela-
tionship quality) for a sample of N = 630 employees. I’ve used this data set for earlier
examples, but I’ve thus far ignored the fact that nj = 6 employees are nested within J =
105 workgroups or teams. The data’s structure is now like the random intercept analysis,
where persons are level-1 units and organizations (workgroups) are level-2 units. The
dependent variable, employee empowerment, has an intraclass correlation equal to .11,
meaning that team-level mean differences in average empowerment scores account for
roughly 11% of the total variation. The Appendix gives a description of the data set and
the variable definitions.
A cross-level interaction is one where the influence of a level-1 explanatory variable
is moderated by a level-2 regressor. This effect is usually (but not necessarily) accompa-
nied by random coefficients, as the interaction can be viewed as explaining slope het-
erogeneity. The analysis for this example features leader–member exchange (employee–
supervisor relationship quality) as a within-team predictor of employee empowerment,
the effect of which is moderated by team-level leadership climate. The model also
includes a gender dummy code (0 = female, 1 = male) and group-level cohesion ratings
as level-1 and level-2 covariates, respectively.
Multilevel Missing Data 321
( ) ( ) (
EMPOWER ij = β0 j + β1 j LMX ij − μ1 j + β2 MALEij − μ 2 +β3 COHESION j − μ 3 ) (8.32)
( ) ( )(
+β4 CLIMATE j − μ 4 +β5 LMX ij − μ1 j CLIMATE j − μ 2 + ε ij )
((
EMPOWER ij ~ N1 E EMPOWER ij | LMX ij , MALEij , COHESION j , CLIMATE j , σ2ε ) )
The j subscript on the intercept and leader– member exchange slope conveys that
each team has a unique mean and bivariate association. Note that I centered leader–
membership scores at the workgroup means to isolate within-team variation in the
regressor (Enders & Tofighi, 2007; Kreft et al., 1995). As explained previously, these
group-level quantities are normally distributed latent variables rather than determinis-
tic arithmetic averages (Hamaker & Muthén, 2020; Lüdtke et al., 2008). The partially
factored specification is ideally suited for this analysis, because the grand means and
latent group means are iteratively estimated model parameters (Enders & Keller, 2019).
A sequential specification does not easily accommodate centering.
( *
f LMX, MALE , COHESION, CLIMATE )
Expanding on earlier ideas, level-1 predictors are decomposed into within- and
between- cluster components. The following within- cluster model expresses level-1
scores as correlated deviations around the latent group means:
LMX ij μ1 j r1ij( W )
X= = + (8.34)
ij( W ) MALE*j μ 2 j r2ij W
( )
μ σ2 σ12( W )
1( W )
X ij( W ) ~ N 2
1j
,
μ 2 j σ21 W 1
( )
Note that the male dummy code appears as a latent response variable, and the corre-
sponding diagonal element of the within-cluster covariance matrix Σ(W) is fixed at 1 to
establish a metric (the model also includes a single, fixed threshold).
The level-1 predictors condition on level-2 covariates via their latent group means
322 Applied Missing Data Analysis
in the level-2 model, which now consists of four empty regression equations with cor-
related residuals in the between-cluster covariance matrix Σ(B).
r
μ1 j μ1 1 j( B )
r
μ2 j μ 2 + 2 j( B )
X=
j( B ) = (8.35)
COHESION j μ 3 r3 j( B )
CLIMATE j μ 4 r4 j( B )
σ2 σ12( B ) σ13( B ) σ14( B )
μ1 1( B )
σ σ22( B ) σ23( B ) σ24( B )
μ2 21( B )
X j( B ) ~ N 4 ,
μ3 σ σ 32( B ) σ23( B ) σ 34( B )
31( B )
μ 4
σ σ 42( B ) σ 43( B ) σ 4( B )
2
41( B )
As explained previously, the within- and between-cluster covariance matrices can also
be expressed as a set of round-robin regression equations, as this avoids difficulties esti-
mating covariance matrices with fixed elements.
Analysis Example
Continuing with the organizational example, I used Bayesian estimation to fit a mul-
tilevel moderated regression model from Equation 8.32. After inspecting the potential
scale reduction factors (Gelman & Rubin, 1992) from a preliminary diagnostic run, I
specified an MCMC process with 10,000 iterations following the burn-in period. Fol-
lowing earlier examples, I also created M = 100 filled-in data sets by saving the imputa-
tions and latent group means from the final iteration of parallel MCMC chains. After
creating the multiple imputations, I centered the variables at the imputed latent group
Multilevel Missing Data 323
means and used restricted maximum likelihood to fit the random coefficient model to
each data set. Finally, I used Rubin’s (1987) rules to pool the parameter estimates and
standard errors and applied the Barnard and Rubin (1999) degrees of freedom expres-
sion to the significance tests. Analysis scripts are on the companion website.
Table 8.5 gives the posterior summaries for the Bayesian analysis, and Table 8.6
summarizes the multiple imputation point estimates and standard errors. The two
analyses produced similar coefficients, but the between-cluster covariance matrix esti-
mates were somewhat different. The Bayesian analysis was sensitive to the choice of
prior distribution for the between-cluster covariance matrix, so I considered four dif-
ferent options: a Jeffreys prior, an improper Wishart prior that subtracts three degrees
of freedom and adds 0 to the sum of squares and cross-products matrix (Asparouhov &
Muthén, 2010a), a more informative prior that adds threes degrees of freedom and an
identity matrix to the sum of squares, and a separation strategy prior that decomposes
the covariance matrix into a pair of variances and a correlation. Recent literature recom-
mends the latter option when the number of observations per cluster is small like it is
here (Keller & Enders, 2022), so the table reflects this strategy.
Turning to the slope coefficients, lower-order terms are conditional effects that
depend on centering. For example, β1 reflects the pure within-cluster influence of
leader–member exchange (the focal predictor) for a workgroup with average leadership
climate (the moderator). The interaction coefficient is of particular interest, because it
captures the influence of the group-level moderator on the level-1 slope. The positive-
valued coefficient implies the association between employee–supervisor relationship
quality and employee empowerment gets stronger as workgroup climate improves (i.e.,
a one-unit increase in leadership climate increases β1 by about .04). The 95% credible
interval suggests that 0 is an unlikely value for the parameter, and the frequentist sig-
nificance test in Table 8.6 similarly refutes the null hypothesis.
The bottom two rows of Tables 8.5 and 8.6 give the simple slopes for hypothetical
workgroups at plus and minus one between-cluster standard deviation from the cli-
mate grand mean (and with b0j and b1j values equal to 0). The Bayesian analysis treats
the conditional effects as auxiliary quantities computed from the focal model param-
eters at each MCMC iteration (Keller & Enders, 2021), whereas the multiple imputation
estimates used standard linear contrasts with delta method standard errors (Grund,
Robitzsch, & Lüdke, 2021). Figure 8.6 shows these conditional effects as dashed and
dotted lines, respectively, and the solid line is the regression line for a team with aver-
age leadership climate. As you can see in Figure 8.6, the influence of leader–member
exchange is stronger in workgroups with above-average leadership climate, and it is
weaker in workgroups with below-average climate.
Thus far I’ve considered hierarchical data structures with two levels, but Bayesian esti-
mation and model-based multiple imputation readily extend to three (or even more)
levels. Returning to the cluster-randomized educational experiment from Section 8.2,
researchers collected seven problem-solving assessments throughout the school year
at roughly monthly intervals. The earlier random intercept analysis used data from the
first and last occasion, and I now treat repeated measurements as an additional hier-
archy in the design. The three-level data structure features repeated measurements as
Multilevel Missing Data 325
40
35
Employee Empowerment
30
25
20
15
10
–10 –5 0 5 10
Leader-Member Exchange (Group Mean Centered)
FIGURE 8.6. Conditional slopes at three levels of the moderator. The dashed line is the slope
for a work group with a leadership climate score at one between-cluster standard deviation above
the grand mean, and the dotted line is the slope for a team at one standard deviation below the
mean. The solid line is the slope for a team with average climate.
level-1 units, students as level-2 units, and schools as level-3 units (i.e., repeated mea-
surements nested in students, students nested in schools). As you might expect, the
missing data rates increased over time (e.g., baseline problem-solving scores were com-
plete, and nearly 20% of the scores were missing by the final assessment), and compari-
son schools additionally had planned missing data at certain occasions. The data set and
the variable definitions are described in the Appendix.
A longitudinal growth curve model is a type of multilevel regression in which
repeated measurements are a function of a temporal predictor that codes the passage of
time, in this case the monthly testing occasions. To facilitate interpretation, researchers
usually code one of the measurement occasions as 0 and set the others relative to that
fixed point. One common option expresses time relative to the baseline assessment (e.g.,
MONTH = 0, 1, 2, 3, 4, 5, 6), and another reflects these “time scores” relative to the final
measurement (e.g., MONTH = –6, –5, –4, –3, –2, –1, 0). I use the latter definition for the
ensuing example, because this scaling yields an estimate of the intervention group’s
mean difference at the end of the school year. The temporal codes define a level-1 predic-
tor that targets within-student changes in the dependent variable. However, unlike other
lower-level covariates, the time scores have no between-person (level-2) or between-
326 Applied Missing Data Analysis
school (level-3) fluctuation, because they are constant across students (i.e., all students
were assessed in approximately monthly increments). This feature impacts the composi-
tion of the supporting regressor models, as there is no need to estimate this variable’s
latent group means or higher-level variation.
The growth curve model features an average linear trajectory for each experimental
condition, along with individual variation around the mean intercept and slope. Starting
with the repeated measurements, the within-person linear model for student i in school
j is
(
PROBSOLVEtij = β0ij + β1ij MONTH tij + εtij ) (8.36)
(
εtij ~ N1 0, σ2ε )
where PROBSOLVEtij is the student’s problem-solving test score at measurement occasion
t, MONTH is the temporal predictor or “time variable,” β0ij is the participant’s expected
end-of-year problem-solving score (i.e., the predicted value when MONTH = 0), and β1ij is
his or her latent monthly change rate. Finally, εtij is a time-specific residual that captures
the distances between the repeated measurements and the individual linear trajectories.
By assumption, these residuals are normally distributed with constant variance σε2.
Building on the earlier analysis example, I use standardized math achievement
scores and free or reduced-price lunch eligibility as student-level covariates. The regres-
sors enter the student-level between-cluster (level-2) model as predictors of the indi-
vidual random intercepts. The model is as follows:
( ) (
β0ij = β0 j + β2 STANMATH ij − μ 2 + β3 FRLUNCH ij − μ 3 + b0ij ) (8.37)
( ) (
β0 j = β0 + β4 TEACHEXPj − μ 4 + β5 CONDITION j + b0 j ) (8.38)
(
β1 j = β1 + β6 CONDITION j + b1 j )
b0 j
(
~ N 2 0, S b( L3 )
b1 j
)
Multilevel Missing Data 327
Centering average years of teacher experience at the grand means defines β0 as the aver-
age end-of-year problem-solving test score for control schools (i.e., CONDITION = 0),
and b0j is a school-level residual that captures unexplained variation in the intercepts.
In the slope equation, β1 is the average monthly growth rate for control schools, β6 is the
growth rate difference for intervention schools (i.e., the group-by-time interaction), and
b1j is a school-level random slope residual. As before, the random intercepts and slopes
are bivariate normal by assumption with a variance–covariance matrix Σb(L3).
Finally, substituting the right sides of the higher-level equations into the lower-
level equations gives a reduced form expression that features a cross-level interaction
between a level-3 regressor (treatment assignment) and the level-1 temporal predictor.
( ) ( ) (
X1tij = μ1 + μ1 j − μ1 + μ1ij − μ1 j + X1tij − μ1ij ) (8.41)
The temporal predictor is unique, because it contains only within-student variation (i.e.,
the time scores are constant across students, so there is no fluctuation in the average
time scores). The within-cluster regression model thus expresses time scores as devia-
tions around the grand mean, as follows:
328 Applied Missing Data Analysis
MONTH tij ~ (
N1 μ1 , σr21( W ) )
In fact, there is no need to estimate this model at all, because the time scores are com-
plete and function as known constants.
The student-level between-cluster (level-2) model expresses level-2 scores as cor-
related deviations around their level-3 latent group means, as follows:
STANMATH ij μ 2 j r2ij( L2 )
X= = + (8.43)
ij( L2 ) FRLUNCH ij* μ 3 j r3ij L2
( )
(
X ij( L2 ) ~ N 2 μ j , S ( L2 ) )
Notice that the lunch assistance indicator appears as a latent response variable that
represents a continuous proclivity to receive free or reduced-price lunch. As always,
the corresponding diagonal element of the level-2 covariance matrix is fixed at 1, and
the model requires a fixed threshold parameter. More generally, the factorization would
include the latent group means of any level-1 regressors with higher-level variation.
The level-2 predictors condition on level-3 covariates via their latent group means
in the student-level between-cluster model, which now consists of four empty regression
equations with correlated residuals.
μ2 j r
μ 2 2 j( L3 )
r
μ
X=
3j
=
μ 3 + 3 j( L3 ) (8.44)
j( L3 ) TEACHEXP j
μ 4 r4 j( L3 )
CONDITION j
*
μ 5 r5 j( L3 )
(
X j( L3 ) ~ N 4 μ, S ( L3 ) )
Notice that the treatment assignment indicator is also modeled as a latent response vari-
able, and the corresponding diagonal element of the level-3 covariance matrix is fixed
at 1 to establish a metric. Alternatively, this variable can be treated as a fixed constant,
because it is complete and does not require a distribution.
Analysis Example
Continuing with the education example, I used Bayesian estimation to fit the three-level
growth model from Equation 8.39. After inspecting the potential scale reduction factor
diagnostic (Gelman & Rubin, 1992), I specified an MCMC process with 10,000 burn-in
iterations and 20,000 total iterations. I also created M = 100 filled-in data sets by saving
the imputations from the final iteration of parallel MCMC chains with 10,000 iterations
each. After creating multiple imputations, I centered the variables and used restricted
Multilevel Missing Data 329
maximum likelihood to fit the random coefficient model to each data set. Finally, I used
Rubin’s (1987) rules to pool the parameter estimates and standard errors and applied
the Barnard and Rubin (1999) degrees of freedom expression for the significance tests.
Analysis scripts are on the companion website.
Table 8.7 gives the posterior summaries for the Bayesian analysis, and Table 8.8
summarizes the multiple imputation point estimates and standard errors. Consistent
with earlier examples, the two sets of results were numerically similar. I examined the
consistency of the Bayesian results across three prior distributions for the between-
cluster covariance matrices. Not surprisingly, the intercept and slope variances were
sensitive to this specification, but their variance explained effect sizes (Rights & Sterba,
2019) were relatively stable. The table reports results from an improper prior that sub-
tracts degrees of freedom, as this is the default in some popular software packages.
Turning to the slope coefficients, the lower-order terms are conditional effects that
depend on centering. For example, β0 is the end-of-year problem-solving average for
control schools (marginalizing over the student- and school-level covariates), and β1
reflects the average monthly change rate for these schools. The positive-valued β5 coeffi-
cient indicates that intervention schools finished the year 1.79 points higher, on average,
and the group-by-time interaction slope shows that these schools improved by 0.31 more
per month than the comparison group. The 95% credible intervals suggest that 0 is an
unlikely value for the group difference parameters, and the frequentist significance tests
in Table 8.8 similarly reject the null hypothesis. To further illustrate the group-by-time
interaction, Figure 8.7 shows the average linear growth trajectory for control schools as
a dashed line, and the solid line is the average growth curve for intervention schools.
–6 –5 –4 –3 –2 –1 0
Months Until Final Measurement Occasion
FIGURE 8.7. The dashed line is the average linear growth trajectory for comparison or control
schools, and the solid line is the average growth curve for intervention schools.
Multilevel Missing Data 331
Chapter 7 classified multiple imputation procedures into two buckets according to the
degree of similarity between the imputation and analysis models: Agnostic imputation
procedures deploy a model that differs from the focal analysis, whereas model-based
imputation invokes the same model as the secondary analysis. The previous examples
highlight that model-based multiple imputation goes hand in hand with a Bayesian anal-
ysis that tailors the filled-in data to a particular multilevel model. Adopting a tailored
approach is vital for models with random coefficients and interaction effects, whereas a
variety of methods are appropriate for random intercept models.
Considering the agnostic imputation procedures from Chapter 7, single-level joint
modeling and fully conditional specification approaches are known to introduce sub-
stantial biases when applied to multilevel data sets, because they produce filled-in data
with no between-cluster variation (Andridge, 2011; Black, Harel, & McCoach, 2011;
Lüdtke et al., 2017; Mistler & Enders, 2017; Reiter, Raghunathan, & Kinney, 2006;
Taljaard, Donner, & Klar, 2008; van Buuren, 2011). However, both procedures readily
extend to multilevel analyses and are widely available in software packages (Asparouhov
& Muthén, 2010c; Carpenter et al., 2011; Carpenter & Kenward, 2013; Enders, Keller, et
al., 2018; Goldstein et al., 2009, 2014; van Buuren, 2011; Yucel, 2008, 2011). Consistent
with their single-level counterparts, the joint modeling framework uses a multivariate
regression model as an imputation model, and fully conditional specification uses a
sequence of univariate multilevel models.
( ( ) )
Yij ~ N1 E Yij | X ij , σ2ε
where Dj is one of J code variables that equals 1 if participant i belongs to group j and
0 otherwise. Note that I use an absolute coding scheme that omits the usual regression
332 Applied Missing Data Analysis
intercept and includes the entire set of J code variables. This specification defines each γj
as a group-specific mean or intercept. Importantly, I exclude level-2 predictors, because
the code variables explain all between-cluster variation in the outcome (McNeish &
Kelley, 2019).
Fixed effect imputation is computationally simple and capable of producing accu-
rate parameter estimates in certain situations (Lüdtke et al., 2017; Reiter et al., 2006). It
also has noteworthy limitations. Methodologists have pointed out that dummy coding
appears to overcompensate for group mean differences (Graham, 2012, p. 136), and
analytic work confirms the procedure can exaggerate between-group variation (Lüdtke
et al., 2017). This positive bias gets worse as either the intraclass correlation or within-
cluster sample size decreases. On the inferential side, other studies have shown that
fixed effect imputation can produce positively biased standard errors and inaccurate
confidence intervals (Andridge, 2011; van Buuren, 2011). Bias issues aside, the proce-
dure is practically limited to random intercept analyses (Enders, Mistler, & Keller, 2016),
as preserving random slope variation requires a large set of product terms between the
dummy codes and a level-1 variable.
Joe Schafer extended the popular joint model imputation framework to multilevel data
structures (Schafer, 2001; Schafer & Yucel, 2002), and a number of flexible variations
of his approach have since appeared in the literature (Asparouhov & Muthén, 2010c;
Carpenter et al., 2011; Carpenter & Kenward, 2013; Goldstein et al., 2009; Goldstein
et al., 2014; Yucel, 2008, 2011). I describe a version that uses an empty multivariate
regression model where all variables are outcomes regardless of their role in the analy-
sis (Asparouhov & Muthén, 2010c; Carpenter & Kenward, 2013). The model allows
missing data at either level of the data hierarchy, and it readily accommodates a latent
variable formulation for incomplete categorical variables. Importantly, the joint model is
limited to random intercept analyses and has no capacity for preserving random associa-
tions between pairs of incomplete variables. Later in the section, I describe an extension
that uses cluster-specific covariance matrices to preserve these relations (Quartagno &
Carpenter, 2016; Yucel, 2011).
To illustrate the joint model imputation scheme more concretely, reconsider the
two-level random intercept analysis model from Equation 8.18. Consistent with its
single-level counterpart, the multilevel joint model invokes a multivariate normal dis-
tribution for continuous and latent response variables. The normal distribution is now
more complex and involves within- and between-cluster variation and covariation (e.g.,
variation among employees who belong to the same workgroup, and variation among
the workgroups). Following ideas presented earlier in the chapter, each level-1 variable
decomposes into the sum of a grand mean and within- and between-cluster residu-
als (see Equation 8.8). The within-cluster model expresses level-1 scores as correlated
deviations around their latent group means, as follows:
Multilevel Missing Data 333
μ1 j 1ij( W )
PROBSOLVEij Y1ij r
r
Y=
PRETESTij
=
Y2ij μ 2 j + 2ij( W )
ij( W ) STANMATH =Y3ij μ3 j r (8.46)
3ij( W )
ij
*
FRLUNCH ij* μ
Y4 ij 4 j r4 ij( W )
(
Yij( W ) ~ N 4 μ j , S ( W ) )
Following procedures from Chapter 6, the lunch assistance indicator appears as a latent
response variable (e.g., a student’s underlying proclivity for receiving free or reduced-
price lunch), and the corresponding diagonal element of the within-cluster covariance
matrix is fixed at 1 to establish a metric. In words, the bottom row of the equation
says that level-1 scores are normally distributed around their latent group means. The
within-cluster normal distribution is also the posterior predictive distribution of the
level-1 missing values.
Importantly, the within-cluster model presumes that associations among level-1
variables are the same across all level-2 units (e.g., all schools share the same variance–
covariance matrix). This effectively limits joint model imputation to random intercept
analyses, because it has no capacity for preserving random associations between pairs
of incomplete variables. Simulation studies show that applying the joint model prior to
estimating a random coefficient analysis produces substantial bias, because the filled-
in values eradicate cluster-specific associations from the data (e.g., slope variance esti-
mates are dramatically attenuated; Enders et al., 2016). A variant of the joint model with
random within-cluster covariance matrices addresses this shortcoming (Yucel, 2011).
Level-1 variables correlate with level-2 variables via their latent group means in the
between-cluster model, which now consists of six empty regression equations with cor-
related residuals.
r1 j( B )
μ1 j μ1 j μ1
r2 j( B )
μ2 j μ2 j μ
2
μ3 j μ3 j μ 3 r3 j( B )
Y=
j( B )
= = + (8.47)
μ4 j μ4 j μ 4 r4 j( B )
TEACHEXP Y μ
5j 5 r5 j( B )
j
CONDITION *j Y6*j μ6 r
6 j( B )
(
Y j( B ) ~ N 6 μ, S ( B ) )
The treatment assignment indicator appears as a latent response variable, and the cor-
responding diagonal element of the between-cluster covariance matrix is fixed at 1 to
establish a metric. The between-cluster mean structure also includes fixed threshold
parameters for the two binary variables. The between-cluster normal distribution serves
double duty as the posterior predictive distribution of the level-2 missing values (includ-
ing the latent group means). In fact, Equations 8.46 and 8.47 are the same models that
334 Applied Missing Data Analysis
I applied to incomplete regressors earlier in Section 8.2, but the equations now include
the dependent variable and its latent group mean. I characterize this imputation model
as agnostic, because it looks nothing like the analysis model in Equation 8.18. This
difference is not a problem, because the multivariate normal data structure does not
conflict with the analytic model.
MCMC Algorithm
The posterior distribution for joint model imputation is a complicated multivariate func-
tion that describes the relative probability of different combinations of model param-
eters, latent group means, and missing values given the observed data. The MCMC algo-
rithm applies a now-familiar strategy: Estimate one unknown at a time (e.g., param-
eter, latent variable, missing score), holding all other quantities at their current values.
The generic MCMC recipe for parallel imputation chains is shown below, and I refer
interested readers to the literature for the exact form of each distribution (Carpenter &
Kenward, 2013; Goldstein et al., 2009; Schafer & Yucel, 2002; Yucel, 2008):
Do for m = 1 to M imputations.
Assign starting values to all parameters, random effects, and missing values.
Do for t = 1 to T iterations.
> Estimate the grand means, given everything else.
> Estimate latent group means, given everything else.
> Estimate the between-cluster covariance matrix, given everything else.
> Estimate the within-cluster covariance matrix, given everything else.
> Impute missing values, given the model parameters.
Repeat.
Save the filled-in data for later analysis.
Repeat.
The final imputation step uses the mean vector and covariance matrices to construct a
regression model and distribution of missing values for each unique missing data pat-
tern. I illustrated this process for a single-level analysis in Section 5.9, and the same idea
applies here.
Analysis Example
Revisiting the cluster-randomized educational intervention, I applied joint model impu-
tation to the random intercept analysis from Equation 8.18. After inspecting the poten-
tial scale reduction factors (Gelman & Rubin, 1992) from a preliminary diagnostic run,
I created M = 100 filled-in data sets by saving the imputations from the final iteration
of 100 parallel MCMC chains with 2,000 iterations each. I used restricted maximum
Multilevel Missing Data 335
likelihood to fit the random intercept regression model to each data set, and I applied
Rubin’s (1987) rules to pool the parameter estimates and standard errors. To refresh,
the Barnard and Rubin (1999) small-sample degrees of freedom adjustment requires
the complete-data degrees of freedom as an input (see Equations 7.21 and 7.22). Follow-
ing the HLM software package (Raudenbush et al., 2019), I used the number of schools
minus the number of predictors minus 1 as the degrees of freedom for all coefficients
(i.e., dfcom = 29 – 5 – 1). The top panel of Table 8.9 gives the multiple imputation point
estimates and standard errors. Perhaps not surprisingly, joint model imputation pro-
duced results that are effectively equivalent to the Bayesian analysis, and model-based
multiple imputation results from Section 8.2 (see Tables 8.1 and 8.2). As such, no further
discussion of the results is warranted.
els (Enders, Hayes, et al., 2018; Quartagno & Carpenter, 2016). To illustrate Yucel’s
approach more concretely, reconsider the daily diary study and the random coefficient
analysis model from Equation 8.30. The within-cluster model again expresses level-1
scores as correlated deviations around their latent group means, as follows:
MCMC Algorithm
The posterior distribution for the random covariance matrix model is again a compli-
cated multivariate function describing the relative probability of different combina-
tions of model parameters, latent group means, and missing values given the observed
data. The algorithmic steps resemble those from the classic joint model, but the recipe
includes two new steps that estimate the pooled degrees of freedom and scale matrix
(the two components that define the average level-1 covariance matrix). The estimation
Multilevel Missing Data 337
step for the cluster-specific covariance matrices leverages this pooled information, such
that some variables can be completely missing within a given cluster (Quartagno &
Carpenter, 2016). The generic MCMC recipe is shown below, and I refer interested read-
ers to the literature for the exact form of each distribution (Carpenter & Kenward, 2013;
Goldstein et al., 2009; Schafer & Yucel, 2002; Yucel, 2008):
Do for m = 1 to M imputations.
Assign starting values to all parameters, random effects, and missing values.
Do for t = 1 to T iterations.
> Estimate the grand means, given everything else.
> Estimate latent group means, given everything else.
> Estimate the between-cluster covariance matrix, given everything else.
> Estimate the pooled scale matrix, given everything else.
> Estimate the pooled degrees of freedom, given everything else.
> Estimate cluster-specific covariance matrices, given everything else.
> Impute missing values, given the model parameters.
Repeat.
Save the filled-in data for later analysis.
Repeat.
Analysis Example
Revisiting the earlier health psychology analysis, I used the joint model with random
within-cluster covariance matrices to create daily diary imputations for the random
coefficient regression from Equation 8.30. Although the imputation model is not per-
fectly compatible with a random slope analysis model, simulation studies suggest it nev-
ertheless performs well in this context (Enders, Hayes, et al., 2018). I generated M = 100
imputations from a sequential MCMC chain with 3,000 burn-in iterations and 3,000
thinning or between-imputation iterations (i.e., I saved the first data set after 3,000
iterations and saved the remaining data sets every 3,000 computational cycles thereaf-
ter). After creating the multiple imputations, I centered the predictor variables at their
arithmetic averages (the software for this variant of the joint model does not save latent
group means) and used restricted maximum likelihood to fit the random coefficient
model to each data set. Finally, I used Rubin’s (1987) rules to pool the parameter esti-
mates and standard errors and applied the Barnard and Rubin (1999) degrees of freedom
expression for significance tests. Table 8.10 summarizes the multiple imputation point
estimates and standard errors, which were quite like the model-based imputation results
in Table 8.4. The random covariance model appears to provide a close approximation to
an optimal model-based imputation routine, especially when the proportion of between-
cluster variation is large, as it is here (Enders, Hayes, et al., 2018). The substantive inter-
pretations match the earlier example, so no further discussion is warranted.
338 Applied Missing Data Analysis
t
( t −1
) (
PROBSOLVEij( ) = γ 01 j + γ11 STANMATH ij( ) + γ 21 FRLUNCH ij( )
t −1
) (8.50)
( ) ( )
+ γ 31 TEACHEXPj( ) + γ 41 PRETESTij + γ 51 CONDITION j + r1ij
t −1
( )
( ) (
STANMATH ij( ) = γ 02 j + γ12 FRLUNCH ij( ) + γ 22 PROBSOLVEij( )
t t −1 t
)
( ) ( )
+ γ 32 TEACHEXPj( ) + γ 42 PRETESTij + γ 52 CONDITION j + r2ij
t −1
( )
To track changes to the imputed data across models, I attach a t superscript to the
incomplete variables to index iterations (e.g., the imputed standardized math scores
on the right side of the problem-solving imputation model originate from the previous
iteration).
The binary lunch assistance indicator requires a multilevel logistic regression
model with random intercepts.
(
Pr FRLUNCH (t ) = 1
)
ln (
= γ + γ PROBSOLVE (t ) + γ STANMATH (t )
) ( )
ij
(
)
03 j 13 ij 23 ij
(t )
1 − Pr FRLUNCH ij = 1
(8.51)
( ( t −1)
+ γ 33 TEACHEXPj )
+ γ 43 PRETESTij ( )
(
+γ 53 CONDITION j )
The logistic model has no within-cluster variance, because this parameter is a fixed con-
stant. Following Section 8.2, the random intercepts (i.e., γ01j, γ02j, and γ03j) and within-
cluster residuals (i.e., r1ij and r2ij) are normally distributed. Following established logic,
MCMC creates continuous imputations by drawing random numbers from a normal
distribution centered at a predicted score that incorporates the random effects (e.g.,
Equation 8.12), and the algorithm samples binary outcome scores from a binomial dis-
tribution.
cluster means as level-2 regressors. This strategy mimics the logic of joint model imputa-
tion, and the two approaches often produce equivalent results (Carpenter & Kenward,
2013; Enders et al., 2016; Grund et al., 2017; Mistler & Enders, 2017). I illustrate a model
that uses latent group means in the next section.
Level-1 variables relate to level-2 variables via their group-level averages. As such,
the between-cluster imputation models are single-level regressions with cluster means
and level-2 variables as predictors. The imputation model for the teacher experience
variable is as follows:
(t ) (t )
TEACHEXPj( ) = γ 04 + γ14 PROBSOLVE j
t
+ γ 24 STANMATH j
(8.52)
(t )
+ γ 34 FRLUNCH j + γ 44
( PRETEST ) + γ (CONDITION ) + r
j 54 j 04 j
The bars over the level-1 regressors convey that the group means are arithmetic aver-
ages of the imputed level-1 scores from iteration t, and the between-cluster residual is
normally distributed, as before. Importantly, this specification assumes equal cluster
sizes and is incompatible with a joint model when cluster sizes are unbalanced (Car-
penter & Kenward, 2013; Enders et al., 2016; Grund et al., 2017; Mistler & Enders,
2017). However, empirical research suggests that biases resulting from unequal group
sizes tend to be relatively small and are most evident when the intraclass correlation or
within-cluster sample size is very small (Grund et al., 2017). The next section remedies
this shortcoming by marrying fully conditional specification with latent group means.
Y = μ + γ ( ) (Y
(t )
−μ )+ γ ( )(
( t −1)
− μ ) + γ ( ) (Y − μ ) + r ( )
Y4 ij( ) (t )
* t −1
2 ij 2j 12 W 3ij 3j 22 W 4j 32 W 1ij 1j 2 ij W
Y ( ) = μ + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + r ( )
t * t −1 t t
3ij 3j 13 W 4 ij 4j 23 W 1ij 1j 33 W 2 ij 2j 3ij W
Y ( ) = μ + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + γ ( ) (Y ( ) − μ ) + r ( )
*t t t t
4 ij 4j 14 W 1ij 1j 24 W 2 ij 2j 34 W 3ij 3j 4 ij W
Multilevel Missing Data 341
Centering the regressors in each equation at their latent group means removes all
between-cluster variation from the level-1 variables (i.e., the γ’s reflect pure within-
cluster associations) and defines the intercept as the target variable’s latent group mean.
Although it isn’t obvious, level-1 variables correlate with level-2 variables (e.g., teacher
experience and the latent treatment assignment indicator) via these random intercepts.
The bottom equation is a probit model for the latent response variable, which now
replaces the binary variable as a regressor on the right side of the other equations. As
always, setting the variance of the r4ij(W) residuals to 1 establishes a metric.
Next, consider the between-cluster joint model in Equation 8.47. The six-dimensional
multivariate normal distribution similarly spawns an equal number of between-cluster
regressions.
( ) (
μ1( j) = μ1 + γ11( B ) μ (2 j ) − μ 2 + γ 21( B ) μ (3 j ) − μ 3 + γ 31( B ) μ (4 j ) − μ 4
t t −1 t −1 t −1
) ( ) (8.54)
( )
+ γ 41( B ) Y5( j ) − μ 5 + γ 51( B ) Y6 (j ) − μ 6 + r1 j( B )
t −1 * t −1
( )
...
(t )
Y5 j = μ 5 + γ15( B ) ( Y6 (j )
* t −1
− μ6 ) ( )
+ γ 25( B ) μ1( j) − μ1 + γ 35( B ) μ 2( j) − μ 2
t t
( )
( ) (
+ γ 45( B ) μ (3 j) − μ 3 + γ 55( B ) μ (4 j) − μ 4 + r5 j( B )
t t
)
Y6 (j )
*t
( (t )
)
= μ 6 + γ16( B ) μ1 j − μ1 + γ 26( B ) μ 2 j − μ 2 + γ 36( B ) μ (3 j) − μ 3 ( (t )
) ( t
)
( )
+ γ 46( B ) μ (4 j) − μ 4 + γ 56( B ) Y5( j) − μ 5 + r6 j( B )
t
( t
)
Consistent with the lunch assistance variable, the treatment assignment indicator
appears as a latent response variable, and the variance of r6j(B) is fixed at 1.
The latent group means are effectively missing data, and each MCMC iteration esti-
mates these quantities by drawing their values from a normal distribution. However,
the conditional distribution of the latent means given all other quantities is complex,
because the group means appear in two models (e.g., μ1j functions as a random intercept
in Y1’s within-cluster regression in Equation 8.53, and it appears as the outcome in a
between-cluster model). Because a given latent mean is common to all members of the
level-2 cluster (e.g., all students within a given school share the same group mean), the
within-cluster model’s contribution to the conditional distribution repeats nj times, once
for each observation in group j. The distribution that generates the latent group means
is the product of the two normal curves below:
nj
× N1 E (( μ1 j | μ 2 j , μ 3 j , μ 4 j , Y5 j , Y6*j ) , σr21( B) )
In fact, the distribution’s two-part composition is identical to that of an incomplete
level-2 regressor in a Bayesian analysis. The product over the level-1 scores highlights
that latent group averages accommodate unequal group sizes by explicitly conditioning
on the number of within-cluster observations (e.g., the reliability of each group’s latent
variable increases as group size increases and vice versa).
342 Applied Missing Data Analysis
Analysis Example
Revisiting the cluster-randomized education study, I applied fully conditional specifica-
tion with latent variables to the random intercept analysis from Equation 8.18. Follow-
ing earlier examples, I created M = 100 filled-in data sets by saving imputations from
parallel MCMC chains, and I used Rubin’s (1987) rules to pool the restricted maximum
likelihood estimates and standard errors. The bottom panel of Table 8.9 summarizes the
analysis results. Perhaps not surprisingly, joint model imputation produced results that
are effectively equivalent to joint model imputation, as well as the Bayesian analysis and
model-based multiple imputation results from Section 8.2 (see Tables 8.1 and 8.2). As
such, no further discussion of the results is warranted.
( )
Yij = β0 j + β1 j X ij + ε ij = E Yij | X ij + ε ij (8.56)
( )
X ij = γ 0 j + γ1 j Yij + rij = E X ij | Yij + rij (8.57)
( (
X ij ~ N1 E X ij | Yij , σr2) )
Although the two equations share the same structure and appear to target the same
group-specific association, they are logically inconsistent. Revisiting a concept from
Chapter 7, the regression models are incompatible, because the two univariate normal
distributions cannot originate from the same multivariate distribution (unless the ran-
dom slope variance equals zero). In practical terms, incompatibility means that the nor-
mal distribution of X in Equation 8.57 is mathematically impossible given the composi-
tion of the analysis and the model-implied distribution of Y in Equation 8.56 (and vice
versa).
Revisiting the Bayesian specification for a random coefficient model provides addi-
tional insight into why reverse regression gives flawed imputations. Returning to the
distribution of missing values in Equation 8.29, the random coefficients induce het-
eroscedasticity, such that the variation of the imputations in that equation depends on
the magnitude of a group’s random slope (or more accurately, β12j ). In contrast, the nor-
Multilevel Missing Data 343
mal distribution in Equation 8.57 is misspecified, because it says that all imputations
have the same variance, regardless of group membership. The practical consequence of
this misspecification is that slope variance estimates are too small, and other parameters
may also exhibit bias (Enders et al., 2020; Enders, Hayes, et al., 2018; Enders, Keller,
et al., 2018; Erler et al., 2016; Grund et al., 2016a, 2018). As such, you should avoid
the reverse random coefficient specification and use Bayesian estimation or model-
based multiple imputation. You may recall that we previously encountered the same
misspecification problem when applying reverse regression to moderated regression
analyses (i.e., just-another-variable imputation). In fact, random slope models are just a
special type of moderated regression that feature the product of a level-1 regressor and
level-2 latent variable.
At least for now, maximum likelihood estimation is arguably less capable than Bayesian
estimation and multiple imputation, because it handles a more limited set of multilevel
missing data problems. Virtually any software package that estimates mixed models can
accommodate incomplete outcomes. Consistent with classic regression models, analyz-
ing the observed data yields accurate estimates when missing values are restricted to
the dependent variable and missingness is due to explanatory variables (Little, 1992;
von Hippel, 2007). This scenario could arise, for example, in a longitudinal study where
baseline covariates are complete but repeated measurements are incomplete due to inter-
mittent missingness or attrition. In fact, no imputation is needed, because this missing
data pattern is simply a complete-data estimation problem with unbalanced group sizes
(i.e., each level-2 individual could have a different number of level-1 repeated measure-
ments).
The situation becomes more complicated when explanatory variables have missing
values. Currently, dedicated multilevel modeling software packages have limited capac-
ity for handling incomplete predictors, and most programs simply delete observations
with missing covariates. Not only does deletion assume a stringent MCAR process (i.e.,
missingness is haphazard and unrelated to the data), but it can also dramatically reduce
the sample size, particularly when an entire cluster is removed, because its level-2
scores are incomplete. The notable exception is the HLM program (Raudenbush et al.,
2019), which addresses incomplete predictors using an approach developed by Shin and
Raudenbush (2007, 2013). This estimator assumes that incomplete variables are multi-
variate normal, and it currently has no capacity for handling categorical predictors or
random slopes between incomplete level-1 variables (these limitations do not apply to
complete covariates). The procedure is conceptually like joint model imputation and
leverages comparable normal distributions.
To describe the HLM approach, reconsider the random intercept analysis model
from Equation 8.18. Multilevel modeling software packages typically assume explana-
tory variables are fixed by design, meaning that no distributional assumptions are
applied to these variables. As you know, this specification is antithetical to any type
of missing data handling. Shin and Raudenbush (2007, 2013) address this problem
344 Applied Missing Data Analysis
by deriving transformations that reparameterize the analysis model into within- and
between-cluster normal distributions like those in Equations 8.46 and 8.47. After repa-
rameterizing the model, Shin and Raudenbush use an EM algorithm to estimate level-
specific mean vectors and covariance matrices. The expectation and maximization steps
at each level are fundamentally similar to a conventional EM algorithm (Dempster et al.,
1977; Little & Rubin, 2002), because each covariance matrix reflects a single source of
variation. Finally, a reverse transformation converts maximum likelihood estimates of
the mean vector and covariance matrices to the desired regression model parameters.
Multilevel structural equation modeling is a second option for implementing maxi-
mum likelihood missing data handling. Like the HLM program, most multilevel struc-
tural equation modeling estimators currently assume that incomplete predictors are
multivariate normal. Although some software packages do allow for incomplete random
slope predictors, simulation studies suggest that maximum likelihood is prone to sub-
stantial biases (Enders et al., 2020; Enders, Hayes, et al., 2018), presumably, because the
incomplete predictors condition on the outcome in a way that doesn’t account for the
heteroscedasticity of their missing values (see Equation 8.29). For this reason, I restrict
the focus to regression models with random intercepts. A number of accessible descrip-
tions of multilevel structural models are available in the literature (Mehta & Neale,
2005; Rabe-Hesketh, Skondral, & Zheng, 2012; Stapleton, 2013), as are technically ori-
ented papers that provide a deeper dive into the mechanics of estimation (Asparouhov
& Muthén, 2007; Bentler & Liang, 2011; Liang & Bentler, 2004; Muthén & Asparouhov,
2008; Rabe-Hesketh et al., 2004, 2012).
Returning to ideas from Chapter 3, a structural equation model views an individu-
al’s responses as a multivariate normal data vector (see Equation 3.26). The framework
is a flexible tool for implementing maximum likelihood, because it allows researchers
to structure the normal distribution’s parameters as regression models, potentially with
latent variables. Importantly, individuals need not contribute the same amount of infor-
mation to estimation, and each person’s likelihood expression can have a different pat-
tern or number of observed responses in Yi. When confronted with missing values, the
estimator uses analytic expressions such as expectations to replace the missing parts of
the data instead of imputing the missing values themselves.
A multilevel structural equation model leverages the same multivariate normal
distribution function, but Yi functions as a correlated vector of exchangeable level-1
observations nested within a level-2 unit i (i.e., clusters become the unit of analysis). To
illustrate, consider an empty random intercept model for the dependent variable, math
problem solving. The multilevel structural equation model defines each level-1 observa-
tion (e.g., student test score) as an indicator of a cluster-level latent factor. The following
equation shows the factor model for a particular school i with 20 students:
mation to the factor, such that β0i represents a level-2 latent group mean. Setting the
factor loading matrix equal to a vector of 1’s treats student scores as exchangeable,
equivalent indicators of the latent factor (by extension, all residuals share the same
variance).
The structural equation model parameters combine to produce predictions about
the population means and covariance matrix. Following earlier notation from Chap-
ter 3, the model-predicted or model-implied moments for the empty random intercept
model are μ(θ) and Σ(θ).
… σ2b0
β0 σ2b σ2b0 σ2b0 σ2b0 + σ2ε
0
A conventional multilevel model induces the same model-implied mean and covariance
structure. Mehta and Neale (2005) provide an accessible tutorial on multilevel struc-
tural models that highlight linkages with the conventional mixed modeling framework.
Maximum likelihood estimation for multilevel structural equation models bor-
rows heavily from concepts in Chapter 3. For example, the observed-data log-likelihood
replaces the population mean vector and covariance matrix with their model-implied
counterparts (see Equation 3.26), and the goal of estimation is to find the random inter-
cept model parameters that minimize the discrepancies between the observed data and
the model-implied moments in μ(θ) and Σ(θ). Methodologists have developed EM algo-
rithms for this purpose that are similar to those in Chapter 3 (Bentler & Liang, 2011;
Liang & Bentler, 2004; Poon & Lee, 1998; Raudenbush, 1995).
Analysis Example
Revisiting the earlier math problem-solving analysis, I used a multilevel structural equa-
tion modeling framework to estimate the random intercept analysis from Equation 8.18.
The model incorrectly treats the incomplete binary lunch assistance indicator as a nor-
mally distributed variable, but limited computer simulation evidence suggests that this
specification may be fine for binary variables (Muthén et al., 2016). Analysis scripts are
available on the companion website.
Table 8.11 gives the focal model parameters and their standard errors. The table
omits predictor model parameters, because these are not the substantive focus. The
maximum likelihood estimates were mostly indistinguishable from those of multiple
imputation and Bayesian estimation; as described previously, the Bayesian analysis had
a somewhat different between-cluster variance, but other parameters were similar. This
analysis, like many others in the book, underscores the point that different analytic
methods that apply similar assumptions generally give the same answer. It is important
to reiterate that maximum likelihood missing data handling is best suited for random
intercept analyses, as the few estimators that currently allow for incomplete random
slope predictors are prone to bias.
346 Applied Missing Data Analysis
This chapter has described missing data handling for multilevel data structures. Ideas
established in earlier chapters readily extend to multilevel regression. For example,
imputations still equal predicted scores plus noise, but cluster-specific regressions gen-
erate the predictions, and within-cluster variation defines the spread of the random
residuals. Similarly, missing dependent variable scores depend only on the focal analysis
model’s parameters and random effects, whereas incomplete predictors require one or
more additional supporting models.
Chapter 7 classified multiple imputation procedures into two buckets according to
the degree of similarity between the imputation and analysis models: An agnostic impu-
tation strategy deploys a model that differs from the substantive analysis, and a model-
based imputation procedure invokes the same focal model as the secondary analysis
(perhaps with additional auxiliary variables). These classifications emphasize that an
analysis model’s composition—in particular, whether it includes nonlinear effects—
determines the type of imputation strategy that works best. This distinction also applies
to multilevel data sets, where model-based missing data-handling strategies are demon-
strably superior for analyses that feature random coefficients, interactive, or curvilinear
effects. In contrast, multilevel extensions of the joint model and fully conditional speci-
fication (agnostic imputation procedures) are best suited for random intercept models. I
recommend the following articles for readers who want additional details on topics from
this chapter:
Enders, C. K., Du, H., & Keller, B. T. (2020). A model-based imputation procedure for multilevel
regression models with random coefficients, interaction effects, and other nonlinear terms.
Psychological Methods, 25, 88–112.
Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and
evaluation of joint modeling and chained equations imputation. Psychological Methods,
21, 222–240.
Multilevel Missing Data 347
Grund, S., Lüdke, O., & Robitzsch, A. (2016). Multiple imputation of missing covariate values in
multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48,
640–649.
Quartagno, M., & Carpenter, J. R. (2016). Multiple imputation for IPD meta-analysis: Allowing for
heterogeneity and studies with missing covariates. Statistics in Medicine, 35, 2938–2954.
Schafer, J. L., & Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-
effects models with missing values. Journal of Computational and Graphical Statistics, 11,
437–457.
van Buuren, S. (2011). Multiple imputation of multilevel data. In J. J. Hox & J. K. Roberts (Eds.),
Handbook of advanced multilevel analysis (pp. 173–196). New York: Routledge.
Analyses that assume a conditionally MAR process have been our go-tos throughout the
book. This mechanism stipulates that unseen score values carry no unique information
about missingness beyond that contained in the observed data. This assumption is con-
venient, because there is no need to specify and estimate a model for the missing data
process. Although the MAR assumption is quite reasonable for a broad range of practi-
cal applications, in some cases it may be plausible that the unseen score values do carry
unique information about missingness, in which case the process is called missing not at
random (MNAR). The two major modeling frameworks for MNAR processes—selection
models and pattern mixture models—mitigate nonresponse bias by introducing a model
that describes the occurrence of missing data, but they do that in very different ways: A
typical selection model features a regression equation with a missing data indicator as a
dependent variable, whereas a pattern mixture model uses the indicator as a predictor.
As you will see, models for MNAR processes require strict and unverifiable assump-
tions; selection models leverage untestable normal distributional assumptions, whereas
pattern mixture models generally require you to specify values for one or more inesti-
mable population parameters. Ultimately, there is no way to confirm these requirements
are satisfied, and model misspecifications could produce estimates that contain more
bias than those from a MAR analysis. A common view is that MNAR models are best
suited for sensitivity analyses that examine the stability of one’s substantive conclu-
sions across different assumptions. Beunckens, Molenberghs, Thijs, and Verbeke (2007,
p. 477) define a sensitivity analysis as “one in which several statistical models are con-
sidered simultaneously and/or where a statistical model is further scrutinized using
specialized tools (such as diagnostic measures).” I adopt this definition throughout the
chapter, and the data analysis examples illustrate how to implement this strategy and
interpret potentially conflicting results.
348
Missing Not at Random Processes 349
To refresh ideas from Section 1.3, the MNAR mechanism states that the probability of
missingness is related to the unseen score values. Gomer and Yuan (2021) further sepa-
rate this mechanism into focused and diffuse subtypes that describe different roles for
the observed data. I adopt this distinction throughout the chapter, because it provides a
useful framework for structuring sensitivity analyses that consider competing explana-
tions about the missing data process.
Gomer and Yuan (2021) define a focused MNAR process as one where missingness
depends only on the unseen score values in Y(mis). The conditional distribution of the
missing data indicators is
(
Pr M i = 1| Yi( mis ) , φ ) (9.1)
where Mi is a vector of binary missingness codes for individual i (each indicator equals
1 if missing and 0 otherwise), Yi(mis) represents the person’s unseen score values, and φ
contains model parameters that describe the occurrence of missing data. To illustrate,
consider the depression scores from the chronic pain data set on the companion website.
A focused process would occur if the unseen depression scores were the sole determi-
nant of missingness (e.g., participants with the worst symptoms leave the study to seek
treatment elsewhere).
A diffuse MNAR mechanism is one where nonresponse depends on both the unseen
score values in Y(mis) and the observed data in Y(obs). The conditional distribution of the
missing data indicators for this process is as follows:
(
Pr M i = 1| Yi( obs ) , Yi( mis ) , φ ) (9.2)
Applied to the chronic pain study, a diffuse mechanism would occur if participants
with the worst symptoms (i.e., high values of Yi(mis)) leave the study to seek treatment
elsewhere and participants with low perceived control over their pain (i.e., low values of
Yi(obs)) miss assessments that coincide with acute disruptions in their day-to-day activi-
ties. The literature suggests that diffuse processes are somewhat harder to model (e.g.,
require much larger sample sizes) and are capable of inducing greater nonresponse bias
than focused ones (Du, Enders, Keller, Bradbury, & Karney, 2021; Gomer & Yuan, 2021;
Zhang & Wang, 2012).
The previous definitions imply that nonresponse is entangled with the outcome vari-
able in a way that cannot be ignored when analyzing the data. The two major model-
ing frameworks for MNAR processes—selection models and pattern mixture models—
mitigate nonresponse bias by introducing a model that describes the occurrence of
missing data, but they do that in different ways: A selection model features a regression
350 Applied Missing Data Analysis
equation with a missing data indicator as a dependent variable, whereas a pattern mix-
ture model uses the missing data indicator as a predictor. Both approaches start with a
multivariate distribution for the data and the missingness indicators, and factorize this
function into the product of two or more separate distributions. The basic idea mimics
factored regression models from previous chapters.
Using generic notation, the selection model factorizes the joint distribution of the
analysis variables and the missing data indicators (Y and M, respectively) into the fol-
lowing sequence of univariate functions:
( Yi , Mi ) f ( Mi | Yi ) × f ( Yi )
f= (9.3)
FIGURE 9.1. Panel (a) shows a path diagram of a focused selection model where only the
incomplete dependent variable predicts missingness, and panel (b) depicts a diffuse selection
model where both analysis variables predict missingness.
Missing Not at Random Processes 351
( Yi , Mi ) f ( Yi | Mi ) × f ( Mi )
f= (9.4)
The f(Yi|Mi) term conveys that model parameters differ by missing data pattern, and f(Mi)
is a model that describes the pattern proportions. These proportions serve as weights for
combining pattern-specific estimates into population-level quantities that average over
the distribution of missing data. Figure 9.2 shows path diagrams for focused and diffuse
pattern mixture models. The missing data indicator in the top panel creates mean dif-
ferences on the dependent variable, and the dashed line in the bottom figure represents
a moderation effect (i.e., product term) where the influence of X on Y differs for people
with missing data.
Selection and pattern mixture models are equivalent in the sense that they attempt
to describe the same multivariate distribution, but the similarities stop there. For one,
translating the generic factorizations from Equations 9.3 and 9.4 into formal statisti-
FIGURE 9.2. Panel (a) shows a path diagram of a focused pattern mixture model where the
missing data indicator predicts the dependent variable, and panel (b) depicts a diffuse pattern
mixture model where the indicator moderates the influence of X on Y.
352 Applied Missing Data Analysis
cal models requires different inputs and assumptions, and applying the two modeling
frameworks to the same data can yield very different estimates. From a practical per-
spective, the two modeling frameworks tell a different story about the missing data;
selection models treat missingness as an outcome that depends on the analysis vari-
ables, whereas pattern mixture models treat missingness as a qualitative dimension that
moderates the focal model parameters. Either or both descriptions could be reasonable
for a given application.
Selection models for missing data trace to Heckman’s (1976, 1979) seminal work on sam-
ple selection, truncation, and limited dependent variables. Heckman’s work spawned a
great deal of interest in the econometrics literature, and there is now a considerable body
of methodological research devoted to his approach, as well as countless applications. A
selection model for missing data pairs the focal regression with an additional probit or
logistic model for the binary missingness indicator (see Figure 9.1). Heckman’s papers
described a two-step limited information method that first estimates the missingness
model, after which it uses a function of the resulting parameter estimates to formulate
a corrective variable that appears in the focal model. Winship and Mare (1992) and
Puhani (2000) provide good summaries of this estimator and the early literature. Cut
to today, and we can use factored or sequentially specified regressions to estimate these
models in either the likelihood or Bayesian frameworks (Du et al., 2021; Ibrahim et al.,
2002; Ibrahim et al., 2005; Ibrahim, Lipsitz, & Chen, 1999; Lipsitz & Ibrahim, 1996;
Lüdtke et al., 2020a, 2020b).
Continuing with the bivariate example involving depression and perceived control
over pain, the selection model pairs the focal analysis with an additional logistic or probit
regression for the binary missing data indicator. I focus on the latter for consistency with
earlier material. As a quick recap, probit regression envisions binary scores originat-
ing from a latent response variable that now represents a normally distributed propen-
sity for missing data. The model also includes a threshold parameter τ that divides the
latent response distribution into two segments, such that participants with missing and
observed data have latent scores above and below the threshold, respectively. The link
between the latent response variable and its manifest missing data indicator is as follows:
0 if Mi* ≤ τ
Mi = *
(9.5)
1 if Mi > τ
Following the specification from Section 6.2, the threshold parameter is fixed at 0 to
identify the latent response variable’s mean structure.
The simplest selection model is one for a focused MNAR process where only the
dependent variable appears in the missingness model.
Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (9.6)
(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
Missing Not at Random Processes 353
(
Mi* = γ 0 + γ1Yi + ri = E Mi* | Yi + ri)
Mi* ((
~ N1 E Mi* ) )
| Yi ,1
The second model is a probit regression with the latent response variable as the out-
come. As always, fixing the model’s residual variance to 1 establishes a metric. Fig-
ure 9.1a shows a path diagram of the two regressions that effectively comprise a single
mediator model in which the influence of the explanatory variable on the missing data
indicator is transmitted indirectly via the dependent variable.
To make the previous model more concrete, I used computer simulation to create an
artificial data set based on the perceived control over pain and depression variables from
the chronic pain data, and I deleted 25% of the artificial depression scores to mimic a
strong selection process where participants with high depression scores are more likely
to have missing data (e.g., those with the worst symptoms leave the study to seek treat-
ment elsewhere). Figure 9.3 shows the resulting scatterplot of the data, with gray circles
representing complete cases and black crosshairs denoting partial data records with
missing depression scores. The contour rings convey the perspective of a drone hover-
ing over the peak of the bivariate normal population distribution, with smaller contours
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 9.3. Scatterplot of a focused MNAR process where only the dependent variable deter-
mines missingness. Gray circles represent complete cases, and black crosshairs denote partial
data records with missing depression scores.
354 APPLIED MISSING DATA ANALYSIS
denoting higher elevation (and vice versa). The graph depicts a systematic process where
the missing values are primarily located above the regression line in the upper left quad-
rant of the plot.
Section 1.8 presented computer simulation results illustrating the biases that result
from applying a model for conditionally MAR data to an MNAR process like the one in
Figure 9.3. To illustrate the problem, Figure 9.4 shows a single data set from a model-
based imputation routine that treats the missing values as a function of the predic-
tor variable. The MCMC algorithm incorrectly intuits that the unobserved depression
scores should be evenly dispersed around a regression line, as you would expect from
a conditionally MAR process. To accommodate that expectation, the estimator favors
a biased regression line (the dashed line) that fills in the wrong part of the population
distribution.
A more complex selection model for a diffuse MNAR process includes one or more
additional predictor variables in the missingness equation. Using generic notation, a
model that includes both depression and perceived control over pain in the selection
equation is as follows:
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 9.4. A single data set from a model-based imputation routine that treats the missing
values as a function of the explanatory variable. The MCMC algorithm evenly disperses imputes
(the black crosshairs) around a biased regression line (the dashed line) that fills in the wrong
part of the population distribution.
Missing Not at Random Processes 355
Yi = β0 + β1 X i + ε i = E ( Yi | X i ) + ε i (9.7)
(
Yi ~ N1 E ( )
Yi | X i , σ2ε )
Mi* = γ 0 + γ1Yi + γ 2 X i + ri = E ( Mi* )
| Yi , X i + ri
Mi* ~ N1 E(( Mi* ) )
| Yi , X i ,1
Figure 9.1b shows a path diagram of the model, which corresponds to a partially medi-
ated process in which the explanatory variable uniquely predicts missingness after con-
trolling for the unseen values of the outcome variable.
I again used computer simulation to create an artificial data set with missing data
patterns based on the previous model, and I deleted 25% of the artificial depression
scores to mimic a strong selection process where participants with high levels of depres-
sion or lower levels of perceived control over pain are more likely to have missing val-
ues (e.g., those with the worst symptoms leave the study to seek other treatment, and
participants with low control over their pain miss assessments that coincide with acute
disruptions in their day-to-day activities). I kept the overall strength of the selection
process the same as it was before, but both variables now contribute equally to missing-
ness. Figure 9.5 shows the resulting scatterplot of the data, with gray circles again repre-
senting complete cases and black crosshairs denoting partial data records with missing
depression scores. The missing depression scores are still primarily located above the
population regression line, but a portion of the black crosshairs have relocated to the
lower left quadrant of the plot.
Important Assumptions
Estimating the selection equation is a formidable task, because the latent response
variable is completely missing and the outcome is partially missing from an unknown
region of the distribution. This seemingly impossible charge requires a strict bivariate
normality assumption for the model’s two residuals. In fact, the normality assumption
is the glue that holds the two models together and makes estimation possible when the
regressions share the same variables (Little & Rubin, 1987, p. 230; Puhani, 2000; Sar-
tori, 2003), as is the case in Equation 9.7. As a practical matter, estimation works best
when the selection model includes variables that do not appear in the analysis model (or
vice versa). Auxiliary variables that correlate with the missing data indicator but not the
analysis variables are good candidates for achieving this separation.
Eliminating nonresponse bias also requires that the missingness model is approxi-
mately correct. Generally, omitting an important determinant of missingness from the
selection equation can introduce substantial bias, while overfitting the model with one
or two unnecessary predictors can cause extraordinarily noisy estimates with large sam-
pling variation. While it is most important to specify the right set of variables, specify-
ing the wrong functional forms can also introduce bias. For example, if you thought that
participants with high or low depression scores are more likely to have missing data
(e.g., those with mild symptoms leave the study, because they no longer feel treatment
356 APPLIED MISSING DATA ANALYSIS
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 9.5. Scatterplot of a diffuse MNAR process where the dependent variable and pre-
dictor explain missingness. Gray circles represent complete cases, and black crosshairs denote
partial data records with missing depression scores.
is necessary, whereas those with acute symptoms leave the study to seek treatment else-
where), then the selection equation should include a curvilinear effect, as shown below:
Here, again, the model is not robust to misspecification, and including the wrong con-
figuration of effects can impact parameter estimates.
Applying the generic factorization from Equation 9.3 to the diffuse selection model
from Equation 9.7 gives the factored regression (sequential) specification below:
f ( Y , M, X=
) f ( M|Y , X ) × f ( Y|X ) × f ( X ) (9.9)
The first term to the right of the equals sign corresponds to the missingness model, the
second term is the focal model (e.g., the regression of depression on perceived control
over pain), and the rightmost term is the marginal (overall) distribution of the predictor.
By now this should look familiar, because we’ve applied this specification throughout
the book.
Because the dependent variable acts as a predictor in the missingness model, the
distribution of its missing values mirrors the two-part function for an incomplete regres-
sor (see Section 5.3). It is instructive to look at the analytic distribution of the missing
values to draw connections to earlier material. Dropping unnecessary scaling terms and
substituting the kernels of the distribution functions from Equation 9.7 into the right
side of the factorization gives the following expression:
( )
2
1 Mi − ( γ 0 + γ1Yi + γ 2 X i )
*
( )
f Mi | X i , Yi × f ( Yi | X i ) ∝ exp −
*
σ2r
2
(9.10)
1 ( Yi − ( β0 + β1 X1i ) )
2
× exp −
2 σ2ε
The selection model’s residual variance is fixed to 1, but I include σr2 in the equation to
maintain comparability to analogous expressions from Chapter 5.
Multiplying the two normal curve functions and performing algebra that combines
the component functions into a single distribution for Y gives a normal distribution with
two-part mean and variance expressions that depend on the focal and selection model
parameters. This should be a familiar refrain, as we’ve seen this composition several
times (e.g., Equation 5.12).
( ) (( ) (
f Yi( mis ) | Mi* , X i = N1 E Yi | Mi* , X i , var Yi | Mi* , X i )) (9.11)
(
γ1 Mi* − γ 0 − γ 2 X i )
β +β X
E ( Yi | Mi* , X i ) = (
var Yi | Mi* , X i )
×
σr2
+ 0 21 i
σε
−1
γ2 1
(
var Yi | Mi* , = )
X i 12 + 2
σr σ ε
Although the equation is not intuitive, you can see that both the mean and variance
contain a term that depends on the strength of the MNAR process, as encoded by γ1.
These terms vanish when γ1 = 0 (i.e., the mechanism is MCAR or MAR), in which case
the distribution of missing values simplifies to that of a conditionally MAR analysis.
358 Applied Missing Data Analysis
Conversely, nonzero values of this coefficient induce a correction to both the center and
spread of the distribution.
Practical Recommendations
The literature suggests diffuse processes are somewhat harder to model (e.g., require
larger samples) and can induce greater nonresponse bias than focused ones (Du et al.,
2021; Gomer & Yuan, 2021; Zhang & Wang, 2012). As a general observation, including
too many predictors in the missingness model is generally less detrimental than ignor-
ing an important determinant of missing data (Du et al., 2021; Gomer & Yuan, 2021),
but overfitting can produce very noisy estimates and a nontrivial reduction in precision
and power. Unless you have a very large data set, adopting a judicious rather than inclu-
sive approach to selecting predictors for the missingness model is probably a good idea.
It is also important to reiterate that the model works best when the selection equation
includes variables that do not appear in the focal analysis (or vice versa).
Modeling a focused process is a good starting point for building a diffuse missing-
ness model that includes carefully chosen predictors. Ibrahim et al. (2005) suggested
adding regressors in a stepwise fashion, perhaps in conjunction with model selection
indices and individual influence diagnostics. I have found this approach to be very use-
ful, and the next section provides additional details on model comparisons. Carefully
inspecting parameter estimates and their standard errors (or posterior standard devia-
tions) is also important, because selection models often leave breadcrumbs that signal
a misspecification. For example, misspecified models are often accompanied by large
increases in some standard errors (e.g., values that are double or triple the size of those
from an MAR analysis) or an implausibly large R2 statistic from the missingness model.
The subsequent analysis examples illustrate a model-building procedure that uses a
variety of inputs and criteria to identify a plausible selection model.
As mentioned previously, the γ1 coefficient from Equations 9.6 and 9.7 encodes the
strength of the MNAR process. This feature appears to offer a way to test the missing
data mechanism, as this slope should be significantly different from zero if an MNAR
process caused the missing data patterns. Unfortunately, significance tests of the probit
model parameter estimates are not trustworthy, nor are likelihood ratio tests comparing
nested models with and without the dependent variable in the selection equation (Jansen
et al., 2006; Molenberghs & Kenward, 2007, Section 19.6; Verbeke & Molenberghs,
2000). Molenberghs, Beunckens, Sotto, and Kenward (2008) further cast shade on such
comparisons, showing that any MNAR model has a MAR counterpart that fits the data
equally well. Ibrahim et al. (2005) and others instead suggest the Akaike information cri-
terion (AIC; Akaike, 1974) and Bayesian information criterion (BIC; Schwarz, 1978) for
relative fit assessments, and published applications of such comparisons are common in
Missing Not at Random Processes 359
the literature (e.g., Gottfredson, Bauer, Baldwin, & Okiishi, 2014; Muthén, A sparouhov,
Hunter, & Leuchter, 2011). These indices can sometimes distinguish among competing
models if you are willing to accept the validity of the selection model and its predictions about
the missing values with no input from the observed data.
I briefly describe the AIC and BIC, and point interested readers to the literature for
additional details (Dziak, Coffman, Lanza, Li, & Jermiin, 2020; Vrieze, 2012). The AIC
and BIC are
−2LL + ln ( N ) P
AIC = (9.12)
BIC =
−2LL + 2P
where LL is the log-likelihood value of the fitted model and P is the number of esti-
mated parameters. The idea is to downgrade models that achieve good fit (i.e., attain a
lower –2LL value) by including too many parameters, and the rightmost term in each
expression is a penalty factor that inflates the deviance value to compensate for model
complexity. As such, lower values of both indices are better when considering compet-
ing models. The AIC and BIC need not agree, because they make different assumptions
and select on different features. Roughly speaking, the BIC attempts to identify the true
data-generating model from a set of candidate models, whereas the AIC tries to select
a candidate model that isn’t necessarily correct but adequately describes the unknown
data-generating function.
The difference between AIC and BIC values provides information about the relative
fit of two models (the models need not be nested). For example, using the indices to
compare models that assume a conditionally MAR versus MNAR process gives the fol-
lowing ΔAIC and ΔBIC values:
ΔAIC = ( ) (
−2 LL( MAR ) − LL( MNAR ) + ln ( N ) P( MAR ) − P( MNAR )
AIC( MAR ) − AIC( MNAR ) = ) (9.13)
ΔBIC =
BIC( MAR ) − BIC ( MNAR −2 ( LL(
)= MAR ) − LL( MNAR ) ) + ln ( N ) ( P( MAR ) − P( MNAR ))
Because lower values are better, positive ΔAIC or ΔBIC values favor the MNAR analysis,
and negative values support the conditionally MAR model. Using Bayesian factors (akin
to the ratio of two likelihood values) as a guide, Raftery (1995) developed the following
effect-size-like rules of thumb for the BIC difference: |ΔBIC| values from 0 to 2 constitute
“weak” evidence, differences between 2 and 6 reflect “positive” evidence, values from 6
to 10 are “strong,” and a difference greater than 10 is “very strong.” Anderson and Burn-
ham (2004) propose analogous rules for the AIC.
It is important to emphasize that model comparisons with ΔAIC or ΔBIC require
strong and untestable assumptions about the missing values, and you must be will-
ing to accept the MNAR model’s propositions with blind faith. Moreover, because the
observed data may contain relatively little information about a given model compari-
son, ΔAIC and ΔBIC are especially susceptible to extreme or unusual data patterns.
360 Applied Missing Data Analysis
This can produce a problematic situation where the presence or absence of just one
observation can cause the indices to favor one model over another. Interestingly, it
isn’t necessarily participants with missing data that exert the greatest leverage on a
model comparison, as an unusual pattern of observed data can do the same. For exam-
ple, Kenward (1998) reported an example from the biostatistics literature involving a
longitudinal study of dairy cow milk yields. He found that including two sick cows
with complete data and qualitatively different trajectories produced a comparison that
favored a selection model, whereas excluding these two individuals supported a condi-
tionally MAR analysis.
A number of biostatistics papers describe sensitivity procedures designed to iden-
tify data records that unduly influence the results of a model comparison (Beunckens
et al., 2007; Kenward, 1998; Molenberghs & Verbeke, 2001; Molenberghs, Verbeke,
Thijs, Lesaffre, & Kenward, 2001; Thijs, Molenberghs, & Verbeke, 2000; Verbeke,
Molenberghs, Thijs, Lesaffre, & Kenward, 2001). One such approach is to iteratively
fit two models of interest after removing one person at a time from the data. This jack-
knife strategy produces individual influence diagnostics that measure the change in
the ΔAIC or ΔBIC that results from excluding a participant from the analysis (Sterba &
Gottfredson, 2014). These individual-level difference values are
where the (–i) subscript indicates that individual i is excluded from the ΔAIC or ΔBIC
computation. Selection models are computationally intensive, so refitting multiple mod-
els N times is a substantial barrier to implementing this approach in practice.
Sterba and Gottfredson (2014) propose noniterative approximations to ΔAICi and
ΔBICi that are simple by-products of maximum likelihood estimation. These approxi-
mate influence diagnostics are
Δˆ AIC i = (
−2 LLi( MAR ) − LLi( MNAR ) ) (9.15)
Δˆ BICi = ( ) N
−2 LLi( MAR ) − LLi( MNAR ) + ln (
P( MAR ) − P( MNAR )
N −1
)
where LLi(MAR) and LLi(MNAR) are individual i’s contributions to the sample log-likelihood
values (computed by substituting a participant’s data and the maximum likelihood esti-
mates into the observed-data log-likelihood expression from Equation 3.5). Software
packages that implement maximum likelihood estimation routinely save the necessary
quantities upon request, making these diagnostics simple to compute. If the ΔAIC (or
ΔBIC) is positive (i.e., the observed data favor the MNAR model), a Δ̂AICi (or Δ̂BICi) value
that exceeds ΔAIC (or ΔBIC) is influential in the sense that excluding that participant’s
data could switch the sign of the model comparison. Conversely, if the ΔAIC (or ΔBIC) is
negative (i.e., the observed data favor the conditionally MAR analysis), a Δ̂AICi (or Δ̂BICi)
value more negative than ΔAIC (or ΔBIC) is similarly influential. The analysis examples
in the next section illustrate these diagnostics.
Missing Not at Random Processes 361
I use the psychiatric trial data on the companion website to illustrate a sensitivity analy-
sis for a multiple regression model with an outcome that could be MNAR. The data,
which were collected as part of the National Institute of Mental Health Schizophre-
nia Collaborative Study, comprise repeated measurements of illness severity ratings
from 437 individuals. In the original study, participants were assigned to one of four
experimental conditions (a placebo condition and three drug regimens), but the data
collapse these categories into a dichotomous treatment indicator (DRUG = 0 for the
placebo group, and DRUG = 1 for the combined medication group). The outcome was
measured in half-point increments ranging from 1 (normal, not at all ill) to 7 (among the
most extremely ill). The researchers collected a baseline measure of illness severity prior
to randomizing participants to conditions, and they obtained follow-up measurements
1 week, 3 weeks, and 6 weeks later. The overall missing data rates for the repeated mea-
surements were 1, 3, 14, and 23%, and these percentages differ by treatment condition;
19 and 35% of the placebo group scores were missing at the 3-week and 6-week assess-
ments, respectively, versus 13 and 19% for the medication group. The data set and the
variable definitions are given in the Appendix, and other illustrations with these data
are found in the literature (Demirtas & Schafer, 2003; Hedeker & Gibbons, 1997, 2006).
The focal analysis is a linear regression model with baseline severity ratings, a gen-
der dummy code (0 = female, 1 = male), and the treatment assignment indicator predict-
ing 6-week follow-up scores.
Centering baseline scores and the male dummy code at their grand means facilitates
interpretation by defining β0 and β1 as the placebo group average and group mean differ-
ence, respectively (marginalizing over the covariates). This model is an ideal candidate
for a sensitivity analysis, because focused and diffuse MNAR mechanisms are plausible
explanations for dropout. To this end, I fit a series of selection models that invoked
different assumptions about the missingness process, and I used simple model checks
to assess the validity of each analysis. These analyses rely heavily on the normal distri-
bution assumption, so it is worth noting that the observed data distribution is slightly
non-normal, with skewness and excess kurtosis equal to 0.21 and –0.94, respectively.
This small data set offers limited choices for auxiliary variables, but the illness
severity ratings at the 1-week and 3-week follow-up assessments are excellent candi-
dates, because both have salient semipartial correlations with the dependent variable (r
= .40 and .61, respectively). Following the procedure from Section 5.8, I used a pair of
linear regression models to link the auxiliary variables to the focal variables:
I omitted the auxiliary variables from the missingness model to reduce problematic
dependencies across regression equations (as noted previously, conditioning on differ-
ent variables in the focal and selection models helps estimation). In some cases, it may
be necessary to include one or more auxiliary variables as regressors in the selection
equation. One way to mitigate collinearity problems is to use principal components
analysis to reduce a large superset of auxiliary variables into one or two composites that
link to the focal model variables (Howard et al., 2015), as such a composite would likely
provide unique information that enhances estimability even if one or two of its constitu-
ents appear in the missingness model.
I used maximum likelihood estimation for the examples, and the companion web-
site also provides scripts for Bayesian estimation and model-based multiple imputation.
Selection models can be difficult to estimate, because the observed data may contain
very little information about the missingness model. Monitoring convergence is espe-
cially important in this context, as invalid solutions are common. For the maximum
likelihood analyses, I refit the model with several sets of random starting values and
examined the final log-likelihood values to verify that different runs produced the same
solution. For a Bayesian analysis, specifying multiple MCMC chains with different start-
ing values provides analogous information, along with convergence diagnostics such as
trace plots and the potential scale reduction factors (Gelman & Rubin, 1992). For both
modeling frameworks, slow convergence is often a signal that the model is too complex
for the data. Finally, it is also important to check whether the MNAR results are unduly
influenced by unusual data records, and I use individual influence diagnostics for this
purpose (Sterba & Gottfredson, 2014). Analysis scripts are available on the companion
website.
Missingness model
Intercept (γ0) –0.73 0.07 –0.73 0.07
R2 0 0 0 0
Model fit
AIC (P) 5484.44 (14) 4994.00 (23)
BIC (P) 5541.56 (14) 5087.85 (23)
their apparent importance, I include the auxiliary variable regressions in the subsequent
selection models.
As always, the probit model’s threshold and residual variance are fixed at 0 and 1, respec-
tively. The leftmost columns in Table 9.2 show the estimates and standard errors, which
are effectively equivalent to the conditionally MAR analysis with auxiliary variables.
Model comparisons using ΔAIC or ΔBIC are not very useful here given that the esti-
mates are effectively identical to those in the right panel of Table 9.1. Returning to the
path diagram in Figure 9.1a, the explanatory variable influences missingness indirectly
via the dependent variable, which essentially acts as mediator. If a focused missingness
model is reasonable, the indirect pathway alone should reproduce the bivariate associa-
tion between each predictor and the missing data indicator, whereas inaccurate predic-
tions about this relation signal the need for a diffuse process with a direct pathway (e.g.,
the model in Figure 9.1b). For this analysis, the cell proportions in a 2 × 2 contingency
table encode the bivariate association between the treatment assignment (or gender
dummy code) and the missing data indicator.
364 Applied Missing Data Analysis
Missingness model
Intercept (γ0) –0.75 0.21 0.44 0.35 0.37 0.32
SEVERITY6 (γ1) 0.01 0.06 –0.19 0.08 –0.18 0.08
DRUG (γ2) — — –0.78 0.20 –0.76 0.20
MALE (γ3) — — — — –0.13 0.14
R2 < .01 .002 .11 .06 .11 .06
Model fit
AIC (P) 4995.64 (24) 4981.00 (25) 4982.16 (26)
BIC (P) 5093.56 (24) 5083.00 (25) 5088.24 (26)
To illustrate, consider the treatment assignment variable, which had 35 and 19%
missing data rates for the placebo and medication conditions, respectively. To check
the selection model’s predictions about these proportions, I applied the procedure for
computing the indirect effect from a mediation model with a binary outcome (Muthén
et al., 2016, p. 310). Using generic notation, the expected value and variance of the latent
response variable at a particular value of the explanatory variable (e.g., X = 0 or 1) is as
follows:
( )
E Mi* | X i = γ 0 + γ1 ( β0 + β1 X i ) = γ 0 + γ1 ( E ( Yi | X i ) ) (9.19)
( )
var Mi* | X i =γ12 σ2ε + 1
The predicted probability of missing data for a particular value of the explanatory vari-
able via the indirect pathway through the dependent variable is an area under the following
normal curve (Muthén et al., 2016, Eq. 8.6):
*
(
τ − E Mi | X i
Pr ( Mi = 1| X i ) = 1 − Φ
)*
E Mi | X i
( )
= Φ (9.20)
(
var Mi* | X i
)
var Mi* | X i
( )
where Φ(·) is the cumulative distribution function of the standard normal curve (i.e., a
Missing Not at Random Processes 365
function that gives the area below the z-score inside the parentheses). Substituting the
parameter estimates and dummy codes into these expressions gives model-predicted
missingness rates of 24 and 23% for the placebo and medication conditions, respectively.
These rather dramatic mispredictions signal a misspecification that could be remedied
by modeling a diffuse process with a direct effect for the treatment indicator. The miss-
ing data rates for males and females were not nearly as different, and the model’s predic-
tions about these proportions were far more accurate.
The middle panel of Table 9.2 shows the parameter estimates and standard errors from
the analysis. The diffuse model produced some important and noticeable differences.
Relative to the conditionally MAR analysis, the intercept (placebo group average) was
lower by nearly nine-tenths of a standard error unit, and the treatment group difference
was smaller (less negative) by an amount equal to one-third of a standard error. There
are no obvious indications that the model is unreasonable, and the observed and pre-
dicted missing data rates for the treatment conditions were a close match.
Looking at the information criteria at the bottom of the table, both the AIC and BIC
favored the selection model over an analysis that assumes MAR data; the differences
between pairs of information criteria values are ΔAIC = AICMAR – AICMNAR = 12.65 and
ΔBIC = BICMAR – BICMNAR = 4.49 (see Equation 9.13). The ΔBIC represents “positive” evi-
dence favoring the selection model according to Raftery’s (1995) effect-size-like rules of
thumb. It is important to reiterate that model comparisons with ΔAIC or ΔBIC require
untestable assumptions about the missing values, and you must be willing to accept the
MNAR model’s propositions a priori. Following verbiage from Verbeke, Lesaffre, and
Spiessens (2001, p. 426), we could say there is positive evidence for nonrandom dropout,
conditional on the validity of the selection model.
366 APPLIED MISSING DATA ANALYSIS
A concern with using information criteria for model comparisons is that a small
number of outliers (or perhaps even a single individual) may unduly influence the con-
clusions. To explore this possibility, I used the expressions from Equation 9.15 to com-
pute individual influence diagnostics. As a reminder, these quantities approximate the
change in the model comparison (ΔAIC or ΔBIC) that would result from excluding a
single participant’s data from both analyses (Sterba & Gottfredson, 2014). In this case,
a positive value of Δ̂AICi (or Δ̂BICi) that exceeds ΔAIC (or ΔBIC) indicates that exclud-
ing a participant’s data could switch the sign of the model comparison to favor the MAR
analysis. Figure 9.6 is an index plot that graphs the ΔBICi values for each participant.
That graph reveals no outliers, thus lending credence to the conclusion that the diffuse
selection model is plausible for these data.
Next, I added the gender dummy code to the missingness model as follows:
The rightmost set of columns in Table 9.2 show the parameter estimates and standard
errors, which were effectively equivalent to those of the previous analysis. The AIC and
BIC were both slightly higher, indicating that the additional complexity is not war-
ranted. Finally, I did not consider baseline severity scores as a predictor of missingness,
because this variable’s bivariate association with the missing data indicator was nearly
zero.
4
Approximate Change in BIC Difference
2
0
–2
–4
FIGURE 9.6. Index plot displaying the influence diagnostics for each participant. That graph
reveals no outliers that exceed ΔBIC.
Missing Not at Random Processes 367
When specifying a selection model, you not only need to include the right set of
regressors in the missingness model, but you also must get their functional forms cor-
rect. Following Ibrahim et al. (2005), I investigated a final missingness model with
higher-order interaction term between the treatment assignment indicator and depen-
dent variable. Such a process could occur, for example, if placebo group participants
with the most acute symptoms quit the study to seek treatment elsewhere, whereas treat-
ment group participants with the least acute symptoms quit, because they’ve achieved
adequate relief.
The model exhibited two telltale symptoms of misspecification: very large increases
in the standard errors (e.g., the treatment slope standard error increased by 40%) and
a missingness model with an implausibly large R2 statistic near .70. Additionally, the
estimators struggled to achieve convergence and required long iterative sequences (e.g.,
MCMC required more than 100,000 burn-in iterations). All evidence suggests that the
interactive model is either not plausible or the sample size isn’t large enough to support
estimation. In either case, the results are not credible and should not be interpreted.
Summary
The sensitivity analysis identified a defensible selection model that included the depen-
dent variable and treatment assignment indicator as predictors of missingness. Con-
ducting simple model checks and looking for signs of misspecification were instrumen-
tal in ruling out candidate missingness models with dubious support from the data.
The selection model produced noticeable differences in some key parameters; relative
to the MAR analysis, the intercept (placebo group average) was lower by nearly nine-
tenths of a standard error unit, and the treatment group difference was smaller by about
one-third of a standard error. Nontrivial differences like this might seem troubling, but
the results simply reflect two different, plausible assumptions about the missing data.
To emphasize, there is no way of knowing whether the MNAR analysis is better than a
simpler analysis that assumes a conditionally MAR mechanism. Both sets of results are
defensible and could (and should) be presented in a research report.
Whereas selection models are consistent with a mediated mechanism in which quanti-
tative differences on the analysis variables predict missingness via direct and indirect
effects, pattern mixture models are aligned with moderated processes in which missing
data patterns form qualitatively different subgroups with distinct parameter values (see
the factorization in Equation 9.4 and the path diagrams in Figure 9.2). I return to the
bivariate example with perceived control over pain predicting depression as a substan-
tive backdrop for describing the model (I generically refer to these variables as X and Y,
respectively).
368 Applied Missing Data Analysis
Examining a pattern mixture model for bivariate normal data sets up its application
to multiple regression. The data feature two missing data patterns: One group has com-
plete data and indicator scores of M = 0, and the second group has missing depression
scores and M = 1. The model specifies a unique mean vector and variance–covariance
matrix for each pattern, as follows:
X i( 0 ) μ ( 0 ) σ 2( 0 ) σ(XY)
0 X i(1) μ (1) σ2(1) σ (XY)
1
~ N1 X , X ~ N1 X , X (9.24)
Y (0) μ( 0 ) σ( 0 ) 2( 0 )
σY Y (1) μ (1) σ (1) 2 (1)
σY
i Y YX i Y YX
μ Y = π( ) μ (Y ) + π( ) μ Y( )
0 0 1 1
(9.25)
where π(0) and π(1) are weights equal to the proportion of cases in each pattern.
The idea behind a pattern mixture model is relatively straightforward: Estimate the
parameters of interest within each missing data pattern, then average over the distribu-
tion of missingness by computing a weighted composite of the group-specific estimates.
However, the example highlights that people with missing depression scores have no
data with which to estimate the mean and variance of Y and its covariance with X.
The regression analysis similarly features three inestimable parameters. To use the pat-
tern mixture model, you need to either provide values for the inestimable quantities or
impose so-called identifying restrictions that borrow information by setting the ines-
timable parameters equal to functions of the estimable ones (Kenward & Molenberghs,
2014; Little, 1993, 1994). I describe the former strategy here and illustrate identifying
restrictions later in Section 9.13.
A straightforward way to specify a pattern mixture model for multiple regression is
to cast missing data indicators as dummy codes that moderate the influence of one or
more explanatory variables on the outcome (Hedeker & Gibbons, 1997; Hogan & Laird,
1997a, 1997b). Applied to the bivariate data, this specification corresponds to a familiar
moderated regression model with a focal predictor X (e.g., perceived control over pain),
a binary missing data indicator M, and the product of the two (see Figure 9.2b).
I use superscripts to emphasize that these are not the coefficients of substantive inter-
est. The lower-order terms are conditional effects that depend on scaling: β0(0) is the
predicted outcome score for a participant with complete data and X = 0, β1(0) is a simple
slope that reflects the influence of X in the M = 0 pattern, and β0(diff) is the pattern mean
difference at X = 0. Finally, the β1(diff) coefficient is the slope difference for the group with
missing data. Importantly, β0(diff) and β1(diff) are inestimable, because the dependent vari-
able is missing for the M = 1 group. The next section describes a simple effect-size-based
strategy for specifying these coefficients.
Missing Not at Random Processes 369
To make the previous model more concrete, I used computer simulation to create
an artificial data set based on the perceived control over pain and depression variables
from the chronic pain data. I deleted 25% of the artificial depression scores to mimic a
process where participants with missing data form a subpopulation with higher depres-
sion scores at the center of the perceived control distribution and a stronger bivariate
association between the two variables. To maintain comparability with earlier figures,
I used the same population parameters that created the bivariate scatterplots for the
selection model examples, and I chose pattern-specific coefficients that maintained the
same overall strength of the association between the analysis variables and missingness.
Figure 9.7 shows the resulting scatterplot of the data, with gray circles representing
complete cases and black crosshairs denoting partial data records with missing depres-
sion scores. The contour rings convey the perspective of a drone hovering over the peak
of the bivariate normal population distribution, with smaller contours denoting higher
elevation (and vice versa). The dashed and dotted lines are the true pattern-specific
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 9.7. Scatterplot of a diffuse MNAR process from a pattern mixture model where par-
ticipants with missing data form a subpopulation with higher depression scores at the center of
the perceived control distribution and a stronger bivariate association between the two variables.
Gray circles represent complete cases, and black crosshairs denote partial data records with
missing depression scores. The dashed and dotted lines are the true pattern-specific regressions,
and the solid line is the overall (marginal) population function that averages over missing data
patterns.
370 APPLIED MISSING DATA ANALYSIS
regressions, and the solid line is the overall (marginal) population function that aver-
ages over missing data patterns, as follows:
β0 =
0 0
(1 0
)
π( )β(0 ) + π( ) β(0 ) + β(0 ) =
diff
π( )β(0 ) + π( )β(0 )
0 0 1 1
(9.27)
π β + π (β + β
( ) ( ) () ( ) ( )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β1 = 1 1 1 1 1
Figure 9.7 depicts a systematic process where missing values are primarily located in
the upper left and lower right quadrants of the contour plot. An analysis based on the
conditionally MAR assumption is incapable of targeting these regions, as its imputations
would disperse around a single, biased regression line.
The moderated regression model in Equation 9.26 is consistent with a diffuse
MNAR process where both analysis variables uniquely correlate with missingness (see
Figure 9.2b). In contrast, the model for a focused process (shown in Figure 9.2a) features
an inestimable pattern mean difference and a common regression slope.
Yi = β(0 ) + β(0 ) Mi + β1 X i + ε i
0 diff
(9.28)
35
30
25
20
Depression
15
10
5
0
0 10 20 30 40
Perceived Control Over Pain
FIGURE 9.8. Scatterplot of a focused MNAR process from a pattern mixture model where
participants with missing data form a subpopulation with higher depression scores. Gray circles
represent complete cases, and black crosshairs denote partial data records with missing depres-
sion scores. The dashed and dotted lines are the true pattern-specific regressions, and the solid
line is the overall (marginal) population function that averages over missing data patterns.
Missing Not at Random Processes 371
To illustrate, Figure 9.8 shows the scatterplot of an artificial data set where the 25% of
participants with missing data form a subpopulation with a much higher depression
average. As before, gray circles represent the complete cases, black crosshairs denote the
partial data records with missing depression scores, the dashed and dotted lines are the
true group-specific regressions, and the solid line is the overall (marginal) population
function. This model is equivalent in spirit to the old idea of filling in the data under an
MAR assumption and adding a constant to each impute (Rubin, 1987, p. 22; van Buuren,
2012, p. 92).
β(0 ) = d × σY or d × σ2ε
diff
(9.29)
Continuing with chronic pain example, if you believed that the missing depression
scores have a higher mean than the observed data, you could set the standardized dif-
ference to a value like +0.20 or +0.50 and solve for β0(diff). A preliminary analysis based
on the MAR assumption can provide an estimate of the standard deviation, or you could
use the residual standard deviation to constrain β0(diff) during estimation.
A similar strategy can provide a value for the inestimable slope difference in Equa-
tion 9.26. To begin, consider the situation where X is binary (e.g., 0 = placebo, 1 = medi-
cation), in which case, β1(0) is the group mean difference for participants with complete
data. The inestimable β1(diff) term is the additional mean difference for the M = 1 pattern.
Standardizing the net mean difference by dividing by the standard deviation or residual
standard deviation is equivalent to subtracting pattern-specific Cohen’s d values:
β1( =
diff )
d Δ × σY or d Δ × σ2ε (9.31)
For example, setting d Δ equal to –0.10 implies that the group mean difference for the
incomplete cases is smaller (more negative) by an amount equal to one-tenth of a stan-
dard deviation unit.
When X is a quantitative variable, β1(0) is the focal regression slope for participants
with complete data, and β1(diff) is the slope difference for the M = 1 pattern. Standardizing
the slope difference by dividing by the standard deviation or residual standard deviation
is equivalent to subtracting pattern-specific standardized coefficients or beta weights:
β1( )
( )σ
diff
(1) (0) (0) ( diff ) σ X (0) σ X
d Δ = βz − βz = β1 + β1 − β1 = (9.32)
Y σY σY
and solving for γ3 gives the following expression for the slope difference:
σY σ2ε
β1( ) =
diff
dΔ or d Δ (9.33)
σX σX
Applied to the chronic pain example, setting dΔ equal to +0.25 means that, among par-
ticipants with missing data, a one standard deviation increase in perceived control over
pain increases the predicted depression score by one-fourth of a standard deviation
more than that of the complete data.
Linking inestimable parameters to the standardized mean difference provides a
practical heuristic for specifying coefficients, but it is still incumbent on the researcher
to choose values that are reasonable for a given application. Moreover, it is incorrect
to view “small” values of d and dΔ as unimportant, as standardized differences of this
magnitude could be quite salient in many situations. For example, consider a random-
ized intervention where the true effect size is d = 0.25. Setting dΔ to 0.50 in Equation
9.31 (Cohen’s medium effect size benchmark) is equivalent to saying that the moderat-
ing impact of missing data is twice as large as the intervention effect itself. The medium
effect size threshold is probably an upper bound for many practical applications, and
much smaller values of dΔ could be realistic.
Important Assumptions
In contrast to the selection model, which requires strict and untestable distributional
assumptions, correctly specifying the missingness process is the primary barrier to get-
ting good results from a pattern mixture model. In practice, this means selecting the
correct set of interaction terms (the missing data indicator could moderate the influence
of any regressor) and providing accurate values for all inestimable parameters. Even
with simple heuristics for deriving these quantities, there is no way to verify whether
our choices increase or decrease nonresponse bias. While this may seem like a serious
disadvantage, methodologists have argued that the transparency of the model’s assump-
tions is actually its strength (Little, 2009). Rather than having to iteratively fit models
Missing Not at Random Processes 373
and scour computer output for subtle clues that may signal a misspecification or identi-
fication problem, you simply lay your statistical cards on the table and estimate a model
that reflects the presumed missingness process.
Mi* =γ 0 + ri (9.34)
ri* ~ N1 ( 0,1)
and the predicted probability of missing data is the area above the threshold in a stan-
dard normal distribution.
Pr ( Mi = 1) = 1 − Φ ( −γ 0 ) = Φ ( γ 0 ) (9.35)
This proportion and its complement determine the weights for computing marginal esti-
mates that average over the missing data patterns (see Equations 9.25 and 9.27).
Fitting a separate model to estimate the observed missing data rate might seem like
overkill, but doing so facilitates standard error computations. Returning to Equation
9.27, the coefficients of interest are functions of the pattern proportions and group-
specific intercepts and slopes. Similarly, the standard errors (or posterior standard devi-
ations) of β̂0 and β̂1 are composite functions that depend on the sampling variances
(or posterior variances) of the γ̂’s and the π̂’s. Some software packages that implement
maximum likelihood estimation offer facilities for computing auxiliary parameters that
are functions of estimated model’s parameters, and these programs use the multivariate
delta method (Raykov & Marcoulides, 2004) to combine the squared standard errors of
the proportions and pattern-specific coefficients into measures of uncertainty for β̂0 and
β̂1. Bayesian estimation software packages often offer similar functionality for defining
auxiliary parameters, the posterior distributions of which naturally reflect uncertainty
in their constituent parts.
Absent software to do the heavy lifting, you would need to compute standard errors
by hand. Enders (2010, pp. 309–312) and Hedeker and Gibbons (1997, p. 74) show
worked examples of this process. Demirtas and Schafer (2003) suggest multiple imputa-
tion as an alternative to explicitly pooling over the missing data patterns. To implement
this procedure, you would use the pattern mixture model from Equation 9.26 to impute
the missing values (i.e., model-based multiple imputation), after which you would fit
the focal regression analysis to the filled-in data. The second-stage analysis does not
require the missing data indicators, because the MNAR process is already embedded in
the imputations, and applying Rubin’s (1987) rules to the imputation-specific estimates
of β0 and β1 gives pooled values that average over the missingness process.
374 Applied Missing Data Analysis
Sensitivity Analyses
Fitting selection models requires a researcher to proactively search for a model that has sup-
port from the data, while looking for subtle clues that signal a misspecification or identifica-
tion problem. Pattern mixture models are very different, because they are easy to estimate
once you supply values for the inestimable parameters (or impose comparable constraints).
The main challenge is supplying accurate information about the unknown coefficients, as
the validity of the results hinges on a correct (or approximately so) specification. Although
it might be possible to formulate directional hypotheses about group mean differences in
some situations (e.g., participants with missing data have a higher depression average),
selecting specific values for the inestimable parameters is very challenging.
Rather than trying to select the optimal quantities for the inestimable parameters,
you can instead conduct a sensitivity analysis that considers a range of plausible coef-
ficient values. Little (2009, p. 428) recommends inducing distributional differences of
± 0.20 or 0.50 residual standard deviation units, and earlier formulas facilitate this strat-
egy. You could also vary the strength of the MNAR process across an entire response
surface by refitting a model with d and dΔ values between –0.50 and +0.50. Tracking
changes to the focal model parameters can answer two important practical questions:
“How much does the missing data distribution need to differ to meaningfully change
MAR-based estimates?” and “How big a difference is needed to change significance test
results?” The analysis examples in the next section illustrate such a procedure.
I use the psychiatric trial data on the companion website to illustrate pattern mixture
models for multiple regression. Equation 9.16 shows the focal analysis model. Reducing
a potentially large number of missing data patterns into a manageable number of groups
is often the starting point for these models, as specifying many pattern-specific differ-
ences is both unwieldy and unnecessary. Table 9.3 shows the nine missing data pat-
terns for the repeated outcome variable, with O and M indicating observed and missing
values, respectively. Following Hedeker and Gibbons (1997), I classify participants as
“completers” or “dropouts” based on the presence or absence of an illness severity rating
at the 6-week follow-up. The completer pattern combines a large group of participants
with fully observed data (Pattern 1) and several smaller groups with intermittent miss-
ing values (Patterns 5, 6, 7, and 9), and the dropout pattern primarily reflects monotone
missingness where participants permanently leave the study (Patterns 2, 3, 4, and 8
in bold typeface). Pattern reduction decisions are important and potentially impact-
ful, because they define the qualitatively different subpopulations that presumably have
unique parameter values. The earlier analysis examples showed that conditioning on the
1-week and 3-week follow-up ratings was important, so I again leveraged these interme-
diate assessments as auxiliary variables. Following Equation 9.17, I used a pair of linear
regression models to link the auxiliary variables to the focal variables, and these equa-
tions also included the dropout indicator.
Missing Not at Random Processes 375
TABLE 9.3. Missing Data Patterns from the Schizophrenia Trial Data
Pattern % sample Baseline Week 1 Week 3 Week 6
1 71.40 O O O O
2 0.69 O M M M
3 10.30 O O M M
4 12.13 O O O M
5 0.69 M O O O
6 1.14 O M O O
7 2.97 O O M O
8 0.23 O M O M
9 0.46 O M M O
Note. Rows in bold typeface are “dropouts”; all other patterns are “completers.”
Considering a range of plausible effects allows you to determine how large the
pattern-specific differences need to be to alter significance tests or meaningfully change
estimates from a conditionally MAR analysis. Table 9.4 shows that the dropout group’s
distribution would need to differ by at least ±0.30 standard deviation units to alter the
MAR-based estimate of β0 by half a standard error unit, and a difference of at least ±0.50
standard deviations is necessary to change this parameter by an entire standard error.
A change of one standard error is quite large, because it implies that the missing data
process affects estimates by an amount equal to the expected sampling error. You may
recall that the diffuse selection model for these data induced similarly large changes to
some parameters.
Consistent with the previous analysis, centering baseline severity scores and the gender
dummy code defines β0(0) as the placebo group mean for the completers (DRUG = 0 and
M = 0) and β0(diff) as the inestimable mean difference for the dropout pattern. The β1(0)
coefficient represents the medication group mean difference for completers, and β1(diff) is
the additional medication effect among participants who dropped out.
I used Equation 9.29 to obtain β0(diff) coefficients for standardized effect size values
between d = –0.50 and +0.50 in increments of 0.10, and I used Equation 9.31 to obtain
Missing Not at Random Processes 377
β1(diff) coefficients for the same range of effects. This process created 121 unique param-
eter combinations that varied the strength and direction of the MNAR process across a
broad range of values, not all of which may be plausible. I estimated a pattern mixture
model for each combination of coefficients and computed focal model parameters that
average over the missing data patterns (see Equation 9.27). The empty probit model
from Equation 9.34 and the corresponding probability function in Equation 9.35 again
provided the pattern proportions and standard errors for pooling. Analysis scripts are
available on the companion website.
Figure 9.9 is a heat map that summarizes changes to the focal model’s intercept coef-
ficient (the placebo group mean, β0) across the 121 effect size combinations. The circle
in the middle of the plot represents a conditionally MAR mechanism where both effect
sizes equal zero, and the colors get progressively darker as the change to the pooled esti-
mate increases. The white squares denote estimates that differ by less than one-fourth of
a standard error from the MAR analysis, whereas black squares are estimates that differ
by more than one standard error. Broadly speaking, the greatest distortions occur when
the dropout group’s distribution differs by at least ±0.40 standard deviation units along
0.6
0.4
Standardized Pattern Slope Difference
0.2
0.0
–0.2
–0.4
–0.6
FIGURE 9.9. Heat map summarizing changes to the focal model’s intercept coefficient across
the 121 effect size combinations. The circle in the middle of the plot represents a conditionally
MAR mechanism, and the colors get progressively darker as the difference between the pattern
mixture model and MAR-based estimates gets larger. The white squares denote estimates that
differ by less than one-fourth of a standard error, whereas black squares are estimates that dif-
fer by more than one standard error. Broadly speaking, the greatest distortions occur when the
dropout group’s distribution differs by at least ±0.40 standard deviation units along the horizon-
tal axis.
378 Applied Missing Data Analysis
the horizontal axis, in which case β0 differs from the MAR-based estimate by at least
half a standard error unit (i.e., most of the boxes are dark gray to black). A similar heat
map for the β1 coefficient revealed that except for the dΔ = ±0.50 conditions, the MNAR
slope coefficient always differed from its MAR-based counterpart by less than half a
standard error unit. Moreover, the medication group difference was always statistically
significant, indicating that even a very strong MNAR process was incapable of altering
the substantive conclusion that the medication was beneficial.
Figure 9.9 also provides information about specific processes that might be plausi-
ble for this application. For example, consider the possibility that placebo group partici-
pants with the most acute symptoms (and highest severity scores) leave the study to seek
treatment elsewhere, whereas medication group participants with mildest symptoms
(and lowest severity scores) leave the study, because they no longer feel treatment is nec-
essary. This scenario corresponds to positive values along the horizontal axis (placebo
group participants with missing data have a higher mean) and negative values along the
vertical axis (medication group participants with missing data have a lower mean than
those with complete data). To illustrate this scenario in more detail, Table 9.5 shows
the pattern-specific and marginal coefficients from the vertical slice of squares located
at +0.30 on the horizontal axis (i.e., the placebo group mean for the dropout pattern is
higher by three-tenths of a standard deviation). For comparison, the MAR analysis pro-
duced intercept and slope estimates of β̂0 = 4.49 (SE = 0.17) and β̂1 = –1.50 (SE = 0.18),
and MNAR estimates that differ by more than half a standard error are highlighted with
bold typeface.
Summary
Compared to selection models, pattern mixtures provide a very different vehicle for con-
ducting sensitivity analyses. The examples illustrated a process that varied the type,
direction, and strength of the missingness process across a broad range of “what if” sce-
narios. I used these results to answer two useful practical questions: “How much does the
Missing Not at Random Processes 379
where Yti is the outcome (e.g., depression) score at occasion t for individual i, β0 is the
predicted baseline average when MONTH = 0, and β1 is the average monthly change
rate. Everyone has a unique linear trajectory, and the b0i and b1i terms are latent vari-
ables or random effects that reflect deviations between the individual intercepts and
slopes and the corresponding average coefficients. Finally, εti is a time-specific residual
that captures the difference between an individual’s own linear trajectory and his or
her repeated measurements. By assumption, the individual intercepts and slopes are
bivariate normal with a covariance matrix Σb, and the within-person residuals are nor-
mally distributed around the individual trajectories with constant variance σε2. Adding a
between-person (time-invariant) predictor of the individual intercepts and slopes (e.g., a
380 Applied Missing Data Analysis
treatment assignment indicator) gives the following model with a so-called “cross-level
interaction effect”:
In this model, β0 and β1 give the predicted baseline score average monthly change rate
for a person with X = 0 (e.g., the control group if X is binary), and β2 and β3 are intercept
and slope differences.
The same longitudinal analysis can be specified as a latent curve model in the struc-
tural equation modeling framework. This specification views repeated measurements
as manifest indicators of an intercept and slope latent factor with fixed loadings that
encode the passage of time. To illustrate, Figure 9.10 shows a path diagram of the linear
growth model with a person-level predictor of growth. Consistent with standard path
diagram conventions, ellipses represent latent variables, rectangles denote manifest (i.e.,
measured) variables, single-headed straight arrows symbolize regression coefficients,
and double-headed curved arrows are correlations. The latent variables represent the
individual intercepts and slopes, and the factor means (not shown in the diagram) cor-
respond to the β0 and β1 coefficients from the multilevel model. The unit factor loadings
connecting the intercept latent factor to the repeated measurements reflect the constant
influence of this term at each time point, and the fixed slope factor loadings encode
the monthly time scores (i.e., MONTH = 0, 1, 2). While the latent curve version of the
analysis requires a different data structure and model specification, it is equivalent to
the multilevel model and gives identical estimates. Interested readers can consult work
by Mehta and colleagues (Mehta & Neale, 2005; Mehta & West, 2000) for a synopsis of
these linkages.
Intercept Slope
1 0
1 1 1 2
Y1 Y2 Y3
FIGURE 9.10. Path diagram depicting a linear growth model with a person-level predictor of
growth. Ellipses represent latent variables, rectangles denote manifest (i.e., measured) variables,
single-
headed straight arrows symbolize regression coefficients, and double- headed curved
arrows are correlations.
Missing Not at Random Processes 381
Types of Missingness
Little (1995) described two distinct types of MNAR data in longitudinal settings.
Outcome-dependent missingness occurs when the unseen value of the dependent vari-
able at a particular measurement occasion predicts concurrent nonresponse. Applied
to the depression example, this type of missing data could occur if a sudden spike in
depressive symptoms at the 1-month (or 2-month) follow-up prompts some partici-
pants to quit the study and seek treatment elsewhere. In contrast, random coefficient-
dependent missingness occurs when one’s underlying growth trajectory is responsible
for missing data rather than time-specific realizations of the dependent variable. For
example, participants experiencing the most rapid declines in depressive symptoms
might quit the study, because they judge that treatment is no longer necessary, whereas
individuals with elevated and flatter trajectories might drop out to seek treatment else-
where. This type of missingness leverages one’s entire score history, including unseen
future measurements that could potentially relate to missingness if the outcome is mea-
sured with error (Demirtas & Schafer, 2003). These processes call for different models,
and both could be plausible for the same analysis.
tor variables that encode different types of missingness are also a possibility (Albert,
Follmann, Wang, & Suh, 2002). Little (1995) discusses some of these alternatives.
The Diggle–Kenward selection model (Diggle & Kenward, 1994) for outcome-dependent
missing data links missingness at an occasion t to the concurrent unseen score values
and observed data from prior occasions. Like its siblings from earlier in the chapter, the
model augments the focal analysis (a growth curve model) with additional regression
equations that explain missingness. To illustrate, Figure 9.11 shows a path diagram of
a three-wave structural equation model that includes a between-person predictor of the
individual intercepts and slopes. The rectangles labeled M2 and M3 are binary discrete-
time survival indicators that code dropout at the 1- and 2-month follow-up assessments,
respectively (M1 is not needed, because baseline scores are complete). The dashed lines
are logistic or probit regressions that link the probability of dropout at wave t to the out-
come scores at that occasion, and the dotted paths add a MAR component that connects
dropout to scores at the prior occasion.
The diagram’s missingness model corresponds to the following regression equa-
tions:
Diggle and Kenward (1994) used logistic selection models, but I use probit models for
consistency with earlier material. The equations feature occasion-specific intercepts that
allow dropout probabilities to vary over time, but they impose equality constraints on the
concurrent and lagged effects (i.e., the γ1 and γ2 take on the same value in both equations).
Estimating time-specific effects would introduce complex outcome-by-occasion interac-
tions that may be difficult to estimate. This model reflects a diffuse MNAR process where
the observed data from prior occasions uniquely predict missingness beyond the unseen
score values. The path diagram shows that X’s influence on missingness is transmitted
indirectly via the latent variables and repeated measurements, which effectively function
as mediators. An even more diffuse model would include direct pathways between the
predictor and dropout indicators, and a model for a focused process would omit the lagged
effects from the prior measurement occasion. Consistent with previous selection models,
significance tests of the γ coefficients are not trustworthy and do not provide a way to eval-
uate the missingness mechanism (Jansen et al., 2006; Molenberghs & Kenward, 2007).
Specifying the Diggle–Kenward model in the multilevel framework requires soft-
ware that can estimate multilevel path models with categorical and continuous out-
comes (Keller & Enders, 2021; Muthén & Muthén, 1998–2017). The data structure for a
multilevel growth curve model features the repeated measurements stacked in a single
column. Table 9.6 shows the data setup for two hypothetical participants from a three-
Missing Not at Random Processes 383
Intercept Slope
1 0
1 1 1 2
Y1 Y2 Y3
M2 M3
FIGURE 9.11. Path diagram depicting the Diggle–Kenward growth model. The rectangles
labeled M2 and M3 are binary missing data indicators, dashed lines are probit regressions that
link the probability of dropout at wave t to the outcome scores at that occasion, and the dotted
paths add an MAR component that connects dropout to scores at the prior occasion.
wave study. The corresponding selection equation features a column of dropout indi-
cators regressed on time-specific dummy codes, the dependent variable, and a lagged
version of the outcome, as follows:
rti ~ N1 ( 0,1)
The Tti terms are binary variables that code the three measurement occasions, such that
Tt equals 1 in rows that correspond to measurement occasion t, and 0 otherwise. These
variables essentially function like on–off switches that initiate occasion-specific inter-
cept coefficients capturing time-related changes to the missing data rates. If the baseline
scores are complete or nearly so, fixing γ01 at a large negative z-value during estimation
induces a zero missingness probability. Finally, Yli is a lagged version of the dependent
variable that pairs each Yti with the participant’s score from the prior measurement occa-
sion. This variable is always missing at the baseline measurement.
where γ02 and γ03 are occasion-specific intercepts that allow the missingness rates to vary
across time, and γ1 and γ2 are shared slope coefficients (i.e., parameters with equality
constraints). The equivalent multilevel selection equation is as follows:
rti ~ N1 ( 0,1)
M2 M3
Intercept Slope
1 0
1 1 1 2
Y1 Y2 Y3
FIGURE 9.12. Path diagram depicting the shared parameter growth model. The rectangles
labeled M2 and M3 are binary missing data indicators, dashed lines are probit regressions that
link the probability of dropout to the random intercepts, and the dotted paths connect dropout
to the individual slopes.
f ( Yi , M i , b i =
) f ( Yi | bi , Mi ) × f ( Mi | bi ) × f ( bi ) (9.44)
where Y represents the repeated measurements, M contains the missing data indica-
tors, and b denotes the intercepts and slopes (the b0i and b1i terms in Equation 9.38). By
assumption, the repeated measurements and missing data indicators are conditionally
independent after controlling for the random intercepts and slopes, which simplifies the
factorization as follows:
f ( Yi | b i ) × f ( M i | b i ) × f ( b i ) (9.45)
The shared parameter b plays the key role by absorbing the MNAR linkage between the
outcomes and indicators. This feature is evident in Figure 9.12, where the repeated mea-
surements and indicators are connected (correlated), because they share the intercept
and slope latent factors as common predictors. Methodologists have described several
variants of this approach that instead use latent class membership as a shared parameter
(Beunckens, Molenberghs, Verbeke, & Mallinckrodt, 2008; Gottfredson, Bauer, & Bald-
win, 2014; Muthén et al., 2011; Roy, 2003).
Hedeker and Gibbons (1997) describe a random coefficient pattern mixture model
where missing data patterns form qualitatively different subgroups with distinct growth
trajectories but common random effect parameters. Their model casts missing data indi-
386 Applied Missing Data Analysis
cators as dummy codes that moderate the influence of one or more explanatory vari-
ables on the outcome.
Returning to the depression scenario, there are three dropout patterns: partici-
pants who (1) completed the study, (2) quit following the baseline assessment, and (3)
dropped out prior to the final assessment. The simplest incarnation of the pattern mix-
ture model uses a single binary indicator that classifies participants simply as “com-
pleters” or “dropouts” based on the presence or absence of data at the final measurement
occasion. Using generic notation, the fitted growth curve model is
where β0(0) and β1(0) are the intercept and slope (e.g., baseline average and monthly change
rate) for the completers in pattern M = 0, and β0(diff) and β1(diff) capture the difference in
the dropout group’s intercept and slope coefficients, respectively. The weighted averages
from Equation 9.27 give the overall (marginal) population estimates of β0 and β1.
Adding a between-person (time-invariant) predictor of the individual intercepts
and slopes (e.g., a treatment assignment indicator) gives the following model:
X M
Intercept Slope
1 0
1 1 1 2
Y1 Y2 Y3
FIGURE 9.13. Path diagram depicting the random coefficient pattern mixture model, where
the rectangle labeled M is the binary dropout indicator. The structural equation model features
the missing data indicator, covariate, and the interaction of the two predicting the intercept and
slope growth factors.
Missing Not at Random Processes 387
The terms not involving M are the completer group’s parameters (these quantities have
the same interpretation as the βs from Equation 9.39), and terms that include M reflect
coefficient differences for the dropout group. Figure 9.13 shows a path diagram of the
corresponding structural equation model. Following Figure 9.2, the dashed line indi-
cates an interaction effect where the missing data indicator moderates the influence of
the predictor.
β0 =
1 2
( diff1 3
) 0
(
π( )β(0 ) + π( ) β(0 ) + β(0 ) + π( ) β(0 ) + β(0 )
0 0 diff 2
) (9.49)
= π( )β(0 ) + π( )β(0 ) + π( )β(0 )
1 1 2 2 3 3
β1 =
1 2
( diff1 3
) 0
(
π( )β1( ) + π( ) β1( ) + β1( ) + π( ) β1( ) + β1( )
0 0 diff 2
)
= π( )β1( ) + π( )β1( ) + π( )β1( )
1 1 2 2 3 3
quit prior to the final follow-up (the β1(diff2) coefficient). However, the growth rate differ-
ence for the early dropouts (M2 = 1) is not estimable from a single observation. Given a
suitable estimate of the standard deviation (e.g., an MAR-based estimates of the baseline
standard deviation or the within-cluster residual standard deviation), Equation 9.31 (or
Equation 9.33) can provide a value for the inestimable β1(diff1) coefficient. In this context,
dΔ can be viewed as the standardized mean difference that results from a one-unit incre-
ment in the temporal predictor (e.g., one additional month in the study). Returning to
the hypothetical depression scenario, suppose that the baseline standard deviation from
a preliminary analysis was σ̂Y = 6. Furthermore, suppose that the completer group’s
depression scores improve (decrease) by one-fifth of a standard deviation per month,
on average (i.e., β1(0) = –1.20). Specifying dΔ = +0.10 means that every additional month
in the study changes the early dropout group’s mean by a value that is one-tenth of a
standard deviation higher (more positive) than that of the completers. This specification
induces a pattern-specific growth rate that halves the complete-case change rate (i.e.,
β1(2) = β1(0) + β1(0) = –1.20 + (0.10 × 6) = –0.60).
So-called identifying restrictions that replace inestimable parameters with esti-
mates from another pattern are an alternative to an effect-size-based specification. Three
such restrictions—the complete-case, neighboring-case, and available-case identifying
restrictions—have received considerable attention in the literature (Demirtas & Schafer,
2003; Molenberghs, Michiels, Kenward, & Diggle, 1998; Thijs, Molenberghs, Michiels,
Verbeke, & Curran, 2002; Verbeke & Molenberghs, 2000). As its name implies, the
complete-case restriction sets any inestimable parameters equal to those of the com-
pleter group. Applied to the depression example, this restriction sets β1(diff1) equal to
0, such that participants who quit after the baseline assessment follow the same linear
trajectory as the people who complete the study (albeit with a different baseline aver-
age). The neighboring-case restriction instead borrows a coefficient from the nearest
group for which an effect is estimable. Participants who quit prior to the final assess-
ment are the early dropout group’s nearest neighbors, so this strategy sets β1(diff1) equal
to β1(diff2). Finally, the available-case restriction sets the inestimable coefficient equal
to the weighted average across all patterns for which an effect is estimable. I illustrate
some of these restrictions in the next section, and Demirtas and Schafer (2003) describe
a detailed sensitivity analysis for the same psychiatric trial data.
I use the psychiatric trial data on the companion website to illustrate a sensitivity
analysis for a longitudinal growth curve model with an outcome that could be MNAR.
The data, which were collected as part of the National Institute of Mental Health
Schizophrenia Collaborative Study, comprises four illness severity ratings, measured
in half-point increments ranging from 1 (normal, not at all ill) to 7 (among the most
extremely ill). In the original study, the 437 participants were assigned to one of four
experimental conditions (a placebo condition and three drug regimens), but the data
collapse these categories into a dichotomous treatment indicator (DRUG = 0 for the
placebo group, and DRUG = 1 for the combined medication group). The researchers
Missing Not at Random Processes 389
The β0 and β1 coefficients are the placebo group’s baseline average and linear change
rate, respectively, β2 is the medication group’s baseline difference, and β3 is the slope dif-
ference. By assumption, the individual intercepts and slopes are bivariate normal with a
covariance matrix Σb, and the within-person residuals are normally distributed around
the individual trajectories with constant variance σε2.
I used maximum likelihood estimation for the examples, and the companion web-
site also provides scripts for Bayesian estimation and model-based multiple imputa-
tion. Longitudinal selection and shared parameter models can be difficult to estimate,
because the observed data often contain very little information about the missingness
model. Monitoring convergence is especially important in this context, as invalid solu-
tions are common. For the maximum likelihood analyses, I refit the model with several
sets of random starting values and examined the final log-likelihood values to verify
that different runs achieved the same solution. For a Bayesian analysis, specifying mul-
tiple MCMC chains with different starting values provides analogous information along
with convergence diagnostics such as trace plots and the potential scale reduction factor
diagnostic (Gelman & Rubin, 1992). It is also important to check whether the MNAR
results are unduly influenced by unusual data records, and I use individual influence
diagnostics for this purpose (Sterba & Gottfredson, 2014).
390 APPLIED MISSING DATA ANALYSIS
7
6
Placebo
5
Illness Severity
4
3
Medication
2
1
0 1 2 3 4 5 6
Weeks Since Baseline
FIGURE 9.14. Illness severity means for the placebo and medication condition. The black
squares and circles show the means on the transformed time scale, and the white squares and
circles reflect time in weeks. The total change for both groups is the same, but the transformation
compresses elapsed time after the 1-week follow-up.
9.11, these regressions link the latent response variable at occasion t to the unseen score
values at that measurement and the observed data from prior occasions.
The regressions feature occasion-specific intercepts that allow the conditional probabil-
ity of missing data to vary over time, but they place equality constraints on the concur-
rent and lagged effects (i.e., the γ1 and γ2 pairs share the same value). As noted earlier,
estimating time-specific effects introduces complex outcome-by-occasion interactions
that may be difficult to estimate.
Equation 9.51 is consistent with a diffuse MNAR process where observed scores
from the prior occasion uniquely predict missingness above and beyond the unseen
score values (i.e., M depends on both Y(mis) and Y(obs)). The AIC and BIC indicated sub-
stantial support for this model relative to a focused one that omitted the lagged effect
(e.g., BIC = 5223.09 vs. 5242.88). I also considered a more diffuse process that included
treatment assignment as a predictor of missingness, but adding that regressor had a neg-
ligible impact on the estimates (the AIC and BIC offered modest support for the simpler
model).
Table 9.7 shows parameter estimates and standard errors from the Diggle–Kenward
model along with those of an analysis that assumes a conditionally MAR process. The
MAR-based analysis featured an empty selection model as a device for equating the
metric of its log-likelihood values (and thus AIC and BIC) to that of the MNAR analysis
(Muthén et al., 2011; Sterba & Gottfredson, 2014). The Diggle–Kenward model pre-
dicted a substantially slower (less negative) growth rate for the placebo group (β̂1 = –0.15
vs. –0.35), and it also showed a larger (more negative) slope difference for the medica-
tion condition (β̂3 = –0.70 vs. –0.63). I judge these changes to be quite large, as the β1
coefficients differ by nearly three standard error units, and the β3 coefficients differ by
approximately one standard error. To further illustrate the analysis results, Figure 9.15
shows the average growth curves from the MAR analysis as solid lines, and it depicts the
Diggle–Kenward trajectories as dashed lines. The figure highlights that the two analyses
made different predictions for both groups, although the direction of the effects were
consistent (as were significance tests of β̂1 and β̂3).
Looking at the fit information near the bottom of Table 9.7, both the AIC and BIC
favored the Diggle–Kenward model over the MAR analysis (ΔAIC = AICMAR – AICMNAR
= 21.13. and ΔBIC = BICMAR – BICMNAR = 12.97). Conditional on the validity of the Diggle–
Kenward model, this ΔBIC represents “very strong” evidence (Raftery, 1995) of MNAR
dropout. I further used the influence diagnostics from Equation 9.15 to identify indi-
vidual data records that unduly impact this conclusion (Sterba & Gottfredson, 2014). An
index plot like the one in Figure 9.6 revealed no such outliers, thus lending credence to
the conclusion that the Diggle–Kenward model is plausible for these data.
392 Applied Missing Data Analysis
Missingness model
Intercept Week 3 –1.23 0.08 –2.58 0.56 –1.72 1.01
Intercept Week 6 –1.09 0.08 –2.16 0.49 –1.56 0.99
SEVERITYt — — 1.18 0.23 — —
SEVERITYt–1 — — –0.98 0.18 — —
Random intercepts (b0) — — — — 0.11 0.17
Random slopes (b1) — — — — –0.44 0.24
DRUG — — — — –0.70 0.21
Model fit
AIC (no. of parameters) 5197.24 (10) 5176.11 (12) 5189.31 (13)
BIC (no. of parameters) 5238.04 (10) 5225.07 (12) 5242.35 (13)
The AIC and BIC supported the more diffuse model with treatment condition predict-
ing missingness (e.g., BIC = 5242.35 vs. 5248.74 for the simpler model with only random
effects predicting dropout).
Missing Not at Random Processes 393
7
6
Placebo
5
Illness Severity
4
3
Medication
2
1
FIGURE 9.15. Solid lines depict the average growth curves from the MAR analysis, and the
dashed lines are the Diggle–Kenward trajectories. Both analyses show that the medication condi-
tion achieved greater gains than the placebo group, but the models make different predictions
about the means.
The rightmost columns in Table 9.7 show the parameter estimates and standard
errors. The estimates are effectively equivalent to those of the conditionally MAR model,
and the average growth trajectories are indistinguishable from the solid lines in Fig-
ure 9.15. The negative γ2 and γ3 coefficients suggest that individuals who experience
the steepest declines (i.e., lowest, or most negative random slopes) and control group
participants are more likely quit the study. Turning to the fit information near the bot-
tom of the table, the AIC and BIC preferred the Diggle–Kenward model, but they dis-
agreed about the shared parameter model and MAR analysis; the AIC favored the former
(ΔAIC = AICMAR – AICMNAR = 7.93), whereas the BIC selected the latter (BIC = BICMAR –
BICMNAR = –4.31). Discrepant information criteria are not a problem in this case, because
the analyses in question produced equivalent estimates. The individual influence diag-
nostics revealed no outliers that unduly affected these model comparisons.
the presence or absence of data at the final measurement occasion. Returning to Table 9.3,
this setup collapses the four patterns with missing 6-week follow-up scores (Patterns 2,
3, 4, and 8) into a single group coded M = 0, and it combines participants with complete
data and intermittent missing values (Patterns 1, 5, 6, 7, and 9) into a group coded M = 1.
This coding scheme is a good starting point, because there are no inestimable parameters.
The fitted growth curve model casts the missing data indicator as a dummy code
that moderates the influence of one or more explanatory variables on the outcome.
The β0(0) to β3(0) coefficients are the growth model parameters for the completer pattern
with M = 0, and β0(diff) to β3(diff) give the amount by which these coefficients differ in the
dropout group. The model resembles the path diagram in Figure 9.13 but has different
time scores (slope factor loadings). The overall population-level parameters are again
weighted averages over the missing data patterns as follows:
β0 =
0 0 1
(
π( )β(0 ) + π( ) β(0 ) + β(0 ) =
0 diff 0
)
π( )β(0 ) + π( )β(0 )
0 1 1
(9.54)
π β + π ( β + β(
( ) ( ) () ( ) )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β =
1 1 1 1 1 1
π ( )β ( ) + π ( ) ( β ( ) + β ( )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β =
2 2 2 2 2 2
π ( )β ( ) + π ( ) ( β ( ) + β ( )
) =π( )β( ) + π( )β( )
0 0 1 0 diff 0 0 1 1
β =
3 3 3 3 3 3
Simultaneously estimating an empty probit or logit model for the missing data indicator
provides pattern proportions and standard errors for pooling, and creating and analyz-
ing model-based multiple imputations is an alternative to explicitly pooling over the
missing data patterns (Demirtas & Schafer, 2003).
Table 9.8 shows pattern-specific and population-level estimates for the Hedeker–
Gibbons model, and it also shows the estimates from a corresponding MAR analysis that
fixed all pattern difference coefficients (β0(diff) to β3(diff)) equal to 0. Specifying an MAR
analysis in this framework gives the same population-level estimates as a conventional
analysis that ignores missingness, but the AIC and BIC values are comparable to those
of the MNAR model. Looking at the pattern-specific estimates, dropouts and completers
had very different growth trajectories; among the participants who quit, the placebo
group had a much higher baseline mean and a slower (less negative) growth rate, and
the treatment group had a much steeper decline in symptoms. The overall (marginal)
estimates differed somewhat from those of the MAR analysis, but the discrepancies were
not as large as those for the Diggle–Kenward model (e.g., the treatment group growth
rate estimates differed by about three-fourths of a standard error unit, and other differ-
ences were roughly equal to half a standard error). Figure 9.16 shows average growth
Missing Not at Random Processes 395
Model fit
AIC (no. of parameters) 5054.26 (9) 5036.43 (13) —
BIC (no. of parameters) 5090.98 (9) 5089.47 (13) —
7
6
Placebo
5
Illness Severity
4
3
Medication
2
1
FIGURE 9.16. Solid lines depict the average growth curves from the MAR analysis, and the
dashed lines are the Hedeker–Gibbons pattern mixture model trajectories. Both analyses show
that the medication condition achieved greater gains than the placebo group, and the models
made similar predictions about the means.
396 Applied Missing Data Analysis
curves from the MAR analysis as solid lines, and it depicts the Hedeker–Gibbons model
trajectories as dashed lines.
Looking at the information criteria near the bottom of Table 9.8, both the AIC
and BIC selected the MNAR analysis (ΔAIC = AICMAR – AICMNAR = 17.83 and ΔBIC =
BICMAR – BICMNAR = 1.52), although ΔBIC’s evidence is “weak” according to Raftery’s
(1995) effect size guidelines. Individual influence diagnostics revealed 15 participants
with positive indices larger than ΔBIC. Removing any one of these data records from
the analysis could switch the sign of ΔBIC from positive to negative, thereby favoring
the MAR model. Interestingly, most of these cases had response profiles with very large
score reductions (e.g., a change from 7 to 2), and all were medication recipients who
dropped out. Presumably, these individuals are mostly responsible for the very large
negative slope for the M = 1 pattern in Table 9.8. These influence diagnostics should be
reported as part of a broader sensitivity analysis, but finding influential participants is
not a prescription for removing data records from the analysis (Sterba & Gottfredson,
2014).
Next, consider a more refined analysis that collapsed the missing data patterns in
Table 9.3 into three groups: early dropouts who quit prior to the 3-week follow-up (Pat-
terns 2 and 3), later dropouts who leave prior to the 6-week follow-up (Patterns 4 and
8), and a completer group that includes participants with intermittent missing values
(Patterns 1, 5, 6, 7, and 9). Figure 9.17 shows the observed means for each pattern broken
down by treatment group. The fitted model includes two dummy codes that indicate
early and late dropout (EDROP and LDROP, respectively):
(9.55)
+ β(3 ) ( EDROPi )( SQRTWEEKti )( DRUGi ) + β(0 ) ( LDROPi )
diff1 diff 2
7
6
Placebo
5
Illness Severity
4
3
Medication
2
1
0 1 2 3
Square Root of Weeks Since Baseline
FIGURE 9.17. Observed means for three missing data patterns broken down by treatment
group. The early dropout group has two observations, the later dropout group has three observa-
tions, and the completers have four observations. Dashed lines are the placebo condition, and
solid lines are the medication group.
growth rate and the medication group’s slope difference equal to the corresponding
parameters in the completer group (i.e., β1(diff1) = β3(diff1) = 0). Figure 9.17 suggests that this
strategy isn’t as unreasonable as it might sound, as the two groups show similar changes
between the baseline and 1-week follow-up (albeit with different baseline averages). The
neighboring-case restriction instead sets the early dropout group’s growth parameters
equal to those of the latter dropouts (i.e., β1(diff1) = β1(diff2) and β3(diff1) = β3(diff2)). Although
I don’t illustrate the procedure here, the available case restriction would fix the early
dropout pattern’s growth rate parameters equal to a weighted average across the other
patterns.
Table 9.9 shows the parameter estimates and standard errors from the two mod-
els. The corresponding MAR analysis produced the same estimates as the two-pattern
model in Table 9.8, albeit with different fit values (because the empty model for the indi-
cators differs). The complete-case restriction produced estimates that were like those of
the MAR analysis; the placebo group average and baseline mean difference coefficients
changed by about half a standard error unit, but the growth rate parameters were effec-
tively equivalent. The neighboring-case restriction estimates were effectively identical
to those of the two-pattern model in Table 9.8. This isn’t necessarily surprising given
that the equality constraints are a more elaborate way of combining the two dropout
398 Applied Missing Data Analysis
Model fit
AIC (no. of parameters) 5182.55 (16) 5181.63 (16) 5181.65 (16)
BIC (no. of parameters) 5247.83 (16) 5246.91 (16) 5246.93 (16)
patterns. The AIC and BIC disagreed, as the former favored the MNAR analyses (ΔAIC
values were positive), and the latter supported the MAR analysis (ΔBIC were negative).
These discrepancies don’t pose much of a dilemma given the relative stability of the
estimates.
As a final example, I used the effect-size-based strategy from Section 9.7 to specify
the early dropout group’s growth rate parameters. This procedure allows you to exam-
ine the stability of the results across a wide range of plausible parameter values, but I
used the method to induce a more extreme MNAR process than the one implied by the
neighboring-case restriction. Returning to Figure 9.17, the two dropout groups exhib-
ited relatively similar change rates during the first week of the study, but they could have
diverged after the 1-week follow-up. To examine this possibility, I selected growth rate
parameters for the early dropout pattern that induced a flatter (more positive) trajectory
among placebo group participants and an even steeper decline for people who received
medication. Using the residual standard deviation and a standardized effect size of dΔ
= +0.10, I specified the placebo group’s relative growth rate as dΔ × σ̂Y = 0.10 × 0.77 =
0.077. In this context, setting β1(diff1) equal to β1(diff2) + 0.077 implies that, relative to the
later dropouts, placebo group participants who quit the study early follow a trajectory
that increases the mean by one-tenth of a standard deviation more during the first week
of the study (or about one-fourth of a standard deviation over the duration of the study).
Setting β3(diff1) equal to β3(diff2) – 0.077 induces a comparable relative decline for early
dropouts in the treatment group.
The rightmost panel in Table 9.9 shows the parameter estimates from the analysis.
As expected, the analysis produced a flatter (less negative) placebo group growth rate
and a steeper (more negative) trajectory for the medication condition. The effect-size-
based identifying restriction can be viewed as a worst-case scenario among the pattern
mixture models, because it extrapolates the group means in a way that exaggerates the
MNAR process.
Missing Not at Random Processes 399
Summary
The real data examples comprise a sensitivity analysis that examined the stability of
the growth model parameters across four different missingness processes. The Diggle–
Kenward selection model and the Hedeker–Gibbons pattern mixture model with an
effect-size-based identifying restriction produced nontrivial differences in some key
parameters. Both analyses suggested a flatter (less negative) trajectory for the placebo
group and a steeper decline for the medication condition, with changes to the param-
eters as large as one standard error unit in some cases. Moreover, the ΔBIC offered
“strong” evidence favoring these models (with all the usual caveats about such model
comparisons), and individual influence diagnostics identified a subgroup of medica-
tion recipients who left the study after experiencing very large score reductions (e.g., a
change from 7 to 2).
Considered as a whole, the sensitivity analysis results suggest that a MNAR process
is quite plausible for these data. As is often the case, fitting different models induced
relatively large changes to some key parameters, albeit with no changes to their statisti-
cal significance tests. Although such differences might seem troubling, the results sim-
ply reflect different, plausible assumptions about the missing data. To reiterate, there is
no way of knowing whether the MNAR analyses are better than a simpler analysis that
assumes a conditionally MAR mechanism. Both sets of results are defensible and could
(and should) be presented in a research report. Chapter 11 offers recommendations for
reporting sensitivity analysis results.
Analyses that assume a conditionally MAR process have been our go-tos throughout
the book. This mechanism stipulates that unseen score values carry no unique informa-
tion about missingness beyond that contained in the observed data. This assumption
is convenient, because there is no need to specify and estimate a model for the missing
data process. Although the MAR assumption is quite reasonable for a broad range of
applications, an MNAR process where the unseen score values carry unique information
about missingness may also be plausible in some settings. This chapter has outlined two
major modeling frameworks for such processes: selection models and pattern mixture
models. Both approaches introduce an additional model that describes the occurrence
of missing data, but they do so in very different ways: A typical selection model features
a regression equation with a missing data indicator as a dependent variable, whereas a
pattern mixture model uses the indicator as a moderator variable. I have described both
approaches in the context of regression models and longitudinal growth models.
Fitting selection models requires a researcher to proactively search for a model that
has support from the data while looking for subtle clues that signal a misspecification
or identification problem. In contrast, pattern mixture models require the researcher
to specify values for one or more inestimable parameters (or impose comparable con-
straints). Although their implementation details are very different, both modeling
frameworks require strict, untestable assumptions, and model misspecifications could
400 Applied Missing Data Analysis
produce estimates that contain more bias than those from an MAR analysis. Accord-
ingly, the literature often recommends sensitivity analyses that examine the stability of
one’s substantive conclusions across different assumptions, and the analysis examples
demonstrated that process. The examples highlighted that invoking different assump-
tions about missingness can have relatively large impacts on key model parameters.
Although such differences might seem troubling, they simply reflect different, plausible
assumptions about the missing data. Viewed through that lens, discrepant results are
still defensible and should be presented in a research report. Finally, I recommend the
following articles for readers who want additional details on topics from this chapter.
Diggle, P., & Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis. Journal
of the Royal Statistical Society C: Applied Statistics, 43, 49–93.
Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychologi-
cal Methods, 16, 1–16.
Hedeker, D., & Gibbons, R. D. (1997). Application of random-effects pattern-mixture models for
missing data in longitudinal studies. Psychological Methods, 2, 64–78.
Kenward, M. G. (1998). Selection models for repeated measurements with non-random dropout:
An illustration of sensitivity. Statistics in Medicine, 17, 2723–2732.
Muthén, B., Asparouhov, T., Hunter, A. M., & Leuchter, A. F. (2011). Growth modeling with non-
ignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological
Methods, 16, 17–33.
Sterba, S. K., & Gottfredson, N. C. (2014). Diagnosing global case influence on MAR versus
MNAR model comparisons. Structural Equation Modeling: A Multidisciplinary Journal, 22,
294–307.
10
This chapter uses a series of data analysis examples to illustrate a collection of odds and
ends that include specialized topics, advanced applications, and practical issues. Earlier
data analysis examples demonstrated that given the same data and similar assumptions,
maximum likelihood, Bayesian estimation, and multiple imputation generally produce
the same numerical results. Several examples in this section highlight use of cases that
differentiate the three methods or favor one approach over another. Analysis scripts for
all examples are available on the companion website.
Nearly every data analysis project begins with descriptive summaries of sample demo-
graphics and key study variables. Maximum likelihood and Bayesian analyses are not
well suited for bread-and-butter descriptive summaries (e.g., cross-tabulation tables for
categorical variables, means and standard deviations of continuous variables), because
they are designed around one specific analysis model. In contrast, agnostic multiple
imputation procedures such as the joint model and fully conditional specification are
ideal for this task, because they apply a flexible model that can preserve associations
among a diverse collection of variables with different metrics.
The literature offers surprisingly little guidance on applying multiple imputation to
basic descriptive quantities, cross-tabulation tables, and the like. A quick scan of online
question-and-answer communities suggests that there is disagreement about the use of
imputation for generating descriptive summaries, with numerous authors stating they
are meaningless or invalid. These objections often stem from the legitimate concern that
estimands such as standard deviations and percentages may not have normal distribu-
tions, and some people also argue that Rubin’s (1987) pooling rules should be reserved
401
402 Applied Missing Data Analysis
for inferential analyses. Yet others suggest that descriptive summaries of background
and demographic variables should be based on the observed data. I take the view that
you can and should use multiple imputation for descriptive analyses, because doing so
provides a logical consistency across analyses within a given project. For example, a
cross-tabulation table of imputed demographic variables describes respondent charac-
teristics most likely associated with the main analysis results.
I use the chronic pain data on the companion website to illustrate multiple impu-
tation for descriptive summaries and correlations. The data include psychological cor-
relates of pain severity from a sample of N = 275 individuals with chronic pain (e.g.,
depression, pain interference with daily life, perceived control). The illustration pig-
gybacks on earlier moderated regression examples where the influence of depression
on psychosocial disability (a construct capturing pain’s impact on emotional behaviors
such as psychological autonomy and communication, emotional stability, etc.) differed
by gender. A research paper with this focal analysis would likely report descriptive
statistics and correlations by gender, and this example shows how to obtain such sum-
maries from multiple imputation.
Imputation Models
While it is ideally suited for the moderated regression, model-based imputation is nar-
row in scope and tailors imputations around that one analysis. Reporting descriptive
statistics and correlations for males and females requires an imputation scheme capable
of preserving several interaction effects at once (e.g., a correlation that differs by gender
implies a two-way interaction). Model-based imputation accommodates more than one
interaction effect, but using product terms to preserve gender-specific correlations is
cumbersome. A simple alternative is to impute the male and female data separately. This
multiple-group imputation strategy (Enders & Gottschall, 2011; Graham, 2009) gener-
ates imputations that preserve all possible mean differences and two-way interactions
with gender. The imputation phase can employ either the joint model or fully condi-
tional specification, and there is generally no reason to prefer one to the other. The main
requirements are that the grouping variable must be complete and group sizes must be
large enough to support imputation (in the limit, the number of variables can’t exceed
the number of observations).
To illustrate the procedure, I applied fully conditional specification with latent vari-
ables to data sets that comprised n = 166 women and n = 109 men. Each group’s imputa-
tion model includes four complete variables with mixed response types (age, stress, per-
ceived control over pain, educational attainment categories), six incomplete numerical
variables (work hours per week, exercise frequency, anxiety, pain interference with daily
life, depression, psychosocial disability), and one incomplete nominal variable (a three-
category pain severity rating). The stress and exercise frequency variables are 7- and
8-point ordinal scales, respectively. After stratifying by gender, exercise frequency had
relatively few responses in some of the higher bins. The sparse data would not support
group-specific threshold estimates, so I simplified imputation by treating this variable
as continuous. Other options include combining categories into a smaller number of
ordered bins or treating the categories as nominal groups (the multinomial probit model
Special Topics and Applications 403
is often easier to estimate, because it doesn’t require threshold parameters). I also treated
the complete stress rating scale as continuous, because this variable has a sufficient
number of scale points and is relatively symmetric (Rhemtulla et al., 2012).
Fully conditional specification imputes variables one at a time by stringing together
a series of regression models, one per incomplete variable. This example requires seven
regressions per group, six of which feature a continuous target variable. To illustrate, the
anxiety scale’s imputation model is
g g g
(
ANXIETYi( ) = γ (0 ) + γ1( ) INTERFEREi( ) + γ 2( ) DEPRESSi( )
g g g
) ( )
g
( g
) ( g g
) (
+ γ (3 ) DISABILITYi( ) + γ (4 ) WORKHRSi( ) + γ (5 ) EXERCISEi( )
g g
)
+ γ ( ) ( MODERATE ( ) ) + γ ( ) ( SEVERE ( ) ) + γ ( ) ( AGE ( ) )
g * g g * g g g
6 i 7
(10.1) i 8 i
+ γ ( ) ( POSTBA ( ) ) + r ( )
g g g
12 i i
where the g superscript indicates that each group has unique parameter values. Notice
that the incomplete nominal pain ratings appear as a pair of latent response difference
scores on the right side of the equation (MODERATE * and SEVERE *), whereas a pair of
dummy codes represent the educational attainment groups (COLLEGE and POSTBA);
the complete variables function like known constants, because I did not assign them a
distribution.
As a second example, the imputation model for the nominal pain rating requires the
multivariate regression of the latent difference scores on the remaining variables. The
probit regression model is
MODERATEi*( g )
SEVERE*( g )
= g
(
γ (0 ) + γ 1( ) INTERFEREi( ) + γ (2 ) DEPRESSi( )
g g g g
) ( )
i
g
( g
) (
+ γ (3 ) DISABILITYi( ) + γ (4 ) WORKHRSi( )
g g
)
+ γ ( ) ( EXERCISE ( ) ) + γ ( ) ( ANXIETY ( ) ) + γ ( ) ( AGE ( ) )
g g g (10.2) g g g
5 i 6 i 7 i
+ γ ( ) ( POSTBA ( ) ) + r ( )
g g g
11 i i
where each vector γ includes one coefficient per difference score, and r contains a pair of
correlated residuals. Aside from stratifying the sample by gender and imputing within
each group, all other aspects of imputation are the same as Chapter 7.
Carlo error (Bodner, 2008; Graham et al., 2007; Harel, 2007; von Hippel, 2020). Prior
to creating imputations, I performed an exploratory analysis and used trace plots and
potential scale reduction factor diagnostics (Gelman & Rubin, 1992) to evaluate con-
vergence. Based on this diagnostic run, I specified 100 parallel imputation chains with
2,000 iterations each, and I saved a data set at the final iteration of each chain. After
applying fully conditional specification to each group, I generated basic summaries that
you might see in a published manuscript, including descriptive statistics and corre-
lations by gender and cross-tabulation tables. As mentioned previously, these simple
estimands seem to pose the most ambiguity for applying Rubin’s (1987) pooling rules.
To begin, consider the categorical pain rating variable, which had about a 7% miss-
ing data rate in both groups. Summarizing the chronic pain ratings for men and women
is an important preliminary step, because this variable is a defining feature of the tar-
get population. However, Rubin’s pooling rule assumes that estimands follow a normal
distribution. Row or column percentages from a cross-tabulation table probably do not
satisfy this requirement, but averaging the percentages is nevertheless a viable strategy
(U.S. Census Bureau, 2019). To illustrate, Table 10.1 shows cross-tabulation tables of
educational attainment and chronic pain ratings by gender. Educational attainment is
complete, but I include it here to illustrate what a table with multiple categorical vari-
ables might look like. I computed the cell sizes by multiplying the pooled column per-
centages by the male and female sample sizes (n = 109 and 166, respectively). Although
reporting fractional cell sizes might seem odd, they are routine in analyses with latent
categorical variables (e.g., a latent class analysis where class sizes are computed by mul-
tiplying group probabilities by the sample size). I would argue that this application isn’t
so different and that fractional cell sizes emphasize the uncertainty in the descriptive
summaries.
Taking the cross-tabulation table one step further, suppose it is of interest to deter-
mine whether the pain distributions differ by gender. Software packages routinely aug-
ment contingency tables with Pearson chi-square and likelihood ratio chi-square sta-
tistics, among others. Although the inputs for computing Wald or likelihood ratio tests
probably aren’t available, the TD2 (or D2) statistic from Section 7.12 (Li, Meng, et al.,
1991) provides a straightforward tool for pooling virtually any chi-square statistic. Con-
sistent with the usual test of independence, the significant test statistic indicates that
males and females have different pain rating distributions, TD2 = χ2(2) = 16.48, p < .001.
Li, Raghunathan, et al. (1991) also describe an F distribution for the test statistic, but
the denominator degrees of freedom for this example was so large (dfD2 = 30,678.57) that
the chi-square and F versions are effectively identical tests. I use a chi-square reference
distribution, because this is the norm for complete-data contingency tables.
Table 10.2 gives means, standard deviations, and correlations by gender. Rubin’s
rule is absolutely appropriate for means, but there is not universal agreement about stan-
dard deviations; some authors recommend transforming standard deviations prior to
MALES (n = 109)
1. AGE 1.00
2. WORKHRS –.24 1.00
3. EXERCISE .05 –.13 1.00
4. ANXIETY –.19 .20 –.18 1.00
5. STRESS –.16 .23 –.10 .69 1.00
6. CONTROL .25 –.12 .17 –.28 –.11 1.00
7. INTERFERE .06 .12 –.23 .25 .12 –.44 1.00
8. DEPRESS –.26 –.01 –.33 .54 .51 –.35 .25 1.00
9. DISABILITY .00 –.02 .05 .31 .24 –.28 .37 .29 1.00
Means 48.64 34.59 2.50 12.16 4.10 20.39 27.96 15.32 21.83
SD 11.67 19.10 1.43 4.90 1.76 5.46 8.64 6.74 4.52
pooling (White, Royston, & Wood, 2011, p. 389), and others suggest a normal approxi-
mation is appropriate (Marshall, Altman, Holder, & Royston, 2009; van Buuren, 2012,
p. 155). Limited computer simulation evidence suggests that pooling standard devia-
tions without transformation works just fine unless the sample size is very small and the
missing data rate is very large (e.g., less than N = 50 and 30% missing data), in which
case pooling after an inverse transformation is preferable (Hubbard & Enders, 2022).
Although the transformation makes no difference here, the pooling equations for an
inverse transformation are
M
1 1
ϑˆ = ∑
M m =1 θˆ m
(10.3)
1
θˆ = ˆ
ϑ
where θ̂m is the untransformed estimate from data set m, ϑ̂ is the average inverse estimate
(e.g., the average reciprocal of the standard deviation), and θ̂ is the back-transformed
point estimate.
There is widespread agreement that Rubin’s rule is not appropriate for a correlation,
because the sampling distribution of r becomes increasingly skewed as the population
correlation moves away from zero. Here, again, simulation results suggest that transfor-
mation only helps at small sample sizes (Hubbard & Enders, 2022), but applying Fisher’s
(1915) r-to-z transformation facilitates significance testing. Schafer (1997) and others
recommend the following procedure: (1) Apply the r-to-z transformation to each of the
M correlation estimates, (2) average the transformed estimates, then (3) back-transform
the pooled z-statistic to the correlation metric. The expression for the pooled z-statistic
is
1 1 + rm
M
1
z= ∑ ln
M m =1 2 1 − rm
(10.4)
where the collection of terms to the right of the summation is the r-to-z transformation.
The back-transformation to the correlation metric is as follows:
exp ( 2z ) − 1
r= (10.5)
exp ( 2z ) + 1
Between-group mean comparisons would likely be standard fare in a project con-
cerned with gender differences. Casting independent-samples t-tests as simple regres-
sion models is useful when using general-use statistical software to analyze multiply
imputed data sets, because most programs offer this functionality. The analysis model
features background or outcome variables regressed on a gender dummy code. For
example, the following simple regression gives the depression group comparison:
where β0 is the female average, and β1 is the mean difference for males. Table 10.3 gives
Special Topics and Applications 407
the group means, t-statistics with the Barnard and Rubin (1999) degrees of freedom
adjustment, and fractions of missing information (i.e., the proportion of the squared
standard errors due to missing data). Unlike the classic expression from Rubin (1987,
Eq. 3.1.6), the adjusted degrees of freedom values never exceed the sample size and
decrease as the fractions of missing data increase.
Finally, I previously noted that multiple-group imputation generates imputations
that preserve all possible two-way interactions with gender, because it allows every pair
of means and correlations to differ for men and women. At least in this example, where
the moderator variable is complete, multiple-group imputation is a flexible alternative
to model-based multiple imputation. Fitting the moderated regression model (see Equa-
tion 7.26) to the imputed data sets gave similar (but not identical) estimates to model-
based imputation procedure described in Section 7.11. These minor differences likely
owe to the fact that the multiple-group procedure is more complex and estimates far
more parameters.
Most methods in this book leverage the normal distribution in important ways; Bayesian
estimation and multiple imputation make this dependence explicit by sampling impu-
tations from normal curves, and maximum likelihood estimation similarly intuits the
location of missing values by assuming they are normal. Of course, the normal distribu-
tion is often a rough approximation for real data where variables are asymmetrical and/
or kurtotic. Using the normal curve for missing data handling is fine in many situations,
but misspecifications can introduce bias if the data diverge too much from this ideal
(some estimands are more robust than others). Not surprisingly, the impact of applying
a normal curve to non-normal data depends on the missingness rate, as misspecifica-
tions are unlikely to cause problems if the non-normal variable has relatively few miss-
ing data points.
408 Applied Missing Data Analysis
Bayesian estimation and multiple imputation are particularly useful for evaluating
the impact of non-normality, because they produce estimates of the missing values.
Graphing imputations next to the observed data can provide a window into an estima-
tor’s inner machinery, as severe misspecifications can produce large numbers of out-
of-range or implausible values (e.g., negative imputes for a strictly positive variable).
Maximum likelihood estimation is a bit more of a black box in this regard, because it
does the same thing—intuits that missing values extend to a range of implausible score
values—w ithout producing explicit evidence of its assumptions.
When applying maximum likelihood estimation, researchers routinely use correc-
tive procedures such as robust (sandwich estimator) standard errors and the bootstrap
to counteract the influence of non-normal data (see Chapters 2 and 3). This section
focuses on data transformations as an alternative strategy for treating non-normal miss-
ing values. In particular, I focus on the Yeo–Johnson power transformation (Yeo &
Johnson, 2000), because it subsumes a broad range of transformations used in applied
practice (e.g., logarithmic, inverse, Box–Cox). The procedure effectively estimates the
shape of the data as MCMC iterates, such that the distribution of missing values matches
the observed-data distribution. This approach has shown promise when paired with a
factored regression specification (Lüdtke et al., 2020b). The interpretation of the regres-
sion model parameters depends on whether the non-normal variable is a regressor or
outcome, so I illustrate each situation separately.
Pr ( DRINKER i = 1)
ln
1 − Pr ( DRINKER = = β0 + β1 ( AGETRYALCi ) + β2 ( COLLEGEi )
i 1) (10.7)
+ β3 ( AGEi ) + β4 ( MALEi )
where the logit function on the left side of the equation is the log odds of steady drinking
(DRINKER = 0 if the respondent drinks less than once a week on average, and DRINKER
= 1 if the respondent consumes quantities of alcohol at least once per week), AGETRYALC
is the age at which the respondent first tried alcohol, COLLEGE is a dummy code indicat-
ing some college or a college degree, and MALE is a gender dummy code with females as
the reference group. Approximately 9.7% of the dependent variable scores are missing,
age at first use has a 32.9% missing data rate, and 27.9% of the educational attainment
values are unknown. The age at first use distribution is markedly peaked and positively
skewed, with scores ranging from 10 to 47.
The age at first alcohol use variable could be missing, because the respondent
refused, didn’t know the answer, or, because the question wasn’t applicable (e.g., the
Special Topics and Applications 409
question was skipped, because a respondent reported no lifetime alcohol use). Some
researchers may choose to restrict the population of interest to persons who have tried
alcohol on the grounds that age at first use is not a relevant concept otherwise. This
approach would treat missing values that arise from a survey skip pattern as out of
scope, and it would exclude respondents with no lifetime alcohol use. I take the alterna-
tive tack of imputing missing responses regardless of origin.
Authoritative imputers argue that it is permissible to fill in “not applicable” or “don’t
know” responses like the ones here. For example, Rubin, Stern, and Vehovar (1995) sug-
gest that a “don’t know” response could be viewed as concealing some intention or future
behavior. Furthermore, imputing age at first use scores for respondents who report no
lifetime alcohol use may be justified on grounds that the answer to the lifetime use ques-
tion could be incorrect due to measurement or response error (Schafer & Graham, 2002,
p. 148). In these data, a logical skip may reflect a respondent’s uncertainty about a past
behavior, obscuring a score that truly exists. Schafer and Graham also argue that treat-
ing “not applicable” responses as missing values can provide a convenience feature that
facilitates missing data handling. For respondents who report no lifetime alcohol use, I
am effectively imputing the hypothetical age at first use, had the participant ever tried or
if he or she eventually will try alcohol.
(
f ( AGE | MALE ) × f MALE* )
The first term is the binomial distribution associated with the logistic model, the next
two terms are supporting models for the incomplete regressors, and the two terms in
the bottom row are unnecessary distributions for the complete predictors (I ignore these
terms going forward).
The f(AGETRYALC|AGE, MALE) term is the focus of this example, because the age
at first use variable is substantially skewed and kurtotic (skewness = 1.82 and excess
kurtosis = 8.12). I start by applying a linear regression model with a normal residual
distribution to this variable.
(
AGETRYALCi ~ N1 E ( AGETRYALCi | AGEi , MALEi ) , σ12 )
410 Applied Missing Data Analysis
Following established notation, the bottom row says that age scores are normally dis-
tributed around predicted values on the regression line and have constant variation.
Either a probit or logistic regression could be used to model the college indicator’s dis-
tribution, and the specification of the dependent variable’s model has no bearing on this
choice. I use probit regression and a latent response variable formulation for consistency
with earlier material.
Frequency
0 10 20 30 40
Age at First Alcohol Use
FIGURE 10.1. Overlaid histograms with the observed data as gray bars and the missing values
as white bars with a kernel density function. The observed data are markedly peaked and skewed
with scores ranging from 10 to 47, whereas the imputations follow a symmetrical distribution
that extends from 0.30 to 34.93.
virtually indistinguishable, because they apply the same assumptions to the same data.
Substantively, the results show that the probability of steady drinking increased for indi-
viduals who tried alcohol at an earlier age, attended at least some college, are older, and
are males. The implausible score values in Figure 10.1 clearly offend our aesthetic sensi-
bilities, but out-of-range imputations don’t necessarily invalidate the results and trans-
late into biased parameter estimates; computer simulation studies show that a normal
imputation model can work surprisingly well when estimating means and regression
coefficients (Demirtas et al., 2008; Lee & Carlin, 2017; von Hippel, 2013; Yuan et al.,
2012), although other studies suggest that applying a normal curve to a heavily skewed
predictor like the one from this example can introduce bias and distort significance tests
(Lüdtke et al., 2020a, 2020b).
Yeo–Johnson imputation
β0 –2.73 0.19 –14.75 < .001 — .12
β1 (AGETRYALC) –0.08 0.02 –3.54 < .001 0.93 .23
β2 (COLLEGE) 0.42 0.16 2.73 .01 1.53 .23
β3 (AGE) 0.03 0.004 5.92 < .001 1.03 .11
β4 (MALE) 0.83 0.15 5.66 < .001 2.31 .10
cal software packages, and some are limited to very simple applications. A variant of
fully conditional specification known as predictive mean matching (Kleinke, 2017; Lee
&C arlin, 2017; van Buuren, 2012; Vink et al., 2014) achieves a similar end by sampling
imputations from a donor pool of observed scores taken from participants whose pre-
dicted values are similar to that of the person with missing data. Van Buuren (2012)
provides a detailed discussion of predictive mean matching, and the procedure is avail-
able in his popular R package MICE (van Buuren et al., 2021; van Buuren & Groothuis-
Oudshoorn, 2011).
A second option is to apply a normalizing transformation to the skewed variable
prior to imputation and then back-transform the filled-in data to the original metric
prior to analysis. Given the right transformation, this procedure can produce imputa-
tions that have approximately the same shape as the observed scores. Common recom-
mendations for positively skewed variables include Box–Cox transformations (Box &
Cox, 1964; Goldstein et al., 2009) and simple logarithmic, square root, or inverse trans-
formations (Schafer & Olsen, 1998; Su et al., 2011; van Buuren, 2012; von Hippel, 2013).
These transformations can be applied to a negatively skewed variable after first reflect-
ing its distribution by subtracting scores from the maximum value plus one. Other
Special Topics and Applications 413
(
X i† ~ N1 μ, σ2 ) (10.11)
where X†is the transformed score. The function that returns the transformed scores is
( ( X + 1)λ − 1 / λ
i ) if X i ≥ 0 and λ ≠ 0
ln ( X + 1) if X i ≥ 0 and λ =0
i
X i† = (10.12)
(
− ( − X i + 1) − 1 / ( 2 − λ )
2 −λ
) if X i < 0 and λ ≠ 2
−ln ( − X i + 1) if X i < 0 and λ =
2
where λ is the shape parameter.
The skewed variable, in turn, follows a Yeo–Johnson normal distribution with the
mean, variance, and shape coefficient as parameters.
(
X i ~ YJN μ, σ2 , λ ) (10.13)
This distribution is essentially a two-part function based on a normal curve and a com-
ponent for the shape coefficient. To illustrate, the distribution’s log-likelihood function
is
N
N
( ) ( ) ( ) ∑(X )
N 1 −1 2
LL μ, σ2 , λ | data = − ln ( 2π ) − ln σ2 − σ2 †
i −μ
2 2 2 i =1
(10.14)
N
+ ( λ − 1) ∑sign ( X ) ln ( X
i =1
i i + 1)
where the terms in curly braces correspond to a normal distribution for the transformed
variable, and the second collection of terms owes to the shape parameter and its linkage
414 APPLIED MISSING DATA ANALYSIS
to the raw scores (Yeo & Johnson, 2000, Eq. 3.1). To illustrate, Figure 10.2 shows the
distributions that result from applying an inverse transformation with shape parameters
of λ = 0.50, 1.00 (no transformation), and 1.50 to a standard normal variable (i.e., the
raw score distributions that are normalized, because of applying the transformation
in Equation 10.12). The shape parameter doesn’t have a clear interpretation, because it
works differently depending on whether the raw score is positive or negative (see Equa-
tion 10.12). Nevertheless, the figure highlights that the transformation can map highly
skewed distributions to the normal curve.
To illustrate the application of this distribution to the factored regression specifica-
tion, consider a linear regression model with a skewed regressor:
Yi = β0 + β1 X i + ε i (10.15)
(
Yi ~ N1 E ( Yi | X i ) , σ2ε )
The factored regression specification for this analysis expresses the joint distribution of
the two variables as the product of two univariate distributions.
(Y, X ) f (Y | X ) × f ( X )
f= (10.16)
Relative Probability
–4 –2 0 2 4
Score Value
FIGURE 10.2. The distributions that result from applying an inverse Yeo–Johnson transfor-
mation with shape parameters of λ = 0.50, 1.00 (no transformation), and 1.50 to a standard
normal variable (i.e., the raw score distributions that are normalized, because of applying the
transformation).
Special Topics and Applications 415
The first term after the equal sign corresponds to the focal analysis model, and the
second term corresponds to the Yeo–Johnson normal distribution in Equation 10.13.
As with any incomplete predictor, the conditional distribution of the missing values
depends on all distributions or models in which the variable appears. In this case, the
distribution of missing values is a two-part function that depends on a normal curve for
Y and a Yeo–Johnson normal curve for X. Importantly, the parameters of f(X) are on the
transformed metric, but X and its imputations are on the raw score metric. From a prac-
tical perspective, this means that the interpretation of the focal parameters is unaffected
by the transformation; conceptually, the procedure samples skewed imputations that,
when transformed using λ and Equation 10.12, approximate a normal curve. As you will
see, the imputations closely mimic the distribution of the observed data.
Implementing the Yeo– Johnson transformation requires a value for the shape
parameter λ. It is convenient to embed this parameter into the iterative estimation pro-
cess. For example, the MCMC recipe for the factored regression specification from Equa-
tion 10.16 has the following major steps: (1) Estimate the focal model parameters, condi-
tional on the current values of Y and X; (2) estimate the Yeo–Johnson model parameters,
conditional on the transformed X scores; (3) estimate the shape parameter λ, conditional
on the current data; (4) impute Y conditional on the focal model parameters and the
current values of X; and (5) impute X conditional on two sets of model parameters, the
shape parameter, and the new values of Y. As always, the Metropolis–Hastings algorithm
can draw imputations from complex, nonstandard distributions.
((
AGETRYALCi ~ YJN E AGETRYALCi† | COLLEGEi , AGEi , MALEi , σ2r1 , λ ) )
The model is a linear regression linking the transformed age scores to other predictors,
and the predicted score and residual variance from this regression define the center and
spread of the Yeo–Johnson normal distribution for the skewed age variable. All other
aspects of the factorization are the same as the previous example.
The Yeo–Johnson transformation can be finicky to implement, and MCMC can be
very slow to converge if the skewed variable’s mean is far from zero. To facilitate conver-
gence, I centered the age scores at their median value of 16 prior to fitting the model. After
inspecting trace plots and potential scale reduction factor diagnostics (Gelman & Rubin,
1992), I specified an MCMC process with 10,000 iterations following a 2,000-iteration
burn-in period, and I created 100 filled-in data sets by saving the imputations from the
416 APPLIED MISSING DATA ANALYSIS
final iteration of 100 parallel chains. Figure 10.3 shows overlaid histograms with the
observed data as gray bars and the missing values as white bars with a kernel density
function. As you can see, the imputations follow a positively skewed distribution that
mimics the shape of the observed data, ranging from 5.89 to 42.83. Any continuous dis-
tribution is likely to produce out-of-range values, and the Yeo–Johnson procedure is no
different. However, only 0.60% of the missing values now fall below the lowest reported
age (there is no need to round or truncate these values or any other fractional imputes).
As mentioned previously, the multiple imputations are on the original metric, so
you can simply fit the analysis model to the filled-in data sets without regard to the
transformation (unless you want to normalize the variable first, in which case you use
the estimated shape parameter and Equation 10.12). The bottom panel of Table 10.4
shows the multiple imputation estimates with robust standard errors. Following the
earlier example, the results suggest that the probability of steady drinking increased
for individuals who tried alcohol at an earlier age, attended at least some college, are
older, and are males. Applying the Yeo–Johnson transformation changed the age at first
alcohol use slope coefficient by nearly half of a standard error unit. I judge this to be a
nontrivial difference, because it could potentially alter the inference about this variable
if the sample size or effect size was smaller. Viewed through the lens of a sensitivity
analysis, the tabled results simply reflect two different assumptions about the missing
data distribution, and there is no way to know for sure which analysis is more correct.
Reporting results for both sets of assumptions is an appropriate option.
Frequency
0 10 20 30 40
Age at First Alcohol Use
FIGURE 10.3. Overlaid histograms with the observed data as gray bars and the missing val-
ues as white bars with a kernel density function. The imputations follow a skewed distribution
that mimics the shape of the observed data and ranges from 5.89 to 42.83.
Special Topics and Applications 417
The smoking intensity variable has 21.2% missing data, 3.6% of the parental smoking
indicator scores are missing, and 11.4% of the income values are unknown. The smoking
intensity distribution is markedly peaked and positively skewed, with scores ranging
from 2 to 29 (skewness = 1.46 and excess kurtosis = 2.88). Although a count or negative
binomial imputation model might be more appropriate for these data (see Section 10.10),
I use the Yeo–Johnson transformation as a continuous approximation to the discrete
distribution.
The first term is the focal linear regression model, the next two terms are supporting
models for the incomplete regressors, and the final term is an unnecessary distribution
for the complete predictor. I use a probit model for the parental smoking indicator and a
linear regression model for the income variable. The composition of the regressor mod-
els is familiar by now, so I omit these equations in the interest of space.
Frequency
–10 0 10 20 30
Smoking Intensity
FIGURE 10.4. Overlaid histograms with the observed data as gray bars and the missing val-
ues as white bars with a kernel density function. The observed data are markedly peaked and
skewed with scores ranging from 2 to 29, whereas the imputations follow a symmetrical distribu-
tion that extends from –5.52 to 27.12.
are markedly peaked and skewed with scores ranging from 2 to 29, whereas the imputa-
tions follow a symmetric distribution that extends from –5.52 to 27.12. Although they
are very small in number, negative imputations are clearly illogical.
The middle panel of Table 10.4 shows the multiple imputation estimates with robust
standard errors (the Bayesian posterior medians and standard deviations were numeri-
cally equivalent to the frequentist point estimates and standard errors), and the top
panel shows FIML estimates as a comparison. The similarity of the two sets of point
estimates highlights that maximum likelihood assumes the same distribution for the
missing values as Figure 10.4 without producing explicit evidence of that assumption.
The one noticeable difference was the standard error of the residual variance, which
was smaller in the multiple imputation analysis. Not surprisingly, the sandwich estima-
tor produced different results when applied to just the observed data versus imputed
data sets that are a mixture of a normal and skewed distributions. Substantively, the
results show that smoking intensity increased for respondents whose parents smoked,
decreased for people with higher incomes, and increased as age increased.
factored regression specification in Equation 10.19, the focal model now corresponds to
a Yeo–Johnson normal distribution for the smoking intensity variable.
((
INTENSITYi ~ YJN E INTENSITYi† | PARSMOKEi , INCOMEi , AGEi , σ2r , λ ) )
Importantly, the linear regression model links the transformed outcome to the regres-
sors, and the predicted score and residual variance from this regression define the center
and spread of the Yeo–Johnson normal distribution for the skewed smoking intensity
variable. All other aspects of the factorization are the same as the previous example. I
use γ’s to emphasize that the parameters have a different interpretation than those in
Equation 10.18. To reiterate, the imputed scores are on the original skewed metric.
As noted earlier, the MCMC algorithm can be very slow to converge if the skewed
variable’s mean is far from zero. To facilitate convergence, I centered the smoking inten-
sity scores at the median value of 9 prior to fitting the model. After inspecting trace plots
and potential scale reduction factor diagnostics (Gelman & Rubin, 1992), I specified an
MCMC process with 10,000 iterations following a 2,000-iteration burn-in period, and I
created 100 filled-in data sets by saving the imputations from the final iteration of 100
parallel chains. Figure 10.5 shows overlayed histograms with the observed data as gray
bars and the missing values as white bars with a kernel density function. As you can
Frequency
0 5 10 15 20 25 30
Smoking Intensity
FIGURE 10.5. Overlaid histograms with the observed data as gray bars and the missing val-
ues as white bars and a kernel density function. The imputations follow a skewed distribution
that mimics the shape of the observed data and ranges from 2.33 to 39.03.
420 Applied Missing Data Analysis
see, the imputations follow a positively skewed distribution that mimics the shape of the
observed data, ranging from 2.33 to 39.03. The Yeo–Johnson procedure produced a very
small number of out-of-range values in the stacked data set with 200,000 observations
(100 data sets × 2,000 observations). There is no need to round or truncate these values
prior to analysis. As mentioned previously, the multiple imputations are on the original
metric, so you can simply fit the analysis model to the filled-in data sets without regard
to the transformation. The bottom panel of Table 10.5 shows the multiple imputation
estimates with robust standard errors. Overall, the point estimates were like those of
the normal-theory analyses, and the sandwich estimator standard errors were a better
match to maximum likelihood.
A final option is to analyze the normalized or transformed dependent variable, as
is common practice in many disciplines. To illustrate, I saved the transformed outcome
scores for each of the 100 imputed data sets alongside their skewed counterparts. Figure
10.6 shows overlayed histograms with the observed data as gray bars and the missing
Yeo–Johnson imputation
β0 –2.94 0.86 –3.44 < .001 .28
β1 (PARSMOKE) 2.73 0.18 15.06 < .001 .27
β2 (INCOME) –0.13 0.03 –4.92 < .001 .29
β3 (AGE) 0.60 0.04 15.52 < .001 .25
σε2 11.37 0.72 15.70 < .001 .20
R2 .26 .02 12.73 < .001 .14
Frequency
–15 –10 –5 0 5 10 15
Smoking Intensity
FIGURE 10.6. Overlaid histograms with the transformed observed data as gray bars and the
transformed missing values as white bars and a kernel density function.
values as white bars with a kernel density function. The Yeo–Johnson transformation
maintains the sign of the original scores, and the spread of the imputes around zero is a
result of centering the outcome prior to the analysis (doing so facilitated convergence).
Table 10.6 shows the multiple imputation results (the Bayesian posterior medians and
standard deviations were numerically equivalent to the frequentist point estimates and
standard errors). Substantively, the results show that smoking intensity increased for
respondents whose parents smoked, decreased for people with higher incomes, and
increased as age increased. Although the results are on a different metric, the signs and
interpretations of the coefficients are the same as the untransformed results.
A mediation analysis attempts to clarify the mechanism through which two variables
are related. A typical model features an explanatory variable affecting an intervening
variable (the mediator) that, in turn, transmits the predictor’s influence to the outcome.
Seminal mediation articles include Baron and Kenny (1986) and Judd and Kenny (1981),
and a number of excellent books are devoted to the topic (Hayes, 2013; Jose, 2013;
MacKinnon, 2008; Muthén et al., 2016). I use the chronic pain data on the companion
website to illustrate a mediation analysis with missing data. The data set includes psy-
chological correlates of pain severity (e.g., depression, pain interference with daily life,
perceived control) for a sample of N = 275 individuals with chronic pain. The single-
mediator model for the illustration features a binary severe pain indicator (0 = no, little,
or moderate pain, 1 = severe pain) that influences depression indirectly via an intervening
or mediating variable, pain interference with daily life activities. Approximately 7.3% of
the binary pain ratings are missing, and the missing data rates for the depression and
pain interference scales are 13.5 and 10.6%, respectively.
Figure 10.7 depicts the mediation model as a path diagram, with straight arrows
denoting regression coefficients and double-headed curved arrows representing vari-
ances or residual variances. To reduce visual clutter, I omit triangle symbols that
researchers sometimes use to denote grand means or intercepts. The model decomposes
the bivariate association between severe pain and depression into a direct pathway and
an indirect pathway via the mediator variable, pain interference with daily life. The path
diagram can alternatively be written as a pair of regression equations. Modifying my
established notation to align with the mediation literature, the regression models are
where I1 and I2 are regression intercepts, α and β are slope coefficients that define the
indirect effect, τ′ is the direct effect of severe pain on depression, and ε1 and ε2 are nor-
PAIN
DEPRESS
INTERFERE
FIGURE 10.7. Path diagram of a single-mediator model. A binary pain severity indicator
exerts a direct influence on depression, and it also exerts an indirect effect via an intervening or
mediating variable, pain interference with daily life.
Special Topics and Applications 423
mally distributed residuals. The two regressions align perfectly with the factored regres-
sion (sequential) specification we’ve used throughout the book, and they also integrate
with a multivariate (structural equation model) specification. I use the former for the
Bayesian analysis and the latter for maximum likelihood and multiple imputation.
Multiplying the α and β slopes (i.e., the indirect pathways) defines the so-called
“product of coefficients” estimator of the mediated effect, αβ = α × β. Mediation infer-
ence is challenging, because the sampling distribution of the product of two coefficients
can be markedly asymmetric and kurtotic, even when estimates of α and β follow a nor-
mal distribution (MacKinnon, 2008; MacKinnon, Lockwood, Hoffman, West, & Sheets,
2002; MacKinnon, Lockwood, & Williams, 2004; Shrout & Bolger, 2002). I introduced
bootstrap resampling in Section 2.8 as a method for generating standard errors that are
robust to normality violations (Efron, 1987; Efron & Gong, 1983; Efron & Tibshirani,
1993), and this approach is also the preferred method for testing indirect effects in the
frequentist framework (MacKinnon, 2008; MacKinnon et al., 2004; Shrout & Bolger,
2002). In a Bayesian analysis, the MCMC algorithm iteratively estimates α and β, and
multiplying each pair of estimates creates a posterior distribution and credible intervals
that reflect the estimand’s natural shape.
(
f ( DEPRESS | INTERFERE, PAIN ) × f ( INTERFERE | PAIN ) × f PAIN * ) (10.22)
All three components are needed for this example, because the variables have miss-
ing data. The analysis also included perceived control over pain, stress, and anxiety as
auxiliary variables, as all have salient bivariate associations with the analysis variables.
Sequencing the models such that the analysis variables predict the auxiliary variables
and not vice versa maintains the desired interpretation of the focal model parameters.
The factored regression specification is as follows:
(
f ( DEPRESS | INTERFERE, PAIN ) × f ( INTERFERE | PAIN ) × f PAIN * )
All models correspond to linear regressions with normal distributions except the final
term, which is an empty probit (or logit) model for the binary explanatory variable.
Bayesian Estimation
Yuan and MacKinnon (2009) describe complete-data Bayesian estimation and infer-
ence for the model specification in Equation 10.22, and their approach readily accom-
424 Applied Missing Data Analysis
modates missing data. The MCMC algorithm follows a familiar two-step recipe that
involves estimating multiple sets of regression model parameters conditional on the
filled-in data, then sampling new imputations from distributions based on the updated
model parameters. The missing values follow complex, multipart functions that depend
on every model in which a variable appears. For example, the distribution of depression
scores depends on the focal model parameters (e.g., the β and τ′ paths) and three auxil-
iary variable models. Similarly, the conditional distribution of the severe pain indicator
involves the product of six distributions. In practice, the Metropolis–Hastings algo-
rithm does the heavy lifting of sampling imputations from these complex, multipart
functions.
Relative Probability
0 1 2 3 4
Product of Coefficients Estimator (Indirect Effect)
FIGURE 10.8. Posterior distribution of 10,000 indirect effect (i.e., product of coefficients esti-
mator) estimates from MCMC estimation.
Special Topics and Applications 425
The potential scale reduction factor (Gelman & Rubin, 1992) diagnostic indicated
that the MCMC algorithm converged in fewer than 200 iterations, so I used 11,000 total
iterations with a conservative 1,000-iteration burn-in period. Table 10.7 summarizes the
posterior distributions of the mediation model parameters. In the interest of space, I omit
the auxiliary variable and covariate model parameters from the table, because they are
not the substantive focus. The product of coefficients estimator is a deterministic aux-
iliary parameter obtained by multiplying the α and β coefficients from each iteration.
Figure 10.8 shows the posterior distribution of the 10,000 indirect effect estimates, with
a solid line denoting the median estimate and dashed lines indicating the 95% credible
interval boundaries. The posterior median was Mdnαβ = 1.52, meaning that the change
from mild or moderate pain to severe pain increased depression by 1.52 points via pain
interference with daily life. The 95% credible intervals were asymmetrical around the
distribution’s center and spanned from 0.68 to 2.52. Applying null hypothesis-like logic,
we can conclude that the parameter value is unlikely equal to zero, because the null
value falls outside the credible interval.
Maximum Likelihood
Next, I used maximum likelihood and structural equation modeling software to esti-
mate the mediation model in Figure 10.7. This approach incorrectly assumes that the
binary pain severity indicator is normally distributed, but computer simulations suggest
that this misspecification is often benign (Muthén et al., 2016); earlier analysis examples
support this conclusion. Finally, I used Graham’s (2003) saturated correlates approach
to incorporate the three additional auxiliary variables into the model. Recall that this
specification uses correlated residuals to connect the auxiliary variables to the analysis
variables and to each other (see Section 3.10).
The bootstrap is the preferred method for testing indirect effects in the frequentist
framework. As a quick review, the basic idea is to treat the sample data as a surrogate
for the population and draw B samples of size N with replacement. The sampling with
replacement scheme ensures that some data records—and thus missing data patterns—
appear more than once in each sample, whereas others do not appear at all. Drawing
many bootstrap samples (e.g., B > 1,000) and fitting the mediation model to each data set
produces an empirical sampling distribution of the product of coefficients estimator. The
percentile bootstrap defines the 95% confidence interval as the 2.5 and 97.5% quantiles
of the empirical sampling distribution, and the bias-corrected bootstrap adjusts these
quantiles to compensate for any difference between the pooled point estimate and the
center of the bootstrap sampling distribution (Efron, 1987; MacKinnon, 2008, p. 334).
The top panel of Table 10.8 gives the path coefficients and 95% confidence inter-
val limits from the percentile and bias corrected bootstraps. The percentile bootstrap
defines the 95% confidence interval bounds as the 2.5% and 97.5% quantiles of the
empirical bootstrap distribution, and the biased-corrected interval shifts these quan-
tiles to account for the discrepancy between the point estimate and the center of the
empirical sampling distribution (see MacKinnon, 2008, p. 334). Based on its Type I error
rates, the literature seems to favor the percentile bootstrap (Chen, 2018; Fritz, Taylor,
& MacKinnon, 2012), but I report both for completeness. The bias correction induced a
426 Applied Missing Data Analysis
Multiple imputation
PAIN → INTERFERE (α) 8.49 6.70 10.27 8.49 6.70 10.28
INTERFERE → DEPRESS (β) 0.18 0.09 0.28 0.18 0.09 0.28
PAIN → DEPRESS (τ′) 1.92 –0.03 3.87 1.92 –0.03 3.88
Indirect Effect (αβ) 1.56 0.73 2.45 1.56 0.76 2.48
slight adjustment to the indirect effect’s confidence limits, but it had virtually no impact
on other intervals.
Focusing on the mediated effect, the product of coefficients estimate was αβ = 1.57,
meaning that the change from mild or moderate pain to severe pain increased depres-
sion by 1.57 points via pain interference with daily life. The 95% confidence interval
limits, which spanned from 0.73 to 2.45, were asymmetrical around the point estimate,
because the bootstrap sampling distribution was asymmetrical (its shape was similar
to the posterior distribution in Figure 10.8). The interval indicates that the indirect
effect is significantly different from zero, because it does not include the null value. The
component path coefficients were also statistically significant, but the direct effect of
severe pain on depression was not significant after accounting for the indirect pathway.
Despite attacking the problem very differently, maximum likelihood estimation and the
bootstrap produced results that were numerically equivalent to the Bayesian analysis,
albeit with different interpretations.
Multiple Imputation
I used fully conditional specification with latent variables to create multiple imputations
for the mediation analysis. The imputation model included the three focal variables
and the three auxiliary variables (control, stress, and anxiety). I used a latent response
formulation for the incomplete binary variable, and I treated the 7-point stress rating (a
complete variable) as continuous (Rhemtulla et al., 2012).
Following the procedure from Section 7.4, fully conditional specification imputes
variables one at a time by stringing together a series of regression models, one per incom-
plete variable. This example requires four regressions, three of which are linear and one
Special Topics and Applications 427
( )
DEPRESSi = γ 01 + γ11 ( INTERFEREi ) + γ 21 PAIN i* + γ 31 ( ANXIETYi )
(10.24)
+ γ 41 ( CONTROLi ) + γ 51 ( STRESSi ) + r1i
Notice that the latent response variable appears as a regressor on the right side of the
equation (the classic formulation of fully conditional specification in MICE would
instead use the binary indicator). As a second example, the probit imputation model for
the latent pain severity scores is shown below:
The residual variance is fixed at one to establish a metric, and the model also includes a
threshold parameter that divides the latent distribution into two regions.
After estimating various sets of regression model parameters, MCMC samples new
imputations from posterior predictive distributions based on the updated model param-
eters. For example, the missing depression scores are sampled from a normal distribu-
tion with center and spread equal to a predicted score and residual variance, respectively
(i.e., imputation = predicted value + noise). MCMC generates latent variable imputations
for the binary pain indicator, and the location of the continuous imputes relative to the
threshold parameter induces corresponding discrete values (e.g., a latent score below the
threshold implies little or moderate pain, and a continuous score above the threshold
implies a severe pain rating).
There are at least four ways to apply the bootstrap to multiply imputed data (Scho-
maker & Heumann, 2018). I focus on the two procedures that have the best support from
the literature. A multiple imputation nested within bootstrapping approach performs
resampling first and imputation second (Zhang & Wang, 2013; Zhang, Wang, & Tong,
2015). This procedure first creates B incomplete data sets by drawing bootstrap samples
with replacement from the original data, and it then applies multiple imputation to
create M complete data sets from each bootstrap sample. Reversing the process gives a
bootstrapping nested within multiple imputation procedure that performs imputation
first and resampling second (Wu & Jia, 2013). This approach first applies multiple impu-
tation to the data, and it then draws B bootstrap samples with replacement from each of
the M complete data sets. The analysis phase fits a model to each of the B × M data sets,
and the resulting estimates mix to form empirical sampling distributions.
I illustrate bootstrapping within multiple imputation, because it is more convenient
to implement. After creating M = 100 imputations, I fit the path model from Figure 10.7
to each imputed data set and used Rubin’s (1987) pooling rules to combine parameter
estimates. The pooled indirect effect is the average of the product of coefficient estimates
from the M filled-in data sets (not the product of the average coefficients, α̂ and β̂).
M
1
=
αβ ∑
M m =1
αˆ mβˆ m (10.26)
428 Applied Missing Data Analysis
Next, I used sampling with replacement to create B = 500 nested bootstrap samples for
each of the M = 100 imputed data sets, and I fit the mediation model to each bootstrap
sample. The B × M = 50,000 estimates mix to form an empirical sampling distribution
that reflects within- and between-imputation variation.
The bottom panel of Table 10.8 gives the path coefficients and 95% confidence
interval limits from the percentile bootstrap and bias-corrected bootstrap. To reiterate,
the 2.5 and 97.5% quantiles of the empirical bootstrap distribution define the percentile
bootstrap confidence interval, and the biased-corrected method shifts these quantiles
to account for the discrepancy between the pooled point estimate from Equation 10.26
and the center of the empirical sampling distribution (see MacKinnon, 2008, p. 334; Wu
& Jia, 2013). Focusing on the mediated effect, the product of coefficients estimate was
= 1.56, meaning that the change from mild or moderate pain to severe pain increased
αβ
depression by 1.56 points via pain interference with daily life. The 95% confidence inter-
val limits, which spanned from 0.73 to 2.45, were asymmetrical around the point esti-
mate, because the bootstrap sampling distribution was skewed (its shape was similar to
the posterior distribution in Figure 10.8). The interval indicates that the indirect effect
is significantly different from zero, because it does not include the null value. The com-
ponent path coefficients were also statistically significant, but the direct effect of severe
pain on depression was not significant after accounting for the indirect pathway. The
multiple imputation and maximum likelihood results were effectively equivalent, and
the numeric estimates closely matched the Bayes analysis (albeit with different perspec-
tives on inference). The close correspondence of these methods has been a recurring
theme throughout the book.
Structural equation modeling analyses introduce unique challenges for missing data
handling, because they often involve large numbers of categorical variables (e.g., ques-
tionnaire or test items as indicators of a latent factor) and specialized analytic tasks
related to global and local fit assessments. I use the eating disorder risk data from the
companion website to illustrate a confirmatory factor analysis with item-level missing-
ness. The data comprise body mass index scores and 12 Eating Attitudes Test question-
naire items (Garner, Olmsted, Bohr, & Garfinkel, 1982) from a sample of N = 200 female
college athletes. Seven items were intended to measure a drive for thinness construct
that reflects excessive concern or preoccupation with weight gain, and five items mea-
sured dieting behaviors. Figure 10.9 show a path diagram of the two-factor model. All
items used 6-point rating scales, and the stems are found in the Appendix and in Table
10.9.
Researchers have multiple options for fitting factor models with ordinal indicators
to complete data sets (e.g., see Jöreskog & Moustaki, 2001; Wirth & Edwards, 2007).
Perhaps the most common approach is to simply treat questionnaire items as continu-
ous and normally distributed. A second option is FIML estimation with a probit link
function for the discrete indicators. Following ideas from Chapter 6, the probit model
views each questionnaire item as arising from a normally distributed latent response
Special Topics and Applications 429
THINNESS
DRIVE
DIETING
FIGURE 10.9. Two-factor structure for 12 questionnaire items. Seven items measure a drive
for thinness construct that reflects excessive concern or preoccupation with weight gain, and
five items measure dieting behaviors.
variable, the distribution of which is separated into discrete segments by a set of thresh-
old parameters. The resulting factor analysis model describes the correlation structure
of these latent response variables (i.e., polychoric correlations) rather than the vari-
ances and covariances of the discrete indicators. Weighted (or diagonally weighted) least
squares (Finney & DiStefano, 2013; Muthén, 1984; Muthén, du Toit, & Spisic, 1997) is
a third option that also targets latent variable associations.
All things being equal, you might expect that estimators for categorical data are
preferable, because they are theoretically more correct, but that isn’t necessarily the
case. With complete data, full information estimation for item-level factor analysis is
restricted to simple models with few factors and indicators, and missing data analy-
ses are no different. Although the two-factor model in Figure 10.9 is relatively simple,
the corresponding saturated or unrestricted model—a multivariate contingency table
rather than the usual sample means and variance–covariance matrix—is too complex
to estimate. For example, a saturated model for just two of the 6-point items is a 6 × 6
contingency table with 36 cells, a model for three of the items has 6 × 6 × 6 = 216 cells,
and so on. As you can imagine, the multivariate contingency table for 12 ordinal items
430 Applied Missing Data Analysis
Dieting behavior
Aware of the calorie content of foods that I eat. 0.75 0.04 0.76 0.04
Particularly avoid food with a high carbohydrate. 0.58 0.06 0.65 0.06
Avoid foods with sugar in them. 0.59 0.06 0.67 0.06
Eat diet foods. 0.72 0.05 0.76 0.05
Engage in dieting behavior. 0.83 0.04 0.88 0.03
is intractably large. Unfortunately, the absence of a saturated model rules out global fit
assessments.
Although conceptually similar, weighted least squares is a limited information
estimator that works from bivariate contingency tables (and associations) rather than
high-dimensional multivariate data. Estimation happens in two steps. The first stage
estimates threshold parameters and the polychoric correlation for each pair of latent
variables, and the second stage engages an iterative optimization routine that minimizes
the sum of squared standardized differences between the first stage (saturated model)
estimates and the thresholds and correlations predicted by the factor analysis model.
Simulation studies show that weighted least squares estimators tend to require relatively
large sample sizes to achieve their optimal properties (Rhemtulla et al., 2012; Satorra &
Bentler, 1988, 1994; Savalei, 2014), perhaps much larger than this data set. Importantly,
this estimator assumes an MCAR mechanism, because the first stage uses pairwise dele-
tion to estimate the polychoric correlations.
I focus on maximum likelihood and multiple imputation for this example, because
they readily connect with familiar complete-data structural equation modeling proce-
dures. Maximum likelihood estimation provides two routes: Treat the ordered-categorical
indicators as normally distributed variables or use full information estimation with a
probit link function to model the latent response variables. Multiple imputation creates
complete sets of item responses that are amenable to normal-theory maximum likeli-
hood estimation or weighted least squares. A third option is to save the latest response
variables with the imputed data sets and use the continuous variables as indicators.
Special Topics and Applications 431
μ (=
θ ) Λα + υ (10.27)
S ( θ ) ΛΨΛ ′ + Θ
=
tistic to have the same expected value or mean as an optimal statistic computed from
multivariate normal data (see Section 2.11). Simulation studies suggest that robust test
statistics may counteract the biasing effects of non-normal missing data in some situ-
ations (Enders, 2001; Rhemtulla et al., 2012; Savalei & Bentler, 2005, 2009; Savalei &
Falk, 2014; Yuan & Bentler, 2000; Yuan & Zhang, 2012). The rescaled chi-square from
the analysis was statistically significant, TSB(53) = 106.48, p < .001, indicating that the
two-factor model did not adequately explain the sample variances and covariances.
Researchers routinely supplement the model fit statistic with other absolute or rel-
ative fit indices (McDonald & Ho, 2002). Popular options include the Tucker–Lewis
Index or non-normed fit index (TLI or NNFI; Bentler & Bonett, 1980; Tucker & Lewis,
1973), the comparative fit index (CFI; Bentler, 1990), and the root mean square error of
approximation (RMSEA; Browne & Cudeck, 1992; Steiger, 1989, 1990; Steiger & Lind,
1980). Incremental fit indices such as the TLI and CFI compare the relative fit of two
nested models, the first of which is the hypothesized model (e.g., the confirmatory fac-
tor analysis model), and the second of which is a more restrictive null or baseline model.
With certain exceptions (e.g., longitudinal growth curves; Widaman & T hompson,
2003), the usual baseline model includes means and variances but fixes all correlations
to zero.
The TLI and CFI give the proportional improvement of the hypothesized model rel-
ative to that of the baseline model (e.g., TLI = .95 means that the hypothesized model’s
fit is a 95% improvement over that of the baseline model). These indices are
where T0 and TLR are the chi-square statistics for the null (baseline) and hypothesized
models, respectively, and df0 and df are their corresponding degrees of freedom. In con-
trast, the RMSEA is an absolute index that estimates population misfit of the hypoth-
esized model per degree of freedom.
max (TLR − df ,0 )
RMSEA = (10.29)
df ( N − 1)
Robust versions of these indices replace normal-theory test statistics with their rescaled
counterparts.
Consistent with the global chi-square test, robust indices indicated that fit is inad-
equate by conventional standards: TLI = .921, CFI = .936, and RMSEA = .071 (Hu &
Bentler, 1999). Researchers routinely use modification indices (also known as score
tests) to identify specific sources of misfit in models such as this. The modification
index is a chi-square statistic that reflects the predicted change in model fit that would
result from a single additional path (MacCallum, 1986; Sörbom, 1989). These tests have
a long history in the structural equation modeling literature but require caution, because
Special Topics and Applications 433
they capitalize on chance (Bollen, 1989; Byrne, Shavelson, & Muthén, 1989; Kaplan,
1990; MacCallum, 1986; MacCallum, Roznowski, & Necowitz, 1992; Whittaker, 2012).
The analysis produced three large modification indices. The first pointed to an omitted
cross-loading from the dieting behavior factor to the drive for thinness item “Think
about burning up calories when I exercise.” The modification index was χ2(1) = 10.02
(p < .01), and the predicted value of the omitted standardized loading was .46. The other
two large indices involved residual covariances between pairs of dieting behavior items.
The first indicated that adding a covariance between “Particularly avoid food with a high
carbohydrate . . . ” and “Avoid foods with sugar in them” would significantly improve
model fit, χ2(1) = 18.92 (p < .001), and the second predicted a similar improvement for
the covariance between “Eat diet foods” and “Engage in dieting behavior,” χ2(1) = 19.68
(p < .001). The large projected values of the residual correlations (.39 and .55, respec-
tively) further point to these omitted paths as important sources of misfit.
Imputation Models
There are several compelling reasons to use multiple imputation with structural equa-
tion models, not the least of which is its flexibility with mixtures of categorical and con-
tinuous variables. A number of recent papers have extended procedures for fit assess-
ments to multiply imputed data (Chung & Cai, 2019; Enders & Mansolf, 2018; Lee
& Cai, 2012; Mansolf, Jorgensen, & Enders, 2020), and researchers now have the full
complement of tools necessary to carry out structural equation modeling analyses.
The joint model imputation and fully conditional specification procedures that I
apply to the example are agnostic in the sense that they do not impose a particular struc-
ture or pattern on the means and associations. In the parlance of the structural equa-
tion modeling literature, the imputation phase uses a saturated or just-identified model
(Bollen, 1989; Kline, 2015) that spends all available degrees of freedom. For the joint
model, this is an unrestricted mean vector and covariance matrix, and fully conditional
specification uses an equivalent set of regressions. It is also possible to employ a model-
based approach that uses a factor analysis model for imputation (e.g., H0 imputation;
Asparouhov & Muthén, 2010c). Model-based imputation is attractive from a precision
perspective, because it requires far fewer parameters. For example, the factor model in
Figure 10.9 has 53 degrees of freedom, one for each omitted path or restriction placed
on the covariance matrix. Employing a restrictive imputation procedure to this analysis
would limit the range of models that could be estimated and compared, because the
resulting imputations presuppose perfect fit (e.g., global tests of model fit would be
suspect). Nevertheless, this approach warrants consideration when the number of indi-
cators is very large relative to the sample size, because it should converge more reliably
than an unrestricted imputation model.
The joint modeling framework invokes a multivariate normal distribution for the 12
latent response variables and numerical body mass index scores, and the corresponding
imputation model parameters are a mean vector and variance–covariance matrix. The
13-dimensional normal distribution for the focal variables and auxiliary variables is as
follows:
434 Applied Missing Data Analysis
DRIVE1*i μ1 1
* μ σ 1
DRIVE7i 7 7⋅1
DIETING ~ N13 μ 8 , σ 8⋅1 σ 8⋅7
* 1 (10.30)
1i
* σ
μ12 12⋅1 σ12⋅7 σ12⋅18 1
DIETING5i 2
BMI
i
μ13 σ13⋅1 σ13⋅7 σ13⋅8 σ13⋅12 σ13
Following established notation, the asterisk superscripts denote latent response vari-
ables, the variances of which are fixed at one to establish a scale. The model also incor-
porates five threshold parameters per item that divide the latent continuum into discrete
segments (see Section 6.4).
Fully conditional specification uses a sequence of regression models to impute vari-
ables in a round-robin fashion. I use a fully latent version of the procedure that models
associations among continuous and latent response variables. This approach invokes a
linear regression for the body mass index and a probit regression for each categorical
variable. To illustrate, the probit imputation model for the first drive for thinness item
is as follows:
( ) ( )
DRIVE1*i = γ 0 + γ1 DRIVE2*i + + γ 6 DRIVE7*i + γ 7 DIETING1*i + ( ) (10.31)
(
+ γ11 DIETING5*i )+γ 12 ( BMI i ) + ri
The latent response variable’s residual variance is fixed at one to establish its scale, and
the model also requires five threshold parameters (one of which is fixed) that divide
the underlying latent distribution into six discrete segments. As a second example, the
linear regression imputation model for body mass index is shown below:
( ) ( )
BMI i = γ 0 + γ1 DRIVE1*i + + γ 7 DRIVE7*i + γ 8 DIETING1*i + ( ) (10.32)
+ (
γ12 DIETING5*i )+r
i
Unlike the conventional MICE specification, the fully latent version of fully conditional
specification features latent response variables as regressors on the right side of all
imputation models.
After sequentially estimating various sets of model parameters, MCMC samples
new imputations from posterior predictive distributions based on the updated param-
eter values. For example, the missing body mass index scores are sampled from a normal
distribution with center and spread equal to a predicted score and residual variance,
respectively (i.e., imputation = predicted value + noise). The probit regression models
produce latent response variable imputations for the entire sample (recall that latent
scores are restricted to a particular region of the distribution if the discrete response is
observed, and they are unrestricted otherwise). The location of the continuous imputes
relative to the estimated threshold parameters induces discrete imputes for the filled-in
Special Topics and Applications 435
data sets. The estimated latent response scores that spawn the categorical responses can
also be saved alongside the discrete imputes.
Dieting behavior
Aware of the calorie content of foods that I eat. 0.74 0.04 0.79 0.04 0.76 0.04
Particularly avoid food with a high carbohydrate. 0.58 0.06 0.68 0.04 0.65 0.06
Avoid foods with sugar in them. 0.59 0.06 0.73 0.05 0.67 0.06
Eat diet foods. 0.72 0.05 0.75 0.04 0.77 0.04
Engage in dieting behavior. 0.82 0.04 0.88 0.03 0.88 0.03
somewhat larger than those of the direct estimator in Table 10.9, because the initial
imputation stage employs an unrestricted model that spends all 53 of the factor model’s
degrees of freedom (Collins et al., 2001). However, this doesn’t appear to be the case, as
the two sets of standard errors were quite similar. For all intents and purposes, imputing
the data gave the same results as direct maximum likelihood estimation based on the
observed data. This has been a recurring theme throughout the book.
Evaluating global model fit is a standard step in any structural equation modeling
analysis. Fit assessments nearly always include a test statistic comparing the research-
er’s model (e.g., the two-factor model) to an optimal saturated model that places no
restrictions on the data. Meng and Rubin’s (1992) pooled likelihood ratio statistic serves
this purpose for multiply imputed data (Enders & Mansolf, 2018; Lee & Cai, 2012), and
the so-called D2 statistic (Li, Meng, et al., 1991; Rubin, 1987) is another option for pool-
ing goodness-of-fit tests.
The naive Meng and Rubin (1992) statistic that assumes normality was statisti-
cally significant, χ2(53) = 130.10, p < .001, indicating that the two-factor model did not
adequately explain the sample variances and covariances. The test can be robustified by
computing the Satorra and Bentler (1994) scaling factor from each data set and using
the arithmetic average to rescale the chi-square statistic (Jorgensen, Pornprasertmanit,
Schoemann, & Rosseel, 2021). The rescaled chi-square was also statistically significant,
TSB(53) = 107.36, p < .001, but it was a much closer match to that of the direct estimator.
A second option is to compute the rescaled test statistic for each data set and use the
D2 procedure to pool the chi-square values. The pooled rescaled statistic was well cali-
brated to the other robust options, TD2(53) = 110.46, p < .001.
At present, relatively little is known about the behavior of these fit statistics with
multiply imputed data. Limited simulation results suggest that the Meng and Rubin
chi-square may lack power relative to direct maximum likelihood estimation (Enders
& Mansolf, 2018). Additionally, the test statistic is not invariant to changes in model
parameterization, and using different identification constraints (e.g., setting a loading to
one rather than fixing the factor variance) will lead to different test statistics; the test is
similar to the Wald statistic in this respect (Gonzalez & Griffin, 2001). In practice, the
changes to the test statistic across different parameterizations tend to be very small and
should not materially impact decisions about model fit (Enders & Mansolf, 2018). This
is not an issue for the D2 procedure.
Returning to the fit indices in Equations 10.28 and 10.29, a natural way to compute
imputation-based versions of these measures is to substitute pooled chi-square statistics
into the expressions (Enders & Mansolf, 2018; Lee & Cai, 2012; Muthén & Muthén,
1998–2017). Robust indices based on pooled Satorra–Bentler chi-square statistics gave
TLI = .920, CFI = .935, and RMSEA = .086. Again, the literature offers little guidance on
the behavior of these fit measures, but limited simulation results suggest that Meng and
Rubin’s (1992) likelihood ratio statistic works well when data are multivariate normal
(Enders & Mansolf, 2018). Considered as a whole, the fit statistics suggest the two-factor
model doesn’t adequately describe the correlations in the data (Hu & Bentler, 1999).
I previously used modification indices (MacCallum, 1986; Sörbom, 1989) to iden-
tify potential sources of model misfit, and these diagnostics were recently developed
for multiply imputed data (Mansolf et al., 2020). Consistent with the direct estimation
Special Topics and Applications 437
results, the analysis produced three large modification indices. The first pointed to an
omitted cross-loading from the dieting behavior factor to the drive for thinness item
“Think about burning up calories when I exercise.” The modification index was χ2(1) =
10.82 (p < .01), and the predicted value of the omitted standardized loading was .43. The
other two large indices involved residual covariances between pairs of dieting behavior
items. The first indicated that adding a covariance between “Particularly avoid food
with a high carbohydrate . . . ” and “Avoid foods with sugar in them” would significantly
improve model fit, χ2(1) = 26.65 (p < .001), and the second predicted a similar improve-
ment for the covariance between “Eat diet foods” and “Engage in dieting behavior,” χ2(1)
= 23.04 (p < .001). The large projected values of the residual correlations (.43 and .55,
respectively) further point to these omitted paths as important sources of misfit.
I generated the previous analysis results by feeding imputed data sets into a capable
structural equation modeling program. Lee and Cai (2012) outlined an alternative two-
stage estimation strategy that uses the pooled mean vector and covariance matrix as
input data. In fact, their procedure is the multiple imputation analogue of the two-stage
maximum likelihood estimator described in Section 3.10 (Savalei & Bentler, 2009). The
first stage of the procedure uses multiple imputation to treat the missing data, and the
second stage uses the pooled mean vector and covariance matrix as input data to the
classic maximum likelihood discrepancy function shown below:
( )
f θ | μˆ , Sˆ = ( ) {(
ˆ −1 ( θ ) + ( μˆ − μ ( θ ) )′ S −1 ( θ ) ( μˆ − μ ( θ ) ) + tr SS
−ln SS ) }
ˆ −1 ( θ ) − V (10.33)
In this context, μ̂ and Ŝ are pooled estimates of the sample means and variance–
covariance matrix, but the discrepancy function is otherwise the same as that found
in the structural equation modeling literature (Bollen, 1989; Jöreskog, 1969; Kaplan,
2009).
The maximum likelihood estimator identifies the factor model parameters that
minimize the difference between the first stage estimates in μ̂ and Ŝ and the model-
implied moments in μ(θ) and Σ(θ). Because the discrepancy function makes no refer-
ence to observed data values, it incorrectly assumes there are no missing values. This
has no bearing on the estimates, but standard errors and model-fit statistics will be
too small, because they fail to account for imputation noise (between-imputation varia-
tion). To counteract this problem, Lee and Cai (2012) use the between-imputation varia-
tion of the sample moments and a key result from Browne’s (1984) famous paper on
distribution-free estimation to derive an adjustment to the standard errors and model
fit statistic. Their correction is analogous to the one described by Savalei and Bentler
(2009) for maximum likelihood estimation, and a SAS macro for implementing the two-
stage approach is available on the Internet.
among the underlying latent variables. The procedure happens in two stages. The first
stage estimates threshold parameters (i.e., the z-score cutoff points that divide the latent
distribution into discrete segments) and the polychoric correlation for each pair of latent
variables, and the second stage engages an iterative optimization routine that minimizes
the sum of squared standardized differences between the first stage estimates and the
thresholds and correlations predicted by the factor analysis model. The fit function that
gives these weighted discrepancies is
( σˆ − σ ( θ ) )′ W −1 ( σˆ − σ ( θ ) )
f ( θ | σˆ ) = (10.34)
where σ̂ is a vector containing the thresholds and latent variable correlations from the
first stage, σ(θ) is the corresponding vector of model-predicted thresholds and correla-
tions, and W is a weight matrix that standardizes the squared deviation scores (e.g., the
variance–covariance matrix of the estimates or the diagonal of that matrix in the case of
diagonally weighted least squares).
Weighted least squares is referred to as a limited-information estimator, because
the initial stage derives polychoric correlations on a pairwise basis from two-way
cross-tabulation tables that ignore the multivariate distribution of the categorical data
(Maydeu-Olivares & Joe, 2005; Olsson, 1979; Rhemtulla et al., 2012). Estimating these
latent variable correlations requires complete data, and pairwise deletion is the default
option in software packages. Filling in the data prior to estimating the polychoric cor-
relations is a better solution, because it maximizes the sample size and requires a less
stringent conditionally MAR process where missingness depends on the observed data.
To illustrate, I applied weighted least squares estimation to the filled-in data sets from
the previous example. Chung and Cai (2019) extended the aforementioned two-stage
estimator (Lee & Cai, 2012) to categorical variables, so their approach is an alternative
to what I describe here.
The middle panel of Table 10.10 gives the standardized factor loadings and their
standard errors. A desirable feature of complete-data weighted least squares estimation
is that it provides a model fit statistic and the usual selection of fit indices. Liu and Sri-
utaisuk (2019) proposed using the D2 procedure to pool weighted least squares test statis-
tics, and their computer simulations support this strategy, particularly when the analysis
model includes variables with little or no missing data that correlate with the incomplete
variables. Small sample size aside, this example is an optimal application, because each
latent factor has indicators with little or no missing data. The pooled chi-square from the
factor analysis was significant, χ2(53) = 100.92, p < .001, and substituting the test statistic
into the earlier fit expressions gave TLI = .582, CFI = .665, and RMSEA = .067.
estimates of the various concerns, preoccupations, and behaviors captured by the ques-
tionnaire items. The procedure is conceptually equivalent to full information estimation
with a probit link but provides a mechanism for estimating a saturated model and evalu-
ating model fit. As mentioned elsewhere, generating latent replacements for categori-
cal indicators has a rich history in the psychometrics literature and is routinely used
in large-scale assessment settings where latent imputations are referred to as plausible
values (Asparouhov & Muthén, 2010d; Lee & Cai, 2012; Mislevy, 1991; Mislevy, Beaton,
Kaplan, & Sheehan, 1992; von Davier, Gonzalez, & Mislevy, 2009).
Because the latent item responses have 100% missing data, more data sets are needed
to maximize precision and minimize Monte Carlo simulation error. In my experience,
increasing the number of imputations from 100 to 500 can have a meaningful impact on
test statistics and probability values, with additional increases providing diminishing
returns. To this end, I created M = 500 filled-in data sets by saving the imputations and
latent variable scores from the final iteration of 100 parallel MCMC chains with 10,000
iterations each. The latent data are normal by construction, so a conventional maximum
likelihood estimator is optimal, and no robust corrections are necessary. After fitting the
two-factor model to each latent data set, I used Rubin’s (1987) pooling rules to combine
the results. The rightmost panel in Table 10.10 shows the pooled standardized factor
loadings and standard errors. The latent response variable analysis produced noticeable
discrepancies with normal-theory and weighted least squares estimation, with several
loadings that differed by up to one standard error unit. However, the estimates were
effectively equivalent to those of the full information director estimator with a probit link
(see the rightmost panel in Table 10.9). In fact, fully conditional specification with latent
variables can be viewed as a multiple imputation analogue of full information estimation.
Even with this relatively simple model, direct maximum likelihood estimation was
incapable of generating a model fit statistic, and numerical integration precludes the use
of modification indices. Both are available when analyzing the latent imputations. Meng
and Rubin’s (1992) test statistic was significant, χ2(53) = 111.280, p < .001, and the cor-
responding fit indices were as follows: TLI = .895, CFI = .915, and RMSEA = .074. The
TD2 (or D2) statistic (Li, Meng, et al., 1991; Rubin, 1987) was previously well calibrated
to the likelihood ratio test, but it is now noticeably larger in value, χ2(53) = 136.68, p <
.001. The literature suggests that TD2 loses its good statistical properties when the frac-
tions of missing information are very high (Grund et al., 2016c; Li, Meng, et al., 1991; Liu
& Sriutaisuk, 2019), as they are here, because the latent response variables have 100%
missing data. Unless future methodological research suggests otherwise, the current
literature suggests that TD2 is inappropriate for fit assessments with latent imputations.
The modification indices revealed the same sources of misfit described earlier, so no
further discussion is warranted.
Researchers collecting self-report data routinely use questionnaires with multiple items
that tap into different features of the construct being measured. When analyzing such
data, the focus is usually a scale score that sums or averages items that measure a com-
440 Applied Missing Data Analysis
mon theme. Returning to the chronic pain data, consider a linear regression analysis
where depression, gender, and a binary severe pain indicator (0 = no, little, or moder-
ate pain, 1 = severe pain) influence psychosocial disability (a construct capturing pain’s
impact on emotional behaviors such as psychological autonomy and communication,
emotional stability, etc.). The focal analysis model is as follows:
I used the disability and depression scales in earlier examples without mentioning that
the former is the sum of six 6-point questionnaire items and the latter is a composite
of seven 4-point rating scales (see Appendix). The item-level missingness rates range
between 1.5 and 4.7%, and the disability and depression scale scores have 9.1 and 13.45%
of their values missing, respectively (a scale score is missing if at least one of its compo-
nents is missing).
Item-level missing data can occur for a variety of reasons. Among other things,
a participant may inadvertently skip items or refuse to answer certain questions, an
examinee may fail to complete a full set of cognitive items in the allotted time, or a
researcher may employ a planned missing data design that intentionally omits a sub-
set of items from each respondent’s questionnaire form (Graham et al., 2006). Perhaps
the most common way to deal with item-level missing data is to compute a prorated
scale score that averages the observed responses. For example, if a respondent answered
four out of seven depression items, the scale score would be the average of those four
responses. The missing data literature also describes this procedure as averaging the
available items (Schafer & Graham, 2002) and person mean imputation, because it is
equivalent to imputing missing item responses with the average of each participant’s
observed scores (Huisman, 2000; Peyre et al., 2011; Roth et al., 1999; Sijtsma & van der
Ark, 2003).
Despite their widespread use, prorated scale scores have important limitations that
should deter their use. For one, the method assumes an MCAR process where missing-
ness is unrelated to the data; this puts it on par with deletion procedures that discard
incomplete data records. A second, and perhaps more problematic, feature is that pro-
ration requires all item means to be the same and all pairs of variables to have equal
correlations (Graham, 2009; Mazza et al., 2015; Schafer & Graham, 2002). This popular
procedure is prone to bias when these strict assumptions are not satisfied (Mazza et al.,
2015).
This section describes two broad approaches for analyzing composites with item-
level missing data: factored regression models and agnostic multiple imputation. The
former uses a now-familiar factoring strategy to augment the focal analysis model with
supporting regressions for the incomplete questionnaire items, whereas the latter fills in
the missing items’ responses with no regard to the scale score structure. All things being
equal, the two approaches generally give the same answer, although factored regression
is advantageous when the number of items is very large relative to the sample size (a
situation where item-level multiple imputation routines often fail to converge). The key
feature of both methods is that they leverage strong sources of item-level correlation in
the data, thereby maximizing power and precision (Gottschall, West, & Enders, 2012).
Special Topics and Applications 441
Yi = β0 + β1 X i + β2 Zi + ε i (10.36)
Furthermore, suppose that each scale is the sum of four questionnaire items: X1 to X4
and Z1 to Z4. For now, assume that Y is a numerical variable rather than a composite. A
factored regression specification expresses the multivariate distribution of the depen-
dent variable and the regressor items as a sequence of univariate distributions, as fol-
lows:
( Y , X1 , X2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) f ( Y | X1 , X2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) ×
f=
f ( X1 | X 2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) × f ( X 2 | X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) ×
(10.37)
f ( X 3 | X 4 , Z1 , Z2 , Z3 , Z4 ) × f ( X 4 | Z1 , Z2 , Z3 , Z4 ) × f ( Z1 | Z2 , Z3 , Z4 ) ×
f ( Z2 | Z 3 , Z 4 ) × f ( Z 3 | Z 4 ) × f ( Z 4 )
Notice that first term following the equals sign has the same structure as the focal model
(i.e., the outcome to the left of the pipe and predictors to its right), but it features items
rather than scales.
A regression model with scale scores can be viewed as placing restrictions or con-
straints on the associations in f(Y|X1, X2, X3, X4, Z1, Z2, Z3, Z4). To illustrate, the equation
below rewrites the regression as a function of the item responses:
Yi = β0 + β1 X i + β2 Zi + ε i = β0 + β1 ( X1i + X 2i + X 3i + X 4 i )
+ β2 ( Z1i + Z2i + Z3i + Z4 i ) + ε i (10.38)
( a a b b b
)
f ( Y | X1 , X 2 , X 3 , X 4 , Z1 , Z2 , Z3 , Z4 ) = f Y | X1( ) , X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) (10.39)
a a b
Looking beyond the focal model, the terms in the second and third rows of Equation
10.40 correspond to a sequence of probit regression models where each questionnaire
item is regressed on some subset of the others. Item-level missing data handling can be
computationally challenging with smaller samples, because the number of parameters
that accumulates across these supporting item-level models can be very large, especially
442 Applied Missing Data Analysis
when scales have many items. The categorical nature of the questionnaire items adds
to this challenge. Additional constraints can simplify estimation while still exploiting
strong item-level associations. For example, the factorization below illustrates between-
scale constraints that assume each X item has the same association with all Z items:
( a a b b b b
)
f Y | X1( ) , X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
a a
( c
) (
f X1 | X 2 , X 3 , X 4 , Z1( ) , Z2( ) , Z3( ) , Z4( ) × f X 2 | X 3 , X 4 , Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
c c c d d d d
) (10.40)
f ( X | X , Z( ) , Z( ) , Z( ) , Z( ) ) × f ( X (f) (f) (f) (f)
)×
e e e e
3 4 1 2 3 4 4 | Z1 , Z2 , Z3 , Z4
f ( Z1 | Z2 , Z3 , Z4 ) × f ( Z2 | Z3 , Z4 ) × f ( Z3 | Z4 ) × f ( Z4 )
The alphanumeric superscripts show that the constraints reduce 16 coefficients (one per
each Z item) into four slopes (one for each parcel of Z items).
Further constraints can be imposed if the sample size is still too small to support
estimation (e.g., the MCMC algorithm fails to converge). The factorization below illus-
trates within-scale constraints that assume each X item has the same association with a
subset of other X items (e.g., X1 is assumed to have a common association with X2, X3,
and X4):
( a a b b b b
)
f Y | X1( ) , X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
a a
( c c c d d d
) (
f X1 | X 2( ) , X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( ) × f X 2 | X 3( ) , X 4( ) , Z1( ) , Z2( ) , Z3( ) , Z4( )
d e e f f f f
) (10.41)
× f ( X | X , Z( ) , Z( ) , Z( ) , Z( ) ) × f ( X )
| Z1( ) , Z2( ) , Z3( ) , Z4( ) ×
g g g g h h h h
3 4 1 2 3 4 4
f ( Z1 | Z2 , Z3 , Z4 ) × f ( Z2 | Z3 , Z4 ) × f ( Z3 | Z4 ) × f ( Z 4 )
f ( Y1 , Y2 , Y , X, Z=
) f ( Y1 | Y2 , Y ) × f ( Y2 | Y ) × f ( Y | X, Z ) × f ( X | Z ) × f ( Z ) (10.42)
The first two terms following the equals sign are supporting models that link the com-
Special Topics and Applications 443
posite to its items (e.g., a pair of probit regressions), the third term corresponds to the
focal analysis model (e.g., Equation 10.36), and the last two terms are supporting regres-
sor models (which could also look like Equation 10.37 with composite predictors).
Three points are worth highlighting. First, neither f(Y1|Y2, Y) nor f(Y2|Y) include X
or Z, because I assume that items are conditionally independent of the predictors after
controlling for the scale score. This is tantamount to assuming that Y items do not cross-
load on a factor with X or Z items. Second, the supporting item-level regressions are
designed to leverage collinearity between the dependent variable and its components,
and the factorization essentially applies the idea of treating questionnaire items as aux-
iliary variables (Eekhout et al., 2015b; Mazza et al., 2015). Third, the factorization neces-
sarily leaves out one item to avoid perfect linear dependencies. In practice, virtually any
combination of items will convey roughly the same amount of information to the scale
score, so excluding the item with the highest missing data rate is a good strategy.
The factored regression specification readily accommodates auxiliary variables
with additional terms on the left side of the factorization prior to the analysis variables.
To illustrate, consider a scenario where Y is measured with three items (Y1 to Y3), and X
and Z are both measured by pair of items (X1 and X2 and Z1 and Z2). The factorization
for a model with a single auxiliary variable and between-scale constraints is as follows:
( a a b b
)
f A | Y , X1( ) , X 2( ) , Z1( ) , Z2( ) × f ( Y1 | Y2 , Y ) × f ( Y2 | Y ) ×
f (Y | X ( ) , X ( ) , Z( ) , Z( ) ) × f ( X | X , Z( ) , Z( ) ) ×
c c d d e e
1 2 1 2 1 2 1 2 (10.43)
f ( X | Z( ) , Z( ) ) × f ( Z | Z ) × f ( Z )
f f
2 1 2 1 2 2
Figure 10.10 shows the factorization as a path diagram, with the dashed rectangles
enclosing the X and Z items denoting composite scores. The factorization reduces model
complexity by imposing two innocuous constraints on the auxiliary variable’s regres-
sion model. First, the model features the Y scale score as a predictor but not its items
(i.e., the partial regression slopes for Y1 and Y2 are fixed at 0). Second, equality con-
straints are placed on coefficients linking the X and Z items to the auxiliary variable.
Collectively, these constraints transmit the auxiliary variable’s information to the scale
scores rather than the individual items.
The factored regression specification places the emphasis on the scale scores, and
the item-level regressions are simply a device for accessing sources of strong correla-
tion in the data. The specification can be implemented with maximum likelihood or
Bayesian estimation, and the latter could also generate model-based multiple imputa-
tions. The imputed data would include the dependent variable scale score, all but one
of the dependent variable’s items, and all items from the regressor scales (but not the
scale scores themselves, which are obtained by summing the filled-in item responses).
The absence of one Y item is not a problem, as you would simply analyze the imputed
scale scores without regard to the items. Agnostic multiple imputation approaches like
the joint model and fully conditional specification are ideally suited for filling in item
responses without imposing a scale score structure, and I illustrate that procedure later
in this section.
444 Applied Missing Data Analysis
X2 Z2
X1 Z1
FIGURE 10.10. Path diagram of a factored regression model with a single auxiliary variable
and between-scale constraints. The dashed rectangles enclosing the X and Z items represent the
scale scores.
Bayesian Estimation
Returning to the chronic pain data and the analysis model in Equation 10.35, the depres-
sion measure is the sum of seven 4-point rating scales, and psychosocial disability is a
composite of six 6-point questionnaire items. I use Bayesian estimation to apply the
factored regression approach to the linear regression model, and I also include the per-
ceived control over pain and pain interference with daily life scales as auxiliary vari-
ables, as both are correlates of the analysis variables.
Although the number of questionnaire items is not very large relative to the sam-
ple size, I reduced model complexity by excluding the psychosocial disability items
from the auxiliary variable models and imposing equality constraints on the associa-
tions between the auxiliary variables and the depression items (e.g., a single coefficient
described the regression of pain interference on the seven depression items). The speci-
fications mimic those in Equation 10.43 and Figure 10.10. The factored regression model
for the analysis is shown below:
(
f INTERFERE | CONTROL, DISABILITY , DEP1( ) ,…, DEP7( ) , PAIN, MALE ×
a a
)
(
f CONTROL | DISABILITY , DEP1( ) ,…, DEP7( ) , PAIN, MALE ×
b b
)
( ) (
f DIS1* | DIS2 ,…, DIS5 , DISABILITY ×…× f DIS5* | DISABILITY × ) (10.44)
(
f DISABILITY | DEP1( ) ,…, DEP7( ) , PAIN, MALE ×
c c
)
( ) (
f DEP1* | DEP2 ,…, DEP7 , PAIN, MALE ×…× f DEP7* | PAIN, MALE × )
( *
) (
f PAIN | MALE × f MALE *
)
Special Topics and Applications 445
The first two terms are the auxiliary variable models. For example, the term linking
perceived control over pain to the analysis variables translates into the linear regression
model below:
Notice that the equation features an equality constraint where the seven depression
items share a common slope coefficient (i.e., the auxiliary variable links to the depres-
sion scale score rather than individual items). The next group of terms link all but the
last psychosocial disability item to the corresponding scale score. For example, the
f(DIS1|DIS2, . . ., DIS5, DISABILITY) term translates into a probit regression model for the
first disability item’s underlying latent response variable.
As always, the residual variance is fixed at one to establish the metric, and the model
also requires a set of threshold parameters, one of which is fixed at zero.
The focal model, which appears in the third row from the bottom, is an item-level
linear regression with equality constraints on the regression slopes as follows:
The practical impact of these constraints is that β1 reflects the association between the
depression scale score and the dependent variable. The supporting models for the pre-
dictors appear in the bottom two rows of Equation 10.44, all of which are probit regres-
sions with latent response scores as dependent variable (including the binary severe
pain indicator). Finally, the last term is the marginal distribution of the gender dummy
code. I ignore this term, because this variable is complete and does not require a distri-
bution. Keller (2022) describes a Metropolis–Hastings step that streamlines the intro-
duction of these equality constraints, such that a researcher only needs to specify the
scale score structure in a format that mimics Equation 10.47.
After updating several sets of regression model parameters, MCMC samples new
imputations from posterior predictive distributions based on the model parameters. Fol-
lowing ideas established in Chapters 5 and 6, the missing values follow complex, mul-
tipart functions that depend on every model in which a variable appears. For example,
the conditional distribution of the psychosocial disability scale scores given everything
else is the product of eight univariate distributions: two induced by the auxiliary vari-
able models, five from the item-level regressions, and one from the focal analysis. Simi-
larly, the distributions of the missing depression items depend on a latent response vari-
able distribution and other models where the discrete response appears as a predictor.
Importantly, the depression scale score itself is never the target of imputation; rather,
item responses are chosen that make sense when summed together with other items.
I began with an exploratory MCMC chain and used trace plots and potential scale
reduction factor diagnostics (Gelman & Rubin, 1992) to evaluate convergence. This step
446 Applied Missing Data Analysis
is especially important with categorical questionnaire items, because the probit model’s
threshold parameters tend to converge slowly and require long burn-in periods (Cowles,
1996; Nandram & Chen, 1996). Based on this diagnostic run, I specified an MCMC pro-
cess with 10,000 iterations following an initial burn-in period of 20,000 iterations. Table
10.11 summarizes the posterior distributions of the regression model parameters. In the
interest of space, I omit the auxiliary variable and item-level regressions from the table,
because they are not the substantive focus. Importantly, the substantive interpretations
make no reference to the supporting item-level regressions and match those of a scale
score analysis. For example, the β1 coefficient indicates that a one-unit change in the
depression scale score predicts a 0.27 increase in the psychosocial disability scale score,
controlling for gender and pain (Mdnβ1 = 0.27, SD = 0.04).
Maximum Likelihood
Maximum likelihood offers a frequentist alternative for deploying the factored regres-
sion specification. As explained in Chapter 3, a mixture of categorical and continuous
variables requires an iterative optimization procedure known as numerical integration
that fills in the missing parts of the data in an imputation-esque fashion. Applying the
factorization in Equation 10.44 produced the estimates in the top panel of Table 10.12.
Perhaps not surprisingly, the point estimates and standard errors were numerically
equivalent to the posterior medians and standard deviations. This has been a repetitive
theme throughout the book.
Multiple Imputation
Agnostic multiple imputation is well suited for item-level missing data, because it fills in
the missing values without imposing a structure or pattern on the means and associa-
tions (although it could, if desired). Several studies have investigated the use of multiple
imputation with item-level missing data, and they are unequivocally supportive (Eek-
hout et al., 2014; Finch, 2008; Gottschall et al., 2012; Peyre et al., 2011; van Buuren,
2010).
An important practical issue is whether to impute the scale scores themselves or
the individual questionnaire items (the imputation model cannot include both, because
Special Topics and Applications 447
Multiple imputation
β0 17.59 0.67 26.33 260.16 < .001 .04
β1 (DEPRESS) 0.27 0.04 6.35 256.82 < .001 .05
β2 (MALE) –0.80 0.53 –1.51 259.66 .13 .04
β3 (PAIN) 1.75 0.59 2.97 248.28 < .001 .07
σε2 16.86 — — — — —
R2 .20 — — — — —
DEP1*i
…
*
DEP7i
DIS1*i
… ~ N16 ( μ, S ) (10.48)
*
DIS6i
MALE*
i
INTERFEREi
CONTROL
By now you are familiar with the fact that latent response variables replace discrete vari-
ables, and the corresponding diagonal elements in the variance–covariance matrix are
set to one to establish a metric. The model also incorporates three threshold parameters
per depression item, five thresholds for each disability item, and a single fixed threshold
for the male dummy code.
A practical limitation of item-level imputation is that the number of parameters
can be quite large relative to the sample size. A model-based imputation strategy that
imposes a factor structure on the questionnaire items can dramatically reduce the num-
ber of parameters while still leveraging item-level covariation. Importantly, the model-
based imputations assume the two-factor model is correct, so any global fit assessments
will be overly favorable. Nevertheless, this approach warrants consideration when the
number of indicators is very large, because it should converge more reliably than an
unrestricted imputation model.
Fully conditional specification uses a sequence of regression models to impute vari-
ables in a round-robin fashion. I use a fully latent version of the procedure that models
associations among continuous and latent response variables. This approach invokes a
linear regression for the auxiliary variables and probit regressions for each categorical
variable. To illustrate, the probit imputation model for the first depression item is as
follows:
( ) ( ) ( ) (
DEP1*i = γ 0 + γ1 DEP2*i + … + γ 6 DEP7*i + γ 7 DIS1*i + … + γ12 DIS6*i ) (10.49)
+ γ13 ( INTERFEREi ) + γ14 ( CONTROLi ) + γ15 ( MALEi ) + ri
As always, the latent response variable’s residual variance is fixed at one to establish its
scale. The model also requires three threshold parameters (one of which is fixed) that
divide the underlying latent distribution into six discrete segments.
After updating all model parameters, MCMC samples new imputations from pos-
terior predictive distributions based on the model parameters. For example, numerical
variables are sampled from a normal distribution with center and spread equal to a
predicted score and residual variance, respectively (i.e., imputation = predicted value +
noise). The procedure creates latent response imputations for the entire sample (recall
that latent scores are restricted to a particular region of the distribution if the discrete
response is observed, and they are unrestricted otherwise), and the location of these
Special Topics and Applications 449
continuous imputes relative to the estimated threshold parameters induces discrete val-
ues for each filled-in data set.
Following earlier examples, I created M = 100 filled-in data sets. Prior to creating
imputations, I performed an exploratory analysis and used trace plots and potential
scale reduction factor diagnostics (Gelman & Rubin, 1992) to evaluate convergence. As
mentioned previously, this step is especially important when imputing questionnaire
items, because the many threshold parameters tend to converge slowly and require long
burn-in periods. Based on this diagnostic run, I specified 100 parallel imputation chains
with 20,000 iterations each, and I saved the data at the final iteration of each chain. After
creating the multiple imputations, I computed the scale scores from the filled-in item
responses and fit the regression model from Equation 10.35 to the data. Finally, I used
Rubin’s (1987) pooling rules to combine the estimates and standard errors and applied
Barnard and Rubin’s (1999) degrees of freedom expression to the significance tests. The
bottom panel of Table 10.12 summarizes the multiple imputation results, which were
identical to those of maximum likelihood (and Bayesian estimation).
The factored regression specification for scale scores readily extends to models with
interaction effects. Continuing with the chronic pain data, consider a moderated regres-
sion where the influence of depression on psychosocial disability differs by gender.
I used this model in earlier chapters to illustrate missing data handling for interaction
effects, and this section shows how to accommodate the scale structure (depression is
measured with seven 4-point rating scales, and the dependent variable is measured with
six 6-point questionnaire items).
Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i (10.51)
where X is a composite (e.g., the depression scale score), and M is a numeric variable
(e.g., the male dummy code). Deviating from the analysis example, suppose that X is
the sum of three questionnaire items: X1 to X3. The dependent variable could also be a
composite, but for now I treat it as a numerical variable.
450 Applied Missing Data Analysis
Extending ideas from the previous section, a moderated regression with a scale score
can be viewed as an item-level analysis that imposes equality constraints on regression
slopes. The equation below rewrites Equation 10.51 as a function of the item responses:
Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i
= β0 + β1 ( X1i + X 2i + X 3i ) + β2 Mi + β3 ( X1i + X 2i + X 3i ) Mi + ε i (10.52)
= β0 + ( β1 X1i + β1 X 2i + β1 X 3i ) + β2 Mi + ( β 3 X1i Mi + β 3 X 2i Mi + β 3 X 3i Mi ) + ε i
As you can see, the focal model is equivalent to an item-level regression where questions
from the same scale share a common slope coefficient, as do the collection of product
terms involving the item responses and the moderator.
The supporting item-level regression models follow the procedure from the previ-
ous section and do not change with the introduction of an interaction. Using generic
notation, the factored regression specification for this example involves the product of
five univariate distributions, each of which corresponds to a regression model.
( a b b
)
f Y | X1( ) , X 2( ) , X 3( ) , M, X1M ( ) , X 2 M ( ) , X 3 M ( ) ×
a a b
(10.53)
f ( X1 | X 2 , X 3 , M ) × f ( X 2 | X 3 , M ) × f ( X 3 | M ) × f ( M )
As before, the alphanumerical superscripts denote constrained associations.
The procedure readily extends to applications where both X and M are scales. To
illustrate, suppose that M is the sum of two questionnaire items, M1 and M2. To illustrate
this model’s constraints, the equation below rewrites the focal analysis as a function of
the item responses:
Yi = β0 + β1 X i + β2 Mi + β3 X i Mi + ε i
= β0 + β1 ( X1i + X 2i + X 3i ) + β2 ( M1i + M2i ) + β3 ( X1i + X 2i + X 3i )( M1i + M2i ) + ε i
(10.54)
= β0 + ( β1 X1i + β1 X 2i + β1 X 3i ) + ( β2 M1i + β2 M2i )
+ ( β3 X1i M1i + β3 X 2i M1i + β3 X 3i M1i + β3 X1i M2i + β3 X 2i M2i + β3 X 3i M2i ) + ε i
As you can see, the focal model is equivalent to an item-level regression where ques-
tions from the same scale share common slope coefficients, as do the collection of all
possible product terms involving the two sets of scale items. Although this model seems
extraordinarily cumbersome to specify, Keller (2022) uses a Metropolis–Hastings step
to streamline estimation, such that a researcher only needs to specify the scale score
structure in the following format:
additional terms on the left that link the scale score to all but one of its component
items, as follows:
( a b b
)
f ( Y1 | Y2 , Y ) × f ( Y2 | Y ) × f Y | X1( ) , X 2( ) , X 3( ) , M, X1M ( ) , X 2 M ( ) , X 3 M ( ) ×
a a b
(10.56)
f ( X1 | X 2 , X 3 , M ) × f ( X 2 | X 3 , M ) × f ( X 3 | M ) × f ( M )
Auxiliary variables enter the factorization in the same way as Equation 10.43.
At this time, it isn’t possible to estimate this model with maximum likelihood, and
Bayesian estimation and model-based multiple imputation require specialized software
that works with a series of univariate likelihoods instead of a single multivariate distri-
bution (Keller & Enders, 2021).
After examining trace plots and potential scale reduction factors (Gelman &
Rubin, 1992), I specified an MCMC process with 10,000 iterations following an initial
20,000-iteration burn-in period. The top panel of Table 10.13 summarizes the posterior
distributions of the parameters. Recall that lower-order terms are conditional effects
that depend on scaling; Mdnβ1 = 0.38 (SD = 0.06) is the effect of depression on psycho-
social disability for female participants, and Mdnβ2 = –0.80 (SD = 0.54) is the gender
difference at the depression mean. The interaction effect captures the slope difference
for males. The negative coefficient (Mdnβ3 = –0.25, SD = 0.08) indicates that the male
depression slope was approximately one-fourth of a point lower than the female slope
(i.e., the male slope is Mdnβ1 + Mdnβ3 = 0.38 – 0.25 = 0.13). The simple slopes for males
and females resemble those in Figure 5.5.
The same analysis that generates Bayesian summaries of the model parameters
can also generate model-based multiple imputations for a frequentist analysis. To
illustrate the process, I also created M = 100 filled-in data sets by saving the imputa-
tions from the final iteration of 100 parallel MCMC chains. After creating the multiple
imputations, I computed and centered the depression scale score from the filled-in
452 Applied Missing Data Analysis
item responses (imputation fills in the disability scale score) and fit the regression
model from Equation 10.50 to the data. Finally, I used Rubin’s (1987) pooling rules
to combine the estimates and standard errors and applied Barnard and Rubin’s (1999)
degrees of freedom expression to the significance tests. The top panel of Table 10.14
summarizes the multiple imputation estimates, which were numerically equivalent to
the Bayesian results.
( )
Yi = β0 + β1 η Xi + β2 ( Mi ) + β3 η Xi ( )(M ) + ε
i i (10.58)
where ηXi is the latent factor score for person i. The interaction term now reflects the
Special Topics and Applications 453
product of a latent and manifest variable. A measurement model linking the factor to the
items replaces the item-level regressions from the scale score model. The factor model
for this example consists of three probit regressions with latent response variables as
outcomes.
X1*i γ γ r1i
* 01 11
X 2i = γ 02 + γ12 η Xi + r2i (10.59)
X 3*i γ 03 γ13 r
3i
Applying ideas from Chapter 6, the residual variances are fixed at one to establish the
latent response variables’ metrics, and each regression additionally requires a set of
threshold parameters, one of which is fixed to zero. Like other latent variables we’ve
encountered, ηX doesn’t have a metric and requires similar identification constraints.
I scale the latent variable by fixing the first factor loading (the γ11 coefficient) equal to
1 (i.e., ηX’s variance is equated to “true score” variation in X1*), and I also set the factor
mean to 0. Finally, note that I use γ’s and r’s for consistency with earlier material, but
it is more common to see measurement intercepts and loadings written as υ and λ in
structural equation modeling applications.
The factored regression specification for the moderated regression analysis now
involves the dependent variable, questionnaire items, and latent factor. Applying the
probability chain rule gives the following factorization, which isn’t in its final form:
454 Applied Missing Data Analysis
f ( Y , η X , M, X1 , X 2 , X 3 ) = f ( Y | η X , M, X1 , X 2 , X 3 ) × f ( X1 | X 2 , X 3 , η X , M ) ×
(10.60)
f ( X2 | X 3 , ηX , M ) × f ( X 3 | ηX , M ) × f ( ηX | M ) × f ( M )
Returning to the measurement model, the latent factor is the sole determinant of the
item responses and, by extension, their means and associations. The factor model stipu-
lates that items correlate with each other, because they share a common predictor (i.e.,
any two items link indirectly via their pathways to the latent factor), and their correla-
tions with other variables like Y and M are also indirect via the latent variable. Said
differently, each item is conditionally independent of all other variables after controlling
for the factor.
The conditional independence assumption simplifies the factorization, as any item
on the right side of a vertical pipe with ηX vanishes. The final model specification is
f ( Y | η X , M, η X × M ) × f ( X1 | η X ) × f ( X 2 | η X ) × f ( X 3 | η X ) × f ( η X | M ) × f ( M ) (10.61)
where the first term corresponds to the focal analysis, the next three terms are the
measurement model, the penultimate term links the latent factor to the moderator, and
the final term is marginal (overall) distribution of the moderator. The final two terms
translate into a pair of linear regression models, one of which is empty.
η Xi = γ 04 + γ14 ( Mi − μ M ) + r4 i (10.62)
Mi =μ M + r5i
Centering the moderator in the top equation defines the intercept coefficient γ04 as the
latent factor mean, which I fix to 0 to identify the model and center ηX.
The MCMC algorithm for this analysis treats the factor scores as missing data to
be estimated, much like the item-level latent response variables. Applying ideas from
earlier chapters, the distribution of these missing values is a multipart function that
depends on every model in which ηX appears. In this case, the conditional distribution
of the factor scores is proportional to the product of five univariate distributions, each of
which corresponds to a normal curve induced by a linear regression model.
f ( η X | Y , M, X1 , X 2 , X 3 ) ∝ f ( Y | η X , M, η X × M ) ×
(10.63)
f ( X1 | η X ) × f ( X 2 | η X ) × f ( X 3 | η X ) × f ( η X | M )
Deriving the conditional distribution involves multiplying five normal curve equations
and performing algebra that combines the component functions into a single distribu-
tion for ηX. In practice, tthe Metropolis–Hastings algorithm does the heavy lifting of
sampling latent imputations from this complicated distribution.
Because they function as dependent variables, the item-level missing values are
determined solely by the factor model; that is, the probit regressions from Equation
10.59 are the imputation models, and MCMC samples latent response scores from a
univariate normal distribution. To illustrate, the distribution of missing values for X1 is
((
X1*i ~ N1 E X1*i | η Xi ,1 ) ) (10.64)
Special Topics and Applications 455
where the predicted value in the function’s first argument is computed by substitut-
ing a person’s current factor score into a probit regression equation. The normal curve
generates latent scores for cases with observed data as well, but the threshold param-
eters restrict those values to a particular region of the distribution. Finally, MCMC sam-
ples missing Y scores from a normal distribution that depends only on the focal model
parameters (imputation = predicted value + noise).
The factored regression specification readily accommodates a latent factor as a
dependent variable as well. To illustrate, suppose the outcome is measured with three
questionnaire items, Y1 to Y3. A latent factor ηY replaces the Y scale score in the analysis
model, as follows:
( )
ηYi = β0 + β1 η Xi + β2 ( Mi ) + β3 η Xi ( )(M ) + ε i i (10.65)
Reusing previous notation, the measurement model linking the factor to the items again
consists of three probit regressions with latent response variables as outcomes.
Y1*i γ γ r1i
* 01 11
Y2i = γ 02 + γ12 ηYi + r2i (10.66)
*
Y3i γ 03 γ13 r
3i
As before, the residual variances are fixed at one to establish the latent response vari-
ables’ metrics, and each item’s regression additionally requires a set of threshold param-
eters, one of which is fixed at zero. I scale the latent variable by fixing the first factor
loading (the γ11 coefficient) equal to 1, and I set the structural intercept (the β0 coeffi-
cient) equal to 0 to identify the mean structure.
Assuming that items are conditionally independent of other variables after control-
ling for their respective latent variables gives the following factored regression specifica-
tion:
f ( Y1 , Y2 , X 3 , ηY , η X , M, X1 , X 2 , X=
3) f ( Y1 | ηY ) × f ( Y2 | ηY ) × f ( Y3 | ηY ) ×
(10.67)
f ( ηY | η X , M, η X × M ) × f ( X1 | η X ) × f ( X 2 | η X ) × f ( X 3 | η X ) × f ( η X | M ) × f ( M )
The first three terms after the equals sign correspond to the measurement model in
Equation 10.66, the fourth term is the focal analysis model from Equation 10.65, and
the remaining terms are the same as before. Like ηX, the MCMC algorithm treats ηY as
a variable to be imputed, and the conditional distribution of the factor scores is propor-
tional to the product of the four univariate distributions in which ηY appears.
f ( ηY | Y , M, η X , Y1 , Y2 , Y3 , X1 , X 2 , X 3 ) ∝ f ( Y1 | ηY ) ×
(10.68)
f ( Y2 | ηY ) × f ( Y3 | ηY ) × f ( Y | η X , M, η X × M )
After updating all model parameters, MCMC uses a Metropolis–Hastings step to sample
latent imputations from this complex distribution. Like the X items, the measurement
model solely determines the distributions of the missing Y items.
456 Applied Missing Data Analysis
( ) ( )
DISABILITYi* = β0 + β1 DEPRESSi* + β2 ( MALEi ) + β 3 DEPRESSi* ( MALEi )
(10.69)
+ β4 ( PAIN i ) + ε i
to combine the estimates and standard errors and applied Barnard and Rubin’s (1999)
degrees of freedom expression to the significance tests.
The bottom panel of Table 10.14 summarizes the multiple imputation results.
Because the latent variables are standardized, β̂1 = 0.65 (SE = 0.15) gives the standard
deviation unit change for an increase of one standard deviation in latent depression
scores among female participants, and β̂2 = –0.07 (SE = 0.13) is the standardized gender
difference at the latent depression factor’s mean. The negative interaction coefficient
(β̂3 = –0.35, SE = 0.15) indicates that the male depression slope was approximately half
that of females. It is important to highlight that the fractions of missing information
(the proportion of a squared standard error due to the missing data) were very high (e.g.,
values ranging between .36 and .78), owing to the fact that the latent factor scores have
100% missing data. Although scaling differences preclude a direct comparison with the
composite score analysis in the top panel of Table 10.14, the relative magnitude of the
test statistics suggests that the latent variable analysis had less power. However, there
appears to be a trade-off, as the variable analysis produced a substantially larger R2 sta-
tistic, presumably due to the reduction in measurement error.
This section describes longitudinal missing data in the context of a latent growth curve
model. In many applications, there is more than one way to treat longitudinal missing-
ness, and two approaches that seemingly invoke the same conditionally MAR assump-
tion could give different answers (Gottfredson, Sterba, & Jackson, 2017). I use the psy-
chiatric trial data on the companion website to illustrate a growth curve analysis with
three different missing data treatments: maximum likelihood estimation based on a
latent curve analysis, agnostic multiple imputation (fully conditional specification) fol-
lowed by a latent curve analysis, and multilevel multiple imputation. I used these data
extensively in Chapter 9 to illustrate analyses for MNAR processes.
The psychiatric trial data consist of repeated measurements of illness severity rat-
ings measured in half-point increments ranging from 1 (normal, not at all ill) to 7 (among
the most extremely ill). In the original study, the 437 participants with schizophrenia
were assigned to one of four experimental conditions (a placebo condition and three
drug regimens), but the data collapse these categories into a dichotomous treatment
indicator (DRUG = 0 for the placebo group, and DRUG = 1 for the combined medication
group). The researchers collected a baseline measure of illness severity prior to random-
izing participants to conditions, and they obtained follow-up measurements 1 week, 3
weeks, and 6 weeks later. The overall missing data rates for the repeated measurements
were 1, 3, 14, and 23%, and these percentages differ by treatment condition; 19 and
35% of the placebo group scores were missing at the 3-week and 6-week assessments,
respectively, versus 13 and 19% for the medication group. Table 9.3 shows the missing
data patterns.
As a review, a longitudinal growth curve model (also called a linear mixed model
and a multilevel model) is a type of regression where repeated measurements are a
function of a temporal predictor that codes the passage of time, in this case weeks. To
458 Applied Missing Data Analysis
where β0 and β1 are the placebo group’s average intercept and slope, respectively, β2 is
the baseline mean difference for the medication condition, and β3 is the difference in
the mean change rate for this group. The b0i and b1i terms are deviations between the
group-average trajectories and the individual intercepts and slopes. By assumption, the
latent residuals are bivariate normal with a covariance matrix Σb. Replacing β0i and β1i
in Equation 10.70 with the right sides of their expressions from Equation 10.71 reveals
that β3 is a group-by-time interaction effect.
An important feature of this data set is that all participants share the same assess-
ment schedule (i.e., the time scores are constant across participants instead of variable).
Special Topics and Applications 459
Designs like this provide the greatest flexibility for missing data handling, because the
repeated measurements can be treated as separate variables in a multivariate analysis
(i.e., single-level data in wide format) or as a single variable with multiple observations
nested within individuals (i.e., multilevel data in long or stacked format). Because miss-
ingness is relegated to the dependent variable, any multilevel software package that sim-
ply removes measurement occasions with missing data gives optimal maximum likeli-
hood estimates (von Hippel, 2007), and a single-level latent curve structural equation
model gives identical results. The same is true for Bayesian estimation, where multilevel
and structural equation growth models are conceptually equivalent (albeit with differ-
ent parameterizations and different prior distributions).
Single-level and multilevel multiple imputation are two routes that don’t neces-
sarily produce identical results. Single-level agnostic approaches such as joint model
imputation and fully conditional specification do not impose a pattern on the means
or variance–covariance matrix, making it unnecessary to specify a functional form for
growth. For example, the joint modeling approach draws imputations from a multi-
variate normal distribution, where each illness severity rating has a unique mean and
group mean difference, and fully conditional specification deploys a set of round-robin
regression models, where each illness severity variable is imputed conditional on the
treatment assignment indicator and all other repeated measurements. Both approaches
automatically preserve group-specific changes, as well as any nonlinearities that may be
present in the data.
In contrast, model-based multilevel imputation adheres exactly to the focal model’s
time trend. For example, using Equation 10.72 as an imputation model presupposes
that the individual and average change trajectories are linear within each condition.
Tailoring the filled-in data to a particular analysis is precisely the goal of model-based
imputation, but it is important to emphasize that the resulting imputations are inappro-
priate for exploring nonlinear change. If you are unsure about the functional form, it is
appropriate to adopt a more complex imputation model and fit simpler nested models to
the data. For example, the following quadratic growth model would likely do a good job
of capturing the nonlinearities in these data, and the resulting imputations would also
support a linear growth curve model
( )
SEVERITYti = ( β0 + b0i ) + ( β1 + b1i )( WEEKti ) + β2 WEEKti2 + β3 ( DRUGi )
(10.73)
+β4 ( WEEKti )( DRUGi ) + β5 (WEEK ) ( DRUG ) + ε
2
ti i ti
sequential or cross-sequential planned missing designs (see Section 1.9) where single-
level unrestricted imputation models are inestimable.
TABLE 10.15. Growth Curve Estimates for Three Missing Data Methods
ML MLM FCS
Effect Est. SE Est. SE Est. SE
Intercept (β0) 5.35 0.08 5.35 0.09 5.37 0.09
SQRTWEEK (β1) –0.35 0.06 –0.35 0.07 –0.38 0.07
DRUG (β2) 0.05 0.10 0.05 0.10 0.03 0.10
SQRTWEEK × DRUG (β3) –0.63 0.07 –0.63 0.08 –0.60 0.08
Intercept variance (σb20) 0.36 0.06 0.36 — 0.38 0.06
Slope variance (σb21) 0.23 0.03 0.23 — 0.23 0.04
Covariance (σb0b1) 0.02 0.04 0.03 — 0.01 0.04
Residual variance (σε2) 0.59 0.04 0.59 — 0.61 0.04
Note. ML, maximum likelihood; FCS, single-level fully conditional specification; MLM, multilevel model-based mul-
tiple imputation.
Special Topics and Applications 461
allel MCMC chains, each with 2,000 iterations. I then fit the multilevel growth model to
each data set and used Rubin’s (1987) pooling rules to combine the estimates and stan-
dard errors. The middle panel of Table 10.15 shows the resulting estimates, which were
effectively equivalent to a latent curve analysis with maximum likelihood estimation.
As mentioned previously, creating imputations that condition on the individual trajecto-
ries in this way can offer substantial protection against a random coefficient-dependent
missingness process (Gottfredson et al., 2017).
and the corresponding imputation model for the 6-week follow-up scores is as follows:
Again, the equations highlight that the means (regression intercepts) are free to vary at
each time point and do not follow a particular functional form.
462 Applied Missing Data Analysis
I again created 100 imputations by saving a single data set from 100 parallel MCMC
chains, each with 2,000 iterations. I then fit a single-level latent growth model to each
data set and used Rubin’s (1987) pooling rules to combine the estimates and standard
errors. The rightmost panel of Table 10.15 shows the estimates, which have the same
interpretation as the other analyses. Single-level multiple imputation produced modest
but noticeable differences when compared to direct maximum likelihood and multi-
level model-based imputation; the placebo group’s growth rate was lower (steeper, more
negative) by nearly half a standard error unit; and the medication group’s slope differ-
ence was higher (flatter, less negative) by roughly the same amount. One explanation
for these differences is that the estimates are simply noisier, because the wide-format
fully conditional specification requires more parameters, but the standard errors don’t
support this explanation. Although there is no way to know the exact cause, differences
of this magnitude could arise, because the methods that condition on random effects
offer some protection against an MNAR mechanism where the individual intercepts and
slopes determine missingness (Gottfredson et al., 2017). This explanation seems likely
given that the analysis examples in Section 9.13 found support for an MNAR process.
Missing data-handling methods for discrete data have evolved considerably since the
first edition of this book, and earlier chapters and analysis examples illustrated esti-
mation and imputation approaches for binary, ordinal, and multicategorical variables.
Bayesian methods (including model-based multiple imputation) for incomplete count
outcomes are a more recent innovation (Asparouhov & Muthén, 2021b; Neelon, 2019;
Polson et al., 2013), and these routines are beginning to appear in statistical software
packages (Keller & Enders, 2021; Muthén & Muthén, 1998–2017). This section summa-
rizes this approach and provides a data analysis example.
I use substance use data on the companion website to illustrate missing data han-
dling for a regression model with a count outcome. The data set includes a subset of N =
1,500 respondents from a national survey on substance use patterns and health behav-
iors. I previously used these data in Section 10.3 to illustrate a logistic regression with a
dichotomous measure of drinking frequency as the outcome, and this example repeats
that analysis with the number of drinking days per month as the dependent variable.
Drinking frequency features excessive zeros from a large proportion of respondents who
reported no lifetime alcohol use (i.e., so-called “structural zeros”). I excluded these 483
individuals from consideration, thereby defining the population of interest as people who
would potentially consume alcohol. Figure 10.11 shows the observed-data distribution.
Either Poisson or negative binomial regression are appropriate for count outcomes
such as number of drinking days. Both are linear models with the natural logarithm of
the counts as the outcome. The regression model for this example is
(
ln )
ALCDAYSi = β0 + β1 ( AGETRYALCi ) + β2 ( COLLEGEi ) + β3 ( AGEi ) (10.77)
+ β4 ( MALEi )
Special Topics and Applications 463
50
40
30
Percent
20
10
0
0 2 4 6 8 10 12 14 16 19 22 25 28 30
Number of Drinking Days
ALCDAYSi is the predicted number of drinking days per month for individual i,
AGETRYALC is the age at which the respondent first tried alcohol, COLLEGE is a dummy
code indicating some college or a college degree, and MALE is a gender dummy code
with females as the reference group. Approximately 8.4% of the dependent variable
scores are missing, and 15.9% of the educational attainment values are unknown. I
use negative binomial rather than Poisson regression, because the former incorporates
a dispersion parameter that accommodates heterogeneity among individuals with the
same predicted count (the model simplifies to a Poisson regression when the dispersion
parameter equals 0). Interested readers can consult Coxe, West, and Aiken (2009) for an
excellent tutorial on regression models for count data.
The familiar factored regression specification readily accommodates count regres-
sion. The factorization for this analysis features a different dependent variable, but
it otherwise has the same structure as the earlier one from Equation 10.8. Moreover,
Bayesian estimation for count regression models is very similar to the data augmenta-
tion procedure for logistic models described in Section 6.9. To refresh, the procedure
introduces latent response scores and person-specific weights as a rescaling trick that
allows regression coefficients to be estimated using the same machinery as linear regres-
sion models (see Equation 6.52). The MCMC algorithm cycles between four major steps:
Estimate person-specific weights that determine the latent response variable scores,
estimate the regression coefficients given the current latent data and weights, update
the dispersion parameter, and sample discrete imputes by drawing values from a nega-
464 Applied Missing Data Analysis
tive binomial distribution function. This process requires an additional step for the dis-
persion parameter, but it otherwise mimics Bayesian estimation for logistic regression
models. I point interested readers to Polson et al. (2013) and Asparouhov and Muthén
(2021b) for additional details.
I used Bayesian estimation with model-based multiple imputation to estimate the
regression model in Equation 10.77. To reiterate, I used a negative binomial regression
that uses a dispersion parameter to account for unobserved heterogeneity in the counts,
as this approach imposes more flexible assumptions than Poisson regression (e.g., the
model allows for variation among individuals with the same predicted count). This
choice has no bearing on the interpretation of the coefficients, as negative binomial and
Poisson coefficients have the same meaning. After inspecting trace plots and potential
scale reduction factor diagnostics (Gelman & Rubin, 1992), I created 100 filled-in data
sets by saving the imputations from the final iteration of 100 parallel chains, each with
1,000 iterations.
Table 10.16 shows the model-based multiple imputation estimates (not surprisingly,
the Bayesian results that generated the imputations were numerically equivalent). The
slope coefficients reflect the change in the logarithm of the counts for a one-unit change
in a predictor. Although the coefficients don’t reflect the natural metric of the dependent
variable, we can nevertheless conclude from their signs that the number of drinking
days increased for individuals who tried alcohol at an earlier age, attended at least some
college, are older, and are males. The exponentiated coefficients in the rightmost col-
umn reflect the results on the count metric. For example, the intercept is the predicted
number of drinking days for a person with zeros on all predictors (a nonsensical score
profile). Paralleling the logic of odds ratios in logistic regression, the exponentiated
slope coefficients give the multiplicative effect of a one-unit change in the predictors on
the counts. For example, the model predicts that the number of drinking days for males
is 1.79 larger than that for females, controlling for other variables. Similarly, considering
the effect of trying alcohol at age 19 versus age 18, we would expect the 19-year-old to
drink 93% as many days as the 18-year-old. Finally, the large and significant dispersion
parameter suggests there is residual heterogeneity among individuals who share the
same predicted count.
The final example illustrates power analyses for growth curve models with missing
data. I demonstrate the process for the wave planned missingness designs from Sec-
tion 1.9, as well as for unplanned conditionally MAR data. I use Monte Carlo computer
simulations for this purpose, because they are relatively easy to implement and ideally
suited for a wide range of analysis models beyond longitudinal growth curves (Muthén
& Muthén, 2002). The goal of a computer simulation is to generate many artificial data
sets with known population parameters and examine the distributions of the estimates
and test statistics across those many samples. Each of the artificial data sets has miss-
ing values that follow a desired process, and FIML estimation provides the parameter
estimates. The proportion of artificial data sets that produce a significant test statistic is
a simulation-based power estimate.
Generating realistic population parameters is by far the hardest part of a computer
simulation. For the purposes of illustration, I consider a longitudinal study with five
weekly assessments, and I use the same growth model as the previous example. Using
generic notation, the analysis model and its population values are
where WEEK is the temporal predictor that codes occasions relative to the baseline mea-
surement (i.e., WEEK = 0, 1, 2, 3, and 4), X is a binary between-person or time-invariant
covariate (e.g., intervention status, demographic characteristic), and all other terms are
the same as their counterparts from Equation 10.72.
I used the following process to generate parameter values: To begin, I arbitrarily
fixed the baseline mean and standard deviation equal to 50 and 10, respectively, and
I used .50 as an intraclass correlation (i.e., between-person variation comprised 50%
of the variation, a typical value for longitudinal data). These constraints allowed me
to use effect size expressions from Rights and Sterba (2019) to solve for the growth
model parameters. To induce a small amount of normative growth in the X = 0 group,
the fixed effect of WEEK explained 2% of the within-person variation, and the group-
by-time interaction explained an additional 6% of the variability. To mimic a scenario
where two groups are identical at baseline, I set the R2 value for the between-cluster
effect of X equal to zero. Finally, the random slope variance accounted for 10% of the
total variation (in my experience with longitudinal data, variance explained by the
random slopes is often between 5 and 10%), and the residual correlation between the
random intercepts and slopes was .30. These inputs produced the population param-
eters in Equation 10.78. An R tool that performs these calculations is available on the
companion website.
466 Applied Missing Data Analysis
Enders, 2011; Wu et al., 2016). Next, consider a scenario where the research budget fixes
the total number of assessments that can be collected, but those measurements can be
distributed across a complete-data or wave missing data design. A wave missing data
design with 270 participants requires 810 measurements that could instead be distrib-
uted across 162 participants with complete data. The computer simulations revealed
that the complete-data design was far less efficient and achieved power of only .63—a
20% reduction. This result is also consistent with the literature. Finally, it is instructive
to examine power for a wave missing design with different missingness patterns. Recall
that Brandmaier et al.’s (2020) approach identified (1, 2, 3), (2, 3, 4), and (3, 4, 5) as the
least efficient patterns among the 10 candidates with which I started. Deploying a design
with this configuration required a massive increase to 735 participants to achieve .80
power to detect the group-by-time interaction.
Collectively, the computer simulations suggest that, when done well, wave missing
data designs can achieve nearly optimal power while dramatically reducing data collec-
tion resources and respondent burden. However, the simulations also show that, when
done poorly, planned missingness can result in a catastrophic reduction in power. For-
tunately, it is relatively straightforward to create good designs, as analytic solutions and
Monte Carlo computer simulation make it easy to identify patterns that maximize preci-
sion and vet their power. The best patterns maximize the variability of the time scores.
0 if Mti* ≤ τ
Mti = *
(10.79)
1 if Mti > τ
where Mti is the missing data indicator for occasion t (0 = observed, 1 = missing),
Mt*i is the underlying latent variable, and the threshold parameter τ is fixed to zero
to identify the latent response variable’s mean structure. The predicted probability of
missing data is computed as the area under the normal curve above the threshold (see
Equation 2.67).
468 Applied Missing Data Analysis
The regression model that linked the latent response variable at occasion t to the
observed data is as follows:
To facilitate model specification, I created an R function that uses the desired R2 effect
size (the strength of the MAR mechanism) and the relative contribution of the predic-
tors to solve for the population regression coefficients. This tool is available on the
companion website. Specifying a strong selection mechanism where the two predictors
combined equally to explain 25% of the latent response variable’s variance gave the fol-
lowing regressions (as always, residual variances are fixed at 1).
This chapter has used a series of data analysis examples to illustrate a collection of odds
and ends that include specialized topics, advanced applications, and practical issues.
These topics include missing data handling for descriptive summaries, transformation
methods for non-normal variables, estimation and inferential procedures for path and
structural equation models, missing data handling for incomplete questionnaire items,
and longitudinal analyses. I recommend the following articles for readers who want
additional details on some of the topics from this chapter.
Alacam, E., Du, H., Enders, C. K., & Keller, B. T. (2022). A model-based approach to treating
composite scores with missing items. Manuscript submitted for publication.
Asparouhov, T., & Muthén, B. (2021b). Expanding the Bayesian structural equation, multilevel
and mixture models to logit, negative-binomial and nominal variables. Structural Equation
Modeling: A Multidisciplinary Journal, 28, 622–637.
Brandmaier, A. M., Ghisletta, P., & von Oertzen, T. (2020). Optimal planned missing data
design for linear latent growth curve models. Behavior Research Methods, 52, 1445–1458.
Enders, C. K. (in press). Fitting structural equation models with missing data. In R. Hoyle (Ed.),
Handbook of structural equation modeling (2nd ed.). New York: Guilford Press.
Graham, J. W., Taylor, B. J., & Cumsille, P. E. (2001). Planned missing data designs in analysis of
change. In L. Collins & A. Sayer (Eds.), New methods for the analysis of change (pp. 335–
353). Washington, DC: American Psychological Association.
Lüdtke, O., Robitzsch, A., & West, S. G. (2020b). Regression models involving nonlinear effects
with missing data: A sequential modeling approach using Bayesian estimation. Psychologi-
cal Methods, 25, 157–181.
Wu, W., Jia, F., Rhemtulla, M., & Little, T. D. (2016). Search for efficient complete and planned
missing data designs for analysis of change. Behavior Research Methods, 48, 1047–1061.
11
Wrap‑Up
This final chapter addresses two very practical issues: choosing a missing data-handling
method and reporting the results from a missing data analysis. All things being equal,
the three analytic pillars of this book—maximum likelihood, Bayesian estimation, and
multiple imputation—are likely to produce very similar numerical results, so the choice
of technique is often one of personal preference. However, special analysis features and
software availability may make one method preferable. The first section walks read-
ers through the main considerations and provides a recipe for selecting a method. Of
course, software availability and data analytic preferences play a major role in this deci-
sion, so the second section of the chapter reviews the current software landscape and
paints with broad brushstrokes the different software tools. The chapter concludes with
recommendations for reporting the results from a missing data analysis, and it provides
templates that illustrate the suggestions.
With few exceptions, analyses that assume a conditionally MAR mechanism should
be the norm, as there is rarely a good justification for using atheoretical methods (e.g.,
mean imputation) or methods that assume purely unsystematic missingness (e.g., dele-
tion methods). Maximum likelihood, Bayesian estimation, and multiple imputation are
all natural choices that often produce very similar numerical results—all things being
equal. A quick recap of these methods sets the stage for choosing a method.
The goal of maximum likelihood estimation is to identify the model parameter
values most likely responsible for producing the data. The missing data-handling aspect
of maximum likelihood happens behind the scenes, and a researcher simply needs to
470
Wrap‑Up 471
dial up a capable software package and specify a model. The estimator does not discard
incomplete data records, nor does it impute them. Rather, when confronted with miss-
ing values, maximum likelihood uses the normal curve to deduce the missing parts
of the data as it iterates to a solution (technically, the estimator marginalizes over the
missing values). The resulting parameter values are those with maximum support from
whatever data are available. Chapters 2 and 3 describe this approach.
Like maximum likelihood, the primary goal of a Bayesian analysis is to fit a model
to the data and use the resulting estimates to inform one’s research questions. How-
ever, Bayesian estimation has more of a multiple imputation flavor, because it fills in
the missing values en route to getting the parameter values. Like maximum likelihood,
missing data handling happens behind the scenes, with temporary imputations play-
ing a supporting role that simplifies estimation. While the numerical estimates tend to
match those of maximum likelihood, a Bayesian analysis requires a different inferential
framework that makes no reference to repeated sampling. Chapters 4 through 6 describe
this approach.
Unlike maximum likelihood and Bayesian estimation, multiple imputation puts
the filled-in data front and center, and the goal is to create suitable imputations for
later analysis. A typical application consists of three major steps: Specify an imputation
model and deploy an Bayesian estimation algorithm that creates several copies of the
data, each containing different estimates of the missing values; perform one or more
analyses on the completed data sets and get point estimates and standard errors from
each; and use “Rubin’s rules” (Little & Rubin, 2020; Rubin, 1987) to combine estimates
and standard errors into a single package of results.
Notwithstanding philosophical differences between the frequentist and Bayesian
paradigms, statistical theory and numerous data analysis examples from earlier chap-
ters tell us that maximum likelihood, Bayesian estimation, and multiple imputation are
usually numerically equivalent if they leverage the same assumptions and use the same
variables. As such, personal preference is often the only reason for selecting one method
over another. In truth, the most important consideration isn’t which method to use, but
rather how to accurately represent the distribution of the data. Two approaches have fea-
tured prominently throughout this book: multivariate normal distributions (possibly
with latent response variables) and factored regression specifications.
In practice, a factored regression specification is always appropriate whenever a
multivariate normal distribution is appropriate, but the converse is not true. As such,
what factors determine whether a normal distribution is a good approximation to the
data? In general, any model that features an interaction term, curvilinear effect, random
slope, or other type of nonlinearity is at odds with a multivariate normal distribution.
In contrast, additive models that lack these special terms and assume constant variation
are compatible with a normal curve. The flow chart in Figure 11.1 provides a recipe for
identifying the appropriate specification for a given analysis and variable set. It high-
lights that any analysis with an incomplete nonlinear term requires a factored regression
specification. A multivariate normal distribution accommodates the obvious scenario
where all variables are approximately continuous and normal, and adopting a latent
response variable framework further accommodates binary, ordinal, and multicategori-
472 Applied Missing Data Analysis
1. Analysis features a
Factored regression
nonlinearity (interaction, YES
specification
polynomial, random slope)?
NO
NO
NO
FIGURE 11.1. Flowchart for identifying an appropriate specification for a given analysis and
variable set.
cal nominal variables. With few exceptions, other types of response scales (e.g., count
outcomes) require a factored regression specification.
All three contemporary missing data-handling methods accommodate multivariate
and factored regression specifications. For maximum likelihood and Bayesian analyses,
the structural equation modeling framework is a convenient way to implement multi-
variate missing data handling, and agnostic multiple imputation schemes such as the
joint model and fully conditional specification are another option. At this point in his-
tory, Bayesian estimation and model-based multiple imputation are arguably superior
for factored regression specifications, because they support a broader range of analysis
models and offer greater flexibility with mixtures of discrete and numeric variables.
Multilevel models with random coefficients and/or mixed response types are an impor-
tant example, as unbiased maximum likelihood estimators are not widely available.
Given an appropriate distributional specification, software availability and data analytic
preferences ultimately determine which method to use. The next section reviews the
Wrap‑Up 473
current software landscape and paints with broad brushstrokes information about dif-
ferent software tools.
Given the pace at which computational options evolve, I’ve purposefully avoided
software-centric presentations throughout the book. Nevertheless, software availability
and data analytic preferences play a major role in selecting a missing data-handling
technique, so a cursory review of the current software landscape is important. Fortu-
nately, numerous mature platforms exist that will surely be around a decade from now,
and these tools will only improve over time. What follows is an incomplete snapshot of
the software landscape in 2022.
Considering general-use commercial software, most researchers reading this book
probably use SAS, SPSS, or Stata. All three applications have structural equation mod-
eling modules that provide a general way to implement maximum likelihood missing
data handling, and all offer agnostic multiple imputation schemes. Commercial struc-
tural equation modeling software programs such as EQS (Bentler, 2000–2008), LISREL
(Jöreskog & Sörbom, 2021), and Mplus (Muthén & Muthén, 1998–2017) are also very
capable platforms for estimating a broad range of models with maximum likelihood,
and some also offer multiple imputation. Except for Mplus, the commercial programs
mentioned here generally do not support factored regression specifications and thus are
ill-suited for models with incomplete nonlinear effects.
Mplus has no peer in the commercial software space when it comes to missing
data-handling options, as it offers sophisticated and flexible options for maximum like-
lihood, Bayesian estimation, and multiple imputation. Given its structural equation
modeling roots, Mplus’s computational machinery is primarily multivariate in nature,
although it does use or allow for factored regression specifications in some situations.
Caution is warranted with incomplete nonlinear effects, because some model specifica-
tions can misrepresent the distributions of incomplete predictors in a way that mimics
bias-inducing just-another-variable and reverse random coefficient approaches; multi-
level models with incomplete random slope predictors are an example (Enders, Hayes,
et al., 2018). The most recent version 8.6 of the software features new factored regression
specifications for latent variable interactions that mimic those described in Section 10.8
(Asparouhov & Muthén, 2021a), but no information is available on the performance of
these methods with missing data.
Turning to free software options, Blimp (Keller & Enders, 2021) is an all-purpose
data analysis and latent variable modeling program that harnesses the flexibility of
Bayesian estimation in a user-friendly application that requires minimal scripting and
no deep-level knowledge about Bayes. The application, which is available for macOS,
Windows, and Linux, was developed with funding from Institute of Educational Sci-
ences awards R305D150056 (myself and Roy Levy) and R305D190002 (myself, Brian
Keller, and Han Du). The software began as a platform for implementing multilevel mul-
tiple imputation via fully conditional specification (Enders, Keller, et al., 2018), and its
474 Applied Missing Data Analysis
Before presenting results, report complications, protocol violations, and other unanticipated
events in data collection. These include missing data, attrition, and nonresponse. Discuss
analytic techniques devised to ameliorate these problems. Describe nonrepresentativeness
statistically by reporting patterns and distributions of missing data and contaminations.
Document how the actual analysis differs from the analysis planned before complications
arose. The use of techniques to ensure that the reported results are not produced by anoma-
lies in the data (e.g., outliers, points of high influence, nonrandom missing data, selection
bias, attrition problems) should be a standard component of all analyses. (p. 597)
• Report the number of missing values for each variable of interest, or the number of cases
with complete data for each important component of the analysis. Give reasons for miss-
ing values if possible, and indicate how many individuals were excluded, because of miss-
ing data when reporting the flow of participants through the study. If possible, describe
reasons for missing data in terms of other variables (rather than just reporting a universal
reason such as treatment failure).
• Clarify whether there are important differences between individuals with complete and
incomplete data—for example, by providing a table comparing the distributions of key
exposure and outcome variables in these different groups.
• Describe the type of analysis used to account for missing data (eg, multiple imputation),
and the assumptions that were made (e.g., missing at random).
Reporting Guidelines
To facilitate an uptick in better reporting practices, the checklist below compiles rec-
ommendations from a variety of sources in the literature (Burton & Altman, 2004;
Manly & Wells, 2015; Sterne et al., 2009; Sterner, 2011; van Buuren, 2012, pp. 254–255;
Vandenbroucke et al., 2007). Of course, length requirements limit what can be reported
in the body of a journal article, but online supplemental documents have no such restric-
tions and are an ideal repository for specific information. The remainder of this section
is a discussion of each recommendation and provides illustrative templates.
TABLE 11.1. Observed Data Proportions for Each Variable or Variable Pair
(Covariance Coverage)
Variable 1 2 3 4 5 6 7 8 9
1. AGE 1.00
2. WORKHRS 1.00 1.00
3. EXERCISE .88 .88 .88
4. ANXIETY .98 .98 .87 .98
5. STRESS .93 .93 .82 .91 .93
6. CONTROL .95 .95 .83 .93 .88 .95
7. INTERFERE 1.00 1.00 .88 .98 .93 .95 1.00
8. DEPRESS 1.00 1.00 .88 .98 .93 .95 1.00 1.00
9. DISABILITY .90 .90 .78 .88 .82 .84 .90 .90 .90
The age at first alcohol use exhibited substantial positive skewness and excess kurtosis;
the complete-case estimates were 1.82 and 8.12, respectively. Inspecting histograms of the
observed and imputed data revealed that sampling imputations from a normal distribu-
tion produced values well below 10, the lowest reported age; the lowest imputation was
approximately 0.30, and about 2% of all imputes were less than 10. To investigate the prac-
tical impact of the normality assumption, we performed a sensitivity analysis that instead
sampled skewed imputations from a Yeo–Johnson distribution (Lüdtke, Robitzsch, & West,
2020b; Yeo & Johnson, 2000). The resulting imputations followed a positively skewed dis-
tribution that better resembled the shape of the observed data, ranging from 5.89 to 42.83.
Altering the distribution of the missing values increased the variable’s slope coefficient by
nearly half a standard error unit, but its sign and significance test were unaffected. From
this, we can conclude that our main conclusions were stable across different assumptions
about the missing data distribution.
Maximum likelihood estimation is a bit more of a black box when it comes to dis-
tributional assumptions, as it too will intuit that missing values extend into an implau-
sible range without ever producing explicit evidence of its assumptions. Moreover,
appropriate transformations are more difficult to implement in this context, because
software packages that estimate the necessary shape parameters (e.g., for a Box–Cox
or Yeo–Johnson transformation) generally require complete data. Finally, discrepancies
between normal-theory and robust (sandwich estimator) standard errors can signal a
model misspecification, but what constitutes a discrepancy is somewhat subjective.
In many cases, the research team was able to ascertain that test scores were missing, because
a student moved to another district. As student mobility often correlates with sociodemo-
Wrap‑Up 479
For each incomplete variable, we performed comparisons of individuals with and with-
out missing data, and we flagged any variables that produced a standardized mean differ-
ence larger than Cohen’s (1988) small effect size benchmark of ± 0.20. These comparisons
revealed that people with missing disability scores were younger (d = –0.30); participants
without missing depression scores were more anxious (d = 0.33); and people with miss-
ing pain ratings exercised more frequently (d = 0.30), exhibited higher anxiety (d = 0.43),
and reported more stress (d = 0.24). Collectively, these comparisons rule out a MCAR pro-
cess, and they potentially signal the need to condition on one or more of these additional
variables if their semipartial (i.e., residual) correlations with the analysis variables exceed
approximately ± .30 (Collins et al., 2001). Based on their strong semipartial correlations
with one or more analysis variables, we designated anxiety, stress, and pain interference
with daily life as auxiliary variables; pain interference did not predict missingness but con-
ditioning on this variable could improve power.
One the basis of their salient semipartial (residual) correlations with the incomplete analy-
sis variables, we designated anxiety, stress, and pain interference with daily life as auxil-
480 Applied Missing Data Analysis
iary variables. The online supplemental document describes the variable selection process,
including comparisons of individuals with and without missing data.
We used fully conditional specification multiple imputation in Blimp 3 (Keller & Enders,
2021) to treat missing values. Potential scale reduction factor convergence diagnostics
(Gelman & Rubin, 1992) from a preliminary run indicated that a burn-in period of 2,000
iterations was sufficiently conservative. Based on this information, we created 100 imputed
data sets by saving the filled-in data from the final iteration of 100 MCMC chains, each
with random starting values. The imputation model included the analysis variables as well
as three additional auxiliary variable variables. We then used the R packages lme4 (version
1.1-27.1; Bates et al., 2021) and mitml (version 0.4-1; Grund, Robitzsch, et al., 2021) to fit
Wrap‑Up 481
the analysis models and pool the resulting parameter estimates and standard errors (Rubin,
1987).
The following passage similarly describes the algorithmic details for a Bayesian analysis:
We used Bayesian estimation in Blimp 3 (Keller & Enders, 2021) to treat missing values,
and we used a factored regression (sequential) specification to incorporate three auxiliary
variables. Potential scale reduction factor convergence diagnostics (Gelman & Rubin, 1992)
from a preliminary run indicated that a burn-in period of 2,000 iterations was sufficiently
conservative. Based on this information, we used eight MCMC chains with random starting
values to generate posterior summaries consisting of 10,000 estimates following the initial
burn-in period. We verified this setting was sufficient by examining the effective number of
independent MCMC samples for each parameter, all of which were greater than the recom-
mended value of 100 (Gelman et al., 2014, p. 287).
Finally, maximum likelihood estimation routines are typically more of a black box,
offering fewer tweakable settings. The following passage illustrates how to describe this
approach:
We used the FIML estimator in Mplus 8.6 (Muthén & Muthén, 1998–2017) with robust
standard errors (i.e., the MLR estimator), and we used Graham’s (2003) saturated correlates
approach to incorporate three additional auxiliary variables. For each of the primary analy-
ses, we fit the model using 10 sets of random starting values, all of which achieved the same
final solution.
trial). The passage below provides a template for describing the results in a published
paper:
As a second illustration, reconsider the analysis examples from Section 9.8, which
applied pattern mixture models to a multiple regression from the same psychiatric trial
data. The passage below provides a template for summarizing those results in a paper:
As a final illustration, reconsider the analysis examples from Section 9.13, which
applied selection and pattern mixture models to a longitudinal growth curve. The pas-
sage below provides a template for summarizing those results in a paper:
(2) a shared parameter model for random coefficient-dependent missingness where one’s
underlying growth trajectory is responsible for missing data (Albert & Follmann, 2009; Wu
& Carroll, 1988), and (3) a random coefficient pattern mixture model where completers and
dropouts form qualitatively different subgroups with distinct growth trajectories (Hedeker
& Gibbons, 1997).
The Diggle– Kenward selection model and the Hedeker– Gibbons pattern mixture
model produced nontrivial differences in some key parameters. Both analyses suggested
a flatter (less negative) trajectory for the placebo group and a steeper decline for the medi-
cation condition, with changes to the parameters as large as one standard error unit in
some cases (an amount we judge to be practically significant). Considered as a whole, the
sensitivity analysis results suggest that an MNAR process is quite plausible for these data.
Importantly, there is no way to determine which model is more correct, as the results reflect
different, plausible assumptions about the missing data process. The online supplemental
document describes the sensitivity analysis results in more detail.
Missing data methodology has evolved considerably since the first edition of this book
was published in 2010. Rewinding more than a decade, researchers primarily had to rely
on techniques that assume a multivariate normal distribution. Major innovations since
that time include missing data-handling methods for mixtures of discrete and numeri-
cal variables, non-normal data, multilevel data, models with interactive or nonlinear
effects, and factored regression specifications, to name a few. At this point in history,
elegant missing data solutions exist for most analyses that researchers use in their day-
to-day practice, and there is no shortage of capable software tools. Methodologies from
the first edition of this book now enjoy widespread use in published research articles,
and I hope this second edition contributes to the uptake of new and improved meth-
odologies. Finally, I recommend the following articles for readers who want additional
details on the reporting recommendations offered in this chapter:
484 Applied Missing Data Analysis
Manly, C. A., & Wells, R. S. (2015). Reporting the use of multiple imputation for missing data in
higher education research. Research in Higher Education, 56, 397–409.
Nicholson, J. S., Deboeck, P. R., & Howard, W. (2017). Attrition in developmental psychology: A
review of modern missing data reporting and practices. International Journal of Behavioral
Development, 41, 143–153.
Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., . . . Carpenter,
J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research:
Potential and pitfalls. British Medical Journal, 338, Article b2393.
Appendix
alcoholuse.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
AGE Age in years 0 Numerical (12 to 85)
ETHNIC Ethnicity 0 1 = Non-Hispanic/Black, 2 =
Hispanic, 3 = Black
COLLEGE College education dummy code 27.9 0 = High school or less, 1 =
Some college or more
AGETRYCIG Age first tried cigarettes 68.7 Numerical (10 to 48)
AGETRYALC Age first tried alcohol 32.9 Numerical (10 to 47)
ALCDAYS Drinking days per month 9.7 Count (0 to 30)
CIGDAYS Smoking days per month 13.6 Count (0 to 30)
DRINKER Alcohol use frequency 9.7 0 = Less than weekly, 1 = At
classification least once per week
diary.dat
Name Definition Missing % Scale
PERSON Individual identifier 0 Integer index
DAY Day identifier 0 Integer index (0 to 20)
PAIN Pain rating composite 3.9 Numerical (1 to 10)
SLEEP Sleep rating composite 8.9 Numerical (0 to 10)
POSAFF Positive affect composite 13.4 Numerical (1 to 7)
(continued)
485
486 Appendix
diary.dat (continued)
Name Definition Missing % Scale
NEGAFF Negative affect composite 13.3 Numerical (1 to 7)
LIFEGOAL Life goal pursuit composite 14.5 Numerical (1 to 7)
FEMALE Gender dummy code 0 0 = Male, 1 = Female
EDUC Education level 4.7 Ordinal (1 to 7)
DIAGNOSES Number of diagnosed ailments 7.1 Numerical (1 to 8)
ACTIVITY Activity level composite 12.4 Numerical (0 to 5)
PAINACCEPT Pain acceptance composite 2.4 Numerical (0 to 5)
CATASTROPIZE Catastrophizing composite 0 Numerical (0 to 5)
STRESS Stress composite 0 Numerical (0 to 5)
drugtrial.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
DRUG Medication condition dummy 0 0 = Placebo, 1 = Medication
code
SEVERITY0 Illness severity at baseline 0.7 Numerical (1 to 7)
SEVERITY1 Illness severity at 1 week 2.5 Numerical (1 to 7)
SEVERITY3 Illness severity at 3 weeks 14.4 Numerical (1 to 7)
SEVERITY6 Illness severity at 6 weeks 23.3 Numerical (1 to 7)
DROPGRP Dropout group 0 1 = Completer, 2 = 3-week
dropout, 3 = 6-week
dropout
EARLYDROP 3-week dropout dummy code 0 1 = 3-week dropout, 0 =
Completer or 6-week
dropout
LATEDROP 6-week dropout dummy code 0 1 = 6-week dropout, 0 =
Completer or 3-week
dropout
DROPOUT Dropout indicator 0 0 = Completer, 1 = Dropout
SDROPOUT3 3-week survival dropout 0 0 = Completer, 1 = Dropout
indicator
SDROPOUT6 6-week survival dropout 11 0 = Completer, 1 = Dropout
indicator
Appendix 487
drugtrial2level.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
DRUG Medication condition dummy 0 0 = Placebo, 1 = Medication
code
SEVERITY Illness severity 10.2 Numerical (1 to 7)
WEEK Time scores 0 Numerical (0, 1, 3, 6)
DROPGRP Dropout group 0 0 = Completer, 1 = 3-week
dropout, 2 = 6-week
dropout
EARLYDROP 3-week dropout dummy code 0 1 = 3-week dropout, 0 =
Completer or 6-week
dropout
LATEDROP 6-week dropout dummy code 0 1 = 6-week dropout, 0 =
Completer or 3-week
dropout
DROPOUT Dropout indicator 0 0 = Completer, 1 = Dropout
SDROPOUT Survival dropout indicator 2.7 0 = Completer, 1 = Dropout
eatingrisk.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
BMI Body mass index 11.5 Numerical (17.39 to 28.27)
eatingrisk.dat (continued)
Name Definition Missing % Scale
Dieting Behavior Questionnaire items
employee.dat
Name Definition Missing % Scale
EMPLOYEE Employee identifier 0 Integer index
TEAM Team identifier 0 Integer index
TURNOVER Intend to quit job in the next 6 5.1 0 = No, 1 = Yes
months
MALE Gender dummy code 0 0 = Female, 1 = Male
EMPOWER Employee empowerment 16.2 Numerical (14 to 42)
composite
LMX Leader–member exchange 4.1 Numerical (0 to 17)
(relationship quality with
supervisor) composite
WORKSAT Work satisfaction rating 4.8 Ordinal (1 to 7)
CLIMATE Leadership climate composite 9.5 Numerical (12 to 33)
(team-level)
COHESION Team cohesion composite 5.7 Numerical (2 to 10)
(team-level)
Appendix 489
math.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
MALE Gender dummy code 0 0 = Female, 1 = Male
FRLUNCH Lunch assistance dummy code 5.2 0 = None, 1 = Free or reduced-
price lunch
ACHIEVEGRP Achievement classification 2.4 1 = Typically achieving, 2 =
Low achieving, 3 = Learning
disabled
STANREAD Standardized reading 10.4 Numerical (27.2 to 69.2)
EFFICACY Math self-efficacy rating scale 10.0 Ordinal (1 to 6)
ANXIETY Math anxiety composite 8.8 Numerical (0 to 56)
MATHPRE Math achievement pretest 0 Numerical (26 to 76)
MATHPOST Math achievement posttest 16.8 Numerical (35 to 85)
pain.dat
Name Definition Missing % Scale
ID Individual identifier 0 Integer index
TXGRP Treatment group dummy code 0 0 = Waitlist control, 1 =
Treatment
MALE Gender dummy code 0 0 = Female, 1 = Male
AGE Age in years 0 Numerical (19 to 78)
EDUC Highest education 0 1 = Some college or less, 2 =
College, 3 = Post-BA
WORKHRS Work hours per week 12.0 Numerical (0 to 94)
EXERCISE Exercise frequency 1.8 Ordinal (1 to 8)
PAINGRPS Chronic pain intensity rating 7.3 1 = No or little, 2 = Moderate, 3
= Severe
PAIN Severe pain dummy code 7.3 0 = No, little, moderate pain, 1
= Severe pain
ANXIETY Anxiety composite 5.5 Numerical (7 to 26)
STRESS Stress rating 0 Ordinal (1 to 7)
CONTROL Perceived control over pain 0 Numerical (6 to 30)
composite
DEPRESS Depression composite 13.5 Numerical (7 to 28)
INTERFERE Pain interference with life 10.5 Numerical (6 to 41)
composite
DISABILITY Psychosocial disability 9.1 Numerical (10 to 34)
composite
(continued)
490 Appendix
pain.dat (continued)
Name Definition Missing % Scale
Depression Questionnaire items
problemsolving2level.dat
Name Definition Missing % Scale
SCHOOL School identifier 0 Integer index
STUDENT Student identifier 0 Integer index
CONDITION Experimental condition 0 0 = Control school, 1 =
Experimental school
TEACHEXP Teacher years of experience 10.8 Numerical (4.3 to 24.6)
ESLPCT % English as second language 0 Numerical (10 to 100)
ETHNIC Ethnicity/race 9.0 1 = White, 2 = Black, 3 =
Hispanic
MALE Gender dummy code 0 0 = female, 1 = male
FRLUNCH Lunch assistance dummy code 4.7 0 = None, 1 = Free or reduced-
price lunch
ACHIEVEGRP Achievement classification 2.1 1 = Typically achieving, 2 =
Low achieving, 3 = Learning
disabled
STANMATH Standardized reading 7.4 Numerical (5.3 to 87.8)
EFFICACY1 Math self-efficacy pretest 0 Numerical (0 to 12)
EFFICACY2 Math self-efficacy posttest 20.5 Numerical (0 to 12)
PROBSOLVE1 Math problem-solving pretest 0 Numerical (37 to 66)
PROBSOLVE2 Math problem-solving posttest 20.5 Numerical (37 to 65)
problemsolving3level.dat
Name Definition Missing % Scale
SCHOOL School identifier 0 Integer index
STUDENT Student identifier 0 Integer index
WAVE Monthly wave identifier 0 Integer index (1 to 7)
CONDITION Experimental condition 0 0 = Control school, 1 =
Experimental school
TEACHEXP Teacher years of experience 10.8 Numerical (4.3 to 24.6)
ESLPCT % English as second language 0 Numerical (10 to 100)
ETHNIC Ethnicity/race 9.0 1 = White, 2 = Black, 3 =
Hispanic
MALE Gender dummy code 0 0 = Female, 1 = Male
FRLUNCH Lunch assistance dummy code 4.7 0 = None, 1 = Free or reduced-
price lunch
(continued)
492 Appendix
problemsolving3level.dat (continued)
Name Definition Missing % Scale
ACHIEVEGRP Achievement classification 2.1 1 = Typically achieving, 2 =
Low achieving, 3 = Learning
disabled
STANMATH Standardized reading 7.4 Numerical (5.3 to 87.8)
MONTH0 Time scores (baseline = 0) 0 Numerical (0 to 6)
MONTH7 Time scores (endpoint = 0) 0 Numerical (–6 to 0)
PROBSOLVE Math problem solving 11.4 Numerical (37 to 68)
EFFICACY Math self-efficacy 11.4 Numerical (0 to 14)
smoking.dat
Name Definition Missing % Scale
ID Participant identifier 0 Integer index
INTENSITY Smoking intensity (cigarettes 21.2 Numerical (2 to 29)
per day)
HVYSMOKER Heavy smoking indicator 21.2 0 = 10 or fewer cigarettes per
day, 1 = 11 or more per day
AGE Age at assessment 0 Numerical (18 to 25)
PARSMOKE Parental smoking dummy code 3.6 0 = Nonsmoker, 1 = Smoker
FEMALE Female dummy code 0 0 = Male, 1 = Female
RACE Race categories 6 1 = White, 2 = Black, 3 =
Hispanic, 4 = Other
INCOME Household income 11.4 Ordinal (1 to 20)
EDUC Highest education 5.4 1 = Less than HS, 2 = HS or
some college, 3 = BA or
higher
References
Abrams, K., Ashby, D., & Errington, D. (1994). Simple Bayesian analysis in clinical trials: A tuto-
rial. Controlled Clinical Trials, 15, 349–359.
Agresti, A. (2012). Categorical data analysis (3rd ed.). Hoboken, NJ: Wiley.
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. New-
bury Park, CA: Sage.
Aitchison, J., & Bennett, J. A. (1970). Polychotomous quantal response by maximum indicant.
Biometrika, 57, 253–262.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control, 19, 716–723.
Ake, C. F. (2005, April). Rounding after multiple imputation with non-binary cateorical covariates.
Paper presented at the SAS Users Group International, Philadelphia, PA.
Alacam, E., Du, H., Enders, C. K., & Keller, B. T. (2022). A model-based approach to treating com-
posite scores with missing items. Manuscript submitted for publication.
Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88, 669–679.
Albert, P. S., & Follmann, D. A. (2009). Shared-parameter models. In G. Fitzmaurice, M. David-
ian, G. Vebeke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp. 433–452). Boca
Raton, FL: Chapman & Hall.
Albert, P. S., Follmann, D. A., Wang, S. A., & Suh, E. B. (2002). A latent autoregressive model for
longitudinal binary data subject to informative missingness. Biometrics, 58, 631–642.
Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.
Allison, P. D. (2005, April). Imputation of categorical variables with PROC MI. Paper presented at
the SAS Users Group International, Philadelphia, PA.
Anderson, D., & Burnham, K. (2004). Model selection and multi-model inference (2nd ed.). New
York: Springer-Verlag.
Anderson, T. W. (1957). Maximum-likelihood estimates for a multivariate normal-distribution
when some observations are missing. Journal of the American Statistical Association, 52,
200–203.
Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychol-
ogy. British Journal of Mathematical and Statistical Psychology, 66, 1–7.
Andridge, R. R. (2011). Quantifying the impact of fixed effects modeling of clusters in multiple
imputation for cluster randomized trials. Biometrical Journal, 53, 57–74.
493
494 References
multimodal distributions of the exponential family. Journal of the American Statistical Asso-
ciation, 78, 124–130.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2002). Applied multiple regression/correlation
analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychological Methods, 6, 330–351.
Cook, R. J., Zeng, L., & Yi, G. Y. (2004). Marginal analysis of incomplete longitudinal binary
data: A cautionary note on LOCF imputation. Biometrics, 60, 820–828.
Cowles, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link
generalized linear models. Statistics and Computing, 6, 101–111.
Cowles, M. K., & Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A
comparative review. Journal of the American Statistical Association, 91, 883–904.
Coxe, S., West, S. G., & Aiken, L. S. (2009). The analysis of count data: A gentle introduction to
poisson regression and its alternatives. Journal of Personality Assessment, 91, 121–136.
Craig, C. C. (1936). On the frequency function of xy. Annals of Mathematical Statistics, 7, 1–15.
Darnieder, W. F. (2011). Bayesian methods for data-dependent priors. (PhD thesis), The Ohio State
University, Columbus, OH.
Demirtas, H., Freels, S. A., & Yucel, R. M. (2008). Plausibility of multivariate normality assump-
tion when multiple imputing non-Gaussian continuous outcomes: A simulation assessment.
Journal of Statistical Computation and Simulation, 78, 69–84.
Demirtas, H., & Hedeker, D. (2008a). Imputing continuous data under some non-Gaussian dis-
tributions. Statistica Neerlandica, 62, 193–205.
Demirtas, H., & Hedeker, D. (2008b). Multiple imputation under power polynomials. Communi-
cations in Statistics—Simulation and Computation, 37, 1682–1695.
Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture
models for non-ignorable drop-out. Statistics in Medicine, 22, 2553–2575.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society B: Statistical Methodology, 39,
1–38.
Diggle, P., & Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis. Journal
of the Royal Statistical Society C: Applied Statistics, 43, 49–93.
Dixon, W. J. (1988). BMDP statistical software. Los Angeles: University of California Press.
Du, H., Enders, C. K., Keller, B. T., Bradbury, T., & Karney, B. (2021, February 2). A Bayes-
ian latent variable selection model for nonignorable missingness. Multivariate Behavioral
Research. [Epub ahead of print]
Duncan, S. C., Duncan, T. E., & Hops, H. (1996). Analysis of longitudinal data within accelerated
longitudinal designs. Psychological Methods, 1, 236–248.
Dyklevych, O. (2014). Bayesian inference in the multinomial probit model: A case study. Master’s
thesis, Örebro University, Örebro, Sweden.
Dziak, J. J., Coffman, D. L., Lanza, S. T., Li, R., & Jermiin, L. S. (2020). Sensitivity and specificity
of information criteria. Briefings in Bioinformatics, 21, 553–565.
Edgett, G. L. (1956). Multiple regression with missing observations among the independent vari-
ables. Journal of the American Statistical Association, 51, 122–131.
Edwards, M. C., Wirth, R. J., Houts, C. R., & Xi, N. (2012). Categorical data in the structural
equation modeling framework. In R. H. Hoyle (Ed.), Handbook of structural equation model-
ing (pp. 195–208). New York: Guilford Press.
498 References
Eekhout, I., de Vet, H. C., Twisk, J. W., Brand, J. P., de Boer, M. R., & Heymans, M. W. (2014).
Missing data in a multi-item instrument were best handled by multiple imputation at the
item score level. Journal of Clinical Epidemiology, 67, 335–342.
Eekhout, I., Enders, C. K., Twisk, J. W. R., de Boer, M. R., de Vet, H. C. W., & Heymans, M. W.
(2015a). Analyzing incomplete item scores in longitudinal data by including item score
information as auxiliary variables. Structural Equation Modeling: A Multidisciplinary Journal,
22, 588–602.
Eekhout, I., Enders, C. K., Twisk, J. W. R., de Boer, M. R., de Vet, H. C. W., & Heymans, M. W.
(2015b). Including auxiliary item information in longitudinal data analyses improved han-
dling missing questionnaire outcome data. Journal of Clinical Epidemiology, 68, 637–645.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Associa-
tion, 82, 171–185.
Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-valida-
tion. American Statistician, 37, 36–48.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Boca Raton, FL: Chapman
& Hall.
Enders, C. K. (2001). The impact of nonnormality on full information maximum-likelihood esti-
mation for structural equation models with missing data. Psychological Methods, 6, 352–370.
Enders, C. K. (2002). Applying the Bollen–Stine bootstrap for goodness-of-fit measures to struc-
tural equation models with missing data. Multivariate Behavioral Research, 37, 359–377.
Enders, C. K. (2008). A note on the use of missing auxiliary variables in FIML-based structural
equation models. Structural Equation Modeling: A Multidisciplinary Journal, 15, 434–448.
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychologi-
cal Methods, 16, 1–16.
Enders, C. K., Baraldi, A. N., & Cham, H. (2014). Estimating interaction effects with incomplete
predictor variables. Psychological Methods, 19, 39–55.
Enders, C. K., Du, H., & Keller, B. T. (2020). A model-based imputation procedure for multilevel
regression models with random coefficients, interaction effects, and other nonlinear terms.
Psychological Methods, 25, 88–112.
Enders, C. K., & Gottschall, A. C. (2011). Multiple imputation strategies for multiple group struc-
tural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 18, 35–54.
Enders, C. K., Hayes, T., & Du, H. (2018). A comparison of multilevel imputation schemes for
random coefficient models: Fully conditional specification and joint model imputation with
random covariance matrices. Multivariate Behavioral Research, 53, 695–713.
Enders, C. K., & Keller, B. T. (2019). Blimp technical appendix: Centering covariates in a Bayesian
multilevel analysis. Available at www.appliedmissingdata.com/blimp-papers.
Enders, C. K., Keller, B. T., & Levy, R. (2018). A fully conditional specification approach to
multilevel imputation of categorical and continuous variables. Psychological Methods, 23,
298–317.
Enders, C. K., & Mansolf, M. (2018). Assessing the fit of structural equation models with multi-
ply imputed data. Psychological Methods, 23, 76–93.
Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and
evaluation of joint modeling and chained equations imputation. Psychological Methods, 21,
222–240.
Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel
models: A new look at an old issue. Psychological Methods, 12, 121–138.
Erler, N. S., Rizopoulos, D., Jaddoe, V. W., Franco, O. H., & Lesaffre, E. M. (2019). Bayesian
References 499
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., . . . Hothorn, T. (2019). Package ‘mvt-
norm.’ Retrieved from https://cran.r-project.org/web/packages/mvtnorm/mvtnorm.pdf.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calculating poste-
rior moments. In J. M. Bernado, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian
Statistics 4 (pp. 169–193). Oxford, UK: Carendon Press.
Geyer, C. J. (1992). Practical Markov chain Monte Carlo. Statistical Science, 7, 473–483.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain Monte Carlo in
practice. London: Chapman & Hall.
Glynn, R. J., & Laird, N. M. (1986). Regression estimates and missing data: Complete-case analysis
(Technical Report). Cambridge, MA: Harvard School of Public Health, Department of Bio-
statistics.
Gold, M. S., & Bentler, P. M. (2000). Treatments of missing data: A Monte Carlo comparison of
RBHDI, iterative stochastic regression imputation, and expectation-maximization. Struc-
tural Equation Modeling: A Multidisciplinary Journal, 7, 319–355.
Goldstein, H., Carpenter, J. R., & Browne, W. J. (2014). Fitting multilevel multivariate models
with missing data in responses and covariates that may include interactions and non-linear
terms. Journal of the Royal Statistical Society A: Statistics in Society, 177, 553–564.
Goldstein, H., Carpenter, J., Kenward, M. G., & Levin, K. A. (2009). Multilevel models with mul-
tivariate mixed response types. Statistical Modelling, 9, 173–197.
Gomer, K., & Yuan, K.-H. (2021, June 28). Subtypes of the missing not at random missing data
mechanism. Psychological Methods. [Epub ahead of print]
Gonzalez, R., & Griffin, D. (2001). Testing parameters in structural equation modeling: Every
“one” matters. Psychological Methods, 6, 258–269.
Gottfredson, N. C., Bauer, D. J., & Baldwin, S. A. (2014). Modeling change in the presence of non-
randomly missing data: Evaluating a shared parameter mixture model. Structural Equation
Modeling: A Multidisciplinary Journal, 21, 196–209.
Gottfredson, N. C., Bauer, D. J., Baldwin, S. A., & Okiishi, J. C. (2014). Using a shared parameter
mixture model to estimate change during treatment when termination is related to recovery
speed. Journal of Consulting and Clinical Psychology, 82, 813–827.
Gottfredson, N. C., Sterba, S. K., & Jackson, K. M. (2017). Explicating the conditions under
which multilevel multiple imputation mitigates bias resulting from random coefficient-
dependent missing longitudinal data. Prevention Science, 18, 12–19.
Gottschall, A. C., West, S. G., & Enders, C. K. (2012). A comparison of item-level and scale-level
multiple imputation for questionnaire batteries. Multivariate Behavioral Research, 47, 1–25.
Gourieroux, C., Monfort, A., & Trognon, A. (1984). Pseudo maximum likelihood methods: The-
ory. Econometrica, 52, 681–700.
Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural equation
models. Structural Equation Modeling: A Multidisciplinary Journal, 10, 80–100.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of
Psychology, 60, 549–576.
Graham, J. W. (2012). Missing data: Analysis and design. New York: Springer.
Graham, J. W., Cumsille, P. E., & Shevock, A. E. (2013). Methods for handling missing data. In J.
A. Schinka & W. F. Velicer (Eds.), Research methods in psychology (Vol. 3). New York: Wiley.
Graham, J. W., Hofer, S. M., & MacKinnon, D. P. (1996). Maximizing the usefulness of data
obtained with planned missing value patterns: An application of maximum likelihood pro-
cedures. Multivariate Behavioral Research, 31, 197–218.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really
needed?: Some practical clarifications of multiple imputation theory. Prevention Science, 8,
206–213.
References 501
Graham, J. W., Taylor, B. J., & Cumsille, P. E. (2001). Planned missing data designs in analysis of
change. In L. Collins & A. Sayer (Eds.), New methods for the analysis of change (pp. 335–353).
Washington, DC: American Psychological Association.
Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data
designs in psychological research. Psychological Methods, 11, 323–343.
Greene, W. H. (2017). Econometric analysis (8th ed.). Boston: Prentice Hall.
Grimm, K. J., Ram, N., & Estabrook, R. (2016). Growth modeling: Structural equation and multi-
level modeling approaches. New York: Guilford Press.
Grund, S., Lüdke, O., & Robitzsch, A. (2016a). Multiple imputation of missing covariate values
in multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48,
640–649.
Grund, S., Lüdke, O., & Robitzsch, A. (2016b). Multiple imputation of multilevel missing data:
an introduction to the R package pan. Sage Open, 6, 1–17.
Grund, S., Lüdke, O., & Robitzsch, A. (2016c). Pooling ANOVA results from multiply imputed
datasets: A simulation study. Methodology, 12, 75–88.
Grund, S., Lüdke, O., & Robitzsch, A. (2017). Multiple imputation of missing data at level 2: A
comparison of fully conditional and joint modeling in multilevel designs. Journal of Educa-
tional and Behavioral Statistics, 43, 316–353.
Grund, S., Lüdke, O., & Robitzsch, A. (2018). Multiple imputation of missing data for multi-
level models: Simulations and recommendations. Organizational Research Methods, 21, 111–
149.
Grund, S., Lüdtke, O., & Robitzsch, A. (2021, May 23). Multiple imputation of missing data in
multilevel models with the R package mdmb: A flexible sequential modeling approach.
Behavior Research Methods. [Epub ahead of print]
Grund, S., Robitzsch, A., & Lüdke, O. (2021). Package ‘mitml.’ Retrieved from https://cran.r-proj-
ect.org/web/packages/mitml/mitml.pdf.
Guo, J., Gabry, J., Goodrich, B., & Weber, S. (2020). Package ‘rstan.’ Retrieved from https://cran.r-
project.org/web/packages/rstan/rstan.pdf.
Hamaker, E. L., & Muthén, B. (2020). The fixed versus random effects debate and how it relates
to centering in multilevel modeling. Psychological Methods, 25, 365–379.
Hancock, G. R., & Liu, M. (2012). Bootstrapping standard errors and data-model fit statistics in
structural equation modeling. In R. H. Hoyle (Ed.), Handbook of structural equation modeling
(pp. 296–306). New York: Guilford Press.
Hardt, J., Herke, M., & Leonhart, R. (2012). Auxiliary variables in multiple imputation in regres-
sion with missing X: A warning against including too many in small sample research. BMC
Medical Research Methodology, 12, Article 184.
Harel, O. (2007). Inferences on missing information under multiple imputation and two-stage
multiple imputation. Statistical Methodology, 4, 75–89.
Hartley, H. O. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14,
174–194.
Hartley, H. O., & Hocking, R. R. (1971). The analysis of incomplete data. Biometrics, 27, 783–823.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applica-
tions. Biometrika, 57, 97–109.
Hayes, A. F. (2013). Introduction to mediation, moderation, and conditional process analysis: A
regression-based approach. New York: Guilford Press.
Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in
OLS regression: An introduction and software implementation. Behavior Research Methods,
39, 709–722.
He, Y. L., & Raghunathan, T. E. (2009). On the performance of sequential regression multiple
502 References
missing data using complete data routines. Journal of Educational and Behavioral Statistics,
24, 21–41.
Jamshidian, M., & Jalal, S. (2010). Tests of homoscedasticity, normality, and missing completely
at random for incomplete multivariate data. Psychometrika, 75, 649–674.
Jansen, I., Hens, N., Molenberghs, G., Aerts, M., Verbeke, G., & Kenward, M. G. (2006). The
nature of sensitivity in monotone missing not at random models. Computational Statistics
and Data Analysis, 50, 830–858.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceed-
ings of the Royal Society of London A: Mathematical and Physical Sciences, 186, 453–461.
Jeffreys, H. (1961). Theory of probability (3rd ed.). London: Oxford University Press.
Jeličić, H., Phelps, E., & Lerner, R. A. (2009). Use of missing data methods in longitudinal stud-
ies: The persistence of bad practices in developmental psychology. Developmental Psychol-
ogy, 45, 1195–1199.
Jinadasa, K., & Tracy, D. (1992). Maximum likelihood estimation for multivariate normal dis-
tribution with monotone sample. Communications in Statistics—Theory and Methods, 21,
41–50.
Johnson, E. G. (1992). The design of the National Assessment of Educational Progress. Journal of
Educational Measurement, 29, 95–110.
Johnson, V. E. (1996). Studying convergence of Markov chain Monte Carlo algorithms using
coupled sample paths. Journal of the American Statistical Association, 91, 154–166.
Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling. New York: Springer.
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis.
Psychometrika, 34, 183–202.
Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three
approaches. Multivariate Behavioral Research, 36, 347–387.
Jöreskog, K. G., & Sörbom, D. (2021). LISREL 11 for Windows. Skokie, IL: Scientific Software
International.
Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & Rosseel, Y. (2021). Package ‘sem-
Tools.’ Retrieved from https://cran.r-project.org/web/packages/semtools/semtools.pdf.
Jose, P. E. (2013). Doing statistical mediation and moderation. New York: Guilford Press.
Judd, C. M., & Kenny, D. A. (1981). Process analysis: Estimating mediation in treatment evalua-
tions. Evaluation Review, 5, 602–619.
Kaplan, D. (1990). Evaluating and modifying covariance structure models: A review and recom-
mendation. Multivariate Behavioral Research, 25, 137–155.
Kaplan, D. (2009). Structrual equation modeling: Foundations and extensions (2nd ed.). Thousand
Oaks, CA: Sage.
Kaplan, D. (2014). Bayesian statistics for the social sciences. New York: Guilford Press.
Kaplan, D., & Depaoili, S. (2012). Bayesian structural equation modeling. In R. Hoyle (Ed.),
Handbook of structural equation modeling (pp. 650–673). New York: Guilford Press.
Karahalios, A., Baglietto, L., Carlin, J. B., English, D. R., & Simpson, J. A. (2012). A review of the
reporting and handling of missing data in cohort studies with repeated assessment of expo-
sure measures. BMC Medical Research Methodology, 12, 1–10.
Kasim, R. M., & Raudenbush, S. W. (1998). Application of Gibbs sampling to nested variance
components models with heterogeneous within-group variance. Journal of Educational and
Behavioral Statistics, 32, 93–116.
Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal
of the American Statistical Association, 91, 1343–1370.
Keller, B. T. (2022). Model-based missing data handling for scale score and latent variable interac-
tions. Manuscript in progress.
504 References
Keller, B. T., & Enders, C. K. (2021). Blimp user’s guide (Version 3). Retrieved from www.applied-
missingdata.com/blimp.
Keller, B. T., & Enders, C. K. (2022). An investigation of factored regression missing data methods for
multilevel models with cross-level interactions. Manuscript submitted for publication.
Kenward, M. G. (1998). Selection models for repeated measurements with non-random dropout:
An illustration of sensitivity. Statistics in Medicine, 17, 2723–2732.
Kenward, M. G., & Molenberghs, G. (1998). Likelihood based frequentist inference when data
are missing at random. Statistical Science, 13, 236–247.
Kenward, M. G., & Molenberghs, G. (2014). A perspective and historical overview on selection,
pattern-mixture and shared parameter models. In G. Molenberghs, G. Fitzmaurice, M. G.
Kenward, A. Tsiatis, & G. Verbeke (Eds.), Missing data methodology handbooks of modern
statistical methods (pp. 53–90). Boca Raton, FL: CRC Press.
Kim, J. K., Brick, J. M., Fuller, W. A., & Kalton, G. (2006). On the bias of the multiple-imputation
variance estimator in survey sampling. Journal of the Royal Statistical Society B: Statistical
Methodology, 68, 509–521.
Kim, K. H., & Bentler, P. M. (2002). Tests of homogeneity of means and covariance matrices for
multivariate incomplete data. Psychometrika, 67, 609–623.
Kim, S., Belin, T. R., & Sugar, C. A. (2018). Multiple imputation with non-additively related
variables: Joint-modeling and approximations. Statistical Methods in Medical Research, 27,
1683–1694.
Kim, S., Sugar, C. A., & Belin, T. R. (2015). Evaluating model-based imputation methods for miss-
ing covariates in regression models with interactions. Statistics in Medicine, 34, 1876–1888.
King, G., & Roberts, M. E. (2015). How robust standard errors expose methodological problems
they do not fix, and what to do about it. Political Analysis, 23, 159–179.
Klebanoff, M. A., & Cole, S. R. (2008). Use of multiple imputation in the epidemiologic literature.
American Journal of Epidemiology, 168, 355–357.
Kleinke, K. (2017). Multiple imputation under violated distributional assumptions: A systematic
evaluation of the assumed robustness of predictive mean matching. Journal of Educational
and Behavioral Statistics, 42, 371–404.
Kline, R. B. (2015). Principles and practice of structural equation modeling (4th ed.). New York:
Guilford Press.
Kopylov, I. (2008). Subjective probability. In T. Rudas (Ed.), Handbook of probability: Theory and
applications (pp. 35–48). Thousand Oaks, CA: Sage.
Kreft, I. G., de Leeuw, J., & Aiken, L. S. (1995). The effect of different forms of centering in hier-
archical linear models. Multivariate Behavioral Research, 30, 1–21.
Kruschke, J. K., & Liddell, T. M. (2018). Bayesian data analysis for newcomers. Psychonomic Bul-
letin and Review, 25, 155–177.
Kunkle, D., & Kaizer, E. E. (2017). A comparison of existing methods for multiple imputation in
individual participant data meta-analysis. Statistics in Medicine, 36, 3507–3532.
Lee, K. J., & Carlin, J. B. (2017). Multiple imputation in the presence of non-normal data. Statis-
tics in Medicine, 36, 606–617.
Lee, M. D., & Wagenmakers, E. J. (2005). Bayesian statistical inference in psychology: Comment
on Trafimow (2003). Psychological Review, 112, 662–668.
Lee, T., & Cai, L. (2012). Alternative multiple imputation inference for mean and covariance
structure modeling. Journal of Educational and Behavioral Statistics, 37, 675–702.
Levy, R., & Enders, C. (2021, May 6). Full conditional distributions for Bayesian multilevel mod-
els with additive or interactive effects and missing data on covariates. Communications in
Statistics—Simulation and Computation. [Epub ahead of print]
Levy, R., & Mislevy, R. J. (2016). Bayesian psychometric modeling. Boca Raton, FL: CRC Press.
References 505
Li, K.-H., Meng, X.-L., Raghunathan, T. E., & Rubin, D. B. (1991). Significance levels from
repeated p-values with multiply-imputed data. Statistica Sinica, 1, 65–92.
Li, K. H., Raghunathan, T. E., & Rubin, D. B. (1991). Large-sample significance levels from mul-
tiply imputed data using moment-based statistics and an F reference distribution. Journal of
the American Statistical Association, 86, 1065–1073.
Liang, J., & Bentler, P. M. (2004). An EM algorithm for fitting two-level structural equation mod-
els. Psychometrika, 69, 101–122.
Lipsitz, S. R., & Ibrahim, J. G. (1996). A conditional model for incomplete covariates in paramet-
ric regression models. Biometrika, 83, 916–922.
Little, R. (2009). Selection and pattern-mixture models. In G. Fitzmaurice, M. Davidian, G.
Vebeke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp. 409–431). Boca Raton, FL:
Chapman & Hall.
Little, R. J. (1988a). Missing-data adjustments in large surveys. Journal of Business and Economic
Statistics, 6, 287–296.
Little, R. J. A. (1988b). A test of missing completely at random for multivariate data with missing
values. Journal of the American Statistical Association, 83, 1198–1202.
Little, R. J. A. (1992). Regression with missing X’s: A review. Journal of the American Statistical
Association, 87, 1227–1237.
Little, R. J. A. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the
American Statistical Association, 88, 125–134.
Little, R. J. A. (1994). A class of pattern-mixture models for normal incomplete data. Biometrika,
81, 471–483.
Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of
the American Statistical Association, 90, 1112–1121.
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. Hoboken, NJ: Wiley.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Hoboken,
NJ: Wiley.
Little, R. J. A., & Rubin, D. B. (2020). Statistical analysis with missing data (3rd ed.). Hoboken, NJ:
Wiley.
Little, T. D. (2013). Longitudinal structural equation modeling. New York: Guilford Press.
Little, T. D., & Rhemtulla, M. (2013). Planned missing data designs for developmental research-
ers. Child Development Perspectives, 7, 199–204.
Liu, C. (1995). Missing data imputation using the multivariate t distribution. Journal of Multivari-
ate Analysis, 53, 139–158.
Liu, G., & Gould, A. L. (2002). Comparison of alternative strategies for analysis of longitudinal
trials with dropouts. Journal of Biopharmaceutical Statistics, 12, 207–226.
Liu, H. Y., Zhang, Z. Y., & Grimm, K. J. (2016). Comparison of inverse Wishart and separation-
strategy priors for Bayesian estimation of covariance parameter matrix in growth curve
analysis. Structural Equation Modeling: A Multidisciplinary Journal, 23, 354–367.
Liu, J. C., Gelman, A., Hill, J., Su, Y. S., & Kropko, J. (2014). On the stationary distribution of
iterative imputations. Biometrika, 101, 155–173.
Liu, Y., & Enders, C. K. (2017). Evaluation of multi-parameter test statistics for multiple imputa-
tion. Multivariate Behavioral Research, 52, 371–390.
Liu, Y., & Sriutaisuk, S. (2019). Evaluation of model fit in structural equation models with ordi-
nal missing data: An examination of the D2 method. Structural Equation Modeling: A Multi-
disciplinary Journal, 27, 561–583.
Lomnicki, Z. A. (1967). On the distribution of products of random variables. Journal of the Royal
Statistical Society, 29, 513–524.
Longford, N. (1989). Contextual effects and group means. Multilevel Modelling Newsletter, 1, 5–11.
506 References
Lord, F. M. (1955). Estimation of parameters from incomplete data. Journal of the American Sta-
tistical Association, 50, 870–876.
Lord, F. M. (1962). Estimating norms by item-sampling. Educational and Psychological Measure-
ment, 22, 259–267.
Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm.
Journal of the Royal Statistical Society B: Statistical Methodology, 44, 226–233.
Lüdtke, O., Marsh, H. W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008).
The multilevel latent covariate model: A new, more reliable approach to group-level effects
in contextual studies. Psychological Methods, 13, 201–229.
Lüdtke, O., Robitzsch, A., & Grund, S. (2017). Multiple imputation of missing data in multilevel
designs: A comparison of different strategies. Psychological Methods, 22, 141–165.
Lüdtke, O., Robitzsch, A., & West, S. G. (2020a). Analysis of interactions and nonlinear effects
with missing data: A factored regression modeling approach using maximum likelihood
estimation. Multivariate Behavioral Research, 55, 361–381.
Lüdtke, O., Robitzsch, A., & West, S. G. (2020b). Regression models involving nonlinear effects
with missing data: A sequential modeling approach using Bayesian estimation. Psychologi-
cal Methods, 25, 157–181.
Lunn, D., Jackson, C., Thomas, A., & Spiegelhalter, D. (2013). The BUGS book. Boca Raton, FL:
CRC Press.
Lynch, S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scientists.
Berlin: Springer.
MacCallum, R. (1986). Specification searches in covariance structure modeling. Psychological
Bulletin, 100, 107–120.
MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covari-
ance structure analysis: The problem of capitalization on chance. Psychological Bulletin, 111,
490–504.
MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. New York: Erlbaum.
MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A com-
parison of methods to test mediation and other intervening variable effects. Psychological
Methods, 7, 83–104.
MacKinnon, D. P., Lockwood, C. M., & Williams, J. (2004). Confidence limits for the indi-
rect effect: Distribution of the product and resampling methods. Multivariate Behavioral
Research, 39, 99–128.
Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data
should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiol-
ogy, 110, 63–73.
Magnus, J. R., & Neudecker, J. (1999). Matrix differential calculus with applications in statistics and
econometrics (3rd ed.). West Sussex, UK: Wiley.
Mallinckrodt, C. H., Clark, W. S., & David, S. R. (2001). Accounting for dropout bias using
mixed-effects models. Journal of Biopharmaceutical Statistics, 11, 9–21.
Manly, C. A., & Wells, R. S. (2015). Reporting the use of multiple imputation for missing data in
higher education research. Research in Higher Education, 56, 397–409.
Mansolf, M., Jorgensen, T. D., & Enders, C. K. (2020). A multiple imputation score test for model
modification in structural equation models. Psychological Methods, 25, 393–411.
Marshall, A., Altman, D. G., Holder, R. L., & Royston, P. (2009). Combining estimates of interest
in prognostic modelling studies after multiple imputation: Current practice and guidelines.
BMC Medical Research Methodology, 9, 1–8.
Matz, A. W. (1978). Maximum likelihood parameter estimation for the quartic exponential dis-
tribution. Technometrics, 20, 475–484.
References 507
Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness-
of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statisti-
cal Association, 100, 1009–1020.
Mazza, G. L., Enders, C. K., & Ruehlman, L. S. (2015). Addressing item-level missing data: A
comparison of proration and full information maximum likelihood estimation. Multivariate
Behavioral Research, 50, 504–519.
McCulloch, R., & Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit
model. Journal of Econometrics, 64, 207–240.
McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000). A Bayesian analysis of the multinomial
probit model with fully identified parameters. Journal of Econometrics, 99, 173–193.
McDonald, R. P., & Ho, M. H. (2002). Principles and practice in reporting structural equation
analyses. Psychological Methods, 7, 64–82.
McKelvey, R. D., & Zavoina, W. (1975). A statistical model for the analysis of ordinal level depen-
dent variables. Journal of Mathematical Sociology, 4, 103–120.
McLachlan, G. J., & Krishnan, T. (2007). The EM algorithm and extensions. Hoboken, NJ: Wiley.
McNeish, D. (2016a). On using Bayesian methods to address small sample problems. Structural
Equation Modeling: A Multidisciplinary Journal, 23, 750–773.
McNeish, D. M. (2016b). Using data-dependent priors to mitigate small sample bias in latent
growth models: A discussion and illustration using Mplus. Journal of Educational and Behav-
ioral Statistics, 41, 27–56.
McNeish, D., & Kelley, K. (2019). Fixed effects models versus mixed effects models for clustered
data: Reviewing the approaches, disentangling the differences, and making recommenda-
tions. Psychological Methods, 24, 20–35.
McNeish, D., Stapleton, L. M., & Silverman, R. D. (2017). On the unnecessary ubiquity of hierar-
chical linear modeling. Psychological Methods, 22, 114–140.
Mealli, F., & Rubin, D. B. (2016). Clarifying missing at random and related definitions, and
implications when coupled with exchangeability. Biometrika, 103, 491–491.
Mehta, P. D., & Neale, M. C. (2005). People are variables too: Multilevel structural equations
modeling. Psychological Methods, 10, 259–284.
Mehta, P. D., & West, S. G. (2000). Putting the individual back into individual growth curves.
Psychological Methods, 5, 23–43.
Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical
Science, 9, 538–558.
Meng, X.-L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance–covariance matri-
ces: The SEM algorithm. Journal of the American Statistical Association, 86, 899–909.
Meng, X.-L., & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data
sets. Biometrika, 79, 103–111.
Merkle, E. C., Fitzsimmons, E., Uanhoro, J., & Goodrich, B. (2020). Efficient Bayesian structural
equation modeling in Stan. Retrieved from https://arxiv.org/pdf/2008.07733.pdf.
Merkle, E. C., & Rosseel, Y. (2018). blavaan: Bayesian structural equation models via parameter
expansion. Journal of Statistical Software, 85, 1–30.
Merkle, E. C., Rosseel, Y., Goodrich, B., & Garnier-Villarreal, M. (2021). Package ‘blavaan.’
Retrieved from https://cran.r-project.org/web/packages/blavaan/blavaan.pdf.
Mi, X., Miwa, T., & Hothorn, T. (2009). mvtnorm: New numerical algorithm for multivariate
normal probabilities. R Journal, 1, 37–39.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological
Bulletin, 105, 156–166.
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex sam-
ples. Psychometrika, 56, 177–196.
508 References
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population charac-
teristics from sparse matrix samples of item responses. Journal of Educational Measurement,
29, 133–161.
Mistler, S. A., & Enders, C. K. (2011). An introduction to planned missing data designs for devel-
opmental research. In B. Laursen, T. Little, & N. Card (Eds.), Handbook of developmental
research methods (pp. 742–754). New York: Guilford Press.
Mistler, S. A., & Enders, C. K. (2017). A comparison of joint model and fully conditional speci-
fication imputation for multilevel missing data. Journal of Educational and Behavioral Statis-
tics, 42, 432–466.
Mohan, K., Pearl, J., & Tian, J. (Eds.). (2013). Graphical models for inference with missing data. Red
Hook, NY: Curran Associates.
Molenberghs, G., Beunckens, C., Sotto, C., & Kenward, M. G. (2008). Every missingness not at
random model has a missingness at random counterpart with equal fit. Journal of the Royal
Statistical Society B: Statistical Methodology, 70, 371–388.
Molenberghs, G., & Kenward, M. (2007). Missing data in clinical studies. West Sussex, UK: Wiley.
Molenberghs, G., Michiels, B., Kenward, M. G., & Diggle, P. J. (1998). Monotone missing data and
pattern-mixture models. Statistica Neerlandica, 52, 153–161.
Molenberghs, G., Thijs, H., Jansen, I., Beunckens, C., Kenward, M. G., Mallinckrodt, C., & Car-
roll, R. J. (2004). Analyzing incomplete longitudinal clinical trial data. Biostatistics, 5, 445–
464.
Molenberghs, G., & Verbeke, G. (2001). A review on linear mixed models for longitudinal data,
possibly subject to dropout. Statistical Modelling, 1, 235–269.
Molenberghs, G., Verbeke, G., Thijs, H., Lesaffre, E., & Kenward, M. G. (2001). Influence analy-
sis to assess sensitivity of the dropout process. Computational Statistics and Data Analysis,
37, 93–113.
Montgomery, D. C. (2020). Design and analysis of experiments (10th ed.). Hoboken, NJ: Wiley.
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statisti-
cal methods. Statistics in Medicine, 38, 2074–2102.
Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical,
and continuous latent variable indicators. Psychometrika, 49, 115–132.
Muthén, B., & Asparouhov, T. (2008). Growth mixture modeling: Analysis with non-Gaussian
random effects. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds.), Lon-
gitudinal data analysis (pp. 143–165). Boca Raton, FL: Chapman & Hall.
Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible
representation of substantive theory. Psychological Methods, 17, 313–335.
Muthén, B., Asparouhov, T., Hunter, A. M., & Leuchter, A. F. (2011). Growth modeling with non-
ignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological
Methods, 16, 17–33.
Muthén, B., du Toit, S. H. C., & Spisic, D. (1997). Robust inference using weighted least squares
and quadratic estimating equations in latent variable modeling with categorical and con-
tinuous outcomes. Unpublished technical report. Retrieved from www.statmodel.com/down-
load/article_075.pdf.
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are
not missing completely at random. Psychometrika, 52, 431–462.
Muthén, B., & Masyn, K. (2005). Discrete-time survival mixture analysis. Journal of Educational
and Behavioral Statistics, 30, 27–58.
Muthén, B., Muthén, L., & Asparouhov, T. (2016). Regression and mediation analysis using Mplus.
Los Angeles: Muthén & Muthén.
References 509
Muthén, B., & Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the
EM algorithm. Biometrics, 55, 463–469.
Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide. (8th ed.). Los Angeles: Muthén
& Muthén.
Muthén, L. K., & Muthén, B. O. (2002). How to use a Monte Carlo study to decide on sample size
and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9, 599–620.
Mykland, P., Tierney, L., & Yu, B. (1995). Regeneration in Markov-chain samplers. Journal of the
American Statistical Association, 90, 233–241.
Nandram, B., & Chen, M.-H. (1996). Reparameterizing the generalized linear model to accelerate
Gibbs sampler convergence. Journal of Statistical Computation and Simulation, 54, 129–144.
Neelon, B. (2019). Bayesian zero-inflated negative binomial regression based on pólya-gamma
mixtures. Bayesian Analysis, 14, 829–855.
Nesselroade, J. R., & Baltes, P. B. (1979). Longitudinal research in the study of behavior and develop-
ment. New York: Academic Press.
Nielsen, S. F. (2003). Proper and improper multiple imputation. International Statistical Review,
71, 593–607.
O’Brien, S. M., & Dunson, D. B. (2004). Bayesian multivariate logistic regression. Biometrics, 60,
739–746.
O’Hagan, A. (2008). The Bayesian approach to statistics. In T. Rudas (Ed.), Handbook of probabil-
ity: Theory and applications (pp. 85–100). Thousand Oaks, CA: Sage.
Olkin, I., & Tate, R. F. (1961). Multivariate correlation models with mixed discrete and continu-
ous variables. Annals of Mathematical Statistics, 32, 448–465.
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psy-
chometrika, 44, 443–460.
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and applica-
tions. In Proceedings from the Sixth Berkeley Symposium on Mathematical Statistics and Prob-
ability: Vol. 1. Theory of statistics (pp. 697–715). Berkeley: University of California Press.
Palomo, J., Dunson, D. B., & Bollen, K. (2007). Bayesian structural equation modeling. In S.-Y.
Lee (Ed.), Handbook of latent variable and related models (pp. 163–188). Amsterdam: Elsevier.
Pan, Q., & Wei, R. (2016). Fraction of missing information (g) at different missing data fractionvs
in the 2012 NAMCS Physician Workflow Mail Survey. Applied Mathematics, 7, 1057–1067.
Park, T., & Lee, S. Y. (1997). A test of missing completely at random for longitudinal data with
missing observations. Statistics in Medicine, 16, 1859–1871.
Pawitan, Y. (2000). A reminder of the fallibility of the Wald statistic: Likelihood explanation.
American Statistician, 54, 54–56.
Paxton, P., Curran, P. J., Bollen, K. A., Kirby, J., & Chen, F. N. (2001). Monte Carlo experiments:
Design and implementation. Structural Equation Modeling: A Multidisciplinary Journal, 8,
287–312.
Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of reporting
practices and suggestions for improvement. Review of Educational Research, 74, 525–556.
Peyre, H., Leplege, A., & Coste, J. (2011). Missing data methods for dealing with missing items
in quality of life questionnaires: A comparison by simulation of personal mean score, full
information maximum likelihood, multiple imputation, and hot deck techniques applied to
the SF-36 in the French 2003 decennial health survey. Quality of Life Research, 20, 287–300.
Plummer, M. (2019). Package ‘rjags.’ Retrieved from https://cran.r-project.org/web/packages/rjags/
rjags.pdf.
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using pólya–
gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349.
510 References
Poon, W.-Y., & Lee, S.-Y. (1998). Analysis of two-level structural equation models via EM-type
algorithm. Statistica Sinica, 8, 749–766.
Potthoff, R. F., Tudor, G. E., Pieper, K. S., & Hasselblad, V. (2006). Can one assess whether miss-
ing data are missing at random in medical studies? Statistical Methods in Medical Research,
15, 213–234.
Pritikin, J. N., Brick, T. R., & Neale, M. C. (2018). Multivariate normal maximum likelihood
with both ordinal and continuous variables, and data missing at random. Behavior Research
Methods, 50, 490–500.
Puhani, P. A. (2000). The Heckman correction for sample selection and its critique. Journal of
Economic Surveys, 14, 53–68.
Quartagno, M., & Carpenter, J. R. (2016). Multiple imputation for IPD meta-analysis: Allowing
for heterogeneity and studies with missing covariates. Statistics in Medicine, 35, 2938–2954.
Quartagno, M., & Carpenter, J. R. (2019). Multiple imputation for discrete data: Evaluation of the
joint latent normal model. Biometrical Journal, 61, 1003–1019.
Quartagno, M., & Carpenter, J. (2020). Package ‘jomo.’ Retrieved from https://cran.r-project.org/
web/packages/jomo/jomo.pdf.
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation
modeling. Psychometrika, 69, 167–190.
Rabe-Hesketh, S., Skondral, A., & Zheng, X. (2012). Multilevel structural equation modeling.
In R. H. Hoyle (Ed.), Handbook of structrual equation modeling (pp. 512–531). New York:
Guilford Press.
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25,
111–163.
Raftery, A. E., & Lewis, S. M. (1992). [Practical Markov chain Monte Carlo]: Comment: One long
run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Statistical
Science, 7, 493–497.
Raghunathan, T. E., & Grizzle, J. E. (1995). A split questionnaire survey design. Journal of the
American Statistical Association, 90, 54–63.
Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., & Solenberger, P. (2001). A multivariate
technique for multiply imputing missing values using a sequence of regression models.
Survey Methodology, 27, 85–95.
Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several parameters
with applications to problems of estimation. Mathematical Proceedings of the Cambridge Phil-
osophical Society, 44, 50–57.
Raudenbush, S. W. (1995). Maximum likelihood estimation for unbalanced multilevel covari-
ance structure models via the EM algorithm. British Journal of Mathematical and Statistical
Psychology, 48, 359–370.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis
methods (2nd ed.). Thousand Oaks, CA: Sage.
Raudenbush, S. W., Bryk, A. S., Cheong, Y., & Congdon, R. (2019). HLM for Windows [Computer
software]. Skokie, IL: Scientific Software Internation.
Raykov, T. (2011). On testability of missing data mechanisms in incomplete data sets. Structural
Equation Modeling: A Multidisciplinary Journal, 18, 419–429.
Raykov, T., Lichtenberg, P. A., & Paulson, D. (2012). Examining the missing completely at ran-
dom mechanism in incomplete data sets: A multiple testing approach. Structural Equation
Modeling: A Multidisciplinary Journal, 19, 399–408.
Raykov, T., & Marcoulides, G. A. (2004). Using the delta method for approximate interval esti-
References 511
Rubin, D. B. (2003). Discussion on multiple imputation. International Statistical Review, 71, 619–
625.
Rubin, D. B. (2004). The design of a general and flexible system for handling nonresponse in
sample surveys. American Statistician, 58, 298–302.
Rubin, D. B., Stern, H. S., & Vehovar, V. (1995). Handling “don’t know” survey responses: The
case of the Slovenian plebiscite. Journal of the American Statistical Association, 90, 822–828.
Saris, W. E., Satorra, A., & Sörbom, D. (1987). The detection and correction of specification
errors in structural equation models. Sociological Methodology, 17, 105–129.
Sartori, A. E. (2003). An estimator for some binary-outcome selection models without exclusion
restrictions. Political Analysis, 11, 111–138.
Satorra, A., & Bentler, P. M. (1988). Scaling corrections for chi-square statistics in covariance
structure analysis. In ASA 1988 Proceedings of the Business and Economic Statistics Section
(pp. 308–313). Alexandria, VA: American Statistical Association.
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors in covari-
ance structure analysis. In A. V. Eye & C. C. Clogg (Eds.), Latent variables analysis: Applica-
tions for developmental research (pp. 399–419). Thousand Oaks, CA: Sage.
Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment struc-
ture analysis. Psychometrika, 66, 507–514.
Savalei, V. (2010). Expected versus observed information in SEM with incomplete normal and
nonnormal data. Psychological Methods, 15, 352–367.
Savalei, V. (2014). Understanding robust corrections in structural equation modeling. Structural
Equation Modeling: A Multidisciplinary Journal, 21, 149–160.
Savalei, V., & Bentler, P. M. (2005). A statistically justified pairwise ML method for incomplete
nonnormal data: A comparison with direct ML and pairwise ADF. Structural Equation Mod-
eling: A Multidisciplinary Journal, 12, 183–214.
Savalei, V., & Bentler, P. M. (2009). A two-stage approach to missing data: Theory and application
to auxiliary variables. Structural Equation Modeling: A Multidisciplinary Journal, 16, 477–497.
Savalei, V., & Falk, C. F. (2014). Robust two-stage approach outperforms robust full information
maximum likelihood with incomplete nonnormal data. Structural Equation Modeling: A Mul-
tidisciplinary Journal, 21, 280–302.
Savalei, V., & Rhemtulla, M. (2012). On obtaining estimates of the fraction of missing informa-
tion from full information maximum likelihood. Structural Equation Modeling: A Multidisci-
plinary Journal, 19, 477–494.
Savalei, V., & Rhemtulla, M. (2017). Normal theory two-stage estimator for models with compos-
ites when data are missing at the item level. Journal of Educational and Behavioral Statistics,
42, 405–431.
Savalei, V., & Rosseel, Y. (2021, April 14). Computational options for standard errors and test
statistics with incomplete normal and nonnormal data. Structural Equation Modeling: A Mul-
tidisciplinary Journal. [Epub ahead of print]
Savalei, V., & Yuan, K. H. (2009). On the model-based bootstrap with missing data: Obtaining a
p-value for a test of exact fit. Multivariate Behavioral Research, 44, 741–763.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. New York: Chapman & Hall.
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8,
3–15.
Schafer, J. L. (2001). Multiple imputation with PAN. In A. G. Sayer & L. M. Collins (Eds.), New
methods for the analysis of change (pp. 355–377). Washington, DC: American Psychological
Association.
Schafer, J. L. (2003). Multiple imputation in multivariate problems when the imputation and
analysis models differ. Statistica Neerlandica, 57, 19–35.
References 513
Springer, M. D., & Thompson, W. E. (1966). The distribution of independent random variables.
SIAM Journal on Applied Mathematics, 14, 511–526.
Spybrook, J., Bloom, H., Congdon, R., Hill, C., Martinez, A., & Raudenbush, S. W. (2011). Opti-
mal design plus empirical evidence: Documentation for the “Optimal Design” software ver-
sion 3.0. Retrieved from http://hlmsoft.net/od/od-manual-20111016-v300.pdf.
Stapleton, L. M. (2013). Multilevel structural equation modeling with complex sample data. In G.
R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course (pp. 521–
562). Charlotte, NC: Information Age.
Steiger, J. H. (1989). EZPATH: A supplementary module for SYSTAT and SYGRAPH. Evanston, IL:
SYSTAT.
Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation
approach. Multivariate Behavioral Research, 25, 173–180.
Steiger, J. H., & Lind, J. C. (1980, May). Statistically-based tests for the number of common factors.
Paper presented at the Annual Meeting of the Psychometric Society, Iowa City, IA.
Sterba, S. K., & Gottfredson, N. C. (2014). Diagnosing global case influence on MAR versus
MNAR model comparisons. Structural Equation Modeling: A Multidisciplinary Journal, 22,
294–307.
Stern, H. (1998). A primer on the Bayesian approach to statistical inference. Stats, 23, 3–9.
Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., . . . Carpenter, J.
R. (2009). Multiple imputation for missing data in epidemiological and clinical research:
Potential and pitfalls. British Medical Journal, 338, Article b2393.
Sterner, W. R. (2011). What is missing in counseling research?: Reporting missing data. Journal
of Counseling and Development, 89, 56–62.
Su, Y.-S., Gelman, A. E., Hill, J., & Yajima, M. (2011). Multiple imputation with diagnostics (mi)
in R: Opening windows into the black box. Journal of Statistical Software, 45, 1–31.
Taljaard, M., Donner, A., & Klar, N. (2008). Imputation strategies for missing continuous out-
comes in cluster randomized trials. Biometrical Journal, 50, 329–345.
Thijs, H., Molenberghs, G., Michiels, B., Verbeke, G., & Curran, D. (2002). Strategies to fit
pattern-mixture models. Biostatistics, 3, 245–265.
Thijs, H., Molenberghs, G., & Verbeke, G. (2000). The milk protein trial: Influence analysis of
the dropout process. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 42,
617–646.
Thoemmes, F., & Mohan, K. (2015). Graphical representation of missing data problems. Struc-
tural Equation Modeling: A Multidisciplinary Journal, 22, 631–642.
Thoemmes, F., & Rose, N. (2014). A cautious note on auxiliary variables that can increase bias in
missing data problems. Multivariate Behavioral Research, 49, 443–459.
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analy-
sis. Psychometrika, 38, 1–10.
U.S. Census Bureau. (2019). 2018 National Survey of Children’s Health: Analysis with multiply
imputed data. Retrieved from www2.census.gov/programs-surveys/nsch/technical-documenta-
tion/methodology/nsch-analysis-with-imputed-data-guide.pdf.
Vach, W. (1994). Logistic regression with missing values in the covariates. Berlin: Springer-Verlag.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional
specification. Statistical Methods in Medical Research, 16, 219–242.
van Buuren, S. (2010). Item imputation without specifying scale structure. Methodology, 6, 31–36.
van Buuren, S. (2011). Multiple imputation of multilevel data. In J. J. Hox & J. K. Roberts (Eds.),
Handbook of advanced multilevel analysis (pp. 173–196). New York: Routledge.
van Buuren, S. (2012). Flexible imputation of missing data. New York: Chapman & Hall.
van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully
References 515
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number
of observations is large. Transactions of the American Mathematical Society, 54, 426–482.
Wang, N., & Robins, J. M. (1998). Large-sample theory for parametric multiple imputation pro-
cedures. Biometrika, 85, 935–948.
West, S. G., & Thoemmes, F. (2010). Campbell’s and Rubin’s perspectives on causal inference.
Psychological Methods, 15, 18–37.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test
for heteroskedasticity. Econometrica, 48, 817–838.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50,
1–26.
White, H. (1996). Estimation, inference and specification analysis. New York: Cambridge Univer-
sity Press.
White, I. R., & Carlin, J. B. (2010). Bias and efficiency of multiple imputation compared with
complete-case analysis for missing covariate values. Statistics in Medicine, 29, 2920–2931.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations:
Issues and guidance for practice. Statistics in Medicine, 30, 377–399.
Whittaker, T. A. (2012). Using the modification index and standardized expected parameter
change for model modification. Journal of Experimental Education, 80, 26–44.
Widaman, K. F. (2006). Missing data: What to do with or without them. Monographs of the Society
for Research in Child Development, 71, 42–64.
Widaman, K. F., & Thompson, J. S. (2003). On specifying the null model for incremental fit indi-
ces in structural equation modeling. Psychological Methods, 8, 16–37.
Wilkinson and Task Force on Statistical Inference. (1999). Statistical methods in psychology
journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite
hypotheses. Annals of Mathematical Statistics, 9, 60–62.
Winship, C., & Mare, R. D. (1992). Models for sample selection bias. Annual Review of Sociology,
18, 327–350.
Wirth, R., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future direc-
tions. Psychological Methods, 12, 58–79.
Wood, A. M., White, I. R., & Thompson, S. G. (2004). Are missing outcome data adequately
handled? A review of published randomized controlled trials in major medical journals.
Clinical Trials, 1, 368–376.
Wothke, W. (2000). Longitudinal and multi-group modeling with missing data. In T. D. Little,
K. U. Schnabel, & J. Baumert (Eds.), Modeling longitudinal and multilevel data: Practical
issues, applied approaches, and specific examples (pp. 1–24). Mahwah, NJ: Erlbaum.
Wu, M. C., & Carroll, R. J. (1988). Estimation and comparison of changes in the presence of
informative right censoring by modeling the censoring process. Biometrics, 44, 175–188.
Wu, W., & Jia, F. (2013). A new procedure to test mediation with missing data through non-
parametric bootstrapping and multiple imputation. Multivariate Behavioral Research, 48,
663–691.
Wu, W., Jia, F., & Enders, C. (2015). A comparison of imputation strategies for ordinal missing
data on Likert scale variables. Multivariate Behavioral Research, 50, 484–503.
Wu, W., Jia, F., Rhemtulla, M., & Little, T. D. (2016). Search for efficient complete and planned
missing data designs for analysis of change. Behavior Research Methods, 48, 1047–1061.
Xu, S., & Blozis, S. A. (2011). Sensitivity analysis of mixed models for incomplete longitudinal
data. Journal of Educational and Behavioral Statistics, 36, 237–256.
Yeo, I. K., & Johnson, R. A. (2000). A new family of power transformations to improve normality
or symmetry. Biometrika, 87, 954–959.
References 517
Yuan, K.-H. (2009a). Identifying variables responsible for data not missing at random. Psy-
chometrika, 74, 233–256.
Yuan, K.-H. (2009b). Normal distribution based pseudo ML for missing data: With applications
to mean and covariance structure analysis. Journal of Multivariate Analysis, 100, 1900–1918.
Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance
structure analysis with nonnormal missing data. Sociological Methodology, 30, 165–200.
Yuan, K.-H., & Bentler, P. M. (2010). Consistency of normal distribution based pseudo maximum
likelihood estimates when data are missing at random. American Statistician, 64, 263–267.
Yuan, K.-H., Bentler, P. M., & Zhang, W. (2005). The effect of skewness and kurtosis on mean and
covariance structure analysis: The univariate case and its multivariate implication. Socio-
logical Methods and Research, 34, 240–258.
Yuan, K.-H., & Hayashi, K. (2006). Standard errors in covariance structure models: Asymptot-
ics versus bootstrap. British Journal of Mathematical and Statistical Psychology, 59, 397–417.
Yuan, K.-H., Lambert, P. L., & Fouladi, R. T. (2004). Mardia’s multivariate kurtosis with missing
data. Multivariate Behavioral Research, 39, 413–437.
Yuan, K.-H., Tong, X., & Zhang, Z. (2014). Bias and efficiency for SEM with missing data and aux-
iliary variables: Two-stage robust method versus two-stage ML. Structural Equation Model-
ing: A Multidisciplinary Journal, 22, 178–192.
Yuan, K.-H., Yang-Wallentin, F., & Bentler, P. M. (2012). ML versus MI for missing data with vio-
lation of distribution conditions. Sociological Methods and Research, 41, 598–629.
Yuan, K.-H., & Zhang, Z. Y. (2012). Robust structural equation modeling with missing data and
auxiliary variables. Psychometrika, 77, 803–826.
Yuan, Y., & MacKinnon, D. P. (2009). Bayesian mediation analysis. Psychological Methods, 14,
301–322.
Yucel, R. M. (2008). Multiple imputation inference for multivariate multilevel continuous data
with ignorable non-response. Philosophical Transactions of the Royal Society A: Mathematical
and Physical Sciences, 366, 2389–2403.
Yucel, R. M. (2011). Random-covariances and mixed-effects models for imputing multivariate
multilevel continuous data. Statistical Modelling, 11, 351–370.
Yucel, R. M., He, Y., & Zaslavsky, A. M. (2008). Using calibration to improve rounding in imputa-
tion. American Statistician, 62, 1–5.
Yucel, R. M., He, Y., & Zaslavsky, A. M. (2011). Gaussian-based routines to impute categorical
variables in health surveys. Statistics in Medicine, 30, 3447–3460.
Zellner, A., & Min, C.-K. (1995). Gibbs sampler convergence criteria. Journal of the American
Statistical Association, 90, 921–927.
Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psycho-
logical Methods, 22, 649–666.
Zhang, X., Boscardin, W. J., & Belin, T. R. (2008). Bayesian analysis of multivariate nominal
measures using multivariate multinomial probit models. Computational Statistics and Data
Analysis, 52, 3697–3708.
Zhang, Z., & Wang, L. (2012). A note on the robustness of a full Bayesian method for nonignor-
able missing data analysis. Brazilian Journal of Probability and Statistics, 26, 244–264.
Zhang, Z., & Wang, L. (2013). Methods for mediation analysis with missing data. Psychometrika,
78, 154–184.
Zhang, Z., Wang, L., & Tong, X. (2015). Mediation analysis with missing data through multiple
imputation and bootstrap. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, &
S.-M. Chow (Eds.), Quantitative psychology research: The 79th meeting of the Psychometric
Society, Madison, Wisconsin, 2014 (pp. 341–355). New York: Springer.
Author Index
519
520 Author Index
Du, H., 188, 316, 346, 349, 352, 358, 441, 469, 473 Freels, S. A., 205, 269
du Toit, S. H. C., 429 Fritz, M. S., 425
Duncan, S. C., 41 Frühwirth-Schnatter, S., 222, 256
Duncan, T. E., 41 Früwirth, R., 222, 256
Dunson, D. B., 192, 222, 256
Dyklevych, O., 245
Dziak, J. J., 359 G
Lunn, D., 183 Muthén, B., 14, 98, 99, 112, 115, 116, 117, 120,
Ly, A., 187 122, 123, 147, 160, 182, 183, 217, 220, 222,
Lynch, S. M., 147, 157, 159, 161, 168, 172, 177, 226, 242, 253, 256, 257, 260, 263, 286, 295,
187, 209, 211, 228, 236, 243, 255, 309 297, 301, 312, 314, 321, 323, 331, 332, 344,
345, 359, 364, 379, 381, 382, 385, 391, 400,
422, 425, 429, 433, 436, 437, 439, 462, 464,
M 465, 469, 473, 481
Muthén, L. K., 99, 147, 160, 256, 382, 436, 462,
MacCallum, R., 432, 433, 436 465, 473, 481
MacKinnon, D. P., 2, 67, 350, 422, 423, 425, 428 Myin-Germeys, I., 37
Madley-Dowd, P., 35, 46 Mykland, P., 172
Magnus, J. R., 88, 89
Mallinckrodt, C. H., 31, 385
Manly, C. A., 475, 483, 484 N
Mansolf, M., 297, 433, 436
Marcoulides, G. A., 14, 15, 103, 373 Nandram, B., 236, 435, 446
Mare, R. D., 352 Neale, M. C., 99, 344, 345, 380
Marshall, A., 406 Necowitz, L. B., 433
Marsman, M., 148, 187 Neelon, B., 462
Masyn, K., 381 Nesselroade, J. R., 41
Matz, A. W., 213 Neudecker, J., 88, 89
Maydeu-Olivares, A., 438 Nicholson, J. S., 484
Mazza, G. L., 27, 440, 443 Nielsen, S. F., 284
McCoach, D. B., 331
McCulloch, R., 183, 245
McDonald, R. P., 432 O
McKelvey, R. D., 232, 239
McLachlan, G. J., 112, 115 O’Brien, S. M., 222, 256
McNeish, D., 183, 221, 301, 332 O’Hagan, A., 149
Mealli, F., 1, 4, 13, 45 Okiishi, J. C., 359
Mehta, P. D., 344, 345, 380 Olchowski, A. E., 2, 46, 268
Meng, X. L., 115, 183, 287, 294, 296, 297, 298, Olkin, I., 263
405, 436, 439 Olmsted, M. P., 428
Merkle, E. C., 183, 185, 192, 220, 474 Olsen, M. K., 262, 263, 267, 268, 299, 412
Mi, X., 248, 252 Olsson, U., 438
Micceri, T., 67 Orchard, T., 98, 112
Michiels, B., 388
Min, C. K., 172
Mislevy, R. J., 147, 149, 159, 177, 439 P
Mistler, S. A., 40, 331, 332, 340, 346, 466
Miwa, T., 248 Palomo, J., 192
Moerbeek, M., 301 Pan, Q., 285
Mohan, K., 6 Pannekoek, J., 278
Molenberghs, G., 13, 31, 65, 66, 111, 112, 301, Park, T., 14
348, 358, 360, 368, 379, 382, 385, 388 Parzen, M., 222
Monfort, A., 67 Paulson, D., 15
Montgomery, D. C., 37 Pawitan, Y., 79
Morris, T. P., 160 Paxton, P., 160
Moustaki, I., 428 Pearl, J., 6
Author Index 525
Petrie, T., 112 Robert, C. P., 147, 159, 172, 177, 229
Peugh, J. L., 24, 474 Roberts, G. O., 159
Peyre, H., 27, 440, 446 Roberts, M. E., 69
Phelps, E., 24 Robins, J. M., 284
Pickles, A., 122 Robitzsch, A., 99, 116, 125, 127, 146, 202, 221,
Pieper, K. S., 14 276, 295, 316, 324, 338, 347, 413, 469, 474,
Pillai, N. S., 13 478, 480
Plummer, M., 474 Rockwood, N. J., 474
Polson, N. G., 222, 245, 256, 257, 260, 263, 462, Rose, N., 20, 23, 142, 217, 288, 362
464 Rosenfeld, B., 99
Poon, W. Y., 345 Rosseel, Y., 119, 146, 183, 185, 192, 220, 436,
Pornprasertmanit, S., 436 474
Potthoff, R. F., 14 Rossi, P. E., 245
Press, S. J., 275 Roth, P. L., 27, 440
Pritikin, J. N., 99, 117, 120, 122 Roy, J., 385
Puhani, P. A., 352, 355 Royston, P., 289, 406, 484
Roznowski, M., 433
Rubin, D. B., 1, 2, 3, 4, 5, 7, 10, 12, 13, 19, 20, 37,
Q 45, 46, 70, 100, 112, 114, 115, 146, 172, 177,
198, 203, 213, 216, 232, 238, 244, 250, 255,
Quartagno, M., 245, 263, 274, 301, 332, 336, 337, 259, 261, 262, 267, 268, 279, 282, 284, 285,
347, 474 286, 288, 292, 293, 294, 295, 296, 297, 299,
300, 311, 318, 319, 322, 323, 328, 329, 334,
335, 337, 342, 344, 355, 362, 371, 373, 389,
R 401, 404, 407, 409, 410, 415, 417, 419, 425,
427, 435, 436, 439, 445, 449, 451, 452, 456,
R Core Team, 474 457, 461, 462, 464, 471, 480, 481
Rabe-Hesketh, S., 122, 128, 132, 156, 344, 379, Ruehlman, L. S., 27
381 Ryan, O., 147, 187
Raftery, A. E., 172, 359, 365, 391, 396
Raghunathan, T. E., 37, 38, 39, 262, 272, 275,
276, 284, 286, 294, 295, 297, 298, 331, 405, S
411
Ram, N., 183 Sarabia, J. M., 116, 275
Rao, C. R., 79, 124 Saris, W. E., 79, 124
Raudenbush, S. W., 112, 301, 309, 311, 312, 319, Sartori, A. E., 355
335, 343, 344, 345 Satorra, A., 79, 81, 82, 125, 430, 431, 435, 436
Raykov, T., 4, 14, 15, 19, 20, 22, 103, 288, 373 Savalei, V., 13, 38, 65, 66, 67, 69, 81, 89, 97,
Reiter, J. P., 284, 286, 294, 295, 297, 298, 331, 109, 110, 111, 112, 119, 124, 133, 135,
332 136, 137, 139, 146, 285, 430, 432, 435, 437,
Reshetnyak, E., 99 447
Rhemtulla, M., 19, 37, 38, 39, 40, 41, 43, 67, 85, Sayer, A., 469
99, 136, 180, 217, 285, 403, 426, 430, 431, 432, Schafer, J. L., 1, 2, 14, 16, 19, 20, 23, 27, 31, 41,
435, 438, 447, 469 46, 114, 116, 133, 214, 217, 222, 262, 263, 267,
Richardson, S., 207 268, 282, 287, 288, 294, 297, 299, 309, 332,
Rights, J. D., 312, 329, 465 334, 337, 347, 361, 373, 381, 384, 387, 388,
Rippe, R. C., 483 389, 394, 406, 409, 412, 440, 458, 474
Ritter, C., 172 Scheuren, F., 261, 300
Rizopoulos, D., 116 Schluchter, M. D., 381
526 Author Index
Verbeke, G., 301, 348, 358, 360, 365, 379, 385, Wirth, R. J., 121, 122, 428
388 Wood, A. M., 24, 30, 406, 474
Verhagen, J., 187 Woodbury, M. A., 98, 112
Vink, G., 278, 412 Wothke, W., 98, 115
von Davier, M., 439 Wu, M. C., 384, 483
von Hippel, P. T., 25, 27, 28, 35, 112, 115, 126, Wu, W., 40, 274, 427, 428, 466, 469
127, 189, 202, 205, 268, 269, 279, 289, 343,
404, 411, 412, 413, 459
von Oertzen, T., 40, 469 X
Vrieze, S. I., 359
Xi, N., 121
Xu, S., 379
W
529
530 Subject Index
Simulation. See also Monte Carlo computer Substantive model-compatible imputation. See
simulations Model-based imputation procedure
comparing missing data methods via, 31–36, Synthetic parameter values, 160
32t, 33f, 34f, 35t, 36t
power analyses for planned missingness
designs, 43–45, 44f, 45t T
selection models for multiple regression,
354 t distribution, 411
Single imputation. See also Imputation Target distribution or target function, 207, 208f,
arithmetic mean imputation, 25–27, 26f 209–210, 210f
listwise and pairwise deletion, 24–25, 25f Tests, significance. See Significance tests
overview, 24 Thinning interval, 267
Single-level multiple imputation, 461–462 Three-form design, 37–38, 38t, 43–45, 44f, 45t.
Six-form design, 38, 38t. See also Multiform See also Multiform designs
designs Three-level models, 324–329, 329t, 330f
Slopes, 58, 59f, 61–64, 61f Trace plots
Software, 473–474, 476, 480–481 assessing convergence of the Gibbs sampler,
Split-chain method, 238 172–174, 173f, 174f, 175f
SPSS, 473 item-level missing data and, 445
Square root transformation, 412–413 overview, 172
Standard errors regression with an ordinal outcome and,
alternative approaches to estimating, 67–70, 237–238, 238f
69t Transformations, 411–415, 414f
based on expected information, 65–66 Truncated normal distribution, 229
with incomplete data, 107–112, 111t t-statistic, 285–286, 293
maximum likelihood estimates and, 60–64, Tucker–Lewis Index (TLI), 432
60f, 61f, 76–78, 77f Two-method measurement designs, 41–43, 42f
missing at random (MAR) mechanisms and, Two-stage estimation, 133–134, 136–139, 138f
13
model-based imputation and, 292
multiple imputation and, 282–285 U
multivariate normal data and, 88–89
pooling, 282–285, 292 Uncongenial scenarios, 287–288
second derivatives and, 61–64, 61f Underidentified pattern of missing data, 2–3, 3f
structural equation modeling framework and, Univariate analysis, 121, 148
118 Univariate normal distribution, 50–54, 51f, 53f,
two-stage estimation and, 137, 139 54f, 55f, 155–159, 158f
Stata, 473 Univariate pattern mean differences, 15–16. See
Statistical significance tests. See Significance also Pattern mean difference approach
tests Univariate pattern of missing data, 2–3, 3f
Stochastic regression imputation, 28–30, 30f, Unknown parameters, 55–58, 56t, 57f
31–36, 32t, 33f, 34f, 35t, 36t Unplanned missing data
Structural equation modeling multiform designs and, 39
auxiliary variables and, 133–134 power analyses for planned missingness
multilevel missing data and, 344–345 designs, 43–45, 44f, 45t
overview, 116–120, 119t, 120f, 428–439, 429f, power analysis and, 467–468
430t, 435t U-shaped function, 63
Subgroups, 401–407, 404t, 405t, 407t Utilities, 245
Subject Index 545
Craig K. Enders, PhD, is Professor and Area Chair in Quantitative Psychology in the
Department of Psychology at the University of California, Los Angeles. His primary
research focus is on analytic issues related to missing data analyses, and he leads the
research team responsible for developing the Blimp software application for missing
data analyses. Dr. Enders also conducts research in the areas of multilevel modeling and
structural equation modeling, and is an active member of the Society of Multivariate
Experimental Psychology, the American Psychological Association, and the American
Educational Research Association.
546
NOTATION GUIDE
i = observation index
C = number of MCMC chains
J = number of clusters (multilevel model); j is an index
E = expectation or average
G = number of groups; g is an index
H = rows per person in augmented data for factored regression (Chapter 3)
N = sample size
B = number of bootstrap samples; b is an index
T = number of iterations; t is an index
K = number of predictor variables; k is an index
V = number of variables; v is an index
P = number of unique parameters; p is an index
Q = number of of hypothesized parameters tested or degrees of freedom;
q is an index