2 6 PB
2 6 PB
Associate Editors
Denny Borsboom, University of Amsterdam, Netherlands
Hawjeng Chiou, National Taiwan Normal University, Taiwan
Ick Hoon Jin, Yonsei University, Korea
Hongyun Liu, Beijing Normal University, China
Christof Schuster, Giessen University, Germany
Jiashan Tang, Nanjing University of Posts and
Telecommunications, China
https://isdsa.org
Guest Editors
Tessa Blanken, University of Amsterdam, Netherlands
Alexander Christensen, University of Pennsylvania, USA
Han Du, University of California, Los Angeles, USA
Hudson Golino, University of Virginia, USA
Timothy Hayes, Florida International University, USA
Suzanne Jak, University of Amsterdam, Netherlands
Ge Jiang, University of Illinois at Urbana-Champaign, USA
Zijun Ke, Sun Yat-Sen University, China
Mark Lai, University of Southern California
Haiyan Liu, University of California, Merced, USA
Laura Lu, University of Georgia, USA
Yujiao Mai, ISDSA, USA
Ocheredko Oleksandr, Vinnytsya National Pirogov Memorial Medical
University, Ukraine
Robert Perera, Virginia Commonwealth University, USA
Sarfaraz Serang, Utah State University, USA
Xin (Cynthia) Tong, University of Virginia, USA
Riet van Bork, University of Pittsburgh, USA
Qian Zhang, Florida State University, USA
Editorial Assistant
Wen Qu, University of Notre Dame, USA
Jin Liu, Le Kang, Roy T. Sabo, Robert M. Kirkpatrick and Robert A. Perera*
54—88
Two-step growth mixture model to examine heterogeneity in nonlinear trajecto-
ries
Shuai Zhou*, Yanling Li, Guangqing Chi, Junjun Yin, Zita Oravecz, Yosef
Bodovski, Naomi P. Friedman, Scott I. Vrieze and Sy-Miin Chow 127—155
GPS2space: An Open-source Python Library for Spatial Measure Extraction from
GPS Data
1 Introduction
In social and behavioral sciences, there has been great interest in the analysis
of change (e.g., Collins, 1991; Lu, Zhang, & Lubke, 2010; Singer & Willett,
2003). Growth modeling is designed to provide direct information of growth by
measuring the variables of interest on the same participants repeatedly through
time (e.g., Demidenko, 2004; Fitzmaurice, Davidian, Verbeke, & Molenberghs,
2008; Fitzmaurice, Laird, & Ware, 2004; Hedeker & Gibbons, 2006; Singer
2 Z. Lu and Z. Zhang
& Willett, 2003). Among the most popular growth models, latent growth
curve models (LGCMs) are becoming increasingly important because they can
effectively capture individuals’ latent growth trajectories and also explain the
latent factors that influence such growth by analyzing the repeatedly measured
manifest variables (e.g., Baltes & Nesselroade, 1979). Manifest variables are
evident in the data, such as observed scores; latent variables cannot be measured
directly and are essentially hidden in the data, such as the latent initial levels and
latent growth rates (Singer & Willett, 2003). We use the term “latent” because
these variables are not directly observable but rather are assumed to be inferred,
although they may be closely related to observed scores. For example, the latent
intercept (i.e., the latent initial level) may be related to the test score at the first
occasion, the prior knowledge of a course (such as mathematical knowledge), or
other similar variables. The latent slope (i.e., the latent growth rate) may be
related to the participant’s learning ability, the attitude toward the course, the
instructor’s teaching methods, or other similar types of variables.
However, with an increase in complexity of LGCMs, comes an increase in
difficulties estimating such models. First, missing data are almost inevitable with
longitudinal data (e.g., Jelicic, Phelps, & Lerner, 2009; Little & Rubin, 2002).
Second, conventional likelihood estimation procedures might fail for complex
models with complicated data structures.
and not yet easy to use (e.g., Baraldi & Enders, 2010), and (2) missingness
mechanisms are not testable (e.g., Little & Rubin, 2002). At the same time,
however, non-ignorable missingness analysis is a crucial and a serious concern in
applied research areas, in which participants may be dropping out for reasons
directly related to the response being measured (e.g., Baraldi & Enders, 2010;
Enders, 2011b; Hedeker & Gibbons, 1997). Not attending to the non-ignorable
missingness may result in severely biased statistical estimates, standard errors,
and associated confidence intervals, and thus poses substantial risk of leading
researchers to incorrect conclusions (e.g., Little & Rubin, 2002; Schafer, 1997;
Zhang & Wang, 2012).
In a study of latent growth models, Lu, Zhang, and Lubke (2011) investigated
non-ignorable missingness in mixture models. However, the missingness in that
study was only allowed to depend on latent class membership. In practice,
even within one population, the missingness may depend on many other latent
variables, such as latent initial levels and latent growth rates. When observed
data are not completely informative about these latent variables, the missingness
is non-ignorable. Furthermore, Lu et al. (2011) did not examine how to identify
the missingness mechanisms. Accordingly, this study extends previous research
to more general non-ignorable missingness and also investigates the influences
of different types of non-ignorable missingness on model estimation.
In this article, a full Bayesian estimate approach (e.g., Lee, 2007; Muthén
& Asparouhov, 2012) is proposed. There are several advantages. First, this
approach involves Gibbs sampling methods (Geman & Geman, 1984). Gibbs
sampling is especially useful when the joint distribution is complex or unknown
but the conditional distribution of each variable is available. The sequence of
samples constructs a Markov chain that can be shown to be ergodic (Geman
& Geman, 1984). That is, once convergence is obtained, the samples can
be assumed to be independent draws from the stationary distribution. Thus,
after convergence the generated value is actually from the joint distribution
of all parameters. Each variable from the Markov chain has also been shown
to converge to the marginal distribution of that variable (Robert & Casella,
2004). Additional advantages of Bayesian methods include their intuitive
interpretations of statistical results, their flexibility in incorporating prior
information about how data behave in similar contexts and findings from
experimental research, their capacity for dealing with small sample sizes (such as
occur with special populations), and their flexibility in the analysis of complex
statistical models with complicated data structure (e.g., Dunson, 2000; Scheines,
Hoijtink, & Boomsma, 1999).
yi = Λη i + ei (1)
ηi = β + ξi , (2)
parameters for the missingness, and mi is a vector mi = (mi1 , mi2 , ..., miT )0
that indicates the missingness status for yi . Specifically, if yi is missing at time
point t, then mit = 1; otherwise, mit = 0. Here, we assume the missingness
is conditionally independent (e.g., Dawid, 1979), which means across different
occasions the conditional distributions of missingness are independent with
each other. Let τit = f (mit = 1) be the probability that yit is missing, then
mit follows a Bernoulli distribution of τit , and the density function of mit is
f (mit ) = τitmit (1 − τit )1−mit . For different non-ignorable missingness patterns,
the expressions of τit are different. Lu et al. (2011) investigated the non-ignorable
missingness in mixture models. The τit in that article is a function of latent class
membership, and thus the missingness is Latent Class Dependent (LCD).
However, LCD was proposed in the framework of mixture models. Within
each latent population, there is no class membership indicator. Consequently, the
missingness is ignorable. In this article, we consider more complex non-ignorable
missingness mechanisms within a population. In general, we assume Li is a vector
of latent variables that depend on the missing values. A general class of selection
models for dealing with non-ignorable missing data in latent growth modelling
can be formulated as
f (yi , mi |β, ξ i , Li , γ t , xi ) = f (η i |β, ξ i )f (yi |η i ) Φ(ω 0i γ t )mit [1 − Φ(ω 0i γ t )]1−mit
= f (η i |β, ξ i )f (yi |η i )Φ(γ0t + Li γ Lt + x0i γ xt )mit
×[1 − Φ(γ0t + Li γ Lt + x0i γ xt )]1−mit (3)
where xi is an r-dimensional vector, ω i = (1, L0i , x0i )0 and γ t = (γ0t , γ 0Lt , γ 0xt )0 .
The missingness is non-ignorable because it depends on the latent variables Li
in the model and the observed data are not completely informative about these
latent variables. Note that the vector γ Lt here should be non-zero. Otherwise,
the missingness becomes ignorable.
Specific sub-models under different situations can be derived from this general
model. For example, missingness may be related to latent intercepts, latent
growth rates, or potential outcomes. To show different types of non-ignorable
missingness, we draw the path diagrams, as shown in Figures 1, 2, and 3, to
illustrate the sub-models. These sub-models are based on three types of latent
variables on which the missingness might depend. In these path diagrams, a
square/rectangle indicates an observed variable, a circle/oval means a latent
variable, a triangle represents a constant, and arrows show the relationship
among them. yt is the outcome at time t, which is influenced by latent effects
such as I, S, and ηq . As the value of yt might be missing, we use both circle and
square in the path diagram. If yt is missing, then the potential outcome cannot
be observed and the corresponding missingness indicator mt becomes 1. The
dashed lines between yt and mt show the 1-1 relationship. In these sub-models,
the value of mt depends on the observed covariate xr and some latent variables.
The details of these three sub-models are described as follows.
intercept, Ii . For example, a student’s latent initial ability level of the knowledge
of a course influences the likelihood of that participant dropping out of or staying
in that course. If the latent initial ability of a course is not high, a student may
choose to drop that course or even drop out a school. In the case of LID, the Li
in Equation (3) is simplified to a univariate Ii . Suppose that the missingness is
also related to some observed covariates xi , such as parents’ education or family
income, then τIit is expressed as a probit link function of Ii and xi
τIit = Φ(γ0t + Ii γIt + x0i γ xt ) = Φ(ω 0Ii γ It ), (4)
where ω Ii = (1, Ii , x0i )0 and γ It = (γ0t , γIt , γ 0xt )0 .
Latent variable
Observed variable
Observed variable with possible missing value
Constant
e1 e2 … eT
y1 y2 … yT m1 m2 … mT
I 12 S
22
… q
11 qq
1q
x1 x2 … xr
1 2
2
x2r
x1 x2
Figure 1. Path diagram of a latent growth model with latent intercept dependent
missingness (LID) where f (mt ) depends on covariates xr s and latent intercept I.
Latent variable
Observed variable
Observed variable with possible missing value
Constant
e1 e2 … eT
y1 y2 … yT m1 m2 … mT
I 12 S
22
… q
11 qq
1q
x1 x2 … xr
1 x21 x22 x2r
Figure 2. Path diagram of a latent growth model with latent slope dependent
missingness (LSD) where f (mt ) depends on covariates xr s and latent slope S.
3 Bayesian Estimation
In this article, a full Bayesian estimation approach is used to estimate growth
models. The algorithm is described as follows. First, model related latent
variables are added via the data augmentation method (Tanner & Wong, 1987).
By including auxiliary variables, the likelihood function for each model is
obtained. Second, proper priors are adopted. Third, with the likelihood function
and the priors, based on the Bayes’ Theorem, the posterior distribution of
Non-ignorable Missingness in LGCMs 9
Latent variable
Observed variable
Observed variable with possible missing value
Constant
e1 e2 … eT
y1 y2 … yT m1 m2 … mT
I 12 S
22
… q
11 qq
1q
x1 x2 … xr
1 2
2
x2r
x1 x2
Figure 3. Path diagram of a latent growth model with latent outcome dependent
missingness (LOD) where f (mt ) depends on covariates xr s and potential outcome y.
where τit is defined by Equation (4) for the LID missingness, (5) for the LSD
missingness, and (6) for the LOD missingness.
The commonly used proper priors (e.g., Lee, 2007) are adopted in the
study. Specifically, (1) an inverse Gamma distribution prior is used for φ ∼
IG(v0 /2, s0 /2) where v0 and s0 are given hyper-parameters. The density function
of an inverse Gamma distribution is f (φ) ∝ φ−(v0 /2)−1 exp(−s0 /(2φ)). (2)
An inverse Wishart distribution prior is used for Ψ. With hyper-parameters
m0 and V0 , Ψ ∼ IW (m0 , V0 ), where m0 is a scalar and V0 is a q × q
matrix. Its density function is f (Ψ) ∝ |Ψ|−(m0 +q+1)/2 exp[−tr(V0 Ψ−1 )/2]. (3)
For β a multivariate normal prior is used, and β ∼ M Nq (β 0 , Σ0 ) where the
hyper-parameter β 0 is a q-dimensional vector and Σ0 is a q × q matrix. (4) The
prior for γ t (t = 1, 2, . . . , T ) is chosen to be a multivariate normal distribution
γ t ∼ M N(2+r) (γ t0 , Dt0 ), where γ t0 is a (2 + r)-dimensional vector, Dt0 is a
(2 + r) × (2 + r) matrix, and both are pre-determined hyper-parameters.
After constructing the likelihood function and assigning the priors, the joint
posterior distribution for unknown parameters is readily available. Considering
the high-dimensional integration for marginal distributions of parameters, the
conditional distribution for each parameter is obtained instead. The derived
conditional posteriors are provided in Equations (8) - (11) in the appendix. In
addition, the conditional posteriors for the latent variable η i and the augmented
missing data yimis (i = 1, 2, ..., N ) are also provided by Equations (12) and (13),
respectively, in the appendix.
After obtaining the conditional posteriors, the Markov chain for each model
parameter is generated by implementing a Gibbs sampling algorithm (Casella &
George, 1992; Geman & Geman, 1984). Specifically, the following algorithm is
used in the research.
1. Start with a set of initial values for model parameters φ(0) , Ψ(0) , β (0) , γ (0) ,
latent variable η (0) , and missing values ymis(0) .
Non-ignorable Missingness in LGCMs 11
2. At the sth iteration, the following parameters are generated: φ(s) , Ψ(s) , β (s) ,
γ (s) , η (s) , and ymis(s) . To generate φ(s+1) , Ψ(s+1) , β (s+1) , γ (s+1) , η (s+1) ,
and ymis(s+1) , the following procedure is implemented:
(a) Generate φ(s+1) from the distribution in Equation (8) in the appendix.
(b) Generate Ψ(s+1) from the inverse Wishart distribution in Equation (9) in
the appendix. iv. Generate β (s+1) from the multivariate normal distribution
in Equation (10) in the appendix.
(c) Generate γ (s+1) from the distribution in Equation (11) in the appendix.
(d) Generate η (s+1) from the multivariate normal distribution in Equation
(12) in the appendix.
(e) Generate ymis(s+1) from the normal distribution in Equation (13) in the
appendix.
4 Simulation Studies
In this section, simulation studies are conducted to evaluate the performance of
the proposed models estimated by the Bayesian method.
The simulation studies are implemented by the following algorithm. (1) Set
the counter R = 0. (2) Generate complete longitudinal growth data according
to predefined model parameters. (3) Create missing data according to missing
data mechanisms and missing data rates. (4) Generate Markov chains for model
parameters through the Gibbs sampling procedure. (5) Test the convergence of
generated Markov chains. (6) If the Markov chains pass the convergence test,
set R = R + 1 and calculate and save the parameter estimates. Otherwise, set
R = R and discard the current replication of simulation. (7) Repeat the above
process till R = 100 to obtain 100 replications of valid simulation.
In step 4, priors carrying little prior information are adopted (Congdon, 2003;
Gill, 2002; Zhang, Hamagami, Wang, Grimm, & Nesselroade, 2007). Specifically,
2
The summary table for the model with the latent intercept dependent (LID)
missingness (XI), for N=100 is not included due to its low convergence rate.
Non-ignorable Missingness in LGCMs 13
Table 1. Simulation model design. N =1000, 500, 300, 200 and 100
Model X5 I6 S7 Y8
Ignorable (X) X
2
LID (XI) X X
3 1
LSD (XS) X X
LOD4 (XY) X X
1 2
Note. The shaded model is the true model XS. LID: Latent Intercept
Dependent. 3 LSD: Latent Slope Dependent. 4 LOD: Latent Outcome Dependent.
5
X: Observed covariates. If X is the only item checked, the missingness
is ignorable. 6 I: Individual latent intercept. If checked, the missingness is
non-ignorable. 7 S: Individual latent slope. If checked, the missingness is
non-ignorable. 8 Y: Individual potential outcome y. If checked, the missingness
is non-ignorable.
for ϕ1 , we set µϕ1 = 02 and Σϕ1 = 103 I2 . For φ, we set v0k = s0k = 0.002. For
β, it is assumed that β k0 = 02 and Σk0 = 103 I2 . For Ψ, we define mk0 = 2
and Vk0 = I2 . Finally, for γ t , we let γ t0 = 03 and Dt0 = 103 I3 , where 0d
and Id denote a d-dimensional zero vector and a d-dimensional identity matrix,
respectively. In step 5, the iteration number of burn-in period is set. The Geweke
convergence criterion indicated that less than 10,000 iterations was adequate for
all conditions in the study. Therefore, a conservative burn-in of 20,000 iterations
was used for all iterations. And then the Markov chains with a length of 20, 000
iterations are saved for convergence testing and data analysis. After step 7,
12 summary statistics are reported based on 100 sets of converged simulation
replications. For the purpose of presentation, let θj represent the j th parameter,
also the true value in the simulation. Twelve statistics are defined below. (1)
The average estimate (est.j ) across 100 converged simulation replications of
¯ P100
each parameter is obtained as est.j = θ̂j = i=1 θ̂ij /100, where θ̂ij denotes the
estimate of θj in the ith simulation replication. (2) The simple bias (BIAS.smpj )
¯
of each parameter is calculated as BIAS.smpj = θ̂j − θj . (3) The relative bias
¯
(BIAS.relj ) of each parameter is calculated using BIAS.relj = (θ̂j − θj )/θj when
¯
θj 6= 0 and BIAS.relj = θ̂j − θj when θj = 0. (4) The empirical
q standard error
P100 ¯ 2
(SE.empj ) of each parameter is obtained as SE.empj = i=1 (θ̂ij − θ̂j ) /99,
and (5) the average standard error (SE.avgj ) of the same parameter is calculated
P100
by SE.avgj = i=1 ŝij /100, where ŝij denotes the estimated standard error of
θ̂ij . (6) The average mean square error (MSE) of each parameter is obtained
P100
by MSEj = i=1 MSEij /100, where MSEij is the mean square error for the
j parameter in the ith simulation replication, MSEij = (Biasij )2 + (ŝij )2 . The
th
average lower (7) and upper (8) limits of the 95% percentile confidence interval
P100 l
(CI.lowj and CI.upperj ) are respectively defined as CI.lowj = i=1 θ̂ij /100 and
P100 u l u
CI.upperj = i=1 θ̂ij /100 where θ̂ij and θ̂ij denote the 95% lower and upper
14 Z. Lu and Z. Zhang
4.3.1 Estimates from the True Model. First, we investigate the estimates
obtained from the true model. Tables 3, 4 and 5 in the appendix show the
summarized estimates from the true model for N=1000, N=500, N=300, and
N=100. From Tables 3 with the sample size 1000, first, one can see that all the
relative estimate biases are very small, with the largest one being 0.067 for γ03 .
Second, the difference between the empirical SEs and the average SEs is very
small, which indicates the SEs are estimated accurately. Third, both CI and
HPD interval coverage probabilities are very close to the theoretical percentage
95%, which means the type I error for each parameter is close to the specified
5% so that we can use the estimated confidence intervals to conduct statistical
inference. Fourth, this true model has 100% convergence rate. When the sample
sizes are smaller, the performance becomes worse as expected.
In order to compare estimates with different sample sizes, we further calculate
the five summary statistics across all parameters, which are shown in Table 2.
The first statistic is the average absolute relative
Pp biases (|Bias.rel|) across all
parameters, which is defined as |Bias.rel| = j=1 |Bias.relj |/p, where p is the
total number parameters in a model. Second, we obtain the average absolute
differences between the empirical SEs and P the average Bayesian SEs (|SE.diff|)
p
across all parameters by using |SE.diff| = j=1 |SE.empj − SE.avgj |/p. Third,
we calculate the average percentilePcoverage probabilities (CI.cover) across all
p
parameters by using CI.cover = j=1 CI.coverj /p. Fourth, we calculate the
average HPD P coverage probabilities (HPD.cover) across all parameters by using
p
HPD.cover = j=1 HPD.coverj /p. Fifth, the convergence rate is calculated.
Table 2 shows that, except for the case for N=100, the true mode can
recover model parameters very well, with small average absolute relative
biases of estimates, |Bias.rel|, small average absolute differences between the
empirical SEs and the average SEs, |SE.diff|, and almost 95% average percentile
coverage probabilities (CI.cover), and the average HPD coverage probabilities
(HPD.cover). With the increase of the sample size, both the point estimates and
standard errors get more accurate.
the true model is the LGCM with LSD (XS) missingness, and there are
three mis-specified models, the LGCM with LID (XI) missingness, the LGCM
with LOD (XY) missingness, and the LGCM with ignorable missingness
(see Table 1 for the simulation design). Table 6 in the appendix shows the
summarized estimates from the mis-specified model with LID (XI) missingness
for N=1000, N=500, N=300, and N=200 (the summarized estimates for N=100
are unavailable due to a low convergence rate). Table 8 in the appendix provides
the results for the mis-specified model with LOD (XY) missingness for N=1000,
N=500, N=300, N=200, and N=100. Table 10 in the appendix is the summary
table for the mis-specified model with ignorable (X) missingness for different
sample sizes.
To compare estimates from different models, we further summarize and
visualize some statistics. Figure 4 (a) compares the point estimates of intercept
and slope for all models when N=1000. The true value of slope is 3 but the
estimate is 2.711 when the missingness is ignored. Actually, for the model with
ignorable missingness, the slope estimates are all less than 2.711 for all sample
sizes in our study. Figure 4 (b) focuses on the coverage of slope. When the
missingness is ignored, it is as low as 4% for N=1000, and 21% for N=500 (the
coverage for N=1000 is lower because the SE for N=1000 is smaller than the
SE for N=500). As a result, conclusions based on the model with ignorable
missingness can be very misleading. Figure 4 (b) also shows that the slope
estimate from the model with the mis-specified missingness, LID (XI), has low
coverage, with 76% for N=1000 and 87% for N=500. So the conclusions based on
this model may still be incorrect. Figure 4 (c) compares the true model and the
model with another type of mis-specified missingness, LOD (XY) for N=1000.
16 Z. Lu and Z. Zhang
For the wrong model, the coverage is 51% for intercept, and 72% for Cov(I,S).
Finally, Figure 4 (d) compares the convergence rates for all models. One can see
that the convergence rates of LOD (XY) and LID (XI) models are much lower
than those of the true model LSD (XS) and the model with ignorable missingness.
When the missingness is ignored, the number of parameters is smaller than that
of non-ignorable models, andNon-ignorable
then convergence
Missingness inrate gets higher.
LGCMs 17
True N=1000
1.0
LSD(XS) 87% N=500
LOD(XY)
LID(XI) 76%
Ignorable(X)
0.8
S=3
8
S=2.711
Slope Coverage
0.6
6
y
0.4
4
21%
0.2
2
4%
I=1 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 LSD(XS) LOD(XY) LID(XI) ignorable(X)
N=1000
N=500
100%100% 100%100%
1.0
91% 89%
85%
79%
1.2
LOD(XY)
100% 98% 96% 94% 94% 93% 94%
93% 93% 92%
1.0
90%
79%
0.6
72%
0.8
Percentage
0.6
0.4
51%
0.4
0.2
0.2
0.0
0.0
CVG.rate Intercept Slope Var(I) Var(S) Cov(IS) Var(e) LSD(XS) Ignorable(X) LOD(XY) LID(XI)
Convergence rate (CVG.rate), and HPD coverage of parameter estimates when N=1000 Different Models, with True Model: LSD(XS)
Based on the simulation studies, we draw the following conclusions: (1) the
proposed Bayesian method can accurately recover model parameters (both
point estimates and standard errors), (2) the small difference between the
empirical SE and the average SE indicates that the Bayesian method used in
the study can estimate the standard errors accurately, (3) with the increase
of the sample size, estimates get closer to their true values and standard
errors become more accurate, (4) ignoring the non-ignorable missingness can
lead to incorrect conclusions, (5) mis-specified missingness may also result in
misleading conclusions, and (6) the non-convergence of models might be a sign
of a misspecified model.
5 Discussion
The models proposed in this article have several implications for future research.
First, the missingness in the simulation study is assumed to be independent
across different times. If this assumption is violated, likelihood functions might
be much more complicated. For example, if the missingness depends on the
previous missingness, then the autocorrelation among missingness might be
involved. A similar model is the Diggle and Kenward (1994)’s model, in which
the probability of missing data at current wave depends directly on the current
outcomes as well as on the preceding assessment. Another example is survival
analysis (e.g., Klein & Moeschberger, 2003), in which censoring is the common
form of missing data problem. In practice, the missingness can come from
different sources and can be modeled as a combination of different types of
missingness. Second, various model selection criteria could be considered (e.g.,
Cain & Zhang, 2019). It is an interesting topic for future work to propose
new criteria. For example, observed-data and complete-data likelihood functions
for random effects models can be used for f (y|θ); information criterion can
be proposed using other weighted combination of the growth model and the
missing data model. Third, the data considered in the study are assumed to be
normally distributed. However, in reality, data are seldom normally distributed,
particularly in behavioral and educational sciences (e.g., Cain, Zhang, & Yuan,
2017; Micceri, 1989). When data have heavy tails, or contaminated with outliers,
robust models (e.g., Hoaglin, Mosteller, & Tukey, 1983; Huber, 1996; Zhang,
2013; Zhang, Lai, Lu, & Tong, 2013) should be adopted to make models
insensitive to small deviations from the assumption of normal distribution.
Fourth, latent population heterogeneity (e.g., McLachlan & Peel, 2000) may
exist in the collected longitudinal data. Growth mixture models (GMMs) can
be considered to provide a flexible set of models for analyzing longitudinal data
with latent or mixture distributions (e.g., Bartholomew & Knott, 1999).
18 Z. Lu and Z. Zhang
References
Appendix
Appendix A. The Derived Posteriors for LGCMs with Non-ignorable
Missingness
Let η = (η 1 , η 2 , . . . , η N ), and the conditional posterior distribution for φ can be
easily derived as an Inverse Gamma distribution,
PN
where a1 = v0 + N T , and b1 = s0 + i=1 (yi − Λη i )0 (yi − Λη i ).
Notice that tr(AB) = tr(BA), so the conditional posterior distribution for
Ψ is derived as an Inverse Wishart distribution,
β|Ψ, η ∼ M N (β 1 , Σ1 ), (10)
−1 PN
where β 1 = N Ψ−1 + Σ−1
0 Ψ−1 i=1 η i + Σ−1
0 β 0 , and Σ1 =
−1
N Ψ−1 + Σ−1
0 .
The conditional posterior for γ t , (t = 1, 2, . . . , T ), is a distribution of
1
f (γ t |ω, x, m) ∝ exp − (γ t − γ t0 )0 D−1
t0 (γ t − γ t0 )
2
N
X
+ {mit log Φ(ω 0i γ t ) + (1 − mit ) log[1 − Φ(ω 0i γ t )]} .
i=1
(11)
where Φ(ω 0i γ t ) is defined by Equation (4), (5), or (6).
By expanding the terms inside the exponential part and combining similar
terms, the conditional posterior distribution for η i , i = 1, 2, . . . , N , is derived as
a Multivariate Normal distribution,
Table 3. Summarized Estimates from True Model: LGCM with LSD Missingness (XS).
N=1000 (convergence rate: 100/100 = 100%)
S 3 3.003 0.003 0.001 0.079 0.077 0.012 2.853 3.155 0.97 2.853 3.154 0.96
var(I) 1 1.011 0.011 0.011 0.105 0.102 0.022 0.82 1.22 0.94 0.814 1.213 0.94
var(S) 4 3.99 -0.01 -0.003 0.232 0.232 0.107 3.56 4.468 0.94 3.545 4.449 0.93
cov(IS) 0 0.001 0.001 0.001 0.119 0.112 0.026 -0.221 0.217 0.94 -0.218 0.218 0.94
var(e) 1 1 0 0 0.043 0.042 0.004 0.92 1.086 0.92 0.918 1.084 0.93
γ01 -1 -1.025 -0.025 0.025 0.184 0.174 0.065 -1.375 -0.694 0.93 -1.365 -0.69 0.94
Wave 4 Wave 3 Wave 2 Wave 1
γx1 -1.5 -1.541 -0.041 0.027 0.138 0.123 0.036 -1.795 -1.314 0.92 -1.783 -1.307 0.93
γS1 0.5 0.515 0.015 0.03 0.066 0.062 0.008 0.4 0.641 0.9 0.397 0.636 0.92
Missingness Parameters
γ02 -1 -1.038 -0.038 0.038 0.191 0.171 0.067 -1.385 -0.714 0.96 -1.376 -0.711 0.97
γx2 -1.5 -1.551 -0.051 0.034 0.129 0.119 0.034 -1.798 -1.33 0.95 -1.786 -1.323 0.94
γS2 0.5 0.521 0.021 0.042 0.066 0.06 0.008 0.41 0.643 0.95 0.408 0.639 0.94
γ03 -1 -1.067 -0.067 0.067 0.186 0.172 0.069 -1.417 -0.741 0.94 -1.407 -0.737 0.94
γx3 -1.5 -1.557 -0.057 0.038 0.117 0.116 0.03 -1.796 -1.341 0.97 -1.785 -1.334 0.97
γS3 0.5 0.529 0.029 0.058 0.063 0.058 0.008 0.42 0.648 0.89 0.418 0.643 0.91
γ04 -1 -1.034 -0.034 0.034 0.18 0.173 0.063 -1.384 -0.709 0.94 -1.374 -0.704 0.93
γx4 -1.5 -1.539 -0.039 0.026 0.122 0.114 0.029 -1.773 -1.325 0.95 -1.763 -1.319 0.94
γS4 0.5 0.514 0.014 0.027 0.058 0.057 0.007 0.407 0.63 0.95 0.405 0.625 0.95
Note. The results are summarized based on 100 converged replications with a
convergence rate of 100/100 = 100%. 1 The estimated parameter. 2 The true
value of the corresponding parameter. 3 The parameter estimate, defined by
¯ P100 ¯
est.j = θ̂j = i=1 θ̂ij /100. 4 The simple bias, defined by BIAS.smpj = θ̂j − θj .
5 ¯
The relative bias, defined by BIAS.relj = (θ̂j − θj )/θj when θj 6= 0 and
¯
BIAS.relj = θ̂j q − θj when θj = 0. 6 The empirical standard errors, defined
P100 ¯
by SE.empj = (θ̂ij − θ̂j )2 /99. 7 The average standard errors, defined
P100i=1 8
by SE.avgj = i=1 ŝij /100. The mean square error, defined by MSEj =
P100 2 2 9
i=1 MSEij /100, where MSEij = (Biasij ) + (ŝij ) . For percentile confidence
10 11
interval. The average lower 2.5% percentile. The average upper 97.5%
percentile. 12 The average 95% coverage of percentile confidence interval. 13 The
lower,upper bounds, and coverage for HPD interval.
24 Z. Lu and Z. Zhang
Table 4. Summarized Estimates from True Model: LGCM with LSD Missingness (XS)
(con’t)
S 3 3.001 0.001 0 0.097 0.109 0.021 2.789 3.216 0.97 2.788 3.213 0.97
var(I) 1 0.976 -0.024 -0.024 0.146 0.144 0.042 0.712 1.274 0.97 0.7 1.26 0.97
var(S) 4 4.001 0.001 0 0.388 0.329 0.258 3.403 4.691 0.9 3.373 4.652 0.9
cov(IS) 0 -0.009 -0.009 -0.009 0.155 0.157 0.049 -0.324 0.294 0.96 -0.319 0.297 0.96
var(e) 1 1.014 0.014 0.014 0.06 0.061 0.007 0.901 1.141 0.96 0.897 1.136 0.96
γ01 -1 -1.082 -0.082 0.082 0.254 0.255 0.137 -1.609 -0.608 0.95 -1.587 -0.596 0.97
Wave 4 Wave 3 Wave 2 Wave 1
γx1 -1.5 -1.606 -0.106 0.071 0.181 0.186 0.079 -2.002 -1.275 0.95 -1.975 -1.258 0.97
γS1 0.5 0.54 0.04 0.081 0.083 0.092 0.017 0.375 0.735 0.95 0.368 0.722 0.94
Missingness Parameters
γ02 -1 -1.096 -0.096 0.096 0.281 0.252 0.152 -1.61 -0.624 0.89 -1.591 -0.615 0.89
γx2 -1.5 -1.615 -0.115 0.077 0.204 0.18 0.088 -1.996 -1.291 0.91 -1.971 -1.275 0.94
γS2 0.5 0.546 0.046 0.092 0.104 0.088 0.021 0.385 0.73 0.87 0.379 0.719 0.88
γ03 -1 -1.068 -0.068 0.068 0.32 0.248 0.169 -1.572 -0.602 0.93 -1.555 -0.594 0.93
γx3 -1.5 -1.613 -0.113 0.075 0.279 0.174 0.123 -1.978 -1.295 0.9 -1.958 -1.283 0.93
γS3 0.5 0.536 0.036 0.072 0.116 0.084 0.022 0.381 0.71 0.92 0.378 0.702 0.91
γ04 -1 -1.123 -0.123 0.123 0.261 0.257 0.15 -1.652 -0.647 0.94 -1.628 -0.633 0.95
γx4 -1.5 -1.579 -0.079 0.053 0.174 0.168 0.066 -1.933 -1.274 0.95 -1.913 -1.261 0.96
γS4 0.5 0.543 0.043 0.086 0.089 0.085 0.017 0.388 0.719 0.92 0.382 0.71 0.92
N=300 (convergence rate: 100/100 = 100%)
I 1 1.001 0.001 0.001 0.104 0.097 0.02 0.81 1.192 0.89 0.811 1.192 0.89
Growth Curve
S 3 2.984 -0.016 -0.005 0.149 0.14 0.042 2.712 3.262 0.93 2.71 3.259 0.93
var(I) 1 1.014 0.014 0.014 0.183 0.19 0.07 0.673 1.418 0.96 0.654 1.392 0.96
var(S) 4 3.975 -0.025 -0.006 0.416 0.425 0.354 3.22 4.886 0.96 3.174 4.82 0.96
cov(IS) 0 0.054 0.054 0.054 0.212 0.205 0.09 -0.359 0.449 0.94 -0.351 0.454 0.93
var(e) 1 1.011 0.011 0.011 0.073 0.08 0.012 0.867 1.179 0.96 0.86 1.17 0.96
γ01 -1 -1.094 -0.094 0.094 0.341 0.345 0.249 -1.822 -0.468 0.97 -1.778 -0.441 0.97
Wave 4 Wave 3 Wave 2 Wave 1
γx1 -1.5 -1.65 -0.15 0.1 0.265 0.253 0.162 -2.209 -1.217 0.92 -2.155 -1.185 0.94
γS1 0.5 0.548 0.048 0.097 0.121 0.124 0.033 0.331 0.82 0.97 0.318 0.794 0.97
Missingness Parameters
γ02 -1 -1.106 -0.106 0.106 0.452 0.34 0.341 -1.819 -0.486 0.93 -1.782 -0.467 0.93
γx2 -1.5 -1.692 -0.192 0.128 0.345 0.253 0.23 -2.243 -1.254 0.89 -2.196 -1.227 0.9
γS2 0.5 0.566 0.066 0.132 0.158 0.121 0.046 0.354 0.827 0.93 0.343 0.807 0.92
γ03 -1 -1.139 -0.139 0.139 0.397 0.335 0.293 -1.845 -0.527 0.91 -1.801 -0.503 0.92
γx3 -1.5 -1.648 -0.148 0.099 0.305 0.236 0.175 -2.152 -1.233 0.86 -2.115 -1.21 0.92
γS3 0.5 0.566 0.066 0.132 0.141 0.115 0.038 0.361 0.811 0.9 0.352 0.794 0.91
γ04 -1 -1.217 -0.217 0.217 0.411 0.356 0.347 -1.976 -0.576 0.9 -1.932 -0.552 0.9
γx4 -1.5 -1.681 -0.181 0.121 0.263 0.241 0.163 -2.203 -1.257 0.9 -2.161 -1.231 0.92
γS4 0.5 0.583 0.083 0.165 0.138 0.118 0.041 0.372 0.839 0.88 0.363 0.82 0.91
Note. The same as Table 3
Non-ignorable Missingness in LGCMs 25
Table 5. Summarized Estimates from True Model: LGCM with LSD Missingness (XS)
(con’t)
BIAS SE CI HPD
para. true est. smp. rel. emp. avg. MSE lower upper cover lower upper cover
N=200 (convergence rate: 100/106 ≈ 94.34%)
I 1 1.011 0.011 0.011 0.099 0.119 0.024 0.779 1.244 0.98 0.779 1.243 0.98
Growth Curve
S 3 2.975 -0.025 -0.008 0.177 0.171 0.061 2.643 3.314 0.93 2.642 3.312 0.94
var(I) 1 1.011 0.011 0.011 0.228 0.233 0.107 0.601 1.516 0.94 0.572 1.476 0.92
var(S) 4 4 0 0 0.474 0.522 0.498 3.095 5.135 0.97 3.029 5.041 0.96
cov(IS) 0 0.065 0.065 0.065 0.257 0.252 0.134 -0.447 0.549 0.92 -0.436 0.557 0.92
var(e) 1 1.027 0.027 0.027 0.098 0.099 0.02 0.851 1.238 0.95 0.84 1.224 0.95
Wave 4 Wave 3 Wave 2 Wave 1
γ01 -1 -1.3 -0.3 0.3 0.671 0.5 0.901 -2.399 -0.449 0.93 -2.306 -0.402 0.94
γx1 -1.5 -1.874 -0.374 0.249 0.745 0.424 1.113 -2.868 -1.227 0.88 -2.735 -1.169 0.91
Missingness Parameters
γS1 0.5 0.647 0.147 0.293 0.323 0.197 0.202 0.334 1.1 0.91 0.311 1.045 0.92
γ02 -1 -1.278 -0.278 0.278 0.69 0.468 0.838 -2.303 -0.463 0.87 -2.227 -0.426 0.89
γx2 -1.5 -1.779 -0.279 0.186 0.456 0.349 0.451 -2.578 -1.209 0.91 -2.487 -1.163 0.9
γS2 0.5 0.627 0.127 0.254 0.244 0.171 0.117 0.343 1.014 0.9 0.324 0.976 0.91
γ03 -1 -1.191 -0.191 0.191 0.505 0.436 0.5 -2.133 -0.419 0.91 -2.05 -0.377 0.93
γx3 -1.5 -1.721 -0.221 0.147 0.502 0.314 0.426 -2.428 -1.193 0.9 -2.348 -1.15 0.94
γS3 0.5 0.586 0.086 0.172 0.183 0.152 0.068 0.326 0.926 0.91 0.309 0.889 0.95
γ04 -1 -1.27 -0.27 0.27 0.594 0.467 0.67 -2.304 -0.457 0.86 -2.209 -0.404 0.9
γx4 -1.5 -1.808 -0.308 0.205 0.397 0.336 0.382 -2.56 -1.24 0.82 -2.48 -1.195 0.89
γS4 0.5 0.618 0.118 0.236 0.204 0.16 0.085 0.345 0.98 0.88 0.325 0.942 0.89
N=100 (convergence rate: 100/142 ≈ 70.42%)
I 1 1.031 0.031 0.031 0.167 0.168 0.057 0.701 1.359 0.96 0.701 1.359 0.97
Growth Curve
S 3 2.983 -0.017 -0.006 0.236 0.242 0.115 2.514 3.467 0.95 2.51 3.46 0.94
var(I) 1 0.933 -0.067 -0.067 0.305 0.323 0.206 0.408 1.665 0.93 0.355 1.574 0.91
var(S) 4 3.965 -0.035 -0.009 0.829 0.747 1.261 2.743 5.656 0.91 2.623 5.458 0.91
cov(IS) 0 0.069 0.069 0.069 0.333 0.357 0.246 -0.666 0.748 0.93 -0.646 0.762 0.95
var(e) 1 1.078 0.078 0.078 0.157 0.151 0.054 0.82 1.409 0.93 0.801 1.38 0.94
Wave 4 Wave 3 Wave 2 Wave 1
γ01 -1 -3.257 -2.257 2.257 5.794 1.333 42.792 -6.264 -1.131 0.84 -5.922 -1.018 0.86
γx1 -1.5 -4.314 -2.814 1.876 7.492 1.277 69.337 -7.171 -2.396 0.8 -6.739 -2.251 0.85
Missingness Parameters
γS1 0.5 1.626 1.126 2.252 2.881 0.55 10.353 0.788 2.857 0.8 0.746 2.698 0.84
γ02 -1 -3.011 -2.011 2.011 5.719 1.322 41.711 -6.062 -1.027 0.85 -5.696 -0.893 0.88
γx2 -1.5 -3.772 -2.272 1.515 6.947 1.283 61.237 -6.811 -1.927 0.82 -6.385 -1.774 0.85
γS2 0.5 1.436 0.936 1.871 2.57 0.549 8.564 0.653 2.71 0.81 0.586 2.527 0.86
γ03 -1 -2.877 -1.877 1.877 5.93 1.2 42.401 -5.493 -0.898 0.89 -5.233 -0.806 0.91
γx3 -1.5 -3.86 -2.36 1.573 6.955 1.153 58.835 -6.508 -2.086 0.83 -6.125 -1.932 0.85
γS3 0.5 1.388 0.888 1.776 2.567 0.467 7.977 0.641 2.428 0.85 0.596 2.289 0.89
γ04 -1 -2.831 -1.831 1.831 5.646 1.297 39.835 -5.902 -0.891 0.89 -5.522 -0.753 0.90
γx4 -1.5 -3.386 -1.886 1.257 5.379 1.127 37.532 -6.048 -1.745 0.81 -5.622 -1.586 0.88
γS4 0.5 1.222 0.722 1.444 1.944 0.457 4.854 0.552 2.312 0.84 0.491 2.152 0.88
Note. Abbreviations are as given in Table 3.
26 Z. Lu and Z. Zhang
BIAS SE CI HPD
para. true est. smp. rel. emp. avg. MSE lower upper cover lower upper cover
N=1000 (convergence rate: 100/112 ≈ 89.29%)
I 1 1.064 0.064 0.064 0.052 0.044 0.009 0.977 1.151 0.66 0.977 1.150 0.66
Growth Curve
S 3 2.921 -0.079 -0.026 0.082 0.074 0.018 2.776 3.067 0.77 2.777 3.066 0.76
var(I) 1 0.169 -0.831 -0.831 0.036 0.031 0.693 0.117 0.237 0 0.113 0.230 0
var(S) 4 3.494 -0.506 -0.126 0.218 0.203 0.344 3.116 3.913 0.40 3.103 3.897 0.37
cov(IS) 0 0.629 0.629 0.629 0.064 0.064 0.404 0.511 0.762 0 0.507 0.756 0
var(e) 1 1.439 0.439 0.439 0.049 0.050 0.197 1.343 1.540 0 1.341 1.538 0
Wave 4 Wave 3 Wave 2 Wave 1
S 3 2.914 -0.086 -0.029 0.099 0.105 0.028 2.710 3.120 0.88 2.709 3.118 0.87
var(I) 1 0.197 -0.803 -0.803 0.046 0.048 0.650 0.121 0.309 0 0.114 0.294 0
var(S) 4 3.448 -0.552 -0.138 0.315 0.284 0.484 2.934 4.043 0.56 2.909 4.012 0.54
cov(IS) 0 0.633 0.633 0.633 0.074 0.088 0.414 0.474 0.819 0 0.466 0.808 0
var(e) 1 1.425 0.425 0.425 0.079 0.072 0.192 1.289 1.571 0 1.286 1.567 0
Wave 4 Wave 3 Wave 2 Wave 1
Table 7. Summarized Estimates from LGCM with LID Missingness (XI) (con’t)
BIAS SE CI HPD
para. true est. smp. rel. emp. avg. MSE lower upper cover lower upper cover
N=300 (convergence rate: 100/148 ≈ 67.57%)
I 1 1.077 0.077 0.077 0.11 0.083 0.025 0.916 1.242 0.78 0.915 1.24 0.81
Growth Curve
S 3 2.864 -0.136 -0.045 0.139 0.135 0.056 2.601 3.131 0.87 2.6 3.129 0.87
var(I) 1 0.251 -0.749 -0.749 0.084 0.076 0.574 0.136 0.429 0.01 0.123 0.402 0.01
var(S) 4 3.424 -0.576 -0.144 0.369 0.366 0.601 2.775 4.209 0.71 2.734 4.153 0.67
cov(IS) 0 0.656 0.656 0.656 0.118 0.119 0.458 0.445 0.909 0 0.433 0.892 0
var(e) 1 1.413 0.413 0.413 0.101 0.095 0.19 1.237 1.608 0 1.232 1.601 0
Wave 4 Wave 3 Wave 2 Wave 1
S 3 2.796 -0.114 -0.038 0.525 0.161 0.071 2.484 3.114 0.85 2.483 3.112 0.85
var(I) 1 0.322 -0.648 -0.648 0.15 0.115 0.469 0.15 0.593 0.1 0.13 0.549 0.07
var(S) 4 3.353 -0.527 -0.132 0.739 0.435 0.677 2.6 4.302 0.74 2.546 4.225 0.71
cov(IS) 0 0.617 0.617 0.617 0.276 0.147 0.479 0.352 0.93 0.01 0.338 0.91 0.01
var(e) 1 1.346 0.376 0.376 0.267 0.115 0.174 1.135 1.586 0.07 1.126 1.574 0.08
Wave 4 Wave 3 Wave 2 Wave 1
BIAS SE CI HPD
para. true est. smp. rel. emp. avg. MSE lower upper cover lower upper cover
N=1000 (convergence rate: 100/126 ≈ 79.37%)
I 1 1.12 0.12 0.12 0.062 0.06 0.022 1.002 1.238 0.52 1.002 1.237 0.51
Growth Curve
S 3 3.003 0.003 0.001 0.084 0.078 0.013 2.85 3.158 0.94 2.849 3.156 0.94
var(I) 1 1.03 0.03 0.03 0.105 0.108 0.024 0.828 1.252 0.93 0.823 1.245 0.93
var(S) 4 3.994 -0.006 -0.002 0.253 0.235 0.119 3.556 4.479 0.91 3.542 4.46 0.90
cov(IS) 0 0.112 0.112 0.112 0.146 0.116 0.047 -0.118 0.337 0.74 -0.115 0.338 0.72
var(e) 1 1.015 0.015 0.015 0.048 0.044 0.004 0.933 1.105 0.91 0.93 1.102 0.92
Wave 4 Wave 3 Wave 2 Wave 1
S 3 3.008 0.008 0.003 0.107 0.11 0.024 2.793 3.226 0.95 2.793 3.224 0.94
var(I) 1 1.004 0.004 0.004 0.146 0.152 0.044 0.725 1.322 0.96 0.714 1.308 0.95
var(S) 4 3.996 -0.004 -0.001 0.399 0.334 0.27 3.391 4.698 0.86 3.365 4.662 0.86
cov(IS) 0 0.102 0.102 0.102 0.178 0.163 0.069 -0.223 0.417 0.88 -0.219 0.42 0.89
var(e) 1 1.026 0.026 0.026 0.062 0.063 0.008 0.91 1.156 0.95 0.906 1.151 0.94
Wave 4 Wave 3 Wave 2 Wave 1
Table 9. Summarized Estimates from LGCM with LOD Missingness (XY) (con’t)
BIAS SE CI HPD
para. true est. smp. rel. emp. avg. MSE lower upper cover lower upper cover
N=300 (convergence rate: 100/107 ≈ 93.46%)
I 1 1.139 0.139 0.139 0.127 0.111 0.047 0.922 1.357 0.70 0.922 1.356 0.69
Growth Curve
S 3 2.988 -0.012 -0.004 0.157 0.144 0.046 2.708 3.274 0.90 2.707 3.272 0.90
var(I) 1 1.045 0.045 0.045 0.196 0.204 0.082 0.682 1.479 0.94 0.661 1.451 0.95
var(S) 4 4.04 0.04 0.01 0.463 0.44 0.41 3.261 4.982 0.92 3.212 4.915 0.95
cov(IS) 0 0.153 0.153 0.153 0.256 0.215 0.135 -0.277 0.569 0.85 -0.27 0.574 0.84
var(e) 1 1.021 0.021 0.021 0.079 0.082 0.013 0.873 1.195 0.94 0.865 1.184 0.94
Wave 4 Wave 3 Wave 2 Wave 1
S 3 2.986 -0.014 -0.005 0.187 0.176 0.066 2.648 3.336 0.92 2.644 3.331 0.93
var(I) 1 1.019 0.019 0.019 0.233 0.25 0.117 0.583 1.562 0.96 0.552 1.517 0.97
var(S) 4 4.034 0.034 0.008 0.516 0.536 0.557 3.107 5.202 0.96 3.043 5.11 0.96
cov(IS) 0 0.182 0.182 0.182 0.311 0.263 0.199 -0.347 0.691 0.85 -0.338 0.697 0.85
var(e) 1 1.047 0.047 0.047 0.103 0.104 0.024 0.863 1.27 0.92 0.852 1.255 0.94
Wave 4 Wave 3 Wave 2 Wave 1
S 3 3.028 0.028 0.009 0.254 0.259 0.133 2.535 3.548 0.97 2.528 3.539 0.97
var(I) 1 0.937 -0.063 -0.063 0.332 0.354 0.246 0.375 1.751 0.92 0.315 1.637 0.90
var(S) 4 4.136 0.136 0.034 0.845 0.809 1.414 2.817 5.971 0.93 2.686 5.757 0.94
cov(IS) 0 0.15 0.15 0.15 0.453 0.394 0.39 -0.657 0.902 0.88 -0.633 0.918 0.88
var(e) 1 1.153 0.153 0.153 0.34 0.176 0.184 0.847 1.529 0.86 0.825 1.494 0.89
Wave 4 Wave 3 Wave 2 Wave 1
Table 10. Summarized Estimates from LGCM with Ignorable Missingness (X)
BIAS SE CI HPD
para. true est. smp. rel. emp. avg. MSE lower upper cover lower upper cover
N=1000 (convergence rate: 100/100 = 100%)
I 1 1.009 0.009 0.009 0.051 0.052 0.005 0.906 1.111 0.94 0.906 1.111 0.93
Growth Curve
S 3 2.711 -0.289 -0.096 0.078 0.077 0.095 2.56 2.863 0.04 2.561 2.863 0.04
var(I) 1 1.008 0.008 0.008 0.108 0.104 0.022 0.813 1.221 0.95 0.807 1.214 0.95
var(S) 4 3.837 -0.163 -0.041 0.232 0.223 0.13 3.422 4.297 0.87 3.409 4.279 0.86
cov(IS) 0 0.004 0.004 0.004 0.115 0.109 0.025 -0.214 0.214 0.96 -0.21 0.216 0.96
var(e) 1 0.999 -0.001 -0.001 0.044 0.043 0.004 0.919 1.086 0.92 0.917 1.084 0.92
N=500 (convergence rate: 100/100 = 100%)
I 1 0.999 -0.001 -0.001 0.073 0.074 0.011 0.854 1.143 0.98 0.855 1.143 0.98
Growth Curve
S 3 2.711 -0.289 -0.096 0.099 0.109 0.105 2.497 2.925 0.21 2.497 2.925 0.21
var(I) 1 0.973 -0.027 -0.027 0.146 0.146 0.043 0.705 1.277 0.98 0.693 1.263 0.98
var(S) 4 3.852 -0.148 -0.037 0.371 0.317 0.259 3.276 4.518 0.86 3.248 4.48 0.88
cov(IS) 0 -0.008 -0.008 -0.008 0.154 0.154 0.047 -0.317 0.287 0.96 -0.311 0.292 0.96
var(e) 1 1.014 0.014 0.014 0.06 0.062 0.008 0.9 1.141 0.96 0.895 1.136 0.95
N=300 (convergence rate: 100/100 = 100%)
I 1 1.009 0.009 0.009 0.103 0.096 0.02 0.821 1.197 0.89 0.821 1.197 0.89
Growth Curve
S 3 2.687 -0.313 -0.104 0.139 0.141 0.137 2.411 2.964 0.34 2.411 2.963 0.35
var(I) 1 1.006 0.006 0.006 0.189 0.194 0.073 0.657 1.416 0.94 0.639 1.391 0.94
var(S) 4 3.816 -0.184 -0.046 0.412 0.41 0.372 3.091 4.694 0.93 3.045 4.631 0.92
cov(IS) 0 0.045 0.045 0.045 0.214 0.2 0.088 -0.359 0.429 0.94 -0.351 0.435 0.94
var(e) 1 1.01 0.01 0.01 0.075 0.08 0.012 0.864 1.179 0.96 0.857 1.17 0.94
N=200 (convergence rate: 100/100 = 100%)
I 1 1.019 0.019 0.019 0.098 0.116 0.023 0.792 1.247 0.97 0.791 1.246 0.97
Growth Curve
S 3 2.69 -0.31 -0.103 0.178 0.173 0.157 2.352 3.029 0.52 2.351 3.027 0.52
var(I) 1 0.99 -0.01 -0.01 0.232 0.236 0.11 0.576 1.5 0.94 0.548 1.461 0.95
var(S) 4 3.884 -0.116 -0.029 0.47 0.509 0.495 3.004 4.992 0.96 2.938 4.898 0.96
cov(IS) 0 0.066 0.066 0.066 0.25 0.246 0.127 -0.434 0.538 0.91 -0.422 0.546 0.92
var(e) 1 1.02 0.02 0.02 0.094 0.1 0.019 0.843 1.233 0.95 0.833 1.219 0.97
N=100 (convergence rate: 100/100 = 100%)
I 1 1.031 0.031 0.031 0.174 0.161 0.057 0.714 1.348 0.94 0.715 1.348 0.95
Growth Curve
S 3 2.699 -0.301 -0.1 0.239 0.248 0.209 2.21 3.187 0.78 2.212 3.187 0.78
var(I) 1 0.863 -0.137 -0.137 0.275 0.315 0.197 0.354 1.579 0.94 0.302 1.487 0.86
var(S) 4 3.951 -0.049 -0.012 0.815 0.753 1.247 2.726 5.66 0.92 2.601 5.456 0.92
cov(IS) 0 0.063 0.063 0.063 0.35 0.35 0.25 -0.658 0.728 0.91 -0.637 0.744 0.94
var(e) 1 1.063 0.063 0.063 0.137 0.149 0.045 0.808 1.39 0.94 0.788 1.361 0.95
Note. Abbreviations are as given in Table 3.
Journal of Behavioral Data Science, 2021, 1 (2), 31–53.
DOI: https://doi.org/10.35566/jbds/v1n2/p3
1 Introduction
Understanding the causal effect of a treatment has historically been of great
scientific interest and remains one of the most frequently pursued objectives in
scientific research today. The gold standard for evaluating treatment effects is
the randomized controlled trial, where the researcher randomly assigns treatment
status to each individual. The benefit of this approach is that the causal effect
of the treatment can be estimated by simply comparing outcomes between those
who were treated and those who were not (Greenland, Pearl, & Robins, 1999).
Random assignment of treatment guarantees that, on average, the treated and
untreated individuals will be equal on all potential confounding variables, both
measured and unmeasured. Eliminating the possibility of confounding clears the
way for a direct comparison to be made.
However, random assignment is not always possible. This can be for ethical
reasons, since researchers cannot, for example, force participants to smoke to
32 S. Serang and J. Sears
investigate the effects of smoking. It can also be for practical reasons, where
the researcher cannot control the assignment of a treatment. For example, re-
searchers cannot randomly assign depression to some participants, enact a law
or policy in a randomly assigned jurisdiction, or choose where their participants
live. An observational study, where treatment is not randomly assigned, may be
the only available option in these cases. Unlike randomized controlled trials, di-
rect comparisons between treated and untreated individuals in an observational
study cannot be made as easily. This is because treated and untreated partic-
ipants may not be equal in all other characteristics, creating the potential for
confounding effects. In fact, it may be differences in these very characteristics
that lead some participants to select treatment, making the estimation of the
treatment’s effect less straightforward. To estimate a treatment’s effect, it must
first be defined, which we do in the context of the potential outcomes framework.
The foundations for the potential outcomes framework were laid out by Neyman,
Iwaszkiewicz, and Kolodziejczyk (1935) and further developed by Rubin (1974),
resulting in it also being called the Rubin Causal Model, Neyman-Rubin Causal
Model, and Neyman-Rubin counterfactual framework of causality. The model
can be conceptualized as follows. Let Y1i be the potential outcome of individual
i if they received the treatment and Y0i be the potential outcome of individual
i if they did not receive the treatment. The observed score Yi , can be written as
sured the relevant covariates and showing that these are balanced across treated
and untreated participants. The most common way of demonstrating balance
in an observed covariate across groups is via a standardized mean difference.
This takes the form of the mean difference in the covariate between groups (in
absolute value) divided by either a pooled standard deviation or an unpooled
standard deviation of one of the groups.
A standardized mean difference of 0 would indicate the covariate has the
same mean across groups. However, there is no universally agreed upon metric
for judging how small a nonzero standardized mean difference must be to be
considered negligible enough for the groups to be considered balanced on the
covariate for practical purposes. Many recommendations exist in the method-
ological literature. Harder, Stuart, and Anthony (2010) use a value less than
0.25, based on a suggestion by Ho, Imai, King, and Stuart (2007). Austin (2011)
suggests a stricter value of less than 0.1, based on work by Normand et al. (2001).
Leite, Stapleton, and Bettini (2018) point out that for educational research, the
What Works Clearinghouse Procedures and Standards Handbook (version 4.0)
requires a value less than 0.05 without additional covariate adjustment, or be-
tween 0.05 and 0.25 with additional regression adjustment (U.S. Department
of Education, Institute of Education Sciences, & What Works Clearinghouse,
2017).
Analyzing standardized mean differences is reasonable when attempting to
balance across demographic covariates such as sex, age, race, etc. Yet some char-
acteristics do not lend themselves well to being assessed in this way. Consider
an example where we are interested in evaluating the effects of a breakup from
a romantic relationship (the treatment) on life satisfaction (the outcome). For
simplicity, let us assume that we only collect data from one partner per couple.
Putting demographics aside, affect might be an important covariate to balance
on. However, ensuring that couples who do and do not break up have the same
average affect might not be especially useful. Stability of affect has been shown
to be predictive of whether couples remain together or break up (Ferrer, 2016;
Ferrer, Steele, & Hsieh, 2012). That is, fluctuations in affect are what need to
be balanced, not simply average affect. Consider the plot given in Figure 1 of
two hypothetical individuals, J and K, and their affect over time. J has highly
variable affect, whereas K has relatively stable affect. Based on the aforemen-
tioned research, J is more likely to experience a breakup, given their instability.
However, both J and K have the same average affect. Imagine a treatment group
filled with individuals like J and an untreated group filled with individuals like
K. According to the standardized mean difference, these two groups would be
balanced across affect, because they have the same mean affect. The fact that
they have different patterns with regard to the variability would be entirely
missed.
The literature does recommend that covariates should be balanced across
groups on not just the mean, but the distribution of the variables (Austin, 2011;
Ho et al., 2007). Researchers are encouraged to examine higher-order moments,
as well as interactions between covariates. Graphical methods are often used
SEM Tree Matching 35
1.2 Purpose
Thus far we have discussed ways to evaluate whether treated and untreated
participants are balanced on covariates. If they are found to be unbalanced, we
can turn to statistical approaches to balance them. A natural initial thought
36 S. Serang and J. Sears
scores are estimated via factor score estimation (Raykov, 2012), or by using
structural equation modeling to estimate propensity scores directly (Leite et
al., 2018). Machine learning techniques including bagging, boosting, trees, and
random forests, have also been used for the estimation of propensity scores (Lee,
Lessler, & Stuart, 2010).
stability. The causal tree approach would then allow us to estimate the causal
effect of a breakup separately in each of these subgroups, as well as compare
them to see if the causal effect differs by subgroup.
One limitation of causal trees as described is that they assume we wish to match
on observed covariates. However, stability in our example is not an observed vari-
able in the data: it is a characterization based on a pattern. One way to charac-
terize stability for the data in our example would be to fit a simple intercept-only
growth curve model and examine the residual variance. A model fit to individ-
uals such as J would produce a large residual variance, whereas a model fit to
individuals like K would yield a relatively small residual variance. Thus, stability
of a group can be characterized by model-based parameter estimates, in lieu of
observed variables.
To do this within the causal tree framework, we would need a mechanism
to fit a model within each node. For longitudinal models, we can use an ap-
proach like the nonlinear longitudinal recursive partitioning algorithm proposed
by Stegmann, Jacobucci, Serang, and Grimm (2018), which allows the user to
fit linear and nonlinear longitudinal models within each node. A more general
approach is the structural equation model tree (SEM Tree) proposed by Brand-
maier, Oertzen, McArdle, and Lindenberger (2013), which allows for structural
equation models (SEMs) to be fit within each node. A benefit of the latter is
the flexibility of the SEM framework, which can accommodate a wide range of
models, including many longitudinal models, via latent growth curve modeling
(Meredith & Tisak, 1990).
The logic of SEM Trees is similar to that of standard decision trees, with
some minor variations. A prespecified SEM is first fit to the full sample, and
the minus two log-likelihood (−2LogL) is calculated. Then, the −2LogL for
the candidate split is calculated. Since the split can be conceptualized as a
multiple group model (Jöreskog, 1971), the −2LogL for the split is simply the
sum of the −2LogL values for each daughter node. A likelihood ratio test is then
conducted with these two −2LogL values. If it rejects, the split is made. As in
other decision trees, this process is recursively repeated until all daughter nodes
are terminal nodes. Unlike conventional decision trees, terminal nodes in SEM
Trees do not provide a predicted proportion or mean. Rather, each terminal
node is characterized by a set of parameter estimates for the SEM fit to the
sample in that node. In this way, SEM Trees can be used to identify subgroups
of people who are similar in that they can be represented by a set of parameter
estimates that is distinct from the parameter estimates that characterize those in
other nodes. SEM Trees can therefore identify subgroups with distinct patterns
of stability, growth, or other patterns reflected in the parameter estimates.
SEM Tree Matching 39
match on, and XS , the splitting covariates we want to split on in the recursive
partitioning process which define the subgroups of the tree’s terminal nodes.
Guidance for whether a covariate should be a modeled covariate or a splitting
covariate is provided in the discussion. Let M be an SEM with parameters θ
that produces XM , so that M (θ) = XM . In our running example, M would be
the intercept-only growth model and θ would be its parameters. For properly
specified M , XM can be used to estimate θ, resulting in parameter estimates θ̂.
Using Mplus Trees, we can build a tree that matches on θ̂, with groups (terminal
nodes) defined by their covariate patterns on XS . The treatment assignment
information, W , is not provided to the recursive partitioning algorithm and so the
tree is built blind to W . In the estimation subsample, we can divide participants
into groups according to the splits found by the tree. Within each group, we
can estimate the CATE as defined before by taking the difference between the
means of the outcomes of the treated and untreated participants in each group.
Since we are using a fresh sample, we can draw inference using hypothesis tests
such as an independent-samples t test or another suitable alternative. We can
also test whether the CATE differs by group by testing the interaction effect in
a two-way independent ANOVA.
3 Simulation Studies
As a proof of concept for Causal Mplus Trees, we performed two small simulation
studies. The simulation studies were conducted in R using the lavaan package
to simulate data and the MplusTrees package for analysis. Readers are referred
to the package documentation for details regarding the implementation of the
algorithm in the software. Each simulation consisted of 1,000 replications.
The first simulation mapped onto our running example regarding stability of
affect. Each sample consisted of N = 2,000 individuals, 1,000 in each of two
groups. The data were generated from an intercept-only (no growth) model with
10 time points. The intercept had a mean of 10 with a variance of 1. The only
difference between the groups was in the residual variance, σ2 . One group had a
residual variance of 1 (the group with stable affect), and the other had a residual
variance of 10 (the group with unstable affect). The group memberships were
identified by a dichotomous covariate, used as a splitting variable. Thus, the
tree matched on the growth curve, using the group membership to split. Within
each group, treated and untreated participants were evenly split (500 each).
A diagram of this population tree is given in Figure 2. For the stable affect
group, outcomes were generated using a standard normal distribution, N (0, 1),
for the untreated group and a N (0.5, 1) distribution for the treated group, to
represent a medium-sized CATE. However, for the unstable affect group, the
outcome distributions were flipped, with the untreated group’s outcome being
generated from a N (0.5, 1) distribution, whereas the treated group’s outcome
SEM Tree Matching 41
was generated from a N (0, 1) distribution. In this way, although the ATE for
the full sample was 0, the CATE for each group was 0.5 in absolute value.
It should be noted that these groups are, from the start, balanced on the
modeled covariates. Since the growth curve variables were all generated to have
a mean of 10, they would be considered balanced according to the standardized
mean difference. Thus, if one were to follow conventional procedure, propensity
scores would not be needed here, and the estimation of the ATE would consist
of simply the mean difference between treated and untreated participants, which
would be 0 on average.
The Causal Mplus Trees algorithm was implemented as described in the prior
section, with 80% of the sample (1,600 individuals) used for matching and 20%
(400 individuals) used to estimate CATEs. A cp value of .01 was used to split,
with a minimum of 100 individuals required to consider splitting on a node. Each
terminal node was also required to have at least 100 individuals within it. For
each replication, the CATE was estimated in each group using an independent
samples t test. A two-way independent ANOVA was also conducted to determine
if CATEs differed by group.
Overall, the results demonstrated the effectiveness of the algorithm. Across
all replications, 94.5% of CATEs were detected. Additionally, 99.8% of the in-
teractions from the two-way ANOVA were detected, showing that the algorithm
can detect differences in CATEs by group. As a comparison, we also analyzed
42 S. Serang and J. Sears
these data as they would have been analyzed using the conventional approach.
Since the covariates were on average balanced according the standardized mean
difference, the ATE would have been estimated by using the full sample to esti-
mate the mean difference between treated and untreated participants. Despite a
sample size of 2,000 to do this (relative to the only 400 available to Causal Mplus
Trees after performing the matching), only 3.4% of datasets yielded statistically
significant ATEs, consistent with a nominal false positive rate of 5%.
to those found in the first simulation study. Taken together, they show that
the Causal Mplus Trees algorithm is able to estimate CATEs and support hy-
pothesis testing to determine their statistical significance. It can also determine
whether the CATEs differ by group. Notably, CATEs were found in the absence
of ATEs, with modeled covariates already balanced across treated and untreated
participants according to the standardized mean difference.
4 Empirical Example
day level percentage point change in average travel distance relative to that
day-of-week’s average in early 2020 (average for Feb 10 to March 8, prior to
the presence of COVID-19 in the US). Accordingly, a value of –3 indicates a
3 percentage point decline in average travel distance relative to baseline levels.
A positive value of CADT signals that residents of that county increased their
travel distances relative to their pre-COVID-19 patterns, whereas a negative
value indicates reduced travel distances (that can occur through reductions in
both the distances traveled per trip as well as the overall number of trips taken).
Each county’s average CADT for July was estimated by taking the mean of the
daily CADT for each day from July 1, 2020 until July 31, 2020. The estimate
of the CATE in each group, along with corresponding information, is given in
Table 2.
Of the four groups, the only one with a statistically significant CATE was
Group 2, where counties in states with Democratic governors had an average
CADT that was 6.46 percentage points less than counties in states with Repub-
lican governors t(108.76) = -2.84, p = .006. Group 2 was on average the most
populous, least rural group of the four, as well as the most educated with highest
median incomes. As such, Group 2 contained the country’s more metropolitan
areas. We interpret this result to mean that in metropolitan counties matched
for COVID rates, people in counties in states with Democratic governors trav-
eled 6.5 percentage points less in July than people in comparable counties in
states with Republican governors. Of note, the two-way independent ANOVA
found that in the estimation subsample, a significant main effect of party was
not found F (1, 598) = 3.76, p = .053, whereas a main effect of Group F (3, 598)
SEM Tree Matching 47
= 13.45, p < .001, and an interaction F (3, 598) = 3.41, p = .017 were. This
suggests that the party effect is more prominent for more metropolitan counties,
but would be obscured if examining the country as a whole. The mean difference
between parties in CADT for all 3,030 counties was only 0.60 percentage points,
with a t test on the full dataset yielding t(2422.6) = -1.50, p = .133, though this
result should be read with the caveat that nearly all counties were represented in
the sample. The value of Causal Mplus Trees in analyzing these data is evident
in its ability to find a group of counties exhibiting stronger party effects, while
simultaneously matching on COVID-19 trajectories.
Our findings corroborate those of previous COVID-19 partisanship studies.
Allcott et al. (2020) found evidence of 3.6 percent fewer point of interest visits
associated with a 10 percentage point decrease in the Republican vote share
(roughly equivalent to shifting from the median to the 25th percentile Republi-
can vote share county for the 2010 presidential election). Brzezinski et al. (2020)
estimated a 3 percentage point difference in the share of devices staying fully
at home for the 90th vs 10th percentile Democrat vote share counties 15 days
after a county’s first case. Areas with relatively greater viewership of conserva-
tive news shows that initially downplayed the threat of coronavirus (versus those
that accurately portrayed the pandemic) have also been linked to delayed behav-
ior changes and higher initial occurrences of cases and deaths (Bursztyn, Rao,
Roth, & Yanagizawa-Drott, 2020). Further, our Group 2 CATE is comparable
in magnitude to the decline in travel distance attributable to statewide stay-
at-home mandates (Sears et al., 2020). While prior studies employ traditional
approaches for discussing treatment effect heterogeneity (i.e. running difference-
in-differences or event study regressions on subgroups of interest), the Causal
Mplus Trees method provides a data-driven approach to identifying comparable
groups on model fit and analyzing treatment effect heterogeneity.
5 Discussion
In this paper, we proposed the Causal Mplus Trees algorithm, which matches
on parameter estimates of an SEM using a tree-based approach and uses these
groupings to estimate CATEs in a holdout sample. We used two small simulation
studies to demonstrate a proof of concept for the approach. We also showed
how it could be used to estimate party effects on mobility using COVID-19
data. We reiterate that we do not see Causal Mplus Trees as a substitute for
traditional matching methods. Propensity score matching and related methods
have their place and can be effective in matching on covariates, both observed
and latent. We believe that our approach offers an alternative option to those
whose research questions would be better addressed by the ability to match on
parameter estimates from an SEM.
the procedure ultimately matches on both, the way it does so differs by co-
variate type. Matching is performed on modeled covariates indirectly through
the parameter estimates produced by the model, whereas splitting covariates
are matched more directly on the observed values of the scores. The choice of
whether a covariate should be used as a modeled or splitting covariate depends
upon what specifically the user wants to match, which can vary based on the
research question, study design, and characteristics of the sample collected.
Another consideration for researchers using Causal Mplus Trees is the depth
to which the tree should be grown. Cross-validation is the most commonly used
approach for this in the context of conventional decision trees. However, we
believe that cross-validation may not be as well suited for our purposes primarily
because it is designed to optimize predictive accuracy. In our algorithm, the goal
of the tree is not to optimize predictive accuracy, but rather to partition the
sample into groups that are matched well enough on θ̂ to justify causal inference
in the holdout sample. As in propensity score matching, there is no objective
criterion for this, so the researcher must make a subjective judgment and make
a case to justify it.
We urge researchers to take into account the following considerations. First,
the sample size in each parent node must be large enough to estimate M in not
only the parent node, but also each of the daughter nodes. SEMs can require
larger sample sizes to estimate, so limits should be placed on the splitting pro-
cedure so as not to consider splitting on a sample that does not have a large
enough sample to do this. Related to this is the need for a sufficient number of
treated and untreated participants in each terminal node to be able to estimate
the CATEs in the holdout sample. If a group has no treated (or no untreated)
participants, the CATE cannot be estimated. Of course, it is possible that the
mix in the tree differs from the mix in the holdout sample, but to the extent
that the matching subsample is a reflection of the estimation subsample, the
matching subsample can give a sense of the mix one would expect in the estima-
tion subsample. If performing hypothesis tests, certain minimum sample sizes
are required to meet the assumptions of the test as well as to detect the effects,
so these must also be kept in mind when deciding how deep to grow the tree.
Parsimony is also important to consider, especially with respect to building a
coherent narrative with policy implications. We are typically searching for groups
with qualitative meaning given the relevant theoretical framework. If the tree
were to produce a dozen groups, it may be challenging to map this onto available
theory in order to interpret the results. The relative importance of parameters
in characterizing a pattern should be taken into account as well. Theory may
dictate that some parameters may be more important to match on than others
for a given context (e.g., the residual variance in our stability example). As such,
it could be justifiable to trim the tree earlier if splits begin resulting in differences
in less relevant parameters. The size of parameter estimates may also play a role.
For example, the algorithm could decide on a split that results in two daughter
nodes with only small differences in their parameter estimates. Treating these
as two separate groups for the purpose of estimating the CATE may not be
SEM Tree Matching 49
worthwhile. Similar to the logic used in propensity score analysis, the treated
and untreated participants in each node should be compared on their parameters
estimates, to verify, even if only subjectively, that they are similar and therefore
matched to some degree.
The choice for the depth of the tree depends on a trade-off between inter-
pretability of a result and the validity of the causal inference. If one were to view
the ability to draw causal inference as how well treated and untreated partici-
pants are matched, then the ability to draw causal inference can be conceptu-
alized not as a dichotomy but as a continuum with perfectly matched partici-
pants on one end and perfectly unmatched participants on the other. The better
matched participants are, the greater the ability to draw causal inference. How-
ever, better matching requires a deeper tree, which becomes less interpretable
and generalizable as the depth grows. This trade-off exists in propensity score
matching as well but is more apparent in the context of decision trees where
such trade-offs are more apparent and a language with which to conceptualize
and discuss them already exists.
References
Abrevaya, J., Hsu, Y.-C., & Lieli, R. (2015). Estimating conditional aver-
age treatment effects. Journal of Business and Economic Statistics, 33 ,
485–505. doi: https://doi.org/10.1080/07350015.2014.975555
Adolph, C., Amano, K., Bang-Jensen, B., Fullman, N., & Wilkerson, J.
(2020). Pandemic politics: Timing state-level social distancing responses to
COVID-19. medRxiv . doi: https://doi.org/10.1101/2020.03.30.20046326
50 S. Serang and J. Sears
Allcott, H., Boxell, L., Conway, J., Gentzkow, M., Thaler, M., & Yang, D. (2020).
Polarization and public health: Partisan differences in social distancing
during the coronavirus pandemic. Journal of Public Economics, 191 . doi:
https://doi.org/10.1016/j.jpubeco.2020.104254
Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal
effects. Proceedings of the National Academy of Sciences, 113 , 7353–7360.
doi: https://doi.org/10.1073/pnas.1510489113
Austin, P. (2009). The relative ability of different propensity-score methods
to balance measured covariates between treated and untreated subjects
in observational studies. Medical Decision Making, 29 , 661–677. doi:
https://doi.org/10.1177/0272989X09341755
Austin, P. (2011). An introduction to propensity score methods for reducing the
effects of confounding in observational studies. Multivariate Behavioral Re-
search, 46 , 399–424. doi: https://doi.org/10.1080/00273171.2011.568786
Berk, R. (2004). Regression analysis: A constructive critique. Sage. doi:
https://doi.org/10.4135/9781483348834
Brandmaier, A., Oertzen, T., McArdle, J., & Lindenberger, U. (2013). Struc-
tural equation model trees. Psychological Methods, 18 , 71–86. doi:
https://doi.org/10.1037/a0030001.
Brandmaier, A., Prindle, J., & Arnold, M. (2021). semtree: Recursive partition-
ing for structural equation models [R package version 0.9.17.]. Retrieved
from https://CRAN.R-project.org/package=semtree
Brandmaier, A., Prindle, J., McArdle, J., & Lindenberger, U. (2016). Theory-
guided exploration with structural equation model forests. Psychological
Methods, 21 , 566–582. doi: https://doi.org/10.1037/met0000090
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Clas-
sification and regression trees. Chapman & Hall. doi:
https://doi.org/10.1201/9781315139470
Browne, M., & Toit, S. (1991). Models for learning data. In L. Collins & J. Horn
(Eds.), Best methods for the analysis of change (p. 47–68). American
Psychological Association. doi: https://doi.org/10.1037/10099-004
Brzezinski, A., Deiana, G., Kecht, V., & Van Dijcke, D. (2020). The covid-19 pan-
demic: Government vs. community action across the united states. INET
Oxford Working Paper . Retrieved from https://www.inet.ox.ac.uk/
publications/no-2020-06-the-covid-19-pandemic-government-vs
-community-action-across-the-united-states/ (No. 2020-06.)
Bursztyn, L., Rao, A., Roth, C., & Yanagizawa-Drott, D. (2020). Mis-
information during a pandemic. NBER Working Paper (27417). doi:
https://doi.org/10.3386/w27417
Dong, E., Du, H., & Gardner, L. (2020). An interactive web-based dashboard to
track covid-19 in real time. Lancet Infectious Disease, 20 , 533–534. doi:
https://doi.org/10.1016/S1473-3099(20)30120-1
Ferrer, E. (2016). Exploratory approaches for studying social in-
teractions, dynamics, and multivariate processes in psychological
science. Multivariate Behavioral Research, 51 , 240–256. doi:
SEM Tree Matching 51
https://doi.org/10.1080/00273171.2016.1140629
Ferrer, E., Steele, J., & Hsieh, F. (2012). Analyzing dynamics of af-
fective dyadic interactions using patterns of intra- and inter-individual
variability. Multivariate Behavioral Research, 47 , 136–171. doi:
https://doi.org/10.1080/00273171.2012.640605
Gadarian, S., Goodman, S., & Pepinsky, T. (2020). Partisanship, health behav-
ior, and policy attitudes in the early stages of the COVID-19 pandemic.
SSRN . doi: https://doi.org/10.2139/ssrn.3562796
Greenland, S., Pearl, J., & Robins, J. (1999). Causal diagrams for epidemiologic
research. Epidemiology, 10 , 37–48. doi: https://doi.org/10.1097/00001648-
199901000-00008
Grimm, K., & Ram, N. (2009). Nonlinear growth models in
mplus and sas. Structural Equation Modeling, 16 , 676–701. doi:
https://doi.org/10.1080/10705510903206055
Gu, X., & Rosenbaum, P. (1993). Comparison of multivariate
matching methods: Structures, distances, and algorithms. Jour-
nal of Computational and Graphical Statistics, 2 , 405–420. doi:
https://doi.org/10.1080/10618600.1993.10474623
Guo, S., & Fraser, M. (2010). Propensity score analysis: Statistical methods and
applications. Sage.
Hallquist, M., & Wiley, J. (2018). MplusAutomation: An R package for facilitat-
ing large-scale latent variable analyses in Mplus. Structural Equation Mod-
eling, 25 , 621–638. doi: https://doi.org/10.1080/10705511.2017.1402334
Harder, V., Stuart, E., & Anthony, J. (2010). Propensity score techniques
and the assessment of measured covariate balance to test causal associa-
tions in psychological research. Psychological Methods, 15 , 234–249. doi:
https://doi.org/10.1037/a0019623
Hirano, K., & Imbens, G. (2001). Estimation of causal effects using propensity
score weighting: An application to data on right heart catheterization.
Health Services and Outcomes Research Methodology, 2 , 259–278. doi:
https://doi.org/10.1023/A:1020371312283
Ho, D., Imai, K., King, G., & Stuart, E. (2007). Matching as nonparametric pre-
processing for reducing model dependence in parametric causal inference.
Political Analysis, 15 , 199–236. doi: https://doi.org/10.1093/pan/mpl013
Holland, P. (1986). Statistics and causal inference. Journal
of the American Statistical Association, 81 , 945–60. doi:
https://doi.org/10.1080/01621459.1986.10478354
Jöreskog, K. (1971). Simultaneous factor analysis in several populations. Psy-
chometrika, 36 , 409-426. doi: https://doi.org/10.1007/BF02291366
Lee, B., Lessler, J., & Stuart, E. (2010). Improving propensity score weight-
ing using machine learning. Statistics in Medicine, 29 , 337–346. doi:
https://doi.org/10.1002/sim.3782
Leite, W., Stapleton, L., & Bettini, E. (2018). Propensity score analy-
sis of complex survey data with structural equation modeling: A tu-
torial with mplus. Structural Equation Modeling, 3 , 448–469. doi:
52 S. Serang and J. Sears
https://doi.org/10.1080/10705511.2018.1522591
Meredith, W., & Tisak, J. (1990). Latent curve analysis. Psychometrika, 55 ,
107–122. doi: https://doi.org/10.1007/bf02294746
Muthén, L., & Muthén, B. (1998-2017). Mplus user’s guide (8th ed.) [Computer
software manual]. Muthén & Muthén.
National Governors Association. (2020). Governors roster. Retrieved
from https://www.nga.org/wp-content/uploads/2019/07/Governors
-Roster.pdf
Neale, M., Hunter, M., Pritikin, J., Zahery, M., Brick, T., Kirkpatrick,
R., & Boker, S. (2016). Openmx 2.0: Extended structural equa-
tion and statistical modeling. Psychometrika, 81 , 535–549. doi:
https://doi.org/10.1007/s11336-014-9435-8
Neyman, J., Iwaszkiewicz, K., & Kolodziejczyk, S. (1935). Statistical problems
in agricultural experimentation. Supplement to the Journal of the Royal
Statistical Society, 2 , 107–180. doi: https://doi.org/10.2307/2983637
Normand, S., Landrum, M., Guadagnoli, E., Ayanian, J., Ryan, T., Cleary, P., &
McNeil, B. (2001). Validating recommendations for coronary angiography
following an acute myocardial infarction in the elderly: A matched analysis
using propensity scores. Journal of Clinical Epidemiology, 54 , 387–398.
doi: https://doi.org/10.1016/S0895-4356(00)00321-8
R Core Team. (2020). R: A language and environment for statistical computing
[Computer software manual]. Vienna, Austria. Retrieved from https://
www.R-project.org/
Raykov, T. (2012). Propensity score analysis with fallible covariates: A note on
a latent variable modeling approach. Educational and Psychological Mea-
surement, 72 , 715–733. doi: https://doi.org/10.1177/0013164412440999
Rosenbaum, P., & Rubin, D. (1983). The central role of the propensity score
in observational studies for causal effects. Biometrika, 70 , 41–55. doi:
https://doi.org/10.1093/biomet/70.1.41
Rosenbaum, P., & Rubin, D. (1984). Reducing bias in observational studies
using subclassification on the propensity score. Journal of the American
Statistical Association, 79 , 516–24. doi: https://doi.org/10.2307/2288398
Rosenbaum, P., & Rubin, D. (1985). The bias due to incomplete matching.
Biometrics, 41 , 103–16. doi: https://doi.org/10.2307/2530647
Rosseel, Y. (2012). lavaan: An R package for structural equa-
tion modeling. Journal of Statistical Software, 48(2), 1–36. doi:
https://doi.org/10.18637/jss.v048.i02
Rubin, D. (1974). Estimating causal effects of treatments in randomized and
nonrandomized studies. Journal of Educational Psychology, 66 , 688–701.
doi: https://doi.org/10.1037/h0037350
Rubin, D. (1980). Randomization analysis of experimental data: The fisher ran-
domization test comment. Journal of the American Statistical Association,
75 , 591–593. doi: https://doi.org/10.2307/2287653
Rubin, D. (1986). What if’s have causal answers. Journal
of the American Statistical Association, 81 , 961–962. doi:
SEM Tree Matching 53
https://doi.org/10.1080/01621459.1986.10478355
Sears, J., Villas-Boas, S., Villas-Boas, M., & Villas-Boas, V. (2020).
Are we #stayinghome to flatten the curve? SSRN . doi:
https://doi.org/10.2139/ssrn.3569791
Serang, S., Jacobucci, R., Stegmann, G., Brandmaier, A., Culianos, D.,
& Grimm, K. (2021). Mplus Trees: Structural equation model
trees using Mplus. Structural Equation Modeling, 28 , 127–137. doi:
https://doi.org/10.1080/10705511.2020.1726179
Stegmann, G., Jacobucci, R., Serang, S., & Grimm, K. (2018). Recursive parti-
tioning with nonlinear models of change. Multivariate Behavioral Research,
53 , 559–570. doi: https://doi.org/10.1080/00273171.2018.1461602
Suk, Y., Kang, H., & Kim, J.-S. (in press). Random forests approach for causal
inference with clustered observational data. Multivariate Behavioral Re-
search. doi: https://doi.org/10.1080/00273171.2020.1808437
Therneau, T., & Atkinson, B. (2018). Rpart: Recursive partitioning and
regression trees [R package version 4.1-13.]. Retrieved from https://
CRAN.R-project.org/package=rpart
Thoemmes, F., & Kim, E. (2011). A systematic review of propensity score meth-
ods in the social sciences. Multivariate Behavioral Research, 46 , 90–118.
doi: https://doi.org/10.1080/00273171.2011.540475
Unacast. (2020). Unacast social distancing scoreboard dataset. Retrieved from
https://www.unacast.com/data-for-good.
U.S. Census Bureau. (2010). Decennial census, 2010. Retrieved from https://
data.census.gov/
U.S. Department of Education, Institute of Education Sciences, & What Works
Clearinghouse. (2017). What works clearinghouse: Standards handbook
(version 4.0). Retrieved from https://ies.ed.gov/ncee/wwc/Docs/
referenceresources/wwc standards \handbook v4.pdf
Wager, S., & Athey, S. (2018). Estimation and inference of
heterogeneous treatment effects using random forests. Journal
of the American Statistical Association, 113 , 1228–1242. doi:
https://doi.org/10.1080/01621459.2017.1319839
West, S., Cham, H., Thoemmes, F., Renneberg, B., Schulze, J., & Weiler,
M. (2014). Propensity scores as a basis for equating groups: Ba-
sic principles and application in clinical treatment outcome research.
Journal of Consulting and Clinical Psychology, 82 , 906–919. doi:
https://doi.org/10.1037/a0036387
Journal of Behavioral Data Science, 2021, 1 (2), 54–88.
DOI: https://doi.org/10.35566/jbds/v1n2/p4
1 Introduction
1.1 Motivating Example
Earlier studies have examined the impacts of time-invariant covariates (TICs) on
nonlinear mathematics achievement trajectories. For example, Liu, Perera, Kang,
GMM-EFA 55
2 Method
In this section, we specify the GMM with a bilinear spline growth curve as the
within-class model. Harring et al. (2006) showed there are five parameters in the
bilinear spline functional form: an intercept and slope of each linear piece and
a change-point, yet the degree of freedom of the bilinear spline is four since two
linear pieces join at the knot. In this study, we view the initial status, two slopes,
and the knot as the four parameters. We construct the model with consideration
of the variability of the initial status and two slopes, but assuming that the class-
specific knot is the same across all individuals in a latent class though Liu et al.
(2019); Preacher and Hancock (2015) have shown that the knot can also have a
random effect by relaxing the assumption. Suppose the pre-specified number of
latent classes is K, for i = 1 to n individuals and k = 1 to K latent classes, we
express the model as
K
X
p(y i |zi = k, xi ) = π(zi = k|xi ) × p(y i |zi = k), (1)
k=1
1
(k) Reference Group (k = 1)
1+ K (k)T x )
P
k=2 exp(β0 +β i
π(zi = k|xi ) = exp(β
(k)
+β (k)T
x i)
, (2)
P 0
(k) Other Groups (k = 2, . . . , K)
1+ Kk=2 exp(β0 +β
(k)T x )
i
Equation (1) defines a FMM that combines mixing proportions, π(zi = k|xi ),
and within-class models, p(y i |zi = k), where xi , y i and zi are the covariates,
J × 1 vector of repeated outcome (where J is the number of measurements)
and membership of the ith individual, respectively. For Equation (1), we have
GMM-EFA 59
PK
two constratints: 0 ≤ π(zi = k|xi ) ≤ 1 and k=1 π(zi = k|xi ) = 1. Equation
(k)
(2) defines mixing components as logistic functions of covariates xi , where β0
and β (k) are the class-specific logistic coefficients. These functions decide the
membership for the ith individual, depending on the values of the covariates xi .
Equations (3) and (4) together define a within-class model. Similar to all
factor models, Equation (3) expresses the outcome y i as a linear combination of
growth factors. When the underlying functional form is bilinear spline growth
curve with an unknown fixed knot, η i is a 3 × 1 vector of growth factors (η i =
η0i , η1i , η2i , for an initial status and a slope of each stage of the ith individual).
Accordingly, Λi (γ (k) ), which is a function of the class-specific knot γ (k) , is a J ×3
matrix of factor loadings. Note that the subscript i in Λi (γ (k) ) indicates that it
is a function of the individual measurement occasions of the ith individual. The
pre- and post-knot y i can be expressed as
(
η0i + η1i tij + ij tij ≤ γ (k)
yij = ,
η0i + η1i γ (k) + η2i (tij − γ (k) ) + ij tij > γ (k)
where yij and tij are the measurement and measurement occasion of the ith
individual at time j. Additionally, i is a J × 1 vector of residuals of the ith
individual. Equation (4) further expresses the growth factors as deviations from
their class-specific means. In the equation, µη (k) is a 3 × 1 vector of class-specific
growth factor means and ζ i is a 3×1 vector of residual deviations from the mean
vector of the ith individual.
To unify pre- and post-knot expressions, we need to reparameterize growth
factors. Earlier studies, for example, Grimm et al. (2016); Harring et al. (2006);
Liu et al. (2019), presented multiple ways to realize this aim. Note that no matter
which approach we follow to reparameterize growth factors, the reparameterized
coefficients are not directly related to the underlying change patterns and need
to be transformed back to be interpretable. In this article, we follow the reparam-
eterized method in Liu et al. (2019) and define the class-specific reparameterized
growth factors as the measurement at the knot, mean of two slopes, and the half
difference of two slopes. Note that the expressions of the repeated outcome y i
using the growth factors in the original and reparameterized frames are equiva-
lent. We also extend the (inverse-)transformation functions and matrices for the
reduced model in Liu et al. (2019), with which we can obtain the original pa-
rameters efficiently for interpretation purposes. Detailed class-specific reparam-
eterizing process and the class-specific (inverse-) transformation are provided in
Appendix 6.2 and Appendix 6.2, respectively.
To simplify the model, we assume that class-specific growth factors follow a mul-
tivariate Gaussian distribution, that is, ζ i |k ∼ MVN(0, Ψη (k) ). Note that Ψη (k) is
a 3×3 variance-covariance matrix of class-specific growth factors. We also assume
that individual residuals follow identical and independent normal distributions
60 J. Liu et al.
(k)
over time in each latent class, that is, i |k ∼ N (0, θ I), where I is a J ×J iden-
tity matrix. Accordingly, for the ith individual in the k th unobserved group, the
(k)
within-class model implied mean vector (µi ) and variance-covariance matrix
(k)
(Σ i ) of repeated measurements are
(k)
µi = Λi µη (k) , (5)
(k) (k)
Σi = Λi Ψη ΛTi + θ(k) I. (6)
Step 1 In the first step, we estimate the class-specific parameters and mixing
proportions for the model specified in Equations (1), (2), (3) and (4) without
considering the impact that covariates xi have on the class formation. The pa-
rameters need to be estimated in this step include
(k) (k) (k) (k) (k) (k)
Θ s1 = {µ(k) (k) (k)
η0 , µη1 , µη2 , γ
(k)
, ψ00 , ψ01 , ψ02 , ψ11 , ψ12 , ψ22 , θ(k) , π (2) , · · · , π (K) }.
Step 2 In the second step, we examine the associations between the ‘soft clus-
ters’, where each trajectory is assigned with different posterior probabilities, and
the baseline characteristics by fixing the class-specific parameters as their esti-
mates from the first step, that is, the parameters need to be estimated in this
(k)
step are those logistic coefficients, Θ s2 = {β0 , β T (k) } (k = 2, . . . , K), in Equa-
tion (2). The log-likelihood function in Equation (7) also needs to be modified
as
Xn XK
log lik(Θ s2 ) = log π(zi = k|xi )p(y i |zi = k)
i=1 k=1
(8)
n K
X
X (k) (k)
= log π(zi = k|xi )p(y i |µ̂i , Σ̂ i ) .
i=1 k=1
3 Model Evaluation
We evaluate the proposed model using a Monte Carlo simulation study with
two goals. The first goal is to evaluate the model performance by examining the
relative bias, empirical SE, relative RMSE, and empirical coverage for a nominal
95% confidence interval (CI) of each parameter. Table 1 lists the definitions and
estimates of these performance metrics.
The second goal is to evaluate how well the clustering algorithm performs
to separate the heterogeneous trajectories. To evaluate the clustering effects, we
need to calculate the posterior probabilities for each individual belonging to the
k th unobserved group. The calculation is based on the class-specific estimates and
mixing proportions obtained from the first step and realized by Bayes’ theorem
π(zi = k)p(y i |zi = k)
p(zi = k|y i ) = PK .
k=1 π(zi = k)p(y i |zi = k)
We then assign each individual to the latent class with the highest posterior
probability to which that observation most likely belongs. If multiple posterior
probabilities equal to the maximum value, we break the tie among competing
components randomly (McLachlan & Peel, 2000). We evaluate the clustering
effects by accuracy and entropy. Since the true membership is available in simu-
lation studies, we are able to calculate accuracy, which is defined as the fraction
of all correctly labeled instances (Bishop, 2006). Entropy, which is given
n X
X K
1
Entropy = 1 + p(zi = k|y i ) log p(zi = k|y i ) , (9)
n log(K) n=1 k=1
62 J. Liu et al.
(σ 2 /µ) of each growth factor at the one-tenth scale, guided by Bauer and Curran
(2003); Kohli (2011); Kohli et al. (2015). Further, the growth factors were set to
be positively correlated to a moderate degree (ρ = 0.3).
For both parts, the primary aim was to investigate how the separation be-
tween latent classes, the unbalanced class mixing proportion, and the trajectory
shape affected the model performance. Utilizing a model-based clustering al-
gorithm, we are usually interested in examining how well the model can detect
heterogeneity in samples and estimate parameters of interest in each latent class.
Intuitively, the model should perform better under those conditions with a larger
separation between latent classes. We wanted to test this hypothesis. In the sim-
ulation design, we had two metrics to gauge the separation between clusters: the
difference between the knot locations and the Mahalanobis distance (MD) of the
three growth factors of latent classes. We set 1, 1.5 and 2 as a small, medium,
and large difference between the knot locations. We chose 1 as the level of small
difference to follow the rationale in Kohli et al. (2015) and considered the other
two levels to investigate whether the more widely spaced knots improve the
model performance. We considered two levels of MD, 0.86 (i.e., small distance)
and 1.72 (i.e., large distance), for class separation. Note that both the small
and large distance in the current simulation design was smaller than the corre-
sponding level in Kohli et al. (2015) because we wanted to examine the proposed
model under more challenging conditions in terms of cluster separation.
We chose two levels of mixing proportion, 1:1 and 1:2, for the conditions with
two latent classes and three levels of mixing proportion, 1:1:1, 1:1:2 and 1:2:2, for
the scenarios with three clusters. We selected these levels because we wanted to
evaluate how the challenging conditions (i.e., the unbalanced allocation) affect
performance measures and clustering effects. We also examined several common
change patterns shown in Table 2 (Scenario 1, 2 and 3). We changed the knot
locations and one growth factor under each scenario but fixed the other two
growth factors to satisfy the specified MD. We considered θ = 1 or θ = 2 as two
levels of homogeneous residual variances across latent classes to see the effect of
the measurement precision, and we considered two levels of sample size.
All mixture models suffer from the label switching issue: inconsistent assign-
ments of membership for multiple replications in simulation studies. The label
switching does not hurt the model estimation in the frequentist framework since
the likelihood is invariant to permutation of cluster labels; however, the esti-
mates from the first latent class may be mislabeled as such from other latent
classes (Class 2 or Class 3 in our case) (Tueller, Drotar, & Lubke, 2011). In this
study, we utilized the column maxima switched label detection algorithm devel-
oped by Tueller et al. (2011) to check whether the labels were switched; and if
it occurred, the final estimates were relabeled in the correct order before model
evaluation.
64 J. Liu et al.
Table 2. Simulation Design for the Proposed Two-step Growth Mixture Model
Fixed Conditions
Variables Conditions
(k)
Variance of Intercept ψ00 = 25
(k) (k)
Variance of Slopes ψ11 = ψ22 = 1
(k)
Correlations of GFs ρ = 0.3
Time (t) 10 scaled and equally spaced tj (j = 0, · · · , J − 1, J = 10)
Individual t tij ∼ U (tj − ∆, tj + ∆)(j = 0, · · · , J − 1; ∆ = 0.25)
Manipulated Conditions
Variables 2 latent classes 3 latent classes
Sample Size n = 500 or 1000 n = 500 or 1000
(k) (k)
ψγγ = 0.00(k = 1, 2) ψγγ = 0.00(k = 1, 2, 3)
Variance of Knots (k) (k)
ψγγ = 0.09(k = 1, 2) ψγγ = 0.09(k = 1, 2, 3)
(1) (2)
π :π =1:1 π : π (2) : π (3) = 1 : 1 : 1
(1)
(1) (2)
Ratio of Proportions π :π =1:2 π (1) : π (2) : π (3) = 1 : 1 : 2
π (1) : π (2) : π (3) = 1 : 2 : 2
(k) (k)
Residual Variance θ = 1 or 2 θ = 1 or 2
µγ = (4.00, 5.00) µγ = (3.50, 4.50, 5.50)
Locations of knots µγ = (3.75, 5.25) µγ = (3.00, 4.50, 6.00)
µγ = (3.50, 5.50)
Mahalanobis distance d = 0.86 or 1.72 d = 0.86
Scenario 1: Different means of initial status and (means of ) knot locations
Variables 2 latent classes 3 latent classes
(k) (k)
Means of Slope 1’s µη1 = −5 (k = 1, 2) µη1 = −5 (k = 1, 2, 3)
(k) (k)
Means of Slope 2’s µη2 = −2.6 (k = 1, 2) µη2 = −2.6 (k = 1, 2, 3)
µη0 = (98, 102), (d = 0.86) µη0 = (96, 100, 104)
Means of Intercepts
µη0 = (96, 104), (d = 1.72)
Scenario 2: Different means of slope 1 and (means of ) knot locations
Variables 2 latent classes 3 latent classes
(k) (k)
Means of Intercepts µη0 = 100 (k = 1, 2) µη0 = 100 (k = 1, 2, 3)
(k) (k)
Means of Slope 2’s µη2 = −2 (k = 1, 2) µη2 = −2 (k = 1, 2, 3)
µη1 = (−4.4, −3.6), (d = 0.86) µη1 = (−5.2, −4.4, −3.6)
Means of Slope 1’s
µη1 = (−5.2, −3.6), (d = 1.72)
Scenario 3: Different means of slope 2 and (means of ) knot locations
Variables 2 latent classes 3 latent classes
(k) (k)
Means of Intercepts µη0 = 100 (k = 1, 2) µη0 = 100 (k = 1, 2, 3)
(k) (k)
Means of Slope 1’s µη1 = −5 (k = 1, 2) µη1 = −5 (k = 1, 2, 3)
µη2 = (−2.6, −3.4), (d = 0.86) µη2 = (−1.8, −2.6, −3.4)
Means of Slope 2’s
µη2 = (−1.8, −3.4), (d = 1.72)
GMM-EFA 65
4 Result
4.1 Model Convergence
In this section, we first examine the convergence4 rate of two steps for each
condition. Based on our simulation studies, the convergence rate of the proposed
two-step model achieved around 90% for all conditions, and the majority of non-
convergence cases occurred in the first step. To elaborate, for the conditions with
two latent classes, 96 out of total 288 conditions reported 100% convergence rate,
while for the conditions with three latent classes, 12 out of total 144 conditions
reported 100% convergence rate. Among all conditions with two latent classes,
the worst scenario regarding the convergence rate was 121/1121, indicating that
4
In our project, convergence is defined as to reach OpenMx status code 0, which
indicates a successful optimization, until up to 10 attempts with different collections
of starting values (Neale et al., 2016).
66 J. Liu et al.
we need to replicate the procedure described in Section 3.3 1, 121 times to have
1, 000 replications with a convergent solution. Across all scenarios with three
latent classes, the worst condition was 134/11345 .
Table 3. Median (Range) of the Relative Bias over 1, 000 Replications of Parameters
of Interest under the Conditions with Fixed Knots and 2 Latent Classes
Table 5. Median (Range) of the Relative RMSE over 1, 000 Replications of Parameters
of Interest under the Conditions with Fixed Knots and 2 Latent Classes
latent classes and the higher measurement precision. Specifically, coverage prob-
ability of all parameters except knots and intercept coefficient β0 can achieve at
least 90% across all conditions with a medium or large separation between the
knot locations (i.e., 1.5 or 2) and small residual variance (i.e., θ = 1).
Additionally, when being specified correctly, the model with three latent
classes, similar to that with two clusters, performed well in terms of performance
measures, though we noticed that the empirical SE of parameters in the middle
cluster were slightly larger than those in the other two groups.
between two latent classes and the precision of measurements were the primary
determinants of entropy and accuracy.
1.00
0.95
0.90
Mean Accuracy
0.85
0.80
0.75
0.70
0.65
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Mean Entropy
Residuals 1 2
1:1 & 1.0 1:1 & 1.5 1:1 & 2.0
Ratio & Separation
1:2 & 1.0 1:2 & 1.5 1:2 & 2.0
Figure 1 depicts the mean accuracy against the mean entropy for each condi-
tion with two latent classes, the small Mahalanobis distance, and change patterns
of Scenario 1 listed in Table 2. In the plot, we colored the conditions with the
smaller and the larger residual variances black and grey, respectively. Squares,
triangles, and circles are for the small, medium, and large differences between
the locations of the knots. Additionally, we set solid and hollow shapes for the
proportions 1:1 and 1:2, respectively. From the figure, we observed that both
entropy and accuracy increased when the separation between two latent classes
increased and as the residual variances were small. Additionally, unbalanced al-
location tended to yield relatively larger accuracy and entropy. We also noticed
that the scenario of change patterns only affected entropy and accuracy slightly,
while other factors such as the knot standard deviation and the sample size did
not have meaningful impacts on entropy and accuracy. We observed the same
patterns between the mean accuracy and the mean entropy of conditions with
three latent classes.
5 Application
In this section, we demonstrate how to fit the proposed model to separate non-
linear trajectories and associate the ‘soft clusters’ to the baseline characteristics
GMM-EFA 71
using the motivating data. We extracted a random subsample (n = 500) from the
Early Childhood Longitudinal Study Kindergarten Cohort: 2010-11 (ECLS-K:
2011) with complete records of repeated mathematics IRT scaled scores, de-
mographic information (sex, race, and age in months at each wave), baseline
school information (school location and baseline school type), baseline social-
economic status (family income and the highest education level between parents),
baseline teacher-reported social skills (including interpersonal skills, self-control
ability, internalizing problem, externalizing problem), baseline teacher-reported
approach to learning, and baseline teacher-reported children behavior question
(including inhibitory control and attentional focus)6 .
ECLS-K: 2011 is a nationally representative longitudinal sample of US chil-
dren enrolled in about 900 kindergarten programs beginning with 2010 − 2011
school year, where children’s mathematics ability was evaluated in nine waves:
fall and spring of kindergarten (2010 − 2011), first (2011 − 2012) and second
(2012 − 2013) grade, respectively as well as spring of 3rd (2014), 4th (2015)
and 5th (2016), respectively. Only about 30% students were assessed in the fall
of 2011 and 2012 (Lê, Norman, Tourangeau, Brick, & Mulligan, 2011). In the
analysis, we used children’s age (in months) rather than their grade-in-school to
obtain the time structure with individual measurement occasions. In the subset
data, 52% of students were boys, and 48% of students were girls. Additionally,
50% of students were White, 4.8% were Black, 30.4% were Hispanic, 0.2% were
Asian, and 14.6% were others. We dichotomized the variable race to be White
(50%) and others (50%) for this analysis. At the beginning of the study, 87%
and 13% students were from public and private schools, respectively. The covari-
ates including school location (ranged between 1 and 4), family income (ranged
between 1 and 18) and the highest parents’ education (ranged between 0 and 8)
were treated as a continuous variables, and the corresponding mean (SD) was
2.11 (1.12), 11.99 (5.34) and 5.32 (1.97), respectively.
Step 1
In the first step, we first fit a latent growth curve model with a linear-linear
piecewise functional form and three GMMs with two-, three- and four-class and
provided the obtained estimated likelihood, information criteria (AIC and BIC),
residual of each latent class in Table 7. All four models converged. As introduced
earlier, the BIC is a compelling information criterion for the enumeration process
as it penalizes model complexity and adjusts for sample size (Nylund et al., 2007).
The four fits led to BIC values of 31728.23, 31531.60, 31448.99, and 31478.35,
respectively, which led to the selection of the GMM with three latent classes.
Table 8 presents the estimates of growth factors from which we obtained
the model implied trajectory of each latent group, as shown in Figure 2. The
estimated proportions in Class 1, 2 and 3 were 29.6%, 47.8% and 22.6%, re-
spectively. On average, students in Class 1 had the lowest levels of mathematics
6
The total sample size of ECLS-K: 2011 n = 18174. The number of entries after
removing records with missing values (i.e., rows with any of NaN/-9/-8/-7/-1) is
n = 1853.
72 J. Liu et al.
Table 7. Summary of Model Fit Information For the Bilinear Spline Growth Models
with Different # of Latent Classes
achievement throughout the entire duration (the fixed effects of the baseline and
two slopes were 24.133, 1.718 per month, and 0.841 per month, respectively).
On average, students in Class 2 had a similar initial score and slope for the
first stage but relatively lower slope in the second stage (the fixed effects of the
baseline and two slopes were 24.498, 1.730 per month, and 0.588 per month, re-
spectively) compared to the students in the Class 1. Students in Class 3 had the
best mathematics performance on average (the fixed effects of the baseline and
two slopes were 36.053, 2.123 per month, and 0.605 per month, respectively).
For all three classes, post-knot development in mathematics skills slowed sub-
stantially, yet the change to the slower growth rate occurred earlier for Class 1
and 3 (around 8-year old: 91 and 97 months, respectively) than Class 2 (around
9-year old, 110 months). Additionally, for each latent class, the estimates of the
intercept variance and first slope variance were statistically significant, indicat-
ing that each student had a ‘personal’ intercept and pre-knot slope, and then a
‘personal’ trajectory of the development in mathematics achievement.
Step 2
Table 9 summarizes the estimates of the second step of the GMM to associate
‘soft clusters’ of mathematics achievement trajectories to individual-level co-
variates. From the table, we noticed that the impacts of some covariates, such
as baseline socioeconomic status and teacher-reported skills, may differ with or
without other covariates. For example, higher family income, higher parents’
education, higher-rated attentional focus, and inhibitory control increased the
likelihood of being in Class 2 or Class 3 in univariable analyses, while these four
baseline characteristics only associated with Class 3 in multivariable analyses. It
is reasonable that the effect sizes of the Class 3 were larger than those of the Class
2, given its more evident difference from the reference group, as shown in Table
8 and Figure 2. However, it is still too rush to neglect that students from families
with higher socioeconomic status and/or higher-rated behavior questions were
more likely to be in Class 2 at the significant level of 0.05 in an exploratory
study. Another possible explanation for this phenomenon is multicollinearity.
GMM-EFA 73
Table 8. Estimates of the Proposed Mixture Model with 3 Latent Classes (Step 1)
Table 9. Odds Ratio (OR) & 95% Confidence Interval (CI) of Individual-level Predic-
tor of Latent Class in Mathematics Achievement(Reference group: Class 1)
Class 2
Predictor Uni-variable Multi-variable
OR 95% CI OR 95% CI
Sex(0−Boy; 1−Girl) 0.435 (0.254, 0.745)∗ 0.332 (0.174, 0.633)∗
Race(0−White; 1−Other) 0.764 (0.455, 1.281) 1.249 (0.624, 2.498)
School Location 1.407 (1.093, 1.811)∗ 1.357 (0.981, 1.877)
Parents’ Highest Education 1.208 (1.051, 1.388)∗ 1.155 (0.933, 1.431)
Income 1.074 (1.023, 1.128)∗ 1.067 (0.987, 1.154)
School Type (0−Public; 0.573 (0.250, 1.317) 0.442 (0.149, 1.313)
1−Private)
Approach to Learning 1.305 (0.883, 1.929) 0.957 (0.384, 2.389)
Self-control 1.146 (0.764, 1.718) 0.663 (0.272, 1.616)
Interpersonal Skills 1.479 (0.959, 2.282) 1.276 (0.513, 3.175)
External Prob Behavior 0.858 (0.559, 1.319) 1.391 (0.571, 3.386)
Internal Prob Behavior 1.139 (0.658, 1.972) 1.190 (0.589, 2.406)
Attentional Focus 1.251 (1.035, 1.511)∗ 1.139 (0.764, 1.698)
Inhibitory Control 1.238 (1.007, 1.520)∗ 1.557 (0.915, 2.649)
Class 3
Predictor Uni-variable Multi-variable
OR 95% CI OR 95% CI
Sex(0−Boy; 1−Girl) 0.379 (0.205, 0.700)∗ 0.212 (0.098, 0.459)∗
Race(0−White; 1−Other) 0.397 (0.219, 0.721)∗ 0.943 (0.429, 2.073)
School Location 1.266 (0.957, 1.676) 1.211 (0.835, 1.755)
Parents’ Highest Education 1.713 (1.418, 2.068)∗ 1.345 (1.043, 1.734)∗
Income 1.241 (1.155, 1.334)∗ 1.195 (1.083, 1.318)∗
School Type (0−Public; 1.437 (0.661, 3.124) 0.665 (0.234, 1.892)
1−Private)
Approach to Learning 2.624 (1.590, 4.332)∗ 5.363 (1.731, 16.612)∗
Self-control 1.436 (0.903, 2.284) 0.414 (0.136, 1.265)
Interpersonal Skills 1.740 (1.057, 2.862)∗ 0.771 (0.269, 2.209)
External Prob Behavior 0.761 (0.451, 1.283) 1.565 (0.561, 4.367)
Internal Prob Behavior 0.787 (0.405, 1.532) 1.170 (0.488, 2.808)
Attentional Focus 1.601 (1.253, 2.045)∗ 1.095 (0.671, 1.787)∗
Inhibitory Control 1.439 (1.116, 1.855)∗ 1.324 (0.720, 2.434)∗
∗
Note. indicates 95% confidence interval excluded 1.
GMM-EFA 75
100
50
Figure 2. Three Latent Classes: Model Implied Trajectories and Smooth Lines of Ob-
served Mathematics IRT Scores
parents' education
inhibitory control
attentional focus
interpersonal
externalizing
internalizing
self−control
school type
approach
location
income
race
sex
sex 1.00 −0.07 −0.13 −0.01 0.05 0.21 0.14 0.14 −0.17 0.04 0.16 0.21 0.02
race 1.00 −0.13 −0.14 −0.02 −0.06 −0.11 −0.10 0.06 −0.05 −0.05 −0.08 −0.13
location 1.00 0.09 −0.08 0.04 0.10 0.15 −0.06 0.00 0.06 0.06 0.04
income 1.00 0.26 0.06 0.11 0.07 −0.03 −0.06 0.08 0.04 0.66
school type 1.00 −0.01 −0.04 −0.02 0.06 −0.06 0.01 −0.02 0.22
Factor Loadings
Baseline Characteristics Factor 1 Factor 2
Parents’ Highest Education 0.10 0.76
Family Income 0.03 0.86
Approach to Learning 0.90 0.04
Self-control 0.77 0.08
Interpersonal Skills 0.76 0.05
External Prob Behavior −0.72 0.00
Internal Prob Behavior −0.24 −0.07
Attentional Focus 0.83 0.07
Inhibitory Control 0.89 0.01
Explained Variance
Factor 1 Factor 2
SS Loadings 4.04 1.34
Proportion Variance 0.45 0.15
Cumulative Variance 0.45 0.60
values of the second factor scores were more likely to be in Class 27 or Class 38 .
It suggests that both socioeconomic variables and teacher-rated abilities were
positively associated with mathematics performance, while externalizing/inter-
nalizing problems were negative associated with mathematics achievement.
Table 11. Odds Ratio (OR) & 95% Confidence Interval (CI) of Factor Scores, Demo-
graphic Information and School Information of Latent Class in Mathematics Achieve-
ment (Reference group: Class 1)
7
OR (95% CI) for sex, factor score 1 and factor score 2 was 0.345 (0.183, 0.651), 1.454
(1.090, 1.939) and 1.656 (1.226, 2.235), respectively.
8
OR (95% CI) for sex, factor score 1 and factor score 2 was 0.234 (0.111, 0.494), 2.006
(1.408, 2.858) and 3.410 (2.258, 5.148), respectively.
78 J. Liu et al.
6 Discussion
This article extends Bakk and Kuha (2017) study to conduct a stepwise anal-
ysis to investigate the heterogeneity in nonlinear trajectories. We fit a growth
mixture model with a bilinear spline functional form to describe the underlying
change pattern of nonlinear trajectories in the first step. In the second step, we
investigated the associations between the ‘soft’ clusters and baseline character-
istics. Although this stepwise method follows the recommended approach to fit
a FMM model (i.e., separate the estimation of the class-specific parameters and
that of the logistic coefficients), it is not our aim to show that this stepwise
approach is universally preferred. Based on our understanding, this approach is
more suitable for an exploratory study where empirical researchers only have
vague assumptions in terms of sample heterogeneity and its possible causes.
On the one hand, the two-step model can save computational budget as we
only need to refit the second-step model rather than the whole model when
adding or removing covariates. On the other hand, our simulation study showed
that the proposed model works well in terms of performance measures and accu-
racy, especially under preferable conditions, such as well-separated latent classes
and precise measurements. This stepwise approach can also be utilized to analyze
any other types of FMMs in the SEM framework to explore sample heterogeneity.
This article proposes to employ the EFA to reduce the dimensions of covari-
ates and address the multicollinearity issue. In this application, we applied the
EFA in a process termed as ‘feature engineering’ in the ML literature, where re-
searchers employ the PCA technique to reduce the covariate space and address
the multicollinearity issue conventionally, as the interpretation of covariate co-
efficients is out of the primary interest in the ML literature. In this article, we
decided to use the EFA rather than the PCA for two reasons. First, empirical
researchers using the SEM framework are more familiar with the EFA as the idea
behind it is very similar to another model in the SEM framework, the confirma-
tory factor analysis (CFA). More importantly, the factors (i.e., latent variables)
obtained from the EFA are interpretable so that the estimated coefficients from
the second step are interpretable, and we then gain valuable insights from an
exploratory study. For example, in the application, we concluded that a student
with a higher value of the difference between teacher-rated abilities and teacher-
reported problems and/or from a family with higher socioeconomic status was
more likely to achieve higher mathematics scores (i.e., in Class 2 and Class 3).
Although it is not our aim to comprehensively investigate the EFA, we still
want to add two notes about factor retention criteria and factor rotation to
empirical researchers. Following Fabrigar, Wegener, MacCallum, and Strahan
(1999), we used multiple criteria in the application, including the EVG1 rule,
scree test, and parallel analysis to decide the number of factors; fortunately, all
these criteria gave the same decision. Patil, Singh, Mishra, and Todd Donavan
(2008) also suggested conducting a subsequent CFA to evaluate the measure-
ment properties of the factors identified by the EFA (if the number of factors is
different from multiple criteria).
Additionally, several analytic rotation techniques have been developed for
the EFA, with the most fundamental distinction lying in orthogonal and oblique
rotation. Orthogonal rotations constrain factors to be uncorrelated, and the pro-
cedure, varimax, which we used in the application, is generally regarded as the
best one and the most widely used orthogonal rotation in psychological research.
One reason for this choice was its simplicity and conceptual clarity. More im-
portantly, we assumed that the constructs (i.e., the factor of the socioeconomic
variables and that of teacher-rated scores) identified from the covariates set are
independent. However, many theoretical and empirical researchers provided the
basis for expecting psychological constructs, such as personality traits, ability,
and attitudes, to be associated with each other. Consequently, oblique rotations
provide a more realistic and accurate picture of these factors.
One limitation of the proposed two-step model lies in that it only allows
(generalized) linear models in the second step. If the linear assumption is in-
valid, we need to resort to other methods, such as structural equation model
trees (SEM trees, Brandmaier et al. (2013)) or structural equation model forests
(Brandmaier, Prindle, McArdle, & Lindenberger, 2016) to identify the most im-
portant covariates by investigating the variables on which the tree splits first
(Brandmaier et al., 2013; Jacobucci et al., 2017) or the output named ‘variable
importance’ (Brandmaier et al., 2016), respectively. Note that Jacobucci et al.
80 J. Liu et al.
(2017) pointed out that the interpretations of the FMM and SEM trees are dif-
ferent, and the classes obtained from the SEM tree can be viewed as the clusters
of associations between the covariates and trajectories.
One possible future direction of the current study is to build its confirmatory
counterpart. Conceptually, the confirmatory model consists of two measurement
models, and there exists a unidirectional relationship between the factors of
the EFA and the latent categorical variable. Additionally, driven by domain
knowledge, the EFA can be replaced with the CFA in the confirmatory model.
Additionally, the two-step model is proposed under the assumption that these
covariates only indirectly impact the sample heterogeneity. It is also possible to
develop a model that allows these baseline covariates to simultaneously explain
between-group differences and within-group differences by relaxing the assump-
tion.
References
Bolck, A., Croon, M., & Hagenaars, J. (2004). Estimating latent structure
models with categorical variables: One-step versus three-step estimators.
Political Analysis, 12 (1), 3-27. Retrieved from https://www.jstor.org/
stable/25791751
Bouveyron, C., Celeux, G., Murphy, T., & Raftery, A. (2019). Model-based
clustering and classification for data science: With applications in r (cam-
bridge series in statistical and probabilistic mathematics). Cambridge:
Cambridge University Press. Retrieved from https://doi.org/10.1017/
9781108644181
Brandmaier, A. M., Prindle, J. J., McArdle, J. J., & Lindenberger, U. (2016).
Theory-guided exploration with structural equation model forests. Psy-
chological Methods, 4 (21), 566-582. Retrieved from https://doi.org/
10.1037/met0000090
Brandmaier, A. M., von Oertzen, T., McArdle, J. J., & Lindenberger, U. (2013).
Structural equation model trees. Psychological Methods, 18(1), 71-86. Re-
trieved from https://doi.org/10.1037/a0030001
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate
Behavioral Research, 1 (2), 245-276.
Cattell, R. B., & Jaspers, J. (1967). A general plasmode (no. 30-10-5-2) for
factor analytic exercises and research. Multivariate Behavioral Research
Monographs, 67 (3), 211.
Clogg, C. C. (1981). New developments in latent structure analysis. In D. J. Jack-
son & E. F. Borgotta (Eds.), Factor analysis and measurement in sociolog-
ical research: A multi-dimensional perspective (p. 215-246). Beverly Hills,
CA: SAGE Publications.
Cook, N. R., & Ware, J. H. (1983). Design and analysis methods for longitudinal
research. Annual Review of Public Health, 4 (1), 1-23.
Coulombe, P., Selig, J. P., & Delaney, H. D. (2015). Ignoring individual
differences in times of assessment in growth curve modeling. Interna-
tional Journal of Behavioral Development, 40 (1), 76-86. Retrieved from
https://doi.org/10.1177/0165025415577684
Dayton, C. M., & Macready, G. B. (1988). Concomitant-variable latent-class
models. Journal of the American Statistical Association, 83 (401), 173-178.
Retrieved from https://doi.org/10.2307/2288938
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999).
Evaluating the use of exploratory factor analysis in psychological research.
Psychological Methods, 4 (3), 272–299. Retrieved from https://doi.org/
10.1037/1082-989X.4.3.272
Finkel, D., Reynolds, C. A., Mcardle, J. J., Gatz, M., & Pedersen, N. L. (2003,
06). Latent growth curve analyses of accelerating decline in cognitive abil-
ities in late adulthood. Developmental psychology, 39 , 535-550. Retrieved
from https://doi.org/10.1037/0012-1649.39.3.535
Goodman, L. A. (1974). The analysis of systems of qualitative variables when
some of the variables are unobservable. part i-a modified latent structure
approach. American Journal of Sociology, 79 (5), 1179-1259. Retrieved
82 J. Liu et al.
from http://www.jstor.org/stable/2776792
Grimm, K. J., Ram, N., & Estabrook, R. (2016). Growth modeling. Guilford
Press.
Haberman, S. (1979). Analysis of qualitative data. vol. 2: New developments.
New York: Academic Press.
Hagenaars, J. A. (1993). Loglinear models with latent variables. Newbury Park,
CA: Sage.
Harring, J. R., Cudeck, R., & du Toit, S. H. C. (2006). Fitting partially
nonlinear random coefficient models as sems. Multivariate Behavioral
Research, 41 (4), 579-596. Retrieved from https://doi.org/10.1207/
s15327906mbr4104 7
Horn, J. L. (1965). A rationale and technique for estimating the number of
factors in factor analysis. Psychometrika, 30 , 179-185.
Humphreys, L. G., & Ilgen, D. R. (1969). Note on a criterion for the number
of common factors. Educational and Psychological Measurement, 29 (3),
571–578.
Humphreys, L. G., & Montanelli, R. G. (1975). An investigation of the parallel
analysis criterion for determining the number of common factors. Multi-
variate Behavioral Research, 10 (2), 193–205.
Hunter, M. D. (2018). State space modeling in an open source, modular,
structural equation modeling environment. Structural Equation Modeling:
A Multidisciplinary Journal , 25 (2), 307-324. Retrieved from https://
doi.org/10.1080/10705511.2017.1369354
Jacobucci, R., Grimm, K. J., & McArdle, J. J. (2016). Regularized structural
equation modeling. Structural Equation Modeling: A Multidisciplinary
Journal , 23 (4), 555-566. Retrieved from https://doi.org/10.1080/
10705511.2016.1154793
Jacobucci, R., Grimm, K. J., & McArdle, J. J. (2017). A comparison of
methods for uncovering sample heterogeneity: Structural equation model
trees and finite mixture models. Structural Equation Modeling: A Multi-
disciplinary Journal , 24 (2), 270-282. Retrieved from https://doi.org/
10.1080/10705511.2016.1250637
Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor
analysis. Psychometrika, 23 , 187–200. Retrieved from https://doi.org/
10.1007/BF02289233
Kamakura, W. A., Wedel, M., & Agrawal, J. (1994). Concomitant variable latent
class models for conjoint analysis. International Journal of Research in
Marketing, 11 (5), 451-464. Retrieved from https://doi.org/10.1016/
0167-8116(94)00004-2
Kohli, N. (2011). Estimating unknown knots in piecewise linear-linear latent
growth mixture models (Doctoral dissertation, University of Maryland).
Retrieved from http://hdl.handle.net/1903/11973
Kohli, N., & Harring, J. R. (2013). Modeling growth in latent variables using
a piecewise function. Multivariate Behavioral Research, 48 (3), 370-397.
Retrieved from https://doi.org/10.1080/00273171.2013.778191
GMM-EFA 83
Kohli, N., Harring, J. R., & Hancock, G. R. (2013). Piecewise linear-linear latent
growth mixture models with unknown knots. Educational and Psycho-
logical Measurement, 73 (6), 935-955. Retrieved from https://doi.org/
10.1177/0013164413496812
Kohli, N., Hughes, J., Wang, C., Zopluoglu, C., & Davison, M. L. (2015). Fit-
ting a linear-linear piecewise growth mixture model with unknown knots:
A comparison of two common approaches to inference. Psychological
Methods, 20 (2), 259-275. Retrieved from https://doi.org/10.1037/
met0000034
Lê, T., Norman, G., Tourangeau, K., Brick, J. M., & Mulligan, G. (2011). Early
childhood longitudinal study: Kindergarten class of 2010-2011 - sample
design issues. JSM Proceedings, 1629-1639. Retrieved from http://www
.asasrms.org/Proceedings/y2011/Files/301090 66141.pdf
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation, 2nd edition.
Springer-Verlag New York, Inc.
Liu, J. (2019). Estimating knots in bilinear spline growth models with time-
invariant covariates in the framework of individual measurement occasions
(Doctoral dissertation, Virginia Commonwealth University). Retrieved
from https://doi.org/10.25772/9WDR-9R85
Liu, J., Perera, R. A., Kang, L., Kirkpatrick, R. M., & Sabo, R. T. (2019). Ob-
taining interpretable parameters from reparameterizing longitudinal mod-
els: transformation matrices between growth factors in two parameter-
spaces.
Lubke, G. H., & Muthén, B. O. (2005). Investigating population heterogeneity
with factor mixture models. Psychological Methods, 10 (1), 21–39. Re-
trieved from https://doi.org/10.1037/1082-989X.10.1.21
Lubke, G. H., & Muthén, B. O. (2007). Performance of factor mixture models
as a function of model size, covariate effects, and class-specific parameters.
Structural Equation Modeling: A Multidisciplinary Journal , 14 (1), 26-47.
Retrieved from https://doi.org/10.1080/10705510709336735
Marcoulides, G. A., & Drezner, Z. (2003). Model specification searches using
ant colony optimization algorithms. Structural Equation Modeling: A Mul-
tidisciplinary Journal , 10 (1), 154-164. Retrieved from https://doi.org/
10.1207/S15328007SEM1001 8
Marcoulides, G. A., Drezner, Z., & Schumacker, R. E. (1998). Model specification
searches in structural equation modeling using tabu search. Structural
Equation Modeling: A Multidisciplinary Journal , 5 (4), 365-376. Retrieved
from https://doi.org/10.1080/10705519809540112
McLachlan, G., & Peel, D. (2000). Finite mixture models. John Wiley & Sons,
Inc.
Mehta, P. D., & Neale, M. C. (2005). People are variables too: Multilevel
structural equations modeling. Psychological Methods, 10 (3), 259-284. Re-
trieved from https://doi.org/10.1037/1082-989x.5.1.23
Mehta, P. D., & West, S. G. (2000). Putting the individual back into individual
growth curves. Psychological Methods, 5 (1), 23-43.
84 J. Liu et al.
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies
to evaluate statistical methods. Statistics in Medicine, 38 (11), 2074-2102.
Retrieved from https://doi.org/10.1002/sim.8086
Muthén, B. O., & Shedden, K. (1999). Finite mixture modeling with mixture
outcomes using the EM algorithm. Biometrics, 55 (2), 463-469. Retrieved
from https://doi.org/10.1111/j.0006-341x.1999.00463.x
Neale, M. C., Hunter, M. D., Pritikin, J. N., Zahery, M., Brick, T. R., Kirk-
patrick, R. M., . . . Boker, S. M. (2016). OpenMx 2.0: Extended structural
equation and statistical modeling. Psychometrika, 81 (2), 535-549. Re-
trieved from https://doi.org/10.1007/s11336-014-9435-8
Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the
number of classes in latent class analysis and growth mixture modeling:
A monte carlo simulation study. Structural Equation Modeling: A Multi-
disciplinary Journal , 14 (4), 535-569. Retrieved from https://doi.org/
10.1080/10705510701575396
Patil, V. H., Singh, S. N., Mishra, S., & Todd Donavan, D. (2008). Efficient
theory development and factor retention criteria: Abandon the ‘eigenvalue
greater than one’ criterion. Journal of Business Research, 61 (2), 162-170.
Retrieved from https://doi.org/10.1016/j.jbusres.2007.05.008
Preacher, K. J., & Hancock, G. R. (2015). Meaningful aspects of change as novel
random coefficients: A general method for reparameterizing longitudinal
models. Psychological Methods, 20 (1), 84-101. Retrieved from https://
doi.org/10.1037/met0000028
Pritikin, J. N., Hunter, M. D., & Boker, S. M. (2015). Modular open-source
software for Item Factor Analysis. Educational and Psychological Mea-
surement, 75 (3), 458-474. Retrieved from https://doi.org/10.1177/
0013164414554615
R Core Team. (2020). R: A language and environment for statistical computing
[Computer software manual]. Vienna, Austria.
Scharf, F., & Nestler, S. (2019). Should regularization replace simple structure
rotation in exploratory factor analysis? Structural Equation Modeling: A
Multidisciplinary Journal , 26 (4), 576-590. Retrieved from https://doi
.org/10.1080/10705511.2018.1558060
Seber, G. A. F., & Wild, C. J. (2003). Nonlinear regression. John Wiley & Sons,
Inc.
Stegmann, G., & Grimm, K. J. (2018). A new perspective on the effects of
covariates in mixture models. Structural Equation Modeling: A Multi-
disciplinary Journal , 25 (2), 167-178. Retrieved from https://doi.org/
10.1080/10705511.2017.1318070
Sterba, S. K. (2014). Fitting nonlinear latent growth curve models with in-
dividually varying time points. Structural Equation Modeling: A Multi-
disciplinary Journal , 21 (4), 630-647. Retrieved from https://doi.org/
10.1080/10705511.2014.919828
Sun, J., Chen, Y., Liu, J., Ying, Z., & Xin, T. (2016). Latent variable selection
for multidimensional item response theory models via l1 regularization.
GMM-EFA 85
In the original setting of the bilinear spline model, we have three growth factors:
an intercept at t0 (η0 ) and one slope of each stage (η1 and η2 , respectively). To
estimate knots, we may reparameterize the growth factors. For the ith individual,
according to Seber and Wild (Seber & Wild, 2003), we may re-expressed them
as the measurement at the knot (i.e., η0i + η1i γ (k) ), the mean of two slopes (i.e.,
η1i +η2i
2 ), and the half difference between two slopes (i.e., η2i −η2
1i
).
Tishler and Zang (1981) and Seber and Wild (2003) showed that the re-
gression model with two linear stages can be written as either the minimum or
maximum response value of two trajectories. Liu et al. (2019) extended such
expressions to the latent growth curve modeling framework and showed two
forms of bilinear spline for the ith individual in Figure A.1. In the left panel
(η1i > η2i ), the measurement yij is always the minimum value of two lines; that
is, yij = min (η0i + η1i tij , η02i + η2i tij ). To unify the formula of measurements
86 J. Liu et al.
0 0 0
where η0i , η1i and η2i are the measurement at the knot, the mean of two slopes,
and the half difference between two slopes. Similarly, the measurement yij of the
bilinear spline in the right panel, in which the measurement yij is always the
maximum value of two lines, has the identical final form in Equation A.1.
where µη [k] and Ψη [k] are the mean vector and variance-covariance matrix of
original class-specific growth factors, respectively, and f is defined as
2i η2i −η1i
T
f (η i ) = η0i + γ [k] η1i η1i +η
2 2
.
GMM-EFA 87
0
Similarly, suppose h : R3 → R3 is a function, which takes a point η i ∈ R3
0
as input and produces the vector h(η i ) ∈ R3 (i.e., η i ∈ R3 ) as output. By the
multivariate delta method,
0 0 0 0 0
[k]
η i = h(η i ) ∼ N h(µη[k] ), ∇h (µη[k] )Ψ η[k] ∇Th (µη[k] ) , (A.3)
0 0
[k]
where µη and Ψ η[k] are the mean vector and variance-covariance matrix of
class-specific reparameterized growth factors, respectively, and h is defined as
0 0 0 0 0 0 0 0 T
h(η i ) = η0i − γ [k] η1i + γ [k] η2i η1i − η2i η1i + η2i .
Based on Equations (A.2) and (A.3), we can make the transformation be-
0
[k] [k]
tween the growth factor means of two parameter-spaces by µη = f (µη )
0
[k] [k]
and µη = h(µη ), respectively. We can also define the transformation ma-
[k] 0
trix ∇f (µη ) and ∇h (µη[k] ) between the variance-covariance matrix of two
parameter-spaces as
0
Ψ η[k] = ∇f (µ[k] [k] T
η )Ψ η ∇f (µη )
[k]
T
1 γ [k] 0 1 γ [k] 0
[k]
= 0 0.5 0.5 Ψ η 0 0.5 0.5
0 −0.5 0.5 0 −0.5 0.5
and 0 0 0
Ψη [k] = ∇h (µη[k] )Ψ η[k] ∇Th (µη[k] )
T
1 −γ [k] γ [k] 0
1 −γ [k] γ [k]
[k]
= 0 1 −1 Ψ η 0 1 −1 ,
0 1 1 0 1 1
respectively.
B. More Results
88 J. Liu et al.
Table B.1. Median (Range) of the Relative Bias over 1, 000 Replications of Parameters
of Interest under the Conditions with Random Knots of the Standard Deviation of 0.3
and 2 Latent Classes
Table B.2. Median (Range) of the Empirical SE over 1, 000 Replications of Parameters
of Interest under the Conditions with Random Knots of the Standard Deviation of 0.3
and 2 Latent Classes
Shuai Zhou1 , Yanling Li1 , Guangqing Chi1 , Junjun Yin1 , Zita Oravecz1 , Yosef
Bodovski1 , Naomi P. Friedman2 , Scott I. Vrieze3 , and Sy-Miin Chow1
1
The Pennsylvania State University, University Park, PA 16801, USA
[email protected]
2
University of Colorado Boulder, Boulder, CO
3
University of Minnesota, Minneapolis, MN
1 Introduction
Spatial analysis is used to explain locations, attributes, and relationships of fea-
tures in spatial data and has increasingly become a subject of interest in many
128 S. Zhou et al.
such as activity space and shared space, and boost the nearest distance query
for big data. GPS2space builds upon existing functions and includes all the nec-
essary, tunable parameters as arguments for generating spatial measures in a
straightforward and well-documented package that can be readily implemented
by newer users. We used the terms library, package, and toolbox interchange-
ably throughout the article, as these terms all refer to reusable chunks of code
but are used differently in different conventions. Likewise, we used the terms
methods and functions interchangeably, in that they both refer to snippets of a
library/package/toolbox that are used for specific purposes.
The remainder of the article proceeds as follows. First, we briefly introduce
commonly used Python libraries for managing and analyzing GPS data and
highlight the contributions of GPS2space. Then, we illustrate the utility of the
GSP2space library using the CoTwins data to extract the twin siblings’ activity
space and shared space. These measures are used to address questions related to
seasonal, age-based, gender, and zygosity effects in shaping individuals’ activity
space and shared space. Finally, we conclude with discussions on other potential
usages, caveats, and future developments of GPS2space.
Like many data analysis procedures, geospatial analyses involve data reading
and writing, data managing and processing, and visualization. Beyond that,
geospatial analyses also deal with spatial projection and operation, Exploratory
Spatial Data Analysis (ESDA), and spatial modeling. There are existing Python
libraries that focus on certain specific functions useful for geospatial analysis –
a brief overview is provided next.
Geospatial Data Abstraction Library (GDAL/OGR contributors, 2020) spe-
cializes in reading and writing raster and vector data, which are the two com-
monly used data types in GIS. It supports 168 raster data formats and 99 vector
data formats at the time of writing (October 2020). Fiona (Gillies et al., 2011)
and Rasterio (Gillies et al., 2013), two other popular libraries in Python, focus on
reading, writing, and manipulating vector and raster data, respectively. Pyproj
exclusively focuses on cartographic projections and coordinate transformations
(Crickard, Toms, & Rees, 2018). Shapely specializes in spatial operations such
as distance query and intersecting and overlapping analyses (Gillies et al., 2007).
Python Spatial Analysis Library (PySAL) is the most commonly used library
in conducting ESDA and spatial modeling (Rey, 2019; Rey & Anselin, 2007).
GeoPandas, on the other hand, combines Pandas, a widely used Python data
analysis library, and GIS science, providing a wide array of geospatial functions
such as spatial operation, spatial projection transformation, and visualization
(Jordahl, 2014). These packages are often used together to conduct a series of
data managing, manipulation, visualization, and modeling tasks. For example,
GeoPandas relies on Fiona to read and write spatial data and PyProj to perform
GPS2space Python Library for Spatial Measure Extraction 131
spatial projection transformations. Rasterio also uses PyProj for its projection
functionalities.
The packages reviewed thus far do have limitations, especially for novices
who do not have strong background in programming and GIS. For example,
Shapely does not provide options for coordinate system transformations, so the
original units of distance and area measures are usually degrees, which may not
be intuitive for non-specialist audiences. GeoPandas incorporates many useful
geoprocessing methods and spatial analysis techniques and provides foundational
functions for such spatial operations; however, it assumes users have GIS and
programming background to perform the analyses. For example, to calculate the
area of a polygon from GPS data with latitude and longitude coordinate pairs
using GeoPandas, a researcher has to first build a spatial data set, project it to
an appropriate coordinate reference system (CRS), and then calculate the area.
Even though we did not provide an exhaustive list of all the Python packages
that can perform geospatial manipulation and analysis, we highlighted that al-
most all of these packages are tailored for experts with considerable spatial data
handling and GIS experience, and require function customizations in multiple
steps. For novices such multi-step data pre-processing and function customiza-
tion processes can be challenging and error-prone. In addition, none of the above
packages provides immediately available functions for constructing activity space
and shared space.
In this article, we introduced GPS2space with the aim to facilitate and au-
tomate, whenever possible, the processes of spatial data building, activity and
shared space measure extraction, and distance query. Specifically, GPS2space
has three functionalities: (1) building unprojected spatial data from geoloca-
tions with latitude and longitude coordinate pairs using the geodf function; (2)
constructing buffer- and convex hull-based activity space and shared space at
different timescales using the space function; and (3) performing nearest distance
query using the dist function, which incorporates cKDTree 1 and spatial index-
ing and R-Tree 2 algorithms to decrease execution time. GPS2space provides
an easily replicable and open-source solution to building spatial data directly
from latitude and longitude coordinate pairs. It also provides default parameter-
izations suited for many longitudinal spatial data streams that can be used to
simplify and reduce the specification steps needed for extraction of activity- and
shared-space-related and distance measures included in the package. GPS2space
enables transparent and easily replicable ways to change these default options for
experienced GIS scientists and programmers to perform custom specifications.
1
cKDTree is a function from SciPy, a commonly used library for scientific computing
in Python. cKDTree is used to rapidly look up the nearest neighbors of any point
and can dramatically reduce the time needed for such processes.
2
GeoPandas incorporated spatial indexing using the R-tree algorithm to boost the
performance of spatial queries. R-tree is a tree-like data structure that groups nearby
objects together along with their minimum bounding box. In this tree-like data
structure, spatial queries such as finding the nearest neighbor does not have to travel
through all geometries, dramatically increasing performance, especially for two data
sets with different bounding boxes.
132 S. Zhou et al.
We used data from the CoTwins study to illustrate the utility of GPS2space
and demonstrate how spatial activity measures can shed light on individual and
dyadic activity patterns between twin siblings. Twin studies have the advantage
of disentangling genetic and environmental factors for the trait of interest (New-
man, Freeman, & Holzinger, 1937). Despite the increasing application of spatial
thinking and spatial data in social and behavioral research, few twin studies have
been designed to collect twins’ location data, which often convey valuable infor-
mation concerning social contexts. For instance, shared activity space and time
spent with each other reflect opportunities for relationship bonding, and may
thus convey the extent of emotional closeness between two individuals (Ben-Ari
& Lavee, 2007). Furthermore, with twins’ location data, it would be interest-
ing to investigate how monozygotic (MZ; identical) twins and dizygotic (DZ;
fraternal) twins differ in their shared activity space.
The CoTwins study comprises data on substance use among 670 twins. Twins
were initially recruited at ages 14 to 17 and followed from 2015 to 2018. Through-
out 2016 to 2018, the twins’ geolocations were recorded and reported via their
GPS enabled smartphones. iOS devices used the built-in significant-change lo-
cation service to record and report geolocations whenever they detected a sig-
nificant position change of 500 meters or more. Android devices recorded and
reported geolocations every five minutes as long as the device was in use. Over
the course of the study, the twins’ spatial footprints covered locations within
and outside of the United States. In this article, we only used locations in the
contiguous United States, which includes the District of Columbia but excludes
Alaska and Hawaii.
Figure 1 shows the spatial distribution of the twins’ footprints in 2016, 2017,
and 2018 across Colorado and the contiguous United States. The CoTwins study
began collecting locations in June 2016 so the figure shows fewer data points
in 2016. Throughout 2017 and 2018, the twins set foot in almost every state
of the contiguous United States and showed a consistent pattern of footprints
concentrated in Colorado and all over parts of the contiguous US, with North
Dakota, Arkansas, and Alabama as the least visited states. In Colorado in 2017
and 2018 they showed consistent mobility patterns with geolocations clustered
around metropolitan areas such as Denver and Colorado Springs and along major
roads within the state. The border counties in Colorado such as Moffat, Rio
Blanco, Yuma, Cheyenne, Kiowa, and Baca were rarely visited. The code for
Figure 1 can be found in Supplementary Material.
GPS2space Python Library for Spatial Measure Extraction 133
Many related works have demonstrated the spatial aspects of activity space
and shared space and their impact on human behaviors such as substance use
(Mason et al., 2010) and social support in a specific setting such as working space
(Gerdenitsch et al., 2016); however, the temporal variations of such spatial mea-
sures and interindividual differences therein have not been thoroughly explored.
Hence, we employed passive sensor (GPS) data to investigate whether meaning-
ful seasonal, time- (e.g., weekend), and age-based variations, as well as between-
individual differences in these intra-individual changes, could be meaningfully
inferred from individuals’ spatial measures as extracted using GPS2space. In
particular, we examined (1) whether there were seasonal effects in twins’ activ-
ity space/shared space; (2) whether there were weekend effects in twins’ activity
space/shared space; (3) inter-individual differences in initial levels of activity
space/shared space, and possible associations with gender, baseline age, and twin
type (MZ vs. DZ twins); and (4) age-related changes in activity space/shared
space, and possible roles of gender as correlates of interindividual differences in
these age-based changes.
As previously defined, activity space refers to the area of individuals’ routine lo-
cations over a specific time period. Practically, ellipses, convex hulls, and density
kernels are often used to construct the activity space (Huang & Wong, 2016).
The GPS2space library currently includes two commonly used methods for con-
structing activity space: the buffer method and the convex hull method. The
buffer method uses a user-specified buffer distance as the radius in determining
activity space, while the convex hull method lines up the outermost points to
134 S. Zhou et al.
a minimum bounding geometry (J. H. Lee, Davis, Yoon, & Goulias, 2016) to
represent activity space. Both buffer- and convex hull-based activity space ap-
proaches are associated with their own pros and cons. For buffer-based activity
space, users have to specify a buffer distance to group and dissolve points into
polygons to enable extraction of activity space. The choice of buffer distance can
be arbitrary and application-specific, and it affects the sizes of activity space and
shared space. However, this approach provides interpretable mobility estimates
even with only one data point. In this case, activity space for that one data
point is simply the area of the circle whose radius is the buffer distance. Impor-
tantly, it is less sensitive to extreme geolocations that are beyond the clusters
of geolocation. Convex hull-based activity space does not require any arbitrary
parameter. However, convex hull-based activity space computations require at
least three non-collinear points to form an enclosed convex hull. In addition,
convex hull-based activity space is sensitive to extreme geolocations, giving ex-
treme activity space values in the presence of outliers. For example, instances
where individuals travel via cars or flights from one main location to another
would be outliers. The convex hull method would yield extreme activity space
values in trying to construct a convex hull containing all the data points prior
to, during, and after such travels, whereas the buffer-based method would use
the user-specified buffer value to “group” the data points into clusters of points
and compute activity and other spatial activity measures accordingly. We rec-
ommend that users consider their respective applications and contexts in detail
when choosing between these two methods.
To illustrate how buffer- and convex hull-based activity space and shared
space are obtained from raw GPS data with latitude and longitude coordinate
pairs, we used one randomly selected twin pair, denoted herein as TwinX, and
their geolocations on May 12, 2017. For buffer-based activity space, we used a
buffer distance of 1000 meters based on common choices of buffer distance in
other published studies (Perchoux, Chaix, Brondeel, & Kestens, 2016; Stewart et
al., 2015). The process of computing activity and shared spaces can be grouped
largely into 3 steps. We described each step and provided the associated code as
organized by these steps.
Step 1: Conversion of raw GPS data into spatial data.
To perform spatial operations, we need to first convert raw GPS data with
latitude and longitude coordinate pairs to spatial data using the df to gdf func-
tion in the GPS2space library. The df to gdf function takes three parameters:
the first one is the Pandas dataframe 3 that contains GPS data with geolocation
information as represented by latitude and longitude coordinate pairs; the sec-
ond one is the column name of the longitude information; the third one is the
column name of the latitude information. The df to gdf function returns an un-
3
Pandas is a commonly used library for data manipulation analysis in Python. A
Pandas dataframe is a 2-dimensional data structure with rows representing obser-
vations and columns representing variables. A column can have different data types
in a Pandas dataframe.
GPS2space Python Library for Spatial Measure Extraction 135
Figure 2 shows the buffer- and convex hull-based activity space and shared
space for TwinX on May 12, 2017. The buffer-based approach using 1000 meters
as buffer distance gives an activity space of 10.32 and 12.54 square miles 7 for
TwinXa and TwinXb, and a shared space of 8.08 square miles between them.
The convex hull-based approach produces an activity space of 8.99 and 11.08
square miles for each individual of TwinX and a shared space of 8.48 square
miles between them. The code for Figure 2 can be found in Supplementary
Material.
Figure 2. (a) Buffer-based activity space and shared space for TwinX on May 12, 2017
in Colorado. (b) Convex hull-based activity space and shared space for TwinX on May
12, 2017 in Colorado.
7
For illustration purposes, we converted area measurement in square meters to square
miles.
138 S. Zhou et al.
the third one is the EPSG identifier, with a default value of 2163. When dist to point
function is called, the nearest neighbor search is then performed by traversing
the cKDTree created on the spatial points in the target data set, which only
deals with a subset of the points for the distance calculation. As shown in the
following code example, we first constructed the spatial data set for the super-
market data, then we provided three parameters to the dist to point function
for the nearest distance query from the TwinXa to supermarkets. The “dist”
is the outcome GeoPandas dataframe with a “dist2point” column showing the
distance from the source point to its nearest supermarket. All the columns from
both the source and target dataframes are preserved in the outcome dataframe.
# Read market data into Pandas dataframes .
df_market = pd . read_csv ( ‘./ data / market . csv ’)
The two functions, dist to point and dist to poly, serve to provide distance
measures geared respectively toward places of interest that are adequately rep-
resented as points (typically places covering smaller geographical regions such
that the centroids of their enclosing polygon provide a reasonable representation,
such as supermarkets, transportation terminals, and health facilities) vs. poly-
gons (typically geographically dispersed places of interest or places that require
precise definitions of boundaries, such as parks, water bodies, and administra-
tive boundaries). Results from dist to poly and dist to point do not always agree,
mainly because dist to poly and dist to point treat points within polygons dif-
ferently. To illustrate the differences, we calculated the nearest distance from
TwinX to the nearest park, playground, and supermarket (represented as poly-
gons, search radius not specified) and their centroids (represented as points).
Table 1 shows the results. Overall, the two functions produce similar results
except for differences in minimum distance, where dist to poly may produce 0
values while dist to point rarely produces 0 values. The main reason for the
differences in the minimum distance is that once dist to poly detects the point
is within the polygon it assigns 0 to the nearest distance, while dist to point
calculates the Euclidean distance between the two points and only returns 0 if
the geolocations of the two points are identical. In sum, the distance measure
between dist to point and dist to poly depends on the source data’s relative po-
sition to the target polygon and the shape of the target polygon. The code for
Table 1 can be found in Supplementary Material.
Table 1. Comparison between the nearest distance from TwinX to polygon boundary
and polygon centroid for parks, playgrounds, and supermarkets in Colorado
Before extracting the daily activity space and shared space for all participants
using the functions presented above, we pre-processed the GPS data following
procedures implemented in the previous study (Li et al., in press). First, we ex-
cluded records with fewer than 20 valid data points within a week because these
unusually low numbers of GPS points lacked sufficient variability. Then we ex-
cluded data points showing atypical travel trajectories as detected by dbscan
(Density-Based Spatial Clustering of Applications with Noise), an R package
that is commonly used to identify clusters and outlying points (Hahsler, Pieken-
brock, & Doran, 2019). Then the daily activity space was calculated using a
buffer distance of 1000 meters and transformed from square meters to square
miles for illustrative purposes. The activity space was then log transformed to
reduce skewness in the data. The log transformed activity space was referred to
hereafter as LAS. For each participant, we focused on the proportion of shared
space, referred to as PSS hereafter and defined as the proportion of one’s daily
activity space that overlapped with his/her twin sibling’s daily activity space.
The distributions of LAS and PSS were shown in Figure 3. The final data set
consisted of 558 participants with baseline ages between 14 and 20 (mean = 17),
followed between 1 to 3 years (mean = 2). 43% of the participants were males.
In terms of twin types, 33% were MZ twins, 41% were DZ twins of the same sex,
and 26% were DZ twins of opposite sex.
Figure 3. Distributions of (a) log activity spaces (LAS) and (b) proportions of shared
space (PSS) across participants.
142 S. Zhou et al.
LASitk = β0ik +β1ik Ageitk +β2 W eekendt +β3 Summert +β4 F allt +β5 W intert +eitk
(1)
Level-2 model:
Level-3 model:
with,
GPS2space Python Library for Spatial Measure Extraction 143
eitk ∼ N (0, σ 2 ),
2
u0ik τ
∼ M N (0, T = 0 2 ),
u1ik τ01 τ1
2
ϕ0
v0k ϕ01 ϕ21
v1k
∼ M N (0, Φ = ϕ02 ϕ12 ϕ22 )
v2k
ϕ03 ϕ13 ϕ23 ϕ23
v3k 2
ϕ04 ϕ14 ϕ24 ϕ34 ϕ4
The seasonal effect, weekend effect, and age-based changes in LAS were modeled
in the level-1 model, where LASitk was the LAS of person i in family k on day t,
and Ageitk was the age of person i in family k on day t, centered by subtracting
the baseline age from each age instance so that 0 corresponded to the baseline
age. The Weekend, Summer, Fall, and Winter variables were dummy-coded, with
1 each representing weekend, summer (June 1 to August 30), fall (September 1
to November 30), and winter (December 1 to February 28 or 29). Based on the
definitions of these variables, β0ik represented person i’s initial LAS at baseline
age on Spring weekdays; β1ik was the effect of age on the LAS for person i; and
βj (j = 2, . . . , 5) represented weekend or seasonal effects, which were not set as
person-specific since we focused on the overall seasonal and weekend effects in
this study. Finally, the level-1 error eitk followed a normal distribution with a
zero mean and a variance of σ 2 .
In the level-2 model, the level-1 parameters, β0ik and β1ik , were regressed on
a person-specific variable, Genderik (1 = male; -1 = female), to explore gender
differences in the initial levels and age-based changes of LAS. In addition, β0ik
was regressed on the baseline age, Agei0k , centered by subtracting the mean
of baseline ages so that 0 corresponded to the average baseline age. Thus, the
corresponding coefficient γ02k represented the effect of baseline ages on the initial
LAS, and γ00k and γ10k represented the overall initial level and growth rate
of LAS across individuals, respectively, while 2γ01k and 2γ 11k represented the
corresponding gender differences, respectively. The level-2 random effects were
denoted as u0ik and u1ik , which described person i’s deviations in the values
of β0ik and β1ik not accounted for by the predictors. Finally, the variance and
covariance structure of level-2 random effects was defined in T. For instance,
the variance of β0ik , denoted as τ02 , described the extent of between-individual
difference in the initial LAS; the covariance between β0ik and β1ik , denoted as
τ01 , described the relationship between initial levels and growth rates of LAS.
The level-3 model was built to capture between-family differences. Specifi-
cally, we would like to investigate whether twins from different families would
have different initial levels and growth rates of LAS and whether the effects of
gender and baseline age on the initial levels and/or growth rates of LAS would
differ across families as well. Note that twin type was not included as a predictor
in the level-3 model because the magnitudes of activity space were not expected
to be significantly different between MZ and DZ twins (although they might be
144 S. Zhou et al.
expected to differ in the degree to which they share space with their siblings,
which was addressed below in the model for PSS). Among parameters in the
level-3 model, δ010 and δ110 were of particular interest because they reflected the
differences between males and females in terms of their average initial levels and
growth rates of LAS, respectively. The level-3 random effects, v0k - v4k , followed
a multivariate normal distribution with zero means and a covariance matrix, Φ,
where the variances, denoted as ϕ20 - ϕ24 , captured the extent of between-family
differences in the overall initial LAS, the effects of gender and baseline age on
the initial LAS, the overall growth rate of LAS and gender differences therein,
respectively.
In terms of the model for PSS, some slight modeling adaptations were needed
to capture characteristics of the PSS data. As noted, PSS was defined as the
proportion of one’s activity space that overlapped with his/her twin sibling’s
activity space, thus yielding a value ranging from 0 to 1. The model presented
above, which assumed that the error term followed a normal distribution with
a constant variance, might not be appropriate for the data in this scenario.
However, the beta distribution is known for its flexibility in modeling proportions
because its density can display different shapes as decided by the values of α
and β. The beta density can be expressed as:
Γ (α + β) α−1
f (α, β) = y (1 − y)β−1 , 0 < y < 1, α > 0, β > 0 (9)
Γ (α)Γ (β)
Thus, in the generalized growth curve model with PSS as the dependent variable,
PSS was specified to conform to a beta distribution. Consistent with the beta
regression specification proposed by Ferrari and Cribari-Neto (2004), which is
similar to that of the well-known class of generalized linear models (McCullagh
& Nelder, 1989), we defined µ = α/(α + β) and φ = α + β, then E(y) = µ
and V ar(y) = µ(1 − µ)/(1 + φ), where µ was the mean and φ was called the
precision parameter. In our case, we assumed that the PSS, P SSitk , followed
a beta distribution with person-specific means (i.e., E(P SSitk ) = µitk ). Then
we implemented a logit transformation of µitk and built a three-level growth
curve model on the transformed value (i.e., ηitk ). The level-1 model for PSS was
specified as:
µitk
ηitk = log( )
1 − µitk
= β0ik + β1ik Ageitk + β2 W eekendt + β3 Summert + β4 F allt + β5 W intert
(10)
µitk
where 1−µ itk
, denoted below as the odds of PSS, represented the average level
of PSS for individual i in family k at time t relative to not sharing space with
µitk
twin siblings, and ηitk = log( 1−µ itk
) represented the corresponding log odds.
The independent variables were as summarized in Equation 1. Note that the re-
gression coefficients had different interpretations due to the logit transformation.
For instance, β0ik represented the log-odds of PSS for person i in family k at the
GPS2space Python Library for Spatial Measure Extraction 145
baseline age on Spring weekdays; β1ik was the age-related log-odds ratio, which
means that the odds of PSS would multiply by eβ1ik for every 1-unit increase in
Ageitk . Other parameters (e.g., seasonal and weekend effects) can be interpreted
in a similar way.
The level-2 model for PSS was identical to the level-2 model for LAS (see
Equations 2 - 3), but the regression coefficients had different interpretations for
the reason stated above. For instance, the level-2 intercept, γ00k , represented the
overall log-odds of PSS.
In terms of the level-3 model, we hypothesized that MZ and DZ twins might
have different levels of space sharing to the extent that these spatial measures
reflect genetically influenced behavior/preferences. To evaluate this hypothesis,
we added a predictor, twin type, to Equations 4 - 6 (i.e., the level-3 model for
γ00k , γ01k , and γ02k , which were the coefficients in the level-2 model for β0ik , the
log-odds of initial levels of PSS), to investigate zygosity differences in PSS and
how these differences might affect the effects of gender and baseline age on PSS,
as shown below.
6.3 Results
With the brms package, the models were fitted in a Bayesian framework using
Markov chain Monte Carlo (MCMC) methods. Specifically, we ran two chains,
each with 5000 iterations in total and a burn-in of 2000 (discarded) iterations. On
an Intel i5-8350U, 16GB RAM, Windows 10 computer, it took about 40 hours to
run each model. Two diagnostic statistics were used to check the sampling quality
(Gelman et al., 2013): (1) the effective sample size (ESS), which describes how
many posterior draws in the MCMC procedure can be regarded as independent,
and (2) R̂, which describes the ratio of the overall variance of posterior samples
across chains to the within-chain variance. The diagnostic criteria for adequate
sampling and convergence were set as ESS greater than 800 and R̂ below 1.1,
respectively. Results showed that ESS was greater than 800 for most parameters,
except for some random effect standard deviation parameters (e.g., ϕ1 − ϕ4 ), for
146 S. Zhou et al.
which the average ESS was about 400, which can be deemed satisfactory. R̂ was
below 1.1 for all parameters in both models.
Table 2. Parameter estimates of the model for LAS from the CoTwins study, 2016-2018
Table 2 shows the parameter estimates for LAS. In terms of the fixed ef-
fects, weekend and seasonal effects were found in the trajectory of LAS. Specifi-
cally, the participants showed greater LAS values on weekends than on weekdays
(β2 = 0.06, 95% CI = [0.05, 0.07]), which was reasonable since most of the par-
ticipants were supposed to be spending most of their time in school on weekdays,
thus yielding limited activity space. Seasonally, the participants tended to display
greater LAS in summer (β3 = 0.07, 95% CI = [0.06, 0.07]), which was likely due
to summer break as well as the warmer weather. Gender differences were found
in the initial levels of LAS (δ010 = −0.07, 95% CI = [−0.11, −0.01]), although
the upper bound of the 95% credible interval was close to 0. No gender differ-
ences were found in the growth rates of LAS. Finally, older participants tended
to have higher levels of LAS at baseline (δ020 = 0.13, 95% CI = [0.09, 0.16]),
but when it comes to within-individual changes over time, participants’ ages
were not found to be credibly linked to their levels of LAS, as indicated by the
95% credible interval including 0.
GPS2space Python Library for Spatial Measure Extraction 147
Table 3. Parameter estimates of the model for PSS from the CoTwins study, 2016-2018
Table 3 shows the parameter estimates for PSS. In terms of the fixed effects,
weekend and seasonal effects were found in the trajectory of PSS. Specifically,
participants shared more activity space on weekdays than on weekends (β2 =
148 S. Zhou et al.
−0.12, 95% CI = [−0.13, 0.10]). This pattern might be due to the restricted
daily routines on weekdays during which twin siblings in this age range tended
to spend most of their time in school and thus, showed greater PSS. Participants
tended to have the largest PSS in spring, followed by winter, summer, and fall.
In addition, older twins tended to share less activity space at baseline (δ020 =
−0.30, 95% CI = [−0.42, −0.18]), and when it comes to within-individual
changes over time, in contrast to the lack of age-related changes in LAS, PSS was
found to decrease as twins grew older (δ100 = −0.38, 95% CI = [−0.44, −0.31]).
Note that a small portion of twins were in the transition from high school to
college, so the reduction in PSS might also reflect some of the inevitable life
transitions that occur with age, such as attending colleges or working at different
geographical locations. In terms of zygosity differences, both DZ twins of the
same sex and opposite sex were found to share less activity space than MZ
twins (δ001 = −0.29, 95% CI = [−0.49, −0.09]; δ002 = −0.49, 95% CI =
[−0.71, −0.27]), indicating that there might be genetically influenced differences
in PSS. Finally, no gender differences were found in the initial levels and growth
rates of PSS.
Results for random effects were similar to those in the LAS model. We found
between-individual and between-family differences in both initial levels and age-
based changes of PSS. We also found negative associations between the initial
levels and growth rates at the individual level, indicating that twins who had
higher initial levels of PSS tended to show more declines in PSS with age. In
other words, the participants’ GPS data suggested that higher physical closeness
at younger ages might not persist as the twins grew older.
Finally, we conducted sensitivity analysis by re-running the analysis with the
full data set (i.e., keeping the records with fewer than 20 valid data points within
a week in the final data set). Results were detailed in Table S1 and Table S2 in
Supplementary Material, which showed only slight differences in the magnitude
of point estimates and standard errors. Both data sets yielded consistent con-
clusions across all parameters in terms of whether they were credibly different
from zero based on their 95% credible intervals.
7 Discussion
The proliferation of real-time and longitudinal GPS data provides excellent op-
portunities to study human behavior (Osorio-Arjona & Garcı́a-Palomares, 2019).
At the same time, the GPS data also pose challenges for consolidating, automat-
ing, and analyzing data that are not only massive in their quantities but also con-
tain spatial features that require expertise in GIS. Commercial software packages
make these studies easier but may have license and reproducibility issues, and
analyses with commercial software cannot be readily deployed to HPC platforms
to facilitate research procedures. In this article, we reviewed and compared ex-
isting commonly used Python libraries for spatial analysis with GPS2space, our
newly developed open-source Python library. GPS2space can build spatial data
from GPS data with latitude and longitude coordinate pairs, construct buffer-
GPS2space Python Library for Spatial Measure Extraction 149
and convex hull-based activity space and shared space, and perform the nearest
distance query from user-specified locations. We demonstrated how to process
spatial data and calculate buffer- and convex hull-based activity space and shared
space, as well as the nearest distance, with code examples. We also discussed the
pros and cons of buffer- and convex hull-based approaches and illustrated differ-
ent scenarios when the two approaches could be appropriately applied. Lastly,
using data from the CoTwins study, we explored intra-individual changes and
between-individual differences in daily activity space and shared space with twin
siblings; and gender, zygosity and baseline age-related differences in their initial
levels and/or changes, using growth curve modeling techniques. We found differ-
ent patterns of seasonal effects in the trajectories of LAS and PSS, less activity
space shared between DZ twins compared with MZ twins, and a decrease of PSS
with increasing age.
There are several limitations to the current data analysis. First, we did not
allow for individual differences in the seasonal effects, so our results only pro-
vided a general description of seasonal patterns of LAS and PSS. In practice,
the seasonal effects might vary across individuals and need to be considered in
model specifications. Second, some other factors might affect individuals’ activ-
ity space, such as time of the year (e.g., school days versus holidays) and weather
(e.g., snow). Similarly, the magnitude of shared space between twin siblings de-
pends on whether they live together or not. These factors need to be included in
the models to better explain the temporal pattern of LAS and PSS as well as in-
dividual differences in these patterns. Finally, in our example, some participants
were assessed for fewer than three years, while typically at least three repeated
measures per individual are required in the growth curve analysis. Therefore,
participants need to be followed for several more years to better investigate age-
related changes at the year level. We may also assess changes of finer granularity
(e.g., at the month level) based on the current data.
can provide information for researchers to validate and compare mobility or tra-
jectory measures from different data sources.
Many other extensions are possible within GPS2space to circumvent some
of its current limitations. For example, constructing activity space and shared
space involves topological structuring, which can take other forms besides con-
vex hull and buffer, the two methods currently available in GPS2space. Some
researchers use hexagon methods to measure territorial control based on road
data (Tao, Strandow, Findley, Thill, & Walsh, 2016); others also use the concave
hull method to estimate crown volumes of trees from remote sensing data (Yan et
al., 2019). Those approaches are useful and beneficial for certain research ques-
tions but are currently unavailable in GPS2space. To extend the GPS2space, one
could include concave hull, hexagon, and network-based methods in constructing
activity space and parameterize the column name variables for the spatial mea-
sures in GPS2space so that users have control of naming their desired outcomes.
With rapid developments of spatial economics, readily available spatial data
sets, and the computational power of personal computer and cloud computing,
spatial analyses have gained popularity in areas such as social, behavioral, and
environmental studies. We provided a timely open-source solution to work with
GPS data and extract spatial measures with code snippets and empirical exam-
ples using GPS2space. Overall, we have demonstrated that GPS2space can be a
versatile, handy, and extendable tool for researchers to harness the spatialities
of GPS data to investigate a wide array of research questions regarding spatial-
temporal variations of human behavioral changes and environment-population
linkages.
References
Barron, C., Neis, P., & Zipf, A. (2014). A Comprehensive Framework for Intrinsic
OpenStreetMap Quality Analysis. Transactions in GIS , 18 (6), 877–895.
doi: https://doi.org/10.1111/tgis.12073
Ben-Ari, A., & Lavee, Y. (2007). Dyadic closeness in marriage: From the inside
story to a conceptual model. Journal of Social and Personal Relationships,
24 (5), 627–644. doi: https://doi.org/10.1177/0265407507081451
Bivand, R. (2006). Implementing spatial data analysis software tools in R.
Geographical Analysis, 38 (1), 23–40. doi: https://doi.org/10.1111/j.0016-
7363.2005.00672.x
Browning, M., & Lee, K. (2017). Within what distance does “green-
ness” best predict physical health? A systematic review of articles
with gis buffer analyses across the lifespan. International Jour-
nal of Environmental Research and Public Health, 14 (7), 675. doi:
https://doi.org/10.3390/ijerph14070675
Buchowski, M. S., Townsend, K. M., Chen, K. Y., Acra, S. A., & Sun, M.
(1999). Energy expenditure determined by self-reported physical activ-
ity is related to body fatness. Obesity Research, 7 (1), 23–33. doi:
https://doi.org/10.1002/j.1550-8528.1999.tb00387.x
GPS2space Python Library for Spatial Measure Extraction 151
Li, Y., Oravecz, Z., Zhou, S., Bodovski, Y., Barnett, I. J., Chi, G., . . . Chow, S.-
M. (in press). Bayesian forecasting with a regime-switching zero-inflated
multilevel poisson regression model: An application to adolescent alcohol
use with spatial covariates. Psychometrika.
Mason, M. J., Valente, T. W., Coatsworth, J. D., Mennis, J., Lawrence, F., &
Zelenak, P. (2010). Place-based social network quality and correlates of
substance use among urban adolescents. Journal of Adolescence, 33 (3),
419–427. doi: https://doi.org/10.1016/j.adolescence.2009.07.006
McCormick, T. H., Lee, H., Cesare, N., Shojaie, A., & Spiro, E. S. (2017).
Using Twitter for Demographic and Social Science Research: Tools for
Data Collection and Processing. Sociological Methods and Research, 46 (3),
390–421. doi: https://doi.org/10.1177/0049124115605339
McCullagh, P., & Nelder, J. (1989). Generalized Linear Models (2nd ed.).
Chapman and Hall.
McGuire, W., O’Brien, B. G., Baird, K., Corbett, B., & Collingwood, L.
(2020). Does Distance Matter? Evaluating the Impact of Drop Boxes
on Voter Turnout. Social Science Quarterly, 101 (5), 1789–1809. doi:
https://doi.org/10.1111/ssqu.12853
Murray, A. T., Xu, J., Wang, Z., & Church, R. L. (2019). Commer-
cial GIS location analytics: capabilities and performance. International
Journal of Geographical Information Science, 33 (5), 1106–1130. doi:
https://doi.org/10.1080/13658816.2019.1572898
Newman, H. H., Freeman, F. N., & Holzinger, K. J. (1937). Twins: a study of
heredity and environment. Chicago: University of Chicago Press.
Osorio-Arjona, J., & Garcı́a-Palomares, J. C. (2019). Social media and urban
mobility: Using twitter to calculate home-work travel matrices. Cities, 89 ,
268–280. doi: https://doi.org/10.1016/j.cities.2019.03.006
Patil, S. (2016). Big Data Analytics Using R. International Research Journal
of Engineering and Technology, 3 (7), 78–81.
Perchoux, C., Chaix, B., Brondeel, R., & Kestens, Y. (2016).
Residential buffer, perceived neighborhood, and individual activity
space: New refinements in the definition of exposure areas - The
RECORD Cohort Study. Health and Place, 40 , 116–122. doi:
https://doi.org/10.1016/j.healthplace.2016.05.004
Prins, R. G., Pierik, F., Etman, A., Sterkenburg, R. P., Kamphuis, C. B.,
& van Lenthe, F. J. (2014). How many walking and cycling
trips made by elderly are beyond commonly used buffer sizes: Re-
sults from a GPS study. Health and Place, 27 , 127–133. doi:
https://doi.org/10.1016/j.healthplace.2014.01.012
Rey, S. J. (2019). PySAL: the first 10 years. Spatial Economic Analysis, 14 (3),
273–282. doi: https://doi.org/10.1080/17421772.2019.1593495
Rey, S. J., & Anselin, L. (2007). PySAL: A Python Library of Spatial Analytical
Methods. The Review of Regional Studies, 37 (1), 7–27.
Russell, M. A., Almeida, D. M., & Maggs, J. L. (2017). Stressor-
related drinking and future alcohol problems among university stu-
154 S. Zhou et al.
Supplementary Material
Zhiyong Zhang[0000−0003−0590−2196]
get the posterior distribution of Σ for Bayesian inference, one needs to specify a
prior distribution p(Σ) for it. With the prior, the posterior distribution can be
obtained through the Bayes’ Theorem:
p(D|Σ)p(Σ)
p(Σ|D) = .
p(D)
Especially,
2
2vii
V ar(σii ) = 2
. (2)
(m − p − 1) (m − p − 3)
p(Σ|D) ∝ p(D|Σ)p(Σ)
h n i
= |Σ|−n/2 exp − tr(SΣ−1 ) |Σ|−(m0 +p+1)/2 exp −tr(V0 Σ−1 )/2
2
−(n+m0 +p+1)/2 1 −1
= |Σ| exp − tr (nS + V0 ) Σ .
2
From it, we can get the posterior distribution for Σ, also an inverse Wishart
distribution:
nS + V0
E(Σ|D) =
n + m0 − p − 1
n n V0
= S+ 1− . (4)
n + m0 − p − 1 n + m0 − p − 1 m0 − p − 1
In practice, the BUGS program is probably the most widely used software for
Bayesian analysis (e.g., Lunn, Jackson, Best, Thomas, & Spiegelhalter, 2012;
Ntzoufras, 2009). BUGS uses the precision matrix, defined as the inverse of the
covariance matrix, to specify the multivariate normal distribution. Let P = Σ−1 ,
then the normal density function can be written as
−p/2 1/2 1 T
p(x|P) = (2π) |P| exp − x Px .
2
The use of the precision matrix has the computational advantage by avoiding
the inverse of matrix in the density calculation in certain situations.
For the precision matrix P, a Wishart prior W (U0 , w0 ) with the scale matrix
U0 and degrees of freedom w0 is used (e.g., Lunn et al., 2012). The density
function of the prior is
U−1
1 nS + U−1
0
E(Σ|D) = = . (6)
w1 − p − 1 n + w0 − p − 1
Comparing the posterior distributions in Equation (3) and (5), giving an
inverse Wishart distribution IW (V0 , m0 ) prior to the covariance matrix Σ is
the same as giving a Wishart distribution W (V0−1 , m0 ) prior to the precision
matrix P = Σ−1 . However, note that
−1 nS + U−1
0 nS + U−1
0
[E(P|D)] = 6= E(Σ|D) = .
n + w0 n + w0 − p − 1
Therefore, one cannot simply invert the posterior mean of the precision matrix
to get the posterior mean of the covariance matrix.
3 Numerical Examples
For illustration, we look at a concrete experiment. Suppose we have a sample of
size n = 100 with the sample covariance matrix (p = 2)
5 2
S= .
2 10
The aim is to estimate Σ through Bayesian method. We now consider the use
of different priors and evaluate their influence. Given the connection between
the Wishart and inverse Wishart distributions, we focus our discussion on the
specification of an inverse Wishart prior for the covariance matrix Σ .
Table 1. Posterior inference of the covariance matrix parameter based on the inverse
Wishart prior with the scale matrix specified based on an identity matrix.
Mean Variance
S 2 5 10 50 100 2 5 10 50 100
IW (I, m0 )
Σ11 5 5.06 4.91 4.68 3.41 2.54 0.528 0.483 0.418 0.160 0.066
Σ12 2 1.96 1.96 1.87 1.36 1.02 0.516 0.516 0.447 0.172 0.071
Σ22 10 10.11 9.81 9.36 6.81 5.08 2.108 1.926 1.667 0.640 0.265
IW [(m0 − p − 1)I, m0 ]
Σ11 5 5.04 4.92 4.74 3.72 3.03 0.524 0.484 0.428 0.191 0.094
Σ12 2 1.96 1.96 1.87 1.36 1.02 0.518 0.518 0.454 0.194 0.091
Σ22 10 10.09 9.82 9.41 7.12 5.57 2.100 1.930 1.687 0.700 0.318
In the above specification, since V0 ≡ I, the prior mean also changes along
the change of m0 . In practice, e.g., in sensitivity analysis, it can be helpful to
fix the prior mean. To achieve this, one can set V0 = (m0 − p − 1)I. Therefore,
when m0 = 5, the scale matrix will be 2I, and when m0 = 100, the scale matrix
will be m0 = 97I. With such specification, the prior mean is always I.
Another way to specify the prior is to construct the scale matrix for the inverse
Wishart distribution based on the sample data. Intuitively, we can set V0 = S
and change m0 . From the top of Table 2, with the increase of m0 , the posterior
mean deviates from the sample covariance matrix. This is again because that
the prior mean becomes smaller with the increase of m0 since the prior mean is
equal to S/m0 . To maintain the same prior mean while changing the information
in the prior, we set V0 = (m0 − p − 1)S. With such specification, the prior mean
is always S and the posterior mean is also S as we can see from the bottom
part of Table 2. With the increase of the degrees of freedom, more information
is supplied through the prior and we can observe the decrease in the posterior
variance.
124 Zhang, Z.
Table 2. Posterior inference of the covariance matrix parameter based on the priors
with the scale matrix constructed from data.
Mean Variance
S 2 5 10 50 100 2 5 10 50 100
IW (S, m0 )
Σ11 5 5.10 4.95 4.72 3.44 2.56 0.537 0.490 0.424 0.163 0.067
Σ12 2 1.98 1.98 1.89 1.37 1.03 0.525 0.525 0.455 0.175 0.072
Σ22 10 10.20 9.90 9.44 6.87 5.13 2.146 1.961 1.697 0.651 0.270
IW [(m0 − p − 1)S, m0 ]
Σ11 5 5.00 5.00 5.00 5.00 5.00 0.515 0.500 0.476 0.345 0.256
Σ12 2 2.00 2.00 2.00 2.00 2.00 0.536 0.536 0.510 0.370 0.276
Σ22 10 10.00 10.00 10.00 10.00 10.00 2.062 2.000 1.905 1.379 1.026
Table 3. Posterior inference of the covariance matrix parameter with additional spec-
ifications of inverse Wishart priors IW [(m0 − p − 1)V0 , m0 ].
Mean Variance
S 2 5 10 50 100
2 5 10 50 100
10 0
P1: V0 =
0 1
Σ11 5 4.95 5.10 5.33 6.60 7.46 0.505 0.520 0.541 0.601 0.571
Σ12 2 1.96 1.96 1.87 1.36 1.02 0.535 0.535 0.507 0.335 0.217
Σ22 10 10.09 9.82 9.41 7.12 5.57
2.1001.930 1.687 0.700 0.318
5 −2
P2: V0 =
−2 10
Σ11 5 5.00 5.00 5.00 5.00 5.00 0.515 0.500 0.476 0.345 0.256
Σ12 2 1.92 1.92 1.74 0.72 0.03 0.532 0.532 0.501 0.346 0.255
Σ22 10 10.00 10.00 10.00 10.00 10.00
2.062 2.000 1.905 1.379 1.026
5 0
P3: V0 =
0 10
Σ11 5 5.00 5.00 5.00 5.00 5.00 0.515 0.500 0.476 0.345 0.256
Σ12 2 1.96 1.96 1.87 1.36 1.02 0.534 0.534 0.505 0.355 0.260
Σ22 10 10.00 10.00 10.00 10.00 10.00
2.0622.000 1.905 1.379 1.026
5 −5
P4: V0 =
−5 10
Σ11 5 5.00 5.00 5.00 5.00 5.00 0.515 0.500 0.476 0.345 0.256
Σ12 2 1.86 1.86 1.54 -0.24 -1.45 0.530 0.530 0.495 0.343 0.266
Σ22 10 10.00 10.00 10.00 10.00 10.00 2.062 2.000 1.905 1.379 1.026
4 Discussion
Although not without issues, Wishart and inverse Wishart distributions are still
commonly used prior distributions for Bayesian analysis involving a covariance
matrix (Alvarez, Niemi, & Simpson, 2014; Liu, Zhang, & Grimm, 2016). As we
have shown, the use of the inverse Wishart prior has the advantage of conjugate,
which simplifies the posterior distribution. By using an inverse Wishart prior,
the posterior distribution is also an inverse Wishart distribution given normally
distributed data. The posterior mean can be conveniently expressed as a weighted
average of the prior mean and the sample covariance matrix. The influence of
the prior can also be clearly quantified.
When reliable information is available, an informative inverse Wishart prior
can be constructed. For example, previous estimates on the covariance matrix
could be available. In this situation, such covariance matrix estimates can be
used to construct the scale matrix. If the variance estimates of the covariance
matrix is also available, one can determine the degrees of freedom for the inverse
Wishart prior based on the variance expression in Equation (2), which can be
done using the R package discussed in the Appendix. The degrees of freedom
based on each individual element may vary. The overall degrees of freedom for the
inverse Wishart distribution can be determined based on the practical research
question.
126 Zhang, Z.
Appendix
The R package wishartprior is developed and made available on GitHub to help
understand the Wishart and inverse Wishart priors. The URL to the package is
https://github.com/johnnyzhz/wishartprior. The package can be used to
generate random numbers from an inverse Wishart distribution. It can calculate
the mean and variance of Wishart and inverse Wishart distributions. Using the
package, one can investigate the influence of priors.
References
Alvarez, I., Niemi, J., & Simpson, M. (2014). Bayesian inference for a covariance
matrix. In Anual conference on applied statistics in agriculture (pp. 71–82).
Retrieved from arXiv:1408.4050
Barnard, J., McCulloch, R., & Meng, X.-L. (2000). Modeling covariance matri-
ces in terms of standard deviations and correlations, with application to
shrinkage. Statistica Sinica, 10 , 1281–1311.
Congdon, P. (2014). Applied bayesian modeling (2nd ed.). John Wiley & Sons.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin,
D. B. (2014). Bayesian data analysis (2nd ed.). CRC press.
Leonard, T., Hsu, J. S., et al. (1992). Bayesian inference for a co-
variance matrix. The Annals of Statistics, 20 (4), 1669–1696. doi:
https://doi.org/10.1214/aos/1176348885
Liu, H., Zhang, Z., & Grimm, K. J. (2016). Comparison of inverse
wishart and separation-strategy priors for bayesian estimation of co-
variance parameter matrix in growth curve analysis. Structural Equa-
tion Modeling: A Multidisciplinary Journal , 23 (3), 354–367. doi:
https://doi.org/10.1080/10705511.2015.1057285
Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The
bugs book: A practical introduction to bayesian analysis. CRC Press.
Mardia, K., Bibby, J., & Kent, J. (1982). Multivariate analysis. Academic Press.
Ntzoufras, I. (2009). Bayesian modeling using WinBUGS. John Wiley & Sons.
Journal of Behavioral Data Science, 2021, 1 (2), 89–118.
DOI: https://doi.org/10.35566/jbds/v1n2/p6
1 Introduction
Multi-stage sampling design is often used in survey data collection. For exam-
ple, in order to obtain a nationally representative sample of kindergartners, a
two-stage sample design may be used in which a representative set of schools
are sampled in the first stage and students within schools are sampled in the
second stage. Besides the advantage of cost-effectiveness and convenience, data
obtained by multi-stage sampling allow researchers to answer multilevel research
questions. For example, researchers could examine how students’ achievement is
90 W. Luo & H. C. Lai
level-1 errors are assumed to follow a univariate normal distribution, and the
level-2 random effects are assumed to follow a multivariate normal distribution.
It has been documented that although ML estimators for fixed effects and vari-
ance components are consistent even when the random-effects distribution is not
normal, the standard error estimated by the inverse Fisher information matrix
may be biased, especially for variance components (Verbeke & Lesaffre, 1997).
The more sophisticated Huber-White robust standard errors are more accurate
for the variance component estimates, but require at least 100 clusters (Maas &
Hox, 2004). To our knowledge, the performance of MPML with robust standard
errors under distributional misspecification has not been studied yet.
Bootstrap resampling methods for multilevel data have been developed as
an alternative to ML estimation in the case where the general assumptions
mentioned above are violated. In general, there are three main approaches to
bootstrap: (1) the parametric bootstrap, (2) the nonparametric residual boot-
strap, and (3) the case bootstrap. The parametric bootstrap has the strongest
assumptions, which require that the specifications of the functional form and
the distributions of the residuals are both correct. The residual bootstrap only
requires the correct specification of the functional form. Finally, the case boot-
strap has minimum assumptions and only requires the hierarchical structure to
be correctly specified. Van der Leeden, Meijer, and Busing (2008) provided a de-
tailed discussion of the systematic development of bootstrap resampling methods
for multilevel models. It has been shown that bootstrap methods could provide
accurate confidence intervals for fixed effect estimates when the distribution of
the residuals are highly skewed at all levels (Carpenter, Goldstein, & Rasbash,
2003). In addition, applications to small area estimation showed that the boot-
strap method could produce sensible estimates for standard errors for shrinkage
estimates of small area means based on generalized linear mixed models (e.g.,
Booth, 1995; Hall & Maiti, 2006; Lahiri, 2003).
Given the advantages of multilevel bootstrap resampling under conditions
with distributional assumption violation and small sample sizes, it is useful
to extend the method to accommodate multilevel data with sampling weights.
Research in this area is limited and existing methods only use the case boot-
strap approach (Grilli & Pratesi, 2004; Kovacevic, Huang, & You, 2006; Wang
& Thompson, 2012) . Although the case bootstrap is more robust to assumption
violations than residual bootstrap, it is typically less efficient. Some studies have
shown that case bootstrap performed worse than residual bootstrap even when
the assumptions were violated (Efron & Tibshirani, 1993; Van der Leeden et al.,
2008). Hence the purpose of this paper is to propose a weighted nonparamet-
ric residual bootstrap procedure for multilevel modeling with sampling weights.
The proposed procedure is an extension of the nonparametric residual bootstrap
procedure developed by Carpenter et al. (2003). With a Monte Carlo simula-
tion, we examined the performance of the proposed bootstrap method in terms
of parameter estimates and statistical inferences under a variety of conditions.
The outline of the paper is as follows. First, we briefly discuss sampling
weights for multilevel models, followed by a review of existing bootstrap methods
92 W. Luo & H. C. Lai
for multilevel data. Next, we provide details of the proposed procedure followed
by a demonstration of the method using real data. Then we present the sim-
ulation study to examine the performance of the proposed bootstrap method.
Finally, the findings are summarized and discussed.
weight for school j is wj = 1/pj . The conditional sampling weight for student
i within school j is wi|j = 1/pi|j . The unconditional sampling weight for an
individual student is wij = wj × wi|j . If the sampling weights are related to the
dependent variable after conditioning on the covariates in the model, they are
called informative weights (Pfeffermann, 1993). For example, if students with
lower achievement have a higher probability of being sampled controlling for the
predictors Xij and Xj , then the sampling weights are informative. Informative
sampling weights should be incorporated in statistical inferences to avoid bias
in estimates or poor performance of test statistics and confidence intervals. For
multilevel models, the sampling weights at each level need to be taken into ac-
count when they are informative, to ensure that the average association between
the predictors and the outcome in the population of students as well as the
variance and covariance components of school random effects can be accurately
estimated. One approach to incorporate the sampling weights is to use multilevel
pseudo maximum likelihood estimation (MPML), which defines the likelihood
QJ R Qnj wi|j
function as l (θ) = j=1 ( i=1 f Yij |Xij , µj , β 1 q(µj |Xj , β 2 )dµj )wj .
Extant literature has shown that the level-1 weights should be scaled in
order to reduce the bias of variance component estimates and standard error
estimates of fixed effects when cluster sizes are not large (e.g., Pfeffermann et
al., 1998; Potthoff, Woodbury, & Manton, 1992; Stapleton, 2002). There are two
commonly used scaling methods: relative vs. effective sample size scaling. In
relative sample size rescaling, the level-1 weights wi|j are multiplied by a scaling
n
factor s1j = Pnj j so that the sum of the rescaled level-1 weights within
i=1 wi|j
a cluster equals the actual cluster size. In effective sample size rescaling, the
P nj
wi|j
scaling factor s1j = Pi=1
nj
2
is used such that the sum of the rescaled level-
i=1 wi|j
1 weights within a cluster equals the effective cluster size which is defined as
Pnj 2
( i=1 wi|j )
Pnj 2
. Some simulation studies showed that relative sample size rescaling
i=1 wi|j
works better for informative weights, whereas effective sample size rescaling is
more appropriate for non-informative weights (Pfeffermann et al., 1998). Some
researchers argue that non-informative weights should not be used in multilevel
analyses because they tend to result in a loss of efficiency and even bias in
parameter estimates under some conditions. For example, Asparouhov (2006)
found bias in the estimation of multilevel models when cluster sample size is
small and non-informative within-cluster weights are used.
However, in practical applications, choosing the right scaling method may
be challenging. Pfeffermann (1993) described a general method for testing the
informativeness of the weights. Asparouhov (2006) proposed a simpler method
based on the informative index, and recommended to consider both the value of
the informative index and Pfeffermann’s test, the invariance of selection mech-
anism across clusters, and the average cluster size when determining weighting
in multilevel modeling.
94 W. Luo & H. C. Lai
Depending on whether and what parametric assumptions are involved, there are
multiple approaches to do bootstrapping (Davison & Hinkley, 1997), and addi-
tional care is needed to address the dependencies in the data when resampling
with multilevel data (Van der Leeden et al., 2008). Below we first provide a
brief summary of the common bootstrap procedures for multilevel data in gen-
eral (i.e., the parametric bootstrap, the residual bootstrap, and the case boot-
strap) and then focus on the bootstrap method for multilevel data with sampling
weights. Readers should consult Davison and Hinkley (1997), Goldstein (2011),
and Van der Leeden et al. (2008) for more detailed reviews of the statistical
theory of multilevel bootstrapping methods.
there are two variants of the case bootstrap (Davison & Hinkley, 1997): (a) to
resample with replacement intact clusters but no resampling within a cluster,
and (b) to first resample the clusters, and within each cluster resample with
replacement the units. Both Davison and Hinkley (1997) and Goldstein (2011)
recommended (a) over (b).
A few previous studies have examined these three bootstrap methods for
multilevel analyses. Seco, Garcı́a, Garcı́a, and Rojas (2013) showed that the
residual bootstrap produced more precise estimates, in terms of smaller root
mean squared errors, for fixed effects than restricted maximum likelihood. On
the other hand, because the case bootstrap makes fewer assumptions than the
parametric and the residual bootstraps, it requires more information from the
data. As such, previous literature found that its performance was poor compared
to the other two methods, even when the assumptions for the latter two meth-
ods were violated (Efron & Tibshirani, 1993; Van der Leeden et al., 2008). On
the other hand, Thai, Mentré, Holford, Veyrat-Follet, and Comets (2014) found
that in longitudinal linear-mixed models where cluster size is constant, residual
bootstrap and case bootstrap performed similarly when there were at least 100
individuals (i.e., J = 100).
Both Grilli and Pratesi (2004) and Kovacevic et al. (2006) noted that the
steps concerning the level-1 units in their procedures can be omitted when the
sampling fraction is low at the cluster level. Kovacevic et al. (2006) also showed
that the accuracy and stability of variance estimation improved when using the
relative within-cluster weights (i.e., the sum of the rescaled level-1 weights within
a cluster equals the actual cluster size) as compared to the original unscaled
within-cluster weights. However, to the best of our knowledge, these methods
have not been developed into statistical packages that can be easily accessed by
applied researchers.
Step 6: Repeat steps 2-5 to obtain B set sets of bootstrap parameter esti-
mates.
4.2 Illustration
where i indexes students and j indexes schools, u0j represents random effects
associated with the intercept. The main parameters of interest are the average
effects of gender (β1 ) and school SES (β2 ) on students’ math achievement in
the population of 15-year-old students in the United States. Although we used a
random intercept model in this demonstration, researchers could further examine
whether the association between student gender and achievement varies across
schools by adding a random effect associated with the slope of gender that varies
across schools (i.e., a random slope model).
The US sample consists of 2135 students from 145 schools. 74% students had
complete data on both ISEI and Math while 26% had at least one missing value
on the two variables. After removing cases with missing data, the final sample
of analysis consists of 1578 students from 145 schools. The cluster size ranged
from 1 to 20, with the first quartile of 8, median of 12, and the third quartile of
14. To determine the degree to which the weights were informative, we followed
the recommendation by Asparouhov (2006) and computed the informative index
√
by |µcw −µc0 | / υ0 where µcw is the weighted mean of the dependent variable, µ c0
is the unweighted mean, and υ0 is the unweighted variance. The informative
index for math was 0.03, indicating that the sampling weights were very slightly
informative.
The bootstrap estimates were obtained using researcher developed R package
bootmlm (see Appendix for the R code). As a comparison, the model was also
estimated using unweighted ML, and MPML with relative and effective weights
respectively. The MPML estimates were obtained using Mplus 8.2 Muthén and
Muthén (1998, see Appendix B for the Mplus code). The ML estimates were
obtained using the lme4 package in R (Bates, Maechler, Bolker, & Walker, 2015).
Percentile confidence intervals were computed in the bootstrap method (i.e., α/2
98 W. Luo & H. C. Lai
Table 1. ML, MPML, and Bootstrap Results Based on the PISA Data
Estimate SE 95% CI
Intercept 74.33 2.49 [69.45, 79.20]
Gender -1.6 0.66 [-2.88, -0.31]
ISEI m 0.16 0.05 [0.06, 0.26]
Unweighted ML Variance
School 9.43 3.02 [4.35, 16.48]
Residual 162.4 6.06 [151.07, 174.87]
Conditional ICC 0.06
Intercept 80.42 5.52 [69.59, 91.24]
Gender -2.43 1.16 [-4.70, -0.16]
ISEI m 0.06 0.12 [-0.17, 0.28]
MPML Effective Weights Variance
School 10.86 9.03 [2.12, 55.41]
Residual 152.47 24.3 [111.56, 208.38]
Conditional ICC 0.07
Intercept 74.94 2.51 [70.17, 80.18]
Gender -1.56 0.67 [-2.85, -0.17]
ISEI m 0.16 0.05 [0.05, 0.26]
Bootstrap Variance
School 7.42 2.68 [2.23, 13.02]
Residual 162.51 9.95 [144.5, 184.0]
Conditional ICC 0.04
was statistically significant based on the ML and the bootstrap results, but non-
significant based on MPML.
From this particular sample and model, we obtained inconsistent results from
the bootstrap and the MPML methods. We suspected that the MPML results
might not be trustworthy because the specific condition of this sample (i.e.,
small cluster size, low ICC, and very slight informativeness) has been shown
to be unfavorable to MPML (e.g., Asparouhov, 2006). However, it is unknown
whether the performance of the bootstrap method is acceptable, thus a Monte
Carlo simulation is needed to assess the performance of these methods under
various conditions.
5 Simulation
5.1 Data Generation
To evaluate the performance of the weighted bootstrap procedure in accounting
for nonrandom sampling, we used R 3.5.0 (R Core Team, 2018) to simulate
two-level data mimicking the data structure of students nested in schools. The
population models were either (a) a random intercept model or (b) a random
slopes model. The models include one level-1 predictor such as student SES
(denoted as X1ij ) and one level-2 predictor such as school SES (denoted as X2j ).
Because multilevel modeling is a model-based technique usually justified by a
superpopulation model (Cochran, 1977; Lohr, 2010), the data generating model
is treated as the superpopulation, and in each replication, we first generated a
finite population with Jpop = 500 clusters and npop = 100 observations for each
cluster.
When generating a finite population based on the random intercept model
(see Equation 2), we simulated X2j from N (0, 1) distributions and the cluster-
level random intercept effect u0j from either normal distributions or scaled χ2 (df
= 2) distributions with mean 0 and variance τ , depending on the simulation
condition described in the next section. We then simulated npop × Jpop values
of X1ij from N (0, 1) distributions and eij from either normal distributions or
scaled χ2 (df = 2) distributions with mean 0 and variance σ, depending on the
simulation condition. For all simulation conditions, we set β0 = 0.5, β1 = β2 = 1,
and the total variance τ +σ = 2.5. The outcome was computed based on Equation
(2).
When generating a finite population based on the random slopes model, the
following equation was used
where u0j and u1j represent the random effects associated with the intercept and
the slope of X1ij respectively. We simulated u0j and u1j from
a bivariate
normal
τ00
distribution with mean of 0 and variance-covariance of in which τ00
τ01 τ11
represents the variance of the random intercept, τ11 the variance of the random
100 W. Luo & H. C. Lai
slope of X1ij , and τ01 the covariance between the random intercept and the
random slope. The magnitude of τ00 depends on the simulation condition, and
the magnitude of τ11 is half of τ00 because the variance of random slopes is
typically smaller than the variance of random intercepts. The covariance τ01
√
is computed as ρ τ00 τ11 where ρ denotes the correlation between the random
intercepts and the random slopes and was set at 0.5 to represent a moderate
correlation.
After simulating the finite populations, we first sampled J clusters with a
sampling fraction f according to a certain selection mechanism depending on the
simulation condition. Then in each cluster we randomly sampled n observations
with the same sampling fraction f according to a certain selection mechanism
depending on the simulation condition.
0 (stratum 1) and eij < 0 (stratum 2), and then sampled without replacement
according to the 7:3 ratio of sampling probability. The informative index was
about 0.17 when informative selection occurred at level-1 only, 0.09 when at level-
2 only, and 0.27 when at both levels based on the random intercept models. These
values represent slight to moderate informativeness according to Asparouhov
(2006).
Combining the five design factors, there are a total of 48 data conditions
(3 ICCs × 2 sampling fractions × 2 distributions × 2 between-cluster selection
mechanisms × 2 within-cluster selection mechanisms) for the random intercept
models and 24 conditions (3 ICCs × 2 sampling fractions × 2 between-cluster
selection mechanisms × 2 within-cluster selection mechanisms) for the random
slopes models. We conducted 500 replications for each simulation condition. For
each generated data set, three estimators were applied: the proposed bootstrap
method (using the R package bootmlm), MPML with effective weights (using
Mplus 8.2 for the random intercept models and Stata 16 for the random slopes
models), and unweighted maximum likelihood (using the R package lme4 ).
5.3 Analysis
For each parameter in the models (including both fixed effects and variance com-
ponents), we examined the relative bias of the point estimate and the coverage
rate of the 95% confidence intervals. For the bootstrap method, we used the 2.5
and 97.5 percentile of the empirical sampling distribution as the lower and upper
boundaries of the 95% confidence interval. Following Hoogland and Boomsma
(1998), relative biases of point estimates are considered acceptable if their mag-
nitudes are less than 0.05. The coverage rate of a 95% confidence interval should
be approximately equal to 95%, with a margin of error of 1.9% based on 500
replications. Hence coverage rates between 93% and 97% are acceptable.
5.4 Results
5.4.1 Random intercept models Tables 2 to 5 show the relative bias and
coverage rate for parameter estimates under all conditions based on the random
intercept models. The relative biases for the slope of the level-1 predictor X1
and the slope of the level-2 predictor X2 are not shown in the tables because
they were close to zero for all conditions. In addition, the coverage rate for the
slope of X1 was close to 95% under all conditions, therefore it was not included
in the tables.
Intercept. As shown by the relative biases of the ML estimates, ignoring
sampling weights when the selection mechanism was informative caused moder-
ate to large relative biases, ranging from 0.14 to 1.38 (see Table 2 and 3). As a
result of biased point estimate, the coverage rates of the confidence intervals for
the ML estimates were also poor under those conditions ranging from 0.00 to
0.85 (see Table 4 and 5).
MPML successfully reduced the relative biases to an acceptable level under
the majority of conditions, however, there were still small to moderate relative
102 W. Luo & H. C. Lai
biases under 11 conditions where the sample size was small and the selection
mechanism was informative at level 1 or both levels (relative bias ranging from
0.07 to 0.13). As a result, there was slight under-coverage (ranging from 0.88
to 0.92) in about half of those conditions (6 out of 11), mainly when there was
informative selection at both levels.
The bootstrap method performed the best in terms of relative biases because
they were below 0.05 under all conditions. However, the advantage of the boot-
strap method over MPML was less obvious in terms of the coverage rate because
the bootstrap method also had slightly low coverage rate (ranging from 0.88 to
0.92) under similar conditions.
Slope of X2 . The relative bias of the estimated slope of X2 was acceptable
for all methods under all conditions. However, the MPML confidence intervals
suffered from slight under-coverage (89%-92%) in 18 conditions, mainly when
sample size was small and selection was informative at level 2 or both levels.
Variance component of the random intercepts (τ ). ML estimates had
small relative biases under 18 conditions when there was informative sampling
at level-2 or at both levels. The biases were negative ranging from -0.07 to -0.11
when the distribution was normal, and were positive ranging from 0.10 to 0.12
when the distribution was skewed. MPML suffered from small to moderate biases
(-0.10 to 0.27) under 10 conditions when small sample size was combined with
small to moderate ICCs. It was noted that the two moderately large relative
biases (i.e., 0.25 and 0.27) both occurred when there was informative selection
at level-1 or at both levels. The bootstrap method performed better with only
small positive biases (0.08 to 0.11) under 5 conditions where both ICC and
sample size were small. It was noted that out of the 5 conditions where relative
biases were obvious, one was under the normal distribution and four under the
skewed distribution, indicating that the performance of the bootstrap method
might be sensitive to skewed distributions.
In general, all three methods tended to have under-coverage, with ML being
the worst and bootstrap being the best. Where the distribution was normal, 15
conditions had under-coverage ranging from 0.87 to 0.92 for ML, 14 conditions
ranging from 0.86 to 0.92 for MPML, and 11 conditions ranging from 0.89 to
0.92 for bootstrap. When data were skewed, 23 conditions had under-coverage
ranging from 0.67 to 0.92 for ML, 22 conditions ranging from 0.76 to 0.92 for
MPML, and 15 conditions ranging from 0.81 to 0.92 for bootstrap. For both
MPML and bootstrap, the coverage rate tended to worsen as the sample size
decreased. In addition, when data were skewed, larger ICCs led to lower coverage
rate for MPML.
Level-1 residual variance (σ). Only ML estimates had small negative
relative biases when there was informative selection at level-1 or at both levels.
As a result, ML estimates had severe under-coverage under those conditions,
especially when sample size was large. The performance of ML deteriorated
when the distribution was skewed as there were severe under-coverage across all
conditions.
Weighted Residual Bootstrap Method for Multilevel Models 103
Although MPML and bootstrap estimates had minimum relative biases, both
had slight under-coverage under certain conditions. Specifically, when the distri-
bution was normal, under-coverage mainly occurred when sample size was small
combined with informative selection at both levels. When the distribution was
skewed, under-coverage mainly occurred when sample size was small and when
the selection was non-informative or only informative at level-2.
Table 2. Relative Bias for the Random Intercept Model Under Normal Distribution
5.4.2 Random slopes models Tables 6 to 9 show the relative biases and
coverage rates for parameter estimates under all conditions based on the ran-
dom slopes models. Notably, while convergence was not an issue for ML and
the bootstrap method, MPML estimation suffered from a low convergence rate
(ranging between 0.59 and 0.76) when both ICC and sample size were small.
Intercept. Similar to the pattern under the random intercept models, ML
estimates of the intercept suffered from moderate to large relative biases (ranging
104 W. Luo & H. C. Lai
Table 3. Relative Bias for the Random Intercept Model Under χ2 (2) Distribution
at both levels 0.5 0.00 0.96 0.97 0.94 0.95 0.94 0.87 0.92 0.92 0.02 0.93 0.94
Note. Values in bold represent under-coverage or over-coverage (i.e., coverage rate < 0.93 or > 0.97)
105
Table 5. Coverage Rate for the Random Intercept Model Under χ2 (2) Distribution
ICC Selection Sampling Intercept X2 Slope TAU SIGMA
Mechanism Fraction
ML BOOT MPML ML BOOT MPML ML BOOT MPML ML BOOT MPML
Non- 0.1 0.92 0.92 0.92 0.92 0.92 0.91 0.90 0.93 0.90 0.78 0.91 0.91
informative 0.5 0.97 0.97 0.97 0.94 0.94 0.94 0.79 0.89 0.92 0.67 0.94 0.94
Informative 0.1 0.00 0.94 0.94 0.94 0.94 0.92 0.92 0.92 0.92 0.82 0.96 0.94
at level-1 0.5 0.00 0.96 0.97 0.94 0.94 0.93 0.81 0.90 0.91 0.00 0.98 0.95
0.05
Informative 0.1 0.85 0.92 0.93 0.95 0.95 0.90 0.92 0.96 0.95 0.77 0.89 0.90
at level-2 0.5 0.25 0.95 0.96 0.95 0.95 0.96 0.77 0.94 0.92 0.69 0.93 0.94
Informative 0.1 0.00 0.93 0.91 0.95 0.94 0.93 0.93 0.93 0.95 0.82 0.95 0.95
at both levels 0.5 0.00 0.95 0.97 0.95 0.95 0.94 0.78 0.91 0.89 0.00 0.96 0.96
Non- 0.1 0.93 0.93 0.92 0.94 0.94 0.92 0.77 0.83 0.78 0.66 0.91 0.91
informative 0.5 0.97 0.97 0.97 0.93 0.93 0.92 0.72 0.91 0.91 0.67 0.94 0.94
Informative 0.1 0.10 0.95 0.94 0.92 0.92 0.92 0.77 0.83 0.79 0.58 0.96 0.94
at level-1 0.5 0.00 0.97 0.97 0.93 0.93 0.94 0.73 0.92 0.91 0.00 0.98 0.95
0.2
Informative 0.1 0.71 0.92 0.92 0.94 0.94 0.93 0.83 0.90 0.83 0.69 0.89 0.90
at level-2 0.5 0.17 0.97 0.96 0.95 0.94 0.96 0.70 0.94 0.91 0.69 0.93 0.94
Informative 0.1 0.00 0.94 0.92 0.95 0.95 0.93 0.85 0.90 0.86 0.62 0.95 0.94
at both levels 0.5 0.00 0.96 0.96 0.95 0.94 0.95 0.71 0.95 0.90 0.00 0.97 0.96
W. Luo & H. C. Lai
Non- 0.1 0.93 0.93 0.92 0.94 0.94 0.92 0.71 0.81 0.76 0.66 0.91 0.91
informative 0.5 0.97 0.97 0.96 0.92 0.93 0.92 0.70 0.92 0.91 0.67 0.93 0.94
Informative 0.1 0.63 0.93 0.93 0.93 0.92 0.92 0.70 0.81 0.77 0.58 0.95 0.94
at level-1 0.5 0.11 0.97 0.97 0.93 0.93 0.92 0.70 0.93 0.91 0.00 0.98 0.95
0.5
Informative 0.1 0.64 0.92 0.92 0.95 0.95 0.94 0.76 0.88 0.82 0.69 0.88 0.90
at level-2 0.5 0.14 0.97 0.96 0.95 0.94 0.95 0.67 0.94 0.90 0.69 0.93 0.94
Informative 0.1 0.04 0.93 0.94 0.95 0.94 0.93 0.77 0.89 0.81 0.62 0.95 0.94
at both levels 0.5 0.00 0.96 0.96 0.94 0.93 0.95 0.67 0.94 0.90 0.00 0.97 0.96
Note. Values in bold represent under-coverage or over-coverage (i.e., coverage rate < 0.93 or > 0.97)
106
Weighted Residual Bootstrap Method for Multilevel Models 107
from 0.23 to 1.64) when the selection mechanism was informative (see Table
6). The relative biases based on MPML estimates were acceptable under the
majority of conditions, except for 6 conditions where the sample size was small
and the selection mechanism was informative at level 1 or both levels (relative
bias ranging from 0.12 to 0.14). The bootstrap method performed the best in
terms of relative biases because there were only 3 conditions where small biases
were found (ranging from -0.06 to -0.10).
Table 6. Relative Bias for Fixed Effects Estimates from the Random Slopes Model
Under Normal Distribution
As a result of the biased point estimate based on ML, the coverage rates of
the confidence intervals for the ML estimates were also poor (ranging from 0.00
to 0.61) under informative selection mechanisms (see Table 6). On the other
hand, both MPML and the bootstrap method had the issue of over-coverage
(coverage rate above 0.98) in the majority of the conditions, indicating that the
estimated confidence intervals were wider than expected.
108 W. Luo & H. C. Lai
Table 7. Coverage Rate for Fixed Effects Estimates from the Random Slopes Model
Under Normal Distribution
at both levels 0.5 -0.09 -0.10 -0.12 -0.16 -0.40 -0.10 -0.09 -0.10 -0.41 -0.05 0.02 0.01
Note. Values in bold represent unacceptably large relative bias (i.e., absolute value > 0.05)
109
Table 9. Coverage rate for Variance Components Estimates from the Random Slopes Model Under Normal Distribution
ICC Selection Sampling TAU00 TAU11 TAU01 SIGMA
Mechanism Fraction
ML BOOT MPML ML BOOT MPML ML BOOT MPML ML BOOT MPML
Non- 0.1 0.95 0.93 0.71 0.98 1.00 0.71 0.97 0.95 0.87 0.94 0.93 0.94
informative 0.5 0.96 0.85 0.09 0.94 0.92 0.41 0.95 0.76 0.92 0.95 0.70 0.84
Informative 0.1 0.95 0.95 0.86 0.99 1.00 0.81 0.96 0.94 0.66 0.68 0.88 0.95
at level-1 0.5 0.96 0.85 0.11 0.95 0.92 0.39 0.96 0.75 0.95 0.02 0.71 0.91
0.05
Informative 0.1 0.96 0.95 0.68 0.98 0.99 0.70 0.94 0.92 0.86 0.96 0.83 0.94
at level-2 0.5 0.85 0.80 0.16 0.90 0.87 0.44 0.87 0.68 0.93 0.94 0.68 0.86
Informative 0.1 0.94 0.93 0.84 0.98 0.99 0.77 0.96 0.91 0.66 0.68 0.76 0.91
at both levels 0.5 0.85 0.83 0.20 0.90 0.86 0.46 0.85 0.65 0.96 0.03 0.67 0.89
Non- 0.1 0.94 0.93 0.78 0.96 0.94 0.85 0.93 0.94 0.97 0.95 0.79 0.91
informative 0.5 0.96 0.77 0.56 0.94 0.83 0.16 0.97 0.31 0.82 0.95 0.54 0.84
Informative 0.1 0.95 0.93 0.84 0.96 0.96 0.85 0.93 0.93 0.92 0.73 0.78 0.94
at level-1 0.5 0.96 0.77 0.57 0.94 0.82 0.15 0.96 0.28 0.86 0.02 0.60 0.91
0.2
Informative 0.1 0.86 0.87 0.77 0.90 0.88 0.83 0.92 0.87 0.95 0.96 0.70 0.91
at level-2 0.5 0.80 0.75 0.64 0.84 0.81 0.24 0.80 0.29 0.87 0.94 0.55 0.86
Informative 0.1 0.84 0.86 0.85 0.91 0.91 0.82 0.90 0.87 0.88 0.72 0.70 0.92
at both levels 0.5 0.80 0.75 0.65 0.84 0.79 0.25 0.80 0.30 0.89 0.03 0.57 0.89
W. Luo & H. C. Lai
Non- 0.1 0.95 0.90 0.85 0.96 0.93 0.75 0.94 0.86 0.94 0.95 0.66 0.91
informative 0.5 0.96 0.75 0.72 0.93 0.77 0.11 0.97 0.17 0.75 0.95 0.50 0.84
Informative 0.1 0.96 0.89 0.86 0.95 0.92 0.76 0.93 0.86 0.94 0.73 0.71 0.94
at level-1 0.5 0.85 0.75 0.70 0.93 0.77 0.10 0.95 0.17 0.78 0.02 0.57 0.91
0.5
Informative 0.1 0.83 0.82 0.78 0.87 0.85 0.75 0.88 0.78 0.90 0.97 0.62 0.91
at level-2 0.5 0.79 0.75 0.75 0.82 0.77 0.19 0.77 0.19 0.83 0.94 0.52 0.85
Informative 0.1 0.80 0.83 0.82 0.88 0.85 0.75 0.84 0.76 0.91 0.72 0.64 0.92
at both levels 0.5 0.79 0.75 0.76 0.81 0.75 0.20 0.77 0.20 0.82 0.02 0.54 0.89
Note. Values in bold represent under-coverage or over-coverage (i.e., coverage rate < 0.93 or > 0.97)
110
Weighted Residual Bootstrap Method for Multilevel Models 111
ICC was moderate and large. Comparing the three methods, ML showed the
least amount of bias across all conditions.
In terms of the confidence intervals, MPML had the worst performance be-
cause of the severe under-coverage (0.10-0.46) when sample size was large. The
bootstrap confidence intervals had somewhat under-coverage (0.77-0.92) across
the conditions. The ML confidence intervals had the best performance, show-
ing slight under-coverage (0.81 to 0.91) when there was informative sampling at
level-2 or at both levels.
Covariance of the random intercepts and the random slopes (τ01 ).
The ML estimate of τ01 showed small to moderate negative biases (-0.09 to -0.36)
when there was informative sampling at level-2 or at both levels. The MPML
estimates showed moderate negative biases across all conditions, ranging from -
0.37 to -0.61. The bootstrap estimates showed small to moderate negative biases,
with the magnitude decreasing from -0.34 to -0.09 as ICC increased from 0.05
to 0.5.
The ML confidence intervals had slight under-coverage (0.77 to 0.92) when
there was informative sampling at level-2 or at both levels. Despite the moderate
negative biases in the point estimates, MPML confidence intervals only showed
slight under-coverage in most of the conditions (0.66 to 0.92). In general, the
bootstrap confidence intervals suffered from under-coverage (0.17 to 0.92), and
the degree of under-coverage was severe (0.17 to 0.31) when sample sizes were
large and ICCs were moderate to large.
Level-1 residual variance (σ 2 ). ML estimates had small negative relative
biases (-0.09) when there was informative selection at level-1 or at both levels.
The bootstrap estimates showed small positive relative biases (0.07 to 0.12) when
sample size was small and ICC was moderate to large. MPML estimates had the
best performance with little bias across all conditions.
The ML-based confidence intervals showed under-coverage when there was
informative selection at level-1 or at both levels. The degree of under-coverage
was severe (0.02 to 0.03) when sample size was large. The bootstrap confidence
interval had moderate under-coverage across all conditions, ranging from 0.50
to 0.88. The MPML confidence intervals had slight under-coverage across all
conditions, ranging from 0.84 to 0.91.
study that were similar to the specific condition of this sample (i.e., small cluster
size, low ICC, very slightly informative, and slight distributional violation) have
shown favorable results in the bootstrap than the MPML method.
6.1 Implications
The weighted residual bootstrap method provides a robust alternative to MPML.
Applied researchers can use the bootstrap approach when the traditional MPML
estimation fails to converge or when there is severe violation of the normal-
ity assumption. In analyses of random intercept models, the weighted residual
bootstrap method is preferred to MPML when the effect of level-2 predictors
(e.g., school SES), or the variance of the random intercept (e.g., variance of
school mean achievement) are of interest and when both sample sizes and ICCs
are small. In random slopes models, the bootstrap method has advantages over
MPML in the point estimates and the confidence interval estiamtes of the slopes
of level-2 predictors, as well as the variance component estimates associated with
the random intercept and the random slopes (e.g., variance of the association
of student SES and student achievement across schools). However, the statisti-
cal inferences for the covariance component (e.g., the covariance between school
mean achievement and the slope of student SES and student achievement) based
on the bootstrap method might not be trustworthy.
It is recommended that researchers conduct sensitivity analyses using differ-
ent methods. Discrepancies among the results may indicate that the conditions
for MPML to work properly are not satisfied. The weighted residual bootstrap
method is implemented in the developmental version of the R package bootmlm,
which has the capacity to analyze two-level linear random intercept and random
coefficients models with sampling weights.
References
Asparouhov, T. (2005). Sampling weights in latent variable mod-
eling. Structural Equation Modeling, 12 (3), 411–434. doi:
https://doi.org/10.1207/s15328007sem1203 4
Asparouhov, T. (2006). General multi-level modeling with sampling weights.
Communications in Statistics—Theory and Methods, 35 (3), 439–460. doi:
https://doi.org/10.1080/03610920500476598
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-
Effects Models Using lme4. Journal of Statistical Software, 67 (1), 1–48.
doi: https://doi.org/10.18637/jss.v067.i01
Booth, J. (1995). Bootstrap methods for generalized linear mixed models
with applications to small area estimation. In G. Seeber, J. Francis,
R. Hatzinger, & G. Steckel-Berger (Eds.), Statistical modelling (pp. 43–
51). New York, NY: Springer. doi: https://doi.org/10.1007/978-1-4612-
0789-4 6
Carpenter, J. R., Goldstein, H., & Rasbash, J. (2003). A novel bootstrap
procedure for assessing the relationship between class size and achieve-
ment. Journal of the Royal Statistical Society: Series C (Applied Statis-
tics), 52 (4), 431–443. doi: https://doi.org/10.1111/1467-9876.00415
Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York, NY: Wiley.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and
their application. Cambridge, UK: Cambridge University. doi:
https://doi.org/10.1017/cbo9780511802843
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York,
NY: Chapman and Hall. doi: https://doi.org/10.1201/9780429246593
Goldstein, H. (1986). Multilevel mixed linear model analysis using it-
erative generalized least squares. Biometrika, 73 (1), 43–56. doi:
https://doi.org/10.1093/biomet/73.1.43
Goldstein, H. (2011). Bootstrapping in multilevel models. In J. J. Hox
& J. K. Roberts (Eds.), Handbook of advanced multilevel analysis
(p. 163–171). New York, NY: Routledge.
Goldstein, H., Carpenter, J., & Kenward, M. G. (2018). Bayesian models for
weighted data with missing values: a bootstrap approach. Journal of the
Royal Statistical Society: Series C (Applied Statistics), 67 (4), 1071–1081.
doi: https://doi.org/10.1111/rssc.12259
Grilli, L., & Pratesi, M. (2004). Weighted estimation in multilevel ordinal and
binary models in the presence of informative sampling designs. Statistics
Canada, 30 (1), 93–103.
Hall, P., & Maiti, T. (2006). On parametric bootstrap methods for small
area prediction. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology), 68 (2), 221–238. doi: https://doi.org/10.1111/j.1467-
9868.2006.00541.x
Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance
structure modeling. Sociological Methods and Research, 26:329–367. doi:
https://doi.org/10.1177/0049124198026003003
116 W. Luo & H. C. Lai
Kovacevic, M. S., Huang, R., & You, Y. (2006). Bootstrapping for variance
estimation in multi-level models fitted to survey data. ASA Proceedings of
the Survey Research Methods Section.
Kovacevic, M. S., & Rai, S. N. (2003). A pseudo maximum likelihood approach to
multi-level modeling of survey data. Communications in Statistics—Theory
and Methods, 32:103–121. doi: https://doi.org/10.1081/sta-120017802
Lahiri, P. (2003). On the impact of bootstrap in survey sampling
and small-area estimation. Statistical Science, 18 (2), 199–210. doi:
https://doi.org/10.1214/ss/1063994975
Lohr, S. L. (2010). Sampling: Design and analysis (2nd ed.). Boston, MA:
Cengage.
Maas, C. J., & Hox, J. J. (2004). The influence of violations of as-
sumptions on multilevel parameter estimates and their standard er-
rors. Computational Statistics & Data Analysis, 46 , 427–440. doi:
https://doi.org/10.1016/j.csda.2003.08.006
Muthén, L. K., & Muthén, B. O. (1998). Mplus user’s guide (8th ed.). Los
Angeles, CA: Muthén & Muthén.
Organization for Economic Co-operation and Development. (2000). Manual
for the pisa 2000 database [Computer software manual]. Retrieved from
http://www.pisa.oecd.org/dataoecd/53/18/33688135.pdf
Pfeffermann, D. (1993). The role of sampling weights when
modeling survey data. International Statistics Review . doi:
https://doi.org/10.2307/1403631
Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., & Rasbash, J.
(1998). Weighting for unequal selection probabilities in multi-level models.
Journal of the Royal Statistics Society: Series B (Statistical Methodology).
doi: https://doi.org/10.1111/1467-9868.00106
Potthoff, R. F., Woodbury, M. A., & Manton, K. G. (1992). “equivalent sample
size” and “equivalent degrees of freedom” refinements for inference using
survey weights under superpopulation models. Journal of American Sta-
tistical Association. doi: https://doi.org/10.2307/2290269
R Core Team. (2018). R: A language and environment for statistical computing.
Vienna, Austria.
Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex
survey data. Journal of the Royal Statistical Society: Series A (Statis-
tics in Society), 169 (4), 805–827. doi: https://doi.org/10.1111/j.1467-
985x.2006.00426.x
Seco, G. V., Garcı́a, M. A., Garcı́a, M. P. F., & Rojas, P. E. L. (2013). Multilevel
bootstrap analysis with assumptions violated. Psicothema, 25 (4), 520–
528.
Stapleton, L. (2002). The incorporation of sample weights into multi-
level structural equation models. Structural Equation Modeling. doi:
https://doi.org/10.1207/s15328007sem0904 2
Thai, H. T., Mentré, F., Holford, N. H. G., Veyrat-Follet, C., & Comets, E.
(2014). Evaluation of bootstrap methods for estimating uncertainty of
Weighted Residual Bootstrap Method for Multilevel Models 117
# Unweighted ML
m1 <- lmer ( SC17Q01 ~ ISEI_m + male + (1 | Sch_ID ) ,
data = PISA , REML = FALSE )
Meghan K. Cain1[0000−0003−4790−4843]
Abstract. In this tutorial, you will learn how to fit structural equation
models (SEM) using Stata software. SEMs can be fit in Stata using the
sem command for standard linear SEMs, the gsem command for general-
ized linear SEMs, or by drawing their path diagrams in the SEM Builder.
After a brief introduction to Stata, the sem command will be demon-
strated through a confirmatory factor analysis model, mediation model,
group analysis, and a growth curve model, and the gsem command will
be demonstrated through a random-slope model and a logistic ordinal
regression. Materials and datasets are provided online, allowing anyone
with Stata to follow along.
1 Introduction
you can type help followed by the command name in the Command window and
the Viewer window will open with the help file and provide links to further doc-
umentation. Stata’s documentation consists of over 17,000 pages detailing each
feature in Stata including the methods and formulas and fully worked examples.
There are three ways to fit SEMs in Stata: the sem command, the gsem com-
mand, and through the SEM Builder. The sem command is for fitting standard
linear SEMs. It is quicker and has more features for testing and interpreting
results than gsem. The gsem command is for fitting models with generalized
responses, such as binary, count, or categorical responses, models with random
effects, and mixture models. Both sem and gsem models can be fit via path dia-
grams using the SEM Builder. You can open the SEM Builder window by typing
sembuilder into the Command window. See the interface in Figure 1; click the
tools you need on the left, or type their shortcuts shown in the parentheses. To fit
gsem models, the GSEM button must first be selected. Estimation and diagram
settings can be changed using the menus at the top. The Estimate button fits the
model. Path diagrams can be saved as .stsem files to be modified later, or can be
exported to a variety of image formats (for example see Figure 2). Although this
tutorial will focus on the sem and gsem commands, the Builder shares the same
158 M. Cain
functionality. You can watch a demonstration with the SEM Builder on the Stat-
aCorp YouTube Channel: https://www.youtube.com/watch?v=Xj0gBlqwYHI
To download the datasets, do-file, and path diagrams, you can type the fol-
lowing into Stata’s Command window:
. net from http://www.stata.com/users/mcain/JBDS_SEM
Clicking on the SEMtutorial link will download the materials to your current
working directory. To open the do-file with the commands we’ll be using, you
can type
. doedit SEMtutorial
Commands can either be executed from the do-file or typed into the Com-
mand window. We’ll start by loading and exploring our first dataset. These
data contain observations on four indicators for socioeconomic status of high
school students as well as their math scores, school types (private or public),
and the student-teacher ratio of their school. Alternatively, we could have used
a summary statistics dataset containing means, variances, and correlations of
the variables rather than observations.
. use math
. codebook, compact
Variable Obs Unique Mean Min Max Label
Let’s start our analysis by fitting the one-factor confirmatory factor analysis
(CFA) model shown in Figure 2. Using the sem command, paths are specified in
parentheses and the direction of the relationships are specified using arrows, i.e.
(x->y). Arrows can point in either direction, (x->y) or (y<-x). Paths can be
specified individually, or multiple paths can be specified within a single set of
parentheses, (x1 x2 x3 -> y). By default, Stata assumes that all lower-case
variables are observed and uppercase variables are latent. You can change these
settings using the nocapslatent and the latent() options. In Stata, options
are always added after a comma. We’ll see plenty of examples of this later.
SEM using Stata 159
SES
.68
0.53
0.42
1.00 0.85
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Measurement
ses1
SES 1 (constrained)
_cons 1.982659 .0620424 31.96 0.000 1.861058 2.10426
ses2
SES .8481035 .1962358 4.32 0.000 .4634884 1.232719
_cons 2.003854 .0620169 32.31 0.000 1.882303 2.125404
ses3
SES .416385 .1331306 3.13 0.002 .1554539 .6773161
_cons 2.003854 .062017 32.31 0.000 1.882302 2.125405
ses4
160 M. Cain
LR test of model vs. saturated: chi2(2) = 11.03 Prob > chi2 = 0.0040
Viewing the results, we see that by default Stata constrained the first factor
loading to be 1 and estimated the variance of the latent variable. If, instead,
we would like to constrain the variance and estimate all four factor loadings, we
could use the var() option. Constraints in any part of the model can be specified
using the @ symbol. To save room, syntax and results for this and the remaining
models will be shown on their path diagrams; see Figure 3.
SES
1
0.44
0.34
0.82 0.70
SES math ε5 .8
1 0.45 4.8
0.39
0.31
0.50 0.46
To get fit indices for our model, we can use the postestimation command
estat gof after any sem model. Add the stats(all) option to see all fit indices.
. estat gof, stats(all)
Likelihood ratio
chi2_ms(5) 17.689 model vs. saturated
p > chi2 0.003
chi2_bs(10) 150.126 baseline vs. saturated
p > chi2 0.000
Population error
RMSEA 0.070 Root mean squared error of approximation
90% CI, lower bound 0.037
upper bound 0.107
pclose 0.147 Probability RMSEA <= 0.05
Information criteria
AIC 11157.441 Akaike´s information criterion
BIC 11221.219 Bayesian information criterion
Baseline comparison
CFI 0.909 Comparative fit index
TLI 0.819 Tucker-Lewis index
Size of residuals
SRMR 0.040 Standardized root mean squared residual
CD 0.532 Coefficient of determination
indices. This option still uses maximum likelihood estimation, the default, but
adjusts the standard errors and the fit indices. Alternatively, estimation can be
changed to asymptotic distribution-free or full-information maximum likelihood
for missing values using the method(adf) or method(mlmv) options, respectively.
For this example, we’ll use the Satorra-Bentler adjustment (Satorra & Bentler,
1994). First, we’ll store the current model to use again later.
. estimates store m1
. sem (SES -> ses1-ses4 math), vce(sbentler)
Endogenous variables
Measurement: ses1 ses2 ses3 ses4 math
Exogenous variables
Latent: SES
Fitting target model:
Iteration 0: log pseudolikelihood = -5564.2324
Iteration 1: log pseudolikelihood = -5563.7459
Iteration 2: log pseudolikelihood = -5563.7204
Iteration 3: log pseudolikelihood = -5563.7204
Structural equation model Number of obs = 519
Estimation method: ml
Log pseudolikelihood = -5563.7204
( 1) [ses1]SES = 1
Satorra-Bentler
Coefficient std. err. z P>|z| [95% conf. interval]
Measurement
ses1
SES 1 (constrained)
_cons 1.982659 .0621024 31.93 0.000 1.860941 2.104377
ses2
SES .9278593 .169484 5.47 0.000 .5956767 1.260042
_cons 2.003854 .0620769 32.28 0.000 1.882185 2.125522
ses3
SES .620192 .1438296 4.31 0.000 .3382912 .9020928
_cons 2.003854 .0620769 32.28 0.000 1.882185 2.125522
ses4
SES .7954927 .1580751 5.03 0.000 .4856712 1.105314
_cons 2.003854 .0620769 32.28 0.000 1.882185 2.125522
math
SES 6.858402 1.335695 5.13 0.000 4.240488 9.476315
_cons 51.72254 .4700825 110.03 0.000 50.8012 52.64389
LR test of model vs. saturated: chi2(5) = 17.69 Prob > chi2 = 0.0034
Satorra-Bentler scaled test: chi2(5) = 17.80 Prob > chi2 = 0.0032
SEM using Stata 163
Likelihood ratio
chi2_ms(5) 17.689 model vs. saturated
p > chi2 0.003
chi2_bs(10) 150.126 baseline vs. saturated
p > chi2 0.000
Satorra-Bentler
chi2sb_ms(5) 17.804
p > chi2 0.003
chi2sb_bs(10) 153.258
p > chi2 0.000
Population error
RMSEA 0.070 Root mean squared error of approximation
90% CI, lower bound 0.037
upper bound 0.107
pclose 0.147 Probability RMSEA <= 0.05
Satorra-Bentler
RMSEA_SB 0.070 Root mean squared error of approximation
Information criteria
AIC 11157.441 Akaike´s information criterion
BIC 11221.219 Bayesian information criterion
Baseline comparison
CFI 0.909 Comparative fit index
TLI 0.819 Tucker-Lewis index
Satorra-Bentler
CFI_SB 0.911 Comparative fit index
TLI_SB 0.821 Tucker-Lewis index
Size of residuals
SRMR 0.040 Standardized root mean squared residual
CD 0.532 Coefficient of determination
The SB-adjusted CFI is still rather low, 0.91, indicating poor fit. We can use
estat mindices to compute modification indices that can be used to check for
paths and covariances that could be added to the model to improve fit. First,
we’ll need to restore our original model.
. estimates restore m1
164 M. Cain
. estat mindices
Modification indices
Standard
MI df P>MI EPC EPC
The MI, df, and P>MI are the estimated chi-squared test statistic, degrees
of freedom, and p value of the score test testing the statistical significance of
the constrained parameter. By default, only parameters that would significantly
(p < 0.05) improve the model are reported. The EPC is the amount that the
parameter is expected to change if the constraint is relaxed. According to these
results, we see that there is a stronger relationship between the first and second
indicator for SES than would be expected given our model, MI = 16.57, p < 0.001.
We could consider adding a residual covariance between these two indicators to
our model using the cov() option. We use the e. prefix to refer to a residual
variance of an endogenous variable; see Figure 5.
SES math ε5 85
.26 10.76 52
1.28
1.01
1.00 0.89
One potential explanation of the effect that SES has on math score is that
students of higher SES attend schools with smaller student to teacher ratios.
We can test this hypothesis using the mediation model shown in Figure 6. Here,
we get estimates of the direct effects between each of our variables, but what
SEM using Stata 165
we would really like to test is the indirect effect between SES and math through
ratio. We can get direct effects, indirect effects, and total effects of mediation
models with the postestimation command estat teffects.
ratio ε6 23
17
−0.23
−1.37
SES math ε5 90
.46 6.91
56
0.86
0.66
1.00 0.95
. estat teffects
Direct effects
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
ratio
SES -1.367306 .5562429 -2.46 0.014 -2.457522 -.2770903
math
ratio -.2256084 .1026128 -2.20 0.028 -.4267257 -.024491
SES 6.908564 1.583778 4.36 0.000 3.804417 10.01271
Measurement
ses1
SES 1 (constrained)
ses2
SES .9450302 .1643867 5.75 0.000 .6228382 1.267222
ses3
166 M. Cain
ses4
SES .8574695 .2012317 4.26 0.000 .4630625 1.251876
Indirect effects
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
ratio
SES 0 (no path)
math
ratio 0 (no path)
SES .3084758 .1451257 2.13 0.034 .0240346 .5929169
Measurement
ses1
SES 0 (no path)
ses2
SES 0 (no path)
ses3
SES 0 (no path)
ses4
SES 0 (no path)
Total effects
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
ratio
SES -1.367306 .5562429 -2.46 0.014 -2.457522 -.2770903
math
ratio -.2256084 .1026128 -2.20 0.028 -.4267257 -.024491
SES 7.217039 1.599953 4.51 0.000 4.081189 10.35289
Measurement
ses1
SES 1 (constrained)
ses2
SES .9450302 .1643867 5.75 0.000 .6228382 1.267222
ses3
SES .6632608 .1725434 3.84 0.000 .3250819 1.00144
ses4
SES .8574695 .2012317 4.26 0.000 .4630625 1.251876
SEM using Stata 167
In the second group of the output, we see that the mediation effect is not
statistically significant, z = 1.48, p = 0.138. We may consider bootstrapping
this effect to get a more powerful test. We can do this with the bootstrap
command. First, we need to get labels for the effects we would like to test. We
can get these by replaying our model results with the coeflegend option. We
can use these labels to construct an expression for the mediation effect that
we’re calling indirect. We put this expression in parentheses after bootstrap
and put any bootstrapping options after a comma; then, we put the model and
its options after a colon. Multiple expressions can be included using multiple
parentheses sets.
. sem, coeflegend
Structural equation model Number of obs = 519
Estimation method: ml
Log likelihood = -7117.1959
( 1) [ses1]SES = 1
Coefficient Legend
Structural
ratio
SES -1.367306 _b[ratio:SES]
_cons 16.75723 _b[ratio:_cons]
math
ratio -.2256084 _b[math:ratio]
SES 6.908564 _b[math:SES]
_cons 55.50311 _b[math:_cons]
Measurement
ses1
SES 1 _b[ses1:SES]
_cons 1.982659 _b[ses1:_cons]
ses2
SES .9450302 _b[ses2:SES]
_cons 2.003854 _b[ses2:_cons]
ses3
SES .6632608 _b[ses3:SES]
_cons 2.003854 _b[ses3:_cons]
ses4
SES .8574695 _b[ses4:SES]
_cons 2.003854 _b[ses4:_cons]
LR test of model vs. saturated: chi2(8) = 21.72 Prob > chi2 = 0.0055
168 M. Cain
Observed Bootstrap
coefficient Bias std. err. [95% conf. interval]
Key: P: Percentile
Private
ratio ε6 40
16
−0.22
−3.15
0
SES math ε5 88
.29 6.35
60
1.04
0.80
1.00 1.05
Public
ratio ε6 12
16
−0.22
−0.94
−.57
SES math ε5 84
.27 6.57
56
1.04
0.80
1.00 1.05
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
math
SES
Private .7043843 .4184641 1.68 0.092 -.1157902 1.524559
Public .2035724 .1710134 1.19 0.234 -.1316076 .5387525
ratio
ses1
ses2
Measurement
ses3
ses4
Option Description
mcoef measurement coefficients
mcons measurement intercepts
merrvar covariances of measurement errors
scoef structural coefficients
scons structural intercepts
serrvar covariances of structural errors
smerrcov covariances between structural and measurement errors
meanex means of exogenous variables
covex covariances of exogenous variables
all all the above
none none of the above
. estat ginvariant
Tests for group invariance of parameters
Structural
math
ratio 0.001 1 0.9709 . . .
SES 0.005 1 0.9441 . . .
_cons 1.314 1 0.2516 . . .
ratio
SES 1.825 1 0.1768 . . .
_cons 0.011 1 0.9147 . . .
Measurement
ses1
SES . . . 1.832 1 0.1759
_cons . . . 5.997 1 0.0143
ses2
SES . . . 0.072 1 0.7882
_cons . . . 0.341 1 0.5592
ses3
SES . . . 0.049 1 0.8253
_cons . . . 0.634 1 0.4259
ses4
SES . . . 1.945 1 0.1632
_cons . . . 1.149 1 0.2838
To test group differences in each direct path, we can use the postestimation
command estat ginvariant. These results show us Wald tests evaluating con-
straining parameters that were allowed to vary across groups and score tests
evaluating relaxing constraints. Both are testing whether individual paths sig-
nificantly differ across groups.
The last model we will fit using sem is a growth curve model. This will require
a new dataset.
. use crime
172 M. Cain
. describe
Contains data from crime.dta
Observations: 359
Variables: 4 4 Oct 2012 16:22
(_dta has notes)
Sorted by:
These data are from Bollen and Curran (2006); they contain crime rates
collected in two-month intervals for the first eight months of 1995 for 359 com-
munities in New York state. We would like to fit a linear growth curve to these
data to model how crime rate changed over time. In our model, we can set con-
straints using the @ symbol as we did before. To constrain all intercepts to 0, we
can add the nocons option. We will also need the means() option. By default,
Stata constrains the means of latent variables to 0. For this model, we would like
to estimate them so we need to specify the latent variable names inside means().
We may also consider constraining all the residual variances to equality by con-
straining each of them to the same arbitrary letter or word, in this case eps. See
the model in Figure 8.
The estimated mean log crime rate at the beginning of the study was 5.33
and it increased by an average of 0.14 every two months. We could have fit this
same model using gsem. One way we can do this is to simply replace sem with
gsem in the command in Figure 8. Alternatively, we can can think of this as a
multilevel model, and fit it using gsem’s notation for random effects. Let’s do
that next.
SEM using Stata 173
−0.03
5.3 .14
Intercept Slope
.53 .02
1 1
1 1
2 3
0 1
. gen id = _n
. reshape long lncrime, i(id) j(time)
(j = 0 1 2 3)
Data Wide -> Long
. summarize
Variable Obs Mean Std. dev. Min Max
id id
.015 .47
1
1
The gsem command can also be used to fit generalized linear SEMs; that is, SEMs
in which an endogenous variable is distributed according to some distribution
family and is related to the linear prediction of the model through a link function.
See Table 2 for a list of available distribution families and links. Either the
family and link can be specified, i.e. family(bernoulli) link(logit), or some
combinations have shortcuts that you can specify instead, i.e. logit. For this
example, we will return to the first dataset.
. use math
. codebook, compact
Variable Obs Unique Mean Min Max Label
variables for school type in our analysis. See figure Figure 10. By adding schtype
as a factor variable, a dummy variable for each level of schtype is included in the
model. The path coefficient for the base level, by default the lowest, is constrained
to zero. To get exponentiated coefficients, we can follow with the postestimation
command estat eform.
1.schtype 0b.schtype
−6.56
0.00
2.30
SES math ε1 91
1.8 56
0.49
0.37
1.00 0.84
ses1
SES 2.718282 (constrained)
ses2
SES 2.311549 .483485 4.01 0.000 1.534141 3.482899
ses3
SES 1.449492 .180061 2.99 0.003 1.136257 1.849077
ses4
SES 1.628133 .2474222 3.21 0.001 1.208748 2.193029
SEM using Stata 177
4 Conclusion
In this tutorial, we’ve shown the basics of fitting SEMs in Stata using the sem
and gsem commands, and have provided example datasets and syntax online to
follow along. We demonstrated confirmatory factor analysis, mediation, group
analysis, growth curve modeling, and models with random effects and general-
ized responses. However, there are many possibilities and options not included in
this tutorial, such as latent class analysis models, nonrecursive models, reliabil-
ity models, mediation models with generalized responses, multivariate random-
effects models, and much more. Visit Stata’s documentation to see all the avail-
able options for these commands, their methods and formulas, and many more
examples online at https://www.stata.com/manuals/sem.pdf.
References
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation
perspective (Vol. 467). John Wiley & Sons.
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and stan-
dard errors in covariance structure analysis. In Latent variables analysis:
Applications for developmental research. (pp. 399–419). Sage Publications,
Inc.
StataCorp. (2021). Stata statistical software: Release 17. StataCorp LLC.