Sem
Sem
net/publication/356751633
CITATIONS READS
3 2,154
1 author:
Meghan K. Cain
StataCorp
15 PUBLICATIONS 809 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Meghan K. Cain on 13 December 2021.
Meghan K. Cain1[0000−0003−4790−4843]
Abstract. In this tutorial, you will learn how to fit structural equation
models (SEM) using Stata software. SEMs can be fit in Stata using the
sem command for standard linear SEMs, the gsem command for general-
ized linear SEMs, or by drawing their path diagrams in the SEM Builder.
After a brief introduction to Stata, the sem command will be demon-
strated through a confirmatory factor analysis model, mediation model,
group analysis, and a growth curve model, and the gsem command will
be demonstrated through a random-slope model and a logistic ordinal
regression. Materials and datasets are provided online, allowing anyone
with Stata to follow along.
1 Introduction
you can type help followed by the command name in the Command window and
the Viewer window will open with the help file and provide links to further doc-
umentation. Stata’s documentation consists of over 17,000 pages detailing each
feature in Stata including the methods and formulas and fully worked examples.
There are three ways to fit SEMs in Stata: the sem command, the gsem com-
mand, and through the SEM Builder. The sem command is for fitting standard
linear SEMs. It is quicker and has more features for testing and interpreting
results than gsem. The gsem command is for fitting models with generalized
responses, such as binary, count, or categorical responses, models with random
effects, and mixture models. Both sem and gsem models can be fit via path dia-
grams using the SEM Builder. You can open the SEM Builder window by typing
sembuilder into the Command window. See the interface in Figure 1; click the
tools you need on the left, or type their shortcuts shown in the parentheses. To fit
gsem models, the GSEM button must first be selected. Estimation and diagram
settings can be changed using the menus at the top. The Estimate button fits the
model. Path diagrams can be saved as .stsem files to be modified later, or can be
exported to a variety of image formats (for example see Figure 2). Although this
tutorial will focus on the sem and gsem commands, the Builder shares the same
158 M. Cain
functionality. You can watch a demonstration with the SEM Builder on the Stat-
aCorp YouTube Channel: https://www.youtube.com/watch?v=HeQcha3C8Fk
To download the datasets, do-file, and path diagrams, you can type the fol-
lowing into Stata’s Command window:
. net from http://www.stata.com/users/mcain/JBDS_SEM
Clicking on the SEMtutorial link will download the materials to your current
working directory. To open the do-file with the commands we’ll be using, you
can type
. doedit SEMtutorial
Commands can either be executed from the do-file or typed into the Com-
mand window. We’ll start by loading and exploring our first dataset. These
data contain observations on four indicators for socioeconomic status of high
school students as well as their math scores, school types (private or public),
and the student-teacher ratio of their school. Alternatively, we could have used
a summary statistics dataset containing means, variances, and correlations of
the variables rather than observations.
. use math
. codebook, compact
Variable Obs Unique Mean Min Max Label
Let’s start our analysis by fitting the one-factor confirmatory factor analysis
(CFA) model shown in Figure 2. Using the sem command, paths are specified in
parentheses and the direction of the relationships are specified using arrows, i.e.
(x->y). Arrows can point in either direction, (x->y) or (y<-x). Paths can be
specified individually, or multiple paths can be specified within a single set of
parentheses, (x1 x2 x3 -> y). By default, Stata assumes that all lower-case
variables are observed and uppercase variables are latent. You can change these
settings using the nocapslatent and the latent() options. In Stata, options
are always added after a comma. We’ll see plenty of examples of this later.
SEM using Stata 159
SES
.68
0.53
0.42
1.00 0.85
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Measurement
ses1
SES 1 (constrained)
_cons 1.982659 .0620424 31.96 0.000 1.861058 2.10426
ses2
SES .8481035 .1962358 4.32 0.000 .4634884 1.232719
_cons 2.003854 .0620169 32.31 0.000 1.882303 2.125404
ses3
SES .416385 .1331306 3.13 0.002 .1554539 .6773161
_cons 2.003854 .062017 32.31 0.000 1.882302 2.125405
ses4
160 M. Cain
LR test of model vs. saturated: chi2(2) = 11.03 Prob > chi2 = 0.0040
Viewing the results, we see that by default Stata constrained the first factor
loading to be 1 and estimated the variance of the latent variable. If, instead,
we would like to constrain the variance and estimate all four factor loadings, we
could use the var() option. Constraints in any part of the model can be specified
using the @ symbol. To save room, syntax and results for this and the remaining
models will be shown on their path diagrams; see Figure 3.
SES
1
0.44
0.34
0.82 0.70
SES math ε5 .8
1 0.45 4.8
0.39
0.31
0.50 0.46
To get fit indices for our model, we can use the postestimation command
estat gof after any sem model. Add the stats(all) option to see all fit indices.
. estat gof, stats(all)
Likelihood ratio
chi2_ms(5) 17.689 model vs. saturated
p > chi2 0.003
chi2_bs(10) 150.126 baseline vs. saturated
p > chi2 0.000
Population error
RMSEA 0.070 Root mean squared error of approximation
90% CI, lower bound 0.037
upper bound 0.107
pclose 0.147 Probability RMSEA <= 0.05
Information criteria
AIC 11157.441 Akaike´s information criterion
BIC 11221.219 Bayesian information criterion
Baseline comparison
CFI 0.909 Comparative fit index
TLI 0.819 Tucker-Lewis index
Size of residuals
SRMR 0.040 Standardized root mean squared residual
CD 0.532 Coefficient of determination
indices. This option still uses maximum likelihood estimation, the default, but
adjusts the standard errors and the fit indices. Alternatively, estimation can be
changed to asymptotic distribution-free or full-information maximum likelihood
for missing values using the method(adf) or method(mlmv) options, respectively.
For this example, we’ll use the Satorra-Bentler adjustment (Satorra & Bentler,
1994). First, we’ll store the current model to use again later.
. estimates store m1
. sem (SES -> ses1-ses4 math), vce(sbentler)
Endogenous variables
Measurement: ses1 ses2 ses3 ses4 math
Exogenous variables
Latent: SES
Fitting target model:
Iteration 0: log pseudolikelihood = -5564.2324
Iteration 1: log pseudolikelihood = -5563.7459
Iteration 2: log pseudolikelihood = -5563.7204
Iteration 3: log pseudolikelihood = -5563.7204
Structural equation model Number of obs = 519
Estimation method: ml
Log pseudolikelihood = -5563.7204
( 1) [ses1]SES = 1
Satorra-Bentler
Coefficient std. err. z P>|z| [95% conf. interval]
Measurement
ses1
SES 1 (constrained)
_cons 1.982659 .0621024 31.93 0.000 1.860941 2.104377
ses2
SES .9278593 .169484 5.47 0.000 .5956767 1.260042
_cons 2.003854 .0620769 32.28 0.000 1.882185 2.125522
ses3
SES .620192 .1438296 4.31 0.000 .3382912 .9020928
_cons 2.003854 .0620769 32.28 0.000 1.882185 2.125522
ses4
SES .7954927 .1580751 5.03 0.000 .4856712 1.105314
_cons 2.003854 .0620769 32.28 0.000 1.882185 2.125522
math
SES 6.858402 1.335695 5.13 0.000 4.240488 9.476315
_cons 51.72254 .4700825 110.03 0.000 50.8012 52.64389
LR test of model vs. saturated: chi2(5) = 17.69 Prob > chi2 = 0.0034
Satorra-Bentler scaled test: chi2(5) = 17.80 Prob > chi2 = 0.0032
SEM using Stata 163
Likelihood ratio
chi2_ms(5) 17.689 model vs. saturated
p > chi2 0.003
chi2_bs(10) 150.126 baseline vs. saturated
p > chi2 0.000
Satorra-Bentler
chi2sb_ms(5) 17.804
p > chi2 0.003
chi2sb_bs(10) 153.258
p > chi2 0.000
Population error
RMSEA 0.070 Root mean squared error of approximation
90% CI, lower bound 0.037
upper bound 0.107
pclose 0.147 Probability RMSEA <= 0.05
Satorra-Bentler
RMSEA_SB 0.070 Root mean squared error of approximation
Information criteria
AIC 11157.441 Akaike´s information criterion
BIC 11221.219 Bayesian information criterion
Baseline comparison
CFI 0.909 Comparative fit index
TLI 0.819 Tucker-Lewis index
Satorra-Bentler
CFI_SB 0.911 Comparative fit index
TLI_SB 0.821 Tucker-Lewis index
Size of residuals
SRMR 0.040 Standardized root mean squared residual
CD 0.532 Coefficient of determination
The SB-adjusted CFI is still rather low, 0.91, indicating poor fit. We can use
estat mindices to compute modification indices that can be used to check for
paths and covariances that could be added to the model to improve fit. First,
we’ll need to restore our original model.
. estimates restore m1
164 M. Cain
. estat mindices
Modification indices
Standard
MI df P>MI EPC EPC
The MI, df, and P>MI are the estimated chi-squared test statistic, degrees
of freedom, and p value of the score test testing the statistical significance of
the constrained parameter. By default, only parameters that would significantly
(p < 0.05) improve the model are reported. The EPC is the amount that the
parameter is expected to change if the constraint is relaxed. According to these
results, we see that there is a stronger relationship between the first and second
indicator for SES than would be expected given our model, MI = 16.57, p < 0.001.
We could consider adding a residual covariance between these two indicators to
our model using the cov() option. We use the e. prefix to refer to a residual
variance of an endogenous variable; see Figure 5.
SES math ε5 85
.26 10.76 52
1.28
1.01
1.00 0.89
One potential explanation of the effect that SES has on math score is that
students of higher SES attend schools with smaller student to teacher ratios.
We can test this hypothesis using the mediation model shown in Figure 6. Here,
we get estimates of the direct effects between each of our variables, but what
SEM using Stata 165
we would really like to test is the indirect effect between SES and math through
ratio. We can get direct effects, indirect effects, and total effects of mediation
models with the postestimation command estat teffects.
ratio ε6 23
17
−0.23
−1.37
SES math ε5 90
.46 6.91
56
0.86
0.66
1.00 0.95
. estat teffects
Direct effects
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
ratio
SES -1.367306 .5562429 -2.46 0.014 -2.457522 -.2770903
math
ratio -.2256084 .1026128 -2.20 0.028 -.4267257 -.024491
SES 6.908564 1.583778 4.36 0.000 3.804417 10.01271
Measurement
ses1
SES 1 (constrained)
ses2
SES .9450302 .1643867 5.75 0.000 .6228382 1.267222
ses3
166 M. Cain
ses4
SES .8574695 .2012317 4.26 0.000 .4630625 1.251876
Indirect effects
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
ratio
SES 0 (no path)
math
ratio 0 (no path)
SES .3084758 .1451257 2.13 0.034 .0240346 .5929169
Measurement
ses1
SES 0 (no path)
ses2
SES 0 (no path)
ses3
SES 0 (no path)
ses4
SES 0 (no path)
Total effects
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
ratio
SES -1.367306 .5562429 -2.46 0.014 -2.457522 -.2770903
math
ratio -.2256084 .1026128 -2.20 0.028 -.4267257 -.024491
SES 7.217039 1.599953 4.51 0.000 4.081189 10.35289
Measurement
ses1
SES 1 (constrained)
ses2
SES .9450302 .1643867 5.75 0.000 .6228382 1.267222
ses3
SES .6632608 .1725434 3.84 0.000 .3250819 1.00144
ses4
SES .8574695 .2012317 4.26 0.000 .4630625 1.251876
SEM using Stata 167
In the second group of the output, we see that the mediation effect is not
statistically significant, z = 1.48, p = 0.138. We may consider bootstrapping
this effect to get a more powerful test. We can do this with the bootstrap
command. First, we need to get labels for the effects we would like to test. We
can get these by replaying our model results with the coeflegend option. We
can use these labels to construct an expression for the mediation effect that
we’re calling indirect. We put this expression in parentheses after bootstrap
and put any bootstrapping options after a comma; then, we put the model and
its options after a colon. Multiple expressions can be included using multiple
parentheses sets.
. sem, coeflegend
Structural equation model Number of obs = 519
Estimation method: ml
Log likelihood = -7117.1959
( 1) [ses1]SES = 1
Coefficient Legend
Structural
ratio
SES -1.367306 _b[ratio:SES]
_cons 16.75723 _b[ratio:_cons]
math
ratio -.2256084 _b[math:ratio]
SES 6.908564 _b[math:SES]
_cons 55.50311 _b[math:_cons]
Measurement
ses1
SES 1 _b[ses1:SES]
_cons 1.982659 _b[ses1:_cons]
ses2
SES .9450302 _b[ses2:SES]
_cons 2.003854 _b[ses2:_cons]
ses3
SES .6632608 _b[ses3:SES]
_cons 2.003854 _b[ses3:_cons]
ses4
SES .8574695 _b[ses4:SES]
_cons 2.003854 _b[ses4:_cons]
LR test of model vs. saturated: chi2(8) = 21.72 Prob > chi2 = 0.0055
168 M. Cain
Observed Bootstrap
coefficient Bias std. err. [95% conf. interval]
Key: P: Percentile
Private
ratio ε6 40
16
−0.22
−3.15
0
SES math ε5 88
.29 6.35
60
1.04
0.80
1.00 1.05
Public
ratio ε6 12
16
−0.22
−0.94
−.57
SES math ε5 84
.27 6.57
56
1.04
0.80
1.00 1.05
OIM
Coefficient std. err. z P>|z| [95% conf. interval]
Structural
math
SES
Private .7043843 .4184641 1.68 0.092 -.1157902 1.524559
Public .2035724 .1710134 1.19 0.234 -.1316076 .5387525
ratio
ses1
ses2
Measurement
ses3
ses4
Option Description
mcoef measurement coefficients
mcons measurement intercepts
merrvar covariances of measurement errors
scoef structural coefficients
scons structural intercepts
serrvar covariances of structural errors
smerrcov covariances between structural and measurement errors
meanex means of exogenous variables
covex covariances of exogenous variables
all all the above
none none of the above
. estat ginvariant
Tests for group invariance of parameters
Structural
math
ratio 0.001 1 0.9709 . . .
SES 0.005 1 0.9441 . . .
_cons 1.314 1 0.2516 . . .
ratio
SES 1.825 1 0.1768 . . .
_cons 0.011 1 0.9147 . . .
Measurement
ses1
SES . . . 1.832 1 0.1759
_cons . . . 5.997 1 0.0143
ses2
SES . . . 0.072 1 0.7882
_cons . . . 0.341 1 0.5592
ses3
SES . . . 0.049 1 0.8253
_cons . . . 0.634 1 0.4259
ses4
SES . . . 1.945 1 0.1632
_cons . . . 1.149 1 0.2838
To test group differences in each direct path, we can use the postestimation
command estat ginvariant. These results show us Wald tests evaluating con-
straining parameters that were allowed to vary across groups and score tests
evaluating relaxing constraints. Both are testing whether individual paths sig-
nificantly differ across groups.
The last model we will fit using sem is a growth curve model. This will require
a new dataset.
. use crime
172 M. Cain
. describe
Contains data from crime.dta
Observations: 359
Variables: 4 4 Oct 2012 16:22
(_dta has notes)
Sorted by:
These data are from Bollen and Curran (2006); they contain crime rates
collected in two-month intervals for the first eight months of 1995 for 359 com-
munities in New York state. We would like to fit a linear growth curve to these
data to model how crime rate changed over time. In our model, we can set con-
straints using the @ symbol as we did before. To constrain all intercepts to 0, we
can add the nocons option. We will also need the means() option. By default,
Stata constrains the means of latent variables to 0. For this model, we would like
to estimate them so we need to specify the latent variable names inside means().
We may also consider constraining all the residual variances to equality by con-
straining each of them to the same arbitrary letter or word, in this case eps. See
the model in Figure 8.
The estimated mean log crime rate at the beginning of the study was 5.33
and it increased by an average of 0.14 every two months. We could have fit this
same model using gsem. One way we can do this is to simply replace sem with
gsem in the command in Figure 8. Alternatively, we can can think of this as a
multilevel model, and fit it using gsem’s notation for random effects. Let’s do
that next.
SEM using Stata 173
−0.03
5.3 .14
Intercept Slope
.53 .02
1 1
1 1
2 3
0 1
. gen id = _n
. reshape long lncrime, i(id) j(time)
(j = 0 1 2 3)
Data Wide -> Long
. summarize
Variable Obs Mean Std. dev. Min Max
id id
.015 .47
1
1
The gsem command can also be used to fit generalized linear SEMs; that is, SEMs
in which an endogenous variable is distributed according to some distribution
family and is related to the linear prediction of the model through a link function.
See Table 2 for a list of available distribution families and links. Either the
family and link can be specified, i.e. family(bernoulli) link(logit), or some
combinations have shortcuts that you can specify instead, i.e. logit. For this
example, we will return to the first dataset.
. use math
. codebook, compact
Variable Obs Unique Mean Min Max Label
variables for school type in our analysis. See figure Figure 10. By adding schtype
as a factor variable, a dummy variable for each level of schtype is included in the
model. The path coefficient for the base level, by default the lowest, is constrained
to zero. To get exponentiated coefficients, we can follow with the postestimation
command estat eform.
1.schtype 0b.schtype
−6.56
0.00
2.30
SES math ε1 91
1.8 56
0.49
0.37
1.00 0.84
ses1
SES 2.718282 (constrained)
ses2
SES 2.311549 .483485 4.01 0.000 1.534141 3.482899
ses3
SES 1.449492 .180061 2.99 0.003 1.136257 1.849077
ses4
SES 1.628133 .2474222 3.21 0.001 1.208748 2.193029
SEM using Stata 177
4 Conclusion
In this tutorial, we’ve shown the basics of fitting SEMs in Stata using the sem
and gsem commands, and have provided example datasets and syntax online to
follow along. We demonstrated confirmatory factor analysis, mediation, group
analysis, growth curve modeling, and models with random effects and general-
ized responses. However, there are many possibilities and options not included in
this tutorial, such as latent class analysis models, nonrecursive models, reliabil-
ity models, mediation models with generalized responses, multivariate random-
effects models, and much more. Visit Stata’s documentation to see all the avail-
able options for these commands, their methods and formulas, and many more
examples online at https://www.stata.com/manuals/sem.pdf.
References
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation
perspective (Vol. 467). John Wiley & Sons.
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and stan-
dard errors in covariance structure analysis. In Latent variables analysis:
Applications for developmental research. (pp. 399–419). Sage Publications,
Inc.
StataCorp. (2021). Stata statistical software: Release 17. StataCorp LLC.