Sem
Sem
c 19852013 StataCorp LP
Copyright
All rights reserved
Version 13
Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845
Typeset in TEX
ISBN-10: 1-59718-124-2
ISBN-13: 978-1-59718-124-2
This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored
in a retrieval system, or transcribed, in any form or by any meanselectronic, mechanical, photocopy, recording, or
otherwisewithout the prior written permission of StataCorp LP unless permitted subject to the terms and conditions
of a license granted to you by StataCorp LP to use the software and documentation. No license, express or implied,
by estoppel or otherwise, to any intellectual property rights is granted by this document.
StataCorp provides this manual as is without warranty of any kind, either expressed or implied, including, but
not limited to, the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make
improvements and/or changes in the product(s) and the program(s) described in this manual at any time and without
notice.
The software described in this manual is furnished under a license agreement or nondisclosure agreement. The software
may be copied only in accordance with the terms of the agreement. It is against the law to copy the software onto
DVD, CD, disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes.
c 1979 by Consumers Union of U.S.,
The automobile dataset appearing on the accompanying media is Copyright
Inc., Yonkers, NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979.
Stata,
Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of the United Nations.
NetCourseNow is a trademark of StataCorp LP.
Other brand and product names are registered trademarks or trademarks of their respective companies.
For copyright information about the software, type help copyright within Stata.
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
intro
intro
intro
intro
intro
intro
intro
intro
intro
intro
intro
intro
1
2
3
4
5
6
7
8
9
10
11
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction
2
. . . . . . . . . . . . . . . . . . Learning the language: Path diagrams and command language
7
. . . . . . . . . . . . . . . . . . . Learning the language: Factor-variable notation (gsem only) 35
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Substantive concepts 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tour of models 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing groups (sem only) 82
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tests and predictions 89
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robust and clustered standard errors 96
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard errors, the full story 98
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fitting models with survey data (sem only) 102
. . . . . . . . . . . . . . . . . . . . . . . Fitting models with summary statistics data (sem only) 104
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convergence problems and how to solve them 112
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
128
130
132
134
136
138
140
143
145
148
150
152
154
155
158
164
169
177
180
183
187
195
199
208
215
218
222
223
225
ii
Contents
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
example
gsem
gsem
gsem
gsem
gsem
gsem
gsem
16
17
18
19
20
21
22
23
24
25
26
27g
28g
29g
30g
31g
32g
33g
34g
35g
36g
37g
38g
39g
40g
41g
42g
43g
44g
45g
46g
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlated uniqueness model
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Latent growth model
. . . . . . . . . . . . . . . . . . . . . . . . . . Creating multiple-group summary statistics data
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-factor measurement model by group
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Group-level goodness of fit
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Testing parameter equality across groups
. . . . . . . . . . . . . . . . . . . . . . . . . . Specifying parameter constraints across groups
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reliability
. . . . . . . . . . . . . . . . . . . . . . . . . . Creating summary statistics data from raw data
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Fitting a model with data missing at random
. . . . . . . . . . . . . . . . . . Single-factor measurement model (generalized response)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-parameter logistic IRT (Rasch) model
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-parameter logistic IRT model
. . . . . . . . . . . Two-level measurement model (multilevel, generalized response)
. . . . . . . . . . . . . . . . . . . Two-factor measurement model (generalized response)
. . . . . . . . . . . . . . . . . . . . Full structural equation model (generalized response)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combined models (generalized responses)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordered probit and ordered logit
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MIMIC model (generalized response)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinomial logistic regression
. . . . . . . . . . . . . . . . . . Random-intercept and random-slope models (multilevel)
. . . . . . . . . . . . . . . . . . . . . Three-level model (multilevel, generalized response)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crossed models (multilevel)
. . . . . . . . . . . . . . . . . . . . Two-level multinomial logistic regression (multilevel)
. . . . . . . . . . . . . . . . . . . . . . . One- and two-level mediation models (multilevel)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobit regression
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interval regression
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heckman selection model
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Endogenous treatment-effects model
232
237
244
251
256
265
266
269
275
279
287
291
297
306
314
323
330
336
341
347
354
359
368
384
392
397
407
416
419
423
432
439
443
447
452
455
459
460
Contents
sem
sem
sem
sem
sem
sem
sem
sem
sem
sem
sem
sem
sem
sem
ssd
iii
508
511
514
520
521
523
525
527
529
532
534
538
540
542
544
552
565
The first example is a reference to chapter 26, Overview of Stata estimation commands, in the Users
Guide; the second is a reference to the xtabond entry in the Longitudinal-Data/Panel-Data Reference
Manual; and the third is a reference to the reshape entry in the Data Management Reference Manual.
All the manuals in the Stata Documentation have a shorthand notation:
[GSM]
[GSU]
[GSW]
[U ]
[R]
[D ]
[G ]
[XT]
[ME]
[MI]
[MV]
[PSS]
[P ]
[SEM]
[SVY]
[ST]
[TS]
[TE]
[I]
[M ]
Acknowledgments
sem and gsem were developed by StataCorp.
Neither command would exist without the help of two people outside of StataCorp. We must thank
these two people profusely. They are
Jeroen Weesie, Department of Sociology at Utrecht University, The Netherlands
Sophia Rabe-Hesketh, University of California, Berkeley
Jeroen Weesie is responsible for the existence of the SEM project at StataCorp. While spending
his sabbatical with us, Jeroen expressedrepeatedlythe importance of SEM, and that enthusiasm
for SEM was disregardedrepeatedly. Not until after his sabbatical did StataCorp see the light. At
that point, we had him back, and back, and back, so that he could inspire us, guide us, tell us what
we had right, and, often, tell us what we had wrong.
Jeroen helped us with the math, the syntax, and system design, and, when we were too thick-headed,
he even wrote code. By the date of first shipment, all code had been rewritten by us, but design and
syntax for SEM still now and forever will show Jeroens influence.
Thank you, Jeroen Weesie, for teaching us SEM.
Sophia Rabe-Hesketh contributed a bit later, after the second project, GSEM, was well underway.
GSEM stands for generalized SEM. Sophia is the coauthor of gllamm and knows as much about
multilevel and structural equation modeling as anybody, and probably more. She helped us a lot
through her prolific published works; we did have her visit a few times, though, mainly because we
knew that features in GSEM would overlap with features in GLLAMM, and we wanted to straighten
out any difficulties that competing features might cause.
About the competing features, Sophia cared nothing. About the GSEM project, she was excited.
About syntax and computational methodswell, she straightened us out the first day, even on things
we thought we had settled. Today, enough of the underlying workings of GSEM are based on Sophias
and her coauthors publications that anyone who uses gsem should cite Rabe-Hesketh, Skrondal, and
Pickles (2004).
We are indebted to the works of Sophia Rabe-Hesketh, Anders Skrondal of the University of Oslo
and the Norwegian Institute of Public Health, and Andrew Pickles of the University of Manchester.
Reference
Rabe-Hesketh, S., A. Skrondal, and A. Pickles. 2004. Generalized multilevel structural equation modeling. Psychometrika
69: 167190.
Also see
[R] gllamm Generalized linear and latent mixed models
Title
intro 1 Introduction
Description
Also see
Description
SEM stands for structural equation model. Structural equation modeling is
intro 1 Introduction
Beginning with [SEM] intro 2, entitled Learning the language: Path diagrams and command
language, you will learn that
1. A particular SEM is usually described using a path diagram.
2. The sem and gsem commands allow you to use path diagrams to input models. In fact, the
sem and gsem share the same GUI, called the SEM Builder.
3. sem and gsem alternatively allow you to use a command language to input models. The
command language is similar to the path diagrams.
[SEM] intro 3, entitled Learning the language: Factor-variable notation (gsem only), amounts to a
continuation of [SEM] intro 2.
4. We teach you Statas factor-variable notation, a wonderfully convenient shorthand for including
categorical variables in models.
In [SEM] intro 4, entitled Substantive concepts, you will learn that
5. sem provides four different estimation methods; you need to specify the method appropriate
for the assumptions you are willing to make. For gsem, there are two estimation methods.
6. There are four types of variables in SEMs: A variable is observed or latent, and simultaneously
it is endogenous or exogenous. To this, sem and gsem add another type of variable, the error
variable. Error variables are latent exogenous variables with a fixed-unit path coefficient,
and they are associated with a single endogenous variable. Error variables are denoted with
an e. prefix, so if y1 is an endogenous variable, then e.y1 is the associated error variable.
7. It is easy to specify path constraints in SEMsyou just draw them, or omit drawing them,
on the diagram. It is similarly easy with the SEM Builder as well as with sems and gsems
command language.
8. Determining whether an SEM is identified can be difficult. We show you how to let the
software check for you.
9. Identification also includes normalization constraints. sem and gsem apply normalization
constraints automatically, but you can control that if you wish. Sometimes you might even
need to control it.
In [SEM] intro 5, entitled Tour of models,
10. We take you on a whirlwind tour of some of the models that sem and gsem can fit. This is
a fun and useful section because we give you an overview without getting lost in the details.
Then in [SEM] intro 6, entitled Comparing groups (sem only),
11. We show you a highlight of sem: its ability to take an SEM and data consisting of groups
sexes, age categories, and the likeand fit the model in an interacted way that makes it
easy for you to test whether and how the groups differ.
In [SEM] intro 7, entitled Postestimation tests and predictions,
12. We show you how to redisplay results (sem and gsem), how to obtain exponentiated
coefficients (gsem only), and how to obtain standardized results (sem only).
13. We show you how to obtain goodness-of-fit statistics (sem only).
14. We show you how to perform hypothesis tests, including tests for omitted paths, tests for
relaxing constraints, and tests for model simplification.
15. We show you how to display other results, statistics, and tests.
intro 1 Introduction
16. We show you how to obtain predictions of observed response variables and predictions of
latent variables. With gsem, you can obtain predicted means, probabilities, or counts that
take into account the predictions of the latent variables, or you can set the latent variables
to 0.
17. We show you how to access stored results.
In [SEM] intro 8, entitled Robust and clustered standard errors,
18. We mention that sem and gsem optionally provide robust standard errors and provide clustered
standard errors, which relaxes the assumption of independence of observations (or subjects)
to independence within clusters of observations (subjects).
In [SEM] intro 9, entitled Standard errors, the full story,
19. We provide lots of technical detail expanding on item 18.
In [SEM] intro 10, entitled Fitting models with survey data (sem only),
20. We explain how sem can be used with Statas svy: prefix to obtain results adjusted for
complex survey designs, including clustered sampling and stratification.
In [SEM] intro 11, entitled Fitting models with summary statistics data (sem only),
21. We show you how to use sem with summary statistics data such as the correlation or
covariance matrix rather than the raw data. Many sources, especially textbooks, publish data
in summary statistics form.
Finally, in [SEM] intro 12, entitled Convergence problems and how to solve them,
22. We regretfully inform you that some SEMs have difficulty converging. We figure 5% to 15%
of complicated models will cause difficulty. We show you what to do and it is not difficult.
In the meantime,
23. There are many examples that we have collected for you in [SEM] example 1, [SEM] example 2,
and so on. It is entertaining and informative simply to read the examples in order.
24. There is an alphabetical glossary in [SEM] Glossary, located at the end of the manual.
If you prefer, you can skip all this introductory material and go for the details. For the full
experience, go directly to [SEM] sem and [SEM] gsem. You will have no idea what we are talking
aboutwe promise.
intro 1 Introduction
sem
gsem
sem and gsem path notation
sem path notation extensions
gsem path notation extensions
Builder
Builder, generalized
sem model description options
gsem model description options
sem group options
sem ssd options
sem estimation options
gsem estimation options
sem reporting options
gsem reporting options
sem and gsem syntax options
sem option noxconditional
sem option select( )
sem and gsem option covstructure( )
sem option method( )
sem and gsem option reliability( )
sem and gsem option from( )
sem and gsem option constraints( )
gsem family-and-link options
ssd (sem only)
Postestimation, summary of
[SEM] sem postestimation
[SEM] gsem postestimation
Reporting results
[R] estat
[SEM] estat eform (gsem only)
[SEM] estat teffects (sem only)
[SEM] estat residuals (sem only)
[SEM] estat framework (sem only)
Goodness-of-fit tests
[SEM] estat gof (sem only)
[SEM] estat eqgof (sem only)
[SEM] estat ggof (sem only)
[R] estat
intro 1 Introduction
Hypotheses tests
[SEM] estat mindices (sem only)
[SEM] estat eqtest (sem only)
[SEM] estat scoretests (sem only)
[SEM] estat ginvariant (sem only)
[SEM] estat stable (sem only)
[SEM] test
[SEM] lrtest
[SEM] testnl
[SEM] estat stdize (sem only)
Linear and nonlinear combinations of results
[SEM] lincom
[SEM] nlcom
Predicted values
[SEM] predict after sem
[SEM] predict after gsem
Methods and formulas
[SEM] methods and formulas for sem
[SEM] methods and formulas for gsem
Many of these sections are technical, but mostly in the computer sense of the word. We suggest that
when you read the technical sections, you skip to Remarks and examples. If you read the introductory
sections, you will already know how to use the commands, so there is little reason to confuse yourself
with syntax diagrams that are more precise than they are enlightening. However, the syntax diagrams
do serve as useful reminders.
Also see
[SEM] intro 2 Learning the language: Path diagrams and command language
[SEM] example 1 Single-factor measurement model
[SEM] Acknowledgments
Title
intro 2 Learning the language: Path diagrams and command language
Description
Reference
Also see
Description
Individual structural equation models are usually described using path diagrams. Path diagrams
are described here.
Path diagrams can be used in sems (gsems) GUI, known as the SEM Builder or simply the Builder,
as the input to describe the model to be fit. Path diagrams differ a little from author to author, and
sems and gsems path diagrams differ a little, too. For instance, we omit drawing the variances and
covariances between observed exogenous variables by default.
sem and gsem also provide a command-language interface. This interface is similar to path diagrams
but is typable.
x1
x2
x3
x4
e.x1
e.x2
e.x3
7
e.x4
x1 = 1 + 1 X + e.x1
x2 = 2 + 2 X + e.x2
x3 = 3 + 3 X + e.x3
x4 = 4 + 4 X + e.x4
Theres a third way of writing this model, namely,
(x1<-X) (x2<-X) (x3<-X) (x4<-X)
This is the way we would write the model if we wanted to use sems or gsems command syntax
rather than drawing the model in the Builder. The full command we would type is
. sem (x1<-X) (x2<-X) (x3<-X) (x4<-X)
We will get to that later.
By the way, the above model is a linear single-level model. Linear single-level models can be fit
by sem or by gsem. sem is the preferred way to fit linear single-level models because it has added
features for these models that you might find useful later. Nonetheless, if you want to fit the model
with gsem, you would type
. gsem (x1<-X) (x2<-X) (x3<-X) (x4<-X)
Whether we use sem or gsem, we obtain the same results.
However we write this model, what is it? It is a measurement model, a term loaded with meaning
for some researchers. X might be mathematical ability. x1 , x2 , x3 , and x4 might be scores from
tests designed to measure mathematical ability. x1 might be the score based on your answers to a
series of questions after reading this section.
The model we have just drawn, or written in mathematical notation, or written in Stata command
notation, can be interpreted in other ways, too. Look at this diagram:
y
1
x1
x2
x3
e.x1
e.x2
e.y
e.x3
Despite appearances, this diagram is identical to the previous diagram except that we have renamed
x4 to be y . The fact that we changed a name obviously does not matter substantively. The fact that
we have rearranged the boxes in the diagram is irrelevant, too; paths connect the same variables in
the same directions. The equations for the above diagrams are the same as the previous equations
with the substitution of y for x4 :
x1 = 1 + X1 + e.x1
x2 = 2 + X2 + e.x2
x3 = 3 + X3 + e.x3
y = 4 + X4 + e.y
The Stata command notation changes similarly:
(x1<-X) (x2<-X) (x3<-X) (y<-X)
Many people looking at the model written this way might decide that it is not a measurement
model but a measurement-error model. y depends on X but we do not observe X . We do observe
x1 , x2 , and x3 , each a measurement of X , but with error. Our interest is in knowing 4 , the effect
of true X on y .
A few others might disagree and instead see a model for interrater agreement. Obviously, we have
three or four raters who each make a judgment, and we want to know how well the judgment process
works and how well each of these raters performs.
10
Specifying correlation
One of the key features of SEM is the ease with which you can allow correlation between latent
variables to adjust for the reality of the situation. In the measurement model in the previous section,
we drew the following path diagram:
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
x1 = 1 + X1 + e.x1
x2 = 2 + X2 + e.x2
x3 = 3 + X3 + e.x3
x4 = 4 + X4 + e.x4
(X, x1 , x2 , x3 , x4 , e.x1 , e.x2 , e.x3 , e.x4 ) i.i.d. with mean vector and covariance matrix
where i.i.d. means that observations are independent and identically distributed.
We must appreciate that and are estimated, just as are 1 , 1 , . . . , 4 , 4 . Some of the
elements of , however, are constrained to be 0; which elements are constrained is determined by
how we specify the model. In the above diagram, we drew the model in such a way that we assumed
that error variables were uncorrelated with each other. We could have drawn the diagram differently.
11
If we wish to allow for a correlation between e.x2 and e.x3 , we add a curved path between the
variables:
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
The curved path states that there is a correlation to be estimated between the variables it connects. The
absence of a curved pathsay, between e.x1 and e.x4 means that the variables are constrained to be
uncorrelated. That is not to say that x1 and x4 are uncorrelated; obviously, they are correlated because
both are functions of the same X . Their corresponding error variables, however, are uncorrelated.
The equations for this model, in their full detail, are
x1 = 1 + X1 + e.x1
x2 = 2 + X2 + e.x2
x3 = 3 + X3 + e.x3
x4 = 4 + X4 + e.x4
(X, x1 , x2 , x3 , x4 , e.x1 , e.x2 , e.x3 , e.x4 ) i.i.d. with mean and variance
12
X = 0
e.x1 = 0
e.x2 = 0
e.x3 = 0
e.x4 = 0
The parameters to be estimated are
1 , 1 , . . . , 4 , 4 ,
Look carefully at the above list. You will not find a line reading
13
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
A curved path from a variable to itself indicates a variance. All covariances could be shown, including
those between latent variables and between observed exogenous variables.
When we draw diagrams, however, we will assume variance paths and omit drawing them, and we
will similarly assume but omit drawing covariances between observed exogenous variables (there are
no observed exogenous variables in this model). The Builder in sem mode has an option concerning
the latter. Like everyone else, we will not assume correlations between latent variables unless they
are shown.
In sems (gsems) command-language notation, curved paths between variables are indicated via
an option:
(x1<-X) (x2<-X) (x3<-X) (x4<-X), cov(e.x2*e.x3)
14
///
///
///
///
3. In path diagrams, you draw arrows connecting variables to indicate paths. In the command
language, you type variable names and arrows. So long as your arrow points the correct
way, it does not matter which variable comes first. The following mean the same thing:
(x1 <- X)
(X -> x1)
4. In the command language, you may type multiple variables on either side of the arrow:
(X -> x1 x2 x3 x4)
The above means the same as
(X -> x1) (X -> x2) (X -> x3) (X -> x4)
which means the same as
(x1 <- X) (x2 <- X) (x3 <- X) (x4 <- X)
which means the same as
(x1 x2 x3 x4 <- X)
In a more complicated measurement model, we might have
(X Y -> x1 x2 x3) (X -> x4 x5) (Y -> x6 x7)
The above means the same as
(X -> x1 x2 x3 x4 x5)
(Y -> x1 x2 x3 x6 x7)
///
15
which means
(X -> x1) (X -> x2) (X -> x3) (X -> x4) (X -> x5)
(Y -> x1) (Y -> x2) (Y -> x3) (Y -> x6) (Y -> x7)
///
5. In path diagrams, you are required to show the error variables. In the command language,
you may omit the error variables. sem knows that each endogenous variable needs an error
variable. You can type
(x1 <- X) (x2 <- X) (x3 <- X) (x4 <- X)
and that means the same thing as
(x1
(x2
(x3
(x4
<<<<-
X
X
X
X
e.x1)
e.x2)
e.x3)
e.x4)
///
///
///
except that we have lost the small numeric 1s we had next to the arrows from e.x1 to x1,
e.x2 to x2, and so on. To constrain the path coefficient, you type
(x1
(x2
(x3
(x4
<<<<-
X
X
X
X
e.x1@1)
e.x2@1)
e.x3@1)
e.x4@1)
///
///
///
16
7. Nearly all the above applies equally to gsem. We have to say nearly because sometimes,
in some models, some concepts simply vanish. For instance, in a logistic model, there are no
error terms. For generalized responses with family Gaussian, link log, there are error terms,
but they cannot be correlated. Also, for responses with family Gaussian, link identity, and
censoring, there are error terms, but they cannot be correlated. gsem also takes observed
exogenous variables as given and so cannot estimate the covariances between them.
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
In this model, we observe continuous measurements x1, . . . , x4. What if x1, . . . , x4 were instead
binary outcomes such as success and failure or passed and failed? That is, what if x1, . . . , x4 took
on values of only 1 and 0?
In that case, we would want to fit a model appropriate to binary outcomes. Perhaps we want to
fit a logistic regression model or a probit model. To do either one, we will have to use gsem rather
than sem. We will use a probit model.
The path diagram for the measurement model with binary outcomes is
Bernoulli
Bernoulli
Bernoulli
Bernoulli
x1
x2
x3
x4
probit
probit
probit
probit
What were plain boxes for x1, . . . , x4 now say Bernoulli and probit at the top and bottom,
meaning that the variable is from the Bernoulli family and is using the probit link. In addition, e.x1,
. . . , e.x4 (the error terms for x1, . . . , x4) have disappeared. In the generalized linear framework,
there are error terms only for the Gaussian family.
17
Perhaps some math will clear up the issue. The generalized linear model is
g{E(y | X)} = x
and in the case of probit, g{E(y | X)} = 1 {E(y | X)}, where () is the cumulative normal
distribution. Thus the equations are
1 {E(x1 | X)} = 1 + X1
1 {E(x2 | X)} = 2 + X2
1 {E(x3 | X)} = 3 + X3
1 {E(x4 | X)} = 4 + X4
Note that there are no error terms. Equivalently, the above can be written as
Pr(x1 = 1 | X) = (1 + X1 )
Pr(x2 = 1 | X) = (2 + X2 )
Pr(x3 = 1 | X) = (3 + X3 )
Pr(x4 = 1 | X) = (4 + X4 )
In gsems command language, we write this model as
(x1 x2 x3 x4<-X, family(bernoulli) link(probit))
or as
(x1<-X,
(x2<-X,
(x3<-X,
(x4<-X,
family(bernoulli)
family(bernoulli)
family(bernoulli)
family(bernoulli)
link(probit))
link(probit))
link(probit))
link(probit))
In the command language, you can simply type probit to mean family(bernoulli) link(probit),
so the model could also be typed as
(x1 x2 x3 x4<-X, probit)
or even as
(x1<-X, probit) (x2<-X, probit) (x3<-X, probit) (x4<-X, probit)
Whether you type family(bernoulli) link(probit) or type probit, when all the response
variables are probit, you can type
(x1 x2 x3 x4<-X), probit
or
(x1<-X) (x2<-X) (x3<-X) (x4<-X), probit
18
The response variables do not have to be all from the same family and link. Perhaps x1, x2, and
x3 are pass/fail variables but x4 is a continuous variable. Then the model would be diagrammed as
Bernoulli
Bernoulli
Bernoulli
Gaussian
x1
x2
x3
x4
probit
probit
probit
identity
e.x4
The words Gaussian and identity now appear for variable x4 and e.x4 is back! Just as previously,
the generalized linear model is
g{E(y | X)} = x
and in the case of linear regression, g() = , so our fourth equation becomes
E(x4 | X) = X4
or
x4 = + X4 + e.x4
and the entire set of equations to be estimated is
Pr(x1 = 1 | X) = (1 + X1 )
Pr(x2 = 1 | X) = (2 + X2 )
Pr(x3 = 1 | X) = (3 + X3 )
x4 = 4 + X4 + e.x4
This can be written in the command language as
(x1 x2 x3<-X, family(bernoulli) link(probit))
(x4
<-X, family(gaussian) link(identity))
or as
(x1 x2 x3<-X, probit) (x4<-X, regress)
regress is a synonym for family(gaussian) link(identity). Because family(gaussian)
link(identity) is the default, we can omit the option altogether:
(x1 x2 x3<-X, probit) (x4<-X)
19
We demonstrated generalized linear models above by using probit (family Bernoulli, link probit),
but we could just as well have used logit (family Bernoulli, link logit). Nothing changes except that in
the path diagrams, where you see probit, logit would now appear. Likewise, in the command, where
you see probit, logit would appear.
The same is true with almost all the other possible generalized linear models. What they all have
in common is
1. There are no e. error terms, except for family Gaussian.
2. Response variables appear the ordinary way except that the family is listed at the top of the
box and the link is listed at the bottom of the box, except for family multinomial, link logit
(also known as multinomial logistic regression or mlogit).
Concerning item 1, we just showed you a combined probit and linear regression with its e.x4
term. Linear regression is family Gaussian.
Concerning item 2, multinomial logistic regression is different enough that we need to show it to
you.
2.y
logit
x1
multinomial
3.y
logit
x2
multinomial
4.y
logit
Note that there are three boxes for y , boxes containing 2.y , 3.y , and 4.y . When specifying a multinomial
logistic regression model in which the dependent variables can take one of k different values, you
draw k 1 boxes. Names like 1.y , 2.y , . . . , mean y = 1, y = 2, and so on. Logically speaking, you
have to omit one of them. Which you omit is irrelevant and is known as the base outcome. Estimated
coefficients in the model relevant to the other outcomes will be measured as a difference from the
base outcome. We said which you choose is irrelevant, but it might be easier for you to interpret your
model if you chose a meaningful base outcome, such as the most frequent outcome. In fact, that is
just what Statas mlogit command does by default when you do not specify otherwise.
20
In path diagrams, you may implicitly or explicitly specify the base outcome. We implicitly specified
the base by omitting 1.y from the diagram. We could have included it by drawing a box for 1.y
and labeling it 1b.y . Stata understands the b to mean base category. See [SEM] example 37g for an
example.
Once you have handled specification of the base category, you draw path arrows from the predictors
to the remaining outcome boxes. We drew paths from x1 and x2 to all of the outcome boxes, but if
we wanted to omit x1 to 3.y and 4.y , we could have omitted those paths.
Our example is simple in that y is the final outcome. If we had a more complex model where y s
outcome affected another response variable, arrows would connect all or some of 2.y , 3.y , and 4.y
to the other response variable.
The command syntax for our simple example is
(2.y 3.y 4.y<-x1 x2), mlogit
2.y, 3.y, and 4.y are examples of Statas factor-variable syntax. The factor-variable syntax has some
other features that can save typing in command syntax. i.y, for instance, means 1b.y, 2.y, 3.y,
and 4.y. It is especially useful because if we had more levels, say, 10, it would mean 1b.y, 2.y,
. . . , 10.y. To fit the model we diagrammed, we could type
(i.y<-x1 x2), mlogit
If we wanted to include some paths and not others, we need to use the more verbose syntax. For
instance, to omit paths from x1 to 3.y and 4.y , we would type
(2.y<-x1 x2) (3.y 4.y<-x2), mlogit
or
(i.y<-x2) (2.y<-x1), mlogit
For more information on specifying mlogit paths, see [SEM] intro 3, [SEM] example 37g, and
[SEM] example 41g.
Specifying generalized SEMs: Family and link, paths from response variables
When we draw a path from one response variable to another, we are stating that the first endogenous
variable is a predictor of the other endogenous variable. The diagram might look like this:
x2
x1
y1
y2
e.y1
e.y2
21
The response variables in the model are linear, and note that there is a path from y1 to y2 . Could
we change y1 to be a member of the generalized linear family, such as probit, logit, and so on? It
turns out that we can:
x2
Bernoulli
x1
y1
y2
probit
e.y2
In the command syntax, we could write this model as (y1<-x1, probit) (y2<-y1 x2). (In this
case, the observed values and not the expectations are used to fit the y1->y2 coefficient. In general,
this is true for all generalized responses that are not family Gaussian, link identity.)
We can make the substitution from linear to generalized linear if the path from y1 appears in a
recursive part of the model. We will define recursive shortly, but trust us that the above model is
recursive all the way through. The substitution would not have been okay if the path had been in a
nonrecursive portion of the model. The following is a through-and-through nonrecursive model:
x1
y1
e.y1
y2
e.y2
x2
x3
It could be written in command syntax as (y1<-y1 x1 x2) (y2<-y2 x2 x3).
In this model, we could not change y1 to be family Bernoulli and link probit or any other
generalized linear response variable. If we tried to fit the model with such a change, we would get
an error message:
invalid path specification;
a loop among the paths between y1 and y2 is not allowed
r(198);
The software will spot the problem, but you can spot it for yourself.
22
Nonrecursive models have loops. Do you see the loop in the above model? You will if you work
out the total effect of a change in y2 from a change in y1 . Assume a change in y1 . Then that change
directly affects y2 , the new value of y2 affects y1 , which in turn indirectly affects y2 again, which
affects y1 , and on and on.
Now follow the same logic with either the probit or the continuous recursive models above. The
change in y1 affects y2 and it stops right there.
We sympathize if you think that we have interchanged the terms recursive and nonrecursive.
Remember it this way: total effects in recursive models can be calculated nonrecursively because
the model itself is recursive, and total effects in nonrecursive models must be calculated recursively
because the model itself is nonrecursive.
Anyway, you may draw paths from generalized linear response variables to other response variables,
whether linear or generalized linear, as long as no loops are formed.
We gave special mention to multinomial logistic regression in the previous section because those
models look different from the other generalized linear models. Multinomial logistic regression has a
plethora of response variables. In the case of multinomial logistic regression, a recursive model with
a path from the multinomial logistic outcome y1 to (linear) y2 would look like this:
multinomial
x2
2.y1
logit
multinomial
x1
3.y1
y2
logit
multinomial
e.y2
4.y1
logit
23
x2
2.y1
ordinal
x1
y1
3.y1
y2
logit
e.y2
4.y1
In the command syntax, the model could be written as
(y1<-x1, ologit) (y2<-2.y1 3.y1 4.y1 x2)
Unlike multinomial logistic regression, in which the k outcomes result in the estimation of k 1
equations, in ordered probit and logistic models, only one equation is estimated, and thus the response
variable is specified simply as y1 rather than 2.y1 , 3.y1 , and 4.y1 . Even so, you probably will want
the different outcomes to have separate coefficients in the y2 equation so that the effects of being in
groups 1, 2, 3, and 4 are 0 , 0 + 1 , 0 + 2 , and 0 + 3 , respectively. If you drew the diagram
with a single path from y1 to y2 , the effects of being in the groups would be 0 + 11 , 0 + 21 ,
0 + 31 , and 0 + 41 , respectively.
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
24
The data for this model would look something like this:
. list in 1/5
1.
2.
3.
4.
5.
x1
x2
x3
x4
76
106
99
101
104
76
110
106
85
82
63
114
104
77
108
490
778
757
637
654
Lets pretend that the observations are students and that x1, . . . , x4 are four test scores.
Now lets pretend that we have new data with students from different schools. A part of our data
might look like this:
. list in 1/10
school
x1
x2
x3
x4
1.
2.
3.
4.
5.
1
1
1
1
1
99
70
75
96
80
91
62
67
88
72
94
65
70
91
75
92
63
68
89
73
6.
7.
8.
9.
10.
2
2
2
2
2
113
112
97
84
89
119
118
103
90
95
118
117
102
89
94
119
118
103
90
95
We have four test scores from various students in various schools. Our corresponding model might
be
x1
e.x1
x2
e.x2
x3
e.x3
x4
e.x4
school1
The school1 inside double circles in the figure tells us, I am a latent variable at the school
levelmeaning that I am constant within school and vary across schoolsand I correspond to the
latent variable named M1. That is, double circles denote a latent variable named M#, where # is the
subscript of the variable name inside the double circles. Meanwhile, the variable name inside the
double circles specifies the level of the M# variable.
25
x1 = 1 + 1 X + 1 M1,S + e.x1
x2 = 2 + 2 X + 2 M1,S + e.x2
x3 = 3 + 3 X + 3 M1,S + e.x3
x4 = 4 + 4 X + 4 M1,S + e.x4
where S = school number.
Thus we have three different ways of referring to the same thing: school1 inside double circles in
the path diagram corresponds to M1[school] in the command language, which in turn corresponds
to M1,S in the mathematical notation.
Rabe-Hesketh, Skrondal, and Pickles (2004) use boxes to identify different levels of the model.
Our path diagrams are similar to theirs, but they do not use double circles for multilevel latent
variables. We can put a box around the individual-level part of the model, producing something that
looks like this:
x1
e.x1
x2
e.x2
x3
e.x3
x4
e.x4
school1
You can do that in the Builder, but the box has no special meaning to gsem; however, adding the
box does make the diagram easier to understand in presentations.
However you diagram the model, this model is known as a two-level model. The first level is the
student or observational level, and the second level is the school level.
26
school
x1
x2
x3
x4
1.
2.
3.
4.
5.
1
1
1
1
1
1
1
1
1
1
105
115
90
124
114
115
125
100
134
124
113
123
98
132
122
114
124
99
133
123
6.
7.
8.
9.
10.
1
1
1
1
1
2
2
2
2
2
115
112
116
102
109
128
125
129
115
122
125
122
126
112
119
127
124
128
114
121
school
x1
x2
x3
x4
991.
992.
993.
994.
995.
2
2
2
2
2
1
1
1
1
1
106
72
60
107
125
111
77
65
112
130
110
76
64
111
129
111
77
65
112
130
996.
997.
998.
999.
1000.
2
2
2
2
2
2
2
2
2
2
79
84
90
96
104
72
77
83
89
97
74
79
85
91
99
73
78
84
90
98
27
county1
school2
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
28
or
(x1 x2 x3 x4 <- X M2[school<county] M1[county])
The mathematical way of writing this model is
county1
school2
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
The darker box highlights the first level (student, in our case), and the lighter box highlights the
second level (school, in our case). As we previously mentioned, you can add these boxes using the
Builder, but they are purely aesthetic and have no meaning to gsem.
29
x1
e.y
x2
In the command language, this model is specified as
(x1 x2->y)
or
(y<-x1 x2)
and the mathematical representation is
y = + x1 + x2 + e.y
Now assume that we have data not just on y , x1 , and x2 , but also on county of residence. If we
wanted to add a random intercepta random effectfor county, the diagram becomes
county1
x1
x2
The command-language equivalent is
(x1 x2 M1[county]->y)
or
(y<-x1 x2 M1[county])
e.y
30
y = + x1 + x2 + M1,C + e.y
where C = county number. Actually, the model is
y = + x1 + x2 + M1,C + e.y
but is automatically constrained to be 1 by gsem. The software is not reading our mind; consider
a solution for M1,C with = 0.5. Then another equivalent solution is M1,C /2 with = 1, and
another is M1,C /4 with = 2, and on and on, because in all cases, M1,C will equal the same
value. The fact is that is unidentified. Whenever a latent variable is unidentified because of such
scaling considerations, gsem (and sem) automatically set the coefficient to 1 and thus set the scale.
This is a two-level model: the first level is the observational level and the second level is county.
Just as we demonstrated previously with the measurement model, we could have a three-level
nested model. We could imagine the observational level nested within the county level nested within
the state level. The path diagram would be
county1
x1
e.y
x2
state2
The command-language equivalent is
(y<-x1 x2 M1[county<state] M2[state])
and the mathematical representation is
31
You can specify higher-level models, but just as we mentioned when discussing the higher-level
measurement models, you will need lots of data to fit them successfully and you still may run into
other estimation problems.
You can also fit crossed models, such as county and occupation. Unlike county and state, where
a particular county appears only in one state, the same occupation will appear in more than one
county, which is to say, occupation and county are crossed, not nested. Except for a change in the
names of the variables, the path diagram for this model looks identical to the diagram for the state
and county-within-state model:
county1
x1
e.y
x2
occup2
When we enter the model into the Builder, however, we will specify that the two effects are crossed
rather than nested.
The command-language way of expressing this crossed model is
(y<-x1 x2 M1[county] M2[occupation])
and the mathematical representation is
32
county1
x1
e.y
x2
The command-language equivalent for this model is
(y<-x1 c.x1#M1[county]) (y<-x2)
or
(y<-x1 c.x1#M1[county] x2)
To include a random slope (coefficient) on a variable in the command language, include the variable
on which you want the random slope just as you ordinarily would:
(y<-x1
Then, before closing the parenthesis, type
c.variable#latent variable[grouping variable]
In our case, the variable on which we want the random coefficient is x1, the latent variable we want
to create is named M1, and the grouping variable within which the latent variable will be constant
is county, so we type
(y<-x1 c.x1#M1[county]
Finally, finish off the command by typing the rest of the model:
(y<-x1 c.x1#M1[county] x2)
This is another example of Statas factor-variable notation.
The mathematical representation of the random-slope model is
y = + x1 + x2 + M1,C x1 + e.y
where C = county number and = 1.
33
You can simultaneously specify both random slope and random intercept by putting together what
we have already done. The first time you see the combined path diagram, you may think there is a
mistake:
county2
county1
x1
e.y
x2
County appears twice in two different double circles, once with a subscript of 1 and the other
with a subscript of 2! We would gamble at favorable odds that you would have included county in
double circles once and then would have drawn two paths from it, one to the path between x1 and
y, and the other to y itself. That model, however, would make an extreme assumption.
Consider what you would be saying. There is one latent variable M1. The random intercepts would
be equal to M1. The random slopes would be equal to M1, a scalar replica of the random intercepts!
You would be constraining the random slopes and intercepts to be correlated 1.
What you usually want, however, is one latent variable for the random slopes and another for the
random intercepts. They might be correlated, but they are not related by a multiplicative constant.
Thus we also included a covariance between the two random effects.
The command-language equivalent of the correct path diagram (the one shown above) is
(y<-x1 x2 c.x1#M2[county] M1[county])
You will learn later that the command language assumes covariances exist between latent exogenous
variables unless an option is specified. Meanwhile, the Builder assumes those same covariances are
0 unless a curved covariance path is drawn.
The mathematical representation of the model is
Reference
Rabe-Hesketh, S., A. Skrondal, and A. Pickles. 2004. Generalized multilevel structural equation modeling. Psychometrika
69: 167190.
34
Also see
[SEM] intro 1 Introduction
[SEM] intro 3 Learning the language: Factor-variable notation (gsem only)
[SEM] Builder SEM Builder
[SEM] Builder, generalized SEM Builder for generalized models
[SEM] sem and gsem path notation Command syntax for path diagrams
Title
intro 3 Learning the language: Factor-variable notation (gsem only)
Description
Also see
Description
This entry concerns gsem only; sem does not allow the use of factor variables.
gsem allows you to use Statas factor-variable notation in path diagrams (in the SEM Builder) and
in the command language. Use of the notation, though always optional, is sometimes useful:
1. Use of factor variables sometimes saves effort. For instance, rather than typing
. generate femXage = female * age
and including femXage in your model, you can directly include 1.female#c.age. You can
type 1.female#c.age with the gsem command or type it into an exogenous-variable box
in a path diagram.
2. Use of factor variables can save even more effort in command syntax. You can type things
like i.female i.skill i.female#i.skill to include main effects and interaction effects
of female interacted with indicators for all the different levels of skill.
3. Use of factor variables causes the postestimation commands margins, contrast, and
pwcompare to produce more useful output. For instance, they will show the effect of the
discrete change between female = 0 and female = 1 rather than the infinitesimal change
(slope) of female.
In the examples in the rest of this manual, we sometimes use factor-variable notation and, at other
times, ignore it. We might have variable smokes recording whether a person smokes. In one example,
you might see smokes directly included in the model and yet, in another example, we might have
the odd-looking 1.smokes, which is factor-variable notation for emphasizing that smokes is a 0/1
variable. We probably used 1.smokes for reason 3, although we might have done it just to emphasize
that variable smokes is indeed 0/1.
In other examples, we will use more complicated factor-variable notation, such as 1.female#c.age,
because it is more convenient.
You should follow the same rules. Use factor-variable notation when convenient or when you plan
subsequently to use postestimation commands margins, contrast, or pwcompare. If neither reason
applies, use factor-variable notation or not. There is no benefit from consistency in this case.
35
36
This model corresponds to the equation yi = xi + i . It can be fit using the Builder with the path
diagram above or using the command syntax by typing
. gsem (y<-x)
Say we now tell you that x is a 0/1 (binary) variable. Perhaps x indicates male/female. That
changes nothing mathematically. The way we drew the path diagram and specified the command
syntax are as valid when x is an indicator variable as they were when x was continuous.
Specifying 1.x is a way you can emphasize that x is 0/1:
1.x
That model will produce the same results as the first model. In command syntax, we can type
. gsem (y<-1.x)
The only real advantage of 1.x over x arises when we plan to use the postestimation command
margins, contrast, or pwcompare, as we mentioned above. If we fit the model using 1.x, those
commands will know x is 0/1 and will exploit that knowledge to produce more useful output.
In other cases, the #.varname notation can sometimes save us effort. #.varname means a variable
equal to 1 if varname = #. It is not required that varname itself be an indicator variable.
Lets say variable skill takes on the values 1, 2, and 3 meaning unskilled, skilled, and highly
skilled. Then 3.skill is an indicator variable for skill = 3. We could use 3.skill in path
diagrams or in the command language just as we previously used 1.x.
37
1.female
age
1.female#c.age
1.female
2.skill
3.skill
38
Omission is one way to specify the base level. The other way is to include all the skill levels and
indicate which one we want to be used as the base:
1.female
1b.skill
y
2.skill
3.skill
b specifies the base level. Specifying 1b.skill makes skill level 1 the base. Had we wanted skill
level 2 to be the base, we would have specified 2b.skill.
Specification by omission is usually easier for simple models. For more complicated models, which
might have paths from different skill levels going to different variables, it is usually better to specify
all the skill levels and all the relevant paths, mark one of the skill levels as the base, and let gsem
figure out the correct reduced-form model.
By the way, indicator variables are just a special case of categorical variables. Indicator variables
are categorical variables with two levels. Everything just said about categorical variable skill applies
equally to indicator variable female. We can specify 1.female by itself, and thus implicitly specify
that female 6= 1 is the base, or we can explicitly specify both 0b.female and 1.female.
The command language has a neat syntax for specifying all the indicator variables that can be
manufactured from a categorical variable. When you type i.skill, it is the same as typing 1b.skill
2.skill 3.skill. You can use the i. shorthand in the command language by typing
. gsem (y<-1.female i.skill)
or even by typing
. gsem (y<-i.female i.skill)
Most experienced Stata command-line users would type the model the second way, putting i. in
front of both skill and female. They would not bother to think whether variables are indicator or
categorical, nor would they concern themselves with remembering how the indicator and categorical
variables are coded.
You cannot use the i. notation in path diagrams, however. Path diagrams allow only one variable
per box. The i. notation produces at least two variables, and it usually produces a lot more.
39
In the Builder, we could diagram this model by specifying all the individual interactions but omitting
the base levels,
1.female
2.skill
3.skill
1.female#2.skill
1.female#3.skill
40
or by including them,
1.female
1b.skill
2.skill
3.skill
1.female#1b.skill
1.female#2.skill
1.female#3.skill
In the second diagram, we did not bother to include the base levels for the indicator variable female,
but we could have included them.
We can type the model in command syntax just as we have drawn it, either as
. gsem (y<-1.female
2.skill 3.skill
1.female#2.skill
1.female#3.skill)
. gsem (y<-1.female
1b.skill 2.skill
1.female#1b.skill
3.skill
1.female#2.skill
or as
1.female#3.skill)
41
Also see
[SEM] intro 2 Learning the language: Path diagrams and command language
[SEM] intro 4 Substantive concepts
[SEM] Builder SEM Builder
[SEM] Builder, generalized SEM Builder for generalized models
[SEM] sem and gsem path notation Command syntax for path diagrams
Title
intro 4 Substantive concepts
Description
References
Also see
Description
The structural equation modeling way of describing models is deceptively simple. It is deceptive
because the machinery underlying structural equation modeling is sophisticated, complex, and sometimes temperamental, and it can be temperamental both in substantive statistical ways and in practical
computer ways.
Professional researchers need to understand these issues.
42
43
44
the full joint-normality assumption can be relaxed, and the substitute conditional-on-theobserved-exogenous-variables is sufficient to justify all reported estimates and statistics except
the log-likelihood value and the model-versus-saturated 2 test.
Relaxing the constraint that latent variables outside of the error variables are not normally
distributed is more questionable. In the measurement model (X->x1 x2 x3 x4), simulations
with the violently nonnormal X 2 (2) produced good results except for the standard
error of the estimated variance of X . Note that it was not the coefficient on X that was
estimated poorly, it was not the coefficients standard error, and it was not even the variance
of X that was estimated poorly. It was the standard error of the variance of X . Even so,
there are no guarantees.
sem uses method ML when you specify method(ml) or when you omit the method() option
altogether.
2. QML uses ML to fit the model parameters but relaxes the normality assumptions when
estimating the standard errors. QML handles nonnormality by adjusting standard errors.
Concerning the parameter estimates, everything just said about ML applies to QML because
those estimates are produced by ML.
Concerning standard errors, we theoretically expect consistent standard errors, and we
practically observe that in our simulations. In the measurement model with X 2 (2), we
even obtained good standard errors of the estimated variance of X . QML does not really fix
the problem of nonnormality of latent variables, but it does tend to do a better job.
sem uses method QML when you specify method(ml) vce(robust) or, because method(ml)
is the default, when you specify just vce(robust).
3. ADF makes no assumption of joint normality or even symmetry, whether for observed or latent
variables. Whereas QML handles nonnormality by adjusting standard errors and not point
estimates, ADF produces justifiable point estimates and standard errors under nonnormality.
For many researchers, this is most important for relaxing the assumption of normality of the
errors, and because of that, ADF is sometimes described that way. ADF in fact relaxes the
normality assumption for all latent variables.
Along the same lines, it is sometimes difficult to be certain exactly which normality
assumptions are being relaxed when reading other sources. It sometimes seems that ADF
uniquely relaxes the assumption of the normality of the observed variables, but that is not
true. Other methods, even ML, can handle that problem.
ADF is a form of weighted least squares (WLS). ADF is also a generalized method of moments
(GMM) estimator. In simulations of the measurement model with X 2 (2), ADF produces
excellent results, even for the standard error of the variance of X . Be aware, however, that
ADF is less efficient than ML when latent variables can be assumed to be normally distributed.
If latent variables (including errors) are not normally distributed, on the other hand, ADF
will produce more efficient estimates than ML or QML.
45
Method MLMV, on the other hand, is not a deleter at all. Observation 10 will be used in
making all calculations.
For method MLMV to perform what might seem like magic, joint normality of all variables
is assumed and missing values are assumed to be missing at random (MAR). MAR means
either that the missing values are scattered completely at random throughout the data or that
values more likely to be missing than others can be predicted by the variables in the model.
Method MLMV formally requires the assumption of joint normality of all variables, both
observed and latent. If your observed variables do not follow a joint normal distribution, you
may be better off using ML, QML, or ADF and simply omitting observations with missing
values.
sem uses method MLMV when you specify method(mlmv). See [SEM] example 26.
be normally distributed. What we said in the sem case about relaxing the assumption of
normality of the latent variables applies equally in the gsem case.
gsem uses method ML when you specify method(ml) or when you omit the method()
option altogether.
2. QML uses ML to fit the model but relaxes the conditional normality assumptions when
estimating the standard errors. QML handles nonnormality by adjusting standard errors.
Everything said about sems QML applies equally to gsems QML.
gsem uses method QML when you specify vce(robust).
Because the choice of method often affects convergence with sem, in the gsem case there is a
tendency to confuse choice of integration method with maximization method. However, there are
no issues related to assumptions about integration method; choice of integration method is purely a
mechanical issue. This is discussed in [SEM] intro 12.
46
2. sem with method MLMV is not a deleter at all; it uses all observations.
If variable x1 appears in the model and if x1 contains missing in observation 10, then
observation 10 will still be used. Doing this formally requires assuming the joint normality
of all observed variables and was discussed in item 4 of sem: Choice of estimation method.
3. gsem by default is an equationwise deleter.
The abridged meaning is that gsem will often be able to use more observations from the
data than sem will, assuming you do not use sem with method MLMV.
The full meaning requires some setup. Consider a model of at least five equations. Assume
that observed exogenous variable x1 appears in the first equation but not in equations 24;
that equation 1 predicts y1; that y1 appears as a predictor in equations 24; and that x1
contains missing in observation 10.
If endogenous variable y1 is latent or observed and of family Gaussian, link identity, but
without censoring, then
3.1 Observation 10 will be ignored in making calculations related to equation 1.
3.2 Observation 10 will also be ignored in making calculations related to equations
24 because y1, a function of x1, appears in them.
3.3 The calculations for the other equation(s) will include observation 10.
Alternatively, if y1 is observed and not family Gaussian, link identity, or has censoring, then
item 3.2 changes:
3.2 Observation 10 will be used in making calculations related to equations 24 even
though y1, a function of x1, appears in them.
As we said at the outset, the result of all of this is that gsem often uses more observations
than does sem (excluding method MLMV).
4. gsem has an option, listwise, that duplicates the sem rules. This is used in testing of
Stata. There is no reason you would want to specify the option.
47
Let us explain:
Observed.
A variable is observed if it is a variable in your dataset. In this documentation, we often refer
to observed variables with x1, x2, . . . , y1, y2, and so on, but in reality observed variables have
names such as mpg, weight, testscore, and so on.
Latent.
A variable is latent if it is not observed. A variable is latent if it is not in your dataset but you
wish it were. You wish you had a variable recording the propensity to commit violent crime, or
socioeconomic status, or happiness, or true ability, or even income. Sometimes, latent variables
are imagined variants of real variables, variables that are somehow better, such as being measured
without error. At the other end of the spectrum are latent variables that are not even conceptually
measurable.
In this documentation, latent variables usually have names such as L1, L2, F1, . . . , but in real
life the names are more descriptive such as VerbalAbility, SES, and so on. The sem and gsem
commands assume that variables are latent if the first letter of the name is capitalized, so we will
always capitalize our latent variable names.
Endogenous.
A variable is endogenous (determined within the system) if any path points to it.
Exogenous.
A variable is exogenous (determined outside the system) if paths only originate from it or,
equivalently, no path points to it.
Now that we have the above definitions, we can better understand the five types of variables:
1. Observed exogenous.
A variable in your dataset that is treated as exogenous in your model.
2. Latent exogenous.
An unobserved variable that is treated as exogenous in your model.
3. Observed endogenous.
A variable in your dataset that is treated as endogenous in your model.
4. Latent endogenous.
An unobserved variable that is treated as endogenous in your model.
5. Error.
Mathematically, error variables are just latent exogenous variables. In sem and gsem, however,
errors are different in that they have defaults different from the other latent exogenous variables.
Errors are named e. So, for example, the error variable associated with observed endogenous
variable y1 has the full name e.y1; the error variable associated with latent endogenous
variable L1 has the full name e.L1.
In sem, each endogenous variable has a corresponding e. variable.
In gsem, observed endogenous variables associated with family Gaussian have corresponding
e. variables, but other observed endogenous variables do not. All latent endogenous variables
have an associated e. variable, but that is not a special case because all latent endogenous
variables are assumed to be family Gaussian.
In the Builder, when you create an endogenous variable, the variables corresponding error
variable instantly springs into existence. The same happens in the command language, you
just do not see it. In addition, error variables automatically and inalterably have their path
coefficient constrained to be 1.
48
The covariances between the different variable types are given in the table below. In the table,
1. 0 means 0 and there are no options or secret tricks to change the value from 0.
2. i means as implied by your model and beyond your control.
3. (#) means to see the numbered note below the table.
If an entry in the matrix below is 0 or i, then you may not draw curved paths between variables of
the specified types.
Observed exogenous
Latent exogenous
Observed endogenous
Latent endogenous
Error
Observed
exogenous
Latent
exogenous
Observed
endogenous
Latent
endogenous
Error
i
i
i
i
i
49
4.3 Builder, gsem mode: Almost the same as item 4.1 except variances cannot be
constrained to 0.
4.4 gsem command: Almost the same as item 4.2 except variances cannot be constrained
to 0.
5. Covariances between latent exogenous variables:
5.1 Builder, sem mode: Assumed to be 0 unless a curved path is drawn between
variables. Path may include constraints.
5.2 sem command: Assumed to be nonzero and estimated, the same as if a curved
path without a constraint were drawn in the Builder. Can be constrained (even to
0) using cov() option.
5.3 Builder, gsem mode: Same as item 5.1.
5.4 gsem command: Same as item 5.2.
6. Variances of errors:
6.1 Builder, sem mode: Estimated. Can be constrained.
6.2 sem command: Estimated. Can be constrained using var() option.
6.3 Builder, gsem mode: Almost the same as item 6.1 except variances cannot be
constrained to 0.
6.4 gsem command: Almost the same as item 6.2 except variances cannot be constrained
to 0.
7. Covariances between errors:
7.1 Builder, sem mode: Assumed to be 0. Can be estimated by drawing curved paths
between variables. Can be constrained.
7.2 sem command: Assumed to be 0. Can be estimated or constrained using cov()
option.
7.3 Builder, gsem mode: Almost the same as item 7.1 except covariances between errors
cannot be estimated or constrained if one or both of the error terms correspond to a
generalized response with family Gaussian, link log, or link identity with censoring.
7.4 gsem command: Almost the same as item 7.2 except covariances between errors
cannot be estimated or constrained if one or both of the error terms correspond to a
generalized response with family Gaussian, link log, or link identity with censoring.
Finally, there is a sixth variable type that we sometimes find convenient to talk about:
Measure or measurement.
A measure variable is an observed endogenous variable with a path from a latent variable. We
introduce the word measure not as a computer term or even a formal modeling term but as a
convenience for communication. It is a lot easier to say that x1 is a measure of X than to say that
x1 is an observed endogenous variable with a path from latent variable X and so, in a real sense,
x1 is a measurement of X.
50
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
Constraining parameters
Constraining path coefficients to specific values
If you wish to constrain a path coefficient to a specific value, you just write the value next to the
path. In our measurement model without correlation of the residuals,
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
we indicate that the coefficients e.x1, . . . , e.x4 are constrained to be 1 by placing a small 1 along
the path. We can similarly constrain any path in the model.
51
x2 = 2 + X2 + e.x2
we would write a 1 along the path between X and x2 . If we were instead using sems or gsems
command language, we would write
(x1<-X) (x2<-X@1) (x3<-X) (x4<-X)
That is, you type an @ symbol immediately after the variable whose coefficient is being constrained,
and then you type the value.
Constraining path coefficients is common. Constraining intercepts is less so, and usually when
the situation arises, you wish to constrain the intercept to 0, which is often called suppressing the
intercept.
Although it is unusual to draw the paths corresponding to intercepts in path diagrams, they are
assumed, and you could draw them if you wish. A more explicit version of our path diagram for the
measurement model is
x1
x2
x3
_cons
x1 = 1 + X1 + e.x1
x4
52
x2 = 2 + X2 + e.x2
and so on.
Obviously, if you wanted to constrain a particular intercept to a particular value, you would write
the value along the path. To constrain 2 = 0, you could draw
x1
x2
_cons
x3
x4
53
Because intercepts are assumed, you could omit drawing the paths from cons to x1, cons to
x3, and cons to x4:
x1
x2
x3
x4
_cons
Just as with the Builder, the command language assumes paths from
variables, but you could type them if you wished:
54
If you wish to constrain two or more path coefficients to be equal, place a symbolic name along
the relevant paths:
X
myb
myb
x1
x2
x3
x4
In the diagram above, we constrain 2 = 3 because we stated that 2 = myb and 3 = myb.
You follow the same approach in the command language:
(x1<-X) (x2<-X@myb) (x3<-X@myb) (x4<-X)
This works the same way with intercepts. Intercepts are just paths from cons, so to constrain
intercepts to be equal, you add symbolic names to their paths. In the command language, you constrain
1 = 2 by typing
(x1<-X _cons@c) (x2<-X _cons@c) (x3<-X) (x4<-X)
See [SEM] example 8.
If you wish to constrain covariances, usually you will want to constrain them to be equal instead of
to a specific value. If we wanted to fit our measurement model and allow correlation between e.x2
and e.x3 and between e.x3 and e.x4, and we wanted to constrain the covariances to be equal, we
could draw
55
x1
x2
x3
x4
myc
myc
If you instead wanted to constrain the covariances to specific values, you would place the value
along the paths in place of the symbolic names.
In the command language, covariances (curved paths) are specified using the cov() option. To
allow covariances between e.x2 and e.x3 and between e.x3 and e.x4, you would type
(x1<-X) (x2<-X) (x3<-X) (x4<-X), cov(e.x2*e.x3) cov(e.x3*e.x4)
To constrain the covariances to be equal, you would type
(x1<-X) (x2<-X) (x3<-X) (x4<-X), cov(e.x2*e.x3@myc) cov(e.x3*e.x4@myc)
Variances are like covariances except that in path diagrams drawn by some authors, variances curve
back on themselves. In the Builder, variances appear inside or beside the box or circle. Regardless of
how they appear, variances may be constrained to normalize latent variables, although normalization
is handled by sem and gsem automatically (something we will explain in How sem (gsem) solves
the problem for you under Identification 2: Normalization constraints (anchoring) below).
In the Builder, you constrain variances by clicking on the variable and using the lock box to specify
the value, which can be a number or a symbol. In the command language, variances are specified
using the var() option as we will explain below.
56
Lets assume that you want to normalize the latent variable X by constraining its variances to be 1.
You do that by drawing
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
Just because you can draw the path diagram for a model, write its equations, or write it in
Statas command syntax, does not mean the model is identified. Identification refers to the conceptual
constraints on parameters of a model that are required for the models remaining parameters to
have a unique solution. A model is said to be unidentified if these constraints are not supplied.
These constraints are of two types: substantive constraints and normalization constraints. We will
begin by discussing substantive constraints because that is your responsibility; the software provides
normalization constraints automatically.
How to count parameters
If your model has K observed variables, then your data contain K(K + 1)/2 second-order moments,
and thus p, the number of parameters based on second-order moments that can be estimated, cannot
exceed K(K + 1)/2.
Every path in your model contributes 1 to p unless the parameter is constrained to a specific
value, and then it does not contribute at all. If two parameters are constrained to be equal, the two
parameters count as one. In counting p, you must remember to count the curved paths from latent
variables back to themselves, which is to say, the variances. Just counting the number of parameters
can be challenging. And even if p K(K + 1)/2, your model may not be identified. Identification
depends not only on the number of paths but also on their locations.
57
Counting parameters can be even more difficult in the case of certain generalized linear (gsem)
models. For a discussion of this, see Skrondal and Rabe-Hesketh (2004, chap. 5).
Even in the non-gsem case, books have been written on this subject, and we will refer you to
them. A few are Bollen (1989), Brown (2006), Kline (2011), and Kenny (1979). We will refer you
to them, but do not be surprised if they refer you back to us. Brown (2006, 202) writes, Because
latent variable software programs are capable of evaluating whether a given model is identified, it is
often most practical to simply try to estimate the solution and let the computer determine the models
identification status. That is not bad advice.
What happens when models are unidentified
So what happens when you attempt to fit an unidentified model? In some cases, sem (gsem) will
tell you that your model is unidentified. If your model is unidentified for subtle substantive reasons,
however, you will see
initial values not feasible
r(1400);
or
Iteration
Iteration
Iteration
.
.
.
Iteration
.
.
.
50:
51:
52:
101:
(not concave)
(not concave)
(not concave)
(not concave)
In the latter case, sem (gsem) will iterate forever, reporting the same criterion value (such as log
likelihood) and saying not concave over and over again.
Observing periods of the not concave message is not concerning, so do not overreact at the first
occurrence. Become concerned when you see not concave and the criterion value is not changing,
and even then, stay calm for a short time because the value might be changing in digits you are not
seeing. If the iteration log continues to report the same value several times, however, press Break.
Your model is probably not identified.
58
Imagine a latent variable for propensity to be violent. Your imagination might supply a scale that
ranges from 0 to 1 or 1 to 100 or over other values, but regardless, the scale you imagine is arbitrary
in that one scale works as well as another.
Scales have two components: mean and variance. If you imagine a latent variable with mean 0
and your colleague imagines the same variable with mean 100, the difference can be accommodated
in the parameter estimates by an intercept changing by 100. If you imagine a standard deviation of
1 (variance 12 = 1) and your colleague imagines a standard deviation of 10 (variance 102 = 100),
the difference can be accommodated by a path coefficient differing by a multiplicative factor of 10.
You might measure an effect as being 1.1 and then your colleague would measure the same effect as
being 0.11, but either way you both will come to the same substantive conclusions.
How the problem would manifest itself
The problem is that different scales all work equally well, and the software will iterate forever,
jumping from one scale to another.
Another way of saying that the means and variances of latent variables are arbitrary is to say that
they are unidentified. Thats important because if you do not specify the scale you have in mind,
results of estimation will look just like substantive lack of identification.
sem (gsem) will iterate forever and never arrive at a solution.
How sem (gsem) solves the problem for you
You usually do not need to worry about this problem because sem (gsem) solves it for you. sem
(gsem) solves the unidentified scale problem by
1. Assuming that all latent exogenous variables have mean 0.
2. Assuming that all latent endogenous variables have intercept 0.
3. Setting the coefficients on paths from latent variables to the first observed endogenous
variable to be 1.
4. Setting the coefficients on paths from latent variables to the first latent endogenous variable
to be 1 if rule 3 does not applyif the latent variable is measured by other latent variables
only.
59
Rules 3 and 4 are also known as the unit-loading rules. The variable to which the path coefficient
is set to 1 is said to be the anchor for the latent variable.
Applying those rules to our measurement model, when we type
(X->x1) (X->x2) (X->x3) (X->x4)
sem (gsem) acts as if we typed
(X@1->x1) (X->x2) (X->x3) (X->x4), means(X@0)
The above four rules are sufficient to provide a scale for latent variables for all models.
sem (gsem) automatically applies rules 1 through 4 to produce normalization constraints. There
are, however, other normalization constraints that would work as well. In what follows, we will
assume that you are well versed in deriving normalization constraints and just want to know how to
bend sem (gsem) to your will.
Before you do this, however, let us warn you that substituting your normalization rules for the
defaults can result in more iterations being required to fit your model. Yes, one set of normalization
constraints are as good as the next, but sems (gsem)s starting values are based on its default
normalization rules, which means that when you substitute your rules for the defaults, the required
number of iterations sometimes increases.
Lets return to the measurement model:
(X->x1) (X->x2) (X->x3) (X->x4)
As we said previously, type the above and sem (gsem) acts as if you typed
(X@1->x1) (X->x2) (X->x3) (X->x4), means(X@0)
If you wanted to assume instead that the mean of X is 100, you could type
(X->x1) (X->x2) (X->x3) (X->x4), means(X@100)
The means() option allows you to specify mean constraints, and you may do so for latent or observed
variables.
Lets leave the mean at 0 and specify that we instead want to constrain the second path coefficient
to be 1:
(X->x1) (X@1->x2) (X->x3) (X->x4)
We did not have to tell sem (gsem) not to constrain X->x1 to have coefficient 1. We just specified
that we wanted to constrain X->x2 to have coefficient 1. sem (gsem) takes all the constraints that
you specify and then adds whatever normalization constraints are needed to identify the model. If
what you have specified is sufficient, sem (gsem) does not add its constraints to yours.
Obviously, if we wanted to constrain the mean to be 100 and the second rather than the first path
coefficient to be 1, we would type
(X->x1) (X@1->x2) (X->x3) (X->x4), means(X@100)
60
References
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Brown, T. A. 2006. Confirmatory Factor Analysis for Applied Research. New York: Guilford Press.
Kenny, D. A. 1979. Correlation and Causality. New York: Wiley.
Kline, R. B. 2011. Principles and Practice of Structural Equation Modeling. 3rd ed. New York: Guilford Press.
Li, C. 2013. Littles test of missing completely at random. Stata Journal 13: 795809.
Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.
Also see
[SEM] intro 3 Learning the language: Factor-variable notation (gsem only)
[SEM] intro 5 Tour of models
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] sem and gsem option covstructure( ) Specifying covariance restrictions
Title
intro 5 Tour of models
Description
References
Also see
Description
Below is a sampling of SEMs that can be fit by sem or gsem.
61
62
x1
x2
x3
x4
e.x1
e.x2
e.x3
e.x4
y1
y2
y3
...
x1
x2
x3
...
63
Because the measurement model is so often joined with other models, it is common to refer to the
coefficients on the paths from latent variables to observable endogenous variables as the measurement
coefficients and to refer to their intercepts as the measurement intercepts. The intercepts are usually not
shown in path diagrams. The other coefficients and intercepts are those not related to the measurement
issue.
The measurement coefficients are often referred to as loadings.
This model can be fit by sem or gsem. Use sem for standard linear models (standard means
single level); use gsem when you are fitting a multilevel model or when the response variables are
generalized linear such as probit, logit, multinomial logit, Poisson, and so on.
See the following examples:
1. [SEM] example 1. Single-factor measurement model.
2. [SEM] example 27g. Single-factor measurement model (generalized response).
3. [SEM] example 30g. Two-level measurement model (multilevel, generalized response).
4. [SEM] example 35g. Ordered probit and ordered logit.
Bernoulli
Bernoulli
Bernoulli
Bernoulli
item1
item2
item3
item4
logit
logit
logit
logit
The items are the observed variables, and each has a 0/1 outcome measuring a latent variable. Often,
the latent variable represents ability. These days, it is traditional to fit IRT models using logistic
regression, but in the past, probit was used and they were called normal ogive models.
In one-parameter logistic models, also known as 1-PL models and Rasch models, constraints
are placed on the paths and perhaps the variance of the latent variable. Either path coefficients are
constrained to 1 or path coefficients are constrained to be equal and the variance of the latent variable
is constrained to be 1. Either way, this results in the negative of the intercepts of the fitted model
being a measure of difficulty.
1-PL and Rasch models can be fit treating the latent variableabilityas either fixed or random.
Abilities are treated as random with gsem.
In two-parameter logistic models (2-PL), no constraints are imposed beyond the one required to
identify the latent variable, which is usually done by constraining the variance to 1. This results
in path coefficients measuring discriminating ability of the items, and difficulty is measured by the
negative of the intercepts divided by the corresponding (slope) coefficient.
64
IRT has been extended beyond 1-PL and 2-PL models, including extension to other types of
generalized responses.
IRT models, including the extensions, can be fit by gsem.
F1
F2
x1
x2
x3
x4
x5
x6
65
x1
x2
x3
This model can be written in Stata command language as
(y<-x1 x2 x3)
When you estimate a linear regression by using sem, you obtain the same point estimates as you
would with regress and the same standard errors up to a degree-of-freedom adjustment applied by
regress.
66
Linear regression models can be fit by sem and gsem. gsem also has options for censoring.
See the following examples:
1. [SEM] example 6. Linear regression.
2. [SEM] example 38g. Random-intercept and random-slope models (multilevel).
3. [SEM] example 40g. Crossed models (multilevel).
4. [SEM] example 43g. Tobit regression.
5. [SEM] example 44g. Interval regression.
x1
gamma
x2
y
log
x3
Gamma regression is fit by gsem; specify shorthand gamma or specify family(gamma) link(log).
You can fit exponential regressions using Gamma regression if you constrain the log of the scale
parameter to be 0; see [SEM] gsem family-and-link options.
x1
Bernoulli
x2
y
logit
x3
67
x1
Poisson
x2
y
log
x3
68
x1
ordinal
x2
y
probit
x3
69
multinomial
1b.y
logit
x1
multinomial
2.y
logit
x2
multinomial
3.y
logit
x1
y1
y2
x2
x3
x4
70
In this example, all inputs and outputs are observed and the errors are assumed to be uncorrelated.
In these kinds of models, it is common to allow correlation between errors:
x1
y1
y2
x2
x3
x4
71
y1
y2
x1
x2
x3
the inputs x1, x2, and x3 are concepts and thus are not observed. Assume that we have measurements
for them. We can join this structural model example with a three-factor measurement model:
10
y1
y2
X1
X3
X2
z1
z2
z3
z4
z5
z6
z7
z8
Note the curved arrows denoting correlation between the pairs of X1, X2, and X3. In the previous
path diagram, we had no such arrows between the variables, yet we were still assuming that they
were there. In sems path diagrams, correlation between exogenous observed variables is assumed
and need not be explicitly shown. When we changed observed variables x1, x2, and x3 to be the
latent variables X1, X2, and X3, we needed to show explicitly the correlations we were allowing.
Correlation between latent variables is not assumed and must be shown.
72
///
///
///
///
We did not include the cov(X1*X2 X1*X3 X2*X3) option, although we could have. In the
command language, exogenous latent variables are assumed to be correlated with each other. If we
did not want X2 and X3 to be correlated, we would need to include the cov(X2*X3@0) option.
We changed x1, x2, and x3 to be X1, X2, and X3. In command syntax, variables beginning with a
capital letter are assumed to be latent. Alternatively, we could have left the names in lowercase and
specified the identities of the latent variables:
(y1<-x1
(x1->z1
(x2->z4
(x3->z6
///
///
///
///
///
Just as we have joined an observed structural model to a measurement model to handle unobserved
inputs, we could join the above model to a measurement model to handle unobserved y1 and y2.
Models with unobserved inputs, outputs, or both can be fit by sem and gsem.
See the following examples:
1. [SEM] example 9. Structural model with measurement component.
2. [SEM] example 32g. Full structural equation model (generalized response).
4. [SEM] example 45g. Heckman selection model.
5. [SEM] example 46g. Endogenous treatment-effects model.
c1
c2
c3
i1
i2
i3
In this model, the observed causes c1, c2, and c3 determine latent variable L, and L in turn determines
the observed indicators i1, i2, and i3.
73
x1
y1
y2
x2
x3
x4
74
y1
y2
x1
x2
x3
75
In this simple model, x has a direct effect on y and an indirect (mediated through m) effect. The
direct effect may be reasonable given the situation, or it may be included just so one can test whether
the direct effect is present. If both the direct and indirect effects are significant, the effect of x is
said to be partially mediated through m.
There are one-level mediation models and various two-level models, and lots of other variations,
too.
sem and gsem can both fit one-level linear models, but you will be better off using sem. gsem
can fit one-level generalized linear models and fit two-level (and higher) models, generalized linear
or not.
See the following example:
1. [SEM] example 42g. One- and two-level mediation models (multilevel).
Correlations
We are all familiar with correlation matrices of observed variables, such as
x1
1.0000
0.7700
0.0177
x1
x2
x3
x2
x3
1.0000
0.2229
1.0000
x1
x2
x3
x1
662.172
62.5157
0.769312
x2
x3
9.95558
1.19118
2.86775
These results can be obtained from sem. The path diagram for the model is
x1
x2
x3
We could just as well leave off the curved paths because sem assumes them among observed exogenous
variables:
x1
x2
x3
76
If we fit the model, we will obtain the covariance matrix by default. correlate with the
covariance option produces covariances that are divided by N 1 rather than by N . To match this
covariance exactly, you need to specify the nm1 option, which we can do in the command language
by typing
(<- x1 x2 x3), nm1
If we want correlations rather than covariances, we ask for them by specifying the standardized
option:
(<- x1 x2 x3), nm1 standardized
An advantage of obtaining correlation matrices from sem rather than from correlate is that you
can perform statistical tests on the results, such as that the correlation of x1 and x3 is equal to the
correlation of x2 and x3.
If you are willing to assume joint normality of the variables, you can obtain more efficient estimates
of the correlations in the presence of missing-at-random data by specifying the method(mlmv) option.
Correlations are fit using sem.
See the following example:
1. [SEM] example 16. Correlation.
77
x1
x2
x3
x4
x5
x6
78
T1
T2
T3
x1
x2
x3
x4
x5
x6
x7
x8
x9
Data that exhibit this kind of pattern are known as multitraitmultimethod (MTMM) data. Researchers
historically looked at the correlations, but structural equation modeling allows us to fit a model that
incorporates the correlations.
The above model can be written in Stata command syntax as
(T1->x1 x4 x7)
(T2->x2 x5 x8)
(T3->x3 x6 x9),
cov(e.x1*e.x2 e.x1*e.x3 e.x2*e.x3)
cov(e.x4*e.x5 e.x4*e.x6 e.x5*e.x6)
cov(e.x7*e*x8 e.x7*e.x9 e.x8*e.x9)
///
///
///
///
///
An alternative way to type the above is to use the covstructure() option, which we can abbreviate
as covstruct():
(T1->x1 x4 x7)
(T2->x2 x5 x8)
(T3->x3 x6 x9),
covstruct(e.x1 e.x2 e.x3, unstructured)
covstruct(e.x4 e.x5 e.x6, unstructured)
covstruct(e.x7 e.x8 e.x9, unstructured)
///
///
///
///
///
Unstructured means that the listed variables have covariances. Specifying blocks of errors as unstructured would save typing if there were more variables in each block.
The correlated uniqueness model can be fit by sem or gsem, although we recommend use of sem in
this case. Gaussian responses with the identity link are allowed to have correlated uniqueness (error)
but only in the absence of censoring. gsem still provides the theoretical ability to fit these models in
multilevel contexts, but convergence may be difficult to achieve.
See the following example:
1. [SEM] example 17. Correlated uniqueness model.
79
L@0->x1)
L@1->x2)
L@2->x3)
L@3->x4),
noconstant
///
///
///
///
x1 = B + 0L + e.x1
x2 = B + 1L + e.x2
x3 = B + 2L + e.x3
x4 = B + 3L + e.x4
and the path diagram is
1
1
L
0
1
3
2
x1
x2
x3
x4
In evaluating this model, it is useful to review the means of the latent exogenous variables. In most
models, latent exogenous variables have mean 0, and the means are thus uninteresting. sem usually
constrains latent exogenous variables to have mean 0 and does not report that fact.
In this case, however, we ourselves have placed constraints, and thus the means are identified and
in fact are an important point of the exercise. We must tell sem not to constrain the means of the
two latent exogenous variables B and L, which we do with the means() option:
80
(B@1
(B@1
(B@1
(B@1
L@0->x1)
///
L@1->x2)
///
L@2->x3)
///
L@3->x4),
///
noconstant means(B L)
We must similarly specify the means() option when using the Builder.
Latent growth models can be fit with sem or gsem.
See the following example:
1. [SEM] example 18. Latent growth model.
81
References
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Bauldry, S. 2014. miivfind: A command for identifying model-implied instrumental variables for structural equation
models in Stata. Stata Journal 14: 6075.
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Also see
[SEM] intro 4 Substantive concepts
[SEM] intro 6 Comparing groups (sem only)
[SEM] example 1 Single-factor measurement model
Title
intro 6 Comparing groups (sem only)
Description
Reference
Also see
Description
sem has a unique feature not shared by gsem. You can easily compare groupscompare males
with females, compare age group 1 with age group 2 with age group 3, and so onwith respect to
any SEM. Said more technically, any model fit by sem can be simultaneously estimated for different
groups with some parameters constrained to be equal across groups and others allowed to vary, and
those estimates can be used to perform statistical tests for comparing the groups.
y1
y2
y3
...
x1
x2
x3
82
...
83
///
/// theoretical model stated in terms of
/// underlying concepts (latent variables)
///
where the middle part is the theoretical model stated in terms of underlying concepts Y1, Y2, X1, and
X2.
However we write the model, we are assuming the following:
1. The unobserved X1 and X2 are measured by the observed x1, x2, . . . .
2. The middle part is stated in terms of the underlying concepts X1, X2, Y1, and Y2.
3. The unobserved Y1 and Y2 are measured by the observed y1, y2, . . . .
///
///
///
///
///
part 3
part 2
part 1
where agegrp is a variable in our dataset, perhaps taking on values 1, 2, 3, . . . . We can specify the
model by using the command language or by drawing the model in the Builder and then choosing
and filling in the group() option.
After estimation, you can use estat ginvariant (see [SEM] estat ginvariant) to obtain Wald
tests of whether constraints should be added and score tests of whether constraints should be relaxed.
84
(Y1->...) (Y2->...)
///
part 3
(...)
///
(...)
///
part 2
(...)
///
(X1->...) (X2->...),
///
part 1
group(agegrp) ginvariant(classes)
The classes are as follows:
Class description
Class name
1. structural coefficients
2. structural intercepts
scoef
scons
3. measurement coefficients
4. measurement intercepts
mcoef
mcons
serrvar
merrvar
smerrcov
meanex
covex
(*)
(*)
measurement
structural
measurement
merrvar)
85
///
///
///
///
///
part 3, measurement
part 2, structural
part 1, measurement
In this model, the Y1<-Y2 path coefficient is allowed to vary across groups by default. We could
constrain the coefficient to be equal across groups by typing
(Y1->...) (Y2->...)
(...)
(Y1<-Y2@b)
(...)
(X1->...) (X2->...),
group(agegrp)
///
///
///
///
///
part 3, measurement
part 2, structural
part 1, measurement
///
///
///
///
///
///
part 3, measurement
part 2, structural
part 1, measurement
to constrain the mean of X1 to be the same across groups. The means would have been different
across groups by default.
86
///
///
///
///
///
///
part 3, measurement
part 2, structural
part 1, measurement
///
///
///
///
///
///
///
part 3, measurement
part 2, structural
part 1, measurement
If we then wanted to constrain the covariance to be the same across groups, we would type
(Y1->...) (Y2->...)
(...)
(...)
(...)
(X1->...) (X2->...),
group(agegrp)
var(e.Y1@V)
cov(e.Y1*e.Y2@C)
///
///
///
///
///
///
///
part 3, measurement
part 2, structural
part 1, measurement
87
The result is that we constrain age groups 1 and 2 to have the same value of the path, and we do
not constrain the path for the other age groups.
You can constrain variance and covariance estimates to be the same across some groups but not
others in the same way. You can specify, for instance,
..., group(agegrp) var(1: e.Y1@V) var(2: e.Y1@V)
or
..., group(agegrp) cov(e.Y1*e.Y2) cov(1: e.Y1*e.Y2@C) ///
cov(2: e.Y1*e.Y2@C)
Similarly, you can constrain means for some groups but not others, although this is rarely done:
..., group(agegrp) means(1: X1@b) means(2: X1@b)
Relaxing constraints
Just as you can specify
..., group(agegrp) ginvariant(classes)
and then add constraints, you can also specify
..., group(agegrp) ginvariant(classes)
and then relax constraints that the classes impose.
For instance, if we specified ginvariant(scoef), then we would be constraining (Y1<-Y2) to
be invariant across groups. We could then relax that constraint by typing
... (Y1<-Y2) (1: Y1<-Y2@b1) (2: Y1<-Y2@b2) ..., ///
group(agegrp) ginvariant(scoef)
The path coefficients would be free in groups 1 and 2 and constrained in the remaining groups, if
there are any. The path coefficient is free in group 1 because we specified symbolic name b1, and b1
appears nowhere else in the model. The path coefficient is free in group 2 because symbolic name b2
appears nowhere else in the model. If there are remaining groups and we want to relax the constraint
on them, too, we would need to add (3: Y1<-Y2@b3), and so on.
88
The same technique can be used to relax constraints on means, variances, and covariances:
..., group(agegrp) ginvariant(... meanex ...) ///
means(1: X1@b1) means(2: X1@b2)
..., group(agegrp) ginvariant(... serrvar ...) ///
var(1: e.Y1@V1) var(2: e.Y1@V2)
..., group(agegrp) ginvariant(... serrvar ...) ///
cov(1: e.Y1*e.Y2@C1) cov(2: e.Y1*e.Y2@C2)
Reference
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Also see
[SEM] intro 5 Tour of models
[SEM] intro 7 Postestimation tests and predictions
[SEM] sem group options Fitting models on different groups
[SEM] sem and gsem option covstructure( ) Specifying covariance restrictions
Title
intro 7 Postestimation tests and predictions
Description
Also see
Description
After fitting a model with sem or gsem, you can perform statistical tests, obtain predicted values,
and more. Everything you can do is listed below.
sem and gsem vary in the tests and features available after estimation, and we mark whether each
test and feature is available after sem, gsem, or both.
Some tests and features after sem depend on the assumption of joint normality of the observed
variables and others do not. Whenever the joint-normality assumption is the required, we mention it
explicitly. If you do not believe your data meets the joint-normality assumption, you will want to
avoid those tests and features.
Some tests and features that are available after sem are not available after gsem because of the
joint-normality assumption. Others are not available for other reasons. If we do not mention the
joint-normality assumption, then that is not the cause. We usually leave unexplained the other reasons
that a test or feature might be inappropriate after gsem, because those reasons are not our focus here.
If you wish to see results in the BentlerWeeks formulation, after sem estimation type
. estat framework
(output omitted )
90
the symbolic name of the coefficient corresponding to the path Y<-x1 is b[Y1:x1], and
the symbolic name of the coefficient corresponding to the covariance of e.Y1 and e.Y2 is
b[cov(e.Y1,e.Y2): cons].
Figuring out what the names are can be difficult, so instead, type
. sem, coeflegend
or
. gsem, coeflegend
With this command, sem (gsem) will produce a table looking very much like the estimation output
that lists the b[ ] notation for the estimated parameters in the model; see [SEM] example 8.
Family
Link
Meaning of exp(coef)
logit
ologit
mlogit
Poisson
nbreg
Bernoulli
ordinal
multinomial
Poisson
nbreg
logit
logit
logit
log
log
odds ratio
odds ratio
relative-risk ratio
incidence-rate ratio
incidence-rate ratio
gsem reports coefficients, not exponentiated coefficients. You can obtain exponentiated coefficients
and their standard errors by using estat eform after estimation. Using estat eform is no different
from redisplaying results. The syntax is
estat eform equationname
After gsem, equations are named after the dependent variable, so if you want to see the equation for
cases in exponentiated form, you can type estat eform cases.
See [SEM] example 33g, [SEM] example 34g, and [SEM] estat eform.
Be warned, this test is based on the assumption of joint normality of the observed variables, so
you may want to ignore it. The test is a goodness-of-fit test in badness-of-fit units; a significant
result implies that there may be missing paths in the models specification. More mathematically,
the null hypothesis of this test is that the fitted covariance matrix and mean vector of the observed
variables are equal to the matrix and vector observed in the population as measured by the sample.
Remember, however, the goal is not to maximize the goodness of fit. One must not add paths that
are not theoretically meaningful.
91
Performing tests for including omitted paths and relaxing constraints (sem only)
1. (sem only.) Command estat mindices reports 2 modification indices and significance
values for each omitted path in the model, along with the expected parameter change; see
[SEM] example 5 and [SEM] example 9.
2. (sem only.) Command estat scoretests performs score tests on each of the linear
constraints placed on the paths and covariances; see [SEM] example 8.
3. (sem only.) Command estat ginvariant is for use when you have estimated using sems
group() option; see [SEM] intro 6. This command tests whether you can relax constraints
that parameters are equal across groups; see [SEM] example 22.
92
3. (sem only.) Command estat stable assesses the stability of nonrecursive structural equation
systems; see [SEM] example 7 and [SEM] estat stable.
4. (sem and gsem.) Command estat summarize reports summary statistics for the observed
variables used in the model; see [SEM] estat summarize.
5. (sem and gsem.) Command lincom reports the value, standard error, significance, and
confidence interval for linear combinations of estimated parameters; see [SEM] lincom.
6. (sem and gsem.) Command nlcom reports the value, standard error, significance, and
confidence interval for nonlinear (and linear) combinations of estimated parameters; see
[SEM] nlcom and [SEM] example 42g.
7. (sem and gsem.) Command estat vce reports the variancecovariance matrix of the estimated
parameters; see [R] estat vce.
93
94
95
You can save estimation results in files or temporarily in memory and do other useful things with
them; see [R] estimates.
Not stored by sem in e() are the BentlerWeeks matrices, but they can be obtained from the
r() stored results of estat framework. (The BentlerWeeks matrices are not relevant in the case
of gsem.)
See [SEM] sem and [SEM] estat framework.
Also see
[SEM] intro 6 Comparing groups (sem only)
[SEM] intro 8 Robust and clustered standard errors
Title
intro 8 Robust and clustered standard errors
Description
Options
Also see
Description
sem and gsem provide two options to modify how standard-error calculations are made:
vce(robust) and vce(cluster clustvar). These standard errors are less efficient than the default standard errors, but they are valid under less restrictive assumptions.
These options are allowed only when default estimation method method(ml) is used or when
option method(mlmv) is used. ml stands for maximum likelihood, and mlmv stands for maximum
likelihood with missing values; see [SEM] intro 4, [SEM] sem, and [SEM] gsem.
Also see [SEM] intro 9, entitled Standard errors, the full story.
Options
vce(vcetype) specifies how the VCE, and thus the standard errors, is calculated. VCE stands for
variancecovariance matrix of the estimators. The standard errors that sem and gsem report are
the square roots of the diagonal elements of the VCE.
vce(oim) is the default. oim stands for observed information matrix (OIM). The information matrix
is the matrix of second derivatives, usually of the log-likelihood function. The OIM estimator of
the VCE is based on asymptotic maximum-likelihood theory. The VCE obtained in this way is valid
if the errors are independent and identically distributed normal, although the estimated VCE is
known to be reasonably robust to violations of the normality assumption, at least as long as the
distribution is symmetric and normal-like.
vce(robust) specifies an alternative calculation for the VCE, called robust because the VCE
calculated in this way is valid under relaxed assumptions. The method is formally known as
the Huber/White/sandwich estimator. The VCE obtained in this way is valid if the errors are
independently distributed. It is not required that the errors follow a normal distribution, nor is it
required that they be identically distributed from one observation to the next. Thus the vce(robust)
VCE is robust to heteroskedasticity of the errors.
vce(cluster clustvar) is a generalization of the vce(robust) calculation that relaxes the
assumption of independence of the errors and replaces it with the assumption of independence
between clusters. Thus the errors are allowed to be correlated within clusters.
97
relax assumptions that are sometimes unreasonable for a given dataset and thus produce more accurate
standard errors in those cases. Those assumptions are homoskedasticity of the variances of the errors
vce(robust)and independence of the observationsvce(cluster clustvar). vce(cluster clustvar) relaxes both assumptions.
Homoskedasticity means that the variances of the errors are the same from observation to observation.
Homoskedasticity can be unreasonable if, for instance, the error corresponds to a dependent variable
of income or socioeconomic status. It would not be unreasonable to instead assume that, in the data,
the variance of income or socioeconomic status increases as the mean increases. In such cases, rather
than typing
. sem (y<-...) (...) (...<-x1) (...<-x2)
you would type
. sem (y<-...) (...) (...<-x1) (...<-x2), vce(robust)
Independence implies that the observations are uncorrelated. If you have observations on people,
some of whom live in the same neighborhoods, it would not be unreasonable to assume instead that
the error of one person is correlated with those of others who live in the same neighborhood because
neighborhoods tend to be homogeneous. In such cases, if you knew the neighborhood, rather than
typing
. sem (y<-...) (...) (...<-x1) (...<-x2)
you would type
. sem (y<-...) (...) (...<-x1) (...<-x2), vce(cluster neighborhood)
Understand that if the assumptions of independent and identically distributed normal errors are
met, the vce(robust) and vce(cluster clustvar) standard errors are less efficient than the standard
vce(oim) standard errors. Less efficient means that for a given sample size, the standard errors jump
around more from sample to sample than would the vce(oim) standard errors. vce(oim) standard
errors are unambiguously best when the standard assumptions of homoskedasticity and independence
are met.
Also see
[SEM] intro 7 Postestimation tests and predictions
[SEM] intro 9 Standard errors, the full story
[SEM] sem option method( ) Specifying method and calculation of VCE
[SEM] gsem estimation options Options affecting estimation
Title
intro 9 Standard errors, the full story
Description
Options
Also see
Description
In [SEM] intro 8, we told you part of the story of the calculation of the VCE, the part we wanted
to emphasize. In this section, we tell you the full story.
We at Stata try to draw a clear distinction between method and technique. The method is the
process used to obtain the parameter estimates. The technique is the process used to obtain the
variancecovariance matrix of the parameter estimates, which is to say, the standard errors.
The literature does not always draw such clear distinctions.
sem and gsem provide the following methods and techniques:
Methods
ML
QML
MLMV
ADF
maximum likelihood
quasimaximum likelihood
maximum likelihood with missing values
asymptotic distribution free
Techniques
OIM
EIM
OPG
robust
clustered
bootstrap
jackknife
98
MLMV
ADF
Allowed techniques
Comment
OIM
EIM
OPG
default
robust
clustered
bootstrap
jackknife
a.k.a. QML
OIM
EIM
OPG
default
robust
clustered
bootstrap
jackknife
a.k.a. QML
OIM
EIM
default; robust-like
bootstrap
jackknife
gsem
ML
OIM
OPG
default
robust
clustered
bootstrap
jackknife
a.k.a. QML
99
100
Options
The corresponding options for sem and gsem to obtain each allowed method-and-technique
combination are
method()
sem
method(ml)
method(mlmv)
method(adf)
gsem
method(ml)
vce()
Comment
vce(oim)
vce(eim)
vce(opg)
vce(robust)
vce(cluster clustvar)
vce(bootstrap)
vce(jackknife)
default
vce(oim)
vce(eim)
vce(opg)
vce(robust)
vce(cluster clustvar)
vce(bootstrap)
vce(jackknife)
default
vce(oim)
vce(eim)
vce(bootstrap)
vce(jackknife)
default; vce(robust)-like
vce(oim)
vce(opg)
vce(robust)
vce(cluster clustvar)
bootstrap
jackknife
default
a.k.a. QML
a.k.a. QML
a.k.a. QML
no option; use bootstrap: prefix
no option; use jackknife: prefix
method(emethod) specifies the estimation method sem (gsem) is to use. If method() is not specified,
then method(ml) is assumed.
vce(vcetype) specifies the technique to be used to obtain the VCE. When vce() is not specified,
then vce(oim) is assumed.
In the case of gsem, vce(bootstrap) and vce(jackknife) are not allowed, although you can
obtain the bootstrap or jackknife results by prefixing the gsem command with the bootstrap: or
jackknife: prefix.
. bootstrap: gsem . . .
. jackknife: gsem . . .
See [R] bootstrap and [R] jackknife. If you are fitting a multilevel model, be sure to use bootstraps
and jackknifes cluster() and idcluster() options to obtain a correct resampling. If you
have a crossed model, you cannot resample in both dimensions; there is no solution to that problem,
and therefore you cannot use the bootstrap: or jackknife: prefix.
101
Also see
[SEM] intro 8 Robust and clustered standard errors
[SEM] intro 10 Fitting models with survey data (sem only)
[SEM] sem option method( ) Specifying method and calculation of VCE
[SEM] gsem estimation options Options affecting estimation
Title
intro 10 Fitting models with survey data (sem only)
Description
Also see
Description
Sometimes the data are not a simple random sample from the underlying population but instead
are based on a complex survey design that can include stages of clustered sampling and stratification.
Estimates produced by sem can be adjusted for these issues.
Adjustments for survey data are provided by sem but not gsem.
///
///
In the above, we are telling Stata that our data are from a three-stage sampling design. The first
stage samples without replacement counties within state; the second, schools within each sampled
county; and the third, students within schools.
Once we have done that, we can tell Stata to make the survey adjustment by prefixing statistical
commands with the svy: prefix:
. svy: regress test_result teachers_per_student sex ...
See the Stata Survey Data Reference Manual for more information on this. From a survey perspective,
sem is not different from any other statistical command of Stata.
Once results are estimated, you do not include the svy: prefix in front of the postestimation
commands. You type, for instance,
. estat eqtest ...
102
Also see
[SEM] intro 9 Standard errors, the full story
[SEM] intro 11 Fitting models with summary statistics data (sem only)
[SVY] Stata Survey Data Reference Manual
103
Title
intro 11 Fitting models with summary statistics data (sem only)
Description
Reference
Also see
Description
In textbooks and research papers, the data used are often printed in summary statistic form. These
summary statistics include means, standard deviations or variances, and correlations or covariances.
These summary statistics can be used in place of the underlying raw data to fit models with sem.
Summary statistics data (SSD) are convenient for publication because of their terseness. By not
revealing individual responses, they do not violate participant confidentiality, which is sometimes
important.
Support for SSD is provided by sem but not by gsem.
Background
The structural equation modeling estimator is a function of the first and second momentsthe
means and covariancesof the data. Thus it is possible to obtain estimates of the parameters of an
SEM by using means and covariances. One does not need the original dataset.
In terms of sem, one can create a dataset containing these summary statistics and then use that
dataset to obtain fitted models. The sem command is used just as one would use it with the original,
raw data.
105
a. Do not use sems if exp or in range modifiers. You do not have the raw data in memory
and so you cannot select subsets of the data.
b. If you have entered summary statistics for groups of observations (for example, males and,
separately, females), use sems select() option if you want to fit the model with a subset
of the groups. That is, where you would ordinarily type
. sem ... if sex==1, ...
you instead type
. sem ..., ... select(1)
Where you would ordinarily type
. sem ... if region==1 | region==3, ...
you instead type
. sem ..., ... select(1 3)
See [SEM] example 3.
Entering SSD
Entering SSD is easy. You need to see an example of how easy it is before continuing: see
[SEM] example 2.
What follows is an outline of the procedure. Let us begin with the data you need to have. You
have
1. The names of the variables. We will just call them x1, x2, and x3.
2. The number of observations, say, 74.
106
1
-0.5928
0.6043
-0.2120
0.2118
5. If you set covariances in step 4, skip to step 6. Otherwise, if you have them, set the variances:
. ssd set var 33.4722 .6043 .2118
Or set the standard deviations:
. ssd set sd
107
standard deviations:
means:
The middle-aged:
observations:
correlations:
standard deviations:
means:
The old:
observations:
correlations:
standard deviations:
means:
74
1
0.8072
0.3934
5.6855
21.2973
1
0.5928
0.7774
3.0195
1
0.4602
0.2973
141
1
0.5721
0.3843
4.9112
38.1512
1
0.4848
0.7010
5.2210
1
0.5420
0.2282
36
1
0.8222
0.3712
6.7827
58.7171
1
0.3113
0.7221
2.1511
1
0.4305
0.1623
ssd
ssd
ssd
ssd
init x1 x2 x3
set obs
74
set cor
1
\
set sd
5.6855
-.8072
.7774
1
\ .3934 -.5928 1
.4602
108
ssd
ssd
ssd
ssd
ssd
ssd
.
.
.
.
.
.
ssd addgroup
ssd set obs
36
ssd set cor
1
\
ssd set sd
6.7827
ssd set means 58.7171
save mygroupdata
-.8222
.7221
2.1511
.2973
1
\ .3843 -.4848 1
.5420
.2282
\ .3712 -.3113 1
.4305
.1623
109
What happens when you do not set all the summary statistics
You are required to set the number of observations and to set the covariances or the correlations.
Setting the variances (standard deviations) and setting the means are optional.
1. If you set correlations only, then
a. Means are assumed to be 0.
b. Standard deviations are assumed to be 1.
c. You will not be able to pool across groups if you have group data.
As a result of (a) and (b), the parameters sem estimates will be standardized even when you do
not specify sems standardized reporting option. Estimated means and intercepts will be 0.
Concerning (c), we need to explain. This concerns group data. If you type
. sem ...
then sem fits a model with all the data. sem does that whether you have raw data or SSD in
memory. If you have SSD with groupssay, males and females or age groups 1, 2, and 3sem
combines the summary statistics to obtain the summary statistics for the overall data. It is only
possible to do this when covariances and means are known for each group. If you set correlations
without variances or standard deviations and without means, the necessary statistics are not known
and the groups cannot be combined. Thus if you type
. sem ...
you will get an error message. You can still estimate using sem; you just have to specify on which
group you wish to run sem, and you do that with the select() option:
. sem ..., select(#)
2. If you set correlations and means,
a. Standard deviations are assumed to be 1.
b. You will not be able to pool across groups if you have group data.
This situation is nearly identical to situation 1. The only difference is that estimated means and
intercepts will be nonzero.
3. If you set correlations and standard deviations or variances, or if you set covariances only,
a. Means are assumed to be 0.
b. You will not be able to pool across groups if you have group data.
This situation is a little better than situation 1. Estimated intercepts will be 0, but the remaining
estimated coefficients will not be standardized unless you specify sems standardized reporting
option.
Labeling SSD
You may use the following commands on SSD, and you use them in the same way you would with
an ordinary dataset:
1. rename oldvarname newvarname
You may rename the variables; see [D] rename.
2. label data "dataset label"
You may label the dataset; see [D] label.
110
Reference
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Also see
[SEM] intro 10 Fitting models with survey data (sem only)
[SEM] intro 12 Convergence problems and how to solve them
[SEM] ssd Making summary statistics data (sem only)
[SEM] sem option select( ) Using sem with summary statistics data
[SEM] example 2 Creating a dataset from published covariances
[SEM] example 3 Two-factor measurement model
[SEM] example 19 Creating multiple-group summary statistics data
[SEM] example 25 Creating summary statistics data from raw data
111
Title
intro 12 Convergence problems and how to solve them
Description
Also see
Description
It can be devilishly difficult for software to obtain results for SEMs. Here is what can happen:
. sem ...
Variables in structural equation model
(output omitted )
Fitting target model:
initial values not feasible
r(1400);
or,
. gsem ...
Fitting fixed-effects model:
Iteration 0: log likelihood
Iteration 1: log likelihood
Iteration 2: log likelihood
(output omitted )
Refining starting values:
Grid node 0: log likelihood
Grid node 1: log likelihood
Grid node 2: log likelihood
Grid node 3: log likelihood
Fitting full model:
initial values not feasible
r(1400);
= -914.65237
= -661.32533
= -657.18568
=
=
=
=
.
.
.
.
or,
. sem ...
Endogenous variables
(output omitted )
Fitting target model:
Iteration 1: log likelihood
.
.
.
Iteration 50: log likelihood
Iteration 51: log likelihood
Iteration 52: log likelihood
.
.
.
Iteration 101: log likelihood
Break
r(1);
= ...
= -337504.44
= -337503.52
= -337502.13
(not concave)
(not concave)
(not concave)
= -337400.69
(not concave)
In the first two cases, sem and gsem gave up. The error message is perhaps informative if not
helpful. In the last case, sem (it could just as well have been gsem) iterated and iterated while
producing little improvement in the log-likelihood value. We eventually tired of watching a process
that was going nowhere slowly and pressed Break.
Now what?
112
113
1:
50:
51:
52:
(not concave)
(not concave)
(not concave)
(not concave)
If the problem is lack of identification, the criterion function being optimized (the log likelihood
in this case) will eventually stop improving at all and yet sem or gsem will continue iterating.
If the problem is poor starting values, the criterion function will continue to increase slowly.
So if your model might not be identified, do not press Break too soon.
There is another way to distinguish between identification problems and poor starting values. If
starting values are to blame, it is likely that a variance estimate will be heading toward 0. If the
problem is lack of identification, you are more likely to see an absurd path coefficient.
To distinguish between those two alternatives, you will need to rerun the model and specify the
iterate() option:
. sem ..., ... iterate(100)
(output omitted )
We omitted the output, but specifying iterate(100) allowed us to see the current parameter values
at the point. We chose to specify iterate(100) because we knew that the likelihood function was
changing slowly by that point.
If you are worried about model identification, you have a choice: Sit it out and do not press Break
too soon, or press Break and rerun.
114
If you discover that your model is not identified, see Identification 1: Substantive issues in
[SEM] intro 4.
If the first sem or gsem command fails because you pressed Break or if the command issued
an error message, you must reissue the command adding option noestimate or iterate(#).
Specify noestimate if the failure came early before the iteration log started, and otherwise specify
iterate(#), making # the iteration number close to but before the failure occurred:
. sem ..., ... noestimate
or
. sem ..., ... iterate(50)
115
Once you have obtained the parameter values in b, you can list b,
. matrix b = e(b)
. matrix list b
b[1,10]
x1:
x1:
L
_cons
y1
1
96.284553
var(e.x1): var(e.x2):
_cons
_cons
y1
78.54289
113.62044
x2:
L
1.0971753
var(e.x3):
_cons
85.34721
x2:
_cons
97.284553
var(L):
_cons
120.45744
x3:
L
1.0814186
x3:
_cons
97.097561
x2:
L
1.0971753
var(e.x3):
_cons
85.34721
x2:
_cons
97.284553
var(L):
_cons
500
x3:
L
1.0814186
x3:
_cons
97.097561
And, whether you modify it or not, you can use b as the starting values for another attempt of the
same model or for an attempt of a different model:
. sem ..., ... from(b)
If you have multiple constraints that you want to reimpose, you may need to do them in sets.
116
In this example, variable y1 affects y2 affects y1. Models with such feedback loops are said to be
nonrecursive. Assume you had a solution to the above model. The results might be unstable in a
substantive sense; see nonrecursive (structural) model (system) in [SEM] Glossary. The problem is
that finding such truly unstable solutions is often difficult and the stability problem manifests itself
as a convergence problem.
If you have convergence problems and you have feedback loops, that is not proof that the underlying
values are unstable.
Regardless, temporarily remove the feedback loop,
. sem ... (y1<-y2 x2) (y2<-
x3) ...
and see whether the model converges. If it does, save the parameter estimates and refit the original
model with the feedback, but using the saved parameter estimates as starting values.
. matrix b = e(b)
. sem ... (y1<-y2 x2) (y2<-y1 x3) ..., ... from(b)
If the model converges, the feedback loop is probably stable. If you are using sem, you can check
for stability with estat stable. If the model does not converge, you now must find which other
variables need to have starting values modified.
117
There is another option we should mention, namely, intpoints(#). Methods 1, 2, and 3 default
to using seven integration points. You can change that. A larger number of integration points produces
more accurate results but does not improve model convergence. You might reduce the number of
integration points to, say, 3. A lower number of integration points slightly improves convergence
of the model, and it certainly makes model fitting much quicker. Obviously, model results are less
accurate. We have been known to use fewer integration points, but mainly because of the speed issue.
We can experiment more quickly. At the end, we always return to the default number of integration
points.
and you are using sem, we direct you back to Temporarily simplify the model. The problem that we
discuss here seldom reveals itself as initial values not feasible with sem. If you are using gsem,
we direct you to the next section, Get better starting values (gsem). It is not impossible that what we
discuss here is the solution to your problem, but it is unlikely.
118
We discuss here problems that usually reveal themselves by producing an infinite iteration log:
. sem ..., ...
Endogenous variables
(output omitted )
Fitting target model:
Iteration 1: log likelihood
.
.
.
Iteration 50: log likelihood
Iteration 51: log likelihood
Iteration 52: log likelihood
.
.
.
Iteration 101: log likelihood
Break
r(1);
= ...
= -337504.44
= -337503.52
= -337502.13
(not concave)
(not concave)
(not concave)
= -337400.69
(not concave)
We specified 100; you should specify an iteration value based on your log.
In most cases, you will discover that you have a variance of a latent exogenous variable going to
0, or you have a variance of an error (e.) variable going to 0.
Based on what you see, say that you suspect the problem is with the variance of the error of a
latent endogenous variable F going to 0, namely, e.F. You need to give that variance a larger starting
value, which you can do by typing
. sem ..., ... var(e.F, init(1))
or
. sem ..., ... var(e.F, init(2))
or
. sem ..., ... var(e.F, init(10))
We recommend choosing a value for the variance that is larger than you believe is necessary.
To obtain that value,
1. If the variable is observed, use summarize to obtain the summary statistics for the variable,
square the reported standard deviation, and then increase that by, say, 20%.
2. If the variable is latent, use summarize to obtain a summary of the latent variables
anchor variable and then follow the same rule: use summarize, square the reported standard
deviation, and then increase that by 20%. (The anchor variable is the variable whose path
is constrained to have coefficient 1.)
3. If the variable is latent and has paths only to other latent variables so that its anchor variable
is itself latent, follow the anchors paths to an observed variable and follow the same rule:
use summarize, square the reported standard deviation, and then increase that by 20%.
4. If you are using gsem to fit a multilevel model and the latent variable is at the observational
level, follow advice 2 above. If the latent variable is at a higher levelsay schooland
its anchor is x, a Gaussian response with the identity link, type
119
sort school
by school: egen avg = mean(x)
by school: gen touse = _n==1 if school<.
summarize avg if touse==1
If you wanted to set the initial value of the path from x1 to 3, modify the command to read
. sem ... (y<-(x1, init(3)) x2) ...
If that does not solve the problem, proceed through the others in the following order: fixedonly,
constantonly, and zero.
By the way, if you have starting values for some parameters but not othersperhaps you fit a
simplified model to get themyou can combine the options startvalues() and from():
. gsem ..., ...
. matrix b = e(b)
. gsem ..., ... from(b) startvalues(iv)
// simplified model
// full model
You can combine startvalues() with the init() option, too. We described init() in the previous
section.
The other special option gsem provides is startgrid(). startgrid() can be used with or without
startvalues(). startgrid() is a brute-force approach that tries various values for variances and
covariances and chooses the ones that work best.
120
1. You may already be using a default form of startgrid() without knowing it. If you see
gsem displaying Grid node 1, Grid node 2, . . . following Grid node 0 in the iteration log,
that is gsem doing a default search because the original starting values were not feasible.
The default form tries 0.1, 1, and 10 for all variances of all latent variables, by which
we mean the variances of latent exogenous variables and the variances of errors of latent
endogenous variables.
2. startgrid(numlist) specifies values to try for variances of latent variables.
3. startgrid(covspec) specifies the particular variances and covariances in which grid searches
are to be performed. Variances and covariances are specified in the usual way. startgrid(e.F e.F*e.L M1[school] G*H e.y e.y1*e.y2) specifies that 0.1, 1, and 10 be
tried for each member of the list.
4. startgrid(numlist covspec) allows you to combine the two syntaxes, and you can specify
multiple startgrid() options so that you can search the different ranges for different
variances and covariances.
Our advice to you is this:
1. If you got an iteration log and it did not contain Grid node 1, Grid node 2, . . . , then specify
startgrid(.1 1 10). Do that whether the iteration log was infinite or ended with some
other error. In this case, we know that gsem did not run startgrid() on its own because
it did not report Grid node 1, Grid node 2, etc. Your problem is poor starting values, not
infeasible ones.
A synonym for startgrid(.1 1 10) is just startgrid without parentheses.
Be careful, however, if you have a large number of latent variables. startgrid could run
a long time because it runs all possible combinations. If you have 10 latent variables, that
means 103 = 1,000 likelihood evaluations.
If you have a large number of latent variables, rerun your difficult gsem command including
option iterate(#) and look at the results. Identify the problematic variances and search
across them only. Do not just look for variances going to 0. Variances getting really big can
be a problem, too, and even reasonable values can be a problem. Use your knowledge and
intuition about the model.
Perhaps you will try to fit your model by specifying startgrid(.1 1 10 e.F L e.X).
Because values 0.1, 1, and 10 are the default, you could equivalently specify startgrid(e.F
L e.X).
Look at covariances as well as variances. If you expect a covariance to be negative and it is
positive, try negative starting values for the covariance by specifying startgrid(-.1 -1
-10 G*H).
Remember that you can have multiple startgrid() options, and thus you could specify
startgrid(e.F L e.X) startgrid(-.1 -1 -10 G*H).
2. If you got initial values not feasible, you know that gsem already tried the default
startgrid.
The default startgrid only tried the values 0.1, 1, and 10, and only tried them on the
variances of latent variables. You may need to try different values or try the same values on
covariances or variances of errors of observed endogenous variables.
We suggest you first rerun the model causing difficulty including the noestimate option.
We also direct you back to the idea of first simplifying the model; see Temporarily simplify
the model.
121
If, looking at the results, you have an idea of which variance or covariance is a problem,
or if you have few variances and covariances, we would recommend running startgrid()
first. On the other hand, if you have no idea as to which variance or covariance is the
problem and you have a large number of them, you will be better off if you first simplify
the model. If, after doing that, your simplified model does not include all the variances and
covariances, you can specify a combination of from() and startgrid().
Also see
[SEM] intro 11 Fitting models with summary statistics data (sem only)
Title
Builder SEM Builder
Description
Reference
Description
The SEM Builder lets you create path diagrams for SEMs, fit those models, and show results on the
path diagram. Here we discuss standard linear SEMs; see [SEM] Builder, generalized for information
on using the Builder to create models with generalized responses and multilevel structure.
Launch the SEM Builder by selecting the menu item Statistics > SEM (structural equation
modeling) > Model building and estimation. You can also type sembuilder in the Command
window.
Select the Add Observed Variable tool,
, and click within the diagram to add an observed
variable. Type a name in the Name control of the Contextual Toolbar to name the variable. If the
variable is not placed exactly where you want it, simply select the Select tool and drag the variable
to your preferred location.
122
123
After adding the variable, use the Variable control in the Contextual Toolbar to select a variable
from your dataset or type a variable name into the edit field.
Add latent variables to the model by using the
tool.
124
tool.
We have ignored the locks, , in the Contextual Toolbars. These locks apply constraints to the
parameters of the SEM. You can constrain variances,
; means,
; intercepts,
; and path
coefficients and covariances,
. For example, select an exogenous variable or an error variable and
type a number in
to constrain that variance to a fixed value. Or, select three path variables and
type a name (a placeholder) in
to constrain all the path coefficients to be equal. You can type
numbers, names, or linear expressions in the
controls. The linear expressions can involve only
controls.
numbers and names that are used in other
Do not be afraid to try things. If you do not know what a tool, control, or dialog item does, try
it. If you do not like the result, click on the Undo button, , in the Standard Toolbar.
Click on
in the Standard Toolbar to fit the model. A dialog is launched that allows you to set
all the estimation options that are not defined by the path diagram. After estimation, some of the
estimation results are displayed on the path diagram. Use the Results tab of the Settings > Variables
and Settings > Connections dialogs to change what results are shown and how they appear (font
sizes, locations, etc.). Also, as you click on connections or variables, the Details pane displays all
the estimation results for the selected object.
If you wish to create another model derived from the current model, click
Toolbar.
in the Standard
Video example
SEM Builder in Stata
Visit the Stata YouTube channel for more videos.
Reference
Huber, C. 2012. Using Statas SEM features to model the Beck Depression Inventory. The Stata Blog: Not Elsewhere
Classified. http://blog.stata.com/2012/10/17/using-statas-sem-features-to-model-the-beck-depression-inventory/.
Title
Builder, generalized SEM Builder for generalized models
Description
Reference
Description
The SEM Builder allows you to create path diagrams of generalized SEMs, fit those models, and
show the results on the path diagram. Here we extend the discussion from [SEM] Builder. Read that
manual entry first.
126
As with standard observed and latent variables, you can also click and drag as you add generalized
responses and multilevel latent variables to control the size of the resulting variables. You can also
select the Properties button from the Contextual Toolbar or double-click on a variable to change
even more properties of the variable. As with standard variables, you can control their appearance
from the Appearance tab of the resulting dialog box or, more likely, control the appearance of all
variables of a given type from the Settings > Variables menu.
the
As with standard models, you create the structure of your model by connecting the variables with
tool. You may also specify covariances among variables with the
tool.
There is one other construct available in generalized mode that is not available in standard mode
paths to other paths. In addition to creating paths from a latent variable to an observed variable or
to another latent variable at the same nesting level, you can connect a standard or multilevel latent
variable to a path from one observed variable to another observed variable. This makes that path a
random path, or a random slope. As with other path connections, think of the path as adding the
source variable to the target. So we are adding a random variable to a path (or slope), making it a
random path. See [SEM] example 38g for an example of creating a random path in the Builder.
In generalized mode, the Add Measurement Component tool,
, allows the measurements to
be created as generalized responses. Simply click on the Make measurements generalized box on
the dialog box and select a Family/Link. It also allows the latent variable to be either standard
observation-level or multilevel with the radio control at the top of the dialog box.
The Add Observed Variables Set tool, , also allows you to create multiple generalized responses.
Click on the Make variables generalized responses box on the dialog box and select a Family/Link.
Creating multinomial logit responses can be tricky. You must create a generalized response for
each level of your outcome and use factor-variable notation to designate the levels of the outcome.
tool is the safest way to create a multinomial logit response. With your dataset in memory,
The
select the
tool and then click on the diagram. Click on the Make variables generalized responses
box on the dialog box, and then select Multinomial, Logit in the Family/Link control. Select your
multinomial response variable from the Variable control, and the levels for your multinomial response
will be automatically populated. The level designated with a b will be the base level. You can type
the b on another level to make it the base. When you click OK, multinomial generalized responses
will be created for each level of your variable. See [SEM] example 37g for an example of creating
multinomial logit responses in the Builder.
The Add Regression Component tool,
generalized response.
Video example
SEM Builder in Stata
Visit the Stata YouTube channel for more videos.
127
Reference
Huber, C. 2012. Using Statas SEM features to model the Beck Depression Inventory. The Stata Blog: Not Elsewhere
Classified. http://blog.stata.com/2012/10/17/using-statas-sem-features-to-model-the-beck-depression-inventory/.
Title
estat eform Display exponentiated coefficients
Syntax
Remarks and examples
Menu
Also see
Description
Options
Syntax
estat eform
eqnamelist
where eqnamelist is a list of equation names. In gsem, equation names correspond to the names of
the response variables. If no eqnamelist is specified, exponentiated results for the first equation
are shown.
Menu
Statistics
>
>
Other
>
Description
estat eform is for use after gsem but not sem.
gsem reports coefficients. You can obtain exponentiated coefficients and their standard errors by
using estat eform after estimation to redisplay results.
Options
level(#); see [R] estimation options; default is level(95).
display options control the display of factor variables and more. Allowed display options are
noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#),
fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and nolstretch. See
[R] estimation options.
Family
Link
Meaning of exp(coef)
logit
ologit
mlogit
Poisson
nbreg
Bernoulli
ordinal
multinomial
Poisson
nbreg
logit
logit
logit
log
log
odds ratio
odds ratio
relative-risk ratio
incidence-rate ratio
incidence-rate ratio
Also see
[SEM] gsem Generalized structural equation model estimation command
[SEM] gsem postestimation Postestimation tools for gsem
[SEM] intro 7 Postestimation tests and predictions
[SEM] example 33g Logistic regression
[SEM] example 34g Combined models (generalized responses)
129
Title
estat eqgof Equation-level goodness-of-fit statistics
Syntax
Remarks and examples
Menu
Stored results
Description
Reference
Option
Also see
Syntax
estat eqgof
, format(% fmt)
Menu
Statistics
>
>
Goodness of fit
>
Description
estat eqgof is for use after sem but not gsem.
estat eqgof displays equation-by-equation goodness-of-fit statistics. Displayed are R2 and the
BentlerRaykov squared multiple-correlation coefficient (Bentler and Raykov 2000).
These two concepts of fit are equivalent for recursive SEMs and univariate linear regression. For
nonrecursive SEMs, these measures are distinct.
Equation-level variance decomposition is also reported, along with the overall model coefficient
of determination.
Option
format(% fmt) specifies the display format. The default is format(%9.0f).
Stored results
estat eqgof stores the following in r():
Scalars
r(N groups)
r(CD # )
Matrices
r(nobs)
r(eqfit # )
number of groups
overall coefficient of determination (for group #)
sample size for each group
fit statistics (for group #)
130
131
Reference
Bentler, P. M., and T. Raykov. 2000. On measures of explained variance in nonrecursive structural equation models.
Journal of Applied Psychology 85: 125131.
Also see
[SEM] example 3 Two-factor measurement model
Title
estat eqtest Equation-level test that all coefficients are zero
Syntax
Remarks and examples
Menu
Stored results
Description
Also see
Option
Syntax
estat eqtest
, total nosvyadjust
Menu
Statistics
>
>
>
Description
estat eqtest is for use after sem but not gsem.
estat eqtest displays Wald tests that all coefficients excluding the intercept are 0 for each
equation in the model.
Option
total is for use when estimation was with sem, group(). It specifies that the tests be aggregated
across the groups.
nosvyadjust is for use with svy estimation commands. It specifies that the Wald test be carried out
without the default adjustment for the design degrees of freedom. That is to say the test is carried
out as W/k F (k, d) rather than as (d k + 1)W/(kd) F (k, d k + 1), where k is the
dimension of the test and d is the total number of sampled PSUs minus the total number of strata.
Stored results
estat eqtest stores the following in r():
Scalars
r(N groups)
Matrices
r(nobs)
r(test # )
r(test total)
number of groups
sample size for each group
test statistics (for group #)
aggregated test statistics (total only)
132
Also see
[SEM] example 13 Equation-level Wald test
133
Title
estat framework Display estimation results in modeling framework
Syntax
Remarks and examples
Menu
Stored results
Description
Reference
Options
Also see
Syntax
estat framework
, options
options
Description
standardized
compact
fitted
format(% fmt)
Menu
Statistics
>
>
Other
>
Description
estat framework is a postestimation command for use after sem but not gsem.
estat framework displays the estimation results as a series of matrices derived from the Bentler
Weeks form; see Bentler and Weeks (1980).
Options
standardized reports results in standardized form.
compact displays matrices in compact form. Zero matrices are displayed as a description. Diagonal
matrices are shown as a row vector.
fitted displays the fitted mean and covariance values.
format(% fmt) specifies the display format to be used. The default is format(%9.0g).
134
135
Technical note
If sems nm1 option was specified when the model was fit, all covariance matrices are calculated
using N 1 in the denominator instead of N .
Stored results
estat framework stores the following in r():
Scalars
r(N groups)
r(standardized)
number of groups
indicator of standardized results (+)
Matrices
r(nobs)
# )
r(Beta
# )
r(Gamma
r(alpha
# )
# )
r(Phi
# )
r(kappa
r(Sigma
r(mu
# )
r(Psi
# )
# )
Reference
Bentler, P. M., and D. G. Weeks. 1980. Linear structural equations with latent variables. Psychometrika 45: 289308.
Also see
[SEM] example 11 estat framework
[SEM] intro 7 Postestimation tests and predictions (Replaying the model (sem and gsem))
[SEM] intro 7 Postestimation tests and predictions (Accessing stored results)
[SEM] methods and formulas for sem Methods and formulas for sem
[SEM] sem postestimation Postestimation tools for sem
Title
estat ggof Group-level goodness-of-fit statistics
Syntax
Remarks and examples
Menu
Stored results
Description
Also see
Option
Syntax
estat ggof
, format(% fmt)
Menu
Statistics
>
>
Group statistics
>
Description
estat ggof is for use after estimation with sem, group().
estat ggof displays, by group, the standardized root mean squared residual (SRMR), the coefficient
of determination (CD), and the model versus saturated 2 along with its associated degrees of freedom
and p-value.
Option
format(% fmt) specifies the display format. The default is format(%9.3f).
Stored results
estat ggof stores the following in r():
Scalars
r(N groups)
Matrices
r(gfit)
number of groups
fit statistics
136
Also see
[SEM] example 21 Group-level goodness of fit
137
Title
estat ginvariant Tests for invariance of parameters across groups
Syntax
Remarks and examples
Menu
Stored results
Description
References
Options
Also see
Syntax
estat ginvariant
, options
options
Description
showpclass(classname)
class
legend
classname
Description
scoef
scons
structural coefficients
structural intercepts
mcoef
mcons
measurement coefficients
measurement intercepts
serrvar
merrvar
smerrcov
meanex
covex
all
none
Menu
Statistics
>
>
Group statistics
>
Description
estat ginvariant is for use after estimation with sem, group(); see [SEM] sem group options.
estat ginvariant performs score tests (Lagrange multiplier tests) and Wald tests of whether
parameters constrained to be equal across groups should be relaxed and whether parameters allowed
to vary across groups could be constrained.
See Sorbom (1989) and Wooldridge (2010, 421428).
138
139
Options
showpclass(classname) displays tests for the classes specified. showpclass(all) is the default.
class displays a table with joint tests for group invariance for each of the nine parameter classes.
legend displays a legend describing the parameter classes. This option may only be used with the
class option.
Stored results
estat ginvariant stores the following in r():
Scalars
r(N groups)
Matrices
r(nobs)
r(test)
r(test pclass)
r(test class)
number of groups
sample size for each group
Wald and score tests
parameter classes corresponding to r(test)
joint Wald and score tests for each class
References
Sorbom, D. 1989. Model modification. Psychometrika 54: 371384.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press.
Also see
[SEM] example 22 Testing parameter equality across groups
Title
estat gof Goodness-of-fit statistics
Syntax
Remarks and examples
Menu
Stored results
Description
References
Options
Also see
Syntax
estat gof
, options
options
Description
stats(statlist)
nodescribe
statistics to be displayed
suppress descriptions of statistics
statlist
Description
chi2
rmsea
ic
indices
residuals
all
Note: The statistics reported by chi2, rmsea, and indices are dependent on the assumption of joint
normality of the observed variables.
Menu
Statistics
>
>
Goodness of fit
>
Description
estat gof is for use after sem but not gsem.
estat gof displays a variety of overall goodness-of-fit statistics.
Options
stats(statlist) specifies the statistics to be displayed. The default is stats(chi2).
stats(chi2) reports the model versus saturated test and the baseline versus saturated test. The
saturated model is the model that fits the covariances perfectly.
The model versus saturated test is a repeat of the test reported at the bottom of the sem output.
In the baseline versus saturated test, the baseline model includes the means and variances of all
observed variables plus the covariances of all observed exogenous variables. For a covariance
model (a model with no endogenous variables), the baseline includes only the means and
variances of observed variables. Be aware that different authors define the baseline model
differently.
140
141
stats(rmsea) reports the root mean squared error of approximation (RMSEA) and its 90%
confidence interval, and pclose, the p-value for a test of close fit, namely, RMSEA < 0.05. Most
interpreters of this test label the fit close if the lower bound of the 90% CI is below 0.05 and
label the fit poor if the upper bound is above 0.10. See Browne and Cudeck (1993).
stats(ic) reports the Akaike information criterion (AIC) and Bayesian (or Schwarz) information
criterion (BIC). These statistics are available only after estimation with sem method(ml) or
method(mlmv). These statistics are used not to judge fit in absolute terms but instead to
compare the fit of different models. Smaller values indicate a better fit. Be aware that there
are many variations (minor adjustments) to statistics labeled AIC and BIC. Reported here are
statistics that match estat ic; see [R] estat ic.
To compare models that use statistics based on likelihoods, such as AIC and BIC, models
should include the same variables; see [SEM] lrtest. See Akaike (1987), Schwarz (1978), and
Raftery (1993).
stats(indices) reports CFI and TLI, two indices such that a value close to 1 indicates a good
fit. CFI stands for comparative fit index. TLI stands for TuckerLewis index and is also known
as the nonnormed fit index. See Bentler (1990).
stats(residuals) reports the standardized root mean squared residual (SRMR) and the coefficient
of determination (CD).
A perfect fit corresponds to an SRMR of 0. A good fit is a small value, considered by some to
be limited to 0.08. SRMR is calculated using the first and second moments unless sem option
nomeans was specified or implied, in which case SRMR is calculated based on second moments
only. Some software packages ignore the first moments even when available. See Hancock and
Mueller (2006, 157).
Concerning CD, a perfect fit corresponds to a CD of 1. CD is like R2 for the whole model.
stats(all) reports all the statistics. You can also specify just the statistics you wish reported,
such as
. estat gof, stats(indices residuals)
142
Stored results
estat gof stores the following in r():
Scalars
r(chi2 ms)
r(df ms)
r(p ms)
r(chi2 bs)
r(df bs)
r(p bs)
r(rmsea)
r(lb90 rmsea)
r(ub90 rmsea)
r(pclose)
r(aic)
r(bic)
r(cfi)
r(tli)
r(cd)
r(srmr)
r(N groups)
Matrices
r(nobs)
References
Akaike, H. 1987. Factor analysis and AIC. Psychometrika 52: 317332.
Bentler, P. M. 1990. Comparative fit indexes in structural models. Psychological Bulletin 107: 238246.
Browne, M. W., and R. Cudeck. 1993. Alternative ways of assessing model fit. Reprinted in Testing Structural
Equation Models, ed. K. A. Bollen and J. S. Long, pp. 136162. Newbury Park, CA: Sage.
Hancock, G. R., and R. O. Mueller, ed. 2006. Structural Equation Modeling: A Second Course. Charlotte, NC:
Information Age Publishing.
Raftery, A. E. 1993. Bayesian model selection in structural equation models. Reprinted in Testing Structural Equation
Models, ed. K. A. Bollen and J. S. Long, pp. 163180. Newbury Park, CA: Sage.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461464.
Also see
[SEM] example 4 Goodness-of-fit statistics
Title
estat mindices Modification indices
Syntax
Remarks and examples
Menu
Stored results
Description
References
Options
Also see
Syntax
estat mindices
, options
options
Description
showpclass(classname)
minchi2(#)
classname
Description
scoef
scons
structural coefficients
structural intercepts
mcoef
mcons
measurement coefficients
measurement intercepts
serrvar
merrvar
smerrcov
meanex
covex
all
none
Menu
Statistics
>
>
>
Modification indices
Description
estat mindices is for use after sem but not gsem.
estat mindices reports modification indices for omitted paths in the fitted model. Modification
indices are score tests (Lagrange multiplier tests) for the statistical significance of the omitted paths.
See Sorbom (1989) and Wooldridge (2010, 421428).
143
144
Options
showpclass(classname) specifies that results be limited to parameters that belong to the specified
parameter classes. The default is showpclass(all).
minchi2(#) suppresses listing paths with modification indices (MIs) less than #. By default,
estat mindices lists values significant at the 0.05 level, corresponding to 2 (1) value
minchi2(3.8414588). Specify minchi2(0) if you wish to see all tests.
Stored results
estat mindices stores the following in r():
Scalars
r(N groups)
Matrices
r(nobs)
r(mindices pclass)
r(mindices)
number of groups
sample size for each group
parameter class of modification indices
matrix containing the displayed table values
References
Sorbom, D. 1989. Model modification. Psychometrika 54: 371384.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press.
Also see
[SEM] example 5 Modification indices
Title
estat residuals Display mean and covariance residuals
Syntax
Remarks and examples
Menu
Stored results
Description
References
Options
Also see
Syntax
estat residuals
, options
options
Description
normalized
standardized
sample
nm1
zerotolerance(tol)
format(% fmt)
Menu
Statistics
>
>
Goodness of fit
>
Matrices of residuals
Description
estat residuals is for use after sem but not gsem.
estat residuals displays the mean and covariance residuals. Normalized and standardized
residuals are available.
Both mean and covariance residuals are reported unless sems option nomeans was specified or
implied at the time the model was fit, in which case mean residuals are not reported.
estat residuals usually does not work following sem models fit with method(mlmv). It also
does not work if there are any missing values, which after all is the whole point of using method(mlmv).
Options
normalized and standardized are alternatives. If neither is specified, raw residuals are reported.
Normalized residuals and standardized residuals attempt to adjust the residuals in the same way,
but they go about it differently. The normalized residuals are always valid, but they do not follow
a standard normal distribution. The standardized residuals do follow a standard normal distribution
but only if they can be calculated; otherwise, they will equal missing values. When both can be
calculated (equivalent to both being appropriate), the normalized residuals will be a little smaller
than the standardized residuals. See Joreskog and Sorbom (1986).
145
146
sample specifies that the sample variance and covariances be used in variance formulas to compute
normalized and standardized residuals. The default uses fitted variance and covariance values as
described by Bollen (1989).
nm1 specifies that the variances be computed using N 1 in the denominator rather than using sample
size N .
zerotolerance(tol) treats residuals within tol of 0 as if they were 0. tol must be a numeric
value less than 1. The default is zerotolerance(0), meaning that no tolerance is applied.
When standardized residuals cannot be calculated, it is because a variance calculated by the
Hausman (1978) theorem turns negative. Applying a tolerance to the residuals turns some residuals
into 0 and then division by the negative variance becomes irrelevant, and that may be enough to
solve the calculation problem.
format(% fmt) specifies the display format. The default is format(%9.3f).
Stored results
estat residuals stores the following in r():
Scalars
r(N groups)
number of groups
Macros
r(sample)
r(nm1)
Matrices
r(nobs)
r(res mean
# )
# )
r(res cov
r(nres mean
r(nres cov
r(sres mean
r(sres cov
# )
# )
# )
# )
(*) If there are no estimated means or intercepts in the sem model, these matrices are not returned.
References
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Hausman, J. A. 1978. Specification tests in econometrics. Econometrica 46: 12511271.
Joreskog, K. G., and D. Sorbom. 1986. Lisrel VI: Analysis of linear structural relationships by the method of
maximum likelihood. Mooresville, IN: Scientific Software.
Also see
[SEM] example 10 MIMIC model
147
Title
estat scoretests Score tests
Syntax
Remarks and examples
Menu
Stored results
Description
References
Option
Also see
Syntax
estat scoretests
, minchi2(#)
Menu
Statistics
>
>
>
Description
estat scoretests is for use after sem but not gsem.
estat scoretests displays score tests (Lagrange multiplier tests) for each of the user-specified
linear constraints imposed on the model when it was fit. See Sorbom (1989) and Wooldridge (2010,
421428).
Option
minchi2(#) suppresses output of tests with 2 (1) < #. By default, estat mindices lists values significant at the 0.05 level, corresponding to 2 (1) value minchi2(3.8414588). Specify
minchi2(0) if you wish to see all tests.
Stored results
estat scoretests stores the following in r():
Scalars
r(N groups)
number of groups
Matrices
r(nobs)
r(Cns sctest)
148
149
References
Sorbom, D. 1989. Model modification. Psychometrika 54: 371384.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press.
Also see
[SEM] example 8 Testing that coefficients are equal, and constraining them
Title
estat stable Check stability of nonrecursive system
Syntax
Remarks and examples
Menu
Stored results
Description
Reference
Option
Also see
Syntax
estat stable
, detail
Menu
Statistics
>
>
Other
>
Description
estat stable is for use after sem but not gsem.
estat stable reports the eigenvalue stability index for nonrecursive models. The stability index
is computed as the maximum modulus of the eigenvalues for the matrix of coefficients on endogenous
variables predicting other endogenous variables. If the model was fit by sem with the group() option,
estat stable reports the index for each group separately.
There are two formulas commonly used to calculate the index. estat stable uses the formulation
of Bentler and Freeman (1983).
Option
detail displays the matrix of coefficients on endogenous variables predicting other endogenous
variables, also known as the matrix.
150
151
Stored results
estat stable stores the following in r():
Scalars
r(N groups)
r(stindex
number of groups
# )
Matrices
r(nobs)
r(Beta
# )
r(Im
# )
r(Re
# )
r(Modulus
# )
Reference
Bentler, P. M., and E. H. Freeman. 1983. Tests for stability in linear structural equation systems. Psychometrika 48:
143145.
Also see
[SEM] estat teffects Decomposition of effects into total, direct, and indirect
[SEM] methods and formulas for sem Methods and formulas for sem
[SEM] sem postestimation Postestimation tools for sem
Title
estat stdize Test standardized parameters
Syntax
Menu
Description
Stored results
Also see
Syntax
estat stdize: test ...
estat stdize: lincom ...
estat stdize: testnl ...
estat stdize: nlcom ...
Menu
Statistics
>
>
>
Description
estat stdize: is for use after sem but not gsem.
estat stdize: can be used to prefix test, lincom, testnl, and nlcom; see [SEM] test,
[SEM] lincom, [SEM] testnl, and [SEM] nlcom.
These commands without a prefix work in the underlying metric of SEM, which is to say path
coefficients, variances, and covariances. If the commands are prefixed with estat stdize:, they will
work in the metric of standardized coefficients and correlation coefficients. There is no counterpart
to variances in the standardized metric because variances are standardized to be 1.
Stored results
Stored results are the results stored by the command being used with the estat stdize: prefix.
152
Also see
[SEM] example 16 Correlation
153
Title
estat summarize Report summary statistics for estimation sample
Syntax
Description
Options
Stored results
Also see
Syntax
estat summarize
eqlist
>
Postestimation
>
Description
estat summarize is a standard postestimation command of Stata. This entry concerns use of
estat summarize after sem or gsem.
estat summarize reports the summary statistics in the estimation sample for the observed variables
in the model. estat summarize is mentioned here because
1. estat summarize cannot be used if sem was run on summary statistics data; see [SEM] intro 11.
2. estat summarize allows the additional option group after estimation by sem.
If you fit your model with gsem instead of sem, see [R] estat summarize.
Options
group may be specified if group(varname) was specified with sem at the time the model was fit.
It requests that summary statistics be reported by group.
estat summ options are the standard options allowed by estat summarize and are outlined in
Options of [R] estat summarize.
Stored results
See Stored results of [R] estat summarize.
Also see
[R] estat summarize Summarize estimation sample
[SEM] sem postestimation Postestimation tools for sem
[SEM] gsem postestimation Postestimation tools for gsem
154
Title
estat teffects Decomposition of effects into total, direct, and indirect
Syntax
Remarks and examples
Menu
Stored results
Description
References
Options
Also see
Syntax
estat teffects
, options
options
Description
compact
standardized
nolabel
nodirect
noindirect
nototal
display options
Menu
Statistics
>
>
>
Description
estat teffects is for use after sem but not gsem.
estat teffects reports direct, indirect, and total effects for each path (Sobel 1987), along with
standard errors obtained by the delta method.
Options
compact is a popular option. Consider the following model:
. sem (y1<-y2 x1) (y2<-x2)
x2 has no direct effect on y1 but does have an indirect effect. estat teffects formats all its
effects tables the same way by default, so there will be a row for the direct effect of x2 on y1
just because there is a row for the indirect effect of x2 on y1. The value reported for the direct
effect, of course, will be 0. compact says to omit these unnecessary rows.
standardized reports effects in standardized form, but standard errors of the standardized effects
are not reported.
nolabel is relevant only if estimation was with sems group() option and the group variable has a
value label. Groups are identified by group value rather than label.
nodirect, noindirect, and nototal suppress the display of the indicated effect. The default is to
display all effects.
155
156
display options: noomitted, vsquish, cformat(% fmt), pformat(% fmt), sformat(% fmt), and
nolstretch; see [R] estimation options. Although estat teffects is not an estimation command, it allows these options.
care must be taken when interpreting indirect effects. The feedback loop is when a variable indirectly
affects itself, as y1 does in the example; y1 affects y2 and y2 affects y1. Thus in calculating the
indirect effect, the sum has an infinite number of terms although the term values get smaller and
smaller and thus usually converge to a finite result. It is important that you check nonrecursive models
for stability; see Bollen (1989, 397) and see [SEM] estat stable. Caution: if the model is unstable,
the calculation of the indirect effect can sometimes still converge to a finite result.
Stored results
estat teffects stores the following in r():
Scalars
r(N groups)
Matrices
r(nobs)
r(direct)
r(indirect)
r(total)
r(V direct)
r(V indirect)
r(V total)
number of groups
sample size for each
direct effects
indirect effects
total effects
covariance matrix of
covariance matrix of
covariance matrix of
group
estat teffects with the standardized option additionally stores the following in r():
Matrices
r(direct std)
r(indirect std)
r(total std)
157
References
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Sobel, M. E. 1987. Direct and indirect effects in linear structural equation models. Sociological Methods and Research
16: 155176.
Also see
[SEM] estat stable Check stability of nonrecursive system
[SEM] methods and formulas for sem Methods and formulas for sem
[SEM] sem postestimation Postestimation tools for sem
Title
example 1 Single-factor measurement model
Description
Reference
Also see
Description
The single-factor measurement model is demonstrated using the following data:
. use http://www.stata-press.com/data/r13/sem_1fmm
(single-factor measurement model)
. summarize
Obs
Mean
Std. Dev.
Variable
x1
x2
x3
x4
123
123
123
123
96.28455
97.28455
97.09756
690.9837
14.16444
16.14764
15.10207
77.50737
Min
Max
54
64
62
481
131
135
138
885
. notes
_dta:
1. fictional data
2. Variables x1, x2, and x3 each contain a test score designed to measure X.
The test is scored to have mean 100.
3. Variable x4 is also designed to measure X, but designed to have mean 700.
x1
x2
x3
x4
e.x1
e.x2
e.x3
158
e.x4
159
x1 x2 x3 x4
Exogenous variables
Latent:
Number of obs
123
[x1]X = 1
OIM
Std. Err.
Coef.
Measurement
x1 <X
_cons
1
96.28455
(constrained)
1.271963
75.70
P>|z|
0.000
93.79155
98.77755
x2 <X
_cons
1.172364
97.28455
.1231777
1.450053
9.52
67.09
0.000
0.000
.9309398
94.4425
1.413788
100.1266
X
_cons
1.034523
97.09756
.1160558
1.356161
8.91
71.60
0.000
0.000
.8070579
94.43953
1.261988
99.75559
X
_cons
6.886044
690.9837
.6030898
6.960137
11.42
99.28
0.000
0.000
5.704009
677.3421
8.068078
704.6254
80.79361
96.15861
99.70874
353.4711
118.2068
11.66414
13.93945
14.33299
236.6847
23.82631
60.88206
72.37612
75.22708
95.14548
79.62878
107.2172
127.7559
132.1576
1313.166
175.4747
x3 <-
x4 <-
var(e.x1)
var(e.x2)
var(e.x3)
var(e.x4)
var(X)
x1 = 1 + X1 + e.x1
x2 = 2 + X2 + e.x2
x3 = 3 + X3 + e.x3
x4 = 4 + X4 + e.x4
Notes:
1. Variable X is latent exogenous and thus needs a normalizing constraint. The variable is anchored
to the first observed variable, x1, and thus the path coefficient is constrained to be 1. See
Identification 2: Normalization constraints (anchoring) in [SEM] intro 4.
160
2. The path coefficients for X->x1, X->x2, and X->x3 are 1 (constrained), 1.17, and 1.03. Meanwhile,
the path coefficient for X->x4 is 6.89. This is not unexpected; we at StataCorp generated this
data, and the true coefficients are 1, 1, 1, and 7.
3. A test for model versus saturated is reported at the bottom of the output; the 2 (2) statistic
is 1.78 and its significance level is 0.4111. We cannot reject the null hypothesis of this test.
This test is a goodness-of-fit test in badness-of-fit units; a significant result implies that there
may be missing paths in the models specification.
More mathematically, the null hypothesis of the test is that the fitted covariance matrix and mean
vector of the observed variables are equal to the matrix and vector observed in the population.
-2081.0303
-2081.0303
-2080.9861
-2080.9859
model
Std. Err.
Number of obs
123
P>|z|
0.000
93.79155
98.77755
x1 <X
_cons
1
96.28455
(constrained)
1.271962
75.70
X
_cons
1.172365
97.28455
.1231778
1.450052
9.52
67.09
0.000
0.000
.9309411
94.4425
1.413789
100.1266
X
_cons
1.034524
97.09756
.1160559
1.35616
8.91
71.60
0.000
0.000
.8070585
94.43954
1.261989
99.75559
X
_cons
6.886053
690.9837
.6030902
6.96013
11.42
99.28
0.000
0.000
5.704018
677.3421
8.068088
704.6253
var(X)
118.2064
23.8262
79.62858
175.474
var(e.x1)
var(e.x2)
var(e.x3)
var(e.x4)
80.79381
96.15857
99.70883
353.4614
11.66416
13.93942
14.33298
236.6835
60.88222
72.37613
75.22718
95.14011
107.2175
127.7558
132.1577
1313.168
x2 <-
x3 <-
x4 <-
161
Notes:
1. Results are virtually the same. Coefficients differ in the last digit; for instance, x2<-X was 1.172364
and now it is 1.172365. The same is true of standard errors, etc. Meanwhile, variance estimates
are usually differing in the next-to-last digit; for instance, var(e.x2) was 96.15861 and is now
96.15857.
These are the kind of differences we would expect to see. gsem follows a different approach
for obtaining results that involves far more numeric machinery, which correspondingly results in
slightly less accuracy.
2. The log-likelihood values reported are the same. This model is one of the few models we could
have chosen where sem and gsem would produce the same log-likelihood values. In general, gsem
log likelihoods are on different metrics from those of sem. In the case where the model does not
include observed exogenous variables, however, they share the same metric.
3. There is no reason to use gsem over sem when both can fit the same model. sem is slightly more
accurate, is quicker, and has more postestimation features.
162
seed 12347
obs 123
X = round(rnormal(0,10))
x1 = round(100 + X + rnormal(0, 10))
x2 = round(100 + X + rnormal(0, 10))
x3 = round(100 + X + rnormal(0, 10))
x4 = round(700 + 7*X + rnormal(0, 10))
The data recorded in sem 1fmm.dta were obviously generated using normality, the same assumption
that is most often used to justify the SEM maximum likelihood estimator. In [SEM] intro 4, we explained
that the normality assumption can be relaxed and conditional normality can usually be substituted in
its place.
So lets consider nonnormal data. Lets make X be 2 (2), a violently nonnormal distribution,
resulting in the data-manufacturing code
set
set
gen
gen
gen
gen
gen
seed 12347
obs 123
X = (rchi2(2)-2)*(10/2)
x1 = round(100 + X + rnormal(0, 10))
x2 = round(100 + X + rnormal(0, 10))
x3 = round(100 + X + rnormal(0, 10))
x4 = round(700 + 7*X + rnormal(0, 10))
163
All the rnormal() functions remaining in our code have to do with the assumed normality of
the errors. The multiplicative and additive constants in the generation of X simply rescale the 2 (2)
variable to have mean 100 and standard deviation 10, which would not be important except for the
subsequent round() functions, which themselves were unnecessary except that we wanted to produce
a pretty dataset when we created the original sem 1fmm.dta.
In any case, if we rerun the commands with these data, we obtain
Reference
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Also see
[SEM] sem Structural equation model estimation command
[SEM] gsem Generalized structural equation model estimation command
[SEM] intro 5 Tour of models
[SEM] example 3 Two-factor measurement model
[SEM] example 24 Reliability
[SEM] example 27g Single-factor measurement model (generalized response)
Title
example 2 Creating a dataset from published covariances
Description
Reference
Also see
Description
Williams, Eaves, and Cox (2002) publish covariances from their data. We will use those published
covariances in [SEM] example 3 to fit an SEM.
In this example, we show how to create the summary statistics dataset (SSD) that we will analyze
in that example.
Background
In Williams, Eaves, and Cox (2002), the authors report a covariance matrix in a table that looks
something like this:
Affective
1
Affective
1
2
.
.
5
2038.035
Miniscale
... 5
1631.766
1932.163
...
...
Cognitive
1
.
.
164
Cognitive
1
2
Miniscale
...
...
...
2061.875
775.118
871.211
...
630.518
500.128
...
165
\
1932.163
1336.871
1647.164
1688.292
702.969
790.448
879.179
\
1313.809 \
1273.261 2034.216 \
1498.401 1677.767 2061.875 \
585.019 656.527 775.118 630.518 \
653.734 764.755 871.211 500.128 741.767 \
750.037
\
739.157 659.867
855.272 \
785.419
622.830
751.860
669.951 802.825
728.674 ;
observations:
means:
variances or sd:
covariances or correlations:
. #delimit cr
delimiter now cr
set
unset
set
set
Notes:
1. We used #delimit to temporarily set the end-of-line character to semicolon. That was not
necessary, but it made it easier to enter the data in a way that would be subsequently more
readable. You can use #delimit only in do-files; see [P] #delimit.
2. We recommend entering SSD by using do-files. That way, you can edit the file and get it right.
3. We did not have to reset the delimiter. We could have entered the numbers on one (long) line.
That works well when there are only a few summary statistics.
166
Obviously, we can save the dataset anytime we wish. We know we could stop because ssd status
tells us whether there is anything more that we need to define:
. ssd status
Status:
observations:
means:
variances or sd:
covariances or correlations:
set
unset
set
set
Notes:
1. The means have not been set. The authors did not provide the means.
2. ssd status would mention if anything that was not set was required to be set.
variable label
a1
a2
a3
a4
a5
c1
c2
c3
c4
c5
167
variable label
a1
a2
a3
a4
a5
c1
c2
c3
c4
c5
affective
affective
affective
affective
affective
cognitive
cognitive
cognitive
cognitive
cognitive
arousal
arousal
arousal
arousal
arousal
arousal
arousal
arousal
arousal
arousal
1
2
3
4
5
1
2
3
4
5
Notes:
1. You can label the variables and the data, and you can add notes just as you would to any dataset.
2. You save and use SSD just as you save and use any dataset.
168
c4
c5
855.272
622.83
728.674
Reference
Williams, T. O., Jr., R. C. Eaves, and C. Cox. 2002. Confirmatory factor analysis of an instrument designed to
measure affective and cognitive arousal. Educational and Psychological Measurement 62: 264283.
Also see
[SEM] example 3 Two-factor measurement model
Title
example 3 Two-factor measurement model
Description
References
Also see
Description
The multiple-factor measurement model is demonstrated using summary statistics dataset (SSD)
sem 2fmm.dta:
. use http://www.stata-press.com/data/r13/sem_2fmm
(Affective and cognitive arousal)
. ssd describe
Summary statistics data from
http://www.stata-press.com/data/r13/sem_2fmm.dta
obs:
216
Affective and cognitive arousal
vars:
10
25 May 2013 10:11
(_dta has notes)
variable name
variable label
a1
a2
a3
a4
a5
c1
c2
c3
c4
c5
affective
affective
affective
affective
affective
cognitive
cognitive
cognitive
cognitive
cognitive
arousal
arousal
arousal
arousal
arousal
arousal
arousal
arousal
arousal
arousal
1
2
3
4
5
1
2
3
4
5
. notes
_dta:
1. Summary statistics data containing published covariances from Thomas O.
Williams, Ronald C. Eaves, and Cynthia Cox, 2 Apr 2002, "Confirmatory
factor analysis of an instrument designed to measure affective and
cognitive arousal", _Educational and Psychological Measurement_, vol. 62
no. 2, 264-283.
2. a1-a5 report scores from 5 miniscales designed to measure affective
arousal.
3. c1-c5 report scores from 5 miniscales designed to measure cognitive
arousal.
4. The series of tests, known as the VST II (Visual Similes Test II) were
administered to 216 children ages 10 to 12. The miniscales are sums of
scores of 5 to 6 items in VST II.
169
170
Cognitive
Affective
a1
a2
a3
a4
a5
c1
c2
c3
c4
c5
10
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-9542.8803
-9539.5505
-9539.3856
-9539.3851
Number of obs
216
[a1]Affective = 1
[c1]Cognitive = 1
Coef.
OIM
Std. Err.
P>|z|
Measurement
a1 <Affective
a2 <Affective
.9758098
.0460752
21.18
0.000
.885504
1.066116
a3 <Affective
.8372599
.0355086
23.58
0.000
.7676643
.9068556
a4 <Affective
.9640461
.0499203
19.31
0.000
.866204
1.061888
a5 <Affective
1.063701
.0435751
24.41
0.000
.9782951
1.149107
c1 <Cognitive
c2 <Cognitive
1.114702
.0655687
17.00
0.000
.9861901
1.243215
c3 <Cognitive
1.329882
.0791968
16.79
0.000
1.174659
1.485105
c4 <Cognitive
1.172792
.0711692
16.48
0.000
1.033303
1.312281
c5 <Cognitive
1.126356
.0644475
17.48
0.000
1.000041
1.252671
var(e.a1)
var(e.a2)
var(e.a3)
var(e.a4)
var(e.a5)
var(e.c1)
var(e.c2)
var(e.c3)
var(e.c4)
var(e.c5)
var(Affect~e)
var(Cognit~e)
384.1359
357.3524
154.9507
496.4594
191.6857
171.6638
171.8055
276.0144
224.1994
146.8655
1644.463
455.9349
43.79119
41.00499
20.09026
54.16323
28.07212
19.82327
20.53479
32.33535
25.93412
18.5756
193.1032
59.11245
307.2194
285.3805
120.1795
400.8838
143.8574
136.894
135.9247
219.3879
178.7197
114.6198
1306.383
353.6255
480.3095
447.4755
199.7822
614.8214
255.4154
215.2649
217.1579
347.2569
281.2527
188.1829
2070.034
587.8439
cov(Affec~e,
Cognitive)
702.0736
85.72272
534.0601
870.087
(constrained)
(constrained)
8.19
=
0.000
171
172
Notes:
1. In [SEM] example 1, we ran sem on raw data. In this example, we run sem on SSD. There are
no special sem options that we need to specify because of this.
2. The estimated coefficients reported above are unstandardized coefficients or, if you prefer, factor
loadings.
3. The coefficients listed at the bottom of the coefficient table that start with e. are the estimated
error variances. They represent the variance of the indicated measurement that is not measured
by the respective latent variables.
4. The above results do not match exactly (Kline 2005, 184). If we specified sem option nm1, results
are more likely to match to 3 or 4 digits. The nm1 option says to divide by N 1 rather than
by N in producing variances and covariances.
Number of obs
OIM
Std. Err.
Standardized
Coef.
Measurement
a1 <Affective
.9003553
.0143988
a2 <Affective
.9023249
a3 <Affective
216
P>|z|
62.53
0.000
.8721342
.9285765
.0141867
63.60
0.000
.8745195
.9301304
.9388883
.0097501
96.29
0.000
.9197784
.9579983
a4 <Affective
.8687982
.0181922
47.76
0.000
.8331421
.9044543
a5 <Affective
.9521559
.0083489
114.05
0.000
.9357923
.9685195
c1 <Cognitive
.8523351
.0212439
40.12
0.000
.8106978
.8939725
c2 <Cognitive
.8759601
.0184216
47.55
0.000
.8398544
.9120658
c3 <Cognitive
.863129
.0199624
43.24
0.000
.8240033
.9022547
c4 <Cognitive
.8582786
.0204477
41.97
0.000
.8182018
.8983554
c5 <Cognitive
.8930346
.0166261
53.71
0.000
.8604479
.9256212
var(e.a1)
var(e.a2)
var(e.a3)
var(e.a4)
var(e.a5)
var(e.c1)
var(e.c2)
var(e.c3)
var(e.c4)
var(e.c5)
var(Affect~e)
var(Cognit~e)
.1893602
.1858097
.1184887
.2451896
.0933991
.2735248
.2326939
.2550083
.2633578
.2024893
1
1
.0259281
.0256021
.0183086
.0316107
.015899
.0362139
.0322732
.0344603
.0350997
.0296954
.
.
.1447899
.1418353
.0875289
.1904417
.0669031
.2110086
.1773081
.1956717
.2028151
.1519049
.
.
.2476506
.2434179
.1603993
.3156764
.1303885
.354563
.3053806
.3323385
.3419733
.2699183
.
.
cov(Affec~e,
Cognitive)
.8108102
.0268853
.758116
.8635045
30.16
0.000
173
174
Notes:
1. In addition to obtaining standardized coefficients, the standardized option reports estimated
error variances as the fraction of the variance that is unexplained. Error variances were previously
unintelligible numbers such as 384.136 and 357.352. Now they are 0.189 and 0.186.
2. Also listed in the sem output are variances of latent variables. In the previous output, latent
variable Affective had variance 1,644.46 with standard error 193. In the standardized output, it
has variance 1 with standard error missing. The variances of the latent variables are standardized
to 1, and obviously, being a normalization, there is no corresponding standard error.
3. We can now see at the bottom of the coefficient table that affective and cognitive arousal are
correlated 0.81 because standardized covariances are correlation coefficients.
4. The standardized coefficients for this model can be interpreted as the correlation coefficients
between the indicator and the latent variable because each indicator measures only one factor.
For instance, the standardized path coefficient a1<-Affective is 0.90, meaning the correlation
between a1 and Affective is 0.90.
175
b. Click in the upper-right quadrant of the Affective oval (it will highlight when you hover
over it), and drag a covariance to the upper-left quadrant of the Cognitive oval (it will
highlight when you can release to connect the covariance).
7. Clean up.
If you do not like where a covariance has been connected to its variable, use the Select tool,
, to click on the covariance, and then simply click on where it connects to an oval and drag
the endpoint. You can also change the bow of the covariance by dragging the control point that
extends from one end of the selected covariance.
8. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
SEM estimation options dialog box.
9. Show standardized estimates.
From the SEM Builder menu, select View > Standardized Estimates.
You can open a completed diagram in the Builder by typing
. webgetsem sem_2fmm
depvars
fitted
Variance
predicted
residual
R-squared
mc
2028.598
1923.217
1307.726
2024.798
2052.328
627.5987
738.3325
1082.374
851.311
725.3002
1644.463
1565.865
1152.775
1528.339
1860.643
455.9349
566.527
806.3598
627.1116
578.4346
384.1359
357.3524
154.9507
496.4594
191.6857
171.6638
171.8055
276.0144
224.1994
146.8655
.8106398
.8141903
.8815113
.7548104
.9066009
.7264752
.7673061
.7449917
.7366422
.7975107
.9003553
.9023249
.9388883
.8687982
.9521559
.8523351
.8759601
.863129
.8582786
.8930346
mc2
observed
a1
a2
a3
a4
a5
c1
c2
c3
c4
c5
overall
.8106398
.8141903
.8815113
.7548104
.9066009
.7264752
.7673061
.7449917
.7366422
.7975107
.9949997
176
Notes:
1. fitted reports the fitted variance of each of the endogenous variables, whether observed or
latent. In this case, we have observed endogenous variables.
2. predicted reports the variance of the predicted value of each endogenous variable.
3. residual reports the leftover residual variance.
4. R-squared reports R2 , the fraction of variance explained by each indicator. The fraction of the
variance of Affective explained by a1 is 0.81, just as we calculated by hand at the beginning
of this section. The overall R2 is also called the coefficient of determination.
5. mc stands for multiple correlation, and mc2 stands for multiple-correlation squared. R-squared,
mc, and mc2 all report the relatedness of the indicated dependent variable with the models linear
prediction. In recursive models, all three statistics are really the same number. mc is equal to the
square root of R-squared, and mc2 is equal to R-squared.
In nonrecursive models, these three statistics are different and each can have problems. R-squared
and mc can actually become negative! That does not mean the model has negative predictive
power or that it might not even have reasonable predictive power. mc2 = mc2 is recommended
by Bentler and Raykov (2000) to be used instead of R-squared for nonrecursive systems.
In [SEM] example 4, we examine the goodness-of-fit statistics for this model.
In [SEM] example 5, we examine modification indices for this model.
References
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Bentler, P. M., and T. Raykov. 2000. On measures of explained variance in nonrecursive structural equation models.
Journal of Applied Psychology 85: 125131.
Kline, R. B. 2005. Principles and Practice of Structural Equation Modeling. 2nd ed. New York: Guilford Press.
Also see
[SEM] example 1 Single-factor measurement model
[SEM] example 2 Creating a dataset from published covariances
[SEM] example 20 Two-factor measurement model by group
[SEM] example 26 Fitting a model with data missing at random
[SEM] example 31g Two-factor measurement model (generalized response)
Title
example 4 Goodness-of-fit statistics
Description
Reference
Also see
Description
Here we demonstrate estat gof. See [SEM] intro 7 and [SEM] estat gof.
This example picks up where [SEM] example 3 left off:
. use http://www.stata-press.com/data/r13/sem_2fmm
. sem (Affective -> a1 a2 a3 a4 a5) (Cognitive -> c1 c2 c3 c4 c5)
Most texts refer to this test against the saturated model as the model 2 test.
These results indicate poor goodness of fit; see [SEM] example 1. The default goodness-of-fit
statistic reported by sem, however, can be overly influenced by sample size, correlations, variance
unrelated to the model, and multivariate nonnormality (Kline 2011, 201).
Goodness of fit in cases of sem is a measure of how well you fit the observed moments, which in
this case are the covariances between all pairs of a1, . . . , a5, c1, . . . , c5. In a measurement model,
the assumed underlying causes are unobserved, and in this example, those unobserved causes are the
latent variables Affective and Cognitive. It may be reasonable to assume that the observed a1,
. . . , a5, c1, . . . , c5 can be filtered through imagined variables Affective and Cognitive, but that
can be reasonable only if not too much information contained in the original variables is lost. Thus
goodness-of-fit statistics are of great interest to those fitting measurement models. Goodness-of-fit
statistics are of far less interest when all variables in the model are observed.
177
178
Value
Description
Likelihood ratio
chi2_ms(34)
p > chi2
chi2_bs(45)
p > chi2
88.879
0.000
2467.161
0.000
Population error
RMSEA
90% CI, lower bound
upper bound
pclose
0.086
0.065
0.109
0.004
Information criteria
AIC
BIC
19120.770
19191.651
Baseline comparison
CFI
TLI
0.977
0.970
Size of residuals
SRMR
CD
0.022
0.995
Notes:
1. Desirable values vary from test to test.
2. We asked for all the goodness-of-fit tests. We could have obtained specific tests from the above
output by specifying the appropriate option; see [SEM] estat gof.
3. Under likelihood ratio, estat gof reports two tests. The first is a repeat of the model 2
test reported at the bottom of the sem output. The saturated model is the model that fits the
covariances perfectly. We can reject at the 5% level (or any other level) that the model fits as
well as the saturated model.
The second test is a baseline versus saturated comparison. The baseline model includes the mean
and variances of all observed variables plus the covariances of all observed exogenous variables.
Different authors define the baseline differently. We can reject at the 5% level (or any other level)
that the baseline model fits as well as the saturated model.
4. Under population error, the RMSEA value is reported along with the lower and upper bounds
of its 90% confidence interval. Most interpreters of this test check whether the lower bound is
below 0.05 or the upper bound is above 0.10. If the lower bound is below 0.05, then they would
not reject the hypothesis that the fit is close. If the upper bound is above 0.10, they would not
reject the hypothesis that the fit is poor. The logic is to perform one test on each end of the 90%
confidence interval and thus have 95% confidence in the result. This models fit is not close, and
its upper limit is just over the bounds of being considered poor.
Pclose, a commonly used word in reference to this test, is the probability that the RMSEA value is
less than 0.05, interpreted as the probability that the predicted moments are close to the moments
in the population. This models fit is not close.
179
5. Under information criteria are reported AIC and BIC, which contain little information by themselves
but are often used to compare models. Smaller values are considered better.
6. Under baseline comparison are reported CFI and TLI, two indices such that a value close to 1
indicates a good fit. TLI is also known as the nonnormed fit index.
7. Under size of residuals is reported the standardized root mean squared residual (SRMR) and the
coefficient of determination (CD).
A perfect fit corresponds to an SRMR of 0, and a good fit corresponds to a small value,
considered by some to be limited at 0.08. The model fits well by this standard.
The CD is like an R2 for the whole model. A value close to 1 indicates a good fit.
estat gof provides multiple goodness-of-fit statistics because, across fields, different researchers
use different statistics. You should not print them all and look for the one reporting the result you
seek.
Reference
Kline, R. B. 2011. Principles and Practice of Structural Equation Modeling. 3rd ed. New York: Guilford Press.
Also see
[SEM] example 3 Two-factor measurement model
[SEM] example 21 Group-level goodness of fit
Title
example 5 Modification indices
Description
Reference
Also see
Description
Here we demonstrate the use of estat mindices; see [SEM] intro 7 and [SEM] estat mindices.
This example picks up where [SEM] example 3 left off:
. use http://www.stata-press.com/data/r13/sem_2fmm
. sem (Affective -> a1 a2 a3 a4 a5) (Cognitive -> c1 c2 c3 c4 c5)
and by default in the command language, latent exogenous variables are assumed to be correlated
unless we specify otherwise. Had we used the Builder, the latent exogenous variables would have
been assumed to be uncorrelated unless we had drawn the curved path between them.
The original authors who collected these data analyzed them assuming no covariance, which we
could obtain by typing
. sem (Affective -> a1 a2 a3 a4 a5) (Cognitive -> c1 c2 c3 c4 c5), ///
cov(Affective*Cognitive@0)
It was Kline (2005, 7074, 184) who allowed the covariance. Possibly he did that after looking at
the modification indices.
The modification indices report statistics on all omitted paths. Lets begin with the model without
the covariance:
180
181
MI
df
P>MI
EPC
Standard
EPC
Cognitive
8.059
0.00
.1604476
.075774
Affective
5.885
0.02
.0580897
.087733
5.767
7.597
14.300
4.071
21.183
25.232
4.209
11.326
8.984
12.668
4.483
128.482
1
1
1
1
1
1
1
1
1
1
1
1
0.02
0.01
0.00
0.04
0.00
0.00
0.04
0.00
0.00
0.00
0.03
0.00
84.81133
-81.82092
129.761
-45.44807
-116.8181
118.4674
39.07999
66.3965
-47.31483
-80.98353
38.6556
704.4469
.1972802
-.2938627
.3110565
-.1641344
-.4267012
.6681337
.184049
.3098331
-.2931597
-.333871
.2116015
.8094959
Measurement
a5 <-
c5 <-
cov(e.a1,e.a4)
cov(e.a1,e.a5)
cov(e.a2,e.a4)
cov(e.a2,e.c4)
cov(e.a3,e.a4)
cov(e.a3,e.a5)
cov(e.a5,e.c4)
cov(e.c1,e.c3)
cov(e.c1,e.c5)
cov(e.c3,e.c4)
cov(e.c4,e.c5)
cov(Affective,Cognitive)
Notes:
1. Four columns of results are reported.
a. MI stands for modification index and is an approximation to the change in the models
goodness-of-fit 2 if the path were added.
b. df stands for degrees of freedom and is the number that would be added to d of the 2 (d).
c. P>MI is the value of the significance of 2 (df).
d. EPC stands for expected parameter change and is an approximation to the value of the
parameter if it were not constrained to 0. It is reported in unstandardized (column 3) and
standardized (column 4) units.
2. There are lots of significant omitted paths in the above output.
3. Paths are listed only if the modification index is significant at the 0.05 level, corresponding to
2 (1) value 3.8414588. You may specify the minchi2() option to use different 2 (1) values.
Specify minchi2(0) if you wish to see all tests.
4. The omitted path between Affective and Cognitive has the largest change in 2 observed.
Perhaps this is why Kline (2005, 7074, 184) allowed a covariance between the two latent
variables. The standardized EPC reports the relaxed-constraint correlation value, which is the
value reported for the unconstrained correlation path in [SEM] example 3.
Another way of dealing with this significant result would be to add a direct path between the
variables, but that perhaps would have invalidated the theory being proposed. The original authors
instead proposed a second-order model postulating that Affective and Cognitive are themselves
measurements of another latent variable that might be called Arousal.
182
Reference
Kline, R. B. 2005. Principles and Practice of Structural Equation Modeling. 2nd ed. New York: Guilford Press.
Also see
[SEM] example 3 Two-factor measurement model
Title
example 6 Linear regression
Description
Also see
Description
Linear regression is demonstrated using auto.dta:
. sysuse auto
(1978 Automobile Data)
weight
weight2
mpg
foreign
183
184
Coef.
Structural
mpg <weight
weight2
foreign
_cons
var(e.mpg)
OIM
Std. Err.
-.0165729
1.59e-06
-2.2035
56.53884
.0038604
6.08e-07
1.03022
6.027559
10.19332
1.675772
Number of obs
-4.29
2.62
-2.14
9.38
P>|z|
0.000
0.009
0.032
0.000
74
-.0241392
4.00e-07
-4.222695
44.72504
-.0090067
2.78e-06
-.1843056
68.35264
7.385485
14.06865
Notes:
1. We wished to include variable weight2 in our model. Because sem does not allow Statas
factor-variable notation, we first had to generate new variable weight2.
2. Reported coefficients match those reported by regress.
3. Reported standard errors (SEs) differ slightly from those reported by regress. For instance, the
SE for foreign is reported here as 1.03, whereas regress reported 1.06. SEM is an asymptotic
estimator, and sem divides variances and covariances by N = 74, the number of observations.
regress provides
p unbiased finite-sample estimates and divides by N k 1 = 74 3 1 = 70.
Note that 1.03 74/70 = 1.06.
4. sem reports z statistics whereas regress reports t statistics.
5. Reported confidence intervals differ slightly between sem and regress because of the (N
k 1)/N issue.
6. sem reports
the point estimate of e.mpg as 10.19332. regress reports the root MSE as 3.2827.
p
And 10.19332 74/70 = 3.2827.
Standardized
Coef.
Structural
mpg <weight
weight2
foreign
_cons
-2.226321
1.32654
-.17527
9.839209
.4950378
.498261
.0810378
.9686872
.308704
.0482719
var(e.mpg)
Number of obs
-4.50
2.66
-2.16
10.16
P>|z|
0.000
0.008
0.031
0.000
185
74
-3.196577
.3499662
-.3341011
7.940617
-1.256064
2.303113
-.0164389
11.7378
.2272168
.4194152
regress simply reports standardized coefficients in an extra column. All other results are reported
in unstandardized form. sem updates the entire output with the standardized values.
186
Also see
[SEM] example 12 Seemingly unrelated regression
[SEM] example 38g Random-intercept and random-slope models (multilevel)
[SEM] example 43g Tobit regression
[SEM] example 44g Interval regression
Title
example 7 Nonrecursive structural model
Description
References
Also see
Description
To demonstrate a nonrecursive structural model with all variables observed, we use data from
Duncan, Haller, and Portes (1968):
. use http://www.stata-press.com/data/r13/sem_sm1
(Structural model with all observed values)
. ssd describe
Summary statistics data from
http://www.stata-press.com/data/r13/sem_sm1.dta
obs:
329
Structural model with all obse..
vars:
10
25 May 2013 10:13
(_dta has notes)
variable name
variable label
r_intel
r_parasp
r_ses
r_occasp
r_educasp
f_intel
f_parasp
f_ses
f_occasp
f_educasp
respondents intelligence
respondents parental aspiration
respondents family socioeconomic status
respondents occupational aspiration
respondents educational aspiration
friends intelligence
friends parental aspiration
friends family socioeconomic status
friends occupational aspiration
friends educational aspiration
. notes
_dta:
1. Summary statistics data from Duncan, O.D., Haller, A.O., and Portes, A.,
1968, "Peer Influences on Aspirations: A Reinterpretation", _American
Journal of Sociology_ 74, 119-137.
2. The data contain 329 boys with information on five variables and the same
information for each boys best friend.
If you typed ssd status, you would learn that this dataset contains the correlation matrix only.
Variances (standard deviations) and means are undefined. Thus we need to use this dataset cautiously.
It is always better if you enter the variances and means if you have them.
That these data are the correlations only will not matter for how we will use them.
187
188
r_intel
r_occasp
f_occasp
r_ses
f_ses
f_intel
189
Number of obs
329
OIM
Std. Err.
P>|z|
.2773441
.2854766
.1570082
.0973327
.1281904
.05
.0520841
.060153
2.16
5.71
3.01
1.62
0.031
0.000
0.003
0.106
.0260956
.1874783
.0549252
-.020565
.5285926
.3834748
.2590912
.2152304
f_occ~p <r_occasp
r_ses
f_ses
f_intel
.2118102
.0794194
.1681772
.3693682
.156297
.0587732
.0537199
.0525924
1.36
1.35
3.13
7.02
0.175
0.177
0.002
0.000
-.0945264
-.0357739
.062888
.2662891
.5181467
.1946127
.2734663
.4724474
var(e.r_oc~p)
var(e.f_oc~p)
.6889244
.6378539
.0399973
.039965
.6148268
.5641425
.7719519
.7211964
cov(e.r_oc~p,
e.f_occasp)
-.2325666
.2180087
-.6598558
.1947227
Standardized
Coef.
Structural
r_occ~p <f_occasp
r_intel
r_ses
f_ses
-1.07
=
0.286
Notes:
1. We specified the standardized option, but in this case that did not matter much because
these data are based on the correlation coefficients only, so standardized values are equal to
unstandardized values. The exception is the correlation between the latent endogenous variables,
as reflected in the correlation of their errors, and we wanted to show that results match those in
the original paper.
2. Nearly all results match those in the original paper. The authors normalized the errors to have a
variance of 1; sem normalizes the paths from the errors to have coefficient 1. While you can apply
most normalizing constraints any way you wish, sem restricts errors to have path coefficients of
1 and this cannot be modified. You could, however, prove to yourself that sem would produce
the same variances as the authors produced by typing
. sem, coeflegend
. display sqrt(_b[var(e.r_occasp):_cons])
. display sqrt(_b[var(e.f_occasp):_cons])
190
because the coefficients would be the standard deviations of the errors estimated without the
variance-1 constraint. Thus all results match. We replayed results by using the coeflegend
option so that we would know what to type to refer to the two error variances, namely,
b[var(e.r occasp): cons] and b[var(e.f occasp): cons].
b. Click in the right side of the r intel rectangle (it will highlight when you hover over it),
and drag a path to the left side of the r occasp rectangle (it will highlight when you can
release to connect the path).
191
b. Click in the bottom of the r occasp rectangle, slightly to the left of the center, and drag
a path to the f occasp rectangle.
c. Click in the top of the f occasp rectangle, slightly to the right of the center, and drag a
path to the r occasp rectangle.
7. Correlate the error terms.
a. Select the Add Covariance tool,
b. Click in the error term for r occasp (the circle labeled 1 ), and drag a covariance to the
error term for f occasp (the circle labeled 2 ).
8. Clean up.
If you do not like where a path has been connected to its variable, use the Select tool,
,
to click on the path, and then simply click on where it connects to a rectangle and drag the
endpoint. Similarly, you can change where the covariance connects to the error terms by clicking
on the covariance and dragging the endpoint. You can also change the bow of the covariance
by clicking on the covariance and dragging the control point that extends from one end of the
selected covariance.
9. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
SEM estimation options dialog box.
10. Show standardized estimates.
From the SEM Builder menu, select View > Standardized Estimates.
Tips: When you draw paths that should be exactly horizontal or vertical, such as the two between
r occasp and f occasp, holding the Shift key as you drag the path will guarantee that the line
is perfectly vertical. Also, when drawing paths from the independent variables to the dependent
variables, you may find it more convenient to change the automation settings as described in the
tips of [SEM] example 9. However, this does not work for the feedback loop between the dependent
variables.
You can open a completed diagram in the Builder by typing
. webgetsem sem_sm1
192
Modulus
.242372
.242372
Notes:
1. estat stable is for use on nonrecursive models. Recursive models are, by design, stable.
2. Stability concerns whether the parameters of the model are such that the model would blow up
if it were operated over and over again. If the results are found not to be stable, then that casts
questions about the validity of the model.
3. The stability is the maximum of the moduli, and the moduli are the absolute values of the
eigenvalues. Usually, the two eigenvalues are not identical, but it is a property of this model that
they are.
4. If the stability index is less than 1, then the reported estimates yield a stable model.
In the next section, we use estat teffects to estimate total effects. That is appropriate only if
the model is stable, as we find that it is.
Coef.
OIM
Std. Err.
P>|z|
2.15
5.47
2.98
1.61
0.031
0.000
0.003
0.107
.0249748
.1831662
.0536534
-.0209901
.5297134
.3877869
.260363
.2156555
1.35
0.176
-.09472
.5183404
1.35
3.09
6.62
0.178
0.002
0.000
-.0360411
.0615838
.2600142
.1948799
.2747705
.4787223
OIM
Std. Err.
P>|z|
Structural
r_occ~p <r_occasp
f_occasp
r_intel
r_ses
f_ses
f_intel
0
.2773441
.2854766
.1570082
.0973327
0
(no path)
.1287622
.0522001
.052733
.0603699
(no path)
f_occ~p <r_occasp
f_occasp
r_intel
r_ses
f_ses
f_intel
.2118102
0
0
.0794194
.1681772
.3693682
.1563958
(no path)
(no path)
.0589095
.0543854
.0557939
Indirect effects
Coef.
Structural
r_occ~p <r_occasp
f_occasp
r_intel
r_ses
f_ses
f_intel
.0624106
.0173092
.0178168
.0332001
.0556285
.1088356
.0460825
.0080361
.0159383
.0204531
.0292043
.052243
1.35
2.15
1.12
1.62
1.90
2.08
0.176
0.031
0.264
0.105
0.057
0.037
-.0279096
.0015587
-.0134217
-.0068872
-.0016109
.0064411
.1527307
.0330597
.0490552
.0732875
.112868
.21123
f_occ~p <r_occasp
f_occasp
r_intel
r_ses
f_ses
f_intel
.0132192
.0624106
.0642406
.0402881
.0323987
.0230525
.0097608
.0289753
.0490164
.0315496
.0262124
.0202112
1.35
2.15
1.31
1.28
1.24
1.14
0.176
0.031
0.190
0.202
0.216
0.254
-.0059115
.0056201
-.0318298
-.021548
-.0189765
-.0165607
.0323499
.1192011
.160311
.1021242
.083774
.0626657
193
194
Coef.
OIM
Std. Err.
P>|z|
Structural
r_occ~p <r_occasp
f_occasp
r_intel
r_ses
f_ses
f_intel
.0624106
.2946533
.3032933
.1902083
.1529612
.1088356
.0460825
.1367983
.0509684
.050319
.050844
.052243
1.35
2.15
5.95
3.78
3.01
2.08
0.176
0.031
0.000
0.000
0.003
0.037
-.0279096
.0265335
.2033971
.091585
.0533089
.0064411
.1527307
.5627731
.4031896
.2888317
.2526136
.21123
f_occ~p <r_occasp
f_occasp
r_intel
r_ses
f_ses
f_intel
.2250294
.0624106
.0642406
.1197074
.2005759
.3924207
.1661566
.0289753
.0490164
.0483919
.0488967
.0502422
1.35
2.15
1.31
2.47
4.10
7.81
0.176
0.031
0.190
0.013
0.000
0.000
-.1006315
.0056201
-.0318298
.0248611
.10474
.2939478
.5506903
.1192011
.160311
.2145537
.2964118
.4908936
Note:
1. In the path diagram we drew for this model, you can see that the intelligence of the respondent,
r intel, has both direct and indirect effects on the occupational aspiration of the respondent,
r occasp. The tables above reveal that
References
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Duncan, O. D., A. O. Haller, and A. Portes. 1968. Peer influences on aspirations: A reinterpretation. American Journal
of Sociology 74: 119137.
Also see
[SEM] example 8 Testing that coefficients are equal, and constraining them
Title
example 8 Testing that coefficients are equal, and constraining them
Description
Also see
Description
This example continues where [SEM] example 7 left off, where we typed
.
.
.
.
use http://www.stata-press.com/data/r13/sem_sm1
ssd describe
notes
sem (r_occasp <- f_occasp r_intel r_ses f_ses) ///
(f_occasp <- r_occasp f_intel f_ses r_ses), ///
cov(e.r_occasp*e.f_occasp) standardized
. estat stable
. estat teffects
We want to show you how to evaluate potential constraints after estimation, how to fit a model
with constraints, and how to evaluate enforced constraints after estimation.
Obviously, in a real analysis, if you evaluated potential constraints after estimation, there would
be no reason to evaluate enforced constraints after estimation, and vice versa.
->
->
->
->
r_occasp
r_occasp
r_occasp
r_occasp
f_intel
f_ses
r_ses
r_occasp
->
->
->
->
f_occasp
f_occasp
f_occasp
f_occasp
You are about to learn that to test whether those paths have equal coefficients, you type
. test (_b[r_occasp:r_intel ]==_b[f_occasp:f_intel ])
(_b[r_occasp:r_ses
]==_b[f_occasp:f_ses
])
(_b[r_occasp:f_ses
]==_b[f_occasp:r_ses
])
(_b[r_occasp:f_occasp]==_b[f_occasp:r_occasp])
195
///
///
///
196
In Stata, b[ ] is how one accesses the estimated parameters. It is difficult to remember what the
names are. To determine the names of the parameters, replay the sem results with the coeflegend
option:
. sem, coeflegend
Structural equation model
Estimation method = ml
Log likelihood
= -2617.0489
Coef.
Number of obs
.2773441
.2854766
.1570082
.0973327
_b[r_occasp:f_occasp]
_b[r_occasp:r_intel]
_b[r_occasp:r_ses]
_b[r_occasp:f_ses]
f_occ~p <r_occasp
r_ses
f_ses
f_intel
.2118102
.0794194
.1681772
.3693682
_b[f_occasp:r_occasp]
_b[f_occasp:r_ses]
_b[f_occasp:f_ses]
_b[f_occasp:f_intel]
var(e.r_oc~p)
var(e.f_oc~p)
.6868304
.6359151
_b[var(e.r_occasp):_cons]
_b[var(e.f_occasp):_cons]
cov(e.r_oc~p,
e.f_occasp)
-.1536992
_b[cov(e.r_occasp,e.f_occasp):_cons]
With the parameter names at hand, to perform the test, we can type
(_b[r_occasp:r_intel ]==_b[f_occasp:f_intel ])
(_b[r_occasp:r_ses
]==_b[f_occasp:f_ses
])
(_b[r_occasp:f_ses
]==_b[f_occasp:r_ses
])
(_b[r_occasp:f_occasp]==_b[f_occasp:r_occasp])
[r_occasp]r_intel - [f_occasp]f_intel = 0
[r_occasp]r_ses - [f_occasp]f_ses = 0
[r_occasp]f_ses - [f_occasp]r_ses = 0
[r_occasp]f_occasp - [f_occasp]r_occasp = 0
chi2( 4) =
Prob > chi2 =
329
Legend
Structural
r_occ~p <f_occasp
r_intel
r_ses
f_ses
. test
>
>
>
( 1)
( 2)
( 3)
( 4)
1.61
0.8062
Coef.
329
OIM
Std. Err.
P>|z|
Structural
r_occ~p <f_occasp
r_intel
r_ses
f_ses
.2471578
.3271847
.1635056
.088364
.1024504
.0407973
.0380582
.0427106
2.41
8.02
4.30
2.07
0.016
0.000
0.000
0.039
.0463588
.2472234
.0889129
.0046529
.4479568
.4071459
.2380984
.1720752
f_occ~p <r_occasp
r_ses
f_ses
f_intel
.2471578
.088364
.1635056
.3271847
.1024504
.0427106
.0380582
.0407973
2.41
2.07
4.30
8.02
0.016
0.039
0.000
0.000
.0463588
.0046529
.0889129
.2472234
.4479568
.1720752
.2380984
.4071459
var(e.r_oc~p)
var(e.f_oc~p)
.6884513
.6364713
.0538641
.0496867
.5905757
.5461715
.8025477
.7417005
cov(e.r_oc~p,
e.f_occasp)
-.1582175
.1410111
-.4345942
.1181592
-1.12
=
0.262
197
198
No tests were reported because no tests were individually significant at the 5% level. We can
obtain all the individual tests by adding the minchi2(0) option, which we can abbreviate to min(0):
. estat scoretests, min(0)
Score tests for linear constraints
( 1) [r_occasp]f_occasp - [f_occasp]r_occasp = 0
( 2) [r_occasp]r_intel - [f_occasp]f_intel = 0
( 3) [r_occasp]r_ses - [f_occasp]f_ses = 0
( 4) [r_occasp]f_ses - [f_occasp]r_ses = 0
(
(
(
(
1)
2)
3)
4)
chi2
df
P>chi2
0.014
1.225
0.055
0.136
1
1
1
1
0.91
0.27
0.81
0.71
Notes:
1. When we began this example, we used test to evaluate potential constraints that we were
considering. We obtained an overall 2 (4) statistic of 1.61 and thus could not reject the constraints
at any reasonable level.
2. We then refit the model with those constraints.
3. For pedantic reasons, now we use estat scoretests to evaluate relaxing constraints included
in the model. estat scoretests does not report a joint test. You cannot sum the 2 values
to obtain a joint test statistic. Thus we learn only that the individual constraints should not be
relaxed at reasonable confidence levels.
4. Thus when evaluating multiple constraints, it is better to fit the model without the constraints
and use test to evaluate them jointly.
Also see
[SEM] example 7 Nonrecursive structural model
Title
example 9 Structural model with measurement component
Description
References
Also see
Description
To demonstrate a structural model with a measurement component, we use data from Wheaton
et al. (1977):
. use http://www.stata-press.com/data/r13/sem_sm2
(Structural model with measurement component)
. ssd describe
Summary statistics data from
http://www.stata-press.com/data/r13/sem_sm2.dta
obs:
932
Structural model with measurem..
vars:
13
25 May 2013 11:45
(_dta has notes)
variable name
variable label
educ66
occstat66
anomia66
pwless66
socdist66
occstat67
anomia67
pwless67
socdist67
occstat71
anomia71
pwless71
socdist71
Education, 1966
Occupational status, 1966
Anomia, 1966
Powerlessness, 1966
Latin American social distance, 1966
Occupational status, 1967
Anomia, 1967
Powerlessness, 1967
Latin American social distance, 1967
Occupational status, 1971
Anomia, 1971
Powerlessness, 1971
Latin American social distance, 1971
. notes
_dta:
1. Summary statistics data from Wheaton, B., Muthen B., Alwin, D., &
Summers, G., 1977, "Assessing reliability and stability in panel models",
in D. R. Heise (Ed.), _Sociological Methodology 1977_ (pp. 84-136), San
Francisco: Jossey-Bass, Inc.
2. Four indicators each measured in 1966, 1967, and 1981, plus another
indicator (educ66) measured only in 1966.
3. Intended use: Create structural model relating Alienation in 1971,
Alienation in 1967, and SES in 1966.
See Structural models 8: Unobserved inputs, outputs, or both in [SEM] intro 5 for background.
199
200
anomia67
pwless67
anomia71
pwless71
Alien67
Alien71
SES
educ66
occstat66
. sem
///
>
(anomia67 pwless67 <- Alien67)
/// measurement
>
(anomia71 pwless71 <- Alien71)
/// measurement
>
(Alien67 <- SES)
/// structural
>
(Alien71 <- Alien67 SES)
/// structural
>
(
SES -> educ66 occstat66)
// measurement
Endogenous variables
Measurement: anomia67 pwless67 anomia71 pwless71 educ66
Latent:
Alien67 Alien71
Exogenous variables
Latent:
SES
Fitting target model:
Iteration 0:
log likelihood = -15249.988
Iteration 1:
log likelihood = -15246.584
Iteration 2:
log likelihood = -15246.469
Iteration 3:
log likelihood = -15246.469
piece
piece
piece
piece
piece
occstat66
Number of obs
201
932
[anomia67]Alien67 = 1
[anomia71]Alien71 = 1
[educ66]SES = 1
Coef.
OIM
Std. Err.
P>|z|
Structural
Alien67 <SES
-.6140404
.0562407
-10.92
0.000
-.7242701
-.5038107
Alien71 <Alien67
SES
.7046342
-.1744153
.0533512
.0542489
13.21
-3.22
0.000
0.001
.6000678
-.2807413
.8092007
-.0680894
Measurement
anom~67 <Alien67
_cons
1
13.61
(constrained)
.1126205
120.85
0.000
13.38927
13.83073
pwle~67 <Alien67
_cons
.8884887
14.67
20.59
146.44
0.000
0.000
.8039034
14.47365
.9730739
14.86635
anom~71 <Alien71
_cons
1
14.13
(constrained)
.1158943
121.92
0.000
13.90285
14.35715
pwle~71 <Alien71
_cons
.8486022
14.9
20.44
144.03
0.000
0.000
.7672235
14.69723
.9299808
15.10277
educ66 <SES
_cons
1
10.9
(constrained)
.1014894
107.40
0.000
10.70108
11.09892
occs~66 <SES
_cons
5.331259
37.49
.4307503
.6947112
0.000
0.000
4.487004
36.12839
6.175514
38.85161
var(e.ano~67)
var(e.pwl~67)
var(e.ano~71)
var(e.pwl~71)
var(e.educ66)
var(e.occ~66)
var(e.Ali~67)
var(e.Ali~71)
var(SES)
4.009921
3.187468
3.695593
3.621531
2.943819
260.63
5.301416
3.737286
6.65587
.3582978
.283374
.3911512
.3037908
.5002527
18.24572
.483144
.3881546
.6409484
3.365724
2.677762
3.003245
3.072483
2.109908
227.2139
4.434225
3.048951
5.511067
4.777416
3.794197
4.54755
4.268693
4.107319
298.9605
6.338201
4.581019
8.038482
.0431565
.1001798
.0415205
.1034537
12.38
53.96
Notes:
1. Measurement component: In both 1967 and 1971, anomia and powerlessness are used to measure endogenous latent variables representing alienation for the same two years. Education and
occupational status are used to measure the exogenous latent variable SES.
202
b. Click in the right half of the Alien67 oval (it will highlight when you hover over it), and
drag a path to the left half of the Alien71 oval (it will highlight when you can release to
connect the path).
203
204
MI
df
P>MI
EPC
Standard
EPC
anomia71
pwless71
educ66
51.977
32.517
5.627
1
1
1
0.00
0.00
0.02
.3906425
-.2969297
.0935048
.4019984
-.2727609
.0842631
anomia71
pwless71
educ66
41.618
23.622
6.441
1
1
1
0.00
0.00
0.01
-.3106995
.2249714
-.0889042
-.3594367
.2323233
-.0900664
anomia67
pwless67
58.768
38.142
1
1
0.00
0.00
.429437
-.3873066
.4173061
-.3347904
anomia67
pwless67
46.188
27.760
1
1
0.00
0.00
-.3308484
.2871709
-.3601641
.2780833
anomia67
pwless67
4.415
6.816
1
1
0.04
0.01
.1055965
-.1469371
.1171781
-.1450411
63.786
49.892
6.063
49.876
37.357
7.752
1
1
1
1
1
1
0.00
0.00
0.01
0.00
0.00
0.01
1.951578
-1.506704
.5527612
-1.534199
1.159123
-.5557802
.5069627
-.3953794
.1608845
-.4470094
.341162
-.1814365
Measurement
anomia67 <-
pwless67 <-
anomia71 <-
pwless71 <-
educ66 <-
cov(e.anomia67,e.anomia71)
cov(e.anomia67,e.pwless71)
cov(e.anomia67,e.educ66)
cov(e.pwless67,e.anomia71)
cov(e.pwless67,e.pwless71)
cov(e.pwless67,e.educ66)
EPC = expected parameter change
Notes:
1. There are lots of statistically significant paths we could add to the model.
2. Some of those statistically significant paths also make theoretical sense.
3. Two in particular that make theoretical sense are the covariances between e.anomia67 and
e.anomia71 and between e.pwless67 and e.pwless71.
///
///
///
///
///
///
///
measurement
measurement
structural
structural
measurement
Endogenous variables
Measurement: anomia67 pwless67 anomia71 pwless71 educ66 occstat66
Latent:
Alien67 Alien71
Exogenous variables
Latent:
SES
Fitting target
Iteration 0:
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
model:
log likelihood
log likelihood
log likelihood
log likelihood
log likelihood
=
=
=
=
=
-15249.988
-15217.95
-15213.126
-15213.046
-15213.046
piece
piece
piece
piece
piece
205
206
Number of obs
932
[anomia67]Alien67 = 1
[anomia71]Alien71 = 1
[educ66]SES = 1
Coef.
OIM
Std. Err.
P>|z|
Structural
Alien67 <SES
-.5752228
.057961
-9.92
0.000
-.6888244
-.4616213
Alien71 <Alien67
SES
.606954
-.2270301
.0512305
.0530773
11.85
-4.28
0.000
0.000
.5065439
-.3310596
.707364
-.1230006
Measurement
anom~67 <Alien67
_cons
1
13.61
(constrained)
.1126143
120.85
0.000
13.38928
13.83072
pwle~67 <Alien67
_cons
.9785952
14.67
15.79
146.43
0.000
0.000
.8571117
14.47365
1.100079
14.86635
anom~71 <Alien71
_cons
1
14.13
(constrained)
.1159036
121.91
0.000
13.90283
14.35717
pwle~71 <Alien71
_cons
.9217508
14.9
15.43
144.03
0.000
0.000
.8046968
14.69724
1.038805
15.10276
educ66 <SES
_cons
1
10.9
(constrained)
.1014894
107.40
0.000
10.70108
11.09892
occs~66 <SES
_cons
5.22132
37.49
.425595
.6947112
0.000
0.000
4.387169
36.12839
6.055471
38.85161
var(e.ano~67)
var(e.pwl~67)
var(e.ano~71)
var(e.pwl~71)
var(e.educ66)
var(e.occ~66)
var(e.Ali~67)
var(e.Ali~71)
var(SES)
4.728874
2.563413
4.396081
3.072085
2.803674
264.5311
4.842059
4.084249
6.796014
.456299
.4060733
.5171156
.4360333
.5115854
18.22483
.4622537
.4038995
.6524866
3.914024
1.879225
3.490904
2.326049
1.960691
231.1177
4.015771
3.364613
5.630283
5.713365
3.4967
5.535966
4.057398
4.009091
302.7751
5.838364
4.957802
8.203105
1.622024
.3154267
5.14
0.000
1.003799
2.240249
.3399961
.2627541
1.29
0.196
-.1749925
.8549847
cov(e.ano~67,
e.anomia71)
cov(e.pwl~67,
e.pwless71)
.0619825
.1001814
.0597225
.1034517
12.27
53.96
207
Notes:
1. We find the covariance between e.anomia67 and e.anomia71 to be significant (Z = 5.14).
2. We find the covariance between e.pwless67 and e.pwless71 to be insignificant at the 5%
level (Z = 1.29).
3. The model versus saturated 2 test indicates that the model is a good fit.
References
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Wheaton, B., B. Muthen, D. F. Alwin, and G. F. Summers. 1977. Assessing reliability and stability in panel models.
In Sociological Methodology 1977, ed. D. R. Heise, 84136. San Francisco: Jossey-Bass.
Also see
[SEM] example 32g Full structural equation model (generalized response)
Title
example 10 MIMIC model
Description
Reference
Also see
Description
To demonstrate a MIMIC model, we use the following summary statistics data:
. use http://www.stata-press.com/data/r13/sem_mimic1
(Multiple indicators and multiple causes)
. ssd describe
Summary statistics data from
http://www.stata-press.com/data/r13/sem_mimic1.dta
obs:
432
Multiple indicators and multip..
vars:
5
25 May 2013 10:13
(_dta has notes)
variable name
variable label
occpres
income
s_occpres
s_income
s_socstat
. notes
_dta:
1. Summary statistics data from Kluegel, J. R., R. Singleton, Jr., and C. E.
Starnes, 1977, "Subjective class identification: A multiple indicator
approach", _American Sociological Review_, 42: 599-611.
2. Data is also analyzed in Bollen, K. A. 1989, _Structural Equations with
Latent Variables_, New York: John Wiley & Sons, Inc.
3. The summary statistics represent 432 white adults included in the sample
for the 1969 Gary Area Project for the Institute of Social Research at
Indiana University.
4. The three subjective variables are measures of socioeconomic status based
on an individuals perception of their own income, occupational prestige,
and social status.
5. The income and occpres variables are objective measures of income and
occupational prestige, respectively.
208
209
SubjSES
s_occpres
s_socstat
occpres
income
Bollen includes paths that he constrains and we do not show. Our model is nonetheless equivalent to
the one he shows. In his textbook, Bollen illustrates various ways the same model can be written.
210
Coef.
OIM
Std. Err.
P>|z|
.0138498
.0012464
5.97
3.71
0.000
0.000
.0555869
.0021847
.1098772
.0070704
(constrained)
.0794151
12.10
0.000
.8055583
1.11686
Structural
SubjSES <income
occpres
.082732
.0046275
Measurement
s_inc~e <SubjSES
_cons
1
.9612091
s_occ~s <SubjSES
_cons
.7301352
1.114563
.0832915
.0656195
8.77
16.99
0.000
0.000
.5668869
.9859512
.8933835
1.243175
s_soc~t <SubjSES
_cons
.9405161
1.002113
.0934855
.0706576
10.06
14.18
0.000
0.000
.7572878
.863627
1.123744
1.1406
var(e.s_in~e)
var(e.s_oc~s)
var(e.s_so~t)
var(e.Subj~S)
.2087546
.2811852
.1807129
.1860097
.0254098
.0228914
.0218405
.0270476
.1644474
.2397153
.1425987
.1398822
.2649996
.3298291
.2290146
.2473481
Notes:
1. In this model, there are three observed variables that record the respondents ideas of their
perceived socioeconomic status (SES). One is the respondents general idea of his or her SES
(s socstat); another is based on the respondents income (s income); and the last is based
on the respondents occupational prestige (s occpres). Those three variables form the latent
variable SubjSES.
2. The other two observed variables are the respondents income (income) and occupation, the latter
measured by the two-digit Duncan SEI scores for occupations (occpres). These two variables
are treated as predictors of SubjSES.
211
tool, click in the diagram to add another new variable below the
b. Click in the lower-right quadrant of the occpress rectangle (it will highlight when you
hover over it), and drag a path to the upper-left quadrant of the SubjSES oval (it will
highlight when you can release to connect the path).
c. Continuing with the
tool, click in the upper-right quadrant of the income rectangle, and
drag a path to the lower-left quadrant of the SubjSES oval.
212
s_occpres
s_socstat
income
occpres
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
s_income
s_occpres
s_socstat
income
occpres
0.000
-0.009
0.000
0.101
-0.856
0.000
0.008
-0.079
1.482
0.000
-0.053
0.049
0.000
0.000
0.000
s_income
s_occpres
s_socstat
income
occpres
0.000
-0.425
0.008
1.362
-1.221
0.000
0.401
-1.137
2.234
0.000
-0.771
0.074
0.000
0.000
0.000
raw
normalized
Covariance residuals
s_income
s_occpres
s_socstat
income
occpres
s_income
s_occpres
s_socstat
income
occpres
Notes:
1. The residuals can be partitioned into two subsets: mean residuals and covariance residuals.
2. The normalized option caused the normalized residuals to be displayed.
3. Concerning mean residuals, the raw residuals and the normalized residuals are shown on a
separate line of the first table.
213
4. Concerning covariance residuals, the raw residuals and the normalized residuals are shown in
separate tables.
5. Distinguish between normalized residuals and standardized residuals. Both are available from
estat residuals; if we wanted standardized residuals, we would have specified the standardized option instead of or along with normalized.
6. Both normalized and standardized residuals attempt to adjust the residuals in the same way. The
normalized residuals are always valid, but they do not follow a standard normal distribution.
The standardized residuals do follow a standard normal distribution if they can be calculated;
otherwise, they will equal missing values. When both can be calculated (equivalent to both being
appropriate), the normalized residuals will be a little smaller than the standardized residuals.
7. The normalized covariance residuals between income and s income and between occpres and
s occpres are large.
///
///
///
//
<- new
<- new
For no other reason than we want to demonstrate the likelihood-ratio test, we will then use lrtest
rather than test to test the joint significance of the new paths. lrtest compares the likelihood
values of two fitted models. Thus we will use lrtest to compare this new model with the one above.
To do that, we must plan ahead and store in memory the currently fit model:
. estimates store mimic1
Alternatively, we could skip that and calculate the joint significance of the two new paths by using
a Wald test and the test command.
In any case, having stored the current estimates under the name mimic1, we can now fit our new
model:
. sem (SubjSES -> s_income s_occpres s_socstat)
>
(SubjSES <- income occpres)
>
(s_income <- income)
>
(s_occpres <- occpres)
Endogenous variables
Observed:
s_income s_occpres
Measurement: s_socstat
Latent:
SubjSES
Exogenous variables
Observed:
income occpres
Fitting target model:
Iteration 0:
log likelihood = -4267.0974 (not concave)
Iteration 1:
log likelihood = -4022.6745 (not concave)
Iteration 2:
log likelihood = -3977.0648
Iteration 3:
log likelihood = -3962.9229
Iteration 4:
log likelihood = -3962.1604
Iteration 5:
log likelihood = -3960.8404
Iteration 6:
log likelihood = -3960.7133
Iteration 7:
log likelihood = -3960.7111
Iteration 8:
log likelihood = -3960.7111
214
Coef.
Number of obs
OIM
Std. Err.
432
P>|z|
0.000
0.000
.025242
.7293239
.081243
1.035739
Structural
s_inc~e <SubjSES
income
_cons
1
.0532425
.8825314
s_occ~s <SubjSES
occpres
_cons
.783781
.0045201
1.06586
.1011457
.0013552
.0696058
7.75
3.34
15.31
0.000
0.001
0.000
.585539
.0018641
.9294353
.982023
.0071762
1.202285
SubjSES <income
occpres
.0538025
.0034324
.0129158
.0011217
4.17
3.06
0.000
0.002
.028488
.0012339
.0791171
.0056309
Measurement
s_soc~t <SubjSES
_cons
1.195539
1.07922
.1582735
.078323
7.55
13.78
0.000
0.000
.8853282
.9257097
1.505749
1.23273
var(e.s_in~e)
var(e.s_oc~s)
var(e.s_so~t)
var(e.Subj~S)
.2292697
.2773786
.1459009
.1480275
.0248905
.0223972
.028228
.0278381
.1853261
.2367783
.0998556
.1023918
.2836329
.3249407
.2131785
.2140029
(constrained)
.0142862
3.73
.0781685
11.29
LR chi2(2) =
Prob > chi2 =
22.42
0.0000
Notes:
1. The syntax of lrtest is lrtest modelname1 modelname2. We specified the first model name
as mimic1, the model we previously stored. We specified the second model name as period (.),
meaning the model most recently fit. The order in which we specify the names is irrelevant.
2. We find the two added paths to be extremely significant.
Reference
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Also see
[SEM] example 36g MIMIC model (generalized response)
Title
example 11 estat framework
Description
Also see
Description
To demonstrate estat framework, which displays results in BentlerWeeks form, we continue
where [SEM] example 10 left off:
.
.
.
.
.
.
.
use http://www.stata-press.com/data/r13/sem_mimic1
ssd describe
notes
sem (SubjSES -> s_income s_occpres s_socstat)
///
(SubjSES <- income occpres)
estat residuals, normalized
estimates store mimic1
sem (SubjSES -> s_income s_occpres s_socstat)
///
(SubjSES <- income occpres)
///
(s_income <- income)
///
(s_occpres <- occpres)
lrtest mimic1 .
->
<<<-
///
///
///
latent
SubjSES
observed
s_income
s_occpres
s_socstat
0
0
0
0
0
0
0
0
0
1
.783781
1.195539
latent
SubjSES
215
216
.0532425
0
0
0
.0045201
0
latent
SubjSES
.0538025
.0034324
e.s_soc~t
latent
e.SubjSES
observed
e.s_income
e.s_occpres
e.s_socstat
.2292697
0
0
.2773786
0
.1459009
latent
e.SubjSES
.1480275
s_socstat
latent
SubjSES
1.07922
_cons
.8825314
1.06586
4.820021
13.62431
451.6628
occpres
mean
36.698
5.04
latent
SubjSES
217
observed
income
observed
s_income
s_occpres
s_socstat
.4478609
.1614446
.225515
.4086519
.1738222
.392219
latent
SubjSES
.1886304
.1453924
.2060311
.1723333
observed
income
occpres
.5627232
3.008694
.3014937
3.831184
.3659463
2.729776
.3060932
2.283302
4.820021
13.62431
Sigma
observed
occpres
observed
occpres
451.6628
observed
s_income
s_occpres
s_socstat
latent
SubjSES
observed
income
mu
1.548
1.543
1.554
.3971264
5.04
mu
observed
occpres
mu
36.698
Notes:
1. BentlerWeeks form is a vector and matrix notation for the estimated parameters of the model.
The matrices are known as , , , , , and . Those Greek names are spelled out in the
labels, along with a header stating what each contains.
2. We specified estat framework option fitted. That caused estat framework to list one more
matrix and one more vector at the end: and . These two results are especially interesting to
those wishing to see the ingredients of the residuals reported by estat residuals.
3. One of the more useful results reported by estat framework, fitted is the matrix, which
reports all estimated covariances in a readable format and includes the model-implied covariances
that do not appear in sems ordinary output.
4. estat framework also allows the standardized option if you want standardized output.
Also see
[SEM] example 10 MIMIC model
Title
example 12 Seemingly unrelated regression
Description
Also see
Description
sem can be used to estimate seemingly unrelated regression. We will use auto.dta, which surely
needs no introduction:
. sysuse auto
(1978 Automobile Data)
See Structural models 10: Seemingly unrelated regression (SUR) in [SEM] intro 5.
mpg
price
weight
displacement
foreign
length
. sem (price <- foreign mpg displacement)
>
(weight <- foreign length),
>
cov(e.price*e.weight)
Endogenous variables
Observed: price weight
Exogenous variables
Observed: foreign mpg displacement length
Fitting target model:
Iteration 0:
log likelihood = -2150.9983
Iteration 1:
log likelihood = -2138.5739
Iteration 2:
log likelihood = -2133.3461
Iteration 3:
log likelihood = -2133.1979
Iteration 4:
log likelihood = -2133.1956
Iteration 5:
log likelihood = -2133.1956
218
Coef.
OIM
Std. Err.
Number of obs
74
P>|z|
Structural
price <foreign
mpg
displace~t
_cons
2940.929
-105.0163
17.22083
4129.866
724.7311
57.93461
4.5941
1984.253
4.06
-1.81
3.75
2.08
0.000
0.070
0.000
0.037
1520.482
-218.566
8.216558
240.8022
4361.376
8.53347
26.2251
8018.931
weight <foreign
length
_cons
-153.2515
30.73507
-2711.096
76.21732
1.584743
312.6813
-2.01
19.39
-8.67
0.044
0.000
0.000
-302.6347
27.62903
-3323.94
-3.868275
33.84111
-2098.252
var(e.price)
var(e.weight)
4732491
60253.09
801783.1
9933.316
3395302
43616.45
6596312
83235.44
cov(e.price,
e.weight)
209268
73909.54
64407.92
354128
2.83
=
0.005
219
Notes:
1. Point estimates are the same as reported by
. sureg (price foreign mpg displ) (weight foreign length), isure
suregs isure option is required to make sureg iterate to the maximum likelihood estimate.
2. If you wish to compare the estimated variances and covariances after estimation by sureg, type
. matrix list e(Sigma)
220
b. Click in the right side of the mpg rectangle (it will highlight when you hover over it), and
drag a path to the left side of the price rectangle (it will highlight when you can release
to connect the path).
221
b. Click in the 1 circle (it will highlight when you hover over it), and drag a covariance to
the 2 circle (it will highlight when you can release to connect the covariance).
8. Clean up.
,
If you do not like where a path has been connected to its variables, use the Select tool,
to click on the path, and then simply click on where it connects to a rectangle and drag the
endpoint. Similarly, you can change where the covariance connects to the error terms by clicking
on the covariance and dragging the endpoint. You can also change the bow of the covariance
by clicking on the covariance and dragging the control point that extends from one end of the
selected covariance.
9. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
SEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem sem_sureg
Also see
[SEM] example 13 Equation-level Wald test
Title
example 13 Equation-level Wald test
Description
Also see
Description
We demonstrate estat eqtest. See [SEM] intro 7 and [SEM] estat eqtest.
This example picks up where [SEM] example 12 left off:
. use http://www.stata-press.com/data/r13/auto
. sem (price <- foreign mpg displacement)
(weight <- foreign length),
cov(e.price*e.weight)
///
///
observed
price
weight
chi2
df
36.43
633.34
3
2
0.0000
0.0000
Note:
1. The null hypothesis for this test is that the coefficients other than the intercepts are 0. We can
reject that null hypothesis for each equation.
Also see
[SEM] example 12 Seemingly unrelated regression
222
Title
example 14 Predicted values
Description
Also see
Description
We demonstrate the use of predict. See [SEM] intro 7 and [SEM] predict after sem.
This example picks up where the first part of [SEM] example 1 left off:
. use http://www.stata-press.com/data/r13/sem_1fmm
. sem (x1 x2 x3 x4 <- X)
You specify options on predict to specify what you want predicted and how. Because of the differing
options, the two commands could not have been combined into one command.
Our dataset now contains three new variables. Below we compare the three variables with the
original x1 and x2 by using first summarize and then correlate:
. summarize x1 x1hat x2 x2hat Xhat
Obs
Mean
Variable
x1
x1hat
x2
x2hat
Xhat
123
123
123
123
123
96.28455
96.28455
97.28455
97.28455
-1.66e-08
Std. Dev.
14.16444
10.65716
16.14764
12.49406
10.65716
Min
Max
54
68.42469
64
64.62267
-27.85986
131
122.9454
135
128.5408
26.66084
Notes:
1. Means of x1hat and x1 are identical; means of x2hat and x2 are identical.
2. The standard deviation of x1hat is less than that of x1; the standard deviation of x2hat is less
than that of x2. Some of the variation in x1 and x2 is not explained by the model.
3. Standard deviations of x1hat and Xhat are equal. This is because in
x 1 = b0 + b1 X + e 1
coefficient b1 was constrained to be equal to 1 because of the anchoring normalization constraint;
see Identification 2: Normalization constraints (anchoring) in [SEM] intro 4.
223
224
The mean of Xhat in the model above is 1.66e08 rather than 0. Had we typed
. predict double Xhat, latent(X)
1.0000
0.7895
0.5826
0.7895
0.7895
1.0000
0.8119
1.0000
1.0000
x2
x2hat
Xhat
1.0000
0.8119
0.8119
1.0000
1.0000
1.0000
Notes:
1. Both x1hat and x2hat correlate 1 with Xhat. That is because both are linear functions of Xhat
alone.
2. That x1hat and x2hat correlate 1 is implied by item 1, directly above.
3. That Xhat, x1hat, and x2hat all have the same correlation with x1 and with x2 is also implied
by item 1, directly above.
Also see
[SEM] example 1 Single-factor measurement model
Title
example 15 Higher-order CFA
Description
Reference
Also see
Description
sem can be used to estimate higher-order confirmatory factor analysis models.
. use http://www.stata-press.com/data/r13/sem_hcfa1
(Higher-order CFA)
. ssd describe
Summary statistics data from
http://www.stata-press.com/data/r13/sem_hcfa1.dta
obs:
251
Higher-order CFA
vars:
16
25 May 2013 11:26
(_dta has notes)
variable name
variable label
phyab1
phyab2
phyab3
phyab4
appear1
appear2
appear3
appear4
peerrel1
peerrel2
peerrel3
peerrel4
parrel1
parrel2
parrel3
parrel4
Physical ability 1
Physical ability 2
Physical ability 3
Physical ability 4
Appearance 1
Appearance 2
Appearance 3
Appearance 4
Relationship w/ peers 1
Relationship w/ peers 2
Relationship w/ peers 3
Relationship w/ peers 4
Relationship w/ parent 1
Relationship w/ parent 2
Relationship w/ parent 3
Relationship w/ parent 4
. notes
_dta:
1. Summary statistics data from Marsh, H. W. and Hocevar, D., 1985,
"Application of confirmatory factor analysis to the study of
self-concept: First- and higher order factor models and their invariance
across groups", _Psychological Bulletin_, 97: 562-582.
2. Summary statistics based on 251 students from Sydney, Australia in Grade
5.
3. Data collected using the Self-Description Questionnaire and includes
sixteen subscales designed to measure nonacademic traits: four intended
to measure physical ability, four intended to measure physical
appearance, four intended to measure relations with peers, and four
intended to measure relations with parents.
225
226
phyab1
phyab2
phyab3
phyab4
appear1
appear2
appear3
10
appear4
Phys
Appear
12
peerrel1
13
peerrel2
14
peerrel3
15
peerrel4
17
parrel1
18
parrel2
19
parrel3
20
parrel4
Peer
11
Par
16
Nonacad
Coef.
OIM
Std. Err.
251
P>|z|
Structural
Phys <Nonacad
Appear <Nonacad
2.202491
.3975476
5.54
0.000
1.423312
2.98167
Peer <Nonacad
1.448035
.2921383
4.96
0.000
.8754549
2.020616
Par <Nonacad
.569956
.1382741
4.12
0.000
.2989437
.8409683
Measurement
phyab1 <Phys
_cons
1
8.2
(constrained)
.1159065
70.75
0.000
7.972827
8.427173
phyab2 <Phys
_cons
.9332477
8.23
0.000
0.000
.68125
7.990479
1.185245
8.469521
(constrained)
.1285726
.122207
7.26
67.34
227
228
phyab3 <Phys
_cons
1.529936
8.17
.1573845
.1303953
9.72
62.66
0.000
0.000
1.221468
7.91443
1.838404
8.42557
phyab4 <Phys
_cons
1.325641
8.56
.1338053
.1146471
9.91
74.66
0.000
0.000
1.063387
8.335296
1.587894
8.784704
appear1 <Appear
_cons
1
7.41
(constrained)
.1474041
50.27
0.000
7.121093
7.698907
appear2 <Appear
_cons
1.0719
7
.0821893
.1644123
13.04
42.58
0.000
0.000
.9108121
6.677758
1.232988
7.322242
appear3 <Appear
_cons
1.035198
7.17
.0893075
.1562231
11.59
45.90
0.000
0.000
.8601581
6.863808
1.210237
7.476192
appear4 <Appear
_cons
.9424492
7.4
.0860848
.1474041
10.95
50.20
0.000
0.000
.7737262
7.111093
1.111172
7.688907
peerr~1 <Peer
_cons
1
8.81
(constrained)
.1077186
81.79
0.000
8.598875
9.021125
peerr~2 <Peer
_cons
1.214379
7.94
.1556051
.1215769
7.80
65.31
0.000
0.000
.9093989
7.701714
1.51936
8.178286
peerr~3 <Peer
_cons
1.667829
7.52
.190761
.1373248
8.74
54.76
0.000
0.000
1.293944
7.250848
2.041714
7.789152
peerr~4 <Peer
_cons
1.363627
8.29
.159982
.1222066
8.52
67.84
0.000
0.000
1.050068
8.050479
1.677186
8.529521
parrel1 <Par
_cons
1
9.35
(constrained)
.0825215
113.30
0.000
9.188261
9.511739
parrel2 <Par
_cons
1.159754
9.13
.184581
.0988998
6.28
92.32
0.000
0.000
.7979822
8.93616
1.521527
9.32384
parrel3 <Par
_cons
2.035143
8.67
.2623826
.1114983
7.76
77.76
0.000
0.000
1.520882
8.451467
2.549403
8.888533
parrel4 <Par
_cons
1.651802
9
.2116151
.0926003
7.81
97.19
0.000
0.000
1.237044
8.818507
2.06656
9.181493
var(e.phyab1)
var(e.phyab2)
var(e.phyab3)
var(e.phyab4)
var(e.appe~1)
2.07466
2.618638
1.231013
1.019261
1.986955
.2075636
.252693
.2062531
.1600644
.2711164
1.705244
2.167386
.8864333
.7492262
1.520699
2.524103
3.163841
1.70954
1.386621
2.596169
2.801673
2.41072
2.374508
1.866632
2.167766
1.824346
1.803918
1.214141
1.789125
1.069717
.8013735
.911538
1.59518
.2368108
.3697854
.3858166
.3526427
.300262
.2872554
.18965
.2288099
.2516762
.212599
.1195921
.1748043
.1767086
.121231
.1933432
.3704939
.1193956
.0915049
.1237638
2.189162
1.888545
1.873267
1.529595
1.762654
1.392131
1.431856
1.000982
1.477322
.7738511
.5957527
.6014913
1.011838
.0881539
.2276755
.2057449
229
3.585561
3.077276
3.009868
2.277933
2.665984
2.390749
2.272659
1.472692
2.166738
1.478702
1.077963
1.381403
2.514828
.6361528
.600597
.7234903
Notes:
1. The idea behind this model is that physical ability, appearance, and relationships with peers and
parents may be determined by a latent variable containing nonacademic traits. This model was
suggested by Bollen (1989, 315).
2. sem automatically provided normalization constraints for the first-order factors Phys, Appear,
Peer, and Par. Their path coefficients were set to 1.
3. sem automatically provided a normalization constraint for the second-order factor Nonacad. Its
path coefficient was set to 1.
230
b. Click in the upper-left quadrant of the Nonacad oval (it will highlight when you hover over
it), and drag a path to the lower-left quadrant of the Phys oval (it will highlight when you
can release to connect the path).
tool, create the following paths by clicking first on the left side of
c. Continuing with the
the Nonacad variable and dragging to the right side of the first-order latent variable.
Nonacad -> Appear
Nonacad -> Peer
Nonacad -> Par
9. Clean up the direction of the errors.
We want the errors for each of the latent variables to be below the latent variable. The errors
for Phys, Appear, and Peer are likely to have been created in other directions.
a. Choose the Select tool,
Repeat this for all errors on latent variables that are not below the latent variable.
231
Reference
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Also see
[SEM] sem Structural equation model estimation command
Title
example 16 Correlation
Description
Also see
Description
sem can be used to produce correlations or covariances between exogenous variables. The advantages
of using sem over Statas correlate command are that you can perform statistical tests on the results
and that you can handle missing values in a more elegant way.
To demonstrate these features, we use
. use http://www.stata-press.com/data/r13/census13
(1980 Census data by state)
. describe
Contains data from http://www.stata-press.com/data/r13/census13.dta
obs:
50
1980 Census data by state
vars:
9
9 Apr 2013 10:09
size:
1,600
variable name
state
brate
pop
medage
division
region
mrgrate
dvcrate
medagesq
storage
type
display
format
value
label
long
long
long
float
int
int
float
float
float
%13.0g
%10.0g
%12.0gc
%9.2f
%8.0g
%-8.0g
%9.0g
%9.0g
%9.0g
state1
division
cenreg
variable label
State
Birth rate
Population
Median age
Census Division
Census region
Sorted by:
mrgrate
dvcrate
232
medage
example 16 Correlation
233
This model does nothing more than estimate the covariances (correlations), something we could
obtain from the correlate command by typing
. correlate mrgrate dvcrate medage
(obs=50)
mrgrate dvcrate
medage
mrgrate
dvcrate
medage
1.0000
0.7700
-0.0177
1.0000
-0.2229
1.0000
.000662
.000063 1.0e-05
-.000769 -.001191
2.86775
As explained in Correlations in [SEM] intro 5, to see results presented as correlations rather than
as covariances, we specify sems standardized option:
. sem ( <- mrgrate dvcrate medage), standardized
Exogenous variables
Observed: mrgrate dvcrate medage
Fitting target model:
Iteration 0:
log likelihood = 258.58985
Iteration 1:
log likelihood = 258.58985
Structural equation model
Estimation method = ml
Log likelihood
= 258.58985
Standardized
Coef.
Number of obs
50
OIM
Std. Err.
P>|z|
4.60
8.75
9.97
0.000
0.000
0.000
.4210282
1.981634
14.15611
1.045474
3.125947
21.08556
.
.
.
.
.
.
mean(mrgrate)
mean(dvcrate)
mean(medage)
.7332509
2.553791
17.62083
.1593002
.291922
1.767749
var(mrgrate)
var(dvcrate)
var(medage)
1
1
1
.
.
.
.7699637
.0575805
13.37
0.000
.6571079
.8828195
-.0176541
.1413773
-0.12
0.901
-.2947485
.2594403
-.222932
.1343929
-1.66
0.097
-.4863373
.0404732
cov(mrgrate,
dvcrate)
cov(mrgrate,
medage)
cov(dvcrate,
medage)
234
example 16 Correlation
Note:
1. The correlations reported are
sem
0.7699637
0.0176541
0.222932
correlate
0.7700
0.0177
0.2229
b. Click in the top of the mrgrate rectangle, slightly to the right of the center (it will highlight
when you hover over it), and drag a path to the top of the dvcrate rectangle, slightly to
the left of the center (it will highlight when you can release to connect the covariance).
c. Click in the top of the dvcrate rectangle, slightly to the right of the center, and drag a
path to the top of the medage rectangle, slightly to the left of the center.
d. Click in the top of the mrgrate rectangle, slightly to the left of the center, and drag a path
to the top of the medage rectangle, slightly to the right of the center.
example 16 Correlation
235
5. Clean up.
If you do not like where a covariance has been connected to its variable, use the Select tool,
, to click on the covariance, and then simply click on where it connects to an oval and drag
the endpoint. You can also change the bow of the covariance by dragging the control point that
extends from one end of the selected covariance.
6. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
SEM estimation options dialog box.
7. Show standardized estimates.
From the SEM Builder menu, select View > Standardized Estimates.
You can open a completed diagram in the Builder by typing
. webgetsem sem_corr
We must prefix test with estat stdize because otherwise we would be testing equality of
covariances; see Displaying other results, statistics, and tests (sem and gsem) in [SEM] intro 7 and
see [SEM] estat stdize.
That we refer to the two correlations (covariances) by typing b[cov(medage,mrgrate): cons]
and b[cov(medage,dvcrate): cons] is something nobody remembers and that we remind
ourselves of by redisplaying sem results with the coeflegend option:
. sem, coeflegend
Structural equation model
Estimation method = ml
Log likelihood
= 258.58985
Coef.
Number of obs
50
Legend
mean(mrgrate)
mean(dvcrate)
mean(medage)
.0186789
.0079769
29.54
_b[mean(mrgrate):_cons]
_b[mean(dvcrate):_cons]
_b[mean(medage):_cons]
var(mrgrate)
var(dvcrate)
var(medage)
.0006489
9.76e-06
2.8104
_b[var(mrgrate):_cons]
_b[var(dvcrate):_cons]
_b[var(medage):_cons]
.0000613
_b[cov(mrgrate,dvcrate):_cons]
-.0007539
_b[cov(mrgrate,medage):_cons]
-.0011674
_b[cov(dvcrate,medage):_cons]
cov(mrgrate,
dvcrate)
cov(mrgrate,
medage)
cov(dvcrate,
medage)
236
example 16 Correlation
Note:
1. We can reject the test at the 5% level.
Also see
[SEM] test Wald test of linear hypotheses
[SEM] estat stdize Test standardized parameters
[R] correlate Correlations (covariances) of variables or coefficients
Title
example 17 Correlated uniqueness model
Description
Reference
Also see
Description
To demonstrate a correlated uniqueness model, we use the following summary statistics data:
. use http://www.stata-press.com/data/r13/sem_cu1
(Correlated uniqueness)
. ssd describe
Summary statistics data from
http://www.stata-press.com/data/r13/sem_cu1.dta
obs:
500
Correlated uniqueness
vars:
9
25 May 2013 10:12
(_dta has notes)
variable name
variable label
par_i
szt_i
szd_i
par_c
szt_c
szd_c
par_o
szt_o
szd_o
. notes
_dta:
1. Summary statistics data for Multitrait-Multimethod matrix (a specific
kind of correlation matrix) and standard deviations from Brown, Timothy
A., 2006, _Confirmatory Factor Analysis for Applied Research_, New York,
NY: The Guilford Press.
2. Summary statistics represent a sample of 500 patients who were evaluated
for three personality disorders using three different methods.
3. The personality disorders include paranoid, schizotypal, and schizoid.
4. The methods of evaluation include a self-report inventory, ratings from a
clinical interview, and observational ratings.
237
238
Par
Szt
Szd
par_i
szt_i
szd_i
par_c
szt_c
szd_c
par_o
szt_o
szd_o
model:
log likelihood
log likelihood
log likelihood
log likelihood
log likelihood
log likelihood
log likelihood
log likelihood
=
=
=
=
=
=
=
=
-10210.31
-10040.188
-9971.4015
-9918.0037
-9883.6368
-9880.0242
-9879.9961
-9879.9961
(not concave)
(not concave)
Number of obs
500
[par_i]Par = 1
[szt_i]Szt = 1
[szd_i]Szd = 1
OIM
Std. Err.
Standardized
Coef.
Measurement
par_i <Par
.7119709
.0261858
par_c <Par
.8410183
par_o <Par
P>|z|
27.19
0.000
.6606476
.7632941
.0242205
34.72
0.000
.7935469
.8884897
.7876062
.0237685
33.14
0.000
.7410209
.8341916
szt_i <Szt
.7880887
.0202704
38.88
0.000
.7483594
.8278179
szt_c <Szt
.7675732
.0244004
31.46
0.000
.7197493
.8153972
szt_o <Szt
.8431662
.0181632
46.42
0.000
.807567
.8787653
szd_i <Szd
.7692321
.0196626
39.12
0.000
.7306942
.80777
szd_c <Szd
.8604596
.0179455
47.95
0.000
.8252871
.8956321
szd_o <Szd
.8715597
.0155875
55.91
0.000
.8410086
.9021107
var(e.par_i)
var(e.par_c)
var(e.par_o)
var(e.szt_i)
var(e.szt_c)
var(e.szt_o)
var(e.szd_i)
var(e.szd_c)
var(e.szd_o)
var(Par)
var(Szt)
var(Szd)
.4930975
.2926882
.3796764
.3789163
.4108313
.2890708
.408282
.2596093
.2403837
1
1
1
.0372871
.0407398
.0374404
.0319498
.0374582
.0306291
.0302501
.0308827
.027171
.
.
.
.4251739
.2228049
.3129503
.3211966
.3436006
.2348623
.3530966
.2056187
.192616
.
.
.
.5718722
.3844905
.4606295
.4470082
.4912169
.3557912
.4720922
.3277766
.2999976
.
.
.
239
240
cov(e.par_i,
e.szt_i)
cov(e.par_i,
e.szd_i)
cov(e.par_c,
e.szt_c)
cov(e.par_c,
e.szd_c)
cov(e.par_o,
e.szt_o)
cov(e.par_o,
e.szd_o)
cov(e.szt_i,
e.szd_i)
cov(e.szt_c,
e.szd_c)
cov(e.szt_o,
e.szd_o)
cov(Par,Szt)
cov(Par,Szd)
cov(Szt,Szd)
.2166732
.0535966
4.04
0.000
.1116258
.3217207
.4411039
.0451782
9.76
0.000
.3525563
.5296515
-.1074802
.0691107
-1.56
0.120
-.2429348
.0279743
-.2646125
.0836965
-3.16
0.002
-.4286546
-.1005705
.4132457
.0571588
7.23
0.000
.3012165
.5252749
.3684402
.0587572
6.27
0.000
.2532781
.4836022
.7456394
.0351079
21.24
0.000
.6768292
.8144496
-.3296552
.0720069
-4.58
0.000
-.4707861
-.1885244
.4781276
.3806759
.3590146
.3103837
.0588923
.045698
.0456235
.0466126
8.12
8.33
7.87
6.66
0.000
0.000
0.000
0.000
.3627009
.2911095
.2695941
.2190246
.5935544
.4702422
.4484351
.4017428
Notes:
1. We use the correlated uniqueness model fit above to analyze a multitraitmultimethod (MTMM)
matrix. The MTMM matrix was developed by Campbell and Fiske (1959) to evaluate construct
validity of measures. Each trait is measured using different methods, and the correlation matrix
produced is used to evaluate whether measures that are related in theory are related in fact
(convergent validity) and whether measures that are not intended to be related are not related in
fact (discriminant validity).
In this example, the traits are the latent variables Par, Szt, and Szd.
The observed variables are the methodtrait combinations.
The observed traits are the personality disorders paranoid (par), schotypal (szt), and schizoid
(szd). The methods used to measure them are self-report ( i), clinical interview ( c), and
observer rating ( o). Thus variable par i is paranoid (par) measured by self-report ( i).
2. Note our use of the covstructure() option, which we abbreviated to covstr(). We used this
option instead of cov() to save typing; see Correlated uniqueness model in [SEM] intro 5.
3. Large values of the factor loadings (path coefficients) indicate convergent validity.
4. Small correlations between latent variables indicate discriminant validity.
241
b. Click in the bottom of the Par oval (it will highlight when you hover over it), and drag a
path to the top of the par i rectangle (it will highlight when you can release to connect
the path).
242
->
->
->
->
->
->
->
->
szt_i
szd_i
par_c
szt_c
szd_c
par_o
szt_o
szd_o
b. Click on the rectangle of any measurement variable whose associated error is not below it.
c. Click on one of the Error Rotation buttons,
is below the measurement variable.
Repeat this for all errors that are not below the measurement variables.
8. Correlate the errors within the self-report, clinical interview, and observer rating groups.
a. Select the Add Covariance tool,
b. Click in the 2 circle (it will highlight when you hover over it), and drag a covariance to
the 1 circle (it will highlight when you can release to connect the covariance).
c. Continue with the Add Covariance tool,
, to create eight more covariances by clicking
the first-listed error and dragging it to the second-listed error.
3
3
5
6
6
8
9
9
->
->
->
->
->
->
->
->
2
1
4
5
4
7
8
7
The order in which we create the covariances is unimportant. We dragged each covariance from
right to left because the bow of the covariance is outward when we drag in a clockwise direction
and inward when we drag in a counterclockwise direction. Had we connected the opposite way,
we would have needed to use the Contextual Toolbar to mirror the bow of the covariances.
9. Correlate the latent factors.
a. Select the Add Covariance tool,
b. Click in the Par oval and drag a covariance to the Szt oval.
c. Click in the Szt oval and drag a covariance to the Szd oval.
d. Click in the Par oval and drag a covariance to the Szd oval.
243
Reference
Campbell, D. T., and D. W. Fiske. 1959. Convergent and discriminant validation by the multitrait-multimethod matrix.
Psychological Bulletin 56: 81105.
Also see
[SEM] sem Structural equation model estimation command
[SEM] sem and gsem option covstructure( ) Specifying covariance restrictions
Title
example 18 Latent growth model
Description
Reference
Also see
Description
To demonstrate a latent growth model, we use the following data:
. use http://www.stata-press.com/data/r13/sem_lcm
. describe
Contains data from http://www.stata-press.com/data/r13/sem_lcm.dta
obs:
359
vars:
4
25 May 2013 11:08
size:
5,744
(_dta has notes)
variable name
lncrime0
lncrime1
lncrime2
lncrime3
storage
type
float
float
float
float
display
format
value
label
%9.0g
%9.0g
%9.0g
%9.0g
variable label
ln(crime
ln(crime
ln(crime
ln(crime
rate)
rate)
rate)
rate)
in
in
in
in
Jan
Mar
May
Jul
&
&
&
&
Feb
Apr
Jun
Aug
Sorted by:
. notes
_dta:
1. Data used in Bollen, Kenneth A. and Patrick J. Curran, 2006, _Latent
Curve Models: A Structural Equation Perspective_. Hoboken, New Jersey:
John Wiley & Sons
2. Data from 1995 Uniform Crime Reports for 359 communities in New York
state.
244
-1034.1038
-1033.9044
-1033.9037
-1033.9037
Number of obs
[var(e.lncrime3)]_cons = 0
[var(e.lncrime3)]_cons = 0
[var(e.lncrime3)]_cons = 0
359
245
246
Coef.
OIM
Std. Err.
Measurement
lncri~0 <Intercept
_cons
1
0
(constrained)
(constrained)
lncri~1 <Intercept
Slope
_cons
1
1
0
(constrained)
(constrained)
(constrained)
lncri~2 <Intercept
Slope
_cons
1
2
0
(constrained)
(constrained)
(constrained)
lncri~3 <Intercept
Slope
_cons
1
3
0
(constrained)
(constrained)
(constrained)
mean(Inter~t)
mean(Slope)
5.337915
.1426952
.0407501
.0104574
var(e.lncr~0)
var(e.lncr~1)
var(e.lncr~2)
var(e.lncr~3)
var(Interc~t)
var(Slope)
.0981956
.0981956
.0981956
.0981956
.527409
.0196198
.0051826
.0051826
.0051826
.0051826
.0446436
.0031082
cov(Inter~t,
Slope)
-.034316
.0088848
130.99
13.65
-3.86
=
P>|z|
0.000
0.000
5.258047
.1221992
5.417784
.1631912
.0885457
.0885457
.0885457
.0885457
.4467822
.0143829
.1088972
.1088972
.1088972
.1088972
.6225858
.0267635
-.0517298
-.0169022
0.000
Notes:
1. In this example, we have repeated measures of the crime rate in 1995. We will assume that the
underlying rate grows linearly.
2. As explained in Latent growth models in [SEM] intro 5, we assume
lncrimei = Intercept + i Slope
3. sem does not usually report the means of latent exogenous variables because sem automatically
includes the identifying constraint that the means are 0; see How sem (gsem) solves the problem
for you in [SEM] intro 4 and see Default normalization constraints in [SEM] sem.
In this case, sem did not constrain the means to be 0 because we specified sems means() option.
In particular, we specified means(Intercept Slope), which said not to constrain the means of
those two exogenous latent variables and to report the estimated result.
Our model was identified even without the usual 0 constraints on Intercept and Slope because
we specified enough other constraints.
4. We estimate the Intercept to have mean 5.34 and the mean Slope to be 0.14 per two months.
Remember, we have measured crime rates as log base e crime rates.
247
The mean Intercept and Slope are what mixed would refer to as the coefficients in the
fixed-effects part of the model.
Acock (2013, chap. 4) discusses the use of sem to fit latent growth-curve models in more detail.
Acock demonstrates extensions to the basic model we fit here, such as including time-varying and
time-invariant covariates in the model.
248
b. Click in the bottom-left quadrant of the Intercept oval (it will highlight when you hover
over it), and drag a path to the top of the lncrime0 rectangle (it will highlight when you
can release to connect the path).
c. Continuing with the
tool, create the following paths by clicking first in the bottom of
the latent variable and dragging it to the top of the observed (measurement) variable:
Intercept -> lncrime1
Intercept -> lncrime2
Intercept -> lncrime3
Slope -> lncrime0
Slope -> lncrime1
Slope -> lncrime2
Slope -> lncrime3
6. Clean up the direction of the errors.
We want all the errors to be below the measurement variables.
a. Choose the Select tool,
b. Click on the rectangle of any measurement variable whose associated error is not below it.
c. Click on one of the Error Rotation buttons,
is below the measurement variable.
Repeat this for all errors that are not below the measurement variables.
7. Create the covariance between Intercept and Slope.
a. Select the Add Covariance tool,
b. Click in the top-right quadrant of the Intercept oval, and drag a covariance to the top left
of the Slope oval.
8. Clean up paths and covariance.
If you do not like where a path has been connected to its variables, use the Select tool,
, to
click on the path, and then simply click on where it connects to a rectangle and drag the endpoint.
Similarly, you can change where the covariance connects to the latent variables by clicking on
the covariance and dragging the endpoint. You can also change the bow of the covariance by
clicking on the covariance and dragging the control point that extends from one end of the
selected covariance.
9. Constrain the intercepts of the measurements to 0.
a. Choose the Select tool,
b. Click on the rectangle for lncrime0. In the Contextual Toolbar, type 0 in the
press Enter.
249
box and
c. Repeat this process to add the 0 constraint on the intercept for lncrime1, lncrime2, and
lncrime3.
10. Set constraints on the paths from Intercept to the measurements.
a. Continue with the Select tool,
b. Click on the path from Intercept to lncrime0. In the Contextual Toolbar, type 1 in the
box and press Enter.
c. Repeat this process to add the 1 constraint on the following paths:
Intercept -> lncrime1
Intercept -> lncrime2
Intercept -> lncrime3
11. Set constraints on the paths from Slope to the measurements.
a. Continue with the Select tool,
b. Click on the path from Slope to lncrime0. In the Contextual Toolbar, type 0 in the
box and press Enter.
c. Click on the path from Slope to lncrime1. In the Contextual Toolbar, type 1 in the
box and press Enter.
d. Click on the path from Slope to lncrime2. In the Contextual Toolbar, type 2 in the
box and press Enter.
e. Click on the path from Slope to lncrime3. In the Contextual Toolbar, type 3 in the
box and press Enter.
12. Set equality constraints on the error variances.
a. Continue with the Select tool,
b. Click in the 1 circle, which is the error term for lncrime0. In the Contextual Toolbar,
type var in the
box and press Enter.
c. Repeat this process to add the var constraint on the three remaining error variances: 2 , 3 ,
and 4 .
13. Clean up placement of the constraints.
From the SEM Builder menu, select Settings > Connections > Paths....
In the resulting dialog box, do the following:
a. Click on the Results tab.
b. Click on the Result 1... button at the bottom left.
c. In the Appearance of result 1 - paths dialog box that opens, choose 20 (%) in the Distance
between nodes control.
d. Click on OK on the Appearance of result 1 - paths dialog box.
e. Click on OK on the Connection settings - paths dialog box.
250
14. Specify that the means of Intercept and Slope are to be estimated.
a. Choose the Select tool,
Reference
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Also see
[SEM] sem Structural equation model estimation command
Title
example 19 Creating multiple-group summary statistics data
Description
Reference
Also see
Description
The data analyzed in [SEM] example 20 are summary statistics data (SSD) and contain summary
statistics on two groups of subjects, those from grade 4 and those from grade 5. Below we show how
we created this summary statistics dataset.
See [SEM] intro 11 for background on SSD.
ssd
ssd
ssd
ssd
ssd
init
set obs
set means
set sd
set corr
variable names
values
values
values
values
We will first set the end-of-line delimiter to a semicolon because we are going to have some long
lines. We will be entering SSD for 16 variables!
. #delimit ;
delimiter now ;
. ssd init phyab1
phyab2
phyab3
>
appear1 appear2 appear3
>
peerrel1 peerrel2 peerrel3
>
parrel1 parrel2 parrel3
Summary statistics data initialized.
phyab4
appear4
peerrel4
parrel4 ;
Next use, in any order,
251
252
to be set)
to be set)
to be set)
\
1.0
.25
.53
.46
\
1.0 \
.50 1.0 \
.43 .59 1.0 ;
We have now entered the data for the first group, and ssd reports that we have a fully set dataset.
253
The ssd set command now modifies the new group grade==2. If we needed to modify data for
grade==1, we would place a 1 right after the set. For example,
. ssd set 1 means ...
We are not modifying data; however, we are now adding data for the second group. The procedure
for entering the second group is the same as the procedure for entering the first group:
.
.
.
.
ssd
ssd
ssd
ssd
set
set
set
set
obs
means
sd
corr
values
values
values
values
We do that below.
. #delimit ;
delimiter now ;
. ssd set obs 251 ;
(value set for group grade==2)
Status for group grade==2:
observations:
set
means: unset
variances or sd: unset
covariances or correlations: unset (required to be set)
. ssd set corr
>
1.0 \
>
.31 1.0 \
>
.52 .45 1.0 \
>
.54 .46 .70 1.0 \
>
.15 .33 .22 .21 1.0 \
>
.14 .28 .21 .13 .72 1.0 \
>
.16 .32 .35 .31 .59 .56 1.0 \
>
.23 .29 .43 .36 .55 .51 .65 1.0 \
>
.24 .13 .24 .23 .25 .24 .24 .30 1.0 \
>
.19 .26 .22 .18 .34 .37 .36 .32 .38 1.0 \
>
.16 .24 .36 .30 .33 .29 .44 .51 .47 .50 1.0 \
>
.16 .21 .35 .24 .31 .33 .41 .39 .47 .47 .55 1.0 \
>
.08 .18 .09 .12 .19 .24 .08 .21 .21 .19 .19 .20 1.0 \
>
.01 -.01 .03 .02 .10 .13 .03 .05 .26 .17 .23 .26 .33 1.0 \
>
.06 .19 .22 .22 .23 .24 .20 .26 .16 .23 .38 .24 .42 .40 1.0 \
>
.04 .17 .10 .07 .26 .24 .12 .26 .16 .22 .32 .17 .42 .42 .65 1.0 ;
(values set for group grade==2)
Status for group grade==2:
observations:
means:
variances or sd:
covariances or correlations:
set
unset
unset
set
254
We could stop here and save the data in a Stata dataset. We might type
. save sem_2fmmby
However, we intend to use these data as an example in this manual and online. Here is what you
would see if you typed ssd describe:
. ssd describe
Summary statistics data
obs:
385
vars:
16
variable name
variable label
phyab1
phyab2
phyab3
phyab4
appear1
appear2
appear3
appear4
peerrel1
peerrel2
peerrel3
peerrel4
parrel1
parrel2
parrel3
parrel4
Group variable:
Obs. by group:
grade (2 groups)
134, 251
255
We are going to label these data so that ssd describe can provide more information:
. label data "two-factor CFA"
. label var phyab1
"Physical ability 1"
. label var phyab2
"Physical ability 2"
. label var phyab3
"Physical ability 3"
. label var phyab4
"Physical ability 4"
. label var appear1
"Appearance 1"
. label var appear2
"Appearance 2"
. label var appear3
"Appearance 3"
. label var appear4
"Appearance 4"
. label var peerrel1 "Relationship w/ peers 1"
. label var peerrel2 "Relationship w/ peers 2"
. label var peerrel3 "Relationship w/ peers 3"
. label var peerrel4 "Relationship w/ peers 4"
. label var parrel1
"Relationship w/ parent 1"
. label var parrel2
"Relationship w/ parent 2"
. label var parrel3
"Relationship w/ parent 3"
. label var parrel4
"Relationship w/ parent 4"
. #delimit ;
delimiter now ;
. notes: Summary statistics data from
>
Marsh, H. W. and Hocevar, D., 1985,
>
"Application of confirmatory factor analysis to the study of
>
self-concept: First- and higher order factor models and their
>
invariance across groups", _Psychological Bulletin_, 97: 562-582. ;
. notes: Summary statistics based on
>
134 students in grade 4 and
>
251 students in grade 5
>
from Sydney, Australia. ;
. notes: Group 1 is grade 4, group 2 is grade 5. ;
. notes: Data collected using the Self-Description Questionnaire
>
and includes sixteen subscales designed to measure
>
nonacademic traits: four intended to measure physical
>
ability, four intended to measure physical appearance,
>
four intended to measure relations with peers, and four
>
intended to measure relations with parents. ;
. #delimit cr
delimiter now cr
Reference
Marsh, H. W., and D. Hocevar. 1985. Application of confirmatory factor analysis to the study of self-concept: Firstand higher order factor models and their invariance across groups. Psychological Bulletin 97: 562582.
Also see
[SEM] ssd Making summary statistics data (sem only)
[SEM] example 20 Two-factor measurement model by group
Title
example 20 Two-factor measurement model by group
Description
Reference
Also see
Description
Below we demonstrate sems group() option, which allows fitting models in which path coefficients
and covariances differ across groups of the data, such as for males and females. We use the following
data:
. use http://www.stata-press.com/data/r13/sem_2fmmby
(two-factor CFA)
. ssd describe
Summary statistics data from
http://www.stata-press.com/data/r13/sem_2fmmby.dta
obs:
385
two-factor CFA
vars:
16
25 May 2013 11:11
(_dta has notes)
variable name
variable label
phyab1
phyab2
phyab3
phyab4
appear1
appear2
appear3
appear4
peerrel1
peerrel2
peerrel3
peerrel4
parrel1
parrel2
parrel3
parrel4
Physical ability 1
Physical ability 2
Physical ability 3
Physical ability 4
Appearance 1
Appearance 2
Appearance 3
Appearance 4
Relationship w/ peers 1
Relationship w/ peers 2
Relationship w/ peers 3
Relationship w/ peers 4
Relationship w/ parent 1
Relationship w/ parent 2
Relationship w/ parent 3
Relationship w/ parent 4
Group variable:
Obs. by group:
. notes
grade (2 groups)
134, 251
_dta:
1. Summary statistics data from Marsh, H. W. and Hocevar, D., 1985,
"Application of confirmatory factor analysis to the study of
self-concept: First- and higher order factor models and their invariance
across groups", _Psychological Bulletin_, 97: 562-582.
2. Summary statistics based on 134 students in grade 4 and 251 students in
grade 5 from Sydney, Australia.
3. Group 1 is grade 4, group 2 is grade 5.
4. Data collected using the Self-Description Questionnaire and includes
sixteen subscales designed to measure nonacademic status: four intended
to measure physical ability, four intended to measure physical
appearance, four intended to measure relations with peers, and four
intended to measure relations with parents.
256
257
Background
See [SEM] intro 6 for background on sems group() option.
We will fit the model
peerrel1
parrel1
peerrel2
parrel2
Peer
Par
peerrel3
parrel3
peerrel4
parrel4
///
We are using the same data used in [SEM] example 15, but we are using more of the data and
fitting a different model. To remind you, those data were collected from students in grade 5. The
dataset we are using, however, has data for students from grade 4 and from grade 5, which was
created in [SEM] example 19. We have the following observed variables:
1. Four measures of physical ability.
2. Four measures of appearance.
3. Four measures of quality of relationship with peers.
4. Four measures of quality of relationship with parents.
In this example, we will consider solely the measurement problem, and include only the measurement
variables for the two kinds of relationship quality. We are going to treat quality of relationship with
peers as measures of underlying factor Peer and quality of relationship with parents as measures of
underlying factor Par.
Below we will
1. Fit the model with all the data. This amounts to assuming that the students in grades 4 and 5
are identical in terms of this measurement problem.
2. Fit the model with sems group() option, which will constrain some parameters to be the same
for students in grades 4 and 5 and leave free of constraint the others.
258
Coef.
OIM
Std. Err.
P>|z|
0.000
8.497534
8.864908
Measurement
peerr~1 <Peer
_cons
1
8.681221
peerr~2 <Peer
_cons
1.113865
7.828623
.09796
.1037547
11.37
75.45
0.000
0.000
.9218666
7.625268
1.305863
8.031979
peerr~3 <Peer
_cons
1.42191
7.359896
.114341
.1149905
12.44
64.00
0.000
0.000
1.197806
7.134519
1.646014
7.585273
peerr~4 <Peer
_cons
1.204146
8.150779
.0983865
.1023467
12.24
79.64
0.000
0.000
1.011312
7.950183
1.39698
8.351375
parrel1 <Par
_cons
1
9.339558
(constrained)
.0648742
143.96
0.000
9.212407
9.46671
parrel2 <Par
_cons
1.112383
9.220494
.1378687
.0742356
8.07
124.21
0.000
0.000
.8421655
9.074994
1.382601
9.365993
parrel3 <Par
_cons
2.037924
8.676961
.204617
.088927
9.96
97.57
0.000
0.000
1.636882
8.502667
2.438966
8.851255
parrel4 <Par
_cons
1.52253
9.045247
.1536868
.0722358
9.91
125.22
0.000
0.000
1.221309
8.903667
1.82375
9.186826
(constrained)
.0937197
92.63
var(e.peer~1)
var(e.peer~2)
var(e.peer~3)
var(e.peer~4)
var(e.parr~1)
var(e.parr~2)
var(e.parr~3)
var(e.parr~4)
var(Peer)
var(Par)
1.809309
2.193804
1.911874
1.753037
1.120333
1.503003
.9680081
.8498834
1.572294
.5000022
.1596546
.194494
.214104
.1749613
.0899209
.1200739
.1419777
.0933687
.2255704
.093189
cov(Peer,Par)
.4226706
.0725253
5.83
=
0.000
1.521956
1.843884
1.535099
1.441575
.9572541
1.285162
.7261617
.685245
1.186904
.3469983
2.150916
2.610129
2.381126
2.131792
1.311193
1.757769
1.290401
1.054078
2.082822
.7204709
.2805236
.5648176
259
Note:
1. We are using SSD with data for two separate groups. There is no hint of that in the output above
because sem combined the summary statistics and produced overall results just as if we had the
real data.
-13049.77
-10819.682
-8873.4568
-6119.7114
-5949.354
-5775.6085
-5713.9178
-5638.1208
-5616.6335
-5595.7507
-5589.9802
-5578.8701
-5574.0162
-5568.0786
-5551.7349
-5544.0052
-5542.7113
-5542.6775
-5542.6774
(not
(not
(not
(not
(not
(not
(not
(not
(not
(not
(not
(not
(not
concave)
concave)
concave)
concave)
concave)
concave)
concave)
concave)
concave)
concave)
concave)
concave)
concave)
Number of obs
Number of groups
=
=
385
2
260
[peerrel1]1bn.grade#c.Peer = 1
[peerrel2]1bn.grade#c.Peer - [peerrel2]2.grade#c.Peer = 0
[peerrel3]1bn.grade#c.Peer - [peerrel3]2.grade#c.Peer = 0
[peerrel4]1bn.grade#c.Peer - [peerrel4]2.grade#c.Peer = 0
[parrel1]1bn.grade#c.Par = 1
[parrel2]1bn.grade#c.Par - [parrel2]2.grade#c.Par = 0
[parrel3]1bn.grade#c.Par - [parrel3]2.grade#c.Par = 0
[parrel4]1bn.grade#c.Par - [parrel4]2.grade#c.Par = 0
[peerrel1]1bn.grade - [peerrel1]2.grade = 0
[peerrel2]1bn.grade - [peerrel2]2.grade = 0
[peerrel3]1bn.grade - [peerrel3]2.grade = 0
[peerrel4]1bn.grade - [peerrel4]2.grade = 0
[parrel1]1bn.grade - [parrel1]2.grade = 0
[parrel2]1bn.grade - [parrel2]2.grade = 0
[parrel3]1bn.grade - [parrel3]2.grade = 0
[parrel4]1bn.grade - [parrel4]2.grade = 0
[peerrel1]2.grade#c.Peer = 1
[parrel1]2.grade#c.Par = 1
[mean(Peer)]1bn.grade = 0
[mean(Par)]1bn.grade = 0
Coef.
Measurement
peerr~1 <Peer
[*]
_cons
[*]
OIM
Std. Err.
P>|z|
(constrained)
8.466539
.1473448
57.46
0.000
8.177748
8.755329
peerr~2 <Peer
[*]
_cons
[*]
1.109234
.0975279
11.37
0.000
.9180833
1.300385
7.589872
.1632145
46.50
0.000
7.269977
7.909766
peerr~3 <Peer
[*]
_cons
[*]
1.409361
.1138314
12.38
0.000
1.186256
1.632467
7.056996
.1964299
35.93
0.000
6.672001
7.441992
peerr~4 <Peer
[*]
_cons
[*]
1.195982
.0980272
12.20
0.000
1.003852
1.388112
7.89358
.169158
46.66
0.000
7.562036
8.225123
parrel1 <Par
[*]
_cons
[*]
parrel2 <Par
[*]
_cons
[*]
(constrained)
9.368654
.0819489
114.32
0.000
9.208037
9.529271
1.104355
.1369365
8.06
0.000
.8359649
1.372746
9.287629
.0903296
102.82
0.000
9.110587
9.464672
parrel3 <Par
[*]
_cons
[*]
parrel4 <Par
[*]
_cons
[*]
mean(Peer)
1
2
mean(Par)
1
2
var(e.peer~1)
1
2
var(e.peer~2)
1
2
var(e.peer~3)
1
2
var(e.peer~4)
1
2
var(e.parr~1)
1
2
var(e.parr~2)
1
2
var(e.parr~3)
1
2
var(e.parr~4)
1
2
var(Peer)
1
2
var(Par)
1
2
cov(Peer,Par)
1
2
2.05859
.2060583
9.99
0.000
1.654723
2.462457
8.741898
.136612
63.99
0.000
8.474144
9.009653
1.526706
.1552486
9.83
0.000
1.222424
1.830987
9.096609
.1061607
85.69
0.000
8.888538
9.30468
0
.3296841
(constrained)
.1570203
2.10
0.036
.02193
.6374382
0
-.0512439
(constrained)
.0818255
-0.63
0.531
-.211619
.1091313
1.824193
1.773813
.2739446
.1889104
1.359074
1.439644
2.448489
2.185549
2.236974
2.165228
.3310875
.2321565
1.673699
1.75484
2.989817
2.671589
1.907009
1.950679
.3383293
.2586196
1.346908
1.504298
2.700023
2.529516
1.639881
1.822448
.272764
.2151827
1.18367
1.445942
2.271925
2.296992
.9669121
1.213159
.1302489
.1192634
.7425488
1.000547
1.259067
1.470949
.9683878
1.79031
.133192
.1747374
.7395628
1.478596
1.268012
2.167739
.8377567
1.015707
.1986089
.1713759
.526407
.7297073
1.333258
1.4138
.8343032
.8599648
.1384649
.1165865
.6026352
.6592987
1.15503
1.121706
2.039297
1.307976
.3784544
.2061581
1.41747
.9603661
2.933912
1.781406
.4492996
.5201696
.1011565
.1029353
.2889976
.3529413
.6985183
.7666329
.5012091
.3867156
.1193333
.079455
.2673201
.2309867
.7350982
.5424445
4.20
4.87
0.000
0.000
261
262
Notes:
1. In Which parameters vary by default, and which do not in [SEM] intro 6, we wrote that, generally
speaking, when we specify group(groupvar), the measurement part of the model is constrained
by default to be the same across the groups, whereas the remaining parts will have separate
parameters for each group.
More precisely, we revealed that sem classifies each parameter into one of nine classes, which
are the following:
Class description
Class name
1. structural coefficients
2. structural intercepts
scoef
scons
3. measurement coefficients
4. measurement intercepts
mcoef
mcons
serrvar
merrvar
smerrcov
meanex
covex
(*)
(*)
263
In [SEM] example 23, we show how to constrain the parameters we choose to be equal across
groups.
b. Click in the upper-right quadrant of the Peer oval (it will highlight when you hover over
it), and drag a covariance to the upper-left quadrant of the Par oval (it will highlight when
you can release to connect the covariance).
264
6. Clean up.
If you do not like where a covariance has been connected to its variable, use the Select tool,
, to click on the covariance, and then simply click on where it connects to an oval and drag
the endpoint. You can also change the bow of the covariance by dragging the control point that
extends from one end of the selected covariance.
7. Estimate.
Click on the Estimate button,
Reference
Acock, A. C. 2013. Discovering Structural Equation Modeling Using Stata. Rev. ed. College Station, TX: Stata Press.
Also see
[SEM] example 3 Two-factor measurement model
[SEM] example 19 Creating multiple-group summary statistics data
[SEM] example 21 Group-level goodness of fit
[SEM] example 22 Testing parameter equality across groups
[SEM] example 23 Specifying parameter constraints across groups
Title
example 21 Group-level goodness of fit
Description
Also see
Description
Below we demonstrate the estat ggof command, which may be used after sem with the group()
option. estat ggof displays group-by-group goodness-of-fit statistics.
We pick up where [SEM] example 20 left off:
. use http://www.stata-press.com/data/r13/sem_2fmmby
. sem (Peer -> peerrel1 peerrel2 peerrel3 peerrel4)
///
(Par -> parrel1 parrel2 parrel3 parrel4), group(grade)
grade
1
2
SRMR
CD
134
251
0.088
0.056
0.969
0.955
Notes:
1. Reported are the goodness-of-fit tests that estat gof, stats(residuals) would report. The
difference is that they are reported for each group rather than overall.
2. If the fit is good, then SRMR (standardized root mean squared residual) will be close to 0 and
CD (the coefficient of determination) will be near 1.
It is also appropriate to run estat gof to obtain overall results:
. estat gof, stats(residuals)
Fit statistic
Value
Description
Size of residuals
SRMR
CD
0.074
0.958
Also see
[SEM] example 20 Two-factor measurement model by group
[SEM] example 4 Goodness-of-fit statistics
Title
example 22 Testing parameter equality across groups
Description
Also see
Description
Below we demonstrate estat ginvariant to test parameters across groups.
We pick up where [SEM] example 20 left off:
. use http://www.stata-press.com/data/r13/sem_2fmmby
. sem (Peer -> peerrel1 peerrel2 peerrel3 peerrel4)
///
(Par -> parrel1 parrel2 parrel3 parrel4), group(grade)
266
267
. estat ginvariant
Tests for group invariance of parameters
chi2
Wald Test
df
p>chi2
chi2
Score Test
df
p>chi2
Measurement
peerr~1 <Peer
_cons
.
.
.
.
.
.
2.480
0.098
1
1
0.1153
0.7537
peerr~2 <Peer
_cons
.
.
.
.
.
.
0.371
0.104
1
1
0.5424
0.7473
peerr~3 <Peer
_cons
.
.
.
.
.
.
2.004
0.002
1
1
0.1568
0.9687
peerr~4 <Peer
_cons
.
.
.
.
.
.
0.239
0.002
1
1
0.6246
0.9611
parrel1 <Par
_cons
.
.
.
.
.
.
0.272
0.615
1
1
0.6019
0.4329
parrel2 <Par
_cons
.
.
.
.
.
.
0.476
3.277
1
1
0.4903
0.0703
parrel3 <Par
_cons
.
.
.
.
.
.
3.199
1.446
1
1
0.0737
0.2291
parrel4 <Par
_cons
.
.
.
.
.
.
2.969
0.397
1
1
0.0849
0.5288
var(e.peer~1)
var(e.peer~2)
var(e.peer~3)
var(e.peer~4)
var(e.parr~1)
var(e.parr~2)
var(e.parr~3)
var(e.parr~4)
var(Peer)
var(Par)
0.024
0.033
0.011
0.294
1.981
14.190
0.574
0.022
4.583
0.609
1
1
1
1
1
1
1
1
1
1
0.8772
0.8565
0.9152
0.5879
0.1593
0.0002
0.4486
0.8813
0.0323
0.4350
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
cov(Peer,Par)
0.780
0.3772
Notes:
1. In the output above, score tests are reported for parameters that were constrained. The null
hypothesis is that the constraint is valid. None of the tests reject a valid constraint.
2. Wald tests are reported for parameters that were not constrained. The null hypothesis is that a
constraint would be valid. Only in two cases does it appear that grade 4 differs from grade 5,
namely, the variance of e.parrel2 and the variance of Peer.
268
3. We remind you that these tests are marginal tests. That is, each test is intended to be interpreted
separately. These are not joint tests of simultaneous imposition or relaxation of constraints. If you
want simultaneous tests, you must do them yourself by using, for instance, the test command.
If joint tests of parameter classes are desired, the class option can be used.
These results imply that none of the constraints we impose should be relaxed, and that perhaps we
could constrain all the variances and covariances to be equal across groups except for the variances
of e.parrel2 and Peer. We do that in [SEM] example 23.
Also see
[SEM] example 20 Two-factor measurement model by group
[SEM] example 23 Specifying parameter constraints across groups
Title
example 23 Specifying parameter constraints across groups
Description
Also see
Description
Below we demonstrate how to constrain the parameters we want constrained to be equal across
groups when using sem with the group() option.
We pick up where [SEM] example 22 left off:
. use http://www.stata-press.com/data/r13/sem_2fmmby
. sem (Peer -> peerrel1 peerrel2 peerrel3 peerrel4)
///
(Par -> parrel1 parrel2 parrel3 parrel4), group(grade)
. estat ginvariant
The estat ginvariant command implied that perhaps we could constrain all the variances and
covariances to be equal across groups except for the variances of e.parrel2 and Peer.
Background
We can specify which parameters we wish to allow to vary. Remember that sems group() option
classifies the parameters of the model as follows:
Class description
Class name
1. structural coefficients
2. structural intercepts
scoef
scons
3. measurement coefficients
4. measurement intercepts
mcoef
mcons
serrvar
merrvar
smerrcov
meanex
covex
(*)
(*)
270
You may specify any of the class names as being ginvariant(). You may specify as many class
names as you wish. When you specify ginvariant(), sem cancels its default actions on which
parameters vary and which do not, and uses the information you specify. All classes that you do not
mention as being ginvariant() are allowed to vary across groups.
By using ginvariant(), you can constrain, or free by your silence, whole classes of parameters.
For instance, you could type
. sem ..., group(mygroup) ginvariant(mcoef mcons serrvar)
and you are constraining those parameters to be equal across groups and leaving unconstrained scoef,
scons, merrvar, smerrcov, meanex, and covex.
In addition, if a class is constrained, you can still unconstrain individual coefficients. Consider the
model
. sem ... (x1<-L) ...
If you typed
. sem ... (1: x1<-L@a1) (2: x1<-L@a2) ..., group(mygroup) ginvariant(all)
then all estimated parameters would be the same across groups except for the path x1<-L, and it
would be free to vary in groups 1 and 2.
By the same token, if a class is unconstrained, you can still constrain individual coefficients. If
you typed
. sem ... (1: x1<-L@a) (2: x1<-L@a) ..., group(mygroup) ginvariant(none)
then you would leave unconstrained all parameters except the path x1<-L, and it would be constrained
to be equal in groups 1 and 2.
This is all discussed in [SEM] intro 6, including how to constrain and free variance and covariance
parameters.
///
///
271
We impose constraints on all parameters except the variances of e.parrel2 and Peer. We can do
that by typing
. sem (Peer -> peerrel1 peerrel2 peerrel3 peerrel4)
>
(Par -> parrel1 parrel2 parrel3 parrel4),
>
group(grade)
>
ginvariant(all)
>
var(1: e.parrel2@v1)
>
var(2: e.parrel2@v2)
>
var(1: Peer@v3)
>
var(2: Peer@v4)
Endogenous variables
Measurement:
0:
1:
2:
3:
4:
5:
log
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
=
-5560.9934
-5552.3122
-5549.5391
-5549.3528
-5549.3501
-5549.3501
=
=
385
2
272
Coef.
Measurement
peerr~1 <Peer
[*]
_cons
[*]
OIM
Std. Err.
P>|z|
(constrained)
8.708274
.0935844
93.05
0.000
8.524852
8.891696
peerr~2 <Peer
[*]
_cons
[*]
1.112225
.0973506
11.42
0.000
.9214217
1.303029
7.858713
.1035989
75.86
0.000
7.655663
8.061763
peerr~3 <Peer
[*]
_cons
[*]
1.416486
.113489
12.48
0.000
1.194052
1.638921
7.398217
.1147474
64.47
0.000
7.173316
7.623118
peerr~4 <Peer
[*]
_cons
[*]
1.196494
.0976052
12.26
0.000
1.005191
1.387796
8.183148
.1021513
80.11
0.000
7.982936
8.383361
parrel1 <Par
[*]
_cons
[*]
(constrained)
9.339558
.0648742
143.96
0.000
9.212407
9.46671
parrel2 <Par
[*]
_cons
[*]
1.100315
.1362999
8.07
0.000
.8331722
1.367458
9.255299
.0725417
127.59
0.000
9.11312
9.397478
parrel3 <Par
[*]
_cons
[*]
2.051278
.2066714
9.93
0.000
1.64621
2.456347
8.676961
.088927
97.57
0.000
8.502667
8.851255
parrel4 <Par
[*]
_cons
[*]
1.529938
.154971
9.87
0.000
1.2262
1.833675
9.045247
.0722358
125.22
0.000
8.903667
9.186826
var(e.peer~1)
[*]
var(e.peer~2)
[*]
var(e.peer~3)
[*]
var(e.peer~4)
[*]
var(e.parr~1)
[*]
var(e.parr~2)
1
2
var(e.parr~3)
[*]
var(e.parr~4)
[*]
var(Peer)
1
2
var(Par)
[*]
cov(Peer,Par)
[*]
1.799133
.159059
1.512898
2.139523
2.186953
.193911
1.838086
2.602035
1.915661
.2129913
1.54056
2.382094
1.767354
.1746104
1.45622
2.144965
1.125082
.0901338
.9615942
1.316366
.9603043
1.799668
.13383
.1747351
.730775
1.487807
1.261927
2.176898
.9606889
.1420406
.7190021
1.283617
.8496935
.0933448
.6850966
1.053835
1.951555
1.361431
.3387796
.2122853
1.388727
1.002927
2.742489
1.848084
.4952527
.0927994
.3430288
.7150281
.4096197
.0708726
.2707118
.5485275
5.78
0.000
273
Notes:
1. In [SEM] example 20, we previously fit this model by typing
. sem (...) (...), group(grade)
///
///
///
///
///
e.parrel2@v1)
e.parrel2@v2)
Peer@v3)
Peer@v4)
274
e.parrel2@secretname1)
e.parrel2@secretname1)
Peer@secretname2)
Peer@secretname2)
because that is how you impose equality constraints with the path notation. When we specified
var(1:
var(2:
var(1:
var(2:
e.parrel2@v1)
e.parrel2@v2)
Peer@v3)
Peer@v4)
our new constraints overrode the secretly issued constraints. It would not have worked to leave
off the symbolic names; see [SEM] sem path notation extensions. We specified the symbolic
names v1, v2, v3, and v4. v1 and v2 overrode secretname1, and thus the constraint that
var(e.parrel2) be equal across the two groups was relaxed. v3 and v4 overrode secretname2,
and thus the constraint that var(Peer) be equal across groups was relaxed.
Also see
[SEM] example 20 Two-factor measurement model by group
[SEM] example 22 Testing parameter equality across groups
Title
example 24 Reliability
Description
Also see
Description
Below we demonstrate sems reliability() option with the following data:
. use http://www.stata-press.com/data/r13/sem_rel
(measurement error with known reliabilities)
. summarize
Variable
Obs
Mean
y
x1
x2
1234
1234
1234
701.081
100.278
100.2066
Std. Dev.
Min
Max
71.79378
14.1552
14.50912
487
51
55
943
149
150
. notes
_dta:
1. Fictional data.
2. Variables x1 and x2 each contain a test score designed to measure X.
test is scored to have mean 100.
3. Variables x1 and x2 are both known to have reliability 0.5.
4. Variable y is the outcome, believed to be related to X.
275
The
276
example 24 Reliability
Coef.
Structural
y <x1
_cons
var(e.y)
OIM
Std. Err.
3.54976
345.1184
.1031254
10.44365
2627.401
105.7752
Number of obs
34.42
33.05
1234
P>|z|
0.000
0.000
3.347637
324.6492
3.751882
365.5876
2428.053
2843.115
Notes:
1. In these data, variable x1 is measured with error.
2. If we ignore that, we obtain a path coefficient for y<-x1 of 3.55.
3. We also ran this model for y<-x2. We obtained a path coefficient of 3.48.
example 24 Reliability
277
Coef.
Measurement
x1 <X
_cons
1
100.278
OIM
Std. Err.
Number of obs
(constrained)
.4027933
248.96
P>|z|
1234
0.000
99.4885
101.0674
0.000
0.000
6.408705
697.077
7.790335
705.0851
2.152334
85.48963
5086.411
117.2157
y <X
_cons
var(e.x1)
var(e.y)
var(X)
7.09952
701.081
100.1036
104.631
100.1036
.352463
2.042929
20.14
343.17
(constrained)
207.3381
8.060038
Notes:
1. We wish to estimate the effect of y<-x1 when x1 is measured with error (0.50 reliability). To
do that, we introduce latent variable X and write our model as (x1<-X) (y<-X).
2. When we ignored the measurement error of x1, we obtained a path coefficient for y<-x1 of
3.55. Taking into account the measurement error, we obtain a coefficient of 7.1.
278
example 24 Reliability
x2 .5)
Measurement: x1 x2 y
Exogenous variables
Latent:
X
Fitting target model:
Iteration 0:
log likelihood = -16258.636
Iteration 1:
log likelihood = -16258.401
Iteration 2:
log likelihood =
-16258.4
Structural equation model
Estimation method = ml
Log likelihood
=
-16258.4
( 1) [x1]X = 1
( 2) [var(e.x1)]_cons = 100.1036
( 3) [var(e.x2)]_cons = 105.1719
Coef.
Measurement
x1 <X
_cons
1
100.278
OIM
Std. Err.
Number of obs
(constrained)
.4037851
248.34
1234
P>|z|
0.000
99.48655
101.0694
x2 <X
_cons
1.030101
100.2066
.0417346
.4149165
24.68
241.51
0.000
0.000
.9483029
99.39342
1.1119
101.0199
X
_cons
7.031299
701.081
.2484176
2.042928
28.30
343.17
0.000
0.000
6.544409
697.077
7.518188
705.0851
39.31868
87.67509
590.1553
116.5591
y <-
var(e.x1)
var(e.x2)
var(e.y)
var(X)
100.1036
105.1719
152.329
101.0907
(constrained)
(constrained)
105.26
7.343656
Notes:
1. We wish to estimate the effect of y<-X. We have two measures of Xx1 and x2both measured
with error (0.50 reliability).
2. In the previous section, we used just x1. We obtained path coefficient 7.1 with standard error
0.4. Using both x1 and x2, we obtain path coefficient 7.0 and standard error 0.2.
3. We at StataCorp created these fictional data. The true coefficient is 7.
Also see
[SEM] sem and gsem option reliability( ) Fraction of variance not due to measurement error
[SEM] example 1 Single-factor measurement model
Title
example 25 Creating summary statistics data from raw data
Description
Also see
Description
Below we show how to create summary statistics data (SSD) from raw data. We use auto2.dta:
. use http://www.stata-press.com/data/r13/auto2
(1978 Automobile Data)
. describe
(output omitted )
. summarize
(output omitted )
We are going to create SSD containing the variables price, mpg, weight, displacement, and
foreign.
280
price
74
6165.257
2949.496
mpg
74
21.2973
5.785503
74
3019.459
777.1936
weight
74
197.2973
91.83722
displacement
foreign
74
.2972973
.4601885
.
. * We will rescale weight and price:
. replace
weight = weight/1000
weight was int now float
(74 real changes made)
. replace
price = price/1000
price was int now float
(74 real changes made)
. label var weight "Weight (1000s lbs.)"
.
.
.
.
3291
12
1760
79
0
15906
41
4840
425
1
Min
Max
Obs
Mean
Std. Dev.
price
74
6.165257
2.949496
3.291
15.906
74
21.2973
5.785503
12
41
mpg
weight
74
3.019459
.7771936
1.76
4.84
displacement
74
197.2973
91.83722
79
425
foreign
74
.2972973
.4601885
0
1
.
. * ------------------------------------------------------------------------. * Suggestion 5: Create useful transformations:
. *
. gen
gpm
= 1/mpg
. label var gpm
"Gallons per mile"
.
281
price
mpg
gpm
weight
displacement
74
74
74
74
74
6.165257
21.2973
.0501928
3.019459
197.2973
2.949496
5.785503
.0127986
.7771936
91.83722
3.291
12
.0243902
1.76
79
15.906
41
.0833333
4.84
425
foreign
74
.2972973
.4601885
0
1
.
. * ------------------------------------------------------------------------. * Suggestion 7: save prepared data
. *
. save auto_raw
file auto_raw.dta saved
. * -------------------------------------------------------------------------
We follow our advice below. After that, we will show you the advantages of digitally signing the
data.
. * ------------------------------------------------------------------------. * Convert data:
. *
. ssd build _all
(data in memory now summary statistics data; you can use ssd describe and
ssd list to describe and list results.)
282
variable label
price
mpg
gpm
weight
displacement
foreign
Price ($1,000s)
Mileage (mpg)
Gallons per mile
Weight (1000s lbs.)
Displacement (cu. in.)
Car type
. notes
_dta:
1. summary statistics data built from auto_raw.dta on 30 Jun 2012 15:32:33
using -ssd build _all. ssd list
Observations = 74
Means:
price
6.1652567
mpg
21.297297
gpm
.0501928
weight
3.0194595
displacement
197.2973
foreign
.2972973
Variances implicitly defined; they are the diagonal of the covariance
matrix.
Covariances:
price
8.6995258
-7.9962828
.02178417
1.2346748
134.06705
.06612809
mpg
gpm
weight
displacement
33.472047
-.06991586
-3.6294262
-374.92521
1.0473899
.0001638
.00849897
.90648519
-.00212897
.60402985
63.87345
-.21202888
8434.0748
-25.938912
foreign
.21177342
.
.
.
.
. * ------------------------------------------------------------------------. * Save:
. *
. save auto_ss
file auto_ss.dta saved
. * -------------------------------------------------------------------------
We recommend digitally signing the data. This way, anyone can verify later that the data are
unchanged:
. datasignature confirm
(data unchanged since 30jun2012 15:32)
283
Let us show you what would happen if the data had changed:
. replace mpg = mpg+.0001 in 5
(1 real change made)
. datasignature confirm
data have changed since 30jun2012 15:34
r(9);
There is no reason for you or anyone else to change the SSD after it has been created, so we
recommend that you digitally sign the data. With regular datasets, users do make changes, if only by
adding variables.
Be aware that the data signature is a function of the variable names, so if you rename a variable
something you are allowed to dothe signature will change and datasignature will report, for
example, data have changed since 30jun2012 15:34. Solutions to that problem are discussed in
[SEM] ssd.
Publishing SSD
The summary statistics dataset you have just created can obviously be sent to and used by any
Stata user. If you wish to publish your data in printed form, use ssd describe and ssd list to
describe and list the data.
you type
. ssd build _all, group(varname)
Below we build the automobile SSD again, but this time, we specify group(rep78):
. ssd build _all, group(rep78)
If you think carefully about this, you may be worried that all includes rep78 and thus we will be
including the grouping variable among the summary statistics. ssd build knows to omit the group
variable:
. * ------------------------------------------------------------------------. * Suggestion 1: Keep relevant variables:
. *
. webuse auto2
(1978 Automobile Data)
. keep price mpg weight displacement foreign rep78
284
price
mpg
rep78
weight
displacement
74
74
69
74
74
6165.257
21.2973
3.405797
3019.459
197.2973
2949.496
5.785503
.9899323
777.1936
91.83722
foreign
74
.2972973
.4601885
. drop if rep78 >= .
(5 observations deleted)
. * We will rescale weight and price:
. replace
weight = weight/1000
weight was int now float
(69 real changes made)
. replace
price = price/1000
price was int now float
(69 real changes made)
. label var weight "Weight (1000s lbs.)"
. label var price "Price ($1,000s)"
. * and now we check our work:
. summarize
Variable
Obs
Mean
Std. Dev.
3291
12
1
1760
79
15906
41
5
4840
425
Min
Max
price
mpg
rep78
weight
displacement
69
69
69
69
69
6.146043
21.28986
3.405797
3.032029
198
2.91244
5.866408
.9899323
.7928515
93.14789
3.291
12
1
1.76
79
15.906
41
5
4.84
425
foreign
69
.3043478
.4635016
.
.
.
.
.
.
.
.
.
.
.
.
69
69
69
69
69
6.146043
21.28986
.0502584
3.405797
3.032029
2.91244
5.866408
.0128353
.9899323
.7928515
3.291
12
.0243902
1
1.76
15.906
41
.0833333
5
4.84
displacement
foreign
69
69
198
.3043478
93.14789
.4635016
79
0
425
1
variable label
price
mpg
gpm
weight
displacement
foreign
Price ($1,000s)
Mileage (mpg)
Gallons per mile
Weight (1000s lbs.)
Displacement (cu. in.)
Car type
gpm
.04048131
weight
2.3227273
displacement
111.09091
285
286
. * ------------------------------------------------------------------------. * Save:
. *
. save auto_group_ss
file auto_group_ss.dta saved
. * -------------------------------------------------------------------------
Also see
[SEM] ssd Making summary statistics data (sem only)
Title
example 26 Fitting a model with data missing at random
Description
Also see
Description
sem method(mlmv) is demonstrated using
. use http://www.stata-press.com/data/r13/cfa_missing
(CFA MAR data)
. summarize
Variable
Obs
Mean
id
test1
test2
test3
test4
500
406
413
443
417
250.5
97.37475
98.04501
100.9699
99.56815
Std. Dev.
144.4818
13.91442
13.84145
13.4862
14.25438
Min
Max
1
56.0406
62.25496
65.51753
53.8719
500
136.5672
129.3881
137.3046
153.9779
taken
500
3.358
.6593219
. notes
_dta:
1. Fictional data on 500 subjects taking four tests.
2. Tests results M.A.R. (missing at random).
3. 230 took all 4 tests
4. 219 took 3 of the 4 tests
5. 51 took 2 of the 4 tests
6. All tests have expected mean 100, s.d. 14.
287
288
test1
test2
test3
test4
Coef.
Measurement
test1 <X
_cons
1
96.76907
OIM
Std. Err.
Number of obs
(constrained)
.8134878
118.96
230
P>|z|
0.000
95.17467
98.36348
test2 <X
_cons
1.021885
92.41248
.1183745
.8405189
8.63
109.95
0.000
0.000
.789875
90.7651
1.253895
94.05987
.5084673
94.12958
.0814191
.7039862
6.25
133.71
0.000
0.000
.3488889
92.7498
.6680457
95.50937
.5585651
92.2556
.0857772
.7322511
6.51
125.99
0.000
0.000
.3904449
90.82042
.7266853
93.69079
55.86083
61.88092
89.07839
93.26508
96.34453
10.85681
11.50377
8.962574
9.504276
16.28034
38.16563
42.985
73.13566
76.37945
69.18161
81.76028
89.08338
108.4965
113.8837
134.1725
test3 <X
_cons
test4 <X
_cons
var(e.test1)
var(e.test2)
var(e.test3)
var(e.test4)
var(X)
289
Notes:
1. This model was fit using 230 of the 500 observations in the dataset. Unless you use sems
method(mlmv), observations are casewise omitted, meaning that if there is a single variable with
a missing value among the variables being used, the observation is ignored.
2. The coefficients for test3 and test4 are 0.51 and 0.56. Because we at StataCorp manufactured
these data, we can tell you that the true coefficients are 1.
3. The error variance for e.test1 and e.test2 are understated. These data were manufactured
with an error variance of 100.
4. These data are missing at random (MAR), not missing completely at random (MCAR). In MAR
data, which values are missing can be a function of the observed values in the data. MAR data
can produce biased estimates if the missingness is ignored, as we just did. MCAR data do not
bias estimates.
Exogenous variables
Latent:
(output omitted )
Structural equation model
Estimation method = mlmv
Log likelihood
= -6592.9961
( 1)
Number of obs
500
[test1]X = 1
Coef.
Measurement
test1 <X
_cons
1
98.94386
OIM
Std. Err.
(constrained)
.6814418
145.20
P>|z|
0.000
97.60826
100.2795
test2 <X
_cons
1.069952
99.84218
.1079173
.6911295
9.91
144.46
0.000
0.000
.8584378
98.48759
1.281466
101.1968
.9489025
101.0655
.0896098
.6256275
10.59
161.54
0.000
0.000
.7732706
99.83928
1.124534
102.2917
1.021626
99.64509
.0958982
.6730054
10.65
148.06
0.000
0.000
.8336687
98.32603
1.209583
100.9642
101.1135
95.45572
95.14847
101.0943
94.04629
10.1898
10.79485
9.053014
10.0969
13.96734
82.99057
76.47892
78.9611
83.12124
70.29508
123.1941
119.1413
114.6543
122.9536
125.8225
test3 <X
_cons
test4 <X
_cons
var(e.test1)
var(e.test2)
var(e.test3)
var(e.test4)
var(X)
290
Notes:
1. The model is now fit using all 500 observations in the dataset.
2. The coefficients for test3 and test4previously 0.51 and 0.56are now 0.95 and 1.02.
3. Error variance estimates are now consistent with the true value of 100.
4. Standard errors of path coefficients are mostly smaller than reported in the previous model.
5. method(mlmv) requires that the data be MCAR or MAR.
6. method(mlmv) requires that the data be multivariate normal.
Also see
[SEM] intro 4 Substantive concepts
[SEM] sem option method( ) Specifying method and calculation of VCE
Title
example 27g Single-factor measurement model (generalized response)
Description
Also see
Description
This is the first example in the g series. The g means that the example focuses exclusively on the
gsem command. If you are interested primarily in standard linear SEMs, you may want to skip the
remaining examples. If you are especially interested in generalized SEMs, we suggest you read the
remaining examples in order.
gsem provides two features not provided by sem: the ability to fit SEMs containing generalized
linear response variables and the ability to fit multilevel mixed SEMs. These two features can be used
separately or together.
Generalized response variables means that the response variables can be specifications from the
generalized linear model (GLM). These include probit, logistic regression, ordered probit and logistic
regression, multinomial logistic regression, and more.
Multilevel mixed models refer to the simultaneous handling of group-level effects, which can be
nested or crossed. Thus you can include unobserved and observed effects for subjects, subjects within
group, group within subgroup, . . . , or for subjects, group, subgroup, . . . .
Below we demonstrate a single-factor measure model with pass/fail (binary outcome) responses
rather than continuous responses. This is an example of a generalized linear response variable. We
use the following data:
. use http://www.stata-press.com/data/r13/gsem_1fmm
(single-factor pass/fail measurement model)
. summarize
Obs
Mean
Std. Dev.
Variable
x1
x2
x3
x4
s4
123
123
123
123
123
.4065041
.4065041
.4227642
.3495935
690.9837
.4931897
.4931897
.4960191
.4787919
77.50737
Min
Max
0
0
0
0
481
1
1
1
1
885
. notes
_dta:
1. Fictional data.
2. The variables x1, x2, x3, and x4 record 1=pass, 0=fail.
3. Pass/fail for x1, x2, x3: score > 100
4. Pass/fail for x4: score > 725
5. Variable s4 contains actual score for test 4.
291
292
Bernoulli
Bernoulli
Bernoulli
Bernoulli
x1
x2
x3
x4
probit
probit
probit
probit
The measurement variables we have (x1, . . . , x4) are not continuous. They are pass/fail, coded as 1
(pass) and 0 (fail). To account for that, we use probit (also known as family Bernoulli, link probit).
The equations for this model are
Pr(x1 = 1) = (1 + X1 )
Pr(x2 = 1) = (2 + X2 )
Pr(x3 = 1) = (3 + X3 )
Pr(x4 = 1) = (4 + X4 )
where () is the N (0, 1) cumulative distribution.
One way to think about this is to imagine a test that is scored on a continuous scale. Lets imagine
the scores were s1 , s2 , s3 , and s4 and distributed N (i , i2 ), as test scores often are. Lets further
imagine that for each test, a cutoff ci is chosen and the student passes the test if si ci .
If we had the test scores, we would fit this as a linear model. We would posit
si = i + Xi + i
where i N (0, i2 ). However, we do not have test scores in our data; we have only the pass/fail
results xi = 1 if si > ci .
293
So lets consider the pass/fail problem. The probability that a student passes test i is determined
by the probability that the student scores above the cutoff:
Pr(si > ci ) = Pr(i + Xi + i > ci )
= Pr{ i > ci (i + Xi ) }
= Pr{ i ci + (i + Xi ) }
= Pr{ i (i ci ) + Xi }
= Pr{ i /i (i ci )/i + Xi /i }
= { (i ci )/i + Xi /i }
The last equation is the probit model. In fact, we just derived the probit model, and now we know
the relationship between the parameters we will be able to estimate with our pass/fail data: i and
i . We also now know the parameters we could have estimated if we had the continuous test scores:
i and i . The relationship is
i = (i ci )/i
i = i /i
Notice that the right-hand sides of both equations are divided by i , the standard deviation of the
error term from the linear model for the ith test score. In pass/fail data, we lose the original scale of
the score, and the slope coefficient we will be able to estimate is the slope coefficient from the linear
model divided by the standard deviation of the error term. Meanwhile, the intercept we will be able
to estimate is just the difference of the continuous models intercept and the cutoff for passing the
test, sans scale.
The command to fit the model and the results are
. gsem (x1 x2 x3 x4 <-X), probit
Fitting fixed-effects model:
Iteration 0:
Iteration 1:
Iteration 2:
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
-273.75437
-264.3035
-263.37815
-262.305
-261.69025
-261.42132
-261.35508
-261.3224
-261.3133
-261.30783
-261.30535
-261.30405
-261.30337
-261.30302
-261.30283
-261.30272
-261.30267
-261.30264
-261.30263
294
Number of obs
Std. Err.
123
P>|z|
0.053
-.738437
.0050844
x1 <X
_cons
1
-.3666763
(constrained)
.1896773
-1.93
X
_cons
1.33293
-.4470271
.4686743
.2372344
2.84
-1.88
0.004
0.060
.4143455
-.911998
2.251515
.0179438
X
_cons
.6040478
-.2276709
.1908343
.1439342
3.17
-1.58
0.002
0.114
.2300195
-.5097767
.9780761
.0544349
X
_cons
9.453342
-4.801027
5.151819
2.518038
1.83
-1.91
0.067
0.057
-.6440375
-9.736291
19.55072
.1342372
var(X)
2.173451
1.044885
.847101
5.576536
x2 <-
x3 <-
x4 <-
Notes:
1. In the path diagrams, x1, . . . , x4 are shown as being family Bernoulli, link probit. On the
command line, we just typed probit although we could have typed family(bernoulli)
link(probit). In the command language, probit is a synonym for family(bernoulli)
link(probit).
2. Variable X is latent exogenous and thus needs a normalizing constraint. The variable is anchored
to the first observed variable, x1, and thus the path coefficient is constrained to be 1. See
Identification 2: Normalization constraints (anchoring) in [SEM] intro 4.
3. The path coefficients for X->x1, X->x2, and X->x3 are 1, 1.33, and 0.60. Meanwhile, the path
coefficient for X->x4 is 9.45. This is not unexpected; we at StataCorp generated these fictional
data, and we made the x4 effect large and less precisely estimable.
(s4<-X)
= -959.23492
= -959.09499
= -959.09499
= -905.14944
= -905.14944
= -872.33773
(not concave)
-869.83144
-869.69578
-869.68928
-869.6892
model
Std. Err.
Number of obs
P>|z|
295
123
x1 <X
_cons
1
-.4171085
(constrained)
.1964736
-2.12
X
_cons
1.298311
-.4926357
.3280144
.2387179
X
_cons
.682969
-.2942021
X
_cons
0.034
-.8021896
-.0320274
3.96
-2.06
0.000
0.039
.6554142
-.9605142
1.941207
-.0247573
.1747328
.1575014
3.91
-1.87
0.000
0.062
.3404989
-.6028992
1.025439
.0144949
55.24829
690.9837
12.19904
6.960106
4.53
99.28
0.000
0.000
31.3386
677.3422
79.15798
704.6253
var(X)
1.854506
.7804393
.812856
4.230998
var(e.s4)
297.8565
408.64
20.24012
4383.299
x2 <-
x3 <-
s4 <-
Notes:
1. We obtain similar coefficients for x1, . . . , x3.
2. We removed x4 (a pass/fail variable) and substituted s4 (the actual test score). s4 turns out to
be more significant than x4. This suggests a poor cutoff was set for passing test 4.
3. The log-likelihood values for the two models we have fit are strikingly different: 261 in the
previous model and 870 in the current model. The difference has no meaning. Log-likelihood
values are dependent on the model specified. We changed the fourth equation from a probit
specification to a continuous (linear-regression) specification, and just doing that changes the
metric of the log-likelihood function. Comparisons of log-likelihood values are only meaningful
when they share the same metric.
button.
296
Also see
[SEM] example 1 Single-factor measurement model
[SEM] example 28g One-parameter logistic IRT (Rasch) model
[SEM] example 29g Two-parameter logistic IRT model
[SEM] example 30g Two-level measurement model (multilevel, generalized response)
[SEM] example 31g Two-factor measurement model (generalized response)
Title
example 28g One-parameter logistic IRT (Rasch) model
Description
References
Also see
Description
To demonstrate a one-parameter logistic IRT (Rasch) model, we use the following data:
. use http://www.stata-press.com/data/r13/gsem_cfa
(Fictional math abilities data)
. summarize
Variable
Obs
Mean
school
id
q1
q2
q3
500
500
500
500
500
q4
q5
q6
q7
q8
Std. Dev.
Min
Max
10.5
50681.71
.506
.394
.534
5.772056
29081.41
.5004647
.4891242
.4993423
1
71
0
0
0
20
100000
1
1
1
500
500
500
500
500
.424
.49
.434
.52
.494
.4946852
.5004006
.4961212
.5001002
.5004647
0
0
0
0
0
1
1
1
1
1
att1
att2
att3
att4
att5
500
500
500
500
500
2.946
2.948
2.84
2.91
3.086
1.607561
1.561465
1.640666
1.566783
1.581013
1
1
1
1
1
5
5
5
5
5
test1
test2
test3
test4
. notes
500
500
500
500
75.548
80.556
75.572
74.078
5.948653
4.976786
6.677874
8.845587
55
65
50
43
93
94
94
96
_dta:
1. Fictional data on math ability and attitudes of 500 students from 20
schools.
2. Variables q1-q8 are incorrect/correct (0/1) on individual math questions.
3. Variables att1-att5 are items from a Likert scale measuring each
students attitude toward math.
4. Variables test1-test4 are test scores from tests of four different
aspects of mathematical abilities. Range of scores: 0-100.
These data record results from a fictional instrument measuring mathematical ability. Variables q1
through q8 are the items from the instrument.
For discussions of Rasch models, IRT models, and their extensions, see Embretson and Reise (2000),
van der Linden and Hambleton (1997), Skrondal and Rabe-Hesketh (2004), Andrich (1988), Bond
and Fox (2007), and Fischer and Molenaar (1995). Although not demonstrated in this example, many
of the extensions discussed in these books can be fit with gsem as well.
See Itemresponse theory (IRT) models in [SEM] intro 5 for background.
297
298
MathAb
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
q1
q2
q3
q4
q5
q6
q7
q8
logit
logit
logit
logit
logit
logit
logit
logit
In the 1-PL model, we place constraints that all coefficients, the factor loadings, are equal to 1. The
negative of the intercept for each question will then represent the difficulty of the question:
Std. Err.
Number of obs
P>|z|
500
q1 <MathAb
_cons
1
.0293252
(constrained)
.1047674
0.28
0.780
-.1760152
.2346656
MathAb
_cons
1
-.5025012
(constrained)
.1068768
-4.70
0.000
-.7119759
-.2930264
MathAb
_cons
1
.1607425
(constrained)
.104967
1.53
0.126
-.044989
.3664739
MathAb
_cons
1
-.3574951
(constrained)
.105835
-3.38
0.001
-.564928
-.1500623
MathAb
_cons
1
-.0456599
(constrained)
.1047812
-0.44
0.663
-.2510274
.1597075
MathAb
_cons
1
-.3097521
(constrained)
.1055691
-2.93
0.003
-.5166637
-.1028404
MathAb
_cons
1
.09497
(constrained)
.1048315
0.91
0.365
-.1104959
.300436
MathAb
_cons
1
-.0269104
(constrained)
.1047691
-0.26
0.797
-.232254
.1784332
.6154407
1.02171
q2 <-
q3 <-
q4 <-
q5 <-
q6 <-
q7 <-
q8 <-
var(MathAb)
.7929701
.1025406
299
300
Notes:
1. We had to use gsem and not sem to fit this model because the response variables were 0/1 and
not continuous and because we wanted to use logit and not a continuous model.
2. To place the constraints that all coefficients are equal to 1, in the diagram we placed 1s along
the path from the underlying latent factor MathAb to each of the questions. In the command
language, we added @1 to our command:
gsem (MathAb -> (q1-q8)@1), logit
Had we omitted the @1, we would have obtained coefficients about how well each question
measured math ability.
There are several ways we could have asked that the model above be fit. They include the
following:
gsem (MathAb -> q1@1 q2@1 q3@1 q4@1 q5@1 q6@1 q7@1 q8@1), logit
gsem (MathAb -> (q1 q2 q3 q4 q5 q6 q7 q8)@1), logit
gsem (MathAb -> (q1-q8)@1), logit
Similarly, for the shorthand logit, we could have typed family(bernoulli) link(logit).
3. The negative of the reported intercept represents the difficulty of the item. The most difficult is
q2, and the least difficult is q3.
Std. Err.
P>|z|
500
q1 <MathAb
_cons
.8904887
.0293253
.0575755
.1047674
15.47
0.28
0.000
0.780
.7776429
-.1760151
1.003335
.2346657
MathAb
_cons
.8904887
-.5025011
.0575755
.1068768
15.47
-4.70
0.000
0.000
.7776429
-.7119758
1.003335
-.2930264
MathAb
_cons
.8904887
.1607425
.0575755
.104967
15.47
1.53
0.000
0.126
.7776429
-.044989
1.003335
.366474
MathAb
_cons
.8904887
-.3574951
.0575755
.105835
15.47
-3.38
0.000
0.001
.7776429
-.5649279
1.003335
-.1500622
MathAb
_cons
.8904887
-.0456599
.0575755
.1047812
15.47
-0.44
0.000
0.663
.7776429
-.2510273
1.003335
.1597076
MathAb
_cons
.8904887
-.309752
.0575755
.1055691
15.47
-2.93
0.000
0.003
.7776429
-.5166637
1.003335
-.1028403
MathAb
_cons
.8904887
.0949701
.0575755
.1048315
15.47
0.91
0.000
0.365
.7776429
-.1104959
1.003335
.300436
MathAb
_cons
.8904887
-.0269103
.0575755
.1047691
15.47
-0.26
0.000
0.797
.7776429
-.232254
1.003335
.1784333
q2 <-
q3 <-
q4 <-
q5 <-
q6 <-
q7 <-
q8 <-
var(MathAb)
(constrained)
301
302
Notes:
1. The log-likelihood values of both models is 2650.9116. The models are equivalent.
2. Intercepts are unchanged.
and thus we can obtain the itemcharacteristic curves for all eight questions by typing
.2
.4
.6
.8
. twoway line pr1 pr2 pr3 pr4 pr5 pr6 pr7 pr8 ability, sort xlabel(-1.5(.5)1.5)
1.5
.5
0
.5
empirical Bayes means for MathAb
Predicted mean (q1 correct)
Predicted mean (q2 correct)
Predicted mean (q3 correct)
Predicted mean (q4 correct)
Predicted mean (q5 correct)
Predicted mean (q6 correct)
Predicted mean (q7 correct)
Predicted mean (q8 correct)
1.5
303
A less busy graph might show merely the most difficult and least difficult questions:
.2
.4
.6
.8
1.5
.5
0
.5
empirical Bayes means for MathAb
1.5
The slopes of each curve are identical because we have constrained them to be identical. Thus we
just see the shift between difficulties with the lower items having higher levels of difficulty.
button.
304
b. Click on the path from MathAb to q1. In the Contextual Toolbar, type 1 in the
press Enter.
box and
c. Repeat this process to add the 1 constraint on the paths from MathAb to each of the other
measurement variables.
6. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
7. To fit the model in 1-PL IRT model with variance constrained to 1, change the constraints in the
diagram created above.
a. From the SEM Builder menu, select Estimation > Clear Estimates to clear results from the
previous model.
b. Choose the Select tool,
c. Click on the path from MathAb to q1. In the Contextual Toolbar, type b in the
press Enter.
box and
d. Repeat this process to add the b constraint on the paths from MathAb to each of the other
measurement variables.
e. With
, click on the oval for MathAb. In the Contextual Toolbar, type 1 in the
and press Enter.
box
8. Estimate again.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder for the first model by typing
. webgetsem gsem_irt1
You can open a completed diagram in the Builder for the second model by typing
. webgetsem gsem_irt2
References
Andrich, D. 1988. Rasch Models for Measurement. Newbury Park, CA: Sage.
Bond, T. G., and C. M. Fox. 2007. Applying the Rasch Model: Fundamental Measurement in the Human Sciences.
2nd ed. Mahwah, NJ: Lawrence Erlbaum.
Embretson, S. E., and S. P. Reise. 2000. Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum.
Fischer, G. H., and I. W. Molenaar, ed. 1995. Rasch Models: Foundations, Recent Developments, and Applications.
New York: Springer.
Rasch, G. 1960. Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute of
Educational Research.
. 1980. Probabilistic Models for Some Intelligence and Attainment Tests (Expanded ed.). Chicago: University of
Chicago Press.
Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.
305
van der Linden, W. J., and R. K. Hambleton, ed. 1997. Handbook of Modern Item Response Theory. New York:
Springer.
Also see
[SEM] example 27g Single-factor measurement model (generalized response)
[SEM] example 29g Two-parameter logistic IRT model
Title
example 29g Two-parameter logistic IRT model
Description
References
Also see
Description
We demonstrate a two-parameter logistic (2-PL) IRT model with the same data used in [SEM] example 28g:
. use http://www.stata-press.com/data/r13/gsem_cfa
(Fictional math abilities data)
. summarize
Variable
Obs
Mean
Std. Dev.
Min
Max
school
id
q1
q2
q3
500
500
500
500
500
10.5
50681.71
.506
.394
.534
5.772056
29081.41
.5004647
.4891242
.4993423
1
71
0
0
0
20
100000
1
1
1
q4
q5
q6
q7
q8
500
500
500
500
500
.424
.49
.434
.52
.494
.4946852
.5004006
.4961212
.5001002
.5004647
0
0
0
0
0
1
1
1
1
1
att1
att2
att3
att4
att5
500
500
500
500
500
2.946
2.948
2.84
2.91
3.086
1.607561
1.561465
1.640666
1.566783
1.581013
1
1
1
1
1
5
5
5
5
5
test1
500
75.548
5.948653
55
93
test2
500
80.556
4.976786
65
94
500
75.572
6.677874
50
94
test3
test4
500
74.078
8.845587
43
96
. notes
_dta:
1. Fictional data on math ability and attitudes of 500 students from 20
schools.
2. Variables q1-q8 are incorrect/correct (0/1) on individual math questions.
3. Variables att1-att5 are items from a Likert scale measuring each
students attitude toward math.
4. Variables test1-test4 are test scores from tests of four different
aspects of mathematical abilities. Range of scores: 0-100.
These data record results from a fictional instrument measuring mathematical ability. Variables q1
through q8 are the items from the instrument.
For discussions of IRT models and their extensions, see Embretson and Reise (2000), van der
Linden and Hambleton (1997), Skrondal and Rabe-Hesketh (2004), and Rabe-Hesketh, Skrondal, and
Pickles (2004).
See Itemresponse theory (IRT) models in [SEM] intro 5 for background.
306
307
MathAb
1
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
q1
q2
q3
q4
q5
q6
q7
q8
logit
logit
logit
logit
logit
logit
logit
logit
308
var(MathAb@1)
-2750.3114
-2749.3709
-2749.3708
-2645.8536
-2645.8536
-2637.4315
-2637.3761
-2637.3759
model
Number of obs
500
Std. Err.
P>|z|
q1 <MathAb
_cons
1.466636
.0373363
.2488104
.1252274
5.89
0.30
0.000
0.766
.9789765
-.208105
1.954296
.2827776
MathAb
_cons
.5597118
-.4613391
.1377584
.0989722
4.06
-4.66
0.000
0.000
.2897102
-.6553211
.8297134
-.2673571
MathAb
_cons
.73241
.1533363
.1486818
.1006072
4.93
1.52
0.000
0.127
.440999
-.0438503
1.023821
.3505228
MathAb
_cons
.4839501
-.3230667
.1310028
.0957984
3.69
-3.37
0.000
0.001
.2271893
-.5108281
.7407109
-.1353054
MathAb
_cons
1.232244
-.0494684
.2075044
.1163093
5.94
-0.43
0.000
0.671
.8255426
-.2774304
1.638945
.1784937
MathAb
_cons
.946535
-.3147231
.1707729
.1083049
5.54
-2.91
0.000
0.004
.6118262
-.5269969
1.281244
-.1024493
MathAb
_cons
1.197317
.1053405
.2029485
.1152979
5.90
0.91
0.000
0.361
.7995449
-.1206393
1.595088
.3313203
MathAb
_cons
.8461858
-.026705
.1588325
.1034396
5.33
-0.26
0.000
0.796
.5348799
-.2294429
1.157492
.1760329
q2 <-
q3 <-
q4 <-
q5 <-
q6 <-
q7 <-
q8 <-
var(MathAb)
(constrained)
Notes:
1. In the above model, we constrain the variance MathAb to be 1 by typing var(MathAb@1).
2. Had we not constrained var(MathAb@1), the path coefficient from MathAb to q1 would
have automatically constrained to be 1 to set the latent variables scale. When we applied
309
var(MathAb@1), the automatic constraint was automatically released. Setting the variance of a
latent variable is another way of setting its scale.
3. We set var(MathAb@1) to ease interpretation. Our latent variable, MathAb, is now N (0, 1).
4. Factor loadings, which are the slopes, are estimated above for each question.
5. The slopes reveal how discriminating each question is in regard to mathematical ability. Question 1
is the most discriminating, and question 4 is the least discriminating.
6. In the 1-PL model, the negative of the intercept is a measure of difficulty if we constrain the
slopes to be equal to each other. To measure difficulty in the 2-PL model, we divide the negative
of the intercept by the unconstrained slope. If you do the math, you will discover that question 2
is the most difficult and question 3 is the least difficult. It will be easier, however, merely to
continue reading; in the next section, we show an easy way to calculate the discrimination and
difficulty for all the questions.
310
diff
disc
rank_d~f
rank_d~c
1.
2.
3.
4.
5.
q1
q2
q3
q4
q5
-0.0255
0.8242
-0.2094
0.6676
0.0401
1.4666
0.5597
0.7324
0.4840
1.2322
3
8
1
7
5
8
2
3
1
7
6.
7.
8.
q6
q7
q8
0.3325
-0.0880
0.0316
0.9465
1.1973
0.8462
6
2
4
5
6
4
. restore
Notes:
1. Our goal in the Stata code above is to create a dataset containing one observation for each
question. The dataset will contain the following variables: question containing q1, q2, . . . ;
diff and disc containing each questions difficulty and discrimination values; and rank disc
and rank diff containing the ranks of those discrimination and difficulty values.
2. We first preserved the current data before tossing out the data in memory. Later, after making
and displaying our table, we restored the original contents.
3. We then made an 8-observation, 0-variable dataset (set obs 8) and added variables to it. We
created string variable question containing q1, q2, . . . .
4. We were ready to create variables diff and disc. They are defined in terms of estimated
coefficients, and we had no idea what the names of those coefficients were. To find out, we
typed gsem, coeflegend (output shown below). We quickly learned that the slope coefficients
had names like b[q1:MathAb], b[q2:MathAb], . . . , and the intercepts had names like
b[q1: cons], b[q2: cons], . . . .
5. We created new variables diff and disc containing missing values and then created a forvalues
loop to fill in the new variables. Notice the odd-looking i inside the loop. i is the way
that you say substitute the value of (local macro) i here.
6. We put a display format on new variables diff and disc so that when we listed them, they
would be easier to read.
7. We created the rank of each variable by using the egen command.
8. We listed the results. So now you do not have to do the math to see that question 2 is the
most difficult (it has rank diff = 8) and question 3 is the least (it has rank diff = 1).
9. We typed restore, bringing our original data back into memory and leaving ourselves in a
position to continue with this example.
Number of obs
311
500
Legend
q1 <MathAb
_cons
1.466636
.0373363
_b[q1:MathAb]
_b[q1:_cons]
MathAb
_cons
.5597118
-.4613391
_b[q2:MathAb]
_b[q2:_cons]
MathAb
_cons
.73241
.1533363
_b[q3:MathAb]
_b[q3:_cons]
MathAb
_cons
.4839501
-.3230667
_b[q4:MathAb]
_b[q4:_cons]
MathAb
_cons
1.232244
-.0494684
_b[q5:MathAb]
_b[q5:_cons]
MathAb
_cons
.946535
-.3147231
_b[q6:MathAb]
_b[q6:_cons]
MathAb
_cons
1.197317
.1053405
_b[q7:MathAb]
_b[q7:_cons]
MathAb
_cons
.8461858
-.026705
_b[q8:MathAb]
_b[q8:_cons]
q2 <-
q3 <-
q4 <-
q5 <-
q6 <-
q7 <-
q8 <-
var(MathAb)
_b[var(MathAb):_cons]
312
and we graph the curves just as we did previously, too. Here are all eight curves on one graph:
.2
.4
.6
.8
1.5
.5
0
.5
empirical Bayes means for MathAb
1.5
In [SEM] example 28g, we showed a graph for the most and least difficult questions. This time
we show a graph for the most and least discriminating questions:
.2
.4
.6
.8
1.5
.5
0
.5
empirical Bayes means for MathAb
1.5
Here the curves are not parallel because the discrimination has not been constrained to be equal across
the questions. Question 1 has a steeper slope, so it is more discriminating.
313
button.
b. Click on the oval for MathAb. In the Contextual Toolbar, type 1 in the
Enter.
6. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_irt3
References
Embretson, S. E., and S. P. Reise. 2000. Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum.
Rabe-Hesketh, S., A. Skrondal, and A. Pickles. 2004. Generalized multilevel structural equation modeling. Psychometrika
69: 167190.
Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.
van der Linden, W. J., and R. K. Hambleton, ed. 1997. Handbook of Modern Item Response Theory. New York:
Springer.
Also see
[SEM] example 27g Single-factor measurement model (generalized response)
[SEM] example 28g One-parameter logistic IRT (Rasch) model
Title
example 30g Two-level measurement model (multilevel, generalized response)
Description
References
Also see
Description
We demonstrate a multilevel measurement model with the same data used in [SEM] example 29g:
. use http://www.stata-press.com/data/r13/gsem_cfa
(Fictional math abilities data)
. summarize
Variable
Obs
Mean
school
id
q1
q2
q3
500
500
500
500
500
q4
q5
q6
q7
q8
Std. Dev.
Min
Max
10.5
50681.71
.506
.394
.534
5.772056
29081.41
.5004647
.4891242
.4993423
1
71
0
0
0
20
100000
1
1
1
500
500
500
500
500
.424
.49
.434
.52
.494
.4946852
.5004006
.4961212
.5001002
.5004647
0
0
0
0
0
1
1
1
1
1
att1
att2
att3
att4
att5
500
500
500
500
500
2.946
2.948
2.84
2.91
3.086
1.607561
1.561465
1.640666
1.566783
1.581013
1
1
1
1
1
5
5
5
5
5
test1
test2
test3
test4
. notes
500
500
500
500
75.548
80.556
75.572
74.078
5.948653
4.976786
6.677874
8.845587
55
65
50
43
93
94
94
96
_dta:
1. Fictional data on math ability and attitudes of 500 students from 20
schools.
2. Variables q1-q8 are incorrect/correct (0/1) on individual math questions.
3. Variables att1-att5 are items from a Likert scale measuring each
students attitude toward math.
4. Variables test1-test4 are test scores from tests of four different
aspects of mathematical abilities. Range of scores: 0-100.
These data record results from a fictional instrument measuring mathematical ability. Variables q1
through q8 are the items from the instrument.
For discussions of multilevel measurement models, including extensions beyond the example we
present here, see Mehta and Neale (2005) and Skrondal and Rabe-Hesketh (2004).
See Single-factor measurement models and Multilevel mixed-effects models in [SEM] intro 5 for
background.
314
315
MathAb
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
q1
q2
q3
q4
q5
q6
q7
q8
logit
logit
logit
logit
logit
logit
logit
logit
school1
The double-ringed school1 is new. That new component of the path diagram is saying, I am a latent
variable at the school levelmeaning I am constant within school and vary across schoolsand
I correspond to a latent variable named M1; see Specifying generalized SEMs: Multilevel mixed
effects (2 levels) in [SEM] intro 2. This new variable will account for the effect, if any, of the identity
of the school.
To fit this model without this new, school-level component in it, we would type
. gsem (MathAb -> q1-q8), logit
316
To include the new school-level component, we add M1[school] to the exogenous variables:
. gsem (MathAb M1[school] -> q1-q8), logit
Fitting fixed-effects model:
Iteration 0:
log likelihood = -2750.3114
Iteration 1:
log likelihood = -2749.3709
Iteration 2:
log likelihood = -2749.3708
Refining starting values:
Grid node 0:
log likelihood = -2649.0033
Fitting full model:
Iteration 0:
log likelihood = -2649.0033
Iteration 1:
log likelihood = -2645.0613
Iteration 2:
log likelihood = -2641.9755
Iteration 3:
log likelihood = -2634.3857
Iteration 4:
log likelihood = -2631.1111
Iteration 5:
log likelihood = -2630.7898
Iteration 6:
log likelihood = -2630.2477
Iteration 7:
log likelihood = -2630.2402
Iteration 8:
log likelihood = -2630.2074
Iteration 9:
log likelihood = -2630.2063
Iteration 10: log likelihood = -2630.2063
Generalized structural equation model
Log likelihood = -2630.2063
( 1) [q1]M1[school] = 1
( 2) [q2]MathAb = 1
Coef.
Std. Err.
(not concave)
(not concave)
(not concave)
Number of obs
500
P>|z|
q1 <M1[school]
MathAb
_cons
2.807515
.0388021
.9468682
.1608489
2.97
0.24
0.003
0.809
.9516878
-.276456
4.663343
.3540602
q2 <M1[school]
.6673925
.3058328
2.18
0.029
.0679712
1.266814
MathAb
_cons
1
-.4631159
(constrained)
.1201227
-3.86
0.000
-.698552
-.2276798
q3 <M1[school]
.3555867
.3043548
1.17
0.243
-.2409377
.9521111
MathAb
_cons
1.455529
.1537831
.5187786
.1070288
2.81
1.44
0.005
0.151
.4387416
-.0559894
2.472316
.3635556
q4 <M1[school]
.7073241
.3419273
2.07
0.039
.037159
1.377489
MathAb
_cons
.8420897
-.3252735
.3528195
.1202088
2.39
-2.71
0.017
0.007
.1505762
-.5608784
1.533603
-.0896686
q5 <M1[school]
.7295553
.3330652
2.19
0.028
.0767595
1.382351
MathAb
_cons
2.399529
-.0488674
.8110973
.1378015
2.96
-0.35
0.003
0.723
.8098079
-.3189533
3.989251
.2212185
(constrained)
q6 <M1[school]
.484903
.2844447
1.70
0.088
-.0725983
1.042404
MathAb
_cons
1.840627
-.3139302
.5934017
.1186624
3.10
-2.65
0.002
0.008
.6775813
-.5465042
3.003673
-.0813563
q7 <M1[school]
.3677241
.2735779
1.34
0.179
-.1684787
.903927
MathAb
_cons
2.444023
.1062164
.8016872
.1220796
3.05
0.87
0.002
0.384
.8727449
-.1330552
4.015301
.3454881
q8 <M1[school]
.5851299
.3449508
1.70
0.090
-.0909612
1.261221
MathAb
_cons
1.606287
-.0261962
.5367614
.1189835
2.99
-0.22
0.003
0.826
.5542541
-.2593995
2.65832
.2070071
.2121216
.2461246
.1510032
.1372513
.052558
.0825055
.8561121
.7342217
var(
M1[school])
var(MathAb)
317
Notes:
1. The variance of M1[school] is estimated to be 0.21.
2. So how important is M1[school]? The variance of MathAb is estimated to be 0.25, so math
ability and school have roughly the same variance, and both of course have mean 0. The math
ability coefficients, meanwhile, are largeroften much largerthan the school coefficients in
every case, so math ability is certainly more important than school in explaining whether questions
were answered correctly. At this point, we are merely exploring the magnitude of effect.
3. You could also include a school-level latent variable for each question. For instance, you could
type
. gsem (MathAb
(MathAb
(MathAb
(MathAb
(MathAb
(MathAb
(MathAb
(MathAb
M1[school]
M1[school]
M1[school]
M1[school]
M1[school]
M1[school]
M1[school]
M1[school]
N1[school]
N2[school]
N3[school]
N4[school]
N5[school]
N6[school]
N7[school]
N8[school]
->
->
->
->
->
->
->
->
q1) ///
q2) ///
q3) ///
q4) ///
q5) ///
q6) ///
q7) ///
q8), logit
You will sometimes see such effects included in multilevel measurement models in theoretical discussions of models. Be aware that estimation of models with many latent variables is
problematic, requiring both time and luck.
318
school1
e1
MathAb
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
q1
q2
q3
q4
q5
q6
q7
q8
logit
logit
logit
logit
logit
logit
logit
logit
The above is a great way to draw the model. Sadly, gsem cannot understand it. The problem from
gsems perspective is that one latent variable is affecting another and the two latent variables are at
different levels.
So we have to draw the model differently:
MathAb
c2
c4
c3
c5
c6
c8
c7
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
q1
q2
q3
q4
q5
q6
q7
q8
logit
logit
logit
logit
logit
logit
logit
logit
c2
c3
c4
c5
c6
c7
c8
school1
The models may look different, but constraining the coefficients along the paths from math ability
and from school to each question is identical in effect to the model above.
Std. Err.
q1 <M1[school]
MathAb
_cons
1
.0385522
q2 <M1[school]
.3876281
.1156823
MathAb
_cons
.3876281
-.4633143
q3 <M1[school]
P>|z|
500
(constrained)
(constrained)
.1556214
0.25
0.804
-.2664601
.3435646
3.35
0.001
.1608951
.6143612
.1156823
.1055062
3.35
-4.39
0.001
0.000
.1608951
-.6701028
.6143612
-.2565259
.4871164
.1295515
3.76
0.000
.2332001
.7410328
MathAb
_cons
.4871164
.1533212
.1295515
.1098068
3.76
1.40
0.000
0.163
.2332001
-.0618962
.7410328
.3685386
q4 <M1[school]
.3407151
.1058542
3.22
0.001
.1332446
.5481856
MathAb
_cons
.3407151
-.3246936
.1058542
.1011841
3.22
-3.21
0.001
0.001
.1332446
-.5230108
.5481856
-.1263763
q5 <M1[school]
.8327426
.1950955
4.27
0.000
.4503624
1.215123
MathAb
_cons
.8327426
-.0490579
.1950955
.1391324
4.27
-0.35
0.000
0.724
.4503624
-.3217524
1.215123
.2236365
319
320
q6 <M1[school]
.6267415
.1572247
3.99
0.000
.3185868
.9348962
MathAb
_cons
.6267415
-.3135398
.1572247
.1220389
3.99
-2.57
0.000
0.010
.3185868
-.5527317
.9348962
-.074348
q7 <M1[school]
.7660343
.187918
4.08
0.000
.3977219
1.134347
MathAb
_cons
.7660343
.1039102
.187918
.1330652
4.08
0.78
0.000
0.435
.3977219
-.1568927
1.134347
.3647131
q8 <M1[school]
.5600833
.1416542
3.95
0.000
.2824462
.8377203
MathAb
_cons
.5600833
-.0264193
.1416542
.1150408
3.95
-0.23
0.000
0.818
.2824462
-.2518951
.8377203
.1990565
.1719347
2.062489
.1150138
.6900045
.0463406
1.070589
.6379187
3.973385
var(
M1[school])
var(MathAb)
1. Note that for each question, the coefficient on MathAb is identical to the coefficient on M1[school].
2. We estimate separate variances for M1[school] and MathAb. They are 0.17 and 2.06. Now that
the coefficients are the same on school and ability, we can directly compare these variances. We
see that math ability has a much larger affect than does school.
button.
321
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting school > Observations in the next control.
d. Specify M1 as the Base name.
e. Click on OK.
6. Create the factor-loading paths for the multilevel latent variable.
.
b. Click in the top-left quadrant of the double oval for school1 (it will highlight when you
hover over it), and drag a path to the bottom of the q1 rectangle (it will highlight when you
can release to connect the path).
c. Continuing with the
7. Clean up paths.
, to
If you do not like where a path has been connected to its variables, use the Select tool,
click on the path, and then simply click on where it connects to a rectangle or oval and drag the
endpoint.
8. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
9. To fit the model in Fitting the variance-components model, add constraints to the diagram created
above.
a. From the SEM Builder menu, select Estimation > Clear Estimates to clear results from the
previous model.
b. Choose the Select tool,
c. Click on the path from MathAb to q1. In the Contextual Toolbar, type 1 in the
press Enter.
box and
d. Click on the path from school1 to q1. In the Contextual Toolbar, type 1 in the
and press Enter.
box
c. Click on the path from MathAb to q2. In the Contextual Toolbar, type c2 in the
and press Enter.
box
d. Click on the path from school1 to q2. In the Contextual Toolbar, type c2 in the
and press Enter.
box
e. Repeat this process to add the c3 constraint on both paths to q3, the c4 constraint on both
paths to q4, . . . , and the c8 constraint on both paths to q8.
322
You can open a completed diagram in the Builder for the second model by typing
. webgetsem gsem_mlcfa2
References
Mehta, P. D., and M. C. Neale. 2005. People are variables too: Multilevel structural equations modeling. Psychological
Methods 10: 259284.
Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.
Also see
[SEM] example 27g Single-factor measurement model (generalized response)
[SEM] example 29g Two-parameter logistic IRT model
Title
example 31g Two-factor measurement model (generalized response)
Description
Also see
Description
We demonstrate a two-factor generalized linear measurement model with the same data used in
[SEM] example 29g:
. use http://www.stata-press.com/data/r13/gsem_cfa
(Fictional math abilities data)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_cfa.dta
obs:
500
Fictional math abilities data
vars:
19
21 Mar 2013 10:38
size:
18,500
(_dta has notes)
variable name
storage
type
display
format
value
label
school
id
q1
q2
q3
q4
q5
q6
q7
q8
att1
byte
long
byte
byte
byte
byte
byte
byte
byte
byte
float
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%26.0g
result
result
result
result
result
result
result
result
agree
att2
float
%26.0g
agree
att3
float
%26.0g
agree
att4
float
%26.0g
agree
att5
float
%26.0g
agree
test1
test2
test3
test4
byte
byte
byte
byte
%9.0g
%9.0g
%9.0g
%9.0g
variable label
School id
Student id
q1 correct
q2 correct
q3 correct
q4 correct
q5 correct
q6 correct
q7 correct
q8 correct
Skills taught in math class will
help me get a better job.
Math is important in everyday
life
Working math problems makes me
anxious.
Math has always been my worst
subject.
I am able to learn new math
concepts easily.
Score, math test 1
Score, math test 2
Score, math test 3
Score, math test 4
Sorted by:
. notes
_dta:
1. Fictional data on math ability and attitudes of 500 students from 20
schools.
2. Variables q1-q8 are incorrect/correct (0/1) on individual math questions.
3. Variables att1-att5 are items from a Likert scale measuring each
students attitude toward math.
4. Variables test1-test4 are test scores from tests of four different
aspects of mathematical abilities. Range of scores: 0-100.
323
324
These data record results from a fictional instrument measuring mathematical ability. Variables q1
through q8 are the items from the instrument.
In this example, we will also be using variables att1 through att5. These are five Likert-scale
questions measuring each students attitude toward math.
See Multiple-factor measurement models in [SEM] intro 5 for background.
ordinal
ordinal
ordinal
ordinal
ordinal
att1
att2
att3
att4
att5
logit
logit
logit
logit
logit
MathAtt
MathAb
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
q1
q2
q3
q4
q5
q6
q7
q8
logit
logit
logit
logit
logit
logit
logit
logit
In this model, mathematical ability affects the correctness of the answers to the items just as
previously. The new component, attitude toward mathematics, is correlated with math ability. We
expect this correlation to be positive, but that is yet to be determined.
What is important about the attitudinal questions is that the responses are ordinal, that is, the
ordering of the possible answers is significant. In other cases, we might have a categorical variable
taking on, say, five values; even if the values are 1, 2, 3, 4, and 5, there is no case in which answer
5 is greater than answer 4, answer 4 is greater than answer 3, and so on.
325
For our attitude measures, however, response 5 signifies strong agreement with a statement and
1 signifies strong disagreement. We handle the ordinal property by specifying that the attitudinal
responses are family ordinal, link logit, also known as ordered logit or ordinal logistic regression,
and also known in Stata circles as ologit.
In the command language, to fit a one-factor measurement model with math ability, we would type
gsem (MathAb -> q1-q8), logit
To include the second factor, attitude correlated with math ability, we would type
gsem (MathAb -> q1-q8,
logit)
(MathAtt -> att1-att5, ologit)
///
The covariance between MathAtt and MathAb does not even appear in the command! That is because
latent exogenous variables are assumed to be correlated in the command language unless you specify
otherwise; in path diagrams, such variables are correlated only if a curved path is drawn between
them.
There is another, minor difference in syntax between the one-factor and two-factor models that
is worth your attention. Notice that the logit was outside the parentheses in the command to fit
the one-factor model, but it is inside the parentheses in the command to fit the two-factor model.
Actually, logit could have appeared inside the parentheses to fit the one-factor model. When options
appear inside parentheses, they affect only what is specified inside the parentheses. When they appear
outside parentheses, they affect all parenthetical specifications.
To obtain the estimates of the two-factor model, we type
. gsem (MathAb -> q1-q8, logit)
>
(MathAtt -> att1-att5, ologit)
Fitting fixed-effects model:
Iteration 0:
log likelihood = -6629.7253
Iteration 1:
log likelihood = -6628.7848
Iteration 2:
log likelihood = -6628.7848
Refining starting values:
Grid node 0:
log likelihood = -6457.4584
Fitting full model:
Iteration 0:
log likelihood = -6457.4584
Iteration 1:
log likelihood = -6437.9594
Iteration 2:
log likelihood = -6400.2731
Iteration 3:
log likelihood = -6396.3795
Iteration 4:
log likelihood = -6394.5787
Iteration 5:
log likelihood = -6394.4019
Iteration 6:
log likelihood = -6394.3923
Iteration 7:
log likelihood = -6394.3923
Generalized structural equation model
Log likelihood = -6394.3923
( 1) [q1]MathAb = 1
( 2) [att1]MathAtt = 1
Coef.
Std. Err.
Number of obs
P>|z|
500
q1 <MathAb
_cons
1
.0446118
MathAb
_cons
.3446081
-.4572215
(constrained)
.1272964
0.35
0.726
-.2048845
.2941082
0.001
0.000
.1387601
-.6492911
.5504562
-.265152
q2 <.1050264
.0979965
3.28
-4.67
326
q3 <MathAb
_cons
.5445245
.1591406
.1386993
.1033116
3.93
1.54
0.000
0.123
.272679
-.0433464
.8163701
.3616276
MathAb
_cons
.2858874
-.3196648
.0948553
.0947684
3.01
-3.37
0.003
0.001
.0999743
-.5054075
.4718004
-.1339222
MathAb
_cons
.8174803
-.04543
.1867024
.116575
4.38
-0.39
0.000
0.697
.4515504
-.2739127
1.18341
.1830527
MathAb
_cons
.6030448
-.309992
.1471951
.1070853
4.10
-2.89
0.000
0.004
.3145478
-.5198753
.8915419
-.1001086
MathAb
_cons
.72084
.1047265
.1713095
.1116494
4.21
0.94
0.000
0.348
.3850796
-.1141023
1.056601
.3235552
MathAb
_cons
.5814761
-.0250442
.1426727
.1045134
4.08
-0.24
0.000
0.811
.3018428
-.2298868
.8611094
.1797983
att1 <MathAtt
att2 <MathAtt
.3788714
.0971223
3.90
0.000
.1885152
.5692276
att3 <MathAtt
-1.592717
.3614859
-4.41
0.000
-2.301216
-.8842173
att4 <MathAtt
-.8100107
.153064
-5.29
0.000
-1.11001
-.5100108
att5 <MathAtt
.5225423
.1170141
4.47
0.000
.2931988
.7518858
/cut1
/cut2
/cut3
/cut4
-1.10254
-.2495339
.2983261
1.333053
.1312272
.1160385
.1164414
.1391907
-8.40
-2.15
2.56
9.58
0.000
0.032
0.010
0.000
-1.359741
-.4769651
.0701052
1.060244
-.8453396
-.0221027
.5265471
1.605861
/cut1
/cut2
/cut3
/cut4
-1.055791
-.1941211
.3598488
1.132624
.1062977
.0941435
.0952038
.1082204
-9.93
-2.06
3.78
10.47
0.000
0.039
0.000
0.000
-1.264131
-.378639
.1732528
.9205156
-.8474513
-.0096032
.5464448
1.344732
/cut1
/cut2
/cut3
/cut4
-1.053519
-.0491073
.5570671
1.666859
.1733999
.1442846
.1538702
.2135554
-6.08
-0.34
3.62
7.81
0.000
0.734
0.000
0.000
-1.393377
-.3318999
.2554871
1.248298
-.7136614
.2336853
.8586471
2.08542
q4 <-
q5 <-
q6 <-
q7 <-
q8 <-
(constrained)
att1
att2
att3
327
att4
/cut1
/cut2
/cut3
/cut4
-1.07378
-.2112462
.406347
1.398185
.1214071
.1076501
.1094847
.1313327
-8.84
-1.96
3.71
10.65
0.000
0.050
0.000
0.000
-1.311734
-.4222366
.191761
1.140778
-.8358264
-.0002559
.620933
1.655593
/cut1
/cut2
/cut3
/cut4
-1.244051
-.336135
.2137776
.9286849
.1148443
.0986678
.0978943
.107172
-10.83
-3.41
2.18
8.67
0.000
0.001
0.029
0.000
-1.469142
-.5295203
.0219084
.7186316
-1.018961
-.1427498
.4056468
1.138738
var(MathAb)
var(MathAtt)
2.300652
1.520854
.7479513
.4077674
1.216527
.8992196
4.350909
2.572228
cov(MathAtt,
MathAb)
.8837681
.2204606
.4516733
1.315863
att5
4.01
0.000
Notes:
1. The estimated covariance between math attitude and ability is 0.88.
2. We can calculate the correlation from the estimated covariance; the formula is bxy =
bxy /(b
x
by ).
The estimated values are
bxy = 0.8838,
bx2 = 2.301, and
by2 = 1.521. Thus the estimated
correlation between attitude and ability is 0.4724.
3. There is something new in the output, namely, things labeled /cut1, . . . , /cut4. These appear
for each of the five attitudinal measures. These are the ordered logits cutpoints, the values on
the logits distribution that separate attitude 1 from attitude 2, attitude 2 from attitude 3, and so
on. The four cutpoints map the continuous distribution into five ordered, categorical groups.
4. Theres something interesting hiding in the MathAtt coefficients: the coefficients for two of
the paths, att3 att4 <- MathAtt, are negative! If you look back to the description of the
data, you will find that the sense of these two questions was reversed from those of the other
questions. Strong agreement on these two questions was agreement with a negative feeling about
mathematics.
button.
328
b. Click in the top-left quadrant of the MathAb oval, and drag a covariance to the bottom left
of the MathAtt oval.
7. Clean up.
If you do not like where a covariance has been connected to its variable, use the Select tool,
, to simply click on the covariance, and then click on where it connects to an oval and drag
the endpoint. You can also change the bow of the covariance by dragging the control point that
extends from one end of the selected covariance.
8. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_2fmm
Also see
[SEM] example 27g Single-factor measurement model (generalized response)
[SEM] example 29g Two-parameter logistic IRT model
[SEM] example 32g Full structural equation model (generalized response)
329
Title
example 32g Full structural equation model (generalized response)
Description
Also see
Description
To demonstrate a structural model with a measurement component, we use the same data used in
[SEM] example 31g:
. use http://www.stata-press.com/data/r13/gsem_cfa
(Fictional math abilities data)
. summarize
Variable
Obs
Mean
Std. Dev.
Min
Max
school
id
q1
q2
q3
500
500
500
500
500
10.5
50681.71
.506
.394
.534
5.772056
29081.41
.5004647
.4891242
.4993423
1
71
0
0
0
20
100000
1
1
1
q4
q5
q6
q7
q8
500
500
500
500
500
.424
.49
.434
.52
.494
.4946852
.5004006
.4961212
.5001002
.5004647
0
0
0
0
0
1
1
1
1
1
att1
att2
att3
att4
att5
500
500
500
500
500
2.946
2.948
2.84
2.91
3.086
1.607561
1.561465
1.640666
1.566783
1.581013
1
1
1
1
1
5
5
5
5
5
test1
500
75.548
5.948653
55
93
test2
500
80.556
4.976786
65
94
500
75.572
6.677874
50
94
test3
test4
500
74.078
8.845587
43
96
. notes
_dta:
1. Fictional data on math ability and attitudes of 500 students from 20
schools.
2. Variables q1-q8 are incorrect/correct (0/1) on individual math questions.
3. Variables att1-att5 are items from a Likert scale measuring each
students attitude toward math.
4. Variables test1-test4 are test scores from tests of four different
aspects of mathematical abilities. Range of scores: 0-100.
These data record results from a fictional instrument measuring mathematical ability. Variables q1
through q8 are the items from the instrument.
In this example, we will also be using variables att1 through att5. These are five Likert-scale
questions measuring each students attitude toward math.
See Structural models 8: Unobserved inputs, outputs, or both in [SEM] intro 5 for background.
330
331
ordinal
ordinal
ordinal
ordinal
att1
att2
att3
att4
att5
logit
logit
logit
logit
logit
MathAtt
e1
MathAb
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
Bernoulli
q1
q2
q3
q4
q5
q6
q7
q8
logit
logit
logit
logit
logit
logit
logit
logit
This is the same model we fit in [SEM] example 31g, except that rather than a correlation (curved
path) between MathAtt and MathAb, this time we assume a direct effect and so allow a straight path.
If you compare the two path diagrams, in addition to the new substitution of the direct path for
the curved path signifying correlation, there is now an error variable on MathAb. In the previous
diagram, MathAb was exogenous. In this diagram, it is endogenous and thus requires an error term.
In the Builder, the error term is added automatically.
To fit this model in the command language, we type
. gsem (MathAb -> q1-q8, logit)
>
(MathAtt -> att1-att5, ologit)
>
(MathAtt -> MathAb)
Fitting fixed-effects model:
Iteration 0:
Iteration 1:
Iteration 2:
332
-6429.1636
-6396.7471
-6394.6197
-6394.3949
-6394.3923
-6394.3923
model
Std. Err.
Number of obs
500
P>|z|
0.726
-.204885
.294109
q1 <MathAb
_cons
1
.044612
(constrained)
.1272967
0.35
MathAb
_cons
.3446066
-.4572215
.1050261
.0979965
3.28
-4.67
0.001
0.000
.1387593
-.6492911
.550454
-.2651519
MathAb
_cons
.5445222
.1591406
.1386992
.1033116
3.93
1.54
0.000
0.123
.2726767
-.0433465
.8163677
.3616276
MathAb
_cons
.2858862
-.3196648
.0948549
.0947684
3.01
-3.37
0.003
0.001
.099974
-.5054075
.4717984
-.1339222
MathAb
_cons
.8174769
-.04543
.1867022
.116575
4.38
-0.39
0.000
0.697
.4515473
-.2739129
1.183406
.1830528
MathAb
_cons
.6030423
-.3099919
.1471949
.1070853
4.10
-2.89
0.000
0.004
.3145457
-.5198754
.8915389
-.1001085
MathAb
_cons
.7208369
.1047264
.171309
.1116494
4.21
0.94
0.000
0.348
.3850774
-.1141024
1.056597
.3235553
MathAb
_cons
.5814736
-.0250443
.1426725
.1045135
4.08
-0.24
0.000
0.811
.3018406
-.2298869
.8611067
.1797984
att1 <MathAtt
att2 <MathAtt
.3788715
.0971234
3.90
0.000
.1885131
.5692299
att3 <MathAtt
-1.592717
.3614956
-4.41
0.000
-2.301236
-.8841989
att4 <MathAtt
-.8100108
.1530675
-5.29
0.000
-1.110017
-.510004
att5 <MathAtt
.5225425
.1170166
4.47
0.000
.2931942
.7518907
q2 <-
q3 <-
q4 <-
q5 <-
q6 <-
q7 <-
q8 <-
(constrained)
MathAb <MathAtt
.581103
.14776
3.93
0.000
.2914987
.8707072
/cut1
/cut2
/cut3
/cut4
-1.10254
-.2495339
.2983261
1.333052
.131228
.1160385
.1164415
.1391919
-8.40
-2.15
2.56
9.58
0.000
0.032
0.010
0.000
-1.359742
-.4769652
.070105
1.060241
-.8453377
-.0221025
.5265472
1.605864
/cut1
/cut2
/cut3
/cut4
-1.055791
-.1941211
.3598488
1.132624
.1062977
.0941435
.0952038
.1082204
-9.93
-2.06
3.78
10.47
0.000
0.039
0.000
0.000
-1.264131
-.378639
.1732528
.9205156
-.8474513
-.0096032
.5464448
1.344732
/cut1
/cut2
/cut3
/cut4
-1.053519
-.0491074
.5570672
1.666859
.1734001
.1442846
.1538702
.2135557
-6.08
-0.34
3.62
7.81
0.000
0.734
0.000
0.000
-1.393377
-.3319
.2554871
1.248297
-.7136612
.2336853
.8586472
2.08542
/cut1
/cut2
/cut3
/cut4
-1.07378
-.2112462
.406347
1.398185
.1214071
.1076501
.1094847
.1313327
-8.84
-1.96
3.71
10.65
0.000
0.050
0.000
0.000
-1.311734
-.4222365
.191761
1.140778
-.8358264
-.0002559
.620933
1.655593
/cut1
/cut2
/cut3
/cut4
-1.244051
-.336135
.2137776
.9286849
.1148443
.0986678
.0978943
.107172
-10.83
-3.41
2.18
8.67
0.000
0.001
0.029
0.000
-1.469142
-.5295203
.0219084
.7186316
-1.018961
-.1427498
.4056468
1.138738
1.787117
1.520854
.5974753
.4077885
.9280606
.8991947
3.441357
2.572298
333
att1
att2
att3
att4
att5
var(e.MathAb)
var(MathAtt)
Notes:
1. In the model fit in [SEM] example 31g, we estimated a correlation between MathAtt and MathAb
of 0.4724.
2. Theoretically speaking, the model fit above and the model in [SEM] example 31g are equivalent.
Both posit a linear relationship between the latent variables and merely choose to parameterize
the relationship differently. In [SEM] example 31g, it was parameterized as a covariance. In this
example, it is parameterized as causal. People often use structural equation modeling to confirm
a proposed hypothesis. It is important that the causal model you specify be based on theory or
that you have some other justification. You need something other than empirical results to rule
out competing but equivalent models such as the covariance model. Distinguishing causality from
correlation is always problematic.
3. Practically speaking, note that the log-likelihood values for this model and the model in [SEM] example 31g are equal at 6394.3923. Also note that the estimated variances of math attitude,
var(MathAtt), are also equal at 1.520854.
334
button.
b. Click in the bottom of the MathAtt oval (it will highlight when you hover over it), and drag
a path to the top of the MathAb oval (it will highlight when you can release to connect the
path).
335
Also see
[SEM] example 9 Structural model with measurement component
[SEM] example 31g Two-factor measurement model (generalized response)
Title
example 33g Logistic regression
Description
Reference
Also see
Description
In this example, we demonstrate with gsem how to fit a standard logistic regression, which is
often referred to as the logit model in generalized linear model (GLM) framework.
. use http://www.stata-press.com/data/r13/gsem_lbw
(Hosmer & Lemeshow data)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_lbw.dta
obs:
189
Hosmer & Lemeshow data
vars:
11
21 Mar 2013 12:28
size:
2,646
(_dta has notes)
variable name
id
low
age
lwt
race
smoke
ptl
ht
ui
ftv
bwt
storage
type
int
byte
byte
int
byte
byte
byte
byte
byte
byte
int
display
format
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%9.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
value
label
race
smoke
variable label
subject id
birth weight < 2500g
age of mother
weight, last menstrual period
race
smoked during pregnancy
premature labor history (count)
has history of hypertension
presence, uterine irritability
# physician visits, 1st trimester
birth weight (g)
Sorted by:
. notes
_dta:
1. Data from Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013.
"Applied Logistic Regression". 3rd ed. Hoboken, NJ: Wiley.
2. Data from a study of risk factors associated with low birth weights.
336
337
age
lwt
1b.race
2.race
Bernoulli
3.race
low
logit
smoke
ptl
ht
ui
That is, we wish to fit a model in which low birthweight is determined by a history of hypertension
(ht), mothers age (age), mothers weight at last menstrual period (lwt), mothers race (white, black,
or other; race), whether the mother smoked during pregnancy (smoke), the number of premature
babies previously born to the mother (ptl), and whether the mother has suffered from the presence
of uterine irritability (ui).
The path diagram matches the variable names listed in parentheses above except for race, where
the path diagram contains not one but three boxes filled in with 1b.race, 2.race, and 3.race.
This is because in our dataset, race is coded 1, 2, or 3, meaning white, black, or other. We want to
include indicator variables for race so that we have a separate coefficient for each race. Thus we
need three boxes.
In Stata, 1.race means an indicator for race equaling 1. Thus it should not surprise you if you
filled in the boxes with 1.race, 2.race, and 3.race, and that is almost what we did. The difference
is that we filled in the first box with 1b.race rather than 1.race. We use the b to specify the base
category, which we specified as white. If we wanted the base category to be black, we would have
specified 2b.race and left 1.race alone.
338
The above is called factor-variable notation. See [SEM] intro 3 for details on using factor-variable
notation with gsem.
In the command language, we could type
. gsem (low <- age lwt 1b.race 2.race 3.race smoke ptl ht ui), logit
to fit the model. Written that way, there is a one-to-one correspondence to what we would type and
what we would draw in the Builder. The command language, however, has a feature that will allow
us to type i.race instead of 1b.race 2.race 3.race. To fit the model, we could type
. gsem (low <- age lwt i.race smoke ptl ht ui), logit
i.varname is a command-language shorthand for specifying indicators for all the levels of a
variable and using the first level as the base category. You can use i.varname in the command
language but not in path diagrams because boxes can contain only one variable. In the Builder,
however, you will discover a neat feature so that you can type i.race and the Builder will create
however many boxes are needed for you, filled in, and with the first category marked as the base.
We will explain below how you do that.
The result of typing our estimation command is
. gsem (low <- age lwt i.race smoke ptl ht ui), logit
Iteration 0:
log likelihood = -101.0213
Iteration 1:
log likelihood = -100.72519
Iteration 2:
log likelihood =
-100.724
Iteration 3:
log likelihood =
-100.724
Generalized structural equation model
Number of obs
Log likelihood =
-100.724
Coef.
Std. Err.
P>|z|
189
low <age
lwt
-.0271003
-.0151508
.0364504
.0069259
-0.74
-2.19
0.457
0.029
-.0985418
-.0287253
.0443412
-.0015763
race
black
other
1.262647
.8620792
.5264101
.4391532
2.40
1.96
0.016
0.050
.2309024
.0013548
2.294392
1.722804
smoke
ptl
ht
ui
_cons
.9233448
.5418366
1.832518
.7585135
.4612239
.4008266
.346249
.6916292
.4593768
1.20459
2.30
1.56
2.65
1.65
0.38
0.021
0.118
0.008
0.099
0.702
.137739
-.136799
.4769494
-.1418484
-1.899729
1.708951
1.220472
3.188086
1.658875
2.822176
339
If you want to see the odds ratios, type estat eform after fitting the model:
. estat eform
low
exp(b)
Std. Err.
age
lwt
.9732636
.9849634
.0354759
.0068217
race
white
black
other
1
3.534767
2.368079
smoke
ptl
ht
ui
_cons
2.517698
1.719161
6.249602
2.1351
1.586014
P>|z|
-0.74
-2.19
0.457
0.029
.9061578
.9716834
1.045339
.9984249
(empty)
1.860737
1.039949
2.40
1.96
0.016
0.050
1.259736
1.001356
9.918406
5.600207
1.00916
.5952579
4.322408
.9808153
1.910496
2.30
1.56
2.65
1.65
0.38
0.021
0.118
0.008
0.099
0.702
1.147676
.8721455
1.611152
.8677528
.1496092
5.523162
3.388787
24.24199
5.2534
16.8134
Whichever way you look at the results above, they are identical to the results that would be produced
by typing
. logit low age lwt i.race smoke ptl ht ui
or
. logistic low age lwt i.race smoke ptl ht ui
which are two other ways that Stata can fit logit models. logit, like gsem, reports coefficients by
default. logistic reports odds ratios by default.
button.
4. Enlarge the size of the canvas to accommodate the height of the diagram.
Click on the Adjust Canvas Size button,
5 (inches), and then click on OK.
340
Reference
Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken,
NJ: Wiley.
Also see
[SEM] example 34g Combined models (generalized responses)
[SEM] example 35g Ordered probit and ordered logit
[SEM] example 37g Multinomial logistic regression
Title
example 34g Combined models (generalized responses)
Description
Reference
Also see
Description
We demonstrate how to fit a combined model with one Poisson regression and one logit regression
by using the following data:
. use http://www.stata-press.com/data/r13/gsem_lbw
(Hosmer & Lemeshow data)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_lbw.dta
obs:
189
Hosmer & Lemeshow data
vars:
11
21 Mar 2013 12:28
size:
2,646
(_dta has notes)
variable name
id
low
age
lwt
race
smoke
ptl
ht
ui
ftv
bwt
storage
type
int
byte
byte
int
byte
byte
byte
byte
byte
byte
int
display
format
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%9.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
value
label
race
smoke
variable label
subject id
birth weight < 2500g
age of mother
weight, last menstrual period
race
smoked during pregnancy
premature labor history (count)
has history of hypertension
presence, uterine irritability
# physician visits, 1st trimester
birth weight (g)
Sorted by:
. notes
_dta:
1. Data from Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013.
"Applied Logistic Regression". 3rd ed. Hoboken, NJ: Wiley.
2. Data from a study of risk factors associated with low birth weights.
See Structural models 7: Dependencies between response variables in [SEM] intro 5 for background.
341
342
age
smoke
Poisson
ht
ptl
log
lwt
1b.race
Bernoulli
2.race
low
logit
3.race
ui
This model has one logit equation and one Poisson regression equation, with the Poisson response
variable also being an explanatory variable in the logit equation.
Because the two equations are recursive, it is not necessary to fit these models together. We could
draw separate diagrams for each equation and fit each separately. Even so, many researchers often do
fit recursive models together, and sometimes, it is just the first step before placing constraints across
models or introducing a common latent variable. The latter might be likely in this case because neither
generalized linear response has an error that could be correlated and so the only way to correlate
these two responses in gsem is to add a shared latent variable affecting each.
Our purpose here is to show that you can mix models with generalized response variables of
different types.
343
Std. Err.
P>|z|
189
low <ptl
age
smoke
ht
lwt
.5418366
-.0271003
.9233448
1.832518
-.0151508
.346249
.0364504
.4008266
.6916292
.0069259
1.56
-0.74
2.30
2.65
-2.19
0.118
0.457
0.021
0.008
0.029
-.136799
-.0985418
.137739
.4769494
-.0287253
1.220472
.0443412
1.708951
3.188086
-.0015763
race
black
other
1.262647
.8620791
.5264101
.4391532
2.40
1.96
0.016
0.050
.2309023
.0013548
2.294392
1.722803
ui
_cons
.7585135
.4612238
.4593768
1.20459
1.65
0.38
0.099
0.702
-.1418484
-1.899729
1.658875
2.822176
age
smoke
ht
_cons
.0370598
.9602534
-.1853501
-2.985512
.0298752
.3396867
.7271851
.7842174
1.24
2.83
-0.25
-3.81
0.215
0.005
0.799
0.000
-.0214946
.2944796
-1.610607
-4.52255
.0956142
1.626027
1.239906
-1.448474
ptl <-
344
Std. Err.
P>|z|
low
ptl
age
smoke
ht
lwt
1.719161
.9732636
2.517698
6.249602
.9849634
.5952579
.0354759
1.00916
4.322407
.0068217
1.56
-0.74
2.30
2.65
-2.19
0.118
0.457
0.021
0.008
0.029
.8721455
.9061578
1.147676
1.611152
.9716834
3.388787
1.045339
5.523162
24.24199
.9984249
race
white
black
other
1
3.534767
2.368079
(empty)
1.860737
1.039949
2.40
1.96
0.016
0.050
1.259736
1.001356
9.918406
5.600207
ui
_cons
2.1351
1.586014
.9808153
1.910496
1.65
0.38
0.099
0.702
.8677528
.1496092
5.2534
16.8134
age
smoke
ht
_cons
1.037755
2.612358
.8308134
.0505137
.0310032
.8873835
.6041551
.0396137
1.24
2.83
-0.25
-3.81
0.215
0.005
0.799
0.000
.9787348
1.342428
.1997664
.0108613
1.100334
5.083638
3.45529
.2349286
ptl
Had we merely typed estat eform without the two equation names, we would have obtained
exponentiated coefficients for the first equation only.
Equation names are easily found on the output or the path diagrams. Equations are named after
the dependent variable.
button.
345
Main effect in the Specification control, and select race in the Variables control for
Variable 1. Click on Add to varlist, and then click on OK;
d. continue with the Variables control and select the variable ui;
e. select Vertical in the Orientation control;
f. click on OK.
If you wish, move the set of variables by clicking on any variable and dragging it.
5. Create the generalized response for premature labor history.
a. Select the Add Generalized Response Variable tool,
b. Click about one-third of the way in from the right side of the diagram, to the right of ht.
c. In the Contextual Toolbar, select Poisson, Log in the Family/Link control.
d. In the Contextual Toolbar, select ptl in the Variable control.
6. Create the generalized response for low birthweight.
a. Select the Add Generalized Response Variable tool,
b. Click about one-third of the way in from the right side of the diagram, to the right of
2.race.
c. In the Contextual Toolbar, select Bernoulli, Logit in the Family/Link control.
d. In the Contextual Toolbar, select low in the Variable control.
7. Create paths from the independent variables to the dependent variables.
a. Select the Add Path tool,
b. Click in the right side of the age rectangle (it will highlight when you hover over it), and
drag a path to the left side of the ptl rectangle (it will highlight when you can release to
connect the path).
tool, create the following paths by clicking first in the right side of
c. Continuing with the
the rectangle for the independent variable and dragging it to the left side of the rectangle
for the dependent variable:
smoke -> ptl
ht -> ptl
age -> low
smoke -> low
ht -> low
lwt -> low
1b.race -> low
2.race -> low
3.race -> low
ui -> low
d. Continuing with the
tool, create the path from ptl to low by clicking in the bottom of
the ptl rectangle and dragging the path to the top of the low rectangle.
346
8. Clean up.
If you do not like where a path has been connected to its variables, use the Select tool,
,
to click on the path, and then simply click on where it connects to a rectangle and drag the
endpoint.
9. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_comb
Reference
Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken,
NJ: Wiley.
Also see
[SEM] example 33g Logistic regression
[SEM] example 45g Heckman selection model
[SEM] example 46g Endogenous treatment-effects model
Title
example 35g Ordered probit and ordered logit
Description
Reference
Also see
Description
Below we demonstrate ordered probit and ordered logit in a measurement-model context. We are
not going to illustrate every family/link combination. Ordered probit and logit, however, are unique
in that a single equation is able to predict a set of ordered outcomes. The unordered alternative,
mlogit, requires k 1 equations to fit k (unordered) outcomes.
To demonstrate ordered probit and ordered logit, we use the following data:
. use http://www.stata-press.com/data/r13/gsem_issp93
(Selection from ISSP 1993)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_issp93.dta
obs:
871
Selection for ISSP 1993
vars:
8
21 Mar 2013 16:03
size:
7,839
(_dta has notes)
storage
type
display
format
value
label
id
y1
int
byte
%9.0g
%26.0g
agree5
y2
y3
y4
byte
byte
byte
%26.0g
%26.0g
%26.0g
agree5
agree5
agree5
sex
age
edu
byte
byte
byte
%9.0g
%9.0g
%20.0g
sex
age
edu
variable name
variable label
respondent identifier
too much science, not enough
feelings & faith
science does more harm than good
any change makes nature worse
science will solve environmental
problems
sex
age (6 categories)
education (6 categories)
Sorted by:
. notes
_dta:
1. Data from Greenacre, M. and J Blasius, 2006, _Multiple Correspondence
Analysis and Related Methods_, pp. 42-43, Boca Raton: Chapman & Hall.
Data is a subset of the International Social Survey Program (ISSP) 1993.
2. Full text of y1: We believe too often in science, and not enough in
feelings and faith.
3. Full text of y2: Overall, modern science does more harm than good.
4. Full text of y3: Any change humans cause in nature, no matter how
scientific, is likely to make things worse.
5. Full text of y4: Modern science will solve our environmental problems
with little change to our way of life.
347
348
Ordered probit
For the measurement model, we focus on variables y1 through y4. Each variable contains 15,
with 1 meaning strong disagreement and 5 meaning strong agreement with a statement about science.
Ordered probit produces predictions about the probabilities that a respondent gives response 1,
response 2, . . . , response k . It does this by dividing up the domain of an N (0, 1) distribution into
k categories defined by k 1 cutpoints, c1 , c2 , . . . , ck1 . Individual respondents are assumed to
have a score s = X + , where N (0, 1), and then that score is used along with the cutpoints
to produce probabilities for each respondent producing response 1, 2, . . . , k .
Pr(response is i | X) = Pr(ci1 < X + ci )
where c0 = ; ck = +; and c1 , c2 , . . . , ck1 and are parameters of the model to be fit. This
ordered probit model has long been known in Stata circles as oprobit.
We have a set of four questions designed to determine the respondents attitude toward science,
each question with k = 5 possible answers ranging on a Likert scale from 1 to 5. With ordered probit
in hand, we have a way to take a continuous variable, say, a latent variable we will call SciAtt, and
produce predicted categorical responses.
The measurement model we want to fit is
SciAtt
ordinal
ordinal
ordinal
ordinal
y1
y2
y3
y4
probit
probit
probit
probit
oprobit
-5227.8743
-5227.8743
-5230.8106
-5230.8106
-5132.1849
-5069.5037
-5040.4779
-5040.2397
-5039.8242
-5039.823
-5039.823
model
Std. Err.
(not concave)
(not concave)
Number of obs
871
P>|z|
y1 <SciAtt
(constrained)
SciAtt
1.424366
.2126574
6.70
0.000
1.007565
1.841167
SciAtt
1.283359
.1797557
7.14
0.000
.931044
1.635674
SciAtt
-.0322354
.0612282
-0.53
0.599
-.1522405
.0877697
/cut1
/cut2
/cut3
/cut4
-1.343148
.0084719
.7876538
1.989985
.0726927
.0521512
.0595266
.0999181
-18.48
0.16
13.23
19.92
0.000
0.871
0.000
0.000
-1.485623
-.0937426
.6709837
1.794149
-1.200673
.1106863
.9043238
2.18582
/cut1
/cut2
/cut3
/cut4
-1.997245
-.8240241
.0547025
1.419923
.1311972
.0753839
.0606036
.1001258
-15.22
-10.93
0.90
14.18
0.000
0.000
0.367
0.000
-2.254387
-.9717738
-.0640784
1.22368
-1.740104
-.6762743
.1734834
1.616166
/cut1
/cut2
/cut3
/cut4
-1.271915
.1249493
.9752553
2.130661
.0847483
.0579103
.0745052
.1257447
-15.01
2.16
13.09
16.94
0.000
0.031
0.000
0.000
-1.438019
.0114472
.8292277
1.884206
-1.105812
.2384515
1.121283
2.377116
/cut1
/cut2
/cut3
/cut4
-1.484063
-.4259356
.1688777
.9413113
.0646856
.0439145
.0427052
.0500906
-22.94
-9.70
3.95
18.79
0.000
0.000
0.000
0.000
-1.610844
-.5120065
.0851771
.8431356
-1.357281
-.3398647
.2525782
1.039487
.5265523
.0979611
.3656637
.7582305
y2 <-
y3 <-
y4 <-
y1
y2
y3
y4
var(SciAtt)
349
350
Notes:
1. The cutpoints c1 , . . . , c4 are labeled /cut1, . . . , /cut4 in the output. We have a separate
cutpoint for each of the four questions y1, . . . , y4. Look at the estimated cutpoints for y1, which
are 1.343, 0.008, 0.788, and 1.99. The probabilities that a person with SciAtt = 0 (its mean)
would give the various responses are
Pr(response 1) = normal(-1.343) = 0.090
Pr(response 2) = normal(0.008) normal(1.343) = 0.414
Pr(response 3) = normal(0.788) normal(0.008) = 0.281
Pr(response 4) = normal(1.99) normal(0.788) = 0.192
Pr(response 5) = 1 normal(1.99) = 0.023
2. The path coefficients (y1 y2 y3 y4 <- SciAtt) measure the effect of the latent variable we
called science attitude on each of the responses.
3. The estimated path coefficients are 1, 1.42, 1.28, and 0.03 for the four questions.
4. If you read the questionsthey are listed aboveyou will find that in all but the fourth question,
agreement signifies a negative attitude toward science. Thus SciAtt measures a negative attitude
toward science because the loadings on negative questions are positive and the loading on the
single positive question is negative.
5. The direction of the meanings of latent variables is always a priori indeterminate and is set by
the identifying restrictions we apply. We appliedor more correctly, gsem applied for usthe
constraint that y1 <- SciAtt has path coefficient 1. Because statement 1 was a negative statement
about science, that was sufficient to set the direction of SciAtt to be the opposite of what we
hoped for.
The direction does not matter. You simply must remember to interpret the latent variable correctly
when reading results based on it. In the models we fit, including more complicated models, the
signs of the coefficients will work themselves out to adjust for the direction of the variable.
Ordered logit
The description of the ordered logit model is identical to that of the ordered probit model except
that where we assumed a normal distribution in our explanation above, we now assume a logit
distribution. The distributions are similar.
To fit an ordered logit (ologit) model, the link function shown in the boxes merely changes from
probit to logit:
SciAtt
ordinal
ordinal
ordinal
ordinal
y1
y2
y3
y4
logit
logit
logit
logit
-5127.9026
-5127.9026
-5065.4679
-5035.9766
-5035.0943
-5035.0353
-5035.0352
model
(not concave)
Number of obs
871
[y1]SciAtt = 1
Coef.
Std. Err.
P>|z|
y1 <SciAtt
(constrained)
SciAtt
1.394767
.2065479
6.75
0.000
.9899406
1.799593
SciAtt
1.29383
.1845113
7.01
0.000
.9321939
1.655465
SciAtt
-.0412446
.0619936
-0.67
0.506
-.1627498
.0802606
/cut1
/cut2
/cut3
/cut4
-2.38274
-.0088393
1.326292
3.522017
.1394292
.0889718
.106275
.1955535
-17.09
-0.10
12.48
18.01
0.000
0.921
0.000
0.000
-2.656016
-.1832207
1.117997
3.138739
-2.109464
.1655422
1.534587
3.905295
/cut1
/cut2
/cut3
/cut4
-3.51417
-1.421711
.0963154
2.491459
.2426595
.135695
.1046839
.1840433
-14.48
-10.48
0.92
13.54
0.000
0.000
0.358
0.000
-3.989774
-1.687669
-.1088612
2.130741
-3.038566
-1.155754
.3014921
2.852178
/cut1
/cut2
/cut3
/cut4
-2.263557
.2024798
1.695997
3.828154
.1618806
.1012122
.1393606
.2464566
-13.98
2.00
12.17
15.53
0.000
0.045
0.000
0.000
-2.580838
.0041075
1.422855
3.345108
-1.946277
.400852
1.969138
4.3112
/cut1
/cut2
/cut3
/cut4
-2.606013
-.6866159
.268862
1.561921
.1338801
.0718998
.0684577
.0895438
-19.47
-9.55
3.93
17.44
0.000
0.000
0.000
0.000
-2.868413
-.8275369
.1346874
1.386419
-2.343613
-.5456949
.4030366
1.737424
1.715641
.3207998
1.189226
2.475077
y2 <-
y3 <-
y4 <-
y1
y2
y3
y4
var(SciAtt)
351
352
Note:
1. Results are nearly identical to those reported for ordered probit.
button.
b. Click on the y1 rectangle. In the Contextual Toolbar, select Ordinal, Logit in the
Family/Link control.
c. Repeat this process to change the family and link to Ordinal, Logit for y2, y3, and y4.
7. Estimate again.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram for the ordered probit model in the Builder by typing
. webgetsem gsem_oprobit
353
You can open a completed diagram for the ordered logit model in the Builder by typing
. webgetsem gsem_ologit
Reference
Greenacre, M. J. 2006. From simple to multiple correspondence analysis. In Multiple Correspondence Analysis and
Related Methods, ed. M. J. Greenacre and J. Blasius. Boca Raton, FL: Chapman & Hall.
Also see
[SEM] example 1 Single-factor measurement model
[SEM] example 27g Single-factor measurement model (generalized response)
[SEM] example 33g Logistic regression
[SEM] example 36g MIMIC model (generalized response)
[SEM] example 37g Multinomial logistic regression
Title
example 36g MIMIC model (generalized response)
Description
Reference
Also see
Description
To demonstrate a multiple-indicators multiple-causes (MIMIC) model with generalized indicators,
we use the same data used in [SEM] example 35g:
. use http://www.stata-press.com/data/r13/gsem_issp93
(Selection from ISSP 1993)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_issp93.dta
obs:
871
Selection for ISSP 1993
vars:
8
21 Mar 2013 16:03
size:
7,839
(_dta has notes)
storage
type
display
format
value
label
id
y1
int
byte
%9.0g
%26.0g
agree5
y2
y3
y4
byte
byte
byte
%26.0g
%26.0g
%26.0g
agree5
agree5
agree5
sex
age
edu
byte
byte
byte
%9.0g
%9.0g
%20.0g
sex
age
edu
variable name
variable label
respondent identifier
too much science, not enough
feelings & faith
science does more harm than good
any change makes nature worse
science will solve environmental
problems
sex
age (6 categories)
education (6 categories)
Sorted by:
. notes
_dta:
1. Data from Greenacre, M. and J Blasius, 2006, _Multiple Correspondence
Analysis and Related Methods_, pp. 42-43, Boca Raton: Chapman & Hall.
Data is a subset of the International Social Survey Program (ISSP) 1993.
2. Full text of y1: We believe too often in science, and not enough in
feelings and faith.
3. Full text of y2: Overall, modern science does more harm than good.
4. Full text of y3: Any change humans cause in nature, no matter how
scientific, is likely to make things worse.
5. Full text of y4: Modern science will solve our environmental problems
with little change to our way of life.
354
355
sex
SciAtt
ordinal
ordinal
ordinal
ordinal
y1
y2
y3
y4
probit
probit
probit
probit
356
Number of obs
871
[y1]SciAtt = 1
Coef.
Std. Err.
P>|z|
y1 <SciAtt
(constrained)
SciAtt
1.405732
.2089672
6.73
0.000
.9961641
1.8153
SciAtt
1.246449
.1710771
7.29
0.000
.911144
1.581754
SciAtt
-.0345517
.0602017
-0.57
0.566
-.1525449
.0834415
SciAtt <sex
-.2337427
.0644245
-3.63
0.000
-.3600124
-.1074729
/cut1
/cut2
/cut3
/cut4
-1.469615
-.10992
.6729334
1.879901
.0855651
.0615897
.0644695
.0996675
-17.18
-1.78
10.44
18.86
0.000
0.074
0.000
0.000
-1.63732
-.2306336
.5465755
1.684557
-1.301911
.0107937
.7992914
2.075246
/cut1
/cut2
/cut3
/cut4
-2.16739
-.9912152
-.1118914
1.252164
.1480596
.0943091
.075311
.0983918
-14.64
-10.51
-1.49
12.73
0.000
0.000
0.137
0.000
-2.457582
-1.176058
-.2594982
1.05932
-1.877199
-.8063727
.0357154
1.445008
/cut1
/cut2
/cut3
/cut4
-1.412372
-.0230879
.8209522
1.966042
.0977772
.0687432
.0771653
.1196586
-14.44
-0.34
10.64
16.43
0.000
0.737
0.000
0.000
-1.604012
-.1578221
.6697109
1.731515
-1.220733
.1116464
.9721935
2.200568
/cut1
/cut2
/cut3
/cut4
-1.47999
-.4218768
.172995
.9454906
.0650596
.0443504
.0432394
.0507422
-22.75
-9.51
4.00
18.63
0.000
0.000
0.000
0.000
-1.607505
-.508802
.0882473
.8460376
-1.352476
-.3349516
.2577427
1.044944
.5283629
.0978703
.3675036
.7596315
y2 <-
y3 <-
y4 <-
y1
y2
y3
y4
var(e.SciAtt)
Notes:
1. Our latent variable measures a negative attitude toward science, just as it did in [SEM] example 35g.
2. In this MIMIC model, we allow males and females to have different underlying attitudes toward
science.
3. The coefficient for SciAtt <- sex is 0.234. Variable sex is equal to 1 for females, thus
females have a lower mean value for SciAtt by 0.234. Because our SciAtt measure is reversed,
this means that females have a more positive attitude toward science. The effect is significant at
better than the 1% level.
4. The difference between males and females in SciAtt is 0.234. Is that big or small, practically
speaking?
357
button.
358
b. Click in the bottom of the sex rectangle (it will highlight when you hover over it), and drag
a path to the top of the SciAtt oval (it will highlight when you can release to connect the
path).
7. Clean up the direction of the error.
The error on SciAtt is likely to have been created below the oval for SciAtt. Choose the
Select tool, , and then click in the SciAtt oval. Click on one of the Error Rotation buttons,
, in the Contextual Toolbar until the error is where you want it.
8. Clean up the location of the path.
If you do not like where the path between sex and SciAtt has been connected to its variables,
, to click on the path, and then simply click on where it connects to a
use the Select tool,
rectangle or oval and drag the endpoint.
9. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_mimic
Reference
Greenacre, M. J. 2006. From simple to multiple correspondence analysis. In Multiple Correspondence Analysis and
Related Methods, ed. M. J. Greenacre and J. Blasius. Boca Raton, FL: Chapman & Hall.
Also see
[SEM] example 10 MIMIC model
[SEM] example 35g Ordered probit and ordered logit
Title
example 37g Multinomial logistic regression
Description
Reference
Also see
Description
With the data below, we demonstrate multinomial logistic regression, also known as multinomial
logit, mlogit, and family multinomial, link logit:
. use http://www.stata-press.com/data/r13/gsem_sysdsn1
(Health insurance data)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_sysdsn1.dta
obs:
644
Health insurance data
vars:
12
28 Mar 2013 13:46
size:
11,592
(_dta has notes)
variable name
site
patid
insure
age
male
nonwhite
noinsur0
noinsur1
noinsur2
ppd0
ppd1
ppd2
storage
type
byte
float
byte
float
byte
byte
byte
byte
byte
byte
byte
byte
Sorted by:
display
format
%9.0g
%9.0g
%9.0g
%10.0g
%8.0g
%9.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
value
label
insure
variable label
study site (1-3)
patient id
insurance type
NEMC (ISCNRD-IBIRTHD)/365.25
NEMC PATIENT MALE
race
no insurance at baseline
no insurance at year 1
no insurance at year 2
prepaid at baseline
prepaid at year 1
prepaid at year 2
patid
. notes
_dta:
1. Data on health insurance available to 644 psychologically depressed
subjects.
2. Data from Tarlov, A.R., et al., 1989, "The Medical Outcomes Study. An
application of methods for monitoring the results of medical care." _J.
of the American Medical Association_ 262, pp. 925-930.
3. insure: 1=indemnity, 2=prepaid, 3=uninsured.
See Structural models 6: Multinomial logistic regression in [SEM] intro 5 for background.
359
360
1b.insure
logit
multinomial
1.nonwhite
2.insure
logit
multinomial
3.insure
logit
The response variables are 1.insure, 2.insure, and 3.insure, meaning insure = 1 (code
for indemnity), insure = 2 (code for prepaid), and insure = 3 (code for uninsured). We specified
that insure = 1 be treated as the mlogit base category by placing a b on 1.insure to produce
1b.insure in the variable box.
Notice that there are no paths into 1b.insure. We could just as well have diagrammed the model
with a path arrow from the explanatory variable into 1b.insure. It would have made no difference.
In one sense, omitting the path is more mathematically appropriate, because multinomial logistic
base levels are defined by having all coefficients constrained to be 0.
In another sense, drawing the path would be more appropriate because, even with insure = 1
as the base level, we are not assuming that outcome insure = 1 is unaffected by the explanatory
variables. The probabilities of the three possible outcomes must sum to 1, and so any predictor that
increases one probability of necessity causes the sum of the remaining probabilities to decrease. If a
predictor x has positive effects (coefficients) for both 2.insure and 3.insure, then increases in x
must cause the probability of 1.insure to fall.
The choice of base outcome specifies that the coefficients associated with the other outcomes are
to be measured relative to that base. In multinomial logistic regression, the coefficients are logs of
the probability of the category divided by the probability of the base category, a mouthful also known
as the log of the relative-risk ratio.
361
We drew the diagram one way, but we could just as well have drawn it like this:
multinomial
1b.insure
logit
multinomial
1.nonwhite
2.insure
logit
multinomial
3.insure
logit
In fact, we could just as well have chosen to indicate the base category by omitting it entirely from
our diagram, like this:
multinomial
2.insure
logit
1.nonwhite
multinomial
3.insure
logit
Going along with that, we could type three different commands, each exactly corresponding to
one of the three diagrams:
. gsem (1b.insure) (2.insure 3.insure <- i.nonwhite), mlogit
. gsem (1b.insure 2.insure 3.insure <- i.nonwhite), mlogit
. gsem (2.insure 3.insure <- i.nonwhite), mlogit
See [SEM] intro 3 for a complete description of factor-variable notation. It makes no difference
which diagram we draw or which command we type.
362
Std. Err.
Number of obs
P>|z|
616
(base outcome)
2.insure <1.nonwhite
_cons
.6608212
-.1879149
.2157321
.0937644
3.06
-2.00
0.002
0.045
.2379942
-.3716896
1.083648
-.0041401
.3779586
-1.941934
.407589
.1782185
0.93
-10.90
0.354
0.000
-.4209011
-2.291236
1.176818
-1.592632
3.insure <1.nonwhite
_cons
Notes:
1. The above results say that nonwhites are more likely to have insure = 2 relative to 1 than
whites, and that nonwhites are more likely to have insure = 3 relative to 1 than whites, which
obviously implies that whites are more likely to have insure = 1.
2. For a three-outcome multinomial logistic regression model with the first outcome set to be the
base level, the probability of each outcome is
P r(y = 1) = 1/D
P r(y = 2) = exp(X2 2 )/D
P r(y = 2) = exp(X3 3 )/D
where D = 1 + exp(X2 2 ) + exp(X3 3 ).
3. For whitesthat is, for 1.nonwhite = 0we have X2 2 = 0.1879 and X3 3 = 1.9419.
Thus D = 1.9721, and the probabilities for each outcome are 0.5071, 0.4202, and 0.0727.
Those probabilities sum to 1. You can make the similar calculations for nonwhitesthat is, for
1.nonwhite = 1for yourself.
363
age
multinomial
1.male
1b.insure
logit
1.nonwhite
multinomial
2.insure
1b.site
logit
multinomial
2.site
3.insure
logit
3.site
In the above, insure = 2 and insure = 3 have paths pointing to them from different sets of
predictors. They share predictor 1.nonwhite, but insure = 2 also has paths from age and 1.male,
whereas insure = 3 also has paths from the site variables. When we fit this model, we will not
obtain estimates of the coefficients on age and 1.male in the equation for insure = 3. This is
equivalent to constraining the coefficients for age and 1.male to 0 in this equation. In other words,
we are placing a constraint that the relative risk of choosing insure = 3 rather than insure = 1 is
the same for males and females and is the same for all ages.
364
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-555.85446
-541.20487
-540.85219
-540.85164
-540.85164
Std. Err.
Number of obs
P>|z|
615
(base outcome)
2.insure <1.nonwhite
age
1.male
_cons
.7219663
-.0101291
.5037961
.1249932
.2184994
.0058972
.1912717
.2743262
3.30
-1.72
2.63
0.46
0.001
0.086
0.008
0.649
.2937153
-.0216874
.1289104
-.4126763
1.150217
.0014292
.8786818
.6626627
1.nonwhite
.0569646
.4200407
0.14
0.892
-.7663001
.8802293
site
2
3
-1.273576
.0434253
.4562854
.3470773
-2.79
0.13
0.005
0.900
-2.167879
-.6368337
-.3792728
.7236843
_cons
-1.558258
.2540157
-6.13
0.000
-2.056119
-1.060396
3.insure <-
We could have gotten identical results from Statas mlogit command for both this example and
the previous one. To fit the first example, we would have typed
. mlogit insure i.nonwhite
To obtain the results for this second example, we would have been required to type a bit more:
.
.
.
.
.
constraint 1 [Uninsure]age = 0
constraint 2 [Uninsure]1.male = 0
constraint 3 [Prepaid]2.site = 0
constraint 4 [Prepaid]3.site = 0
mlogit insure i.nonwhite age i.male i.site, constraints(1/4)
Having mlogit embedded in gsem, of course, also provides the advantage that we can combine
the mlogit model with measurement models, multilevel models, and more. See [SEM] example 41g
for a two-level multinomial logistic regression with random effects.
365
button.
4. Create the rectangles for each possible outcome of the multinomial endogenous variable.
Select the Add Observed Variables Set tool,
, and then click in the diagram about one-third
of the way in from the right and one-fourth of the way up from the bottom.
In the resulting dialog box,
a. select the Select variables radio button (it may already be selected);
b. check Make variables generalized responses;
c. select Multinomial, Logit in the Family/Link control;
d. select insure in the Variable control;
e. select Vertical in the Orientation control;
f. click on OK.
If you wish, move the set of variables by clicking on any variable and dragging it.
5. Create the independent variable.
a. Select the Add Observed Variable tool,
2.insure.
b. In the Contextual Toolbar, type 1.nonwhite in the Variable control and press Enter.
6. Create the paths from the independent variable to the rectangles for outcomes insure = 2 and
insure = 3.
a. Select the Add Path tool,
b. Click in the right side of the 1.nonwhite rectangle (it will highlight when you hover over
it), and drag a path to the left side of the 2.insure rectangle (it will highlight when you
can release to connect the path).
c. Continuing with the
tool, click in the right side of the 1.nonwhite rectangle and drag
a path to the left side of the 3.insure rectangle.
7. Clean up the location of the paths.
If you do not like where the paths have been connected to the rectangles, use the Select tool,
, to click on the path, and then simply click on where it connects to a rectangle and drag the
endpoint.
8. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_mlogit1
366
Fitting the multinomial logistic model with constraints with the Builder
Use the diagram in Multinomial logistic regression model with constraints above for reference.
1. Open the dataset.
In the Command window, type
. use http://www.stata-press.com/data/r13/gsem_sysdsn1
button.
4. Create the rectangles for each possible outcome of the multinomial endogenous variable.
Select the Add Observed Variables Set tool,
, and then click in the diagram about one-third
of the way in from the right and one-fourth of the way up from the bottom.
In the resulting dialog box,
a. select the Select variables radio button (it may already be selected);
b. check Make variables generalized responses;
c. select Multinomial, Logit in the Family/Link control;
d. select insure in the Variable control;
e. select Vertical in the Orientation control;
f. click on OK.
If you wish, move the set of variables by clicking on any variable and dragging it.
5. Create the independent variables.
Select the Add Observed Variables Set tool,
, and then click in the diagram about one-third
from the left and one-fourth from the bottom.
In the resulting dialog box,
a. select the Select variables radio button (it may already be selected);
b. uncheck Make variables generalized responses;
c. use the Variables control and select age;
d. type 1.male 1.nonwhite in the Variables control after age (typing 1.varname rather
than using the
button to create them as i.varname factor variables prevents rectangles
corresponding to the base categories for these binary variables from being created);
e. include the levels of the factor variable site by clicking on the
button next to the
Variables control. In the resulting dialog box, select the Factor variable radio button, select
Main effect in the Specification control, and select site in the Variables control for
Variable 1. Click on Add to varlist, and then click on OK;
f. select Vertical in the Orientation control;
g. click on OK.
367
6. Create the paths from the independent variables to the rectangles for outcomes insure = 2 and
insure = 3.
a. Select the Add Path tool,
b. Click in the right side of the age rectangle (it will highlight when you hover over it), and
drag a path to the left side of the 2.insure rectangle (it will highlight when you can release
to connect the path).
c. Continuing with the
tool, create the following paths by clicking first in the right side of
the rectangle for the independent variable and dragging it to the left side of the rectangle
for the given outcome of the dependent variable:
1.male -> 2.insure
1.nonwhite -> 2.insure
1.nonwhite -> 3.insure
1b.site -> 3.insure
2.site -> 3.insure
3.site -> 3.insure
7. Clean up the location of the paths.
If you do not like where the paths have been connected to the rectangles, use the Select tool,
, to click on the path, and then simply click on where it connects to a rectangle and drag the
endpoint.
8. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem sem_mlogit2
Reference
Tarlov, A. R., J. E. Ware, Jr., S. Greenfield, E. C. Nelson, E. Perrin, and M. Zubkoff. 1989. The medical outcomes
study. An application of methods for monitoring the results of medical care. Journal of the American Medical
Association 262: 925930.
Also see
[SEM] example 35g Ordered probit and ordered logit
[SEM] example 41g Two-level multinomial logistic regression (multilevel)
Title
example 38g Random-intercept and random-slope models (multilevel)
Description
Reference
Also see
Description
Below we discuss random-intercept and random-slope models in the context of multilevel models, and specifically, 2-level models, although we could just as well use higher-level models (see
[SEM] example 39g). Some people refer to these models as random-effects models and as mixed-effects
models.
To demonstrate random-intercept and random-slope models, we will use the following data:
. use http://www.stata-press.com/data/r13/gsem_nlsy
(NLSY 1968)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_nlsy.dta
obs:
2,763
NLSY 1968
vars:
21
29 Mar 2013 11:30
size:
93,942
(_dta has notes)
variable name
idcode
year
birth_yr
age
race
msp
nev_mar
grade
collgrad
not_smsa
c_city
south
ind_code
occ_code
union
wks_ue
ttl_exp
tenure
hours
wks_work
ln_wage
Sorted by:
storage
type
int
int
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
float
float
int
int
float
idcode
display
format
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%9.0g
%9.0g
%8.0g
%8.0g
%9.0g
value
label
racelbl
variable label
NLS ID
interview year
birth year
age in current year
race
1 if married, spouse present
1 if never married
current grade completed
1 if college graduate
1 if not SMSA
1 if central city
1 if south
industry of employment
occupation
1 if union
weeks unemployed last year
total work experience
job tenure, in years
usual hours worked
weeks worked last year
ln(wage/GNP deflator)
year
. notes
_dta:
1. Data from National Longitudinal Survey of Young Women 14-27 years of age
in 1968 (NLSY), Center for Human Resource Research, Ohio State
University, first released in 1989.
2. This data was subsetted for purposes of demonstration.
368
369
These 2-level data are recorded in long form, that is, each observation corresponds to a year within
a subject and the full set of data is spread across repeated observations.
. list id year ln_wage union grade in 1/20, sepby(idcode)
idcode
year
ln_wage
union
grade
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
1
1
1
1
1
1
1
1
1
1
1
1
1970
1971
1972
1973
1975
1977
1978
1980
1983
1985
1987
1988
1.451214
1.02862
1.589977
1.780273
1.777012
1.778681
2.493976
2.551715
2.420261
2.614172
2.536374
2.462927
.
.
1
.
.
0
.
1
1
1
1
1
12
12
12
12
12
12
12
12
12
12
12
12
13.
14.
15.
16.
17.
18.
19.
20.
2
2
2
2
2
2
2
2
1971
1972
1973
1975
1977
1978
1980
1982
1.360348
1.206198
1.549883
1.832581
1.726721
1.68991
1.726964
1.808289
0
.
.
.
1
1
1
1
12
12
12
12
12
12
12
12
In the repeated observations for a subject, some variables vary (they are at the observation level, such
as ln wage) and other variables do not vary (they are at the subject level, such as grade).
When using gsem, multilevel data must be recorded in the long form except in one case. The
exception is latent-growth curve models, which can be fit in the long or wide form. In the wide
form, there is one physical observation for each subject and multiple variables within subject, such
as ln wage 1970, ln wage 1971, and so on. Researchers from a structural equation modeling
background think about latent-growth models in the wide form; see [SEM] example 18.
In all other cases, if your data are in the wide form, use Statas reshape command (see [D] reshape)
to convert the data to long form.
See Structural models 1: Linear regression and Multilevel mixed-effects models in [SEM] intro 5
for background.
370
idcode1
grade
1.union
ln_wage
Figure 1
We use factor-variable notation in the diagram above; see [SEM] example 37g.
We are using multilevel data. ln wage and union (union membership) vary at the observation
level, while grade (school completion) varies at the subject level. We have used shading to emphasize
that.
In this model, we are including a random intercept (a random effect) at the subject level. Doubleringed idcode is saying, I am a latent variable at the idcode levelmeaning I am constant within
identification codes and vary across identification codesand I correspond to a latent variable named
M1. The M1 part of the statement came from the subscript 1; the M part is fixed.
Double-ringed idcode indicates a latent variable constant within idcodea random effect. And
the fact that the path from the latent variable is pointing to a box and not to another path means that
the latent variable is used as a random intercept rather than a random slope. By the way, variable
idcode in the data contains each subjects identification number.
Using command syntax, we can fit this model by typing
. gsem (ln_wage <- i.union grade M1[idcode])
Fitting fixed-effects model:
Iteration 0:
log likelihood = -925.06629
Iteration 1:
log likelihood = -925.06629
Refining starting values:
Grid node 0:
log likelihood = -763.3769
Fitting full model:
Iteration 0:
log likelihood = -763.3769
Iteration 1:
log likelihood = -622.04625
Iteration 2:
log likelihood = -613.54948
Iteration 3:
log likelihood = -607.56242
Iteration 4:
log likelihood = -607.49246
Iteration 5:
log likelihood = -607.49233
Iteration 6:
log likelihood = -607.49233
(backed up)
Number of obs
Std. Err.
371
1904
P>|z|
0.000
0.000
.1191998
.0634791
.2082818
.0901046
0.000
.5997848
.955041
ln_wage <1.union
grade
.1637408
.0767919
.0227254
.0067923
7.21
11.31
M1[idcode]
_cons
.7774129
.0906282
var(
M1[idcode])
.080247
.0073188
.0671113
.0959537
var(e.ln_w~e)
.078449
.0028627
.0730342
.0842653
(constrained)
8.58
Notes:
1. The ln wage <- M1[idcode] coefficient is constrained to be 1. Such constraints are automatically
supplied by gsem to identify the latent variable. Our model is
ln wage = + 3 M1[idcode] + cons
= + 1 MI[idcode] + 0.7774
Thus M1[idcode] is being used as a random intercept.
Remember that the square bracketed [idcode] means that M1 is constant within idcode and
varies only across idcode.
2. The variance of our random intercept is estimated to be 0.0802, which is greater than the estimated
error variance of 0.0784.
3. Although it is obvious in this case that the latent variable (random intercept) has sufficient
variance that it cannot be ignored, we can test whether the variance is large enough that we could
not ignore it. The test will be up against a boundary (variances cannot be less than 0), and so
the test will be conservative. To perform the test, we would type
.
.
.
.
.
372
grade
idcode1
1.union
ln_wage
Figure 2
Do not read grade pointing to double-ringed idcode as grade being a predictor of idcode.
That would make no sense. Double rings indicate a latent variable, and grade is a predictor of a
latent variable. In particular, the subscript 1 on idcode indicates that the latent variable is named
M1. Thus grade is a predictor of M1. The idcode inside the double rings says that M1 is constant
within idcode. Thus grade, which itself does not vary within idcode, is a predictor of M1, which
does not vary within idcode; said more elegantly, grade is a predictor of M1 at the subject level.
It is logically required that grade vary at the same or higher level as M1[idcode], and gsem will
check that requirement for you.
In this model, M1[idcode] contains both the random intercept and the grade effect. There is now
an equation for M1[idcode], with an error associated with it, and it will be the variance of the error
term that will reflect the variance of the random intercept.
373
1904
Std. Err.
P>|z|
.0227254
7.21
0.000
.1191998
.2082818
ln_wage <1.union
.1637408
M1[idcode]
_cons
.7774129
.0906282
8.58
0.000
.5997848
.955041
M1[idco~] <grade
.0767919
.0067923
11.31
0.000
.0634791
.0901046
var(
e.M1[idcode])
.080247
.0073188
.0671113
.0959537
var(e.ln_w~e)
.078449
.0028627
.0730342
.0842653
(constrained)
Notes:
1. Results are identical to what we previously obtained.
2. The within-and-between formulation is equivalent to the single-equation formulation if there are
no missing values in the data.
3. In this simple model, the two formulations are also equivalent even in the presence of missing
values.
4. If M1[idcode] were also being used to predict another endogenous variable, then missing values
in grade would only cause the equation for the other endogenous variable to have to omit those
observations in the within-and-between formulation.
374
idcode2
idcode1
1.union
ln_wage
grade
1.union#c.grade
-869.92256
-869.92256
-727.15806
-711.74718
-684.33867
-665.94123
-610.14526
-589.89989
-582.24119
-581.298
-581.29004
-581.29003
model
(not
(not
(not
(not
(not
concave)
concave)
concave)
concave)
concave)
Number of obs
1904
[ln_wage]M1[idcode] = 1
[ln_wage]1.union#M2[idcode] = 1
Coef.
Std. Err.
P>|z|
ln_wage <1.union
grade
.1199049
.0757883
.1508189
.0081803
0.80
9.26
0.427
0.000
-.1756946
.0597552
.4155045
.0918215
union#
c.grade
1
.0019983
.0113534
0.18
0.860
-.020254
.0242506
7.25
0.000
.574443
1.000334
M1[idcode]
(constrained)
union#
M2[idcode]
0
1
0
1
(omitted)
(constrained)
_cons
.7873884
.1086476
.0927931
.0088245
.0770136
.1118056
.0823065
.018622
.052826
.1282392
cov(
M2[idcode],
M1[idcode])
-.0549821
.0116103
-.077738
-.0322263
var(e.ln_w~e)
.0720873
.0027135
.0669603
.0776068
var(
M1[idcode])
var(
M2[idcode])
-4.74
0.000
375
376
Notes:
1. M1[idcode] is the random intercept. The coefficient on it is constrained to be 1, just as previously
and just as we would expect.
2. The coefficient on 1.union is the fixed part of the slope of union.
3. M2[idcode] is the random part of the slope of union.
The coefficient on 1.union#M2[idcode] is constrained to be 1, just as we would expect.
1.union#M2[idcode] is the way Stata writes 1.union M2[idcode].
4. There is an unexpected term in the output, 0.union#M2[idcode], shown with coefficient 0.
The first thing to remember about unexpected terms is that they are irrelevant if their coefficients
are 0. gsem reports the coefficient as being 0 (omitted), which is gsems way of saying, Here
is a line that I did not even include in the model. There are a lot of terms gsem could tell us
about that were not included in the model, so why did gsem feel obligated to tell us about this
term? The term has to do with how Stata tracks base levels of factor variables.
There is a settingset showbaselevels offthat will prevent lines like that from being
displayed. There is also a settingset showbaselevels allthat will show even more of
them! The default is set showbaselevels on.
5. We specified the interaction as 1.union#M2[idcode] rather than i.union#M2[idcode]. Even
so, using #.union#M2[idcode] or i.union#M2[idcode] makes no difference because union
takes on two values. If union took on three values, however, think about how we would diagram
the model. We would have two latent variables, and we would want 1.union#M2[idcode] and
2.union#M3[idcode]. If union took on three or more values, typing i.union#M2[idcode]
simply would not produce the desired result.
idcode2
idcode1
1.union
grade
ln_wage
377
Lets call this model Simple formulation 1 as compared with the model that we fit in the previous
section, which we will call Demonstrated formulation 1. We could have fit Simple formulation 1 by
typing
. gsem (ln_wage <- i.union grade M1[idcode] 1.union#M2[idcode])
The corresponding within-and-between model, which we will call Simple formulation 2, would look
something like this:
grade
idcode2
1.union
idcode1
ln_wage
378
In more complicated models, you can correlate the random intercepts and slopes even in a second
formulation. Consider Demonstrated formulation 2:
grade
idcode2
1.union
idcode1
ln_wage
(not concave)
(not concave)
(not concave)
Number of obs
379
1904
Std. Err.
P>|z|
.1508188
0.80
0.427
-.1756945
.4155045
ln_wage <1.union
.119905
M1[idcode]
(constrained)
union#
M2[idcode]
0
1
0
1
(omitted)
(constrained)
_cons
.7873884
.1086476
7.25
0.000
.5744431
1.000334
M1[idco~] <grade
.0757883
.0081803
9.26
0.000
.0597552
.0918215
M2[idco~] <grade
.0019983
.0113534
0.18
0.860
-.020254
.0242506
var(
e.M1[idcode])
var(
e.M2[idcode])
.092793
.0088245
.0770136
.1118055
.0823065
.018622
.052826
.1282392
cov(
e.M2[idcode],
e.M1[idcode])
-.0549822
.0116103
-.077738
-.0322263
var(e.ln_w~e)
.0720873
.0027135
.0669603
.0776068
-4.74
0.000
Note:
1. Results of the above are identical to the results of Demonstrated formulation 1.
button.
380
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting idcode > Observations in the next control.
d. Specify M1 as the Base name.
e. Click on OK.
7. Create the paths from the exogenous variables to ln wage.
a. Select the Add Path tool,
b. Click in the right side of the 1.union rectangle (it will highlight when you hover over it),
and drag a path to the left side of the ln wage rectangle (it will highlight when you can
release to connect the path).
c. Continuing with the
tool, draw paths from the right side of the grade rectangle to the
left side of the ln wage rectangle and from the right side of the idcode1 double oval to
the left side of the ln wage rectangle.
8. Clean up the location of the paths.
If you do not like where the paths have been connected to the rectangles or oval, use the Select
tool, , to click on the path, and then simply click on where it connects to a rectangle or oval
and drag the endpoint.
381
9. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_rint
button.
4. Increase the width of the observed variable rectangles to accommodate the length of the name
of the interaction term.
From the SEM Builder menu, select Settings > Variables > All Observed....
In the resulting dialog box, change the first size to 1 and click on OK.
5. Create the endogenous variable.
, and then click in the diagram about one-third
a. Select the Add Observed Variable tool,
of the way in from the right and one-third of the way up from the bottom. After adding it,
you can click inside the rectangle and move the variable if you wish.
b. In the Contextual Toolbar, select ln wage with the Variable control.
6. Create the observed exogenous variables.
, and then click in the diagram about one-third
Select the Add Observed Variables Set tool,
of the way in from the left and one-third of the way up from the bottom.
In the resulting dialog box,
a. select the Select variables radio button (it may already be selected);
button to
b. type 1.union in the Variables control (typing 1.union rather than using the
create i.union prevents the rectangle corresponding to the base category for this binary
variable from being created);
c. use the Variables control and select grade;
d. type 1.union#c.grade in the Variables control after grade;
e. select Vertical in the Orientation control;
f. click on OK.
If you wish, move the set of variables by clicking on any variable and dragging it.
382
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting idcode > Observations in the next control.
d. Specify M1 as the Base name.
e. Click on OK.
7. Create the paths from the exogenous variables to ln wage.
a. Select the Add Path tool,
b. Click in the right side of the 1.union rectangle (it will highlight when you hover over it),
and drag a path to the left side of the ln wage rectangle (it will highlight when you can
release to connect the path).
c. Continuing with the
tool, draw paths from the right sides of the grade and
1.union#c.grade rectangles to the left side of the ln wage rectangle and from the
bottom of the idcode1 double oval to the top of the ln wage rectangle.
8. Create the random slope.
a. Select the Add Multilevel Latent Variable tool,
to ln wage.
b. In the Contextual Toolbar, click on the
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting idcode > Observations in the next control.
d. Specify M2 as the Base name.
e. Click on OK.
f. Select the Add Path tool,
g. Click in the bottom of the idcode2 double oval, and drag a path to the path between
1.union and ln wage.
9. Create the covariance between the random slope and random intercept.
a. Select the Add Covariance tool,
b. Click in the top-right quadrant of the idcode2 double oval, and drag a covariance to the
top left of the idcode1 double oval.
10. Clean up paths and covariance.
If you do not like where a path has been connected to its variables, use the Select tool,
, to
click on the path, and then simply click on where it connects to a rectangle and drag the endpoint.
Similarly, you can change where the covariance connects to the latent variables by clicking on
the covariance and dragging the endpoint. You can also change the bow of the covariance by
clicking on the covariance and dragging the control point that extends from one end of the
selected covariance.
383
11. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_rslope
Reference
Center for Human Resource Research. 1989. National Longitudinal Survey of Labor Market Experience, Young Women
1424 years of age in 1968. Columbus, OH: Ohio State University Press.
Also see
[SEM] example 39g Three-level model (multilevel, generalized response)
[SEM] example 42g One- and two-level mediation models (multilevel)
Title
example 39g Three-level model (multilevel, generalized response)
Description
References
Also see
Description
To demonstrate three-level models, we use the following data:
. use http://www.stata-press.com/data/r13/gsem_melanoma
(Skin cancer (melanoma) data)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_melanoma.dta
obs:
354
Skin cancer (melanoma) data
vars:
6
25 Mar 2013 15:28
size:
4,956
(_dta has notes)
storage
type
display
format
value
label
nation
region
county
byte
byte
int
%12.0g
%9.0g
%9.0g
nation
deaths
expected
uv
int
float
float
%9.0g
%9.0g
%9.0g
variable name
variable label
Nation ID
Region ID: EEC level-I areas
County ID: EEC level-II/level-III
areas
No. deaths during 1971-1980
No. expected deaths
UV dose, mean-centered
Rabe-Hesketh and Skrondal (2012, exercise 13.7) describe data from the Atlas of Cancer Mortality
in the European Economic Community (EEC) (Smans, Mair, and Boyle 1993). The data were analyzed
in Langford, Bentham, and McDonald (1998) and record the number of deaths among males due to
malignant melanoma during 19711980.
Data are stored in the long form. Observations are counties within regions within nation. These
data and some of the models fit below are also demonstrated in [ME] menbreg.
See Structural models 4: Count models and Multilevel mixed-effects models in [SEM] intro 5 for
background.
384
385
nation1
region2
nbinomial mean
uv
deaths
log
Deaths due to malignant melanoma at the county level are modeled as being affected by ultraviolet
exposure with random region and nation effects.
386
Iteration 0:
log likelihood = -1209.6951
Iteration 1:
log likelihood = -1195.0761
Iteration 2:
log likelihood = -1189.7235
Iteration 3:
log likelihood =
-1167.58
Iteration 4:
log likelihood = -1145.4325
Iteration 5:
log likelihood = -1138.4471
Iteration 6:
log likelihood = -1088.3882
Iteration 7:
log likelihood = -1086.7992
Iteration 8:
log likelihood = -1086.4085
Iteration 9:
log likelihood = -1086.3903
Iteration 10: log likelihood = -1086.3902
Iteration 11: log likelihood = -1086.3902
Generalized structural equation model
Log likelihood = -1086.3902
( 1) [deaths]M1[nation] = 1
( 2) [deaths]M2[nation>region] = 1
Coef.
Std. Err.
(not
(not
(not
(not
(not
concave)
concave)
concave)
concave)
concave)
Number of obs
354
P>|z|
0.003
-.055883
-.0113035
deaths <uv
-.0335933
M1[nation]
M2[nation>
region]
(constrained)
(constrained)
_cons
ln(expected)
-.0790606
1
.1295931
(exposure)
-0.61
0.542
-.3330583
.1749372
deaths
/lnalpha
-4.182603
.3415036
-12.25
0.000
-4.851937
-3.513268
.1283614
.0678971
.0455187
.3619758
.0401818
.0104855
.0240938
.067012
var(
M1[nation])
var(
M2[nation>
region])
.0113725
-2.95
Notes:
1. This is a three-level model of counties nested within region nested within nation, so we specified
the latent variables as M1[nation] M2[nation>region]. Actually, we did the same thing in the
diagram when we used the SEM Builder to define the latent variables, but the nesting information
does not show in the double rings.
387
2. We fit this model by using negative binomial regression, also known as a mean-dispersion
model. In the command, we typed nbreg, which is shorthand for family(nbinomial mean)
link(log).
3. A negative binomial distribution can be regarded as a Gamma mixture of Poisson random variables,
where said Gamma distribution has mean 1 and variance . The estimated ln() is 4.183,
which is small; is estimated as 0.0153. The reported test statistic of 12.25 is significant at
better than the 1% level for ln() = 0, but this is a test for = 1, not 0.
4. Zero does not mean lack of overdispersion, because we are including random effects that also
allow for extra dispersion. For a discussion on these issues, see [ME] menbreg.
5. Notice that we specified exposure(expected), where variable expected contains the expected
number of deaths based on crude rates.
The exposure() option is allowed with Poisson and negative binomial models. If we specify
exposure(varname), we are usually saying that each observations time at risk is recorded in
variable varname. When we omit the option, we are saying that each observation has the same
time at risk. Obviously, if one observation had twice the time at risk of another observation, but
was otherwise identical, we would expect twice the number of events in the first observation.
In this case, however, we are using exposure() differently. We have a variable called expected containing the expected number of deaths from crude rates, and we are claiming
exposure(expected). What this is doing is saying that in two otherwise identical observations,
if the number of expected deaths differed, we would expect the number of deaths due to melanoma
to differ, too, and by the same proportion. See [SEM] gsem family-and-link options.
nation1
region2
Poisson
uv
deaths
log
388
Std. Err.
(not
(not
(not
(not
(not
concave)
concave)
concave)
concave)
concave)
Number of obs
P>|z|
354
deaths <uv
-.0282041
M1[nation]
M2[nation>
region]
(constrained)
(constrained)
_cons
ln(expected)
-.0639672
1
.1335515
(exposure)
.1371732
.0483483
var(
M1[nation])
var(
M2[nation>
region])
.0113998
-2.47
0.013
-.0505473
-.0058608
0.632
-.3257234
.197789
.0723303
.048802
.3855676
.0109079
.0310699
.0752353
-0.48
389
We can reject at any reasonable level that the Poisson model adequately accounts for the dispersion
in these data. Be aware that this test is conservative, because we are testing whether a variance goes
to 0. lrtest usually issues a warning in such cases, but lrtest does not know that the relationship
between negative binomial regression and Poisson regression involves a variance going to 0.
button.
b. Click in the diagram about one-third of the way in from the right and one-fourth of the way
up from the bottom.
c. In the Contextual Toolbar, select Nbinomial mean, Log in the Family/Link control.
d. In the Contextual Toolbar, select deaths in the Variable control.
5. Create the observed exogenous variable.
a. Select the Add Observed Variable tool,
, and then click in the diagram about one-third
of the way in from the right and one-fourth of the way up from the bottom.
b. In the Contextual Toolbar, select uv with the Variable control.
6. Create the level-three latent variable.
a. Select the Add Multilevel Latent Variable tool,
about one-fourth of the way down from the top.
b. In the Contextual Toolbar, click on the
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting nation > Observations in the next line.
390
button.
c. Select the nesting level and nesting variable by selecting 3 from the Nesting depth control
and selecting nation > region > Observations in the next control.
d. Specify M2 as the Base name.
e. Click on OK.
8. Create the paths from the exogenous variables to deaths.
a. Select the Add Path tool,
b. Click in the right side of the uv rectangle (it will highlight when you hover over it), and
drag a path to the left side of the deaths rectangle (it will highlight when you can release
to connect the path).
c. Continuing with the
tool, draw paths from the right side of the double ovals for nation1
and region2 to the left side of the deaths rectangle.
9. Specify the level of exposure.
Use the Select tool,
, and double-click in the deaths rectangle. In the resulting dialog box,
select expected in the Exposure control, and click on OK.
10. Clean up the location of the paths.
If you do not like where the paths have been connected to the rectangles or oval, use the Select
tool, , to click on the path, and then simply click on where it connects to a rectangle or oval
and drag the endpoint.
11. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_3lev
References
Langford, I. H., G. Bentham, and A. McDonald. 1998. Multi-level modelling of geographically aggregated health
data: A case study on malignant melanoma mortality and UV exposure in the European community. Statistics in
Medicine 17: 4157.
Rabe-Hesketh, S., and A. Skrondal. 2012. Multilevel and Longitudinal Modeling Using Stata. 3rd ed. College Station,
TX: Stata Press.
Smans, M., C. S. Mair, and P. Boyle. 1993. Atlas of Cancer Mortality in the European Economic Community. Lyon,
France: IARC Scientific Publications.
Also see
[SEM] example 38g Random-intercept and random-slope models (multilevel)
[SEM] example 34g Combined models (generalized responses)
391
Title
example 40g Crossed models (multilevel)
Description
Reference
Also see
Description
To illustrate crossed models, we use
. use http://www.stata-press.com/data/r13/gsem_fifeschool
(School data from Fife, Scotland)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_fifeschool.dta
obs:
3,435
School data from Fife, Scotland
vars:
5
25 Mar 2013 16:17
size:
24,045
(_dta has notes)
variable name
storage
type
display
format
pid
sid
attain
vrq
int
byte
byte
int
%9.0g
%9.0g
%9.0g
%9.0g
sex
byte
%9.0g
Sorted by:
. notes
pid
value
label
variable label
Primary school ID
Secondary school ID
Attainment score at age 16
Verbal-reasoning score from final
year of primary school
1: female; 0: male
sid
_dta:
1. Paterson, L. 1991. "Socio-economic status and education attainment: A
multidimensional and multilevel study" in _Evaluation and Research in
Education_ 5: 97-121.
2. Each observation is a different student. Each student attended a primary
school (pid) and a secondary school (sid).
3. pid and sid are crossed, not nested. All combinations are possible.
Rabe-Hesketh and Skrondal (2012, 443460) give an introduction to crossed-effects models and
provide other examples of crossed-effects models by using the school data from Fife, Scotland.
See Structural models 1: Linear regression and Multilevel mixed-effects models in [SEM] intro 5
for background.
392
393
sid2
1.sex
attain
pid1
We include latent (random) effects for primary and secondary school because we think that school
identities may have an effect.
394
3435
Std. Err.
P>|z|
.0982634
5.07
0.000
.3060224
.6912079
28.99
0.000
4.901876
5.612867
attain <1.sex
.4986152
M1[pid]
M2[sid]
1
1
_cons
5.257372
.1813784
var(M1[pid])
var(M2[sid])
1.104316
.3457086
.2022595
.1608679
.7712451
.1388744
1.581226
.8605937
var(e.attain)
8.053437
.1990023
7.672694
8.453074
(constrained)
(constrained)
Notes:
1. These data are not nested, but the diagram above would look the same even if they were. The
fact that primary and secondary schools are crossed and not nested is, however, specified when
we enter the model into the SEM Builder and is implicit in the command syntax.
2. We typed attain <- i.sex M1[pid] M2[sid]. We would have typed attain <- i.sex
M1[pid] M2[sid<pid] had secondary school been nested within primary school.
3. gsem produced the following note when it began estimation: crossed random effects detected;
option intmethod(laplace) assumed. gsem provides four integration methods. The default is
mvaghermite, which stands for mean-variance adaptive GaussHermite quadrature. The others
are mcaghermite (mode-curvature adaptive GaussHermite quadrature); ghermite (nonadaptive
GaussHermite quadrature); and laplace (Laplacian approximation).
In general, the adaptive methods mvaghermite and mcaghermite are considered superior in
terms of accuracy to the nonadaptive method ghermite, which is considered superior to the
approximation method laplace. They also take longer.
Fitting crossed models can be difficult. You may specify intmethod() with one of the superior
methods, but be aware, convergence may not be achieved in finite time.
395
button.
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting pid > Observations in the next control.
d. Specify M1 as the Base name.
e. Click on OK.
7. Create the sid-level latent variable.
a. Select the Add Multilevel Latent Variable tool,
b. In the Contextual Toolbar, click on the
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting sid > Observations in the next control.
d. Specify M2 as the Base name.
e. Click on OK.
396
b. Click in the right side of the 1.sex rectangle (it will highlight when you hover over it), and
drag a path to the left side of the attain rectangle (it will highlight when you can release
to connect the path).
c. Continuing with the
tool, draw paths from bottom of the sid2 double oval to the top
of the attain rectangle and from the top of the pid1 double oval to the bottom of the
attain rectangle.
9. Clean up the location of the paths.
If you do not like where the paths have been connected to the rectangles or oval, use the Select
tool, , to click on the path, and then simply click on where it connects to a rectangle or oval
and drag the endpoint.
10. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_cross
Reference
Rabe-Hesketh, S., and A. Skrondal. 2012. Multilevel and Longitudinal Modeling Using Stata. 3rd ed. College Station,
TX: Stata Press.
Also see
[SEM] example 38g Random-intercept and random-slope models (multilevel)
Title
example 41g Two-level multinomial logistic regression (multilevel)
Description
References
Also see
Description
We demonstrate two-level multinomial logistic regression with random effects by using the following
data:
. use http://www.stata-press.com/data/r13/gsem_lineup
(Fictional suspect identification data)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_lineup.dta
obs:
6,535
Fictional suspect
identification data
vars:
6
29 Mar 2013 10:35
size:
156,840
(_dta has notes)
variable name
storage
type
suspect
suswhite
violent
location
witmale
chosen
float
float
float
float
float
float
display
format
%9.0g
%9.0g
%9.0g
%14.0g
%9.0g
%9.0g
value
label
loc
choice
variable label
suspect id
suspect is white
violent crime
lineup location
witness is male
indvidual identified in linup by
witness
Freq.
Percent
Cum.
police_station
suite_1
suite_2
2,228
1,845
2,462
34.09
28.23
37.67
34.09
62.33
100.00
Total
6,535
100.00
397
398
Freq.
Percent
Cum.
none
foil
suspect
2,811
1,369
2,355
43.01
20.95
36.04
43.01
63.96
100.00
Total
6,535
100.00
In what follows, we re-create results similar to those of Wright and Sparks (1994), but we use
fictional data. These data resemble the real data used by the authors in proportion of observations
having each level of the outcome variable chosen, and the data produce results similar to those
presented by the authors.
See Structural models 6: Multinomial logistic regression and Multilevel mixed-effects models in
[SEM] intro 5 for background.
For additional discussion of fitting multilevel multinomial logistic regression models, see Skrondal
and Rabe-Hesketh (2003).
1b.location
2.location
multinomial
2.chosen
1
3.location
logit
suspect1
1.suswhite
multinomial
3.chosen
1.witmale
1.violent
logit
399
This model concerns who is chosen in a police lineup. The response variables are 1.chosen,
2.chosen, and 3.chosen, meaning chosen = 1 (code for not chosen), chosen = 2 (code for foil
chosen), and chosen = 3 (code for suspect chosen). A foil is a stand-in who could not possibly be
guilty of the crime.
We say the response variables are 1.chosen, 2.chosen, and 3.chosen, but 1.chosen does not
even appear in the diagram. By its omission, we are specifying that chosen = 1 be treated as the
base mlogit category. There are other ways we could have drawn this; see [SEM] example 37g.
In these data, each suspect was viewed by multiple witnesses. In the model, we include a random
effect at the suspect level, and we constrain the effect to be equal for chosen values 2 and 3 (selecting
the foil or the suspect).
400
Std. Err.
(not concave)
Number of obs
6535
P>|z|
(base outcome)
2.chosen <location
suite_1
suite_2
.3867066
.4915675
.1027161
.0980312
3.76
5.01
0.000
0.000
.1853868
.2994299
.5880264
.6837051
1.suswhite
1.witmale
1.violent
-.0275501
-.0001844
.0356477
.0751664
.0680803
.0773658
-0.37
-0.00
0.46
0.714
0.998
0.645
-.1748736
-.1336193
-.1159864
.1197734
.1332505
.1872819
M1[suspect]
_cons
-1.002334
.099323
-10.09
0.000
-1.197003
-.8076643
location
suite_1
suite_2
-.2832042
.1391796
.0936358
.0863473
-3.02
1.61
0.002
0.107
-.4667271
-.0300581
-.0996814
.3084172
1.suswhite
1.witmale
1.violent
-.2397561
.1419285
-1.376579
.0643075
.059316
.0885126
-3.73
2.39
-15.55
0.000
0.017
0.000
-.3657965
.0256712
-1.55006
-.1137158
.2581857
-1.203097
M1[suspect]
_cons
.1781047
.0833393
0.033
.0147627
.3414468
.2538014
.0427302
.1824673
.3530228
(constrained)
3.chosen <-
var(
M1[suspect])
(constrained)
2.14
401
Notes:
1. We show the interpretation of mlogit coefficients in [SEM] example 37g.
2. The estimated variance of the random effect is 0.2538, implying a standard deviation of 0.5038.
Thus a 1-standard-deviation change in the random effect amounts to a exp(0.5038) = 1.655
change in the relative-risk ratio. The effect is both practically significant and, from the output,
statistically significant.
3. This is not the model fit by Wright and Sparks (1994). Those authors did not constrain the
random effect to be the same for chosen equal to 2 and 3. They included separate but correlated
random effects, and then took that even a step further.
Two-level multinomial logistic model with separate but correlated random effects
The model we wish to fit is
1b.location
2.location
multinomial
2.chosen
3.location
logit
1.suswhite
multinomial
3.chosen
1.witmale
suspect1
suspect2
logit
1.violent
This is one of the models fit by Wright and Sparks (1994), although remember that we are using
fictional data.
We can fit this model with command syntax by typing
. gsem (2.chosen <- i.location i.suswhite i.witmale i.violent M1[suspect]) ///
>
(3.chosen <- i.location i.suswhite i.witmale i.violent M2[suspect]), ///
>
mlogit
We did not even mention the assumed covariance between the random effects because latent exogenous
variables are assumed to be correlated in the command language. Even so, we can specify the cov()
option if we wish, and we might do so for emphasis or because we are unsure whether the parameter
would be included.
. gsem (2.chosen <- i.location i.suswhite i.witmale i.violent M1[suspect])
>
(3.chosen <- i.location i.suswhite i.witmale i.violent M2[suspect]),
>
cov(M1[suspect]*M2[suspect]) mlogit
Fitting fixed-effects model:
Iteration 0:
402
Std. Err.
(not concave)
(not concave)
Number of obs
6535
P>|z|
(base outcome)
2.chosen <location
suite_1
suite_2
.3881676
.48938
.1004754
.0960311
3.86
5.10
0.000
0.000
.1912394
.3011625
.5850958
.6775974
1.suswhite
1.witmale
1.violent
-.0260152
-.0007652
.0369381
.0749378
.0679187
.0771594
-0.35
-0.01
0.48
0.728
0.991
0.632
-.1728906
-.1338833
-.1142915
.1208602
.132353
.1881677
M1[suspect]
_cons
-1.000382
.0992546
-10.08
0.000
-1.194918
-.8058469
location
suite_1
suite_2
-.2904225
.1364246
.0968578
.089282
-3.00
1.53
0.003
0.127
-.4802604
-.0385649
-.1005847
.3114142
1.suswhite
1.witmale
1.violent
-.2437654
.139826
-1.388013
.0647275
.0596884
.0891863
-3.77
2.34
-15.56
0.000
0.019
0.000
-.370629
.0228389
-1.562815
-.1169018
.256813
-1.213212
M2[suspect]
_cons
.1750622
.0851614
0.040
.008149
.3419754
.2168248
.0549321
.131965
.3562533
.2978104
.0527634
.2104416
.421452
.2329749
.0438721
.1469872
.3189627
(constrained)
3.chosen <-
var(
M1[suspect])
var(
M2[suspect])
cov(
M2[suspect],
M1[suspect])
(constrained)
2.06
5.31
0.000
403
Notes:
1. The estimated variances of the two random effects are 0.2168 and 0.2978, which as explained
in the second note of above example, are both practically and statistically significant.
2. The covariance is estimated to be 0.2300. Therefore, 0.2300/ 0.2168 0.2978 = 0.9052 is the
estimated correlation.
3. Wright and Sparks (1994) were interested in whether the location of the lineup mattered. They
found that it did, and that foils were more likely to be chosen at lineups outside of the police
station (at the two specialist suites). They speculated the cause might be that the police at the
station strongly warn witnesses against misidentification, or possibly because the specialist suites
had better foils.
button.
404
b. Click in the right side of the 1b.location rectangle (it will highlight when you hover over
it), and drag a path to the left side of the 2.chosen rectangle (it will highlight when you
can release to connect the path).
tool, click in the right side of each independent variable and drag a
c. Continuing with the
path to both the 2.chosen and 3.chosen rectangles.
7. Create the suspect-level latent variable.
a. Select the Add Multilevel Latent Variable tool,
, and click near the right side of the
diagram, vertically centered between 2.chosen and 3.chosen.
b. In the Contextual Toolbar, click on the
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting suspect > Observations in the next control.
d. Specify M1 as the Base name.
e. Click on OK.
8. Create the paths from the multilevel latent variable to the rectangles for outcomes chosen = 2
and chosen = 3.
a. Select the Add Path tool,
b. Click in the upper-left quadrant of the suspect1 double oval, and drag a path to the right
side of the 2.chosen rectangle.
c. Continuing with the
tool, click in the lower-left quadrant of the suspect1 double oval,
and drag a path to the right side of the 3.chosen rectangle.
9. Place constraints on path coefficients from the multilevel latent variable.
Use the Select tool,
, to select the path from the suspect1 double oval to the 2.chosen
rectangle. Type 1 in the
box in the Contextual Toolbar and press Enter. Repeat this process
to constrain the coefficient on the path from the suspect1 double oval to the 3.chosen rectangle
to 1.
405
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting suspect > Observations in the next control.
d. Specify M2 as the Base name.
e. Click on OK.
14. Draw a path from the newly added suspect-level latent variable to 3.chosen.
Select the Add Path tool, click in the left of the suspect2 double oval, and drag a path to the
right side of the 3.chosen rectangle.
15. Create the covariance between the random effects.
a. Select the Add Covariance tool,
b. Click in the bottom-right quadrant of the suspect1 double oval, and drag a covariance to
the top right of the suspect2 double oval.
16. Clean up paths and covariance.
If you do not like where a path has been connected to its variables, use the Select tool,
, to
click on the path, and then simply click on where it connects to a rectangle and drag the endpoint.
Similarly, you can change where the covariance connects to the latent variables by clicking on
the covariance and dragging the endpoint. You can also change the bow of the covariance by
clicking on the covariance and dragging the control point that extends from one end of the
selected covariance.
17. Estimate again.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram for the first model in the Builder by typing
. webgetsem gsem_mlmlogit1
406
You can open a completed diagram for the second model in the Builder by typing
. webgetsem gsem_mlmlogit2
References
Skrondal, A., and S. Rabe-Hesketh. 2003. Multilevel logistic regression for polytomous data and rankings. Psychometrika
68: 267287.
Wright, D. B., and A. T. Sparks. 1994. Using multilevel multinomial regression to analyse line-up data. Multilevel
Modelling Newsletter 6: 810.
Also see
[SEM] example 37g Multinomial logistic regression
[SEM] example 38g Random-intercept and random-slope models (multilevel)
Title
example 42g One- and two-level mediation models (multilevel)
Description
References
Also see
Description
To demonstrate linear mediation models, we use the following data:
. use http://www.stata-press.com/data/r13/gsem_multmed
(Fictional job-performance data)
. summarize
Variable
Obs
Mean
branch
support
satis
perform
1500
1500
1500
1500
38
.0084667
.0212
5.005317
Std. Dev.
21.65593
.5058316
.6087235
.8949845
Min
Max
1
-1.6
-1.6
2.35022
75
1.8
2
8.084294
. notes
_dta:
1. Fictional data on job performance, job satisfaction, and perceived
support from managers for 1,500 sales employees of a large department
store in 75 locations.
2. Variable support is average of Likert-scale questions, each question
scored from -2 to 2.
3. Variable satis is average of Likert-scale questions, each question scored
from -2 to 2.
4. Variable perform is job performance measured on continuous scale.
See Structural models 1: Linear regression and Multilevel mixed-effects models in [SEM] intro 5
for background.
407
408
The model we wish to fit is the simplest form of a mediation model, namely,
satis
support
perform
We are interested in the effect of managerial support on job performance, but we suspect a portion
of the effect might be mediated through job satisfaction. In traditional mediation analysis, the model
would be fit by a series of linear regression models as described in Baron and Kenny (1986). That
approach is sufficient because the errors are not correlated. The advantage of using structural equation
modeling is that you can fit a single model and estimate the indirect and total effects, and you can
embed the simple mediation model in a larger model and even use latent variables to measure any
piece of the mediation model.
To fit this model with the command syntax, we type
. sem (perform <- satis support) (satis <- support)
Endogenous variables
Observed: perform satis
Exogenous variables
Observed: support
Fitting target model:
Iteration 0:
log likelihood = -3779.9224
Iteration 1:
log likelihood = -3779.9224
Structural equation model
Estimation method = ml
Log likelihood
= -3779.9224
Coef.
OIM
Std. Err.
Number of obs
1500
P>|z|
Structural
perform <satis
support
_cons
.8984401
.6161077
4.981054
.0251903
.0303143
.0150589
35.67
20.32
330.77
0.000
0.000
0.000
.849068
.5566927
4.951539
.9478123
.6755227
5.010569
satis <support
_cons
.2288945
.019262
.0305047
.0154273
7.50
1.25
0.000
0.212
.1691064
-.0109749
.2886826
.0494989
var(e.perf~m)
var(e.satis)
.3397087
.3569007
.0124044
.0130322
.3162461
.3322507
.364912
.3833795
409
Notes:
1. The direct effect of managerial support on job performance is measured by perform <- support
and is estimated to be 0.6161. The effect is small albeit highly statistically significant. The
standard deviations of performance and support are 0.89 and 0.51. A one standard deviation
increase in support improves performance by a third of a standard deviation.
2. The direct effect of job satisfaction on job performance is measured by perform <- satis and
is estimated to be 0.8984. That also is a moderate effect, practically speaking, and is highly
statistically significant.
3. The effect of managerial support on job satisfaction is measured by satis <- support and is
practically small but statistically significant.
4. What is the total effect of managerial support on performance? It is the direct effect (0.6161)
plus the indirect effect of support on satisfaction on performance (0.2289 0.8984 = 0.2056),
meaning the total effect is 0.8217. It would be desirable to put a standard error on that, but thats
more work.
We can use estat teffects after estimation to obtain the total effect and its standard error:
. estat teffects
Direct effects
Coef.
OIM
Std. Err.
P>|z|
Structural
perform <satis
support
.8984401
.6161077
.0251903
.0303143
35.67
20.32
0.000
0.000
.849068
.5566927
.9478123
.6755227
satis <support
.2288945
.0305047
7.50
0.000
.1691064
.2886826
OIM
Std. Err.
P>|z|
7.34
0.000
P>|z|
Indirect effects
Coef.
Structural
perform <satis
support
0
.205648
(no path)
.0280066
satis <support
(no path)
.150756
.26054
Total effects
Coef.
OIM
Std. Err.
Structural
perform <satis
support
.8984401
.8217557
.0251903
.0404579
35.67
20.31
0.000
0.000
.849068
.7424597
.9478123
.9010516
satis <support
.2288945
.0305047
7.50
0.000
.1691064
.2886826
410
Number of obs
Std. Err.
1500
P>|z|
perform <satis
support
_cons
.8984401
.6161077
4.981054
.0251903
.0303143
.0150589
35.67
20.32
330.77
0.000
0.000
0.000
.849068
.5566927
4.951539
.9478123
.6755227
5.010569
satis <support
_cons
.2288945
.019262
.0305047
.0154273
7.50
1.25
0.000
0.212
.1691064
-.0109749
.2886826
.0494989
var(e.perf~m)
var(e.satis)
.3397087
.3569007
.0124044
.0130322
.3162461
.3322507
.364912
.3833795
We can, however, calculate the indirect and total effects for ourselves and obtain the standard error
by using nlcom. Referring back to note 4 of the previous section, the formula for the indirect effect
and total effects are
indirect effect = 1 4
total effect = 2 + 1 4
where
<- support
1 = b[perform:satis]
4 = b[satis:support]
2 = b[perform:support]
which is most easily revealed by typing
. gsem, coeflegend
(output omitted )
411
.205648
Std. Err.
P>|z|
.0280066
7.34
0.000
.26054
.8217557
Std. Err.
.0404579
z
20.31
P>|z|
0.000
.7424597
.9010516
branch1
satis
support
branch2
perform
In this model, we include a random intercept in each equation at the branch (individual store)
level. The model above is one of many variations on two-level mediation models; see Krull and
MacKinnon (2001) for an introduction to multilevel mediation models, and see Preacher, Zyphur, and
Zhang (2010) for a discussion of fitting these models with structural equation modeling.
412
Std. Err.
P>|z|
0.000
0.000
.5383313
.6490687
.6701968
.7472364
perform <satis
support
.604264
.6981525
M1[branch]
_cons
4.986596
.0489465
101.88
0.000
4.890663
5.082529
satis <support
.2692633
.0179649
14.99
0.000
.2340528
.3044739
M2[branch]
_cons
.0189202
.0570868
0.740
-.0929678
.1308083
.1695962
.0302866
.119511
.2406713
.2384738
.0399154
.1717781
.3310652
.201053
.1188436
.0075451
.0044523
.1867957
.1104299
.2163985
.1278983
var(
M1[branch])
var(
M2[branch])
var(e.perf~m)
var(e.satis)
.0336398
.0250432
17.96
27.88
(constrained)
(constrained)
0.33
Notes:
1. In One-level model with sem above, we measured the direct effects on job performance of job
satisfaction and managerial support as 0.8984 and 0.6161. Now the direct effects are 0.6043 and
0.6982.
2. We can calculate the indirect and total effects just as we did in the previous section, which
we will do below. We mentioned earlier that there are other variations of two-level mediation
models, and how you calculate total effects depends on the model chosen.
413
.1627062
Std. Err.
.0141382
z
11.51
P>|z|
0.000
.1349958
.1904165
.8608587
Std. Err.
.0257501
z
33.43
P>|z|
0.000
.8103894
.911328
414
b. Click in the upper right of the support rectangle (it will highlight when you hover over
it), and drag a path to the lower left of the satis rectangle (it will highlight when you can
release to connect the path).
c. Continuing with the
tool, draw a path from the lower right of the satis rectangle to
the upper left of the perform rectangle.
6. Clean up the direction of the error term.
We want the error for each of the endogenous variables to be to the right of the rectangle. The
error for satis may have been created in another direction. If so,
a. choose the Select tool,
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting branch > Observations in the next control.
d. Specify M1 as the Base name.
e. Click on OK.
11. Create the multilevel latent variable corresponding to the random intercept for perform.
a. Select the Add Multilevel Latent Variable tool,
and to the right of the branch1 double oval.
b. In the Contextual Toolbar, click on the
button.
c. Select the nesting level and nesting variable by selecting 2 from the Nesting depth control
and selecting branch > Observations in the next control.
d. Specify M2 as the Base name.
e. Click on OK.
415
12. Draw paths from the multilevel latent variables to their corresponding endogenous variables.
a. Select the Add Path tool,
b. Click in the bottom of the branch1 double oval, and drag a path to the top of the satis
rectangle.
c. Continuing with the
tool, click in the bottom of the branch2 double oval, and drag a
path to the top of the perform rectangle.
13. Estimate again.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram for the first model in the Builder by typing
. webgetsem sem_med
You can open a completed diagram for the second model in the Builder by typing
. webgetsem gsem_mlmed
References
Baron, R. M., and D. A. Kenny. 1986. The moderatormediator variable distinction in social psychological research:
Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology 51: 11731182.
Krull, J. L., and D. P. MacKinnon. 2001. Multilevel modeling of individual and group level mediated effects.
Multivariate Behavorial Research 36: 249277.
Preacher, K. J., M. J. Zyphur, and Z. Zhang. 2010. A general multilevel SEM framework for assessing multilevel
mediation. Psychological Methods 15: 209233.
Also see
[SEM] example 38g Random-intercept and random-slope models (multilevel)
Title
example 43g Tobit regression
Description
Also see
Description
Tobit regression is demonstrated using auto.dta:
. use http://www.stata-press.com/data/r13/auto
(1978 Automobile Data)
. generate wgt = weight/1000
Gaussian
wgt
mpg
identity
Censoring information does not appear in the path diagram by default. It can be added to the path
diagram by customizing the appearance of mpg in the Builder. The Builder reports the censoring
information for mpg in the Details pane.
416
417
Std. Err.
P>|z|
74
mpg <wgt
_cons
var(e.mpg)
-6.87305
41.49856
.700257
2.058384
14.78942
2.817609
-9.82
20.16
0.000
0.000
-8.245529
37.4642
-5.500572
45.53291
10.18085
21.48414
Notes:
1. Reported coefficients match those reported by tobit.
2. Reported standard errors (SEs) differ very slightly from those reported by tobit.
2
3. gsem reports the point estimate of e.mpg as 14.78942. This
is an estimate of , the error
variance. tobit reports an estimated as 3.845701. And 14.78942 = 3.8457.
button.
b. Click in the diagram about one-fourth of the way in from the left and half of the way up
from the bottom.
c. In the Contextual Toolbar, use the Variable control to select the variable wgt.
5. Create the tobit response.
a. Select the Add Generalized Response Variable tool,
418
b. Click about one-third of the way in from the right side of the diagram, to the right of the
wgt rectangle.
c. In the Contextual Toolbar, select Gaussian, Identity in the Family/Link control (it may
already be selected).
d. In the Contextual Toolbar, use the Variable control to select the variable mpg.
e. In the Contextual Toolbar, click on the Properties button.
f. In the resulting Variable properties dialog box, click on the Censoring button in the Variable
tab.
g. In the resulting Censoring dialog box, select the Left censored radio button. In the resulting
Left censoring box below, select the Constant radio button (it may already be selected), and
type 17 in the Constant control.
h. Click on OK in the Censoring dialog box, and then click on OK in the Variable properties
dialog box. The Details pane will now show the Censoring information for mpg.
6. Create a path from the independent variable to the dependent variable.
a. Select the Add Path tool,
b. Click in the right side of the wgt rectangle (it will highlight when you hover over it), and
drag a path to the left side of the mpg rectangle (it will highlight when you can release to
connect the path).
7. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_tobit
Also see
[SEM] example 38g Random-intercept and random-slope models (multilevel)
[SEM] example 44g Interval regression
[SEM] example 45g Heckman selection model
[SEM] example 46g Endogenous treatment-effects model
Title
example 44g Interval regression
Description
Also see
Description
Interval regression is demonstrated using intregxmpl.dta:
. use http://www.stata-press.com/data/r13/intregxmpl
(Wages of women)
age
c.age#c.age
Gaussian
nev_mar
wage1
identity
rural
school
tenure
Interval measure information does not appear in the path diagram by default. It can be added to the
path diagram by customizing the appearance of wage1 in the Builder. The Builder reports the interval
measure information for wage1 in the Details pane.
419
420
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-856.59446
-856.33321
-856.33293
-856.33293
Number of obs
Std. Err.
P>|z|
488
wage1 <age
.7914438
.4433604
1.79
0.074
-.0775265
1.660414
c.age#c.age
-.0132624
.0073028
-1.82
0.069
-.0275757
.0010509
nev_mar
rural
school
tenure
_cons
-.2075022
-3.043044
1.334721
.8000664
-12.70238
.8119581
.7757324
.1357873
.1045077
6.367117
-0.26
-3.92
9.83
7.66
-1.99
0.798
0.000
0.000
0.000
0.046
-1.798911
-4.563452
1.068583
.5952351
-25.1817
1.383906
-1.522637
1.600859
1.004898
-.2230584
53.28454
3.693076
46.51635
61.0375
var(e.wage1)
Notes:
1. Just like intreg, gsem requires two dependent variables for fitting interval regression models.
The udepvar() suboption in family(gaussian) allows you to specify the dependent variable
containing the upper-limit values for the interval regression. Consequently, the dependent variable
participating in the path specification necessarily contains the lower-limit values.
2. Reported coefficients match those reported by intreg.
3. Reported standard errors (SEs) match those reported by intreg.
2
4. gsem reports the point estimate of e.wage1 as 53.28454. This
is an estimate of , the error
variance. intreg reports an estimated as 7.299626. And 53.28454 = 7.299626.
button.
421
b. Click about one-third of the way in from the right side of the diagram, to the right of the
nev mar rectangle.
c. In the Contextual Toolbar, select Gaussian, Identity in the Family/Link control (it may
already be selected).
d. In the Contextual Toolbar, use the Variable control to select the variable wage1.
e. In the Contextual Toolbar, click on the Properties button.
f. In the resulting Variable properties dialog box, click on the Censoring button in the Variable
tab.
g. In the resulting Censoring dialog box, select the Interval measured, depvar is lower boundary
radio button. In the resulting Interval measured box below, use the Upper bound control to
select the variable wage2.
h. Click on OK in the Censoring dialog box, and then click on OK in the Variable properties
dialog box. The Details pane will now show that wage1 is the lower bound and wage2 is
the upper bound of our interval measure.
6. Create paths from the independent variables to the dependent variable.
a. Select the Add Path tool,
b. Click in the right side of the age rectangle (it will highlight when you hover over it), and
drag a path to the left side of the wage1 rectangle (it will highlight when you can release
to connect the path).
tool, create the following paths by clicking first in the right side of
c. Continuing with the
the rectangle for the independent variable and dragging it to the left side of the rectangle
for the dependent variable:
c.age#c.age -> wage1
nev_mar -> wage1
rural -> wage1
school -> wage1
tenure -> wage1
422
7. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_intreg
Also see
[SEM] example 38g Random-intercept and random-slope models (multilevel)
[SEM] example 43g Tobit regression
[SEM] example 45g Heckman selection model
[SEM] example 46g Endogenous treatment-effects model
Title
example 45g Heckman selection model
Description
References
Also see
Description
To demonstrate selection models, we will use the following data:
. use http://www.stata-press.com/data/r13/gsem_womenwk
(Fictional data on women and work)
. summarize
Variable
Obs
Mean
age
educ
married
children
wage
2000
2000
2000
2000
1343
36.208
13.084
.6705
1.6445
23.69217
Std. Dev.
8.28656
3.045912
.4701492
1.398963
6.305374
Min
Max
20
10
0
0
5.88497
59
20
1
5
45.80979
. notes
_dta:
1.
2.
3.
4.
5.
6.
See Structural models 7: Dependencies between response variables and Structural models 8:
Unobserved inputs, outputs, or both in [SEM] intro 5 for background.
424
For those unfamiliar with this model, it deals with a continuous outcome that is observed only
when another equation determines that the observation is selected, and the errors of the two equations
are allowed to be correlated. Subjects often choose to participate in an event or medical trial or even
the labor market, and thus the outcome of interest might be correlated with the decision to participate.
Heckman won a Nobel Prize for this work.
The model is sometimes cast in terms of female labor supply, but it obviously has broader
application. Nevertheless, we will consider a female labor-supply example.
Women are offered employment at a wage of w,
w i = X i + i
Not all women choose to work, and w is observed only for those women who do work. Women
choose to work if
Zi + i > 0
where
i N (0, 2 )
i N (0, 1)
corr(, ) =
More generally, we can think of this model as applying to any continuously measured outcome
wi , which is observed only if Zi + i > 0. The important feature of the model is that the errors i
of the selection equation and the errors i of the observed-data equation are allowed to be correlated.
The Heckman selection model can be recast as a two-equation SEM one linear regression (for the
continuous outcome) and the other censored regression (for selection)and with a latent variable Li
added to both equations. The latent variable is constrained to have variance 1 and to have coefficient 1
in the selection equation, leaving only the coefficient in the continuous-outcome equation to be
estimated. For identification, the variance from the censored regression will be constrained to be equal
to that of the linear regression. The results of doing this are the following:
1. Latent variable Li becomes the vehicle for carrying the correlation between the two equations.
2. All the parameters given above, namely, , , 2 , and , can be recovered from the SEM
estimates.
3. If we call the estimated parameters in the SEM formulation , , and 2 , and let
denote the coefficient on Li in the continuous-outcome equation, then
=
= /
p
2 + 1
2 = 2 + 2
q
= / ( 2 + 2 )( 2 + 1)
This parameterization places no restriction on the range or sign of . See Skrondal and
Rabe-Hesketh (2004, 107108).
425
1
married
Gaussian
selected
1
children
identity
L
1
educ
wage
age
2
notselected
0
Total
0
.
0
657
1,343
0
1,343
657
Total
657
1,343
2,000
Old-time Stata users may be worried that because wage is missing in so many observations, namely,
all those corresponding to nonworking women, there must be something special we need to do so
that gsem uses all the data. There is nothing special we need to do. gsem counts missing values on
an equation-by-equation basis, so it will use all the data for the censored regression part of the model
while simultaneously using only the working-woman subsample for the continuous-outcome (wage)
426
part of the model. We use all the data for the censored regression because gsem understands the
meaning of missing values in the censored dependent variables so long as one of them is nonmissing.
To fit this model in command syntax, we type
. gsem (wage <- educ age L)
> (selected <- married children educ age L@1,
>
family(gaussian, udepvar(notselected))),
> var(L@1 e.wage@a e.selected@a)
Fitting fixed-effects model:
Iteration 0:
log likelihood = -5568.1366
Iteration 1:
log likelihood = -5211.0882 (not concave)
Iteration 2:
log likelihood = -5209.4228 (not concave)
Iteration 3:
log likelihood = -5209.2214
Iteration 4:
log likelihood = -5209.1638
Iteration 5:
log likelihood = -5208.9052 (not concave)
Iteration 6:
log likelihood = -5208.9044 (not concave)
Iteration 7:
log likelihood = -5208.9042 (not concave)
Iteration 8:
log likelihood = -5208.904
Iteration 9:
log likelihood = -5208.9038
Refining starting values:
Grid node 0:
log likelihood = -5259.1366
Fitting full model:
Iteration 0:
log likelihood = -5557.2489 (not concave)
Iteration 1:
log likelihood = -5439.0882 (not concave)
Iteration 2:
log likelihood = -5285.2854
Iteration 3:
log likelihood = -5229.0964
Iteration 4:
log likelihood = -5179.3914
Iteration 5:
log likelihood = -5178.3235
Iteration 6:
log likelihood = -5178.3046
Iteration 7:
log likelihood = -5178.3046
Generalized structural equation model
Number of obs
Log likelihood = -5178.3046
( 1) [selected]L = 1
( 2) [var(e.selected)]_cons - [var(e.wage)]_cons = 0
( 3) [var(L)]_cons = 1
Coef.
Std. Err.
P>|z|
2000
wage <educ
age
L
_cons
.9899509
.213128
5.923733
.4859256
selected <married
children
educ
age
L
_cons
.624276
.615211
.0781544
.0511984
1
-3.493224
var(L)
var(e.sele~d)
var(e.wage)
.9664716
.9664716
.0532552
.020602
.1846827
1.076867
18.59
10.34
32.08
0.45
0.000
0.000
0.000
0.652
.8855727
.1727488
5.561761
-1.624696
1.094329
.2535073
6.285704
2.596547
.1054324
5.92
.0652008
9.44
.0162868
4.80
.006637
7.71
(constrained)
.3730411
-9.36
0.000
0.000
0.000
0.000
.4176322
.4874197
.0462328
.0381901
.8309197
.7430023
.110076
.0642067
0.000
-4.224371
-2.762077
.5601427
.5601427
1.667552
1.667552
(constrained)
.2689702
.2689702
427
Notes:
1. Some of the estimated coefficients and parameters above will match those reported by the heckman
command and others will not. The above parameters are in the transformed structural equation
modeling metric. That metric can be transformed back to the Heckman metric and results will
match. The relationship to the Heckman metric is
=
= / 2 + 1
2 = 2 + 2
q
= / ( 2 + 2 )( 2 + 1)
2. refers to the coefficients on the continuous-outcome (wage) equation. We can read those
coefficients directly, without transformation except that we ignore the wage <- L path:
wage = 0.9900 educ + 0.2131 age + 0.4859
3. refers to the selection equation, and because = / 2 + 1, we must divide the reported
coefficients by the square root of 2 + 1. What has happened here is that the scaled probit has
2
variance + 1, and we are merely transforming back to the standard probit model, which has
variance 1. The results are
Pr(selected = 0) =
6. There is an easier way to obtain the transformed results than by hand, and the easier way provides
standard errors. That is the subject of the next section.
2 = 2 + 2
q
= / 2 ( 2 + 1)
We must describe these two formulas in a way that nlcom can understand. The Stata notation for
2 :
:
b[var(e.wage): cons]
b[wage:L]
428
We cannot remember that notation; however, we can type gsem, coeflegend to be reminded. We
now have all that we need to obtain the estimates of 2 and . Because heckman reports rather
than 2 , we will tell nlcom to report the sqrt( 2 ):
. nlcom (sigma: sqrt(_b[var(e.wage):_cons] +_b[wage:L]^2))
>
(rho: _b[wage:L]/(sqrt((_b[var(e.wage):_cons]+1)*(_b[var(e.wage):_cons]
> + _b[wage:L]^2))))
sigma: sqrt(_b[var(e.wage):_cons] +_b[wage:L]^2)
rho: _b[wage:L]/(sqrt((_b[var(e.wage):_cons]+1)*(_b[var(e.wage):_cons
> ] + _b[wage:L]^2)))
Coef.
sigma
rho
6.004755
.7034874
Std. Err.
.1656476
.0511867
z
36.25
13.74
P>|z|
0.000
0.000
5.680091
.6031633
6.329418
.8038116
The output above nearly matches what heckman reports. heckman does not report the test statistics
and p-values for these two parameters. In addition, the confidence interval that heckman reports for
will differ slightly from the above and is better. heckman uses a method that will not allow to be
outside of 1 and 1, whereas nlcom is simply producing a confidence interval for the calculation we
requested and in absence of the knowledge that the calculation corresponds to a correlation coefficient.
The same applies to the confidence interval for , where the bounds are 0 and infinity.
To obtain the coefficients and standard errors for the selection equation, we type
. nlcom (married: _b[selected:married]/sqrt(_b[var(e.wage):_cons]+1))
>
(children: _b[selected:children]/sqrt(_b[var(e.wage):_cons]+1))
>
(educ: _b[selected:educ]/sqrt(_b[var(e.wage):_cons]+1))
>
(age: _b[selected:age]/sqrt(_b[var(e.wage):_cons]+1))
married: _b[selected:married]/sqrt(_b[var(e.wage):_cons]+1)
children: _b[selected:children]/sqrt(_b[var(e.wage):_cons]+1)
educ: _b[selected:educ]/sqrt(_b[var(e.wage):_cons]+1)
age: _b[selected:age]/sqrt(_b[var(e.wage):_cons]+1)
Coef.
married
children
educ
age
.4451771
.4387128
.0557326
.0365101
Std. Err.
.0673953
.0277788
.0107348
.0041534
z
6.61
15.79
5.19
8.79
P>|z|
0.000
0.000
0.000
0.000
.3130847
.3842673
.0346927
.0283696
.5772694
.4931583
.0767725
.0446505
429
button.
b. Click about one-third of the way in from the right side of the diagram, to the right of the
married rectangle.
c. In the Contextual Toolbar, select Gaussian, Identity in the Family/Link control (it may
already be selected).
d. In the Contextual Toolbar, select selected in the Variable control.
e. In the Contextual Toolbar, click on the Properties button.
f. In the resulting Variable properties dialog box, click on the Censoring button in the Variable
tab.
g. In the resulting Censoring dialog box, select the Interval measured, depvar is lower boundary
radio button. In the resulting Interval measured box below, use the Upper bound control to
select the variable notselected.
h. Click on OK in the Censoring dialog box, and then click on OK in the Variable properties
dialog box. The Details pane will now show selected as the lower bound and notselected
as the upper bound of our interval measure.
6. Create the endogenous wage variable.
a. Select the Add Observed Variable tool,
, and then click about one-third of the way in
from the right side of the diagram, to the right of the age rectangle.
b. In the Contextual Toolbar, select wage with the Variable control.
7. Create paths from the independent variables to the dependent variables.
a. Select the Add Path tool,
b. Click in the right side of the married rectangle (it will highlight when you hover over it),
and drag a path to the left side of the selected rectangle (it will highlight when you can
release to connect the path).
c. Continuing with the
tool, create the following paths by clicking first in the right side of
the rectangle for the independent variable and dragging it to the left side of the rectangle
for the dependent variable:
430
b. Click in the upper left quadrant of the L oval, and drag a path to the right side of the
selected rectangle.
c. Continuing with the
tool, create another path by clicking first in the lower-left quadrant
of the L oval and dragging a path to the right side of the wage rectangle.
11. Place constraints on the variances and on the path from L to selected.
a. Choose the Select tool,
c. Click on the error oval attached to the wage rectangle. In the Contextual Toolbar, type a in
the
box and press Enter.
d. Click on the error oval attached to the selected rectangle. In the Contextual Toolbar, type
a in the
box and press Enter.
e. Click on the path from L to selected. In the Contextual Toolbar, type 1 in the
and press Enter.
box
431
13. Estimate.
Click on the Estimate button, , in the Standard Toolbar, and then click on OK in the resulting
GSEM estimation options dialog box.
You can open a completed diagram in the Builder by typing
. webgetsem gsem_select
References
Gronau, R. 1974. Wage comparisons: A selectivity bias. Journal of Political Economy 82: 11191143.
Heckman, J. 1976. The common structure of statistical models of truncation, sample selection and limited dependent
variables and a simple estimator for such models. Annals of Economic and Social Measurement 5: 475492.
Lewis, H. G. 1974. Comments on selectivity biases in wage comparisons. Journal of Political Economy 82: 11451155.
Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.
Also see
[SEM] example 34g Combined models (generalized responses)
[SEM] example 46g Endogenous treatment-effects model
Title
example 46g Endogenous treatment-effects model
Description
References
Also see
Description
To illustrate the treatment-effects model, we use the following data:
. use http://www.stata-press.com/data/r13/gsem_union3
(NLSY 1972)
. describe
Contains data from http://www.stata-press.com/data/r13/gsem_union3.dta
obs:
1,693
NLSY 1972
vars:
24
29 Mar 2013 11:39
size:
79,571
(_dta has notes)
variable name
idcode
year
birth_yr
age
race
msp
nev_mar
grade
collgrad
not_smsa
c_city
south
ind_code
occ_code
union
wks_ue
ttl_exp
tenure
hours
wks_work
ln_wage
wage
black
smsa
storage
type
int
int
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
float
float
int
int
float
double
float
byte
display
format
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%9.0g
%9.0g
%8.0g
%8.0g
%9.0g
%10.0g
%9.0g
%8.0g
value
label
racelbl
variable label
NLS ID
interview year
birth year
age in current year
race
1 if married, spouse present
1 if never married
current grade completed
1 if college graduate
1 if not SMSA
1 if central city
1 if south
industry of employment
occupation
1 if union
weeks unemployed last year
total work experience
job tenure, in years
usual hours worked
weeks worked last year
ln(wage/GNP deflator)
real wage
race black
1 if SMSA
See Structural models 7: Dependencies between response variables and Structural models 8:
Unobserved inputs, outputs, or both in [SEM] intro 5 for background.
432
433
1.south
1
1.black
Gaussian
llunion
tenure
identity
age
grade
wage
1.smsa
1.union
We wish to estimate the treatment effect of being a union member. That is, we speculate that
union membership has an effect on wages, and we want to measure that effect. The problem would
be easy if we had data on the same workers from two different but nearly identical universes, one in
which the workers were not union members and another in which they were.
The model above is similar to the Heckman selection model we fit in [SEM] example 45g. The
differences are that the continuous variable (wage) is observed in all cases and that we have a path
from the treatment indicator (previously selection, now treatment) to the continuous variable. Just as
with the Heckman selection model, we allow for correlation by introducing a latent variable with
model identification constraints.
Before we can fit this model, we need to create new variables llunion and ulunion. llunion
will equal 0 if union is 1 and missing otherwise. ulunion is the complement of llunion: it equals
0 if union is 0 and missing otherwise. llunion and ulunion will be used as the dependent variables
in the treatment equation, providing the equivalent of a scaled probit regression.
. gen llunion = 0 if union == 1
(1433 missing values generated)
. gen ulunion = 0 if union == 0
(709 missing values generated)
434
Std. Err.
P>|z|
1210
wage <age
grade
1.smsa
1.black
tenure
1.union
L
_cons
.1487409
.4205658
.9117044
-.7882472
.1524015
2.945816
-1.706795
-4.351572
.0193291
.0293577
.1249041
.1367077
.0369595
.2749549
.1288024
.5283952
7.70
14.33
7.30
-5.77
4.12
10.71
-13.25
-8.24
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
.1108566
.3630258
.6668969
-1.056189
.0799621
2.406915
-1.959243
-5.387207
.1866252
.4781057
1.156512
-.520305
.2248408
3.484718
-1.454347
-3.315936
.148057
4.53
.0357986
3.58
.136439
-6.26
(constrained)
.1407538
-9.25
0.000
0.000
0.000
.3802185
.0580384
-1.121683
.9605913
.1983664
-.5868518
0.000
-1.578548
-1.026804
.7725324
.7725324
1.753298
1.753298
llunion <1.black
tenure
1.south
L
_cons
.6704049
.1282024
-.8542673
1
-1.302676
var(L)
var(e.llun~n)
var(e.wage)
1.163821
1.163821
(constrained)
.2433321
.2433321
Notes:
1. The treatment effect is measured by the coefficient on the path treatment variable>continuous variable or, in our case, 1.union->wage. It is estimated to be 2.9458, which
is practically large and statistically significant.
435
2. The interpretation formulas are the same as for the Heckman selection model in [SEM] example 45g,
namely,
=
p
= / 2 + 1
2 = 2 + 2
q
= / ( 2 + 2 )( 2 + 1)
To remind you, are the coefficients in the continuous-outcome equation, are the coefficients
in the treatment equation, 2 is the variance of the error in the continuous-outcome equation,
and is the correlation between the errors in the treatment and continuous-outcome equations.
3. In the output above, 2 (var(e.wage)) is 1.1638 and (the path coefficient on wage<-L)
is 1.7068. In [SEM] example 45g, we calculated by hand and then showed how to use the
software to obtain the value and its standard error. This time, we will go right to the software.
After obtaining symbolic names by typing gsem, coeflegend, we type the following to obtain
:
. nlcom (rho: _b[wage:L]/(sqrt(_b[var(e.wage):_cons] + 1)*sqrt(_b[var(e.wage):_
> cons] + _b[wage:L]^2)))
rho: _b[wage:L]/(sqrt(_b[var(e.wage):_cons] + 1)*sqrt(_b[var(e.wage):
> _cons] + _b[wage:L]^2))
Coef.
rho
-.574648
Std. Err.
.060969
z
-9.43
P>|z|
0.000
-.455151
4. We can obtain the untransformed treatment coefficients just as we did in [SEM] example 45g.
5. Just as with the Heckman selection model, with gsem, the treatment-effects model can be applied
to generalized outcomes and include multilevel effects. See Skrondal and Rabe-Hesketh (2004,
chap. 14.5) for an example with a Poisson response function.
button.
436
b. Click about one-third of the way in from the right side of the diagram, to the right of the
1.black rectangle.
c. In the Contextual Toolbar, select Gaussian, Identity in the Family/Link control (it may
already be selected).
d. In the Contextual Toolbar, select llunion in the Variable control.
e. In the Contextual Toolbar, click on the Properties button.
f. In the resulting Variable properties dialog box, click on the Censoring button in the Variable
tab.
g. In the resulting Censoring dialog box, select the Interval measured, depvar is lower boundary
radio button. In the resulting Interval measured box below, use the Upper bound control to
select the variable ulunion.
h. Click on OK in the Censoring dialog box, and then click on OK in the Variable properties
dialog box. The Details pane will now show llunion as the lower bound and ulunion as
the upper bound for our interval measure.
6. Create the endogenous wage variable.
, and then click about one-third of the way in
a. Select the Add Observed Variable tool,
from the right side of the diagram, to the right of the grade rectangle.
b. In the Contextual Toolbar, select wage with the Variable control.
7. Create paths from the independent variables to the dependent variables.
a. Select the Add Path tool,
b. Click in the right side of the 1.south rectangle (it will highlight when you hover over it),
and drag a path to the left side of the llunion rectangle (it will highlight when you can
release to connect the path).
c. Continuing with the
tool, create the following paths by clicking first in the right side of
the rectangle for the independent variable and dragging it to the left side of the rectangle
for the dependent variable:
437
b. Click in the upper-left quadrant of the L oval, and drag a path to the right side of the
llunion rectangle.
tool, create another path by clicking first in the lower-left quadrant
c. Continuing with the
of the L oval and dragging a path to the right side of the wage rectangle.
11. Place constraints on the variances and on the path from L to llunion.
a. Choose the Select tool,
c. Click on the error oval attached to the wage rectangle. In the Contextual Toolbar, type a in
the
box and press Enter.
d. Click on the error oval attached to the llunion rectangle. In the Contextual Toolbar, type
a in the
box and press Enter.
e. Click on the path from L to llunion. In the Contextual Toolbar, type 1 in the
press Enter.
box and
438
References
Center for Human Resource Research. 1989. National Longitudinal Survey of Labor Market Experience, Young Women
1424 years of age in 1968. Columbus, OH: Ohio State University Press.
Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.
Also see
[SEM] example 34g Combined models (generalized responses)
[SEM] example 45g Heckman selection model
Title
gsem Generalized structural equation model estimation command
Syntax
Remarks and examples
Menu
Stored results
Description
Also see
Options
Syntax
gsem paths
if
in
, options
where paths are the paths of the model in command-language path notation; see [SEM] sem and gsem
path notation.
options
Description
estimation options
reporting options
syntax options
Menu
Statistics
>
>
Description
gsem fits generalized SEMs. When you use the Builder in gsem mode, you are using the gsem
command.
Options
model description options describe the model to be fit. The model to be fit is fully specified by
pathswhich appear immediately after gsemand the options covariance(), variance(), and
means(). See [SEM] gsem model description options and [SEM] sem and gsem path notation.
estimation options control how the estimation results are obtained. These options control how the
standard errors (VCE) are obtained and control technical issues such as choice of estimation method.
See [SEM] gsem estimation options.
reporting options control how the results of estimation are displayed. See [SEM] gsem reporting
options.
syntax options control how the syntax that you type is interpreted. See [SEM] sem and gsem syntax
options.
439
440
441
Stored results
gsem stores the following in e():
Scalars
e(N)
e(N clust)
e(k)
e(k cat#)
e(k dv)
e(k eq)
e(k out#)
e(k rc)
e(k rs)
e(ll)
e(n quad)
e(rank)
e(ic)
e(rc)
e(converged)
Macros
e(cmd)
e(cmdline)
e(depvar)
e(title)
e(clustvar)
e(family#)
e(link#)
e(offset#)
e(intmethod)
e(vce)
e(vcetype)
e(opt)
e(which)
e(method)
e(ml method)
e(user)
e(technique)
e(datasignature)
e(datasignaturevars)
e(properties)
e(estat cmd)
e(predict)
e(covariates)
e(footnote)
e(marginsnotok)
number of observations
number of clusters
number of parameters
number of categories for the #th depvar, ordinal
number of dependent variables
number of equations in e(b)
number of outcomes for the #th depvar, mlogit
number of covariances
number of variances
log likelihood
number of integration points
rank of e(V)
number of iterations
return code
1 if target model converged, 0 otherwise
gsem
command as typed
names of dependent variables
title in estimation output
name of cluster variable
family for the #th depvar
link for the #th depvar
offset for the #th depvar
integration method
vcetype specified in vce()
title used to label Std. Err.
type of optimization
max or min; whether optimizer is to perform maximization or minimization
estimation method: ml
type of ml method
name of likelihood-evaluator program
maximization technique
the checksum
variables used in calculation of checksum
b V
program used to implement estat
program used to implement predict
list of covariates
program used to implement the footnote display
predictions not allowed by margins
442
Matrices
e(b)
e(b pclass)
e(cat#)
e(out#)
e(Cns)
e(ilog)
e(gradient)
e(V)
e(V modelbased)
Functions
e(sample)
parameter vector
parameter class
categories for the #th depvar, ordinal
outcomes for the #th depvar, mlogit
constraints matrix
iteration log (up to 20 iterations)
gradient vector
covariance matrix of the estimators
model-based variance
marks estimation sample
Also see
[SEM] intro 1 Introduction
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] gsem path notation extensions Command syntax for path diagrams
[SEM] gsem model description options Model description options
[SEM] gsem estimation options Options affecting estimation
[SEM] gsem reporting options Options affecting reporting of results
[SEM] sem and gsem syntax options Options affecting interpretation of syntax
[SEM] gsem postestimation Postestimation tools for gsem
[SEM] methods and formulas for gsem Methods and formulas
Title
gsem estimation options Options affecting estimation
Syntax
Description
Options
Also see
Syntax
gsem paths . . . , . . . estimation options
estimation options
Description
method(ml)
vce(vcetype)
from(matname)
startvalues(svmethod)
startgrid (gridspec)
noestimate
intmethod(intmethod)
intpoints(#)
adaptopts(adaptopts)
listwise
dnumerical
maximize options
intmethod
Description
mvaghermite
mcaghermite
ghermite
laplace
adaptopts
no log
Description
iterate(#)
tolerance(#)
443
444
Description
These options control how results are obtained, from starting values, to numerical integration (also
known as quadrature), to how variance estimates are obtained.
Options
method(ml) is the default and is the only method available with gsem. This option is included for
compatibility with sem, which provides several methods; see [SEM] sem option method( ).
vce(vcetype) specifies the technique used to obtain the variancecovariance matrix of the estimates.
See [SEM] sem option method( ).
from(matname), startvalues(svmethod), and startgrid[(gridspec)] specify overriding starting
values, specify how other starting values are to be calculated, and provide the ability to improve
the starting values. All of this is discussed in [SEM] intro 12. Below we provide a technical
description.
from(matname) allows you to specify starting values. See [SEM] intro 12 and see [SEM] sem and
gsem option from( ). We show the syntax as from(matname), but from() has another, less useful
syntax, too. An alternative to from() is init() used in the path specifications; see [SEM] sem
and gsem path notation.
startvalues() specifies how starting values are to be computed. Starting values specified in
from() override the computed starting values, and starting values specified via init() override
both.
startvalues(zero) specifies that starting values are to be set to 0.
startvalues(constantonly) builds on startvalues(zero) by fitting a constant-only model
for each response to obtain estimates of intercept and scale parameters, and it substitutes 1 for the
variances of latent variables.
startvalues(fixedonly) builds on startvalues(constantonly) by fitting a full fixedeffects model for each response variable to obtain estimates of coefficients along with intercept
and scale parameters, and it continues to use 1 for the variances of latent variables.
startvalues(ivloadings) builds on startvalues(fixedonly) by using instrumental-variable
methods with the generalized residuals from the fixed-effects models to compute starting values
for latent variable loadings, and still uses 1 for the variances of latent variables.
startvalues(iv) builds on startvalues(ivloadings) by using instrumental-variable methods
with generalized residuals to obtain variances of latent variables.
startgrid() performs a grid search on variance components of latent variables to improve starting
values. This is well discussed in [SEM] intro 12. No grid search is performed by default unless the
starting values are found to be not feasible, in which case gsem runs startgrid() to perform a
minimal search involving L3 likelihood evaluations, where L is the number of latent variables.
Sometimes this resolves the problem. Usually, however, there is no problem and startgrid() is
not run by default. There can be benefits from running startgrid() to get better starting values
even when starting values are feasible.
noestimate specifies that the model is not to be fit. Instead, starting values are to be shown (as
modified by the above options if modifications were made), and they are to be shown using the
coeflegend style of output. An important use of this option is before you have modified starting
values at all; you can type the following:
445
446
Also see
[SEM] gsem Generalized structural equation model estimation command
[SEM] intro 8 Robust and clustered standard errors
[SEM] intro 9 Standard errors, the full story
[SEM] intro 12 Convergence problems and how to solve them
Title
gsem family-and-link options Family-and-link options
Syntax
Description
Options
Also see
Syntax
gsem paths . . . , . . . family and link options
family and link options
Description
family(family)
link(link)
cloglog
gamma
logit
nbreg
mlogit
ocloglog
ologit
oprobit
poisson
probit
regress
synonym
synonym
synonym
synonym
synonym
synonym
synonym
synonym
synonym
synonym
synonym
exposure(varnamee )
offset(varnameo )
family
Description
for
for
for
for
for
for
for
for
for
for
for
family(bernoulli) link(cloglog)
family(gamma) link(log)
family(bernoulli) link(logit)
family(nbreg mean) link(log)
family(multinomial) link(logit)
family(ordinal) link(cloglog)
family(ordinal) link(logit)
family(ordinal) link(probit)
family(poisson) link(log)
family(bernoulli) link(probit)
family(gaussian) link(identity)
gaussian , options
Gaussian (normal); the default
bernoulli
Bernoulli
binomial # | varname
binomial; default number of binomial trials is 1
gamma
gamma
multinomial
multinomial
nbinomial mean | constant negative binomial; default dispersion is mean
ordinal
ordinal
poisson
Poisson
link
Description
identity
log
logit
probit
cloglog
identity
log
logit
probit
complementary log-log
447
448
options
Description
ldepvar(varname)
udepvar(varname)
lcensored(varname|#)
rcensored(varname|#)
lower
upper
lower
upper
Gaussian
Bernoulli
binomial
family multinomial
gamma
identity
log
logit
probit
cloglog
D
D
D
x
x
x
x
negative binomial
ordinal
Poisson
D denotes the default.
D
D
Description
gsem not only allows models of the form yi = xi + ui , it also allows
g{E(yi )} = xi
yi F
where you can choose F and g() from a menu. F is called the family, and g() is called the link.
One set of choices is the Gaussian distribution for F and the identity function for g(). In that case,
gsem reproduces linear regression. Other combinations of g() and F produce other popular models,
including logit (also known as logistic regression), probit, multinomial logit, Poisson regression, and
more.
Options
family(family) and link(linkname) specify F and g(). If neither is specified, linear regression is
assumed.
Two of the families allow optional arguments:
family(binomial # | varname ) specifies that data are in binomial form, that is, that the
response variable records the number of successes from a series of Bernoulli trials. The number
of trials is given either as a constant number or as a varname that allows the number of trials
to vary over observations, or it is not given at all. In the last case, the number of trials is thus
equivalent to specifying family(bernoulli).
449
family(nbinomial mean | constant ) specifies a negative binomial model, a Poisson model
with overdispersion. Be aware, however, that even Poisson models can have overdispersion if latent
variables are included in the model. Lets use the term conditional overdispersion to refer to
dispersion above and beyond that implied by latent variables, if any.
That conditional overdispersion can take one of two forms. In mean overdispersion, the conditional
overdispersion is a linear function of the conditional (predicted) mean. Constant overdispersion
refers to the conditional overdispersion being, of course, constant.
If you do not specify mean or constant, then mean is assumed.
cloglog, gamma, logit, nbreg, mlogit, ocloglog, ologit, oprobit, poisson, probit, and
regress are shorthands for specifying popular models.
exposure(varnamee ) and offset(varnameo ) are used only with families poisson and nbreg,
that is, they concern count models.
exposure() specifies a variable that reflects the amount of exposureusually measured in time
unitsfor each observation over which the responses were counted. If one observation was exposed
for twice the time of another, and the observations were otherwise identical, one would expect twice
as many events to be counted. To assume that, ln(varnamee ) is entered into xi with coefficient
constrained to be 1.
offset() enters varnameo into xi with coefficient constrained to be 1. offset() is just another
way of specifying exposure() where the offset variable is the log of amount of exposure.
If neither exposure() nor offset() is specified, observations are assumed to have equal amounts
of exposure.
ldepvar(varname) and udepvar(varname) specify that each observation can be point data, interval
data, left-censored data, or right-censored data. The type of data for a given observation is
determined by the values in yi and varname. The following specifications are equivalent:
depvar1 <- ... , family(gauss, udepvar(depvar2 )
depvar2 <- ... , family(gauss, ldepvar(depvar1 )
Thus only one of ldepvar() or udepvar() is allowed. In either case, depvar1 and depvar2
should have the following form:
Type of data
point data
a = [ a, a ]
interval data
[ a, b ]
left-censored data
( , b ]
right-censored data
[ a, + )
var1
var2
a
a
.
a
a
b
b
.
lcensored(varname|#) and rcensored(varname|#) indicate the lower and upper limits for censoring, respectively. You may specify only one.
lcensored(arg) specifies that observations with yi arg are left-censored and the remaining
observations are not.
rcensored(arg) specifies that observations with yi arg are right-censored and the remaining
observations are not.
Neither lcensored() nor rcensored() may not be combined with ldepvar() or udepvar().
450
Specified that way, the options apply to all the response variables. Alternatively, they may be specified
inside paths to affect single equations:
. gsem (y1
. gsem (y1
(y2
. gsem (y1
. gsem (y1
(y2
<<<<<<-
x1
x1
x2
x1
x1
x2
x2,
x2,
L,
x2,
x2,
L,
On a different topic, it is worth nothing that you can fit exponential-regression models with
family(gamma) link(log) if you constrain the log of the scale parameter to be 0 with gsems
constraints() option. For instance, you might type
. constraint 1 _b[y_logs:_cons] = 0
. gsem (y <- x1 x2, gamma), constraints(1)
The name b[y logs: cons] changes according to the name of the dependent variable. Had y instead
been named waitingtime, the parameter would have been named b[waitingtime logs: cons].
Rather than remembering that, remember instead that the best way to discover the names of parameters
is to type
. gsem (waitingtime <- x1 x2, gamma), noestimate
and then look at the output to discover the names. See [SEM] sem and gsem option constraints( ).
Also see
[SEM] gsem Generalized structural equation model estimation command
[SEM] intro 2 Learning the language: Path diagrams and command language
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] gsem path notation extensions Command syntax for path diagrams
451
Title
gsem model description options Model description options
Syntax
Description
Options
Also see
Syntax
gsem paths . . . , . . . model description options
model description options
Description
family(), link(), . . .
covariance()
variance()
means()
covstructure()
collinear
noconstant
noasis
fvstandard
noanchor
forcenoanchor
reliability()
constraints()
from()
Description
paths and the options above describe the model to be fit by gsem.
Options
family() and link() specify the distribution and link function, such as family(poisson)
link(log), for generalized linear responses. There are lots of synonyms, so you can specify, for example, just poisson. In addition, there are exposure() and offset() options. See
[SEM] gsem family-and-link options.
452
453
covariance(), variance(), and means() fully describe the model to be fit. See [SEM] sem and
gsem path notation.
covstructure() provides a convenient way to constrain covariances in your model. Alternatively
or in combination, you can place constraints by using the standard path notation. See [SEM] sem
and gsem option covstructure( ).
collinear; see [R] estimation options.
noconstant specifies that all intercepts be constrained to 0. See [SEM] sem and gsem path notation.
This option is seldom specified.
noasis specifies that perfect-predictor variables be omitted from all family Bernoulli models. By
default, gsem does not omit the variable, so one can specify tricky models where an equation
contains perfect predictors that are still identified through other portions of the model.
fvstandard specifies that factor-variable notation be interpreted according to the Stata standard.
gsem interprets factor variables slightly differently than do other Stata commands and, given
how factor-variable notation is used in command mode, this usually makes no difference. This is
explained in [SEM] intro 3.
To be technical, when fvstandard is specified, all factor variables automatically are assigned a
base level among the specified or implied levels, and implied yet unspecified elements of interaction
terms will be left in the model specification.
noanchor specifies that gsem not check for lack of identification or fill in anchors where needed.
gsem is instead to issue an error message if anchors would be needed. Specify this option when
you believe you have specified the necessary normalization constraints and want to hear about it
if you are wrong. See Identification 2: Normalization constraints (anchoring) in [SEM] intro 4.
forcenoanchor is similar to noanchor except that rather than issue an error message, gsem proceeds
to estimation. There is no reason you should specify this option. forcenoanchor is used in testing
of gsem at StataCorp.
reliability() specifies the fraction of variance not due to measurement error for a variable. See
[SEM] sem and gsem option reliability( ).
constraints() specifies parameter constraints you wish to impose on your model; see [SEM] sem
and gsem option constraints( ). Constraints can also be specified as described in [SEM] sem and
gsem path notation, and they are usually more conveniently specified using the path notation.
from() specifies the starting values to be used in the optimization process; see [SEM] sem and gsem
option from( ). Starting values can also be specified using the init() suboption as described in
[SEM] sem and gsem path notation.
454
Also see
[SEM] gsem Generalized structural equation model estimation command
[SEM] intro 2 Learning the language: Path diagrams and command language
[SEM] sem and gsem option constraints( ) Specifying constraints
[SEM] sem and gsem option covstructure( ) Specifying covariance restrictions
[SEM] sem and gsem option from( ) Specifying starting values
[SEM] sem and gsem option reliability( ) Fraction of variance not due to measurement error
[SEM] sem and gsem path notation Command syntax for path diagrams
Title
gsem path notation extensions Command syntax for path diagrams
Syntax
Description
Options
Also see
Syntax
gsem paths . . .
paths specifies the direct paths between the variables of your model.
The model to be fit is fully described by paths, covariance(), variance(), and means().
Description
This entry concerns gsem only.
The command syntax for describing generalized SEMs is fully specified by paths, covariance(),
variance(), and means(); see [SEM] sem and gsem path notation.
With gsem, the notation is extended to allow for generalized linear response variables and to allow
for multilevel latent variables. That is the subject of this entry.
Options
covariance(), variance(), and means() are described in [SEM] sem and gsem path notation.
456
457
Changing the subject, the names by which effects are referred to are a function of the top level.
We just discussed a three-level model. The three levels of the model were
(3) school
(2) school>teacher
(1) school>teacher>student
If we had a two-level model, the levels would be
(2) teacher
(1) teacher>student
Thus, if we had started with a two-level model and then wanted to add a third, higher level onto
it, latent variables that were previously referred to as, say, TeachQual[teacher] would now be
referred to as TeachQual[school>teacher].
Latent-variable name
Lname[occupation]
Lname[industry]
Lname
That is convenient, but only if all the equations in the model are using the same specific response
function. Many models include multiple equations with each using a different response function.
You can specify any of the family-and-link options within paths. For instance, typing
. gsem (y <- x1 x2), logit
458
The y1 equation would be logit, and the y2 equation would be Poisson. If you wanted y2 to be linear
regression, you could type
. gsem (y1 <- x1 L, logit) (y2 <- x2 L, regress) ..., ...
Also see
[SEM] gsem Generalized structural equation model estimation command
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] intro 2 Learning the language: Path diagrams and command language
Title
gsem postestimation Postestimation tools for gsem
Description
Also see
Description
The following are the postestimation commands that you can use after estimation by gsem:
Command
Description
gsem, coeflegend
display b[ ] notation
estat eform
estat ic
lrtest
test
lincom
nlcom
testnl
likelihood-ratio tests
Wald tests
linear combination of parameters
nonlinear combination of parameters
Wald tests of nonlinear hypotheses
estat summarize
estat vce
predict
margins
contrast
pwcompare
estimates
Also see
[SEM] gsem reporting options Options affecting reporting of results
459
Title
gsem reporting options Options affecting reporting of results
Syntax
Description
Options
Also see
Syntax
gsem paths . . . , . . . reporting options
gsem, reporting options
reporting options
Description
level(#)
coeflegend
nocnsreport
noheader
notable
display options
Description
These options control how gsem displays estimation results.
Options
level(#); see [R] estimation options.
coeflegend displays the legend that reveals how to specify estimated coefficients in b[ ] notation,
which you are sometimes required to type in specifying postestimation commands.
nocnsreport suppresses the display of the constraints. Fixed-to-zero constraints that are automatically
set by gsem are not shown in the report to keep the output manageable.
noheader suppresses the header above the parameter table, the display that reports the final loglikelihood value, number of observations, etc.
notable suppresses the parameter table.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and
nolstretch; see [R] estimation options.
Also see
[SEM] gsem Generalized structural equation model estimation command
[SEM] example 29g Two-parameter logistic IRT model
461
Title
lincom Linear combinations of parameters
Syntax
Remarks and examples
Menu
Stored results
Description
Also see
Options
Syntax
lincom exp , options
Menu
Statistics
>
>
>
Description
lincom is a postestimation command for use after sem, gsem, and nearly all Stata estimation
commands.
lincom computes point estimates, standard errors, z statistics, p-values, and confidence intervals
for linear combinations of the estimated parameters.
After sem and gsem, you must use the b[ ] coefficient notation; you cannot refer to variables
by using shortcuts to obtain coefficients on variables.
See [R] lincom.
Options
See Options in [R] lincom.
Stored results
See Stored results in [R] lincom.
Also see
[R] lincom Linear combinations of estimators
[SEM] estat stdize Test standardized parameters
[SEM] nlcom Nonlinear combinations of parameters
[SEM] test Wald test of linear hypotheses
462
Title
lrtest Likelihood-ratio test of linear hypothesis
Syntax
Stored results
Menu
Also see
Description
Syntax
{sem | gsem} . . . , . . .
(fit model 1)
{sem | gsem} . . . , . . .
(fit model 2)
where one of the models is constrained and the other is unconstrained. lrtest counts
parameters to determine which model is constrained and which is unconstrained, so it does
not matter which model is which.
Warning concerning use with sem: Do not omit variables, observed or latent, from the model.
Constrain their coefficients to be 0 instead. The models being compared must contain the same
set of variables. This is because the standard SEM likelihood value is a function of the variables
appearing in the model. Despite this fact, use of lrtest is appropriate under the relaxed conditional
normality assumption.
Note concerning gsem: The above warning does not apply to gsem just as it does not apply to other
Stata estimation commands. Whether you omit variables or constrain coefficients to 0, results will
be the same. The generalized SEM likelihood is conditional on the exogenous variables.
Menu
Statistics
>
>
>
Likelihood-ratio test
Description
lrtest is a postestimation command for use after sem, gsem, and other Stata estimation commands.
lrtest performs a likelihood-ratio test comparing models. See [R] lrtest.
464
sem (L1 -> x1 x2 x3) (L1 <- x4 x5) (x1 <- x4) (x2 <- x5)
estimates store m1
sem (L1 -> x1 x2 x3) (L1 <- x4 x5)
estimates store m2
lrtest m1 m2
This is allowed because both models contain the same variables, namely, L1, x1, . . . , x5, even though
the second model omitted some paths.
The following would produce invalid results:
.
.
.
.
.
sem (L1 -> x1 x2 x3) (L1 <- x4 x5) (x1 <- x4) (x2 <- x5)
estimates store m1
sem (L1 -> x1 x2 x3) (L1 <- x4)
estimates store m2
lrtest m1 m2
It produces invalid results because the second model does not include variable x5 and the first model
does. To run this test correctly, you type
. sem (L1 -> x1 x2 x3) (L1 <- x4 x5) (x1 <- x4) (x2 <- x5)
.
.
.
.
estimates store m1
sem (L1 -> x1 x2 x3) (L1 <- x4 x5@0)
estimates store m2
lrtest m1 m2
If we were using gsem rather than sem, all the above would still be valid.
Stored results
See Stored results in [R] lrtest.
Also see
[SEM] example 10 MIMIC model
[SEM] example 39g Three-level model (multilevel, generalized response)
Title
methods and formulas for gsem Methods and formulas
Description
References
Also see
Description
The methods and formulas for the gsem command are presented below.
Introduction
gsem fits generalized linear models with latent variables via maximum likelihood. Here is a table
identifying the family/link combinations that gsem allows.
logit
probit
cloglog
Bernoulli
binomial
x
x
x
x
x
x
ordinal
multinomial
Poisson
negative binomial
gamma
Gaussian
x
x
465
log
identity
x
x
x
x
466
Log-likelihood calculations for fitting any model with latent variables require integrating out
the latent variables. One widely used modern method is to directly estimate the integral required to
calculate the log likelihood by GaussHermite quadrature or some variation thereof. gsem implements
four different methods for numerically evaluating the integrals.
1. GaussHermite quadrature (GHQ)
2. Mean-variance adaptive quadrature (MVAGH)
3. Mode-curvature adaptive quadrature (MCAGH)
4. Laplacian approximation
The default method is MVAGH. The numerical integration method for MVAGH is based on RabeHesketh, Skrondal, and Pickles (2005), and the other numerical integration methods described in this
manual entry are based on Skrondal and Rabe-Hesketh (2004, chap. 6.3).
Families of distributions
gsem implements the most commonly used distribution families associated with generalized linear
models. gsem also implements distributions for ordinal and multinomial outcomes.
In this manual entry, observed endogenous variables are also known as generalized responses
or generalized outcomes, but we will simply refer to them as responses or outcomes. The random
variable corresponding to a given response will be denoted by Y . An observed value of Y will be
denoted by y , and the expected value of Y by . For the ordinal and multinomial families, we will
refer to a linear prediction, denoted by z , instead of the expected value.
The Bernoulli family is a binary response model. The response Y is assumed to take on the values
0 or 1; however, gsem allows any nonzero and nonmissing value to mean 1.
The log of the conditional probability mass function is
The binomial family is a count response model and generalizes the Bernoulli family by taking
the sum of k independent Bernoulli outcomes. The response Y is assumed to take on the values
0, 1, . . . , k .
467
The ordinal family is a discrete response model. The response Y is assumed to take on one of k
unique values. The actual values are irrelevant except that higher values are assumed to correspond to
higher outcomes. Without loss of generality, we will assume that Y takes on the values 1, . . . , k .
The ordinal family with k outcomes has cutpoints 0 , 1 , . . . , k , where 0 = , y < y+1 ,
and k = +.
Given a linear prediction z , the probability that a random response Y takes the value y is
Pr(Y = y|z) = Pr(Y < y z) Pr(Y < y1 z)
where Y is the underlying stochastic component for Y . The distribution for Y is determined by
the link function. gsem allows logit, probit, and cloglog for the ordinal family. The logit
link assigns Y the extreme value distribution that is synonymous with the logit link for Bernoulli
outcomes. The probit link assigns Y the standard normal distribution that is synonymous with the
probit link for Bernoulli outcomes. The cloglog link assigns Y the distribution that is synonymous
with the complementary log-log link for Bernoulli outcomes. The default link for the ordinal family
is the logit link.
The multinomial family is a discrete response model. The response Y is assumed to take on one of
k unique values. The actual values are irrelevant and order does not matter; however, gsem requires
that the values are nonnegative integers. Without loss of generality, we will assume that Y takes
on the values 1, . . . , k . Each of the k outcomes has its own linear prediction. For the model to be
identified, one of the outcomes is chosen to be the base or reference. The linear prediction for the
base outcome is constrained to be 0 for all observations. Without loss of generality, we will assume
the base outcome is the first outcome. Let zi be the prediction for outcome i, where z1 = 0 for the
base outcome.
Given the k linear predictions z0 = (z1 , z2 , . . . , zk ), the log of the conditional probability mass
function is
(
log f (y|z) = zy log
k
X
)
exp(zi )
i=1
The only link allowed for the multinomial family is the logit link.
468
The Poisson family is a count-data response model. The response Y is assumed to take on
nonnegative integer values.
The log of the conditional probability mass function is
The negative binomial family is another count-data response model. It is commonly thought of as
a Poisson family with overdispersion. gsem allows two parameterizations for the dispersion in this
family: mean dispersion and constant dispersion.
The log of the conditional probability mass function is
m = 1/
1
p=
1 +
where is the expected value of Y and is the scale parameter. gsem fits in the log scale.
For constant dispersion, we have
m = exp(log log )
1
p=
1+
where is the expected value of Y and is the scale parameter. gsem fits in the log scale.
The gamma family
The gamma family is a continuous response model. The response Y is assumed to be a nonnegative
real value.
The log of the conditional probability density function is
2
y
) log + 2 log s +
s2 + (s2 1) log y
where is the expected value of Y and s is the scale parameter. gsem fits s in the log scale.
The only link allowed for the gamma family is the log link.
469
The Gaussian family is a continuous response model and is synonymous with the normal distribution.
When the Gaussian family is specified with the identity link but no censoring, gsem fits this
family by using a single multivariate density function and allows the following two special features:
1. gsem can fit covariances between the Gaussian error variables.
2. gsem can fit paths between Gaussian responses, including nonrecursive systems.
The log of the conditional probability density function is
log f (y|, ) =
1
d log 2 + log || + (y )0 1 (y )
2
where d is the dimension of the observed response vector y, is the mean of the responses, and
is the variance matrix of their unexplained errors.
When the Gaussian family is specified with the log link or censoring, the two special features
described above no longer apply. In addition, the multivariate density function is no longer used.
Instead, for each response using the log link, the log of the conditional probability density function
corresponds to the formula above with d = 1. For censored responses, the log likelihood corresponds
to the one in the Methods and formulas for [R] intreg.
Reliability
For a given Gaussian response variable with the identity link, the reliability Y may be specified
as p or 100 p%. The variance of Y s associated error variable is then constrained to (1 p) times
the observed variance of Y .
Link functions
Except for the ordinal and multinomial families, the link function defines the transformation between
the mean and the linear prediction for a given response. If Y is the random variable corresponding
to an observed response variable y , then the link function performs the transformation
g() = z
where = E(Y ) and z is the linear prediction. In practice, the likelihood evaluator function uses
the inverse of the link function to map the linear prediction to the mean.
The logit link
= g 1 (z) =
1
1 + ez
470
g() = 1 ()
and its inverse is
= g 1 (z) = (z)
where () is the cumulative distribution function for the standard normal distribution and 1 () is
its inverse.
The complementary log-log link
g() = log
and its inverse is
= g 1 (z) = ez
The likelihood
gsem fits generalized linear models with latent variables via maximum likelihood. The likelihood
for the specified model is derived under the assumption that each response variable is independent and
identically distributed across the estimation sample. The response variables are also assumed to be
independent of each other. These assumptions are conditional on the latent variables and the observed
exogenous variables.
471
The likelihood is computed by integrating out the latent variables. Let be the vector of model
parameters, y be the vector of observed response variables, x be the vector of observed exogenous
variables, and u be the r 1 vector of latent variables. Then the marginal likelihood looks something
like
Z
L() =
<r
f (y|x, u, )(u|u , u )u
where < denotes the set of values on the real line, <r is the analog in r-dimensional space, is a
vector of the unique model parameters, f () is the conditional probability density function for the
observed response variables, () is the multivariate normal density for u, u is the expected value of
u, and u is the covariance matrix for u. All auxiliary parameters are fit directly without any further
parameterization, so we simply acknowledge that the auxiliary parameters are among the elements of
.
The y variables are assumed to be independent, conditionally on x and u, so f () is the product of
the individual conditional densities. One exception to this is when y contains two or more Gaussian
response variables with the identity link, in which case the Gaussian responses are actually modeled
using a multivariate normal density to allow for correlated errors and nonrecursive systems among
Gaussian responses. This one exception does not change how the integral is numerically evaluated,
so we make no effort to represent this distinction in the formulas.
For a single-level model with n response variables, the conditional joint density function for a
given observation is
f (y|x, u, ) =
n
Y
fi (yi |x, u, )
i=1
For a two-level model, the likelihood is computed at the cluster level, so the conditional density is
also a product of the observation-level density contributions within a given cluster
f (y|x, u, ) =
n Y
t
Y
fi (yij |xj , u, )
i=1 j=1
where t is the number of individuals within the cluster. This extends to more levels by expanding
the products down to the observations nested within the hierarchical groups. Because the single-level
model is a special case of a two-level model where all the groups have a single observation, we will
now use the two-level notation and subscripts.
Except for the ordinal and multinomial families, we use the link function to map the conditional
mean
ij = E(yij |xj , u)
to the linear prediction
472
Z
L() =
f (y, z, )(u|u , u )u
Z
1
1
0 1
p
exp log f (y, z, ) (u u ) u (u u ) u
=
2
(2)r/2 |u | <r
<r
(1)
gsem allows nonrecursive systems between Gaussian response variables with the identity link, but
non-Gaussian responses and Gaussian responses with the log link are not allowed to participate in
any nonrecursive systems. This means that if a given response y is specified with a family other
than Gaussian or a link other than identity, then y cannot have a path that ultimately leads back to
itself. Any response may participate in a recursive system because the participating responses may
be treated as exogenous variables when predicting other responses in a recursive system.
The latent vector u consists of stacked collections of the latent variables from each level. Within
each level, the latent endogenous variables are stacked over the latent exogenous variables . Within
a given level, the latent exogenous variables and latent endogenous errors are assumed independent
and multivariate normal
N (, )
N (0, )
so according to the linear relationship
= B + + Ax +
we have that the latent variables are jointly multivariate normal. This linear relationship implies that
gsem allows latent variables to predict each other, but only within level. It also means that gsem
allows paths from observed variables to latent variables; however, the observed variable must be
constant within group if the path is to a group-level latent variable.
For our two-level model, we have
u N (u , u )
where
u =
u =
= (I B)1 ( + Ax)
0
= (I B)1 (0 + ) (I B)1
= (I B)1
The vector is therefore the set of unique model parameters taken from the following:
i is the vector of fixed-effect coefficients for yij .
i is the matrix of latent loadings for yij .
473
The integral in (1) is generally not tractable, so we must use numerical methods. In the univariate
case, the integral of a function multiplied by the kernel of the standard normal distribution can be
approximated using GaussHermite quadrature (GHQ). For q -point GHQ, let the abscissa and weight
pairs be denoted by (ak , wk ), k = 1, . . . , q . The GHQ approximation is then
f (x) exp(x ) dx
q
X
wk f (ak )
k=1
f (x)(x) dx
where ak =
q
X
wk f (ak )
k=1
We can use a change-of-variables technique to transform the multivariate integral (1) into a set
of nested univariate integrals. Each univariate integral can then be evaluated using GHQ. Let v be
a random vector whose elements are independently standard normal, and let L be the Cholesky
decomposition of u , that is, u = LL0 . In the distribution, we have that u = u + Lv, and the
linear predictions vector as a function of v is
L() = (2)
1X 2
...
exp log f (y, z, )
vk
2
)
dv1 . . . dvr
(2)
k=1
474
GHQ
() =
q
X
...
k1 =1
q
X
"
(
exp
n
X
)
log fi (yij , zijk , )
#
wks
s=1
i=1
kr =1
r
Y
where
Adaptive quadrature
This section sets the stage for mean-variance adaptive GaussHermite quadrature (MVAGH) and
mode-curvature adaptive GaussHermite quadrature (MCAGH).
Lets reconsider the likelihood in (2). If we fix the observed variables and the model parameters,
we see that the posterior density for v is proportional to
(v)f (y, z, )
It is reasonable to assume that this posterior density can be approximated by a multivariate normal
density with mean vector v and variance matrix v . Instead of using the prior density of v as the
weighting distribution in the integral, we can use our approximation for the posterior density,
Z
L() =
<r
f (y, z, )(v)
(v, v , v ) dv
(v, v , v )
L () =
q
X
k1 =1
...
q
X
kr =1
"
(
exp
n
X
i=1
r
Y
#
ks
s=1
where
zijk
= x0j + x0j i (u + Lk )
and k and the ks are the adaptive versions of the abscissas and weights after an orthogonalizing
transformation, which eliminates posterior covariances between the latent variables. k and the ks
are functions of ak and wk and the adaptive parameters v and v .
For MVAGH, v is the posterior mean and v is the posterior variance of v. They are computed
iteratively by updating the posterior moments by using the MVAGH approximation, starting with a 0
mean vector and identity variance matrix.
For MCAGH, v is the posterior mode for v and v is the curvature at the mode. They are computed
by optimizing the integrand in (2) with respect to v.
475
Laplacian approximation
Lets reconsider the likelihood in (1) and denote the argument in the exponential function by
1
h(u) = log f (y, z, ) (u u )0 1
u (u u )
2
n X
t
X
1
=
log fi (yij , zij , ) (u u )0 1
u (u u )
2
i=1 j=1
where
H(u) =
b such that h0 (b
The maximizer of h(u) is u
u) = 0. The integral in (1) is proportional to the posterior
b is also the posterior mode.
density of u given the data, so u
The second-order Taylor approximation then takes the form
1
b )0 H(b
b)
u)(u u
h(u) h(b
u) + (u u
2
(3)
1/2
exp{h(u)} du exp{h(b
u)}(2)r/2 |H(b
u)|
<r
because the second term in (3) is the kernel of a multivariate normal density once it is exponentiated.
The Laplacian approximation for the log likelihood is
1
1
log LLap () = log |u | log |H(b
u)| + h(b
u)
2
2
Postestimation
We begin by considering the prediction of the latent variables u for a given cluster in a two-level
model. Prediction of latent variables in multilevel generalized linear models involves assigning values
to the latent variables, and there are many methods for doing so; see Skrondal and Rabe-Hesketh (2009)
and Skrondal and Rabe-Hesketh (2004, chap. 7) for a comprehensive review. Stata offers two methods
of predicting latent variables: empirical Bayes means (also known as posterior means) and empirical
Bayes modes (also known as posterior modes). Below we provide more details about the two methods.
476
Empirical Bayes
Let b
denote the estimated model parameters. Empirical Bayes (EB) predictors of the latent variables
are the means or modes of the empirical posterior distribution with the parameter estimates replaced
with their estimates b
. The method is called empirical because b
is treated as known. EB combines
the prior information about the latent variables with the likelihood to obtain the conditional posterior
distribution of latent variables. Using Bayes theorem, the empirical conditional posterior distribution
of the latent variables for a given cluster is
b u)
f (y|u, x; b
) (u;
bu,
b u ) du
f (y|u, x; b
) (u;
bu,
(u|y, x; b
) = R
=
b u)
f (y|u, x; b
) (u;
bu,
L(b
)
Z
e=
u
u (u|y, x; b
) du
<r
e rather than u
b to distinguish predicted values from estimates. This
where we use the notation u
multivariate integral is approximated by MVAGH. If you have multiple latent variables within a level
or latent variables across levels, the calculation involves orthogonalizing transformations with the
Cholesky transformation because the latent variables are no longer independent under the posterior
distribution.
When all the response variables are normal, the posterior density is multivariate normal, and EB
means are also best linear unbiased predictors (BLUPs); see Skrondal and Rabe-Hesketh (2004, 227).
In generalized mixed-effects models, the posterior density tends to multivariate normal as cluster size
increases.
e such that
e
EB modal predictions can be approximated by solving for u
log (u|y, x; b
) e = 0
e
u=u
u
Because the denominator in () does not depend on u, we can omit it from the calculation to obtain
the EB mode. The calculation of EB modes does not require numerical integration, and for that reason
they are often used in place of EB means. As the posterior density gets closer to being multivariate
normal, EB modes get closer and closer to EB means.
Just like there are many methods of assigning values to the random effects, there exist many methods
of calculating standard errors of the predicted random effects; see Skrondal and Rabe-Hesketh (2009)
for a comprehensive review.
Stata uses the posterior standard deviation as the standard error of the posterior means predictor of
random effects. For a given level, the EB posterior covariance matrix of the random effects is given
by
Z
e )(u u
e )0 (u|y, x; b
Cov(e
u|y, x; b
) =
) du
(u u
<r
The posterior covariance matrix and the integrals are approximated by MVAGH.
477
Conditional standard errors for the estimated posterior modes are derived from standard theory of
e
e is the negative inverse
maximum likelihood, which dictates that the asymptotic variance matrix of u
of the Hessian matrix.
Other predictions
e , which
In what follows, we show formulas with the posterior means estimates of random effects u
e are simply
are used by default or if the means option is specified. If the modes option is specified, u
e
e in these formulas.
replaced with the posterior modes u
For the ith response in the j th observation within a given cluster in a two-level model, the linear
predictor is computed as
b + x0j
biu
e
zbij = x0j
The linear predictor includes the offset or exposure variable if one was specified during estimation,
unless the nooffset option is specified. If the fixedonly option is specified, the linear predictor
is computed as
b
zbij = x0j
The predicted mean, conditional on the predicted latent variables, is
bij = g 1 (b
zij )
where g 1 () is the inverse link function defined in Link functions above. For the ordinal and
multinomial families, the predicted mean is actually a probability, and gsem can produce a probability
for each outcome value as described in The ordinal family and The multinomial family above.
References
Rabe-Hesketh, S., A. Skrondal, and A. Pickles. 2005. Maximum likelihood estimation of limited and discrete dependent
variable models with nested random effects. Journal of Econometrics 128: 301323.
Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Boca Raton, FL: Chapman & Hall/CRC.
. 2009. Prediction in multilevel generalized linear models. Journal of the Royal Statistical Society, Series A 172:
659687.
Also see
[SEM] gsem Generalized structural equation model estimation command
Title
methods and formulas for sem Methods and formulas for sem
Description
References
Also see
Description
The methods and formulas for the sem commands are presented below.
Variable notation
We will use the following convention to keep track of the five variable types recognized by the
sem estimation command:
1. Observed endogenous variables are denoted y .
2. Observed exogenous variables are denoted x.
3. Latent endogenous variables are denoted .
4. Latent exogenous variables are denoted .
5. Error variables are denoted with prefix e. on the associated endogenous variable.
a. Error variables for observed endogenous are denoted e.y .
b. Error variables for latent endogenous are denoted e. .
478
methods and formulas for sem Methods and formulas for sem
479
In any given analysis, there are typically several variables of each type. Vectors of the four main
variable types are denoted y, x, , and . The vector of all endogenous variables is
y
Y=
X=
The vector of all error variables is
x
=
e.y
e.
Y = BY + X + +
where B = [ij ] is the matrix of coefficients on endogenous variables predicting other endogenous
variables, = [ij ] is the matrix of coefficients on exogenous variables, = [i ] is the vector of
intercepts for the endogenous variables, and is assumed to have mean 0 and
Cov(X, ) = 0
Let
= [j ] = E(X)
= [ij ] = Var(X)
= [ij ] = Var()
Then the mean vector of the endogenous variables is
Y = E(Y ) = (I B)1 ( + )
the variance matrix of the endogenous variables is
0
Y Y = Var(Y ) = (I B)1 (0 + ) (I B)1
and the covariance matrix between the endogenous variables and the exogenous variables is
Y X = Cov(Y, X ) = (I B)1
Let Z be the vector of all variables:
Z=
Y
X
= E(Z) =
= Var(Z) =
Y Y
0Y X
Y X
480
methods and formulas for sem Methods and formulas for sem
Summary data
Let zt be the vector of all observed variables for the tth observation,
yt
zt =
xt
and let wt be the corresponding weight value, where t = 1, . . . , N . If no weights were specified,
then wt = 1. Let w. be the sum of the weights; then the sample mean vector is
z=
N
1 X
wt zt
w. t=1
S=
1 X
wt (zt z)(zt z)0
w. 1 t=1
Maximum likelihood
Let be the vector of unique model parameters, such as
vec(B)
vec()
vech()
=
vech()
Then under the assumption of the multivariate normal distribution, the overall log likelihood for is
w.
log L() =
k log(2) + log {det(o )} + tr D1
o
2
where k is the number of observed variables, o is the submatrix of corresponding to the observed
variables, and
D = f S + (z o )(z o )0
where
1, if nm1 is specified
w. 1
, otherwise
w.
and o is the subvector of corresponding to the observed variables.
f=
For the BHHH optimization technique and when computing observation-level scores, the log
likelihood for is computed as
log L() =
N
X
wt
k log(2) + log {det(o )} + (zt o )0 1
o (zt o )
2
t=1
methods and formulas for sem Methods and formulas for sem
481
z
v=
vech(f S)
and
=
o
o
The weighted least-squares (WLS) criterion function to minimize is the quadratic form
FWLS () = (v )0 W1 (v )
where W is the least-squares weight matrix. For unweighted least squares (ULS), the weight matrix
is the identity matrix W = I. Other weight matrices are mentioned in Bollen (1989).
The weight matrix implemented in sem is an estimate of the asymptotic covariance matrix of
v. This weight matrix is derived without any distributional assumptions and is often referred to as
derived from an arbitrary distribution function or is asymptotic distribution free (ADF), thus the option
method(adf).
Groups
When the group() option is specified, each group has its own summary data and model parameters.
The entire collection of model parameters is
1
2
=
...
G
where G is the number of groups. The group-level criterion values are combined to produce an overall
criterion value.
For method(ml) and method(mlmv), the overall log likelihood is
log L() =
G
X
log L(g )
g=1
FWLS () =
G
X
g=1
FWLS (g )
482
methods and formulas for sem Methods and formulas for sem
Fitted parameters
sem fits the specified model by maximizing the log likelihood or minimizing the WLS criterion. If
is the vector of model parameters, then the fitted parameter vector is denoted by b
, and similarly
b , ,
b ,
b ,
b
b
for B
b,
b , ,
b , and their individual elements.
Standardized parameters
b Y Y . Then the standardized parameter estimates are
Let
bii be the ith diagonal element of
s
bii
eij = bij
bjj
s
eij =
bij
bii
bjj
bii , if i = j
bii /
bij
eij = q
bii bjj
bi
ei =
bii
bj
ej = q
bjj
The variance matrix of the standardized parameters is estimated using the delta method.
Reliability
For an observed endogenous variable, y , the reliability may be specified as p or 100 p%. The
variance of e.y is then constrained to (1 p) times the observed variance of y .
Postestimation
Model framework
estat framework reports the fitted parameters in their individual matrix forms as introduced in
Fitted parameters.
Goodness of fit
estat gof reports the following goodness-of-fit statistics.
methods and formulas for sem Methods and formulas for sem
483
Let the degrees of freedom for the specified model be denoted by df m . In addition to the specified
model, sem fits saturated and baseline models corresponding to the observed variables in the specified
model. The saturated model fits a full covariance matrix for the observed variables and has degrees
of freedom
p+q+1
df s =
+p+q
2
where p is the number of observed endogenous variables and q is the number of observed exogenous
variables in the model. The baseline model fits a reduced covariance matrix for the observed variables
depending on the presence of endogenous variables. If there are no endogenous variables, all variables
are uncorrelated in the baseline model; otherwise, only exogenous variables are correlated in the
baseline model. The degrees of freedom for the baseline model is
df b =
2q , if p = 0
q+1
2p
+
q
+
, if p > 0
For method(ml) and method(mlmv), let the saturated log likelihood be denoted by log Ls and
the baseline log likelihood be denoted by log Lb . The likelihood-ratio test of the baseline versus
saturated models is computed as
2bs = N Fb
with degrees of freedom df bs = df s df b . The chi-squared test of the specified model versus the
saturated model is computed as
2ms = N FWLS (b
)
with degrees of freedom df ms = df s df m .
The Akaike information criterion (Akaike 1974) is defined as
AIC
= 2 log L(b
) + 2df m
= 2 log L(b
) + N df m
See [R] BIC note for additional information on calculating and interpreting the BIC.
484
methods and formulas for sem Methods and formulas for sem
=1
b)
det(
b)
det(
This value is also referred to as the overall R2 in estat eqgof (see [SEM] estat eqgof).
The root mean squared error of approximation (Browne and Cudeck 1993) is computed as
RMSEA
(2ms df ms )G
N df ms
GL
,
N df ms
GU
N df ms
1/2
where L and U are the noncentrality parameters corresponding to a noncentral chi-squared distribution
with df ms degrees of freedom in which the noncentral chi-squared random variable has a cumulative
distribution function equal to 0.95 and 0.05, respectively.
The Browne and Cudeck (1993) p-value for the test of close fit with null hypothesis
H0 : RMSEA 0.05
is computed as
Let k be the number of observed variables in the model. If means are not in the fitted model, the
standardized root mean squared residual is computed according to
SRMR
( Pk P
)1/2
2
2 i=1 ji rij
k(k + 1)G
(Hancock and Mueller 2006) where rij is the standardized covariance residual
rij =
sij
bij
p
sii sjj
bii
bjj
methods and formulas for sem Methods and formulas for sem
485
SRMR
1/2
P
P
2
2 ki=1 m2i + ji rij
k(k + 3)G
zi
bi
mi =
sii
bii
These standardized residuals are not the same as those reported by estat residuals; see Residuals
below.
Group goodness of fit
estat ggof reports CD, SRMR, and model versus saturated 2 values for each group separately.
The group-level formulas are the same as those computed for a single group analysis; see Goodness
of fit above.
Equation-level goodness of fit
estat eqgof reports goodness-of-fit statistics for each endogenous variable in the specified model.
The coefficient of determination for the ith endogenous variable is computed as
Ri2 = 1
bii
bii
The BentlerRaykov (Bentler and Raykov 2000) squared multiple correlation for the ith endogenous
variable is computed as
d (yi , ybi )
Cov
mc2i = q
c (b
bii Var
yi )
b Var
c (b
where
bii is a diagonal element of ,
yi ) is a diagonal element of
n
o0 n
o n
o0
b ) = (I B)
b 1
b 1 + (I B)
b 1 I
b 1 I
b
b
b0 (I B)
b (I B)
c (Y
Var
d (yi , ybi ) is a diagonal element of
and Cov
n
o0
n
o0
b ) = (I B)
b 1
b 1 + (I B)
b 1
b 1 I
b
b
b0 (I B)
b (I B)
d (Y, Y
Cov
Wald tests
estat eqtest performs Wald tests on the coefficients for each endogenous equation in the model.
estat ginvariant computes a Wald test of group invariance for each model parameter that is free
to vary across all groups. See [R] test.
486
methods and formulas for sem Methods and formulas for sem
Score tests
estat mindices computes modification indices for each constrained parameter in the model,
including paths and covariances that were not even part of the model specification. Modification indices
are score tests, which are also known as Lagrange multiplier tests. estat scoretests performs a
score test for each user-specified linear constraint. estat ginvariant performs a score test of group
invariance for each model parameter that is constrained to be equal across all groups.
A score test compares a constrained model fit to the same model without one or more constraints.
The score test is computed as
2 = g(b
)0 V(b
)g(b
)
where b
is the fitted parameter vector from the constrained model, g() is the gradient vector function
for the unconstrained model, and V() is the variance matrix function computed from the expected
information matrix function for the unconstrained model. For method(ml) and method(mlmv),
log L ()
2
1
log L ()
V() = E
0
g() =
where log L () is the log-likelihood function for the unconstrained model. For method(adf),
F ()
g() = WLS
2
1
FWLS ()
V() = E
0
where F () is the WLS criterion function for the unconstrained model.
The score test is computed as described in Wooldridge (2010) when vce(robust) or vce(cluster
clustvar) is specified.
Residuals
estat residuals reports raw, normalized, and standardized residuals for means and covariances
of the observed variables.
The raw residual for the mean of the ith observed variable is
zi
bi
The raw residual for the covariance between the ith and j th observed variables is
Sij
bij
The normalized residual for the mean of the ith observed variable is
z
bi
qi
c (z i )
Var
methods and formulas for sem Methods and formulas for sem
487
where
Sii
N
c (z i ) =
Var
bii
, otherwise
N
The normalized residual for the covariance between the ith and j th observed variables is
S
bij
qij
c (Sij )
Var
where
N
c (Sij ) =
Var
bii
bjj +
bij
, otherwise
N
If the nm1 option is specified, the denominator in the variance estimates is N 1 instead of N .
The standardized residual for the mean of the ith observed variable is
zi
bi
q
c (z i
bi )
Var
where
c (z i
c (z i ) Var
c (b
Var
bi ) = Var
i )
c (b
and Var
i ) is computed using the delta method. Missing values are reported when the computed
c (b
c (z i ) is less than Var
i ). The standardized residual for the covariance between the ith
value of Var
and j th observed variables is
S
bij
q ij
c (Sij
Var
bij )
where
c (Sij
c (Sij ) Var
c (b
Var
bij ) = Var
ij )
c (b
and Var
ij ) is computed using the delta method. Missing values are reported when the computed
c (Sij ) is less than Var
c (b
value of Var
ij ). The variances of the raw residuals used in the standardized
residual calculations are derived in Hausman (1978).
Testing standardized parameters
estat stdize provides access to tests on the standardized parameter estimates. estat stdize
can be used as a prefix to lincom (see [R] lincom), nlcom (see [R] nlcom), test (see [R] test), and
testnl (see [R] testnl).
Stability of nonrecursive systems
estat stable reports a stability index for nonrecursive systems. The stability index is calculated
as the maximum of the modulus of the eigenvalues of B. The nonrecursive system is considered
stable if the stability index is less than 1.
488
methods and formulas for sem Methods and formulas for sem
estat teffects reports direct, indirect, and total effects for the fitted model. The direct effects
b
b
Ed = B
b 1 I , (I B)
b 1
b
Et = (I B)
and the indirect effects are Ei = Et Ed . The standard errors of the effects are computed using
the delta method.
b
Let D be the diagonal matrix whose elements are the square roots of the diagonal elements of ,
and let DY be the submatrix of D associated with the endogenous variables. Then the standardized
effects are
e d = D1 Ed D
E
Y
e i = D1 Ei D
E
Y
e
Et = D1
Y Et D
Predictions
predict computes factor scores and linear predictions.
Factor scores are computed with a linear regression by using the mean vector and variance matrix
from the fitted model. For notational convenience, let
z
Z=
l
where
y
z=
x
and
l=
bZ =
bZ =
bz
bl
b zz
b 0zl
b zl
b ll
el =
e
b 0zl
b zz
=
bz +
bl
e
The linear prediction for the endogenous variables in the tth observation is computed as
bt = B
bY
et +
et +
bX
Y
b
methods and formulas for sem Methods and formulas for sem
where
yt
xt
e
et =
Y
and
et =
X
489
References
Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19:
716723.
Bentler, P. M. 1990. Comparative fit indexes in structural models. Psychological Bulletin 107: 238246.
Bentler, P. M., and T. Raykov. 2000. On measures of explained variance in nonrecursive structural equation models.
Journal of Applied Psychology 85: 125131.
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Browne, M. W., and R. Cudeck. 1993. Alternative ways of assessing model fit. Reprinted in Testing Structural
Equation Models, ed. K. A. Bollen and J. S. Long, pp. 136162. Newbury Park, CA: Sage.
Hancock, G. R., and R. O. Mueller, ed. 2006. Structural Equation Modeling: A Second Course. Charlotte, NC:
Information Age Publishing.
Hausman, J. A. 1978. Specification tests in econometrics. Econometrica 46: 12511271.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461464.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press.
Also see
[SEM] sem Structural equation model estimation command
Title
nlcom Nonlinear combinations of parameters
Syntax
Remarks and examples
Menu
Stored results
Description
Also see
Options
Syntax
nlcom exp , options
Menu
Statistics
>
>
>
Description
nlcom is a postestimation command for use after sem, gsem, and other Stata estimation commands.
nlcom computes point estimates, standard errors, z statistics, p-values, and confidence intervals
for (possibly) nonlinear combinations of the estimated parameters. See [R] nlcom.
Options
See Options in [R] nlcom.
Technical note
estat stdize: is, strictly speaking, unnecessary because everywhere you wanted a standardized
coefficient or correlation, you could just type the formula. If you did that, you would get the same
results except for numerical precision. The answer produced with the estat stdize: prefix will
be a little more accurate because estat stdize: is able to substitute an analytic derivative in one
part of the calculation where nlcom, doing the whole thing itself, would be forced to use a numeric
derivative.
490
Stored results
See Stored results in [R] nlcom.
Also see
[R] nlcom Nonlinear combinations of estimators
[SEM] estat stdize Test standardized parameters
[SEM] lincom Linear combinations of parameters
[SEM] test Wald test of linear hypotheses
[SEM] testnl Wald test of nonlinear hypotheses
[SEM] example 42g One- and two-level mediation models (multilevel)
491
Title
predict after gsem Generalized linear predictions, etc.
Syntax
Remarks and examples
Menu
Reference
Description
Also see
Options
Syntax
predict
type
stub* | newvarlist
if
in
predict
type
stub* | newvarlist
if
in
The default is to predict observed endogenous variables with empirical Bayes means predictions of
the latent variables.
options obs endog
outcome(depvar # )1
Description
mu
pr
eta
nooffset
fixedonly
means
modes
intpoints(#)
tolerance(#)
iterate(#)
1
outcome(depvar #) is allowed only after mlogit, ologit, and oprobit. Predicting other generalized responses
requires specifying only outcome(depvar).
outcome(depvar #) may also be specified as outcome(#.depvar) or outcome(depvar ##).
outcome(depvar #3) means the third outcome value. outcome(depvar #3) would mean the same as
outcome(depvar 4) if outcomes were 1, 3, and 4.
492
493
options latent
Description
latent
latent(varlist)
se(stub* | newvarlist)
means
modes
intpoints(#)
tolerance(#)
iterate(#)
Menu
Statistics
>
>
Predictions
Description
predict is a standard postestimation command of Stata. This entry concerns use of predict
after gsem. See [SEM] predict after sem if you fit your model with sem.
predict after gsem creates new variables containing observation-by-observation values of estimated
observed response variables, linear predictions of observed response variables, or endogenous or
exogenous latent variables.
Out-of-sample prediction is allowed in three cases:
1. if the prediction does not involve latent variables, or
2. if the prediction involves latent variables, directly or indirectly, option fixedonly is specified,
or
3. if the prediction involves latent variables, directly or indirectly, the model is multilevel and
no observational-level latent variables are involved.
predict has two ways of specifying the name(s) of the variable(s) to be created:
. predict stub*, . . .
or
. predict firstname secondname . . . , . . .
The first creates variables named stub1, stub2, . . . . The second creates variables named as you specify.
We strongly recommend using the stub* syntax when creating multiple variables because you have
no way of knowing the order in which to specify the individual variable names to correspond to the
order in which predict will make the calculations. If you use stub*, the variables will be labeled
and you can rename them.
The second syntax is useful when creating one variable and you specify either outcome() or
latent().
494
Options
outcome(depvar # ) and latent (varlist) determine what is to be calculated.
neither specified
outcome(depvar [#]) specified
latent specified
latent(varlist) specified
predict
predict
predict
predict
If you are predicting latent variables, both empirical Bayes means and modes are available; see
options means, modes, intpoints(#), tolerance(#), and iterate(#) below.
b ) or x;
b see options mu
If you are predicting observed response variables, you can obtain g 1 (x
and eta below. Predictions can include latent variables or treat them as 0; see option fixedonly.
If predictions include latent variables, then just as when predicting latent variables, both means and
modes are available; see options means, modes, intpoints(#), tolerance(#), and iterate(#).
b ) be calculated, the inverse-link of the expected value of the linear
mu and pr specify that g 1 (x
predictions. x by default contains predictions of latent variables. pr is a synonym for mu if response
variables are multinomial, ordinal, or Bernoulli. Otherwise, pr is not allowed.
b be calculated, the expected value of the linear prediction. x by default contains
eta specifies that x
predictions of latent variables.
fixedonly and nooffset are relevant only if observed response variables are being predicted.
fixedonly concerns predictions of latent variables used in the prediction of observed response
variables. fixedonly specifies latent variables be treated as 0, and thus only the fixed-effects part
of the model is used to produce the predictions.
nooffset is relevant only if option offset() or exposure() were specified at estimation time.
nooffset specifies that offset() or exposure() be ignored, thus producing predictions as if
all subjects had equal exposure.
means, modes, intpoints(#), tolerance(#), and iterate(#) specify what predictions of the
latent variables are to be calculated.
means and modes specify that empirical Bayes means or modes be used. Means are the default.
intpoints(#) specifies the number of numerical integration points and is relevant only in the
calculation of empirical Bayes means. intpoints() defaults to the number of integration points
specified at estimation time or to intpoints(7).
tolerance(#) is relevant for the calculation of empirical Bayes means and modes. It specifies the
convergence tolerance. It defaults to the value specified at estimation time with gsems adaptopts()
or to tolerance(1e-8).
iterate(#) is relevant for the calculation of empirical Bayes means and modes. It specifies the
maximum number of iterations to be performed in the calculation of each integral. It defaults to
the value specified at estimation time with gsems adaptopts() or to tolerance(1e-8).
495
Reference
Skrondal, A., and S. Rabe-Hesketh. 2009. Prediction in multilevel generalized linear models. Journal of the Royal
Statistical Society, Series A 172: 659687.
Also see
[SEM] gsem Generalized structural equation model estimation command
[SEM] gsem postestimation Postestimation tools for gsem
[SEM] intro 7 Postestimation tests and predictions
[SEM] example 28g One-parameter logistic IRT (Rasch) model
[SEM] example 29g Two-parameter logistic IRT model
Title
predict after sem Factor scores, linear predictions, etc.
Syntax
Remarks and examples
Menu
Reference
Description
Also see
Options
Syntax
predict
type
stub* | newvarlist
if
in
, options
options
Description
xb
xb(varlist)
xblatent
xblatent(varlist)
latent
latent(varlist)
scores
Menu
Statistics
>
>
Predictions
Description
predict is a standard postestimation command of Stata. This entry concerns use of predict
after sem. See [SEM] predict after gsem if you fit your model with gsem.
predict after sem creates new variables containing observation-by-observation values of estimated
factor scores (meaning predicted values of latent variables) and predicted values for latent and observed
endogenous variables. Out-of-sample prediction is allowed.
When predict is used on a model fit by sem with the group() option, results are produced
with the appropriate group-specific estimates. Out-of-sample prediction is allowed; missing values are
filled in for groups not included at the time the model was fit.
predict allows two syntaxes. You can type
. predict stub*,
...
...
497
Options
xb calculates the linear prediction for all observed endogenous variables in the model. xb is the
default if no option is specified.
xb(varlist) calculates the linear prediction for the variables specified, all of which must be observed
endogenous variables.
xblatent and xblatent(varlist) calculate the linear prediction for all or the specified latent
endogenous variables, respectively.
latent and latent(varlist) calculate the factor scores for all or the specified latent variables,
respectively. The calculation method is an analog of regression scoring; namely, it produces the
means of the latent variables conditional on the observed variables used in the model. If missing
values are found among the observed variables, conditioning is on the variables with observed
values only.
scores is for use by programmers. It provides the first derivative of the observation-level log likelihood
with respect to the parameters.
Programmers: In single-group sem, each parameter that is not constrained to be 0 has an associated
equation. As a consequence, the number of equations, and hence the number of score variables
created by predict, may be large.
Reference
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Also see
[SEM] example 14 Predicted values
[SEM] methods and formulas for sem Methods and formulas for sem
[SEM] sem postestimation Postestimation tools for sem
Title
sem Structural equation model estimation command
Syntax
Remarks and examples
Menu
Stored results
Description
Reference
Options
Also see
Syntax
sem paths
if
in
weight
, options
where paths are the paths of the model in command-language path notation; see [SEM] sem and gsem
path notation.
options
Description
group options
ssd options
estimation options
reporting options
syntax options
Menu
Statistics
>
>
Description
sem fits structural equation models. Even when you use the SEM Builder, you are using the sem
command.
Options
model description options describe the model to be fit. The model to be fit is fully specified by
pathswhich appear immediately after semand the options covariance(), variance(), and
means(). See [SEM] sem model description options and [SEM] sem and gsem path notation.
498
499
group options allow the specified model to be fit for different subgroups of the data, with some
parameters free to vary across groups and other parameters constrained to be equal across groups.
See [SEM] sem group options.
ssd options allow models to be fit using summary statistics data (SSD), meaning data on means,
variances (standard deviations), and covariances (correlations). See [SEM] sem ssd options.
estimation options control how the estimation results are obtained. These options control how the
standard errors (VCE) are obtained and control technical issues such as choice of estimation method.
See [SEM] sem estimation options.
reporting options control how the results of estimation are displayed. See [SEM] sem reporting
options.
syntax options control how the syntax that you type is interpreted. See [SEM] sem and gsem syntax
options.
500
2. To override means() constraints, you must use the means() option to free the parameter.
To override that the mean of latent exogenous variable MyLatent has mean 0, specify the
means(MyLatent) option. See [SEM] sem and gsem path notation.
3. To override constrained path coefficients from cons, such as (LatentEndogenous <cons@0), you must explicitly specify the path without a constraint (LatentEndogenous
<- cons). See [SEM] sem and gsem path notation.
Stored results
sem stores the following in e():
Scalars
e(N)
e(N clust)
e(N groups)
e(N missing)
e(ll)
e(df m)
e(df b)
e(df s)
e(chi2 ms)
e(df ms)
e(p ms)
e(chi2 bs)
e(df bs)
e(p bs)
e(rank)
e(ic)
e(rc)
e(converged)
e(critvalue)
e(critvalue b)
e(critvalue s)
e(modelmeans)
number of observations
number of clusters
number of groups
number of missing values in the sample for method(mlmv)
log likelihood of model
model degrees of freedom
baseline model degrees of freedom
saturated model degrees of freedom
test of target model against saturated model
degrees of freedom for e(chi2 ms)
p-value for e(chi2 ms)
test of baseline model against saturated model
degrees of freedom for e(chi2 bs)
p-value for e(chi2 bs)
rank of e(V)
number of iterations
return code
1 if target model converged, 0 otherwise
log likelihood or discrepancy of fitted model
log likelihood or discrepancy of baseline model
log likelihood or discrepancy of saturated model
1 if fitting means and intercepts, 0 otherwise
e(means # )
e(W)
Functions
e(sample)
501
sem
command as typed
raw or ssd if SSD were used
weight type
weight expression
title in estimation output
name of cluster variable
vcetype specified in vce()
title used to label Std. Err.
estimation method: ml, mlmv, or adf
maximization technique
b V
program used to implement estat
program used to implement predict
names of latent y variables
names of observed y variables
names of latent x variables
names of observed x variables
name of group variable
empty if noxconditional specified, xconditional otherwise
parameter vector
standardized parameter vector
parameter class
admissibility of , ,
iteration log (up to 20 iterations)
gradient vector
covariance matrix of the estimators
standardized covariance matrix of the estimators
model-based variance
vector with number of observations per group
vector of group values of e(groupvar)
sample covariance matrix of observed variables (for group #)
sample means of observed variables (for group #)
weight matrix for method(adf)
marks estimation sample (not with SSD)
Reference
Wiggins, V. L. 2011. Multilevel random effects in xtmixed and semthe long and wide of it. The Stata Blog: Not
Elsewhere Classified.
http://blog.stata.com/2011/09/28/multilevel-random-effects-in-xtmixed-and-sem-the-long-and-wide-of-it/.
502
Also see
[SEM] intro 1 Introduction
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] sem path notation extensions Command syntax for path diagrams
[SEM] sem model description options Model description options
[SEM] sem group options Fitting models on different groups
[SEM] sem ssd options Options for use with summary statistics data
[SEM] sem estimation options Options affecting estimation
[SEM] sem reporting options Options affecting reporting of results
[SEM] sem and gsem syntax options Options affecting interpretation of syntax
[SEM] sem postestimation Postestimation tools for sem
[SEM] methods and formulas for sem Methods and formulas for sem
Title
sem and gsem option constraints( ) Specifying constraints
Syntax
Description
Also see
Syntax
# ... ) ...
gsem . . . , . . . constraints(# # . . . ) . . .
sem . . .
, . . . constraints(#
where # are constraint numbers. Constraints are defined by the constraint command; see [R] constraint.
Description
Constraints refer to constraints to be imposed on the estimated parameters of a model. These
constraints usually come in one of three forms:
1. Constraints that a parameter such as a path coefficient or variance is equal to a fixed value such
as 1.
2. Constraints that two or more parameters are equal.
3. Constraints that two or more parameters are related by a linear equation.
It is usually easier to specify constraints with sems and gsems path notation; see [SEM] sem and
gsem path notation.
sems and gsems constraints() option provides an alternative way of specifying constraints.
...
Using the path notation, you can specify more general relationships, too, such as
. sem ... (y1 <- x@c1) (y2 <- x@(2*c1)) (y3 <- x@(3*c1+1)) ...
503
504
Say you now decide you want to fix c1 at value 1. Using the path notation, you modify what you
previously typed:
. sem ... (y1 <- x@1) (y2 <- x@2)
...
Also see
[SEM] sem Structural equation model estimation command
[SEM] gsem Generalized structural equation model estimation command
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] gsem model description options Model description options
[SEM] sem model description options Model description options
[R] constraint Define and list constraints
Title
sem and gsem option covstructure( ) Specifying covariance restrictions
Syntax
Description
Option
Also see
Syntax
sem . . .
, . . . covstructure(variables, structure) . . .
sem . . .
gsem . . .
, . . . covstructure(variables, structure) . . .
OEx, meaning all observed exogenous variables in your model (sem only)
LEx, meaning all latent exogenous variables in your model (including any multilevel latent
exogenous variables in the case of gsem)
Ex, meaning all exogenous variables in your model (sem only)
505
506
Description
diagonal
unstructured
identity
exchangeable
zero
pattern(matname)
fixed(matname)
Notes
(1)
(2)
Notes:
(1) Only elements in the lower triangle of matname are used. All values in matname are interpreted
as the floor() of the value if noninteger values appear. Row and column stripes of matname
are ignored.
(2) Only elements on the lower triangle of matname are used. Row and column stripes of matname
are ignored.
groupid may be specified only when the group() option is also specified, and even then it is optional;
see [SEM] sem group options.
Description
Option covstructure() provides a sometimes convenient way to constrain the covariances of
your model.
Alternatively or in combination, you can place constraints on the covariances by using the standard
path notation, such as
. sem ..., ... cov(name1*name2@c1 name3*name4@c1) ...
. gsem ..., ... cov(name1*name2@c1 name3*name4@c1) ...
See [SEM] sem and gsem path notation.
507
Option
covstructure( groupid: variables, structure) is used either to modify the covariance structure
among the exogenous variables of your model or to modify the covariance structure among the
error variables of your model. Optional groupid is available only with sem with option group()
specified; see [SEM] sem group options.
You may specify the covstructure() option multiple times.
The default covariance structure for exogenous variables is covstructure( Ex, unstructured)
for sem. There is no simple way in this notation to write the default for gsem.
The default covariance structure for error variables is covstructure(e. En, diagonal) for
sem and gsem.
Also see
[SEM] sem Structural equation model estimation command
[SEM] gsem Generalized structural equation model estimation command
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] example 17 Correlated uniqueness model
Title
sem and gsem option from( ) Specifying starting values
Syntax
Description
Option
Also see
Syntax
{sem | gsem} . . .
, . . . from(matname , skip ) . . .
{sem | gsem} . . .
, . . . from(svlist) . . .
Description
See [SEM] intro 12 for a description of starting values.
Starting values are usually not specified. When there are convergence problems, it is often necessary
to specify starting values. You can specify starting values by using
1. suboption init() as described in [SEM] sem and gsem path notation, or by using
2. option from() as described here.
Option from() is especially convenient for using the solution of one model as starting values for
another.
Option
skip is an option of from(matname). It specifies to ignore any parameters in matname that do not
appear in the model being fit. If this option is not specified, the existence of such parameters
causes sem (gsem) to issue an error message.
Option from() can be used with sem or gsem. We illustrate below using sem.
508
509
///
///
510
You may combine the two notations. If starting values are specified for a parameter both ways,
those specified by init() take precedence.
Also see
[SEM] sem Structural equation model estimation command
[SEM] gsem Generalized structural equation model estimation command
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] gsem model description options Model description options
[SEM] sem model description options Model description options
[SEM] gsem estimation options Options affecting estimation
[R] maximize Details of iterative maximization
Title
sem and gsem option reliability( ) Fraction of variance not due to measurement error
Syntax
Description
Option
Also see
Syntax
{sem | gsem} . . .
, . . . reliability(varname #
varname #
...
)
where varname is the name of an observed endogenous variable and # is the fraction or percentage
of variance not due to measurement error:
.
.
Description
Option reliability() allows you to specify the fraction of variance not due to measurement
error for measurement variables.
Option
reliability(varname # . . . ) specifies the reliability for variable varname. Reliability is bounded
by 0 and 1 and is equal to
noise variance
1
total variance
The reliability is assumed to be 1 when not specified.
Background
Option reliability() may be used with sem and may be used with gsem for Gaussian response
variables with the identity link but only in the absence of censoring. We illustrate using sem, but we
could just as well have used gsem.
Variables measured with error have attenuated path coefficients. If we had the model
. sem (y<-x)
511
512
sem and gsem option reliability( ) Fraction of variance not due to measurement error
and x were measured with error, then the estimated path coefficient would be biased toward 0. The
usual solution to such measurement problems is to find multiple measurements and develop a latent
variable from them:
. sem (x1 x2 x3<-X) (y<-X)
Another solution is available if we know the reliability of x. In that case, we can fit the model
. sem (x<-X) (y<-X), reliability(x .9)
x2 .8
x3 .9)
Even if you do not know the reliability, you can experiment using different but reasonable values
for the reliability and thus determine the sensitivity of your estimation results to the measurement
problem.
x = 0 + 0 X + e.x
y = 1 + 1 X + e.y
To fit this model, you type
. sem (x<-X) (y<-X), reliability(x .9)
sem will introduce a normalization constraint, namely, that the path coefficient 0 for x<-X is 1, but
that is of no importance. What is important is that the estimate of that path coefficient 1 of y<-X
is the coefficient that would be obtained from y<-x were x measured without error.
In the above, we specified the measurement part of the model first. Be sure to do that. You might
think you could equally well reverse the two terms so that, rather than writing
(x<-X) (y<-X)
(correct)
(incorrect)
(correct)
(incorrect)
sem and gsem option reliability( ) Fraction of variance not due to measurement error
513
All of that is because sem places its normalization constraint from the latent variable to the first
observed endogenous variable. There is no real error if the terms are interchanged except that you
will be surprised by the coefficient of 1 for y<-X and (the reciprocal of) the coefficient of interest
will be on x<-X.
See How sem (gsem) solves the problem for you in [SEM] intro 4 and see Default normalization
constraints in [SEM] sem.
Also see
[SEM] sem Structural equation model estimation command
[SEM] gsem Generalized structural equation model estimation command
[SEM] gsem model description options Model description options
[SEM] sem model description options Model description options
[SEM] example 24 Reliability
Title
sem and gsem path notation Command syntax for path diagrams
Syntax
Description
Options
Also see
Syntax
sem paths . . .
gsem paths . . .
paths specifies the direct paths between the variables of your model.
The model to be fit is fully described by paths, covariance(), variance(), and means().
Description
The command syntax for describing your SEM is fully specified by paths, covariance(),
variance(), and means(). How this works is described below.
If you are using sem, also see [SEM] sem path notation extensions for documentation of the
group() option for comparing different groups in the data. The syntax of the elements described
below is modified when group() is specified.
If you are using gsem, also see [SEM] gsem path notation extensions for documentation on
specification of family-and-link for generalized (nonlinear) response variables and for specification of
multilevel latent variables.
Either way, read this section first.
Options
covariance() is used to
1. specify that a particular covariance path of your model that usually is assumed to be 0 be
estimated,
2. specify that a particular covariance path that usually is assumed to be nonzero is not to be
estimated (to be constrained to be 0),
3. constrain a covariance path to a fixed value, such as 0, 0.5, 1, etc., and
4. constrain two or more covariance paths to be equal.
variance() does the same as covariance() except it does it with variances.
means() does the same as covariance() except it does it with means.
514
sem and gsem path notation Command syntax for path diagrams
515
516
sem and gsem path notation Command syntax for path diagrams
6. Variances and covariances (curved paths) between variables are indicated by options. Variances
are indicated by
..., ... var(name1)
Covariances are indicated by
..., ... cov(name1*name2)
..., ... cov(name2*name1)
There is no significance to the order of the names.
The actual names of the options are variance() and covariance(), but they are invariably
abbreviated as var() and cov(), respectively.
The var() and cov() options are the same option, so a variance can be typed as
..., ... cov(name1)
and a covariance can be typed as
..., ... var(name1*name2)
7. Variances may be combined, covariances may be combined, and variances and covariances may
be combined.
If you have
..., ... var(name1) var(name2)
you may code this as
..., ... var(name1 name2)
If you have
..., ... cov(name1*name2) cov(name2*name3)
you may code this as
..., ... cov(name1*name2 name2*name3)
All the above combined can be coded as
..., ... var(name1 name2 name1*name2 name2*name3)
or as
..., ... cov(name1 name2 name1*name2 name2*name3)
8. All variables except endogenous variables are assumed to have a variance; it is only necessary
to code the var() option if you wish to place a constraint on the variance or specify an initial
value. See items 11, 12, 13, and 16 below. (In gsem, the variance and covariances of observed
exogenous variables are not estimated and thus var() cannot be used with them.)
Endogenous variables have a variance, of course, but that is the variance implied by the model. If
name is an endogenous variable, then var(name) is invalid. The error variance of the endogenous
variable is var(e.name).
9. Variables mostly default to being correlated:
a. All exogenous variables are assumed to be correlated with each other, whether observed or
latent.
b. Endogenous variables are never directly correlated, although their associated error variables
can be.
c. All error variables are assumed to be uncorrelated with each other.
sem and gsem path notation Command syntax for path diagrams
517
You can override these defaults on a variable-by-variable basis with the cov() option.
To assert that two variables are uncorrelated that otherwise would be assumed to be correlated,
constrain the covariance to be 0:
..., ... cov(name1*name2@0)
To allow two variables to be correlated that otherwise would be assumed to be uncorrelated,
simply specify the existence of the covariance:
..., ... cov(name1*name2)
This latter is especially commonly done with errors:
..., ... cov(e.name1*e.name2)
(In gsem, you may not use the cov() option with observed exogenous variables. You also may
not use cov() with error terms associated with family Gaussian, link log.)
10. Means of variables are indicated by the following option:
..., ... means(name)
Variables mostly default to having nonzero means:
a. All observed exogenous variables are assumed to have nonzero means. In sem, the means can
be constrained using the means() option, but only if you are performing noxconditional
estimation; [SEM] sem option noxconditional.
b. Latent exogenous variables are assumed to have mean 0. Means of latent variables are not
estimated by default. If you specify enough normalization constraints to identify the mean
of a latent exogenous variable, you can specify means(name) to indicate that the mean
should be estimated in either.
c. Endogenous variables have no separate mean. Their means are those implied by the model.
The means() option may not be used with endogenous variables.
d. Error variables have mean 0 and this cannot be modified. The means() option may not be
used with error variables.
To constrain the mean to a fixed value, such as 57, code
..., ... means(name@57)
Separate means() options may be combined:
..., ... means(name1@57 name2@100)
11. Fixed-value constraints may be specified for a path, variance, covariance, or mean by using @
(the at symbol). For example,
(name1 <- name2@1)
(name1 <- name2@1 name3@1)
..., ... var(name@100)
..., ... cov(name1*name2@223)
..., ... cov(name1@1 name2@1 name1*[email protected])
..., ... means(name@57)
518
sem and gsem path notation Command syntax for path diagrams
12. Symbolic constraints may be specified for a path, variance, covariance, or mean by using @ (the
at symbol). For example,
(name1 <- name2@c1) (name3 <- name4@c1)
..., ... var(name1@c1 name2@c1)
..., ... cov(name1@1 name2@1 name3@1 name1*name2@c1 name1*name3@c1)
..., ... means(name1@c1 name2@c1)
(name1 <- name2@c1) ..., var(name3@c1) means(name4@c1)
Symbolic names are just names from 1 to 32 characters in length. Symbolic constraints constrain
equality. For simplicity, all constraints below will have names c1, c2, . . . .
13. Linear combinations of symbolic constraints may be specified for a path, variance, covariance,
or mean by using @ (the at symbol). For example,
(name1 <- name2@c1) (name3 <- name4@(2*c1))
..., ... var(name1@c1 name2@(c1/2))
..., ... cov(name1@1 name2@1 name3@1 name1*name2@c1 name1*name2@(c1/2))
..., ... means(name1@c1 name2@(3*c1+10))
(name1 <- name2@(c1/2)) ..., var(name3@c1) means(name4@(2*c1))
14. All equations in the model are assumed to have an intercept (to include observed exogenous
variable cons) unless the noconstant option (abbreviation nocons) is specified, and then
all equations are assumed not to have an intercept (not to include cons). (There are some
exceptions to this in gsem because some generalized linear models have no intercept or even the
concept of an intercept.)
Regardless of whether noconstant is specified, you may explicitly refer to observed exogenous
variable cons.
The following path specifications are ways of writing the same model:
(name1 <- name2) (name1 <- name3)
(name1 <- name2) (name1 <- name3) (name1 <- cons)
(name1 <- name2 name3)
(name1 <- name2 name3 cons)
There is no reason to explicitly specify cons unless you have also specified the noconstant
option and want to include cons in some equations but not others, or regardless of whether
you specified the noconstant option, you want to place a constraint on its path coefficient. For
example,
(name1 <- name2 name3 cons@c1) (name4 <- name5 cons@c1)
15. The noconstant option may be specified globally or within a path specification. That is,
(name1 <- name2 name3) (name4 <- name5), nocons
suppresses the intercepts in both equations. Alternatively,
(name1 <- name2 name3, nocons) (name4 <- name5)
suppresses the intercept in the first equation but not the second, whereas
(name1 <- name2 name3) (name4 <- name5, nocons)
suppresses the intercept in the second equation but not the first.
sem and gsem path notation Command syntax for path diagrams
519
Also see
[SEM] sem Structural equation model estimation command
[SEM] gsem Generalized structural equation model estimation command
[SEM] sem path notation extensions Command syntax for path diagrams
[SEM] gsem path notation extensions Command syntax for path diagrams
[SEM] intro 2 Learning the language: Path diagrams and command language
[SEM] intro 6 Comparing groups (sem only)
Title
sem and gsem syntax options Options affecting interpretation of syntax
Syntax
Description
Options
Also see
Syntax
sem paths . . . , . . . syntax options
gsem paths . . . , . . . syntax options
syntax options
Description
latent(names)
nocapslatent
Description
These options affect some minor issues of how sem and gsem interpret what you type.
Options
latent(names) specifies that names is the full set of names of the latent variables. sem and gsem
ordinarily assume that latent variables have the first letter capitalized and observed variables have the
first letter lowercased; see [SEM] sem and gsem path notation. When you specify latent(names),
sem and gsem treat the listed variables as the latent variables and all other variables, regardless
of capitalization, as observed. latent() implies nocapslatent.
nocapslatent specifies that having the first letter capitalized does not designate a latent variable.
This option can be used when fitting models with observed variables only where some observed
variables in the dataset have the first letter capitalized.
Also see
[SEM] sem Structural equation model estimation command
[SEM] gsem Generalized structural equation model estimation command
[SEM] sem and gsem path notation Command syntax for path diagrams
520
Title
sem estimation options Options affecting estimation
Syntax
Description
Options
Also see
Syntax
sem paths . . . , . . . estimation options
estimation options
Description
method(method)
vce(vcetype)
nm1
noxconditional
allmissing
noivstart
noestimate
maximize options
control maximization process for specified model; seldom used
satopts(maximize options) control maximization process for saturated model; seldom used
baseopts(maximize options) control maximization process for baseline model; seldom used
Description
These options control how results are obtained.
Options
method() and vce() specify the method used to obtain parameter estimates and the technique used
to obtain the variancecovariance matrix of the estimates. See [SEM] sem option method( ).
nm1 specifies that the variances and covariances used in the SEM equations be the sample variances
(divided by N 1) and not the asymptotic variances (divided by N ). This is a minor technical
issue of little importance unless you are trying to match results from other software that assumes
sample variances. sem assumes asymptotic variances.
noxconditional states that you wish to include the means, variances, and covariances of the
observed exogenous variables among the parameters to be estimated by sem. See [SEM] sem
option noxconditional.
allmissing specifies how missing values should be treated when method(mlmv) is also specified.
Usually, sem omits from the estimation sample observations that contain missing values of any of
the observed variables used in the model. method(mlmv), however, can deal with these missing
values, and in that case, observations containing missing are not omitted.
Even so, sem, method(mlmv) does omit observations containing .a, .b, . . . , .z from the
estimation sample. sem assumes you do not want these observations used because the missing
value is not missing at random. If you want sem to include these observations in the estimation
sample, specify the allmissing option.
521
522
noivstart is an arcane option that is of most use to programmers. It specifies that sem is to
skip efforts to produce good starting values with instrumental-variable techniques, techniques that
require computer time. If you specify this option, you should specify all the starting values. Any
starting values not specified will be assumed to be 0 (means, path coefficients, and covariances)
or some simple function of the data (variances).
noestimate specifies that the model is not to be fit. Instead, starting values are to be shown and they
are to be shown using the coeflegend style of output. An important use of this is to improve
starting values when your model is having difficulty converging. You can do the following:
. sem ..., ... noestimate
. matrix b = e(b)
. ... (modify elements of b) ...
. sem ..., ... from(b)
maximize options specify the standard and rarely specified options for controlling the maximization process for sem; see [R] maximize
relevant options for sem are difficult,
. The
technique(algorithm spec), iterate(#), no log, trace, gradient, showstep, hessian,
tolerance(#), ltolerance(#), and nrtolerance(#).
satopts(maximize options) is a rarely specified option and is only relevant if you specify the
method(mlmv) option. sem reports a test for model versus saturated at the bottom of the output.
Thus sem needs to obtain the saturated fit. In the case of method(ml) or method(adf), sem can
make a direct calculation. In the other case of method(mlmv), sem must actually fit the saturated
model. The maximization options specified inside satopts() control that maximization process.
It is rare that you need to specify the satopts() option, even if you find it necessary to specify
the overall maximize options.
baseopts(maximize options) is a rarely specified option and an irrelevant one unless you also
specify method(mlmv) or method(adf). When fitting the model, sem records information about
the baseline model for later use by estat gof, should you use that command. Thus sem needs
to obtain the baseline fit. In the case of method(ml), sem can make a direct calculation. In
the cases of method(mlmv) and method(adf), sem must actually fit the baseline model. The
maximization options specified inside baseopts() control that maximization process. It is rare
that you need to specify the baseopts() option even if you find it necessary to specify the overall
maximize options.
Also see
[SEM] sem Structural equation model estimation command
[SEM] sem option method( ) Specifying method and calculation of VCE
[SEM] sem option noxconditional Computing means, etc., of observed exogenous variables
[SEM] intro 8 Robust and clustered standard errors
[SEM] intro 9 Standard errors, the full story
Title
sem group options Fitting models on different groups
Syntax
Description
Options
Also see
Syntax
sem paths . . . , . . . group options
group options
Description
group(varname)
ginvariant(classname)
classname
Description
scoef
scons
structural coefficients
structural intercepts
mcoef
mcons
measurement coefficients
measurement intercepts
serrvar
merrvar
smerrcov
meanex
covex
all
none
meanex, covex, and all exclude the observed exogenous variables (that is, they include only the latent
exogenous variables) unless you specify the noxconditional option or the noxconditional
option is otherwise implied; see [SEM] sem option noxconditional. This is what you would desire
in most cases.
Description
sem can fit combined models across subgroups of the data while allowing some parameters to vary
and constraining others to be equal across subgroups. These subgroups could be males and females,
age category, and the like.
sem performs such estimation when the group(varname) option is specified. The ginvariant(classname) option specifies which parameters are to be constrained to be equal across the
groups.
523
524
Options
group(varname) specifies that the model be fit as described above. varname specifies the name of
a numeric variable that records the group to which the observation belongs.
If you are using summary statistics data in place of raw data, varname is the name of the group
variable as reported by ssd describe; see [SEM] ssd.
ginvariant(classname) specifies which classes of parameters of the model are to be constrained to
be equal across groups. The classes are defined above. The default is ginvariant(mcoef mcons)
if the option is not specified.
Also see
[SEM] sem Structural equation model estimation command
[SEM] intro 6 Comparing groups (sem only)
[SEM] example 20 Two-factor measurement model by group
[SEM] example 23 Specifying parameter constraints across groups
Title
sem model description options Model description options
Syntax
Description
Options
Also see
Syntax
sem paths . . . , . . . model description options
model description options
Description
covariance()
variance()
means()
covstructure()
noconstant
nomeans
noanchor
forcenoanchor
reliability()
constraints()
from()
Description
paths and the options above describe the model to be fit by sem.
Options
covariance(), variance(), and means() fully describe the model to be fit. See [SEM] sem and
gsem path notation.
covstructure() provides a convenient way to constrain covariances in your model. Alternatively
or in combination, you can place constraints by using the standard path notation. See [SEM] sem
and gsem option covstructure( ).
noconstant specifies that all intercepts be constrained to 0. See [SEM] sem and gsem path notation.
nomeans specifies that means and intercepts not be fit. The means and intercepts are concentrated out
of the function being optimized, which is typically the likelihood function. Results for all other
parameters are the same whether or not this option is specified.
525
526
This option is seldom specified. sem issues this option to itself when you use summary statistics
data that do not include summary statistics for the means.
noanchor specifies that sem is not to check for lack of identification and fill in anchors where needed.
sem is instead to issue an error message if anchors would be needed. You specify this option when
you believe you have specified the necessary normalization constraints and want to hear about it
if you are wrong. See Identification 2: Normalization constraints (anchoring) in [SEM] intro 4.
forcenoanchor is similar to noanchor except that rather than issue an error message, sem proceeds
to estimation. There is no reason you should specify this option. forcenoanchor is used in testing
of sem at StataCorp.
reliability() specifies the fraction of variance not due to measurement error for a variable. See
[SEM] sem and gsem option reliability( ).
constraints() specifies parameter constraints you wish to impose on your model; see [SEM] sem
and gsem option constraints( ). Constraints can also be specified as described in [SEM] sem and
gsem path notation, and they are usually more conveniently specified using the path notation.
from() specifies the starting values to be used in the optimization process; see [SEM] sem and gsem
option from( ). Starting values can also be specified by using the init() suboption as described
in [SEM] sem and gsem path notation.
Also see
[SEM] sem Structural equation model estimation command
[SEM] intro 2 Learning the language: Path diagrams and command language
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] sem and gsem option covstructure( ) Specifying covariance restrictions
[SEM] sem and gsem option reliability( ) Fraction of variance not due to measurement error
[SEM] sem and gsem option constraints( ) Specifying constraints
[SEM] sem and gsem option from( ) Specifying starting values
Title
sem option method( ) Specifying method and calculation of VCE
Syntax
Description
Options
Also see
Syntax
sem . . .
, . . . method(method) vce(vcetype) . . .
method
Description
ml
mlmv
adf
vcetype
Description
oim
eim
opg
robust
cluster clustvar
bootstrap , bootstrap options
jackknife , jackknife options
ml
mlmv
adf
oim
eim
opg
robust
cluster
bootstrap
jackknife
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Description
sem option method() specifies the method used to obtain the estimated parameters.
sem option vce() specifies the technique used to obtain the variancecovariance matrix of the
estimates (VCE), which includes the reported standard errors.
527
528
Options
method(method) specifies the method used to obtain parameter estimates. method(ml) is the default.
vce(vcetype) specifies the technique used to obtain the VCE. vce(oim) is the default.
Also see
[SEM] sem Structural equation model estimation command
[SEM] intro 4 Substantive concepts
[SEM] intro 8 Robust and clustered standard errors
[SEM] intro 9 Standard errors, the full story
[SEM] example 26 Fitting a model with data missing at random
Title
sem option noxconditional Computing means, etc., of observed exogenous variables
Syntax
Description
Option
Also see
Syntax
sem . . .
, . . . noxconditional . . .
Description
sem has a noxconditional option that you may rarely wish to specify. The option is described
below.
Option
noxconditional states that you wish to include the means, variances, and covariances of the observed
exogenous variables among the parameters to be estimated by sem.
What is x conditional?
In many cases, sem does not include the means, variances, and covariances of observed exogenous
variables among the parameters to be estimated. When sem omits them, the estimator of the model is
said to be x conditional. Rather than estimating the values of the means, variances, and covariances,
sem uses the separately calculated observed values of those statistics. sem does this to save time and
memory.
sem does not use the x-conditional calculation when it would be inappropriate.
The noxconditional option prevents sem from using the x-conditional calculation. You specify
noxconditional on the sem command:
. sem ..., ... noxconditional
Do not confuse the x-conditional calculation with the assumption of conditional normality discussed
in [SEM] intro 4. The x-conditional calculation is appropriate even when the assumption of conditional
normality is inappropriate.
529
530
sem
sem
sem
sem
...,
...,
...,
...,
...
...
...
...
means(x1@m x2@m)
var(x1@v x2@v)
cov(x1*x2@c x1*x3@c)
covstruct(_OEx, diagonal)
See [SEM] sem and gsem path notation and [SEM] sem and gsem option covstructure( ).
2. sem defaults to noxconditional whenever you use method(mlmv) and there are missing values
among the observed exogenous variables.
There are only three reasons for you to specify the noxconditional option:
1. Specify noxconditional if you subsequently wish to test means, variances, or covariances of
observed exogenous variables with postestimation commands. For example,
. sem ..., ... noxconditional
. sem, coeflegend
. test _b[means(x1):_cons] == _b[means(x2)_cons]
2. Specify noxconditional if you are fitting a model with the group() option.
3. Specify noxconditional if you also specify the ginvariant() option, and you want the
ginvariant() classes meanex, covex, or all to include the observed exogenous variables.
For example,
. sem ..., ... by(agegrp) ginvariant(all) noxconditional
You may also wish to specify noxconditional when comparing results with those from other
packages. Many packages use the noxconditional approach when using an estimation method other
than maximum likelihood (ML). Correspondingly, most packages use the x-conditional calculation
when using ML.
531
Also see
[SEM] sem Structural equation model estimation command
Title
sem option select( ) Using sem with summary statistics data
Syntax
Description
Option
Also see
Syntax
sem . . . , . . . select(# # . . . ) . . .
Description
sem may be used with summary statistics data (SSD), data containing only summary statistics such
as the means, standard deviations or variances, and correlations and covariances of the underlying,
raw data.
You enter SSD with the ssd command; see [SEM] ssd.
To fit a model with sem, there is nothing special you have to do except specify the select()
option where you would usually specify if exp.
Option
select(# # . . . ) is allowed only when you have SSD in memory. It specifies which groups should
be used.
You may select only groups for which you have separate summary statistics recorded in your
summary statistics dataset; the ssd describe command will list the group variable, if any. See
[SEM] ssd.
By the way, select() may be combined with sem option group(). Where you might usually
type
. sem ... if agegrp==1 | agegrp==3, ... group(agegrp)
532
533
The above restricts sem to age groups 1 and 3, so the result will be an estimation of a combined
model of age groups 1 and 3 with some coefficients allowed to vary between the groups and other
coefficients constrained to be equal across the groups. See [SEM] sem group options.
Also see
[SEM] sem Structural equation model estimation command
[SEM] intro 11 Fitting models with summary statistics data (sem only)
Title
sem path notation extensions Command syntax for path diagrams
Syntax
Description
Options
Also see
Syntax
sem paths . . .
group(varname)
paths specifies the direct paths between the variables of your model.
The model to be fit is fully described by paths, covariance(), variance(), and means().
The syntax of these elements is modified (generalized) when the group() option is specified.
Description
This entry concerns sem only.
The command syntax for describing your SEMs is fully specified by paths, covariance(),
variance(), and means(). How that works is described in [SEM] sem and gsem path notation.
See that section before reading this section.
This entry concerns the path features unique to sem, and that has to do with the group() option
for comparing different groups.
Options
covariance(), variance(), and means() are described in [SEM] sem and gsem path notation.
group(varname) allows models specified with paths, covariance(), variance(), and means() to
be automatically generalized (interacted) with the groups defined by varname; see [SEM] intro 6.
The syntax of paths and the arguments of covariance(), variance(), and means() gain an
extra syntactical piece when group() is specified.
specifies that the model be fit separately for the different values of varname. varname might be sex
and then the model would be fit separately for males and females, or varname might be something
else and perhaps take on more than two values.
Whatever varname is, group(varname) defaults to letting some of the path coefficients, covariances,
variances, and means of your model vary across the groups and constraining others to be equal.
Which parameters vary and which are constrained is described in [SEM] sem group options, but that
is a minor detail right now.
In what follows, we will assume that varname is mygrp and takes on three values. Those values
are 1, 2, and 3, but they could just as well be 2, 9, and 12.
Consider typing
. sem ..., ...
534
535
and typing
. sem ..., ... group(mygrp)
Whatever the paths, covariance(), variance(), and means() are that describe the model,
there are now three times as many parameters because each group has its own unique set. In fact,
when you give the second command, you are not merely asking for three times the parameters, you
are specifying three models, one for each group! In this case, you specified the same model three
times without knowing it.
You can vary the model specified across groups.
1. Lets write the model you wish to fit as
. sem (a) (b) (c), cov(d) cov(e) var(f)
where a, b, . . . , f stand for what you type. In this generic example, we have two cov() options
just because multiple cov() options often occur in real models. When you type
. sem (a) (b) (c), cov(d) cov(e) var(f) group(mygrp)
The 1:, 2:, and 3: identify the groups for which paths, covariances, or variances are being
added, modified, or constrained.
If mygrp contained the unique values 5, 8, and 10 instead of 1, 2, and 3, then 5: would appear
in place of 1:; 8: would appear in place of 2:; and 10: would appear in place of 3:.
2. Consider the model
. sem (y <- x) (b) (c), cov(d) cov(e) var(f) group(mygrp)
If you wanted to constrain the path coefficient (y <- x) to be the same across all three groups,
you could type
. sem (y <- x@c1) (b) (c), cov(d) cov(e) var(f) group(mygrp)
See item 12 in [SEM] sem and gsem path notation for more examples of specifying constraints.
This works because the expansion of (y <- x@c1) is
(1: y <- x@c1) (2: y <- x@c1) (3: y <- x@c1)
If you wanted to constrain the path coefficient (y <- x) to be the same in groups 2 and 3, you
could type
. sem (1: y <- x) (2: y <- x@c1) (3: y <- x@c1) (b) (c),
///
cov(d) cov(e) var(f) group(mygrp)
536
The part (y <- x) (2: y <- x@c1) (3: y <- x@c1) expands to
(1: y <- x) (2: y <- x) (3: y <- x) (2: y <- x@c1) (3: y <- x@c1)
and thus the path is defined twice for group 2 and twice for group 3. When a path is defined
more than once, the definitions are combined. In this case, the second definition adds more
information, so the result is as if you typed
(1: y <- x) (2: y <- x@c1) (3: y <- x@c1)
When results are combined from repeated definitions, then definitions that appear later take
precedence. In this case, results are as if the expansion read
(1: y <- x@c2) (2: y <- x@c1)
Thus coefficients for groups 2 and 3 are constrained. The group-1 coefficient is constrained to
c2. If c2 appears nowhere else in the model specification, then results are as if the path for
group 1 were unconstrained.
6. Instead of following item 3, item 4, or item 5, you could not type
. sem (y <- x@c1) (1: y <- x) (b) (c),
///
cov(d) cov(e) var(f) group(mygrp)
(1: y <- x)
and you might think that 1: y <- x would replace 1: y <- x@c1. Information, however, is
combined, and even though precedence is given to information appearing later, silence does not
count as information. Thus the expanded and reduced specification reads the same as if 1: y <x was never specified:
(1: y <- x@c1)
7. Items 16, stated in terms of paths, apply equally to what is typed inside the means(),
variance(), and covariance() options. For instance, if you typed
. sem (a) (b) (c), var(e.y@c1) group(mygrp)
then you are constraining the variance to be equal across all three groups.
If you wanted to constrain the variance to be equal in groups 2 and 3, you could type
. sem (a) (b) (c), var(e.y) var(2: e.y@c1) var(3: e.y@c1), group(mygrp)
You could omit typing var(e.y) because it is implied. Alternatively, you could type
. sem (a) (b) (c), var(e.y@c1) var(1: e.y@c2) group(mygrp)
537
because silence does not count as information when specifications are combined.
Similarly, if you typed
. sem (a) (b) (c), cov(e.y1*e.y2@c1) group(mygrp)
then you are constraining the covariance to be equal across all groups. If you wanted to constrain
the covariance to be equal in groups 2 and 3, you could type
. sem (a) (b) (c), cov(e.y1*e.y2)
///
cov(2: e.y1*e.y2@c1) cov(3: e.y1*e.y2@c1) ///
group(mygrp)
You could not omit cov(e.y1*e.y2) because it is not assumed. By default, error variables
are assumed to be uncorrelated. Omitting the option would constrain the covariance to be 0 in
group 1 and to be equal in groups 2 and 3.
Alternatively, you could type
. sem (a) (b) (c), cov(e.y1*e.y2@c1)
cov(1: e.y1*e.y2@c2)
group(mygrp)
///
///
8. In the examples above, we have referred to the groups with their numeric values, 1, 2, and 3.
Had the values been 5, 8, and 10, then we would have used those values.
If the group variable mygrp has a value label, you can use the label to refer to the group. For
instance, imagine mygrp is labeled as follows:
. label define grpvals 1 Male
. label values mygrp grpvals
2 Female
3 "Unknown sex"
We could type
. sem (y <- x) (Female: y <- x@c1) (Unknown sex: y <- x@c1) ..., ...
or we could type
. sem (y <- x) (2: y <- x@c1) (3: y <- x@c1) ..., ...
Also see
[SEM] sem Structural equation model estimation command
[SEM] sem and gsem path notation Command syntax for path diagrams
[SEM] intro 2 Learning the language: Path diagrams and command language
[SEM] intro 6 Comparing groups (sem only)
Title
sem postestimation Postestimation tools for sem
Description
Also see
Description
The following are the postestimation commands that you can use after estimation by sem:
Command
Description
sem, coeflegend
estat framework
display b[ ] notation
display results in modeling framework (matrix form)
estat
estat
estat
estat
estat
gof
ggof
eqgof
residuals
ic
estat mindices
estat scoretests
estat ginvariant
estat eqtest
lrtest
test
lincom
nlcom
testnl
estat stdize:
estat
estat
estat
estat
decomposition of effects
assess stability of nonrecursive systems
summary statistics for the estimation sample
variancecovariance matrix of the estimators (VCE)
teffects
stable
summarize
vce
predict
margins
estimates
538
539
Also see
[SEM] sem reporting options Options affecting reporting of results
Title
sem reporting options Options affecting reporting of results
Syntax
Reference
Description
Also see
Options
Syntax
sem paths . . . , . . . reporting options
sem, reporting options
reporting options
Description
level(#)
standardized
coeflegend
nocnsreport
nodescribe
noheader
nofootnote
notable
nofvlabel
fvwrap(#)
fvwrapon(style)
showginvariant
Description
These options control how sem displays estimation results.
Options
level(#); see [R] estimation options.
standardized displays standardized values, that is, beta values for coefficients, correlations for
covariances, and 1s for variances. Standardized values are obtained using model-fitted variances
(Bollen 1989, 124125). We recommend caution in the interpretation of standardized values,
especially with multiple groups.
coeflegend displays the legend that reveals how to specify estimated coefficients in b[ ] notation,
which you are sometimes required to use when specifying postestimation commands.
nocnsreport suppresses the display of the constraints. Fixed-to-zero constraints that are automatically
set by sem are not shown in the report to keep the output manageable.
nodescribe suppresses display of the variable classification table.
540
541
noheader suppresses the header above the parameter table, the display that reports the final loglikelihood value, number of observations, etc.
nofootnote suppresses the footnotes displayed below the parameter table.
notable suppresses the parameter table.
nofvlabel displays group values rather than value labels.
fvwrap(#) specifies how many lines to allow when long value labels must be wrapped. Labels
requiring more than # lines are truncated. This option overrides the fvwrap setting; see [R] set
showbaselevels.
fvwrapon(style) specifies whether value labels that wrap will break at word boundaries or break
based on available space.
fvwrapon(word), the default, specifies that value labels break at word boundaries.
fvwrapon(width) specifies that value labels break based on available space.
This option overrides the fvwrapon setting; see [R] set showbaselevels.
showginvariant specifies that each estimated parameter be reported in the parameter table. The
default is to report each invariant parameter only once.
Reference
Bollen, K. A. 1989. Structural Equations with Latent Variables. New York: Wiley.
Also see
[SEM] sem Structural equation model estimation command
[SEM] example 8 Testing that coefficients are equal, and constraining them
[SEM] example 16 Correlation
Title
sem ssd options Options for use with summary statistics data
Syntax
Description
Options
Also see
Syntax
sem paths . . . , . . . ssd options
ssd options
Description
select()
forcecorrelations
Description
Data are sometimes available in summary statistics form only. These summary statistics include
means, standard deviations or variances, and correlations or covariances. These summary statistics
can be used by sem in place of the underlying raw data.
Options
select() is an alternative to if exp when you are using summary statistics data (SSD). Where you
might usually type
. sem ... if agegrp==1 | agegrp==3 | agegrp==5, ...
with SSD in memory, you type
. sem ..., ... select(1 3 5)
See [SEM] sem option select( ) and [SEM] intro 11.
forcecorrelations tells sem that it may make calculations that would usually be considered
suspicious with SSD that contain only a subset of means, variances (standard deviations), and
covariances (correlations). Do not specify this option unless you appreciate the statistical issues
that we are about to discuss. There are two cases where forcecorrelations is relevant.
In the first case, sem is unwilling to produce group() estimates if one or more (usually all) of the
groups have correlations only defined. You can override that by specifying forcecorrelations,
and sem will assume unit variances for the group or groups that have correlations only. Doing
this is suspect unless you make ginvariant() all parameters that are dependent on covariances
or unless you truly know that the variances are indeed 1.
In the second case, sem is unwilling to pool across groups unless you have provided means and
covariances (or means and correlations and standard deviations or variances). Without that information, should the need for pooling arise, sem issues an error message. The forcecorrelations
option specifies that sem ignore its rule and pool correlation matrices, treating correlations as
if they were covariances when variances are not defined and treating means as if they were 0
when means are not defined. The only justification for making the calculation in this way is that
variances truly are 1 and means truly are 0.
542
sem ssd options Options for use with summary statistics data
543
Understand that there is nothing wrong with using pure correlation data, or covariance data without
the means, so long as you fit models for individual groups. Doing anything across groups basically
requires that sem have the covariance and mean information.
Also see
[SEM] sem Structural equation model estimation command
[SEM] intro 11 Fitting models with summary statistics data (sem only)
[SEM] ssd Making summary statistics data (sem only)
Title
ssd Making summary statistics data (sem only)
Syntax
Description
Options
Stored results
Also see
Syntax
To enter summary statistics data (SSD), the commands are
ssd init varlist
ssd set # observations #
ssd set # means vector
ssd set # { variances | sd } vector
ssd set # { covariances | correlations } matrix
(to add second group)
ssd addgroup
ssd unaddgroup #
ssd status
#
544
545
2. (stata) matname
where matname is the name of a Stata 1 k or k 1 matrix, for example,
. ssd set variances (stata) mymeans
3. (mata) matname
where matname is the name of a Mata 1 k or k 1 matrix, for example,
. ssd set sd (mata) mymeans
or
. ssd set correlations 1 .2 .3 \ 1 .5 \ 1
2. (ltd) # # . . .
which is to say, a space-separated list of numbers corresponding to the lower triangle and
diagonal of the matrix, without backslashes between rows, for example,
. ssd set correlations (ltd) 1
.2 1
.3 .5 1
3. (dut) # # . . .
which is to say, a space-separated list of numbers corresponding to the diagonal and upper
triangle of the matrix, without backslashes between rows, for example,
. ssd set correlations (dut) 1 .2 .3
1 .5
4. (stata) matname
where matname is the name of a Stata k k symmetric matrix, for example,
. ssd set correlations (stata) mymat
5. (mata) matname
where matname is the name of a Mata k k symmetric matrix, for example,
. ssd set correlations (mata) mymat
Description
ssd allows you to enter SSD to fit SEMs and allows you to create SSD from original, raw data to
publish or to send to others (and thus preserve participant confidentiality). Data created by ssd may
be used with sem but not gsem.
Options
group(varname) is for use with ssd build. It specifies that separate groups of summary statistics
be produced for each value of varname.
clear is for use with ssd build. It specifies that it is okay to replace the data in memory with SSD
even if the original dataset has not been saved since it was last changed.
546
A summary statistics dataset is different from a regular, raw Stata dataset. Be careful not to use
standard Stata data-manipulation commands with SSD in memory. The commands include
generate
replace
merge
append
drop
set obs
to mention a few. You may, however, use rename to change the names of the variables.
The other data-manipulation commands can damage your summary statistics dataset. If you make
a mistake and do use one of these commands, do not attempt to repair the data yourself. Let ssd
repair your data by typing
. ssd repair
ssd is usually successful as long as variables or observations have not been dropped.
Every time you use ssd, even for something as trivial as describing or listing the data, ssd verifies
that the data are not corrupted. If ssd finds that they are, it suggests you type ssd repair:
. ssd describe
SSD corrupt
The summary statistics data should [ssd
describes the problem]. The data may be fixable;
type ssd repair.
. ssd repair
(data repaired)
. ssd describe
(usual output appears)
In critical applications, we also recommend you digitally sign your summary statistics dataset:
. datasignature set
5:5(65336):3718404259:2275399871
Then at any future time, you can verify the data are still just as they should be:
. datasignature confirm
(data unchanged since 30jun2012 15:32)
The data signature is a function of the variable names. If you rename a variablesomething that is
allowedthen the data signature will change:
. rename varname newname
. datasignature confirm
data have changed since 30jun2012 15:32
r(9);
547
Before re-signing, however, if you want to convince yourself that the data are unchanged except
for the variable name, type datasignature report. It is the part of the signature in parentheses
that has to do with the variable names. datasignature report will tell you what the new signature
would be, and you can verify that the other components of the signature match.
See [D] datasignature.
Stored results
ssd describe stores the following in r():
Scalars
r(N)
r(k)
r(G)
r(complete)
r(complete means)
r(complete covariances)
Macros
r(v#)
r(groupvar)
Also see
[SEM] intro 11 Fitting models with summary statistics data (sem only)
[D] datasignature Determine whether data have changed
[SEM] example 2 Creating a dataset from published covariances
[SEM] example 19 Creating multiple-group summary statistics data
[SEM] example 25 Creating summary statistics data from raw data
Title
test Wald test of linear hypotheses
Syntax
Remarks and examples
Menu
Stored results
Description
Also see
Options
Syntax
test coeflist
test exp = exp =
test [eqno]
: coeflist
= ...
= . . . ] : coeflist
(spec) . . .
, test options
Menu
Statistics
>
>
>
Description
test is a postestimation command for use after sem, gsem, and other Stata estimation commands.
test performs the Wald test of the hypothesis or hypotheses that you specify. In the case of sem
and gsem, you must use the b[ ] coefficient notation.
See [R] test. Also documented there is testparm. testparm cannot be used after sem or gsem
because its syntax hinges on use of shortcuts for referring to coefficients.
Options
See Options for test in [R] test.
548
Stored results
See Stored results in [R] test.
Also see
[SEM] example 8 Testing that coefficients are equal, and constraining them
[SEM] example 16 Correlation
549
Title
testnl Wald test of nonlinear hypotheses
Syntax
Remarks and examples
Menu
Stored results
Description
Also see
Options
Syntax
testnl exp = exp
= ...
, options
= . . . ) (exp = exp = . . . ) . . .
, options
Menu
Statistics
>
>
>
Description
testnl is a postestimation command for use after sem, gsem, and other Stata estimation commands.
testnl performs the Wald test of the nonlinear hypothesis or hypotheses. In the case of sem and
gsem, you must use the b[ ] coefficient notation.
Options
See Options in [R] testnl.
Technical note
estat stdize: is unnecessary because, with testnl, everywhere you wanted a standardized
coefficient or correlation, you could just type the formula. If you did that, you would get the
same answer except for numerical precision. In this case, the answer produced with the estat
stdize: prefix will be a little more accurate because estat stdize: is able to substitute an analytic
derivative in one part of the calculation where testnl, doing the whole thing itself, would be forced
to use a numeric derivative.
550
Stored results
See Stored results in [R] testnl.
Also see
[R] testnl Test nonlinear hypotheses after estimation
[SEM] test Wald test of linear hypotheses
[SEM] lrtest Likelihood-ratio test of linear hypothesis
[SEM] estat stdize Test standardized parameters
[SEM] estat eqtest Equation-level test that all coefficients are zero
[SEM] nlcom Nonlinear combinations of parameters
551
Glossary
ADF, method(adf). ADF stands for asymptotic distribution free and is a method used to obtain fitted
parameters for standard linear SEMs. ADF is used by sem when option method(adf) is specified.
Other available methods are ML, QML, and MLMV.
anchoring, anchor variable. A variable is said to be the anchor of a latent variable if the path
coefficient between the latent variable and the anchor variable is constrained to be 1. sem and
gsem use anchoring as a way of normalizing latent variables and thus identifying the model.
baseline model. A baseline model is a covariance modela model of fitted means and covariances
of observed variables without any other pathswith most of the covariances constrained to be
0. That is, a baseline model is a model of fitted means and variances but typically not all the
covariances. Also see saturated model. Baseline models apply only to standard linear SEMs.
BentlerWeeks formulation. The Bentler and Weeks (1980) formulation of standard linear SEMs places
the results in a series of matrices organized around how results are calculated. See [SEM] estat
framework.
bootstrap, vce(bootstrap). The bootstrap is a replication method for obtaining variance estimates.
Consider an estimation method E for estimating . Let b be the result of applying E to dataset
D containing N observations. The bootstrap is a way of obtaining variance estimates for b from
repeated estimates b1 , b2 , . . . , where each bi is the result of applying E to a dataset of size N
drawn with replacement from D. See [SEM] sem option method( ) and [R] bootstrap.
vce(bootstrap) is allowed with sem but not gsem. You can obtain bootstrap results by prefixing
the gsem command with bootstrap:, but remember to specify bootstraps cluster() and
idcluster() options if you are fitting a multilevel model. See [SEM] intro 9.
Builder. The Builder is Statas graphical interface for building sem and gsem models. The Builder is
also known as the SEM Builder. See [SEM] intro 2, [SEM] Builder, and [SEM] Builder, generalized.
CFA, CFA models. CFA stands for confirmatory factor analysis. It is a way of analyzing measurement
models. CFA models is a synonym for measurement models.
CI. CI is an abbreviation for confidence interval.
clustered, vce(cluster clustvar). Clustered is the name we use for the generalized Huber/White/sandwich estimator of the VCE, which is the robust technique generalized to relax the
assumption that errors are independent across observations to be that they are independent across
clusters of observations. Within cluster, errors may be correlated.
Clustered standard errors are reported when sem or gsem option vce(cluster clustvar) is
specified. The other available techniques are OIM, OPG, robust, bootstrap, and jackknife. Also
available for sem only is EIM.
coefficient of determination. The coefficient of determination is the fraction (or percentage) of
variation (variance) explained by an equation of a model. The coefficient of determination is thus
like R2 in linear regression.
command language. Statas sem and gsem commands provide a way to specify SEMs. The alternative
is to use the Builder to draw path diagrams; see [SEM] intro 2, [SEM] Builder, and [SEM] Builder,
generalized.
552
Glossary
553
complementary log-log regression. Complementary log-log regression is a term for generalized linear
response functions that are family Bernoulli, link cloglog. It is used for binary outcome data.
Complementary log-log regression is also known in Stata circles as cloglog regression or just
cloglog. See generalized linear response functions.
conditional normality assumption. See normality assumption, joint and conditional.
constraints. See parameter constraints.
correlated uniqueness model. A correlated uniqueness model is a kind of measurement model in
which the errors of the measurements have a structured correlation. See [SEM] intro 5.
crossed-effects models. See multilevel models.
curved path. See path.
degree-of-freedom adjustment. In estimates of variances and covariances, a finite-population degreeof-freedom adjustment is sometimes applied to make the estimates unbiased.
bii = Sii /N . If
bii is the variance of observable variable xi , it can readily be proven that Sii /N
is a biased estimate of the variances in samples of size N and that Sii /(N 1) is an unbiased
estimate. It is usual to calculate variances with Sii /(N 1), which is to say the standard formula
has a multiplicative degree-of-freedom adjustment of N/(N 1) applied to it.
If
bii is the variance of estimated parameter i , a similar finite-population degree-of-freedom
adjustment can sometimes be derived that will make the estimate unbiased. For instance, if i is a
coefficient from a linear regression, an unbiased estimate of the variance of regression coefficient
i is Sii /(N p 1), where p is the total number of regression coefficients estimated excluding
the intercept. In other cases, no such adjustment can be derived. Such estimators have no derivable
finite-sample properties, and one is left only with the assurances provided by its provable asymptotic
properties. In such cases, the variance of coefficient i is calculated as Sii /N , which can be
derived on theoretical grounds. SEM is an example of such an estimator.
SEM is a remarkably flexible estimator and can reproduce results that can sometimes be obtained by
other estimators. SEM might produce asymptotically equivalent results, or it might produce identical
results depending on the estimator. Linear regression is an example in which sem and gsem produce
the same results as regress. The reported standard errors, however, will not look identical because
the linear-regression estimates have the finite-population degree-of-freedom adjustment applied to
them and the SEM estimates do not. To see the equivalence, you must
p undo the adjustment on the
reported linear regression standard errors by multiplying them by {(N p 1)/N }.
direct, indirect, and total effects. Consider the following system of equations:
b32 x1 + b33 x5 + e3
The total effect of x1 on y1 is b12 + b11 b22 + b11 b21 b32 . It measures the full change in y1 based
on allowing x1 to vary throughout the system.
The direct effect of x1 on y1 is b12 . It measures the change in y1 caused by a change in x1
holding other endogenous variablesnamely, y2 and y3 constant.
The indirect effect of x1 on y1 is obtained by subtracting the total and direct effect and is thus
b11 b22 + b11 b21 b32 .
554
Glossary
EIM, vce(eim). EIM stands for expected information matrix, defined as the inverse of the negative
of the expected value of the matrix of second derivatives, usually of the log-likelihood function.
The EIM is an estimate of the VCE. EIM standard errors are reported when sem option vce(eim)
is specified. EIM is available only with sem. The other available techniques for sem are OIM, OPG,
robust, clustered, bootstrap, and jackknife.
endogenous variable. A variable, observed or latent, is endogenous (determined by the system) if
any path points to it. Also see exogenous variable.
error, error variable. The error is random disturbance e in a linear equation:
y = b0 + b1 x1 + b2 x2 + + e
An error variable is an unobserved exogenous variable in path diagrams corresponding to e.
Mathematically, error variables are just another example of latent exogenous variables, but in sem
and gsem, error variables are considered to be in a class by themselves. All (Gaussian) endogenous
variablesobserved and latenthave a corresponding error variable. Error variables automatically
and inalterably have their path coefficients fixed to be 1. Error variables have a fixed naming
convention in the software. If a variable is the error for (observed or latent) endogenous variable
y, then the residual variables name is e.y.
In sem and gsem, error variables are uncorrelated with each other unless explicitly indicated
otherwise. That indication is made in path diagrams by drawing a curved path between the error
variables and is indicated in command notation by including cov(e.name1*e.name2) among the
options specified on the sem command. In gsem, errors for family Gaussian, link log responses
are not allowed to be correlated.
estimation method. There are a variety of ways that one can solve for the parameters of an SEM.
Different methods make different assumptions about the data-generation process, so it is important
that you choose a method appropriate for your model and data; see [SEM] intro 4.
exogenous variable. A variable, observed or latent, is exogenous (determined outside the system)
if paths only originate from it or, equivalently, no path points to it. In this manual, we do not
distinguish whether exogenous variables are strictly exogenousthat is, uncorrelated with the
errors. Also see endogenous variable.
family distribution. See generalized linear response functions.
fictional data. Fictional data are data that have no basis in reality even though they might look real;
they are data that are made up for use in examples.
first-, second-, and higher-level (latent) variables. Consider a multilevel model of patients within
doctors within hospitals. First-level variables are variables that vary at the observational (patient)
level. Second-level variables vary across doctors but are constant within doctors. Third-level
variables vary across hospitals but are constant within hospitals. This jargon is used whether
variables are latent or not.
first- and second-order latent variables. If a latent variable is measured by other latent variables
only, the latent variable that does the measuring is called first-order latent variable, and the latent
variable being measured is called the second-order latent variable.
full joint and conditional normality assumption. See normality assumption, joint and conditional.
gamma regression. Gamma regression is a term for generalized linear response functions that are
family gamma, link log. It is used for continuous, nonnegative, positively skewed data. Gamma
regression is also known as log-gamma regression. See generalized linear response functions.
Gaussian regression. Gaussian regression is another term for linear regression. It is most often used
when referring to generalized linear response functions. In that framework, Gaussian regression is
family Gaussian, link identity. See generalized linear response functions.
Glossary
555
generalized linear response functions. Generalized linear response functions include linear functions
and include functions such as probit, logit, multinomial logit, ordered probit, ordered logit, Poisson,
and more.
These generalized linear functions are described by a link function g() and statistical distribution
F . The link function g() specifies how the response variable yi is related to a linear equation of
the explanatory variables, xi , and the family F specifies the distribution of yi :
g{E(yi )} = xi ,
yi F
If we specify that g() is the identity function and F is the Gaussian (normal) distribution, then we
have linear regression. If we specify that g() is the logit function and F the Bernoulli distribution,
then we have logit (logistic) regression.
In this generalized linear structure, the family may be Gaussian, gamma, Bernoulli, binomial,
Poisson, negative binomial, ordinal, or multinomial. The link function may be the identity, log,
logit, probit, or complementary log-log.
gsem fits models with generalized linear response functions.
generalized method of moments. Generalized method of moments (GMM) is a method used to obtain
fitted parameters. In this documentation, GMM is referred to as ADF, which stands for asymptotic
distribution free and is available for use with sem. Other available methods for use with sem are
ML, QML, ADF, and MLMV.
The SEM moment conditions are cast in terms of second moments, not the first moments used in
many other applications associated with GMM.
generalized SEM. Generalized SEM is a term we have coined to mean SEM optionally allowing
generalized linear response functions or multilevel models. gsem fits generalized SEMs.
GMM. See generalized method of moments.
goodness-of-fit statistic. A goodness-of-fit statistic is a value designed to measure how well the
model reproduces some aspect of the data the model is intended to fit. SEM reproduces the firstand second-order moments of the data, with an emphasis on the second-order moments, and thus
goodness-of-fit statistics appropriate for use after sem compare the predicted covariance matrix
(and mean vector) with the matrix (and vector) observed in the data.
gsem. gsem is the Stata command that fits generalized SEMs. Also see sem.
GUI. See Builder.
identification. Identification refers to the conceptual constraints on parameters of a model that are
required for the models remaining parameters to have a unique solution. A model is said to be
unidentified if these constraints are not supplied. These constraints are of two types: substantive
constraints and normalization constraints.
Normalization constraints deal with the problem that one scale works as well as another for each
latent variable in the model. One can think, for instance, of propensity to write software as being
measured on a scale of 0 to 1, 1 to 100, or any other scale. The normalization constraints are the
constraints necessary to choose one particular scale. The normalization constraints are provided
automatically by sem and gsem by anchoring with unit loadings.
Substantive constraints are the constraints you specify about your model so that it has substantive
content. Usually, these constraints are 0 constraints implied by the paths omitted, but they can
include explicit parameter constraints as well. It is easy to write a model that is not identified for
substantive reasons; see [SEM] intro 4.
556
Glossary
indicator variables, indicators. The term indicator variable has two meanings. An indicator variable
is a 0/1 variable that contains whether something is true. The other usage is as a synonym for
measurement variables.
indirect effects. See direct, indirect, and total effects.
initial values. See starting values.
intercept. An intercept for the equation of endogenous variable y , observed or latent, is the path
coefficient from cons to y . cons is Stata-speak for the built-in variable containing 1 in all
observations. In SEM-speak, cons is an observed exogenous variable.
jackknife, vce(jackknife). The jackknife is a replication method for obtaining variance estimates.
Consider an estimation method E for estimating . Let b be the result of applying E to dataset
D containing N observations. The jackknife is a way of obtaining variance estimates for b from
repeated estimates b1 , b2 , . . . , bN , where each bi is the result of applying E to D with observation
i removed. See [SEM] sem option method( ) and [R] jackknife.
vce(jackknife) is allowed with sem but not gsem. You can obtain jackknife results by prefixing
the gsem command with jackknife:, but remember to specify jackknifes cluster() and
idcluster() options if you are fitting a multilevel model. See [SEM] intro 9.
joint normality assumption. See normality assumption, joint and conditional.
Lagrange multiplier test. Synonym for score test.
latent growth model. A latent growth model is a kind of measurement model in which the observed
values are collected over time and are allowed to follow a trend. See [SEM] intro 5.
latent variable. A variable is latent if it is not observed. A variable is latent if it is not in your dataset
but you wish it were. You wish you had a variable recording the propensity to commit violent
crime, or socioeconomic status, or happiness, or true ability, or even income accurately recorded.
Latent variables are sometimes described as imagined variables.
In the software, latent variables are usually indicated by having at least their first letter capitalized.
Also see first- and second-order latent variables, first-, second-, and higher-level (latent) variables,
and observed variables.
linear regression. Linear regression is a kind of SEM in which there is a single equation. See
[SEM] intro 5.
link function. See generalized linear response functions.
logit regression. Logit regression is a term for generalized linear response functions that are family
Bernoulli, link logit. It is used for binary outcome data. Logit regression is also known as logistic
regression and also simply as logit. See generalized linear response functions.
manifest variables. Synonym for observed variables.
measure, measurement, x a measurement of X, x measures X. See measurement variables.
measurement models, measurement component. A measurement model is a particular kind of
model that deals with the problem of translating observed values to values suitable for modeling.
Measurement models are often combined with structural models and then the measurement model
part is referred to as the measurement component. See [SEM] intro 5.
measurement variables, measure, measurement, x a measurement of X, x measures X. Observed
variable x is a measurement of latent variable X if there is a path connecting x X . Measurement
variables are modeled by measurement models. Measurement variables are also called indicator
variables.
Glossary
557
method. Method is just an English word and should be read in context. Nonetheless, method is used
here usually to refer to the method used to solve for the fitted parameters of an SEM. Those
methods are ML, QML, MLMV, and ADF. Also see technique.
MIMIC. See multiple indicators and multiple causes.
mixed-effects models. See multilevel models.
ML, method(ml). ML stands for maximum likelihood. It is a method to obtain fitted parameters. ML
is the default method used by sem and gsem. Other available methods for sem are QML, MLMV,
and ADF. Also available for gsem is QML.
MLMV, method(mlmv). MLMV stands for maximum likelihood with missing values. It is an ML
method used to obtain fitted parameters in the presence of missing values. MLMV is the method
used by sem when the method(mlmv) option is specified; method(mlmv) is not available with
gsem. Other available methods for use with sem are ML, QML, and ADF. These other methods
omit from the calculation observations that contain missing values.
modification indices. Modification indices are score tests for adding paths where none appear. The
paths can be for either coefficients or covariances.
moments (of a distribution). The moments of a distribution are the expected values of powers of a
random variable or centralized (demeaned) powers of a random variable. The first moments are
the expected or observed means, and the second moments are the expected or observed variances
and covariances.
multilevel models. Multilevel models are models that include unobserved effects (latent variables)
for different groups in the data. For instance, in a dataset of students, groups of students might
share the same teacher. If the teachers identity is recorded in the data, then one can introduce
a latent variable that is constant within teacher and that varies across teachers. This is called a
two-level model.
If teachers could in turn be grouped into schools, and school identities were recorded in the data,
then one can introduce another latent variable that is constant within school and varies across
schools. This is called a three-level (nested-effects) model.
In the above example, observations (students) are said to be nested within teacher nested within
school. Sometimes there is no such subsequent nesting structure. Consider workers nested within
occupation and industry. The same occupations appear in various industries and the same industries
appear within various occupations. We can still introduce latent variables at the occupation and
industry level. In such cases, the model is called a crossed-effects model.
The latent variables that we have discussed are also known as random effects. Any coefficients on
observed variables in the model are known as the fixed portion of the model. Models that contain
fixed and random portions are known as mixed-effects models.
multinomial logit regression. Multinomial logit regression is a term for generalized linear response
functions that are family multinomial, link logit. It is used for categorical-outcome data when the
outcomes cannot be ordered. Multinomial logit regression is also known as multinomial logistic
regression and as mlogit in Stata circles. See generalized linear response functions.
multiple correlation. The multiple correlation is the correlation between endogenous variable y and
its linear prediction.
multiple indicators and multiple causes. Multiple indicators and multiple causes is a kind of structural
model in which observed causes determine a latent variable, which in turn determines multiple
indicators. See [SEM] intro 5.
558
Glossary
multivariate regression. Multivariate regression is a kind of structural model in which each member
of a set of observed endogenous variables is a function of the same set of observed exogenous
variables and a unique random disturbance term. The disturbances are correlated. Multivariate
regression is a special case of seemingly unrelated regression.
negative binomial regression. Negative binomial regression is a term for generalized linear response
functions that are family negative binomial, link log. It is used for count data that are overdispersed
relative to Poisson. Negative binomial regression is also known as nbreg in Stata circles. See
generalized linear response functions.
nested-effects models. See multilevel models.
nonrecursive (structural) model (system), recursive (structural) model (system). A structural model
(system) is said to be nonrecursive if there are paths in both directions between one or more pairs
of endogenous variables. A system is recursive if it is a systemit has endogenous variables that
appear with paths from themand it is not nonrecursive.
y1 = 2y2 + 1x1 + e1
y2 = 3y1 2x2 + e2
This model is unstable. To see this, without loss of generality, treat x1 + e1 and 2x2 + e2 as if
they were both 0. Consider y1 = 1 and y2 = 1. Those values result in new values y1 = 2 and
y2 = 3, and those result in new values y1 = 6 and y2 = 6, and those result in new values . . . .
Continue in this manner, and you reach infinity for both endogenous variables. In the jargon of the
mathematics used to check for this property, the eigenvalues of the coefficient matrix lie outside
the unit circle.
On the other hand, consider these values:
y1 =0.5y2 + 1x1 + e1
y2 =1.0y1 2x2 + e2
These results are stable in that the resulting values converge to y1 = 0 and y2 = 0. In the jargon
of the mathematics used to check for this property, the eigenvalues of the coefficient matrix lie
inside the unit circle.
Finally, consider the values
y1 =0.5y2 + 1x1 + e1
y2 =2.0y1 2x2 + e2
Start with y1 = 1 and y2 = 1. That yields new values y1 = 0.5, and y2 = 2 and that yields new
values y1 = 1 and y2 = 1, and that yields new values y1 = 0.5 and y2 = 2, and it will oscillate
forever. In the jargon of the mathematics used to check for this property, the eigenvalues of the
coefficient matrix lie on the unit circle. These coefficients are also considered to be unstable.
normalization constraints. See identification.
normality assumption, joint and conditional. The derivation of the standard, linear SEM estimator
usually assumes the full joint normality of the observed and latent variables. However, full joint
normality can replace the assumption of normality conditional on the values of the exogenous
variables, and all that is lost is one goodness-of-fit test (the test reported by sem on the output) and
Glossary
559
the justification for use of optional method MLMV for dealing with missing values. This substitution
of assumptions is important for researchers who cannot reasonably assume normality of the observed
variables. This includes any researcher including, say, variables age and age-squared in his or her
model.
Meanwhile, the generalized SEM makes only the conditional normality assumption.
Be aware that even though the full joint normality assumption is not required for the standard
linear SEM, sem calculates the log-likelihood value under that assumption. This is irrelevant except
that log-likelihood values reported by sem cannot be compared with log-likelihood values reported
by gsem, which makes the lesser assumption.
See [SEM] intro 4.
normalized residuals. See standardized residuals.
observed variables. A variable is observed if it is a variable in your dataset. In this documentation,
we often refer to observed variables by using x1, x2, . . . , y1, y2, and so on; in reality, observed
variables have names such as mpg, weight, testscore, and so on.
In the software, observed variables are usually indicated by having names that are all lowercase.
Also see latent variable.
OIM, vce(oim). OIM stands for observed information matrix, defined as the inverse of the negative
of the matrix of second derivatives, usually of the log likelihood function. The OIM is an estimate
of the VCE. OIM is the default VCE that sem and gsem report. The other available techniques are
EIM, OPG, robust, clustered, bootstrap, and jackknife.
OPG, vce(opg). OPG stands for outer product of the gradients, defined as the cross product of the
observation-level first derivatives, usually of the log likelihood function. The OPG is an estimate of
the VCE. The other available techniques are OIM, EIM, robust, clustered, bootstrap, and jackknife.
ordered complementary log-log regression. Ordered complementary log-log regression is a term
for generalized linear response functions that are family ordinal, link cloglog. It is used for
ordinal-outcome data. Ordered complementary log-log regression is also known as ocloglog in
Stata circles. See generalized linear response functions.
ordered logit regression. Ordered logit regression is a term for generalized linear response functions
that are family ordinal, link logit. It is used for ordinal outcome data. Ordered logit regression is
also known as ordered logistic regression, as just ordered logit, and as ologit in Stata circles. See
generalized linear response functions.
ordered probit regression. Ordered probit regression is a term for generalized linear response functions
that are family ordinal, link probit. It is used for ordinal-outcome data. Ordered probit regression
is also known as just ordered probit and known as oprobit in Stata circles. See generalized linear
response functions.
parameter constraints. Parameter constraints are restrictions placed on the parameters of the model.
These constraints are typically in the form of 0 constraints and equality constraints. A 0 constraint
is implied, for instance, when no path is drawn connecting x with y . An equality constraint is
specified when one path coefficient is forced to be equal to another or one covariance is forced
to be equal to another.
560
Glossary
is the vector of path coefficients, is the vector of means, and is the matrix of variances
and covariances. The resulting parameter estimates are written as b
.
Ancillary parameters are extra parameters beyond the ones just described that concern the distribution.
These include the scale parameter of gamma regression, the dispersion parameter for negative
binomial regression, and the cutpoints for ordered probit, logit, and cloglog regression, and the
like. These parameters are also included in .
path. A path, typically shown as an arrow drawn from one variable to another, states that the first
variable determines the second variable, at least partially. If x y , or equivalently y x, then
yj = + + xj + + e.yj , where is said to be the x y path coefficient. The ellipses
are included to account for paths to y from other variables. is said to be the intercept and is
automatically added when the first path to y is specified.
A curved path is a curved line connecting two variables, and it specifies that the two variables are
allowed to be correlated. If there is no curved path between variables, the variables are usually
assumed to be uncorrelated. We say usually because correlation is assumed among observed
exogenous variables and, in the command language, assumed among latent exogenous variables,
and if some of the correlations are not desired, they must be suppressed. Many authors refer to
covariances rather than correlations. Strictly speaking, the curved path denotes a nonzero covariance.
A correlation is often called a standardized covariance.
A curved path can connect a variable to itself, and in that case, it indicates a variance. In path
diagrams in this manual, we typically do not show such variance paths even though variances are
assumed.
path coefficient. The path coefficient is associated with a path; see path. Also see intercept.
path diagram. A path diagram is a graphical representation that shows the relationships among a set
of variables using paths. See [SEM] intro 2 for a description of path diagrams.
path notation. Path notation is a syntax defined by the authors of Statas sem and gsem commands
for entering path diagrams in a command language. Models to be fit may be specified in path
notation or they may be drawn using path diagrams into the Builder.
p-value. P -value is another term for the reported significance level associated with a test. Small
p-values indicate rejection of the null hypothesis.
Poisson regression. Poisson regression is a term for generalized linear response functions that are
family Poisson, link log. It is used for count data. See generalized linear response functions.
probit regression. Probit regression is a term for generalized linear response functions that are family
Bernoulli, link probit. It is used for binary outcome data. Probit regression is also known simply
as probit. See generalized linear response functions.
QML, method(ml) vce(robust). QML stands for quasimaximum likelihood. It is a method used to
obtain fitted parameters and a technique used to obtain the corresponding VCE. QML is used by sem
and gsem when options method(ml) and vce(robust) are specified. Other available methods
are ML, MLMV, and ADF. Other available techniques are OIM, EIM, OPG, clustered, bootstrap, and
jackknife.
quadrature. Quadrature is generic method for performing numerical integration. gsem uses quadrature
in any model including latent variables (excluding error variables). sem, being limited to linear
models, does not need to perform quadrature.
random-effects models. See multilevel models.
regression. A regression is a model in which an endogenous variable is written as a function of other
variables, parameters to be estimated, and a random disturbance.
Glossary
561
reliability. Reliability is the proportion of the variance of a variable not due to measurement error.
A variable without measure error has reliability 1.
residual. In this manual, we reserve the word residual for the difference between the observed
and fitted moments of an SEM. We use the word error for the disturbance associated with a
(Gaussian) linear equation; see error. Also see standardized residuals.
robust, vce(robust). Robust is the name we use here for the Huber/White/sandwich estimator of
the VCE. This technique requires fewer assumptions than most other techniques. In particular, it
merely assumes that the errors are independently distributed across observations and thus allows
the errors to be heteroskedastic. Robust standard errors are reported when the sem (gsem) option
vce(robust) is specified. The other available techniques are OIM, EIM, OPG, clustered, bootstrap,
and jackknife.
saturated model. A saturated model is a full covariance modela model of fitted means and
covariances of observed variables without any restrictions on the values. Also see baseline model.
Saturated models apply only to standard linear SEMs.
score test, Lagrange multiplier test. A score test is a test based on first derivatives of a likelihood
function. Score tests are especially convenient for testing whether constraints on parameters should
be relaxed or parameters should be added to a model. Also see Wald test.
scores. Scores has two unrelated meanings. First, scores are the observation-by-observation firstderivatives of the (quasi) log-likelihood function. When we use the word scores, this is what
we mean. Second, in the factor-analysis literature, scores (usually in the context of factor scores)
refers to the expected value of a latent variable conditional on all the observed variables. We refer
to this simply as the predicted value of the latent variable.
second-level latent variable. See first-, second-, and higher-order latent variables.
second-order latent variable. See first- and second-order latent variables.
seemingly unrelated regression. Seemingly unrelated regression is a kind of structural model in
which each member of a set of observed endogenous variables is a function of a set of observed
exogenous variables and a unique random disturbance term. The disturbances are correlated and
the sets of exogenous variables may overlap. If the sets of exogenous variables are identical, this
is referred to as multivariate regression.
SEM. SEM stands for structural equation modeling and for structural equation model. We use SEM
in capital letters when writing about theoretical or conceptual issues as opposed to issues of the
particular implementation of SEM in Stata with the sem or gsem commands.
sem. sem is the Stata command that fits standard linear SEMs. Also see gsem.
SSD, ssd. See summary statistics data.
standard linear SEM. An SEM without multilevel effects in which all response variables are given
by a linear equation. Standard linear SEM is what most people mean when they refer to just SEM.
Standard linear SEMs are fit by sem, although they can also be fit by gsem; see generalized SEM.
standardized coefficient. In a linear equation y = . . . bx + . . . , the standardized coefficient is
(b
y /b
x )b. Standardized coefficients are scaled to units of standard deviation change in y for a
standard deviation change in x.
standardized covariance. A standardized covariance between y and x is equal to the correlation of y
and x, that is, it is equal to xy /x y . The covariance is equal to the correlation when variables
are standardized to have variance 1.
562
Glossary
standardized residuals, normalized residuals. Standardized residuals are residuals adjusted so that
they follow a standard normal distribution. The difficulty is that the adjustment is not always
possible. Normalized residuals are residuals adjusted according to a different formula that roughly
follow a standard normal distribution. Normalized residuals can always be calculated.
starting values. The estimation methods provided by sem and gsem are iterative. The starting values
are values for each of the parameters to be estimated that are used to initialize the estimation
process. sem and gsem provide starting values automatically, but in some cases, these are not
good enough and you must both diagnose the problem and provide better starting values. See
[SEM] intro 12.
structural equation model. Different authors use the term structural equation model in different
ways, but all would agree that an SEM sometimes carries the connotation of being a structural
model with a measurement component, that is, combined with a measurement model.
structural model. A structural model is a model in which the parameters are not merely a description
but are believed to be of a causal nature. Obviously, SEM can fit structural models and thus so can
sem and gsem. Neither SEM, sem, nor gsem are limited to fitting structural models, however.
Structural models often have multiple equations and dependencies between endogenous variables,
although that is not a requirement.
See [SEM] intro 5. Also see structural equation model.
structured (correlation or covariance). See unstructured and structured (correlation or covariance).
substantive constraints. See identification.
summary statistics data. Data are sometimes available only in summary statistics form, as means
and covariances; means, standard deviations or variances, and correlations; covariances; standard
deviations or variances and correlations; or correlations. SEM can be used to fit models with such
data in place of the underlying raw data. The ssd command creates datasets containing summary
statistics.
technique. Technique is just an English word and should be read in context. Nonetheless, technique
is usually used here to refer to the technique used to calculate the estimated VCE. Those techniques
are OIM, EIM, OPG, robust, clustered, bootstrap, and jackknife.
Technique is also used to refer to the available techniques used with ml, Statas optimizer and
likelihood maximizer, to find the solution.
total effects. See direct, indirect, and total effects.
unstandardized coefficient. A coefficient that is not standardized. If mpg = 0.006 weight +
39.44028, then 0.006 is an unstandardized coefficient and, as a matter of fact, is measured in
mpg-per-pound units.
unstructured and structured (correlation or covariance). A set of variables, typically error variables,
is said to have an unstructured correlation or covariance if the covariance matrix has no particular
pattern imposed by theory. If a pattern is imposed, the correlation or covariance is said to be
structured.
variancecovariance matrix of the estimator. The estimator is the formula used to solve for the fitted
parameters, sometimes called the fitted coefficients. The VCE is the estimated variancecovariance
matrix of the parameters. The diagonal elements of the VCE are the variances of the parameters or
equivalent; the square roots of those elements are the reported standard errors of the parameters.
VCE. See variancecovariance matrix of the estimator.
Glossary
563
Wald test. A Wald test is a statistical test based on the estimated variancecovariance matrix of the
parameters. Wald tests are especially convenient for testing possible constraints to be placed on
the estimated parameters of a model. Also see score test.
weighted least squares. Weighted least squares (WLS) is a method used to obtain fitted parameters.
In this documentation, WLS is referred to as ADF, which stands for asymptotic distribution free.
Other available methods are ML, QML, and MLMV. ADF is, in fact, a specific kind of the more
generic WLS.
WLS. See weighted least squares.
A
Acock, A. C., [SEM] intro 4, [SEM] intro 5,
[SEM] intro 6, [SEM] intro 11,
[SEM] example 1, [SEM] example 3,
[SEM] example 7, [SEM] example 9,
[SEM] example 18, [SEM] example 20
adaptopt() option, see gsem option adaptopts()
addgroup, ssd subcommand, [SEM] ssd
ADF, see asymptotic distribution free
adf, see sem option method()
AIC, see Akaike information criterion
Akaike information criterion, [SEM] estat gof,
[SEM] example 4, [SEM] methods and
formulas for sem
Akaike, H., [SEM] estat gof, [SEM] methods and
formulas for sem
allmissing option, see sem option allmissing
Alwin, D. F., [SEM] example 9
anchoring, see constraints, normalization
Andrich, D., [SEM] example 28g
asymptotic distribution free, [SEM] intro 4,
[SEM] methods and formulas for sem,
[SEM] Glossary
B
Baron, R. M., [SEM] example 42g
baseline comparisons, [SEM] estat gof,
[SEM] example 4
baseline model, [SEM] estat gof, [SEM] example 4,
[SEM] methods and formulas for sem,
[SEM] Glossary
baseopts option, see sem option baseopts()
Bauldry, S., [SEM] intro 5
Bayesian information criterion, [SEM] estat gof,
[SEM] example 4, [SEM] methods and
formulas for sem
Bentham, G., [SEM] example 39g
Bentler, P. M., [SEM] estat eqgof, [SEM] estat
framework, [SEM] estat gof, [SEM] estat
stable, [SEM] example 3, [SEM] methods and
formulas for sem
BentlerRaykov squared multiple-correlation coefficient,
[SEM] estat eqgof
BentlerWeeks matrices, [SEM] intro 7, [SEM] estat
framework, [SEM] example 11, [SEM] Glossary
BIC, see Bayesian information criterion
565
C
Campbell, D. T., [SEM] example 17
CD, see coefficient of determination
Center for Human Resource Research,
[SEM] example 38g, [SEM] example 46g
CFA, see confirmatory factor analysis
CFI, see comparative fit index
chi-squared test, [SEM] methods and formulas for sem
CI, see confidence interval
cloglog option, see gsem option cloglog
cluster, see gsem option vce(), see sem option
vce()
cluster estimator of variance, structural equation
modeling, [SEM] intro 8, [SEM] sem option
method( )
clustered, [SEM] Glossary
coefficient of determination, [SEM] estat eqgof,
[SEM] estat ggof, [SEM] estat gof,
[SEM] example 4, [SEM] example 21,
[SEM] methods and formulas for sem,
[SEM] Glossary
coeflegend option, see gsem option coeflegend, see
sem option coeflegend
collinear option, see gsem option collinear
command language, [SEM] Glossary
comparative fit index, [SEM] estat gof, [SEM] methods
and formulas for sem
complementary log-log regression, [SEM] Glossary
conditional normality, see normality, conditional
confidence interval, [SEM] Glossary
confirmatory factor analysis, [SEM] intro 5,
[SEM] example 15, [SEM] example 30g,
[SEM] Glossary
constraints, [SEM] sem and gsem option constraints( ),
[SEM] Glossary
across groups, [SEM] intro 6
normalization, [SEM] intro 4, [SEM] gsem,
[SEM] sem, [SEM] Glossary
D
datasignature command, [SEM] example 25,
[SEM] ssd
degree-of-freedom adjustment, [SEM] Glossary
delta method, [SEM] estat residuals, [SEM] estat
teffects
describe, ssd subcommand, [SEM] ssd
digitally signing data, see datasignature command
Duncan, O. D., [SEM] example 7
E
Eaves, R. C., [SEM] example 2
effects,
direct, [SEM] estat teffects, [SEM] example 7,
[SEM] example 42g, [SEM] methods and
formulas for sem, [SEM] Glossary
effects, continued
indirect, [SEM] estat teffects, [SEM] example 7,
[SEM] example 42g, [SEM] methods and
formulas for sem, [SEM] Glossary
total, [SEM] estat teffects, [SEM] example 7,
[SEM] example 42g, [SEM] methods and
formulas for sem, [SEM] Glossary
eform, estat subcommand, [SEM] estat eform
eigenvalue stability index, [SEM] estat stable
EIM, see expected information matrix
eim, see sem option vce()
Embretson, S. E., [SEM] example 28g,
[SEM] example 29g
empirical Bayes predictions, [SEM] intro 7,
[SEM] methods and formulas for gsem,
[SEM] predict after gsem
endogenous treatment-effects model,
[SEM] example 46g
endogenous variable, [SEM] intro 4, [SEM] Glossary
eqgof, estat subcommand, [SEM] estat eqgof
eqtest, estat subcommand, [SEM] estat eqtest
error, [SEM] Glossary
variable, [SEM] intro 4, [SEM] Glossary
estat
eform command, [SEM] intro 7, [SEM] estat
eform, [SEM] example 33g, [SEM] example 34g
eqgof command, [SEM] intro 7, [SEM] estat
eqgof, [SEM] example 3
eqtest command, [SEM] intro 7, [SEM] estat
eqtest, [SEM] example 13
framework command, [SEM] intro 7, [SEM] estat
framework, [SEM] example 11
ggof command, [SEM] intro 7, [SEM] estat ggof,
[SEM] example 21
ginvariant command, [SEM] intro 7, [SEM] estat
ginvariant, [SEM] example 22
gof command, [SEM] estat gof, [SEM] example 4
mindices command, [SEM] intro 7, [SEM] estat
mindices, [SEM] example 5, [SEM] example 9
residuals command, [SEM] intro 7, [SEM] estat
residuals, [SEM] example 10
scoretests command, [SEM] intro 7, [SEM] estat
scoretests, [SEM] example 8
stable command, [SEM] intro 7, [SEM] estat
stable, [SEM] example 7
stdize: prefix command, [SEM] estat stdize,
[SEM] example 16
summarize command, [SEM] estat summarize
teffects command, [SEM] estat teffects,
[SEM] example 7, [SEM] example 42g
estimation method, [SEM] Glossary
estimation options, [SEM] gsem estimation options,
[SEM] sem estimation options
Ex, [SEM] sem and gsem option covstructure( )
exogenous variable, [SEM] intro 4, [SEM] Glossary
expected information matrix, [SEM] Glossary
F
factor analysis, see confirmatory factor analysis
factor scores, [SEM] intro 7, [SEM] example 14,
[SEM] methods and formulas for sem,
[SEM] predict after sem
factor-variable notation, [SEM] intro 3
family
Bernoulli, [SEM] methods and formulas for gsem
binomial, [SEM] methods and formulas for gsem
distribution, [SEM] Glossary
gamma, [SEM] methods and formulas for gsem
Gaussian, [SEM] methods and formulas for gsem
multinomial, [SEM] methods and formulas for
gsem
negative binomial, [SEM] methods and formulas
for gsem
ordinal, [SEM] methods and formulas for gsem
Poisson, [SEM] methods and formulas for gsem
family() option, see gsem option family()
feasible generalized least squares, [SEM] intro 4
feedback loops, [SEM] estat stable, [SEM] estat
teffects
fictional data, [SEM] Glossary
first-order latent variables, [SEM] Glossary
Fischer, G. H., [SEM] example 28g
Fiske, D. W., [SEM] example 17
forcecorrelations option, see sem option
forcecorrelations
forcenoanchor option, see gsem option
forcenoanchor, see sem option
forcenoanchor
forcexconditional option, see sem option
forcexconditional
Fox, C. M., [SEM] example 28g
framework, estat subcommand, [SEM] estat
framework
Freeman, E. H., [SEM] estat stable
from() option, see gsem option from(), see sem
option from()
fvstandard option, see gsem option fvstandard
fvwrap() option, see sem option fvwrap()
fvwrapon() option, see sem option fvwrapon()
G
gamma option, see gsem option gamma
gamma regression, [SEM] intro 5, [SEM] Glossary
Gaussian regression, [SEM] Glossary
generalized
least squares, feasible, see feasible generalized least
squares
linear response functions, [SEM] Glossary
method of moments, [SEM] Glossary
generalized, continued
response variables, [SEM] intro 2, [SEM] intro 5,
[SEM] gsem family-and-link options
responses, combined, [SEM] example 34g
SEM, [SEM] Glossary
ggof, estat subcommand, [SEM] estat ggof
ginvariant, estat subcommand, [SEM] estat
ginvariant
ginvariant() option, see sem option ginvariant()
GMM, see generalized method of moments
gof, estat subcommand, [SEM] estat gof
goodness of fit, [SEM] intro 7, [SEM] estat
eqgof, [SEM] estat ggof, [SEM] estat
gof, [SEM] example 3, [SEM] example 4,
[SEM] Glossary
graphical user interface, [SEM] Builder,
[SEM] Builder, generalized, [SEM] Glossary
Greenacre, M. J., [SEM] example 35g,
[SEM] example 36g
Greenfield, S., [SEM] example 37g
Gronau, R., [SEM] example 45g
group invariance test, [SEM] methods and formulas
for sem
group() option, see sem option group()
gsem command, [SEM] Builder, generalized,
[SEM] example 1, [SEM] example 27g,
[SEM] example 28g, [SEM] example 29g,
[SEM] example 30g, [SEM] example 31g,
[SEM] example 32g, [SEM] example 33g,
[SEM] example 34g, [SEM] example 35g,
[SEM] example 36g, [SEM] example 37g,
[SEM] example 38g, [SEM] example 39g,
[SEM] example 40g, [SEM] example 41g,
[SEM] example 42g, [SEM] example 43g,
[SEM] example 44g, [SEM] example 45g,
[SEM] example 46g, [SEM] gsem, [SEM] gsem
family-and-link options, [SEM] gsem model
description options, [SEM] gsem path notation
extensions, [SEM] gsem postestimation,
[SEM] methods and formulas for gsem,
[SEM] sem and gsem path notation
gsem option
adaptopts(), [SEM] gsem estimation options
cloglog, [SEM] gsem family-and-link options
coeflegend, [SEM] example 29g, [SEM] gsem
reporting options
collinear, [SEM] gsem model description
options
constraints(), [SEM] gsem model description
options, [SEM] sem and gsem option
constraints( )
covariance(), [SEM] gsem model description
options
covstructure(), [SEM] gsem model description
options, [SEM] sem and gsem option
covstructure( )
exposure(), [SEM] gsem family-and-link options
family(), [SEM] gsem family-and-link options,
[SEM] gsem model description options,
[SEM] gsem path notation extensions
H
Haller, A. O., [SEM] example 7
Hambleton, R. K., [SEM] example 28g,
[SEM] example 29g
Hancock, G. R., [SEM] estat gof, [SEM] methods and
formulas for sem
Hausman, J. A., [SEM] estat residuals, [SEM] methods
and formulas for sem
Heckman selection model, [SEM] example 43g
Heckman, J., [SEM] example 45g
higher-order models, see confirmatory factor analysis
Hocevar, D., [SEM] example 19
Hosmer, D. W., Jr., [SEM] example 33g,
[SEM] example 34g
Huber, C., [SEM] Builder, [SEM] Builder, generalized
Huber/White/sandwich estimator of variance, see robust,
Huber/White/sandwich estimator of variance
hypothesis test, [SEM] test, [SEM] testnl
I
identification, see model identification
indicator variables, [SEM] Glossary
information criteria, see Akaike information criterion,
see Bayesian information criterion
init, ssd subcommand, [SEM] ssd
initial values, [SEM] Glossary, see starting values
intercept, [SEM] intro 4, [SEM] Glossary, also see
constraints, specifying
interval regression model, [SEM] example 44g
intmethod() option, see gsem option intmethod()
intpoints() option, see gsem option intpoints()
IRT, see item response theory
item response theory, [SEM] intro 5,
[SEM] example 28g, [SEM] example 29g
iterate() option, see gsem option maximize options,
see sem option maximize options
J
jackknife, [SEM] Glossary
joint normality, see normality, joint
Joreskog, K. G., [SEM] estat residuals
K
Kenny, D. A., [SEM] intro 4, [SEM] example 42g
Kline, R. B., [SEM] intro 4, [SEM] example 3,
[SEM] example 4, [SEM] example 5
Krull, J. L., [SEM] example 42g
L
Lagrange multiplier test, [SEM] estat ginvariant,
[SEM] estat mindices, [SEM] estat scoretests,
[SEM] Glossary
Langford, I. H., [SEM] example 39g
M
MacKinnon, D. P., [SEM] example 42g
Mair, C. S., [SEM] example 39g
manifest variables, [SEM] Glossary
MAR, see missing values
margins command, [SEM] intro 7
Marsh, H. W., [SEM] example 19
maximize options, see gsem option maximize options,
see sem option maximize options
maximum likelihood, [SEM] intro 4, [SEM] methods
and formulas for gsem, [SEM] methods and
formulas for sem, [SEM] Glossary
with missing values, [SEM] example 26,
[SEM] Glossary
McDonald, A., [SEM] example 39g
means() option, see gsem option means(), see sem
option means()
measurement
component, [SEM] Glossary
error, [SEM] intro 5, [SEM] example 1,
[SEM] example 27g
model, [SEM] intro 5, [SEM] example 1,
[SEM] example 3, [SEM] example 20,
[SEM] example 27g, [SEM] example 30g,
[SEM] example 31g, [SEM] Glossary
variables, [SEM] Glossary
mediation model, [SEM] intro 5, [SEM] example 42g
Mehta, P. D., [SEM] example 30g
method, [SEM] Glossary
method() option, see gsem option method(), see sem
option method()
MIMIC models, see multiple indicators and multiple
causes model
mindices, estat subcommand, [SEM] estat mindices
missing values, [SEM] example 26
mixed-effects model, see multilevel model
ML, see maximum likelihood
ml, see gsem option method(), see sem option
method()
MLMV, see maximum likelihood with missing values
mlmv, see sem option method()
mlogit option, see gsem option mlogit
model identification, [SEM] intro 4, [SEM] intro 12,
[SEM] Glossary
model simplification test, [SEM] example 8,
[SEM] example 10
model-implied covariances and correlations,
[SEM] example 11
modification indices, [SEM] estat mindices,
[SEM] example 5, [SEM] methods and
formulas for sem, [SEM] Glossary
Molenaar, I. W., [SEM] example 28g
moments (of a distribution), [SEM] Glossary
MTMM, see multitraitmultimethod data and matrices
Mueller, R. O., [SEM] estat gof, [SEM] methods and
formulas for sem
multilevel latent variable, [SEM] intro 2, [SEM] gsem
path notation extensions
multilevel mixed-effects model, see multilevel model
multilevel model, [SEM] intro 5, [SEM] example 30g,
[SEM] example 38g, [SEM] example 39g,
[SEM] example 40g, [SEM] example 41g,
[SEM] example 42g, [SEM] Glossary
multinomial logistic regression, [SEM] intro 2,
[SEM] intro 5, [SEM] example 37g,
[SEM] example 41g, [SEM] Glossary
multiple correlation, [SEM] Glossary
multiple indicators and multiple causes model,
[SEM] intro 5, [SEM] example 10,
[SEM] Glossary
multiple indicators multiple causes model,
[SEM] example 36g
multitraitmultimethod data and matrices,
[SEM] intro 5, [SEM] example 17
N
nbreg option, see gsem option nbreg
Neale, M. C., [SEM] example 30g
negative binomial, [SEM] example 39g
negative binomial regression, [SEM] Glossary
Nelson, E. C., [SEM] example 37g
nested-effects model, [SEM] Glossary
nlcom command, [SEM] intro 7, [SEM] estat stdize,
[SEM] example 42g, [SEM] nlcom
nm1 option, see sem option nm1
noanchor option, see gsem option noanchor, see sem
option noanchor
noasis option, see gsem option noasis
nocapslatent option, see gsem option
nocapslatent, see sem option nocapslatent
nocnsreport option, see gsem option nocnsreport,
see sem option nocnsreport
noconstant option, see gsem option noconstant, see
sem option noconstant
nodescribe option, see sem option nodescribe
noestimate option, see gsem option noestimate, see
sem option noestimate
nofootnote option, see sem option nofootnote
nofvlabel option, see sem option nofvlabel
noheader option, see gsem option noheader, see sem
option noheader
noivstart option, see sem option noivstart
nomeans option, see sem option nomeans
noncursive model, see nonrecursive model
nonnormed fit index, see TuckerLewis index
nonrecursive model, [SEM] Glossary
stability of, [SEM] estat stable, [SEM] example 7
normality,
conditional, [SEM] intro 4, [SEM] Glossary
joint, [SEM] intro 4, [SEM] Glossary
normalization constraints, see constraints, normalization
normalized residuals, [SEM] estat residuals,
[SEM] methods and formulas for sem,
[SEM] Glossary
notable option, see gsem option notable, see sem
option notable
noxconditional option, see sem option
noxconditional
O
observed information matrix, [SEM] Glossary
observed variables, [SEM] intro 4, [SEM] Glossary
ocloglog option, see gsem option ocloglog
OEx, [SEM] sem and gsem option covstructure( )
offset() option, see gsem option offset()
P
p-value, [SEM] Glossary
Q
QML, see quasimaximum likelihood
quadrature, [SEM] Glossary
GaussHermite, [SEM] methods and formulas for
gsem
mean variance adaptive, [SEM] methods and
formulas for gsem
mode curvature adaptive, [SEM] methods and
formulas for gsem
quasimaximum likelihood, [SEM] Glossary
R
Rabe-Hesketh, S., [SEM] Acknowledgments,
[SEM] intro 2, [SEM] intro 4,
[SEM] example 28g, [SEM] example 29g,
[SEM] example 30g, [SEM] example 39g,
[SEM] example 40g, [SEM] example 41g,
[SEM] example 45g, [SEM] example 46g,
[SEM] methods and formulas for gsem,
[SEM] predict after gsem
Raftery, A. E., [SEM] estat gof
random intercept, [SEM] example 38g
random slope, [SEM] example 38g
random-effects model, [SEM] example 38g,
[SEM] Glossary
Rasch models, see item response theory
Rasch, G., [SEM] example 28g
raw residuals, [SEM] methods and formulas for sem
Raykov, T., [SEM] estat eqgof, [SEM] example 3,
[SEM] methods and formulas for sem
recursive model, [SEM] Glossary
regress option, see gsem option regress
regression, [SEM] Glossary
Reise, S. P., [SEM] example 28g, [SEM] example 29g
reliability, [SEM] intro 5, [SEM] intro 12,
[SEM] example 24, [SEM] gsem model
description options, [SEM] sem and gsem
option reliability( ), [SEM] sem model
description options, [SEM] Glossary
reliability option, see gsem option
reliability(), see sem option
reliability()
r. En, [SEM] sem and gsem option covstructure( )
repair, ssd subcommand, [SEM] ssd
replaying models, [SEM] intro 7
reporting options, [SEM] gsem reporting options,
[SEM] sem reporting options
residuals, [SEM] estat gof, [SEM] estat residuals,
[SEM] example 4, [SEM] Glossary
residuals, estat subcommand, [SEM] estat
residuals
RMSEA, see root mean squared error of approximation
robust, [SEM] Glossary
robust, see gsem option vce(), see sem option vce()
S
sandwich/Huber/White estimator of variance, see robust,
Huber/White/sandwich estimator of variance
satopts() option, see sem option satopts()
saturated model, [SEM] estat gof, [SEM] example 4,
[SEM] methods and formulas for sem,
[SEM] Glossary
Schwarz, G., [SEM] estat gof, [SEM] methods and
formulas for sem
Schwarz information criterion, see Bayesian information
criterion
score test, [SEM] intro 7, [SEM] estat ginvariant,
[SEM] estat mindices, [SEM] estat scoretests,
[SEM] methods and formulas for sem,
[SEM] Glossary
scores, [SEM] Glossary
scoretests, estat subcommand, [SEM] estat
scoretests
second-order latent variables, [SEM] Glossary
seemingly unrelated regression, [SEM] intro 5,
[SEM] example 12, [SEM] Glossary
select() option, see sem option select()
SEM, see structural equation modeling
sem command, [SEM] Builder, [SEM] example 1,
[SEM] example 3, [SEM] example 6,
[SEM] example 7, [SEM] example 8,
[SEM] example 9, [SEM] example 10,
[SEM] example 12, [SEM] example 15,
[SEM] example 16, [SEM] example 17,
[SEM] example 18, [SEM] example 20,
[SEM] example 23, [SEM] example 24,
[SEM] example 26, [SEM] example 42g,
[SEM] methods and formulas for sem,
[SEM] sem, [SEM] sem and gsem path
notation, [SEM] sem model description
options, [SEM] sem path notation extensions,
[SEM] sem postestimation, [SEM] Glossary
missing values, [SEM] example 26
with constraints, [SEM] example 8
sem option
allmissing, [SEM] sem estimation options
baseopts(), [SEM] sem estimation options
coeflegend, [SEM] example 8,
[SEM] example 16, [SEM] sem reporting
options
constraints(), [SEM] sem and gsem option
constraints( ), [SEM] sem model description
options
covariance(), [SEM] sem and gsem path
notation, [SEM] sem model description options,
[SEM] sem path notation extensions
T
Tarlov, A. R., [SEM] example 37g
technique, [SEM] Glossary
teffects, estat subcommand, [SEM] estat teffects
test,
chi-squared, see chi-squared test
goodness-of-fit, see goodness of fit
group invariance, see group invariance test
hypothesis, see hypothesis test
Lagrange multiplier, see Lagrange multiplier test
likelihood-ratio, see likelihood-ratio test
model simplification, see model simplification test
modification indices, see modification indices
score, see score test
Wald, see Wald test
U
unaddgroup, ssd subcommand, [SEM] ssd
unit loading, [SEM] intro 4
unstandardized coefficient, [SEM] Glossary
unstructured, [SEM] Glossary
V
van der Linden, W. J., [SEM] example 28g,
[SEM] example 29g
variable types, [SEM] intro 4
variance,
analysis of, [SEM] intro 4
Huber/White/sandwich estimator, see robust,
Huber/White/sandwich estimator of variance
variancecovariance matrix of estimators,
[SEM] Glossary, also see gsem option vce(),
also see sem option vce()
variance() option, see gsem option variance(), see
sem option variance()
VCE, see variancecovariance matrix of estimators
vce() option, see gsem option vce(), see sem option
vce()
W
Wald test, [SEM] intro 7, [SEM] estat eqtest,
[SEM] estat ginvariant, [SEM] example 13,
[SEM] example 22, [SEM] methods and
formulas for sem, [SEM] test, [SEM] testnl,
[SEM] Glossary
Ware, J. E., Jr., [SEM] example 37g
Weeks, D. G., [SEM] estat framework
Weesie, J., [SEM] Acknowledgments
weighted least squares, [SEM] methods and formulas
for sem, [SEM] Glossary
Wheaton, B., [SEM] example 9
White/Huber/sandwich estimator of variance, see robust,
Huber/White/sandwich estimator of variance
Wiggins, V. L., [SEM] sem
Williams, T. O., Jr., [SEM] example 2
WLS, see weighted least squares
Wooldridge, J. M., [SEM] estat ginvariant,
[SEM] estat mindices, [SEM] estat scoretests,
[SEM] methods and formulas for sem
Wright, D. B., [SEM] example 41g
Z
Zhang, Z., [SEM] example 42g
Zubkoff, M., [SEM] example 37g
Zyphur, M. J., [SEM] example 42g