Health Econometrics Using Stata 1nbsped 1597182281 9781597182287 Compress
Health Econometrics Using Stata 1nbsped 1597182281 9781597182287 Compress
Using Stata
2
Partha Deb Hunter College, CUNY and NBER
Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845
Typeset in LATEX 2
10 9 8 7 6 5 4 3 2 1
No part of this book may be reproduced, stored in a retrieval system, or transcribed, in any form or
by any means—electronic, mechanical, photocopy, recording, or otherwise—without the prior
written permission of StataCorp LLC.
Stata, , Stata Press, Mata, , and NetCourse are registered trademarks of StataCorp LLC.
Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of
the United Nations.
3
NetCourseNow is a trademark of StataCorp LLC.
4
Dedication to Willard G. Manning, Jr. (1946–
2014)
Will Manning joined the RAND Corporation in 1975, a few years after
completing his PhD at Stanford. He quickly became involved in the RAND
Health Insurance Experiment . Will was the lead author of the article that
reported the main insurance results in the 1987 American Economic
Review, one of the most cited and influential articles in health economics.
He also published many seminal articles about demand for alcohol and
cigarettes, the taxes of sin, and mental healthcare. In 2010, the American
Society of Health Economics awarded the Victor R. Fuchs Award to Will
for his lifetime contributions to the field of health economics.
5
truth that much more enjoyable.
6
Contents
Tables
Figures
Preface
1 Introduction
1.1 Outline
1.2 Themes
1.3 Health econometric myths
1.4 Stata friendly
1.5 A useful way forward
2 Framework
2.1 Introduction
2.2 Potential outcomes and treatment effects
2.3 Estimating ATEs
2.3.1 A laboratory experiment
2.3.2 Randomization
2.3.3 Covariate adjustment
2.4 Regression estimates of treatment effects
2.4.1 Linear regression
2.4.2 Nonlinear regression
2.5 Incremental and marginal effects
2.6 Model selection
2.6.1 In-sample model selection
2.6.2 Cross-validation
2.7 Other issues
3 MEPS data
3.1 Introduction
3.2 Overview of all variables
3.3 Expenditure and use variables
3.4 Explanatory variables
3.5 Sample dataset
7
3.6 Stata resources
4 The linear regression model: Specification and checks
4.1 Introduction
4.2 The linear regression model
4.3 Marginal, incremental, and treatment effects
4.3.1 Marginal and incremental effects
4.3.2 Graphical representation of marginal and incremental effects
4.3.3 Treatment effects
4.4 Consequences of misspecification
4.4.1 Example: A quadratic specification
4.4.2 Example: An exponential specification
4.5 Visual checks
4.5.1 Artificial-data example of visual checks
4.5.2 MEPS example of visual checks
4.6 Statistical tests
4.6.1 Pregibon’s link test
4.6.2 Ramsey’s RESET test
4.6.3 Modified Hosmer–Lemeshow test
4.6.4 Examples
4.6.5 Model selection using AIC and BIC
4.7 Stata resources
5 Generalized linear models
5.1 Introduction
5.2 GLM framework
5.2.1 GLM assumptions
5.2.2 Parameter estimation
5.3 GLM examples
5.4 GLM predictions
5.5 GLM example with interaction term
5.6 Marginal and incremental effects
5.7 Example of marginal and incremental effects
5.8 Choice of link function and distribution family
5.8.1 AIC and BIC
5.8.2 Test for the link function
5.8.3 Modified Park test for the distribution family
5.8.4 Extended GLM
5.9 Conclusions
5.10 Stata resources
8
6 Log and Box–Cox models
6.1 Introduction
6.2 Log models
6.2.1 Log model estimation and interpretation
6.3 Retransformation from ln(y) to raw scale
6.3.1 Error retransformation and model predictions
6.3.2 Marginal and incremental effects
6.4 Comparison of log models to GLM
6.5 Box–Cox models
6.5.1 Box–Cox example
6.6 Stata resources
7 Models for continuous outcomes with mass at zero
7.1 Introduction
7.2 Two-part models
7.2.1 Expected values and marginal and incremental effects
7.3 Generalized tobit
7.3.1 Full-information maximum likelihood and limited-information
maximum likelihood
7.4 Comparison of two-part and generalized tobit models
7.4.1 Examples that show similarity of marginal effects
7.5 Interpretation and marginal effects
7.5.1 Two-part model example
7.5.2 Two-part model marginal effects
7.5.3 Two-part model marginal effects example
7.5.4 Generalized tobit interpretation
7.5.5 Generalized tobit example
7.6 Single-index models that accommodate zeros
7.6.1 The tobit model
7.6.2 Why tobit is used sparingly
7.6.3 One-part models
7.7 Statistical tests
7.8 Stata resources
8 Count models
8.1 Introduction
8.2 Poisson regression
8.2.1 Poisson MLE
8.2.2 Robustness of the Poisson regression
8.2.3 Interpretation
9
8.2.4 Is Poisson too restrictive?
8.3 Negative binomial models
8.3.1 Examples of negative binomial models
8.4 Hurdle and zero-inflated count models
8.4.1 Hurdle count models
8.4.2 Zero-inflated models
8.5 Truncation and censoring
8.5.1 Truncation
8.5.2 Censoring
8.6 Model comparisons
8.6.1 Model selection
8.6.2 Cross-validation
8.7 Conclusion
8.8 Stata resources
9 Models for heterogeneous effects
9.1 Introduction
9.2 Quantile regression
9.2.1 MEPS examples
9.2.2 Extensions
9.3 Finite mixture models
9.3.1 MEPS example of healthcare expenditures
9.3.2 MEPS example of healthcare use
9.4 Nonparametric regression
9.4.1 MEPS examples
9.5 Conditional density estimator
9.6 Stata resources
10 Endogeneity
10.1 Introduction
10.2 Endogeneity in linear models
10.2.1 OLS is inconsistent
10.2.2 2SLS
10.2.3 Specification tests
10.2.4 2SRI
10.2.5 Modeling endogeneity with ERM
10.3 Endogeneity with a binary endogenous variable
10.3.1 Additional considerations
10.4 GMM
10.5 Stata resources
10
11 Design effects
11.1 Introduction
11.2 Features of sampling designs
11.2.1 Weights
11.2.2 Clusters and stratification
11.2.3 Weights and clustering in natural experiments
11.3 Methods for point estimation and inference
11.3.1 Point estimation
11.3.2 Standard errors
11.4 Empirical examples
11.4.1 Survey design setup
11.4.2 Weighted sample means
11.4.3 Weighted least-squares regression
11.4.4 Weighted Poisson count model
11.5 Conclusion
11.6 Stata resources
References
Author index
Subject index
11
Tables
5.1 GLMs for continuous outcomes
6.1 Box–Cox formulas for common values of
7.1 Choices of two-part models
12
Figures
3.1 Empirical distribution of ln(total expenditures)
3.2 Empirical distributions of healthcare use
4.1 The relationship between total expenditures and age, for men and for
women, with and without any limitations
4.2 Distributions of AME of : Quadratic specification
4.3 Distributions of AME of : Quadratic specification
4.4 Distributions of AME of when : Quadratic specification
4.5 Distributions of AME of : Exponential specification
4.6 Distributions of AME of : Exponential specification
4.7 Distributions of AME of , given : Exponential specification
4.8 Residual plots for y1
4.9 Residual plots for y2
4.10 Residual plots for y3
4.11 Residual plots: Regression using MEPS data, evidence of
misspecification
4.12 Residual plots: Regression using MEPS data, well behaved
4.13 Graphical representation of the modified Hosmer–Lemeshow test
4.14 Graphical representation of the modified Hosmer–Lemeshow test
after adding interaction terms
5.1 Densities of total expenditures and their residuals
5.2 Predicted total expenditures by age and gender
5.3 Predicted marginal effects of age by age and gender
7.1 Predicted expenditures, conditional on age and gender
8.1 Poisson density with
8.2 Poisson density with
8.3 Empirical frequencies
8.4 Empirical and Poisson-predicted frequencies
8.5 Negative binomial density
8.6 Empirical and NB2 predicted frequencies
8.7 Cross-validation log likelihood for office-based visits
8.8 Cross-validation log likelihood for ER visits
9.1 Coefficients and 95% confidence intervals by quantile of expenditure
errors
9.2 Coefficients and 95% confidence intervals by quantile of
ln(expenditure) errors
9.3 Empirical and predicted componentwise densities of expenditure
13
9.4 Componentwise coefficients and 95% confidence intervals of
expenditures
9.5 Empirical and predicted componentwise densities of office-based visits
9.6 Componentwise coefficients and 95% confidence intervals of office-
based use
9.7 Predicted total expenditures by physical health score and activity
limitation
14
Preface
This book grew out of our experience giving presentations about applied
health econometrics at the International Health Economics Association and
the American Society of Health Economists biennial conferences. In those
preconference seminars, we tried to expose graduate students and early
career academics to topics not generally covered in traditional
econometrics courses but nonetheless are salient to most applied research
on healthcare expenditures and use. Participants began to encourage us to
turn our slides into a book.
15
guinea pigs for our efforts at clarity and instruction and especially those
who gave us the initial motivation to undertake this book. Our wives, Erika
Bach and Carolyn Norton, provided support and encouragement,
especially during periods of low marginal productivity. Erika Manning
cared for Will during his illness and tolerated lengthy phone calls at odd
hours, and Will’s bad puns at all hours.
16
Notation and typography
In this book, we assume that you are somewhat familiar with Stata: you
know how to input data, use previously created datasets, create new
variables, run regressions, and the like.
The data we use in this book are freely available for you to download,
using a net-aware Stata, from the Stata Press website, http://www.stata-
press.com. In fact, when we introduce new datasets, we load them into
Stata the same way that you would. For example,
Try it. To download the datasets and do-files to your computer, type the
following commands:
17
Chapter 1
Introduction
Health and healthcare are central to society and economic activity. This
observation extends beyond the large fraction of gross national product
devoted to formal healthcare to the fact that health and healthcare affect
each other and numerous other decisions. Health affects people’s ability to
engage in work and leisure, their probability of marriage, probability of
living to a ripe old age, and how much they spend on healthcare.
Healthcare affects health mostly for the better, although side effects and
medical errors can have drastic consequences. The desire for better health
motivates decisions about smoking, drinking, diet, and exercise over a
lifetime. Therefore, it is important to understand the underlying causes of
health and how health affects people’s lives, including examining the
determinants of healthcare expenditures and use.
18
approaches to fitting many complex models. Advances in computing
power mean that researchers can estimate technically complex statistical
models faster than ever. Stata (and other statistical software) allows
researchers to use these models quickly and easily.
Like the people behind the statistics, data come in all shapes, sizes, and
ages. Researchers collect population health and census data, episode-level
claims data, survey data on households and on providers, and, more
recently, individual biometric data—including genetic information.
Datasets are often merged to generate richer information over time. The
variety of data is dizzying.
19
1.1 Outline
This book is divided into three groups of chapters. The early chapters
provide the background necessary to understand the rest of the book. Many
empirical research questions aim to estimate treatment effects .
Consequently, chapter 2 introduces the potential outcomes framework,
which is useful for estimating and interpreting treatment effects. It also
relates treatment effects to marginal and incremental effects in both linear
and nonlinear models. Chapter 3 introduces the Medical Expenditure Panel
Survey dataset, which is used throughout this book for illustrative
examples. Chapter 4 illustrates how to estimate the average treatment
effect, the treatment effect on the treated, and marginal and incremental
effects for linear regression models. Chapter 4 also shows that
misspecifications in OLS models can lead to inconsistent average effects. It
also includes graphical and statistical tests for model specification to help
decide between competing statistical models.
The core chapters describe the most prominent set of models used for
healthcare expenditures and use, including those that explicitly deal with
skewness , heteroskedasticity , log transformations , zeros , and count data
. Chapter 5 presents GLMs as an alternative to OLS for modeling positive
continuous outcomes. Generalized linear models are especially useful for
skewed dependent variables and for heteroskedastic error terms. Although
we argue that GLM provides a powerful set of models for health
expenditure, we also lay out the popular log transformation model in
chapter 6. Transforming a dependent variable by taking its natural
logarithm is a widely used way to model skewed outcomes. Chapter 6
describes several versions that differ in their assumptions about
heteroskedasticity and the distribution of the error term (normal or
nonnormal). We show that interpretation can be complex, even though
estimation is simple. Chapter 7 adds observations with outcomes equal to
zero . Most health expenditure data have a substantial mass at zero, which
makes models that explicitly account for zeros appealing. Here we
describe and compare two-part and selection models . We explain the
underlying assumptions behind the often misunderstood two-part model,
and show how two-part models are superficially similar, yet strikingly
different from selection models in fundamental ways. Chapter 8 moves
away from continuous dependent variables to count models . These models
are essential for outcomes that are nonnegative integer valued, including
20
counts of office visits, number of cigarettes smoked, and prescription drug
use.
The book then shifts to more advanced topics. Chapter 9 presents four
flexible approaches to modeling treatment-effect heterogeneity . Quantile
regression allows response heterogeneity by level of the dependent
variable. We describe basic quantile regressions and how to use those
models to obtain quantile treatment effects. Next, we describe finite
mixture models . These models allow us to draw the sample from a finite
number of subpopulations with different relationships between outcomes
and predictors in each subpopulation. Thus, finite mixture models can
uncover patterns in the data caused by heterogeneous types. Third, we
describe local-linear regression, a nonparametric regression method.
Nonparametric regression techniques make few assumptions about the
functional form of the relationship between the outcome and the covariates
and allow for very general relationships. Finally, conditional density
estimation is another flexible alternative to linear models for dependent
variables with unusual distributions. The last two chapters discuss issues
that cut across all models. Chapter 10 introduces controlling for
endogeneity or selection-on-unobservables of covariates of policy interest
to the researcher. Chapter 11 discusses design effects . Many datasets have
information collected with complex survey designs. Analyses of such data
should account for stratified sampling, primary sampling units, and
clustered data.
21
1.2 Themes
22
1.3 Health econometric myths
3. OLS is fine. OLS regression has many virtues. It is easy to estimate and
interpret. Under a set of well-known assumptions—including that the
model as specified is correct—OLS is the best linear unbiased
estimator, except when the assumptions fail, which is often. We
demonstrate the limitations of OLS in chapter 4.
4. All GLM models should have a log link with a gamma distribution.
Several early influential articles using GLM models in health
economics happened to analyze data for which the log link with a
gamma distribution was the appropriate choice. Different link and
distributional families may be better (see chapter 5) for other data.
6. Use selection models for data with a large mass at zero. When the
23
data have substantial mass at zero, some researchers reach for the
two-part model , while others reach for selection models . Their
choices often lead to considerable argument over which is better. We
advocate the two-part model for researchers interested in actual
outcomes (including the zeros), and we advocate selection models for
researchers interested in latent outcomes (assuming that the zeros are
missing values). We set the record straight in chapter 7.
7. All count models are Poisson. Ever wonder why some researchers
reflexively use Poisson , and others use the negative binomial ? We
explain the tradeoff between inference about the conditional mean
function and conditional frequencies while providing intuition and
pretty pictures (see chapter 8).
10. Complex survey design is just for summary statistics. Most large
surveys use stratification and cluster sampling to better represent
important subsamples and to use resources efficiently. Model
estimation, not just summary statistics, should control for sample
weights, clustering, and stratification (see chapter 11).
24
1.4 Stata friendly
25
1.5 A useful way forward
Finally, we agree with the observation by Box and Draper (1987) that “all
models are wrong, but some are useful”. Our intent is to provide methods
to choose models that are useful for the research question of interest.
26
Chapter 2
Framework
27
2.1 Introduction
28
have otherwise similar characteristics or understand the trajectory of
healthcare use across the lifespan of individuals. The insights of the
potential-outcomes framework are useful in such circumstances as well,
although gender and age are clearly not modifiable in the way that an
experimental treatment is. In other analyses, the researcher may simply be
interested in the best predictions of individual-level outcomes rather than
the effect of a particular covariate on the outcome. In such analyses, the
researcher would focus on prediction criteria to formulate an appropriate
model. Such criteria may or may not be consistent with a model that is
preferable in a causal analysis.
29
outcome (linear, log, or power), and of what statistical distribution to
choose to complete the model.
30
2.2 Potential outcomes and treatment effects
31
condition (at the same point in time). In fact, we can relate the observed
outcome ( ) to the potential outcomes using the following relationship,
(2.1)
which does not allow us to identify both of the two potential outcomes.
Thus we can never measure a causal effect directly.
32
How do we estimate these effects, given data on treatment assignment,
observed outcomes, and covariates? The answer to this question depends
on the design of the study, and—by implication—properties of the data-
generating process that generates the potential outcomes. We describe
estimating ATEs in three leading situations below: a laboratory experiment,
a nonlaboratory experiment when randomization is possible, and an
observational study without randomization.
33
2.3 Estimating ATEs
34
2.3.1 A laboratory experiment
2.3.2 Randomization
35
where the fact that is independent of is necessary to establish the first
equality, and we use (2.1) to establish the relationship between potential
and observed outcomes. Similarly,
36
and
37
2.4 Regression estimates of treatment effects
We now show how you can use regression models to estimate treatment
effects. We remind our readers that if you are interested in estimating ATE
or ATET , or those effects conditional on covariates, modeling efforts
should focus on obtaining the best estimates of the conditional mean
functions, and . Consistency is clearly a desired property of
the estimators, but precision is important as well. As is typical, there is
often a tradeoff between consistency and efficiency of estimators, so we
urge our readers to think through their modeling choices carefully before
proceeding.
With the above general principles in mind, it is useful to begin with the
randomization case even though no regression is necessary. In that case,
we only need to estimate sample means. Nevertheless, we can also obtain
an estimate of the ATE (which is equal to ATET ) via a simple linear
regression. Without any loss of generality, we can write the relationship
between the observed outcome, , and the treatment assignment, , as
38
determinants of potential imbalances between treatment and control
samples. For instance, by estimating a regression, we may be able to
estimate what would have happened to the treated observations had they
received the control, and vice versa, all else being held constant. Such an
approach requires that the chosen regression specification be the correct
data-generating process, or at least approximately correct.
The simplest, and perhaps most commonly used, linear (in parameters)
regression specification is also specified as being linear in variables (the
treatment indicator and a vector of covariates ):
(2.2)
39
In this specification, the expected outcome in the control condition (
) is
Unlike the prior case shown in (2.2), the expected outcomes in treated
and control cases and—consequently—the individual-level differences in
expected outcomes are functions of the values of the individual’s
covariates, , leading to differences between ATE and ATET . Sample
averages of estimates of individual-level differences in expected outcomes,
over the entire sample for ATE and over the treated sample for ATET , are
valuable. However, they may hide considerable amounts of useful
information about how treatment effects vary across substantively
interesting subgroups of the population. For example, the ATE of a checkup
visit may be substantially different for men as opposed to women.
Estimating two ATEs, one for the sample of men and the other for the
sample of women, would provide a much richer understanding of the
effect of this intervention than just one estimate.
40
that they render the point estimate relatively uninformative. Again,
specification checks and tests described in chapter 4 could help answer this
question.
In this model, the covariates and the treatment indicator enter in a linear,
additive way first, but then their effect on the outcome is transformed by a
nonlinear function, . In this setting, the individual-level expected
treatment effect is no longer a linear function of covariates. Instead, it is
41
Once again, the individual-level expected treatment effect is a function of
covariates , so it will vary from individual to individual across the
sample. The estimation of the sample ATE is
To estimate the ATET , we take the above formula but average only
over the sample of treated observations. Here—as in nonlinear models
generally—the individual-level expected treatment effect is a function of
the covariates, , so expected treatment effects averaged over different
samples will yield different estimates. Specifically, ATE will not be equal
to ATET .
42
2.5 Incremental and marginal effects
43
The marginal effect of is the derivative
Both the average incremental effect and the average marginal effect are
simply the coefficients on the respective variables in the regression. They
are constant across the sample by definition.
Both the incremental effect and marginal effect will vary from
individual to individual across the sample, because the function is
nonlinear. We can calculate sample averages of these effects in a variety of
ways, just as we can treatment effects.
44
changing just the interaction term. More surprisingly, the sign may be
different for different observations (Ai and Norton 2003) . We cannot
determine the statistical significance from the statistic reported in the
regression output. For more on the interpretation of interaction terms in
nonlinear models and how to estimate the magnitude and statistical
significance, see Ai and Norton (2003) and Norton, Wang, and Ai (2004) .
45
2.6 Model selection
However, although graphical tests are suggestive, they are not formal
statistical tests . In chapter 4, we present three statistical tests for assessing
46
model fit. The first two, Pregibon’s (1981 ) link test and Ramsey’s (1969 )
regression equation specification error test , directly test whether the
specified linear regression shows evidence of needing higher-order powers
of covariates or interactions of covariates for appropriate specification.
The third—a modified version of the Hosmer–Lemeshow (1980 ) test —
can be used generally, because it is based on a comparison between
predicted outcomes from the model and model-free empirical analogs. If
the model specification is not correct, then an alternative specification may
predict better, indicating that the specification of the explanatory variables
should change. When the modeling choices involve decisions such as
adding covariates or powers and interactions of existing covariates,
standard tests of individual or joint hypotheses (for example, the Wald and
tests) can also be useful.
When the data have additional statistical issues, such as clustering and
weighting, a strict likelihood interpretation of the optimand is often
invalid. In most such situations, however, the model optimand has a
quasilikelihood interpretation that is sufficient for these two popular
model-selection criteria to be valid (Sin and White 1996; Kadane and
Lazar 2004).
47
where is the maximized log likelihood (or quasilikelihood) and is
the number of parameters in the model. Smaller values of AIC are
preferable. The BIC (Schwarz 1978) is
where is the sample size. Smaller values of BIC are also preferable. For
moderate to large sample sizes, the BIC places a premium on parsimony.
Therefore, it will tend to select models with fewer parameters relative to
the preferred model, based on the AIC criterion.
and
Note that there are many other formulas for AIC and BIC throughout the
literature. Closer examination of alternative formulas shows that they are
substantively only minor variations of the equations shown above. For
example, switching signs of each term in the formula suggests that one
should search for the model with the largest values of the criteria.
Sometimes, AIC and BIC are formulated with an overall division by , the
sample size. This formulation is substantively no different from the ones
we have described.
As mentioned above, the AIC and BIC are robust to many of the
misspecification issues that plague traditional test statistics, most notably
in the context of complex survey data issues. Because the derivation of the
48
criteria does not involve moment conditions, or convergence to statistical
distributions, they are invariant to the typical corrections to standard errors
required to make test statistics the correct size when observations are not
independently and identically distributed (Schwarz 1978; Sin and
White 1996). In general, as long as the likelihood or quasilikelihood (or
weighted likelihood if sampling weights are used) is appropriate as
objective functions to obtain consistent parameter estimates, the AIC and
BIC have desirable optimality properties.
2.6.2 Cross-validation
49
2.7 Other issues
In chapter 10, we will describe methods that apply to the case when there
is selection on unobservables , one form of endogeneity . For example,
when a researcher is interested in the causal effect of insurance on
healthcare expenditures, and when the dataset is an observational sample
of individuals who have chosen to purchase health insurance (or not), it is
difficult to rule out endogeneity .
50
Chapter 3
MEPS data
51
3.1 Introduction
52
3.2 Overview of all variables
53
There are three health-status variables and four insurance variables
(see section 3.4 for further description and summary statistics).
54
examples (see section 3.3 for further description and summary statistics).
55
3.3 Expenditure and use variables
All the expenditure and use variables are highly skewed, with a large mass
at zero. Expenditures include out-of-pocket payments and third-party
payments from all sources. They do not include insurance premiums. We
measure all expenditures in 2004 U.S. dollars; adjusting expenditures for
inflation to 2016 would increase nominal amounts by about 27%. There
are several ways to provide summary statistics for each type of
expenditure. First, we provide summary statistics on all observations,
including those with zeros . We report skewness and kurtosis to show how
skewed these variables are. None of the summary statistics are corrected
for differential sampling or clustering. Total annual expenditures on all
healthcare averaged $3,685 (in 2004 dollars), with a range from $0 to
$440,524. Inpatient expenditures averaged $1,123. Inpatient expenditures
are divided into inpatient facility and inpatient physician expenses. The
total amount paid by a family or individual was less than $700 on average
but was as high as $50,000.
56
Finally, we show summary statistics (including the coefficient of
skewness ) for the subset with positive values (different for each
variable), both for the raw variable and the logged variable. We give them
names ending in gt0. The raw positive expenditure variables have
extremely high skewness, with values ranging from 4.6 to almost 13.
57
expenditures (see figure 3.1). Although it is tempting to conclude that the
distribution of the logarithm of expenditures is normal, or truly symmetric,
both of those conclusions are typically wrong; modeling expenditures as
such can lead to incorrect conclusions. One of this book’s main themes is
to model such variables.
0 2 4 6 8 10 12 14
ln(total expenditures)
The example dataset has six variables that measure healthcare use
(discharges, length of stay, three kinds of office-based visits, and
prescriptions). Each use variable has a large mass at zero and a long right
tail. On average, people had nearly 6 office-based provider visits, almost
13 prescriptions (or refills), and about 1 dental visit. About 29% have no
office-based provider visits during the year, and 5% have at least 25. One-
third have no prescriptions or refills during the year, while 17% have at
least 25. Well over half the sample report having no dental visits during
the past year. About 60% of adults have no dental visits, while closer to
30% have no office-based provider visits, prescriptions, or refills.
58
59
Office-based provider visits
.3
5673 have 0 visits
.2
Fraction
.1
822 top-coded at 25
0
0 5 10 15 20 25
Number of visits
Dental visits
.6
0 5 10 15 20
Number of dental visits
3139 top-coded at 25
.1
0
0 5 10 15 20 25
Number of prescriptions and refills
The density for all six use variables falls gradually for nearly the entire
range. Histograms for three of the use variables (office-based visits, dental
60
visits, and prescriptions and refills) show a large mass at zero and a
declining density for positive values (see figure 3.2). We top coded some
values at 25 for the purpose of the histogram.
61
3.4 Explanatory variables
62
daily living and instrumental activities of daily living. About 28% of the
sample has at least 1 limitation. The other two health measures are based
on the physical and mental health components of the Short Form 12. They
are used to construct continuous measures on a scale from 0 to 100, with a
mean of about 50. A higher number indicates better health. Both
distributions are skewed left, with a median three to four points above the
mean.
63
3.5 Sample dataset
Interested readers can use the example dataset based on the 2004 MEPS data
to reproduce results found in this book. The sample from the 2004 full-
year consolidated data file includes all adults ages 18 and older and who
have no missing data on the main variables of interest. There are 19,386
observations on 44 variables. This dataset is publicly available at
http://www.stata-press.com/data/eheu/dmn_mepssample.dta.
64
3.6 Stata resources
65
Chapter 4
The linear regression model: Specification and
checks
66
4.1 Introduction
67
dependent variable. Correct specification of the relationship is a key
assumption of the theorem. In practice, while researchers cannot claim to
know the true model, they should strive to specify good models. A good
model includes all the necessary variables—including higher-order
polynomials and interactions terms—but no more. A good model includes
variables with the correct functional relationship between the covariates
and outcome. Choosing the correct model specification requires making
choices. There is tension between simplicity and attention to detail, and
there is tension between misspecification and overfitting. We address these
issues in this chapter.
68
4.2 The linear regression model
69
4.3 Marginal, incremental, and treatment effects
We begin with a fairly simple OLS regression model to predict total annual
expenditures at the individual level for those who spend at least some
money, using the 2004 Medical Expenditure Panel Survey (MEPS) data (see
chapter 3). Our goal is to interpret the results using the framework of
potential outcomes and marginal effects (see chapter 2). To be clear, the
model we fit may not be appropriate for a serious research exercise: it
drops all observations with zero expenditures, and its specification of
covariates is rudimentary. Additionally, we do not consider any possible
models besides OLS, especially ones that may be better suited to deal with
the severe skewness in the distribution of this outcome, and we do not
control for design effects or possible endogeneity. In short, we ignore all
interesting features of this typical health outcome variable, knowing that
we will return to each of these issues throughout the book. The focus of
this section is to provide a framework for interpreting regression results.
70
expenditures. Unsurprisingly, expenditures are far higher for those with at
least one limitation. All coefficients are statistically significant at
.
We first interpret the results for age and gender in more detail. Following
chapter 2, we interpret regression results for the continuous variable age as
a marginal effect (derivative) and for the dichotomous variable female as
an incremental effect (difference). One way to interpret the effects (not
necessarily the most informative way for this example, as we will see) is to
compute the average marginal and incremental effects using the Stata
command margins, dydx() . Because of the interaction term between age
and gender, the average marginal and incremental effects will not equal
any of the estimated coefficients in the model.
71
one year older, it would be better to do that computation directly. We show
how this can be done using margins in section 5.7.
72
gender is different at each age, being more than $1,140 at age 20, and
close to 0 around age .
73
74
Predicted total expenditures increase for all four types of people but at
different rates (see figure 4.1). The top two lines are for people with
limitations, roughly $4,400 above the lines for those people without any
limitations. Women have higher predicted total expenditures than men at
young ages, but mens’ expenditures increase more rapidly with age.
Around age , the predictions cross; elderly men are predicted to spend
slightly more than elderly women (controlling for limitations). The figure
clarifies the relationship between all the variables and shows the
importance of including the interaction term between age and gender.
20 50 80
Age
75
Figure 4.1: The relationship between total expenditures and age, for
men and for women, with and without any limitations
First, we use the results from the OLS regression model to estimate
predicted values, comparing predictions that everyone had a limitation
with predictions that no one had limitations. By this approach, we see that
the average predicted spending as if no one had any limitations is only
$3,030, while predicted spending as if everyone had a limitation is $7,487.
76
which accounts for the fact that the covariates also have sampling variation
in the population.
Turning first to the ATE, we see that the treatment effect estimated by
teffects is different from the treatment effect estimated by margins with
contrast() , despite seemingly using the same model specification. The
difference is several hundred dollars.
77
The reason for the difference between the ATE estimated by teffects
and the treatment effect estimated by margins is that the model
specifications are different. The teffects command fits a model (not
shown) in which the treatment variable is fully interacted with all
covariates. It is equivalent (for the point estimates of the parameters) to
running separate models for those with and without any limitations. These
two methods of calculating the ATE (using margins or using teffects )
will be the same if the original regression model interacts all covariates (in
our example: age, female, and their interaction) with the treatment
variable (anylim). That regression is below.
78
The estimated margins using margins are now the same as potential-
outcome means estimated using teffects , because the underlying
regression model specification is now the same.
79
The ATE , found with the margins command and the contrast()
option, is $4,239, which is now exactly the same as the ATE found with the
teffects command. If we use the vce(unconditional) option for the
standard errors, then we will also get the population-averaged standard
errors; this accounts for sample-to-sample variation in covariates. Whether
one wants to sample average or population-averaged standard errors
depends on the research question and whether it makes sense to take the
covariates as fixed or not. However, there are also statistical implications
associated with this choice. The confidence intervals for the population
effects will be larger than those for the sample effects. Given this
difference in confidence intervals, it is possible for the population effect to
be statistically insignificant but for the sample effect to be statistically
significant. This distinction may be especially relevant when the sample is
relatively small and where it is unclear how representative of the
population of interest the sample is. Stata allows the user to decide and
estimate either confidence interval.
80
while regress uses the small-sample adjustment of , where
is the sample size and is the number of covariates. The difference
between these shrinks asymptotically to zero as approaches infinity.
The teffects command also easily estimates the ATET and the
potential-outcome means. In this example, the ATET is several hundred
dollars more than the ATE. This result is consistent with nonrandom
assignment to the treatment group.
81
In this section, we interpreted the results from a linear regression
model. The model specification was useful for illustrative purposes, but we
did not choose it through a rigorous process. Later, we will explore
alternatives to linear regression for skewed positive outcomes (chapters 5
and 6), how to incorporate zeros (chapter 7), and how to control for design
effects and possible endogeneity (chapters 11 and 10). However, first we
show by example that misspecifying the model even in a straightforward
manner can lead to inconsistency—even in the case of OLS estimates of the
parameters of a linear regression. Afterward, we will show visual and
statistical tests to help choose a model specification to reduce the chance
of misspecification.
82
4.4 Consequences of misspecification
83
The second regression is correctly specified:
-.05 0 .05
deviation
Figure 4.3 shows the analogous figures for the distributions of the
average of marginal effects of . The AME of estimated from the
misspecified linear-in-covariates model appears to be inconsistent. Recall
84
that the distribution of is symmetric, while the distribution of is
skewed. This example shows that—unless the distribution of the covariate
is symmetric—even misspecification as innocuous as leaving out a
quadratic term in covariates can lead to inconsistent AMEs.
-.1 0 .1 .2 .3
deviation
85
20
15
10
5
0 Marginal effect of X at X=6
For our second example, we specify a model with one continuous variable,
, and one binary indicator, , that explains in a relationship specified
with an exponential mean and multiplicative errors. This is a log-linear
model:
86
We estimate two regressions using data drawn from the data-
generating process above. The first regression is misspecified because it
does not have the correct exponential relationship:
In this scenario, figure 4.5 shows that, even though has a symmetric
distribution, its estimated AME is inconsistent when the model is
misspecified with a linear conditional mean. The AME of the binary
87
indicator, , is also inconsistent. However, there is a considerably large
loss in efficiency. The distribution of the AME of is substantially more
dispersed in the misspecified case compared with the distribution in the
correctly specified case.
88
Next, we examine the distribution of the effect of on when
(analogous to estimating the treatment effect on the treated). The results,
shown in figure 4.7, demonstrate that the estimated effect of on when
is inconsistent when the model is misspecified as linear when the
true data-generating process is exponential.
89
4.5 Visual checks
In this section, we illustrate the use of visual residual checks for the least-
squares model by examining three simple artificial-data examples with a
correctly specified and a misspecified model. Then, we use the visual
residual checks to explore possible misspecification in the MEPS data for
two simple models.
For the two artificial data examples, we draw 1,000 observations using
data-generating processes, specified in Stata as
90
We use two regress postestimation commands, rvfplot and rvpplot
, to detect misspecification visually. rvfplot plots residuals versus fit
values (predicted dependent variable, or the linear index), and rvpplot
plots residuals versus a specific predictor variable. Because this is the first
time we do this, we show the Stata code for illustrative purposes.
4
2
2
Residuals
Residuals
Residuals
0
0
-2
-2
-2
-4
-4
-4
0 .5 1 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
Fitted values x z
91
The associated residual plots in figure 4.9 show a distinct U-shaped
pattern when residuals are plotted against predicted values and when they
are plotted against x but show no pattern when residuals are plotted against
z. Taken together, they indicate a misspecified model, likely in terms of x
—but not in terms of z.
4
4
2
2
Residuals
Residuals
Residuals
0
0
-2
-2
-2
-4
-4
-4
-2 0 2 4 6 8 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
Fitted values x z
92
The associated residual plots in figure 4.10 show evidence of
misspecification. Regardless of whether the residuals are plotted against
predicted values, x or z, the figures show that the residuals fan out along
the range of the axis. The variation in the residuals increases with higher
values of the predicted values, x and z.
15
15
15
10
10
10
Residuals
Residuals
Residuals
5
5
0
93
lninc. In the first example, we estimate a linear regression of
exponentiated total expenditures on age and lninc. We construct residual
plots for a 10% random sample of the MEPS observations to make the plots
clearer (by reducing the density of points) and to reduce the file size of the
resulting figures.
200000
200000
150000
150000
150000
100000
100000
100000
Residuals
Residuals
Residuals
50000
50000
50000
0
94
Figure 4.11: Residual plots: Regression using MEPS data, evidence of
misspecification
4
2
2
0
0
Residuals
Residuals
Residuals
-2
-2
-2
-4
-4
-4
-6
-6
-6
95
4.6 Statistical tests
Although graphical tests are suggestive for determining model fit, they are
not formal statistical tests. We present three diagnostic statistical tests that
are commonly used for assessing model fit. The first two—Pregibon’s
(1981) link test and Ramsey’s (1969) regression equation specification
error test (RESET) —check for linearity of the link function. The third, a
modified version of the Hosmer–Lemeshow test (1980) , is a more
omnibus test for model misspecification. Although each of these tests were
originally developed for other applications, we focus on their interpretation
as general-purpose diagnostic tests of the specification of the explanatory
variables. After presenting each of these statistical tests, we then show
several examples using the MEPS data.
96
Pregibon’s link test addresses the more interesting case where there are
multiple covariates. For example, if there are two underlying covariates,
and , then the quadratic expansion would include , , and . The
corresponding specification test would be an test of whether the set of
estimated coefficients on the higher-order terms are statistically
significantly different from zero. If there are many covariates in a model,
adding all the possible higher-order terms can become unwieldy. A model
with covariates requires adding additional terms.
The same logic we used to motivate Pregibon’s link test can be applied
97
to the more general RESET test. If there is a single covariate, , we could
add quadratic, cubic, and possibly quartic terms to the augmented
regression. If these additional terms as a set are statistically significantly
different from zero by the test—while retaining the linear term—then
we can reject the simple linear in model in favor of a more nonlinear
formulation. By extension, we can alter the link test to have a quartic
alternative:
The link and RESET tests are parametric, because they assume that the
misspecification can be captured by a polynomial of the predictions from
the main equation. This alternative model specification may not be
appropriate for a situation with a different pattern. For example, consider a
model that was good through much of the predicted range but had a sharp
downturn in the residuals when plotted against a specific covariate, .
Pregibon’s link test , with its implicit assumption of a symmetric parabola,
would not provide a good test for that alternative.
98
specification closely approximates the data-generating process, the mean
prediction for each of these bins should be near zero and not significantly
different from zero.
One way to test this is to sort the data into deciles by the predicted
conditional mean, . Regress the residuals from the original model on
indicator variables for these 10 bins, suppressing the intercept. Test
whether these 10 coefficients are collectively significantly different from 0
using an test.
4.6.4 Examples
To illustrate the link and RESET tests , we present an example using the
MEPS data with a simple model specification. Consider a model that
predicts total healthcare expenditures as a function only of age and gender.
For this example, we drop all observations with zero expenditures:
99
Because we suspect that this simple model does not accurately capture
the relationship between total healthcare expenditures and demographics,
we run the link and RESET tests . First, generate the predicted value of the
dependent variable (linear index) and its powers up to four. In practice, it
helps to normalize the linear index so the variables and parameters are
neither too large nor too small. Normalization does not affect the statistical
test results.
100
Ramsey’s RESET test is a statistical test of whether the coefficients on
the three higher-order polynomials of the predicted value of the dependent
variable are jointly statistically different from zero in a regression of the
dependent variable on the fourth-order polynomial. The statistic is 16.35
(3 and 15,941 degrees of freedom). Therefore, we conclude that the model
is misspecified.
101
There are Stata commands for both the link and RESET tests . The
command for the link test is linktest , and the command for the RESET
test is the postestimation command estat ovtest . However, although the
linktest will adjust for heteroskedasticity, estat ovtest will not.
Therefore, we do not recommend using estat ovtest. That is why we
first showed how to calculate the test statistics without using the Stata
commands. The RESET test should always be done in the manner we
describe to control for possible heteroskedasticity.
102
use the vce(robust) option (and cluster if necessary) to correct for any
heteroskedasticity.
103
Modified Hosmer-Lemeshow test
1000 500
mHL coefficients and CI
-500 0
-1000
-1500
0 2 4 6 8 10
Predicted categories
104
This modest elaboration of the specification shows improvement in the
specification tests. The link test is no longer statistically significant (
). The RESET test is still statistically significant, but the test is
now much lower at 3.93. Finally, the graphical modified Hosmer–
Lemeshow test no longer exhibits a strong U-shape. Instead, only 1 of the
10 deciles (the 5th) is significantly different from 0.
105
Modified Hosmer-Lemeshow test
1500 1000
mHL coefficients and CI
-500 0 500
-1000
0 2 4 6 8 10
Predicted categories
106
We label the linear specification linear, the quadratic ones
quadratic, and the cubic ones cubic. The suffix indicates which variables
are in the model.
We report the values of the AIC and BIC from each of the models in the
output shown below for the sample size of 1,000. The results show that
both AIC and BIC are the smallest for the specification that is quadratic in
both x and z, which is the correct specification. The linear specification
has the highest values of AIC and BIC. The cubic specifications and the two
misspecified quadratic specifications also have higher AIC and BIC than the
correct model. Thus AIC and BIC are able to discriminate against both
underspecified and overspecified models.
107
sample is drawn—AIC and BIC provide even sharper contrasts between the
correct specification and each of the incorrect ones.
Next, we illustrate the use of AIC and BIC for the more realistic case of
analyzing real data where the true model specification is unknown. The
second set of examples compares three different model specifications for
the MEPS data in section 4.4.2. Specifically, the dependent variable is total
expenditures for the subsample with positive expenditures ( ).
In the first MEPS data example, the covariates used in all three
specifications are the continuous age and the binary female. They are
entered additively in the first specification, an additional interaction
between age and female is included in the second specification, and an
additional squared age and interaction of squared age with female are
included in the third specification. We label the first specification as
linear, the second as interact, the richest specification with squared age,
and interactions as quad_int in the output below that compares the AIC and
BIC from the three models. The results show that both AIC and BIC are
smallest for the specification with quadratic age and interaction terms.
108
together, because the conclusions have always been the same. In this
example, they are not. We add the binary anylim to the list of covariates,
in addition to the continuous age and binary female. This variable enters
additively in each of the specifications considered for the first example.
AIC and BIC are calculated for each of the models and are shown in the
output below. AIC is smallest for the richest specification with quadratic
terms and interactions. However, BIC is smallest for the simplest additive
specification.
Three additional points are worth raising. First, for nested models, we
could have done other statistical tests—such as tests. But AIC and BIC
can be compared even for nonnested models, when tests such as tests
and Wald tests are not possible. Second, with standard testing procedures,
the issue of multiple testing always looms large when a researcher is
searching for the best model specification. AIC and BIC do not suffer from
that issue. Any number of candidate models can be compared. Finally, in
general, we cannot compare models with different dependent variables
using AIC and BIC. In chapter 6, we demonstrate how to use AIC and BIC to
compare a linear specification with one that is linear in the logged
outcome.
109
4.7 Stata resources
The Stata command for the link test is linktest . The Stata command
for the RESET test is estat ovtest , but this is not recommended because it
does not control for heteroskedasticity. For testing hypotheses of specific
variables or coefficients, use test and testparm .
To compute the AIC and BIC, use estimates stats after running the
model. Cattaneo, Drukker, and Holland (2013) created the bfit command
to find the model that minimizes either the AIC or the BIC from a broad set
of possible models for regress, logit, and poisson.
110
Chapter 5
Generalized linear models
111
5.1 Introduction
As we move into the main four chapters of this book (chapters 5–8), our
focus shifts toward choosing between alternative models for continuous
positive outcomes, such as healthcare expenditures. One of the main
themes of this book is how best to model outcomes that are not only
continuous and positive but also highly skewed. Skewness creates two
problems for ordinary least-squares (OLS) models: negative predictions and
large sample-to-sample variation . After briefly demonstrating this for the
2004 Medical Expenditure Panel Survey (MEPS) data, we then spend the
rest of this chapter exploring generalized linear models (GLM), which are
an alternative to OLS that handle skewed data more easily.
112
At a minimum, this finding is awkward, indicating that the linear-in-
parameters specification can predict outside the boundaries imposed by the
data-generating process. If that is true, it is likely that OLS estimates of a
linear regression model will yield inconsistent estimates of effects, as we
demonstrated in section 4.4.2.
.0001
.0002
density
density
.00005
.0001
0
113
Figure 5.1: Densities of total expenditures and their residuals
Health expenditure data , for those with any healthcare use, are
generally extremely skewed . In the United States, a small fraction of the
population accounts for a substantial fraction of total expenditures. Berk
and Monheit (2001) report that five percent of the population accounts for
the majority of health expenditures and that the severely right-skewed
concentration of healthcare expenditures has remained stable over decades.
114
how to compute marginal effects of continuous covariates and incremental
effects of discrete covariates. We show how to determine the most
appropriate GLM for a given dataset by choosing the appropriate link
function and distribution family .
115
5.2 GLM framework
116
popular link functions for continuous outcomes include the identity link
, powers and the natural
logarithm . Common distribution families for
continuous dependent variables imply variances that are integer powers of
the mean function. The four most common distribution families are
Gaussian, in which the variance is a constant (zero power); Poisson, in
which the variance is proportional to the mean ( ); gamma, in
which the variance is proportional to the square of the mean ( );
and inverse Gaussian, in which the variance is proportional to the mean
cubed ( ). Table 5.1 lists commonly used links and distribution
families, along with their implications for the expected value and variance
of the outcome.
The GLM approach also allows distribution families that are noninteger
powers of the mean function, but such models are less common in the
literature. For more details, see Blough, Madden, and Hornbrook (1999)
and Basu and Rathouz (2005).
117
5.2.2 Parameter estimation
As alluded to above, GLM estimation requires two sets of choices. The first
set of choices determines the link function and the distribution family . In
section 5.8, we discuss how we chose based on rigorous statistical tests.
After fitting a GLM, one can easily derive marginal and incremental
effects of specific covariates on the expected value of (or other treatment
effects).
118
5.3 GLM examples
Our first example is a model with a log link (option link(log)) and a
Gaussian family (option family(gaussian)). This is equivalent to a
nonlinear regression model with an exponential mean. The results show
that healthcare expenditures increase with age and are higher for women,
but the coefficient on female is not statistically significant at the 5% level.
Because the conditional mean has an exponential form, coefficients can be
interpreted directly as percent changes. Expenditures increase by about
2.6% with each additional year of age after adjusting for gender. Women
spend about 8% more than men after
controlling for age.
119
The second example also specifies a log link but assumes that the
distribution family is gamma (option family(gamma)), implying that the
variance of expenditures is proportional to the square of the mean. This is
a leading choice in published models of healthcare expenditures, but we
will return to the choices of link and family more comprehensively in
section 5.8.
The results show that healthcare expenditures increase with age and are
higher for women. Both coefficients are statistically significant, with
. Expenditures increase by about 2.8% with each additional year
of age, which is quite close to the effect fit by the model with the Gaussian
family. However, now we find that women spend about 23% more than
men , after controlling for age. This is almost
three times as large as the effect estimated in the model with the Gaussian
family. A small change in the model leads to a large change in
interpretation.
Our primary intent was to use these examples to demonstrate the use of
the glm command and explain how to interpret coefficients. However,
these examples also show that the estimated effects in a sample can be
quite different across distribution family choices when the link function is
the same, even though the choice of family has no theoretical implications
120
for consistency of parameter estimates.
We could run many other GLM models, changing the link function or
the distributional family. For example, we could fit a GLM with a square
root link (option link(power 0.5)) and a Poisson family (option
family(poisson)). Or we could fit a GLM with a cube root link (option
link(power 0.333)) and an inverse Gaussian family (option
family(igaussian)).
121
5.4 GLM predictions
For all GLM models with a log link, the expected value of the dependent
variable, , is the exponentiated linear index function:
(5.1)
122
5.5 GLM example with interaction term
When including interaction terms, one must use special Stata notation,
so that margins knows the relationship between variables when it takes
derivatives. Therefore, we use c. as a prefix to indicate that age is a
continuous variable, i. to indicate that female is an indicator variable, and
## between them to include not only the main effects but also their
interaction.
123
The results are harder to interpret directly, because the interaction term
allows for the effect of age to depend on gender and the effect of gender to
depend on age. The coefficients on the main effects of age and gender are
similar in magnitude to the simpler model. The interaction term is negative
and statistically significant, implying that the increase in expenditures with
age are lower for women than for men. However, to predict by how much,
use margins.
The overall predicted total expenditure is about $4,498 for this model,
which includes age, sex, and their interaction. This is even closer to the
sample mean of $4,480 than the model without the interaction term.
124
We can use the marginsplot command following the margins
command to visualize the results. In this example, because the model
specification is so simple (only two variables and their interaction), we can
easily plot out predicted values for all possible combinations of ages and
genders. The code for this is shown below:
Predicted total expenditures rise for both men and women with age
(see figure 5.2). Predicted total expenditures are higher for women among
young adults but rise faster for men. Convergence occurs around age 68.
125
Adjusted Predictions with 95% CIs
4000 6000 8000 10000 12000
Predicted Mean Exp_Tot
2000
20 30 40 50 60 70 80
Age
Male Female
Next, we derive marginal and incremental effects from GLM with log
links and show how to calculate and interpret these effects using Stata.
126
5.6 Marginal and incremental effects
127
Incremental effects are most commonly computed for a binary covariate,
like gender, whether someone has insurance or not, or if a person lives in
an urban or rural area. They can also be computed for a large discrete
change in a continuous variable, like age or income, which may have more
policy meaning than a tiny marginal effect. We will show how to compute
this kind of incremental effect in section 5.7.
When there are links other than the log link (see table 5.1), then the
expectation, the marginal, and the incremental effects are based on the
inverse of the link, , and its partial derivative with respect to a
specific covariate.
128
5.7 Example of marginal and incremental effects
However, the average marginal effects mask that the marginal effects
vary with age. We next graph the estimated marginal effects by age and
gender (see figure 5.3); this shows the slope of lines in figure 5.2. The
marginal effects of age are similar for men and women until about age 40,
then are higher for men at older ages.
129
Conditional Marginal Effects of age with 95% CIs
500
Effects on Predicted Mean Exp_Tot
100 200 300
0 400
20 30 40 50 60 70 80
Age
Male Female
We can use the margins command with the dydx() option to compute
the marginal effects by age and gender, corresponding to figure 5.3. The
results confirm what is shown in the graph. Marginal effects for men and
women are similar at young ages but are much larger for men above age
60. The incremental effect of gender shows that women spend more on
average at younger ages (by almost at age 20), but that difference is
reversed in old age, with men spending considerably more per year on
average.
130
We next compute the incremental effect of a 10-year increase in age.
One way to do this is to use the contrast() option and to specify an
increase in age of 10 years with the at() option. The contrast is $1,463,
slightly more than 10 times the marginal effect of an increase in age of
1 year ($126). In this example, we also estimate unconditional standard
errors, that is, not conditioned on the sample values of the covariates. The
resulting standard errors are slightly larger than the sample-average
131
standard errors.
132
5.8 Choice of link function and distribution family
The main modeling choice for GLMs is between the link function and the
distribution family. Although a number of published studies have used
GLMs with the log link and the gamma family for healthcare expenditures
and costs, we strongly recommend choosing based on performance of
alternative models in the specific context using information criteria or
statistical tests. In this section, we first show a simple way to use
information criteria to simultaneously choose the link function and the
distribution family (see chapter 2). Then, we show separate tests for the
link function and for the distribution family.
There are several ways to choose the link function and the distribution
family for the analysis of a GLM model with a continuous dependent
variable. We propose choosing them jointly using the Akaike information
criterion (AIC) (Akaike 1970) and the Bayesian information criterion (BIC)
(Schwarz 1978) as the statistical criteria for these nonnested models.
Information criteria-based choices have two advantages. First, they can be
applied regardless of whether complex adjustments for design effects in
the data have been applied or not (design effects are described in
chapter 11). Second, choices based on information criteria do not suffer
from issues of multiple hypothesis testing inherent in standard statistical
tests repeated for many possible choices of link and distribution family.
133
where is the sample size. Smaller values of BIC are also preferable.
To illustrate the use of AIC and BIC we fit models with log and square
root ( ) links using Gaussian, Poisson, and gamma distribution
families. We fit six different models with these links and families using
our MEPS dataset, store the results, and compare the AIC and BIC for each.
Note that we also used the scale(x2) option for the Poisson model.
This option is necessary for GLM models with a continuous dependent
variable to compute correct standard errors. It is the default for Gaussian
and gamma families but must be added for Poisson.
The AIC and BIC in the stored results are easily compared in a table. In
our MEPS data with the full covariate specification, the model with the
lowest AIC and BIC was the log link with the gamma family. Although we
expected this, and this choice of link function and family often wins for
expenditure data, it is always worth checking.
134
In this example, we did not fit models with the identity link . The
identity link for nonnegative expenditures is both conceptually flawed and
causes computational problems. The dependent variable of expenditures
can never be negative, yet a model with an identity link would allow this
possibility. In contrast, the log link (which exponentiates the linear index)
and the square root link (which squares the linear index) never estimate the
conditional mean of the dependent variable to be negative. When using the
identity link with many datasets (including the MEPS example), a rich set of
covariates will predict the conditional mean to be negative for some
observations. For these observations, and hence for the sample as a whole,
the log-likelihood function is undefined. In such cases, the maximum
likelihood estimation will have trouble finding a solution. For other types
of dependent variables, the identity link function may well be appropriate.
As a precaution, in our empirical example, we use the iter(40) option to
limit the number of iterations to be 40, so that it will not iterate forever.
Typically, GLMs converge in less than 10 iterations. Consequently, if the
model gets to 40 iterations, check to see if there is a problem with the
model as specified.
Instead of choosing both the link function and the distribution family
simultaneously, choose them sequentially using a series of statistical tests.
Use a Box–Cox approach (see section 6.5) to find an appropriate
functional form and use that form as the link function. In brief, the Box–
Cox approach tests which scalar power, , of the dependent variable, ,
results in the most symmetric distribution. A power of corresponds
to a linear model, corresponds to the square root transformation,
and corresponds to the natural log transformation model. This
135
approach is discussed at length in section 6.5, with examples that show
that the log link is preferred to the square root for the MEPS dataset and the
basic model.
Note that the boxcox command does not admit the factor-variable
syntax of modern Stata. Therefore, we use the xi: prefix to preprocess the
data to generate appropriate indicators. The estimated coefficient (/theta
in the output) is only slightly greater than zero. We take this to mean that
the log link function is preferable to the square root or other common link
functions.
136
5.8.3 Modified Park test for the distribution family
137
To implement the modified Park test, we first run a GLM—which
means choosing an initial link function and distribution family prior to
running the empirical test. Our working assumption—based on results in
section 5.8 and in the literature—is that we should have a log link and
gamma family. Note that this test requires link to be correctly specified.
Postestimation, we generate the log of the squared residuals and the linear
index.
If it is close to one, use the Poisson, because of its property that the
variance is proportional to the mean.
138
5.8.4 Extended GLM
What if the appropriate link function is not one of the widely used choices
[identity, square root, or ]? What if the distribution family is not an
integer power of the mean function? Basu and Rathouz (2005) address
these questions with an approach known as the extended estimating
equations model. They simultaneously estimate the mean and distribution
family, rather than separately, and allow for general noninteger choices of
the power values.
139
5.9 Conclusions
140
5.10 Stata resources
To estimate GLM models in Stata, use the glm command, which works with
margins , svy , and bootstrap . Basu and Rathouz (2005) have Stata code
for their extended estimating equations model.
To compare the AIC and BIC test statistics for GLMs with different
choices of link function and distribution family, use estimates stats * .
Alternatively, conduct a link test with boxcox and a modified Park test
with code found in this chapter.
141
Chapter 6
Log and Box–Cox models
142
6.1 Introduction
Despite the ease of fitting and interpreting generalized linear models (GLM)
(see chapter 5) and the ability of GLMs to deal with heteroskedasticity
while avoiding retransformation problems, a sizable fraction of the health
economics literature still fits regression models with a logged dependent
variable. In this chapter, we cover log models in detail to show their
weaknesses and to explain how a careful analysis would properly interpret
results.
143
6.2 Log models
(6.1)
The OLS regression results show that the log of healthcare expenditures
increases with age and is higher for women. For a semilog model, it is easy
to interpret the coefficient on a continuous variable—like age—as a
percent change in the dependent variable. Expenditures increase by about
3.6% with each additional year of age among those who spent anything. In
144
this case, a coefficient of 0.0358 corresponds to about a 3.6% increase,
because the parameter is close to 0. A more precise value is found by
exponentiating the coefficient; this more precise mathematical formula
matters more for coefficients further from zero. The coefficient is
statistically significantly different from 0, with .
(6.2)
145
where is the OLS estimate of the variance of the coefficient on the
dummy variable of interest. This formula applies only to positive
coefficients. For a negative coefficient, redefine the variable by taking one
minus the variable.
146
6.3 Retransformation from ln(y) to raw scale
(6.3)
The expected value of the exponentiated error term [the second term in
(6.4)] is greater than one by Jensen’s inequality , implying that the
exponentiated linear index (6.3) is an underestimate of the expected value
of .
147
normal distribution. In this case, the error retransformation factor is
, where is the variance of the error term on the
log scale. The expected value of , conditional on , is the exponentiated
linear index multiplied by the error retransformation factor.
(6.5)
148
We use the bootstrap command to calculate standard errors for the
smearing factors and for the predicted means of exp_tot. We have used
200 bootstrap replications without experimentation in this example. We
urge readers to ensure that the estimates of interest are stable given the
choice of number of replications. We could also have used GMM to obtain
correct standard errors analytically. We provide an example of how to use
GMM in section 10.4.
149
overall mean (unlike the GLM model with a log link, as in chapter 5).
150
(6.6)
151
6.4 Comparison of log models to GLM
There is often confusion between GLM with a log link function (see
chapter 5) and OLS regression with a log-transformed dependent variable
(as described in this chapter).
GLM with a log link function models the logarithm of the expected
value of , conditional on —that is, .
A GLM with a log link models the log of the expected value of ,
conditional on as a linear index of covariates and parameters :
(6.7)
(6.8)
(6.9)
152
Taking the expected value of both sides of (6.9) eliminates the mean-
zero error term, but the resulting equation is in terms of the expected value
of the logarithm of , not the expected value of :
Equation (6.9) differs from (6.7) on the left-hand side, because the order of
operations is different—and it differs on the right-hand side, because the
parameter values are different.
In general, the parameters from these two models will not be equal
(that is, ). In particular, the constant terms will be quite
different—with —because in the log transformation model,
.
153
6.5 Box–Cox models
154
Box–Cox model under homoskedasticity. Duan’s (1983) smearing for the
lognormal model is a special case of Abrevaya’s method .
155
6.6 Stata resources
156
Chapter 7
Models for continuous outcomes with mass at zero
157
7.1 Introduction
There are several ways to model such data, a number of which are
discussed in Cameron and Trivedi (2005) and in Wooldridge (2010) . In
this chapter, we discuss two approaches in detail. Both approaches model
158
the outcome using two indices; in each model, one index focuses on the
process by which the zeros are generated. At the end of the chapter, we
provide brief descriptions of single-index models that have been used in
the literature but that we would not recommend.
159
7.2 Two-part models
(7.1)
160
The two-part model has a long history in empirical analysis. Since the
1970s, meteorologists have used versions of a two-part model for rainfall
(Cole and Sherriff 1972; Todorovic and Woolhiser 1975; Katz 1977) .
Economists also used two-part models in the 1970s. Cragg (1971)
developed the hurdle (two-part) model as an extension of the tobit model.
Newhouse and Phelps (1976) published an article that is the first known
example of the two-part model in health economics. Their empirical model
fits price and income elasticities of medical care. The two-part model
became widely used in health economics and health services research after
a team at the RAND Corporation used it to model healthcare expenditures in
the context of the Health Insurance Experiment (Duan et al. 1984) . See
Mihaylova et al. (2011) for more on the widespread use of the two-part
model for healthcare expenditure data. Two-part models are also
appropriate for other mixed discrete-continuous outcomes, such as
household-level consumption.
There are many specific modeling choices for the first- and second-part
models. The choices depend on the data studied, the distribution of the
outcome, and other statistical issues. The most common choices are
displayed in table 7.1. In the first-part model, is typically
specified as a logit or probit equation. In the second-part model, there are
many suitable models for . Common choices are a linear
model, a log-linear model (see chapter 6) , or a generalized linear model
(GLM) (see chapter 5).
161
7.2.1 Expected values and marginal and incremental effects
162
where and denote, respectively, the probability density function
(p.d.f.) and the cumulative distribution function (CDF) of the unit normal
density, is the vector of parameters for the first-part probit model, is
the vector of parameters for the second-part model, and is the scale
(standard deviation) of the normal distribution in the second part. This
model specification is even more restrictive than the usual linear second-
part model which—if estimated by least squares—would not require
normality . We use this restricted specification to aid comparison with the
generalized tobit described in section 7.3.
If the first part is a probit and the second part is a GLM model with a log
link , then the formula requires exponentiating the linear index function,
where the vector of parameters is now denoted :
If instead the first part is a logit , then the first term on the right-hand
side, , is replaced by the logit CDF , with a
vector of parameters denoted . For example, the two-part model with a
logit and a GLM with a log link has an expected value of
More work is necessary when the second part is ordinary least squares
(OLS), with as the dependent variable (see chapter 6). For example, if
163
the first part is a probit , and the second part is a log transformation with
homoskedastic normal errors, then
where is the normal CDF and is the variance of the normal error . If
the error is not assumed normal, then the term can be
replaced by Duan’s (1983) smearing factor, which we denote by :
Other models require other formulas, but the expected value can
always be calculated using the conditioning in (7.1).
164
7.3 Generalized tobit
The other approach is the generalized tobit (or Heckman) selection model,
which begins with structural or behavioral equations that jointly model two
latent outcomes . Each latent variable has an observed counterpart.
Although we have formulated the model so that the outcome variable
includes zeros and positives, following Maddala (1985) , we note that the
model was initially formulated as a combination of missing values and
positives (Heckman 1979) .
(7.2)
(7.3)
(7.4)
165
variables not included in ) and and are vectors of parameters to
estimate. If the joint distribution of and is bivariate normal with a
correlation parameter, ,
(7.5)
Here and
(7.6)
There are two standard ways to fit the selection model. The full selection
model can be fit by full-information maximum likelihood (FIML). The
likelihood function has one term for the probability that the main
dependent variable is not observed, one term for the probability that it is
observed (this term accounts for the error correlation), and one term for the
positive conditional outcome assuming a normal error. If is an indicator
variable for whether is observed, then the likelihood function is
166
Heckman (1979) proposed a computationally simpler limited-
information maximum likelihood (LIML) estimator. Using LIML, you can fit
the model in two steps—not to be confused with having two parts. The two
steps of the LIML model can be fit sequentially. First, fit a probit model on
the full sample of whether the outcome is observed. Second, calculate
the inverse Mills ratio , , which is the ratio of the normal p.d.f. to
the normal CDF. Finally, add the estimated inverse Mills ratio, , as a
covariate to the main equation, and run OLS. The main equation is now
If , then the inverse Mills ratio drops out of the main equation,
and the formula simplifies to a model without selection. There are several
different definitions of the inverse Mills ratio, leading to different formulas
that are close enough to be confusing. See the Stata FAQ for more
discussion of why seemingly different formulas are actually equivalent.
Given that both the LIML and FIML estimators are consistent (under the
usual assumptions), the choice between them falls to other considerations.
Although both versions estimate , FIML does it directly, while LIML
estimates the combined parameter —and can be deduced given an
estimate of . FIML sometimes fails to converge (especially if identification
is only through nonlinear functional form), whereas LIML will always
estimate its parameters. In Stata, LIML has a more limited set of
postestimation commands, making it harder to compare with other models.
167
7.4 Comparison of two-part and generalized tobit models
The two-part and generalized tobit models look similar in many ways, but
they have important differences, strengths, and weaknesses (Leung and
Yu 1996; Manning, Duan, and Rogers 1987) . It is therefore important to
explain the fundamental differences between these models. There is a
long-running debate in the health economics literature about the merits of
the two-part model compared with the selection model (see Jones [2000]
in the Handbook of Health Economics for a summary of the “cake
debates”). The name “cake debates” comes from the title of one of the
original articles comparing these models (Hay and Olsen 1984) . Without
delving into culinary metaphors or arbitrating the past debate directly, we
make several points that focus on the salient statistical features that
distinguish these two models.
First, the generalized tobit and two-part models are generally not
nested models when each is specified parametrically. The many distinct
versions of the two-part model make different assumptions about the first
and second parts of the model. Most versions of the two-part model are not
nested within the generalized tobit model.
Second, the generalized tobit is more general than one specific version
of the two-part model. The generalized tobit, (7.2)–(7.5), with and
, is formally equivalent to a two-part model with a probit first part
and normally distributed second part. The generalized tobit with is
formally a generalization of this specific and restrictive version of the two-
part model but is not a generalization of any other version of the two-part
model.
Third, even for this case where the generalized tobit model is more
general than the two-part model (a probit first part and a normally
distributed second part), simulation evidence shows that the two-part
model delivers virtually identical average marginal effects, the goal of our
econometric investigation. More generally, Drukker (2017) formally
demonstrates the equivalence of even if there is dependence in
the data generating process. Nevertheless, this point is important enough
that we will illustrate it with two examples—one with identification solely
through functional form and one with an identifying excluded variable—in
section 7.4.1.
168
Fourth, the two-part model can be motivated as a mixture density ,
which is at least as natural as a candidate data-generating process as that
implied by the generalized tobit. Thus there is no compelling reason to
view the two-part model as a special case of the generalized tobit; it can be
motivated with a perfectly natural data-generating process that will not be
nested within any generalized tobit model. For more on mixture densities,
see chapter 9.
Fifth, the two-part model has an important practical advantage over the
generalized tobit model. In the two-part model, it is trivially easy to
change the specifications of both the first and second parts to allow for
various error distributions and nonlinear functional forms (for example,
logit or complementary log-log first parts and, more importantly, GLM or
Box–Cox second parts). The different second-part models, discussed at
length in chapters 5 and 6, are often important for dealing with statistical
issues like skewness and heteroskedasticity on the positive values. Such
changes require complex modifications in the generalized tobit, often
leading to models that are not straightforward to estimate. Thus they are
rarely implemented in practice.
169
that mean, the two-part model has greater practical appeal.
The fact that the two-part model returns predictions and marginal effects
that are virtually identical to those of a generalized tobit model—even
when the data-generating process is for a generalized tobit—is so
important and misunderstood that we present two illustrative examples.
Drukker (2017) formally demonstrates this. In the first example, the data
are generated using a generalized tobit data-generating process with jointly
normal errors. There is no exclusion restriction ( ), as is typical in
health economics applications. Without loss of generality, the variance of
the error term for the latent outcome is set equal to one.
170
Although parameter estimates of the second part of the two-part model
do not correspond to those of the generalized tobit data-generating process,
the marginal effect of x1 on y from the two-part model is virtually
identical to those obtained from the generalized tobit model, 28.9:
171
The second example has an identifying instrumental variable that can
be excluded from the main equation. The data-generating process allows
for a substantial effect of an additional variable, z, in the selection
equation that does not enter the latent outcome equation. When the
selection equation in the generalized tobit (7.2)–(7.5) data-generating
process includes an excluded instrument—even if —the typical
implementation of the two-part model would be overspecified, because it
would include the same set of variables in both the first and second parts.
Nevertheless, the simulation evidence shown in this example again
highlights the flexibility of the two-part model specification.
Again, the estimated coefficients in the two models are similar in the
first equation but different in the second equation (0.821 versus 0.959).
172
Despite the differences in estimated coefficients, the marginal effect of
x2 on y from the two-part model is again virtually identical to that
obtained from the generalized tobit model, 24.12. We care primarily about
the estimates of the marginal effects on the expected outcomes, not the
parameter estimates themselves.
173
To summarize the third point, we see this simulation demonstrates that
despite the apparent differences in model assumptions, the two-part model
and the generalized tobit model usually produce similar results when
comparing marginal effects of actual outcomes , which are usually the goal
of econometric modeling in health economics. Now we return to the two-
part model for interpretation and marginal effects.
174
7.5 Interpretation and marginal effects
In both parts, the estimated coefficients for age and female are positive
and statistically significant at the one-percent level, while the interaction
term is negative and statistically significant. Both the probability of
spending and the amount of spending conditional on any spending increase
with age but at a slower rate for women. Women are more likely to spend
at least $1 more than men, and, conditional on spending any amount, they
spend more, at least at younger ages. The results for the second part of the
model are the same as in the first simple GLM example in section 5.3.
175
After we fit both parts of the two-part model with twopm , the
postestimation margins command calculates predictions based on both
176
parts. The predicted total spending is about $3,696 per person per year,
which is within a few dollars of the actual average ($3,685).
20 40 60 80
Age
Male Female
177
details would depend on which specific version of the two-part model is
fit. The main formula is
(7.7)
For the case of a probit first-part model and a GLM second-part model
(and no interactions or higher order terms in ), this is fairly
straightforward to compute,
178
where is Duan’s smearing factor .
179
Continuing the example from section 7.5.1, we now show the marginal (or
incremental) effects of age and gender for the full two-part model,
accounting for the effects of these variables on both parts. After we use the
twopm command, the margins command automatically computes the
unconditional marginal effects, accounting for both parts of the model. The
marginal effect of age averages $123 per year of age, and women spend
more than men by about $798.
Because the graphs showed that the marginal effects vary over the life
course, we computed marginal effects, conditional on four ages (20, 40,
60, and 80). The marginal effect of age for men grows from $40 at age 20
to $383 by age 80; the marginal effect of age for women grows from $56
at age 20 to $231 by age 80. The incremental effect of gender declines,
with women spending on average more than $1,000 more than men at
age 20, but by age 80, the roles have reversed, and men outspend women
by more than $1,000.
180
181
7.5.4 Generalized tobit interpretation
There are three standard ways to interpret the results from the generalized
tobit model. The first way focuses on what happens to the expected latent
outcome (denoted here ). Latent outcomes assume that the dependent
variable is missing (not zero) for part of the sample and that the selection
equation adequately adjusts for the nonrandom selection. The expected
value of the latent outcome is , using the same notation as in
section 7.3. The first interpretation is easy to read from the regression
output table but not relevant for answering research questions in health
economics, where we typically care about predictions of actual
expenditures .
The other two interpretations for the generalized tobit are more
challenging to calculate. The second interpretation focuses on the
characteristics of the actual outcome and is therefore comparable with the
results from a two-part model (Duan et al. 1984; Poirier and
Ruud 1981; Dow and Norton 2003) . In this case,
182
If instead the main outcome is estimated as
and the error term is normal and
homoskedastic, then
The results for the three models of dental expenditures have nearly
identical coefficients in the first equation (probit), which is not surprising.
The coefficients in the second equation are different, especially those from
183
the two-part model, because there we have modeled the conditional mean
to be an exponential function of the linear index. The coefficients obtained
by FIML and LIML are also quite different from each other, partly because
the Mills ratio is both large and imprecisely estimated. However, as the
results below show, the marginal effects are similar across all three
models.
184
We restore the two-part model results to use margins . Overall,
average dental expenditures are , according to the two-part model
results.
185
Women spend more than men on average over all ages by almost $32.
Dental expenditures increase on average by about $2.87 per year.
The marginal effect of age is higher for men than for women at all
ages.
186
For comparison with the two-part model, we must use the formulas for
actual expenditures with the FIML results. It is important to use the
predict(yexpected) option to calculate predictions for actual
expenditures, not latent expenditures—otherwise the results are not
directly comparable. Again, predicted actual expenditures for the two-part
model and generalized tobit are quite close, certainly well within
confidence intervals, even with vastly different estimated coefficients.
187
The FIML-estimated marginal effects are also quite similar to those for
the two-part model.
In sharp contrast to the actual outcomes, the results from FIML can also
be used to compute latent outcomes , which is the default Stata option.
Because about 63% of the sample has zero dental expenditures, if instead
they all spent an average amount, then the total would of course more than
double. That is exactly what is shown.
188
In summary, if you want to estimate actual outcomes and marginal
effects on actual outcomes (as opposed to latent outcomes ), the FIML
selection model will typically yield similar results to the two-part model.
However, in practice, researchers fit two-part models because the results
are easier to manipulate, both for the total effect and for the extensive and
intensive margins.
189
7.6 Single-index models that accommodate zeros
In this section, we briefly describe some single-index models that allow for
a mass of zeros in the distribution of the outcome but not in particularly
flexible ways. We describe these models because they have been used in
the literature, but we cannot recommend their use in research.
The tobit model, named after economist James Tobin, is like a mermaid or
centaur ; it is half one thing and half another. Tobin (1958) was the first to
model dependent variables with a large fraction of zeros. Specifically, the
tobit model combines the probit model with OLS, both in the way the
model is fit and in how it is interpreted. For a recent summary of the tobit
model, see Enami and Mullahy (2009) .
The classic tobit model is appropriate when the dependent variable has
two properties:
Specifically, the classic tobit model assumes that the latent variable,
, can be negative—but that when is negative, the observed value, , is
zero.
190
The values equal to zero are censored , because they are recoded from
a true negative value to zero. (If instead those observations were left out of
the sample, they would be truncated , which is a selection problem.)
The tobit likelihood function has two pieces. There is the probability
that observed equals zero, and the probability that equals some positive
value. If has a normal distribution with variance , and if is an
indicator variable for whether is positive, then the likelihood function is
written as part normal CDF and part normal p.d.f. .
The number of parameters is the same as OLS and one more than for
probit. There is one set of ’s (including the constant) and one . There is
no estimated parameter for the censoring point (zero in this case), because
this threshold is not estimated; it is determined by the data.
How does the tobit model differ from the probit model ? The tobit
model fits one more parameter than probit. The tobit model has a
continuous part and a discrete part. The interpretation of the constant term
is quite different—for tobit, it has the interpretation of an OLS intercept,
191
and for probit, it has the interpretation of the probability of outcome A for
the base case observation.
The tobit model should be used with great caution, if at all. The
assumptions underlying the model are numerous and rarely true. The tobit
model should never be fit unless the data are truly normal and censored.
Here are the top four reasons to avoid the tobit model :
1. The tobit model assumes that the data are censored at zero , instead of
actually being zero. Too often, researchers with health expenditure
data claim that a large mass at zero are censored observations when
they are not censored.
2. The tobit model assumes that the error term has a normal distribution
but is inconsistent even if there are minor departures from normality
and homoskedasticity (Hurd 1979; Goldberger 1981) .
3. The tobit model assumes that the error term when is positive is
truncated normal, with the truncation point at zero. This is rarely true.
4. The tobit model assumes that the same parameters govern both parts
of the likelihood function. There are specification tests that test the
tobit model against the more general Cragg (1971) model that allows
different parameters in the two parts of the model. This test almost
always rejects the null hypothesis that the parameters in both parts are
equal.
In summary, the classic tobit model only applies in the rare cases
where zero values are truly censored. Right-censoring is more common in
real data, and tobit models may work well in those cases.
The tobit model has been used only a few times in the health
economics literature. Holmes and Deb (1998) use a tobit model for data on
health expenditures that are right-censored . The dependent variable they
study is health expenditures for an episode of care. Because they have
claims data for a calendar year, some episodes of care are artificially
192
censored at the end of December. Cook and Moore (1993) use a tobit to
estimate drinks per week. However, there is no evidence that abstainers are
appropriately modeled as censored.
Although two-part models are popular they are not the only estimation
approach for addressing a large mass at zero. Mullahy (1998) suggested
that researchers not use two-part models—especially those that use the log
transformation in the conditional part—if they are interested in the
expected value of given observed covariates, . Using nonlinear least
squares , or some variant of the GLM family , researchers can apply a single
model to all the data to fit the expected value of . Any of the links and
families described in chapter 5 could be used for a one-part model as an
alternative to a two-part model, as long as researchers are interested in the
mean function for , conditional on the covariates —or something that
can be derived from the mean function, such as the marginal or
incremental effect.
Some analysts have worried that some of the distributions used in the
GLM approach do not have zeros in their support. This is a problem if the
models are fit by maximum likelihood estimation (MLE). However, the GLM
approach only uses mean and variance functions. For example, for the
inverse Gaussian (Wald), you cannot use MLE with the zeros, but you can
use GLM with zeros.
Buntin and Zaslavsky (2004) suggest that the choice and the specifics
for each approach depend on the application. They provide an approach to
finding a better-fitting model using a set of diagnostics from both the
literature on risk adjustment and on model selection from the healthcare
expenditure literature. The choice of approach appears to depend on the
fraction of zeros in the data.
193
7.7 Statistical tests
All the usual statistical tests for single-equation models apply to the two-
part model. In addition, the modified Hosmer–Lemeshow test applies to
the entire two-part model. This may help identify problems with the model
specification in the combined model. We can apply Pregibon’s link test
and Ramsey’s regression equation specification error test equation by
equation in these models. For the two-part model, there are no
encompassing link or regression equation specification error tests, because
those tests are for single-equation models. They can be extended to
selection models and generalized tobit models, because they are a system
of equations that can be estimated in a single MLE formulation. Pearson
tests and Copas’ tests can apply to all of these models.
194
7.8 Stata resources
The recently developed twopm command will not only estimate many
different versions of the two-part model—allowing several options for
choice of specific model—but also compute predictions and full marginal
effects, accounting for retransformations, nonnormality, and
heteroskedasticity (Belotti et al. 2015) . Install this package by typing ssc
install twopm . Alternatively, you can fit two-part models in two
separate commands. For example, estimate the first part with either logit
or probit . Commonly used commands for the second part include
regress , boxcox , and glm —always estimated on the subsample of the
data with positive values.
195
Chapter 8
Count models
196
8.1 Introduction
197
Poisson with mean 0.5
60
40
percent
20
0
0 1 2 3 4
198
Leaving aside the objective of estimating event probabilities and
distributions for a moment, it is important to recognize that not all count
data densities are skewed, nor is the mass concentrated on a few values in
all cases—although such cases will be rare in the healthcare context. In
such cases, it may well be appropriate to use methods designed for
continuous outcomes. Consider the density of a random variable drawn
from a Poisson distribution with a mean of five. The distribution of
observations shown in figure 8.2 is relatively symmetric, so simpler
models may be acceptable. Indeed, King (1988) notes that when is large
for all or nearly all observations, “it would be possible to analyze this sort
of data by linear least-squares techniques”.
0 1 2 3 4 5 6 7 8 9 10+
199
30
80
60
20
Percent
Percent
40
10
20
0
0 5 10 15
0
0 5 10 15 20+ # ER visits
# office-based provider visits
We begin our discussion of regression models for count data with the
Poisson regression model (section 8.2). It is the canonical regression
model for count data and should be the starting point of any analysis. We
discuss estimation, interpretation of coefficients, and partial effects in
some detail. The Poisson distribution is a member of the linear exponential
family (LEF) . Therefore, like the linear regression and GLMs, the Poisson
regression has a powerful robustness property: its parameters are
consistently estimated as long as the conditional mean is specified
correctly, even if the true data-generating process is not Poisson. However,
this robustness comes at an efficiency cost (Cameron and Trivedi 2013) .
200
Count outcomes in health and healthcare, although overdispersed, do
not necessarily conform to the properties of the negative binomial model.
They often have even more zeros than predicted by negative binomial
models. Therefore, in subsequent sections, we discuss hurdle and zero-
inflated models that allow for excess zeros. We also briefly describe
models for truncated and censored counts in section 8.5. We end this
chapter with section 8.6, which describes approaches for model selection
and demonstrates them via extensive examples.
201
8.2 Poisson regression
The Poisson density is the starting point for count-data analysis. The basic
principles of estimation, interpretation, and prediction flow through
naturally to more complex models.
(8.1)
(8.2)
202
mean specification has the major mathematical convenience of naturally
bounding to be positive. Because the variance of a Poisson random
variable equals its mean, , the Poisson regression is
intrinsically heteroskedastic.
The Poisson regression model is typically fit using MLE. Given (8.1)
and (8.2) and the assumption that the observations are independent
over , the log-likelihood function for a sample of data can be written as
and
(8.3)
203
The log-likelihood function is globally concave; hence, solving these
equations by Gauss–Newton or Newton–Raphson iterative algorithms
yields unique parameter estimates. By maximum likelihood theory, the
estimated parameters are consistent and asymptotically normal with
covariance matrix
(8.4)
(8.5)
204
more general conditions than the maximum likelihood-based formula
(8.4).
8.2.3 Interpretation
205
Although the coefficients themselves have an intuitive interpretation, it
is often desirable to calculate partial effects of covariates on the expected
outcome, as opposed to the effects on the logarithm of the expected
outcome. Marginal effects in exponential mean models are also relatively
easy to compute and have a simple mathematical form. Mathematically,
for any model with an exponential conditional mean, differentiation yields
which also depends on the values of each of the covariates in the model.
206
covariates at which the derivative or difference is evaluated. In other
words, the partial effects vary by observed characteristics, rather than
being constants.
207
When partial effects are evaluated at the means of the covariates using
the at((mean) _all) option in margins , the incremental effect of female
drops, in magnitude, to 2.07, while the marginal effect of age decreases
slightly to 0.13.
208
seemingly draconian restriction, we have shown that parameter estimates
from the Poisson regression are consistent, even when the data-generating
process is not Poisson—that is, this equality property does not hold.
209
marginal and incremental effects and event probabilities can be
inconsistent. For example, the Poisson density often underpredicts event
probabilities in both tails of the distribution. We first reestimate a Poisson
regression for office-based visits and calculate the observed and predicted
probabilities using the Stata code shown below. The predicted density is
calculated for each value of the count variable (up to a maximum value
based on the empirical frequency for each outcome) and for each
observation (that is, for different values of covariates). Then, the predicted
frequencies are averaged to obtain a single measure of the average
predicted density for each count value.
210
.3 # office-based provider visits # ER visits
.8
.6
.2
.4
.1
.2
0
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10
211
8.3 Negative binomial models
212
that the negative binomial density is overdispersed relative to the Poisson
and has considerably larger fractions of zeros and “large” (greater than 10)
values.
0 1 2 3 4 5 6 7 8 9 10+
213
binomial regression estimates are quite reliable in practice .
The examples below show the results of NB1 and NB2 models fit for the
count of the number of office-based visits. The NB2 regression is the
default specification of the nbreg command in Stata (or the
dispersion(mean) option) , so we fit that model first. Parameter estimates
from negative binomial regressions have a semielasticity interpretation.
We see that the effect of an additional year in age is associated with a
2.8% increase in office visits. Women have 52% more visits than men.
214
As always, it is useful to compute effect sizes on the natural scale, so
we use margins to calculate sample average marginal and incremental
effects. The results show that individuals who are a year older have 0.16
more visits on average. Women have 2.9 more visits than men.
215
Estimates of the sample average partial effects reveal that both the
marginal effect of age (0.13) and the incremental effect of female (2.32)
are smaller when fit using the NB1 model.
216
The NB1 and NB2 models are not nested models—so in principle, a
researcher should use tests to discriminate among nonnested models such
as Vuong’s (1989 ) test or model-selection criteria such as the Akaike
information criterion (AIC) or the Bayesian information criterion (BIC) (see
chapter 2). But because the NB1 and NB2 models have the same number of
parameters, most nonnested tests and criteria simplify to a comparison of
maximized log likelihoods. The value of the log likelihoods suggests that
NB1 fits better than NB2 for this particular dataset and model specification.
217
.3 # office-based provider visits # ER visits
.8
.6
.2
.4
.1
.2
0
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10
One way to improve the fit of the negative binomial model even
further involves a parameterization of in terms of a linear combination
of a set of covariates. However, although this extended model is more
flexible in principle, the parameters of such models can be difficult to
identify in finite samples. Instead, in the following sections, we describe
two types of models that add flexibility to the basic Poisson or negative
binomial specifications, have considerable intuitive appeal, and often fit
the counts of healthcare use quite well.
218
8.4 Hurdle and zero-inflated count models
As mentioned above, many counts of health and healthcare use have more
zeros than predicted by Poisson or negative binomial models. Thus the
first extensions we consider to the Poisson and negative binomial
regressions are their hurdle and zero-inflated extensions. Each of these
models adds flexibility by relaxing the assumption that the zeros and the
positives come from the same data-generating process. Each naturally
generates excess zeros and a thicker right tail relative to the parent
distributions, but they are also capable of generating fewer zeros and
thinner tails.
The hurdle count model can have the same conceptual justification as is
often used to justify the two-part model —that it reflects a two-part
decision-making process (see also chapter 7). One motivation is based on a
principal-agent mechanism. First, the principal decides whether to use the
medical care or not. Then—conditional on making the decision to use care
—the agent, on behalf of the principal, makes a second decision about how
much care to consume. More specifically, the patient initiates the first visit
to a doctor, but the doctor and patient jointly determine the second and
subsequent visits (Pohlmeier and Ulrich 1995) . Alternatively, the two-step
process could be thought of as driven by transaction costs of entry into the
market, which do not exist once the individual is engaged in the receipt of
healthcare services. A richer formulation of the principal-agent mechanism
models the fact that bouts of illness arise during the course of the year.
Some factors may have a differential effect on whether these episodes of
illness become episodes of treatment—for example, the opportunity to
visit one’s family physician rather than having to go to an ER (Keeler and
Rolph 1988) .
However, such justifications are not required for the hurdle count
model to be an appealing extension to the standard Poisson and negative
binomial model. Instead, it is enough to acknowledge that there may be
substantial heterogeneity at the threshold of the count variable between use
and nonuse.
219
density, , so that —while the positive counts are
from another density, . To be more precise, the positive counts are
drawn from the truncated density, .
Section 8.5 provides more details on truncated counts. The overall data-
generating mechanism is
220
The estimates of marginal effects show that women are almost 17
percentage points more likely to have at least one office-based visit than
men. An extra year in age leads to a 0.8 percentage point increase in the
probability of at least one office-based visit.
In the second step, we fit a truncated Poisson model using the Stata
command tpoisson . For estimation, it is important to condition on the
sample with positive values for the outcome; that is, drop observations
with zero counts. Notice the use of the if use_off>0 qualifier in the
tpoisson command shown below:
221
greater than zero) mean for the entire sample, not just for observations
with the outcome greater than zero. Conditional on having at least one
office-based visit—women average 1.59 visits more than men. On
average, a person who is one year older is expected to have 0.11 more
visits.
222
Next, we code the formula for the conditional mean of the outcome for
the hurdle Poisson model and pass that to the expression() option of the
margins command. The code and results are shown below. They show that
women have 2.36 more office-based visits than men. From the previous
results, we can conclude that this is because they are more likely to have
an office-based visit and because, among those who visit, they have more
visits. An extra year in age increases the number of visits by 0.14, again
because of increases along extensive and intensive margins.
223
We also fit a truncated negative binomial regression model, combine
estimates from the two parts, and estimate marginal effects using the same
steps as for the hurdle Poisson model. The sample average of the
incremental effect of being female is 2.54: women average 2.54 more
office-based visits than men. This estimate is somewhat larger than the one
obtained from the hurdle Poisson. An extra year of age is estimated to
increase the number of visits by 0.15, which is quite similar to that
obtained from the hurdle Poisson.
224
Note that the sample average incremental and marginal effects
obtained from the hurdle specification are quite close to those obtained
using the standard negative binomial regression. However, this does not
mean that the partial effects would be similar throughout the distribution
of the covariates.
225
would be a powerful way to model the additional heterogeneity relative to
a standard model. Note that as with the hurdle count model, the use of the
zero-inflated model need not be justified using the intuition of two types of
individuals in the population. It may simply be used to provide additional
modeling flexibility.
Although quite flexible, the zero-inflated model is used less often than
the hurdle count model. Because the likelihood function of the zero-
inflated model cannot be decomposed into its constituent parts, all
parameters must be estimated simultaneously, which can raise
computational issues—especially when both parts of the model have the
same set of covariates (Mullahy 1997) .
226
We estimate the sample average incremental effect of being female and
the marginal effect of age and report these below. On average, women
have 2.49 more office-based visits than men. Individuals who are a year
older have 0.15 more visits. At least on average, estimates and marginal
effects from a zero-inflated negative binomial regression model for office-
based visits deliver similar results to the hurdle count models.
227
Typically, the larger the discrepancy between the number of zeros in
the data and the number predicted by the standard count model (Poisson or
negative binomial), the greater the gains will be from the additional
modeling complexities of either the hurdle or the zero-inflated model .
Gains will be most obvious along the dimension of predicted counts, but a
researcher will typically also obtain better estimates of other event
probabilities and partial effects. Further, unlike the choice between
Poisson and negative binomial, where the choice of distribution has no
impact on expected outcome in terms of the conditional mean, the move to
a zero-inflated Poisson or negative binomial regression model—or a
hurdle count model—can change the conditional mean response to the
covariates.
228
8.5 Truncation and censoring
8.5.1 Truncation
229
where . Maximum likelihood estimates of zero-truncated
models are implemented in Stata for the Poisson and negative binomial
densities with the tpoisson and tnbreg commands, respectively.
8.5.2 Censoring
230
8.6 Model comparisons
231
parameters in the model. Smaller values of AIC are preferable. The BIC
(Schwarz 1978) is
where is the sample size. Smaller values of BIC are preferable. For
moderate to large sample sizes, the BIC places a premium on parsimony
and will tend to select models with fewer parameters relative to the
preferred model based on the AIC criterion.
Both the AIC and BIC demonstrate that—among the models fit—the
hurdle NB1 fits the data for office-based visits best, albeit with the caveat
that the zero-inflated negative binomial model implemented in Stata as
zinb only allows for the NB2 density.
232
8.6.2 Cross-validation
233
# Office-based visits
100
change in log likelihood over NB2
20 40 0 60 80
1 2 3 4 5 6 7 8 9 10
The evidence for ER visits, shown in figure 8.8, is quite different. There
is virtually no discrimination between NB1 and NB2 models or among their
extensions.
#ER visits
5
change in log likelihood over NB2
-5 -10 0
1 2 3 4 5 6 7 8 9 10
234
8.7 Conclusion
We have described a number of models for count data in this chapter. They
are useful as models for many measures of healthcare use. In one of the
two empirical examples in this chapter, there is considerable gain in fit
from going beyond the standard Poisson and negative binomial models to
hurdle and zero-inflated extensions of those models. Nevertheless, even
the hurdle and zero-inflated extensions may sometimes be insufficient to
provide an adequate fit for some outcomes. We will return to the
development of other, possibly more flexible extensions in chapter 9.
235
8.8 Stata resources
To compare the AIC and BIC test statistics for the count models
described above, use estimates stats * or estat ic .
236
Chapter 9
Models for heterogeneous effects
237
9.1 Introduction
There are many conceptual reasons to expect that the marginal (or
incremental) effects of covariates on healthcare expenditure and use, when
evaluated at different values of covariates or at different points on the
distribution of the outcome, are not constant across a number of
dimensions. In observational studies, individual characteristics can
plausibly have very different effects on the outcome at different values of
the characteristic itself. For example, a unit change in health status may
have only a small effect on healthcare expenditures for individuals who are
in good health, while it may have a large effect for individuals in poor
health. The effect of a unit change in health status may also differ along
the distribution of expenditures. The effect of a unit change in health status
on expenditures may be small for people with low expenditures and high
for people with large expenditures. These health characteristics may also
interact with socioeconomic characteristics. For example, individuals who
are generally less healthy or who have greater spending may be less
sensitive to changes in price. Furthermore, in quasiexperimental studies
and large, pragmatic experimental designs, the intensity of treatment and
compliance to treatment often differ across individual characteristics,
household characteristics, provider characteristics, and geographic areas.
Thus treatment effects evaluated at different values of those characteristics
would yield different values of effects.
238
characterized entirely by the functional form of the link function between
the index and outcome unless interactions and polynomials of covariates
are also included.
There are often good reasons to believe that effects are heterogeneous
along dimensions that cannot easily be characterized by parametric
functional forms or by interactions of covariates as they are typically
specified. For example, effects may be heterogeneous along the values of
the outcome itself, by complex configurations of observed characteristics,
or on unobserved characteristics. These types of effect heterogeneity are
not easy to account for using the models that we have described so far.
Ignoring heterogeneity may be a lost opportunity for greater understanding
in some cases, while it may lead to misleading conclusions in others.
239
9.2 Quantile regression
So far, the models we have described relate the conditional mean of the
outcome, , to a set of covariates through one (or two) linear indexes
of parameters and covariates; that is, . In quantile
regression, the conditional expectation of the outcome is not modeled.
Instead, the conditional th quantile is modeled using a linear index of
covariates; that is, . When , the quantile regression
is also known as a median regression. As we will see below, quantile
regressions allow effects of covariates to vary across conditional quantiles;
that is, the effect of a covariate on the th conditional quantile may be
different from the effect of the same covariate on the th quantile. Thus
quantile regressions provide a way for researchers to understand how the
effect of a covariate might differ across the distribution of the outcome.
In ordinary least squares (OLS) , the parameters of this linear model are
computed as the solution to minimizing the sum of squared residuals. In an
analogous way, the median quantile regression computes parameters of the
same linear regression specification by minimizing the sum of absolute
residuals. The sum of squared residuals function is quadratic and
symmetric around zero; the sum of absolute residuals is piecewise linear
and symmetric around zero. Therefore, minimizing the sum of absolute
residuals equates the number of positive and negative residuals and defines
a plane (a line in the case of a simple regression specification) that “goes
through” the median.
240
Bassett (1978) and Bassett and Koenker (1982) showed that if
minimizes
(9.1)
241
9.2.1 MEPS examples
242
How do these results compare with standard least-squares estimates ?
The results below show that the median regression and the least-squares
regression deliver the same inference, but the partial effects of covariates
are different. The marginal effect of age and the incremental effect of
female are about twice as big in the least-squares case as in the median
regression.
243
As we have suggested above, an important value of quantile
regressions is the ability to estimate effects at various quantiles, not just at
the median. Consequently, we estimate the regression of total expenditures
at the 10th through 90th percentiles in 10 percentage point increments to
determine whether and how the effects of the covariates change from the
10th through 90th conditional quantiles of the outcome.
We plot these estimates of the effects of age and female along with the
associated 95% confidence intervals in figure 9.1. For comparison, we
overlay on each panel the least-squares coefficient estimate and its
confidence interval. (To accumulate the parameter estimates and
confidence intervals conveniently, we use the user-written package
parmest [Newson 2003], which can be installed by typing ssc install
parmest in Stata.)
244
The left panel of figure 9.1 shows the marginal effect of age on total
expenditures, while the right panel shows the incremental effect of female
on expenditures across quantiles of errors. Note that the quantiles on the
horizontal axis refer to quantiles of errors—not to quantiles of the
outcome, exp_tot. Although it is tempting to interpret the effects as if they
were applicable to observed quantiles of the outcome, that interpretation is
incorrect. These are conditional quantiles, so they cannot be easily
translated into unconditional quantiles.
Age Female
4000
300
3000
200
Coefficient
Coefficient
2000
100
1000
0
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
quantile quantile
245
What might we learn if the distribution of the outcome were more
symmetric? To explore this, we estimate quantile regressions for the
logarithm of total expenditures (conditional on expenditures being
positive). We first estimate the quantile regression at the median using
qreg . The results show that the coefficient on age is 0.037, implying that
if an individual is one year older, that individual would spend 3.7% more.
Women spend 43% { } more than men.
246
increased as the conditional quantile of expenditures increased. Now,
when the outcome is the log of expenditure, the effect of age decreases
across those conditional quantiles. There is no evidence of heterogeneity in
the effect of female for expenditures measured on the log scale. The
quantile estimates are all within the confidence interval of the OLS
estimates.
Age Female
.045
.45
.04
.4.35
Coefficient
Coefficient
.035
.3
.03
.25
.025
.2
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
quantile quantile
247
9.2.2 Extensions
248
9.3 Finite mixture models
249
(9.2)
(9.3)
250
(9.4)
(9.5)
(9.6)
251
assignments to characterize the classes or components. To do so, we would
estimate the posterior probabilities of class membership after the model
parameters have been estimated. Then, we could use the probabilities
themselves—or classification based on the probabilities—to further
describe characteristics of observations in each class. Note that although it
is technically possible to parameterize the prior probabilities to allow them
to vary by characteristics, the tradition in the literature is to assume they
are constant (McLachlan and Peel 2000) .
We use the new fmm prefix to finite mixtures of GLMs and negative
binomial regressions. fmm is more than a command; it is a prefix because it
can be used to fit a variety of finite mixture models using standard
estimation commands prefixed by fmm #:, where # is the number of
components in the mixture model. We use the fmm prefix to fit finite
mixture models of GLM regressions with gamma densities and log links for
positive values of expenditures. We begin by fitting a two-component
model because it is the place a researcher would begin the quest for a
model with an adequate number of components. We will estimate the
parameters of this model, then calculate model-selection criteria so that the
252
model can be compared with a model with three or more components.
253
254
If we thought this model was adequate, we would proceed to
interpreting estimates and characterizing marginal effects. However, as we
will see below, it is not. Nevertheless, to fix ideas using this simple case,
we briefly interpret the parameter estimates. The output of parameter
estimates is shown in three blocks. The first block displays estimates of the
parameters, which can be used to calculate the class probabilities using
(9.3). We do this below, but after interpreting the output from the next two
255
blocks of results that show the componentwise parameter estimates. For
observations in component 1, the effects of age and female are both
statistically significant—each increases spending. For observations in
component 2, age is statistically significant, but female is not.
256
257
258
We use the postestimation command estat lcprob to obtain estimates
of . The three-component densities are associated with mixture
probabilities of 0.49, 0.43, and 0.08.
The model we have fit has constant class probabilities, which allows us
to code up the transformation from to [using (9.3)] and use nlcom to
obtain the estimates and associated confidence intervals for the class
probabilities more quickly than using estat lcprob . (The results below
are identical to those produced by estat lcprob.)
We use estat ic to calculate the AIC and BIC for the two- and three-
component models. The information criteria for the three-component
model are shown below. Both the AIC and BIC suggest that the three-
component model provides a better fit than the two-component model.
Although—for a serious research exercise—we should fit a four-
component model and calculate information criteria before judging the
three-component model to be the best fit, we stop here and proceed to the
259
interpretation of the parameters, effects, and distributional characteristics
from the three-component model. This gives us an example that is
sufficiently parameter rich, so as to make the nuances of the finite mixture
model apparent—yet not overly complex—so that discussion of the
nuances overwhelms our attempt to describe basic interpretation.
260
low- and medium-spending classes, women and men in the high-spending
class, that is, in component 3, do not differ significantly.
It is not just mean spending that differs across components. The shapes
of the gamma densities are also different. To show this, we plot the
predicted densities at the median age and gender in figure 9.3. To make
the figure easier to read, we show only the densities through $20,000 in
expenditure, which exceeds the 95th percentile of its empirical distribution
of expenditure. The figure shows that each of the predicted densities is
skewed, just as the empirical density is. But, while the densities of the first
two components have positive modes (classically gamma shaped), the
density of component 3 is exponential in shape—it slopes downward right
from the beginning.
261
0 5000 10000 15000 20000
expenditure
262
Age Female
5000
600
400
0
Coefficient
Coefficient
-5000
200
-10000
0
1 2 3 1 2 3
Component Component
Finite mixture estimates and 95% CI denoted by bars and capped lines, respectively
263
264
9.3.2 MEPS example of healthcare use
In a second example, we fit finite mixture models for the number of office-
based healthcare visits. We showed in chapter 8 that the negative
binomial-1 fit this outcome well. Therefore, for this example, we estimate
finite mixtures of negative binomial-1 regressions. We begin by fitting a
two-component model—but for brevity, we do not show the results. Once
again, information criteria suggest that the three-component model is
better than the two-component one.
265
266
267
We use nlcom to obtain estimates of the class probabilities, along with
their standard errors. The three components have mixture weights of 0.64,
0.09, and 0.27.
268
From the estimates of the predicted means, one might conclude that
two of the components are too similar to distinguish. That conclusion
would be wrong; the densities of the components are substantially different
from each other. To demonstrate this, we plot the predicted densities at the
median age and gender in figure 9.5. To make the figure easier to read, we
show only the densities through 30 visits, which exceeds the 97th
percentile of its empirical distribution of office-based visits. We also
represent the densities using (continuous) line charts, although bar charts
would be technically preferred. We use line charts because it is easier to
visualize differences in the component densities. The figure shows that
each of the predicted densities is skewed, just like the empirical density.
The density of the relatively rare component 2 is quite different from that
of the much more frequent component 1, although they have similar
means. Observations in component 3 are most likely to generate large
values of visits and much less likely to generate zero and other small-visit
values.
269
0 10 20 30
# office-based visits
270
Age Female
8
.3
6
.2
Coefficient
Coefficient
4
.1
2
0
-.1
0
1 2 3 1 2 3
Component Component
Finite mixture estimates and 95% CI denoted by bars and capped lines, respectively
271
9.4 Nonparametric regression
(9.7)
272
where denotes a kernel weighting function that assigns greater
weights to observations that are closer to in that is closer to and
inclusion or exclusion is based on the bandwidth . Fan and Gijbels (1996)
describe local-linear regression in detail. Racine and Li (2004) describe
how good bandwidths may be chosen.
273
npregress does not produce those by default. See Cattaneo and
Jansson (2017) for formal justification of the bootstrap for the
nonparametric regression. We have used 100 replications to economize on
computational time without confirming that the number of replications is
sufficient for reliable estimates. For serious research work, we encourage
users to fit models with different numbers of bootstrap replicates before
settling on a number beyond which estimates of standard errors do not
change much.
274
To understand the effect of a change in a unit of pcs12 better, we use
margins to compute the conditional mean function at values across the
empirical distribution of pcs12 and anylim. Once again, we use 100
bootstrap replications to obtain standard errors for the predictions.
275
276
It is easier to understand the nonlinearities in the effects visually, so we
graph the results of margins using marginsplot . Figure 9.7 shows two
clear sources of nonlinearities. First, mean expenditures for individuals
with an activity limitation decline sharply as physical health (pcs12)
increases until about a score of 45, which is approximately the mean
physical component score in this sample. Beyond a score of 45, mean
expenditures appear to decline slowly, but the confidence intervals show
that constant expenditure cannot be ruled out. For individuals without
activity limitations, mean expenditures are roughly constant across values
of pcs12. Second, the figure suggests interactive effects of pcs12 with
anylim. Mean expenditures are substantially bigger for individuals with a
limitation, compared with those without limitations, until the physical
score reaches about 45, that is, for individuals in below-average health. For
individuals with above-average health, mean expenditures are the same for
individuals with and without an activity limitation.
277
Predicted values by pcs and anylim
10000
4000 6000 8000
Mean Function
2000
0
30 35 40 45 50 55 60
Physical health component of SF12
278
9.5 Conditional density estimator
The two key assumptions in CDE (see Gilleskie and Mroz [2004] ,
equations 12 and 13) are that the probability of being in a bin depends on
covariates and that the mean value of , conditional on the bin, is
independent of the covariates. That is, there is heterogeneity across bins
but homogeneity within bins. Within a bin, covariates have no predictive
power. In this case, the CDE approach focuses on modeling the probability
of being in a bin in the best possible way.
279
straightforward. Decide on the number of bins and the threshold values
separating the bins, which do not need to be equally spaced. Fit a series of
logit (or probit) models. For example, if there are 11 bins, fit 10 logit
models, with each successive model fit on a smaller sample. For
observations and bins the expected value of
the dependent variable, , is the mean of for each bin times the
probability of being in that bin, summed over all bins:
Gilleskie and Mroz (2004) demonstrate how to fit CDE in one large
logit model , as opposed to a series of individual logit models with
progressively smaller sample sizes.
Two issues of this model have not yet been worked out to make this
method accessible to the typical applied researcher. First, it needs a theory
and practical method for choosing the number of bins in an optimal way.
Second, although methods for the computation of standard errors are
available, coding those is beyond the scope of this book.
280
9.6 Stata resources
281
Chapter 10
Endogeneity
282
10.1 Introduction
283
for endogenous and exogenous covariates and will estimate treatment
effects under the assumption of jointly normal errors. Finally, we also
briefly describe the generalized method of moments (GMM) for IVs
estimation, because GMM can have substantial benefits compared with 2SLS.
284
10.2 Endogeneity in linear models
(10.1)
(10.2)
Note that we have not yet described the purpose of including the
covariate in (10.2). We will do so in the next section, where we describe
solutions to the problem described here. Note also that the logic described
above applies even if and are uncorrelated with each other; the
composite error terms, and , are correlated, thus
rendering OLS estimates of parameters of (10.1) inconsistent. In fact, one
285
need not construct this example with an unobserved covariate as distinct
from the error terms and . As long as the errors of (10.1) and (10.2)
are correlated, OLS estimates of the model for the outcome, , will be
inconsistent.
286
Next, we omit u from the regression model. We know that the OLS
estimate of the coefficient on y2 is inconsistent. In the example, the OLS
estimate of the coefficient on y2 is 1.48 when u is omitted, and the
confidence interval is far away from the true value of 1.0. The estimated
coefficient when u is omitted is greater than 1.0, because the composite
error terms of the two equations, (10.1) and (10.2), are positively
correlated.
10.2.2 2SLS
287
The existence of an exogenous variable, , that enters the equation for the
endogenous regressor, , in (10.2) but that does not directly determine the
outcome of interest, , in (10.1) (except through its effect on ) is key to
a large class of solutions to obtain consistent estimates of parameters of the
outcome equation. Such variables are often called instruments or IVs. Valid
instruments have two essential properties :
(10.3)
288
where has replaced .
We now fit the model using 2SLS with the goal of estimating a
consistent estimate of the causal effect of y2 on y1. To estimate 2SLS in
Stata, we use ivregress with the first option to see regression results
from both stages. The first-stage regression predicts the endogenous
variable, y2, as a function of the instrument, w, and the exogenous variable,
x. Both of these variables are strong predictors of y2—which is not
surprising, because that is how we generated the data.
289
a function of the predicted endogenous variable, y2, and the exogenous
variable, x. The 2SLS estimate is 0.932 with a standard error of 0.075. The
estimate is close to 1 in magnitude, not statistically significantly different
from 1.0 if one conducted a simple test. However, it has a much wider
confidence interval than found with OLS with no endogeneity. This
example demonstrates two important points: First, with a valid instrument,
the 2SLS estimate is much closer to the true value than the OLS estimate.
Second, the standard errors are typically much larger than OLS. There is a
tradeoff between consistency and precision.
290
The second test is whether the potentially endogenous variable is
actually exogenous. This test is conditional on all the instruments being
valid. The estat endogenous command shows that the -value is below
0.05, meaning the test rejects the null hypothesis of exogeneity
(conditional on the instrument being valid). We will treat y2 as
endogenous.
291
10.2.4 2SRI
Further intuition can be found if we tease apart the error from the main
equation into two pieces—the part correlated with the endogenous variable
and the part that is independent. A control function is a variable (or
variables) that approximates the correlated part. Newey, Powell, and Vella
(1999) proved that there exists an optimal control function. If a researcher
could observe such a variable, then including it in the main equation would
be like including the omitted variable that caused the endogeneity. The
remaining error would be uncorrelated with the endogenous variable.
292
We bootstrapped the standard errors because estimating 2SRI in two
steps requires either bootstrapping the standard errors or using GMM (see
section 10.4). To understand why this is necessary, recall that 2SRI inserts
the predicted first-stage error into the main equation. The standard errors
computed by regress do not reflect the fact that this is an estimate of the
true error. Therefore, the regress standard errors are too small. In
contrast, bootstrapping is not necessary for ivregress .
293
However, the functional form of the control function matters, and in some
cases, alternative functional forms are necessary. Control function methods
can be used for binary outcomes (logit and probit), other categorical
outcomes (ordered and multinomial models), count models, duration or
hazard models, and generalized linear models.
Another way to model endogeneity uses the new ERM commands in Stata.
The ERM commands provide a unified way to model linear and some
nonlinear models with both exogenous and endogenous covariates. In
particular, the ERM commands can model linear, binary, ordered
multinomial, or interval outcomes along with continuous, binary, ordered
multinomial, or interval endogenous variables. The ERM commands also
allow modeling of selection and the computation of treatment effects in
these contexts.
Because ERM assumes joint normality of the error terms, the two
equations are estimated using maximum likelihood, or to be more precise,
full-information maximum likelihood (FIML) . This has the advantage of
better efficiency compared with 2SLS if the joint normality assumption is
correct. However, if the joint normality assumption is incorrect, then the
FIML model in this case still produces consistent estimates, but it no longer
has any efficiency gains.
294
We next demonstrate the use of the linear ERM command eregress for
the generated data from section 10.2.2 to predict y1 as a function of
endogenous y2 with instrument w. We will then compare the results from
eregress with the 2SLS results.
The syntax for eregress is slightly different from that for ivregress .
The expression for the endogenous variable is put after the comma, and the
first option is not needed because eregress automatically reports the
first-stage results. As expected, the results are very similar but not
numerically identical to the 2SLS results. The estimated coefficient is close
to 1.0, the true value. The standard error is also similar to the standard
error found by 2SLS.
295
10.3 Endogeneity with a binary endogenous variable
except that the covariate is now binary and takes only two values, 0 and
1. Let
(10.4)
296
unobserved error e2, and a covariate w. Finally, we generate the
dependent variable, y1, using (10.1) as a function of an exogenous
covariate x, the endogenous binary covariate y2, the unobserved covariate
u, and the unobserved error e1.
We first estimate an OLS regression that does not account for the
endogeneity of y2. We know that the estimates from such a regression will
be inconsistent. We see that the OLS estimate of the coefficient on y2 is
1.76 and the confidence interval is far away from the true value of 1.0.
Estimators that do not account for endogeneity can be misleading.
297
We now fit the model using 2SLS. Although 2SLS ignores the
discreteness of y2, it produces consistent parameter estimates
(Wooldridge 2010). The 2SLS estimate of the coefficient on y2 is 0.632. It
does not appear to be close to the true value of 1.0, but because its
standard error is large (0.405), it is not statistically different from 1.0. The
estimate is also not statistically different from zero. Although 2SLS is
consistent, the efficiency loss appears to be quite large. We return to this
issue below.
298
We now fit the model using 2SRI and estimate standard errors using a
nonparametric bootstrap. We use the probit model (probit) for the first
stage, which produces maximum likelihood estimates. We then compute
residuals nu2_hat and include them in an OLS regression of y1 as a
function of endogenous y2, exogenous x, and the estimated residual
nu2_hat. The estimate of the coefficient on y2 is 0.98, which is quite close
to the true value of 1.0. The estimated standard error is 0.1, which is
considerably smaller than the 2SLS estimate of the analogous standard
error.
299
Next, we fit the model using eregress with the probit option . This is
a FIML estimator of the model; it accounts for joint normality of the error
terms of the two equations. The estimated coefficient is 1.01 and its
standard error is 0.01, suggesting further efficiency gains.
300
We saw earlier that, while the 2SLS estimator is consistent, it does not
produce precise estimates of the parameter of interest in this example.
Both 2SRI and FIML estimators perform much better. However, the
performance of the 2SLS estimator can be improved. Recall that y2 takes
the value 1 about 10% of the time. It is in these ranges of rates of binary
outcomes when nonlinear estimators like the probit have the most gains
relative to an estimate from a linear probability model. But the predictive
performance of the linear probability model can be improved by
introducing nonlinear functions of the covariates. We consider a quadratic
polynomial of x and w to illustrate. The first-stage results table shows the
coefficient estimates from such a model. Two out of three of the higher-
order terms are statistically significant. The results of the second-stage
regression show that the point estimate of the coefficient on y2 is now
0.995—very close to 1.0. The standard error of the estimate is 0.28, which
is a substantial improvement over the standard error in the first set of 2SLS.
301
Readers should note a feature of the specification of the model using
ivregress. The polynomial terms that involve w are included in the set of
instruments that affect only y2 directly. The polynomial term that involves
only x, c.x#c.x, appears in both the first and second stages of the
regression because it is a common exogenous regressor. The estimated
coefficient on c.x#c.x is very close to zero and not statistically significant
in the regression of y1, as it should be. Including it in the fitted model, and
302
not the true model, has no deleterious effect.
303
10.4 GMM
Another way to fit models that control for endogeneity is with GMM. The
essence of GMM is to write down moment conditions of the model and to
replace parameters with their sample analogue. For example, the simple
mean, , of a random variable has the property that the expected value of
the difference between and the mean, , is zero. That is, .
Inserting and solving yields the familiar equation that the estimated
mean, , is the simple mean of . GMM can be used to fit all the estimators
in this book except nonparametric kernel estimators.
First, we show how to reproduce 2SLS results with GMM for the artificial
data example above. However, there is a slight difference in the standard
error , a difference explained below. The ivregress command can
estimate GMM methods directly by specifying the gmm option.
The results are nearly the same as before, with only a slight difference
in the standard errors (and corresponding statistics and -values). The
difference is that the GMM standard errors are smaller by a factor of
.
304
GMM can be estimated with unbalanced instrument sets. For example,
suppose that there are three instruments, but some observations are
missing for the third instrument. In 2SLS, one would have to either drop an
instrument or use a subsample of the data with no missing data. Either
way, information would be lost. In contrast, GMM can use whatever
information is available, using two instruments for some observations and
three for others.
Another advantage is that GMM gets correct standard errors in the 2SRI
process in one step.
305
estimate multiple equations simultaneously. This can be useful in health
economics for fitting two-part models all in one command. The Poisson
count model is especially easy to fit with GMM; therefore, Poisson with IVs
is also straightforward. However, other count models are not as easy to
implement in GMM.
306
10.5 Stata resources
One Stata command to fit linear models with IVs is ivregress —which
can be estimated with 2SLS , limited information maximum likelihood, or
GMM. The estat commands, part of the ivregress postestimation
commands, make it easy to run statistical tests of the main assumptions of
strength and validity of the instruments. In addition, Stata has a unified set
of commands, called ERM, that allow for estimation of linear and some
nonlinear models, where the covariates can be exogenous or endogenous.
The basic command for linear models is eregress , and the command that
estimates treatment effects is etregress .
For nonlinear models, Stata will estimate 2SRI for probit models with
the ivprobit command and the twostep option, for tobit models with the
ivtobit command and the twostep option, and for Poisson models with
the ivpoisson command and the cfunction option. The ERM commands
can be used for probit and ordered probit models with endogenous
covariates. The basic command for probit models is eprobit . The results
from ivprobit without the twostep option is identical to eprobit,
although the syntax is slightly different.
Stata has extensive capabilities of fitting GMM models, not just the GMM
version of linear IVs. The gmm command (as opposed to the gmm option with
ivregress ) can estimate multiple equations simultaneously.
307
Chapter 11
Design effects
308
11.1 Introduction
The literature on survey design issues and statistics for data from
complex surveys is large and detailed (for example, Skinner, Holt, and
Smith [1989]) . On the other hand, the discussion of these issues in
standard econometrics textbooks is sparse; exceptions include Cameron
and Trivedi (2005) , Deaton (1997) , and Wooldridge (2010) . Our
objective here is not to survey that entire field of literature but rather to
provide an introduction to the issues, intuition about the consequences of
ignoring design effects, and some basic approaches to control for design
effects through examples.
309
Trivedi (2005) note, however, that the effect of weighting tends to be
much smaller in the regression context where the focus is on the
relationship between a covariate and an outcome.
310
analyses.
311
11.2 Features of sampling designs
Sampling designs for large surveys can be quite complex, but most of
them share two features. First, observations are not sampled with equal
probability; thus each observation is associated with a weight that indicates
its relative importance in the sample relative to the population. Second,
observations are not all sampled independently but instead are sampled in
clusters. Stratification by subgroup is one important kind of clustering.
Then, each observation in the sample is associated with a cluster identifier.
Ignoring both weights and clusters can lead to misleading statistics and
inference based on simple random sampling.
11.2.1 Weights
There are many types of weights that can be associated with a survey.
Perhaps the most common is the sampling weight , which is used to denote
the inverse of the probability of being included in the sample because of
the sampling design. Therefore, observations that are oversampled will
have lower weights than observations that are undersampled. In addition,
postsampling adjustments to the weights are often made to adjust for
deviations of the data-collection scheme from the original design. In Stata ,
pweights are sampling weights. Commands that allow pweights typically
provide a vce(cluster clustvar) option, described below. Under many
sampling designs, the sum of the sampling weights will equal the relevant
population total instead of the sample size.
Many Stata commands also allow one or more of three additional types
312
of weights: fweights , aweights , and iweights . We briefly describe the
application of each below, but note that they are not generally considered
as arising from complex survey methodology. Frequency weights
(fweights) are integers representing the number of observations each
sampled observation really represents. Analytic weights (aweights) are
typically appropriate when each observation in the data is a summary
statistic, such as the count or average, over a group of observations or to
address issues of heteroskedasticity. The prototypical example is the
instance of rates. For example, consider a county-level dataset in which
each observation consists of rates that measure socioeconomic
characteristics of people in the county in a particular year. Then, the
weighting variable contains the number of individuals over which the
average was calculated. Finally, most Stata commands allow the user to
specify an importance weight (iweight). The iweight has no formal
statistical definition but is assumed to reflect the importance of each
observation in a sample.
313
final survey sample. Once these groups have been defined, researchers
sample from each group, as if it were independent of all the other groups.
For example, if a sample is stratified on gender, then men and women are
sampled independent of one another. Often, sampling weights are subject
to poststratification, which is a method for adjusting the sampling weights
to account for underrepresented groups in the population, often due to
systematic refusal or nonresponse of some sort (Skinner, Holt, and
Smith 1989) .
Most Stata commands that produce inferential statistics allow for the
vce(cluster clustvar) option, where clustvar is the variable that defines
the clusters. When specified, this option changes the formula for the
standard errors . The “sandwich” formula allows for correlation among
observations within clusters but assumes independence of observations
across clusters. Typically, cluster-corrected standard errors are larger than
the corresponding naïve ones (Moulton 1986, 1990) . If the variable in
question varies independently within clusters, there will be almost no
correction. If observations are negatively correlated within a cluster, the
adjustment can make standard errors smaller; however, this circumstance
is rare.
314
The data used to analyze natural experiments are often obtained from
administrative databases, which are not collected using complex survey
procedures. Nevertheless, issues of clustering and attrition may also be
extremely important in statistical analyses of such data. For example,
suppose we are interested in evaluating the effects of a new surgical
technique for a specific health condition that has been implemented in
some, but not all, hospitals. The intervention is applied at the hospital level
—that is, all patients in the treated hospitals are subject to the new
technique, while all patients in the control hospitals are subject to the old
technique. The data consist of retrospective administrative records of all
patients with that diagnosis from the population of hospitals obtained
before and after the new technique was implemented. One possible way to
estimate the treatment effect would be to use a difference-in-differences
method, comparing the patients in treatment hospitals with those in control
hospitals, while also controlling for trends over time.
315
effects on estimates and inference.
316
11.3 Methods for point estimation and inference
The Stata User’s Guide has the estimators for additional, progressively
more complex sampling designs.
317
denote the normalized or unnormalized weights, let denote the vector of
weights, let , and let be the matrix of covariates, . The
goal is to estimate the vector of parameters, , in the linear model,
. Then, the estimated weighted least-squares parameters are
found by the following formula.
In all three of these examples, weights (but not clustering) affect the
point estimates. As we will demonstrate in section 11.4, Stata makes it
easy to incorporate weights into all of these estimators.
Adjusting for weights and for clustering changes the standard errors of the
estimates. The most commonly applied method to obtain the covariance
matrix of involves a Taylor-series based linearization (popularly known
as the delta method ), in which weights and clustering are easily
incorporated. This is implemented as a standard option in Stata. After
declaring the survey design with the svyset command, use the
vce(cluster clustvar) option.
318
When the parameters of interest are complex functions of the model
parameters, the linearized variance estimation may not be convenient. In
such situations, bootstrap or jackknife variance estimation can be
important nonparametric ways to obtain standard errors. With few
assumptions, bootstrap and jackknife resampling techniques provide a way
of estimating standard errors and other measures of statistical precision
and are especially convenient when no standard formula is available. We
begin by describing the bootstrap, which is described in detail in Cameron
and Trivedi (2005) .
Both the bootstrap and jackknife methods are easily adjusted to deal
with clustering . In the context of survey designs with clustering, the unit
of observation for resampling in the bootstrap and jackknife is a cluster or
a PSU . If the survey design also involves sampling weights , both the
bootstrap and jackknife methods become considerably more complex to
implement. For each replication, the sampling weights need to be adjusted,
because some clusters may be repeated, while others may not be in the
sample in the case of the bootstrap, or because one cluster is dropped from
the replicate sample in the case of the jackknife. Some complex surveys
provide bootstrap or jackknife replicate weights, in which case those
methods can be implemented in the complex survey context using svy
bootstrap or svy jackknife . If the resampled weights are not provided,
319
the researcher must calculate those weights. This requires in-depth
knowledge of the survey design and the way in which the weights were
originally constructed.
320
11.4 Empirical examples
Before any commands can be invoked with the survey prefix, the
survey features must be associated with the dataset using the svyset
command. In other words, svyset is required to “set up” the survey design
in the dataset for use in the svy: commands. The syntax identifies the
sample weight, wtdper; a PSU variable, varpsu; and a stratum variable,
varstr.
321
As we have described above, PSU and strata are incorporated in the
survey design in MEPS to ensure geographic and racial representation.
These selection criteria also give rise to natural clustering units. Given our
choice of PSU and strata, we can create a variable to identify unique
clusters of observations in the dataset by grouping observations by unique
values of the PSU and strata identifiers. Output from codebook below
shows that there are 448 clusters.
The first example shows how the estimate of a sample mean might change
when incorporating different design features. Normally, researchers might
consider using summarize to obtain sample means and other sample
statistics. However, summarize is not survey aware and thus cannot be
used with the svy prefix. Instead, use the mean command, because it is
survey aware and can be implemented with sampling weights and cluster-
adjusted standard errors, even without using the svy prefix.
322
(race_bl_pct). To facilitate comparison of the estimates from the different
methods, we do not show the estimates one by one but instead accumulate
them into a table. In the table, the first estimates (noadjust) do not take
any survey features into account. The second set of estimates (cluster)
are identical to the first, because the adjustment to standard errors due to
clustering has no effect on the point estimates. Sampling weights are
incorporated into the estimates shown in the third, fourth, and fifth
columns. The third set of estimates (weights) incorporate only weights,
while the fourth (clust_wgt) incorporates weights and cluster adjustments
to standard errors. The fifth set of estimates (survey) is based on fully
survey-aware estimation that control for weights, clustering, and
stratification.
323
We find that mean spending for nonblacks is $3,731 and mean spending
for blacks is $3,402. The difference in spending is not statistically
significant at the traditional 5% level, but it is significant at the 10% level.
324
11.4.3 Weighted least-squares regression
325
methods, we accumulate regression results into a table. The first regression
(robust) does not account for any design features but does estimate robust
standard errors. The second set of estimates (cluster) are obtained using
ordinary least squares, but the standard errors take clustering into account.
The third set of estimates (weights) are calculated using weighted least
squares; the weights are probability or sampling weights. These estimates
do not account for clustering. The fourth set of estimates (clust_wgt)
incorporate both cluster–robust standard errors and sampling weights as
options of regress but without the svy: prefix. The final specification
(survey) uses the svy: prefix to produce fully survey-aware estimates that
control for weights, clustering, and stratification.
The results show that, as expected, adjusting for weights changes the
point estimates, while adjusting for clustering and stratification does not.
The point estimates are the same in columns 1 and 2 (no weights) but
different from the point estimates in columns 3–5 (with weights).
326
11.4.4 Weighted Poisson count model
327
regression is specified this way. The results, tabulated below, show that the
marginal effects of age are not noticeably different across procedures;
neither are the associated standard errors. The incremental effects of being
female and their associated standard errors increase modestly once
sampling weights are taken into account.
328
329
11.5 Conclusion
330
11.6 Stata resources
See the Stata Survey Data Reference Manual for all commands related to
survey data. Prior to using any other survey-related commands, make Stata
aware of survey design features of the dataset using svyset . Once that is
done, the svy prefix incorporates those features into the estimation and
inference for most Stata commands and many user-written packages.
331
References
Abrevaya, J. 2002. Computing marginal effects in the Box–Cox model.
Econometric Reviews 21: 383–393.
Ai, C., and E. C. Norton. 2000. Standard errors for the retransformation
problem with heteroscedasticity. Journal of Health Economics 19:
697–718.
_________. 2003. Interaction terms in logit and probit models.
Economics Letters 80: 123–129.
_________. 2008. A semiparametric derivative estimator in log
transformation models. Econometrics Journal 11: 538–553.
Akaike, H. 1970. Statistical predictor identification. Annals of the
Institute of Statistical Mathematics 22: 203–217.
Angrist, J. D., and A. B. Krueger. 2001. Instrumental variables and the
search for identification: From supply and demand to natural
experiments. Journal of Economic Perspectives 15(4): 69–85.
Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics:
An Empiricist’s Companion. Princeton: Princeton University Press.
Arlot, S., and A. Celisse. 2010. A survey of cross-validation procedures
for model selection. Statistics Surveys 4: 40–79.
Barnett, S. B. L., and T. A. Nurmagambetov. 2011. Costs of asthma in
the United States: 2002–2007. Journal of Allergy and Clinical
Immunology 127: 145–152.
Bassett, G., Jr., and R. Koenker. 1982. An empirical quantile function
for linear models with iid errors. Journal of the American Statistical
Association 77: 407–415.
Basu, A., and P. J. Rathouz. 2005. Estimating marginal and incremental
effects on health outcomes using flexible link and variance function
models. Biostatistics 6: 93–109.
Baum, C. F. 2006. An Introduction to Modern Econometrics Using
Stata. College Station, TX: Stata Press.
Belotti, F., P. Deb, W. G. Manning, and E. C. Norton. 2015. twopm:
332
Two-part models. Stata Journal 15: 3–20.
Berk, M. L., and A. C. Monheit. 2001. The concentration of health care
expenditures, revisited. Health Affairs 20: 9–18.
Bertrand, M., E. Duflo, and S. Mullainathan. 2004. How much should
we trust differences-in-differences estimates? Quarterly Journal of
Economics 119: 249–275.
Bitler, M. P., J. B. Gelbach, and H. W. Hoynes. 2006. Welfare reform
and children’s living arrangements. Journal of Human Resources 41:
1–27.
Blough, D. K., C. W. Madden, and M. C. Hornbrook. 1999. Modeling
risk using generalized linear models. Journal of Health Economics
18: 153–171.
Blundell, R. W., and R. J. Smith. 1989. Estimation in a class of
simultaneous equation limited dependent variable models. Review of
Economic Studies 56: 37–57.
_________. 1994. Coherency and estimation in simultaneous models
with censored or qualitative dependent variables. Journal of
Econometrics 64: 355–373.
Bound, J., D. A. Jaeger, and R. M. Baker. 1995. Problems with
instrumental variables estimation when the correlation between the
instruments and the endogenous explanatory variable is weak.
Journal of the American Statistical Association 90: 443–450.
Box, G. E. P., and D. R. Cox. 1964. An analysis of transformations.
Journal of the Royal Statistical Society, Series B 26: 211–252.
Box, G. E. P., and N. R. Draper. 1987. Empirical Model-building and
Response Surfaces. New York: Wiley.
Buntin, M. B., and A. M. Zaslavsky. 2004. Too much ado about two-
part models and transformation? Comparing methods of modeling
Medicare expenditures. Journal of Health Economics 23: 525–542.
Cameron, A. C., J. B. Gelbach, and D. L. Miller. 2008. Bootstrap-based
improvements for inference with clustered errors. Review of
Economics and Statistics 90: 414–427.
_________. 2011. Robust inference with multiway clustering. Journal
of Business and Economic Statistics 29: 238–249.
333
Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods
and Applications. New York: Cambridge University Press.
_________. 2010. Microeconometrics using Stata, Revised Edition.
Stata Press.
_________. 2013. Regression Analysis of Count Data. 2nd ed.
Cambridge: Cambridge University Press.
Cattaneo, M. D., D. M. Drukker, and A. D. Holland. 2013. Estimation
of multivalued treatment effects under conditional independence.
Stata Journal 13: 407–450.
Cattaneo, M. D., and M. Jansson. 2017. Kernel-based semiparametric
estimators: Small bandwidth asymptotics and bootstrap consistency.
Working Paper. Http://eml.berkeley.edu/
˜jansson/Papers/CattaneoJansson_BootstrappingSemiparametrics.pdf.
Cawley, J., and C. Meyerhoefer. 2012. The medical care costs of
obesity: An instrumental variables approach. Journal of Health
Economics 31: 219–230.
Claeskens, G., and N. L. Hjort. 2008. Model Selection and Model
Averaging. Cambridge: Cambridge University Press.
Cole, J. A., and J. D. F. Sherriff. 1972. Some single- and multi-site
models of rainfall within discrete time increments. Journal of
Hydrology 17: 97–113.
Cook, P. J., and M. J. Moore. 1993. Drinking and schooling. Journal of
Health Economics 12: 411–429.
Cox, N. J. 2004. Speaking Stata: Graphing model diagnostics. Stata
Journal 4: 449–475.
Cragg, J. G. 1971. Some statistical models for limited dependent
variables with application to the demand for durable goods.
Econometrica 39: 829–844.
Dall, T. M., Y. Zhang, Y. J. Chen, W. W. Quick, W. G. Yang, and
J. Fogli. 2010. The economic burden of diabetes. Health Affairs 29:
297–303.
Deaton, A. 1997. The Analysis of Household Surveys: A
Microeconometric Approach to Development Policy. Washington, DC:
The World Bank.
334
Deb, P. 2007. fmm: Stata module to estimate finite mixture models.
Statistical Software Components S456895, Department of
Economics, Boston College.
https://ideas.repec.org/c/boc/bocode/s456895.html.
Deb, P., and P. K. Trivedi. 1997. Demand for medical care by the
elderly: A finite mixture approach. Journal of Applied Econometrics
12: 313–336.
_________. 2002. The structure of demand for health care: latent class
versus two-part models. Journal of Health Economics 21: 601–625.
Donald, S. G., D. A. Green, and H. J. Paarsch. 2000. Differences in
wage distributions between Canada and the United States: An
application of a flexible estimator of distribution functions in the
presence of covariates. Review of Economic Studies 67: 609–633.
Dow, W. H., and E. C. Norton. 2003. Choosing between and
interpreting the Heckit and two-part models for corner solutions.
Health Services and Outcomes Research Methodology 4: 5–18.
Dowd, B. E., W. H. Greene, and E. C. Norton. 2014. Computation of
standard errors. Health Services Research 49: 731–750.
Drukker, D. M. 2014. mqgamma: Stata module to estimate quantiles of
potential-outcome distributions. Statistical Software Components
S457854, Department of Economics, Boston College.
https://ideas.repec.org/c/boc/bocode/s457854.html.
335
Efron, B. 1988. Logistic regression, survival analysis, and the Kaplan–
Meier curve. Journal of the American Statistical Association 83: 414–
425.
Enami, K., and J. Mullahy. 2009. Tobit at fifty: A brief history of
Tobin’s remarkable estimator, of related empirical methods, and of
limited dependent variable econometrics in health economics. Health
Economics 18: 619–628.
Ettner, S. L., G. Denmead, J. Dilonardo, H. Cao, and A. J. Belanger.
2003. The impact of managed care on the substance abuse treatment
patterns and outcomes of Medicaid beneficiaries: Maryland’s health
choice program. Journal of Behavioral Health Services and Research
30: 41–62.
Ettner, S. L., R. G. Frank, T. G. McGuire, J. P. Newhouse, and E. H.
Notman. 1998. Risk adjustment of mental health and substance abuse
payments. Inquiry 35: 223–239.
Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its
Applications. New York: Chapman & Hall/CRC.
Fenton, J. J., A. F. Jerant, K. D. Bertakis, and P. Franks. 2012. The cost
of satisfaction: A national study of patient satisfaction, health care
utilization, expenditures, and mortality. Archives of Internal Medicine
172: 405–411.
van Garderen, K. J., and C. Shah. 2002. Exact interpretation of dummy
variables in semilogarithmic equations. Econometrics Journal 5: 149–
159.
Garrido, M. M., P. Deb, J. F. Burgess, Jr., and J. D. Penrod. 2012.
Choosing models for health care cost analyses: Issues of nonlinearity
and endogeneity. Health Services Research 47: 2377–2397.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge, UK: Cambridge
University Press.
Gilleskie, D. B., and T. A. Mroz. 2004. A flexible approach for
estimating the effects of covariates on health expenditures. Journal of
Health Economics 23: 391–418.
Goldberger, A. S. 1981. Linear regression after selection. Journal of
Econometrics 15: 357–366.
336
Gourieroux, C., A. Monfort, and A. Trognon. 1984a. Pseudo maximum
likelihood methods: Applications to Poisson models. Econometrica
52: 701–720.
_________. 1984b. Pseudo maximum likelihood methods: Theory.
Econometrica 52: 681–700.
Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River,
NJ: Prentice Hall.
337
Holmes, A. M., and P. Deb. 1998. Provider choice and use of mental
health care: implications for gatekeeper models. Health Services
Research 33: 1263–1284.
Hosmer, D. W., and S. Lemesbow. 1980. Goodness of fit tests for the
multiple logistic regression model. Communications in Statistics—
Theory and Methods 9: 1043–1069.
Hurd, M. 1979. Estimation in truncated samples when there is
heteroscedasticity. Journal of Econometrics 11: 247–258.
Imbens, G. W. 2004. Nonparametric estimation of average treatment
effects under exogeneity: A review. Review of Economics and
Statistics 86: 4–29.
Imbens, G. W., and D. B. Rubin. 2015. Causal Inference for Statistics,
Social, and Biomedical Sciences: An Introduction. New York:
Cambridge University Press.
Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in
the econometrics of program evaluation. Journal of Economic
Literature 47: 5–86.
Jones, A. M. 2000. Health econometrics. In Handbook of Health
Economics, vol. 1B, ed. A. J. Culyer and J. P. Newhouse, 265–344.
Amsterdam: Elsevier.
_________. 2010. Models for health care. Working Papers 10/01,
Health, Econometrics and Data Group.
Kadane, J. B., and N. A. Lazar. 2004. Methods and criteria for model
selection. Journal of the American Statistical Association 99: 279–
290.
Katz, R. W. 1977. Precipitation as a chain-dependent process. Journal
of Applied Meteorology 16: 671–676.
Keeler, E. B., and J. E. Rolph. 1988. The demand for episodes of
treatment in the health insurance experiment. Journal of Health
Economics 7: 337–367.
Kennedy, P. E. 1981. Estimation with correctly interpreted dummy
variables in semilogarithmic equations. American Economic Review
71: 801.
King, G. 1988. Statistical models for political science event counts: Bias
338
in conventional procedures and evidence for the exponential poisson
regression model. American Journal of Political Science 32: 838–
863.
Koenker, R., and G. Bassett, Jr. 1978. Regression quantiles.
Econometrica 46: 33–50.
Koenker, R., and K. F. Hallock. 2001. Quantile regression. Journal of
Economic Perspectives 15: 143–156.
Koenker, R., and J. A. F. Machado. 1999. Goodness of fit and related
inference processes for quantile regression. Journal of the American
Statistical Association 94: 1296–1310.
Konetzka, R. T. 2015. In memoriam: Willard G. Manning, 1946–2014.
American Journal of Health Economics 1: iv–vi.
Lambert, D. 1992. Zero-inflated Poisson regression, with an application
to defects in manufacturing. Technometrics 34: 1–14.
Leroux, B. G. 1992. Consistent estimation of a mixing distribution.
Annals of Statistics 20: 1350–1360.
Leung, S. F., and S. Yu. 1996. On the choice between sample selection
and two-part models. Journal of Econometrics 72: 197–229.
Lindrooth, R. C., E. C. Norton, and B. Dickey. 2002. Provider selection,
bargaining, and utilization management in managed care. Economic
Inquiry 40: 348–365.
Lindsay, B. G. 1995. Mixture Models: Theory, Geometry and
Applications. NSF-CBMS regional conference series in probability and
statistics, Institute of Mathematical Statistics.
Long, J. S., and J. Freese. 2014. Regression Models for Categorical
Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata
Press.
Machado, J. A. F., and J. M. C. Santos Silva. 2005. Quantiles for
counts. Journal of the American Statistical Association 100: 1226–
1237.
Maddala, G. S. 1983. Limited-Dependent and Qualitative Variables in
Econometrics. Cambridge: Cambridge University Press.
_________. 1985. A survey of the literature on selectivity bias as it
339
pertains to health care markets. Advances in Health Economics and
Health Services Research 6: 3–26.
Manning, W. G. 1998. The logged dependent variable,
heteroscedasticity, and the retransformation problem. Journal of
Health Economics 17: 283–295.
Manning, W. G., N. Duan, and W. H. Rogers. 1987. Monte Carlo
evidence on the choice between sample selection and two-part
models. Journal of Econometrics 35: 59–82.
Manning, W. G., and J. Mullahy. 2001. Estimating log models: To
transform or not to transform? Journal of Health Economics 20: 461–
494.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd
ed. London: Chapman & Hall/CRC.
McLachlan, G., and D. Peel. 2000. Finite Mixture Models. New York:
Wiley.
Mihaylova, B., A. H. Briggs, A. O’Hagan, and S. G. Thompson. 2011.
Review of statistical methods for analysing healthcare resources and
costs. Health Economics 20: 897–916.
Miranda, A. 2006. qcount: Stata program to fit quantile regression
models for count data. Statistical Software Components S456714,
Department of Economics, Boston College.
https://ideas.repec.org/c/boc/bocode/s456714.html.
340
_________. 2015. In memoriam: Willard G. Manning, 1946–2014.
Health Economics 24: 253–257.
Murray, M. P. 2006. Avoiding invalid instruments and coping with
weak instruments. Journal of Economic Perspectives 20: 111–132.
Nelson, C. R., and R. Startz. 1990. Some further results on the exact
small sample properties of the instrumental variable estimator.
Econometrica 58: 967–976.
Newey, W. K., J. L. Powell, and F. Vella. 1999. Nonparametric
estimation of triangular simultaneous equations models.
Econometrica 67: 565–603.
Newhouse, J. P., and M. McClellan. 1998. Econometrics in outcomes
research: The use of instrumental variables. Annual Review of Public
Health 19: 17–34.
Newhouse, J. P., and C. E. Phelps. 1976. New estimates of price and
income elasticities of medical care services. In The Role of Health
Insurance in the Health Services Sector, ed. R. N. Rosett, 261–320.
Cambridge, MA: National Bureau of Economic Research.
Newson, R. 2003. Confidence intervals and -values for delivery to the
end user. Stata Journal 3: 245–269.
Norton, E. C., H. Wang, and C. Ai. 2004. Computing interaction effects
and standard errors in logit and probit models. Stata Journal 4: 154–
167.
Park, R. E. 1966. Estimation with heteroscedastic error terms.
Econometrica 34: 888.
Picard, R. R., and R. D. Cook. 1984. Cross-validation of regression
models. Journal of the American Statistical Association 79: 575–583.
Pohlmeier, W., and V. Ulrich. 1995. An econometric model of the two-
part decisionmaking process in the demand for health care. Journal of
Human Resources 30: 339–361.
Poirier, D. J., and P. A. Ruud. 1981. On the appropriateness of
endogenous switching. Journal of Econometrics 16: 249–256.
Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics
9: 705–724.
341
Racine, J., and Q. Li. 2004. Nonparametric estimation of regression
functions with both categorical and continuous data. Journal of
Econometrics 119: 99–130.
Ramsey, J. B. 1969. Tests for specification errors in classical linear
least-squares regression analysis. Journal of the Royal Statistical
Society, Series B 31: 350–371.
Rao, C. R., and Y. Wu. 2001. On model selection. Lecture Notes-
Monograph Series 38: 1–64.
Roy, A., P. Sheffield, K. Wong, and L. Trasande. 2011. The effects of
outdoor air pollutants on the costs of pediatric asthma hospitalizations
in the united states, 1999–2007. Medical Care 49: 810–817.
Rubin, D. B. 1974. Estimating causal effects of treatments in
randomized and nonrandomized studies. Journal of Educational
Psychology 66: 688–701.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of
Statistics 6: 461–464.
Sin, C.-Y., and H. White. 1996. Information criteria for selecting
possibly misspecified parametric models. Journal of Econometrics
71: 207–225.
Skinner, C. J., D. Holt, and T. M. F. Smith. 1989. Analysis of Complex
Surveys. New York: Wiley.
Solon, G., S. J. Haider, and J. M. Wooldridge. 2015. What are we
weighting for? Journal of Human Resources 50: 301–316.
Staiger, D. O., and J. H. Stock. 1997. Instrumental variables regression
with weak instruments. Econometrica 65: 557–586.
Stock, J. H., J. H. Wright, and M. Yogo. 2002. A survey of weak
instruments and weak identification in generalized method of
moments. Journal of Business and Economic Statistics 20: 518–529.
Teicher, H. 1963. Identifiability of finite mixtures. Annals of
Mathematical Statistics 34: 1265–1269.
Terza, J. V., A. Basu, and P. J. Rathouz. 2008. Two-stage residual
inclusion estimation: Addressing endogeneity in health econometric
modeling. Journal of Health Economics 27: 531–543.
342
Tobin, J. 1958. Estimation of relationships for limited dependent
variables. Econometrica 26: 24–36.
Todorovic, P., and D. A. Woolhiser. 1975. A stochastic model of n-day
precipitation. Journal of Applied Meteorology 14: 17–24.
Vanness, D. J., and J. Mullahy. 2012. Moving beyond mean-based
evaluation of health care. In The Elgar Companion to Health
Economics, ed. A. M. Jones, 2nd ed., 563–575. Cheltenham, UK:
Edward Elgar Publishing Limited.
Veazie, P. J., W. G. Manning, and R. L. Kane. 2003. Improving risk
adjustment for Medicare capitated reimbursement using nonlinear
models. Medical Care 41: 741–752.
Vella, F., and M. Verbeek. 1999. Estimating and interpreting models
with endogenous treatment effects. Journal of Business and
Economic Statistics 17: 473–478.
van de Ven, W. P. M. M., and R. P. Ellis. 2000. Risk adjustment in
competitive health plan markets. In Handbook of Health Economics,
vol. 1A, ed. A. J. Culyer and J. P. Newhouse, 755–845. Amsterdam:
Elsevier.
Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-
nested hypotheses. Econometrica 57: 307–333.
Winkelmann, R. 2008. Econometric Analysis of Count Data. 5th ed.
Berlin: Springer.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and
Panel Data. 2nd ed. Cambridge, MA: MIT Press.
_________. 2014. Quasi-maximum likelihood estimation and testing for
nonlinear models with endogenous explanatory variables. Journal of
Econometrics 182: 226–234.
_________. 2016. Introductory Econometrics: A Modern Approach. 6th
ed. Boston, MA: Cengage Learning.
343
Author index
A
B
Baker, R.M., 10.2.2
Barnett, S.B.L., 2.1
Bassett, G., 9.2
Basu, A., 5.2.1 , 5.8.4 , 5.10 , 10.2.4 , 10.3.1
Baum, C.F., 4.7
Belanger, A.J., 6.5
Belotti, F., 7.5.1 , 7.8
Berk, M.L., 5.1
Bertakis, K., 2.1
Bertrand, M., 11.1 , 11.2.3
Bitler, M.P., 2.7
Blough, D.K., 5.1 , 5.2.1 , 5.8.3
Blundell, R.W., 10.3.1
Bound, J.D., 10.2.2
Box, G.E., 1.5 , 6.1
Briggs, A.H., 2.1 , 7.2
Buntin, M.B., 7.6.3
Burgess, J.F., 10.3.1
C
Cameron, A.C., 1.1 , 1.4 , 2.1 , 4.1 , 7.1 , 8.1 , 8.3 , 10.1 , 10.4 , 11.1 ,
11.2.2 , 11.3.2 , 11.4.3
Cao, H., 6.5
Cattaneo, M.D., 4.7 , 9.4.1
Cawley, J., 2.1
Celisse, A., 2.6.2 , 8.6.2
Chen, Y.J., 2.1
Claeskens, G., 2.1 , 2.6
344
Cole, J., 7.2
Cook, P.J., 7.6.2
Cook, R.D., 2.6.2 , 8.6.2
Cox, D.R., 6.1
Cox, N.J., 4.7
Cragg, J.G., 7.2 , 7.6.2
D
Dall, T.M., 2.1
Deaton, A., 11.1
Deb, P., 7.5.1 , 7.6.2 , 7.8 , 9.3 , 9.6 , 10.3.1
Denmead, G., 6.5
Dickey, B., 6.5
Dilonardo, J., 6.5
Donald, S.G., 9.5
Dow, W.H., 7.5.4
Dowd, B.E., 4.3.3
Draper, N.R., 1.5
Drukker, D.M., 4.7 , 9.2.2
Duan, N., 6.3.1 , 6.5 , 7.2 , 7.2.1 , 7.4 , 7.5.2 , 7.5.4
Duflo, E., 11.1 , 11.2.3
E
Efron, B., 9.5
Ellis, R.P., 2.1
Enami, K., 7.6.1
Ettner, S.L., 6.5
F
Fan, J., 9.4
Fenton, J., 2.1
Fogli, J., 2.1
Frank, R.G., 6.5
Franks, P., 2.1
Freese, J., 1.4
G
Garrido, M.M., 10.3.1
Gelbach, J.B., 2.7 , 11.2.2
Gelman, A., 2.1 , 2.2
Gijbels, I., 9.4
345
Gilleskie, D.B., 7.2 , 9.5
Goldberger, A.S., 7.6.1 , 7.6.2
Gourieroux, C., 8.2.2
Green, D.A., 9.5
Greene, W.H., 1.1 , 2.1 , 4.3.3
Gurmu, S., 8.5.1 , 8.5.2
H
Halvorsen, R., 6.2.1
Hansen, L.P., 10.4
Hardin, J.W., 5.1 , 8.1
Hay, J.W., 7.4
Heckman, J.J., 2.1 , 7.3 , 7.3.1
Hilbe, J., 5.1 , 8.1
Hill, J., 2.1 , 2.2
Hjort, N.L., 2.1 , 2.6
Hoch, J.S., 2.1
Holland, A.D., 4.7
Holland, P.W., 2.1 , 2.2
Holmes, A., 7.6.2
Holt, D., 11.1 , 11.2.2
Hornbrook, M.C., 5.1 , 5.2.1 , 5.8.3
Hosmer, D.W., 2.6 , 4.6 , 4.6.3
Hoynes, H.W., 2.7
Hurd, M., 7.6.1 , 7.6.2
I
Imbens, G.W., 2.1
J
Jaeger, D.A., 10.2.2
Jansson, M., 9.4.1
Jerant, A., 2.1
Jones, A.M., 1.1 , 7.4
K
Kadane, J.B., 2.1 , 2.6 , 2.6.1
Kane, R.L., 6.5
Katz, R.W., 7.2
Keeler, E.B., 8.4.1
Kennedy, P.E., 6.2.1
346
King, G., 8.1
Koenker, R., 9.2 , 9.2.1
Konishi, S., 2.1 , 2.6
Krueger, A.B., 10.1
L
Lambert, D., 8.4.2
Lazer, N.A., 2.1 , 2.6 , 2.6.1
Lemeshow, S., 2.6 , 4.6 , 4.6.3
Leroux, B.G., 2.6.1 , 8.6.1 , 9.3
Leung, S.F., 7.4
Li, Q., 9.4
Lindrooth, R.C., 6.5
Lindsay, B.G., 9.3
Long, J.S., 1.4
M
Machado, J.A.F., 9.2.1 , 9.2.2
Maddala, G.S., 7.3 , 7.4
Madden, C.W., 5.1 , 5.2.1 , 5.8.3
Manning, W.G., 5.8.3 , 6.1 , 6.3.2 , 6.5 , 7.2 , 7.4 , 7.5.1 , 7.5.2 , 7.5.4 , 7.8
McClellan, M., 10.1
McCullagh, P., 5.1 , 5.2.1 , 8.2.2 , 9.3
McGuire, T.G., 6.5
McLachlan, G., 9.3
Meyerhoefer, C., 2.1
Mihalova, B., 7.2
Miller, D.L., 11.2.2
Monfort, A., 8.2.2
Monheit, A.C., 5.1
Moore, M.J., 7.6.2
Morris, C.N., 7.2 , 7.5.4
Moulton, B.R., 11.1 , 11.2.2 , 11.2.3
Mroz, T.A., 7.2 , 9.5
Mukerjee, R., 2.1 , 2.6
Mullahy, J., 2.7 , 5.1 , 5.8.3 , 7.6.1 , 7.6.3 , 8.4.2
Mullainathan, S., 11.1 , 11.2.3
Murray, M.P., 10.1
347
Nelder, J.A., 5.1 , 5.2.1 , 8.2.2 , 9.3
Nelson, C.R., 10.2.2 , 10.2.3
Newey, W.K., 10.2.4
Newhouse, J.P., 6.5 , 7.2 , 7.5.4 , 10.1
Norton, E.C., 2.5 , 4.3.3 , 6.5 , 7.5.1 , 7.5.2 , 7.5.4 , 7.8
Notman, E.H., 6.5
Nurmagambetov, T.A., 2.1
O
O’Hagan, A., 7.2
Olsen, R.J., 7.4
P
Paarsch, H.J., 9.5
Palmquist, R., 6.2.1
Park, R.E., 5.8.3
Peel, D., 9.3
Penrod, J.D., 10.3.1
Phelps, C.E., 7.2
Picard, R.R., 2.6.2 , 8.6.2
Pischke, J.-S., 10.1
Pohlmeier, W., 8.4.1
Poirier, D.J., 7.5.4
Powell, J.L., 10.2.4
Pregibon, D., 2.6 , 4.6
Q
Quick, W.W., 2.1
R
Racine, J., 9.4
Ramsey, J.B., 2.6 , 4.6 , 4.6.2
Rao, C.R., 2.1 , 2.6
Rathouz, P.J., 5.2.1 , 5.8.4 , 5.10 , 10.2.4 , 10.3.1
Robb, R., 2.1
Rogers, W.H., 7.4
Rolph, J.E., 8.4.1
Rubin, D.B., 2.1 , 2.2
Ruud, P.A., 7.5.4
348
Santos Silva, J.M.C., 9.2.2
Schwarz, G., 2.6.1 , 5.8.1 , 8.6.1
Shah, C., 6.2.1
sherriff, J., 7.2
Sin, C.-Y., 2.6.1
Skinner, C.J., 11.1 , 11.2.2
Smith, R.J., 10.3.1
Smith, T.F., 11.1 , 11.2.2
Staiger, D.O., 10.2.2 , 10.2.3
Startz, R., 10.2.2 , 10.2.3
Stock, J.H., 10.2.2 , 10.2.3
T
Teicher, H., 9.3.2
Terza, J.V., 10.2.4 , 10.3.1
Thompson, S.G., 7.2
Tobin, J., 7.6.1
Todorovic, P., 7.2
Trivedi, P. K., 11.1
Trivedi, P.K., 1.1 , 1.4 , 2.1 , 4.1 , 7.1 , 8.1 , 8.3 , 9.3 , 10.1 , 10.4 , 11.1 ,
11.3.2 , 11.4.3
Trognon, A., 8.2.2
U
Ulrich, V., 8.4.1
V
van de Ven, W.P.M.M., 2.1
van Garderen, K.J., 6.2.1
Vanness, D.J., 2.7
Veazie, P.J., 6.5
Vella, F., 10.2.4 , 10.3.1
Verbeek, M., 10.3.1
Vuong, Q.H., 8.3.1 , 8.6
Vytlacil, E.J., 2.1
W
Wang, H., 2.5
White, H., 2.6.1
Willan, A.R., 2.1
Winkelmann, R., 8.1
349
Wooldridge, J. M., 11.1
Wooldridge, J.M., 1.1 , 2.1 , 2.2 , 2.3.3 , 4.1 , 7.1 , 7.3 , 10.1 , 10.2.4 ,
10.3.1
Woolhiser, D.A., 7.2
Wright, J.H., 10.2.3
Wu, Y., 2.1 , 2.6
Y
Yang, W.G., 2.1
Yogo, M., 10.2.3
Yu, S., 7.4
Z
Zaslavsky, A.M., 7.6.3
Zhang, Y., 2.1
350
Subject index
2SLS, 10.2.2 , 10.2.2
assumptions for instruments, 10.2.2
balance test, 10.2.3
exogeneity test, 10.2.3
test, 10.2.3
GMM estimation, 10.4
overidentifying restriction test, 10.2.3
specification test of instrument strength, 10.2.3
specification tests, 10.2.3 , 10.2.3
standard errors compared with OLS, 10.2.2
2SRI
control functions, 10.2.4
functional form, 10.3.1
linear models, 10.2.4 , 10.2.4
linear models versus nonlinear models, 10.2.4 , 10.3.1
nonlinear models, 10.3 , 10.3.1
A
actual outcomes
comparison between two-part and generalized tobit models, 7.5.5
generalized tobit models, 7.5.4
two-part models, 7.1 , 7.4.1 , 7.5.4 , 7.5.5
Agency for Healthcare Research and Quality, see AHRQ
AHRQ, 3.1 , 3.4 , 3.5
AIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5 , 5.8.1 , 5.8.1 , 5.10 , 9.3.1
comparison of NB1 and NB2, 8.3.1
count models, 8.6 , 8.6.1 , 8.6.1
does not suffer from multiple testing, 4.6.5
example where differs from BIC, 4.6.5
FMM, 9.3
MLE formula, 2.6.1 , 8.6.1
OLS formula, 2.6.1
robustness, 2.6.1
Akaike information criterion, see AIC
at((mean) _all) option, 8.2.3
ATE, 2.2 , 2.3.2 , 2.3.3 , 2.4 , 2.4.1 , 2.4.2 , 2.5 , 2.7
estimation, 2.3 , 2.3.3
351
laboratory experiment, 2.3.1
margins and teffects commands, 4.3.3
OLS, 4.3.3
ATET, 2.2 , 2.3.2 , 2.3.3 , 2.4 , 2.4.1 , 2.4.2 , 4.3.3
count models, 8.2.3
OLS, 4.3.3
average treatment effect on the treated, see ATET
average treatment effects, see ATE
aweights, 11.2.1 , 11.3.1
B
balance tests, 2SLS, 10.2.3
Bayesian information criterion, see BIC
bcskew0 command, 6.6
bfit command, 4.7
BIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5 , 5.8.1 , 5.8.1 , 5.10 , 9.3.1
comparison of NB1 and NB2, 8.3.1
count models, 8.6 , 8.6.1 , 8.6.1
does not suffer from multiple testing, 4.6.5
example where differs from AIC, 4.6.5
FMM, 9.3
MLE formula, 2.6.1 , 8.6.1
OLS formula, 2.6.1
robustness, 2.6.1
bootstrap command, 5.10 , 11.3.2
adjustment for clustering, 11.3.2
intuition, 11.3.2
nonparametric regression, 9.4.1
standard errors, 11.3.2
boxcox command, 5.8.2 , 5.10 , 6.6 , 7.8
Box-Cox models, 5.8.2 , 6.1 , 6.5 , 6.5.1
example, 6.5.1 , 6.5.1
formula, 6.5
skewness, 6.5
square root model, 6.5
two-part model, second part in, 7.4
C
CDE, 1.1 , 9.1 , 9.5 , 9.5
estimate with one logit model, 9.5
352
fit with series of logit models, 9.5
flexibility for different distributions, 9.5
for count models, 9.5
heterogeneity across bins, 9.5
homogeneity within bins, 9.5
intuition, 9.5
relation to two-part models, 9.5
two key assumptions, 9.5
censoring
causes, 8.5.2
comparison with truncation, 7.6.1
count models, 8.5 , 8.5.2 , 8.5.2
definition, 7.6.1
formulas, 8.5.2
right-censoring, 7.6.2
centaur, 7.6.1
centile command, 5.1
cfunction option, 10.5
chain rule, 7.5.2
clusters, 11.2.2 , 11.2.2
affect standard errors, 11.3
bootstrap, 11.3.2
in natural experiments, 11.2.3 , 11.2.3
jackknife, 11.3.2
PSU, 11.2.2
standard-error formula, 11.2.2
cnreg command, 7.8
codebook command, 11.4.1
complex survey design, see design effects
conditional density estimators, see CDE
conditional independence assumption, see ignorability
confounders, 10.1
consistency
GLM, 5.2.2
negative binomial models, 8.3
Poisson models, 8.1 , 8.2.1 , 8.2.2
contrast() option, 4.3.3 , 4.7 , 5.7 , 5.10
control functions, 10.1
2SRI, 10.2.4
functional form, 10.2.4
353
functions of residuals, 10.3.1
nonlinear functions, 10.2.4
count data, 1.1
count models, 1.1 , 1.3 , 8 , 8.8
AIC, 8.6.1 , 8.6.1
AIC and BIC, 8.3.1 , 8.6
BIC, 8.6.1 , 8.6.1
censoring, 8.5.2 , 8.5.2
cross-validation, 8.6.2 , 8.6.2
discreteness, 8.1
event counts, 8.1
event counts for NB1 and NB2, 8.3.1
FMM, 9.3.2
hurdle count models, 8.4 , 8.4.1
model comparisons, 8.6
model selection, 8.3.1 , 8.4.2 , 8.6.1 , 8.6.2
model selection examples, 8.6.1
negative binomial models, 8.3 , 8.3.1
Poisson models, 8.2 , 8.2.1
skewness, 8.1
truncation, 8.5 , 8.5.1
zero-inflated models, 8.4.2 , 8.4.2
counterfactual outcome, 2.2 , 2.3 , 2.3.1
cross-validation, 2.1 , 2.6.2 , 2.6.2
comparison with repeated random subsampling, 8.6.2
count models, 8.6.2 , 8.6.2
Current Population Survey, 11.4
D
data-generating process
generalized tobit, 7.4
hurdle or two-part model, 8.4.1
negative binomial, 8.3
Poisson distribution, 8.1 , 8.2.4
potential outcomes, 2.2
two-part model, 7.4
delta-method standard errors, 4.3.3 , 11.3.2
dental expenditures
generalized tobit model examples, 7.5.5
two-part model example, 7.5.5
354
describe command, 3.6
design effects, 1.1 , 1.3 , 2.1 , 3.2 , 11 , 11.6
clusters, 11.2.2 , 11.2.2
examples, 11.4 , 11.4.4
inference, 11.3 , 11.3.2
point estimation, 11.3 , 11.3.1
standard errors, 11.1 , 11.3.2 , 11.3.2
stratification, 11.2.2 , 11.2.2
survey design setup, 11.4.1 , 11.4.1
weighted Poisson count model, 11.4.4 , 11.4.4
weighted sample means, 11.4.2 , 11.4.2
weights, 11.2.1 , 11.2.1
WLS, 11.4.3 , 11.4.3
detail option, 3.6
difference-in-differences
standard errors, 11.1 , 11.2.3
with clusters, 11.2.3
difficult option, 9.3.2
discreteness, count models, 8.1
dispersion(constant) option, 8.3.1
dispersion(mean) option, 8.3.1
Duan’s smearing factor, 6.3.1 , 6.5
Abrevaya’s method, relation to, 6.5
formula, 6.3.1
dummy variable interpretation
Kennedy transformation, 6.2.1
log models, 6.2.1
E
economic theory, 1
EEE, 5.8.4 , 5.10
eintreg command, 10.3.1
endogeneity, 1.1 , 1.3 , 2.7 , 4.1 , 10 , 10.5
2SLS, 10.2.2 , 10.2.2
example with artificial data, 10.2.1
examples, 10.1
linear models, 10.2 , 10.2.5
nonlinear models, 10.3 , 10.3.1
OLS is inconsistent, 10.2.1 , 10.2.1
omitted variables, 10.1
355
eoprobit command, 10.3.1
eprobit command, 10.3.1 , 10.5
eregress command, 10.2.5 , 10.3 , 10.3.1 , 10.5
probit option, 10.3
ERM, 10.2.5 , 10.2.5
normality assumption, 10.2.5
error retransformation factor, 6.3 , 6.3.1
estat
commands, 10.5
endogenous command, 10.2.3
firststage command, 10.2.3
gof postestimation command, 4.6.3
ic command, 8.8 , 9.3.1
lcprob command, 9.3.1
ovtest postestimation command, 4.6.4 , 4.7
estimates estat command, 5.10
estimates stats command, 4.7 , 8.8
etregress command, 10.5
exogeneity test, 2SLS specification test, 10.2.3
expression() option, 8.4.1 , 9.2.1
extended estimating equations, see EEE
extended regression models, see ERM
extensive margin, two-part models, 7.1 , 7.5.2 , 7.5.5
F
test, 2SLS specification test, 10.2.3
FIML, 10.2.5
comparison with LIML, 7.3.1
convergence failure, 7.5.5
formula for generalized tobit model, 7.3.1
generalized tobit models, 7.3.1 , 7.3.1
Heckman selection models, 7.3.1 , 7.3.1
finite mixture models, see FMM
first option, 10.2.2
FMM, 1.1 , 9.1 , 9.3 , 9.3.2
AIC and BIC, 9.3
density formula, 9.3
distribution choice, 9.3
example of count models, 9.3.2
example of healthcare expenditures, 9.3.1 , 9.3.1
356
example of healthcare use, 9.3.2 , 9.3.2
formula for posterior probability, 9.3
identification, 9.3
incremental effects, 9.3.1 , 9.3.2
interpretation of parameters, 9.3
marginal effects, 9.3.1 , 9.3.2
motivation, 9.3
predictions of means, 9.3.1
predictions of posterior probabilities, 9.3.1
theory, 9.3
two-component example, 9.3.1
fmm command, 9.3.1 , 9.3.2 , 9.6
full-information maximum likelihood, see FIML
fweights, 11.2.1
G
Gauss-Markov theorem, 4.1
generalized linear models, see GLM
generalized method of moments, see GMM
generalized tobit models, 7.3 , 7.3.1
actual outcomes, 7.5.4
censoring assumption, 7.3
censoring formulas, 7.3
comparison of FIML and LIML, 7.3.1
comparison with two-part models, 7.4 , 7.4.1
correlation of errors, 7.3
example comparing with two-part model, 7.5.5 , 7.5.5
examples showing similar marginal effects to two-part models, 7.4.1 ,
7.4.1
exclusion restrictions, 7.3 , 7.4.1
FIML and LIML, 7.3.1 , 7.3.1
formula for FIML, 7.3.1
formula for LIML, 7.3.1
formulas, 7.3 , 7.3
identification, 7.3 , 7.3.1 , 7.4 , 7.4.1 , 7.8
latent outcomes, 7.1 , 7.3 , 7.4 , 7.4.1 , 7.5.4 , 7.5.5
marginal effects examples, 7.5.5 , 7.5.5
marginal effects formulas, 7.5.4 , 7.5.4
motivation is for missing values, 7.4
normality assumption, 7.3
357
not generalization of two-part models, 7.4
three interpretations, 7.5.4 , 7.5.4
generate command, 3.6 , 6.6
GLM, 1 , 1.1 , 1.3 , 5 , 5.10
assumptions, 5.2.1 , 5.2.1 , 5.2.1
compared with log models, 5.4 , 5.9 , 6.1 , 6.3.1 , 6.4 , 6.4
consistency, 5.2.2
count data, 5.1
dichotomous outcomes, 5.1
distribution family, 5.1 , 5.2.1 , 5.2.2 , 5.8 , 5.8.3 , 5.8.4
framework, 5.2 , 5.2.2
generalization of OLS, 5.1
healthcare expenditure example, 5.3 , 5.3
heteroskedasticity, 5.1 , 5.9
identity link problems, 5.8.1
incremental effects, 5.2.2 , 5.6 , 5.7
index function, 5.2.1
interaction term example, 5.5 , 5.5
inverse of link function, 5.2.1
iteratively reweighted least squares, 5.2.2
link function, 5.1 , 5.2.1 , 5.2.2 , 5.8 , 5.8.4
link function test, 5.8.2 , 5.8.2
log link, 6.4
marginal effects, 5.2.2 , 5.6 , 5.7
parameter estimation, 5.2.2 , 5.2.2
Park test, modified, 5.8.3 , 5.8.3
Poisson models, 8.2.2
prediction example, 5.4 , 5.4
quasi–maximum likelihood, 5.2.2
square root link, 5.3 , 5.8.1 , 5.8.2
tests for link function and distribution family, 5.8 , 5.8.3
two-part model, second part in, 7.2 , 7.2.1 , 7.4 , 7.5.1 , 7.5.2
glm command, 5.3 , 5.7 , 5.10 , 7.8
GMM, 1 , 10.4 , 10.4
endogeneity, 10.4 , 10.4
example of 2SLS, 10.4
Hansen’s test, 10.4
multiple equations, 10.4
multiple instruments, 10.4
overidentification test, 10.4
358
standard errors different for 2SLS, 10.4
gmm option, 10.4 , 10.5
graphical checks, see visual checks
graphical tests, 1.1
grc1leg command, 4.7
H
Hansen’s test, 10.4
health econometric myths, 1.3 , 1.3
healthcare expenditure example, GLM, 5.3 , 5.3
healthcare expenditures, 1 , 1.1 , 1.2 , 2.1 , 2.5 , 3.1 , 3.3 , 5.1
skewed, 5.1 , 9.2.2
zero, mass at, 7.1
heckman command, 7.4.1 , 7.8
Heckman selection models, 7.3 , 7.3.1
censoring assumption, 7.3
censoring formulas, 7.3
comparison of FIML and LIML, 7.3.1
comparison with two-part models, 7.4 , 7.4.1
example comparing with two-part model, 7.5.5 , 7.5.5
examples showing similar marginal effects to two-part models, 7.4.1 ,
7.4.1
FIML and LIML, 7.3.1 , 7.3.1
formula for FIML, 7.3.1
formula for LIML, 7.3.1
formulas, 7.3 , 7.3
marginal effects examples, 7.5.5 , 7.5.5
marginal effects formulas, 7.5.4 , 7.5.4
motivation is for missing values, 7.4
not generalization of two-part models, 7.4
three interpretations, 7.5.4 , 7.5.4
heckprob command, 7.8
heterogeneous treatment effects, see treatment effects, heterogeneous
heteroskedasticity, 1.1
GLM, 5.1 , 5.9 , 6.1
log models and retransformation, 6.1 , 6.3.1 , 6.3.2 , 6.4
histogram command, 3.6
Hosmer-Lemeshow test, see modified Hosmer-Lemeshow test
hurdle count models, 8.4 , 8.4.1
compared with two-part models, 8.4.1
359
compared with zero-inflated models, 8.4.2
example of office visits, 8.4.1
first and second parts, 8.4.1
marginal effects, 8.4.1 , 8.4.1
motivation, 8.4.1
I
ignorability, 2.3 , 2.3.3 , 10.1
incremental effects, 1.1 , 2.5 , 2.5
design effects, 11.1
example for Poisson model, 8.2.3
FMM, 9.3.1 , 9.3.2
GLM, 5.6 , 5.7
graphical representation, 4.3.2 , 4.3.2
inconsistency in Poisson models, 8.2.4
linear regression model, 2.5
log models, 6.3.2 , 6.3.2
nonlinear regression model, 2.5
OLS, 4.1 , 4.3.1 , 4.3.1
Poisson models, 8.2.3
two-part model formulas, 7.5.2
two-part models, 7.2.1 , 7.2.1
zero-inflated models, 8.4.2
zeros, if mass at, 7.1
inpatient expenditures, 3.3
instrumental variables, 10.1
assumptions, 10.2.2
specification tests, 10.2.3 , 10.2.3
weak instruments, 10.2.2
intensive margin, two-part models, 7.1 , 7.5.2 , 7.5.5
interaction term, GLM example, 5.5 , 5.5
intreg command, 7.8
inverse Mills ratio, 7.3 , 7.3.1 , 7.5.5
ivpoisson command, 10.3.1 , 10.5
ivprobit command, 10.5
ivregress command, 10.2.2 , 10.2.4 , 10.2.5 , 10.4 , 10.5
ivregress postestimation commands, 10.2.3 , 10.5
ivtobit command, 10.3.1 , 10.5
iweights, 11.2.1
360
J
test, see Hansen’s test
jackknife command, 11.3.2
adjustment for clustering, 11.3.2
intuition, 11.3.2
standard errors, 11.3.2
Jensen’s inequality, 6.3
K
Kennedy transformation, 6.2.1
-fold cross-validation, see cross-validation
kurtosis, MEPS, 3.3
L
latent outcomes
generalized tobit models, 7.1 , 7.3 , 7.4 , 7.4.1 , 7.5.4 , 7.5.5
tobit models, 7.6.1
LEF, 8.1
negative binomial models are not LEF, 8.3
negative binomial models with fixed , 8.3
Poisson models, 8.1 , 8.2.2
limited-information maximum likelihood, see LIML
LIML
comparison with FIML, 7.3.1
formula for generalized tobit model, 7.3.1
generalized tobit models, 7.3.1 , 7.3.1
Heckman selection models, 7.3.1 , 7.3.1
linear
exponential family, see LEF
regression models, see OLS
regression models mathematical formulation, 4.2 , 4.2
link function test for GLM, 5.8.2 , 5.8.2
linktest command, 4.6.4 , 4.7
lnskew0 command, 6.6
local linear regression, see nonparametric regression
log models, 1.1 , 1.3 , 5.1 , 5.9 , 6 , 6.4 , 6.6
compared with Box-Cox models, 6.5 , 6.5
compared with GLM, 5.4 , 5.9 , 6.4 , 6.4
dummy variable interpretation, 6.2.1
error retransformation factors, 6.3 , 6.3.1 , 6.3.1
361
estimation and interpretation, 6.2.1 , 6.2.1
formula, 6.2.1
healthcare expenditure example, 6.2.1 , 6.2.1
marginal and incremental effects, 6.1 , 6.3.2 , 6.3.2
marginal effects affected by heteroskedasticity, 6.3.2
retransformation to raw scale, 6.3 , 6.3.2
two-part model, second part in, 7.2
logit command, 7.8 , 8.8
logit models
first part of
hurdle count model, 8.4.1
two-part model, 7.2 , 7.2.1 , 7.4 , 7.5.2
zero-inflated model, 8.4.2
weighted formula, 11.3.1
weights, 11.3.1
M
marginal effects, 1.1 , 2.5 , 2.5
design effects, 11.1
example for Poisson model, 8.2.3
FMM, 9.3.1 , 9.3.2
generalized tobit models, 7.5.4 , 7.5.4
GLM, 5.6 , 5.7
graphical representation, 4.3.2 , 4.3.2
Heckman selection models, 7.5.4 , 7.5.4
hurdle count models, 8.4.1 , 8.4.1
inconsistency in Poisson models, 8.2.4
log models, 6.1 , 6.3.2 , 6.3.2 , 6.3.2
log models affected by heteroskedasticity, 6.3.2
nonlinear regression model, 2.5
OLS, 4.1 , 4.3.1 , 4.3.1
Poisson models, 8.2.3
two-part models, 7.2.1 , 7.2.1
formulas for, 7.5.2 , 7.5.2
similar to generalized tobit models, 7.4.1 , 7.4.1
zero-inflated models, 8.4.2
zeros, if mass at, 7.1
margins command, 4.3.1 , 4.3.2 , 4.3.3 , 4.7 , 5.4 , 5.5 , 5.7 , 5.10 , 6.6 ,
7.5.1 , 7.5.3 , 7.5.5 , 8.2.3 , 8.3.1 , 8.4.1 , 9.2.1 , 9.3.1 , 9.3.2 , 9.4.1 , 11.4.4
marginsplot command, 4.3.2 , 4.7 , 5.5 , 5.10 , 9.4.1
362
maximum likelihood estimation, see MLE
mean command, 11.4.2
median regression, see quantile regression
Medical Expenditure Panel Survey, see MEPS
MEPS, 1.1 , 1.3 , 3 , 3.1 , 3.6
demographics, 3.4
expenditure and use variables, 3.3 , 3.3
explanatory variables, 3.4 , 3.4
health insurance, 3.4
health measures, 3.4
Household Component, 3.1
Household Survey, 11.2.2
oversampling and undersampling, 11.2.1
overview, 3.2 , 3.2
PSU, 11.4
PSUs based on counties, 11.2.2
sample dataset, 3.5 , 3.5
sample size, 3.2
sampling weights, 11.4
stratified multistage sample design, 11.2.2
study design, 11.1 , 11.2.1 , 11.4
survey design setup, 11.4.1
website, 3.1
mermaid, 7.6.1
misspecification
OLS, 4.1 , 4.4 , 4.4.2
exponential example, 4.4.2 , 4.4.2
quadratic example, 4.4.1 , 4.4.1
MLE, 1
Box-Cox models, 6.6
count models, 8.1
FMM, 9.3
Poisson models, 8.2.1 , 8.2.1
quasi–maximum likelihood for GLM, 5.2.2
weighted logit model, 11.3.1
model selection, 2.1 , 2.6 , 2.6.2
AIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5
BIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5
count models, 8.3.1 , 8.6.1 , 8.6.2
cross-validation, 2.6.2 , 2.6.2
363
graphical tests, 2.6
statistical tests, 2.6
model specification, 4.1
modified Hosmer-Lemeshow test, 2.6 , 4.6 , 4.6.3 , 4.6.3 , 7.7
mqgamma package, 9.2.2
myths, health econometric, see health econometric myths
N
National Health Interview Survey, 3.1 , 11.2.2
natural experiments
cluster, 11.2.3 , 11.2.3
weights, 11.2.3 , 11.2.3
NB1
definition, 8.3
example, 8.3.1
linear mean, 8.3
NB2
definition, 8.3
example, 8.3.1
quadratic mean, 8.3
nbreg command, 8.3.1 , 8.8
negative binomial models, 1.3 , 8.3 , 8.3.1
compared with Poisson models, 8.3
conditional mean same as Poisson, 8.3
consistency, 8.3
examples, 8.3.1 , 8.3.1
formula for density, 8.3
formulas for first two moments, 8.3
motivated by unobserved heterogeneity causing overdispersion, 8.3
NB1 and NB2, 8.3
robustness, 8.3
second part of hurdle count model, 8.4.1
truncated example, 8.4.1
variance exceeds mean, 8.3
zeros, 8.3
negative binomial-1, see NB1
negative binomial-2, see NB2
nlcom command, 9.3.1 , 9.3.2
nonparametric regression, 9.4 , 9.4.1
bootstrap, 9.4.1
364
compared with parametric models, 9.4
computation time, 9.4.1
examples, 9.4.1 , 9.4.1
local linear regression, 9.4
normality
FIML sensitivity to assumption, 7.5.5
generalized tobit assumption, 7.3
OLS, not assumed by, 7.2.1
tobit model assumption, 7.6.1
npregress command, 9.4 , 9.4.1 , 9.6
O
OLS, 1 , 1.1 , 1.3 , 4 , 4.7 , 5.1
AIC and BIC, 4.6.5 , 4.6.5
assumptions, 4.1
ATE and ATET, 4.3.3
best linear unbiased estimator, 4.1
can be inconsistent for count outcomes, 8.1
compared with median quantile regression, 9.2
endogeneity causes inconsistency, 10.2.1 , 10.2.1
examples of statistical tests, 4.6.4 , 4.6.4
graphical representation of marginal and incremental effects, 4.3.2 ,
4.3.2
incremental effects, 4.3.1 , 4.3.1
log models, see log models
marginal effects, 4.3.1 , 4.3.1
marginal effects compared with quantile regression, 9.2.1
mathematical formulation, 4.2 , 4.2
misspecification
consequences, 4.1 , 4.4 , 4.4.2
exponential example, 4.4.2 , 4.4.2
quadratic example, 4.4.1 , 4.4.1
negative predictions, 5.1
regression with log-transformed dependent variable, see log models
sample-to-sample variation, 5.1
statistical tests, 4.6 , 4.6.5
treatment effects, 4.3.3 , 4.3.3
unbiased property, 4.1
visual checks, 4.5 , 4.5.2
omitted variables, 10.2.1 , 10.2.4
365
endogeneity, source of, 10.1
ordinary least squares, see OLS
outliers, 1.3
overdispersion
problem for Poisson models, 8.2.4
unobserved heterogeneity in negative binomial models, 8.3
overidentification test, GMM, 10.4
overidentifying restriction test, 2SLS specification test, 10.2.3
oversampling, 3.4
P
Park test, modified for GLM, 5.8.3 , 5.8.3 , 5.10
parmest package, 9.2.1
poisson command, 8.8
Poisson models, 1.3 , 8.2 , 8.2.4 , 11.4.3
canonical count model, 8.1
compared with negative binomial models, 8.3
consistent if conditional mean correct, 8.1 , 8.2.2
events underpredicted in tails, 8.2.4
example of office-based visits, 8.2.3
exponential mean specification, 8.2.3
formula, 8.2.1
for covariance matrix, 8.2.1
for semielasticity, 8.2.3
GLM objective function, 8.2.2
GMM, 10.4
incremental effects, 8.2.3
can be inconsistent, 8.2.4
example, 8.2.3
instrumental variables, by GMM, 10.4
interpretation, 8.2.3 , 8.2.3
interpretation as semielasticity, 8.2.3
LEF, 8.2.2
log-likelihood function, 8.2.1
marginal effects, 8.2.3
can be inconsistent, 8.2.4
example, 8.2.3
mean equals variance, 8.2.1
MLE, 8.2.1 , 8.2.1
motivation from exponential time between events, 8.2.1
366
overdispersion, 8.2.4
restrictiveness, 8.2.4 , 8.2.4
robust standard errors, 8.2.4
robustness, 8.2.2 , 8.2.2 , 8.6
second part of hurdle count model, 8.4.1
variance equals mean, 8.2.1
weighted example, 11.4.4 , 11.4.4
zeros underpredicted, 8.2.4
population mean, 11.3.1
potential outcomes, 1.1 , 2.1 , 2.2 , 2.2 , 2.3 , 2.3.2 , 2.3.3 , 2.4.1 , 2.5 , 4.3
causal inference, 2.2
predict command, 6.6 , 9.3.1
prediction, GLM, 5.4 , 5.4
Pregibon’s link test, 2.6 , 4.6 , 4.6.1 , 4.6.1 , 4.6.3 , 7.7
example, 4.6.4
relation to RESET test, 4.6.2
Stata command, 4.6.4
primary sampling unit, see PSU
probit command, 7.8 , 8.8
probit models
comparison with tobit, 7.6.1
first part of
hurdle count model, 8.4.1
zero-inflated model, 8.4.2
tobit model, 7.6.1
two-part model, first part in, 7.2 , 7.2.1 , 7.4 , 7.5.1 , 7.5.2
PSU, 11.2.2
bootstrap or jackknife, 11.3.2
MEPS, 11.2.2
pweights, 11.2.1 , 11.3.1
Q
qcount package, 9.2.2
qreg command, 9.2.1 , 9.6
quantile regression, 1.1 , 9.1 , 9.2 , 9.2.2
appealing properties, 9.2
applied to count data, 9.2.2
applied to nonlinear data-generating process, 9.2.2
compared with OLS, 9.2
effects of covariates vary across conditional quantiles of the outcome,
367
9.2
equivariant to monotone transformations, 9.2
examples, 9.2.1 , 9.2.2
extensions, 9.2.2 , 9.2.2
formula, 9.2
marginal effects compared with OLS, 9.2.1
median regression, 9.2
robust to outliers, 9.2
R
Ramsey’s regression equation specification error test, see RESET test
RAND Health Insurance Experiment, 7.2
randomized trial, 2.3.2 , 2.3.3
regress command, 4.7 , 6.6 , 7.8 , 10.2.4 , 11.4.3
regress postestimation commands, 4.5.1 , 4.7
replace command, 3.6
RESET test, 2.6 , 4.6 , 4.6.2 , 4.6.2 , 7.7
example, 4.6.4
generalization of Pregibon’s link test, 4.6.2
Stata command, 4.6.4
retransformation, 5.1 , 6.1 , 6.3.1 , 6.3.2
general theory of Box-Cox under homoskedasticity, 6.5
robustness, negative binomial models, 8.3
rvfplot command, 4.5.1 , 4.7
rvpplot command, 4.5.1 , 4.7
S
sampling weights, see weights
scatter command, 3.6
selection
attrition, 11.1 , 11.2.3
models, 1.1 , 1.3
on unobservables, 2.4.1 , 2.7
self-selection, 2.3.3
single-index models that accommodate zeros, 7.1 , 7.6 , 7.6.3
GLM, 7.6.3
nonlinear least squares, 7.6.3
one-part models, 7.6.3 , 7.6.3
tobit models, 7.6.1 , 7.6.2
skewed data, 5.1
368
skewness, 1.1 , 1.2 , 5.1
Box-Cox targets skewness, 6.5
count models, 8.1
GLM, 5.9
MEPS, 3.3
OLS errors, 5.1
skewness transformation, zero, 6.6
specification tests, 2.6
2SLS, 10.2.3 , 10.2.3
balance test, 10.2.3
exogeneity test, 10.2.3
test, 10.2.3
instrument strength test, 10.2.3
overidentifying restriction test, 10.2.3
sqreg command, 9.2.1 , 9.6
square root model, 6.5
square root transformation
GLM, 5.3 , 5.8.1 , 5.8.2
square root transformation GLM, 5.8.1 , 5.8.2
ssc install
fmm9 command, 9.6
parmest command, 9.2.1
twopm command, 7.8
standard errors
2SLS versus OLS, 10.2.2 , 10.4
bootstrap, 11.3.2
delta method, 4.3.3 , 11.3.2
design effects, 11.1 , 11.3.2 , 11.3.2
difference-in-differences, 11.1
formula with clustering, 11.2.2
GMM correct standard errors for 2SRI, 10.4
jackknife, 11.3.2
margins and teffects command, 4.3.3
Poisson models, 8.2.4
Stata
Base Reference Manual, 11.3.2
command for weights, 11.2.1
Data-Management Reference Manual, 3.6
Getting Started With Stata manual, 3.6
Graphics Reference Manual, 3.6
369
introduction to the language, 1.4
option for clustering, 11.2.2
Survey Data Reference Manual, 11.6
Treatment-Effects Reference Manual, 4.3.3
User’s Guide, 3.6 , 11.3.1
Stata resources
2SLS, 10.5
2SRI, 10.5
AIC and BIC, 5.10 , 8.8
Box-Cox, 6.6
CDE, 9.6
clusters, 11.6
count models, 8.8
data cleaning, 3.6
design effects, 11.6
endogeneity, 10.5
FMM, 9.6
generalized tobit models, 7.8
getting started, 3.6
GLM, 5.10
GMM, 10.5
graphing, 3.6 , 4.7
Heckman selection models, 7.8
hurdle count models, 8.8
incremental effects, 5.10
instrumental variables, 10.5
linear regression, 4.7 , 6.6
marginal and incremental effects, 4.7
marginal effects, 5.10
negative binomial models, 8.8
Poisson models, 8.8
quantile regression, 9.6
skewness transformation, zero, 6.6
statistical tests, 4.7
stratification, 11.6
study design, 11.6
summary statistics, 3.6
survey design, 11.6
treatment effects, 4.7
two-part models, 7.8
370
weights, 11.6
zero-inflated models, 8.8
statistical tests, 1.1
OLS, 4.6 , 4.6.4 , 4.6.4 , 4.6.5
stratification, 11.2.2 , 11.2.2
affects standard errors, 11.3
purpose, 11.2.2
study design, see design effects
suest command, 8.4.1
summarize command, 3.6 , 11.4.2
survey design, see design effects
svy prefix command, 5.10 , 11.4.1 , 11.4.2 , 11.4.3 , 11.6
svy bootstrap command, 11.3.2
svy jackknife command, 11.3.2
svyset command, 11.3.2 , 11.4.1 , 11.6
T
tabstat command, 3.6
Taylor series, linearization for standard errors, 11.3.2 , 11.4
teffects command, 4.3.3 , 4.7
test command, 4.7
testparm command, 4.7
tnbreg command, 8.8
tobit command, 7.8
tobit models, 7.6.1 , 7.6.2
censoring, 7.6.1
censoring assumption, 7.6.1 , 7.6.2
comparison with probit, 7.6.1
formulas, 7.6.1
homoskedasticity assumption, 7.6.1
latent outcomes, 7.6.1
MLE formula, 7.6.1
normality assumption, 7.6.1 , 7.6.2
restrictive assumptions, 7.6.2
right-censored example, 7.6.2
truncation of positive values at zero, 7.6.2
why used sparingly, 7.6.2 , 7.6.2
tobit, generalized, see generalized tobit models
tpoisson command, 8.4.1 , 8.8
treatment effects, 1.1 , 2.1 , 2.2 , 2.2 , 2.3.3 , 2.5
371
clustering, 11.2.3
covariate adjustment, 2.3.3 , 2.3.3
difference-in-differences, 11.2.3
endogeneity, 10.1
heterogeneous, 1.1 , 1.2 , 1.3 , 9 , 9.6
linear regression, 2.4.1 , 2.4.1
nonlinear regression, 2.4.2 , 2.4.2
OLS, 4.1 , 4.3.3 , 4.3.3
on not-treated, 8.2.3
randomization, 2.3.2 , 2.3.2
regression estimates, 2.4 , 2.4.2
zeros, if mass at, 7.1
treatreg command, 7.8
truncation
comparison with censoring, 7.6.1
count models, 8.5 , 8.5.1
definition, 7.6.1
formulas, 8.5.1
left-truncation, 8.5.1
right-truncation, 8.5.1
zero-truncation, 8.5.1
two-part models, 1.1 , 1.3 , 7.2 , 7.2.1
actual outcomes, 7.1 , 7.4.1 , 7.5.4 , 7.5.5
choices for first-and second-parts, 7.2
compared with hurdle count models, 8.4.1
comparison with generalized tobit models, 7.4 , 7.4.1
comparison with Heckman selection models, 7.4 , 7.4.1
example
comparing with generalized tobit model, 7.5.5 , 7.5.5
comparing with Heckman selection model, 7.5.5 , 7.5.5
with MEPS, 7.5.1 , 7.5.1
examples showing similar marginal effects to generalized tobit models,
7.4.1 , 7.4.1
expected value of , 7.2.1 , 7.2.1
formula,
general, 7.2
logit and GLM with log link, 7.2.1
probit and GLM with log link, 7.2.1
probit and homoskedastic nonnormal log model, 7.2.1
probit and homoskedastic normal log model, 7.2.1
372
probit and linear, 7.2.1
probit and normal, 7.2.1
formulas for incremental effects, 7.5.2
history, 7.2
marginal and incremental effects, 7.2.1 , 7.2.1
marginal effect example, 7.5.3 , 7.5.3
marginal effect formulas, 7.5.2 , 7.5.2
mixture density motivation, 7.4
motivation is for actual zeros, 7.4
not nested in generalized tobit models, 7.4
relation to CDE, 9.5
statistical decomposition, 7.2
statistical tests, 7.7
twopm command, 7.4.1 , 7.5.1 , 7.5.3 , 7.8 , 8.4.1
two-stage least squares, see 2SLS
two-stage residual inclusion, see 2SRI
twostep option, 10.5
V
vce(cluster) option, 11.2.1 , 11.2.2 , 11.3.2 , 11.6
vce(robust) option, 8.2.4
visual checks
artificial-data example, 4.5.1 , 4.5.1
MEPS example, 4.5.2 , 4.5.2
OLS, 4.5 , 4.5.2
Vuong’s test, 8.3.1 , 8.6
W
weak instruments, see instrumental variables
weighted least squares, see WLS
weighted sample means, example, 11.4.2 , 11.4.2
weights, 11.2.1 , 11.2.1
affect point estimates and standard errors, 11.3
analytic weights, aweights, 11.2.1
definitions, 11.2.1
effect on point estimates, 11.3.1
effect on regression coefficients, 11.1
frequency weights, fweights, 11.2.1
importance weights, iweights, 11.2.1
in natural experiments, 11.2.3 , 11.2.3
373
logistic regression, 11.3.1
population mean, 11.3.1
population weights, 11.2.1
postsampling weights, 11.2.1 , 11.2.3
sampling weights, 11.2.1
pweights, 11.2.1
with bootstrap or jackknife, 11.3.2
WLS, 11.3.1
WLS, 11.3.1
example, 11.4.3 , 11.4.3
formula, 11.3.1
Z
zero-inflated models, 8.4 , 8.4.2 , 8.4.2
compared with hurdle count models, 8.4.2
example for office-based visits, 8.4.2
formula for density, 8.4.2
heterogeneity, 8.4.2
incremental effects, 8.4.2
marginal effects, 8.4.2
motivation, 8.4.2
zeros, 1.1 , 1.2
hurdle count model, 8.4.1
MEPS, 3.3
models for continuous outcomes with mass at zero, 7 , 7.8
negative binomial models, 8.3
single-index models, 7.6 , 7.6.3
underprediction in Poisson models, 8.2.4
zinb command, 8.4.2 , 8.8
zip command, 8.4.2 , 8.8
374