0% found this document useful (0 votes)
155 views374 pages

Health Econometrics Using Stata 1nbsped 1597182281 9781597182287 Compress

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views374 pages

Health Econometrics Using Stata 1nbsped 1597182281 9781597182287 Compress

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 374

Health Econometrics

Using Stata
2
Partha Deb Hunter College, CUNY and NBER

Edward C. Norton University of Michigan and NBER

Willard G. Manning University of Chicago


®

A Stata Press Publication StataCorp LLC College Station, Texas


®

Copyright © 2017 StataCorp LLC


All rights reserved. First edition 2017

Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845

Typeset in LATEX 2

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

Print ISBN-10: 1-59718-228-1

Print ISBN-13: 978-1-59718-228-7

ePub ISBN-10: 1-59718-229-X

ePub ISBN-13: 978-1-59718-229-4

Mobi ISBN-10: 1-59718-230-3

Mobi ISBN-13: 978-1-59718-230-0

Library of Congress Control Number: 2016960172

No part of this book may be reproduced, stored in a retrieval system, or transcribed, in any form or
by any means—electronic, mechanical, photocopy, recording, or otherwise—without the prior
written permission of StataCorp LLC.

Stata, , Stata Press, Mata, , and NetCourse are registered trademarks of StataCorp LLC.

Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of
the United Nations.

3
NetCourseNow is a trademark of StataCorp LLC.

LATEX 2 is a trademark of the American Mathematical Society.

4
Dedication to Willard G. Manning, Jr. (1946–
2014)
Will Manning joined the RAND Corporation in 1975, a few years after
completing his PhD at Stanford. He quickly became involved in the RAND
Health Insurance Experiment . Will was the lead author of the article that
reported the main insurance results in the 1987 American Economic
Review, one of the most cited and influential articles in health economics.
He also published many seminal articles about demand for alcohol and
cigarettes, the taxes of sin, and mental healthcare. In 2010, the American
Society of Health Economics awarded the Victor R. Fuchs Award to Will
for his lifetime contributions to the field of health economics.

But perhaps his strongest influence was on empirical methods central


to applied health economics research. With others at RAND, he advocated
moving away from tobit and sample-selection models to deal with
distributions of dependent variables that had a large mass at zero. The two-
part model , in all of its forms, is now the dominant model for healthcare
expenditures and use. He also understood the power and limitations of
taking logarithms of skewed distributions. And woe to the author who did
not deal adequately with heteroskedasticity upon retransformation. Will
continued to push the field of health econometrics through the end of his
career. He helped develop new methods and advocated the work of others
who found better ways of modeling healthcare expenditures and use. His
influence on applied health economics is deep and lasting
(Konetzka 2015; Mullahy 2015) .

Will had three other characteristics that we grew to appreciate, if not


emulate, over the years. He was absolutely meticulous about research—
data, methods, and attribution. Precision is not merely an abstract
statistical concept but an essential part of all steps in a research project.

Will was extraordinarily generous with his time. We know of many


junior economists who were amazed and profoundly grateful that Will
took the time to give detailed feedback on their article or presentation—
and to explain why their standard errors were all wrong.

Finally, Will was hilariously funny. We know this is a rare trait in an


economist, but a little levity makes the daily give and take in search of

5
truth that much more enjoyable.

We dedicate this book to our friend and colleague, Will Manning.

Partha Deb and Edward C. Norton

6
Contents

Tables

Figures
Preface

Notation and typography

1 Introduction
1.1 Outline
1.2 Themes
1.3 Health econometric myths
1.4 Stata friendly
1.5 A useful way forward
2 Framework
2.1 Introduction
2.2 Potential outcomes and treatment effects
2.3 Estimating ATEs
2.3.1 A laboratory experiment
2.3.2 Randomization
2.3.3 Covariate adjustment
2.4 Regression estimates of treatment effects
2.4.1 Linear regression
2.4.2 Nonlinear regression
2.5 Incremental and marginal effects
2.6 Model selection
2.6.1 In-sample model selection
2.6.2 Cross-validation
2.7 Other issues
3 MEPS data
3.1 Introduction
3.2 Overview of all variables
3.3 Expenditure and use variables
3.4 Explanatory variables
3.5 Sample dataset

7
3.6 Stata resources
4 The linear regression model: Specification and checks
4.1 Introduction
4.2 The linear regression model
4.3 Marginal, incremental, and treatment effects
4.3.1 Marginal and incremental effects
4.3.2 Graphical representation of marginal and incremental effects
4.3.3 Treatment effects
4.4 Consequences of misspecification
4.4.1 Example: A quadratic specification
4.4.2 Example: An exponential specification
4.5 Visual checks
4.5.1 Artificial-data example of visual checks
4.5.2 MEPS example of visual checks
4.6 Statistical tests
4.6.1 Pregibon’s link test
4.6.2 Ramsey’s RESET test
4.6.3 Modified Hosmer–Lemeshow test
4.6.4 Examples
4.6.5 Model selection using AIC and BIC
4.7 Stata resources
5 Generalized linear models
5.1 Introduction
5.2 GLM framework
5.2.1 GLM assumptions
5.2.2 Parameter estimation
5.3 GLM examples
5.4 GLM predictions
5.5 GLM example with interaction term
5.6 Marginal and incremental effects
5.7 Example of marginal and incremental effects
5.8 Choice of link function and distribution family
5.8.1 AIC and BIC
5.8.2 Test for the link function
5.8.3 Modified Park test for the distribution family
5.8.4 Extended GLM
5.9 Conclusions
5.10 Stata resources

8
6 Log and Box–Cox models
6.1 Introduction
6.2 Log models
6.2.1 Log model estimation and interpretation
6.3 Retransformation from ln(y) to raw scale
6.3.1 Error retransformation and model predictions
6.3.2 Marginal and incremental effects
6.4 Comparison of log models to GLM
6.5 Box–Cox models
6.5.1 Box–Cox example
6.6 Stata resources
7 Models for continuous outcomes with mass at zero
7.1 Introduction
7.2 Two-part models
7.2.1 Expected values and marginal and incremental effects
7.3 Generalized tobit
7.3.1 Full-information maximum likelihood and limited-information
maximum likelihood
7.4 Comparison of two-part and generalized tobit models
7.4.1 Examples that show similarity of marginal effects
7.5 Interpretation and marginal effects
7.5.1 Two-part model example
7.5.2 Two-part model marginal effects
7.5.3 Two-part model marginal effects example
7.5.4 Generalized tobit interpretation
7.5.5 Generalized tobit example
7.6 Single-index models that accommodate zeros
7.6.1 The tobit model
7.6.2 Why tobit is used sparingly
7.6.3 One-part models
7.7 Statistical tests
7.8 Stata resources
8 Count models
8.1 Introduction
8.2 Poisson regression
8.2.1 Poisson MLE
8.2.2 Robustness of the Poisson regression
8.2.3 Interpretation

9
8.2.4 Is Poisson too restrictive?
8.3 Negative binomial models
8.3.1 Examples of negative binomial models
8.4 Hurdle and zero-inflated count models
8.4.1 Hurdle count models
8.4.2 Zero-inflated models
8.5 Truncation and censoring
8.5.1 Truncation
8.5.2 Censoring
8.6 Model comparisons
8.6.1 Model selection
8.6.2 Cross-validation
8.7 Conclusion
8.8 Stata resources
9 Models for heterogeneous effects
9.1 Introduction
9.2 Quantile regression
9.2.1 MEPS examples
9.2.2 Extensions
9.3 Finite mixture models
9.3.1 MEPS example of healthcare expenditures
9.3.2 MEPS example of healthcare use
9.4 Nonparametric regression
9.4.1 MEPS examples
9.5 Conditional density estimator
9.6 Stata resources
10 Endogeneity
10.1 Introduction
10.2 Endogeneity in linear models
10.2.1 OLS is inconsistent
10.2.2 2SLS
10.2.3 Specification tests
10.2.4 2SRI
10.2.5 Modeling endogeneity with ERM
10.3 Endogeneity with a binary endogenous variable
10.3.1 Additional considerations
10.4 GMM
10.5 Stata resources

10
11 Design effects
11.1 Introduction
11.2 Features of sampling designs
11.2.1 Weights
11.2.2 Clusters and stratification
11.2.3 Weights and clustering in natural experiments
11.3 Methods for point estimation and inference
11.3.1 Point estimation
11.3.2 Standard errors
11.4 Empirical examples
11.4.1 Survey design setup
11.4.2 Weighted sample means
11.4.3 Weighted least-squares regression
11.4.4 Weighted Poisson count model
11.5 Conclusion
11.6 Stata resources
References

Author index

Subject index

11
Tables
5.1 GLMs for continuous outcomes
6.1 Box–Cox formulas for common values of
7.1 Choices of two-part models

12
Figures
3.1 Empirical distribution of ln(total expenditures)
3.2 Empirical distributions of healthcare use
4.1 The relationship between total expenditures and age, for men and for
women, with and without any limitations
4.2 Distributions of AME of : Quadratic specification
4.3 Distributions of AME of : Quadratic specification
4.4 Distributions of AME of when : Quadratic specification
4.5 Distributions of AME of : Exponential specification
4.6 Distributions of AME of : Exponential specification
4.7 Distributions of AME of , given : Exponential specification
4.8 Residual plots for y1
4.9 Residual plots for y2
4.10 Residual plots for y3
4.11 Residual plots: Regression using MEPS data, evidence of
misspecification
4.12 Residual plots: Regression using MEPS data, well behaved
4.13 Graphical representation of the modified Hosmer–Lemeshow test
4.14 Graphical representation of the modified Hosmer–Lemeshow test
after adding interaction terms
5.1 Densities of total expenditures and their residuals
5.2 Predicted total expenditures by age and gender
5.3 Predicted marginal effects of age by age and gender
7.1 Predicted expenditures, conditional on age and gender
8.1 Poisson density with
8.2 Poisson density with
8.3 Empirical frequencies
8.4 Empirical and Poisson-predicted frequencies
8.5 Negative binomial density
8.6 Empirical and NB2 predicted frequencies
8.7 Cross-validation log likelihood for office-based visits
8.8 Cross-validation log likelihood for ER visits
9.1 Coefficients and 95% confidence intervals by quantile of expenditure
errors
9.2 Coefficients and 95% confidence intervals by quantile of
ln(expenditure) errors
9.3 Empirical and predicted componentwise densities of expenditure

13
9.4 Componentwise coefficients and 95% confidence intervals of
expenditures
9.5 Empirical and predicted componentwise densities of office-based visits
9.6 Componentwise coefficients and 95% confidence intervals of office-
based use
9.7 Predicted total expenditures by physical health score and activity
limitation

14
Preface
This book grew out of our experience giving presentations about applied
health econometrics at the International Health Economics Association and
the American Society of Health Economists biennial conferences. In those
preconference seminars, we tried to expose graduate students and early
career academics to topics not generally covered in traditional
econometrics courses but nonetheless are salient to most applied research
on healthcare expenditures and use. Participants began to encourage us to
turn our slides into a book.

In this book, we aim to provide a clear understanding of the most


commonly used (and abused) econometric models for healthcare
expenditure and use and of approaches to choose the most appropriate
model. If you want intuition, meaningful examples, inspiration to improve
your best practice, and enough math for rigor but not enough to cause rigor
mortis, then keep reading. If you want a general econometrics textbook,
then put down this book and go buy a general econometrics textbook. Get
ready to try new methods and statistical tests in Stata as you read. Be
prepared to think.

Despite years of training and practice in applied econometrics, we still


learned a tremendous amount while working on this book from reading
recent literature, comparing and testing models in Stata, and debating with
each other. We particularly learned from our coauthor Will Manning, who
unfortunately died in 2014 before seeing our collective effort come to
fruition. Will was a fountain of knowledge. We think that his overarching
approach to econometrics of repeated testing to find the best model for the
particular research question and dataset is the best guide. The journey
matters, not just the final parameter estimate.

In closing, we want to thank some of the many people who have


helped us complete this book. David Drukker, editor and econometrician,
had numerous suggestions, large and small, that dramatically improved the
book. We are grateful to Stephanie White, Adam Crawley, and David
Culwell at StataCorp for help with LATEX , editorial assistance, and
production of the book. We thank Betsy Querna Cliff, Morris Hamilton,
Jun Li, and Eden Volkov for reading early drafts and providing critical
feedback. We thank the many conference participants who were the early

15
guinea pigs for our efforts at clarity and instruction and especially those
who gave us the initial motivation to undertake this book. Our wives, Erika
Bach and Carolyn Norton, provided support and encouragement,
especially during periods of low marginal productivity. Erika Manning
cared for Will during his illness and tolerated lengthy phone calls at odd
hours, and Will’s bad puns at all hours.

Partha Deb and Edward C. Norton

16
Notation and typography
In this book, we assume that you are somewhat familiar with Stata: you
know how to input data, use previously created datasets, create new
variables, run regressions, and the like.

We designed this book for you to learn by doing, so we expect you to


read it while at a computer trying to use the sequences of commands
contained in the book to replicate our results. In this way, you will be able
to generalize these sequences to suit your own needs.

We use the typewriter font to refer to Stata commands, syntax, and


variables. A “dot” prompt followed by a command indicates that you can
type verbatim what is displayed after the dot (in context) to replicate the
results in the book.

The data we use in this book are freely available for you to download,
using a net-aware Stata, from the Stata Press website, http://www.stata-
press.com. In fact, when we introduce new datasets, we load them into
Stata the same way that you would. For example,

Try it. To download the datasets and do-files to your computer, type the
following commands:

17
Chapter 1
Introduction
Health and healthcare are central to society and economic activity. This
observation extends beyond the large fraction of gross national product
devoted to formal healthcare to the fact that health and healthcare affect
each other and numerous other decisions. Health affects people’s ability to
engage in work and leisure, their probability of marriage, probability of
living to a ripe old age, and how much they spend on healthcare.
Healthcare affects health mostly for the better, although side effects and
medical errors can have drastic consequences. The desire for better health
motivates decisions about smoking, drinking, diet, and exercise over a
lifetime. Therefore, it is important to understand the underlying causes of
health and how health affects people’s lives, including examining the
determinants of healthcare expenditures and use.

Economic theory and policy motivate research questions on healthcare


expenditures and use. The healthcare sector is one of the best areas to test
economic theories of insurance, the behavior of nonprofit firms, principal-
agent relationships, and the influence of peers. Moreover, governments are
intimately involved with healthcare. For example, in the United States, the
federal and state governments administer Medicare and Medicaid health
insurance, the Veterans Administration provides healthcare to veterans,
and the government regulates tobacco, alcohol, and prescription drugs.
Understanding what drives healthcare expenditures and use is essential for
policy. National domestic political debates often center around policies
that aim to enhance public health, improve quality of healthcare, and
ensure affordable access to health insurance. Health economists have much
to offer by studying these issues.

The last few decades have also seen a proliferation of sophisticated


statistical methods. Researchers now have many alternatives to ordinary
least squares (OLS) to analyze data with dependent variables that are
binary, count, or skewed. Researchers can adjust estimates to control for
complex survey design and heteroskedasticity. There are classes of models
[for example, generalized linear models (GLM) ] and statistical methods
(for example, maximum likelihood estimation and generalized method of
moments ) beyond least squares that provide powerful and unified

18
approaches to fitting many complex models. Advances in computing
power mean that researchers can estimate technically complex statistical
models faster than ever. Stata (and other statistical software) allows
researchers to use these models quickly and easily.

Like the people behind the statistics, data come in all shapes, sizes, and
ages. Researchers collect population health and census data, episode-level
claims data, survey data on households and on providers, and, more
recently, individual biometric data—including genetic information.
Datasets are often merged to generate richer information over time. The
variety of data is dizzying.

The importance of the research and policy questions requires that we


use the econometric models with care and that we think deeply about the
correct interpretation. Faster computers do not obviate the need for
thought.

In this book, we lay out the main statistical approaches and


econometric models used to analyze healthcare expenditure and use data.
We explain how to estimate and interpret the models using easy-to-follow
examples. We include numerous references to the main theoretical and
applied literature. We also discuss the strengths and weaknesses of the
models we present. Knowing the limitations of models is as important as
knowing when to appropriately use them. Most importantly, we
demonstrate rigorous model testing methods. By following our approach,
researchers can rigorously address research questions in health economics
using a way that is tailored to their data.

19
1.1 Outline

This book is divided into three groups of chapters. The early chapters
provide the background necessary to understand the rest of the book. Many
empirical research questions aim to estimate treatment effects .
Consequently, chapter 2 introduces the potential outcomes framework,
which is useful for estimating and interpreting treatment effects. It also
relates treatment effects to marginal and incremental effects in both linear
and nonlinear models. Chapter 3 introduces the Medical Expenditure Panel
Survey dataset, which is used throughout this book for illustrative
examples. Chapter 4 illustrates how to estimate the average treatment
effect, the treatment effect on the treated, and marginal and incremental
effects for linear regression models. Chapter 4 also shows that
misspecifications in OLS models can lead to inconsistent average effects. It
also includes graphical and statistical tests for model specification to help
decide between competing statistical models.

The core chapters describe the most prominent set of models used for
healthcare expenditures and use, including those that explicitly deal with
skewness , heteroskedasticity , log transformations , zeros , and count data
. Chapter 5 presents GLMs as an alternative to OLS for modeling positive
continuous outcomes. Generalized linear models are especially useful for
skewed dependent variables and for heteroskedastic error terms. Although
we argue that GLM provides a powerful set of models for health
expenditure, we also lay out the popular log transformation model in
chapter 6. Transforming a dependent variable by taking its natural
logarithm is a widely used way to model skewed outcomes. Chapter 6
describes several versions that differ in their assumptions about
heteroskedasticity and the distribution of the error term (normal or
nonnormal). We show that interpretation can be complex, even though
estimation is simple. Chapter 7 adds observations with outcomes equal to
zero . Most health expenditure data have a substantial mass at zero, which
makes models that explicitly account for zeros appealing. Here we
describe and compare two-part and selection models . We explain the
underlying assumptions behind the often misunderstood two-part model,
and show how two-part models are superficially similar, yet strikingly
different from selection models in fundamental ways. Chapter 8 moves
away from continuous dependent variables to count models . These models
are essential for outcomes that are nonnegative integer valued, including

20
counts of office visits, number of cigarettes smoked, and prescription drug
use.

The book then shifts to more advanced topics. Chapter 9 presents four
flexible approaches to modeling treatment-effect heterogeneity . Quantile
regression allows response heterogeneity by level of the dependent
variable. We describe basic quantile regressions and how to use those
models to obtain quantile treatment effects. Next, we describe finite
mixture models . These models allow us to draw the sample from a finite
number of subpopulations with different relationships between outcomes
and predictors in each subpopulation. Thus, finite mixture models can
uncover patterns in the data caused by heterogeneous types. Third, we
describe local-linear regression, a nonparametric regression method.
Nonparametric regression techniques make few assumptions about the
functional form of the relationship between the outcome and the covariates
and allow for very general relationships. Finally, conditional density
estimation is another flexible alternative to linear models for dependent
variables with unusual distributions. The last two chapters discuss issues
that cut across all models. Chapter 10 introduces controlling for
endogeneity or selection-on-unobservables of covariates of policy interest
to the researcher. Chapter 11 discusses design effects . Many datasets have
information collected with complex survey designs. Analyses of such data
should account for stratified sampling, primary sampling units, and
clustered data.

This book does not attempt to provide a comprehensive treatment of


econometrics. For that, we refer readers to other sources (for example,
Cameron and Trivedi [2005; 2010], Greene [2012], Wooldridge
[2010; 2016]). Instead, we focus on healthcare econometric models that
emphasize three core statistical issues of skewness , zeros , and
heterogeneous response. We focus on providing intuition, a basic
mathematical framework, and user-friendly Stata applications. We provide
citations to the literature for original proofs and important applications.
Much promising theoretical and applied work continues to appear in the
literature each year. Jones (2010) describe some of the recent research as
well as a wide range of econometric approaches.

21
1.2 Themes

Although we present numerous alternative models and ways to check and


choose between those models, it should be no surprise that we do not
determine a single best model for all situations or a good second-best
model for all cases. Instead, researchers must find the model that is most
appropriate for their research question and data. We recommend
comprehensive model checking, but model checking is not a simple check
list. It requires thought.

We aim to provide the tools to find the best model to consistently


estimate the answer to the research question. This answer will often be a
function of , such as the average treatment effect, or the marginal
effect of a covariate on the outcome. We are also concerned about the
precision of those estimates, measured by the variance of the estimators.

Of all possible statistical models, we focus on those that address three


key issues that often appear in health expenditure and use data: skewness ,
a large mass at zero , and heterogeneous response . Health expenditure
data are often wildly right skewed. Transforming the dependent variable to
generate a dependent variable with a more symmetric distribution may
improve the statistical properties of the model fit but may make it harder to
interpret. The distributions of many interesting health outcomes—such as
total annual healthcare expenditures , hospital visits in the calendar year,
and smoking in the last 30 days—typically have a substantial fraction of
zeros, which can pose difficulties for standard statistical models.
Consequently, health economists have developed models to deal with such
outcomes, allowing for a rich understanding of how variables affect
whether the outcome is positive (extensive margin) and the magnitude of
the outcome (intensive margin). While a single, summary marginal effect
is sometimes of interest, we often expect heterogeneous treatment effects
across different subpopulations. Modeling the heterogeneity explicitly can
reveal new insights.

In summary, our aim is to improve best practices among health


economists and health services researchers.

22
1.3 Health econometric myths

Despite the tremendous recent advances in econometrics, we have noticed


a number of misconceptions in the published literature. We hope the
following myths will disappear in future generations:

1. Model selection by citation is safe. The lemming approach to


econometrics is to follow blindly what others have done. But each
research question, each dataset, and each model requires individual
attention. We advocate artisanal handcrafted research, not mass-
produced cookie-cutter research (see chapter 2).

2. Trim outliers. Outliers are so annoying. They are highly influential,


do not fit nicely onto graphs, and are just, well, different. Why not
trim them from the data? The reason is that each outlier represents a
real person or episode. As much as a hospital administrator would
like to assume away an ultraexpensive patient, the patient exists and
is an important feature of many datasets. Embrace the diversity; start
by exploring outliers in the Medical Expenditure Panel Survey data
described in chapter 3.

3. OLS is fine. OLS regression has many virtues. It is easy to estimate and
interpret. Under a set of well-known assumptions—including that the
model as specified is correct—OLS is the best linear unbiased
estimator, except when the assumptions fail, which is often. We
demonstrate the limitations of OLS in chapter 4.

4. All GLM models should have a log link with a gamma distribution.
Several early influential articles using GLM models in health
economics happened to analyze data for which the log link with a
gamma distribution was the appropriate choice. Different link and
distributional families may be better (see chapter 5) for other data.

5. Log all skewed dependent variables. Health economists have


developed a compulsive, almost Pavlovian, instinct to log any and all
skewed dependent variables. While the log transformation makes
estimation on the log scale simple, it makes interpretation and
prediction on the raw scale surprisingly difficult (see chapter 6).

6. Use selection models for data with a large mass at zero. When the

23
data have substantial mass at zero, some researchers reach for the
two-part model , while others reach for selection models . Their
choices often lead to considerable argument over which is better. We
advocate the two-part model for researchers interested in actual
outcomes (including the zeros), and we advocate selection models for
researchers interested in latent outcomes (assuming that the zeros are
missing values). We set the record straight in chapter 7.

7. All count models are Poisson. Ever wonder why some researchers
reflexively use Poisson , and others use the negative binomial ? We
explain the tradeoff between inference about the conditional mean
function and conditional frequencies while providing intuition and
pretty pictures (see chapter 8).

8. Modeling heterogeneity is not worth the effort. Tired of assuming


monolithic treatment effects? Want to spice up your research life? We
introduce four ways to model treatment-effect heterogeneity that can
enrich any analysis (see chapter 9).

9. Correlation is causation. Actually, virtually all researchers know that


statement is false. However, knowing it and correctly adjusting for
endogeneity are two different things. We discuss ways to better
establish causality by controlling for endogeneity, which is pervasive
in applied social science research (see chapter 10).

10. Complex survey design is just for summary statistics. Most large
surveys use stratification and cluster sampling to better represent
important subsamples and to use resources efficiently. Model
estimation, not just summary statistics, should control for sample
weights, clustering, and stratification (see chapter 11).

24
1.4 Stata friendly

We assume the reader has a basic understanding of Stata. To learn more,


read the Stata manuals and online help, or consult introductions to Stata by
Long and Freese (2014) (see especially chapters 2–4) and Cameron and
Trivedi (2010) (see chapters 1 and 2 for overview). Stata is easy to learn,
easy to use, and has a powerful set of tools. Many of them are built-in, but
others are provided by dedicated users who share their code via Stata
packages. Once the reader has a grasp of the basics, our book will be fully
accessible.

Merely reading about econometrics is not the best way to learn.


Readers must actively analyze data themselves. Therefore, we provide
user-friendly Stata code, so interested readers cannot only reproduce all
the examples in the book but also modify the code to analyze their own
data. The data and Stata code in this book are publicly available. We have
designed this not only to be user-friendly but also to be interactive. Dig in!

25
1.5 A useful way forward

Finally, we agree with the observation by Box and Draper (1987) that “all
models are wrong, but some are useful”. Our intent is to provide methods
to choose models that are useful for the research question of interest.

26
Chapter 2
Framework

27
2.1 Introduction

Researchers estimate statistical models of healthcare expenditures and use


in a vast variety of situations, for example, as a basis for risk-adjusted
payments in public and private health insurance systems, to set disease- or
episode-based prices, or to determine risk-adjusted cutoff points for
multitiered pricing or use-limit schemes (van de Ven and Ellis 2000) .
Researchers often use estimated parameters of healthcare expenditures and
use distributions as inputs in decision-theoretic models calibrated for cost-
effectiveness analyses (Hoch, Briggs, and Willan 2002). There is also a
vast literature assessing the effects of health status and other modifiable
characteristics such as asthma, diabetes, heart disease, obesity, patient
satisfaction, and pollution on healthcare costs, expenditures, and use (for
example, Barnett and Nurmagambetov [2011]; Cawley and Meyerhoefer
[2012]; Dall et al. [2010]; Fenton et al. [2012]; Roy et al. [2011]).
Statistical modeling choices that every researcher must make are
dependent on the type of analysis being performed.

We have suggested that merely reading about econometrics is not the


best way to learn. Readers must actively analyze data themselves. It is
tempting, therefore, to jump straight away into data analysis and model
estimation, perhaps picking particular topics of special interest to focus on.
There can be nothing more satisfying to a researcher than seeing a set of
regression coefficients and interpreting results. We do not recommend
skipping ahead, tempting as it may be.

It is important to begin any statistical analysis with a clear conceptual


understanding of the study design and what statistical assumptions are
required to obtain a convincing answer to the research question. A leading
type of policy-related empirical analysis requires statistical estimates of
conditional means and how those conditional means vary across
covariates. For example, suppose we wish to understand what would
happen to healthcare expenditures as a result of a treatment or intervention.
The potential-outcomes framework formalizes a conceptual framework for
understanding how one might obtain a reliable answer to such a question
(Rubin 1974; Holland 1986) . Other analyses are more descriptive in
nature; that is, the goal is to describe differences across characteristics that
we cannot manipulate easily. For example, we might want to understand
the difference in healthcare expenditures between men and women who

28
have otherwise similar characteristics or understand the trajectory of
healthcare use across the lifespan of individuals. The insights of the
potential-outcomes framework are useful in such circumstances as well,
although gender and age are clearly not modifiable in the way that an
experimental treatment is. In other analyses, the researcher may simply be
interested in the best predictions of individual-level outcomes rather than
the effect of a particular covariate on the outcome. In such analyses, the
researcher would focus on prediction criteria to formulate an appropriate
model. Such criteria may or may not be consistent with a model that is
preferable in a causal analysis.

As we have suggested, researchers are often interested in the effects of


a policy-modifiable treatment or intervention. Therefore, in this chapter,
we provide a brief description of the potential-outcomes framework,
beginning with the constructs in a completely general setting. We describe
a number of ways to define different kinds of treatment effects . We also
show how the framework can be cast in a regression setting, both linear
and nonlinear in parameters. Finally, we describe in-sample specification
testing and model-selection strategies, as well as out-of-sample cross-
validation strategies to help guide choices of regression specification.

Estimating treatment effects in the potential-outcomes framework is an


active area of research. This technical literature can be daunting, but there
are also a number of excellent textbook descriptions that provide basic
overviews, technical details, and examples (for example, in Wooldridge
[2010] and Gelman and Hill [2007] ). Imbens and Rubin (2015) devoted
an entire book to this topic. There are also a number of surveys of this
literature. The classic reference is Heckman and Robb (1985) . Other,
more recent surveys that take the reader closer to the frontiers of research
in this area include Imbens (2004), Heckman and Vytlacil (2007) , and
Imbens and Wooldridge (2009) .

In the following chapters, we describe a variety of linear and nonlinear


regression models that work in many disparate situations for the estimation
of treatment effects , for estimation of marginal effects, and for predictions
of outcomes. Casting the problem at hand in a regression framework has
many advantages, but it also opens numerous questions of exactly which
regression model to choose in the final analysis. There are choices of what
covariates to include and how to include them (polynomials, interactions),
of how to specify the functional form of the conditional mean of the

29
outcome (linear, log, or power), and of what statistical distribution to
choose to complete the model.

In this chapter, we broadly describe the types of in-sample and out-of


sample strategies a researcher might consider to answer such questions.
We think it is important for researchers to have these strategies clearly laid
out a priori so that they can make the best modeling choices in a
systematic way. We describe details of model specification tests and
model-selection methods in subsequent chapters. Cameron and
Trivedi (2005) and Greene (2012) provide textbook descriptions of model
selection and testing in nested and nonnested contexts. We also refer our
readers to Claeskens and Hjort (2008) for detailed descriptions and
comparisons of the vast literature on model selection both in- and out-of
sample and to Rao and Wu (2001) and Kadane and Lazar (2004) for more
technical descriptions and syntheses of the literature.

30
2.2 Potential outcomes and treatment effects

The potential-outcomes model of Rubin (1974) and Holland (1986)


provides a framework to formally evaluate the conditions under which one
can obtain a causal estimate of the effect of a binary treatment or
intervention on an outcome . We describe this below, synthesizing the
expositions of Wooldridge (2010) and Gelman and Hill (2007). Let the
binary indicator denote whether observation received the treatment or
not:

Following Rubin (1974), define potential outcomes and as the


outcomes observed under control and treatment conditions, respectively.
For an observation assigned to the treatment (that is, ), is
observed and is the unobserved counterfactual outcome , representing
what would have happened to the unit if it had been assigned to the control
condition. Conversely, for control observations, is observed, and is
the counterfactual outcome. The potential outcomes may be continuous or
discrete, nonnegative, positive, or real, etc. Indeed, we have specified
nothing about the numeric and statistical properties of the potential
outcomes. However, we do need to assume that treatment of observation
affects only the outcomes for observation . This rules out spillover effects
or externalities in the data-generating process of potential outcomes. In
such a setting, the effect of treatment for observation , denoted by , is
defined as the difference between and .

The fundamental problem of causal inference is that we can generally


observe only one of these two potential outcomes, and , for each
observation . We cannot observe both what happens to an individual after
being assigned to treatment (at a particular point in time) and what
happens to that same individual after being assigned to the control

31
condition (at the same point in time). In fact, we can relate the observed
outcome ( ) to the potential outcomes using the following relationship,

(2.1)

which does not allow us to identify both of the two potential outcomes.
Thus we can never measure a causal effect directly.

However, we can think of causal inference as a prediction of key


features of the distribution of . The most commonly estimated feature is
the average treatment effect (ATE) , , calculated as

When the potential outcomes are also determined by other


characteristics of the individuals (vector of covariates ), the ATE
conditional on , which is the vector of covariates for observation , is
simply

Another commonly estimated effect is the average treatment effect on


the treated (ATET) , that is, the mean effect for those who were actually
treated. This is equal to the ATE calculated only on the subsample of
observations that received the treatment,

The ATET can be extended to incorporate conditioning on :

32
How do we estimate these effects, given data on treatment assignment,
observed outcomes, and covariates? The answer to this question depends
on the design of the study, and—by implication—properties of the data-
generating process that generates the potential outcomes. We describe
estimating ATEs in three leading situations below: a laboratory experiment,
a nonlaboratory experiment when randomization is possible, and an
observational study without randomization.

33
2.3 Estimating ATEs

How might we observe or estimate the potential outcomes for an


observation? In some situations, there might be close substitutes for the
counterfactual outcomes . In other situations, it might be possible to
randomly assign individuals to the treatment and control conditions so that
the collection of control units could be viewed as a substitute for the
counterfactual outcomes of the collection of treated units. In other
situations, especially when close substitutes are unavailable and
randomization is unfeasible, we may achieve similarity between treated
and control units via statistical adjustment, for example, by linear or
nonlinear regression.

Each of these approaches requires that an assumption of ignorability


holds. In the language of basic statistics, ignorability means that the
process by which observations are assigned to the treatment group is
independent of the process by which potential outcomes are generated,
conditional on observed covariates that partially determine treatment
assignment and potential outcomes .

For more intuition, consider this example. Suppose that assignment to


a healthcare checkup visit—the treatment group—is determined by the
random outcome of a toss of a fair coin and also by an individual’s age,
gender, and whether he or she had a checkup last year. Suppose that the
potential outcomes under the treatment and control conditions are also
determined by the individual’s age, gender, whether he or she had a
checkup last year, whether he or she is in the treatment group, and a
random error term. Assignment to treatment would be considered
ignorable conditional on age, gender, and an indicator for checkup last
year because—conditional on those covariates—assignment to treatment is
statistically independent of the potential outcomes . Now, imagine the
same scenario in which we know age and gender but not whether the
individual had a checkup last year. In this case, assignment to the
treatment condition no longer satisfies ignorability , because one of the
conditioning variables is not observed.

We maintain the ignorability assumption throughout most of the book.


However, in chapter 10, we describe methods to obtain causal estimates
when this assumption does not hold.

34
2.3.1 A laboratory experiment

In some situations, it may be possible to measure a close substitute for the


counterfactual outcome . Such situations are quite common in the
experimental sciences such as biology, chemistry, and physics, where
bench scientists might subject a material to both treatment and control
environments simultaneously, making the assumption that the samples of
material subject to treatment and control are virtually the same in their
response to the treatment and control conditions. They would observe
from the sample unit subject to the control condition and from the
virtually identical sample unit subject to the treatment condition. In this
case, simple sample averaging of and would provide the estimates
needed to construct ATEs.

Needless to say, such situations are more difficult to conceive in the


context of human subjects. It might be possible to design an experiment in
which one of a pair of identical twins is subjected to treatment and the
other to the control condition as a way of obtaining a close substitute for
the counterfactual outcome . However, such opportunities are rare.

2.3.2 Randomization

In a randomized trial, researchers begin with a pool of reasonably similar


individuals, but not similar enough to allow the researcher to identify clone
pairs as in the laboratory experiment described above. With such a
reasonably homogeneous pool of individuals, a randomized design assigns
those observations to the treatment and control conditions randomly. The
treatment and control samples are not clone pairs, but they will have
similar characteristics on average. In other words, when treatment is
assigned completely at random, we can think of the different treatment
groups (or the treatment and control groups) as a set of random samples
from a common population. Then, because is independent of and ,
ATE and ATET are identical. To be precise, and
.

Randomization also provides a simple way to calculate estimates of


expected potential outcomes, and , using sample averages of
expected observed outcomes, . To see this, consider that

35
where the fact that is independent of is necessary to establish the first
equality, and we use (2.1) to establish the relationship between potential
and observed outcomes. Similarly,

If the treatment is randomized, we can obtain a consistent estimate of


by calculating the mean of the observed outcome in the
treated sample, and we can obtain a consistent estimate of by
calculating the mean of the observed outcome in the control sample. We
calculate the estimates of ATEs by applying these estimates of expected
potential outcomes to the respective formula for the treatment effects of
interest.

2.3.3 Covariate adjustment

When randomization is not possible, there will typically be self-selection


into treatment. Individuals choose whether to receive treatment or not. In
part their choice will be determined by the values of their observed
covariates and, in part, by the values of their unobserved characteristics.
For example, the decision to try a prescription drug (treatment) may
depend on age (observable) and aversion to pain (unobservable).

As we described earlier in this section, ignorability is a key assumption


required to obtain treatment effects in this context. Ignorability implies
that the unobserved characteristics that determine selection into treatment
are conditionally independent of the unobserved characteristics that
determine the potential outcomes . Wooldridge (2010) describes these, and
related conditions, in greater detail.

Denote and . When conditional


independence holds, it is still true that

36
and

In addition, using (2.1) and the conditional independence assumption,


we get and .
Therefore, as in the randomized trial case, we can estimate ATE and ATET
using observed outcomes, observed treatment assignment, and observed
covariates. It is possible to calculate estimates of and quite
generally and even nonparametrically, but we will describe parametric
regression-based methods below.

Before we do that, it is useful to see the general formula for estimates


of and , assuming one has consistent estimates of
and . In other words, is the sample average of the
differences in estimated predicted outcomes in the treated and control
states. The formula for the estimate of is

The formula for the estimate of in this case has a similar


form, but this time, the sample over which we average is restricted to
observations that received treatment. Formally,

37
2.4 Regression estimates of treatment effects

We now show how you can use regression models to estimate treatment
effects. We remind our readers that if you are interested in estimating ATE
or ATET , or those effects conditional on covariates, modeling efforts
should focus on obtaining the best estimates of the conditional mean
functions, and . Consistency is clearly a desired property of
the estimators, but precision is important as well. As is typical, there is
often a tradeoff between consistency and efficiency of estimators, so we
urge our readers to think through their modeling choices carefully before
proceeding.

2.4.1 Linear regression

With the above general principles in mind, it is useful to begin with the
randomization case even though no regression is necessary. In that case,
we only need to estimate sample means. Nevertheless, we can also obtain
an estimate of the ATE (which is equal to ATET ) via a simple linear
regression. Without any loss of generality, we can write the relationship
between the observed outcome, , and the treatment assignment, , as

where and are unknown parameters to be estimated and with


for random error term . In this case, and
. So the ATE is , and we can obtain its best linear
unbiased estimate by an ordinary least-squares regression of on .

When observed covariates and unobserved characteristics determine


selection into treatment and potential outcomes as we have shown above,
the conditional independence assumption allows us to estimate ATEs using
estimates of conditional means of observed outcomes. When we work with
observational data, and even in large experiments carried out in natural
environments, it is not always possible to achieve close similarity at the
individual level or to be confident of randomization, especially if such an
approach was pursued at group levels. In such situations, statistical
modeling can help control for the effects of characteristics that are

38
determinants of potential imbalances between treatment and control
samples. For instance, by estimating a regression, we may be able to
estimate what would have happened to the treated observations had they
received the control, and vice versa, all else being held constant. Such an
approach requires that the chosen regression specification be the correct
data-generating process, or at least approximately correct.

The simplest, and perhaps most commonly used, linear (in parameters)
regression specification is also specified as being linear in variables (the
treatment indicator and a vector of covariates ):

(2.2)

The ATE is , and its estimate is . Note that


the inclusion of regressors in that are either higher-order polynomial
terms of covariates or interactions between covariates does not change the
estimate of solution in any essential way. Also, the estimate
of the ATET is , which is the same as the ATE . However,
unlike in the randomization case, this result is not a natural consequence of
the study design. Instead, it is a restriction imposed by the functional form
we chose for the regression. This functional form assumes that the
treatment effect is identical for all; an alternative, shown below, interacts
the treatment effect with covariates, allowing different treatment effects
for different subpopulations.

If this regression specification adequately describes the data-generating


process, we might comfortably conclude that ATE is equal to ATET . If not,
we should enrich the specification. In chapter 4, we describe a number of
specification checks and tests that help answer this question.

For now, consider a more general regression specification that relaxes


the constraint of equality between ATE and ATET . Including terms in the
regression specification that are interactions between and achieves this
end. To see this, consider a model that includes a full set of interactions
between covariates and the indicator for treatment, :

39
In this specification, the expected outcome in the control condition (
) is

and the expected outcome in the treated condition ( ) is

The difference between expected outcomes in the treated and control


conditions is

Unlike the prior case shown in (2.2), the expected outcomes in treated
and control cases and—consequently—the individual-level differences in
expected outcomes are functions of the values of the individual’s
covariates, , leading to differences between ATE and ATET . Sample
averages of estimates of individual-level differences in expected outcomes,
over the entire sample for ATE and over the treated sample for ATET , are
valuable. However, they may hide considerable amounts of useful
information about how treatment effects vary across substantively
interesting subgroups of the population. For example, the ATE of a checkup
visit may be substantially different for men as opposed to women.
Estimating two ATEs, one for the sample of men and the other for the
sample of women, would provide a much richer understanding of the
effect of this intervention than just one estimate.

This specification, a fully interacted regression model, raises concerns


that the model is overspecified. A researcher may wonder whether there
are too many extraneous variables in the specification. Perhaps only a few
interactions are necessary. In the population, the coefficients on such
extraneous variables would be zero. However, in finite samples, adding
extraneous variables would decrease the precision of model estimates. In
fact, estimates of ATEs from an overspecified model may be so imprecise

40
that they render the point estimate relatively uninformative. Again,
specification checks and tests described in chapter 4 could help answer this
question.

The fully interacted regression specification is as general as you can


make a linear-in-parameters regression specification for the conditional
mean of the outcome. You can include as many covariates as you see fit
and as many interactions between those covariates and polynomial
functions of those covariates as you choose. Because the specification
interacts every one of those terms with the binary indicator for treatment, it
is akin to estimating two separate regressions with that specification of
covariates—one for the sample of treated observations and the other for
the sample of control observations.

However, these two model specifications are not quite equivalent.


Estimating one fully interacted regression model on the entire sample
assumes that the regression errors are homoskedastic. Estimating them
separately allows the variance of the error terms to differ in the treated and
control samples. This difference in the specifications of the variances of
the errors does not change the point estimate of ATE or ATET . It will
change the standard errors of those estimates, however; we will describe
this in more detail and show examples in chapter 4.

2.4.2 Nonlinear regression

We can extend the regression approach to include statistical models that


are nonlinear in parameters, such as most generalized linear models, logit,
probit, Poisson, negative binomial, and other models for count data. We
will describe a number of such models in detail in later chapters. For now,
consider a nonlinear regression model in which

In this model, the covariates and the treatment indicator enter in a linear,
additive way first, but then their effect on the outcome is transformed by a
nonlinear function, . In this setting, the individual-level expected
treatment effect is no longer a linear function of covariates. Instead, it is

41
Once again, the individual-level expected treatment effect is a function of
covariates , so it will vary from individual to individual across the
sample. The estimation of the sample ATE is

where denotes the sample size.

To estimate the ATET , we take the above formula but average only
over the sample of treated observations. Here—as in nonlinear models
generally—the individual-level expected treatment effect is a function of
the covariates, , so expected treatment effects averaged over different
samples will yield different estimates. Specifically, ATE will not be equal
to ATET .

In each of the models described above, we have first described an ATE


that averages the effects over the individuals in the sample and therefore
over the distribution of covariates in the sample. We have also described
the ATET , which requires averaging over the subsample of treated
observations. In other cases, it may be insightful to calculate the ATEs for a
hypothetical individual with a particular set of characteristics (covariates).
For example, we may be less interested in comparing those with insurance
with those without insurance over all individuals, and more interested in
comparing those with insurance with those without insurance for
individuals of lower socioeconomic status.

42
2.5 Incremental and marginal effects

So far, we have framed the researcher’s problem as estimating the effect of


a planned treatment or the effect of a treatment that naturally arises from a
policy change. In both these cases, the treatment is a consequence of a
planned intervention, and the policy question is whether that intervention
had a desired effect. Nevertheless, many important empirical research
questions are more descriptive in nature. For example, researchers may
wish to know the difference in healthcare expenditures between men and
women, or researchers may wish to know how many more doctor visits
people take if they have some additional income, all else equal. Although
these descriptive questions can also be framed as treatment effects , they
are typically not described as such.

We will maintain an arguably artificial distinction between treatment


effects of modifiable interventions and effects of other covariates that are
not interventions and may not be modifiable. We use the phrase
“incremental effect” to describe the effect of a change in an indicator
variable—such as an individual’s gender—and the phrase “marginal
effect” to describe the effect of a small change in a continuous variable,
such as an individual’s income. Thus the average incremental effect of a
binary indicator would be akin to the ATE ; it would be calculated in
exactly the same ways as described in section 2.4 in the contexts of linear
and nonlinear regression models. The average marginal effect would also
be akin to the ATE , but this time—instead of computing differences in
outcomes—we would compute the derivative of the expected outcome
with respect to the continuous covariate of interest.

To be more precise, let’s first consider a linear regression model in


which

where is a continuous covariate and is a binary indicator. The


incremental effect of is the discrete difference

43
The marginal effect of is the derivative

Both the average incremental effect and the average marginal effect are
simply the coefficients on the respective variables in the regression. They
are constant across the sample by definition.

Now consider a nonlinear regression model in which

where is a continuous covariate and is a binary indicator covariate.


The incremental effect of is the discrete difference

The marginal effect of is the derivative

Both the incremental effect and marginal effect will vary from
individual to individual across the sample, because the function is
nonlinear. We can calculate sample averages of these effects in a variety of
ways, just as we can treatment effects.

Interaction terms see extensive use in nonlinear models, such as logit


and probit models. Unfortunately, the intuition from linear regression
models does not extend to nonlinear models. The marginal effect of a
change in both interacted variables is not equal to the marginal effect of

44
changing just the interaction term. More surprisingly, the sign may be
different for different observations (Ai and Norton 2003) . We cannot
determine the statistical significance from the statistic reported in the
regression output. For more on the interpretation of interaction terms in
nonlinear models and how to estimate the magnitude and statistical
significance, see Ai and Norton (2003) and Norton, Wang, and Ai (2004) .

In many of the examples we will use throughout this book, for


simplicity, we will frame the underlying research question as being of a
descriptive nature. Consequently, we will typically use incremental effects
and marginal effects to describe the effects of interest. However, in each of
those cases, especially if the researcher has a treatment in mind—but also
in the purely descriptive situations—the formal potential-outcomes
framework described here will provide invaluable insight into calculation
and interpretation of effects.

45
2.6 Model selection

Once researchers have a good understanding of the parameters of interest,


we recommend they examine the basic characteristics of the data they will
use to estimate the parameters of a suitable econometric model. Some of
the key questions involving the basic characteristics of the outcome of
interest are as follows: Is the outcome always positive? Is it nonnegative
with a substantial mass at zero? Is it integer valued? A second set of
characteristics involves the nature of the statistical distribution of the
outcome. Is the distribution of the outcome variable highly skewed? Is
there good a priori reason to believe the parameter of interest varies across
segments of the distribution of the outcome or on some other unobserved
dimension? The answers to these questions will narrow down the class of
models for consideration.

Next, the researcher should use a battery of specification checks and


tests and model-selection criteria to narrow down the specification of the
model along dimensions of specification of covariates, functional
relationship between the outcome and covariates, and statistical
distributions for the outcome (error). Needless to say, the researcher
should revisit data characteristics and model choices if necessary. We view
this as a critical component of a good empirical analysis, so we describe
graphical checks and statistical tests throughout the book to help choose
between alternative models. In this section, we provide an introduction to
the model-selection approach we take throughout the book. We remind
readers that there is a vast literature on this topic and refer them to
Claeskens and Hjort (2008), Rao and Wu (2001) , and Kadane and
Lazar (2004) for further reading.

When the regression model is linear in parameters—or in the class of


generalized linear models—the regression residuals form a basis for
graphical checks of fit, which we demonstrate in chapter 4. Such checks
can be extremely useful in detecting whether powers of covariates or
interactions between covariates are necessary to specify the model
correctly or whether a transformation of the outcome variable might
improve the specification considerably.

However, although graphical tests are suggestive, they are not formal
statistical tests . In chapter 4, we present three statistical tests for assessing

46
model fit. The first two, Pregibon’s (1981 ) link test and Ramsey’s (1969 )
regression equation specification error test , directly test whether the
specified linear regression shows evidence of needing higher-order powers
of covariates or interactions of covariates for appropriate specification.
The third—a modified version of the Hosmer–Lemeshow (1980 ) test —
can be used generally, because it is based on a comparison between
predicted outcomes from the model and model-free empirical analogs. If
the model specification is not correct, then an alternative specification may
predict better, indicating that the specification of the explanatory variables
should change. When the modeling choices involve decisions such as
adding covariates or powers and interactions of existing covariates,
standard tests of individual or joint hypotheses (for example, the Wald and
tests) can also be useful.

2.6.1 In-sample model selection

The set of candidate models under consideration for an empirical


application is often nonnested. This typically rules out standard statistical
testing of model choice. In such situations, likelihood-based model-
selection approaches are the most straightforward way to evaluate the
performance of alternative models. Two model-selection criteria, which
penalize the maximized log likelihood for the number of model
parameters, are common: the Akaike information criterion (AIC)
(Akaike 1970) and the Bayesian information criterion (BIC), also known as
the Schwarz Bayesian criterion (Schwarz 1978). Both of these criteria
have been shown to have many advantages in many circumstances,
including robustness to model misspecification (Leroux 1992).

When the data have additional statistical issues, such as clustering and
weighting, a strict likelihood interpretation of the optimand is often
invalid. In most such situations, however, the model optimand has a
quasilikelihood interpretation that is sufficient for these two popular
model-selection criteria to be valid (Sin and White 1996; Kadane and
Lazar 2004).

The AIC (Akaike 1970) is

47
where is the maximized log likelihood (or quasilikelihood) and is
the number of parameters in the model. Smaller values of AIC are
preferable. The BIC (Schwarz 1978) is

where is the sample size. Smaller values of BIC are also preferable. For
moderate to large sample sizes, the BIC places a premium on parsimony.
Therefore, it will tend to select models with fewer parameters relative to
the preferred model, based on the AIC criterion.

Although we can apply these formulas directly in the linear regression


context if we calculate the normal-likelihood value corresponding to the
least-squares estimator, it is typical to think of model fit in the linear
regression context to be a function of the sum of squared residuals (SSR),
which every statistical package directly reports. Denote the SSR after
ordinary least-squares estimation of a model by SSR. Then ,

and

Note that there are many other formulas for AIC and BIC throughout the
literature. Closer examination of alternative formulas shows that they are
substantively only minor variations of the equations shown above. For
example, switching signs of each term in the formula suggests that one
should search for the model with the largest values of the criteria.
Sometimes, AIC and BIC are formulated with an overall division by , the
sample size. This formulation is substantively no different from the ones
we have described.

As mentioned above, the AIC and BIC are robust to many of the
misspecification issues that plague traditional test statistics, most notably
in the context of complex survey data issues. Because the derivation of the

48
criteria does not involve moment conditions, or convergence to statistical
distributions, they are invariant to the typical corrections to standard errors
required to make test statistics the correct size when observations are not
independently and identically distributed (Schwarz 1978; Sin and
White 1996). In general, as long as the likelihood or quasilikelihood (or
weighted likelihood if sampling weights are used) is appropriate as
objective functions to obtain consistent parameter estimates, the AIC and
BIC have desirable optimality properties.

2.6.2 Cross-validation

We also recognize that in-sample model checks may not always be


reliable. Overfitting is a real concern, so we strongly recommend cross-
validation checks (Picard and Cook 1984; Arlot and Celisse 2010). Cross-
validation is a technique in which we estimate a subsample of the full
sample, known as the training sample, and we assess model fit in the
remaining observations, known as the validation sample.

-fold cross-validation randomly partitions the original sample into


subsamples. Of the subsamples, a single subsample is retained as the
validation sample for testing the model, and the remaining
subsamples are used as training data. The cross-validation process then
repeats times (the folds), with each of the subsamples used exactly
once as the validation data. We work through an example of -fold cross-
validation in chapter 8 to choose between a number of alternative models
for estimating integer-valued outcomes.

49
2.7 Other issues

In chapter 10, we will describe methods that apply to the case when there
is selection on unobservables , one form of endogeneity . For example,
when a researcher is interested in the causal effect of insurance on
healthcare expenditures, and when the dataset is an observational sample
of individuals who have chosen to purchase health insurance (or not), it is
difficult to rule out endogeneity .

Although much of the focus in the literature—and a substantial focus


in this book—is on estimates of conditional means and their derivatives
(that is, ATEs) , policymakers may be interested in other features of the
distributions of expenditures and use (Bitler, Gelbach, and Hoynes 2006;
Vanness and Mullahy 2012). Distributions of outcomes also matter.
Therefore, throughout this book, for example in chapter 8, we will
illustrate the calculation of other parameters of interest, such as
probabilities of discrete outcomes.

50
Chapter 3
MEPS data

51
3.1 Introduction

In this chapter, we provide a widely used dataset on healthcare


expenditures and use in the United States to illustrate many of our points
and to allow readers to reproduce examples in this book. The empirical
examples in this book use data from the 2004 Medical Expenditure Panel
Survey (MEPS), a national survey on the financing and use of medical care
in the United States. The Agency for Healthcare Research and Quality
(AHRQ) , a federal organization in the United States, has collected MEPS
data annually since 1996. We draw the data used in these examples
primarily from the Household Component, one of four components. The
Household Component contains data on a sample of families and
individuals, drawn from a nationally representative subsample of
households that participated in the prior year’s National Health Interview
Survey . AHRQ uses MEPS data to produce annual estimates for a variety of
measures of healthcare expenditure and use, health status, health insurance
coverage, and sources of payment for health services in the United States.
The overlapping panel design of the survey features several rounds of
interviews covering two full calendar years. The annual file combines data
from several interviews across two different panels, so our example dataset
has one observation per person. More information is publicly available at
http://www.meps.ahrq.gov/mepsweb/.

We use a subset of the MEPS 2004 annual file to illustrate alternative


modeling approaches, because it has expenditure and use variables for
several types of healthcare, as well as demographic and insurance
variables. In this chapter, we describe the summary statistics and
distributions of the most important variables, including showing
histograms for a number of outcomes with severely right-skewed
distributions.

We made a number of decisions when constructing the dataset used for


the examples in this book. Among those decisions was our treatment of
full- versus partial-year respondents, missing values, and variables to use
as outcomes and covariates. Therefore, we recommend that those
interested in using MEPS for their own research begin with the raw data and
documentation provided at http://www.meps.ahrq.gov/mepsweb/, rather
than our particular sample. Our data choices may not match the needs of
others conducting their own research using MEPS.

52
3.2 Overview of all variables

In this book’s examples, we analyze various measures of annual


expenditures and health service use. The example MEPS dataset has 44
variables and 19,386 observations on individuals who are age 18 years and
older and have complete information on healthcare expenditures, counts of
healthcare use, and key covariates. In addition, there are five variables that
identify dwelling units, households and individuals in those households,
and three variables that indicate features of the complex survey design that
are useful when accounting for design effects (see chapter 11).

The demographic variables represent age, gender, race, ethnicity,


family size, education, income, and region (see section 3.4 for further
description and summary statistics).

53
There are three health-status variables and four insurance variables
(see section 3.4 for further description and summary statistics).

There are nine measures of expenditures and six measures of


healthcare use. We often use these variables as dependent variables in the

54
examples (see section 3.3 for further description and summary statistics).

55
3.3 Expenditure and use variables

All the expenditure and use variables are highly skewed, with a large mass
at zero. Expenditures include out-of-pocket payments and third-party
payments from all sources. They do not include insurance premiums. We
measure all expenditures in 2004 U.S. dollars; adjusting expenditures for
inflation to 2016 would increase nominal amounts by about 27%. There
are several ways to provide summary statistics for each type of
expenditure. First, we provide summary statistics on all observations,
including those with zeros . We report skewness and kurtosis to show how
skewed these variables are. None of the summary statistics are corrected
for differential sampling or clustering. Total annual expenditures on all
healthcare averaged $3,685 (in 2004 dollars), with a range from $0 to
$440,524. Inpatient expenditures averaged $1,123. Inpatient expenditures
are divided into inpatient facility and inpatient physician expenses. The
total amount paid by a family or individual was less than $700 on average
but was as high as $50,000.

Next, we construct dummy variables that equal one for observations


with zero expenditures to help show the distributions of these variables.
We give them names ending in 0. Summarizing them shows the fraction of
the sample without any expenditure. Although the majority have some
healthcare expenditures, more than 17% do not. The majority of
individuals in the sample have no inpatient stays, emergency room visits,
or dental expenditures.

56
Finally, we show summary statistics (including the coefficient of
skewness ) for the subset with positive values (different for each
variable), both for the raw variable and the logged variable. We give them
names ending in gt0. The raw positive expenditure variables have
extremely high skewness, with values ranging from 4.6 to almost 13.

The logged expenditures have skewness much closer to zero, implying


that the distribution is closer to being lognormal. However, this is not a
formal test; we cover the Box–Cox test of normality in section 6.5.

The distribution of the logarithm of positive values of total healthcare


expenditures looks much more symmetric than the distribution of positive

57
expenditures (see figure 3.1). Although it is tempting to conclude that the
distribution of the logarithm of expenditures is normal, or truly symmetric,
both of those conclusions are typically wrong; modeling expenditures as
such can lead to incorrect conclusions. One of this book’s main themes is
to model such variables.

Histogram of ln(total expenditures)


.25
.2
.15
Density
.1
.05
0

0 2 4 6 8 10 12 14
ln(total expenditures)

Figure 3.1: Empirical distribution of ln(total expenditures)

The example dataset has six variables that measure healthcare use
(discharges, length of stay, three kinds of office-based visits, and
prescriptions). Each use variable has a large mass at zero and a long right
tail. On average, people had nearly 6 office-based provider visits, almost
13 prescriptions (or refills), and about 1 dental visit. About 29% have no
office-based provider visits during the year, and 5% have at least 25. One-
third have no prescriptions or refills during the year, while 17% have at
least 25. Well over half the sample report having no dental visits during
the past year. About 60% of adults have no dental visits, while closer to
30% have no office-based provider visits, prescriptions, or refills.

58
59
Office-based provider visits
.3
5673 have 0 visits
.2
Fraction
.1

822 top-coded at 25
0

0 5 10 15 20 25
Number of visits

Dental visits
.6

11968 have 0 visits


.4
Fraction
.2 0

0 5 10 15 20
Number of dental visits

Prescriptions and refills


.4

6470 have 0 visits


.3
Fraction
.2

3139 top-coded at 25
.1
0

0 5 10 15 20 25
Number of prescriptions and refills

Figure 3.2: Empirical distributions of healthcare use

The density for all six use variables falls gradually for nearly the entire
range. Histograms for three of the use variables (office-based visits, dental

60
visits, and prescriptions and refills) show a large mass at zero and a
declining density for positive values (see figure 3.2). We top coded some
values at 25 for the purpose of the histogram.

61
3.4 Explanatory variables

The explanatory variables in the dataset include demographics , education,


income, and geographic location. The average age is 45 and ranges from
18 to 85. However, AHRQ top-coded age at 85 because of confidentiality
concerns. Just over half the sample is female. The vast majority is white
(80%). About 14% are African American, and the remaining 6.5%
comprise an other-race category. The race and ethnicity variables are not
mutually exclusive. The distribution of race and ethnicity variables reflects
oversampling of minority groups. Family size averages 3.0 but is as high
as 13.

Years of education is coded as a sequence of categorical variables.


About 30% have no more than a high school degree, and another 30%
have at least a college degree. The natural logarithm of household income
is fairly symmetric, with a mean of about 10.6. Household income
averages nearly $60,000 and is quite skewed. At the high end are the 398
households with annual income greater than $200,000, and at the bottom
end are the 498 households with annual income less than $6,000.

There are three health measure variables . One is a dichotomous


measure of whether the person has any limitations, based on activities of

62
daily living and instrumental activities of daily living. About 28% of the
sample has at least 1 limitation. The other two health measures are based
on the physical and mental health components of the Short Form 12. They
are used to construct continuous measures on a scale from 0 to 100, with a
mean of about 50. A higher number indicates better health. Both
distributions are skewed left, with a median three to four points above the
mean.

General health insurance is divided into four categories (with private


insurance being the omitted group). About 19% are covered by Medicare,
14% by Medicaid, and 49% by private insurance. The remaining 18% are
uninsured. There are 760 observations dually eligible for both Medicare
and Medicaid. In addition to regular health insurance, 40% of the sample
have prorated dental insurance, giving some observations fractional values.

63
3.5 Sample dataset

Interested readers can use the example dataset based on the 2004 MEPS data
to reproduce results found in this book. The sample from the 2004 full-
year consolidated data file includes all adults ages 18 and older and who
have no missing data on the main variables of interest. There are 19,386
observations on 44 variables. This dataset is publicly available at
http://www.stata-press.com/data/eheu/dmn_mepssample.dta.

As stated in the introduction, we created the example dataset in 2008


for illustrative purposes only. It is not intended for research and does not
include any updates that AHRQ made to the file since that time. The dataset
does not include poststratification weights to reflect sample loss due to
partial years of participation or item nonresponse. Interested readers
should use this sample dataset for learning purposes only and should
obtain the most recent version of MEPS to conduct any research.

64
3.6 Stata resources

Stata has excellent documentation for users to learn commands. To get


started , see the Getting Started With Stata manual —which introduces the
basic commands and interface—and the Stata User’s Guide , which has a
brief overview of the essential elements of Stata and practical advice. At
the end of each remaining chapter, we highlight the commands used in the
chapter and where to find more information about them in the Stata
manuals.

Use the describe command to describe each variable’s basic


characteristics, including the often informative label. The command for
basic summary statistics is summarize ; use tabstat or summarize with the
detail option to generate more extensive statistics. Two of the most
commonly used commands for data cleaning are generate and replace .
See the Data-Management Reference Manual for commands to describe,
generate, and manipulate variables.

To visually inspect variable distributions, use the histogram


command. To create scatterplots of variables that show their relationship
visually, use the scatter command. See Stata’s Graphics Reference
Manual for all graphing commands.

65
Chapter 4
The linear regression model: Specification and
checks

66
4.1 Introduction

The linear regression model is undoubtedly the workhorse of empirical


research. Researchers use it ubiquitously for continuous outcomes and
often for count and binary outcomes. With relatively few assumptions —
namely, the relationship between the outcome and the regressors is
correctly specified, and the error term has an expected value of zero
conditional on the values of the regressors—ordinary least-squares (OLS)
estimates of the parameters of the model are unbiased and consistent. In
other words, given those two assumptions, OLS delivers estimates that are
correct on average. In addition, if the distribution of the errors have
constant variance across the sample observations and are uncorrelated
across sample observations, then OLS produces estimates that have the
smallest variance among all linear unbiased estimators.

The formal statement of these properties is the Gauss–Markov theorem


. Many textbooks formally discuss the assumptions and proof, including
Wooldridge (2010) and Cameron and Trivedi (2005) . The Gauss–Markov
theorem has two main implications. First, OLS estimates have the desirable
property of being unbiased under relatively weak conditions. Second, there
is no linear estimator with better properties than OLS . These desirable
features mean, in many cases, we can use the linear regression model to
estimate causal treatment effects and marginal and incremental effects of
other covariates, as we outlined in chapter 2. In this chapter, we show how
these can be implemented in Stata and discuss the interpretation of various
effects.

The Gauss–Markov theorem applies to OLS models only when the


assumptions are met. If a regressor is endogenous , for example, the
conditional expectation of the error term is not zero. If it was, it would
violate one of the Gauss–Markov theorem’s assumptions, and the OLS
estimates would be inconsistent. With observational data, researchers
should always be aware of the possibility of endogenous regressors. We
address these issues in chapter 10.

The other main assumption is that the model specification is correct.


Estimation of a linear model without serious consideration of the model
specification can lead to substantially misleading answers. One of the most
important features of any model is the relationship of the covariates to the

67
dependent variable. Correct specification of the relationship is a key
assumption of the theorem. In practice, while researchers cannot claim to
know the true model, they should strive to specify good models. A good
model includes all the necessary variables—including higher-order
polynomials and interactions terms—but no more. A good model includes
variables with the correct functional relationship between the covariates
and outcome. Choosing the correct model specification requires making
choices. There is tension between simplicity and attention to detail, and
there is tension between misspecification and overfitting. We address these
issues in this chapter.

In this chapter, we show with two examples how easy it is to estimate


inconsistent marginal effects when the fit model is misspecified. Marginal
effects are surprisingly sensitive to model misspecification. If we include a
variable in the model, but the relationship between it and the dependent
variable is not correct, the estimated marginal effects of that variable are
sensitive to the distribution of that covariate and to whether the marginal
effects are conditioned on a specific value of that covariate.

Some readers may wonder why we obsess about model specification .


A commonly held belief is that the estimate, , of the average marginal
effect (AME), , of covariate, , is consistent even if it is
estimated using a misspecified model. We show that this can easily be
false. In addition, we believe that the focus on average effects is too
narrow a view, because policy interest is often about the response to a
covariate for a specific value of that covariate. For example, we may care
only about the effect of a weight-loss drug on those with an unusually high
body mass index, rather than the entire population. In the case of health
insurance, we might be worried about the effect of raising the coinsurance
rates or deductibles in the least generous health insurance plans rather than
for all health insurance plans. In such situations, the marginal effect for the
subsample of interest may be inconsistent, even if the average of marginal
effects for the full sample are not.

Because we never know the correct model specification (theory rarely


provides guidance for model specification), it is important to know how to
make informed choices. To this end, the final sections in this chapter
describe visual and statistical methods to test model specification.

68
4.2 The linear regression model

It is useful to begin with a precise, mathematical formulation of the linear


regression model, in which

where is the outcome for the th observation ( ), is a row


vector of covariates including a constant, is a column vector of
coefficients to be estimated including the intercept, and is the error term.
A linear specification can include nonlinear terms in but is always linear
in . Specifications that are nonlinear in generally cannot be
transformed into a linear specification.

As we showed in chapter 2, if the model is linear in variables, then the


estimates of treatment, marginal, and incremental effects are all simply
regression coefficients. Nonlinear terms, which are interactions between
covariates or polynomial terms of covariates, are functions of parameters
and covariate values. We must estimate and interpret them more carefully.

69
4.3 Marginal, incremental, and treatment effects

We begin with a fairly simple OLS regression model to predict total annual
expenditures at the individual level for those who spend at least some
money, using the 2004 Medical Expenditure Panel Survey (MEPS) data (see
chapter 3). Our goal is to interpret the results using the framework of
potential outcomes and marginal effects (see chapter 2). To be clear, the
model we fit may not be appropriate for a serious research exercise: it
drops all observations with zero expenditures, and its specification of
covariates is rudimentary. Additionally, we do not consider any possible
models besides OLS, especially ones that may be better suited to deal with
the severe skewness in the distribution of this outcome, and we do not
control for design effects or possible endogeneity. In short, we ignore all
interesting features of this typical health outcome variable, knowing that
we will return to each of these issues throughout the book. The focus of
this section is to provide a framework for interpreting regression results.

In this regression model, we estimate the effect of age (age), gender


(female is a binary indicator for being female), and any health limitations
(anylim is a binary indicator of whether the person has health limitations)
on total healthcare expenditures for persons with any expenditures (
), using the MEPS data (see chapter 3).

We include an interaction term between age and gender, allowing the


effect of age to differ between men and women. It is essential to use the
notation with ## between c.age and i.female so Stata understands that
those variables are interacted. The prefix c. indicates that the variable age
is continuous; the prefix i. indicates that the variable gender is binary. We
estimate robust standard errors.

The results appear to show that healthcare expenditures increase with


age and are higher for women. However, the interaction term is negative
and statistically significant, indicating that we must put more effort into
fully understanding the relationship between these demographics and total

70
expenditures. Unsurprisingly, expenditures are far higher for those with at
least one limitation. All coefficients are statistically significant at
.

4.3.1 Marginal and incremental effects

We first interpret the results for age and gender in more detail. Following
chapter 2, we interpret regression results for the continuous variable age as
a marginal effect (derivative) and for the dichotomous variable female as
an incremental effect (difference). One way to interpret the effects (not
necessarily the most informative way for this example, as we will see) is to
compute the average marginal and incremental effects using the Stata
command margins, dydx() . Because of the interaction term between age
and gender, the average marginal and incremental effects will not equal
any of the estimated coefficients in the model.

Women spend more than men by an average of , averaged across


all ages in the sample. The AME of age is , meaning that on average
(allowing for the interaction with female) for this sample, an increase in
age of 1 year corresponds with an increase in total expenditures of .
We note that this interpretation, in a model with covariates entered
nonlinearly, is only true because the model is affine in age. If one is
interested in knowing what happens if each person in the sample becomes

71
one year older, it would be better to do that computation directly. We show
how this can be done using margins in section 5.7.

The average marginal and incremental effects calculated with


margins, dydx() do not fully illustrate the complex relationship between
age, gender, and total expenditures because of the interaction between
them. One way to improve the interpretation is to calculate the marginal
effect of age separately for men and for women (using the at(female=(0
1)) option). The marginal effect of age for men is about higher than
for women ( compared with ). This means that as men age, their
spending increases faster than that of women.

Similarly, we calculate the incremental effect of gender at different


ages (using the at(age=(20 45 70)) option). The incremental effect of

72
gender is different at each age, being more than $1,140 at age 20, and
close to 0 around age .

4.3.2 Graphical representation of marginal and incremental effects

Visualizing such relationships can be very insightful. Because the model


specification is so sparse, there are only four types of people (men and
women, some or no limitations) spread across different ages. Therefore,
we can graph the predicted values of total expenditures against age in just
four lines (in later chapters, we will model the relationships in a way that
allows them to be nonlinear). We do this by first predicting total
expenditures at several ages for the four types of people—using margins
—and then immediately graphing the results with marginsplot , as shown
in figure 4.1.

73
74
Predicted total expenditures increase for all four types of people but at
different rates (see figure 4.1). The top two lines are for people with
limitations, roughly $4,400 above the lines for those people without any
limitations. Women have higher predicted total expenditures than men at
young ages, but mens’ expenditures increase more rapidly with age.
Around age , the predictions cross; elderly men are predicted to spend
slightly more than elderly women (controlling for limitations). The figure
clarifies the relationship between all the variables and shows the
importance of including the interaction term between age and gender.

Adjusted Predictions with 95% CIs


10000
Linear Prediction
50000

20 50 80
Age

Male, No activity limitation Male, Activity limitation


Female, No activity limitation Female, Activity limitation

75
Figure 4.1: The relationship between total expenditures and age, for
men and for women, with and without any limitations

4.3.3 Treatment effects

Next, we interpret the dichotomous variable indicating if the person has


any limitations. For purposes of illustration, we will consider anylim to be
a treatment variable. For a treatment variable, the typical goal is to
estimate the average treatment effect (ATE) and the average treatment
effect on the treated (ATET) (see chapter 2). One way to do this in Stata is
to estimate an incremental effect using the margins command with the
contrast() option. Another way is to use the Stata treatment-effects
command, teffects . We will demonstrate both ways to clarify how these
similar commands can estimate the same magnitude of treatment effects,
and we will explain why the estimated standard errors are slightly
different.

First, we use the results from the OLS regression model to estimate
predicted values, comparing predictions that everyone had a limitation
with predictions that no one had limitations. By this approach, we see that
the average predicted spending as if no one had any limitations is only
$3,030, while predicted spending as if everyone had a limitation is $7,487.

We use the contrast() option to take the difference between those


two predicted margins. When everyone rather than no one, has limitations,
average expenditures increase by ; this is
exactly equal to the OLS estimated coefficient. The delta-method standard
errors take the covariates as fixed, which corresponds to a sample ATE
(Dowd, Greene, and Norton 2014) . Later in this section, we will show
how to compute standard errors for a population-averaged treatment effect,

76
which accounts for the fact that the covariates also have sampling variation
in the population.

Second, in Stata, an alternative to using margins is estimating the ATE


and the ATET using the treatment-effects command teffects . This
command estimates ATE , ATET , and potential-outcome means based on a
regression (or any common nonlinear model). Because of the importance
of estimating treatment effects to our framework, we will show how to use
the teffects command and its relationship to the results from margins .

Without delving too deeply into the many options available in


teffects , we will show its basic use for a linear regression with
regression adjustment. We encourage you to read Stata’s Treatment-
Effects Reference Manual entry for teffects to learn about other useful
options, for example, inverse probability weights. The syntax for teffects
puts the basic regression in parentheses followed by the treatment variable
in parentheses. We show this in the example below.

Turning first to the ATE, we see that the treatment effect estimated by
teffects is different from the treatment effect estimated by margins with
contrast() , despite seemingly using the same model specification. The
difference is several hundred dollars.

77
The reason for the difference between the ATE estimated by teffects
and the treatment effect estimated by margins is that the model
specifications are different. The teffects command fits a model (not
shown) in which the treatment variable is fully interacted with all
covariates. It is equivalent (for the point estimates of the parameters) to
running separate models for those with and without any limitations. These
two methods of calculating the ATE (using margins or using teffects )
will be the same if the original regression model interacts all covariates (in
our example: age, female, and their interaction) with the treatment
variable (anylim). That regression is below.

78
The estimated margins using margins are now the same as potential-
outcome means estimated using teffects , because the underlying
regression model specification is now the same.

79
The ATE , found with the margins command and the contrast()
option, is $4,239, which is now exactly the same as the ATE found with the
teffects command. If we use the vce(unconditional) option for the
standard errors, then we will also get the population-averaged standard
errors; this accounts for sample-to-sample variation in covariates. Whether
one wants to sample average or population-averaged standard errors
depends on the research question and whether it makes sense to take the
covariates as fixed or not. However, there are also statistical implications
associated with this choice. The confidence intervals for the population
effects will be larger than those for the sample effects. Given this
difference in confidence intervals, it is possible for the population effect to
be statistically insignificant but for the sample effect to be statistically
significant. This distinction may be especially relevant when the sample is
relatively small and where it is unclear how representative of the
population of interest the sample is. Stata allows the user to decide and
estimate either confidence interval.

The remaining difference between the standard errors ( versus


) is because teffects uses a sample size correction of ,

80
while regress uses the small-sample adjustment of , where
is the sample size and is the number of covariates. The difference
between these shrinks asymptotically to zero as approaches infinity.

The teffects command also easily estimates the ATET and the
potential-outcome means. In this example, the ATET is several hundred
dollars more than the ATE. This result is consistent with nonrandom
assignment to the treatment group.

The difference between the two values of the potential-outcome means (


) equals the ATE.

81
In this section, we interpreted the results from a linear regression
model. The model specification was useful for illustrative purposes, but we
did not choose it through a rigorous process. Later, we will explore
alternatives to linear regression for skewed positive outcomes (chapters 5
and 6), how to incorporate zeros (chapter 7), and how to control for design
effects and possible endogeneity (chapters 11 and 10). However, first we
show by example that misspecifying the model even in a straightforward
manner can lead to inconsistency—even in the case of OLS estimates of the
parameters of a linear regression. Afterward, we will show visual and
statistical tests to help choose a model specification to reduce the chance
of misspecification.

82
4.4 Consequences of misspecification

We describe two simple examples demonstrating what happens to


estimates of average partial effects if a model is misspecified. In the
context of those examples, we illustrate situations where the average of
marginal effects is consistent, but marginal effects at specific values of
covariates are inconsistent.

4.4.1 Example: A quadratic specification

Consider a specification in which there are two continuous variables—


and —that explain in an additive, quadratic relationship specified by

We artificially generate two variables— and —in Stata, both


ranging from 2 to 8 to mimic the distribution of age (divided by 10) in our
MEPS data. One variable, , is uniformly distributed over the range, while
the other variable, , is distributed, which implies that its
distribution is skewed to the right. In Stata, we specify the data-generating
process as

The code for this example is available in the downloadable do-files


that accompany this book.

We estimate two regressions using data drawn from this data-


generating process. The first regression is misspecified, because it omits
the two squared terms:

83
The second regression is correctly specified:

We conducted Monte Carlo experiments using a sample size of


. We estimated the AMEs of and on using 500 sample
draws, and we summarize the results of deviations of the effects from the
true values in figures 4.2 and 4.3. In each case, the solid curve represents
the distribution of estimates from the correctly specified model, while the
dashed curve represents the distribution of AMEs from the misspecified
model. The results for the estimate of the coefficient on suggest that the
distribution of the AME of on is consistent even when the model is
misspecified—because the distribution of is symmetric, and the peak is
near 0 deviation.

Average marginal effect of X


20
15
10
5
0

-.05 0 .05
deviation

Correctly specified: OLS quadratic in X & Z


Misspecified: OLS linear in X & Z

Figure 4.2: Distributions of AME of : Quadratic specification

Figure 4.3 shows the analogous figures for the distributions of the
average of marginal effects of . The AME of estimated from the
misspecified linear-in-covariates model appears to be inconsistent. Recall

84
that the distribution of is symmetric, while the distribution of is
skewed. This example shows that—unless the distribution of the covariate
is symmetric—even misspecification as innocuous as leaving out a
quadratic term in covariates can lead to inconsistent AMEs.

Average marginal effect of Z


15
10
5
0

-.1 0 .1 .2 .3
deviation

Correctly specified: OLS quadratic in X & Z


Misspecified: OLS linear in X & Z

Figure 4.3: Distributions of AME of : Quadratic specification

We now return to the distribution of the effect of in the misspecified


case. Although the distribution of the average of effects is consistent, it is
important to understand its statistical properties evaluated at specific
values of . As an example, we evaluate the marginal effect of when
. The results, shown in figure 4.4, demonstrate that even in the case
of a covariate for which the average of effects over the distribution of the
covariate is consistent in the misspecified case, evidence of inconsistency
appears when we evaluate the average of effects at a specified value of the
covariate.

85
20
15
10
5
0 Marginal effect of X at X=6

-.6 -.4 -.2 0 .2


deviation

Correctly specified: OLS quadratic in X & Z


Misspecified: OLS linear in X & Z

Figure 4.4: Distributions of AME of when : Quadratic


specification

4.4.2 Example: An exponential specification

For our second example, we specify a model with one continuous variable,
, and one binary indicator, , that explains in a relationship specified
with an exponential mean and multiplicative errors. This is a log-linear
model:

This data-generating process is also inspired by our MEPS data (a


regression of the log of expenditures on age [divided by 10] and a binary
indicator for gender). We draw from a uniform distribution on (2,6) and
for half of the observations. Once again, the code for this example is
available in the downloadable do-files that accompany this book.

86
We estimate two regressions using data drawn from the data-
generating process above. The first regression is misspecified because it
does not have the correct exponential relationship:

The second regression is correctly specified:

From this second regression, we calculate the AME for covariate :

We derive this formula from the properties of a lognormal distribution,


which we cover in chapter 6. As in the first example, we conduct a Monte
Carlo experiment with 500 replications using a sample size of .
We show the distributions of the AMEs for each of the sample sizes below.
The solid curve represents the distribution of estimates from the correctly
specified model, while the dashed curve represents the distribution of AMEs
from the misspecified model.

In this scenario, figure 4.5 shows that, even though has a symmetric
distribution, its estimated AME is inconsistent when the model is
misspecified with a linear conditional mean. The AME of the binary

87
indicator, , is also inconsistent. However, there is a considerably large
loss in efficiency. The distribution of the AME of is substantially more
dispersed in the misspecified case compared with the distribution in the
correctly specified case.

Average marginal effect of X


.04
.03
.02
.01
0

-100 -50 0 50 100 150


deviation

Correctly specified: OLS exponential in X & D


Misspecified: OLS linear in X & D

Figure 4.5: Distributions of AME of : Exponential specification

Average marginal effect of D


.02
.015
.01
.005
0

-400 -200 0 200


deviation

Correctly specified: OLS exponential in X & D


Misspecified: OLS linear in X & D

Figure 4.6: Distributions of AME of : Exponential specification

88
Next, we examine the distribution of the effect of on when
(analogous to estimating the treatment effect on the treated). The results,
shown in figure 4.7, demonstrate that the estimated effect of on when
is inconsistent when the model is misspecified as linear when the
true data-generating process is exponential.

Marginal effect of D at D=1


.015
.01
.005
0

-400 -300 -200 -100 0 100


deviation

Correctly specified: OLS exponential in X & D


Misspecified: OLS linear in X & D

Figure 4.7: Distributions of AME of , given : Exponential


specification

Using those simple Monte Carlo experimental examples, we showed


that the commonly held belief that effect estimates in linear models are
consistent even when the model is misspecified is incorrect. In fact, the
examples show that, in some cases, this belief can be grossly misleading.
Thus, as we emphasize throughout the book, specification checking and
testing should be an integral part of model development. We next turn to
visual and statistical model checks of the linear regression specification
and describe a number of ways in which specifications can be checked and
improved.

89
4.5 Visual checks

In this section, we illustrate the use of visual residual checks for the least-
squares model by examining three simple artificial-data examples with a
correctly specified and a misspecified model. Then, we use the visual
residual checks to explore possible misspecification in the MEPS data for
two simple models.

4.5.1 Artificial-data example of visual checks

For the two artificial data examples, we draw 1,000 observations using
data-generating processes, specified in Stata as

The following regression for y1 is correctly specified. Therefore,


residuals from this least-squares model should have the property that there
is no systematic pattern in the residuals as a function of either the
predictions of y1 or the covariates.

90
We use two regress postestimation commands, rvfplot and rvpplot
, to detect misspecification visually. rvfplot plots residuals versus fit
values (predicted dependent variable, or the linear index), and rvpplot
plots residuals versus a specific predictor variable. Because this is the first
time we do this, we show the Stata code for illustrative purposes.

Figure 4.8 confirms our expectation that there is no pattern in the


residuals. Regardless of whether we plot the residuals against predicted
values, x, or z, the figures show no pattern. We conclude that the
regression model is correctly specified.
4

4
2

2
Residuals

Residuals

Residuals
0

0
-2

-2

-2
-4

-4

-4

0 .5 1 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
Fitted values x z

Figure 4.8: Residual plots for y1

The following regression for y2 is misspecified, because the fit model


includes only a linear term. The true data-generating process is quadratic
in x.

91
The associated residual plots in figure 4.9 show a distinct U-shaped
pattern when residuals are plotted against predicted values and when they
are plotted against x but show no pattern when residuals are plotted against
z. Taken together, they indicate a misspecified model, likely in terms of x
—but not in terms of z.
4

4
2

2
Residuals

Residuals

Residuals
0

0
-2

-2

-2
-4

-4

-4

-2 0 2 4 6 8 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
Fitted values x z

Figure 4.9: Residual plots for y2

In the third example with simulated data, we generate an outcome, y3,


that is log linear in the covariates and error. Therefore, a linear regression
of the logarithm of y3 would be correctly specified. However, we fit a
linear regression model of y3 that is also linear in covariates.

92
The associated residual plots in figure 4.10 show evidence of
misspecification. Regardless of whether the residuals are plotted against
predicted values, x or z, the figures show that the residuals fan out along
the range of the axis. The variation in the residuals increases with higher
values of the predicted values, x and z.
15

15

15
10

10

10
Residuals

Residuals

Residuals
5

5
0

.6 .8 1 1.2 1.4 1.6 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1


Fitted values x z

Figure 4.10: Residual plots for y3

4.5.2 MEPS example of visual checks

Are such visual representations of residuals useful in real data situations


when the misspecification may not be so obvious? To demonstrate the
value of such plots with real data, we construct two examples using the
MEPS data for positive total expenditures and its logarithm.

Each of the regression specifications has two covariates, age and

93
lninc. In the first example, we estimate a linear regression of
exponentiated total expenditures on age and lninc. We construct residual
plots for a 10% random sample of the MEPS observations to make the plots
clearer (by reducing the density of points) and to reduce the file size of the
resulting figures.

The residual plots in figure 4.11 show evidence of misspecification.


The figures show that there are many small residuals below zero along the
range of the axis and a number of large, positive residuals whose
frequency appears to increase from left to right along the ranges of age and
lninc. Although these residual plots are intended only to be diagnostic,
they do suggest that a linear-in-logs specification may be more
appropriate.
200000

200000

200000
150000

150000

150000
100000

100000

100000
Residuals

Residuals

Residuals
50000

50000

50000
0

0 2000 4000 6000 8000 10000 20 40 60 80 2 4 6 8 10 12


Fitted values Age ln(family income)

94
Figure 4.11: Residual plots: Regression using MEPS data, evidence of
misspecification

Guided by this evidence, we generate a new dependent variable that is


the natural logarithm of expenditures. Then, we estimate a linear
regression of log expenditures.

The residual plots in figure 4.12 show well-behaved residual scatters


when plotted against predicted log expenditures and against age. However,
the plot of residuals against the log of income may warrent a closer look at
the specification of income. While the residuals are symmetrically
distributed above and below zero along the distribution of lninc, it may be
worthwhile revisiting the choice to log transform income or to distinguish
between low-income observations and the rest in the regression
specification.
4

4
2

2
0

0
Residuals

Residuals

Residuals
-2

-2

-2
-4

-4

-4
-6

-6

-6

6 6.5 7 7.5 8 8.5 20 40 60 80 2 4 6 8 10 12


Fitted values Age ln(family income)

Figure 4.12: Residual plots: Regression using MEPS data, well


behaved

95
4.6 Statistical tests

Although graphical tests are suggestive for determining model fit, they are
not formal statistical tests. We present three diagnostic statistical tests that
are commonly used for assessing model fit. The first two—Pregibon’s
(1981) link test and Ramsey’s (1969) regression equation specification
error test (RESET) —check for linearity of the link function. The third, a
modified version of the Hosmer–Lemeshow test (1980) , is a more
omnibus test for model misspecification. Although each of these tests were
originally developed for other applications, we focus on their interpretation
as general-purpose diagnostic tests of the specification of the explanatory
variables. After presenting each of these statistical tests, we then show
several examples using the MEPS data.

4.6.1 Pregibon’s link test

Pregibon’s link test is often presented as a test of whether the independent


variables are correctly specified, conditional on the specification of the
dependent variable. A simple regression model shows intuition for
Pregibon’s link test. In a model where is a linear function of a single
regressor, , and an intercept, we may be concerned that the model is
misspecified. Alternative model specifications could include higher-order
polynomials in . The simplest alternative model is quadratic in .
Therefore, compare the model with the simple linear specification

with a model with a quadratic term

If the least-squares estimate of is significantly different from zero,


we would reject the simpler model that is linear in . The simpler
specification provides an inconsistent estimate of the response to changes
in over the range of observed in the data.

96
Pregibon’s link test addresses the more interesting case where there are
multiple covariates. For example, if there are two underlying covariates,
and , then the quadratic expansion would include , , and . The
corresponding specification test would be an test of whether the set of
estimated coefficients on the higher-order terms are statistically
significantly different from zero. If there are many covariates in a model,
adding all the possible higher-order terms can become unwieldy. A model
with covariates requires adding additional terms.

However, we can follow similar logic by collapsing the initial model


into a linear index function, —which includes the constant term—and
then including that index and its square in an auxiliary regression. This
way, we reduce the dimensionality of the problem from to two. Thus,
replace the alternative second model with

Again, if the estimate, , is statistically significantly different from


zero, we infer that the simpler model provides an inconsistent estimate of
response to changes in those covariates.

Pregibon’s link test is a diagnostic test, not a constructive test. In the


regression context, we cannot tell the source of the problem—missing
interactions or squared terms in the covariates, a misspecification of the
dependent variable, or all the above—but we can reject the simpler model
specification.

4.6.2 Ramsey’s RESET test

Ramsey’s RESET test is a generalized version of Pregibon’s link test .


Although the link test works well for misspecifications that generate
residual-versus-predicted plots with a U-shaped or J-shape (or the inverse),
it may not work well if the plot has an S-shape or more complex shapes.
These could occur if the response to the underlying covariates exhibits a
threshold or diminishing returns. A quadratic formulation may not
adequately reflect such a pattern.

The same logic we used to motivate Pregibon’s link test can be applied

97
to the more general RESET test. If there is a single covariate, , we could
add quadratic, cubic, and possibly quartic terms to the augmented
regression. If these additional terms as a set are statistically significantly
different from zero by the test—while retaining the linear term—then
we can reject the simple linear in model in favor of a more nonlinear
formulation. By extension, we can alter the link test to have a quartic
alternative:

The RESET test is a joint test of the hypothesis that


(Ramsey 1969).

The original formulation was based on a Taylor-series expansion for a


model in the spirit of what we described. The test is based on the Lagrange
Multiplier principle that relaxing a constraint (under the null hypothesis
that the constraint is not binding) will not improve the model fit. In some
econometric textbooks, this is referred to as an omitted variable test—even
though it can detect only omitted variables that are correlated with higher-
order terms in the covariates. It cannot detect omitted variables that are
orthogonal to included variables, such as ones that occur in randomized
trials. It also cannot detect omitted variables that are linearly related to the
included covariates.

4.6.3 Modified Hosmer–Lemeshow test

The link and RESET tests are parametric, because they assume that the
misspecification can be captured by a polynomial of the predictions from
the main equation. This alternative model specification may not be
appropriate for a situation with a different pattern. For example, consider a
model that was good through much of the predicted range but had a sharp
downturn in the residuals when plotted against a specific covariate, .
Pregibon’s link test , with its implicit assumption of a symmetric parabola,
would not provide a good test for that alternative.

The modified Hosmer–Lemeshow (1980) test is a nonparametric test


that looks at the quality of the predictions throughout their range. It divides
the range of predictions into, for example, 10 bins. If the model

98
specification closely approximates the data-generating process, the mean
prediction for each of these bins should be near zero and not significantly
different from zero.

One way to test this is to sort the data into deciles by the predicted
conditional mean, . Regress the residuals from the original model on
indicator variables for these 10 bins, suppressing the intercept. Test
whether these 10 coefficients are collectively significantly different from 0
using an test.

The advantage of the modified Hosmer–Lemeshow test is that it is


more flexible than parametric tests; the disadvantage is that it is less
efficient. The modified Hosmer–Lemeshow test can be adapted to
nonlinear models. Note that this test is different from the version of the
Hosmer–Lemeshow test used in Stata (the estat gof postestimation
command for logistic) to test how well data for a dichotomous outcome
follows a specified dichotomous dependent variable.

4.6.4 Examples

To illustrate the link and RESET tests , we present an example using the
MEPS data with a simple model specification. Consider a model that
predicts total healthcare expenditures as a function only of age and gender.
For this example, we drop all observations with zero expenditures:

The model with a simple specification shows that total healthcare


expenditures increase with age and are higher on average for women.

99
Because we suspect that this simple model does not accurately capture
the relationship between total healthcare expenditures and demographics,
we run the link and RESET tests . First, generate the predicted value of the
dependent variable (linear index) and its powers up to four. In practice, it
helps to normalize the linear index so the variables and parameters are
neither too large nor too small. Normalization does not affect the statistical
test results.

Pregibon’s link test is a statistical test of whether the coefficient on the


squared predicted value of the dependent variable is statistically different
from zero in a regression of the dependent variable on the linear index and
its square. The statistic is 4.15 (corrected for heteroskedasticity) and the
corresponding statistic is 17.23 (1 and 15,943 degrees of freedom).
Therefore, we conclude that the model is misspecified.

100
Ramsey’s RESET test is a statistical test of whether the coefficients on
the three higher-order polynomials of the predicted value of the dependent
variable are jointly statistically different from zero in a regression of the
dependent variable on the fourth-order polynomial. The statistic is 16.35
(3 and 15,941 degrees of freedom). Therefore, we conclude that the model
is misspecified.

101
There are Stata commands for both the link and RESET tests . The
command for the link test is linktest , and the command for the RESET
test is the postestimation command estat ovtest . However, although the
linktest will adjust for heteroskedasticity, estat ovtest will not.
Therefore, we do not recommend using estat ovtest. That is why we
first showed how to calculate the test statistics without using the Stata
commands. The RESET test should always be done in the manner we
describe to control for possible heteroskedasticity.

For completeness, we show the results of the two Stata commands


here. The linktest shows, numerically, the same statistic of 4.15 as
before. In contrast, the command estat ovtest shows a different
statistic, because it does not control for heteroskedasticity.

The statistic of 11.83 is smaller than the statistic that correctly


controls for heteroskedasticity, meaning the Stata command is less likely
to reject the null hypothesis of correct model specification.

Next, we run the modified Hosmer–Lemeshow test on the original


simple model. For the choice of 10 bins, the test strongly rejects the null
hypothesis. In Stata, we run the regression in the nocons mode, because
OLS forces the average residual for the estimation sample to be identically
zero if the original model has an intercept. We encourage researchers to

102
use the vce(robust) option (and cluster if necessary) to correct for any
heteroskedasticity.

Although the test clearly indicates that the specification is flawed—


because the coefficients are collectively significantly different from zero—
it does not provide any indications about how or why it might be flawed.
We find it useful to calculate and plot the residuals at each decile of the
predicted expenditure, exp_tot. These are shown in figure 4.13 along with
95% confidence intervals for those residuals shown by the dashed and
dotted lines. Ideally, the graph would be flat, with all average residuals
close to zero. Instead, the average residuals by decile exhibit a strong U-
shape, with the lowest and middle decile residuals displaying confidence
intervals outside zero.

103
Modified Hosmer-Lemeshow test
1000 500
mHL coefficients and CI
-500 0
-1000
-1500

0 2 4 6 8 10
Predicted categories

Figure 4.13: Graphical representation of the modified Hosmer–


Lemeshow test

In response to the diagnostic test results, we create a richer model by


adding a squared term for age and interactions between age (and age
squared) with gender.

104
This modest elaboration of the specification shows improvement in the
specification tests. The link test is no longer statistically significant (
). The RESET test is still statistically significant, but the test is
now much lower at 3.93. Finally, the graphical modified Hosmer–
Lemeshow test no longer exhibits a strong U-shape. Instead, only 1 of the
10 deciles (the 5th) is significantly different from 0.

105
Modified Hosmer-Lemeshow test
1500 1000
mHL coefficients and CI
-500 0 500
-1000

0 2 4 6 8 10
Predicted categories

Figure 4.14: Graphical representation of the modified Hosmer–


Lemeshow test after adding interaction terms

4.6.5 Model selection using Akaike information criterion and Bayesian


information criterion

We now demonstrate the use of Akaike information criterion (AIC) and


Bayesian information criterion (BIC) to compare model specifications (see
section 2.6.1), using two of the examples from earlier in this chapter. The
first examples show not only that AIC and BIC penalize models that omit
important variables from the model but also that they penalize models that
overfit the data by including unnecessary variables.

For the first set of examples, we use simulated data generated by a


quadratic specification described in section 4.4.1. We use the same data-
generating process, the same set seed 123456, and the same sample sizes
of 1,000 and 10,000. We compare the simple specification that includes
only linear terms in x and z with ones that are quadratic in x only, in z
only, and one that is quadratic in both x and z. We also fit models that are
cubic in x only, in z only, and one that is cubic in both x and z. There are
seven models; three have omitted variables, three have extra variables, and
one has the correct specification. Recall that the true specification is
quadratic in both x and z, so the true specification also includes quadratic
terms for both.

106
We label the linear specification linear, the quadratic ones
quadratic, and the cubic ones cubic. The suffix indicates which variables
are in the model.

We report the values of the AIC and BIC from each of the models in the
output shown below for the sample size of 1,000. The results show that
both AIC and BIC are the smallest for the specification that is quadratic in
both x and z, which is the correct specification. The linear specification
has the highest values of AIC and BIC. The cubic specifications and the two
misspecified quadratic specifications also have higher AIC and BIC than the
correct model. Thus AIC and BIC are able to discriminate against both
underspecified and overspecified models.

When the sample size is increased to 10,000—and a new simulated

107
sample is drawn—AIC and BIC provide even sharper contrasts between the
correct specification and each of the incorrect ones.

Next, we illustrate the use of AIC and BIC for the more realistic case of
analyzing real data where the true model specification is unknown. The
second set of examples compares three different model specifications for
the MEPS data in section 4.4.2. Specifically, the dependent variable is total
expenditures for the subsample with positive expenditures ( ).
In the first MEPS data example, the covariates used in all three
specifications are the continuous age and the binary female. They are
entered additively in the first specification, an additional interaction
between age and female is included in the second specification, and an
additional squared age and interaction of squared age with female are
included in the third specification. We label the first specification as
linear, the second as interact, the richest specification with squared age,
and interactions as quad_int in the output below that compares the AIC and
BIC from the three models. The results show that both AIC and BIC are
smallest for the specification with quadratic age and interaction terms.

The second MEPS data example demonstrates a real-life problem that a


researcher might face. So far, we have reported results of the AIC and BIC

108
together, because the conclusions have always been the same. In this
example, they are not. We add the binary anylim to the list of covariates,
in addition to the continuous age and binary female. This variable enters
additively in each of the specifications considered for the first example.
AIC and BIC are calculated for each of the models and are shown in the
output below. AIC is smallest for the richest specification with quadratic
terms and interactions. However, BIC is smallest for the simplest additive
specification.

The mathematical explanation is simple; the AIC penalizes models only


along the number-of-parameters dimension. The BIC penalizes models
using an interaction of the number of parameters and the logarithm of the
sample size. So, except for very small sample sizes, the BIC will have
bigger penalties for additional parameters relative to the AIC. It is a more
conservative criterion. The more difficult issue is the question of what the
analyst should conclude. The answer depends on whether the substantive
question calls for a more parsimonious or less parsimonious regression
specification.

Three additional points are worth raising. First, for nested models, we
could have done other statistical tests—such as tests. But AIC and BIC
can be compared even for nonnested models, when tests such as tests
and Wald tests are not possible. Second, with standard testing procedures,
the issue of multiple testing always looms large when a researcher is
searching for the best model specification. AIC and BIC do not suffer from
that issue. Any number of candidate models can be compared. Finally, in
general, we cannot compare models with different dependent variables
using AIC and BIC. In chapter 6, we demonstrate how to use AIC and BIC to
compare a linear specification with one that is linear in the logged
outcome.

109
4.7 Stata resources

The Stata command for linear regression is regress . The regress


postestimation commands contain many graphic and statistical diagnostic
techniques for linear regression. Diagnostic plots contain some of the
commands used here as well as commands for assessing the normality and
other distributional fit issues. We find rvfplot and rvpplot especially
useful. The grc1leg command will combine graphs using one legend.

Stata has several commands to estimate and interpret the effects of


covariates and treatment effects . These commands include margins ,
margins with the contrast() option, and teffects . Depending on
whether the research question of interest is about the effect of a change in
a covariate holding all other variables constant, or a population-averaged
effect, these commands can get the right magnitude and standard errors.
marginsplot is especially useful after margins for visualizing marginal
effects and treatment effects with their standard errors.

The Stata command for the link test is linktest . The Stata command
for the RESET test is estat ovtest , but this is not recommended because it
does not control for heteroskedasticity. For testing hypotheses of specific
variables or coefficients, use test and testparm .

To compute the AIC and BIC, use estimates stats after running the
model. Cattaneo, Drukker, and Holland (2013) created the bfit command
to find the model that minimizes either the AIC or the BIC from a broad set
of possible models for regress, logit, and poisson.

For a general discussion of issues for least squares, including weighted


and generalized least squares, see Christopher Baum’s (2006) book, An
Introduction to Modern Econometrics Using Stata (especially chapters 4–
9). Nicholas Cox’s (2004) article “Speaking Stata: Graphing model
diagnostics” provides a review of the literature for diagnostic plots . He
also includes a set of worked examples.

110
Chapter 5
Generalized linear models

111
5.1 Introduction

As we move into the main four chapters of this book (chapters 5–8), our
focus shifts toward choosing between alternative models for continuous
positive outcomes, such as healthcare expenditures. One of the main
themes of this book is how best to model outcomes that are not only
continuous and positive but also highly skewed. Skewness creates two
problems for ordinary least-squares (OLS) models: negative predictions and
large sample-to-sample variation . After briefly demonstrating this for the
2004 Medical Expenditure Panel Survey (MEPS) data, we then spend the
rest of this chapter exploring generalized linear models (GLM), which are
an alternative to OLS that handle skewed data more easily.

As we showed in chapter 3, health expenditures are extremely skewed


(see the left-panel of figure 5.1 for a density plot for total expenditures in
the MEPS data for those with positive expenditures). One of the
consequences of such extreme skewness is that predicted total
expenditures from a linear regression model can be negative. Using the
MEPS dataset, we fit a linear regression model of total healthcare
expenditures (for the subsample with some use of care) on a number of
covariates and then calculate predicted expenditures postestimation. We
use the centile command to determine the fraction of negative
expenditures. The results, shown below, tell us that between 6% and 7% of
predictions are negative, with many substantially so.

112
At a minimum, this finding is awkward, indicating that the linear-in-
parameters specification can predict outside the boundaries imposed by the
data-generating process. If that is true, it is likely that OLS estimates of a
linear regression model will yield inconsistent estimates of effects, as we
demonstrated in section 4.4.2.

Even if there were no negative predictions, one might consider a


nonlinear mean specification, because extreme skewness can lead to
unacceptably large sample-to-sample variation in OLS estimates . To be
more precise, it is not the skewness of expenditures themselves but rather
the skewness of the errors that matters. The right panel of figure 5.1 shows
that the OLS residuals are also extremely skewed, albeit less so than raw
expenditures. Sample-to-sample variation can be large, because each
exceptionally large residual has undue influence on OLS estimates.
Densities of total expenditures and its residuals
among those who had at least some expenditure
.00015
.0003

.0001
.0002
density

density
.00005
.0001
0

0 20000 40000 60000 80000 100000 -50000 0 50000 100000


total expenditure OLS residuals

Expenditures above $100,000 (around the 99.9th percentile) dropped

113
Figure 5.1: Densities of total expenditures and their residuals

Health expenditure data , for those with any healthcare use, are
generally extremely skewed . In the United States, a small fraction of the
population accounts for a substantial fraction of total expenditures. Berk
and Monheit (2001) report that five percent of the population accounts for
the majority of health expenditures and that the severely right-skewed
concentration of healthcare expenditures has remained stable over decades.

In this chapter, we discuss GLMs, a class of models with many desirable


properties, including their ability to accommodate skewness. They are also
more likely to have a specification that approximates the true data-
generating process of healthcare expenditures. They not only deal well
with skewed data but also model heteroskedasticity directly and are easy to
interpret.

GLMs are more general than ordinary linear regression models


(McCullagh and Nelder 1989). The GLM generalizes the ordinary linear
regression model by allowing the expectation of the outcome variable, ,
to be a function (known as the link function) of the linear index of
covariates, —not simply a linear function of the index. In addition,
GLMs allow the variance of the outcome to be a function of its predicted
value by the choice of an appropriate distribution family, which naturally
incorporates heteroskedasticity . Several link functions and distribution
families are appropriate for right-skewed data. Modeling the expectation of
the dependent variable, , instead of the expectation of , avoids the
retransformation problem inherent in regression models for the logarithm
of the outcome (see chapter 6).

There has been a growing interest among health economists in


applying GLM to healthcare expenditures and costs. The work of
Mullahy (1998) and Blough, Madden, and Hornbrook (1999) was among
the first applications in health economics. Much of the subsequent work in
health economics and health services research builds from the formulation
in Blough, Madden, and Hornbrook (1999). In addition to continuous
outcomes, GLM can be used for dichotomous, polytomous, and count
outcomes.

In this chapter, we provide an overview of GLM for continuous


outcomes. We apply GLM methods to total healthcare expenditures for
those adults with any healthcare use in the 2004 MEPS dataset. We show

114
how to compute marginal effects of continuous covariates and incremental
effects of discrete covariates. We show how to determine the most
appropriate GLM for a given dataset by choosing the appropriate link
function and distribution family .

McCullagh and Nelder (1989) present the theory of GLM and an


overview of many classes of GLMs, including GLM classes for outcomes
that we do not cover in this chapter (for example, dichotomous outcomes
and count data ). Hardin and Hilbe (2012) provide a detailed description of
the theory and the statistical issues for GLMs, as well as examples in Stata
to fit and validate these models.

115
5.2 GLM framework

5.2.1 GLM assumptions

The specification of a GLM is characterized by four working assumptions :

1. There is an index function , , that specifies the basis of the


relationship between the covariates and the outcome. This index is
linear in the coefficients, , but may be nonlinear in the underlying
covariates, (for example, polynomials in age, and interactions
between age and gender).

2. There is a link function , , that relates the mean of the outcome, , to


the linear index, .

The inverse of maps the index, , into the expected value, ,


conditional on the observed characteristics of the outcome, :

For example, if the mean of is an exponential function of the linear


index [that is, ], then the link
function is the natural log.

3. The variance, , of the raw-scale outcome, , is itself a function of the


mean, , but not of the covariates, except through the mean function,
.

4. The continuous outcome, , is generated by a distribution from the


exponential family, which includes the normal (Gaussian), Poisson,
gamma, and inverse Gaussian distributions (see McCullagh and
Nelder 1989) .

Only certain combinations of link functions and distribution families


are permitted; see McCullagh and Nelder (1989) for more details. Some

116
popular link functions for continuous outcomes include the identity link
, powers and the natural
logarithm . Common distribution families for
continuous dependent variables imply variances that are integer powers of
the mean function. The four most common distribution families are
Gaussian, in which the variance is a constant (zero power); Poisson, in
which the variance is proportional to the mean ( ); gamma, in
which the variance is proportional to the square of the mean ( );
and inverse Gaussian, in which the variance is proportional to the mean
cubed ( ). Table 5.1 lists commonly used links and distribution
families, along with their implications for the expected value and variance
of the outcome.

Table 5.1: GLMs for continuous outcomes

The GLM approach also allows distribution families that are noninteger
powers of the mean function, but such models are less common in the
literature. For more details, see Blough, Madden, and Hornbrook (1999)
and Basu and Rathouz (2005).

117
5.2.2 Parameter estimation

As alluded to above, GLM estimation requires two sets of choices. The first
set of choices determines the link function and the distribution family . In
section 5.8, we discuss how we chose based on rigorous statistical tests.

For the parameter estimates in the model to be consistent , it is only


necessary to correctly specify the link function, , and how the covariates
enter the index function. The choice of the distribution family affects the
efficiency of the estimates, but an incorrect choice does not lead to
inconsistency of the parameter estimates. An inappropriate assumption
about the distribution family can lead to an inconsistent estimate of the
inference statistics, but this inconsistency can be remedied using robust
standard errors.

The other choice is whether to estimate GLMs by quasi–maximum


likelihood or iteratively reweighted least squares . In Stata, the default is
quasi–maximum likelihood, which does not require specification of the
full log likelihood. The choice between these two methods does not seem
to matter much in practice for typical models and datasets.

After fitting a GLM, one can easily derive marginal and incremental
effects of specific covariates on the expected value of (or other treatment
effects).

118
5.3 GLM examples

We now show how to estimate GLMs for healthcare expenditures with a


few choices of link functions and distribution families, using the MEPS data
introduced in chapter 3. Specifically, we estimate the effect of age (age)
and gender (female is a binary indicator for being female) on total
healthcare expenditures for persons with any expenditures ( ).

Our first example is a model with a log link (option link(log)) and a
Gaussian family (option family(gaussian)). This is equivalent to a
nonlinear regression model with an exponential mean. The results show
that healthcare expenditures increase with age and are higher for women,
but the coefficient on female is not statistically significant at the 5% level.
Because the conditional mean has an exponential form, coefficients can be
interpreted directly as percent changes. Expenditures increase by about
2.6% with each additional year of age after adjusting for gender. Women
spend about 8% more than men after
controlling for age.

119
The second example also specifies a log link but assumes that the
distribution family is gamma (option family(gamma)), implying that the
variance of expenditures is proportional to the square of the mean. This is
a leading choice in published models of healthcare expenditures, but we
will return to the choices of link and family more comprehensively in
section 5.8.

The results show that healthcare expenditures increase with age and are
higher for women. Both coefficients are statistically significant, with
. Expenditures increase by about 2.8% with each additional year
of age, which is quite close to the effect fit by the model with the Gaussian
family. However, now we find that women spend about 23% more than
men , after controlling for age. This is almost
three times as large as the effect estimated in the model with the Gaussian
family. A small change in the model leads to a large change in
interpretation.

Our primary intent was to use these examples to demonstrate the use of
the glm command and explain how to interpret coefficients. However,
these examples also show that the estimated effects in a sample can be
quite different across distribution family choices when the link function is
the same, even though the choice of family has no theoretical implications

120
for consistency of parameter estimates.

We could run many other GLM models, changing the link function or
the distributional family. For example, we could fit a GLM with a square
root link (option link(power 0.5)) and a Poisson family (option
family(poisson)). Or we could fit a GLM with a cube root link (option
link(power 0.333)) and an inverse Gaussian family (option
family(igaussian)).

121
5.4 GLM predictions

For all GLM models with a log link, the expected value of the dependent
variable, , is the exponentiated linear index function:

(5.1)

The sample average of the expected value of total expenditures is the


average of over the sample. We calculate its estimate using the margins
command. The predicted mean of total expenditures is $4,509, less than
1% from the sample mean of $4,480.

When we compare predictions from log transformation models in


chapter 6 with the sample mean, we will find that those predictions are
much further off. They will be anywhere from 10% to 20% too high. GLM
is generally better than log models at reproducing the sample mean of the
outcome.

122
5.5 GLM example with interaction term

Before computing marginal effects, we extend our simple specification to


include the interaction of age and gender as a covariate. That is, we allow
for the effect of gender to vary by age (or equivalently, the effect of age to
vary by gender). The results with an interaction term are harder to
interpret, but more realistic, and will help show the power of several Stata
postestimation commands.

When including interaction terms, one must use special Stata notation,
so that margins knows the relationship between variables when it takes
derivatives. Therefore, we use c. as a prefix to indicate that age is a
continuous variable, i. to indicate that female is an indicator variable, and
## between them to include not only the main effects but also their
interaction.

123
The results are harder to interpret directly, because the interaction term
allows for the effect of age to depend on gender and the effect of gender to
depend on age. The coefficients on the main effects of age and gender are
similar in magnitude to the simpler model. The interaction term is negative
and statistically significant, implying that the increase in expenditures with
age are lower for women than for men. However, to predict by how much,
use margins.

The overall predicted total expenditure is about $4,498 for this model,
which includes age, sex, and their interaction. This is even closer to the
sample mean of $4,480 than the model without the interaction term.

However, the overall mean is not as interesting as predictions that


show the variation by age and sex. Next, we calculate predictions
separately for men and women at ages 20, 50, and 80. The predicted values
for men are lower than those of women at young ages but higher for
elderly persons because of the interaction term. For example, a 20-year-old
woman is predicted to spend $2,228 compared with only $1,346 for a 20-
year-old man, but by age 80, the numbers are higher and reversed; an 80-
year-old woman is predicted to spend $9,392 compared with $10,723 for
an 80-year-old man.

124
We can use the marginsplot command following the margins
command to visualize the results. In this example, because the model
specification is so simple (only two variables and their interaction), we can
easily plot out predicted values for all possible combinations of ages and
genders. The code for this is shown below:

Predicted total expenditures rise for both men and women with age
(see figure 5.2). Predicted total expenditures are higher for women among
young adults but rise faster for men. Convergence occurs around age 68.

125
Adjusted Predictions with 95% CIs
4000 6000 8000 10000 12000
Predicted Mean Exp_Tot
2000

20 30 40 50 60 70 80
Age

Male Female

Figure 5.2: Predicted total expenditures by age and gender

Next, we derive marginal and incremental effects from GLM with log
links and show how to calculate and interpret these effects using Stata.

126
5.6 Marginal and incremental effects

Marginal and incremental effects in GLM models are easy to compute. In a


GLM with a log link, the marginal effect of a continuous variable, , on the
expected value of is the derivative of (5.1),

if has no higher-order terms or interactions with other variables. This


corresponds to the simple model in section 5.3.

If instead, for example, also had a squared term, , so the


specification was , the marginal effect
of a change in would be

If there is an interaction between two variables, and , then the


marginal effect with respect to will also have an extra term involving
the interaction coefficient, . This corresponds to the example with an
interaction term in section 5.5.

An incremental effect is the difference between the expected value of


, evaluated at two different values of a covariate of interest, holding all
other covariates fixed. The formula below compares the expected value of
when with [and is based on (2.3)], although the specific
values of can vary depending on the research question.

127
Incremental effects are most commonly computed for a binary covariate,
like gender, whether someone has insurance or not, or if a person lives in
an urban or rural area. They can also be computed for a large discrete
change in a continuous variable, like age or income, which may have more
policy meaning than a tiny marginal effect. We will show how to compute
this kind of incremental effect in section 5.7.

When there are links other than the log link (see table 5.1), then the
expectation, the marginal, and the incremental effects are based on the
inverse of the link, , and its partial derivative with respect to a
specific covariate.

128
5.7 Example of marginal and incremental effects

Next, we compute the average marginal effects of the covariates while


accounting for their interaction. Stata’s margins command will do this
correctly, as long as the relationship between the variables is indicated in
the glm command line using the Stata symbols for variable types and its
powers and interactions—namely, c., i., and ##. The average marginal
effect of age (averaged over men and women) is about $126. The average
incremental difference between men and women (averaged over all ages)
is about $508.

However, the average marginal effects mask that the marginal effects
vary with age. We next graph the estimated marginal effects by age and
gender (see figure 5.3); this shows the slope of lines in figure 5.2. The
marginal effects of age are similar for men and women until about age 40,
then are higher for men at older ages.

129
Conditional Marginal Effects of age with 95% CIs
500
Effects on Predicted Mean Exp_Tot
100 200 300
0 400

20 30 40 50 60 70 80
Age

Male Female

Figure 5.3: Predicted marginal effects of age by age and gender

We can use the margins command with the dydx() option to compute
the marginal effects by age and gender, corresponding to figure 5.3. The
results confirm what is shown in the graph. Marginal effects for men and
women are similar at young ages but are much larger for men above age
60. The incremental effect of gender shows that women spend more on
average at younger ages (by almost at age 20), but that difference is
reversed in old age, with men spending considerably more per year on
average.

130
We next compute the incremental effect of a 10-year increase in age.
One way to do this is to use the contrast() option and to specify an
increase in age of 10 years with the at() option. The contrast is $1,463,
slightly more than 10 times the marginal effect of an increase in age of
1 year ($126). In this example, we also estimate unconditional standard
errors, that is, not conditioned on the sample values of the covariates. The
resulting standard errors are slightly larger than the sample-average

131
standard errors.

132
5.8 Choice of link function and distribution family

The main modeling choice for GLMs is between the link function and the
distribution family. Although a number of published studies have used
GLMs with the log link and the gamma family for healthcare expenditures
and costs, we strongly recommend choosing based on performance of
alternative models in the specific context using information criteria or
statistical tests. In this section, we first show a simple way to use
information criteria to simultaneously choose the link function and the
distribution family (see chapter 2). Then, we show separate tests for the
link function and for the distribution family.

5.8.1 AIC and BIC

There are several ways to choose the link function and the distribution
family for the analysis of a GLM model with a continuous dependent
variable. We propose choosing them jointly using the Akaike information
criterion (AIC) (Akaike 1970) and the Bayesian information criterion (BIC)
(Schwarz 1978) as the statistical criteria for these nonnested models.
Information criteria-based choices have two advantages. First, they can be
applied regardless of whether complex adjustments for design effects in
the data have been applied or not (design effects are described in
chapter 11). Second, choices based on information criteria do not suffer
from issues of multiple hypothesis testing inherent in standard statistical
tests repeated for many possible choices of link and distribution family.

As shown in chapter 2, the AIC (Akaike 1970) is

where is the maximized GLM quasilog likelihood and is the number


of parameters in the model. Smaller values of AIC are preferable. The BIC
(Schwarz 1978) is

133
where is the sample size. Smaller values of BIC are also preferable.

To illustrate the use of AIC and BIC we fit models with log and square
root ( ) links using Gaussian, Poisson, and gamma distribution
families. We fit six different models with these links and families using
our MEPS dataset, store the results, and compare the AIC and BIC for each.

Note that we also used the scale(x2) option for the Poisson model.
This option is necessary for GLM models with a continuous dependent
variable to compute correct standard errors. It is the default for Gaussian
and gamma families but must be added for Poisson.

The AIC and BIC in the stored results are easily compared in a table. In
our MEPS data with the full covariate specification, the model with the
lowest AIC and BIC was the log link with the gamma family. Although we
expected this, and this choice of link function and family often wins for
expenditure data, it is always worth checking.

134
In this example, we did not fit models with the identity link . The
identity link for nonnegative expenditures is both conceptually flawed and
causes computational problems. The dependent variable of expenditures
can never be negative, yet a model with an identity link would allow this
possibility. In contrast, the log link (which exponentiates the linear index)
and the square root link (which squares the linear index) never estimate the
conditional mean of the dependent variable to be negative. When using the
identity link with many datasets (including the MEPS example), a rich set of
covariates will predict the conditional mean to be negative for some
observations. For these observations, and hence for the sample as a whole,
the log-likelihood function is undefined. In such cases, the maximum
likelihood estimation will have trouble finding a solution. For other types
of dependent variables, the identity link function may well be appropriate.
As a precaution, in our empirical example, we use the iter(40) option to
limit the number of iterations to be 40, so that it will not iterate forever.
Typically, GLMs converge in less than 10 iterations. Consequently, if the
model gets to 40 iterations, check to see if there is a problem with the
model as specified.

5.8.2 Test for the link function

Instead of choosing both the link function and the distribution family
simultaneously, choose them sequentially using a series of statistical tests.
Use a Box–Cox approach (see section 6.5) to find an appropriate
functional form and use that form as the link function. In brief, the Box–
Cox approach tests which scalar power, , of the dependent variable, ,
results in the most symmetric distribution. A power of corresponds
to a linear model, corresponds to the square root transformation,
and corresponds to the natural log transformation model. This

135
approach is discussed at length in section 6.5, with examples that show
that the log link is preferred to the square root for the MEPS dataset and the
basic model.

Note that the boxcox command does not admit the factor-variable
syntax of modern Stata. Therefore, we use the xi: prefix to preprocess the
data to generate appropriate indicators. The estimated coefficient (/theta
in the output) is only slightly greater than zero. We take this to mean that
the log link function is preferable to the square root or other common link
functions.

136
5.8.3 Modified Park test for the distribution family

There is a regression-based statistical test based on Park (1966) —called


the modified Park test—that provides a simple way to test for the
relationship between the predicted mean and variance in a GLM. The
selection of the distribution family is important, because it affects the
precision of the estimated response, both in terms of estimated coefficients
and marginal effects (Manning and Mullahy 2001) . In the absence of any
guidance from theory, the analyst must determine empirically how the
raw-scale variance depends on the mean function.

137
To implement the modified Park test, we first run a GLM—which
means choosing an initial link function and distribution family prior to
running the empirical test. Our working assumption—based on results in
section 5.8 and in the literature—is that we should have a log link and
gamma family. Note that this test requires link to be correctly specified.
Postestimation, we generate the log of the squared residuals and the linear
index.

The Park test is based on the estimated relationship between the


variance of the error term and mean. The dependent variable is the natural
logarithm of the raw-scale variance. The sole covariate is the natural log of
the conditional expected value of the dependent variable,
. The coefficient on the predicted value indicates
which distribution family is preferred. Typically, analysts have made
choices that reflect integer powers:

If it is close to zero, use the Gaussian family, because the variance is


unrelated to the mean, as in Mullahy (1998).

If it is close to one, use the Poisson, because of its property that the
variance is proportional to the mean.

If it is close to two, use the gamma, as in Blough, Madden, and


Hornbrook (1999).

If it is close to three, use the inverse Gaussian.

Although the test rejects each of the integer-valued powers considered,


the estimated coefficient on xbetahat (1.82) is closest to the gamma
family’s integer value of 2. Therefore, for these data and this model
specification, we choose the gamma family (given the choice of the log
link function).

138
5.8.4 Extended GLM

What if the appropriate link function is not one of the widely used choices
[identity, square root, or ]? What if the distribution family is not an
integer power of the mean function? Basu and Rathouz (2005) address
these questions with an approach known as the extended estimating
equations model. They simultaneously estimate the mean and distribution
family, rather than separately, and allow for general noninteger choices of
the power values.

139
5.9 Conclusions

In summary, GLM is appealing in health economics, because it deals with


skewness and heteroskedasticity while avoiding retransformation issues of
OLS models with a logged dependent variable (see chapter 6). As we will
demonstrate, it is also much harder to calculate marginal effects for log
transformation models than for GLM models.

140
5.10 Stata resources

To estimate GLM models in Stata, use the glm command, which works with
margins , svy , and bootstrap . Basu and Rathouz (2005) have Stata code
for their extended estimating equations model.

To estimate predicted values and calculate marginal and incremental


effects of covariates conditional on the other covariates, use the
postestimation commands margins or contrast after fitting a GLM model.
The marginsplot command generates graphs immediately after using
margins .

To compare the AIC and BIC test statistics for GLMs with different
choices of link function and distribution family, use estimates stats * .
Alternatively, conduct a link test with boxcox and a modified Park test
with code found in this chapter.

141
Chapter 6
Log and Box–Cox models

142
6.1 Introduction

Despite the ease of fitting and interpreting generalized linear models (GLM)
(see chapter 5) and the ability of GLMs to deal with heteroskedasticity
while avoiding retransformation problems, a sizable fraction of the health
economics literature still fits regression models with a logged dependent
variable. In this chapter, we cover log models in detail to show their
weaknesses and to explain how a careful analysis would properly interpret
results.

Interpreting the effects of covariates on the raw scale is much more


difficult than fitting log models. We are typically not interested in log
dollars per se (Manning 1998). Instead, the interest ultimately is about
or how changes with changes in covariates.
Transforming the dependent variable for estimation complicates
interpretation on the raw (unlogged) scale. Retransforming the results back
to the raw scale requires dealing with the error term—which may be
nonnormal, heteroskedastic , or both. Predicted values of the dependent
variable and marginal effects therefore depend not only on the coefficients
but also on the distribution of the error term.

In this chapter, we focus on fitting and interpreting models of


transformed positive dependent variables. We start with the popular log
model and later discuss the more general Box–Cox model . After
introducing the model with a logged dependent variable, we explain why
retransformation is difficult and dependent on the error term. We then
explain how to compute , marginal effects , and incremental
effects under four different assumptions about the error term
(homoskedastic or heteroskedastic and normal or nonnormal). Because of
the importance of comparing different kinds of models, we show how to
compare ordinary least-squares (OLS) regression models with either or
as the dependent variable and then discuss in detail the differences
between log models and GLM models. In the remainder of this chapter, we
describe a more general transformation-based model, the Box–Cox model
(1964).

143
6.2 Log models

6.2.1 Log model estimation and interpretation

We start with models that take the natural logarithm of a continuous


dependent variable , with no zeros or negative values (for models that
include zero values, see chapter 7). For notational simplicity throughout
this chapter, we assume unless otherwise specified.

When the dependent variable, , for observation is transformed by


taking the natural logarithm, the model is

(6.1)

where is a vector of covariates including the constant term, is the


vector of parameters to be estimated, and is a random error term.

The expected value of the natural logarithm of (conditional on


and on ) is the linear index

when the error term satisfies the orthogonality constraint that .

In this example, we use the 2004 Medical Expenditure Panel Survey


(MEPS) data introduced in chapter 3 to estimate the effect of age (age) and
gender (female is a binary indicator for being female) on total healthcare
expenditures for persons with any expenditures ( ). We fit the
log transformation model once, then use those results throughout this
chapter to make predictions about (on the raw scale), given different
assumptions about the heteroskedasticity and normality of the errors.

The OLS regression results show that the log of healthcare expenditures
increases with age and is higher for women. For a semilog model, it is easy
to interpret the coefficient on a continuous variable—like age—as a
percent change in the dependent variable. Expenditures increase by about
3.6% with each additional year of age among those who spent anything. In

144
this case, a coefficient of 0.0358 corresponds to about a 3.6% increase,
because the parameter is close to 0. A more precise value is found by
exponentiating the coefficient; this more precise mathematical formula
matters more for coefficients further from zero. The coefficient is
statistically significantly different from 0, with .

Dummy variables also have a percentage effect on the dependent


variable in a log model (Halvorsen and Palmquist 1980). The magnitude of
the percentage change in for a unit change in the dummy variable, , is
the exponentiated coefficient, , less 1, multiplied by 100.

(6.2)

The coefficient on female in the simple health expenditure model is


, meaning that women spend about 42% more than men:
, averaged over all ages.

However, (6.2) has finite sample bias, because is estimated with


error—and because [see (6.3)]. Kennedy (1981)
proposed subtracting a term in the exponent to correct the bias in (6.2).
The Kennedy transformation is the following formula,

145
where is the OLS estimate of the variance of the coefficient on the
dummy variable of interest. This formula applies only to positive
coefficients. For a negative coefficient, redefine the variable by taking one
minus the variable.

The variance, , of the percentage change in is also easy to calculate


(van Garderen and Shah 2002) :

However, in the MEPS example the Kennedy transformation is not


substantively different from the standard interpretation. See the output
below, which calculates both—along with the standard error—based on
the formula by van Garderen and Shah (2002) .

In our experience, the Kennedy transformation , while popular, rarely


makes a substantive difference for statistically significant parameters. In
the MEPS example, because the variance of is small, the Kennedy
transformation does not make a practical difference.

146
6.3 Retransformation from ln(y) to raw scale

When the dependent variable is transformed by taking the natural


logarithm [as in (6.1)], then the expected value of , conditional on , is
not simply the exponentiated linear index:

(6.3)

Instead, it also depends on the expected value of the exponentiated


error term.

The expected value of the exponentiated error term [the second term in
(6.4)] is greater than one by Jensen’s inequality , implying that the
exponentiated linear index (6.3) is an underestimate of the expected value
of .

The following subsection describes two ways to estimate the


multiplicative retransformation factor , , depending on whether
the log scale error, , has a normal or nonnormal distribution. We then
show how to calculate predicted values and their standard errors and
compare the predicted values of using each method.

6.3.1 Error retransformation and model predictions

We wrote a small program to estimate the retransformation factors using


normal theory and Duan’s smearing retransformation. This allows us to
use the bootstrap to obtain appropriate standard errors for predicted means.
As the code below shows, we first fit a linear model for and predict
the linear index (xbhat), the residuals (ehat), and the exponentiated index
(expxbhat).

The first—and simplest—case assumes that the error term has a

147
normal distribution. In this case, the error retransformation factor is
, where is the variance of the error term on the
log scale. The expected value of , conditional on , is the exponentiated
linear index multiplied by the error retransformation factor.

(6.5)

In the program below, the normal factor is denoted normalfactor.

In the second case, we relax the normality assumption. Duan (1983)


developed a consistent way to estimate when the errors are not
normal but with the covariates assumed fixed. Duan’s smearing factor —
denoted —is the scalar average of the exponentiated estimated error
terms , where the log-scale residual provides a
consistent estimator for the error.

The expected value of , conditional on , is the exponentiated linear


index multiplied by Duan’s smearing factor :

In the program below, Duan’s smearing factor is denoted duanfactor. In


each case, the predicted conditional mean is calculated by multiplying
expxbhat with the appropriate multiplicative factor.

148
We use the bootstrap command to calculate standard errors for the
smearing factors and for the predicted means of exp_tot. We have used
200 bootstrap replications without experimentation in this example. We
urge readers to ensure that the estimates of interest are stable given the
choice of number of replications. We could also have used GMM to obtain
correct standard errors analytically. We provide an example of how to use
GMM in section 10.4.

The results show that the estimated value of the normal


retransformation factor is 3.11. Clearly, 3.11 is far greater than 1.0. The
estimate of Duan’s smearing factor is 2.89 in the MEPS sample. Therefore,
all predictions based on the normal theory retransformation factor will be
7.6% higher than the corresponding predictions
based on Duan’s smearing factor. Note that the confidence intervals for
these estimates do not overlap.

We now compare the sample averages of the predicted values of


exp_tot with each other and with the sample average of exp_tot. The
mean of total healthcare expenditures for this sample with positive
expenditures is $4,480. Ideally, predictions of total healthcare expenditures
on the raw scale would be close to the actual sample mean. However, the
method that assumes normality does not come close—the prediction of the
average is nearly 20% higher than the sample mean, and their confidence
intervals do not overlap. Relaxing the normality assumption yields a
prediction ($4,986) that is still too high by about 10% relative to the
sample mean. Once again, the confidence intervals do not overlap.
Therefore, none of these methods came particularly close to estimating the

149
overall mean (unlike the GLM model with a log link, as in chapter 5).

We do not expect these alternatives to have a mean exactly equal to the


sample mean. However, when the predictions are far from the sample
mean, there could be problems with the retransformation , the model
specification, the estimate of the retransformation factor, the modeling of
heteroskedasticity, or all the above. In particular, adding more covariates
would improve the estimate greatly. We have also not considered
heteroskedasticity. Introducing heteroskedasticity into the retransformation
factor matters greatly. Simple groupwise heteroskedasticity can be easily
introduced into the normal and Duan retransformation factors.
Unfortunately, it is rare for a researcher to encounter situations where
heteroskedasticity occurs only by group or for the researcher to be able to
identify such groups in the data. Therefore, in such cases, GLMs have a
natural advantage .

6.3.2 Marginal and incremental effects

Retransformation also affects estimates of the marginal effects of


continuous covariates . The general formula for the marginal effect
applies the chain rule to the derivative of :

150
(6.6)

The second term in (6.6) depends on whether the error term is


homoskedastic or heteroskedastic . If the error term is homoskedastic, the
second term is identically zero. However, ignoring heteroskedasticity—if
it exists—will lead to inconsistent estimates not only of the conditional
mean but also of the marginal effects (Manning 1998). This point is worth
emphasizing: unlike OLS models without retransformation,
heteroskedasticity must be accounted for when fitting marginal effects in
log transformation models.

As with marginal effects, the calculation of incremental effects


depends on correctly modeling possible heteroskedasticity, because the
heteroskedasticity correction will appear in the estimate of the conditional
mean of .

151
6.4 Comparison of log models to GLM

There is often confusion between GLM with a log link function (see
chapter 5) and OLS regression with a log-transformed dependent variable
(as described in this chapter).

GLM with a log link function models the logarithm of the expected
value of , conditional on —that is, .

OLS regression with a log-transformed dependent variable models the


expected value of the logarithm of conditional on —that is,
.

The similarity is deceptive, but the order of operations matters greatly.


We compare the equations to show why these models differ.

A GLM with a log link models the log of the expected value of ,
conditional on as a linear index of covariates and parameters :

(6.7)

Exponentiating (6.7) yields an expression for the expected value of ,


conditional on :

(6.8)

In contrast, an OLS regression with a log-transformed dependent


variable models the log of as a linear index of covariates and
parameters , plus an error term. Notice the inclusion of an error term
:

(6.9)

152
Taking the expected value of both sides of (6.9) eliminates the mean-
zero error term, but the resulting equation is in terms of the expected value
of the logarithm of , not the expected value of :

Equation (6.9) differs from (6.7) on the left-hand side, because the order of
operations is different—and it differs on the right-hand side, because the
parameter values are different.

To get an expression in terms of the expected value of , return to (6.9)


and first exponentiate both sides, then take the expectation. The expected
value of , conditional on , is therefore a complicated function of the
exponentiated error term:

The expected value of the dependent variable, , in the log


transformation model depends on two terms [see (6.10)]. The first term
looks like the expected value of in the GLM with a log link [right-hand
side of (6.8)]. However, the second term is different. It depends on the
error term. If the error term is heteroskedastic in , the second term will
also include terms in .

In general, the parameters from these two models will not be equal
(that is, ). In particular, the constant terms will be quite
different—with —because in the log transformation model,
.

In summary, while OLS regressions with a log-transformed dependent


variable appear similar to GLM models with a log link, the GLM models are
easier to interpret on the raw scale and naturally adjust for
heteroskedasticity . Properly interpreting results from a log-transformation
model requires substantially more effort.

153
6.5 Box–Cox models

The log transformation is a specific case of the popular Box–Cox


transformation. The Box–Cox (1964) transformation is a nonlinear
transformation of a variable using a power function. Specifically, the Box–
Cox transformation for is a specific variant of

where in the limit as . One reason for the popularity


of the Box–Cox model is that it incorporates many commonly used models
—including linear square root , and natural logarithm. Below are common
values of and the corresponding power functions.

Table 6.1: Box–Cox formulas for common values of

By choosing the correct value of , we find that the resulting


transformed continuous dependent variable will be closer to being
symmetric, because the method targets skewness in the error term.
Mathematically, if , then the Box–Cox transformation pulls the right
tail in more than it does the left tail, thus making right-skewed data more
symmetric. If , then the transformation pushes out the right tail more
than the left tail. Therefore, when , the Box–Cox transformation
makes right-skewed data more symmetric; when , it makes left-
skewed data more symmetric. However, this transformation does not
necessarily eliminate heavy tails.

Abrevaya (2002) provides the general theory of retransformation of the

154
Box–Cox model under homoskedasticity. Duan’s (1983) smearing for the
lognormal model is a special case of Abrevaya’s method .

In health economics, the Box–Cox transformation most commonly


transforms a skewed dependent variable, such as positive expenditures.
The log transformation is most common. The square root transformation
has also been used in a few applications (Ettner et al. 1998, 2003;
Lindrooth, Norton, and Dickey 2002; Veazie, Manning, and Kane 2003).

6.5.1 Box–Cox example

We next show one example of the basic Box–Cox transformation


(dependent variable only) on the total expenditures variable from the MEPS
data. The result is that (Stata calls this /theta). Although
statistically significantly different from zero, this estimated transformation
parameter is fairly close to zero—and justifies the use of the log model as
the best simple approximation. If researchers were to encounter a case that
was neither log nor linear, they could use the Abrevaya (2002) approach.

155
6.6 Stata resources

Models with log-transformed dependent variables are estimated with OLS


regression, typically with regress . Generate a new logged dependent
variable with generate prior to fitting the model.

The boxcox command estimates maximum likelihood estimates of


Box–Cox models. There are four versions of the boxcox command: it will
fit models with just the left-hand side transformed, models with just the
right-hand side transformed, and models with both sides transformed—
where the left- and right-hand sides have the same or different
transformation factor. Stata’s predict command implements the Abrevaya
method after running boxcox. Therefore, margins gives consistent
estimates of the means of the partial effects and standard errors.

Two commands related to boxcox will generate a new variable


transformed from an old variable, such that the new variable has zero
skewness . bcskew0 uses the standard Box–Cox formula to find a
transformation with zero skewness. However, it differs slightly in its
estimated parameter from boxcox, because it uses a different algorithm.
lnskew0 takes the natural logarithm of an expression minus a constant ( ).
Unlike bcskew0, lnskew0 assumes a transformation parameter ( ), and
chooses a value of the additive constant, , to minimize the skewness of
the transformed logged variable.

156
Chapter 7
Models for continuous outcomes with mass at zero

157
7.1 Introduction

We have now explained a variety of ways to model positive outcomes with


skewed positive values. While some important research questions in health
economics and health policy involve only expenditures for those who
spend at least some money, many more research questions involve health
expenditures that include a substantial fraction of zeros—with the
remaining values being positive, continuous, and severely skewed. For
example, annual hospital expenditures are zero for most people but
positive and often large for the subset who require hospital care. The
majority of adults are nonsmokers, with many moderate smokers and a few
heavy smokers. For any measure of healthcare use—inpatient, outpatient,
emergency room, dental visit, preventive care—there is always a sizable
fraction of the general population who do not use any healthcare during a
defined period. The domains of all of these healthcare outcomes are either
zero or positive. Statistical models that reflect the point mass at zero may
better describe the relationships between the explanatory variables and the
outcomes.

While it is tempting to eliminate zeros from the distributions of


observed expenditures for statistical modeling reasons, incorporating them
into the analysis is important for computing the correct treatment effects
and for marginal and incremental effects of covariates. We often care
about how treatments, policies, or other covariates affect the outcome for
the entire population, including those who have zero expenditure or use.
Some research questions are about whether a policy affects if a person has
any expenditures (the extensive margin ) or whether the policy affects the
amount spent for those who have at least some expenditures (the intensive
margin ). An antismoking policy may affect the extensive margin (the
fraction who smoke), the intensive margin (the number of cigarettes
smoked by smokers), or both. Obtaining health insurance may also affect
the intensive and extensive margins of healthcare spending differently. The
most commonly used models that account for a substantial fraction of
zeros allow for a differential response of the covariates over these two
margins.

There are several ways to model such data, a number of which are
discussed in Cameron and Trivedi (2005) and in Wooldridge (2010) . In
this chapter, we discuss two approaches in detail. Both approaches model

158
the outcome using two indices; in each model, one index focuses on the
process by which the zeros are generated. At the end of the chapter, we
provide brief descriptions of single-index models that have been used in
the literature but that we would not recommend.

We assume that the goal of the econometric strategy is to estimate


conditional means [that is, ] and marginal and incremental effects
of actual outcomes , [that is, and
], where is the outcome for
observation , is the vector of conditioning covariates, and is a
specific covariate. In most applications to health expenditures, researchers
are not interested in the conditional expectation, , of some
underlying latent variable , , in a model in which denotes
censoring—but instead in a model in which truly
represents . We compare these different statistical approaches to
modeling continuous dependent variables with a large mass at zero,
specifically on how they achieve the goals of predicting conditional
means, marginal effects, and incremental effects of actual outcomes.

159
7.2 Two-part models

One approach to achieve these goals is based purely on a statistical


decomposition of the data density (Cragg 1971) . In this approach, it is
assumed that the density of the outcome is a mixture of a process that
generates zeros and a process that generates only positive values (this
second process may not admit zeros). Consider an observed outcome, ,
and a vector of covariates, .

Let be the density of when , and let be the conditional


density of when . Without any loss of generality, we can write the
density of as

(7.1)

where , because it defines a degenerate density at


. By the fundamental definitions of joint and conditional events, the
joint density always decomposes into the product of the
probability that is in a particular subdomain multiplied by its density,
conditional on being in that subdomain. This definition is completely
general. It does not require or imply any particular relationship between
and (and to be precise). Specifically, we note that there is no
independence requirement between the distributions or the stochastic
elements that underly the distributions. Gilleskie and Mroz (2004) and
Mroz (2012) invoke the same argument for a multiple index
decomposition of a multivalued or count outcome. Drukker (2017)
formally demonstrates you can identify when there is dependence
between the part that determines whether or and the part that
models .

The estimator of the parameters of this model can be decomposed into


two parts; the parameters of the model for are estimated
separately from the parameters of the model for .
Because of this decomposition, this approach is widely known as the two-
part model.

160
The two-part model has a long history in empirical analysis. Since the
1970s, meteorologists have used versions of a two-part model for rainfall
(Cole and Sherriff 1972; Todorovic and Woolhiser 1975; Katz 1977) .
Economists also used two-part models in the 1970s. Cragg (1971)
developed the hurdle (two-part) model as an extension of the tobit model.
Newhouse and Phelps (1976) published an article that is the first known
example of the two-part model in health economics. Their empirical model
fits price and income elasticities of medical care. The two-part model
became widely used in health economics and health services research after
a team at the RAND Corporation used it to model healthcare expenditures in
the context of the Health Insurance Experiment (Duan et al. 1984) . See
Mihaylova et al. (2011) for more on the widespread use of the two-part
model for healthcare expenditure data. Two-part models are also
appropriate for other mixed discrete-continuous outcomes, such as
household-level consumption.

There are many specific modeling choices for the first- and second-part
models. The choices depend on the data studied, the distribution of the
outcome, and other statistical issues. The most common choices are
displayed in table 7.1. In the first-part model, is typically
specified as a logit or probit equation. In the second-part model, there are
many suitable models for . Common choices are a linear
model, a log-linear model (see chapter 6) , or a generalized linear model
(GLM) (see chapter 5).

Table 7.1: Choices of two-part models

161
7.2.1 Expected values and marginal and incremental effects

In this section, we describe how to compute the expected value of ,


conditional on the vector of covariates , for different specific choices of
the two parts of a two-part model. We also explain how to compute
marginal and incremental effects. We focus on a few of the most popular
two-part models, because there is not space to show every possible
combination. By explaining the approach to the modeling and showing
representative models, we leave it to readers to apply the models most
appropriate to their own data.

Consider first a model with a probit first part and a normally


distributed second part for a positive outcome, , and vector of covariates,
. The density, , is composed of two parts—depending on the value
of ,

162
where and denote, respectively, the probability density function
(p.d.f.) and the cumulative distribution function (CDF) of the unit normal
density, is the vector of parameters for the first-part probit model, is
the vector of parameters for the second-part model, and is the scale
(standard deviation) of the normal distribution in the second part. This
model specification is even more restrictive than the usual linear second-
part model which—if estimated by least squares—would not require
normality . We use this restricted specification to aid comparison with the
generalized tobit described in section 7.3.

We conclude this section by showing example formulas for the


unconditional expected value of , . Because there are many
different possible specifications for the two-part model, the formula for the
unconditional expected value depends on the choice of models. For
example, if the first part is probit and the second part is linear, then

If the first part is a probit and the second part is a GLM model with a log
link , then the formula requires exponentiating the linear index function,
where the vector of parameters is now denoted :

If instead the first part is a logit , then the first term on the right-hand
side, , is replaced by the logit CDF , with a
vector of parameters denoted . For example, the two-part model with a
logit and a GLM with a log link has an expected value of

More work is necessary when the second part is ordinary least squares
(OLS), with as the dependent variable (see chapter 6). For example, if

163
the first part is a probit , and the second part is a log transformation with
homoskedastic normal errors, then

where is the normal CDF and is the variance of the normal error . If
the error is not assumed normal, then the term can be
replaced by Duan’s (1983) smearing factor, which we denote by :

Other models require other formulas, but the expected value can
always be calculated using the conditioning in (7.1).

164
7.3 Generalized tobit

The other approach is the generalized tobit (or Heckman) selection model,
which begins with structural or behavioral equations that jointly model two
latent outcomes . Each latent variable has an observed counterpart.
Although we have formulated the model so that the outcome variable
includes zeros and positives, following Maddala (1985) , we note that the
model was initially formulated as a combination of missing values and
positives (Heckman 1979) .

The generalized tobit explicitly models the correlation of the error


terms of two structural equations, one for the censoring process and the
other for the latent outcome . Using the notation of Wooldridge (2010) ,
consider two latent random variables, and , with observed
counterparts, and , respectively. The variable defines a censoring
process as

(7.2)

and an outcome equation as

(7.3)

Note that is the observed outcome (for example, healthcare


expenditures), with a mass of observations at zero. The model is
completed by specifying the joint distribution of the latent variables ,
and . In this case,

(7.4)

where the vector, , is a superset of (that is, may include some

165
variables not included in ) and and are vectors of parameters to
estimate. If the joint distribution of and is bivariate normal with a
correlation parameter, ,

(7.5)

Here and

(7.6)

where is the inverse Mills ratio .

The correlation , , is identified from two sources. The preferred


approach is to use exclusion restrictions . However, the model is also
identified through nonlinearities in the functional form. In health
economics applications, there is rarely a good justification for exclusion
restrictions. Therefore, in practice, and identification is based
entirely on functional form.

7.3.1 Full-information maximum likelihood and limited-information


maximum likelihood

There are two standard ways to fit the selection model. The full selection
model can be fit by full-information maximum likelihood (FIML). The
likelihood function has one term for the probability that the main
dependent variable is not observed, one term for the probability that it is
observed (this term accounts for the error correlation), and one term for the
positive conditional outcome assuming a normal error. If is an indicator
variable for whether is observed, then the likelihood function is

166
Heckman (1979) proposed a computationally simpler limited-
information maximum likelihood (LIML) estimator. Using LIML, you can fit
the model in two steps—not to be confused with having two parts. The two
steps of the LIML model can be fit sequentially. First, fit a probit model on
the full sample of whether the outcome is observed. Second, calculate
the inverse Mills ratio , , which is the ratio of the normal p.d.f. to
the normal CDF. Finally, add the estimated inverse Mills ratio, , as a
covariate to the main equation, and run OLS. The main equation is now

If , then the inverse Mills ratio drops out of the main equation,
and the formula simplifies to a model without selection. There are several
different definitions of the inverse Mills ratio, leading to different formulas
that are close enough to be confusing. See the Stata FAQ for more
discussion of why seemingly different formulas are actually equivalent.

Given that both the LIML and FIML estimators are consistent (under the
usual assumptions), the choice between them falls to other considerations.
Although both versions estimate , FIML does it directly, while LIML
estimates the combined parameter —and can be deduced given an
estimate of . FIML sometimes fails to converge (especially if identification
is only through nonlinear functional form), whereas LIML will always
estimate its parameters. In Stata, LIML has a more limited set of
postestimation commands, making it harder to compare with other models.

167
7.4 Comparison of two-part and generalized tobit models

The two-part and generalized tobit models look similar in many ways, but
they have important differences, strengths, and weaknesses (Leung and
Yu 1996; Manning, Duan, and Rogers 1987) . It is therefore important to
explain the fundamental differences between these models. There is a
long-running debate in the health economics literature about the merits of
the two-part model compared with the selection model (see Jones [2000]
in the Handbook of Health Economics for a summary of the “cake
debates”). The name “cake debates” comes from the title of one of the
original articles comparing these models (Hay and Olsen 1984) . Without
delving into culinary metaphors or arbitrating the past debate directly, we
make several points that focus on the salient statistical features that
distinguish these two models.

First, the generalized tobit and two-part models are generally not
nested models when each is specified parametrically. The many distinct
versions of the two-part model make different assumptions about the first
and second parts of the model. Most versions of the two-part model are not
nested within the generalized tobit model.

Second, the generalized tobit is more general than one specific version
of the two-part model. The generalized tobit, (7.2)–(7.5), with and
, is formally equivalent to a two-part model with a probit first part
and normally distributed second part. The generalized tobit with is
formally a generalization of this specific and restrictive version of the two-
part model but is not a generalization of any other version of the two-part
model.

Third, even for this case where the generalized tobit model is more
general than the two-part model (a probit first part and a normally
distributed second part), simulation evidence shows that the two-part
model delivers virtually identical average marginal effects, the goal of our
econometric investigation. More generally, Drukker (2017) formally
demonstrates the equivalence of even if there is dependence in
the data generating process. Nevertheless, this point is important enough
that we will illustrate it with two examples—one with identification solely
through functional form and one with an identifying excluded variable—in
section 7.4.1.

168
Fourth, the two-part model can be motivated as a mixture density ,
which is at least as natural as a candidate data-generating process as that
implied by the generalized tobit. Thus there is no compelling reason to
view the two-part model as a special case of the generalized tobit; it can be
motivated with a perfectly natural data-generating process that will not be
nested within any generalized tobit model. For more on mixture densities,
see chapter 9.

Fifth, the two-part model has an important practical advantage over the
generalized tobit model. In the two-part model, it is trivially easy to
change the specifications of both the first and second parts to allow for
various error distributions and nonlinear functional forms (for example,
logit or complementary log-log first parts and, more importantly, GLM or
Box–Cox second parts). The different second-part models, discussed at
length in chapters 5 and 6, are often important for dealing with statistical
issues like skewness and heteroskedasticity on the positive values. Such
changes require complex modifications in the generalized tobit, often
leading to models that are not straightforward to estimate. Thus they are
rarely implemented in practice.

Sixth, the standard interpretation of the models is different because of


the original motivation for how to treat the zeros. The generalized tobit
was originally intended to deal with missing values of the dependent
variable, so it treats observed zeros as missing. For health economics, the
standard interpretation of the generalized tobit model would therefore be to
estimate what patients would pay if they had spent money. We are not
aware of any articles that have been motivated by such a research question.
The two-part model assumes that zeros are zeros (not missing values) .

In conclusion, it is best to think of the two-part model (in all of its


possible forms) and the generalized tobit as nonnested models. The point
to note is that, in general, in the two-part model generally
depends on how the functions and distributions in (7.1) are specified for
the two-part model—whereas the form of in the context of the
generalized tobit depends on how (7.6) and the associated joint distribution
of the errors in those equations are specified. However, if interest is in the
latent (uncensored) process a generalized tobit-type structure is essential.
In that context, the parameter plays a substantive role in interpretation of
the parameters (Maddala 1983) . Otherwise, especially if one is interested
in understanding the conditional mean or marginal effects of covariates on

169
that mean, the two-part model has greater practical appeal.

7.4.1 Examples that show similarity of marginal effects

The fact that the two-part model returns predictions and marginal effects
that are virtually identical to those of a generalized tobit model—even
when the data-generating process is for a generalized tobit—is so
important and misunderstood that we present two illustrative examples.
Drukker (2017) formally demonstrates this. In the first example, the data
are generated using a generalized tobit data-generating process with jointly
normal errors. There is no exclusion restriction ( ), as is typical in
health economics applications. Without loss of generality, the variance of
the error term for the latent outcome is set equal to one.

Comparing the estimated two-part (twopm in Stata) models and


generalized tobit (heckman ) models shows that the estimated coefficients
are similar in the first equation (first-part and selection equations) but quite
different in the second equation (second-part and main equations). In the
second equations, the parameters on x1 are 0.692 and 0.988. Although
researchers might be tempted to conclude that these results imply that the
models will lead to vastly different predictions of marginal effects, the
marginal effects are in fact nearly identical—as seen in the Stata output:

170
Although parameter estimates of the second part of the two-part model
do not correspond to those of the generalized tobit data-generating process,
the marginal effect of x1 on y from the two-part model is virtually
identical to those obtained from the generalized tobit model, 28.9:

171
The second example has an identifying instrumental variable that can
be excluded from the main equation. The data-generating process allows
for a substantial effect of an additional variable, z, in the selection
equation that does not enter the latent outcome equation. When the
selection equation in the generalized tobit (7.2)–(7.5) data-generating
process includes an excluded instrument—even if —the typical
implementation of the two-part model would be overspecified, because it
would include the same set of variables in both the first and second parts.
Nevertheless, the simulation evidence shown in this example again
highlights the flexibility of the two-part model specification.

Again, the estimated coefficients in the two models are similar in the
first equation but different in the second equation (0.821 versus 0.959).

172
Despite the differences in estimated coefficients, the marginal effect of
x2 on y from the two-part model is again virtually identical to that
obtained from the generalized tobit model, 24.12. We care primarily about
the estimates of the marginal effects on the expected outcomes, not the
parameter estimates themselves.

173
To summarize the third point, we see this simulation demonstrates that
despite the apparent differences in model assumptions, the two-part model
and the generalized tobit model usually produce similar results when
comparing marginal effects of actual outcomes , which are usually the goal
of econometric modeling in health economics. Now we return to the two-
part model for interpretation and marginal effects.

174
7.5 Interpretation and marginal effects

7.5.1 Two-part model example

In this example, we use the 2004 Medical Expenditure Panel Survey


(MEPS) data introduced in chapter 3 to estimate the effect of age (age) and
gender (female is a binary indicator for being female) on total healthcare
expenditures (exp_tot). In this two-part model, we use a probit model to
predict the probability of any expenditures and a GLM model with a log
link and gamma family to predict the level of expenditures for those who
have more than zero. The goals are to estimate total expenditures
conditional on the covariates and then to calculate the marginal effect of
age and the incremental effect of gender. To focus on technique, we limit
the covariates to just age and gender and their interaction.

The results below could be computed separately, first by fitting two


models (probit and then either GLM or OLS), but instead we use the twopm
(two-part model) Stata command written by Belotti et al. (2015) . This
allows for easier computation of predicted values and marginal effects
using the postestimation commands.

In both parts, the estimated coefficients for age and female are positive
and statistically significant at the one-percent level, while the interaction
term is negative and statistically significant. Both the probability of
spending and the amount of spending conditional on any spending increase
with age but at a slower rate for women. Women are more likely to spend
at least $1 more than men, and, conditional on spending any amount, they
spend more, at least at younger ages. The results for the second part of the
model are the same as in the first simple GLM example in section 5.3.

175
After we fit both parts of the two-part model with twopm , the
postestimation margins command calculates predictions based on both

176
parts. The predicted total spending is about $3,696 per person per year,
which is within a few dollars of the actual average ($3,685).

Adjusted Predictions with 95% CIs


15000
Twopm Combined Expected Values
5000 10000
0

20 40 60 80
Age

Male Female

Figure 7.1: Predicted expenditures, conditional on age and gender

The predicted total expenditures, combining both parts of the two-part


model, confirm what the coefficients implied. Expenditures are higher for
women than for men at younger ages but rise faster for men, with the
crossover point being around age 70 (see figure 7.1).

7.5.2 Two-part model marginal effects

This section outlines how to compute marginal and incremental effects in


two-part models, accounting for the full model. We use simple notation
throughout this section to show the structure of the formulas. Formula

177
details would depend on which specific version of the two-part model is
fit. The main formula is

(7.7)

The marginal effect of a continuous variable, , on the expected value


of is the full derivative of (7.7). Therefore, the full marginal and
incremental effects include both the extensive margin (effect on the
probability that ) and the intensive margin (effect on the mean of
conditional on ). The marginal effect is computed by the chain rule :

For the case of a probit first-part model and a GLM second-part model
(and no interactions or higher order terms in ), this is fairly
straightforward to compute,

where is the vector of parameters in the first-stage probit , is the


vector of parameters in the second-stage GLM, and and refer to
the coefficients corresponding to the specific covariate, .

If there were interactions with or higher-order terms in , then the


derivatives would need to account for the expressions in brackets.

For OLS models with a log-transformed (and corresponding vector of


parameters ), the marginal effects depend on how you deal with
heteroskedasticity. First, we show formulas assuming homoskedasticity.
These calculations can be manipulated to be expressions of either
or . For the probit model, this is

178
where is Duan’s smearing factor .

For the logit model, this is

In contrast to a marginal effect, where it makes sense to think of a tiny


increase in the value of a continuous variable for a dummy variable, we
compute an incremental effect . Consider the dichotomous variable
female. Conceptually, we compute the predicted value of the outcome two
ways, first as if everyone in the sample is female, then as if everyone in the
sample is male (always holding all other covariates at their original levels),
and then take the difference. More generally, for a dichotomous indicator,
:

If the second part of the model is heteroskedastic, the marginal and


incremental effects are more complicated, because the smearing factor
is no longer a scalar. See Manning (1998) and Ai and Norton (2000, 2008)
for methods to deal with heteroskedasticity when retransforming back to
the raw scale.

7.5.3 Two-part model marginal effects example

179
Continuing the example from section 7.5.1, we now show the marginal (or
incremental) effects of age and gender for the full two-part model,
accounting for the effects of these variables on both parts. After we use the
twopm command, the margins command automatically computes the
unconditional marginal effects, accounting for both parts of the model. The
marginal effect of age averages $123 per year of age, and women spend
more than men by about $798.

Because the graphs showed that the marginal effects vary over the life
course, we computed marginal effects, conditional on four ages (20, 40,
60, and 80). The marginal effect of age for men grows from $40 at age 20
to $383 by age 80; the marginal effect of age for women grows from $56
at age 20 to $231 by age 80. The incremental effect of gender declines,
with women spending on average more than $1,000 more than men at
age 20, but by age 80, the roles have reversed, and men outspend women
by more than $1,000.

180
181
7.5.4 Generalized tobit interpretation

There are three standard ways to interpret the results from the generalized
tobit model. The first way focuses on what happens to the expected latent
outcome (denoted here ). Latent outcomes assume that the dependent
variable is missing (not zero) for part of the sample and that the selection
equation adequately adjusts for the nonrandom selection. The expected
value of the latent outcome is , using the same notation as in
section 7.3. The first interpretation is easy to read from the regression
output table but not relevant for answering research questions in health
economics, where we typically care about predictions of actual
expenditures .

The other two interpretations for the generalized tobit are more
challenging to calculate. The second interpretation focuses on the
characteristics of the actual outcome and is therefore comparable with the
results from a two-part model (Duan et al. 1984; Poirier and
Ruud 1981; Dow and Norton 2003) . In this case,

The third interpretation is the expected value of the observed outcome


conditional on observing the dependent variable and is

Clearly, these three different expected values will differ in magnitude,


as will the associated marginal effects. The marginal effect of a covariate
on the latent outcome , , is simply . However, the marginal effect of a
covariate on the actual outcome , , is complicated. If is linear, with no
logs or retranformation issues, then

182
If instead the main outcome is estimated as
and the error term is normal and
homoskedastic, then

Any other model specification would require different specific formulas.

7.5.5 Generalized tobit example

Although we wanted to directly compare the results from the two-part


model on total healthcare expenditures with the results from the
generalized tobit (or Heckman) selection model, it is not possible. For
health expenditures in the MEPS data, FIML fails to converge . The FIML
estimator is highly sensitive to departures from joint normality , and the
positive values of total healthcare expenditures are not close to normal.
Having a highly skewed distribution of the positive values is a common
issue in health economics. Although we can fit the model with a logged
dependent variable, it makes comparisons with the two-part model much
harder, because expressing marginal effects on the raw scale cannot be
done automatically in Stata. LIML estimates are easy to obtain, but the
postestimation commands that compute marginal effects do not work in
Stata for that case.

Therefore, to make comparisons easier across the two-part, FIML, and


LIML models, we changed the example to analyze dental expenditures . We
use the MEPS data introduced in chapter 3 to estimate the effect of age (age)
and gender (female is a binary indicator of being female) on dental health
expenditures, including the many people with zero expenditures.

The results for the three models of dental expenditures have nearly
identical coefficients in the first equation (probit), which is not surprising.
The coefficients in the second equation are different, especially those from

183
the two-part model, because there we have modeled the conditional mean
to be an exponential function of the linear index. The coefficients obtained
by FIML and LIML are also quite different from each other, partly because
the Mills ratio is both large and imprecisely estimated. However, as the
results below show, the marginal effects are similar across all three
models.

184
We restore the two-part model results to use margins . Overall,
average dental expenditures are , according to the two-part model
results.

185
Women spend more than men on average over all ages by almost $32.
Dental expenditures increase on average by about $2.87 per year.

The marginal effect of age is higher for men than for women at all
ages.

186
For comparison with the two-part model, we must use the formulas for
actual expenditures with the FIML results. It is important to use the
predict(yexpected) option to calculate predictions for actual
expenditures, not latent expenditures—otherwise the results are not
directly comparable. Again, predicted actual expenditures for the two-part
model and generalized tobit are quite close, certainly well within
confidence intervals, even with vastly different estimated coefficients.

187
The FIML-estimated marginal effects are also quite similar to those for
the two-part model.

In sharp contrast to the actual outcomes, the results from FIML can also
be used to compute latent outcomes , which is the default Stata option.
Because about 63% of the sample has zero dental expenditures, if instead
they all spent an average amount, then the total would of course more than
double. That is exactly what is shown.

188
In summary, if you want to estimate actual outcomes and marginal
effects on actual outcomes (as opposed to latent outcomes ), the FIML
selection model will typically yield similar results to the two-part model.
However, in practice, researchers fit two-part models because the results
are easier to manipulate, both for the total effect and for the extensive and
intensive margins.

189
7.6 Single-index models that accommodate zeros

In this section, we briefly describe some single-index models that allow for
a mass of zeros in the distribution of the outcome but not in particularly
flexible ways. We describe these models because they have been used in
the literature, but we cannot recommend their use in research.

7.6.1 The tobit model

The tobit model, named after economist James Tobin, is like a mermaid or
centaur ; it is half one thing and half another. Tobin (1958) was the first to
model dependent variables with a large fraction of zeros. Specifically, the
tobit model combines the probit model with OLS, both in the way the
model is fit and in how it is interpreted. For a recent summary of the tobit
model, see Enami and Mullahy (2009) .

The classic tobit model is appropriate when the dependent variable has
two properties:

it has a normal distribution (this is a strong assumption); and

negative values are censored at zero .

Censored means that the actual value is known to be beyond a


threshold value, or less than zero in this case. Censoring is different from
truncation , in which there is no information about the actual value—so it
is missing . The classic tobit model is rarely, if ever, appropriate for
modeling healthcare expenditures, because zero expenditures are not
censored negative values—but instead are actual values.

The tobit model assumes that is a continuous, semiobserved


(censored ), normally distributed , underlying latent dependent variable.
Semiobserved means that some values of are observed, and other values
are known only to be in a range. The tobit model fits the relationship
between covariates and the latent variable, .

Specifically, the classic tobit model assumes that the latent variable,
, can be negative—but that when is negative, the observed value, , is
zero.

190
The values equal to zero are censored , because they are recoded from
a true negative value to zero. (If instead those observations were left out of
the sample, they would be truncated , which is a selection problem.)

The tobit likelihood function has two pieces. There is the probability
that observed equals zero, and the probability that equals some positive
value. If has a normal distribution with variance , and if is an
indicator variable for whether is positive, then the likelihood function is
written as part normal CDF and part normal p.d.f. .

The likelihood function is written in terms of the error term,


specifically, a standard normal error term (mean 0 and variance 1). The
in the p.d.f. is the Jacobian term, the result of a normalization needed
when doing a linear transformation of to a uniform normal variable. The
derivative of the normal CDF with respect to the error term is the normal
p.d.f. multiplied by the inverse of the error’s standard deviation:

The number of parameters is the same as OLS and one more than for
probit. There is one set of ’s (including the constant) and one . There is
no estimated parameter for the censoring point (zero in this case), because
this threshold is not estimated; it is determined by the data.

How does the tobit model differ from the probit model ? The tobit
model fits one more parameter than probit. The tobit model has a
continuous part and a discrete part. The interpretation of the constant term
is quite different—for tobit, it has the interpretation of an OLS intercept,

191
and for probit, it has the interpretation of the probability of outcome A for
the base case observation.

The tobit model is extremely sensitive to its underlying assumptions of


normality and homoskedasticity (Hurd 1979; Goldberger 1981) .

7.6.2 Why tobit is used sparingly

The tobit model should be used with great caution, if at all. The
assumptions underlying the model are numerous and rarely true. The tobit
model should never be fit unless the data are truly normal and censored.
Here are the top four reasons to avoid the tobit model :

1. The tobit model assumes that the data are censored at zero , instead of
actually being zero. Too often, researchers with health expenditure
data claim that a large mass at zero are censored observations when
they are not censored.

2. The tobit model assumes that the error term has a normal distribution
but is inconsistent even if there are minor departures from normality
and homoskedasticity (Hurd 1979; Goldberger 1981) .

3. The tobit model assumes that the error term when is positive is
truncated normal, with the truncation point at zero. This is rarely true.

4. The tobit model assumes that the same parameters govern both parts
of the likelihood function. There are specification tests that test the
tobit model against the more general Cragg (1971) model that allows
different parameters in the two parts of the model. This test almost
always rejects the null hypothesis that the parameters in both parts are
equal.

In summary, the classic tobit model only applies in the rare cases
where zero values are truly censored. Right-censoring is more common in
real data, and tobit models may work well in those cases.

The tobit model has been used only a few times in the health
economics literature. Holmes and Deb (1998) use a tobit model for data on
health expenditures that are right-censored . The dependent variable they
study is health expenditures for an episode of care. Because they have
claims data for a calendar year, some episodes of care are artificially

192
censored at the end of December. Cook and Moore (1993) use a tobit to
estimate drinks per week. However, there is no evidence that abstainers are
appropriately modeled as censored.

7.6.3 One-part models

Although two-part models are popular they are not the only estimation
approach for addressing a large mass at zero. Mullahy (1998) suggested
that researchers not use two-part models—especially those that use the log
transformation in the conditional part—if they are interested in the
expected value of given observed covariates, . Using nonlinear least
squares , or some variant of the GLM family , researchers can apply a single
model to all the data to fit the expected value of . Any of the links and
families described in chapter 5 could be used for a one-part model as an
alternative to a two-part model, as long as researchers are interested in the
mean function for , conditional on the covariates —or something that
can be derived from the mean function, such as the marginal or
incremental effect.

Some analysts have worried that some of the distributions used in the
GLM approach do not have zeros in their support. This is a problem if the
models are fit by maximum likelihood estimation (MLE). However, the GLM
approach only uses mean and variance functions. For example, for the
inverse Gaussian (Wald), you cannot use MLE with the zeros, but you can
use GLM with zeros.

Buntin and Zaslavsky (2004) suggest that the choice and the specifics
for each approach depend on the application. They provide an approach to
finding a better-fitting model using a set of diagnostics from both the
literature on risk adjustment and on model selection from the healthcare
expenditure literature. The choice of approach appears to depend on the
fraction of zeros in the data.

Finally, a one-equation alternative to two-part and selection models


that can be fit with OLS yet respects the nonnegativity of the outcome
variable adds a positive constant to the outcome before taking the natural
logarithm. We do not recommend this approach given all the, far superior,
alternatives we have described.

193
7.7 Statistical tests

All the usual statistical tests for single-equation models apply to the two-
part model. In addition, the modified Hosmer–Lemeshow test applies to
the entire two-part model. This may help identify problems with the model
specification in the combined model. We can apply Pregibon’s link test
and Ramsey’s regression equation specification error test equation by
equation in these models. For the two-part model, there are no
encompassing link or regression equation specification error tests, because
those tests are for single-equation models. They can be extended to
selection models and generalized tobit models, because they are a system
of equations that can be estimated in a single MLE formulation. Pearson
tests and Copas’ tests can apply to all of these models.

194
7.8 Stata resources

The recently developed twopm command will not only estimate many
different versions of the two-part model—allowing several options for
choice of specific model—but also compute predictions and full marginal
effects, accounting for retransformations, nonnormality, and
heteroskedasticity (Belotti et al. 2015) . Install this package by typing ssc
install twopm . Alternatively, you can fit two-part models in two
separate commands. For example, estimate the first part with either logit
or probit . Commonly used commands for the second part include
regress , boxcox , and glm —always estimated on the subsample of the
data with positive values.

The Stata command for the Heckman selection model is heckman . A


related model, with a binary equation in the second step, can be estimated
with heckprob . Use the tobit command for basic tobit models with upper
and lower censoring, when the censoring points are the same for all
observations. Stata has two commands for generalized versions of the tobit
model. Use cnreg when the upper- or lower-censoring points differ across
observations. Use intreg for data that also may have interval data, in
addition to point data and left- and right-censoring points.

The treatreg command in Stata fits a model similar to the classic


selection model, but the main outcome is observed for all observations.
Therefore, the selection equation can be thought of as selection into a
treatment or control group. In other words, the treatment variable is a
dummy endogenous variable. The model is similar to two-stage least
squares, except that the errors are modeled as jointly normally distributed.
Identification comes from both instruments and nonlinearities in the
selection equation.

195
Chapter 8
Count models

196
8.1 Introduction

In many statistical contexts, including many measures of healthcare use,


the outcome or response variable of interest, , is a nonnegative integer or
count variable. Examples of count outcomes include the number of visits
to the doctor, the number of nights spent in a hospital, the number of
prescriptions filled, and the number of cigarettes smoked per day. Such
variables have distributions that place probability mass at nonnegative
integer values only. In addition, these variables are typically severely
skewed , intrinsically heteroskedastic, and have variances that increase
with the mean. For most count outcomes, the observations are
concentrated on a few small discrete values—typically zero—and a few
small positive integers. Therefore, discrete data density is an important
distinguishing feature of such outcomes.

If a researcher is interested only in the prediction of the conditional


mean or in the response of the conditional mean to a covariate, it may be
tempting to ignore the discreteness and skewness and simply estimate the
responses of interest using linear regression methods (see chapter 4). If
skewness is a concern, but not discreteness, a researcher might consider
the use of generalized linear models (GLMs) (see chapter 5). If discreteness
and skewness are both important features of the distribution of the count
outcome, then models that ignore discreteness can be quite inefficient,
leading to substantial losses in statistical power. In addition, models that
ignore discreteness may display considerably greater sample-to-sample
variability of estimates than count models that account for the discreteness
of the outcomes.

Consider a data-generating process in a regression context with a linear


index, an exponential conditional mean, and a Poisson distribution. If this
process leads to a distribution that is skewed and concentrated on
relatively few integer values, then King (1988) demonstrated that ordinary
least squares (OLS) can be grossly misspecified and produce inconsistent
estimates. In addition, the OLS model on a log-transformed dependent
variable (with an adjustment to account for the logarithm of zero) also
produces inconsistent estimates. Figure 8.1 below illustrates such an
outcome. It represents observations drawn from a Poisson distribution with
a mean of 0.5. Over 90% of the values are concentrated on 0 and 1 and the
distribution is distinctly skewed.

197
Poisson with mean 0.5
60
40
percent
20
0

0 1 2 3 4

Figure 8.1: Poisson density with

King (1988) argues that in such cases, the conditional expectation


function in a count-data process cannot be linear, or even necessarily
approximately linear, because predictions must be nonnegative. A
regression of on a vector of covariates —where is a small
positive constant—resolves the issue of negative predictions, but
King (1988) shows that results can be quite sensitive to the choice of . In
addition, Monte Carlo experiments show that the OLS estimates of such a
specification are biased in both small and large samples. Furthermore, the
efficiency losses are large; the Poisson MLE is 3.03 to 14.19 times more
efficient than the OLS estimator of the logged, adjusted outcome.

Equally important is the consideration that in the case of discrete data,


substantive interest may lie in the estimation of event probabilities. For
example, researchers may wish to estimate the probability that the count
equals 0, that the count is greater than 10, or that the count is greater than 2
but less than 6. There may be interest in the response of specific parts of
the distribution to changes in covariates. More generally, the researcher
may be interested in fitting the distribution of the event counts and
examining responses of the distribution to changes in covariates, not
simply in features of the conditional mean. In these situations, it is
essential to formally estimate the count-data process.

198
Leaving aside the objective of estimating event probabilities and
distributions for a moment, it is important to recognize that not all count
data densities are skewed, nor is the mass concentrated on a few values in
all cases—although such cases will be rare in the healthcare context. In
such cases, it may well be appropriate to use methods designed for
continuous outcomes. Consider the density of a random variable drawn
from a Poisson distribution with a mean of five. The distribution of
observations shown in figure 8.2 is relatively symmetric, so simpler
models may be acceptable. Indeed, King (1988) notes that when is large
for all or nearly all observations, “it would be possible to analyze this sort
of data by linear least-squares techniques”.

Poisson with mean 5


20
15
percent
105
0

0 1 2 3 4 5 6 7 8 9 10+

Figure 8.2: Poisson density with

In terms of empirical regularities, it is useful to note that most of the


measures of healthcare use in the 2004 Medical Expenditure Panel Survey
(MEPS) dataset have probability mass concentrated on a few values and are
severely skewed. We display the distributions of office visits and
emergency room (ER) use in figure 8.3.

199
30

80
60
20

Percent
Percent

40
10

20
0
0 5 10 15
0

0 5 10 15 20+ # ER visits
# office-based provider visits

Figure 8.3: Empirical frequencies

Regression models for count data are comprehensively described in


Cameron and Trivedi (2013) , Hardin and Hilbe (2012) , and
Winkelmann (2008) , among others. This chapter complements the
material in those books and is not intended to be exhaustive. In what
follows, we describe models, methods, and empirical strategies based on
maximum likelihood estimation (MLE) of a few classes of count-data
regression models that tend to fit measures of healthcare use well.

We begin our discussion of regression models for count data with the
Poisson regression model (section 8.2). It is the canonical regression
model for count data and should be the starting point of any analysis. We
discuss estimation, interpretation of coefficients, and partial effects in
some detail. The Poisson distribution is a member of the linear exponential
family (LEF) . Therefore, like the linear regression and GLMs, the Poisson
regression has a powerful robustness property: its parameters are
consistently estimated as long as the conditional mean is specified
correctly, even if the true data-generating process is not Poisson. However,
this robustness comes at an efficiency cost (Cameron and Trivedi 2013) .

In section 8.3, we discuss the negative binomial regression model,


which is the canonical model for overdispersed count data. We contrast
results obtained from negative binomial regressions to those obtained from
Poisson regressions. The negative binomial regression model relaxes a
restrictive property of the Poisson regression and thus can be substantially
more efficient. Unfortunately, it does not generally inherit the robustness
property of the Poisson, so there is a tension between consistency under
general conditions and efficiency of the estimates.

200
Count outcomes in health and healthcare, although overdispersed, do
not necessarily conform to the properties of the negative binomial model.
They often have even more zeros than predicted by negative binomial
models. Therefore, in subsequent sections, we discuss hurdle and zero-
inflated models that allow for excess zeros. We also briefly describe
models for truncated and censored counts in section 8.5. We end this
chapter with section 8.6, which describes approaches for model selection
and demonstrates them via extensive examples.

201
8.2 Poisson regression

The Poisson density is the starting point for count-data analysis. The basic
principles of estimation, interpretation, and prediction flow through
naturally to more complex models.

8.2.1 Poisson MLE

Consider a random variable that takes on values when


measured over a fixed amount of time, . In this case, the Poisson density
(more specifically, the probability mass function) is

(8.1)

where is a parameter often referred to as the intensity or rate parameter.


The rate parameter is also the mean of the Poisson distributed random
variable; that is, . In fact, the Poisson distribution has a special
property that its mean is equal to its variance ; therefore, .

The Poisson distribution can be generated by a series of point (no


duration of their own) events where the time between events (interarrival
time) follows an exponential distribution . The exponential distribution has
the property that the arrival times of events are independent of the time
since the last event. The Poisson distribution inherits this property, so each
event is independent of the prior event counts.

The Poisson regression model is derived from the Poisson distribution.


Now, the rate parameter is not a constant, so it is denoted by . It is used
to parameterize the relation between and a vector of covariates
(regressors), . The standard assumption is to use the exponential mean
parameterization,

(8.2)

where is a vector of unknown coefficients. The exponential

202
mean specification has the major mathematical convenience of naturally
bounding to be positive. Because the variance of a Poisson random
variable equals its mean, , the Poisson regression is
intrinsically heteroskedastic.

The Poisson regression model is typically fit using MLE. Given (8.1)
and (8.2) and the assumption that the observations are independent
over , the log-likelihood function for a sample of data can be written as

The first and second derivatives of the log-likelihood function, with


respect to parameters , can be derived as

and

Therefore, the Poisson MLE is the solution to (number of parameters to


be estimated) nonlinear equations corresponding to the first-order
conditions for the MLE,

(8.3)

If includes a constant term, then the residuals sum to


zero by (8.3).

203
The log-likelihood function is globally concave; hence, solving these
equations by Gauss–Newton or Newton–Raphson iterative algorithms
yields unique parameter estimates. By maximum likelihood theory, the
estimated parameters are consistent and asymptotically normal with
covariance matrix

(8.4)

8.2.2 Robustness of the Poisson regression

Recall that the Poisson distribution is a member of the LEF of distributions.


Therefore, the first-order conditions for the Poisson regression model MLE
can be obtained from an objective function that specifies only the first
moment, without specification of the distribution of the data. More
precisely, it can be obtained via a GLM objective function (McCullagh and
Nelder 1989) or via a pseudolikelihood (Gourieroux, Monfort, and
Trognon 1984a,b) . This is clear from inspection of the first-order
conditions in (8.3), because the left-hand side of (8.3) will have an
expected value of zero if . Therefore, parameter
estimates from the Poisson regression are consistent under the relatively
weak assumption that the conditional mean is correctly specified. The
data-generating process need not be Poisson at all. Consequently, Poisson
regression is a powerful tool for analyzing count data.

However, the standard errors of the estimates obtained by MLE are


incorrect. The correct formula under the weaker assumption is

(8.5)

where . This formula generally produces more conservative


inference than the formula based on the MLE. Therefore, it is common
practice, when estimating Poisson regressions, to implement the robust
sandwich estimator of the variance (8.5) because it is appropriate under

204
more general conditions than the maximum likelihood-based formula
(8.4).

More substantively—from the point of view of an applied researcher—


although we can obtain consistent estimates under weak assumptions,
estimates from the Poisson regression may be grossly inefficient if the
data-generating process is not Poisson. In addition, predicted probabilities
and predictions of effects can be quite misleading, as we demonstrate
below in the context of our data from MEPS.

8.2.3 Interpretation

The exponential mean specification of the Poisson regression model has


implications for interpreting the parameters of the model. As with all
exponential mean models, the coefficients themselves have a natural
interpretation as semielasticities with respect to the variables (or elasticity
if the variable itself is measured in logarithms). More precisely, because
,

where the scalar, , denotes the th regressor. To demonstrate the


interpretation of results from the Stata output, we display the Stata log
from a Poisson regression of office-based visits (use_off) on a simple
specification of covariates that includes one continuous variable (age) and
one binary indicator (female). Note that an increase in age by 1 year leads
to 2.5% more visits [ ]. In addition, women have
about 50% (derived from the point estimate on 1.female using
) more visits than men.

205
Although the coefficients themselves have an intuitive interpretation, it
is often desirable to calculate partial effects of covariates on the expected
outcome, as opposed to the effects on the logarithm of the expected
outcome. Marginal effects in exponential mean models are also relatively
easy to compute and have a simple mathematical form. Mathematically,
for any model with an exponential conditional mean, differentiation yields

In such models—and thus in the specific case of the Poisson regression—


these are the marginal effects for continuous variables and depend on the
values of each of the covariates in the model via . For binary
variables, it is preferable to use the discrete difference or the incremental
effect

which also depends on the values of each of the covariates in the model.

Thus the marginal or incremental effects depend on the values of the

206
covariates at which the derivative or difference is evaluated. In other
words, the partial effects vary by observed characteristics, rather than
being constants.

There is no preferred approach for which values of covariates to use;


instead, it depends on the substantive question at hand. If looking for
population effects or welfare comparisons, researchers should calculate the
sample average marginal or average incremental effect. These are sample
averages of individual-level treatment effects, as described in chapter 2.
The researcher may also consider calculating averages of marginal and
incremental effects over relevant subsamples of the data, for example, the
sample of the treated to obtain effects on the treated or the sample of
untreated observations to obtain the effect on the untreated . In other cases,
if researchers are interested in reporting a typical effect, they often
calculate marginal effects at the sample means of each of the covariates.
Incremental and marginal effects may also be calculated using other
sample moments, for each of the covariates—for example, medians or
subsample means. In addition, it may also be insightful to calculate the
marginal effects for a hypothetical individual with a particular set of
characteristics (covariates) of particular policy interest that the researcher
chooses a priori.

As an example, we estimate the partial effects of age and of being


female on the number of office-based visits using the Poisson regression
results shown above. The results, obtained using the margins command,
show that the sample average incremental effect of being female is 2.27
visits, while the sample average marginal effect of an increase in a year of
age is 0.14 visits.

207
When partial effects are evaluated at the means of the covariates using
the at((mean) _all) option in margins , the incremental effect of female
drops, in magnitude, to 2.07, while the marginal effect of age decreases
slightly to 0.13.

The choice of which partial effect to report is important. Sometimes,


the differences between effect sizes can be quite dramatic. The difference
between the two estimates depends on the range and distribution of the
covariate and on the estimate of the associated parameter.

8.2.4 Is Poisson too restrictive?

Recall that the Poisson distribution is parameterized in terms of a single


scalar parameter ( ), so all moments of are functions of . In fact, both
the mean and variance of a Poisson random variable are . In spite of this

208
seemingly draconian restriction, we have shown that parameter estimates
from the Poisson regression are consistent, even when the data-generating
process is not Poisson—that is, this equality property does not hold.

As we have seen in the MEPS dataset, empirical distributions of


healthcare use are overdispersed relative to the Poisson—that is, the
variance exceeds the mean. Overdispersion leads to deflated standard
errors and inflated statistics in the usual maximum likelihood output.
Ignoring the overdispersion will lead to a false sense of precision, and the
greater the discrepancy between the variance and the mean, the more the
risk grows. However, this issue can be remedied with robust standard
errors estimated using “sandwich” estimators of the variance–covariance
matrix of parameters. Note that obtaining correct standard-errors does not
render the estimates efficient. Poisson MLE is still inefficient and can be
grossly so.

The specification below estimates a sandwich variance–covariance


matrix of parameter estimates and reports robust standard errors in the case
of the MEPS data for the count of office visits. Comparing the output below
to the estimates obtained without vce(robust) shows that the standard
errors of the coefficients are approximately four times larger using the
vce(robust) option. This example demonstrates the inefficiency of the
Poisson estimator for such counts and highlights the importance of using
robust standard errors for inference if Poisson is the desired estimator.

Finally, even though parameter estimates are consistent, estimates of

209
marginal and incremental effects and event probabilities can be
inconsistent. For example, the Poisson density often underpredicts event
probabilities in both tails of the distribution. We first reestimate a Poisson
regression for office-based visits and calculate the observed and predicted
probabilities using the Stata code shown below. The predicted density is
calculated for each value of the count variable (up to a maximum value
based on the empirical frequency for each outcome) and for each
observation (that is, for different values of covariates). Then, the predicted
frequencies are averaged to obtain a single measure of the average
predicted density for each count value.

We also calculate observed and predicted probabilities for a Poisson


model of the count of ER visits in an analogous fashion. The distributions
are displayed in figure 8.4. The light (open) bars depict the empirical
density, that is, the frequency of observations in each count cell. The dark
bars depict the predicted frequencies. The figure highlights the extent to
which event probabilities of tail events are underpredicted, especially the
zeros ; consequently, events in the center of the distribution are
overpredicted for the model of office-based visits. The Poisson assumption
appears a better fit for ER use.

210
.3 # office-based provider visits # ER visits

.8
.6
.2

.4
.1

.2
0

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10

Empirical Predicted Empirical Predicted

Figure 8.4: Empirical and Poisson-predicted frequencies

211
8.3 Negative binomial models

The negative binomial regression model is arguably the most popular


model for count data that accommodates overdispersion. It is often
justified as a logical extension of the Poisson regression in which
overdispersion (relative to the Poisson) is caused by unobserved
heterogeneity. Consider an unobserved random term in the conditional
mean specification in the Poisson regression—that is, .
In the context of models of health and healthcare use, it is not hard to
justify via the existence of unobserved differences in health status or
differences in tastes. The former is especially appealing in the absence of
rich specifications of health status—or the observations that many chronic
conditions that affect utilization are relatively rare, and their severity is
rarely measured. Integration of out of the distribution leads to the
negative binomial distribution. This, and other derivations of the negative
binomial distribution, is in Cameron and Trivedi (2013) .

The negative binomial density for a count outcome is

where denotes the gamma function that simplifies to a factorial for an


integer argument, is an additional parameter, and has the same
interpretation as in the Poisson model.

The first two moments of the negative binomial distribution are

An appealing property of this parameterization is that the conditional mean


of the negative binomial regression is exactly the same as that in the
Poisson regression. However, the variance exceeds the mean . Thus the
negative binomial distribution introduces a greater proportion of zeros and
a thicker right tail. Figure 8.5 displays histograms of Poisson and negative
binomial densities with means of two. A researcher can visually observe

212
that the negative binomial density is overdispersed relative to the Poisson
and has considerably larger fractions of zeros and “large” (greater than 10)
values.

Poisson and negative binomial with mean 2


50
4030
percent
20
10
0

0 1 2 3 4 5 6 7 8 9 10+

Poisson Negative binomial

Figure 8.5: Negative binomial density

Two standard variants of the negative binomial are used in regression


applications. Both variants specify conditional means using
. The most common variant specifies the conditional
variance of as [from (8.7)], which is quadratic in the mean.
The other variant of the negative binomial model specifies the variance
, which is linear in the mean. This specification is derived by
replacing with throughout (8.6). In Cameron and Trivedi (2013)
and elsewhere, this formulation is called the negative binomial-1 (NB1)
model, while the formulation with the quadratic variance function is called
the negative binomial-2 (NB2) model.

The negative binomial distribution is not a member of the LEF , so it is


sensitive to misspecification. Unlike with the Poisson distribution, one
must be sure that the data-generating process is a negative binomial to
ensure that the parameter estimates are consistent . However, Cameron and
Trivedi (2013) show that a negative binomial regression model with a
fixed value of (or ) has a distribution in the LEF and hence is robust to
misspecification of higher moments. Because of this property, negative

213
binomial regression estimates are quite reliable in practice .

8.3.1 Examples of negative binomial models

The examples below show the results of NB1 and NB2 models fit for the
count of the number of office-based visits. The NB2 regression is the
default specification of the nbreg command in Stata (or the
dispersion(mean) option) , so we fit that model first. Parameter estimates
from negative binomial regressions have a semielasticity interpretation.
We see that the effect of an additional year in age is associated with a
2.8% increase in office visits. Women have 52% more visits than men.

214
As always, it is useful to compute effect sizes on the natural scale, so
we use margins to calculate sample average marginal and incremental
effects. The results show that individuals who are a year older have 0.16
more visits on average. Women have 2.9 more visits than men.

Next, we estimate the NB1 regression, which requires the


dispersion(constant) option. Parameter estimates are below. Note that
Stata reports a parameter, alpha, which is the value of exponentiated
/lnalpha in the case of the NB2 regression. In the case of the NB1
regression, Stata reports a parameter, delta, which is the value of the
exponentiated /lndelta. The coefficient estimates on age are roughly
similar across NB2 and NB1 specifications, while the coefficient on female
is smaller when fit using the NB1 model.

215
Estimates of the sample average partial effects reveal that both the
marginal effect of age (0.13) and the incremental effect of female (2.32)
are smaller when fit using the NB1 model.

216
The NB1 and NB2 models are not nested models—so in principle, a
researcher should use tests to discriminate among nonnested models such
as Vuong’s (1989 ) test or model-selection criteria such as the Akaike
information criterion (AIC) or the Bayesian information criterion (BIC) (see
chapter 2). But because the NB1 and NB2 models have the same number of
parameters, most nonnested tests and criteria simplify to a comparison of
maximized log likelihoods. The value of the log likelihoods suggests that
NB1 fits better than NB2 for this particular dataset and model specification.

As we did with the Poisson regressions, we calculate empirical


frequencies of each count value and the associated predicted frequencies
from NB2 regressions for office-based visits and ER use. The histograms of
actual and predicted count frequencies are shown in figure 8.6. The left
panel demonstrates the dramatic improvement in fit of the NB2 regression
relative to Poisson for office-based provider visits (compare with the left
panel of figure 8.4). The improvement in fit for ER visits is not as dramatic
but still noticeable.

217
.3 # office-based provider visits # ER visits

.8
.6
.2

.4
.1

.2
0

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10

Empirical Predicted Empirical Predicted

Figure 8.6: Empirical and NB2 predicted frequencies

One way to improve the fit of the negative binomial model even
further involves a parameterization of in terms of a linear combination
of a set of covariates. However, although this extended model is more
flexible in principle, the parameters of such models can be difficult to
identify in finite samples. Instead, in the following sections, we describe
two types of models that add flexibility to the basic Poisson or negative
binomial specifications, have considerable intuitive appeal, and often fit
the counts of healthcare use quite well.

218
8.4 Hurdle and zero-inflated count models

As mentioned above, many counts of health and healthcare use have more
zeros than predicted by Poisson or negative binomial models. Thus the
first extensions we consider to the Poisson and negative binomial
regressions are their hurdle and zero-inflated extensions. Each of these
models adds flexibility by relaxing the assumption that the zeros and the
positives come from the same data-generating process. Each naturally
generates excess zeros and a thicker right tail relative to the parent
distributions, but they are also capable of generating fewer zeros and
thinner tails.

8.4.1 Hurdle count models

The hurdle count model can have the same conceptual justification as is
often used to justify the two-part model —that it reflects a two-part
decision-making process (see also chapter 7). One motivation is based on a
principal-agent mechanism. First, the principal decides whether to use the
medical care or not. Then—conditional on making the decision to use care
—the agent, on behalf of the principal, makes a second decision about how
much care to consume. More specifically, the patient initiates the first visit
to a doctor, but the doctor and patient jointly determine the second and
subsequent visits (Pohlmeier and Ulrich 1995) . Alternatively, the two-step
process could be thought of as driven by transaction costs of entry into the
market, which do not exist once the individual is engaged in the receipt of
healthcare services. A richer formulation of the principal-agent mechanism
models the fact that bouts of illness arise during the course of the year.
Some factors may have a differential effect on whether these episodes of
illness become episodes of treatment—for example, the opportunity to
visit one’s family physician rather than having to go to an ER (Keeler and
Rolph 1988) .

However, such justifications are not required for the hurdle count
model to be an appealing extension to the standard Poisson and negative
binomial model. Instead, it is enough to acknowledge that there may be
substantial heterogeneity at the threshold of the count variable between use
and nonuse.

In the hurdle, or two-part model, the zeros are determined by one

219
density, , so that —while the positive counts are
from another density, . To be more precise, the positive counts are
drawn from the truncated density, .
Section 8.5 provides more details on truncated counts. The overall data-
generating mechanism is

In practice , is usually specified as a logit or probit , although any


binary choice model will do. The distribution, , is usually a Poisson or
negative binomial . When is a negative binomial, the hurdle count
model takes overdispersion into account in two ways: by allowing a
separate process for the zeros, and by allowing the positive counts to be
overdispersed via the negative binomial parameter, .

To demonstrate, we return to the example of any office visits with the


MEPS data. There are two parts to hurdle Poisson model estimation. In the
first step, we fit a logit model for the probability that the number of office-
based visits is greater than zero. Age and female both increase the
probability of having at least one office-based visit significantly.

220
The estimates of marginal effects show that women are almost 17
percentage points more likely to have at least one office-based visit than
men. An extra year in age leads to a 0.8 percentage point increase in the
probability of at least one office-based visit.

In the second step, we fit a truncated Poisson model using the Stata
command tpoisson . For estimation, it is important to condition on the
sample with positive values for the outcome; that is, drop observations
with zero counts. Notice the use of the if use_off>0 qualifier in the
tpoisson command shown below:

A variety of marginal effects can be calculated using margins . We can


get estimates of marginal effects on the conditional (on the outcome being

221
greater than zero) mean for the entire sample, not just for observations
with the outcome greater than zero. Conditional on having at least one
office-based visit—women average 1.59 visits more than men. On
average, a person who is one year older is expected to have 0.11 more
visits.

Predictions and marginal effects from the hurdle Poisson model, as a


whole, require putting the two parts together. Without the convenience of a
command such as twopm , the estimates can be combined using suest and
the expression() option in margins to obtain overall marginal effects. We
first refit the logit and truncated Poisson regression models without
adjustments to the maximum-likelihood standard errors, because those
adjustments are done within suest. suest produces a typical Stata
regression table with coefficients and standard errors of both equations.

222
Next, we code the formula for the conditional mean of the outcome for
the hurdle Poisson model and pass that to the expression() option of the
margins command. The code and results are shown below. They show that
women have 2.36 more office-based visits than men. From the previous
results, we can conclude that this is because they are more likely to have
an office-based visit and because, among those who visit, they have more
visits. An extra year in age increases the number of visits by 0.14, again
because of increases along extensive and intensive margins.

223
We also fit a truncated negative binomial regression model, combine
estimates from the two parts, and estimate marginal effects using the same
steps as for the hurdle Poisson model. The sample average of the
incremental effect of being female is 2.54: women average 2.54 more
office-based visits than men. This estimate is somewhat larger than the one
obtained from the hurdle Poisson. An extra year of age is estimated to
increase the number of visits by 0.15, which is quite similar to that
obtained from the hurdle Poisson.

224
Note that the sample average incremental and marginal effects
obtained from the hurdle specification are quite close to those obtained
using the standard negative binomial regression. However, this does not
mean that the partial effects would be similar throughout the distribution
of the covariates.

8.4.2 Zero-inflated models

The zero-inflated model developed by Lambert (1992) and Mullahy (1997)


also has considerable intuitive appeal. The intuition is that there are two
types of agents in the population: potential users and nonusers. While
positive counts arise only from the decisions of users, zeros can arise
because users choose not to consume in a particular period or because of
the behavior of nonusers.

In the context of healthcare use, consider an example in which the


outcome of interest is the number of visits to an acupuncturist. It might be
reasonable to believe that the population consists of two types of
individuals: those who would never seek such care and those who would.
However, there would be individuals among the second group who did not
visit an acupuncturist in the survey recall period. A person observed to
have zero visits during the observation period might either be someone
who would never visit an acupuncturist or be someone who would—but
happened not to during the observation period. Thus a zero-inflated model

225
would be a powerful way to model the additional heterogeneity relative to
a standard model. Note that as with the hurdle count model, the use of the
zero-inflated model need not be justified using the intuition of two types of
individuals in the population. It may simply be used to provide additional
modeling flexibility.

In the zero-inflated class of models, a count density, , produces


realizations from the entire range of nonnegative integers—that is,
. In addition, another process generates zeros with
probability specified by density . Thus, while positive counts arise
only from the density, , zeros arise from as well as . Thus
the density for the zero-inflated count model is

Typically, is specified as a logit or probit model, and is a


Poisson or negative binomial density. Note that unlike the hurdle count
model, truncated densities are not used in this specification.

Although quite flexible, the zero-inflated model is used less often than
the hurdle count model. Because the likelihood function of the zero-
inflated model cannot be decomposed into its constituent parts, all
parameters must be estimated simultaneously, which can raise
computational issues—especially when both parts of the model have the
same set of covariates (Mullahy 1997) .

Below we fit a zero-inflated negative binomial regression model for


office-based visits using the zinb command. Although we do not show
results from the zero-inflated Poisson regression, we could estimate this
using the zip command. The estimates reported below show that age and
gender are positively related to the number of visits and negatively related
to the probability of obtaining a zero via the inflation factor.

226
We estimate the sample average incremental effect of being female and
the marginal effect of age and report these below. On average, women
have 2.49 more office-based visits than men. Individuals who are a year
older have 0.15 more visits. At least on average, estimates and marginal
effects from a zero-inflated negative binomial regression model for office-
based visits deliver similar results to the hurdle count models.

227
Typically, the larger the discrepancy between the number of zeros in
the data and the number predicted by the standard count model (Poisson or
negative binomial), the greater the gains will be from the additional
modeling complexities of either the hurdle or the zero-inflated model .
Gains will be most obvious along the dimension of predicted counts, but a
researcher will typically also obtain better estimates of other event
probabilities and partial effects. Further, unlike the choice between
Poisson and negative binomial, where the choice of distribution has no
impact on expected outcome in terms of the conditional mean, the move to
a zero-inflated Poisson or negative binomial regression model—or a
hurdle count model—can change the conditional mean response to the
covariates.

228
8.5 Truncation and censoring

In this section, we briefly discuss situations in which the count outcome is


either truncated (missing) or censored (recoded). Truncation occurs
naturally in the second part of the hurdle count model but also in situations
where nonusers of care are unobserved. Censoring occurs when counts are
top coded.

8.5.1 Truncation

In some studies, sampled individuals must have been engaged in the


activity of interest to be included in the samples. When this is the case, the
count data are truncated, because they are observed only over part of the
range of the response variable. Examples of truncated counts include
outpatient visits and the number of prenatal visits among a population of
women who all have at least one visit. In all of these cases, we do not
observe zero counts, so the data are said to be zero-truncated —or, more
generally, left-truncated . Right-truncation is less common but can arise in
situations where the analyst has a dataset in which observations with large
values of the count variable have been removed from the dataset during
prior analysis. Typically, researchers know the rule by which these
observations were removed, but it is not feasible to recover those
observations from the original data source. Truncation leads to inconsistent
parameter estimates, unless the likelihood function is suitably modified.
This is the case even when the true data-generating mechanism is as
simple as a Poisson density (Gurmu 1997) .

Consider the case of zero-truncation. Let denote the density


function, and let denote the cumulative distribution function of
the discrete random variable, where is a parameter vector. If realizations
of are omitted, then the ensuing zero-truncated density is

For the zero-truncated Poisson, this simplifies to

229
where . Maximum likelihood estimates of zero-truncated
models are implemented in Stata for the Poisson and negative binomial
densities with the tpoisson and tnbreg commands, respectively.

8.5.2 Censoring

Censored counts most commonly arise from the aggregation of counts


greater than some value. This is often done in survey designs involving
healthcare use for confidentiality reasons, where a measure of use is top-
coded at a value smaller than the true maximum in the data to reduce the
risk of compromising confidentiality of the survey respondent. Censoring,
like truncation, leads to inconsistent parameter estimates if the uncensored
likelihood is mistakenly used (Gurmu 1997) . Consider the case in which
the number of events greater than some known value, , might be
aggregated into a single category. In this case, some values of are
incompletely observed; the precise value is unknown, but it is known to
equal or exceed . The observed data have density

Simplification to the Poisson and negative binomial densities can be


derived using their respective densities.

230
8.6 Model comparisons

As is apparent from the number of models and their variants described


above, a number of modeling choices are necessary when modeling a
count outcome. Although the Poisson regression has the advantage of
being robust to some types of misspecification, the estimates from a
Poisson regression may not be desirable, and predictions of events and
partial effects of policy interest may well be inconsistent. The richer class
of models tend to fit the data much better but are also prone to
misspecification. One way to minimize the effects of misspecification is to
chase the best-fitting model before conducting inference, calculating
partial effects, and making predictions.

One complication that generally rules out standard statistical testing of


model choice is the fact that most of the models are nonnested. However,
the Vuong (1989) test is an exception, because it is designed to test
nonnested hypotheses. It is implemented in Stata for a number of models,
including zip, but not generally applicable without substantial effort. A
likelihood-based model-selection approach (AIC or BIC) is generally the
most straightforward way to evaluate the performance of alternative
models.

8.6.1 Model selection

As we described in chapter 2, there are two commonly used model


selection criteria that penalize the maximized log likelihood for the
number of model parameters. They have many desirable properties,
including robustness to model misspecification along a variety of
dimensions (Schwarz 1978; Leroux 1992) . They also do not suffer from
issues of multiple testing, which would arise in the traditional hypothesis
testing framework when many alternative models are considered.

The AIC (Akaike 1970) is

where is the maximized log likelihood and is the number of

231
parameters in the model. Smaller values of AIC are preferable. The BIC
(Schwarz 1978) is

where is the sample size. Smaller values of BIC are preferable. For
moderate to large sample sizes, the BIC places a premium on parsimony
and will tend to select models with fewer parameters relative to the
preferred model based on the AIC criterion.

To illustrate these criteria, we fit a variety of count-data models for the


number of office-based visits and the number of ER visits. In each case we
estimated Poisson, NB2 and NB1 models, and hurdle and zero-inflated
models derived from NB2 and NB1 densities. To make the comparisons
more realistic, we used a full set of covariates in each of the models (age,
gender, race and ethnicity, household size, education, income, region, and
insurance).

Both the AIC and BIC demonstrate that—among the models fit—the
hurdle NB1 fits the data for office-based visits best, albeit with the caveat
that the zero-inflated negative binomial model implemented in Stata as
zinb only allows for the NB2 density.

For the number of ER visits, while there is considerably less evidence


of overdispersion, the Poisson does not fit as well as the alternative
models. In fact, the NB2 density fits better than the NB1 density and their
zero-inflated and hurdle counterparts when the BIC is used as the model
selection criterion. However, the zero-inflated negative binomial appears
to fit best when the AIC is used.

232
8.6.2 Cross-validation

In-sample model checks may not always be reliable. Cross-validation


checks (Picard and Cook 1984; Arlot and Celisse 2010) provide a
powerful alternative way for model comparison. Cross-validation is a
technique in which estimation is done on a subsample of the full sample
(known as the training sample) and model fit is assessed in the remaining
observations (known as the validation sample). In -fold cross-validation,
the original sample is randomly partitioned into subsamples. Of the
subsamples, a single subsample is retained as the validation data for testing
the model, and the remaining subsamples are training data. The
cross-validation process is then repeated times (the folds), with each of
the subsamples used exactly once as the validation data. The results
from the folds then can be averaged (or otherwise combined) to produce a
single estimation. The advantage of this method over repeated random
subsampling is that all observations are used for both training and
validation, and each observation is used for validation exactly once. The
10-fold cross-validation is common.

We conduct 10-fold cross-validation for the number of office-based


visits and the number of ER visits. In each of the outcomes, the Poisson
regression—as well as its hurdle and zero-inflated extensions—fit
substantially worse than the models based on the negative binomial
densities. For ease of interpretation, we do not report them in the figures
below. Figure 8.7 below shows a comparison of the negative binomial
models for the number of office-based visits. Although the NB2 model is
the worst performer for each replicate, its hurdle counterpart performs
quite well—either the best or close to the best-performing model.

233
# Office-based visits
100
change in log likelihood over NB2
20 40 0 60 80

1 2 3 4 5 6 7 8 9 10

NB1 Hurdle-NB2 Hurdle-NB1 Zi-NB2

Figure 8.7: Cross-validation log likelihood for office-based visits

The evidence for ER visits, shown in figure 8.8, is quite different. There
is virtually no discrimination between NB1 and NB2 models or among their
extensions.

#ER visits
5
change in log likelihood over NB2
-5 -10 0

1 2 3 4 5 6 7 8 9 10

NB1 Hurdle-NB2 Hurdle-NB1 Zi-NB2

Figure 8.8: Cross-validation log likelihood for ER visits

234
8.7 Conclusion

We have described a number of models for count data in this chapter. They
are useful as models for many measures of healthcare use. In one of the
two empirical examples in this chapter, there is considerable gain in fit
from going beyond the standard Poisson and negative binomial models to
hurdle and zero-inflated extensions of those models. Nevertheless, even
the hurdle and zero-inflated extensions may sometimes be insufficient to
provide an adequate fit for some outcomes. We will return to the
development of other, possibly more flexible extensions in chapter 9.

235
8.8 Stata resources

Stata has several commands to estimate count models as well as zero-


inflated count models. To estimate the Poisson regression in Stata, use the
poisson command. Use nbreg for negative binomial regressions, zip for
the zero-inflated Poisson regression, and zinb for its negative binomial
counterpart. Hurdle count models do not have single commands in Stata.
They can be estimated by using logit or probit for the hurdle part of the
model and tpoisson or tnbreg for the conditional (on a positive integer)
part of the model.

To compare the AIC and BIC test statistics for the count models
described above, use estimates stats * or estat ic .

236
Chapter 9
Models for heterogeneous effects

237
9.1 Introduction

There are many conceptual reasons to expect that the marginal (or
incremental) effects of covariates on healthcare expenditure and use, when
evaluated at different values of covariates or at different points on the
distribution of the outcome, are not constant across a number of
dimensions. In observational studies, individual characteristics can
plausibly have very different effects on the outcome at different values of
the characteristic itself. For example, a unit change in health status may
have only a small effect on healthcare expenditures for individuals who are
in good health, while it may have a large effect for individuals in poor
health. The effect of a unit change in health status may also differ along
the distribution of expenditures. The effect of a unit change in health status
on expenditures may be small for people with low expenditures and high
for people with large expenditures. These health characteristics may also
interact with socioeconomic characteristics. For example, individuals who
are generally less healthy or who have greater spending may be less
sensitive to changes in price. Furthermore, in quasiexperimental studies
and large, pragmatic experimental designs, the intensity of treatment and
compliance to treatment often differ across individual characteristics,
household characteristics, provider characteristics, and geographic areas.
Thus treatment effects evaluated at different values of those characteristics
would yield different values of effects.

From a statistical perpective, when the linear regression model is


specified as being linear in covariates, as is common, the effects of
covariates are constant across their values in the population. In the context
of the linear regression model, researchers can explore heterogeneity of
effects of covariates via the use of polynomials of covariates, by
interaction terms, or by stratifying the sample by indicators of the source
of heterogeneity. For example, regression specifications specified using
quadratic functions of age or stratified by gender are commonplace.
However, there are data and statistical limits to the amount of stratification
that can be done given a sample, and such analyses increase the risk of
false findings. When the nonlinear models we have described in previous
chapters are used, they have conditional expectation functions for the
outcome that are nonlinear functions of linear indexes. Therefore, even if
the index is linear in covariates, marginal effects will vary when evaluated
at different values of covariates and of the outcome. They are, however,

238
characterized entirely by the functional form of the link function between
the index and outcome unless interactions and polynomials of covariates
are also included.

There are often good reasons to believe that effects are heterogeneous
along dimensions that cannot easily be characterized by parametric
functional forms or by interactions of covariates as they are typically
specified. For example, effects may be heterogeneous along the values of
the outcome itself, by complex configurations of observed characteristics,
or on unobserved characteristics. These types of effect heterogeneity are
not easy to account for using the models that we have described so far.
Ignoring heterogeneity may be a lost opportunity for greater understanding
in some cases, while it may lead to misleading conclusions in others.

As we saw in chapter 2, accounting for such heterogeneity is


exceedingly important if the researcher is interested in estimating effects at
specific values of covariates or of the outcome. We also saw that allowing
for appropriate nonlinearity might be important even if the object of
interest is an average of effects across the sample, for example, the average
treatment effect. Therefore, in this chapter, we describe four methods that
allow the researcher to explore heterogeneity of effects in more general
ways: First, we describe quantile regression, which is an appealing
technique to explore heterogeneity along values of the outcome. Next, we
describe finite mixture models , which allow for heterogeneity along the
outcome distribution by complex configurations of either observed or
unobserved characteristics. These models identify a finite (typically small)
number of classes of observations with associated covariate effects that
vary across classes. Third, we describe some uses of nonparametric
regression now available in Stata. Nonparametric regression techniques
make few assumptions about the functional form of the relationship
between the outcome and the covariates. Finally, we briefly describe a
conditional density estimator that explicitly allows the relationship
between covariates and outcome to differ across the distribution of the
outcome.

239
9.2 Quantile regression

So far, the models we have described relate the conditional mean of the
outcome, , to a set of covariates through one (or two) linear indexes
of parameters and covariates; that is, . In quantile
regression, the conditional expectation of the outcome is not modeled.
Instead, the conditional th quantile is modeled using a linear index of
covariates; that is, . When , the quantile regression
is also known as a median regression. As we will see below, quantile
regressions allow effects of covariates to vary across conditional quantiles;
that is, the effect of a covariate on the th conditional quantile may be
different from the effect of the same covariate on the th quantile. Thus
quantile regressions provide a way for researchers to understand how the
effect of a covariate might differ across the distribution of the outcome.

The median regression , or the 0.5th quantile regression, is the simplest


point of departure from the linear model fit by OLS. Specifically, consider
the linear regression specification for the continuous dependent variable,
, for individuals , regressed on a vector of covariates, —with a vector of
parameters to be estimated, , and an independent and identically
distributed error term, :

In ordinary least squares (OLS) , the parameters of this linear model are
computed as the solution to minimizing the sum of squared residuals. In an
analogous way, the median quantile regression computes parameters of the
same linear regression specification by minimizing the sum of absolute
residuals. The sum of squared residuals function is quadratic and
symmetric around zero; the sum of absolute residuals is piecewise linear
and symmetric around zero. Therefore, minimizing the sum of absolute
residuals equates the number of positive and negative residuals and defines
a plane (a line in the case of a simple regression specification) that “goes
through” the median.

What if we wished to estimate the parameters of the regression line


that correspond to quantiles other than the median? Koenker and

240
Bassett (1978) and Bassett and Koenker (1982) showed that if
minimizes

(9.1)

where . Therefore, is the solution to the th


quantile regression. In other words, the solution produces the best-fitting
plane that goes through the th quantile of . To fix ideas, suppose that
. Then

and minimizing (9.1) produces estimates corresponding to the median


regression.

Quantile regressions for a given conditional quantile have two


appealing properties relative to standard least-squares regressions. First,
quantile regression estimates are extremely robust to outliers . The reason
is quite simple. Consider the case of the median regression. If a large
(positive or negative) value of changes by a bit, the median value of
is unaffected, so the quantile regression estimates remain unchanged.
Second, quantile regressions are equivariant to monotone transformations
of the outcome variable . This means that not only the regression of
goes through the median of but also the regression of any will go
through the median of for any monotone function, . This is because
the median is an order statistic that is invariant to such transformation.

Our reason for describing quantile regressions focuses on yet another


appealing property. We estimate quantile regressions at various values of
to understand how the effects of covariates vary across the conditional
quantiles of the outcome . Koenker and Hallock (2001) and
Drukker (2016) give informative introductions to quantile regressions and
how they allow researchers to explore heterogeneity of effects.

241
9.2.1 MEPS examples

To demonstrate the value of quantile regressions, we use the 2004 Medical


Expenditure Panel Survey (MEPS) data introduced in chapter 3 to estimate
the effect of a change in age by a year (age) and gender (female is a binary
indicator for being female) on total healthcare expenditures for persons
with any expenditures ( ). The median regression (or the
quantile regression estimated for the 50th percentile) yields the following
results. It shows that a unit increase in age increases median expenditures
by $55, while women spend $385 more at the median than men. Both
effects are statistically significant.

242
How do these results compare with standard least-squares estimates ?
The results below show that the median regression and the least-squares
regression deliver the same inference, but the partial effects of covariates
are different. The marginal effect of age and the incremental effect of
female are about twice as big in the least-squares case as in the median
regression.

243
As we have suggested above, an important value of quantile
regressions is the ability to estimate effects at various quantiles, not just at
the median. Consequently, we estimate the regression of total expenditures
at the 10th through 90th percentiles in 10 percentage point increments to
determine whether and how the effects of the covariates change from the
10th through 90th conditional quantiles of the outcome.

There is a subtle difference between the command we used to calculate


the quantile regression at the median (qreg) and the command we use to
estimate the regressions across the sequence of quantiles (sqreg) . We
could have used qreg repeatedly and would have obtained the same point
estimates, but we did not. We used sqreg because it produces standard
errors for each of the point estimates that account for the fact that the
estimates of the coefficients across quantiles are not independent of each
other, because they are all estimated from the same sample of data
(Koenker and Machado 1999) . Therefore, the standard errors obtained via
bootstrap by sqreg formally allow for testing of equality of coefficients
across quantiles, while the standard errors from repeatedly using qreg
would not have.

We plot these estimates of the effects of age and female along with the
associated 95% confidence intervals in figure 9.1. For comparison, we
overlay on each panel the least-squares coefficient estimate and its
confidence interval. (To accumulate the parameter estimates and
confidence intervals conveniently, we use the user-written package
parmest [Newson 2003], which can be installed by typing ssc install
parmest in Stata.)

244
The left panel of figure 9.1 shows the marginal effect of age on total
expenditures, while the right panel shows the incremental effect of female
on expenditures across quantiles of errors. Note that the quantiles on the
horizontal axis refer to quantiles of errors—not to quantiles of the
outcome, exp_tot. Although it is tempting to interpret the effects as if they
were applicable to observed quantiles of the outcome, that interpretation is
incorrect. These are conditional quantiles, so they cannot be easily
translated into unconditional quantiles.

Nevertheless, it is revealing that—in both panels—the quantile


regression estimates lie outside the confidence intervals of the least-
squares estimates for most quantiles, suggesting that the effects of these
covariates are not constant across the error distribution or equivalently
across the conditional distribution of the dependent variable. The OLS
coefficient on age statistically coincides with quantile estimates from the
70th through 80th percentiles, while the OLS coefficient on female
statistically coincides with quantile estimates from about the 35th through
80th percentiles. A researcher might conclude that there is more
heterogeneity in the effect of a change in age than there is in the effect of
gender on total expenditures.

Age Female
4000
300

3000
200
Coefficient

Coefficient
2000
100

1000
0

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
quantile quantile

Quantile regression estimates and 95% CIs denoted by solid lines


and shaded areas, respectively
OLS estimate and 95% CI denoted by dashed and dotted lines, respectively

Figure 9.1: Coefficients and 95% confidence intervals by quantile of


expenditure errors

As we described in chapter 3, the distribution of expenditures—


conditional on being positive—is severely skewed to the right. In this case,
the median quantile regression produces substantially different estimates
than the least-squares estimates (mean regression).

245
What might we learn if the distribution of the outcome were more
symmetric? To explore this, we estimate quantile regressions for the
logarithm of total expenditures (conditional on expenditures being
positive). We first estimate the quantile regression at the median using
qreg . The results show that the coefficient on age is 0.037, implying that
if an individual is one year older, that individual would spend 3.7% more.
Women spend 43% { } more than men.

We use sqreg to display the effects of covariates across the quantiles


of the conditional distribution of the logarithm of expenditures and to
compare them with OLS estimates in figure 9.2. We find that there is still
substantial evidence of effect heterogeneity for age. The pattern of
coefficients is reversed, relative to the effects of age on expenditures as
shown in figure 9.1. In the previous case, the effect of a change in age

246
increased as the conditional quantile of expenditures increased. Now,
when the outcome is the log of expenditure, the effect of age decreases
across those conditional quantiles. There is no evidence of heterogeneity in
the effect of female for expenditures measured on the log scale. The
quantile estimates are all within the confidence interval of the OLS
estimates.

Age Female
.045

.45
.04

.4.35
Coefficient

Coefficient
.035

.3
.03

.25
.025

.2

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
quantile quantile

Quantile regression estimates and 95% CIs denoted by solid lines


and shaded areas, respectively
OLS estimate and 95% CI denoted by dashed and dotted lines, respectively

Figure 9.2: Coefficients and 95% confidence intervals by quantile of


ln(expenditure) errors

As we described above, quantile regressions are equivariant to


monotone transformations of the outcome variable. A specific
consequence of that result in the context of the quantile regression of
is that . So, unlike in the least-
squares case described in chapter 6, no retransformation constants are
necessary. In fact, partial effects can easily be computed using the
expression() option of margins , as shown below. The results show that
an increase of one year in age increases median expenditure by $68.
Women at the median spend $628 more than men at the median.

247
9.2.2 Extensions

We briefly discuss two extensions of the quantile regression approach that


extend the framework in quantile linear regression analysis.

First, quantile regressions are usually applied to continuous outcomes


such as healthcare expenditures. However, what if we wanted to use
similar methods for count measures of healthcare use? Machado and
Santos Silva (2005) describe a method for estimating conditional quantiles
of count data. They show that it is possible to estimate conditional
quantiles of a count variable after it has been smoothed by adding noise
with a uniform distribution to the count variable. This model is available
for Stata users as a user-written package called qcount (Miranda 2006).

Second, there is no general approach for estimating quantile treatment


effects in the context of nonlinear data-generating processes . However, as
we have seen in previous chapters, outcomes such as healthcare
expenditures are better analyzed using nonlinear models—in part because
they are often positive and highly skewed. Drukker (Forthcoming)
describes a method that estimates the quantiles of the potential-outcome
distributions for treatment and control in the case of a gamma distributed
random variable (that may also exhibit censoring). Differences in the
estimated quantiles across treatment and control are the quantile treatment
effects. This model is available for Stata users as a user-written package
called mqgamma (Drukker 2014).

248
9.3 Finite mixture models

Finite mixture models have a long and established history in statistics as a


way to nonparametrically or semiparametrically estimate distributions for
random variables when parametric distributions are inadequate
(Lindsay 1995) . Finite mixture models also segment observations in a
dataset into components or classes defined by differences in effects of
covariates on outcomes or, more generally, differences in conditional
component distributions of the outcome (McLachlan and Peel 2000) . The
finite mixture model also adds flexibility vis-a-vis linear or nonlinear
parametric models that we have described in previous chapters by
introducing heterogeneity in responses across the distribution of outcomes.

Finite mixture models are an appealing way to model healthcare


expenditures and use for a number of reasons. First, the finite-mixture
regression model provides a flexible method of modeling the data, and—
under appropriate conditions—is a semiparametric estimator of the
unknown density. Second, the finite-mixture regression model can be
thought of as arising from intuitively plausible data-generating
mechanisms in which latent heterogeneity splits the population into classes
that display different responses to changes in covariates or different data
densities. For example, in the case of healthcare expenditure or use, the
latent classes may correspond to healthy and sick users of healthcare
services. Healthy individuals may be expected to be sensitive to changes in
price and income, while sick individuals may be relatively insensitive to
price and income.

Suppose that an outcome, , is a random variable drawn from one of


distributions. Assume the probability that is drawn from distribution
(more commonly referred to as component or class) is , with
and . Let denote the density (mass
function if is a discrete random variable) for class or component where
and denote the parameters of the distribution, .
Then, the density function for a -component finite mixture
(Lindsay 1995; Deb and Trivedi 1997; McLachlan and Peel 2000) is the
weighted sum of component densities ,

249
(9.2)

This model’s parameters can be estimated by maximum likelihood , either


by an expectation-maximization (EM) algorithm (McLachlan and
Peel 2000) or more standard Newton–Raphson methods (Deb and
Trivedi 1997) , or a combination of techniques as implemented in Stata. It
is typical, in most estimation algorithms, not to estimate the parameters
directly. Instead, they are reparameterized as

(9.3)

with by convention as a normalization restriction. A major


computational advantage of estimating the parameters is that there are
no restrictions on their values unlike the values of each of which is
constrained to be between zero and one and they must sum to one.

Arguably, the most common component density is the normal


(Gaussian) distribution. Its functional forms and properties have been
described in numerous works, including McLachlan and Peel (2000) .
Other popular choices of distributions in the literature for continuous
outcomes are the lognormal and generalized linear models (GLMs). Popular
choices for distributions of count outcomes are the Poisson and negative
binomial. Generally, the finite mixture model can be specified with any
density the researcher deems appropriate for the outcome.

To illustrate the model in the case of an important distribution choice


in the context of healthcare expenditures, we describe specific details of
the finite mixture of a generalized linear regression with a gamma density
and log link. Recall that in chapter 5, we showed that the GLM with a
gamma density and log link described expenditures well. For individuals
in class , the GLM with a gamma density and log-link function for an
outcome can be expressed as a density function (McCullagh and
Nelder 1989) using

250
(9.4)

The mixture density in this model is obtained by replacing


from (9.4) into (9.2).

In this parameterization, the expected value of the outcome, , given


covariates and class , is

(9.5)

Consequently, using properties of finite mixture distributions, we see that


the expected value of the outcome, , given covariates , is

The expected value formula in (9.5) can be used to estimate heterogeneous


effects of treatment, as well as marginal and incremental effects.

In the finite mixture models described above, the parameters can be


interpreted as prior probabilities of class membership. As specified, they
are assumed to be constants. Therefore, they do not provide any
information on how likely a particular observation might be to belong to a
particular class . However, following the finite mixture literature, we can
calculate the posterior probability that observation belongs to component
:

(9.6)

These posterior probabilities vary across individuals and provide a


mechanism for assigning individuals to latent classes and for using those

251
assignments to characterize the classes or components. To do so, we would
estimate the posterior probabilities of class membership after the model
parameters have been estimated. Then, we could use the probabilities
themselves—or classification based on the probabilities—to further
describe characteristics of observations in each class. Note that although it
is technically possible to parameterize the prior probabilities to allow them
to vary by characteristics, the tradition in the literature is to assume they
are constant (McLachlan and Peel 2000) .

The discussion above has taken the number of classes, , as given. In


practice, the number of classes is not likely to be known a priori, so the
researcher will need to determine empirically. There is a subtle
identification issue that complicates testing the number of components in a
finite mixture model using standard test statistics (see Lindsay [1995] and
McLachlan and Peel [2000] for discussions). Instead, the literature points
to the Akaike information criterion (AIC) and Bayesian information
criterion (BIC) as consistent tools for model selection (Leroux 1992) . In
practice, a two-component model is fit first and the model-selection
criteria computed. Then models with additional components are fit until
model-selection criteria suggest no improvement.

Below we describe two empirical examples. The first estimates finite


mixtures of GLMs with gamma densities and log links for healthcare
expenditures. The second estimates finite mixtures of negative binomial
regressions for counts of office-based visits. Finite mixture models for
count data, and details of finite mixture models more generally, are
described in Deb and Trivedi (1997, 2002) .

9.3.1 MEPS example of healthcare expenditures

We use the new fmm prefix to finite mixtures of GLMs and negative
binomial regressions. fmm is more than a command; it is a prefix because it
can be used to fit a variety of finite mixture models using standard
estimation commands prefixed by fmm #:, where # is the number of
components in the mixture model. We use the fmm prefix to fit finite
mixture models of GLM regressions with gamma densities and log links for
positive values of expenditures. We begin by fitting a two-component
model because it is the place a researcher would begin the quest for a
model with an adequate number of components. We will estimate the
parameters of this model, then calculate model-selection criteria so that the

252
model can be compared with a model with three or more components.

The output shows four blocks of iterations. Because the log-likelihood


function for finite mixture models is complex and not globally concave, it
is important to establish good starting values. The first set of iterations
produces good starting values for the parameters underlying the latent
class probabilities [as described in (9.3)]. The second set of iterations
establishes starting values for the componentwise parameters. The third set
of iterations begins the maximum likelihood process using the EM
algorithm. This algorithm simplifies the optimization problem by cycling
between a maximization step to obtain class-specific regression parameters
given estimates of class probabilities and an expectation step to obtain
estimates for the class probabilities. The EM algorithm can be slow to
converge, so after 20 iterations (the default setting), the command switches
to a standard Newton–Raphson algorithm.

253
254
If we thought this model was adequate, we would proceed to
interpreting estimates and characterizing marginal effects. However, as we
will see below, it is not. Nevertheless, to fix ideas using this simple case,
we briefly interpret the parameter estimates. The output of parameter
estimates is shown in three blocks. The first block displays estimates of the
parameters, which can be used to calculate the class probabilities using
(9.3). We do this below, but after interpreting the output from the next two

255
blocks of results that show the componentwise parameter estimates. For
observations in component 1, the effects of age and female are both
statistically significant—each increases spending. For observations in
component 2, age is statistically significant, but female is not.

Now we return to calculating the latent class probabilities. We use the


postestimation command estat lcprob to obtain estimates of . The two
components have probabilities, and , equal to 0.68 and 0.32,
respectively.

Having described results from a two-component finite mixture of


gamma GLMs using fmm , we fit a three-component model that may be a
better description of the data-generating process. The output from
estimation of a three-component finite mixture model using fmm produces
four blocks of iteration logs and, this time, four blocks of parameter
estimates. Despite all the effort to obtain good starting values, it is not
surprising to see a number of iterations labeled (not concave) prior to
convergence—as in the output below. The first block of parameter
estimates presents estimates for the latent class probabilities. The next
three blocks show results for the three componentwise regressions. Except
for the coefficient on female in component 3, all other covariate
coefficients are statistically significant. Older individuals spend more in
each latent class. Women spend more than men in the first and second
classes. We will return to a more detailed interpretation of the estimates
using margins below.

256
257
258
We use the postestimation command estat lcprob to obtain estimates
of . The three-component densities are associated with mixture
probabilities of 0.49, 0.43, and 0.08.

The model we have fit has constant class probabilities, which allows us
to code up the transformation from to [using (9.3)] and use nlcom to
obtain the estimates and associated confidence intervals for the class
probabilities more quickly than using estat lcprob . (The results below
are identical to those produced by estat lcprob.)

We use estat ic to calculate the AIC and BIC for the two- and three-
component models. The information criteria for the three-component
model are shown below. Both the AIC and BIC suggest that the three-
component model provides a better fit than the two-component model.
Although—for a serious research exercise—we should fit a four-
component model and calculate information criteria before judging the
three-component model to be the best fit, we stop here and proceed to the

259
interpretation of the parameters, effects, and distributional characteristics
from the three-component model. This gives us an example that is
sufficiently parameter rich, so as to make the nuances of the finite mixture
model apparent—yet not overly complex—so that discussion of the
nuances overwhelms our attempt to describe basic interpretation.

As a sanity check, we estimate the overall predicted mean using


margins and its empirical counterpart for the sample of observations with
positive values of exp_tot. It is an indicator of model validity that the
predicted mean is quite close to the empirical estimate.

We begin the characterization of results from the three-component


model by using margins to compute means of predicted outcomes for each
of the three components. Using the results below, we can say that the
distribution of expenditures is characterized by three components, one with
mean spending of $1,322 and associated probability of 0.49, a second
component with mean spending of $5,245 and probability of 0.43, and a
third, relatively rare class (with probability 0.08) with mean spending of
$19,959. We can now say that, while women spend more than men in the

260
low- and medium-spending classes, women and men in the high-spending
class, that is, in component 3, do not differ significantly.

It is not just mean spending that differs across components. The shapes
of the gamma densities are also different. To show this, we plot the
predicted densities at the median age and gender in figure 9.3. To make
the figure easier to read, we show only the densities through $20,000 in
expenditure, which exceeds the 95th percentile of its empirical distribution
of expenditure. The figure shows that each of the predicted densities is
skewed, just as the empirical density is. But, while the densities of the first
two components have positive modes (classically gamma shaped), the
density of component 3 is exponential in shape—it slopes downward right
from the beginning.

261
0 5000 10000 15000 20000
expenditure

f(y|x,c=1) f(y|x,c=2) f(y|x,c=3)


μ1 = 1322.0, π1 = 0.49
μ2 = 5244.8, π2 = 0.43
μ3 = 19958.9, π3 = 0.08

Figure 9.3: Empirical and predicted componentwise densities of


expenditure

What do the parameter estimates of this three-component model tell us


about the effects of age and female on expenditures? We present the
estimates of the componentwise partial effects , along with their 95%
confidence intervals graphically. In the figure, we label the components in
order of increasing component mean expenditure. In the panel on the left
in figure 9.4, we can see that the effect of age increases from about $60 in
the lowest expenditure class, and to about $150 in the medium expenditure
class, to more than $400 in the highest expenditure class. These effects are
statistically different from each other. In the panel on the right, we display
the incremental effects of being female. In component 1, the effect is about
$500; in component 2, the effect is about $2,000. In component 3, the
effect of female is statistically insignificant.

262
Age Female

5000
600
400

0
Coefficient

Coefficient
-5000
200

-10000
0

1 2 3 1 2 3
Component Component

Finite mixture estimates and 95% CI denoted by bars and capped lines, respectively

Figure 9.4: Componentwise coefficients and 95% confidence


intervals of expenditures

Finally, we estimate the posterior probabilities of class membership


with (9.6), using the predict command in Stata. Note that we maintain the
relabeling of components relative to the raw output. As a second sanity
check, we calculate sample means of posterior probabilities. A corollary of
Bayes’s theorem is that the sample means of posterior probabilities equal
the prior probabilities. In fact, that is what the summary statistics show.

We then classify each observation uniquely into a class on the basis of


the posterior probabilities and examine characteristics of individuals by
those predicted classes. The average age of individuals in the low-
expenditure component is greater than the average age of individuals in the
two other classes. Women are much more likely to be in the low- and
middle-expenditure classes compared with the fraction of women in the
high-expenditure class.

263
264
9.3.2 MEPS example of healthcare use

In a second example, we fit finite mixture models for the number of office-
based healthcare visits. We showed in chapter 8 that the negative
binomial-1 fit this outcome well. Therefore, for this example, we estimate
finite mixtures of negative binomial-1 regressions. We begin by fitting a
two-component model—but for brevity, we do not show the results. Once
again, information criteria suggest that the three-component model is
better than the two-component one.

When we first fit the three-component model, we noticed that it took


55 iterations to converge (after 20 iterations of the EM algorithm),
including 49 iterations showing the dreaded not concave message. While
there is nothing wrong with such an iterative process, especially in the
context of finite mixture models, we were impatient, so we used the
difficult option for optimization. With this option, the final maximum-
likelihood estimation algorithm converges in seven iterations. From our
experience, we note that difficult may be a useful option in many
circumstances.

Estimates from the three-component model are shown below. The


coefficients on age and female are statistically significant in each of the
three components.

265
266
267
We use nlcom to obtain estimates of the class probabilities, along with
their standard errors. The three components have mixture weights of 0.64,
0.09, and 0.27.

We use margins to compute means of predicted outcomes for each of


the three components. A majority of observations in our sample, 64%,
average 3.5 visits per year. A small fraction, 9%, averages 3 office-based
visits per year, and the remaining 27% averages about 12 visits per year.

268
From the estimates of the predicted means, one might conclude that
two of the components are too similar to distinguish. That conclusion
would be wrong; the densities of the components are substantially different
from each other. To demonstrate this, we plot the predicted densities at the
median age and gender in figure 9.5. To make the figure easier to read, we
show only the densities through 30 visits, which exceeds the 97th
percentile of its empirical distribution of office-based visits. We also
represent the densities using (continuous) line charts, although bar charts
would be technically preferred. We use line charts because it is easier to
visualize differences in the component densities. The figure shows that
each of the predicted densities is skewed, just like the empirical density.
The density of the relatively rare component 2 is quite different from that
of the much more frequent component 1, although they have similar
means. Observations in component 3 are most likely to generate large
values of visits and much less likely to generate zero and other small-visit
values.

269
0 10 20 30
# office-based visits

f(y|x,c=1) f(y|x,c=2) f(y|x,c=3)


μ1 = 3.5, π1 = 0.64
μ2 = 3.0, π2 = 0.09
μ3 = 12.1, π3 = 0.27

Figure 9.5: Empirical and predicted componentwise densities of


office-based visits

We take the estimated marginal effects of age, and the incremental


effects of female, and display them along with 95% confidence intervals
in figure 9.6 below. The left panel displays marginal effects of age. The
estimated marginal effect of age is negative and statistically significant for
the lowest use (on average) group. The effects of age are positive for the
middle-use group—and positive and substantially bigger for the high-use
group. Both effects are statistically significant. The effects of female are
distinctly nonmonotonic across the components, ordered in ascending
values of mean predicted use. Among the 9% of individuals in the
component with the lowest office-based use, the effect of female is large
and positive, albeit with a wide confidence interval. Among the 64% of
individuals in the component with moderate office-based use, the effect of
gender is positive and statistically significant but quite small. Among the
27% of individuals in the high-use component, the effect of female is
moderately large and statistically significant.

270
Age Female

8
.3

6
.2
Coefficient

Coefficient
4
.1

2
0
-.1

0
1 2 3 1 2 3
Component Component

Finite mixture estimates and 95% CI denoted by bars and capped lines, respectively

Figure 9.6: Componentwise coefficients and 95% confidence


intervals of office-based use

In this section, we have demonstrated the use of finite mixture models


to fit heterogeneous effects of covariates for measures of healthcare
expenditures and use. As the theory suggests, finite mixture models
provide a valuable way to elicit and estimate heterogeneous effects of
treatment specified in chapter 2. The fmm command in Stata is extremely
powerful and has many alternatives. One should always fit finite mixture
models after considerable thought to identification considerations. Such
models are not always identified. For example, while it is possible to code
fmm 2: logit y x, finite mixtures of binary outcomes are not generally
identified (Teicher 1963).

271
9.4 Nonparametric regression

Nonparametric regression techniques make few assumptions about the


functional form of the relationship between the outcome and the
covariates. Moreover, the assumptions that are made are local in the sense
that they apply to a neighborhood of each data point separately. By
contrast, parametric models make assumptions that apply uniformly
throughout the sample (and population). Therefore, nonparametric
regression is appealing precisely when the researcher is unsure of
functional forms of covariates and relationships between them—which is
almost always. Below we describe some uses of nonparametric regression
now available in Stata via the npregress command. Its desirable
properties come at a computational cost that we describe in the context of
our examples below.

More precisely, the nonparametric model of given a vector of


covariates is given by

where and is an unknown function of covariates and


parameters. The conditional expectation function is given by

Local-linear regression , which is the default technique in npregress ,


estimates a regression for each observation in the dataset using a subset of
“nearby” observations. More precisely, for each data point in the
sample, a local-linear regression is estimated by solving

(9.7)

272
where denotes a kernel weighting function that assigns greater
weights to observations that are closer to in that is closer to and
inclusion or exclusion is based on the bandwidth . Fan and Gijbels (1996)
describe local-linear regression in detail. Racine and Li (2004) describe
how good bandwidths may be chosen.

Two features distinguish local-linear regression from OLS. First, the


coefficients are reestimated at every observation, so the intercept and
slopes vary throughout the sample in a priori unspecified ways. Second,
each regression in (9.7) is estimated using a set of kernel weights given by
; thus it is technically weighted least squares.

9.4.1 MEPS examples

We use nonparametric regression to estimate the effects of age, the


physical-health component score (pcs12), and an indicator for the presence
of an activity limitation (anylim) on total healthcare expenditures for
Hispanic men with any expenditures ( ). As a benchmark, we
first fit the model using OLS.

Next, we use npregress to estimate the effects nonparametrically. We


use a nonparametric bootstrap (with 100 bootstrap replications) to obtain
standard errors , test statistics, and confidence intervals, because

273
npregress does not produce those by default. See Cattaneo and
Jansson (2017) for formal justification of the bootstrap for the
nonparametric regression. We have used 100 replications to economize on
computational time without confirming that the number of replications is
sufficient for reliable estimates. For serious research work, we encourage
users to fit models with different numbers of bootstrap replicates before
settling on a number beyond which estimates of standard errors do not
change much.

The output of npregress shows the average of the predicted means


and the averages of the predicted derivatives (changes for discrete
regressors) of the mean function with bootstrap standard errors. The
average of the observation-level effects of a activity limitation is $2,247.
Individuals with disabling activity limitations have higher healthcare
expenditures than those who do not. Note that this estimate is almost 45%
larger than the OLS estimated effect of $1,554. The average of the effects of
pcs is , while its OLS counterpart is . In both models, the
interpretation of the effect is that better health (indicated by higher pcs12
scores) is associated with lower healthcare expenditures. While the
nonparametric average of the effects is not so different from the effect
estimated by OLS, one should not conclude that OLS estimates are “good
enough”. There may well be important nonlinearities and interactions in
the effects of the physical health score and activity limitations on
expenditures that the nonparametric regression takes into account (but OLS
does not). These nonlinearities may be of substantive interest in many
applications.

274
To understand the effect of a change in a unit of pcs12 better, we use
margins to compute the conditional mean function at values across the
empirical distribution of pcs12 and anylim. Once again, we use 100
bootstrap replications to obtain standard errors for the predictions.

275
276
It is easier to understand the nonlinearities in the effects visually, so we
graph the results of margins using marginsplot . Figure 9.7 shows two
clear sources of nonlinearities. First, mean expenditures for individuals
with an activity limitation decline sharply as physical health (pcs12)
increases until about a score of 45, which is approximately the mean
physical component score in this sample. Beyond a score of 45, mean
expenditures appear to decline slowly, but the confidence intervals show
that constant expenditure cannot be ruled out. For individuals without
activity limitations, mean expenditures are roughly constant across values
of pcs12. Second, the figure suggests interactive effects of pcs12 with
anylim. Mean expenditures are substantially bigger for individuals with a
limitation, compared with those without limitations, until the physical
score reaches about 45, that is, for individuals in below-average health. For
individuals with above-average health, mean expenditures are the same for
individuals with and without an activity limitation.

277
Predicted values by pcs and anylim
10000
4000 6000 8000
Mean Function
2000
0

30 35 40 45 50 55 60
Physical health component of SF12

No activity limitation Activity limitation

Figure 9.7: Predicted total expenditures by physical health score and


activity limitation

We conclude our description of the nonparametric regression estimator


with a report on computational times because they can be considerable . It
took about 2 minutes to obtain estimates of the average of effects and their
standard errors based on 100 bootstrap replications in our example with 3
regressors and just over 1,000 observations. It took about 15 minutes to
obtain the output of margins with standard errors obtained by bootstrap.
Each call to margins takes about the same amount of time as the
regression estimates; we evaluated margins at 14 values. These are quite
modest times. When we estimated the same specification using a different
sample of about 8,000 observations, the regression estimates (with
standard errors calculated using 100 bootstrap replications) took about 1.5
hours. It took about 9 hours to obtain the results of margins with 100
bootstrap replications. These calculations were done using a 4-core version
of Stata-MP on a reasonably fast Linux machine. This is the cost of an
estimator that requires almost no judgment on the part of the researcher in
terms of functional forms of covariates and relationships between them. In
some situations, the extra effort may well be worth it. In other situations,
researchers may find that getting graphs of effects without confidence
intervals (which is the source of most of the computational time) may be
sufficiently insightful.

278
9.5 Conditional density estimator

We end this chapter with a brief description of another flexible estimator


proposed by Gilleskie and Mroz (2004) . Their conditional density
estimator (CDE) is easy to estimate. The intuition is to break up the domain
of the dependent variable into a set of mutually exclusive and exhaustive
subdomains—which we will call bins—and focus the modeling effort on
predicting the probability of being in each bin. The mean value of the
dependent variable is assumed constant within each bin. This assumption
is reasonable when bin sizes are small. After estimation, the results can
estimate predicted means conditional on the covariates, as well as marginal
and incremental effects.

The primary advantage of CDE is that it can be used with continuous


and count outcomes. It works with single- and multipeaked distributions .
For example, it can model the number of hours worked per week, where
there are typically peaks in the density at 20, 35, and 40 hours. CDE can
also model medical care expenditure data, incorporating zeros easily. CDE
can be extended to count models , for example, to model the number of
prescription drugs purchased in a year (Mroz 2012) . Gilleskie and Mroz
(2004) built on earlier work by Efron (1988) and Donald, Green, and
Paarsch (2000) .

The two key assumptions in CDE (see Gilleskie and Mroz [2004] ,
equations 12 and 13) are that the probability of being in a bin depends on
covariates and that the mean value of , conditional on the bin, is
independent of the covariates. That is, there is heterogeneity across bins
but homogeneity within bins. Within a bin, covariates have no predictive
power. In this case, the CDE approach focuses on modeling the probability
of being in a bin in the best possible way.

The second assumption greatly simplifies estimation and inference,


because predicted expenditures are the same for all within the same bin.
The assumption of homogeneity within a bin is typically least likely to
hold for the highest expenditure group. In principle, the second assumption
could be relaxed by treating the last bin like the second part of a two-part
model.

Given the two main assumptions, fitting the CDE model is

279
straightforward. Decide on the number of bins and the threshold values
separating the bins, which do not need to be equally spaced. Fit a series of
logit (or probit) models. For example, if there are 11 bins, fit 10 logit
models, with each successive model fit on a smaller sample. For
observations and bins the expected value of
the dependent variable, , is the mean of for each bin times the
probability of being in that bin, summed over all bins:

Gilleskie and Mroz (2004) demonstrate how to fit CDE in one large
logit model , as opposed to a series of individual logit models with
progressively smaller sample sizes.

Two issues of this model have not yet been worked out to make this
method accessible to the typical applied researcher. First, it needs a theory
and practical method for choosing the number of bins in an optimal way.
Second, although methods for the computation of standard errors are
available, coding those is beyond the scope of this book.

280
9.6 Stata resources

The main quantile command in Stata is qreg . To estimate a sequence of


quantile regressions, as demonstrated in section 9.2.1, use sqreg .

A new Stata command, fmm , fits a variety of finite mixture models. In


prior versions of Stata, finite mixture models could be fit with a user-
written package (Deb 2007) with the same name. With the introduction of
the new fmm command, fmm points to the official Stata command, even if
the user has the package already installed. In older versions of Stata, to run
finite mixture models, use the package fmm9 . Install this package by
typing ssc install fmm9 (Deb 2007) .

Nonparametric regressions can be estimated in Stata using npregress .


More specifically, npregress estimates local linear and local constant
regressions to produce observation-level effects of covariates on an
outcome.

281
Chapter 10
Endogeneity

282
10.1 Introduction

The models in this book specify how a dependent variable is generated


from exogenous, observed covariates and unobserved errors. A covariate
that is uncorrelated with the unobserved errors is exogenous. In contrast, a
covariate that is correlated with the unobserved errors is endogenous.
Endogeneity is a problem because correlation between a covariate and the
error term will cause inconsistency in the estimated coefficients in the
models we have described so far. In the parlance of chapter 2, the
ignorability assumption does not hold, and estimates of treatment effects
will not be consistent if we use models that ignore the endogeneity.

One common cause of endogeneity in health economics is unobserved


health status . Consider a model that predicts healthcare expenditures as a
function of out-of-pocket price. Even after controlling for observed health
status, unobserved health status remains in the error term because health
status is hard to measure accurately and fully. Unobserved health status is
correlated with both out-of-pocket price and healthcare expenditures.
Therefore, an ordinary least-squares (OLS) regression (or any of the other
methods for estimating model parameters described in previous chapters)
of healthcare expenditures on out-of-pocket price will produce inconsistent
estimates of the coefficient on out-of-pocket price. In some literature, these
omitted variables are called confounders .

In this chapter, we provide a brief introduction to the issues raised by


endogeneity and a few available solutions. Many econometric methods to
control for endogeneity use variables, called instrumental variables (IVs) ,
that predict the endogenous variable but do not directly predict the main
dependent variable. We use simulated data to show how to use IVs to
control for endogeneity. We begin our description with a discussion of
two-stage least squares (2SLS) to solve the problem of inconsistency
because of endogeneity. In this context, we discuss standard statistical tests
for endogeneity and for instrument validity. The control function approach
is an important 2SLS extension to deal with endogeneity in models in which
either the outcome or the endogenous regressor is not continuous, thus
suggesting the use of nonlinear models. A common version of control
functions is two-stage residual inclusion (2SRI). We also describe Stata’s
extended regression models (ERM), a unified framework for dealing with
endogeneity in linear and some nonlinear models. This framework allows

283
for endogenous and exogenous covariates and will estimate treatment
effects under the assumption of jointly normal errors. Finally, we also
briefly describe the generalized method of moments (GMM) for IVs
estimation, because GMM can have substantial benefits compared with 2SLS.

This chapter is not a comprehensive overview of econometric issues in


endogeneity, measurement error, and use of IVs. For further information on
models with endogeneity, we direct readers to the general econometric
literature. There are excellent summaries on this matter in recent
econometric textbooks by Cameron and Trivedi (2005) ,
Wooldridge (2010) , and Angrist and Pischke (2009) and in review articles
by Newhouse and McClellan (1998) , Angrist and Krueger (2001) , and
Murray (2006) .

284
10.2 Endogeneity in linear models

10.2.1 OLS is inconsistent

We use a simple, canonical example to help clarify how endogeneity can


create inconsistent coefficient estimates. The outcome of interest, , is
continuous and determined by a linear equation

(10.1)

where and are observed characteristics (covariates). Think of as


being an unobserved covariate and as a typical linear regression error
term. The are parameters to be estimated. The covariate is exogenous;
it is uncorrelated with both and . The covariate is determined by a
linear equation

(10.2)

where is an exogenous covariate and is the regression error term.


Note that the unobserved term, , enters both regression equations. The
reason that is endogenous in the model defined in (10.1) is that it is a
function of unobserved and therefore correlated with the composite error
. The model defined in (10.1) violates the assumption that the
errors are uncorrelated with observed characteristics, so OLS estimates of
its parameters will not be consistent. For example, let be healthcare
spending and be out-of-pocket price. If both spending and out-of-pocket
price depend on observed characteristics such as age, education ( ), and
unobserved health status ( ), then the OLS estimates of the effect of out-of-
pocket price on spending will be inconsistent.

Note that we have not yet described the purpose of including the
covariate in (10.2). We will do so in the next section, where we describe
solutions to the problem described here. Note also that the logic described
above applies even if and are uncorrelated with each other; the
composite error terms, and , are correlated, thus
rendering OLS estimates of parameters of (10.1) inconsistent. In fact, one

285
need not construct this example with an unobserved covariate as distinct
from the error terms and . As long as the errors of (10.1) and (10.2)
are correlated, OLS estimates of the model for the outcome, , will be
inconsistent.

We now use artificial data to demonstrate empirically how the omitted


variable leads to inconsistency in OLS estimates. We generated a dataset
with 10,000 observations. As shown below, we begin by drawing data for
the observed covariates x and w from standard normal distributions. We
also draw the unobserved variables, u, e1, and e2, from standard normal
distributions. Each of these random variables is drawn independently. The
endogenous covariate y2, following (10.2), is a function of x, u, the
unobserved error e2, and a covariate w. Finally, we generate the
dependent variable, y1, using (10.1) as a function of an exogenous
covariate x, an endogenous covariate y2, the unobserved covariate u, and
the unobserved error e1. Note that w enters the function for y2 but does not
explicitly enter the function for y1. The variable w affects only y1 through
its effect on y2. The unobserved component u affects both y1 and y2,
which causes y2 to be endogenous. If we could observe u, then we could
include u in the model for y1, and y2 would be exogenous instead of
endogenous. For purposes of the example, we draw all the variables using
a standard normal distribution random-number generator using the
following commands:

When we fit the OLS model with u included as an observed covariate,


the estimated coefficient on y2 is a consistent estimate of the true value of
1.0. We find that the estimated coefficient is close to 1.0.

286
Next, we omit u from the regression model. We know that the OLS
estimate of the coefficient on y2 is inconsistent. In the example, the OLS
estimate of the coefficient on y2 is 1.48 when u is omitted, and the
confidence interval is far away from the true value of 1.0. The estimated
coefficient when u is omitted is greater than 1.0, because the composite
error terms of the two equations, (10.1) and (10.2), are positively
correlated.

Typically, in real empirical work, a variable like u is not observed (or


else we would just include it in the model and avoid all the problems of
controlling for endogeneity), so we need methods other than OLS to recover
a consistent estimate of the coefficient on y2. The rest of this section
shows ways to recover consistent estimates of the coefficient on y2 using
IVs.

10.2.2 2SLS

287
The existence of an exogenous variable, , that enters the equation for the
endogenous regressor, , in (10.2) but that does not directly determine the
outcome of interest, , in (10.1) (except through its effect on ) is key to
a large class of solutions to obtain consistent estimates of parameters of the
outcome equation. Such variables are often called instruments or IVs. Valid
instruments have two essential properties :

1. The instrument is uncorrelated with the main dependent variable,


except through its influence on the endogenous variable.

2. The instrument is strongly correlated with the endogenous variable.

A substantial literature has shown that merely being statistically


significantly correlated with the endogenous variable is not enough; the
instrument must be strongly correlated (Nelson and Startz 1990; Bound,
Jaeger, and Baker 1995; Staiger and Stock 1997) . Instruments that are not
strongly correlated with the endogenous variable are called weak
instruments . Weak instruments produce estimates that are inconsistent.

2SLS is a common method to control for endogeneity. As its name


implies, 2SLS conceptually involves estimating two equations. The first
stage models the endogenous variable (in this example, there is only one
endogenous variable) as a function of all the exogenous variables and at
least one IV. The second stage involves estimating the parameters of the
equation for the outcome in which the predicted value of the endogenous
regressor from the first stage replaces the endogenous variable itself. The
goal is to purge the endogenous variable of its error term—which is
correlated with the error term in the main equation—and instead make the
predicted endogenous variable a function of the instrument, which is
exogenous. In the context of our example, the first stage consists of
estimating the parameters of

by OLS to obtain predictions denoted . The second stage consists of


estimating the parameters of

(10.3)

288
where has replaced .

Readers should note two points: First, 2SLS is considerably more


general than we describe in our example. Second, Stata (and other
statistical software) does not actually fit the model in two stages. It can be
solved using a “one-stage” formula.

We now fit the model using 2SLS with the goal of estimating a
consistent estimate of the causal effect of y2 on y1. To estimate 2SLS in
Stata, we use ivregress with the first option to see regression results
from both stages. The first-stage regression predicts the endogenous
variable, y2, as a function of the instrument, w, and the exogenous variable,
x. Both of these variables are strong predictors of y2—which is not
surprising, because that is how we generated the data.

The second-stage regression (for the outcome of interest) predicts y1 as

289
a function of the predicted endogenous variable, y2, and the exogenous
variable, x. The 2SLS estimate is 0.932 with a standard error of 0.075. The
estimate is close to 1 in magnitude, not statistically significantly different
from 1.0 if one conducted a simple test. However, it has a much wider
confidence interval than found with OLS with no endogeneity. This
example demonstrates two important points: First, with a valid instrument,
the 2SLS estimate is much closer to the true value than the OLS estimate.
Second, the standard errors are typically much larger than OLS. There is a
tradeoff between consistency and precision.

10.2.3 Specification tests

As always, it is important to test both the validity of the instruments and


whether endogeneity is in fact a problem. Stata’s ivregress
postestimation commands include the standard tests. The first set of tests
are whether the instruments are strongly correlated with the potentially
endogenous variable. In addition to inspecting statistics on the
instruments individually in the first-stage regression, you must inspect the
statistic on the joint test of significance of all the instruments. A joint
test is important with multiple instruments because correlation among
instruments—if there are more than one—can reduce individual statistics
and mask joint statistical significance. The rule of thumb is when there is
one IV, the statistic should be at least 10 (Nelson and
Startz 1990; Staiger and Stock 1997; Stock, Wright, and Yogo 2002) .

The estat firststage command reveals that the statistic on the


one instrument in the example is about , which is well above the
minimum recommended threshold. We conclude that the instrument, w,
strongly predicts the endogenous variable.

290
The second test is whether the potentially endogenous variable is
actually exogenous. This test is conditional on all the instruments being
valid. The estat endogenous command shows that the -value is below
0.05, meaning the test rejects the null hypothesis of exogeneity
(conditional on the instrument being valid). We will treat y2 as
endogenous.

In situations when there are more instruments than endogenous


regressors, it is possible to test the overidentifying restrictions . That is, we
could test the assumption that the additional instruments are unrelated to
the main dependent variable, conditional on one of the instruments being
valid. For a single instrument, one must rely on theory and an
understanding of the institutions studied. In addition to these statistical
tests, researchers also often perform balance tests to argue that values of
covariates are balanced across different values of the instrument. If the
instrument is purely random, it should be uncorrelated with the other
exogenous variables (of course, it still needs to be correlated with the
endogenous variable).

291
10.2.4 2SRI

There is another equivalent way to construct 2SLS estimates. Let


. Then, by construction, . By replacing in
(10.3), we get

Estimating this equation by OLS (treating as an additional control


regressor) yields an estimate of that is identical to the 2SLS estimate.
This estimator, the 2SRI, is a specific kind of control function estimator.
We introduce 2SRI not merely to reproduce the 2SLS results but as a bridge
to the discussion of endogeneity in models in which the endogenous
regressor is better modeled nonlinearly, or more generally nonlinear
models with endogenous regressors.

Further intuition can be found if we tease apart the error from the main
equation into two pieces—the part correlated with the endogenous variable
and the part that is independent. A control function is a variable (or
variables) that approximates the correlated part. Newey, Powell, and Vella
(1999) proved that there exists an optimal control function. If a researcher
could observe such a variable, then including it in the main equation would
be like including the omitted variable that caused the endogeneity. The
remaining error would be uncorrelated with the endogenous variable.

We demonstrate the use of 2SRI in linear models using the example


data. First, we estimate the first-stage regression of y2 as a function of
exogenous x and instrument w. We then compute residuals nu2_hat. We
then estimate the outcome variable y1 as a function of endogenous y2,
exogenous x, and the estimated residual nu2_hat. As expected, the
estimated coefficient on y2 in the second equation is identical to the 2SLS
result, 0.932.

292
We bootstrapped the standard errors because estimating 2SRI in two
steps requires either bootstrapping the standard errors or using GMM (see
section 10.4). To understand why this is necessary, recall that 2SRI inserts
the predicted first-stage error into the main equation. The standard errors
computed by regress do not reflect the fact that this is an estimate of the
true error. Therefore, the regress standard errors are too small. In
contrast, bootstrapping is not necessary for ivregress .

In linear models, the estimated coefficient on the endogenous variable


in 2SRI is identical to the estimated coefficient on the predicted endogenous
variable in 2SLS. However, this identity does not hold in nonlinear models.
Control functions can be a consistent way to control for endogeneity in
nonlinear models. 2SRI is one specific type of control function. As
explained by Terza, Basu, and Rathouz (2008) and Wooldridge (2014) ,
2SRI remains consistent in nonlinear models, whereas 2SLS is not consistent.

293
However, the functional form of the control function matters, and in some
cases, alternative functional forms are necessary. Control function methods
can be used for binary outcomes (logit and probit), other categorical
outcomes (ordered and multinomial models), count models, duration or
hazard models, and generalized linear models.

10.2.5 Modeling endogeneity with ERM

Another way to model endogeneity uses the new ERM commands in Stata.
The ERM commands provide a unified way to model linear and some
nonlinear models with both exogenous and endogenous covariates. In
particular, the ERM commands can model linear, binary, ordered
multinomial, or interval outcomes along with continuous, binary, ordered
multinomial, or interval endogenous variables. The ERM commands also
allow modeling of selection and the computation of treatment effects in
these contexts.

In this section, we limit our discussion to using ERM in the context of


our example. To remind readers, the model is given by

In addition, assume that the joint distribution of and is


bivariate normal with zero means and unrestricted variances and
covariance; that is,

This model is implemented in Stata via the eregress command.

Because ERM assumes joint normality of the error terms, the two
equations are estimated using maximum likelihood, or to be more precise,
full-information maximum likelihood (FIML) . This has the advantage of
better efficiency compared with 2SLS if the joint normality assumption is
correct. However, if the joint normality assumption is incorrect, then the
FIML model in this case still produces consistent estimates, but it no longer
has any efficiency gains.

294
We next demonstrate the use of the linear ERM command eregress for
the generated data from section 10.2.2 to predict y1 as a function of
endogenous y2 with instrument w. We will then compare the results from
eregress with the 2SLS results.

The syntax for eregress is slightly different from that for ivregress .
The expression for the endogenous variable is put after the comma, and the
first option is not needed because eregress automatically reports the
first-stage results. As expected, the results are very similar but not
numerically identical to the 2SLS results. The estimated coefficient is close
to 1.0, the true value. The standard error is also similar to the standard
error found by 2SLS.

295
10.3 Endogeneity with a binary endogenous variable

We next show how to fit a model with a continuous main dependent


variable and a binary (nonlinear) endogenous variable. This example
builds on several ideas presented in section 10.2. The outcome of interest,
, is continuous and determined by a linear equation as before,

except that the covariate is now binary and takes only two values, 0 and
1. Let

(10.4)

where denotes the indicator function returning values of 1 if the


argument is greater than 0, and 0 otherwise. Let the distribution of
be standard normal. In other words, is determined by a probit model.
Even if and are uncorrelated with each other, the composite error
terms, and , are correlated, thus rendering OLS estimates of
the outcome equation inconsistent.

We estimate the parameters of this model using three methods that


account for the endogeneity of . 2SLS in this context ignores the discrete
nature of . 2SRI and eregress with the probit option both explicitly
model as a binary outcome. All three estimators are consistent, so our
example provides only a sense of finite sample differences.

As in section 10.2.1, we use a simulated dataset with 10,000


observations. As the list of commands shown below demonstrates, we
begin by drawing data for the observed covariates, x and w, from standard
normal distributions. The unobserved variables, u, e1, and e2, are drawn
from independent normal distributions with zero mean and variances equal
to 0.5. Thus the variance of the sum of u and e2 equals 1, which is
convenient for the interpretation of probit coefficients. We have also
scaled the distribution of e1, but that has no substantive significance. The
binary endogenous covariate y2, following (10.4), is a function of x, u, the

296
unobserved error e2, and a covariate w. Finally, we generate the
dependent variable, y1, using (10.1) as a function of an exogenous
covariate x, the endogenous binary covariate y2, the unobserved covariate
u, and the unobserved error e1.

We calculate the frequency distribution of y2 to show that it is equal to


1 about 10% of the time. Similar rates are quite common in empirical
research. We will describe its relevance below.

We first estimate an OLS regression that does not account for the
endogeneity of y2. We know that the estimates from such a regression will
be inconsistent. We see that the OLS estimate of the coefficient on y2 is
1.76 and the confidence interval is far away from the true value of 1.0.
Estimators that do not account for endogeneity can be misleading.

297
We now fit the model using 2SLS. Although 2SLS ignores the
discreteness of y2, it produces consistent parameter estimates
(Wooldridge 2010). The 2SLS estimate of the coefficient on y2 is 0.632. It
does not appear to be close to the true value of 1.0, but because its
standard error is large (0.405), it is not statistically different from 1.0. The
estimate is also not statistically different from zero. Although 2SLS is
consistent, the efficiency loss appears to be quite large. We return to this
issue below.

298
We now fit the model using 2SRI and estimate standard errors using a
nonparametric bootstrap. We use the probit model (probit) for the first
stage, which produces maximum likelihood estimates. We then compute
residuals nu2_hat and include them in an OLS regression of y1 as a
function of endogenous y2, exogenous x, and the estimated residual
nu2_hat. The estimate of the coefficient on y2 is 0.98, which is quite close
to the true value of 1.0. The estimated standard error is 0.1, which is
considerably smaller than the 2SLS estimate of the analogous standard
error.

299
Next, we fit the model using eregress with the probit option . This is
a FIML estimator of the model; it accounts for joint normality of the error
terms of the two equations. The estimated coefficient is 1.01 and its
standard error is 0.01, suggesting further efficiency gains.

300
We saw earlier that, while the 2SLS estimator is consistent, it does not
produce precise estimates of the parameter of interest in this example.
Both 2SRI and FIML estimators perform much better. However, the
performance of the 2SLS estimator can be improved. Recall that y2 takes
the value 1 about 10% of the time. It is in these ranges of rates of binary
outcomes when nonlinear estimators like the probit have the most gains
relative to an estimate from a linear probability model. But the predictive
performance of the linear probability model can be improved by
introducing nonlinear functions of the covariates. We consider a quadratic
polynomial of x and w to illustrate. The first-stage results table shows the
coefficient estimates from such a model. Two out of three of the higher-
order terms are statistically significant. The results of the second-stage
regression show that the point estimate of the coefficient on y2 is now
0.995—very close to 1.0. The standard error of the estimate is 0.28, which
is a substantial improvement over the standard error in the first set of 2SLS.

301
Readers should note a feature of the specification of the model using
ivregress. The polynomial terms that involve w are included in the set of
instruments that affect only y2 directly. The polynomial term that involves
only x, c.x#c.x, appears in both the first and second stages of the
regression because it is a common exogenous regressor. The estimated
coefficient on c.x#c.x is very close to zero and not statistically significant
in the regression of y1, as it should be. Including it in the fitted model, and

302
not the true model, has no deleterious effect.

10.3.1 Additional considerations

In Stata, the suite of ERM commands (eregress , eprobit , eoprobit , and


eintreg ) allows users to fit models with linear (continuous), binary,
ordered multinomial and interval outcomes, and endogenous regressors
with the same class of characteristics. These cover many empirical
situations. While they make multivariate normality assumptions, they may
work well even when the errors are not jointly normal (Wooldridge 2014) .
Stata also implements control-function estimators for ivtobit and
ivpoisson .

Generally, in nonlinear models , control function estimators are


consistent when 2SLS is inconsistent or infeasible (for example, if the
endogenous regressor is unordered multinomial). The control function
approach to endogeneity is discussed more in Blundell and
Smith (1989, 1994); Vella and Verbeek (1999); Terza, Basu, and
Rathouz (2008); and Wooldridge (2010, 2014) .

When the endogenous regressor is modeled using a nonlinear


estimator, the best way to create residuals is unclear. For example, in the
binary and multinomial cases, in addition to response residuals (the
difference between the dependent variable and the predicted probability),
there are Pearson residuals, Anscombe residuals, and deviance residuals.
More generally, control functions can be more general functions of the
residuals, however they are created. One could, in principle, include
polynomials of the residuals or other transformations of the residuals
(Garrido et al. 2012; Wooldridge 2014) . Choices may vary from model to
model.

303
10.4 GMM

Another way to fit models that control for endogeneity is with GMM. The
essence of GMM is to write down moment conditions of the model and to
replace parameters with their sample analogue. For example, the simple
mean, , of a random variable has the property that the expected value of
the difference between and the mean, , is zero. That is, .
Inserting and solving yields the familiar equation that the estimated
mean, , is the simple mean of . GMM can be used to fit all the estimators
in this book except nonparametric kernel estimators.

We want to discuss GMM at least briefly, because for some models,


GMM has advantages over maximum likelihood estimation. (See Cameron
and Trivedi [2005, 2010] for longer discussions of GMM.)

First, we show how to reproduce 2SLS results with GMM for the artificial
data example above. However, there is a slight difference in the standard
error , a difference explained below. The ivregress command can
estimate GMM methods directly by specifying the gmm option.

The results are nearly the same as before, with only a slight difference
in the standard errors (and corresponding statistics and -values). The
difference is that the GMM standard errors are smaller by a factor of
.

One advantage of GMM is that when there are multiple instruments ,

304
GMM can be estimated with unbalanced instrument sets. For example,
suppose that there are three instruments, but some observations are
missing for the third instrument. In 2SLS, one would have to either drop an
instrument or use a subsample of the data with no missing data. Either
way, information would be lost. In contrast, GMM can use whatever
information is available, using two instruments for some observations and
three for others.

Another advantage is that GMM gets correct standard errors in the 2SRI
process in one step.

In addition to fitting 2SRI for models with endogeneity, GMM can

305
estimate multiple equations simultaneously. This can be useful in health
economics for fitting two-part models all in one command. The Poisson
count model is especially easy to fit with GMM; therefore, Poisson with IVs
is also straightforward. However, other count models are not as easy to
implement in GMM.

With GMM, there is an additional test statistic to test the


overidentification assumption . Hansen’s statistic is a weighted average
of the score function of the instruments times the residuals. Under
homoskedasticity, the statistic has the interpretation of the explained
sum of squares from a regression of the 2SLS residuals on a vector of the
instruments.

306
10.5 Stata resources

One Stata command to fit linear models with IVs is ivregress —which
can be estimated with 2SLS , limited information maximum likelihood, or
GMM. The estat commands, part of the ivregress postestimation
commands, make it easy to run statistical tests of the main assumptions of
strength and validity of the instruments. In addition, Stata has a unified set
of commands, called ERM, that allow for estimation of linear and some
nonlinear models, where the covariates can be exogenous or endogenous.
The basic command for linear models is eregress , and the command that
estimates treatment effects is etregress .

For nonlinear models, Stata will estimate 2SRI for probit models with
the ivprobit command and the twostep option, for tobit models with the
ivtobit command and the twostep option, and for Poisson models with
the ivpoisson command and the cfunction option. The ERM commands
can be used for probit and ordered probit models with endogenous
covariates. The basic command for probit models is eprobit . The results
from ivprobit without the twostep option is identical to eprobit,
although the syntax is slightly different.

Stata has extensive capabilities of fitting GMM models, not just the GMM
version of linear IVs. The gmm command (as opposed to the gmm option with
ivregress ) can estimate multiple equations simultaneously.

307
Chapter 11
Design effects

308
11.1 Introduction

So far, we have described econometric methods and analyzed data from


the 2004 Medical Expenditure Panel Survey (MEPS) dataset as if the data
were collected using simple random sampling. However, in many research
studies, either the design of the data or the study objectives may require
the analyst to pay attention to features related to complex sampling.
Specifically, observations in surveys are often not drawn with equal
probability. For example, observations within households, hospitals,
counties, and healthcare markets are correlated in unobserved ways—or
the dataset is incomplete because of refusal, attrition , or item nonresponse.
Such issues are often intrinsic in the design of survey data.

The literature on survey design issues and statistics for data from
complex surveys is large and detailed (for example, Skinner, Holt, and
Smith [1989]) . On the other hand, the discussion of these issues in
standard econometrics textbooks is sparse; exceptions include Cameron
and Trivedi (2005) , Deaton (1997) , and Wooldridge (2010) . Our
objective here is not to survey that entire field of literature but rather to
provide an introduction to the issues, intuition about the consequences of
ignoring design effects, and some basic approaches to control for design
effects through examples.

When large, multipurpose surveys such as the MEPS are conducted, a


simple random sample is rarely collected for a variety of well-known
financial and statistical efficiency reasons. When more complex sampling
methods are used, estimates of model parameters, marginal and
incremental effects, and inference may be more reliable if the specifics of
the sampling design are accounted for during estimation. Indeed, if you
ignore the sampling design entirely, the standard errors of parameter
estimates will likely be underestimated, possibly leading to results that
seem to be statistically significant, when in fact, they are not. The
difference in point estimates and standard errors obtained using methods
with and without survey-design features will vary from dataset to dataset
and between variables within the same dataset. There is no practical way
to know beforehand how different the results might be. Arguably, this lack
of specificity has led to economists generally underemphasizing complex
survey-design issues in econometrics. Solon, Haider, and
Wooldridge (2015) have a recent discussion of these issues. Cameron and

309
Trivedi (2005) note, however, that the effect of weighting tends to be
much smaller in the regression context where the focus is on the
relationship between a covariate and an outcome.

For many research questions and study designs, the correlations


between groups of observations on unobserved attributes can play an
important role in correct inference. Consider, for example, a typical survey
of individuals in which households are randomly sampled, but all
individuals within a household are surveyed. Suppose further that income
is measured only at the household level but that health insurance status and
healthcare use are both measured at the individual level (so there is
variation within households). Then, income is, by definition, perfectly
correlated within a household for every household. Thus, if a researcher is
interested in the difference in income between insured and uninsured
individuals or the income elasticity of healthcare use—where healthcare
use is measured at the individual level—then not accounting for the
correlation between individuals in a household will cause the standard
errors of the estimates to be understated (Moulton 1986, 1990) .

Furthermore, in many studies in which difference-in-differences


designs (comparing treatment and control groups across preperiods and
postperiods) are used, both the treatment and control groups may consist of
sets of units that are considerably more aggregated than the unit of
observation. For example, states or counties may be the level at which the
treatment of interest is applied, but observations may be at the individual
level. Thus the treatment assignment will be common to all units within
the states or counties, causing observations within those units to be
correlated. In such situations, too, summary statistics and estimates from
regressions based on individual observations will yield standard errors that
are too small, thus distorting the size of inference test statistics (Bertrand,
Duflo, and Mullainathan 2004) .

Below, we begin by describing some key features of many survey


designs and the consequences of ignoring such design features. Then, we
provide a number of examples using our MEPS sample—first in the context
of calculating summary statistics and then in the regression context. Also
note that some of these issues also occur in randomized controlled trial
data, administratively collected data such as medical claims, and in data
collected in other ways. Thus, although the descriptions below are framed
in the survey data context, the issues apply to many empirical data

310
analyses.

311
11.2 Features of sampling designs

Sampling designs for large surveys can be quite complex, but most of
them share two features. First, observations are not sampled with equal
probability; thus each observation is associated with a weight that indicates
its relative importance in the sample relative to the population. Second,
observations are not all sampled independently but instead are sampled in
clusters. Stratification by subgroup is one important kind of clustering.
Then, each observation in the sample is associated with a cluster identifier.
Ignoring both weights and clusters can lead to misleading statistics and
inference based on simple random sampling.

11.2.1 Weights

There are many types of weights that can be associated with a survey.
Perhaps the most common is the sampling weight , which is used to denote
the inverse of the probability of being included in the sample because of
the sampling design. Therefore, observations that are oversampled will
have lower weights than observations that are undersampled. In addition,
postsampling adjustments to the weights are often made to adjust for
deviations of the data-collection scheme from the original design. In Stata ,
pweights are sampling weights. Commands that allow pweights typically
provide a vce(cluster clustvar) option, described below. Under many
sampling designs, the sum of the sampling weights will equal the relevant
population total instead of the sample size.

In MEPS , minorities were oversampled by design, so there were


sufficient observations for each minority group to allow analysts to obtain
reliable estimates of interest for each of those groups. Households
containing Hispanics and blacks were oversampled at rates of
approximately 2 and 1.5 times, respectively, the rate of remaining
households. However, as a consequence, averages or counts taken over the
entire sample without adjustments for the oversampling will not be
generally representative of the population. Once sampling weights are
accounted for the relatively large number of observations for minorities
will be appropriately downweighted when aggregates and averages are
estimated.

Many Stata commands also allow one or more of three additional types

312
of weights: fweights , aweights , and iweights . We briefly describe the
application of each below, but note that they are not generally considered
as arising from complex survey methodology. Frequency weights
(fweights) are integers representing the number of observations each
sampled observation really represents. Analytic weights (aweights) are
typically appropriate when each observation in the data is a summary
statistic, such as the count or average, over a group of observations or to
address issues of heteroskedasticity. The prototypical example is the
instance of rates. For example, consider a county-level dataset in which
each observation consists of rates that measure socioeconomic
characteristics of people in the county in a particular year. Then, the
weighting variable contains the number of individuals over which the
average was calculated. Finally, most Stata commands allow the user to
specify an importance weight (iweight). The iweight has no formal
statistical definition but is assumed to reflect the importance of each
observation in a sample.

11.2.2 Clusters and stratification

As mentioned above, individuals are not sampled independently in most


survey designs. Collections of observational units (for example, states,
counties, or households) are typically sampled as a group known as a
cluster. The purpose of clustering is usually to save time and money in the
sample collection, because it is often easier to collect information from
people who live in close proximity to each other than to have a true
random sample. There may also be further subsampling within the clusters.
The clusters at the first level of sampling are called primary sampling units
(PSUs) . For example, in a sampling design in the context of the United
States, states might be sampled first, and then counties within each state
might be sampled, and then individuals in each selected county might be
sampled. States would then be the PSUs. Many survey designs do not use
the same sampling method at all levels of sampling. For example,
proportional-to-size sampling may be used to select states and counties
within states, while simple random sampling may be used to select
individual within counties.

Stratification is another specific type of clustering. Stratification


partitions the population into distinct groups, often by demographic
variables such as gender, race, or socioeconomic status. The purpose of
stratification is to make sure that certain groups are fully represented in the

313
final survey sample. Once these groups have been defined, researchers
sample from each group, as if it were independent of all the other groups.
For example, if a sample is stratified on gender, then men and women are
sampled independent of one another. Often, sampling weights are subject
to poststratification, which is a method for adjusting the sampling weights
to account for underrepresented groups in the population, often due to
systematic refusal or nonresponse of some sort (Skinner, Holt, and
Smith 1989) .

Most Stata commands that produce inferential statistics allow for the
vce(cluster clustvar) option, where clustvar is the variable that defines
the clusters. When specified, this option changes the formula for the
standard errors . The “sandwich” formula allows for correlation among
observations within clusters but assumes independence of observations
across clusters. Typically, cluster-corrected standard errors are larger than
the corresponding naïve ones (Moulton 1986, 1990) . If the variable in
question varies independently within clusters, there will be almost no
correction. If observations are negatively correlated within a cluster, the
adjustment can make standard errors smaller; however, this circumstance
is rare.

The MEPS Household Survey is based on a stratified multistage sample


design (to be precise, it is based on a frame of another large survey—the
National Health Insurance Survey —in the previous year). The first stage
of sample selection was an area sample of PSUs , where PSUs generally
consisted of one or more counties. Many PSUs were selected with certainty
to ensure representation. Within each PSU, density strata were formed using
1990 Census population distributions of Hispanic persons and black
persons for single or groups of blocks. Within each density stratum,
“supersegments” were formed—consisting of clusters of housing units.
Households within supersegments were randomly selected for each
calendar year. Thus the design of the MEPS incorporates features of
sampling units and stratification.

Note that the asymptotic theory of cluster-adjustments requires that the


number of clusters be large. This is what we implicitly assume in the
descriptions below, but we point the reader to Cameron, Gelbach, and
Miller (2008, 2011) for description of the issues and solutions for
situations in which the number of clusters is small.

11.2.3 Weights and clustering in natural experiments

314
The data used to analyze natural experiments are often obtained from
administrative databases, which are not collected using complex survey
procedures. Nevertheless, issues of clustering and attrition may also be
extremely important in statistical analyses of such data. For example,
suppose we are interested in evaluating the effects of a new surgical
technique for a specific health condition that has been implemented in
some, but not all, hospitals. The intervention is applied at the hospital level
—that is, all patients in the treated hospitals are subject to the new
technique, while all patients in the control hospitals are subject to the old
technique. The data consist of retrospective administrative records of all
patients with that diagnosis from the population of hospitals obtained
before and after the new technique was implemented. One possible way to
estimate the treatment effect would be to use a difference-in-differences
method, comparing the patients in treatment hospitals with those in control
hospitals, while also controlling for trends over time.

The use of difference-in-differences designs has become quite popular


in the health economics and health services literature. Although the
regression models are fit at the patient level, and observed hospital-level
characteristics may be included as regressors, it is typically not possible to
rule out the possibility of unobserved hospital-level heterogeneity or
clustering within hospitals. Such clustering will generally imply that the
usual standard errors of the parameter estimates will be incorrect and too
small, because the errors within a cluster are likely to be positively
correlated, so we will be more likely to find a statistically significant
treatment effect when one does not exist (Moulton 1986; Bertrand, Duflo,
and Mullainathan 2004) .

Sampling weights may also be relevant in such circumstances, even if


the sample was, ex ante, drawn from an administrative database. Consider
a study in which patients are followed up for a substantial length of time to
determine longer-term outcomes of the surgery (in treated and control
hospitals). Although the original sample might have consisted of the
population of patients with the diagnosis, it is likely that some patients will
be lost during follow-up in nonrandom ways. Some might die; others may
have moved out of the reach of the administrative data, for example, into
nursing homes. Such loss of observations will likely be different by
socioeconomic and health status; thus the final dataset will not have
equiprobable sampling. Postsampling weights may have substantive

315
effects on estimates and inference.

316
11.3 Methods for point estimation and inference

As expected, standard formulas for computing statistics are not applicable


when design features are incorporated into the analysis. In general, weights
affect both point estimates and standard errors, while clustering and
stratification affect only the standard errors. We discuss the theory of how
to adjust point estimates for weights in section 11.3.1 and of how to adjust
standard errors for design effects in section 11.3.2. There are specific
examples using MEPS data and Stata code in section 11.4.

11.3.1 Point estimation

To provide intuition, we present three examples of how weights affect


point estimation. We start with the simplest case of using sampling
weights to estimate a population average. Suppose that there are sampling
weights, , for observations . Note that the sample size is
, but let the total population be , equal to the sum of all the individual
weights:

The population mean , , of a random variable, , is the weighted average.

The Stata User’s Guide has the estimators for additional, progressively
more complex sampling designs.

The second example is for weighted least squares . The least-squares


estimator for linear regression can also be readily modified to incorporate
sampling weights. In Stata, the observations denoting weights are
normalized to sum to , if pweights or aweights are specified. Let

317
denote the normalized or unnormalized weights, let denote the vector of
weights, let , and let be the matrix of covariates, . The
goal is to estimate the vector of parameters, , in the linear model,
. Then, the estimated weighted least-squares parameters are
found by the following formula.

Finally, consider the logistic regression as an example of a maximum


likelihood estimator . The log- (pseudo-) likelihood function for the
logistic distribution, , with sampling weights is

where is a indicator for the dependent variable,


, and

In all three of these examples, weights (but not clustering) affect the
point estimates. As we will demonstrate in section 11.4, Stata makes it
easy to incorporate weights into all of these estimators.

11.3.2 Standard errors

Adjusting for weights and for clustering changes the standard errors of the
estimates. The most commonly applied method to obtain the covariance
matrix of involves a Taylor-series based linearization (popularly known
as the delta method ), in which weights and clustering are easily
incorporated. This is implemented as a standard option in Stata. After
declaring the survey design with the svyset command, use the
vce(cluster clustvar) option.

318
When the parameters of interest are complex functions of the model
parameters, the linearized variance estimation may not be convenient. In
such situations, bootstrap or jackknife variance estimation can be
important nonparametric ways to obtain standard errors. With few
assumptions, bootstrap and jackknife resampling techniques provide a way
of estimating standard errors and other measures of statistical precision
and are especially convenient when no standard formula is available. We
begin by describing the bootstrap, which is described in detail in Cameron
and Trivedi (2005) .

The principle of bootstrap resampling is quite simple. Consider a


dataset with observations, on which a statistical model is fit and
parameters (or functions of parameters such as marginal effects) are
calculated. The bootstrap procedure involves drawing, with replacement—
so that the same observation may be drawn again— observations from
the -observation dataset. The model and parameters of interest are
estimated from this resampled dataset. This process is repeated many
times, and the empirical distribution of the parameters of interest is used to
calculate standard deviations or other features of the distributions of the
parameters.

The jackknife method is, like the bootstrap, a technique that is


independent of the estimation procedure. In it, the model is fit multiple
times, with one observation being dropped from the estimation sample
each time. The standard errors of the estimates of interest are then
calculated as the empirical standard deviations of the estimates over the set
of replicates.

Both the bootstrap and jackknife methods are easily adjusted to deal
with clustering . In the context of survey designs with clustering, the unit
of observation for resampling in the bootstrap and jackknife is a cluster or
a PSU . If the survey design also involves sampling weights , both the
bootstrap and jackknife methods become considerably more complex to
implement. For each replication, the sampling weights need to be adjusted,
because some clusters may be repeated, while others may not be in the
sample in the case of the bootstrap, or because one cluster is dropped from
the replicate sample in the case of the jackknife. Some complex surveys
provide bootstrap or jackknife replicate weights, in which case those
methods can be implemented in the complex survey context using svy
bootstrap or svy jackknife . If the resampled weights are not provided,

319
the researcher must calculate those weights. This requires in-depth
knowledge of the survey design and the way in which the weights were
originally constructed.

Although we do not show an empirical example of either bootstrap or


jackknife methods in this chapter, the Stata manual has good examples in
the Stata Base Reference Manual under bootstrap and jackknife .

320
11.4 Empirical examples

The sample design of the MEPS includes stratification, clustering, multiple


stages of selection, and disproportionate sampling. Furthermore, the MEPS
sampling weights reflect adjustments for survey nonresponse and
adjustments to population control totals from the Current Population
Survey . The MEPS public-use files include variables to obtain weighted
estimates and to implement a Taylor-series approach to estimate standard
errors for weighted survey estimates. These variables, which jointly reflect
the MEPS survey design, include the estimation weight, sampling strata, and
PSU. Our MEPS dataset includes the sampling weights , wtdper; the stratum,
varstr; and PSU variables, varpsu.

11.4.1 Survey design setup

Stata’s survey prefix command (svy:) invokes adjustments to most


estimation commands to account for survey design characteristics and
conduct appropriate point estimation, model fitting, and variance
estimation. We will demonstrate the effects of weighting and clustering on
parameter estimates and standard errors separately. Not all of Stata’s
commands and user-written packages work with svy. Therefore, in
addition to demonstrating the use of svy, we show examples that
incorporate survey characteristics one at a time—as well as together—
without the use of svy.

Before any commands can be invoked with the survey prefix, the
survey features must be associated with the dataset using the svyset
command. In other words, svyset is required to “set up” the survey design
in the dataset for use in the svy: commands. The syntax identifies the
sample weight, wtdper; a PSU variable, varpsu; and a stratum variable,
varstr.

321
As we have described above, PSU and strata are incorporated in the
survey design in MEPS to ensure geographic and racial representation.
These selection criteria also give rise to natural clustering units. Given our
choice of PSU and strata, we can create a variable to identify unique
clusters of observations in the dataset by grouping observations by unique
values of the PSU and strata identifiers. Output from codebook below
shows that there are 448 clusters.

11.4.2 Weighted sample means

The first example shows how the estimate of a sample mean might change
when incorporating different design features. Normally, researchers might
consider using summarize to obtain sample means and other sample
statistics. However, summarize is not survey aware and thus cannot be
used with the svy prefix. Instead, use the mean command, because it is
survey aware and can be implemented with sampling weights and cluster-
adjusted standard errors, even without using the svy prefix.

We use mean to calculate the means and standard errors of total


healthcare expenditures (exp_tot) and an indicator for black race
(multiplied by 100 so we can interpret means as percentages)

322
(race_bl_pct). To facilitate comparison of the estimates from the different
methods, we do not show the estimates one by one but instead accumulate
them into a table. In the table, the first estimates (noadjust) do not take
any survey features into account. The second set of estimates (cluster)
are identical to the first, because the adjustment to standard errors due to
clustering has no effect on the point estimates. Sampling weights are
incorporated into the estimates shown in the third, fourth, and fifth
columns. The third set of estimates (weights) incorporate only weights,
while the fourth (clust_wgt) incorporates weights and cluster adjustments
to standard errors. The fifth set of estimates (survey) is based on fully
survey-aware estimation that control for weights, clustering, and
stratification.

Sampling weights matters for the estimates of population means. The


estimates in columns 3–5 (with weights) are substantially different from
the estimates in columns 1 and 2 (no weights). The estimates of total
expenditures differ by about 4% and the estimates of the percentage of the
population that is black differs by about 3%.

In our next example, we estimate mean total expenditures by race to


conduct a test for whether the difference in means is different from zero.

323
We find that mean spending for nonblacks is $3,731 and mean spending
for blacks is $3,402. The difference in spending is not statistically
significant at the traditional 5% level, but it is significant at the 10% level.

We repeat the analysis taking sampling weights and clustering into


account using the svy prefix. The results are quite remarkable. The
weighted sample mean for the subsample of blacks (race_bl=1) is smaller
than the unweighted sample mean (3,115 compared with 3,402), while the
weighted sample mean for nonblacks is larger than its unweighted
counterpart (3,927 compared with 3,731). Thus, even though the cluster-
corrected standard errors of the means estimated using the survey-aware
methods are uniformly larger than the naïve estimates, the statistic for the
difference in expenditures between blacks and nonblacks is not statistically
significant at 5% when the naïve estimates are used, but is significant at
the 0.1% level when survey-aware estimates are used.

324
11.4.3 Weighted least-squares regression

Common wisdom is that design effects matter less in regression contexts


than when summary statistics are desired (Solon, Haider, and
Wooldridge 2015; Cameron and Trivedi 2005) , but—aside from limited
special cases—there is no formal derivation of this understanding, nor can
the differences between naïve and more sophisticated estimates be signed a
priori. Therefore, it may be better to take design effects seriously,
regardless of the nature of the statistical model under consideration.
Below, we demonstrate the effects of taking survey design features into
consideration. We show this first in the context of a linear regression in
which the coefficient estimates themselves are of primary interest and
show this second in the context of a Poisson regression—in which
incremental and marginal effects are of interest.

We estimate regressions of total expenditures by self or family


(exp_self) on continuous age (age) and indicators for gender (female),
black race (race_bl), and South region (reg_south) using ordinary and
weighted least squares and using alternative formula to estimate standard
errors. To facilitate comparison of the estimates from the different

325
methods, we accumulate regression results into a table. The first regression
(robust) does not account for any design features but does estimate robust
standard errors. The second set of estimates (cluster) are obtained using
ordinary least squares, but the standard errors take clustering into account.
The third set of estimates (weights) are calculated using weighted least
squares; the weights are probability or sampling weights. These estimates
do not account for clustering. The fourth set of estimates (clust_wgt)
incorporate both cluster–robust standard errors and sampling weights as
options of regress but without the svy: prefix. The final specification
(survey) uses the svy: prefix to produce fully survey-aware estimates that
control for weights, clustering, and stratification.

The results show that, as expected, adjusting for weights changes the
point estimates, while adjusting for clustering and stratification does not.
The point estimates are the same in columns 1 and 2 (no weights) but
different from the point estimates in columns 3–5 (with weights).

In contrast, clustering changes only the standard errors. In fact,


estimates of standard errors are also not substantially different for most
variables. However, for reg_south, the standard errors increase in a
substantial way. Recall that our clustering variable is designed to have
geographic and race implications but should have no implications for age
and gender for this MEPS sample. Therefore, it is not surprising that the
standard errors on the coefficients on age and gender do not change much
and that the standard error on the coefficient of reg_south increases. It is a
bit surprising that the standard error of race_bl does not change much.

The landscape changes quite a bit once sampling weights are


introduced. All point estimates and standard errors change substantially,
except for those on age. The implications for interpretation of the effect of
race_bl are substantial. While one would comfortably conclude that black
race was not associated with healthcare expenditures from columns 1 and
2 ( -value of 0.58), one would have to think about the association further
given the -values less than or equal to 0.1 in the remaining three cases.

326
11.4.4 Weighted Poisson count model

We estimated Poisson regressions of the counts of office-based provider


visits (use_off) on female, race_bl, and reg_south with and without
survey design features. In each case, we estimated sample average partial
effects of age and female using margins , which naturally accounts for
survey design features used to obtain the model parameter estimates if the

327
regression is specified this way. The results, tabulated below, show that the
marginal effects of age are not noticeably different across procedures;
neither are the associated standard errors. The incremental effects of being
female and their associated standard errors increase modestly once
sampling weights are taken into account.

As with the previous linear regression example, the most dramatic


differences across specifications are seen in the effects of black race and
South region. The incremental effect of race_bl increases substantially in
magnitude once sampling weights are introduced. Standard errors also
increase but not by much. The incremental effects of reg_south do not
change much at all across specifications, but their standard errors increase
once clustering is accounted for.

328
329
11.5 Conclusion

In this chapter, we have presented some of the key features of complex


sampling designs and their consequences for point estimates and inference.
We have shown how estimates and inference can be misleading if design
features are not accounted for. But also note that the incorporation of
design features can improve the quality of point estimates and inference
only if the measures of design features are correct. In practice, while
identifiers for clusters are seldom problematic, the use of sampling weights
may present challenges. In some datasets, sampling weights can vary
across the sample by orders of magnitude. In such cases, purely
computational issues of numerical precision and the undue influence of a
few observations with unduly large weights can make estimates that
incorporate sampling weights worse than unweighted estimates. Uneven
sampling weights with a few extraordinarily large weights are not
uncommon in samples in which original sampling weights have been
recalculated because of poststratification adjustments. Therefore, we
strongly recommend that researchers read the documentation to understand
the design features of the data before using corrections.

330
11.6 Stata resources

See the Stata Survey Data Reference Manual for all commands related to
survey data. Prior to using any other survey-related commands, make Stata
aware of survey design features of the dataset using svyset . Once that is
done, the svy prefix incorporates those features into the estimation and
inference for most Stata commands and many user-written packages.

Some statistical controls can be done without first using svyset. To


control for probability weighting, add [pw=weightvar] after the main
command, where weightvar is the variable indicating the probability
weights. To control for clustering, use the option vce(cluster clustvar)
, where clustvar is the variable indicating the clusters.

331
References
Abrevaya, J. 2002. Computing marginal effects in the Box–Cox model.
Econometric Reviews 21: 383–393.
Ai, C., and E. C. Norton. 2000. Standard errors for the retransformation
problem with heteroscedasticity. Journal of Health Economics 19:
697–718.
_________. 2003. Interaction terms in logit and probit models.
Economics Letters 80: 123–129.
_________. 2008. A semiparametric derivative estimator in log
transformation models. Econometrics Journal 11: 538–553.
Akaike, H. 1970. Statistical predictor identification. Annals of the
Institute of Statistical Mathematics 22: 203–217.
Angrist, J. D., and A. B. Krueger. 2001. Instrumental variables and the
search for identification: From supply and demand to natural
experiments. Journal of Economic Perspectives 15(4): 69–85.
Angrist, J. D., and J.-S. Pischke. 2009. Mostly Harmless Econometrics:
An Empiricist’s Companion. Princeton: Princeton University Press.
Arlot, S., and A. Celisse. 2010. A survey of cross-validation procedures
for model selection. Statistics Surveys 4: 40–79.
Barnett, S. B. L., and T. A. Nurmagambetov. 2011. Costs of asthma in
the United States: 2002–2007. Journal of Allergy and Clinical
Immunology 127: 145–152.
Bassett, G., Jr., and R. Koenker. 1982. An empirical quantile function
for linear models with iid errors. Journal of the American Statistical
Association 77: 407–415.
Basu, A., and P. J. Rathouz. 2005. Estimating marginal and incremental
effects on health outcomes using flexible link and variance function
models. Biostatistics 6: 93–109.
Baum, C. F. 2006. An Introduction to Modern Econometrics Using
Stata. College Station, TX: Stata Press.
Belotti, F., P. Deb, W. G. Manning, and E. C. Norton. 2015. twopm:

332
Two-part models. Stata Journal 15: 3–20.
Berk, M. L., and A. C. Monheit. 2001. The concentration of health care
expenditures, revisited. Health Affairs 20: 9–18.
Bertrand, M., E. Duflo, and S. Mullainathan. 2004. How much should
we trust differences-in-differences estimates? Quarterly Journal of
Economics 119: 249–275.
Bitler, M. P., J. B. Gelbach, and H. W. Hoynes. 2006. Welfare reform
and children’s living arrangements. Journal of Human Resources 41:
1–27.
Blough, D. K., C. W. Madden, and M. C. Hornbrook. 1999. Modeling
risk using generalized linear models. Journal of Health Economics
18: 153–171.
Blundell, R. W., and R. J. Smith. 1989. Estimation in a class of
simultaneous equation limited dependent variable models. Review of
Economic Studies 56: 37–57.
_________. 1994. Coherency and estimation in simultaneous models
with censored or qualitative dependent variables. Journal of
Econometrics 64: 355–373.
Bound, J., D. A. Jaeger, and R. M. Baker. 1995. Problems with
instrumental variables estimation when the correlation between the
instruments and the endogenous explanatory variable is weak.
Journal of the American Statistical Association 90: 443–450.
Box, G. E. P., and D. R. Cox. 1964. An analysis of transformations.
Journal of the Royal Statistical Society, Series B 26: 211–252.
Box, G. E. P., and N. R. Draper. 1987. Empirical Model-building and
Response Surfaces. New York: Wiley.
Buntin, M. B., and A. M. Zaslavsky. 2004. Too much ado about two-
part models and transformation? Comparing methods of modeling
Medicare expenditures. Journal of Health Economics 23: 525–542.
Cameron, A. C., J. B. Gelbach, and D. L. Miller. 2008. Bootstrap-based
improvements for inference with clustered errors. Review of
Economics and Statistics 90: 414–427.
_________. 2011. Robust inference with multiway clustering. Journal
of Business and Economic Statistics 29: 238–249.

333
Cameron, A. C., and P. K. Trivedi. 2005. Microeconometrics: Methods
and Applications. New York: Cambridge University Press.
_________. 2010. Microeconometrics using Stata, Revised Edition.
Stata Press.
_________. 2013. Regression Analysis of Count Data. 2nd ed.
Cambridge: Cambridge University Press.
Cattaneo, M. D., D. M. Drukker, and A. D. Holland. 2013. Estimation
of multivalued treatment effects under conditional independence.
Stata Journal 13: 407–450.
Cattaneo, M. D., and M. Jansson. 2017. Kernel-based semiparametric
estimators: Small bandwidth asymptotics and bootstrap consistency.
Working Paper. Http://eml.berkeley.edu/
˜jansson/Papers/CattaneoJansson_BootstrappingSemiparametrics.pdf.
Cawley, J., and C. Meyerhoefer. 2012. The medical care costs of
obesity: An instrumental variables approach. Journal of Health
Economics 31: 219–230.
Claeskens, G., and N. L. Hjort. 2008. Model Selection and Model
Averaging. Cambridge: Cambridge University Press.
Cole, J. A., and J. D. F. Sherriff. 1972. Some single- and multi-site
models of rainfall within discrete time increments. Journal of
Hydrology 17: 97–113.
Cook, P. J., and M. J. Moore. 1993. Drinking and schooling. Journal of
Health Economics 12: 411–429.
Cox, N. J. 2004. Speaking Stata: Graphing model diagnostics. Stata
Journal 4: 449–475.
Cragg, J. G. 1971. Some statistical models for limited dependent
variables with application to the demand for durable goods.
Econometrica 39: 829–844.
Dall, T. M., Y. Zhang, Y. J. Chen, W. W. Quick, W. G. Yang, and
J. Fogli. 2010. The economic burden of diabetes. Health Affairs 29:
297–303.
Deaton, A. 1997. The Analysis of Household Surveys: A
Microeconometric Approach to Development Policy. Washington, DC:
The World Bank.

334
Deb, P. 2007. fmm: Stata module to estimate finite mixture models.
Statistical Software Components S456895, Department of
Economics, Boston College.
https://ideas.repec.org/c/boc/bocode/s456895.html.

Deb, P., and P. K. Trivedi. 1997. Demand for medical care by the
elderly: A finite mixture approach. Journal of Applied Econometrics
12: 313–336.
_________. 2002. The structure of demand for health care: latent class
versus two-part models. Journal of Health Economics 21: 601–625.
Donald, S. G., D. A. Green, and H. J. Paarsch. 2000. Differences in
wage distributions between Canada and the United States: An
application of a flexible estimator of distribution functions in the
presence of covariates. Review of Economic Studies 67: 609–633.
Dow, W. H., and E. C. Norton. 2003. Choosing between and
interpreting the Heckit and two-part models for corner solutions.
Health Services and Outcomes Research Methodology 4: 5–18.
Dowd, B. E., W. H. Greene, and E. C. Norton. 2014. Computation of
standard errors. Health Services Research 49: 731–750.
Drukker, D. M. 2014. mqgamma: Stata module to estimate quantiles of
potential-outcome distributions. Statistical Software Components
S457854, Department of Economics, Boston College.
https://ideas.repec.org/c/boc/bocode/s457854.html.

_________. 2016. Quantile regression allows covariate effects to differ


by quantile. The Stata Blog: Not Elsewhere Classified.
http://blog.stata.com/2016/09/27/quantile-regression-
allows-covariate-effects-to-differ-by-quantile/.

_________. 2017. Two-part models are robust to endogenous selection.


Economics Letters 152: 71–72.
_________. Forthcoming. Quantile treatment effect estimation from
censored data by regression adjustment. Stata Journal.
Duan, N. 1983. Smearing estimate: A nonparametric retransformation
method. Journal of the American Statistical Association 78: 605–610.
Duan, N., W. G. Manning, C. N. Morris, and J. P. Newhouse. 1984.
Choosing between the sample-selection model and the multi-part
model. Journal of Business and Economic Statistics 2: 283–289.

335
Efron, B. 1988. Logistic regression, survival analysis, and the Kaplan–
Meier curve. Journal of the American Statistical Association 83: 414–
425.
Enami, K., and J. Mullahy. 2009. Tobit at fifty: A brief history of
Tobin’s remarkable estimator, of related empirical methods, and of
limited dependent variable econometrics in health economics. Health
Economics 18: 619–628.
Ettner, S. L., G. Denmead, J. Dilonardo, H. Cao, and A. J. Belanger.
2003. The impact of managed care on the substance abuse treatment
patterns and outcomes of Medicaid beneficiaries: Maryland’s health
choice program. Journal of Behavioral Health Services and Research
30: 41–62.
Ettner, S. L., R. G. Frank, T. G. McGuire, J. P. Newhouse, and E. H.
Notman. 1998. Risk adjustment of mental health and substance abuse
payments. Inquiry 35: 223–239.
Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its
Applications. New York: Chapman & Hall/CRC.
Fenton, J. J., A. F. Jerant, K. D. Bertakis, and P. Franks. 2012. The cost
of satisfaction: A national study of patient satisfaction, health care
utilization, expenditures, and mortality. Archives of Internal Medicine
172: 405–411.
van Garderen, K. J., and C. Shah. 2002. Exact interpretation of dummy
variables in semilogarithmic equations. Econometrics Journal 5: 149–
159.
Garrido, M. M., P. Deb, J. F. Burgess, Jr., and J. D. Penrod. 2012.
Choosing models for health care cost analyses: Issues of nonlinearity
and endogeneity. Health Services Research 47: 2377–2397.
Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge, UK: Cambridge
University Press.
Gilleskie, D. B., and T. A. Mroz. 2004. A flexible approach for
estimating the effects of covariates on health expenditures. Journal of
Health Economics 23: 391–418.
Goldberger, A. S. 1981. Linear regression after selection. Journal of
Econometrics 15: 357–366.

336
Gourieroux, C., A. Monfort, and A. Trognon. 1984a. Pseudo maximum
likelihood methods: Applications to Poisson models. Econometrica
52: 701–720.
_________. 1984b. Pseudo maximum likelihood methods: Theory.
Econometrica 52: 681–700.
Greene, W. H. 2012. Econometric Analysis. 7th ed. Upper Saddle River,
NJ: Prentice Hall.

Gurmu, S. 1997. Semi-parametric estimation of hurdle regression


models with an application to Medicaid utilization. Journal of
Applied Econometrics 12: 225–242.
Halvorsen, R., and R. Palmquist. 1980. The interpretation of dummy
variables in semilogarithmic equations. American Economic Review
70: 474–475.
Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and
Extensions. 3rd ed. College Station, TX: Stata Press.
Hay, J. W., and R. J. Olsen. 1984. Let them eat cake: A note on
comparing alternative models of the demand for medical care.
Journal of Business and Economic Statistics 2: 279–282.
Heckman, J. J. 1979. Sample selection bias as a specification error.
Econometrica 47: 153–161.
Heckman, J. J., and R. Robb, Jr. 1985. Alternative methods for
evaluating the impact of interventions: An overview. Journal of
Econometrics 30: 239–267.
Heckman, J. J., and E. J. Vytlacil. 2007. Econometric evaluation of
social programs, part I: Causal models, structural models and
econometric policy evaluation. In Handbook of Econometrics,
vol. 6B, ed. J. J. Heckman and E. Leamer, 4779–4874. Amsterdam:
Elsevier.
Hoch, J. S., A. H. Briggs, and A. R. Willan. 2002. Something old,
something new, something borrowed, something blue: A framework
for the marriage of health econometrics and cost-effectiveness
analysis. Health Economics 11: 415–430.
Holland, P. W. 1986. Statistics and causal inference. Journal of the
American Statistical Association 81: 945–960.

337
Holmes, A. M., and P. Deb. 1998. Provider choice and use of mental
health care: implications for gatekeeper models. Health Services
Research 33: 1263–1284.
Hosmer, D. W., and S. Lemesbow. 1980. Goodness of fit tests for the
multiple logistic regression model. Communications in Statistics—
Theory and Methods 9: 1043–1069.
Hurd, M. 1979. Estimation in truncated samples when there is
heteroscedasticity. Journal of Econometrics 11: 247–258.
Imbens, G. W. 2004. Nonparametric estimation of average treatment
effects under exogeneity: A review. Review of Economics and
Statistics 86: 4–29.
Imbens, G. W., and D. B. Rubin. 2015. Causal Inference for Statistics,
Social, and Biomedical Sciences: An Introduction. New York:
Cambridge University Press.
Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in
the econometrics of program evaluation. Journal of Economic
Literature 47: 5–86.
Jones, A. M. 2000. Health econometrics. In Handbook of Health
Economics, vol. 1B, ed. A. J. Culyer and J. P. Newhouse, 265–344.
Amsterdam: Elsevier.
_________. 2010. Models for health care. Working Papers 10/01,
Health, Econometrics and Data Group.
Kadane, J. B., and N. A. Lazar. 2004. Methods and criteria for model
selection. Journal of the American Statistical Association 99: 279–
290.
Katz, R. W. 1977. Precipitation as a chain-dependent process. Journal
of Applied Meteorology 16: 671–676.
Keeler, E. B., and J. E. Rolph. 1988. The demand for episodes of
treatment in the health insurance experiment. Journal of Health
Economics 7: 337–367.
Kennedy, P. E. 1981. Estimation with correctly interpreted dummy
variables in semilogarithmic equations. American Economic Review
71: 801.
King, G. 1988. Statistical models for political science event counts: Bias

338
in conventional procedures and evidence for the exponential poisson
regression model. American Journal of Political Science 32: 838–
863.
Koenker, R., and G. Bassett, Jr. 1978. Regression quantiles.
Econometrica 46: 33–50.
Koenker, R., and K. F. Hallock. 2001. Quantile regression. Journal of
Economic Perspectives 15: 143–156.
Koenker, R., and J. A. F. Machado. 1999. Goodness of fit and related
inference processes for quantile regression. Journal of the American
Statistical Association 94: 1296–1310.
Konetzka, R. T. 2015. In memoriam: Willard G. Manning, 1946–2014.
American Journal of Health Economics 1: iv–vi.
Lambert, D. 1992. Zero-inflated Poisson regression, with an application
to defects in manufacturing. Technometrics 34: 1–14.
Leroux, B. G. 1992. Consistent estimation of a mixing distribution.
Annals of Statistics 20: 1350–1360.
Leung, S. F., and S. Yu. 1996. On the choice between sample selection
and two-part models. Journal of Econometrics 72: 197–229.
Lindrooth, R. C., E. C. Norton, and B. Dickey. 2002. Provider selection,
bargaining, and utilization management in managed care. Economic
Inquiry 40: 348–365.
Lindsay, B. G. 1995. Mixture Models: Theory, Geometry and
Applications. NSF-CBMS regional conference series in probability and
statistics, Institute of Mathematical Statistics.
Long, J. S., and J. Freese. 2014. Regression Models for Categorical
Dependent Variables Using Stata. 3rd ed. College Station, TX: Stata
Press.
Machado, J. A. F., and J. M. C. Santos Silva. 2005. Quantiles for
counts. Journal of the American Statistical Association 100: 1226–
1237.
Maddala, G. S. 1983. Limited-Dependent and Qualitative Variables in
Econometrics. Cambridge: Cambridge University Press.
_________. 1985. A survey of the literature on selectivity bias as it

339
pertains to health care markets. Advances in Health Economics and
Health Services Research 6: 3–26.
Manning, W. G. 1998. The logged dependent variable,
heteroscedasticity, and the retransformation problem. Journal of
Health Economics 17: 283–295.
Manning, W. G., N. Duan, and W. H. Rogers. 1987. Monte Carlo
evidence on the choice between sample selection and two-part
models. Journal of Econometrics 35: 59–82.
Manning, W. G., and J. Mullahy. 2001. Estimating log models: To
transform or not to transform? Journal of Health Economics 20: 461–
494.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd
ed. London: Chapman & Hall/CRC.
McLachlan, G., and D. Peel. 2000. Finite Mixture Models. New York:
Wiley.
Mihaylova, B., A. H. Briggs, A. O’Hagan, and S. G. Thompson. 2011.
Review of statistical methods for analysing healthcare resources and
costs. Health Economics 20: 897–916.
Miranda, A. 2006. qcount: Stata program to fit quantile regression
models for count data. Statistical Software Components S456714,
Department of Economics, Boston College.
https://ideas.repec.org/c/boc/bocode/s456714.html.

Moulton, B. R. 1986. Random group effects and the precision of


regression estimates. Journal of Econometrics 32: 385–397.
_________. 1990. An illustration of a pitfall in estimating the effects of
aggregate variables on micro units. Review of Economics and
Statistics 72: 334–338.
Mroz, T. A. 2012. A simple, flexible estimator for count and other
ordered discrete data. Journal of Applied Econometrics 27: 646–665.
Mullahy, J. 1997. Heterogeneity, excess zeros, and the structure of
count data models. Journal of Applied Econometrics 12: 337–350.
_________. 1998. Much ado about two: Reconsidering retransformation
and the two-part model in health econometrics. Journal of Health
Economics 17: 247–281.

340
_________. 2015. In memoriam: Willard G. Manning, 1946–2014.
Health Economics 24: 253–257.
Murray, M. P. 2006. Avoiding invalid instruments and coping with
weak instruments. Journal of Economic Perspectives 20: 111–132.
Nelson, C. R., and R. Startz. 1990. Some further results on the exact
small sample properties of the instrumental variable estimator.
Econometrica 58: 967–976.
Newey, W. K., J. L. Powell, and F. Vella. 1999. Nonparametric
estimation of triangular simultaneous equations models.
Econometrica 67: 565–603.
Newhouse, J. P., and M. McClellan. 1998. Econometrics in outcomes
research: The use of instrumental variables. Annual Review of Public
Health 19: 17–34.
Newhouse, J. P., and C. E. Phelps. 1976. New estimates of price and
income elasticities of medical care services. In The Role of Health
Insurance in the Health Services Sector, ed. R. N. Rosett, 261–320.
Cambridge, MA: National Bureau of Economic Research.
Newson, R. 2003. Confidence intervals and -values for delivery to the
end user. Stata Journal 3: 245–269.
Norton, E. C., H. Wang, and C. Ai. 2004. Computing interaction effects
and standard errors in logit and probit models. Stata Journal 4: 154–
167.
Park, R. E. 1966. Estimation with heteroscedastic error terms.
Econometrica 34: 888.
Picard, R. R., and R. D. Cook. 1984. Cross-validation of regression
models. Journal of the American Statistical Association 79: 575–583.
Pohlmeier, W., and V. Ulrich. 1995. An econometric model of the two-
part decisionmaking process in the demand for health care. Journal of
Human Resources 30: 339–361.
Poirier, D. J., and P. A. Ruud. 1981. On the appropriateness of
endogenous switching. Journal of Econometrics 16: 249–256.
Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics
9: 705–724.

341
Racine, J., and Q. Li. 2004. Nonparametric estimation of regression
functions with both categorical and continuous data. Journal of
Econometrics 119: 99–130.
Ramsey, J. B. 1969. Tests for specification errors in classical linear
least-squares regression analysis. Journal of the Royal Statistical
Society, Series B 31: 350–371.
Rao, C. R., and Y. Wu. 2001. On model selection. Lecture Notes-
Monograph Series 38: 1–64.
Roy, A., P. Sheffield, K. Wong, and L. Trasande. 2011. The effects of
outdoor air pollutants on the costs of pediatric asthma hospitalizations
in the united states, 1999–2007. Medical Care 49: 810–817.
Rubin, D. B. 1974. Estimating causal effects of treatments in
randomized and nonrandomized studies. Journal of Educational
Psychology 66: 688–701.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of
Statistics 6: 461–464.
Sin, C.-Y., and H. White. 1996. Information criteria for selecting
possibly misspecified parametric models. Journal of Econometrics
71: 207–225.
Skinner, C. J., D. Holt, and T. M. F. Smith. 1989. Analysis of Complex
Surveys. New York: Wiley.
Solon, G., S. J. Haider, and J. M. Wooldridge. 2015. What are we
weighting for? Journal of Human Resources 50: 301–316.
Staiger, D. O., and J. H. Stock. 1997. Instrumental variables regression
with weak instruments. Econometrica 65: 557–586.
Stock, J. H., J. H. Wright, and M. Yogo. 2002. A survey of weak
instruments and weak identification in generalized method of
moments. Journal of Business and Economic Statistics 20: 518–529.
Teicher, H. 1963. Identifiability of finite mixtures. Annals of
Mathematical Statistics 34: 1265–1269.
Terza, J. V., A. Basu, and P. J. Rathouz. 2008. Two-stage residual
inclusion estimation: Addressing endogeneity in health econometric
modeling. Journal of Health Economics 27: 531–543.

342
Tobin, J. 1958. Estimation of relationships for limited dependent
variables. Econometrica 26: 24–36.
Todorovic, P., and D. A. Woolhiser. 1975. A stochastic model of n-day
precipitation. Journal of Applied Meteorology 14: 17–24.
Vanness, D. J., and J. Mullahy. 2012. Moving beyond mean-based
evaluation of health care. In The Elgar Companion to Health
Economics, ed. A. M. Jones, 2nd ed., 563–575. Cheltenham, UK:
Edward Elgar Publishing Limited.
Veazie, P. J., W. G. Manning, and R. L. Kane. 2003. Improving risk
adjustment for Medicare capitated reimbursement using nonlinear
models. Medical Care 41: 741–752.
Vella, F., and M. Verbeek. 1999. Estimating and interpreting models
with endogenous treatment effects. Journal of Business and
Economic Statistics 17: 473–478.
van de Ven, W. P. M. M., and R. P. Ellis. 2000. Risk adjustment in
competitive health plan markets. In Handbook of Health Economics,
vol. 1A, ed. A. J. Culyer and J. P. Newhouse, 755–845. Amsterdam:
Elsevier.
Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-
nested hypotheses. Econometrica 57: 307–333.
Winkelmann, R. 2008. Econometric Analysis of Count Data. 5th ed.
Berlin: Springer.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and
Panel Data. 2nd ed. Cambridge, MA: MIT Press.
_________. 2014. Quasi-maximum likelihood estimation and testing for
nonlinear models with endogenous explanatory variables. Journal of
Econometrics 182: 226–234.
_________. 2016. Introductory Econometrics: A Modern Approach. 6th
ed. Boston, MA: Cengage Learning.

343
Author index
A

Abrevaya, J., 6.5 , 6.5.1


Ai, C., 2.5 , 7.5.2
Akaike, H., 2.6.1 , 5.8.1 , 8.6.1
Angrist, J.D., 10.1
Arlot, S., 2.6.2 , 8.6.2

B
Baker, R.M., 10.2.2
Barnett, S.B.L., 2.1
Bassett, G., 9.2
Basu, A., 5.2.1 , 5.8.4 , 5.10 , 10.2.4 , 10.3.1
Baum, C.F., 4.7
Belanger, A.J., 6.5
Belotti, F., 7.5.1 , 7.8
Berk, M.L., 5.1
Bertakis, K., 2.1
Bertrand, M., 11.1 , 11.2.3
Bitler, M.P., 2.7
Blough, D.K., 5.1 , 5.2.1 , 5.8.3
Blundell, R.W., 10.3.1
Bound, J.D., 10.2.2
Box, G.E., 1.5 , 6.1
Briggs, A.H., 2.1 , 7.2
Buntin, M.B., 7.6.3
Burgess, J.F., 10.3.1

C
Cameron, A.C., 1.1 , 1.4 , 2.1 , 4.1 , 7.1 , 8.1 , 8.3 , 10.1 , 10.4 , 11.1 ,
11.2.2 , 11.3.2 , 11.4.3
Cao, H., 6.5
Cattaneo, M.D., 4.7 , 9.4.1
Cawley, J., 2.1
Celisse, A., 2.6.2 , 8.6.2
Chen, Y.J., 2.1
Claeskens, G., 2.1 , 2.6

344
Cole, J., 7.2
Cook, P.J., 7.6.2
Cook, R.D., 2.6.2 , 8.6.2
Cox, D.R., 6.1
Cox, N.J., 4.7
Cragg, J.G., 7.2 , 7.6.2

D
Dall, T.M., 2.1
Deaton, A., 11.1
Deb, P., 7.5.1 , 7.6.2 , 7.8 , 9.3 , 9.6 , 10.3.1
Denmead, G., 6.5
Dickey, B., 6.5
Dilonardo, J., 6.5
Donald, S.G., 9.5
Dow, W.H., 7.5.4
Dowd, B.E., 4.3.3
Draper, N.R., 1.5
Drukker, D.M., 4.7 , 9.2.2
Duan, N., 6.3.1 , 6.5 , 7.2 , 7.2.1 , 7.4 , 7.5.2 , 7.5.4
Duflo, E., 11.1 , 11.2.3

E
Efron, B., 9.5
Ellis, R.P., 2.1
Enami, K., 7.6.1
Ettner, S.L., 6.5

F
Fan, J., 9.4
Fenton, J., 2.1
Fogli, J., 2.1
Frank, R.G., 6.5
Franks, P., 2.1
Freese, J., 1.4

G
Garrido, M.M., 10.3.1
Gelbach, J.B., 2.7 , 11.2.2
Gelman, A., 2.1 , 2.2
Gijbels, I., 9.4

345
Gilleskie, D.B., 7.2 , 9.5
Goldberger, A.S., 7.6.1 , 7.6.2
Gourieroux, C., 8.2.2
Green, D.A., 9.5
Greene, W.H., 1.1 , 2.1 , 4.3.3
Gurmu, S., 8.5.1 , 8.5.2

H
Halvorsen, R., 6.2.1
Hansen, L.P., 10.4
Hardin, J.W., 5.1 , 8.1
Hay, J.W., 7.4
Heckman, J.J., 2.1 , 7.3 , 7.3.1
Hilbe, J., 5.1 , 8.1
Hill, J., 2.1 , 2.2
Hjort, N.L., 2.1 , 2.6
Hoch, J.S., 2.1
Holland, A.D., 4.7
Holland, P.W., 2.1 , 2.2
Holmes, A., 7.6.2
Holt, D., 11.1 , 11.2.2
Hornbrook, M.C., 5.1 , 5.2.1 , 5.8.3
Hosmer, D.W., 2.6 , 4.6 , 4.6.3
Hoynes, H.W., 2.7
Hurd, M., 7.6.1 , 7.6.2

I
Imbens, G.W., 2.1

J
Jaeger, D.A., 10.2.2
Jansson, M., 9.4.1
Jerant, A., 2.1
Jones, A.M., 1.1 , 7.4

K
Kadane, J.B., 2.1 , 2.6 , 2.6.1
Kane, R.L., 6.5
Katz, R.W., 7.2
Keeler, E.B., 8.4.1
Kennedy, P.E., 6.2.1

346
King, G., 8.1
Koenker, R., 9.2 , 9.2.1
Konishi, S., 2.1 , 2.6
Krueger, A.B., 10.1

L
Lambert, D., 8.4.2
Lazer, N.A., 2.1 , 2.6 , 2.6.1
Lemeshow, S., 2.6 , 4.6 , 4.6.3
Leroux, B.G., 2.6.1 , 8.6.1 , 9.3
Leung, S.F., 7.4
Li, Q., 9.4
Lindrooth, R.C., 6.5
Lindsay, B.G., 9.3
Long, J.S., 1.4

M
Machado, J.A.F., 9.2.1 , 9.2.2
Maddala, G.S., 7.3 , 7.4
Madden, C.W., 5.1 , 5.2.1 , 5.8.3
Manning, W.G., 5.8.3 , 6.1 , 6.3.2 , 6.5 , 7.2 , 7.4 , 7.5.1 , 7.5.2 , 7.5.4 , 7.8
McClellan, M., 10.1
McCullagh, P., 5.1 , 5.2.1 , 8.2.2 , 9.3
McGuire, T.G., 6.5
McLachlan, G., 9.3
Meyerhoefer, C., 2.1
Mihalova, B., 7.2
Miller, D.L., 11.2.2
Monfort, A., 8.2.2
Monheit, A.C., 5.1
Moore, M.J., 7.6.2
Morris, C.N., 7.2 , 7.5.4
Moulton, B.R., 11.1 , 11.2.2 , 11.2.3
Mroz, T.A., 7.2 , 9.5
Mukerjee, R., 2.1 , 2.6
Mullahy, J., 2.7 , 5.1 , 5.8.3 , 7.6.1 , 7.6.3 , 8.4.2
Mullainathan, S., 11.1 , 11.2.3
Murray, M.P., 10.1

347
Nelder, J.A., 5.1 , 5.2.1 , 8.2.2 , 9.3
Nelson, C.R., 10.2.2 , 10.2.3
Newey, W.K., 10.2.4
Newhouse, J.P., 6.5 , 7.2 , 7.5.4 , 10.1
Norton, E.C., 2.5 , 4.3.3 , 6.5 , 7.5.1 , 7.5.2 , 7.5.4 , 7.8
Notman, E.H., 6.5
Nurmagambetov, T.A., 2.1

O
O’Hagan, A., 7.2
Olsen, R.J., 7.4

P
Paarsch, H.J., 9.5
Palmquist, R., 6.2.1
Park, R.E., 5.8.3
Peel, D., 9.3
Penrod, J.D., 10.3.1
Phelps, C.E., 7.2
Picard, R.R., 2.6.2 , 8.6.2
Pischke, J.-S., 10.1
Pohlmeier, W., 8.4.1
Poirier, D.J., 7.5.4
Powell, J.L., 10.2.4
Pregibon, D., 2.6 , 4.6

Q
Quick, W.W., 2.1

R
Racine, J., 9.4
Ramsey, J.B., 2.6 , 4.6 , 4.6.2
Rao, C.R., 2.1 , 2.6
Rathouz, P.J., 5.2.1 , 5.8.4 , 5.10 , 10.2.4 , 10.3.1
Robb, R., 2.1
Rogers, W.H., 7.4
Rolph, J.E., 8.4.1
Rubin, D.B., 2.1 , 2.2
Ruud, P.A., 7.5.4

348
Santos Silva, J.M.C., 9.2.2
Schwarz, G., 2.6.1 , 5.8.1 , 8.6.1
Shah, C., 6.2.1
sherriff, J., 7.2
Sin, C.-Y., 2.6.1
Skinner, C.J., 11.1 , 11.2.2
Smith, R.J., 10.3.1
Smith, T.F., 11.1 , 11.2.2
Staiger, D.O., 10.2.2 , 10.2.3
Startz, R., 10.2.2 , 10.2.3
Stock, J.H., 10.2.2 , 10.2.3

T
Teicher, H., 9.3.2
Terza, J.V., 10.2.4 , 10.3.1
Thompson, S.G., 7.2
Tobin, J., 7.6.1
Todorovic, P., 7.2
Trivedi, P. K., 11.1
Trivedi, P.K., 1.1 , 1.4 , 2.1 , 4.1 , 7.1 , 8.1 , 8.3 , 9.3 , 10.1 , 10.4 , 11.1 ,
11.3.2 , 11.4.3
Trognon, A., 8.2.2

U
Ulrich, V., 8.4.1

V
van de Ven, W.P.M.M., 2.1
van Garderen, K.J., 6.2.1
Vanness, D.J., 2.7
Veazie, P.J., 6.5
Vella, F., 10.2.4 , 10.3.1
Verbeek, M., 10.3.1
Vuong, Q.H., 8.3.1 , 8.6
Vytlacil, E.J., 2.1

W
Wang, H., 2.5
White, H., 2.6.1
Willan, A.R., 2.1
Winkelmann, R., 8.1

349
Wooldridge, J. M., 11.1
Wooldridge, J.M., 1.1 , 2.1 , 2.2 , 2.3.3 , 4.1 , 7.1 , 7.3 , 10.1 , 10.2.4 ,
10.3.1
Woolhiser, D.A., 7.2
Wright, J.H., 10.2.3
Wu, Y., 2.1 , 2.6

Y
Yang, W.G., 2.1
Yogo, M., 10.2.3
Yu, S., 7.4

Z
Zaslavsky, A.M., 7.6.3
Zhang, Y., 2.1

350
Subject index
2SLS, 10.2.2 , 10.2.2
assumptions for instruments, 10.2.2
balance test, 10.2.3
exogeneity test, 10.2.3
test, 10.2.3
GMM estimation, 10.4
overidentifying restriction test, 10.2.3
specification test of instrument strength, 10.2.3
specification tests, 10.2.3 , 10.2.3
standard errors compared with OLS, 10.2.2
2SRI
control functions, 10.2.4
functional form, 10.3.1
linear models, 10.2.4 , 10.2.4
linear models versus nonlinear models, 10.2.4 , 10.3.1
nonlinear models, 10.3 , 10.3.1

A
actual outcomes
comparison between two-part and generalized tobit models, 7.5.5
generalized tobit models, 7.5.4
two-part models, 7.1 , 7.4.1 , 7.5.4 , 7.5.5
Agency for Healthcare Research and Quality, see AHRQ
AHRQ, 3.1 , 3.4 , 3.5
AIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5 , 5.8.1 , 5.8.1 , 5.10 , 9.3.1
comparison of NB1 and NB2, 8.3.1
count models, 8.6 , 8.6.1 , 8.6.1
does not suffer from multiple testing, 4.6.5
example where differs from BIC, 4.6.5
FMM, 9.3
MLE formula, 2.6.1 , 8.6.1
OLS formula, 2.6.1
robustness, 2.6.1
Akaike information criterion, see AIC
at((mean) _all) option, 8.2.3
ATE, 2.2 , 2.3.2 , 2.3.3 , 2.4 , 2.4.1 , 2.4.2 , 2.5 , 2.7
estimation, 2.3 , 2.3.3

351
laboratory experiment, 2.3.1
margins and teffects commands, 4.3.3
OLS, 4.3.3
ATET, 2.2 , 2.3.2 , 2.3.3 , 2.4 , 2.4.1 , 2.4.2 , 4.3.3
count models, 8.2.3
OLS, 4.3.3
average treatment effect on the treated, see ATET
average treatment effects, see ATE
aweights, 11.2.1 , 11.3.1

B
balance tests, 2SLS, 10.2.3
Bayesian information criterion, see BIC
bcskew0 command, 6.6
bfit command, 4.7
BIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5 , 5.8.1 , 5.8.1 , 5.10 , 9.3.1
comparison of NB1 and NB2, 8.3.1
count models, 8.6 , 8.6.1 , 8.6.1
does not suffer from multiple testing, 4.6.5
example where differs from AIC, 4.6.5
FMM, 9.3
MLE formula, 2.6.1 , 8.6.1
OLS formula, 2.6.1
robustness, 2.6.1
bootstrap command, 5.10 , 11.3.2
adjustment for clustering, 11.3.2
intuition, 11.3.2
nonparametric regression, 9.4.1
standard errors, 11.3.2
boxcox command, 5.8.2 , 5.10 , 6.6 , 7.8
Box-Cox models, 5.8.2 , 6.1 , 6.5 , 6.5.1
example, 6.5.1 , 6.5.1
formula, 6.5
skewness, 6.5
square root model, 6.5
two-part model, second part in, 7.4

C
CDE, 1.1 , 9.1 , 9.5 , 9.5
estimate with one logit model, 9.5

352
fit with series of logit models, 9.5
flexibility for different distributions, 9.5
for count models, 9.5
heterogeneity across bins, 9.5
homogeneity within bins, 9.5
intuition, 9.5
relation to two-part models, 9.5
two key assumptions, 9.5
censoring
causes, 8.5.2
comparison with truncation, 7.6.1
count models, 8.5 , 8.5.2 , 8.5.2
definition, 7.6.1
formulas, 8.5.2
right-censoring, 7.6.2
centaur, 7.6.1
centile command, 5.1
cfunction option, 10.5
chain rule, 7.5.2
clusters, 11.2.2 , 11.2.2
affect standard errors, 11.3
bootstrap, 11.3.2
in natural experiments, 11.2.3 , 11.2.3
jackknife, 11.3.2
PSU, 11.2.2
standard-error formula, 11.2.2
cnreg command, 7.8
codebook command, 11.4.1
complex survey design, see design effects
conditional density estimators, see CDE
conditional independence assumption, see ignorability
confounders, 10.1
consistency
GLM, 5.2.2
negative binomial models, 8.3
Poisson models, 8.1 , 8.2.1 , 8.2.2
contrast() option, 4.3.3 , 4.7 , 5.7 , 5.10
control functions, 10.1
2SRI, 10.2.4
functional form, 10.2.4

353
functions of residuals, 10.3.1
nonlinear functions, 10.2.4
count data, 1.1
count models, 1.1 , 1.3 , 8 , 8.8
AIC, 8.6.1 , 8.6.1
AIC and BIC, 8.3.1 , 8.6
BIC, 8.6.1 , 8.6.1
censoring, 8.5.2 , 8.5.2
cross-validation, 8.6.2 , 8.6.2
discreteness, 8.1
event counts, 8.1
event counts for NB1 and NB2, 8.3.1
FMM, 9.3.2
hurdle count models, 8.4 , 8.4.1
model comparisons, 8.6
model selection, 8.3.1 , 8.4.2 , 8.6.1 , 8.6.2
model selection examples, 8.6.1
negative binomial models, 8.3 , 8.3.1
Poisson models, 8.2 , 8.2.1
skewness, 8.1
truncation, 8.5 , 8.5.1
zero-inflated models, 8.4.2 , 8.4.2
counterfactual outcome, 2.2 , 2.3 , 2.3.1
cross-validation, 2.1 , 2.6.2 , 2.6.2
comparison with repeated random subsampling, 8.6.2
count models, 8.6.2 , 8.6.2
Current Population Survey, 11.4

D
data-generating process
generalized tobit, 7.4
hurdle or two-part model, 8.4.1
negative binomial, 8.3
Poisson distribution, 8.1 , 8.2.4
potential outcomes, 2.2
two-part model, 7.4
delta-method standard errors, 4.3.3 , 11.3.2
dental expenditures
generalized tobit model examples, 7.5.5
two-part model example, 7.5.5

354
describe command, 3.6
design effects, 1.1 , 1.3 , 2.1 , 3.2 , 11 , 11.6
clusters, 11.2.2 , 11.2.2
examples, 11.4 , 11.4.4
inference, 11.3 , 11.3.2
point estimation, 11.3 , 11.3.1
standard errors, 11.1 , 11.3.2 , 11.3.2
stratification, 11.2.2 , 11.2.2
survey design setup, 11.4.1 , 11.4.1
weighted Poisson count model, 11.4.4 , 11.4.4
weighted sample means, 11.4.2 , 11.4.2
weights, 11.2.1 , 11.2.1
WLS, 11.4.3 , 11.4.3
detail option, 3.6
difference-in-differences
standard errors, 11.1 , 11.2.3
with clusters, 11.2.3
difficult option, 9.3.2
discreteness, count models, 8.1
dispersion(constant) option, 8.3.1
dispersion(mean) option, 8.3.1
Duan’s smearing factor, 6.3.1 , 6.5
Abrevaya’s method, relation to, 6.5
formula, 6.3.1
dummy variable interpretation
Kennedy transformation, 6.2.1
log models, 6.2.1

E
economic theory, 1
EEE, 5.8.4 , 5.10
eintreg command, 10.3.1
endogeneity, 1.1 , 1.3 , 2.7 , 4.1 , 10 , 10.5
2SLS, 10.2.2 , 10.2.2
example with artificial data, 10.2.1
examples, 10.1
linear models, 10.2 , 10.2.5
nonlinear models, 10.3 , 10.3.1
OLS is inconsistent, 10.2.1 , 10.2.1
omitted variables, 10.1

355
eoprobit command, 10.3.1
eprobit command, 10.3.1 , 10.5
eregress command, 10.2.5 , 10.3 , 10.3.1 , 10.5
probit option, 10.3
ERM, 10.2.5 , 10.2.5
normality assumption, 10.2.5
error retransformation factor, 6.3 , 6.3.1
estat
commands, 10.5
endogenous command, 10.2.3
firststage command, 10.2.3
gof postestimation command, 4.6.3
ic command, 8.8 , 9.3.1
lcprob command, 9.3.1
ovtest postestimation command, 4.6.4 , 4.7
estimates estat command, 5.10
estimates stats command, 4.7 , 8.8
etregress command, 10.5
exogeneity test, 2SLS specification test, 10.2.3
expression() option, 8.4.1 , 9.2.1
extended estimating equations, see EEE
extended regression models, see ERM
extensive margin, two-part models, 7.1 , 7.5.2 , 7.5.5

F
test, 2SLS specification test, 10.2.3
FIML, 10.2.5
comparison with LIML, 7.3.1
convergence failure, 7.5.5
formula for generalized tobit model, 7.3.1
generalized tobit models, 7.3.1 , 7.3.1
Heckman selection models, 7.3.1 , 7.3.1
finite mixture models, see FMM
first option, 10.2.2
FMM, 1.1 , 9.1 , 9.3 , 9.3.2
AIC and BIC, 9.3
density formula, 9.3
distribution choice, 9.3
example of count models, 9.3.2
example of healthcare expenditures, 9.3.1 , 9.3.1

356
example of healthcare use, 9.3.2 , 9.3.2
formula for posterior probability, 9.3
identification, 9.3
incremental effects, 9.3.1 , 9.3.2
interpretation of parameters, 9.3
marginal effects, 9.3.1 , 9.3.2
motivation, 9.3
predictions of means, 9.3.1
predictions of posterior probabilities, 9.3.1
theory, 9.3
two-component example, 9.3.1
fmm command, 9.3.1 , 9.3.2 , 9.6
full-information maximum likelihood, see FIML
fweights, 11.2.1

G
Gauss-Markov theorem, 4.1
generalized linear models, see GLM
generalized method of moments, see GMM
generalized tobit models, 7.3 , 7.3.1
actual outcomes, 7.5.4
censoring assumption, 7.3
censoring formulas, 7.3
comparison of FIML and LIML, 7.3.1
comparison with two-part models, 7.4 , 7.4.1
correlation of errors, 7.3
example comparing with two-part model, 7.5.5 , 7.5.5
examples showing similar marginal effects to two-part models, 7.4.1 ,
7.4.1
exclusion restrictions, 7.3 , 7.4.1
FIML and LIML, 7.3.1 , 7.3.1
formula for FIML, 7.3.1
formula for LIML, 7.3.1
formulas, 7.3 , 7.3
identification, 7.3 , 7.3.1 , 7.4 , 7.4.1 , 7.8
latent outcomes, 7.1 , 7.3 , 7.4 , 7.4.1 , 7.5.4 , 7.5.5
marginal effects examples, 7.5.5 , 7.5.5
marginal effects formulas, 7.5.4 , 7.5.4
motivation is for missing values, 7.4
normality assumption, 7.3

357
not generalization of two-part models, 7.4
three interpretations, 7.5.4 , 7.5.4
generate command, 3.6 , 6.6
GLM, 1 , 1.1 , 1.3 , 5 , 5.10
assumptions, 5.2.1 , 5.2.1 , 5.2.1
compared with log models, 5.4 , 5.9 , 6.1 , 6.3.1 , 6.4 , 6.4
consistency, 5.2.2
count data, 5.1
dichotomous outcomes, 5.1
distribution family, 5.1 , 5.2.1 , 5.2.2 , 5.8 , 5.8.3 , 5.8.4
framework, 5.2 , 5.2.2
generalization of OLS, 5.1
healthcare expenditure example, 5.3 , 5.3
heteroskedasticity, 5.1 , 5.9
identity link problems, 5.8.1
incremental effects, 5.2.2 , 5.6 , 5.7
index function, 5.2.1
interaction term example, 5.5 , 5.5
inverse of link function, 5.2.1
iteratively reweighted least squares, 5.2.2
link function, 5.1 , 5.2.1 , 5.2.2 , 5.8 , 5.8.4
link function test, 5.8.2 , 5.8.2
log link, 6.4
marginal effects, 5.2.2 , 5.6 , 5.7
parameter estimation, 5.2.2 , 5.2.2
Park test, modified, 5.8.3 , 5.8.3
Poisson models, 8.2.2
prediction example, 5.4 , 5.4
quasi–maximum likelihood, 5.2.2
square root link, 5.3 , 5.8.1 , 5.8.2
tests for link function and distribution family, 5.8 , 5.8.3
two-part model, second part in, 7.2 , 7.2.1 , 7.4 , 7.5.1 , 7.5.2
glm command, 5.3 , 5.7 , 5.10 , 7.8
GMM, 1 , 10.4 , 10.4
endogeneity, 10.4 , 10.4
example of 2SLS, 10.4
Hansen’s test, 10.4
multiple equations, 10.4
multiple instruments, 10.4
overidentification test, 10.4

358
standard errors different for 2SLS, 10.4
gmm option, 10.4 , 10.5
graphical checks, see visual checks
graphical tests, 1.1
grc1leg command, 4.7

H
Hansen’s test, 10.4
health econometric myths, 1.3 , 1.3
healthcare expenditure example, GLM, 5.3 , 5.3
healthcare expenditures, 1 , 1.1 , 1.2 , 2.1 , 2.5 , 3.1 , 3.3 , 5.1
skewed, 5.1 , 9.2.2
zero, mass at, 7.1
heckman command, 7.4.1 , 7.8
Heckman selection models, 7.3 , 7.3.1
censoring assumption, 7.3
censoring formulas, 7.3
comparison of FIML and LIML, 7.3.1
comparison with two-part models, 7.4 , 7.4.1
example comparing with two-part model, 7.5.5 , 7.5.5
examples showing similar marginal effects to two-part models, 7.4.1 ,
7.4.1
FIML and LIML, 7.3.1 , 7.3.1
formula for FIML, 7.3.1
formula for LIML, 7.3.1
formulas, 7.3 , 7.3
marginal effects examples, 7.5.5 , 7.5.5
marginal effects formulas, 7.5.4 , 7.5.4
motivation is for missing values, 7.4
not generalization of two-part models, 7.4
three interpretations, 7.5.4 , 7.5.4
heckprob command, 7.8
heterogeneous treatment effects, see treatment effects, heterogeneous
heteroskedasticity, 1.1
GLM, 5.1 , 5.9 , 6.1
log models and retransformation, 6.1 , 6.3.1 , 6.3.2 , 6.4
histogram command, 3.6
Hosmer-Lemeshow test, see modified Hosmer-Lemeshow test
hurdle count models, 8.4 , 8.4.1
compared with two-part models, 8.4.1

359
compared with zero-inflated models, 8.4.2
example of office visits, 8.4.1
first and second parts, 8.4.1
marginal effects, 8.4.1 , 8.4.1
motivation, 8.4.1

I
ignorability, 2.3 , 2.3.3 , 10.1
incremental effects, 1.1 , 2.5 , 2.5
design effects, 11.1
example for Poisson model, 8.2.3
FMM, 9.3.1 , 9.3.2
GLM, 5.6 , 5.7
graphical representation, 4.3.2 , 4.3.2
inconsistency in Poisson models, 8.2.4
linear regression model, 2.5
log models, 6.3.2 , 6.3.2
nonlinear regression model, 2.5
OLS, 4.1 , 4.3.1 , 4.3.1
Poisson models, 8.2.3
two-part model formulas, 7.5.2
two-part models, 7.2.1 , 7.2.1
zero-inflated models, 8.4.2
zeros, if mass at, 7.1
inpatient expenditures, 3.3
instrumental variables, 10.1
assumptions, 10.2.2
specification tests, 10.2.3 , 10.2.3
weak instruments, 10.2.2
intensive margin, two-part models, 7.1 , 7.5.2 , 7.5.5
interaction term, GLM example, 5.5 , 5.5
intreg command, 7.8
inverse Mills ratio, 7.3 , 7.3.1 , 7.5.5
ivpoisson command, 10.3.1 , 10.5
ivprobit command, 10.5
ivregress command, 10.2.2 , 10.2.4 , 10.2.5 , 10.4 , 10.5
ivregress postestimation commands, 10.2.3 , 10.5
ivtobit command, 10.3.1 , 10.5
iweights, 11.2.1

360
J
test, see Hansen’s test
jackknife command, 11.3.2
adjustment for clustering, 11.3.2
intuition, 11.3.2
standard errors, 11.3.2
Jensen’s inequality, 6.3

K
Kennedy transformation, 6.2.1
-fold cross-validation, see cross-validation
kurtosis, MEPS, 3.3

L
latent outcomes
generalized tobit models, 7.1 , 7.3 , 7.4 , 7.4.1 , 7.5.4 , 7.5.5
tobit models, 7.6.1
LEF, 8.1
negative binomial models are not LEF, 8.3
negative binomial models with fixed , 8.3
Poisson models, 8.1 , 8.2.2
limited-information maximum likelihood, see LIML
LIML
comparison with FIML, 7.3.1
formula for generalized tobit model, 7.3.1
generalized tobit models, 7.3.1 , 7.3.1
Heckman selection models, 7.3.1 , 7.3.1
linear
exponential family, see LEF
regression models, see OLS
regression models mathematical formulation, 4.2 , 4.2
link function test for GLM, 5.8.2 , 5.8.2
linktest command, 4.6.4 , 4.7
lnskew0 command, 6.6
local linear regression, see nonparametric regression
log models, 1.1 , 1.3 , 5.1 , 5.9 , 6 , 6.4 , 6.6
compared with Box-Cox models, 6.5 , 6.5
compared with GLM, 5.4 , 5.9 , 6.4 , 6.4
dummy variable interpretation, 6.2.1
error retransformation factors, 6.3 , 6.3.1 , 6.3.1

361
estimation and interpretation, 6.2.1 , 6.2.1
formula, 6.2.1
healthcare expenditure example, 6.2.1 , 6.2.1
marginal and incremental effects, 6.1 , 6.3.2 , 6.3.2
marginal effects affected by heteroskedasticity, 6.3.2
retransformation to raw scale, 6.3 , 6.3.2
two-part model, second part in, 7.2
logit command, 7.8 , 8.8
logit models
first part of
hurdle count model, 8.4.1
two-part model, 7.2 , 7.2.1 , 7.4 , 7.5.2
zero-inflated model, 8.4.2
weighted formula, 11.3.1
weights, 11.3.1

M
marginal effects, 1.1 , 2.5 , 2.5
design effects, 11.1
example for Poisson model, 8.2.3
FMM, 9.3.1 , 9.3.2
generalized tobit models, 7.5.4 , 7.5.4
GLM, 5.6 , 5.7
graphical representation, 4.3.2 , 4.3.2
Heckman selection models, 7.5.4 , 7.5.4
hurdle count models, 8.4.1 , 8.4.1
inconsistency in Poisson models, 8.2.4
log models, 6.1 , 6.3.2 , 6.3.2 , 6.3.2
log models affected by heteroskedasticity, 6.3.2
nonlinear regression model, 2.5
OLS, 4.1 , 4.3.1 , 4.3.1
Poisson models, 8.2.3
two-part models, 7.2.1 , 7.2.1
formulas for, 7.5.2 , 7.5.2
similar to generalized tobit models, 7.4.1 , 7.4.1
zero-inflated models, 8.4.2
zeros, if mass at, 7.1
margins command, 4.3.1 , 4.3.2 , 4.3.3 , 4.7 , 5.4 , 5.5 , 5.7 , 5.10 , 6.6 ,
7.5.1 , 7.5.3 , 7.5.5 , 8.2.3 , 8.3.1 , 8.4.1 , 9.2.1 , 9.3.1 , 9.3.2 , 9.4.1 , 11.4.4
marginsplot command, 4.3.2 , 4.7 , 5.5 , 5.10 , 9.4.1

362
maximum likelihood estimation, see MLE
mean command, 11.4.2
median regression, see quantile regression
Medical Expenditure Panel Survey, see MEPS
MEPS, 1.1 , 1.3 , 3 , 3.1 , 3.6
demographics, 3.4
expenditure and use variables, 3.3 , 3.3
explanatory variables, 3.4 , 3.4
health insurance, 3.4
health measures, 3.4
Household Component, 3.1
Household Survey, 11.2.2
oversampling and undersampling, 11.2.1
overview, 3.2 , 3.2
PSU, 11.4
PSUs based on counties, 11.2.2
sample dataset, 3.5 , 3.5
sample size, 3.2
sampling weights, 11.4
stratified multistage sample design, 11.2.2
study design, 11.1 , 11.2.1 , 11.4
survey design setup, 11.4.1
website, 3.1
mermaid, 7.6.1
misspecification
OLS, 4.1 , 4.4 , 4.4.2
exponential example, 4.4.2 , 4.4.2
quadratic example, 4.4.1 , 4.4.1
MLE, 1
Box-Cox models, 6.6
count models, 8.1
FMM, 9.3
Poisson models, 8.2.1 , 8.2.1
quasi–maximum likelihood for GLM, 5.2.2
weighted logit model, 11.3.1
model selection, 2.1 , 2.6 , 2.6.2
AIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5
BIC, 2.6.1 , 2.6.1 , 4.6.5 , 4.6.5
count models, 8.3.1 , 8.6.1 , 8.6.2
cross-validation, 2.6.2 , 2.6.2

363
graphical tests, 2.6
statistical tests, 2.6
model specification, 4.1
modified Hosmer-Lemeshow test, 2.6 , 4.6 , 4.6.3 , 4.6.3 , 7.7
mqgamma package, 9.2.2
myths, health econometric, see health econometric myths

N
National Health Interview Survey, 3.1 , 11.2.2
natural experiments
cluster, 11.2.3 , 11.2.3
weights, 11.2.3 , 11.2.3
NB1
definition, 8.3
example, 8.3.1
linear mean, 8.3
NB2
definition, 8.3
example, 8.3.1
quadratic mean, 8.3
nbreg command, 8.3.1 , 8.8
negative binomial models, 1.3 , 8.3 , 8.3.1
compared with Poisson models, 8.3
conditional mean same as Poisson, 8.3
consistency, 8.3
examples, 8.3.1 , 8.3.1
formula for density, 8.3
formulas for first two moments, 8.3
motivated by unobserved heterogeneity causing overdispersion, 8.3
NB1 and NB2, 8.3
robustness, 8.3
second part of hurdle count model, 8.4.1
truncated example, 8.4.1
variance exceeds mean, 8.3
zeros, 8.3
negative binomial-1, see NB1
negative binomial-2, see NB2
nlcom command, 9.3.1 , 9.3.2
nonparametric regression, 9.4 , 9.4.1
bootstrap, 9.4.1

364
compared with parametric models, 9.4
computation time, 9.4.1
examples, 9.4.1 , 9.4.1
local linear regression, 9.4
normality
FIML sensitivity to assumption, 7.5.5
generalized tobit assumption, 7.3
OLS, not assumed by, 7.2.1
tobit model assumption, 7.6.1
npregress command, 9.4 , 9.4.1 , 9.6

O
OLS, 1 , 1.1 , 1.3 , 4 , 4.7 , 5.1
AIC and BIC, 4.6.5 , 4.6.5
assumptions, 4.1
ATE and ATET, 4.3.3
best linear unbiased estimator, 4.1
can be inconsistent for count outcomes, 8.1
compared with median quantile regression, 9.2
endogeneity causes inconsistency, 10.2.1 , 10.2.1
examples of statistical tests, 4.6.4 , 4.6.4
graphical representation of marginal and incremental effects, 4.3.2 ,
4.3.2
incremental effects, 4.3.1 , 4.3.1
log models, see log models
marginal effects, 4.3.1 , 4.3.1
marginal effects compared with quantile regression, 9.2.1
mathematical formulation, 4.2 , 4.2
misspecification
consequences, 4.1 , 4.4 , 4.4.2
exponential example, 4.4.2 , 4.4.2
quadratic example, 4.4.1 , 4.4.1
negative predictions, 5.1
regression with log-transformed dependent variable, see log models
sample-to-sample variation, 5.1
statistical tests, 4.6 , 4.6.5
treatment effects, 4.3.3 , 4.3.3
unbiased property, 4.1
visual checks, 4.5 , 4.5.2
omitted variables, 10.2.1 , 10.2.4

365
endogeneity, source of, 10.1
ordinary least squares, see OLS
outliers, 1.3
overdispersion
problem for Poisson models, 8.2.4
unobserved heterogeneity in negative binomial models, 8.3
overidentification test, GMM, 10.4
overidentifying restriction test, 2SLS specification test, 10.2.3
oversampling, 3.4

P
Park test, modified for GLM, 5.8.3 , 5.8.3 , 5.10
parmest package, 9.2.1
poisson command, 8.8
Poisson models, 1.3 , 8.2 , 8.2.4 , 11.4.3
canonical count model, 8.1
compared with negative binomial models, 8.3
consistent if conditional mean correct, 8.1 , 8.2.2
events underpredicted in tails, 8.2.4
example of office-based visits, 8.2.3
exponential mean specification, 8.2.3
formula, 8.2.1
for covariance matrix, 8.2.1
for semielasticity, 8.2.3
GLM objective function, 8.2.2
GMM, 10.4
incremental effects, 8.2.3
can be inconsistent, 8.2.4
example, 8.2.3
instrumental variables, by GMM, 10.4
interpretation, 8.2.3 , 8.2.3
interpretation as semielasticity, 8.2.3
LEF, 8.2.2
log-likelihood function, 8.2.1
marginal effects, 8.2.3
can be inconsistent, 8.2.4
example, 8.2.3
mean equals variance, 8.2.1
MLE, 8.2.1 , 8.2.1
motivation from exponential time between events, 8.2.1

366
overdispersion, 8.2.4
restrictiveness, 8.2.4 , 8.2.4
robust standard errors, 8.2.4
robustness, 8.2.2 , 8.2.2 , 8.6
second part of hurdle count model, 8.4.1
variance equals mean, 8.2.1
weighted example, 11.4.4 , 11.4.4
zeros underpredicted, 8.2.4
population mean, 11.3.1
potential outcomes, 1.1 , 2.1 , 2.2 , 2.2 , 2.3 , 2.3.2 , 2.3.3 , 2.4.1 , 2.5 , 4.3
causal inference, 2.2
predict command, 6.6 , 9.3.1
prediction, GLM, 5.4 , 5.4
Pregibon’s link test, 2.6 , 4.6 , 4.6.1 , 4.6.1 , 4.6.3 , 7.7
example, 4.6.4
relation to RESET test, 4.6.2
Stata command, 4.6.4
primary sampling unit, see PSU
probit command, 7.8 , 8.8
probit models
comparison with tobit, 7.6.1
first part of
hurdle count model, 8.4.1
zero-inflated model, 8.4.2
tobit model, 7.6.1
two-part model, first part in, 7.2 , 7.2.1 , 7.4 , 7.5.1 , 7.5.2
PSU, 11.2.2
bootstrap or jackknife, 11.3.2
MEPS, 11.2.2
pweights, 11.2.1 , 11.3.1

Q
qcount package, 9.2.2
qreg command, 9.2.1 , 9.6
quantile regression, 1.1 , 9.1 , 9.2 , 9.2.2
appealing properties, 9.2
applied to count data, 9.2.2
applied to nonlinear data-generating process, 9.2.2
compared with OLS, 9.2
effects of covariates vary across conditional quantiles of the outcome,

367
9.2
equivariant to monotone transformations, 9.2
examples, 9.2.1 , 9.2.2
extensions, 9.2.2 , 9.2.2
formula, 9.2
marginal effects compared with OLS, 9.2.1
median regression, 9.2
robust to outliers, 9.2

R
Ramsey’s regression equation specification error test, see RESET test
RAND Health Insurance Experiment, 7.2
randomized trial, 2.3.2 , 2.3.3
regress command, 4.7 , 6.6 , 7.8 , 10.2.4 , 11.4.3
regress postestimation commands, 4.5.1 , 4.7
replace command, 3.6
RESET test, 2.6 , 4.6 , 4.6.2 , 4.6.2 , 7.7
example, 4.6.4
generalization of Pregibon’s link test, 4.6.2
Stata command, 4.6.4
retransformation, 5.1 , 6.1 , 6.3.1 , 6.3.2
general theory of Box-Cox under homoskedasticity, 6.5
robustness, negative binomial models, 8.3
rvfplot command, 4.5.1 , 4.7
rvpplot command, 4.5.1 , 4.7

S
sampling weights, see weights
scatter command, 3.6
selection
attrition, 11.1 , 11.2.3
models, 1.1 , 1.3
on unobservables, 2.4.1 , 2.7
self-selection, 2.3.3
single-index models that accommodate zeros, 7.1 , 7.6 , 7.6.3
GLM, 7.6.3
nonlinear least squares, 7.6.3
one-part models, 7.6.3 , 7.6.3
tobit models, 7.6.1 , 7.6.2
skewed data, 5.1

368
skewness, 1.1 , 1.2 , 5.1
Box-Cox targets skewness, 6.5
count models, 8.1
GLM, 5.9
MEPS, 3.3
OLS errors, 5.1
skewness transformation, zero, 6.6
specification tests, 2.6
2SLS, 10.2.3 , 10.2.3
balance test, 10.2.3
exogeneity test, 10.2.3
test, 10.2.3
instrument strength test, 10.2.3
overidentifying restriction test, 10.2.3
sqreg command, 9.2.1 , 9.6
square root model, 6.5
square root transformation
GLM, 5.3 , 5.8.1 , 5.8.2
square root transformation GLM, 5.8.1 , 5.8.2
ssc install
fmm9 command, 9.6
parmest command, 9.2.1
twopm command, 7.8
standard errors
2SLS versus OLS, 10.2.2 , 10.4
bootstrap, 11.3.2
delta method, 4.3.3 , 11.3.2
design effects, 11.1 , 11.3.2 , 11.3.2
difference-in-differences, 11.1
formula with clustering, 11.2.2
GMM correct standard errors for 2SRI, 10.4
jackknife, 11.3.2
margins and teffects command, 4.3.3
Poisson models, 8.2.4
Stata
Base Reference Manual, 11.3.2
command for weights, 11.2.1
Data-Management Reference Manual, 3.6
Getting Started With Stata manual, 3.6
Graphics Reference Manual, 3.6

369
introduction to the language, 1.4
option for clustering, 11.2.2
Survey Data Reference Manual, 11.6
Treatment-Effects Reference Manual, 4.3.3
User’s Guide, 3.6 , 11.3.1
Stata resources
2SLS, 10.5
2SRI, 10.5
AIC and BIC, 5.10 , 8.8
Box-Cox, 6.6
CDE, 9.6
clusters, 11.6
count models, 8.8
data cleaning, 3.6
design effects, 11.6
endogeneity, 10.5
FMM, 9.6
generalized tobit models, 7.8
getting started, 3.6
GLM, 5.10
GMM, 10.5
graphing, 3.6 , 4.7
Heckman selection models, 7.8
hurdle count models, 8.8
incremental effects, 5.10
instrumental variables, 10.5
linear regression, 4.7 , 6.6
marginal and incremental effects, 4.7
marginal effects, 5.10
negative binomial models, 8.8
Poisson models, 8.8
quantile regression, 9.6
skewness transformation, zero, 6.6
statistical tests, 4.7
stratification, 11.6
study design, 11.6
summary statistics, 3.6
survey design, 11.6
treatment effects, 4.7
two-part models, 7.8

370
weights, 11.6
zero-inflated models, 8.8
statistical tests, 1.1
OLS, 4.6 , 4.6.4 , 4.6.4 , 4.6.5
stratification, 11.2.2 , 11.2.2
affects standard errors, 11.3
purpose, 11.2.2
study design, see design effects
suest command, 8.4.1
summarize command, 3.6 , 11.4.2
survey design, see design effects
svy prefix command, 5.10 , 11.4.1 , 11.4.2 , 11.4.3 , 11.6
svy bootstrap command, 11.3.2
svy jackknife command, 11.3.2
svyset command, 11.3.2 , 11.4.1 , 11.6

T
tabstat command, 3.6
Taylor series, linearization for standard errors, 11.3.2 , 11.4
teffects command, 4.3.3 , 4.7
test command, 4.7
testparm command, 4.7
tnbreg command, 8.8
tobit command, 7.8
tobit models, 7.6.1 , 7.6.2
censoring, 7.6.1
censoring assumption, 7.6.1 , 7.6.2
comparison with probit, 7.6.1
formulas, 7.6.1
homoskedasticity assumption, 7.6.1
latent outcomes, 7.6.1
MLE formula, 7.6.1
normality assumption, 7.6.1 , 7.6.2
restrictive assumptions, 7.6.2
right-censored example, 7.6.2
truncation of positive values at zero, 7.6.2
why used sparingly, 7.6.2 , 7.6.2
tobit, generalized, see generalized tobit models
tpoisson command, 8.4.1 , 8.8
treatment effects, 1.1 , 2.1 , 2.2 , 2.2 , 2.3.3 , 2.5

371
clustering, 11.2.3
covariate adjustment, 2.3.3 , 2.3.3
difference-in-differences, 11.2.3
endogeneity, 10.1
heterogeneous, 1.1 , 1.2 , 1.3 , 9 , 9.6
linear regression, 2.4.1 , 2.4.1
nonlinear regression, 2.4.2 , 2.4.2
OLS, 4.1 , 4.3.3 , 4.3.3
on not-treated, 8.2.3
randomization, 2.3.2 , 2.3.2
regression estimates, 2.4 , 2.4.2
zeros, if mass at, 7.1
treatreg command, 7.8
truncation
comparison with censoring, 7.6.1
count models, 8.5 , 8.5.1
definition, 7.6.1
formulas, 8.5.1
left-truncation, 8.5.1
right-truncation, 8.5.1
zero-truncation, 8.5.1
two-part models, 1.1 , 1.3 , 7.2 , 7.2.1
actual outcomes, 7.1 , 7.4.1 , 7.5.4 , 7.5.5
choices for first-and second-parts, 7.2
compared with hurdle count models, 8.4.1
comparison with generalized tobit models, 7.4 , 7.4.1
comparison with Heckman selection models, 7.4 , 7.4.1
example
comparing with generalized tobit model, 7.5.5 , 7.5.5
comparing with Heckman selection model, 7.5.5 , 7.5.5
with MEPS, 7.5.1 , 7.5.1
examples showing similar marginal effects to generalized tobit models,
7.4.1 , 7.4.1
expected value of , 7.2.1 , 7.2.1
formula,
general, 7.2
logit and GLM with log link, 7.2.1
probit and GLM with log link, 7.2.1
probit and homoskedastic nonnormal log model, 7.2.1
probit and homoskedastic normal log model, 7.2.1

372
probit and linear, 7.2.1
probit and normal, 7.2.1
formulas for incremental effects, 7.5.2
history, 7.2
marginal and incremental effects, 7.2.1 , 7.2.1
marginal effect example, 7.5.3 , 7.5.3
marginal effect formulas, 7.5.2 , 7.5.2
mixture density motivation, 7.4
motivation is for actual zeros, 7.4
not nested in generalized tobit models, 7.4
relation to CDE, 9.5
statistical decomposition, 7.2
statistical tests, 7.7
twopm command, 7.4.1 , 7.5.1 , 7.5.3 , 7.8 , 8.4.1
two-stage least squares, see 2SLS
two-stage residual inclusion, see 2SRI
twostep option, 10.5

V
vce(cluster) option, 11.2.1 , 11.2.2 , 11.3.2 , 11.6
vce(robust) option, 8.2.4
visual checks
artificial-data example, 4.5.1 , 4.5.1
MEPS example, 4.5.2 , 4.5.2
OLS, 4.5 , 4.5.2
Vuong’s test, 8.3.1 , 8.6

W
weak instruments, see instrumental variables
weighted least squares, see WLS
weighted sample means, example, 11.4.2 , 11.4.2
weights, 11.2.1 , 11.2.1
affect point estimates and standard errors, 11.3
analytic weights, aweights, 11.2.1
definitions, 11.2.1
effect on point estimates, 11.3.1
effect on regression coefficients, 11.1
frequency weights, fweights, 11.2.1
importance weights, iweights, 11.2.1
in natural experiments, 11.2.3 , 11.2.3

373
logistic regression, 11.3.1
population mean, 11.3.1
population weights, 11.2.1
postsampling weights, 11.2.1 , 11.2.3
sampling weights, 11.2.1
pweights, 11.2.1
with bootstrap or jackknife, 11.3.2
WLS, 11.3.1
WLS, 11.3.1
example, 11.4.3 , 11.4.3
formula, 11.3.1

Z
zero-inflated models, 8.4 , 8.4.2 , 8.4.2
compared with hurdle count models, 8.4.2
example for office-based visits, 8.4.2
formula for density, 8.4.2
heterogeneity, 8.4.2
incremental effects, 8.4.2
marginal effects, 8.4.2
motivation, 8.4.2
zeros, 1.1 , 1.2
hurdle count model, 8.4.1
MEPS, 3.3
models for continuous outcomes with mass at zero, 7 , 7.8
negative binomial models, 8.3
single-index models, 7.6 , 7.6.3
underprediction in Poisson models, 8.2.4
zinb command, 8.4.2 , 8.8
zip command, 8.4.2 , 8.8

374

You might also like