Econometrics
Econometrics
CHRISTOPHER F. BAUM
Department of Economics
Boston College
ISBN-10: 1-59718-013-0
ISBN-13: 978-1-59718-013-9
This book is protected by copyright. All rights are reserved. No part of this book may be repro-
duced, stored in a retrieval system, or transcribed, in any form or by any means—electronic,
mechanical, photocopying, recording, or otherwise—without the prior written permission of
StataCorp LP.
Stata is a registered trademark of StataCorp LP. LATEX 2ε is a trademark of the American
Mathematical Society.
Contents
Illustrations xv
Preface xvii
Notation and typography xix
1 Introduction 1
1.1 An overview of Stata’s distinctive features . . . . . . . . . . . . . . . . 1
1.2 Installing the necessary software . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Installing the support materials . . . . . . . . . . . . . . . . . . . . . . 5
2 Working with economic and financial data in Stata 7
2.1 The basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 The use command . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Variable types . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 n and N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 generate and replace . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 sort and gsort . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.6 if exp and in range . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.7 Using if exp with indicator variables . . . . . . . . . . . . . . . 13
2.1.8 Using if exp versus by varlist: with statistical commands . . . 15
2.1.9 Labels and notes . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.10 The varlist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.11 drop and keep . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.12 rename and renvars . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.13 The save command . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.14 insheet and infile . . . . . . . . . . . . . . . . . . . . . . . . . . 21
viii Contents
References 321
Author index 329
Subject index 333
Preface
This book is a concise guide for applied researchers in economics and finance to learn
basic econometrics and use Stata with examples using typical datasets analyzed in
economics. Readers should be familiar with applied statistics at the level of a simple
linear regression (ordinary least squares, or OLS) model and its algebraic representation,
equivalent to the level of an undergraduate statistics/econometrics course sequence.1
The book also uses some multivariate calculus (partial derivatives) and linear algebra.
I presume that the reader is familiar with Stata’s windowed interface and with the
basics of data input, data transformation, and descriptive statistics. Readers should
consult the appropriate Getting Started with Stata manual if review is needed. Mean-
while, readers already comfortable interacting with Stata should feel free to skip to
chapter 4, where the discussion of econometrics begins in earnest.
In any research project, a great deal of the effort is involved with the preparation
of the data specified as part of an econometric model. While the primary focus of the
book is placed upon applied econometric practice, we must consider the considerable
challenges that many researchers face in moving from their original data sources to
the form needed in an econometric model—or even that needed to provide appropriate
tabulations and graphs for the project. Accordingly, Chapter 2 focuses on the details
of data management and several tools available in Stata to ensure that the appropriate
transformations are accomplished accurately and efficiently. If you are familiar with
these aspects of Stata usage, you should feel free to skim this material, perhaps returning
to it to refresh your understanding of Stata usage. Likewise, Chapter 3 is devoted to a
discussion of the organization of economic and financial data, and the Stata commands
needed to reorganize data among the several forms of organization (cross section, time
series, pooled, panel/longitudinal, etc.) If you are eager to begin with the econometrics
of linear regression, skim this chapter, noting its content for future reference.
Chapter 4 begins the econometric content of the book and presents the most widely
used tool for econometric analysis: the multiple linear regression model applied to
continuous variables. The chapter also discusses how to interpret and present regression
estimates and discusses the logic of hypothesis tests and linear and nonlinear restrictions.
The last section of the chapter considers residuals, predicted values, and marginal effects.
Applying the regression model depends on some assumptions that real datasets
often violate. Chapter 5 discusses how the crucial zero-conditional-mean assumption
of the errors may be violated in the presence of specification error. The chapter also
1. Two excellent texts at this level are Wooldridge (2006) and Stock and Watson (2006).
xviii Preface
discusses statistical and graphical techniques for detecting specification error. Chapter 6
discusses other assumptions that may be violated, such as the assumption of independent
and identically distributed (i.i.d.) errors, and presents the generalized linear regression
model. It also explains how to diagnose and correct the two most important departures
from i.i.d., heteroskedasticity and serial correlation.
Chapter 7 discusses using indicator variables or dummy variables in the linear re-
gression models containing both quantitative and qualitative factors, models with in-
teraction effects, and models of structural change.
Many regression models in applied economics violate the zero-conditional-mean as-
sumption of the errors because they simultaneously determine the response variable and
one or more regressors or because of measurement error in the regressors. No matter
the cause, OLS techniques will no longer generate unbiased and consistent estimates, so
you must use instrumental-variables (IV) techniques instead. Chapter 8 presents the
IV estimator and its generalized method-of-moments counterpart along with tests for
determining the need for IV techniques.
Chapter 9 applies models to panel or longitudinal data that have both cross-sectional
and time-series dimensions. Extensions of the regression model allow you to take ad-
vantage of the rich information in panel data, accounting for the heterogeneity in both
panel unit and time dimensions.
Many econometric applications model categorical and limited dependent variables:
a binary outcome, such as a purchase decision, or a constrained response such as the
amount spent, which combines the decision whether to purchase with the decision of
how much to spend, conditional on purchasing. Because linear regression techniques
are generally not appropriate for modeling these outcomes, chapter 10 presents several
limited-dependent-variable estimators available in Stata.
The appendices discuss techniques for importing external data into Stata and explain
basic Stata programming. Although you can use Stata without doing any programming,
learning how to program in Stata can help you save a lot of time and effort. You should
also learn to generate reproducible results by using do-files that you can document,
archive, and rerun. Following Stata’s guidelines will make your do-files shorter and
easier to maintain and modify.
4 Linear regression
This chapter presents the most widely used tool in applied economics: the linear regres-
sion model, which relates a set of continuous variables to a continuous outcome. The
explanatory variables in a regression model often include one or more binary or indica-
tor variables; see chapter 7. Likewise, many models seek to explain a binary response
variable as a function of a set of factors, which linear regression does not handle well.
Chapter 10 discusses several forms of that model, including those in which the response
variable is limited but not binary.
4.1 Introduction
This chapter discusses multiple regression in the context of a prototype economic re-
search project. To carry out such a research project, we must
1. lay out a research framework—or economic model—that lets us specify the ques-
tions of interest and defines how we will interpret the empirical results;
2. find a dataset containing empirical counterparts to the quantities specified in the
economic model;
3. use exploratory data analysis to familiarize ourselves with the data and identify
outliers, extreme values, and the like;
4. fit the model and use specification analysis to determine the adequacy of the
explanatory factors and their functional form;
5. conduct statistical inference (given satisfactory findings from specification analy-
sis) on the research questions posed by the model; and
6. analyze the findings from hypothesis testing and the success of the model in terms
of predictions and marginal effects. On the basis of these findings, we may have
to return to one of the earlier stages to reevaluate the dataset and its specification
and functional form.
Section 2 reviews the basic regression analysis theory on which regression point and
interval estimates are based. Section 3 introduces a prototype economic research project
studying the determinants of communities’ single-family housing prices and discusses the
various components of Stata’s results from fitting a regression model of housing prices.
Section 4 discusses how to transform Stata’s estimation results into publication-quality
tables. Section 5 discusses hypothesis testing and estimation subject to constraints on
69
70 Chapter 4 Linear regression
the parameters. Section 6 deals with computing residuals and predicted values. The
last section discusses computing marginal effects. In the following chapters, we take up
violations of the assumptions on which regression estimates are based.
12 14 16 18 20 22
Student−−teacher ratio
Figure 4.1 shows average single-family housing prices for 100 Boston-area communi-
ties, along with the linear fit of housing prices to student–teacher ratios. The conditional
4.2 Computing linear regression estimates 71
mean of price for each value of stratio is shown by the appropriate point on the line.
As theory predicts, the mean house price conditional on the student–teacher ratio is
inversely related to that ratio. Communities with more crowded schools are considered
less desirable. Of course, this relationship between house price and the student–teacher
ratio must be considered ceteris paribus: all other factors that might affect the price
of the house are held constant when we evaluate the effect of a measure of community
schools’ quality on the house price.
In working with economic data, we do not know the population values of β1 , β2 , . . . ,
βk . We work with a sample of N observations of data from that population. Using the
information in this sample, we must
To obtain estimates of the coefficients, some assumptions must be made about the
process that generated the data. I discuss those assumptions below and describe what I
mean by good estimates. Before performing steps 2–4, I check whether the data support
these assumptions by using a process known as specification analysis.
If we have a cross-sectional sample from the population, the linear regression model
for each observation in the sample has the form
y = Xβ + u (4.1)
u = y − xβ
We assume that
E [u | x] = 0 (4.2)
i.e., that the u process has a zero-conditional mean. This assumption is that the un-
observed factors involved in the regression function are not related systematically to
the observed factors. This approach to the regression model lets us consider both non-
stochastic and stochastic regressors in X without distinction, as long as they satisfy the
assumption of (4.2).3
E[x′ u] = 0
′
E[x (y − xβ)] = 0 (4.3)
Substituting calculated moments from our sample into the expression and replacing the
b in (4.3) yields the ordinary least squares
unknown coefficients β with estimated values β
(OLS) estimator
b
X ′ y − X ′ Xβ = 0
b
β = (X′ X)−1 X′ y (4.4)
u b
b = y − Xβ
2. x is a vector of random variables and u is scalar random variable. In (4.1), X is a matrix of
realizations of the random vector x, u and y are vectors of realizations of the scalar random variables
u and y.
3. Chapter 8 discusses how to use the instrumental-variables estimator when the zero-conditional-
mean assumption is encountered.
4. The assumption of zero-conditional mean is stronger than that of a zero covariance, because co-
variance considers only linear relationships between the random variables.
4.2.2 The sampling distribution of regression estimates 73
Given the solution for the vector β,b the additional parameter of the regression prob-
lem σu2 —the population variance of the stochastic disturbance—may be estimated as a
function of the regression residuals u
bi
PN
2 b2i
u
i=1 b′u
u b
s = = (4.5)
N −k N −k
where (N −k) are the residual degrees of freedom of the regression problem. The positive
square root of s2 is often termed the standard error of regression, or root mean squared
error. Stata uses the latter term and displays s as Root MSE.
The method of moments is not the only approach for deriving the linear regression
estimator of (4.4), which is the well-known formula from which the OLS estimator is
derived.5
5. The treatment here is similar to that of Wooldridge (2006). See Stock and Watson (2006) and
appendix 4.A for a derivation based on minimizing the squared-prediction errors.
6. Both frameworks also assume that the (constant) variance of the u process is finite. Formally, i.i.d.
stands for independently and identically distributed.
74 Chapter 4 Linear regression
7. More precisely, the distribution of the OLS estimator converges to a normal distribution. Although
appendix B provides some details, in the text I will simply refer to the “approximate” or “large-sample”
normal distribution. See Wooldridge (2006) for an introduction to large-sample theory.
8. At first glance, you might think that the expression for the VCE should be multiplied by 1/N , but
this assumption is incorrect. As discussed in appendix B, because the OLS estimator
√ is consistent, it
is converging to the constant vector of population parameters at the rate 1/ N , implying that the
variance of the OLS estimator is going to zero as the sample size gets larger. Large-sample theory
compensates for this effect in how it standardizes the estimator. The loss of the 1/N term in the
estimator of the VCE is a product of this standardization.
9. For a formal presentation of the Gauss–Markov theorem, see any econometrics text, e.g., Wooldridge
(2006, 108–109). The OLS estimator is said to be “unbiased” because E[β] b = β.
4.3 Interpreting regression estimates 75
versus all other linear, unbiased estimators of the parameterization model. However,
this statement rests upon the hypotheses of an appropriately specified model and an
i.i.d. disturbance process with a zero-conditional mean, as specified in (4.2).
10. When computing infinite precision, we must be concerned with numerical singularity and a com-
puter program’s ability to reliably invert a matrix regardless of whether it is analytically invertible. As
we discuss in section 4.3.7, computationally near-linear dependencies among the columns of X should
be avoided.
76 Chapter 4 Linear regression
The regress command, like other Stata estimation commands, requires us to specify
the response variable followed by a varlist of the explanatory variables.
The header of the regression output describes the overall model estimates, whereas
the table presents the point estimates, their precision, and their interval estimates.
regression imply that a model with a great many regressors can explain y arbitrarily
well. Given the least-squares residuals, the most common measure of goodness of fit,
regression R2 , may be calculated (given a constant term in the regression function) as
b′u
u b
R2 = 1 − (4.6)
e′ y
y e
where ye = y − y: the regressand with its sample mean removed. This calculation
emphasizes that the object of regression is not to explain y′ y, the raw sum of squares of
the response variable y, which would merely explain why E[y] 6= 0—not an interesting
question. Rather, the object is to explain the variations in the response variable.
With a constant term in the model, the least-squares approach seeks to explain
the largest possible fraction of the sample variation of y about its mean (and not the
associated variance). The null model with which (4.1) is contrasted is y = µ + ui ,
where µ is the population mean of y. In estimating a regression, we want to determine
whether the information in the regressors x is useful. Is the conditional expectation
E[y|x] more informative than the unconditional expectation E[y] = µ? The null model
above has an R2 = 0, whereas virtually any set of regressors will explain some fraction
of the variation of y around y, the sample estimate of µ. R2 is that fraction in the unit
interval, the proportion of the variation in y about y explained by x.
2
b′u
the model and scales u b by (N −k) rather than N .16 R can be expressed as a corrected
2
version of R in which the degrees-of-freedom adjustments are made, penalizing a model
with more regressors for its loss of parsimony:
2 b′u
u b /(N − k) N −1
R =1− ′
= 1 − (1 − R2 )
e y
y e /(N − 1) N −k
. regress, beta
Source SS df MS Number of obs = 506
F( 4, 501) = 175.86
Model 49.3987735 4 12.3496934 Prob > F = 0.0000
Residual 35.1834974 501 .070226542 R-squared = 0.5840
Adj R-squared = 0.5807
Total 84.5822709 505 .167489645 Root MSE = .265
The output indicates that lnox has the largest beta coefficient, in absolute terms, fol-
lowed by rooms. In economic and financial applications, where most regressors have
a natural scale, it is more common to compute marginal effects such as elasticities or
semielasticities (see section 4.7).