0% found this document useful (0 votes)
225 views

KTN Omitted Variables

KTN Omitted Variables

Uploaded by

Cristina C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views

KTN Omitted Variables

KTN Omitted Variables

Uploaded by

Cristina C
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

DAVID DRANOVE 7-112-004

Practical Regression: Introduction to


Endogeneity: Omitted Variable Bias
This is one in a series of notes entitled Practical Regression. These notes supplement the
theoretical content of most statistics texts with practical advice on solving real world empirical
problems through regression analysis.

Lets start our lecture on endogeneity issues with a simple definition: A right hand side (RHS)
variable is said to be endogenous if it is correlated with the error in the original model.1

There are three main sources of endogeneity: (1) omitted variables, (2) reverse causality, or
simultaneity, and (3) measurement error. In this note, we discuss (1) at length. Problem (2)
occurs when the RHS variable is a function of Y (as opposed to being a cause of Y). For example,
recall the sales and advertising model from Regression Basics2:

(1) S = 0 + 1 P + 2 A +

where S = Sales, P = Price, and A = Advertising. We normally assume that advertising affects
sales, but suppose that firms increase their advertising in anticipation of changes in demand. In
this case, sales affects advertising; advertising is therefore said to be endogenous. In general, if
you are unsure whether X causes Y or Y causes X, your regression suffers from simultaneity bias
(i.e., because X and Y are determined together). It is impossible to determine the direction of
causality from ordinary least squares (OLS) regression and therefore impossible to use OLS to
determine whether and by how much a change in X will affect Y. Problem (3), measurement
error, occurs when an RHS variable is imprecisely measured. Problem (2) is discussed in
Causality and Instrumental Variables and Problem (3) is discussed in Noise,
Heteroskedasticity, and Grouped Data.3

All endogeneity sourcesomitted variables, simultaneity, and measurement errorwill bias


the coefficient on the affected RHS variable, and potentially any other variables that are
correlated with the endogenous variable. This is why it is crucial to determine whether your
model may suffer from one of these endogeneity issues.

1
RHS variables are uncorrelated with the residual of the regression, by construction.
2
David Dranove, Practical Regression: Regression Basics, Case #7-112-002 (Kellogg School of Management, 2012).
3
David Dranove, Practical Regression: Causality and Instrumental Variables, Case #7-112-010, and Practical Regression: Noise,
Heteroskedasticity, and Grouped Data, Case #7-112-006 (Kellogg School of Management, 2012).

2012 by the Kellogg School of Management, Northwestern University. This technical note was prepared by Professor David
Dranove. Technical notes are developed solely as the basis for class discussion. Technical notes are not intended to serve as
endorsements, sources of primary data, or illustrations of effective or ineffective management. To order copies or request permission
to reproduce materials, call 847-491-5400 or e-mail [email protected]. No part of this publication may be reproduced,
stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any meanselectronic, mechanical, photocopying,
recording, or otherwisewithout the permission of the Kellogg School of Management.
TECHNICAL NOTE: OMITTED VARIABLE BIAS 7-112-004

This note describes omitted variables and some solutions for dealing with the problems they
present.

Omitted Variable Bias


There are so many reasons to be parsimonious when choosing RHS variables that you may be
tempted to run regressions with just one predictor variable. Its time to put things in perspective
and remember why we add control variables. Adding theoretically sound control variables to the
RHS has two virtues:

It improves the predictive power of your model and, in the process, improves the
precision of your estimates.
Excluding relevant variables can bias the coefficients on the included variables. In other
words, the computer reports values that are systematically higher or lower than the actual
values due to an omitted variable bias (OVB).

It is useful to examine the mathematics that underlie OVB. (This should look familiar, as it is
quite similar to the math in Building Your Model4 that shows what happens when one predictor
is a function of another.)

Suppose that the true economic relationship that determines the dependent variable Y is:

(2) Y = 0 + XX + ZZ + y

(For example, Y may be average income over an individuals lifetime; X may be schooling; and
Z may be health status.)

We may or may not realize that this is the true relationship. In any event, suppose we have
data only on Y and X and are interested in determining X. We regress Y on X, and the computer
reports a coefficient on X. Is this an unbiased estimate of the actual X? We can answer this
question after a bit of math.:

Let us suppose that the statistical relationship between X and Z is:

(3) Z = C0 + CXX + z

(This is a very general statement and allows for any degree of correlation between X and Z.)
Substitute equation (3) into equation (2) to obtain:

(4) Y = 0 + XX + Z(C0 + CXX + Z) + y

Gathering terms together gives us:

(5) Y = [0 + ZC0] + [X + ZCX]X + [y + ZZ]

4
David Dranove, Practical Regression: Building Your Model, Case #7-112-003 (Kellogg School of Management, 2012).

2 KELLOGG SCHOOL OF MANAGEMENT


7-112-004 TECHNICAL NOTE: OMITTED VARIABLE BIAS

Equation (5) looks exactly like a regression equation of Y on X:

The first term in braces [0 + ZC0] is the intercept.


The second term [X + ZCX] is the slope.
The third term [y + Z Z] is the error.

In fact, this is the regression equation that the computer estimates when we run regress Y X.

We clearly have a problem. The intercept and slope coefficients that the computer reports are
not estimates of 0 and X. Instead, they are estimates of 0 + ZC0 and X + ZCX. This is
summarized in the following table.5

Parameter of interest You want to estimate The computer reports


Intercept 0 0 + ZC0
Coefficient on X X X + ZCX

We want an estimate of the direct effect of X on Y, X. Unfortunately, the coefficient on X is


actually X + ZCX. The term ZCX represents the bias.

Key point to remember: OVB appears when a variable on the RHS must do double duty. The
coefficient on the included variable includes its direct effect on Y as well as the indirect effect of
the omitted variable that happens to be correlated with it.

It is often helpful if you can determine the direction of the bias. For example, suppose:

You believe that X and Z are positively correlated (CX > 0), and
You believe that Z is positively related to Y (Z > 0).

You should then conclude that the estimate that the computer reports for X will be more
positive than the correct value (because ZCX > 0). Continuing our example, if schooling and
health status are positively correlated and health status has an independent effect on earnings,
then a regression that omits health status will overstate the effect of schooling on earnings.6

Omitted Variable Bias in Action


We will illustrate OVB using data on the penetration of managed care organizations (MCOs,
a form of health insurance) in U.S. metropolitan areas in the 1990s. We initially regress
mco_penetration (from 0 to 100 percent of the market) on the metropolitan area population and
per capita income (both measured in thousands). Here is the result:

5
Here is a more precise statement of the bias. Suppose you want to estimate X but omit data on Z. The coefficient on X will equal X
+ ZCov(X,Z)/Var(X), where Cov(X,Z) is the covariance between X and Z and Var(X) is the variance of X. This follows from the
formula for the parameter CX.
6
Determining the bias when there are more than two explanatory variables is complex. The coefficients on any included variable
correlated with an included variable that suffers from OVB may itself be biased, however.

KELLOGG SCHOOL OF MANAGEMENT 3


TECHNICAL NOTE: OMITTED VARIABLE BIAS 7-112-004

Intepreting the results, it seems that every 100,000 increase in population is associated with a
0.27 percent increase in mco_penetration, while every $1,000 increase in income is associated
with a 0.90 percent increase in mco_penetration. The latter result may be surprising because
many analysts believe that MCOs appeal to individuals trying to save money on their health
insurance and therefore penetration should be higher in low-income markets.

We now consider additional predictors, including:

hospital_concentration is a measure of the concentration of the local hospital market


(where 0 corresponds to infinitely many hospitals and 1 is a local hospital monopoly).
High concentration can affect the ability of MCOs to reduce costs.
MD_solo_practice is the percentage of physicians who are in solo practices. MCOs may
be less able to reduce costs in markets with many solo practitioners.
urban is the percentage of the local population that lives in an urbanized area (defined
according to population density).

First, note the correlation matrix:

4 KELLOGG SCHOOL OF MANAGEMENT


7-112-004 TECHNICAL NOTE: OMITTED VARIABLE BIAS

There are quite a few moderate intercorrelations. If the variables we are adding to the
regression turn out to be good predictors of MCO penetration, then the coefficients in our initial
model may be biased. The new regression is:

Note that the coefficients on the new variables are all significant and the coefficients on
hospital_concentration and MD_solo_practice are negative as expected, whereas the coefficient
on urban is positive and significant. Also note that the coefficients on population and income are
no longer significant. The first model we ran must have suffered from OVB.

Examining the correlation matrix, we can see that the OVB in the first model came from the
two omitted variables: hospital_concentration and urban. Hospital_concentration has a negative
correlation with population and income but also has a negative direct effect. The product of the
two negatives imparted a positive bias to the coefficients on population and income. Urban has a
positive correlation and a positive direct effect, again imparting a positive bias.

Coping with Omitted Variable Bias


It is impossible to get data on all the factors that might affect the dependent variable. This
exposes all regressions to potential OVB, which is why we always should think about possible
biases in our regressions. Fortunately, OVB can be managed:

Omitting variables results in biased coefficients only if the omitted variables are
correlated with included variables.
If the omitted variable is not important in its own right (i.e., Z is small, meaning that the
omitted variable Z is not an important determinant of Y), the bias will be small.
Even if OVB exists, it may be possible to determine the direction of the bias. This will
allow us to state that the reported coefficients are either upper or lower bounds on the
actual effects.

KELLOGG SCHOOL OF MANAGEMENT 5


TECHNICAL NOTE: OMITTED VARIABLE BIAS 7-112-004

Thinking about OVB forces us to carefully identify the correct economic model and do a
better job of variable selection in the first place.
Fixed effects and/or instrumental variables can mitigate or eliminate the bias.

Striking the Balance, or Seven Steps to Statistical Heaven


Parsimony is important. So is avoiding OVB. Here are some steps to follow that will allow
you to strike the right balance.

1. Always begin with a core set of predictors that have theoretical relevance, in addition
to any predictors whose effects you are specifically interested in. You may estimate a
quick and dirty OLS model at this time.

2. Finalize model specification issues.

3. Add additional predictors that you think might be relevant. You can add them one at a
time or one category at a time (e.g., a set of three dummy variables for seasons). Check
for the robustness of your initial findings.

4. When adding predictors, you should keep all the original predictors in the model, even if
they were not significant. Remember, OVB can cause significant predictors to appear
insignificant. By adding more variables, your key predictors may become significant. If
you have already dropped them, you may never realize that they belonged in the model.

5. At this point, you should know your robust findings; that is the main goal of your
research.

6. If you want to produce a final model, then you may want to remove those additional
predictors that were not significant.

7. You can also remove core predictors if they remain insignificant and you need degrees of
freedom. If you are not taxed for degrees of freedom, you may want to keep your core
variables, if only to paint the entire picture for your audience.

6 KELLOGG SCHOOL OF MANAGEMENT

You might also like