KTN Omitted Variables
KTN Omitted Variables
Lets start our lecture on endogeneity issues with a simple definition: A right hand side (RHS)
variable is said to be endogenous if it is correlated with the error in the original model.1
There are three main sources of endogeneity: (1) omitted variables, (2) reverse causality, or
simultaneity, and (3) measurement error. In this note, we discuss (1) at length. Problem (2)
occurs when the RHS variable is a function of Y (as opposed to being a cause of Y). For example,
recall the sales and advertising model from Regression Basics2:
(1) S = 0 + 1 P + 2 A +
where S = Sales, P = Price, and A = Advertising. We normally assume that advertising affects
sales, but suppose that firms increase their advertising in anticipation of changes in demand. In
this case, sales affects advertising; advertising is therefore said to be endogenous. In general, if
you are unsure whether X causes Y or Y causes X, your regression suffers from simultaneity bias
(i.e., because X and Y are determined together). It is impossible to determine the direction of
causality from ordinary least squares (OLS) regression and therefore impossible to use OLS to
determine whether and by how much a change in X will affect Y. Problem (3), measurement
error, occurs when an RHS variable is imprecisely measured. Problem (2) is discussed in
Causality and Instrumental Variables and Problem (3) is discussed in Noise,
Heteroskedasticity, and Grouped Data.3
1
RHS variables are uncorrelated with the residual of the regression, by construction.
2
David Dranove, Practical Regression: Regression Basics, Case #7-112-002 (Kellogg School of Management, 2012).
3
David Dranove, Practical Regression: Causality and Instrumental Variables, Case #7-112-010, and Practical Regression: Noise,
Heteroskedasticity, and Grouped Data, Case #7-112-006 (Kellogg School of Management, 2012).
2012 by the Kellogg School of Management, Northwestern University. This technical note was prepared by Professor David
Dranove. Technical notes are developed solely as the basis for class discussion. Technical notes are not intended to serve as
endorsements, sources of primary data, or illustrations of effective or ineffective management. To order copies or request permission
to reproduce materials, call 847-491-5400 or e-mail [email protected]. No part of this publication may be reproduced,
stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any meanselectronic, mechanical, photocopying,
recording, or otherwisewithout the permission of the Kellogg School of Management.
TECHNICAL NOTE: OMITTED VARIABLE BIAS 7-112-004
This note describes omitted variables and some solutions for dealing with the problems they
present.
It improves the predictive power of your model and, in the process, improves the
precision of your estimates.
Excluding relevant variables can bias the coefficients on the included variables. In other
words, the computer reports values that are systematically higher or lower than the actual
values due to an omitted variable bias (OVB).
It is useful to examine the mathematics that underlie OVB. (This should look familiar, as it is
quite similar to the math in Building Your Model4 that shows what happens when one predictor
is a function of another.)
Suppose that the true economic relationship that determines the dependent variable Y is:
(2) Y = 0 + XX + ZZ + y
(For example, Y may be average income over an individuals lifetime; X may be schooling; and
Z may be health status.)
We may or may not realize that this is the true relationship. In any event, suppose we have
data only on Y and X and are interested in determining X. We regress Y on X, and the computer
reports a coefficient on X. Is this an unbiased estimate of the actual X? We can answer this
question after a bit of math.:
(3) Z = C0 + CXX + z
(This is a very general statement and allows for any degree of correlation between X and Z.)
Substitute equation (3) into equation (2) to obtain:
4
David Dranove, Practical Regression: Building Your Model, Case #7-112-003 (Kellogg School of Management, 2012).
In fact, this is the regression equation that the computer estimates when we run regress Y X.
We clearly have a problem. The intercept and slope coefficients that the computer reports are
not estimates of 0 and X. Instead, they are estimates of 0 + ZC0 and X + ZCX. This is
summarized in the following table.5
Key point to remember: OVB appears when a variable on the RHS must do double duty. The
coefficient on the included variable includes its direct effect on Y as well as the indirect effect of
the omitted variable that happens to be correlated with it.
It is often helpful if you can determine the direction of the bias. For example, suppose:
You believe that X and Z are positively correlated (CX > 0), and
You believe that Z is positively related to Y (Z > 0).
You should then conclude that the estimate that the computer reports for X will be more
positive than the correct value (because ZCX > 0). Continuing our example, if schooling and
health status are positively correlated and health status has an independent effect on earnings,
then a regression that omits health status will overstate the effect of schooling on earnings.6
5
Here is a more precise statement of the bias. Suppose you want to estimate X but omit data on Z. The coefficient on X will equal X
+ ZCov(X,Z)/Var(X), where Cov(X,Z) is the covariance between X and Z and Var(X) is the variance of X. This follows from the
formula for the parameter CX.
6
Determining the bias when there are more than two explanatory variables is complex. The coefficients on any included variable
correlated with an included variable that suffers from OVB may itself be biased, however.
Intepreting the results, it seems that every 100,000 increase in population is associated with a
0.27 percent increase in mco_penetration, while every $1,000 increase in income is associated
with a 0.90 percent increase in mco_penetration. The latter result may be surprising because
many analysts believe that MCOs appeal to individuals trying to save money on their health
insurance and therefore penetration should be higher in low-income markets.
There are quite a few moderate intercorrelations. If the variables we are adding to the
regression turn out to be good predictors of MCO penetration, then the coefficients in our initial
model may be biased. The new regression is:
Note that the coefficients on the new variables are all significant and the coefficients on
hospital_concentration and MD_solo_practice are negative as expected, whereas the coefficient
on urban is positive and significant. Also note that the coefficients on population and income are
no longer significant. The first model we ran must have suffered from OVB.
Examining the correlation matrix, we can see that the OVB in the first model came from the
two omitted variables: hospital_concentration and urban. Hospital_concentration has a negative
correlation with population and income but also has a negative direct effect. The product of the
two negatives imparted a positive bias to the coefficients on population and income. Urban has a
positive correlation and a positive direct effect, again imparting a positive bias.
Omitting variables results in biased coefficients only if the omitted variables are
correlated with included variables.
If the omitted variable is not important in its own right (i.e., Z is small, meaning that the
omitted variable Z is not an important determinant of Y), the bias will be small.
Even if OVB exists, it may be possible to determine the direction of the bias. This will
allow us to state that the reported coefficients are either upper or lower bounds on the
actual effects.
Thinking about OVB forces us to carefully identify the correct economic model and do a
better job of variable selection in the first place.
Fixed effects and/or instrumental variables can mitigate or eliminate the bias.
1. Always begin with a core set of predictors that have theoretical relevance, in addition
to any predictors whose effects you are specifically interested in. You may estimate a
quick and dirty OLS model at this time.
3. Add additional predictors that you think might be relevant. You can add them one at a
time or one category at a time (e.g., a set of three dummy variables for seasons). Check
for the robustness of your initial findings.
4. When adding predictors, you should keep all the original predictors in the model, even if
they were not significant. Remember, OVB can cause significant predictors to appear
insignificant. By adding more variables, your key predictors may become significant. If
you have already dropped them, you may never realize that they belonged in the model.
5. At this point, you should know your robust findings; that is the main goal of your
research.
6. If you want to produce a final model, then you may want to remove those additional
predictors that were not significant.
7. You can also remove core predictors if they remain insignificant and you need degrees of
freedom. If you are not taxed for degrees of freedom, you may want to keep your core
variables, if only to paint the entire picture for your audience.