0% found this document useful (0 votes)
2 views

Lecture 2_regression_multiple_regressors

The document discusses the application of Ordinary Least Squares (OLS) regression with multiple regressors to address omitted variable bias (OVB) and improve causal inference in empirical economics. It highlights the importance of including relevant variables to obtain unbiased estimates and explains the conditions under which OLS estimates can be interpreted as causal. Additionally, it covers the implications of multicollinearity, model selection, and the significance of adjusted R² in evaluating model fit.

Uploaded by

cringelord1980
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 2_regression_multiple_regressors

The document discusses the application of Ordinary Least Squares (OLS) regression with multiple regressors to address omitted variable bias (OVB) and improve causal inference in empirical economics. It highlights the importance of including relevant variables to obtain unbiased estimates and explains the conditions under which OLS estimates can be interpreted as causal. Additionally, it covers the implications of multicollinearity, model selection, and the significance of adjusted R² in evaluating model fit.

Uploaded by

cringelord1980
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Empirical Economics

Jacopo Bonan – [email protected]


Part 1.2 - Linear Regression with Multiple Regressors
University of Brescia

1/30
Preview

• OLS with one regressor can give biased β, if E(u|X ) = 0 does not hold
• OLS with multiple regressors can solve the omitted variable bias and get causal
effects
• Ceteris paribus condition
• OLS with multiple regressors can improve predictions

2/30
Example - California schools

Characteristics that are important drivers of the final score are also likely to be correlated
with STR.
For example, because of the large immigration in California: % of students who are still
learning English is important for tests results and may also be related class size.
Example - The role of non-native English speakers
• Students who are still learning English might perform worse on standardized tests
than native speakers. Thus, districts with a higher % of non-native speakers might
have, on average, lower scores.
• Districts with many migrants could have larger classes (why?)
• Then, OLS could erroneously produce a large estimate of β1 . It mixes the impact of
class size with that of migration; it compares small classes with few non-native
speakers (high performers) vs large classes with many non-native speakers (low
performers)
• The effect of STR is biased!
• What if we know the % of non English speakers in each district (elpcti = English
learners percent)? 3/30
Example - California schools

Correlation of elpct with str and score

corr (str , elpct) = 0.19 corr (testscr , elpct) = −0.64

4/30
Omitted Variable Bias

OVB in OLS with one regressor


The Omitted Variable Bias is a systematic bias of the OLS for the causal effect of X on
Y , due to the fact that X is correlated with a variable that has been omitted from the
regression model (is in the error term).
Two conditions need to be verified for an OVB:

1. a regressor X must be correlated with the omitted variable;


2. the omitted variable is a determinant of Y .

When both conditions are realized, the assumption A1 is violated and the OLS
estimator is biased. Neither changing the sample, nor increasing the number of
observations would solve the problem.

5/30
What is the size of the bias?
Formula for the OVB
Let us suppose that all the assumptions A2-A3 are verified and let us define
ρXu = corr (Xi , ui ).
Then the OLS estimator:
p σu
β̂1 → β1 + ρXu (1)
σX
| {z }
bias

Size and sign of the bias:

1. the size of the bias depends upon:


• the correlation between the omitted variable (in u) and the regressor X ;
• the std dev. of the error (σu ): how relevant the OV is in explaining Y ?
2. the sign of the bias uniquely depends upon the fact that the correlation between X
and u is positive or negative.
• ρXu > 0 ⇒ bias > 0 ⇒ we are overestimating the true β1
• ρXu < 0 ⇒ bias < 0 ⇒ we are underestimating the true β1

6/30
Can I cancel/reduce the bias?

To cancel the bias I should include all the omitted variables in my model.
A first method to reduce it, is to split the sample into groups such that within each
group the omitted variable is kept constant (e.g. districts for which the % of non English
speaking students is similar) but also such that the other depended variable of interest
has a sufficient variability (e.g. the students-to-teachers ratio).
Using this grouping strategy (here quartile split) we can compute the difference in
average score between large and small classes within groups of schools with similar elpct
and use a simple t-test to test whether the difference is significant.
Sample splitting

7/30
Linear model with multiple regressors

The grouping approach partially solves OVB, but has limitations

• does not provide a precise causal effect of class size, holding constant the fraction
of English learners
• appears complicated if one includes more than one omitted variable
• becomes unpractical as the number of comparison cells increases and the samples
within each cell decrease

One solution is to extend the single variable OLS model to multiple regression model.
This allows to estimate the causal effect on Yi of changing X1i , while holding constant
the other regressors (X2i , X3i ,etc), which are confounding factors and causing OVB in the
univariate OLS.
For prediction, the multiple regression model can improve accuracy.

8/30
The Multiple Regression Model

The Linear Multiple Regression Model with k regressors is:

Yi = β0 + β1 X1,i + β2 X2,i + · · · + βk Xk,i + ui (2)

• Same components as in the univariate model


• βk = ∆X∆Y
k
holding constant X1 , ..., Xk−1 → partial effect on Y of Xk , holding
constant all other factors; expected difference in Yi associated with a unit difference
in Xk , ceteris paribus
• β0 is the expected value of Y when all the X s are equal to 0

9/30
The Multiple Regression Model

The Linear Multiple Regression Model with k regressors is:

Yi = β0 + β1 X1,i + β2 X2,i + · · · + βk Xk,i + ui (3)

and can be written in compact notation as:

Y = X β + u (4)
n×1 n×(k+1)(k+1)×1 n×1

where:
       
Y1 1 X1,1 ... Xk,1 β0 u1

 Y2 


 1 X1,2 ... Xk,2 


 β1 


 u2 

Y =
 .. , X = 
  .. .. .. , β = 
  ..  u=
  .. 

 .   . . ··· .   .   . 
Yn 1 X1,n ... Xk,n βk un

Alternatively you can write it as:

Yi = Xi′ β + ui with Xi′ = [1, X1,i , . . . , Xk,i ]

10/30
The OLS estimator

The minimization problem now is that of choosing a vector β that contains k + 1


parameters (k regressors + the constant). But it is the usual problem:

• the objective function is the sum of the squared deviations;


• the choice variables are the parameter values.

X
argmin [Yi − b0 − b1 X1,i − · · · − bk Xk,i ]2 (5)
b
i

and the FOCs lead us to a (linear) system in k + 1 equation with k + 1 unknown:


X ′ (Y − Xb) = 0 (6)

OLS general formulation


Solving for b we obtain the OLS estimator:

β̂ = (X ′ X )−1 X ′ Y (7)

Note: to compute (X ′ X )−1 we need this product of matrices to be invertible.


11/30
This condition is satisfied if X has full rank, i.e. there is no multicollinearity.
California Schools

Example - California schools

score =686.0 − 1.10 × str − 0.65 × elpct


(8.7) (0.43) (0.03)

• After including elpct, the parameter on str changes (more or less reduced by half).
• Why such a drastic change in the estimate?
• In the univariate model, β1 was underestimated (negative OVB)
• Now, OVB is attenuated. Completely removed?

12/30
Assumptions of the Multiple Regression Model

Conditions for ALL OLS estimates to be interpreted as causal become:


A1: the conditional distribution of the errors, given the regressor has zero mean –
E(u|X ) = E(ui |X1,i , X2,i , . . . , Xk,i ) = 0.
A2: observations are i.i.d. – (X , Y ) = (X1,i , X2,i , . . . , Xk,i , Yi ) ∼ i.i.d.
A3: large outliers are unlikely – 0 < E[Xj,i ]4 , E[Yi ]4 < ∞ ∀ j = 1, . . . , k.
A4: no perfect multicollinearity between regressors – rank(X ) = k + 1

13/30
Perfect multicollinarity

The regressors are said to exhibit perfect multicollinearity (or to be perfectly


multicollinear) if one of the regressors is a linear function of some other regressors.

Perfect multicollinearity
Formally we have perfect multicollinearity if a regressor j can be expressed as:
k
X
Xj,i = αh Xh,i ∀ i = 1, . . . , n
h=1

The assumption A4 above requires that this is not the case.

Note: every modern software automatically checks for this and drops one of the
redundant regressors.

14/30
Example - California schools
The dummy variable trap
Suppose we partition the school districts into three categories: rural, suburban, urban
and we create three dummy variables (i.e. Xrural , Xsuburban , Xurban ) with value 1 if the
district i is of that specific category, and value 0 if not.
Imagine we want to estimate:

score = β0 + β1 rural + β2 suburban + β3 urban

However, because every district belongs only to one of the three categories we will have
that:
rurali + suburbani + urbani = 1 ∀i
but the vector 1 is a regressor already included in the model (associated with the
constant). Thus, to estimate this model, we’ll need to drop either one of the three
dummy variables (which becomes the reference category) or the constant. for example:

score = β0 + β1 rural + β2 suburban

However this changes the interpretation of the coefficients β0 , β1 , β2 . 15/30


Imperfect Multicollinearity

When two (or more) of the regressors are highly correlated, then imperfect
multicollinearity arises.
Imperfect multicollinearity, does not pose any problems for the theory of the OLS
estimators. However, if the regressors are imperfectly multicollinear, then the coefficient
on at least one individual regressor will be imprecisely estimated – in particular, it will
have a large sampling variance.

Example - California schools


Consider the regression of score on str and pctel. Suppose we were to add also the
percentage of the district’s residents who are first-generation immigrants (pctimm).
These people often speak English only as a second language, so the variables pctel and
pctimm will be highly correlated and it will be difficult to precisely estimate the
individual effects. In particular, there will be little info on test scores in schools with
low pctel and high pctimm, and vice-versa. This will lead to larger variance (less
precision) of the estimator

16/30
Control variables and causality

In the multiple regression we are not interested in the causal effects of all the variables.
Some of them might be there only to avoid OVB in the causal interpretation of the
variables of interest. Thus we have:

• variables of interest (X ): for which we aim at estimating the causal effect;


• control variables (W ): only there to reduce the omitted variable bias; no interest in
causal effects

This allows to relax assumption A1, which becomes:


Conditional Mean Independence
A1-bis: the error u has a conditional mean that doesn’t depend on the X , given W
formally: E(u|W , X ) = E(u|W ) or in extended form
E(ui |X1,i , X2,i , . . . , Xk,i , W1,i , . . . , Wr ,i ) = E (ui |W1,i , . . . , Wr ,i ).

The conditional mean of u given W , does not change even after considering the
knowledge about X . Thus, when controlling for W , X becomes uncorrelated from u (as
if they were randomly assigned).
If A1-bis holds, the coefficients for the variables of interest (X ) have a causal 17/30
interpretation, while those for the controls (W ) can be biased.
Goodness of Fit in the Multiple Regression

Similarly to the single regressor case, we can measure the quality of the model by means
of the SER and the R 2 .
The standard error of the regression writes:
s P
n 2
r
i=1 ûi SSR
q
SER = sû = sû2 = = (8)
n−k −1 n−k −1

The denominator adjusts for the degrees of freedom lost due to the estimate of the k + 1
parameters. In large samples, such adjustment is negligible.
The R 2 is like in the univariate case:
ESS SSR
R2 = =1− (9)
TSS TSS

where ESS = i (Ŷi − Ȳi )2 and TSS = i (Yi − Ȳi )2


P P

However, the R 2 increases (by construction) every time that we add a new variable to
our model, which contributes to decrease SSR.

18/30
Adjusted R 2

To correct for this issue, it is better to use the adjusted R 2 (often indicated as R̄ 2 ) that
writes:
n − 1 SSR
R̄ 2 = 1 − (10)
n − k − 1 TSS

When adding a new regressor (k increases) the formula for the R̄ 2 entails a trade-off:

• it reduces the ratio SSR


TSS

• it increases the ratio n−1


n−k−1

so the decision (to add or not the regressor) depends on which effect dominates.
Notes: R̄ 2 is always less than R 2 and can take negative values

19/30
Goodness of Fit and Model Selection

A Note of Caution
When choosing the most appropriate model (among a set of models) the R 2 or the R̄ 2
should not be the unique criterion.
A high value of the R 2 only means that your regression model explains the variability in
Y.
It does not imply that:

• you have an unbiased estimator for the causal effect (and that you have deleted all
the possible OVB);
• the variables in the model are statistically significant.

20/30
The sampling distribution of β̂
Properties of the OLS estimator
As for the single regressor model, under A1-A4 the OLS is unbiased and consistent.

Formally, if A1-A4 hold true we will have:


E(β̂) = β (11)
′ −1 ′ ′ −1
Var (β̂) = σβ̂2 = (X X ) (X Σu X )(X X ) (12)
p
β̂ → β (13)

where:
1
Σu = E(uu ′ ) (14)
n−k
1
which (being unobservable) can be estimated as Σ̂u = n−k û û ′ . In most applications, we
exclude A4, i.e., homoschedasticity, and compute heteroschedasticity robust SE (the
software does it!)
In large samples, thanks to the CLT, the OLS is distributed as a multivariate standard
Normal and
d
β̂k → N (βk , σβ̂2 k ) ∀j = 1, . . . , K + 1 (15)
21/30
Hypothesis testing in the multiple regression model

We can rewrite:
β̂k − E[β̂k ]
∼ N (0, 1) ∀ k = 1, . . . , K + 1
SE (βˆk )

Which implies that:

• hypothesis testing on a single element βk of the vector β can be carried out using
the usual t-test;
• 95% confidence interval for a single element βk of the vector β can be computed by
β̂k ± 1.96SE (β̂k );

Note: because in Var (β̂k ) there is also the covariance between the different estimates,
the t-tests on single elements of the vector β are not independent. Therefore
including/omitting a regressor will change the final outcome of every single t-test.

22/30
Hypothesis testing in the multiple regression model

Example - California schools


All modern softwares report all the useful info and we’ll get something similar to:

score =686.0 − 1.10 × str − 0.65 × elpct


(8.7) (0.43) (0.03)

Single regressor vs. multiple regressors:

23/30
Hypothesis testing in the multiple regression model
Example - California schools
what if we add expenditure per pupil as a further control?

The coef of STR becomes −0.29(0.48) → it becomes non-significant, and flips the
policy implication with respect to the beginning. However, STR and PPexpenditure
are correlated (imperfect multicollinearity)- hence one may test that both β1 = 0 and
24/30
β2 = 0
Testing joint hypothesis

In a multiple regression model, we can also test for joint hypotheses.


H0 : β1 = β1,0 , β2 = β2,0 , . . . , βq = βq,0
H1 : at least one of the q restrictions in H0 is not true.
Why using only single t-tests is never a good idea
Let’s set q = 2 and let’s use the single t-tests at the 5% on each single restriction. If
the t-tests are independent then we won’t reject H0 if and only if:

|t1 | ≤ 1.96 and |t2 | ≤ 1.96

Therefore:
PrH0 (|t1 | ≤ 1.96 and |t2 | ≤ 1.96) = 0.952 = 90.25%
and the test size (rejecting H0 when it is true) is of the 9.75% (not 5%).
Conclusion: you make many more type-I errors than you would expect.

The problem worsens:


• if q increases;
• if regressors are correlated
25/30
The F-statistic on two restrictions

Definition for q = 2
If q = 2, we can define the F-statistic as:

1 t12 + t22 − 2ρ̂t1 ,t2 t1 t2


 
F = 2
(16)
2 1 − ρ̂t1 ,t2

where ρ̂2t1 ,t2 is the correlation between t1 and t2 . Therefore, the F-stat takes into
account the correlation between different t-stats.
If the single t-stats are uncorrelated (ρ̂2t1 ,t2 = 0), the F-stat would simply be an average
of two squared t-statistics:
1 2
F = (t1 + t22 )
2

The F-stat is distributed as a F2,∞ . If its value is “sufficiently large”, we reject H0 .

26/30
F-statistic

• Reject H0 if F > Fα , where Fα is the critical values for a given significance level α

27/30
Testing multiple coefficients

Sometimes a single restriction might involve more parameters. For example, economic
theory might suggest a specific restriction about two parameters having the same value
(e.g. β1 = β2 or β1 − β2 = 0)
In this case therefore a single restriction (q = 1) involves more estimated parameters.
To test for this restriction we can transform the regression model in a form such that the
t-test refers to a single parameter.
Example - equality restriction
Let’s suppose our model is

Yi = β0 + β1 X1 + β2 X2 + ui (17)

and we want to test H0 : β1 = β2 vs. H1 : β1 ̸= β2 .


By adding and subtracting β2 X1 to our model we get:

Yi = β0 + (β1 − β2 )X1 + β2 (X1 + X2 ) + ui


= β0 + γ1 X1 + β2 V1 + ui

and it would be sufficient to use the t-test H0 : γ1 = 0. 28/30


F-test in R

In R:

# heteroskedasticity-robust F-test
linearHypothesis(model, c("STR=0", "expenditureK=0"), white.adjust = "hc1")

29/30
Model specification

How to decide what variables to include in a regression

1. Identify the variable of interest (X )


2. Think about OVB: what variables are we omitting that could bias β attached to the
variable of interest
3. Include those omitted variables (or proxies) as control variables, after checking their
correlation with Y and X . This wil be the base specification
4. Specify a range of plausible alternative models, which include additional candidate
control variables: alternative specifications
5. Estimate your base model and plausible alternative specifications (“sensitivity
checks”)
• do candidate variables affect the coefficient of interest (β)?
• are candidate variables significant?
• don’t just try to maximize R 2 : the objective is an unbiased estimator of the
coeff. of interest (causal effect!), not the best fit

30/30

You might also like