1. Linear regression Model - Applied_Part 1&2
1. Linear regression Model - Applied_Part 1&2
3. Inference:
a. Hypothesis Testing
b. Size, power and p-values C. Analysis
4. More analysis:
a. Data Problems: Multicollinearity, Outliers
b. Testing Functional Form
c. Selecting Regressors
1
1. LINEAR REGRESSION MODEL
APPLIED - PART 1 & 2
References:
Wooldridge: Chapter 3 (1 and 2), Chapter 7
Verbeek: Chapter 2 (4), Chapter 3 (1)
2
1. The linear regression model
Assume a relationship between y (dependent or explained variable) and a set of
variables:
x≡1 (a constant), x2 , x3 , …., xK,
valid for each individual in the population, such that the relationship is linear in
parameters (not necessarily in variables).
We write
𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝐾 𝑥𝐾 + 𝜀
4
Reminder: Estimators vs. Estimates
Suppose n random variables Zi (i=1, 2, …, n) distributed according to some pdf.
You have a sample and are interested in estimating some parameter 𝜃 of the distribution
Example: data on income for a sample of Irish workers
Estimator: a rule, a function of the random variables Zi, that gives you a sample value for the
parameter of interest.
W=h(Z1, Z2, ...,Zn)
𝑍 +𝑍 +⋯+𝑍𝑛
• Ex: sample mean estimator: 𝑍ҧ = 1 2
𝑛
• Estimator is a random variable (new sample = new sample value for W) with his own
probability distribution
• Sampling distribution: likelihood of various outcomes of W across different samples
Estimate: sample value of θ obtained for a specific sample drawn {z1, z2, ...,zn}
𝑧1 +𝑧2 +⋯+𝑧𝑛
• Ex: sample mean estimate 𝑧ҧ =
𝑛
• Each estimate has a certain probability of occurring, given the pdf of the
estimator
𝑁 𝑁
2
𝜀𝑖2 = 𝑦𝑖 − 𝛽1 − 𝛽2 𝑥2𝑖 − 𝛽3 𝑥3𝑖 − ⋯ − 𝛽𝐾 𝑥𝐾𝑖
𝑖=1 𝑖=1
Why?
7
𝑁
2
min 𝑦𝑖 − 𝛽1 − 𝛽2 𝑥2𝑖 − 𝛽3 𝑥3𝑖 − ⋯ − 𝛽𝐾 𝑥𝐾𝑖
𝑖=1
First-order conditions:
𝑁
𝑁 Note the
−2 𝑦𝑖 − 𝛽መ1 − 𝛽መ2 𝑥2𝑖 − 𝛽መ3 𝑥3𝑖 − ⋯ − 𝛽መ𝐾 𝑥𝐾𝑖 𝑥2𝑖 = 0 hats!!!
𝑖=1
…..
𝑁
➔ System of K equations in K unknowns (𝛽መ1 , 𝛽መ2 , 𝛽መ3 , … , 𝛽መ𝐾 ) – our parameters’ estimates
8
Unanswered questions:
And one answer to the question in your mind: Can we have an example?
I cannot give you a numerical example or practise exercise where you are asked
to find the estimates with paper, pencil and a calculator….
9
a. Predicted values and residuals (and a nice picture)
Predicted values:
𝑦ො𝑖 = 𝛽መ1 + 𝛽መ2 𝑥2𝑖 + 𝛽መ3 𝑥3𝑖 + ⋯ + 𝛽መ𝐾 𝑥𝐾𝑖
Residuals:
𝑦𝑖 − 𝑦ො𝑖 = 𝑦𝑖 − 𝛽መ1 − 𝛽መ2 𝑥2𝑖 − 𝛽መ3 𝑥3𝑖 − ⋯ 𝛽መ𝐾 𝑥𝐾𝑖 = 𝜀𝑖Ƹ
or sometimes 𝑒𝑖
10
Example: Wage equation (Verbeek Ch 3, section 6)
𝑦𝑖 = 𝑤𝑎𝑔𝑒
𝑥𝑖 = 𝑎𝑔𝑒, 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒, 𝑔𝑒𝑛𝑑𝑒𝑟, 𝑒𝑡𝑐.
Data on 1472 individuals, randomly sampled from the working population in Belgium in
1994.
11
12
13
• lm package to run linear models in R
stargazer(reg1, type="text")
14
15
16
Calculate predicted values and residuals:
17
18
b. Interpreting coefficients and estimates
Model: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝐾 𝑥𝐾 + 𝜀
𝛽1 : value of 𝑦𝑖 predicted by the model when all x are equal to zero → intercept of
the population line
𝛽መ1 : value of 𝑦𝑖 when all x are equal to zero as predicted by the regression line →
intercept of the regression line
19
Consider the model and take the partial derivative with respect to one of the
regressors (say 𝑥2 ) :
𝜕𝑦
= 𝛽2
𝜕𝑥2
𝛽2 : marginal effect of 𝑥2 on y when assuming everything else remains constant (all
the other x and 𝜀) as predicted by the model → slope of the linear model
∆𝑦
If ∆𝑥3 = ⋯ = ∆𝑥𝐾 = ∆𝜀 = 0 then ∆𝑦 = 𝛽2 ∆𝑥2 → 𝛽2 =
∆𝑥2
i.e. 𝛽2 is the change in 𝑦 for a unit change in 𝑥2 as predicted by the model when
nothing else changes (holding everything else constant)
20
Similarly
𝜕𝑦ො ∆𝑦ො
𝛽መ2 = or 𝛽መ2 =
𝜕𝑥2 ∆𝑥2
𝛽መ2 is the change in y predicted by our estimated regression line for unit
change in the regressor 𝑥2 everything else being held constant → slope of
the regression line.
21
Intriguing questions:
22
Dependent variable:
𝑦𝑖 = 𝛽1 + 𝛽2 𝑒𝑑𝑢𝑐𝑖 + 𝛽3 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
wage
23
1. Squared variables
Suppose model includes
2
𝛽2 𝑥2𝑖 + 𝛽3 𝑥2𝑖
𝜕𝑦𝑖
= 𝛽2 + 2𝛽3 𝑥2𝑖
𝜕𝑥2𝑖
𝛽3 > 0
25
𝑦 = 𝛽1 + 𝛽2 𝑒𝑑𝑢𝑐 + 𝛽3 𝑒𝑥𝑝𝑒𝑟 + 𝛽4 𝑒𝑥𝑝𝑒𝑟 2 + 𝜀
# alternatively
reg3 <- lm(wage ~ educ + exper + I(exper^2), data = bwages1)
26
𝑦 = 𝛽1 + 𝛽2 𝑒𝑑𝑢𝑐 + 𝛽3 𝑒𝑥𝑝𝑒𝑟 + 𝛽4 𝑒𝑥𝑝𝑒𝑟 2 + 𝜀
Dependent variable:
𝜕𝑦𝑖
educ 1.933*** = 0.3688409 + 2 −0.0044353 𝑒𝑥𝑝𝑒𝑟𝑖
𝜕𝑒𝑥𝑝𝑒𝑟𝑖
(0.081)
Constant -0.057
(0.423)
Observations 1,472
27
2. Interaction terms
What is an interaction?
𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥2 𝑥3 + ⋯ + 𝜀
Ex: 𝑒𝑑𝑢𝑐 × 𝑒𝑥𝑝𝑒𝑟 you obtain the data for this new variable by multiplying the two
variables
Interactions allow the effect of one explanatory variable to depend upon another.
𝜕𝑦
= 𝛽2 + 𝛽4 𝑥3
𝜕𝑥2
𝜕𝑦
and similarly = 𝛽3 + 𝛽4 𝑥2
𝜕𝑥3
28
Example: model includes the interaction between education and experience
𝜕𝑦𝑖
= 𝛽3 + 𝛽4 𝑒𝑑𝑢𝑐𝑖
𝜕𝑒𝑥𝑝𝑒𝑟𝑖
Hence, the marginal effect of an extra year of experience depends upon the
education level attained. In particular, if 𝛽4 > 0 returns increase with education.
29
30
𝑦 = 𝛽1 + 𝛽2 𝑒𝑑𝑢𝑐 + 𝛽3 𝑒𝑥𝑝𝑒𝑟 + 𝛽4 (𝑒𝑑𝑢𝑐 × 𝑒𝑥𝑝𝑒𝑟) + 𝜀
Dependent variable:
wage 𝜕𝑦𝑖
= 𝛽3 + 𝛽4 𝑒𝑑𝑢𝑐𝑖
𝜕𝑒𝑥𝑝𝑒𝑟𝑖
educ 0.917***
(0.163)
(0.007)
→ 𝛾 is the intercept shift between two groups (parallel shift of the linear
model - regression line)
32
Group 𝐷𝑖 = 1 : 𝑦𝑖 = 𝛽1 + 𝛾 + 𝛽2 𝑥𝑖
y
Slope = 𝜷𝟐
𝜷𝟏 +𝜸
𝜸
𝜷𝟏 𝐺𝑟𝑜𝑢𝑝 𝐷𝑖 = 0 : 𝑦𝑖 = 𝛽1 + 𝛽2 𝑥𝑖
33
Suppose you wish to control for gender.
Define:
1 𝑖𝑓 𝑚𝑎𝑙𝑒
𝑚𝑎𝑙𝑒𝑖 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
→ 𝛽2 is the intercept shift between males and females, i.e. the gender wage gap.
34
Males: 𝑦𝑖 = 𝛽1 + 𝛽2 + 𝛽3 𝑒𝑑𝑢𝑐𝑖
wage
Slope = 𝜷𝟑
𝜷𝟏 + 𝜷𝟐
𝜷𝟐
𝜷𝟏 Females: 𝑦𝑖 = 𝛽1 + 𝛽3 𝑒𝑑𝑢𝑐𝑖
educ
35
Dependent The model is 𝑦𝑖 = 𝛽1 + 𝛽2 𝑚𝑎𝑙𝑒𝑖 + 𝜀𝑖
variable:
1.301 is the unadjusted wage gap, the raw
wage
difference in average wage of men and
women in the sample
male 1.301***
bwages1 %>%
(0.235) group_by(male) %>%
summarize("wages by sex" = mean(wage))
Constant 10.262***
(0.183) male `wages by sex`
1 0 10.3
2 1 11.6
Observations 1,472
There can be reasons why men are paid
more than women (education, experience,
etc) → you need to control for them
36
Now the model is 𝑦𝑖 = 𝛽1 + 𝛽2 𝑚𝑎𝑙𝑒𝑖 + 𝛽3 𝑒𝑑𝑢𝑐𝑖 + 𝛽4 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
Dependent variable:
wage
(1) (2)
exper 0.192***
(0.010)
(0.183) (0.387)
Dependent variable:
wage
(1) (2)
female 0.214
(0.387) Wage gap = 1.56 - .214 = 1.346
male 1.346*** 1.560***
(0.193) (0.373)
Constant 0.214
(0.387)
1 𝑖𝑓 𝑚𝑎𝑟𝑟𝑖𝑒𝑑
𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
The model is 𝑦𝑖 = 𝛽1 + 𝛽2 𝑚𝑎𝑙𝑒𝑖 + 𝛽3 𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 + 𝛽4 𝑒𝑑𝑢𝑐𝑖 + 𝛽5 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
- If 𝑚𝑎𝑙𝑒𝑖 = 0 and 𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 = 0 (unmarried females) → intercept equals 𝛽1
(base category)
- If 𝑚𝑎𝑙𝑒𝑖 = 1 and 𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 = 0 (unmarried males) → intercept equals 𝛽1 + 𝛽2
- If 𝑚𝑎𝑙𝑒𝑖 = 0 and 𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 = 1 (married females) → intercept equals 𝛽1 + 𝛽3
- If 𝑚𝑎𝑙𝑒𝑖 = 1 and 𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 = 1 (married males) → intercept equals 𝛽1 + 𝛽2 +
𝛽3
40
Unmarried Married
Female 𝛽1 𝛽1 + 𝛽3
Male 𝛽1 + 𝛽2 𝛽1 + 𝛽2 + 𝛽3
41
4. Dummies interactions
You can interact dummies with continuous variables and other dummies.
The model is:
𝑦𝑖 = 𝛽1 + 𝛽2 𝑥2𝑖 + 𝛿𝐷𝑖 𝑥2𝑖 + ⋯ + 𝜀𝑖
The interaction 𝐷𝑖 𝑥2𝑖 allows us to estimate a different slope for the two groups
42
𝐷𝑖 = 1 :𝑦𝑖 = 𝛽1 + (𝛽2 + 𝛿)𝑥2𝑖
y
𝜷𝟏 𝐷𝑖 = 0 :𝑦𝑖 = 𝛽1 + 𝛽2 𝑥2𝑖
𝑥2
43
Example: suppose you suspect the effect of experience on wage depends
on gender.
You can include in the model an interaction between experience and one
of the gender dummies.
The model is:
𝑦𝑖 = 𝛽1 + 𝛽2 𝑒𝑑𝑢𝑐𝑖 + 𝛽3 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝛽4 (𝑒𝑥𝑝𝑒𝑟𝑖 × 𝑚𝑎𝑙𝑒𝑖 ) + 𝜀𝑖
exper
45
Dependent variable:
An extra year of experience increases
wage by 0.149 BF for females and by
wage
(.149+ .070) = 0.219 for men
(1) (2)
(0.010) (0.012)
(0.010)
(0.373) (0.367)
47
𝐷𝑖 = 1 : 𝑦𝑖 = (𝛽1 +𝛾) + (𝛽4 +𝛿)𝑥2𝑖
𝜷𝟏 +𝜸
𝜷𝟏 𝐷𝑖 = 0 : 𝑦𝑖 = 𝛽1 + 𝛽2 𝑥2𝑖
𝑥2𝑖
48
Example:
𝛽2 : wage gap (wage difference between male and female workers that is
unexplained by the variables controlled for)
49
Males: 𝑦𝑖 = (𝛽1 +𝛽2 ) + 𝛽3 𝑒𝑑𝑢𝑐𝑖 + (𝛽4 +𝛽5 )𝑒𝑥𝑝𝑒𝑟𝑖
wage
𝜷𝟏 + 𝜷𝟐
exper
50
Dependent variable:
wage
Note how the estimated gender gap gets reduced as we enrich the model!
51
Interacting two dummies
Suppose the model is
𝑦𝑖 = 𝛽1 + 𝛽2 𝑚𝑎𝑙𝑒𝑖 + 𝛽3 𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 + 𝛽4 𝑚𝑎𝑙𝑒𝑖 × 𝑚𝑎𝑟𝑟𝑖𝑒𝑑𝑖 + ⋯ + 𝜀𝑖
52
Unmarried Married
Female 𝛽1 𝛽1 + 𝛽3
Male 𝛽1 + 𝛽2 𝛽1 + 𝛽2 + 𝛽3 + 𝛽4
𝛽 (𝑖𝑓 𝑢𝑛𝑚𝑎𝑟𝑟𝑖𝑒𝑑)
gender wage gap = ቊ 2
𝛽2 + 𝛽4 (𝑖𝑓 𝑚𝑎𝑟𝑟𝑖𝑒𝑑)
Interaction allows the gender wage gap to differ between married and unmarried
individuals
Similarly for wage difference between married and unmarried workers (equal to
𝛽3 for female and to 𝛽3 + 𝛽4 for males)
53
5. Ordinal variables
For example: 𝑒𝑑𝑢𝑐𝑖 = 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑙𝑒𝑣𝑒𝑙 (1, 2, … , 5)
• If you include 𝛽1 + 𝛽2 𝑒𝑑𝑢𝑐𝑖 , then 𝛽2 is the marginal effect of gaining an extra
education level (from 1 to 2, 2 to 3, etc.) and it is model as constant
54
• Define instead a full set of dummy variables
1 𝑖𝑓 𝑒𝑑𝑢𝑐𝑖 = 1 1 𝑖𝑓 𝑒𝑑𝑢𝑐𝑖 = 5
𝑒𝑑1𝑖 = ቊ … 𝑒𝑑5𝑖 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Now the marginal effect of gaining an extra education level depends on what
education level is gained.
55
Primary school: 𝑦𝑖 = 𝛽1 +𝛽6 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
Lower vocational: 𝑦𝑖 = 𝛽1 + 𝛽2 (𝑒𝑑2𝑖 = 1) +𝛽6 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
Intermediate vocational: 𝑦𝑖 = 𝛽1 + 𝛽3 (𝑒𝑑3𝑖 = 1) +𝛽6 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
Higher vocational: 𝑦𝑖 = 𝛽1 + 𝛽4 (𝑒𝑑4𝑖 = 1) +𝛽6 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
University level: 𝑦𝑖 = 𝛽1 + 𝛽5 (𝑒𝑑5𝑖 = 1) +𝛽6 𝑒𝑥𝑝𝑒𝑟𝑖 + 𝜀𝑖
56
Dependent variable:
(1) (2)
educ 1.930***
(0.082)
(0.427)
1.812 = 1.714
factor(educ)3 3.527***
(0.411)
(0.427) ….
Constant 1.074*** 3.302***
(0.373) (0.441)
57
6. Logarithms and Elasticities
Sometimes economists use log transformation of variables.
i.e. instead of using y or 𝑥2𝑖 , use 𝑙𝑛𝑦 or 𝑙𝑛𝑥2𝑖
Why?
- to allow for non-linearities
- if dependent variable has asymmetric distribution
- to reduce problem of heteroskedasticity in the data (later on this)
- to easily obtain elasticities.
58
59
60
61
Elasticity
Suppose the model is linear in variables
𝑦𝑖 = 𝛽1 +𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + ⋯ + 𝛽𝐾 𝑥𝐾𝑖 + 𝜀𝑖
Elasticity: percentage change in y for 1% change in x (say 𝑥2𝑖 )
∆𝑦𝑖
𝑦𝑖 ∆𝑦𝑖 𝑥2𝑖 𝑥2𝑖
= = 𝛽2
∆𝑥2𝑖 ∆𝑥2𝑖 𝑦𝑖 𝑦𝑖
𝑥2𝑖
62
But remember that for function 𝑙𝑛𝑦 = 𝑎𝑙𝑛𝑥
Take total differential
𝑑𝑦
1 1 𝑦
𝑑𝑦 = 𝑎 𝑑𝑥 ➔ 𝑑𝑥 =𝑎
𝑦 𝑥
𝑥
63
Semi-elasticity
❑If dependent variable is in log and regressor is in levels (sometimes called log-lin
model), then
𝑙𝑛𝑦𝑖 = 𝛽1 +𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + ⋯ + 𝛽𝐾 𝑥𝐾𝑖 + 𝜀𝑖
𝛽𝑘 : a 1-unit increase in 𝑥𝑘 is associated with a 𝛽𝑘 × 100% change in y.
❑If independent variable is in level and regressor is in log (sometimes called lin-log
model), then
𝑦𝑖 = 𝛽1 +𝛽2 𝑙𝑛𝑥2𝑖 + 𝛽3 𝑥3𝑖 + ⋯ + 𝛽𝐾 𝑥𝐾𝑖 + 𝜀𝑖
𝛽2 : a 1% increase in 𝑥2 is associated with a 𝛽𝑘 / 100 unit change in y.
64
Example: wage equation
65
c. Goodness-of-fit
How well does the regression line fit the data? How close to the line
are the observations?
Is the variation in x a good predictor of the variation in y???
The quality of the linear approximation offered by the model can be
measured by the R2.
• The R2 indicates the proportion of the variance in y that can be
explained by the linear combination of x variables
variance explained
• In formula:
total variance
66
• If the model contains an intercept (as it usual does), then we can re-write it as
(𝜀𝑖Ƹ ) (𝜀𝑖Ƹ 2 )
𝑖 ) = 𝑉(
in fact 𝑦𝑖 = 𝑦ො𝑖 + 𝜀𝑖Ƹ and 𝑉(𝑦 𝑦ො𝑖 ) + 𝑉(
𝜀𝑖Ƹ ) since 𝑦ො𝑖 and 𝜀𝑖Ƹ are uncorrelated
0 R2 1.
σ 𝑁 2 σ 𝑁 2
𝑦
ො
𝑖=1 𝑖 𝜀
𝑖=1 𝑖Ƹ
𝑢𝑛𝑐𝑒𝑛𝑡𝑒𝑟𝑒𝑑 𝑅2 = 𝑁 2 = 1 − 𝑁 2
σ𝑖=1 𝑦𝑖 σ𝑖=1 𝑦𝑖
68
Caveats!
1. R2 is sensitive to transformation of y → R2s cannot be compared if y is different (y, ln(y), ∆y, ∆ln(y),
etc)
2. There is no general rule to say that an R2 is high or low, this depends upon the particular context.
microeconometrics context 0.2 can be high
time series analysis 0.8 can be low
3. It does not measure the quality of the model, but only the quality of the linear approximation, hence
not much relevance when analysing results.
4. Obtained minimizing variance of error ε , hence it will be the highest value you can ever obtain in a
linear model. If using different estimators (maybe better in certain circumstances), their
correspondent R2 will always be lower!
5. R2 will never decrease if a variable is added. Therefore, we define adjusted R2 as
1
σ𝑁𝑖=1 𝜀𝑖Ƹ2
𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 − 𝑁−𝐾
1
σ𝑁𝑖=1 𝑦𝑖 − 𝑦ത 2
𝑁−1
it has a penalty for larger K, and it may decline if you add a regressor, it can be negative 69