0% found this document useful (0 votes)
33 views

UnivariateRegression 2

Uploaded by

Alada mana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

UnivariateRegression 2

Uploaded by

Alada mana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

PE I: Univariate Regression

OLS, Regression Models & Interpretation


(Chapters 3.2 & 3.3)

Andrius Buteikis, [email protected]


http://web.vu.lt/mif/a.buteikis/
OLS: Assumptions
(UR.1) The Data Generating Process (DGP), or in other words,
the population, is described by a linear (in terms of the coefficients)
model:
Yi = β0 + β1 Xi + i , ∀i = 1, ..., N (UR.1)
(UR.2) The error term  has an expected value of zero, given any
value of the explanatory variable:

E(i |Xj ) = 0, ∀i, j = 1, ..., N (UR.2)

(UR.3) The error term  has the same variance given any value of
the explanatory variable (i.e. homoskedasticity) and the error terms
are not correlated across observations (i.e. no autocorrelation):

Var (ε|X) = σ2 I (UR.3)

i.e. Cov(i , j ) = 0, i 6= j and Var(i ) = σ2 = σ 2 .


(UR.4) (optional) The residuals are normal:

ε|X ∼ N 0, σ2 I

(UR.4)

ε = (1 , ..., N )> , X = (X1 , ...., XN )> , and Y = (Y1 , ..., YN )> .
OLS: Estimation
Problem: Assume that the data follows (UR.1) - (UR.4). However, we
do not know the true parameters β0 and β1 of our underlying
regression Y = β0 + β1 X + .
Question: How do we evaluate β0 and β1 ?
Answer: Specify the equations in a matrix notation:

Y1 = β0 + β1 X1 + 1
     


 Y1 1 X1 1
Y2

= β0 + β1 X2 + 2  Y2  1 X2     
 β0  2
⇐⇒  .  =  . ..  β +  .. 
  
...
  ..   .. .  1 .

YN 1 XN N

YN = β0 + β1 XN + N

and minimize the sum of squared residuals of Y = Xβ + ε:


N  >  
ε> b
X
RSS(β)
b = 2i = b
b ε = Y − Xβ
b Y − Xβ
b → min
i=1 β
b0 ,βb1

(Alternatively, either use the method of moments or directly minimize the


sum of the residuals instead of its vectorized form)
OLS: The Estimator
Taking the partial derivative and equating it to zero:

∂RSS(β)
b
= −2X> Y + 2X> Xβ
b=0
∂β
b

yields the OLS estimator:


−1
b = X> X
β X> Y (OLS)
The term Ordinary Least Squares (OLS) comes from the fact
that these estimates minimize the sum of squared residuals.

Gauss-Markov Theorem
The main advantage of the OLS estimators are summarized by the
following theorem:
Gauss-Markov theorem
Under the assumption that the conditions (UR.1) - (UR.3) hold true,
the OLS estimators βb0 and βb1 are BLUE (Best Linear Unbiased
Estimator) and Consistent.
Univariate Regression: Modelling Framework
Assume that we are interested in explaining Y in terms of X under (UR.1)
- (UR.4):
Yi = β0 + β1 Xi + i , ∀i = 1, ..., N
where:
I X is called the independent variable, the control variable, the
explanatory variable, the predictor variable, or the regressor;
I Y is called the dependent variable, the response variable, the
explained variable, the predicted variable or the regressand;
I  is called the random component, the error term, the disturbance
or the (economic) shock.
The expected value:

E(β0 + β1 X + |X ) = β0 + β1 X

where:
I β0 - the intercept parameter, sometimes called the constant term.
I β1 - the slope parameter.
After obtaining the estimates βb0 and βb1 , we may want to examine the
following values:
I The fitted values of Y , which are defined as the following OLS
regression line (or more generally, the estimated regression line):

Y
bi = βb0 + βb1 Xi

where βb0 and βb1 are estimated via OLS. By definition, each fitted
value of Y
bi is on the estimated OLS regression line.
I The residuals, which are defined as the difference between the
actual and fitted values of Y :

ei = Yi − Y
i = b
b bi = Yi − βb0 − βb1 Xi

which are hopefully close to the true unobserved errors i .


It may be helpful to look at  as an unexplainable part of the model,
which is due to the randomness of the data.
As such, the explainable part of the model can be expressed in
terms of the fitted values Y
b , which themselves are estimates of the
conditional expected value of Y , given X .
Scatter diagram of (X,Y) sample data and the regression line

^ε12 = residual

^ ^ ^
Y = β0 + β1X
Y

Y12
^
^ε8 Y12 = fitted value

^
Y8

Y8

x8 x12

X
Now is also a good time to highlight the difference between the errors and
the residuals.

I The random sample, taken from a Data Generating


Process (i.e. the population), is described via

Yi = β0 + β1 Xi + i

where i is the error for observation i.


I After estimating the unknown parameters β0 , β1 , we can
re-write the equation as:

Yi = βb0 + βb1 Xi + b
i

where bi is the residual for observation i.


The errors show up in the underlying (i.e. true) DGP equation,
while the residuals show up in the estimated equation. The errors
are never observed, while the residuals are calculated from
the data.
We can also re-write the residuals in terms of the error term and the
difference between the true and estimated parameters:
   
i = Yi − Y
b bi = β0 + β1 Xi + i − (βb0 + βb1 Xi ) = i − βb0 − β0 − βb1 − β1 Xi
Gauss-Markov theorem
If the conditions (UR.1) - (UR.3) hold true, the OLS estimator
(OLS)
b = X> X −1 X> Y

β
is BLUE (Best Linear Unbiased Estimator) and Consistent.
What is an Estimator?
An estimator is a rule that can be applied to any sample of data to
produce an estimate. In other words the estimator is the rule and the
estimate is the result.
So, eq. (OLS) is the rule and therefore an Estimator.

OLS estimators are Linear


From the specification of the relationship in (UR.1) between Y and X
(using the matrix notation for generality):

Y = Xβ + ε

We see that the relationship is Linear with respect to Y.


OLS estimators are Unbiased
Using the matrix notation for the sample linear equations (Y = Xβ + ε)
and plugging it into eq. (OLS) gives us the following:

b = β + X> X −1 X> ε

β

If we take the expectation of both sides, use the law of total expectation
and the fact that E (ε|X) = 0 from (UR.2):

b = β + E X> X −1 X> ε = ... = β


h i h  i
E β
h i
We have shown that E β b = β - i.e., the OLS estimator β
b is an
Unbiased estimator of β.

Furthermore:
I Unbiasedness does not guarantee that the estimate we get with any particular
sample is equal (or even very close) to β.
I It means that if we could repeatedly draw random samples from the population
and compute the estimate each time, then the average of these estimates would
be (very close to) β.
I However, in most applications we have just one random sample to work with. As
we will see later on, there are methods for creating additional samples from the
available data by creating and analysing different subsamples.
OLS estimators are Best (Efficient)
I When there is more than one unbiased method of estimation to
choose from, that estimator which has the lowest variance is the best.
I We want to show that OLS estimators are best in the sense that β b
are efficient estimators of β (i.e. they have the smallest variance).
From the proof of unbiasedness of the OLS we have that:

b = β + X> X −1 X> ε =⇒ β
b − β = X> X −1 X> ε
 
β

Which we can then use this expression for calculating the


variance-covariance matrix of the OLS estimator:
h i h i
Var(β)
b = E (β b − E(β))(
b β b > = E (β
b − E(β)) b − β)>
b − β)(β
−1
= ... = σ 2 X> X

Note: we are usually interested in the diagonal elements of the parameter


variance-covariance matrix:
" #
Var(βb0 ) Cov(βb0 , βb1 ) −1
Var(β) =
b = σ 2 X> X
Cov(β1 , β0 )
b b Var(β1 )
b

That is, the variance of the parameter estimates themselves.


Question: Is this variance really the best?
Answer: To verify this, assume that we have some other estimator of β,
which is also unbiased and can be expressed as:
e = X> X −1 X> + D Y = CY
h  i
β

As long as we can express it as the above eq., we can show that, since
E [ε] = 0:
h i h −1 >  i
E βe =E X> X X + D (Xβ + ε) = ... = (I + DX) β
= β ⇐⇒ DX = 0

then β
e is unbiased if and only if DX = 0.

Then, we can calculate its variance as:


e = Var(CY) = CVar(Xβ + ε)C> = σ 2 CC>
Var(β)
h −1 i
= ... = σ 2 X> X + DD> = Var(β)
b + DD> ≥ Var(β)
b

since DD> is a positive semidefinite matrix. This means that β


b has the
b is the Best estimator of β.
smallest variance. Therefore, β
The lower the variance of an estimator, the more precise (accurate) it is.
−1
b = σ 2 X> X
Problem: Looking back at Var(β) - we do not know the
true σ 2 .
Question: How do we get σ 2 ?
Answer: We estimate it by calculating the sample residual variance:
N
> b
 1 X 2
b2 = s 2 =
b
σ = 
N −2 N − 2 i=1 i
b

Note that if we take N instead of N − 2 for the univariate regression


case in the denominator, then the variance estimate would be biased.
This is because the variance estimator would not account for two
restrictions that must be satisfied by the OLS residuals, namely:
N
X N
X
i = 0,
b i X i = 0
b
i=1 i=1

So, we take N − 2 instead of N, because of the number of restrictions on


the residuals.
Note that this is an estimated variance. Nevertheless, it is a key
component in assessing the accuracy of the parameter estimates (when
calculating test statistics and confidence intervals).
Since we estimate βb from the a random sample, the estimator β b is a
random variable as well. We can measure the uncertainty of β via its
b
standard deviation. This is the standard error of our estimate of β:
The square roots of the diagonal elements of the variance-covariance
matrix Var(
d β) b are called the standard errors (se) of the corre-
sponding OLS estimators β,b which we use to estimate the standard
deviation of βi from βi
b
q
se(βbi ) = Var(
d βbi )

The standard errors describe the accuracy of an estimator (the


smaller the better).
I The standard errors are measures of the sampling variability of the least squares
estimates βb1 and βb2 in repeated samples;
I If we collect a number of different data samples, the OLS estimates will be
different for each sample. As such, the OLS estimators are random variables and
have their own distribution.
I Potential problem: If the residuals are large (since their mean will still be zero this
concerns the case when the estimated variance of the residuals is large), then the
standard errors of the coefficients are large as well.
While the theoretical properties of unbiased estimators with low variance
are nice to have - what is their significance in practical applications?
We have shown that the (OLS) estimator is BLUE. Finally, we
move on to examining their consistency.
OLS estimators are Consistent
A consistent estimator has the property that, as the number of data points
(which are used to estimate the parameters) increases (i.e. N → ∞), the
estimates converges in probability to the true parameter, i.e.:
Definition
Let θeN be an estimator of a parameter θ based on a sample Y1 , ..., YN .
Then we say that θeN is a consistent estimator of θ if ∀ > 0:
 
P |θeN − θ| >  → 0, as N → ∞
P
We can denote this as θeN − → θ or plim(θeN ) = θ.
If θeN is not consistent, then we say that θeN is inconsistent.
I Unlike unbiasedness, consistency involves the behavior of the
sampling distribution of the estimator as the sample size N gets large
and the distribution of θeN becomes more and more concentrated
about θ. In other words, for larger sample sizes, WN is less and less
likely to be very far from θ.
I An inconsistent estimator does not help us learn about θ, regardless
of the size of the data sample.
I For this reason, consistency is a minimal requirement of an
estimator used in statistics or econometrics.
Unbiased estimators are not necessarily consistent, but those whose
variances shrink to zero as the sample size grows are consistent.
For some examples, see the lecture notes.
Going back to our OLS estimators βb0 and βb1 - since we can express the
b = β + X> X −1 X> ε, then, as N → ∞ we have that:

estimator as β

E() · E(X 2 ) − E(X ) · E(X )


   
b → β0 + 1
β
β1 Var(X ) E(X ) − E(X ) · E()
     
β 1 0 β
= 0 + = 0
β1 Var(X ) 0 β1

Since E() = 0 and E(X ) = E(X ) − E(X )E() = Cov(X , ) = 0.


b → β, as N → ∞.
Which means that β
So, the OLS parameter vector β
b is a consistent estimator of β.
Practical illustration of the OLS properties
We will return to our example in this chapter. We have proved the
unbiasedness and consistency of OLS estimators.
To illustrate these properties empirically, we will:
I generate 5000 replications (i.e. different samples) for each of the
different sample sizes N ∈ {11, 101, 1001}.
I for each replication of each sample size we will estimate the unknown
regression parameters β;
I for each sample size, we will calculate the average of these parameter
vectors.
This method of using repeat sampling is also known as a Monte Carlo
method.
The extensive code can be found in the lecture notes.
In our experimentation the true parameter values are:
## True beta_0 = 1. True beta_1 = 0.5
while the average values of the parameters from 5000 different samples for
each sample size is:
## With N = 10:
## the AVERAGE of the estimated parameters:
## beta_0: 1.00437
## beta_1: 0.49984
## With N = 100:
## the AVERAGE of the estimated parameters:
## beta_0: 0.99789
## beta_1: 0.50006
## With N = 1000:
## the AVERAGE of the estimated parameters:
## beta_0: 0.99957
## beta_1: 0.5
We can see that:
I The mean of the estimated parameters are close to the true
parameter value regardless of sample size.
The variance of these estimates can also be examined:
## With N = 10:
## the VARIANCE of the estimated parameters:
## beta_0: 0.3178
## beta_1: 0.00904
## With N = 100:
## the VARIANCE of the estimated parameters:
## beta_0: 0.03896
## beta_1: 1e-05
## With N = 1000:
## the VARIANCE of the estimated parameters:
## beta_0: 0.00394
## beta_1: 0.0
Note that unbiasedness is true for any N, while consistency is an
asymptotic property, i.e. it holds when N → ∞.
We can see that:
I The variance of the estimated parameters decreases with larger
sample size, i.e. the larger the sample size, the closer will our
estimated parameters be to the true values.
Histogram of 0 with N = 10 Histogram of 1 with N = 10
700
700
600
600
500 500
400 400
300 300
200 200
100 100
0 0
1 0 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Histogram of 0 with N = 100 Histogram of 1 with N = 100
800 700
700 600
600
500
500
400
400
300 300
200 200
100 100
0 0
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.490 0.495 0.500 0.505 0.510
Histogram of 0 with N = 1000 Histogram of 1 with N = 1000
800
700
700
600
600
500 500
400 400
300 300
200 200
100 100
0 0
0.8 0.9 1.0 1.1 1.2 0.49960.49970.49980.49990.50000.50010.50020.50030.5004
We see that the histograms of the OLS estimators have a bell-shaped
distribution.

Under assumption (UR.4) it can be shown that since ε|X ∼ N 0, σ2 I ,
b = β + X> X −1 X> ε will

then the linear combination of epsilons in β
also be normal, i.e.

b ∼ N β, σ 2 X> X −1
  
β|X
Regression Models and Interpretation
Inclusion of the constant term in the regression
In some cases we want to impose a restriction that if X = 0, then Y = 0
as well.
I An example could be the relationship between income (X ) and income
tax revenue (Y ) - if there is no income, X = 0, then the expected
revenue from the taxes would also be zero - E(Y |X = 0) = 0.
Formally, we now choose a slope estimator, β1 , from the following
regression model:
Yi = β1 Xi + i , i = 1, ..., N
which is called a regression through the origin, because the conditional
expected value:
E (Yi |Xi ) = β1 Xi
of the regression passes through the origin point X = 0, Y = 0. We can
obtain the estimate of the slope parameter via OLS by minimizing the sum
of squared residuals:
N N  2 PN
X X Xi Yi
RSS = 2i
b = Yi − βb1 Xi → min =⇒ β1 = Pi=1
b
N 2
i=1 i=1 i=1 Xi
So, it is possible to specify a regression without a constant term, but
should we opt for it?
I A constant β0 can be described as the mean value of Y when all
predictor variables are set to zero. However, if the predictors can’t be
zero, then it is impossible to interpret the constant.
I The intercept parameter β0 may be regarded as a sort of garbage
collector for the regression model. The reason for this is the
underlying assumption that the expected value of the residuals is zero,
which means that any bias, that is not accounted by the model, is
collected in the intercept β0 .
In general, inclusion of a constant in the regression model ensures that the
models residuals have a mean of zero, otherwise the estimated coefficients
may be biased.
Consequently, and as is often the case, without knowing the true
underlying model it is generally not worth interpreting the regression
constant.
Linear Regression Models
The regression model that we examined up to now is called a simple
linear regression model. Looking at the univariate regression:

Y = Xβ + ε

where β = (β0 , β1 )> .


When we say linear regression, we mean linear in parameters β. There
are no restrictions on transformations of X and Y , as long as the
parameters enter the equation linearly. √ √
For example, we can use log(X ) and log(Y ), or X and X etc.
in the univariate regression. While transforming X and Y does not
effect the linear regression specification itself, the interpretation
of the coefficients depends on the transformation of X and
Y.
On the other hand, there are regression models, which are not regarded as
linear, since they are not linear in their parameters:
1
Yi = + i , i = 1, ..., N
β0 + β1 Xi
Furthermore, estimation of such models is a separate issue, which covers
nonlinear regression models.
Effects of Changing the Measurement Units
Generally, it is easy to figure out what happens to the intercept, β0 , and
slope, β1 , estimates when the units of measurement are changed for the
dependent variable, Y :

e = c · Y = c · (β0 + β1 X + ) = (c · β0 ) + (c · β1 )X + (c · )
Y

In other words, if Y is multiplied by a constant c, then the OLS


e = c · Y are βe0 = c · β0 and βe1 = c · β1 .
estimates of Y
# import numpy as np
# #
set.seed(234) np.random.seed(234)
# Set the coefficients: # Set the coefficients:
N = 50 N = 50
beta_0 = 1 beta_0 = 1
beta_1 = 0.5 beta_1 = 0.5
const = 10 const = 10
# Generate sample data: # Generate sample data:
x <- 0:N x = np.arange(start = 0,
# stop = N + 1, step = 1)
e <- rnorm(mean = 0, sd = 1, e = np.random.normal(loc = 0,
n = length(x)) scale = 1, size = len(x))
y <- beta_0 + beta_1 * x + e y = beta_0 + beta_1 * x + e
new_y <- y * const new_y = y * const
# import statsmodels.api as sm
# #
# x_mat = sm.add_constant(x)
y_fit <- lm(y ~ x) y_fit = sm.OLS(y, x_mat).fit()
new_fit <- lm(new_y ~ x) new_fit = sm.OLS(new_y, x_mat).fit()
print(y_fit$coefficients) print(y_fit.params)
## (Intercept) x ## [0.83853621 0.5099222 ]
## 0.8337401 0.5047895
print(new_fit$coefficients) print(new_fit.params)
## (Intercept) x ## [8.38536209 5.09922195]
## 8.337401 5.047895
Note that the variance of c ·  is now c 2 σ 2 :
print(summary(y_fit)$sigma^2) print(y_fit.scale)
## [1] 0.9329815 ## 0.9570628350921718
print(summary(new_fit)$sigma^2) print(new_fit.scale)
## [1] 93.29815 ## 95.70628350921723
This assumes that nothing changes in the scaling of the independent
variable X used in OLS estimation.
If we change the units of measurement of an independent variable, X , then
only the slope, β1 , (i.e. the coefficient of that independent variable)
changes:
 
β1
Y = β0 + β1 X +  = β0 + (c · X ) + 
c
In other words, if X is multiplied by a constant c, then the OLS
β1
estimates of Y are β0 and βe1 = .
c
We can verify that this is the case with our empirical data sample by
creating new variables X (1) = X · c and X (2) = X /c
x1 <- x * const x1_mat = sm.add_constant(x * const)
x2 <- x / const x2_mat = sm.add_constant(x / const)
# #
y_fit_x_mlt <- lm(y ~ x1) y_fit_x_mlt = sm.OLS(y, x1_mat).fit()
y_fit_x_div <- lm(y ~ x2) y_fit_x_div = sm.OLS(y, x2_mat).fit()
print(y_fit$coefficients) print(y_fit.params)
## (Intercept) x ## [0.83853621 0.5099222 ]
## 0.8337401 0.5047895
print(y_fit_x_mlt$coefficients) print(y_fit_x_mlt.params)
## (Intercept) x1 ## [0.83853621 0.05099222]
## 0.83374010 0.05047895
print(y_fit_x_div$coefficients) print(y_fit_x_div.params)
## (Intercept) x2 ## [0.83853621 5.09922195]
## 0.8337401 5.0478946
Furthermore, if we scale both X and Y by the same constant:

   
e = c·Y = c· β0 + β1 (c · X ) +  = (c·β0 )+β1 (c · X )+(c·)
Y
c

In other words, if both Y and X are multiplied by the same constant


c, then the OLS estimates for the intercept change to βe0 = c · β0
but remain the same for the slope β1 .
x1 <- x * const
new_y <- y * const x1_mat = sm.add_constant(x * const)
# new_y = y * const
y_fit_scaled <- lm(new_y ~ x1) #
print(y_fit$coefficients) y_fit_scaled = sm.OLS(new_y, x1_mat).fit(
print(y_fit.params)
## (Intercept) x
## 0.8337401 0.5047895 ## [0.83853621 0.5099222 ]
print(y_fit_scaled$coefficients) print(y_fit_scaled.params)
## (Intercept) x1 ## [8.38536209 0.5099222 ]
## 8.3374010 0.5047895
Finally, if we scale Y by one constant and X by a different constant:

   
β1 a 
Y = a·Y = a· β0 +
e (c · X ) +  = (a·β0 )+ · β1 (c · X )+(a·)
c c

In other words, if Y is multiplied by a constant a and X is multiplied


by a constant c, then the OLS estimates for the intercept change
a
to βe0 = a · β0 but for the slope they change to βe1 = · β1 .
c
const_a <- 5 const_a = 5
const_c <- 10 const_c = 10
x1 <- x * const_c x1_mat = sm.add_constant(x * const_c)
new_y <- y * const_a new_y = y * const_a
# #
y_fit_scaled <- lm(new_y ~ x1) y_fit_scaled = sm.OLS(new_y, x1_mat).fit(
print(const_a / const_c) print(str(const_a / const_c))
## [1] 0.5 ## 0.5
print(y_fit$coefficients) print(y_fit.params)
## (Intercept) x ## [0.83853621 0.5099222 ]
## 0.8337401 0.5047895
print(y_fit_scaled$coefficients) print(y_fit_scaled.params)
## (Intercept) x1 ## [4.19268105 0.2549611 ]
## 4.1687005 0.2523947
Usually, changing the unit of measurement is referred to as data scaling. In
all cases after scaling the data the standard errors of the scaled regression
coefficients change as well, however, as we will later see, this does not
effect any test statistics related to the coefficients, or the model accuracy.
In multiple regression analysis this means that we can include
variables with different measurements in the same regression and it
will not affect the accuracy (in terms of standard errors) of our model
- e.g. Y is measured in thousands, X1 is measured in thousands and
X2 is measured in single units (or millions, etc.).
Interpretation of the Parameters
In our univariate regression:

Yi = β0 + β1 Xi + i

β1 shows the amount by which the expected value of Y (remember that


E(Yi |Xi ) = β0 + β1 Xi ) changes (either increases, or decreases), when X
increases by 1 unit. This can be verified by specifying two cases - with
X = x and X = x + 1:

E(Y |X = x ) = Y
e = β0 + β1 x

E(Y |X = x + 1) = Y
e = β0 + β1 (x + 1)
e

then, taking the difference yields:

e −Y
Y
e e = β0 + β1 (x + 1) − β0 − β1 x = β1

For example, if X is in thousands of dollars, then β1 shows the amount


that the expected value of Y changes, when X increases by one thousand.
As mentioned previously, interpreting the intercept β0 is tricky.
The defining feature of a univariate linear regression is that the
change in (the expected value of) Y is equal to the change in X
multiplied by β1 . So, the marginal effect of X on Y is constant
and equal to β1 :
∆Y = β1 ∆X
or alternatively:

∆Y ∆E(Y |X ) dE(Y |X )
β1 = = = =: slope
∆X ∆X dX
where d denotes the derivative of the expected value of Y with
respect to X . As we can see, in the linear regression case, the
derivative is simply the slope of the regression line, β1 .
Note that in this case we say that the marginal effect of X on Y
is constant, because a one-unit change in X results in the same
change in Y , regardless of the initial value of X .
Below we provide a graphical interpretation with β0 = 10, β1 = 5,
 ∼ N (0, 52 ) and X from an evenly spaced set between 0 and 10, with
N = 100:
60

Y = β0 + β1X + ε
E(Y|X) = β0 + β1X
50
40
y

β1 , slope
30

unit increase in X
20
10

β0 , intercept
0

0 2 4 6 8 10

x
In econometrics, an even more popular characteristic is the rate of change.
The proportionate (or relative) change of Y , moving from Y , to Y e is
defined as dividing the change in Y by its initial value:

e −Y
Y ∆Y
= , Y 6= 0
Y Y
Usually, we measure changes in terms of percentages - a percentage
change in Y , from Y to Y
e is defined as the proportionate change
multiplied by 100:
∆Y
%∆Y = 100 · ≈ 100 · ∆ log(Y )
Y

Note: the approximate equality to the logarithm is useful for


interpreting the coefficients when modelling log(Y ), instead of Y .
The elasticity of a variable Y with respect to X is defined as the
percentage change in Y corresponding to a 1% increase in X :
∆Y
%∆Y 100 ·
η = η(Y |X ) = = Y = ∆Y · X
%∆X ∆X ∆X Y
100 ·
X
So, the elasticity of the expected value of Y with respect to X is:

dE(Y |X ) X
η= ·
dX E(Y |X )

In the univariate regression case we have that:


X
η = β1 ·
E(Y |X )
In practice in a linear model the elasticity is different on each
point (Xi , Yi ), i = 1, ..., N. Most commonly, elasticity estimated by
substituting the sample means of X and Y , i.e.:

X
ηb = βb2 ·
Y
With the interpretation being that a 1% increase in X will yield, on
average, a ηb percentage (i.e. %bη ) increase/decrease in Y .
To reiterate - η shows a percentage and not a unit change in Y
corresponding to a 1% change in X .
I If elasticity is less than one, we can classify that Y is inelastic with
respect to X .
I If elasticity is greater than one, then we would say that Y is elastic
with respect to X .
Nonlinearities in a Linear Regression
Often times economic variables are not always related by a straight-line
relationship. In a simple linear regression the marginal effect of X on Y is
constant, though this is not realistic in many economic relationships.

I For example, estimating how the price of a house (Y ) relates to the


size of the house (X ): in a linear specification, the expected price for
an additional square foot is constant. However, it is possible that
more expensive homes have a higher value for each additional square
foot, compared to smaller, less expensive homes.

Fortunately, the linear regression model Y = Xβ + ε is quite flexible - the


variables Y and X can be transformed via:
I logarithms,
I squares,
I cubes,
I creating so called indicator variables.
All of these transformations can be used to account for a nonlinear
relationship between the variables Y and X (but still expressed as a linear
regression in terms of parameters β).
If we have a linear regression with transformed variables:

f (Yi ) = β0 + β1 · g(Xi ) + i , i = 1, ..., N

then we can rewrite it in a matrix notation:

Y = Xβ + ε
> > >
where Y =
 [f (Y1 ), ...,f (YN )] , ε = [1 , ..., N ] , β = [β0 , β1 ]
1 g(X1 )
 1 g(X2 ) 
and X =  . .. , where f (Y ) and g(X ) are some kind of
 
 .. .
1 g(XN )
transformations of the initial values of Y and X .
This allows us to estimate the unknown parameters via OLS:
−1
b = X> X
β X> Y
Quadratic Regression Model
The quadratic regression model:

Y = β0 + β1 X 2 + 

is a parabola, where β0 is the intercept and β1 is the shape parameter of


the curve: if β1 > 0, then the curve is U − shaped; if β1 < 0, then the
curve is inverted − U − shaped.
Because of the nonlinearity in X , our unit change in X effect on Y now
depends on the initial value of X . If, as before, we take X = x and
X = x + 1:
e = β0 + β1 x 2
E(Y |X = x ) = Y
e = β0 + β1 (x + 1)2
E(Y |X = x + 1) = Y
e

= β0 + β1 x 2 + β1 · (2x + 1)

so, the difference:


e −Y
Y
e e = β1 · (2x + 1)

now depends on the initial value of x - the larger the initial value, the
more pronounced the change in Y will be.
The slope of the quadratic regression is:

dE(Y |X )
slope = = 2β1 X
dX
which changes as X changes. For large values of X , the slope will be
larger, for β1 > 0 (or smaller, if β1 < 0), and would have a more
pronounced change in Y for a unit increase in X , compared to smaller
values of X . Note that, unlike the simple linear regression, in this case β1
is no longer the slope.
The elasticity (i.e. the percentage change in Y , given a 1% change in X )
is:
X 2β1 X 2
η = 2β1 X · =
Y Y
A common approach is to choose a point on the fitted relationship,
i.e. select a value of X and the corresponding fitted value Y b to estimate
ηb(Yb |X ) = 2βb1 X 2 /Y
b . Note that we could use (Y , X ) since it is on the
regression curve.
However, we may be interested to see how the elasticity changes at
different value of X - when X is small; when X is large etc.
Below we will assume that our data generating process satisfies (UR.1) -
(UR.4) assumptions with the following parameters:
I β0 = 1, β1 ∈ {−0.02, 0.02};
I X is a random sample with replacement from a set: Xi ∈ {1, ..., 50},
i = 1, ..., 100;
I  ∼ N (0, 52 ).

β1 < 0
β1 > 0
40
20
y1

0
−20
−40
set.seed(123) np.random.seed(123)
# Set the coefficients: # Set the coefficients:
N = 100 N = 100
beta_0 = 0.5 beta_0 = 0.5
beta_1 = c(-0.02, 0.02) beta_1 = [-0.02, 0.02]
# Generate sample data: # Generate sample data:
x <- sample(1:50, size = N, x = np.random.choice(list(range(1, 51)),
replace = TRUE) size = N, replace = True)
e <- rnorm(mean = 0, sd = 3, e = np.random.normal(loc = 0, scale = 3,
n = length(x)) size = len(x))
y1 <- beta_0 + beta_1[1] * x^2 + e y1 = beta_0 + beta_1[0] * x**2 + e
y2 <- beta_0 + beta_1[2] * x^2 + e y2 = beta_0 + beta_1[1] * x**2 + e

print(coef(lm(y1 ~ I(x^2)))) print(sm.OLS(y1, sm.add_constant(x**2)).f


## (Intercept) I(x^2) ## [ 0.27578417 -0.01973092]
## 0.03470760 -0.01966276
print(coef(lm(y2 ~ I(x^2)))) print(sm.OLS(y2, sm.add_constant(x**2)).f
## (Intercept) I(x^2) ## [0.27578417 0.02026908]
## 0.03470760 0.02033724
Log-Linear Regression Model
A couple of useful properties of the logarithm function, which are
frequently applied to simplify some non-linear model specifications and
various approximations in econometric analysis:
If X > 0 is small (e.g. X = 0.01, 0.02, ..., 0.1), then:

log(1 + X ) ≈ X

The equality deteriorates as X gets larger (e.g. X = 0.5).


For small changes in X it can be shown that:
X1 − X0 ∆X
∆ log(X ) = log(X1 )−log(X0 ) ≈ = , where ∆X = X1 −X0
X0 X0
A percentage change in X , from X0 to X1 is defined as the log
difference multiplied by 100:

%∆X ≈ 100 · ∆ log(X )


Log-Linear Regression Model
Often times, the dependent and/or independent variable may appear in
logarithmic form. The log-linear model has a logarithmic term on the
left-hand side of the equation and an untransformed variable on the
right-hand side:
log(Y ) = β0 + β1 X + 
In order to use this model we must have that Y > 0.
Furthermore, we may also sometimes want to take the logarithm in order
to linearize Y .
If Y is defined via the following exponential form:

Y = exp(β0 + β1 X + ) (1)

Then we can take the logarithm of Y to get the log-linear expression

log(Y ) = β0 + β1 X + 

From eq. (1) we can see that we can use the log-transformation to
regularize the data (i.e. to make highly skewed distributions less skewed).
For example, the histogram of (1) with β0 = 0.8, β1 = 4 and  ∼ N (0, 1)
looks to be skewed to the right, because the tail to the right is longer.
Whereas the histogram of log(Y ) (i.e. the dependent variable of the
log-linear model) appears to be symmetric around the mean and similar to
the normal distribution:
set.seed(123) np.random.seed(123)
# #
N <- 1000 N = 1000
beta_0 <- 0.8 beta_0 = 0.8
beta_1 <- 4 beta_1 = 4
x <- seq(from = 0, to = 1, x = np.linspace(start = 0,
length.out = N) stop = 1, num = N)
e <- rnorm(mean = 0, sd = 1, e = np.random.normal(loc = 0,
n = length(x)) scale = 1, size = len(x))
y <- exp(beta_0 + beta_1 * x + e) y = np.exp(beta_0 + beta_1 * x + e)

The appropriate histograms of Y and log(Y ):


Histogram of y Histogram of log(y)
600
Frequency

Frequency

80
40
200
0

0 200 400 600 800 1000 0 2 4 6

y log(y)
The extremely skewed distribution of Y became less skewed and more
bell-shaped after taking the logarithm.
Many economic variables - price, income, wage, etc. - have skewed
distributions, and taking logarithms to regularize the data is a
common practice.
>
Furthermore, setting Y = [log(Y1 ), ..., log(YN )] allows us to apply the
OLS via the same formulas as we did before:
# lm_model = sm.OLS(np.log(y),
# sm.add_constant(x))
lm_fit <- lm(log(y) ~ x) lm_fit = lm_model.fit()
print(coef(lm_fit)) print(lm_fit.params)
## (Intercept) x
## 0.8156225 4.0010107 ## [0.81327991 3.89431192]

Furthermore, we can calculate the log-transformed fitted values log(Y


b ), as
well as transform them back to Yb:

Y = exp(β0 + β1X + ε) log(Y) = β0 + β1X + ε


200

6
4
log(y)
50 100
y

2
0
0
Log-Linear Regression Model
To better understand the effect that a change in X has on log(Y ), we
calculate the expected value of log(Y ) when X changes from x to x + 1:

E(log(Y )|X = x ) = Y
e = β0 + β1 x

E(log(Y )|X = x + 1) = Ye = β0 + β1 (x + 1)
e

Then the difference:


e −Y
Y
e e = β1

is similar as in the simple linear regression case, but the interpretation is


different since we are talking about log(Y ), instead of Y . In other words:

∆ log(Y ) = β1 ∆X
100 · ∆ log(Y ) = (100 · β1 )∆X

Because X and Y are related via the log-linear regression, it follows that:

%∆Y ≈ (100 · β1 )∆X

In other words, for the log-linear model, a unit increase in X


yields (approximately) a (100 · β1 ) percentage change in Y .
We can rewrite the previous equality as:
%∆Y
100 · β1 = := semi-elasticity
∆X
This quantity, known the semi-elasticity, is the percentage change in Y
when X increases by one unit. As we have just shown, in the log-linear
regression, the semi-elasticity is constant and equal to 100 · β1 .
Furthermore, we can derive the same measurements as before:
dE(Y |X )
slope := = β1 Y
dX
unlike the simple linear regression model, in the log-linear regression
model, the marginal effect increases for larger values of Y (note,
that this is not log(Y )).
The elasticity (the percentage change in Y , given a 1% increase in X ):
X
η = slope · = β1 X
Y
If we wanted to change the units of measurement of Y in a log-linear
model, then β0 would change, but β1 would remain unchanged, since:
log(cY ) = log(c) + log(Y ) = [log(c) + β0 ] + β1 X + 
Linear-Log Regression Model
Alternatively, we may describe the linear relationship, where X is
log-transformed and Y is untransformed:

Y = β0 + β1 log(X ) + 

where X > 0. In this case, if X increases from X0 to X1 , it follows that:

Y0 = E(Y |X = X0 ) = β0 + β1 log(X0 )
Y1 = E(Y |X = X1 ) = β0 + β1 log(X1 )

Setting ∆Y = Y1 − Y0 and ∆ log(X ) = log(X1 ) − log(X0 ), yields:

β1 β1
∆Y = β1 ∆ log(X ) = [100 · ∆ log(X )] ≈ (%∆X )
100 100
In other words, in a linear-log model, a 1% increase in X yields
(approximately) a β1 /100 unit change in Y .
Furthermore, we have that:
dE(Y |X ) 1
slope := = β1
dX X
and:
X 1
η = slope · = β1
Y Y
If we wanted to change the units of measurement of X in a linear-log
model, then β0 would change, but β1 would remain unchanged, since:
c 
Y = β0 + β1 log X = [β0 − β1 log(c)] + β1 log (cX )
c

Sometimes linear-log is not much different (in terms of model accuracy)


from a linear-linear (i.e. simple) regression.
However, because of the functional form of the linear-log model, if β1 < 0,
then the function decreases at a decreased rate, as X increases.
This means that it may sometimes be useful if we do not want Y to have
a negative value, for a reasonably large value of X .
For example, let Y be expenditure on leisure activities, and X - age. Let
us say that we only have expenditure data on ages from 18 to 50 and we
would like to predict the expenditure for ages 51 to 80.
In this case it is reasonable to assume, that expenditure cannot be
negative for reasonable values of X - an age of up to 80 may be realistic
assumption, while 100 years or more - less so.
Further assume that the underlying (true) model is indeed linear-log:

Y = β0 + β1 X + 

with β0 = 1200, β1 = −250 and  ∼ N (0, 502 ).


600

E(Y|X) = β0 + β1X
E(Y|X) = β0 + β1log(X)
400
y

200
0

20 30 40 50 60 70 80

We see that the fitted values Yb do not differ much in the linear-linear and
linear-log cases, however, the predicted values (since in this example Y is
the expenditure) are more believable in the linear-log case, since they are
non-negative.
Log-Log Regression Model
Elasticities are often important in applied economics. As such, it is
sometimes convenient to have constant elasticity models. If we take
logarithms of both Y and X , then we arrive at the log-log model:

log(Y ) = β0 + β1 log(X ) + 

where X , Y > 0. Then:

dE(Y |X ) Y
slope := = β1
dX X
and:
X
η = slope · = β1
Y
i.e. the elasticity of the log-log model is constant.
In other words, in a log-log model, a 1% increase in X yields
(approximately) a β1 percentage change in Y .
Choosing a Functional Form
Generally, we will never know the true relationship between Y and X .
The functional form that we select is only an approximation.
As such, we should select a functional form, which satisfies our objectives,
economic theory, the underlying model assumptions, like (UR.1) -
(UR.3), and one which provides an adequate fit on the data sample.
Consequently, we may incorrectly specify a functional form for our
regression model, or, some of our regression assumptions may not hold.
It is a good idea to focus on:
1. Examining the regression results:
I checking whether the signs of the coefficients follow economic logic;
I checking the significance of the coefficients;
2. Examining the residuals of Y − Yb - if our specified functional form is
inadequate, then the residuals would not necessarily follow (UR.3),
(UR.4).
Nevertheless, before examining the signs and model statistical properties,
we should have a first look at the data and examine the dependent
variable Y and the independent variable X plots to make some initial
guesses at the functional form of the regression model.
Histogram of The Response Variable
Histogram of y1 Histogram of y2

500
80

400
60

300
Frequency

Frequency
40

200
20

100
0

0
−3 −2 −1 0 1 2 3 0 5 10 15 20 25 30

y1 y2

The histogram can be used to answer the following questions:


I What kind of population distribution is the data sample most likely
from?
I What is the sample mean of the data?
I Is the distribution of the data large?
I Are the data symmetric, or skewed?
I Are there outliers in the data?
Run-Sequence Plot
An easy way to graphically summarize a univariate data set. A common
assumption of univariate data sets is that they behave like:
I Random realization of a dataset;
I Are from the same population (i.e. all random drawings have the
same distribution);
The run-sequence (or simply, run charts) plots the variable of interest
on the vertical axis, and the variable index on the horizontal axis.
They are primarily used to inspect if there are any outliers, mean or
variance changes or if there is a dependence across observations.
Normal Distribution Change in variance Change in mean
50
23

30
40
22

30
21

25
y1

y2

y3
20
20

10
19

20
0
18

−10

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Index Index Index

We see that the underlying differences in the distribution of the data are
present in the run-sequence plot of the data sample.
The run-sequence plot can help answer questions, like:
I Are there any (significant) changes in the mean?
I Are there any (significant) changes in the variance?
I Are there any outliers in the data?
The run-sequence plot can be useful when examining the residuals
of a model.
In our exponential model example:

Y = exp(β0 + β1 X + )

what would the residuals look like if we fitted a linear-linear and a


log-linear model?
Residuals of a linear−linear model Residuals of a log−linear model

1000

3
2
lm_log$residuals
lm_lin$residuals

600

1
0
200

−3 −2 −1
0

0 200 400 600 800 1000 0 200 400 600 800 1000

Index Index

We see that in the case of a linear linear model, the residual variance is
not the same across observations, while the log-linear model has residuals
that appear to have the same mean and variance across observations.
We note that the values of X are ordered from smallest to largest - this
means that a larger index value corresponds to the value of Y which was
for a larger value of X .
If we were to randomize the order of Xi (and as a result, randomize Yi
and Ybi ), the run-sequence plot of the residuals would then have the
following plot:
Residuals of a linear−linear model, index randomized Residuals of a log−linear model, index randomized

1000

3
lm_log$residuals[i_index]
lm_lin$residuals[i_index]

2
600

1
0
200

−3 −2 −1
0

0 200 400 600 800 1000 0 200 400 600 800 1000

Index Index

Note the fact that in cross-sectional data one of our assumptions is that the
observations (Xi , Yi ) are treated as independent from (Xj , Yj ) for i 6= j. As such we can
order the data in any way we want.
I In the first case (with ordered values of X from (Xi , Yi )), ordering the residuals by
the value of X allows us to examine whether our model performs the same for all
values of X . Clearly, for larger values of X this was not the case (Note: a scatter
plot would be more useful in such a case).
I In the second case (where the ordering of X from (Xi , Yi ) is random), where we
shuffled the order of our data (Xi , Yi ), the residual run-sequence plot allows us to
identify, whether there are residuals, which are more pronounced, but it does not
identify when this happens, unlike the first case, with ordering by values of X .
Run sequence plots are more common in time series data, but can still be
utilized for cross-sectional data.
Scatter Plot
As we have seen, a scatter plot reveals relationships between two variables
or interest in a form of non-random structure. the vertical axis is usually
for the dependent (i.e. response) variable Y , and the horizontal axis is for
the independent variable X .
Let:
Y = β0 + β1 · X1 + 
where X1 ∼ N (µ1 , σ12 ),  ∼ N (0, 12 ) and let X2 ∼ N (µ2 , σ22 ) and X2 is
independent of X1 and .
Scatter plot of X1 and Y Scatter plot of X2 and Y
30

30
20

20
10

10
y

y
0

0
−10

−10

−4 −2 0 2 4 6 8 −5 0 5 10

x1 x2

We see a clear linear relationship between X1 and Y . The scatter plot of


X2 and Y appears to be random, i.e. Y does not depend on the value of
The scatter plot can help us answer the following questions:
I Are the variables X and Y related?
I What kind of relationship (linear, exponential, etc.) could describe Y
and X ?
I Are there any outliers?
I Does the variation in Y change depending on the value of X ?
In addition to the run-sequence plot, the scatter plot for the residu-
als b
 and the independent variable X can be useful to determine,
whether our model performs the same across any value of X , or if
there are specific values of X , for which our model is worse (which
would result in larger residual values). If the values of X are
ordered in ascending order, then the run-sequence plot will
be very similar to the scatter plot of b  and X , with the only
difference being the spacing between the points.
In our exponential model example:

Y = exp(β0 + β1 X + )

what would the residuals look like if we fitted a linear-linear and a


log-linear model?
Residual run−sequence plot Scatter plot of Residuals and X
linear−linear model linear−linear model

lm_lin$residuals

lm_lin$residuals
10

10
6

6
2

2
−2

−2
0 10 20 30 40 50 5 10 15 20

Index x

Residual run−sequence plot Scatter plot of Residuals and X


log−linear model log−linear model
lm_log$residuals

lm_log$residuals
0 1 2

0 1 2
−2

−2
0 10 20 30 40 50 5 10 15 20

Index x

Notice that when X is ordered the only difference between the


run-sequence plot and the scatter plot is in the spacing between the points.
Scatter plots are similar to run sequence plots for univariate cross-
sectional data. However, unlike run-sequence plots, scatter plots
allow to examine the relationship between X and Y (instead of only
examining Y ).
For multivariable cross-sectional data, run-sequence plots may be
faster to utilize to focus on the properties of Y variable itself.
Quantile-Quantile Plot
The quantile-quantile (q-q) plot can be used to determine if two data
sets come from a common distribution. A 45-degree line is also plotted for
easier comparison - if the two sets come from the same distribution, the
points should fall approximately along this reference line. The greater the
departure from this line, the greater the evidence for the conclusion that
the two data sets have come from populations with different distributions.
More specifically, the quantiles of the first dataset are plotted against the
quantiles of the second dataset.

The quantile-quantile (q-q, or Q-Q) plot is a scatterplot created


by plotting two sets of quantiles against one another.
The q-q plot is similar to a probability plot, which can be used to
examine, whether the model residuals are normally distributed. For a
probability plot, the q-q plot is used, with quantiles for one of the data
samples are replaced with the quantiles of a theoretical distribution.
In a normal probability plot the data are plotted against a theo-
retical (standard) normal distribution in such a way that the points
should form an approximate straight line.
Departures from this straight line indicate departures from normality.
The normal distribution is a base distribution, and its quantiles are
on the horizontal axis as the Theoretical Quantiles, while the sample
data quantiles are on the vertical axis.
Histogram of Y1 ~N(0,1) Normal Q−Q Plot of Y1 ~N(0,1)

200

3
Sample Quantiles

2
Frequency

1
100

−1
50
0

−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

y1 Theoretical Quantiles

Histogram of Y2 = exp(Y1) Normal Q−Q Plot of Y2 = exp(Y1)


500

Sample Quantiles

20
Frequency

300

10
0 100

5
0
0 5 10 15 20 25 −3 −2 −1 0 1 2 3

y2 Theoretical Quantiles

We see that the q-q plot of Y1 shows that Y1 is normally distributed, since
all the quantiles are on the diagonal line. On the other hand, the q-q plot
of Y2 shows that the data is skewed.
The probability plot is used to answer the following questions:
I Does a specified theoretical distribution provide a good fit to my
data?
I What distribution best fits my data?
I What are good estimates for the location and scale parameters of the
chosen distribution?
In addition, the normal probability plot answers the following questions:
I Is the data normally distributed?
I What is the nature of the departure from normality (data skewed with
shorter/longer tails)?
As mentioned, probability plots can be used to inspect whether the
residuals are normally distributed. We will see an example of this in the
residual diagnostics section later on.
Notes on the terminology of quantiles, percentiles and
quartiles:

I Quantiles can go from 0 to any value. Quantiles are cut points


dividing the range of a probability distribution into continuous
intervals with equal probabilities, or dividing the observations in a
sample in the same way. The p-quantile is defined as the value,
which includes p · N observations, with 0 ≤ p ≤ 1 and N being the
number of observations.
I Percentiles go from 0 to 100. It is a measure used for indicating the
value below which a given percentage of observations in a group of
observations fall. For example, the 20th percentile is the value below
which 20% of the observations may be found.
I Quartiles go from 0 to 4. They are values that divide a list of sorted
values into quarters.
In general, percentiles and quartiles are specific types of quantiles. The
relationship is as follows:
I 0 quartile = 0 quantile = 0 percentile
I 1 quartile = 0.25 quantile = 25 percentile
I 2 quartile = 0.5 quantile = 50 percentile (median)
I 3 quartile = 0.75 quantile = 75 percentile
I 4 quartile = 1 quantile = 100 percentile
I We have reviewed the OLS properties;
I We have examined various variables transformations;
I We have examine ways to specify non-linearities while retaining the
linear-regression model form.
I We have examined coefficient interpretations, depending on the
model specification.
I We have presented various ways to plot the data in order to determine
the relations between different variables, or to examine the residuals.
Examples using empirical data
From the Lecture notes Ch. 3.10 continue with the dataset(-s) that you
have used from the previous exercise set and do the tasks from Exercise
Set 2 from Ch 3.10. See Ch. 3.11 for an example.

You might also like