UnivariateRegression 2
UnivariateRegression 2
(UR.3) The error term has the same variance given any value of
the explanatory variable (i.e. homoskedasticity) and the error terms
are not correlated across observations (i.e. no autocorrelation):
ε|X ∼ N 0, σ2 I
(UR.4)
ε = (1 , ..., N )> , X = (X1 , ...., XN )> , and Y = (Y1 , ..., YN )> .
OLS: Estimation
Problem: Assume that the data follows (UR.1) - (UR.4). However, we
do not know the true parameters β0 and β1 of our underlying
regression Y = β0 + β1 X + .
Question: How do we evaluate β0 and β1 ?
Answer: Specify the equations in a matrix notation:
Y1 = β0 + β1 X1 + 1
Y1 1 X1 1
Y2
= β0 + β1 X2 + 2 Y2 1 X2
β0 2
⇐⇒ . = . .. β + ..
...
.. .. . 1 .
YN 1 XN N
YN = β0 + β1 XN + N
∂RSS(β)
b
= −2X> Y + 2X> Xβ
b=0
∂β
b
Gauss-Markov Theorem
The main advantage of the OLS estimators are summarized by the
following theorem:
Gauss-Markov theorem
Under the assumption that the conditions (UR.1) - (UR.3) hold true,
the OLS estimators βb0 and βb1 are BLUE (Best Linear Unbiased
Estimator) and Consistent.
Univariate Regression: Modelling Framework
Assume that we are interested in explaining Y in terms of X under (UR.1)
- (UR.4):
Yi = β0 + β1 Xi + i , ∀i = 1, ..., N
where:
I X is called the independent variable, the control variable, the
explanatory variable, the predictor variable, or the regressor;
I Y is called the dependent variable, the response variable, the
explained variable, the predicted variable or the regressand;
I is called the random component, the error term, the disturbance
or the (economic) shock.
The expected value:
E(β0 + β1 X + |X ) = β0 + β1 X
where:
I β0 - the intercept parameter, sometimes called the constant term.
I β1 - the slope parameter.
After obtaining the estimates βb0 and βb1 , we may want to examine the
following values:
I The fitted values of Y , which are defined as the following OLS
regression line (or more generally, the estimated regression line):
Y
bi = βb0 + βb1 Xi
where βb0 and βb1 are estimated via OLS. By definition, each fitted
value of Y
bi is on the estimated OLS regression line.
I The residuals, which are defined as the difference between the
actual and fitted values of Y :
ei = Yi − Y
i = b
b bi = Yi − βb0 − βb1 Xi
^ε12 = residual
^ ^ ^
Y = β0 + β1X
Y
Y12
^
^ε8 Y12 = fitted value
^
Y8
Y8
x8 x12
X
Now is also a good time to highlight the difference between the errors and
the residuals.
Yi = β0 + β1 Xi + i
Yi = βb0 + βb1 Xi + b
i
Y = Xβ + ε
b = β + X> X −1 X> ε
β
If we take the expectation of both sides, use the law of total expectation
and the fact that E (ε|X) = 0 from (UR.2):
Furthermore:
I Unbiasedness does not guarantee that the estimate we get with any particular
sample is equal (or even very close) to β.
I It means that if we could repeatedly draw random samples from the population
and compute the estimate each time, then the average of these estimates would
be (very close to) β.
I However, in most applications we have just one random sample to work with. As
we will see later on, there are methods for creating additional samples from the
available data by creating and analysing different subsamples.
OLS estimators are Best (Efficient)
I When there is more than one unbiased method of estimation to
choose from, that estimator which has the lowest variance is the best.
I We want to show that OLS estimators are best in the sense that β b
are efficient estimators of β (i.e. they have the smallest variance).
From the proof of unbiasedness of the OLS we have that:
b = β + X> X −1 X> ε =⇒ β
b − β = X> X −1 X> ε
β
As long as we can express it as the above eq., we can show that, since
E [ε] = 0:
h i h −1 > i
E βe =E X> X X + D (Xβ + ε) = ... = (I + DX) β
= β ⇐⇒ DX = 0
then β
e is unbiased if and only if DX = 0.
b ∼ N β, σ 2 X> X −1
β|X
Regression Models and Interpretation
Inclusion of the constant term in the regression
In some cases we want to impose a restriction that if X = 0, then Y = 0
as well.
I An example could be the relationship between income (X ) and income
tax revenue (Y ) - if there is no income, X = 0, then the expected
revenue from the taxes would also be zero - E(Y |X = 0) = 0.
Formally, we now choose a slope estimator, β1 , from the following
regression model:
Yi = β1 Xi + i , i = 1, ..., N
which is called a regression through the origin, because the conditional
expected value:
E (Yi |Xi ) = β1 Xi
of the regression passes through the origin point X = 0, Y = 0. We can
obtain the estimate of the slope parameter via OLS by minimizing the sum
of squared residuals:
N N 2 PN
X X Xi Yi
RSS = 2i
b = Yi − βb1 Xi → min =⇒ β1 = Pi=1
b
N 2
i=1 i=1 i=1 Xi
So, it is possible to specify a regression without a constant term, but
should we opt for it?
I A constant β0 can be described as the mean value of Y when all
predictor variables are set to zero. However, if the predictors can’t be
zero, then it is impossible to interpret the constant.
I The intercept parameter β0 may be regarded as a sort of garbage
collector for the regression model. The reason for this is the
underlying assumption that the expected value of the residuals is zero,
which means that any bias, that is not accounted by the model, is
collected in the intercept β0 .
In general, inclusion of a constant in the regression model ensures that the
models residuals have a mean of zero, otherwise the estimated coefficients
may be biased.
Consequently, and as is often the case, without knowing the true
underlying model it is generally not worth interpreting the regression
constant.
Linear Regression Models
The regression model that we examined up to now is called a simple
linear regression model. Looking at the univariate regression:
Y = Xβ + ε
e = c · Y = c · (β0 + β1 X + ) = (c · β0 ) + (c · β1 )X + (c · )
Y
e = c·Y = c· β0 + β1 (c · X ) + = (c·β0 )+β1 (c · X )+(c·)
Y
c
β1 a
Y = a·Y = a· β0 +
e (c · X ) + = (a·β0 )+ · β1 (c · X )+(a·)
c c
Yi = β0 + β1 Xi + i
E(Y |X = x ) = Y
e = β0 + β1 x
E(Y |X = x + 1) = Y
e = β0 + β1 (x + 1)
e
e −Y
Y
e e = β0 + β1 (x + 1) − β0 − β1 x = β1
∆Y ∆E(Y |X ) dE(Y |X )
β1 = = = =: slope
∆X ∆X dX
where d denotes the derivative of the expected value of Y with
respect to X . As we can see, in the linear regression case, the
derivative is simply the slope of the regression line, β1 .
Note that in this case we say that the marginal effect of X on Y
is constant, because a one-unit change in X results in the same
change in Y , regardless of the initial value of X .
Below we provide a graphical interpretation with β0 = 10, β1 = 5,
∼ N (0, 52 ) and X from an evenly spaced set between 0 and 10, with
N = 100:
60
Y = β0 + β1X + ε
E(Y|X) = β0 + β1X
50
40
y
β1 , slope
30
unit increase in X
20
10
β0 , intercept
0
0 2 4 6 8 10
x
In econometrics, an even more popular characteristic is the rate of change.
The proportionate (or relative) change of Y , moving from Y , to Y e is
defined as dividing the change in Y by its initial value:
e −Y
Y ∆Y
= , Y 6= 0
Y Y
Usually, we measure changes in terms of percentages - a percentage
change in Y , from Y to Y
e is defined as the proportionate change
multiplied by 100:
∆Y
%∆Y = 100 · ≈ 100 · ∆ log(Y )
Y
dE(Y |X ) X
η= ·
dX E(Y |X )
X
ηb = βb2 ·
Y
With the interpretation being that a 1% increase in X will yield, on
average, a ηb percentage (i.e. %bη ) increase/decrease in Y .
To reiterate - η shows a percentage and not a unit change in Y
corresponding to a 1% change in X .
I If elasticity is less than one, we can classify that Y is inelastic with
respect to X .
I If elasticity is greater than one, then we would say that Y is elastic
with respect to X .
Nonlinearities in a Linear Regression
Often times economic variables are not always related by a straight-line
relationship. In a simple linear regression the marginal effect of X on Y is
constant, though this is not realistic in many economic relationships.
Y = Xβ + ε
> > >
where Y =
[f (Y1 ), ...,f (YN )] , ε = [1 , ..., N ] , β = [β0 , β1 ]
1 g(X1 )
1 g(X2 )
and X = . .. , where f (Y ) and g(X ) are some kind of
.. .
1 g(XN )
transformations of the initial values of Y and X .
This allows us to estimate the unknown parameters via OLS:
−1
b = X> X
β X> Y
Quadratic Regression Model
The quadratic regression model:
Y = β0 + β1 X 2 +
= β0 + β1 x 2 + β1 · (2x + 1)
now depends on the initial value of x - the larger the initial value, the
more pronounced the change in Y will be.
The slope of the quadratic regression is:
dE(Y |X )
slope = = 2β1 X
dX
which changes as X changes. For large values of X , the slope will be
larger, for β1 > 0 (or smaller, if β1 < 0), and would have a more
pronounced change in Y for a unit increase in X , compared to smaller
values of X . Note that, unlike the simple linear regression, in this case β1
is no longer the slope.
The elasticity (i.e. the percentage change in Y , given a 1% change in X )
is:
X 2β1 X 2
η = 2β1 X · =
Y Y
A common approach is to choose a point on the fitted relationship,
i.e. select a value of X and the corresponding fitted value Y b to estimate
ηb(Yb |X ) = 2βb1 X 2 /Y
b . Note that we could use (Y , X ) since it is on the
regression curve.
However, we may be interested to see how the elasticity changes at
different value of X - when X is small; when X is large etc.
Below we will assume that our data generating process satisfies (UR.1) -
(UR.4) assumptions with the following parameters:
I β0 = 1, β1 ∈ {−0.02, 0.02};
I X is a random sample with replacement from a set: Xi ∈ {1, ..., 50},
i = 1, ..., 100;
I ∼ N (0, 52 ).
β1 < 0
β1 > 0
40
20
y1
0
−20
−40
set.seed(123) np.random.seed(123)
# Set the coefficients: # Set the coefficients:
N = 100 N = 100
beta_0 = 0.5 beta_0 = 0.5
beta_1 = c(-0.02, 0.02) beta_1 = [-0.02, 0.02]
# Generate sample data: # Generate sample data:
x <- sample(1:50, size = N, x = np.random.choice(list(range(1, 51)),
replace = TRUE) size = N, replace = True)
e <- rnorm(mean = 0, sd = 3, e = np.random.normal(loc = 0, scale = 3,
n = length(x)) size = len(x))
y1 <- beta_0 + beta_1[1] * x^2 + e y1 = beta_0 + beta_1[0] * x**2 + e
y2 <- beta_0 + beta_1[2] * x^2 + e y2 = beta_0 + beta_1[1] * x**2 + e
log(1 + X ) ≈ X
Y = exp(β0 + β1 X + ) (1)
log(Y ) = β0 + β1 X +
From eq. (1) we can see that we can use the log-transformation to
regularize the data (i.e. to make highly skewed distributions less skewed).
For example, the histogram of (1) with β0 = 0.8, β1 = 4 and ∼ N (0, 1)
looks to be skewed to the right, because the tail to the right is longer.
Whereas the histogram of log(Y ) (i.e. the dependent variable of the
log-linear model) appears to be symmetric around the mean and similar to
the normal distribution:
set.seed(123) np.random.seed(123)
# #
N <- 1000 N = 1000
beta_0 <- 0.8 beta_0 = 0.8
beta_1 <- 4 beta_1 = 4
x <- seq(from = 0, to = 1, x = np.linspace(start = 0,
length.out = N) stop = 1, num = N)
e <- rnorm(mean = 0, sd = 1, e = np.random.normal(loc = 0,
n = length(x)) scale = 1, size = len(x))
y <- exp(beta_0 + beta_1 * x + e) y = np.exp(beta_0 + beta_1 * x + e)
Frequency
80
40
200
0
y log(y)
The extremely skewed distribution of Y became less skewed and more
bell-shaped after taking the logarithm.
Many economic variables - price, income, wage, etc. - have skewed
distributions, and taking logarithms to regularize the data is a
common practice.
>
Furthermore, setting Y = [log(Y1 ), ..., log(YN )] allows us to apply the
OLS via the same formulas as we did before:
# lm_model = sm.OLS(np.log(y),
# sm.add_constant(x))
lm_fit <- lm(log(y) ~ x) lm_fit = lm_model.fit()
print(coef(lm_fit)) print(lm_fit.params)
## (Intercept) x
## 0.8156225 4.0010107 ## [0.81327991 3.89431192]
6
4
log(y)
50 100
y
2
0
0
Log-Linear Regression Model
To better understand the effect that a change in X has on log(Y ), we
calculate the expected value of log(Y ) when X changes from x to x + 1:
E(log(Y )|X = x ) = Y
e = β0 + β1 x
E(log(Y )|X = x + 1) = Ye = β0 + β1 (x + 1)
e
∆ log(Y ) = β1 ∆X
100 · ∆ log(Y ) = (100 · β1 )∆X
Because X and Y are related via the log-linear regression, it follows that:
Y = β0 + β1 log(X ) +
Y0 = E(Y |X = X0 ) = β0 + β1 log(X0 )
Y1 = E(Y |X = X1 ) = β0 + β1 log(X1 )
β1 β1
∆Y = β1 ∆ log(X ) = [100 · ∆ log(X )] ≈ (%∆X )
100 100
In other words, in a linear-log model, a 1% increase in X yields
(approximately) a β1 /100 unit change in Y .
Furthermore, we have that:
dE(Y |X ) 1
slope := = β1
dX X
and:
X 1
η = slope · = β1
Y Y
If we wanted to change the units of measurement of X in a linear-log
model, then β0 would change, but β1 would remain unchanged, since:
c
Y = β0 + β1 log X = [β0 − β1 log(c)] + β1 log (cX )
c
Y = β0 + β1 X +
E(Y|X) = β0 + β1X
E(Y|X) = β0 + β1log(X)
400
y
200
0
20 30 40 50 60 70 80
We see that the fitted values Yb do not differ much in the linear-linear and
linear-log cases, however, the predicted values (since in this example Y is
the expenditure) are more believable in the linear-log case, since they are
non-negative.
Log-Log Regression Model
Elasticities are often important in applied economics. As such, it is
sometimes convenient to have constant elasticity models. If we take
logarithms of both Y and X , then we arrive at the log-log model:
log(Y ) = β0 + β1 log(X ) +
dE(Y |X ) Y
slope := = β1
dX X
and:
X
η = slope · = β1
Y
i.e. the elasticity of the log-log model is constant.
In other words, in a log-log model, a 1% increase in X yields
(approximately) a β1 percentage change in Y .
Choosing a Functional Form
Generally, we will never know the true relationship between Y and X .
The functional form that we select is only an approximation.
As such, we should select a functional form, which satisfies our objectives,
economic theory, the underlying model assumptions, like (UR.1) -
(UR.3), and one which provides an adequate fit on the data sample.
Consequently, we may incorrectly specify a functional form for our
regression model, or, some of our regression assumptions may not hold.
It is a good idea to focus on:
1. Examining the regression results:
I checking whether the signs of the coefficients follow economic logic;
I checking the significance of the coefficients;
2. Examining the residuals of Y − Yb - if our specified functional form is
inadequate, then the residuals would not necessarily follow (UR.3),
(UR.4).
Nevertheless, before examining the signs and model statistical properties,
we should have a first look at the data and examine the dependent
variable Y and the independent variable X plots to make some initial
guesses at the functional form of the regression model.
Histogram of The Response Variable
Histogram of y1 Histogram of y2
500
80
400
60
300
Frequency
Frequency
40
200
20
100
0
0
−3 −2 −1 0 1 2 3 0 5 10 15 20 25 30
y1 y2
30
40
22
30
21
25
y1
y2
y3
20
20
10
19
20
0
18
−10
We see that the underlying differences in the distribution of the data are
present in the run-sequence plot of the data sample.
The run-sequence plot can help answer questions, like:
I Are there any (significant) changes in the mean?
I Are there any (significant) changes in the variance?
I Are there any outliers in the data?
The run-sequence plot can be useful when examining the residuals
of a model.
In our exponential model example:
Y = exp(β0 + β1 X + )
1000
3
2
lm_log$residuals
lm_lin$residuals
600
1
0
200
−3 −2 −1
0
0 200 400 600 800 1000 0 200 400 600 800 1000
Index Index
We see that in the case of a linear linear model, the residual variance is
not the same across observations, while the log-linear model has residuals
that appear to have the same mean and variance across observations.
We note that the values of X are ordered from smallest to largest - this
means that a larger index value corresponds to the value of Y which was
for a larger value of X .
If we were to randomize the order of Xi (and as a result, randomize Yi
and Ybi ), the run-sequence plot of the residuals would then have the
following plot:
Residuals of a linear−linear model, index randomized Residuals of a log−linear model, index randomized
1000
3
lm_log$residuals[i_index]
lm_lin$residuals[i_index]
2
600
1
0
200
−3 −2 −1
0
0 200 400 600 800 1000 0 200 400 600 800 1000
Index Index
Note the fact that in cross-sectional data one of our assumptions is that the
observations (Xi , Yi ) are treated as independent from (Xj , Yj ) for i 6= j. As such we can
order the data in any way we want.
I In the first case (with ordered values of X from (Xi , Yi )), ordering the residuals by
the value of X allows us to examine whether our model performs the same for all
values of X . Clearly, for larger values of X this was not the case (Note: a scatter
plot would be more useful in such a case).
I In the second case (where the ordering of X from (Xi , Yi ) is random), where we
shuffled the order of our data (Xi , Yi ), the residual run-sequence plot allows us to
identify, whether there are residuals, which are more pronounced, but it does not
identify when this happens, unlike the first case, with ordering by values of X .
Run sequence plots are more common in time series data, but can still be
utilized for cross-sectional data.
Scatter Plot
As we have seen, a scatter plot reveals relationships between two variables
or interest in a form of non-random structure. the vertical axis is usually
for the dependent (i.e. response) variable Y , and the horizontal axis is for
the independent variable X .
Let:
Y = β0 + β1 · X1 +
where X1 ∼ N (µ1 , σ12 ), ∼ N (0, 12 ) and let X2 ∼ N (µ2 , σ22 ) and X2 is
independent of X1 and .
Scatter plot of X1 and Y Scatter plot of X2 and Y
30
30
20
20
10
10
y
y
0
0
−10
−10
−4 −2 0 2 4 6 8 −5 0 5 10
x1 x2
Y = exp(β0 + β1 X + )
lm_lin$residuals
lm_lin$residuals
10
10
6
6
2
2
−2
−2
0 10 20 30 40 50 5 10 15 20
Index x
lm_log$residuals
0 1 2
0 1 2
−2
−2
0 10 20 30 40 50 5 10 15 20
Index x
200
3
Sample Quantiles
2
Frequency
1
100
−1
50
0
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
y1 Theoretical Quantiles
Sample Quantiles
20
Frequency
300
10
0 100
5
0
0 5 10 15 20 25 −3 −2 −1 0 1 2 3
y2 Theoretical Quantiles
We see that the q-q plot of Y1 shows that Y1 is normally distributed, since
all the quantiles are on the diagonal line. On the other hand, the q-q plot
of Y2 shows that the data is skewed.
The probability plot is used to answer the following questions:
I Does a specified theoretical distribution provide a good fit to my
data?
I What distribution best fits my data?
I What are good estimates for the location and scale parameters of the
chosen distribution?
In addition, the normal probability plot answers the following questions:
I Is the data normally distributed?
I What is the nature of the departure from normality (data skewed with
shorter/longer tails)?
As mentioned, probability plots can be used to inspect whether the
residuals are normally distributed. We will see an example of this in the
residual diagnostics section later on.
Notes on the terminology of quantiles, percentiles and
quartiles: