Lecture Notes on Multicollinearity
Lecture Notes on Multicollinearity
CHAPTER CONTENTS
Introduction 104
Perfect multicollinearity 104
Consequences of perfect multicollinearity 105
Imperfect multicollinearity 106
Consequences of imperfect multicollinearity 107
Detecting problematic multicollinearity 109
Computer examples 110
Questions and exercises 115
LEARNING OBJECTIVES
After studying this chapter you should be able to:
1 Recognize the problem of multicollinearity in the CLRM.
2 Distinguish between perfect and imperfect multicollinearity.
3 Understand and appreciate the consequences of perfect and imperfect multi-
collinearity for OLS estimates.
4 Detect problematic multicollinearity using econometric software.
5 Find ways of resolving problematic multicollinearity.
103
104 Violating the assumptions of the CLRM
Introduction
Assumption 8 of the CLRM (see page 37) requires there be no linear relationships
among the sample values of the explanatory variables. This requirement can also
be stated as the absence of perfect multicollinearity. This chapter explains how the
existence of perfect multicollinearity means that the OLS method cannot provide esti-
mates for population parameters. It also examines the more common and realistic
case of imperfect multicollinearity and its effects on OLS estimators. Finally, possible
ways of detecting problematic multicollinearity are discussed and ways of resolving
these problems are suggested.
Perfect multicollinearity
To understand multicollinearity, consider the following model:
Y = β1 + β2 X2 + β3 X3 + u (5.1)
X2 : 1 2 3 4 5 6
X3 : 2 4 6 8 10 12
From this we can easily observe that X3 = 2X2 . Therefore, while Equation (5.1) seems
to contain two distinct explanatory variables (X2 and X3 ), in fact the information
provided by X3 is not distinct from that of X2 . This is because, as we have seen, X3
is a multiple of X2 . When this situation occurs, X2 and X3 are said to be linearly
dependent, which implies that X2 and X3 are collinear. More formally, two variables
X2 and X3 are linearly dependent if one variable can be expressed as a multiple of the
other. When this occurs the equation:
δ1 X2 + δ2 X3 = 0 (5.2)
can be satisfied for non-zero values of both δ1 and δ2 . In our example we have X3 =
2X2 , therefore (−2)X2 + (1)X3 = 0, so δ1 = −2 and δ2 = 1. Obviously, if the only
solution in Equation (5.2) were δ1 = δ2 = 0 (usually called the trivial solution) then
X2 and X3 would be linearly independent. The absence of perfect multicollinearity
requires that does not hold.
In equation (5.2), where there are more than two explanatory variables (let’s
take five), linear dependence means that one variable can be expressed as a lin-
ear combination of one or more, or even all, of the other variables. So this time
the expression:
δ1 X1 + δ2 X2 + δ3 X3 + δ4 X4 + δ5 X5 = 0 (5.3)
can be satisfied with at least two non-zero coefficients. If Equation (5.3) holds only if
all coefficients are zero, then the Xs exhibit perfect collinearity.
This concept can be understood better by using the dummy variable trap. Take, for
example, X1 to be the intercept (so X1 = 1), and X2 , X3 , X4 and X5 to be seasonal
dummies for quarterly time series data (that is, X2 takes the value of 1 for the first
Multicollinearity 105
quarter, 0 otherwise; X3 takes the value of 1 for the second quarter, 0 otherwise and
so on). Then in this case X2 + X3 + X4 + X5 = 1; and because X1 = 1 then X1 =
X2 + X3 + X4 + X5 . So, the solution is δ1 = 1, δ2 = −1, δ3 = −1, δ4 = −1, and δ5 = −1,
and this set of variables is linearly dependent.
Y = β1 + β2 X2 + β3 X3 + ut (5.4)
Y = β1 + β2 X2 + β3 (δ1 + δ2 X2 ) + u
= (β1 + β3 δ1 ) + (β2 + β3 δ2 )X2 + u
= ϑ1 + ϑ2 X2 + ε (5.5)
However, this is a system of two equations and three unknowns: β̂1 , β̂2 and β̂3 . Unfor-
tunately, as in any system that has more variables than equations, this has an infinite
number of solutions. For example, select an arbitrary value for β̂3 , let’s say k. Then for
β̂3 = k we can find β̂1 and β̂2 as:
β̂1 = ϑ̂1 − δ1 k
β̂2 = ϑ̂2 − δ2 k
Since there are infinite values that can be used for k, there is an infinite number of
solutions for β̂1 , β̂2 and β̂3 . So, under perfect multicollinearity, no method can provide
us with unique estimates for population parameters. In terms of matrix notation, and
for the more general case if one of the columns of matrix X is a linear combination of
one or more of the other columns, the matrix X X is singular, which implies that its
determinant is zero (|X X| = 0). Since the OLS estimators are given by:
β̂ = (X X)−1 X Y
106 Violating the assumptions of the CLRM
1
(X X)−1 = [adj(X X)]
|X X|
but because X X = 0 it cannot be inverted.
Another way of showing no solution can be found by trying to evaluate the
expression for the least squares estimator. From Equation (4.13):
Substituting X3 = δ1 + δ2 X2 :
which means that the regression coefficient is indeterminate. So we have seen that
the consequences of perfect multicollinearity are extremely serious. However, perfect
multicollinearity seldom arises with actual data. The occurrence of perfect multi-
collinearity often results from correctable mistakes, such as the dummy variable trap
presented above, or including variables as ln (X) and ln (X2 ) in the same equation. The
more relevant question and the real problem is how to deal with the more common
case of imperfect multicollinearity, examined in the next section.
Imperfect multicollinearity
Imperfect multicollinearity exists when the explanatory variables in an equation are
correlated, but this correlation is less than perfect. Imperfect multicollinearity can be
expressed as follows: when the relationship between the two explanatory variables in
Equation (5.4), for example, is X3 = X2 + v (where v is a random variable that can be
viewed as the ‘error’ in the exact linear relationship among the two variables), then
if v has non-zero values we can obtain OLS estimates. On a practical note, in real-
ity every multiple regression equation will contain some degree of correlation among
its explanatory variables. For example, time series data frequently contain a common
Multicollinearity 107
upward time trend that causes variables of this kind to be highly correlated. The prob-
lem is to identify whether the degree of multicollinearity observed in one relationship
is high enough to create problems. Before discussing that we need to examine the
effects of imperfect multicollinearity in the OLS estimators.
the variances, and consequently the standard errors, of the OLS estimators will tend to
be large when there is a relatively high degree of multicollinearity. In other words,
while OLS provides linear unbiased estimators with the minimum variance prop-
erty, these variances are often substantially larger than those obtained in the absence
of multicollinearity.
To explain this in more detail, consider the expression that gives the variance of the
partial slope of variable Xj , which is given by (in the case of two explanatory variables):
σ2
Var(β̂2 ) = (5.7)
(X2 − X̄2 )2 (1 − r 2 )
σ2
Var(β̂3 ) = (5.8)
(X3 − X̄3 )2 (1 − r 2 )
where r 2 is the square of the sample correlation coefficient between X2 and X3 . Other
things being equal, a rise in r (which means a higher degree of multicollinearity) will
lead to an increase in the variances and therefore also to an increase in the standard
errors of the OLS estimators.
Extending this to more than two explanatory variables, the variance of βj will be
given by:
σ2
Var(β̂j ) = (5.9)
(Xj − X̄j )2 (1 − R2j )
108 Violating the assumptions of the CLRM
σ2 1
Var(β̂j ) = (5.10)
(Xj − X̄j ) (1 − R2j )
2
The second term in this expression is called the variance inflation factor (VIF) for Xj :
1
VIFj =
(1 − R2j )
Rj2 VIFj
0 1
0.5 2
0.8 5
0.9 10
0.95 20
0.975 40
0.99 100
0.995 200
0.999 1000
VIF values exceeding 10 are generally viewed as evidence of the existence of prob-
lematic multicollinearity, which will be discussed below. From the table we can see that
this occurs when R2 > 0.9. In conclusion, imperfect multicollinearity can substantially
diminish the precision with which the OLS estimators are obtained. This obviously
has more negative effects on the estimated coefficients. One important consequence
is that large standard errors will lead to confidence intervals for the β̂j parameters
calculated by:
being very wide, thereby increasing uncertainty about the true parameter values.
Another consequence is related to the statistical inference of the OLS estimates.
Since the t-ratio is given by t = β̂j /sβ̂j , the inflated variance associated with multi-
collinearity raises the denominator of this statistic and causes its value to fall. Therefore
we might have t-statistics that suggest the insignificance of the coefficients, but this
Multicollinearity 109
1 Estimates of the OLS coefficients may be imprecise in the sense that large standard
errors lead to wide confidence intervals.
2 Affected coefficients may fail to attain statistical significance because of low
t-statistics, which may lead us mistakenly to drop an influential variable from a
regression model.
3 The signs of the estimated coefficients can be the opposite of those expected.
4 The addition or deletion of a few observations may result in substantial changes in
the estimated coefficients.
Computer examples
Example 1: induced multicollinearity
The file multicol.wf1 contains data for three different variables, namely Y, X2 and X3,
where X2 and X3 are constructed to be highly collinear. The correlation matrix of the
three variables can be obtained from EViews by opening all three variables together in
a group, by clicking on Quick/Group Statistics/Correlations. EViews requires us to
define the series list that we want to include in the group so we type:
Y X2 X3
and then click OK. The results will be as shown in Table 5.1.
Y 1 0.8573686 0.857437
X2 0.8573686 1 0.999995
X3 0.8574376 0.999995 1
The results are, of course, symmetrical, while the diagonal elements are equal to
1 because they are correlation coefficients of the same series. We can see that Y is
highly positively correlated with both X2 and X3, and that X2 and X3 are nearly
the same (the correlation coefficient is equal to 0.999995; that is very close to 1).
From this we obviously suspect that there is a strong possibility of the negative effects
of multicollinearity.
Estimate a regression with both explanatory variables by typing in the EViews
command line:
ls y c x2 x3
We get the results shown in Table 5.2. Here we see that the effect of X2 on Y is negative
and the effect of X3 is positive, while both variables appear to be insignificant. This
latter result is strange, considering that both variables are highly correlated with Y, as
we saw above. However, estimating the model by including only X2, either by typing
on the EViews command line:
ls y c x2
or by clicking on the Estimate button of the Equation Results window and respec-
ifying the equation by excluding/deleting the X3 variable, we get the results shown
in Table 5.3. This time, we see that X2 is positive and statistically significant (with a
t-statistic of 7.98).
Re-estimating the model, this time including only X3, we get the results shown in
Table 5.4. This time, we see that X3 is highly significant and positive.
Multicollinearity 111
1 The correlation among the explanatory variables is very high, suggesting that multi-
collinearity is present and that it might be serious. However, as mentioned above,
looking at the correlation coefficients of the explanatory variables alone is not
enough to detect multicollinearity.
2 Standard errors or t-ratios of the estimated coefficients changed from estimation to
estimation, suggesting that the problem of multicollinearity in this case was serious.
3 The stability of the estimated coefficients was also problematic, with negative and
positive coefficients being estimated for the same variable in two alternative specifi-
cations.
4 R2 from auxiliary regressions is substantially high, suggesting that multicollinearity
exists and that it has been an unavoidable effect on our estimations.
and then click OK. The results are shown in Table 5.6. From the correlation matrix we
can see that, in general, the correlations among the variables are very high, but the
highest correlations are between CPI and PPI, 0.98, as expected.
Estimating a regression with the logarithm of imports as the dependent variable and
the logarithms of GDP and CPI only as explanatory variables by typing in the EViews
command line:
we get the results shown in Table 5.7. The R2 of this regression is very high, and both
variables appear to be positive, with the log(GDP) also being highly significant. The
log(CPI) is only marginally significant.
However, estimating the model also including the logarithm of PPI, either by typing
on the EViews command line:
or by clicking on the Estimate button of the Equation Results window and respeci-
fying the equation by adding the log(PPI) variable to the list of variables, we get the
results shown in Table 5.8. Now log(CPI) is highly significant, while log(PPI) (which is
highly correlated with log(CPI) and therefore should have more or less the same effect
on log(IMP)) is negative and highly significant. This, of course, is because of the inclu-
sion of both price indices in the same equation specification, as a result of the problem
of multicollinearity.
114 Violating the assumptions of the CLRM
Table 5.8 Second model regression results (including both CPI and PPI )
Dependent variable: LOG(IMP)
Method: least squares
Date: 02/17/04 Time: 02:19
Sample: 1990:1 1998:2
Included observations: 34
Estimating the equation this time without log(CPI) but with log(PPI) we get the
results in Table 5.9, which show that log(PPI) is positive and insignificant. It is clear
that the significance of log(PPI) in the specification above was a result of the linear
relationship that connects the two price variables.
The conclusions from this analysis are similar to the case of the collinear data set in
Example 1 above, and can be summarized as follows:
3 The stability of the estimated coefficients was also quite problematic, with nega-
tive and positive coefficients being estimated for the same variable in alternative
specifications.
In this case it is clear that multicollinearity is present, and that it is also serious,
because we included two price variables that are quite strongly correlated. We leave it as
an exercise for the reader to check the presence and the seriousness of multicollinearity
with only the inclusion of log(GDP) and log(CPI) as explanatory variables (see Exercise
5.1 below).
Y = β1 + β2 X2 + β3 X3 + β4 X4 + ut
Exercise 5.1
The file imports_uk_y.wf1 contains yearly data (1972–1997) for real imports (IMP), real
gross domestic product (GDP) and the relative consumer price index (CPI) of domestic
to foreign prices for the US economy. Use these data to estimate the following model:
ln(IMP)t = β1 + β2 ln(GDP)t + ut
ln(IMP)t = β1 + β2 ln(CPI)t + ut
ln(GDP)t = β1 + β2 ln(CPI)t + ut
116 Violating the assumptions of the CLRM
What can you conclude about the nature of multicollinearity from these results?
Exercise 5.2
The file imports_uk.wf1 contains quarterly observations of the variables mentioned in
Exercise 5.1. Repeat Exercise 5.1 using the quarterly (higher frequency) data. Do your
results change?
Exercise 5.3
Use the data in the file money_uk02.wf1 to estimate the parameters α, β and γ in the
equation below:
where R1t is the three-month treasury bill rate. For the rest of the variables the usual
notation applies.
(a) Use as an additional variable in the above equation R2t , which is the dollar
interest rate.
(b) Do you expect to find multicollinearity? Why?
(c) Calculate the correlation matrix of all the variables. Which correlation coefficient
is the largest?
(d) Calculate auxiliary regressions and conclude whether the degree of multicollinear-
ity in (a) is serious or not.
Exercise 5.4
The file cars.wf1 contains annual data from 1971–1986 (16 observations) for new cars
sales in the US. The variables are defined as follows:
(a) Estimate a log-linear model that examines a demand function for new cars,
including all explanatory variables as regressors. Discuss your results.
(b) In the model estimated above, do you expect to face multicollinearity? Discuss
why.
(c) Check if there is indeed multicollinearity among the regressors. How will you do
that? What are your conclusions?
(d) Based on your answer above, now develop a more suitable model that avoids the
problems of multicollinearity.
Exercise 5.5
Consider the following data:
Y −12 −8 −4 −3 1 2 4 8 18 13
X1 12 13 14 15 16 15 17 18 21 20
X2 21 23 25 27 29 27 31 33 39 37
(a) Can you obtain estimations for the coefficients of the following linear model:
Y = β0 + β1 X1 + β2 X2 + u
Exercise 5.6
The file manhours.xls contains the following variables:
The data describes the manpower needs for operating a US Navy bachelor officers’
quarters. It contains cross-sectional observations for 25 establishments.
(a) Estimate a regression model that explains average daily occupancy (occup) from
all available explanatory variables. Do the results suggest multicollinearity?
(b) Are some of the explanatory variables collinear? How is this detected?