0% found this document useful (0 votes)
13 views

Lecture Notes on Multicollinearity

This chapter discusses multicollinearity, focusing on its types: perfect and imperfect multicollinearity, and their consequences on Ordinary Least Squares (OLS) estimates. Perfect multicollinearity leads to non-unique OLS estimates, while imperfect multicollinearity allows for estimates but can inflate variances and standard errors, affecting the precision and statistical significance of the coefficients. The chapter also covers detection methods and strategies to resolve multicollinearity issues.

Uploaded by

Sharjeel Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture Notes on Multicollinearity

This chapter discusses multicollinearity, focusing on its types: perfect and imperfect multicollinearity, and their consequences on Ordinary Least Squares (OLS) estimates. Perfect multicollinearity leads to non-unique OLS estimates, while imperfect multicollinearity allows for estimates but can inflate variances and standard errors, affecting the precision and statistical significance of the coefficients. The chapter also covers detection methods and strategies to resolve multicollinearity issues.

Uploaded by

Sharjeel Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

5 Multicollinearity

CHAPTER CONTENTS
Introduction 104
Perfect multicollinearity 104
Consequences of perfect multicollinearity 105
Imperfect multicollinearity 106
Consequences of imperfect multicollinearity 107
Detecting problematic multicollinearity 109
Computer examples 110
Questions and exercises 115

LEARNING OBJECTIVES
After studying this chapter you should be able to:
1 Recognize the problem of multicollinearity in the CLRM.
2 Distinguish between perfect and imperfect multicollinearity.
3 Understand and appreciate the consequences of perfect and imperfect multi-
collinearity for OLS estimates.
4 Detect problematic multicollinearity using econometric software.
5 Find ways of resolving problematic multicollinearity.

103
104 Violating the assumptions of the CLRM

Introduction
Assumption 8 of the CLRM (see page 37) requires there be no linear relationships
among the sample values of the explanatory variables. This requirement can also
be stated as the absence of perfect multicollinearity. This chapter explains how the
existence of perfect multicollinearity means that the OLS method cannot provide esti-
mates for population parameters. It also examines the more common and realistic
case of imperfect multicollinearity and its effects on OLS estimators. Finally, possible
ways of detecting problematic multicollinearity are discussed and ways of resolving
these problems are suggested.

Perfect multicollinearity
To understand multicollinearity, consider the following model:

Y = β1 + β2 X2 + β3 X3 + u (5.1)

where hypothetical sample values for X2 and X3 are given below:

X2 : 1 2 3 4 5 6
X3 : 2 4 6 8 10 12

From this we can easily observe that X3 = 2X2 . Therefore, while Equation (5.1) seems
to contain two distinct explanatory variables (X2 and X3 ), in fact the information
provided by X3 is not distinct from that of X2 . This is because, as we have seen, X3
is a multiple of X2 . When this situation occurs, X2 and X3 are said to be linearly
dependent, which implies that X2 and X3 are collinear. More formally, two variables
X2 and X3 are linearly dependent if one variable can be expressed as a multiple of the
other. When this occurs the equation:

δ1 X2 + δ2 X3 = 0 (5.2)

can be satisfied for non-zero values of both δ1 and δ2 . In our example we have X3 =
2X2 , therefore (−2)X2 + (1)X3 = 0, so δ1 = −2 and δ2 = 1. Obviously, if the only
solution in Equation (5.2) were δ1 = δ2 = 0 (usually called the trivial solution) then
X2 and X3 would be linearly independent. The absence of perfect multicollinearity
requires that does not hold.
In equation (5.2), where there are more than two explanatory variables (let’s
take five), linear dependence means that one variable can be expressed as a lin-
ear combination of one or more, or even all, of the other variables. So this time
the expression:

δ1 X1 + δ2 X2 + δ3 X3 + δ4 X4 + δ5 X5 = 0 (5.3)

can be satisfied with at least two non-zero coefficients. If Equation (5.3) holds only if
all coefficients are zero, then the Xs exhibit perfect collinearity.
This concept can be understood better by using the dummy variable trap. Take, for
example, X1 to be the intercept (so X1 = 1), and X2 , X3 , X4 and X5 to be seasonal
dummies for quarterly time series data (that is, X2 takes the value of 1 for the first
Multicollinearity 105

quarter, 0 otherwise; X3 takes the value of 1 for the second quarter, 0 otherwise and
so on). Then in this case X2 + X3 + X4 + X5 = 1; and because X1 = 1 then X1 =
X2 + X3 + X4 + X5 . So, the solution is δ1 = 1, δ2 = −1, δ3 = −1, δ4 = −1, and δ5 = −1,
and this set of variables is linearly dependent.

Consequences of perfect multicollinearity


It is fairly easy to show that under conditions of perfect multicollinearity, the OLS
estimators are not unique. Consider, for example, the model:

Y = β1 + β2 X2 + β3 X3 + ut (5.4)

where X3 = δ1 + δ2 X2 and δ1 and δ2 are known constants. Substituting this into


Equation (5.4) gives:

Y = β1 + β2 X2 + β3 (δ1 + δ2 X2 ) + u
= (β1 + β3 δ1 ) + (β2 + β3 δ2 )X2 + u
= ϑ1 + ϑ2 X2 + ε (5.5)

where, of course, ϑ1 = β1 + β3 δ1 and ϑ2 = β2 + β3 δ2 .


What we can therefore estimate from our sample data are the coefficients ϑ1 and ϑ2 .
However, no matter how good the estimates of ϑ1 and ϑ2 are, we shall never be able to
obtain unique estimates of β1 , β2 and β3 . To obtain these we would have to solve the
following system of equations:

ϑ̂1 = β̂1 + β̂3 δ1

ϑ̂2 = β̂2 + β̂3 δ2

However, this is a system of two equations and three unknowns: β̂1 , β̂2 and β̂3 . Unfor-
tunately, as in any system that has more variables than equations, this has an infinite
number of solutions. For example, select an arbitrary value for β̂3 , let’s say k. Then for
β̂3 = k we can find β̂1 and β̂2 as:

β̂1 = ϑ̂1 − δ1 k

β̂2 = ϑ̂2 − δ2 k

Since there are infinite values that can be used for k, there is an infinite number of
solutions for β̂1 , β̂2 and β̂3 . So, under perfect multicollinearity, no method can provide
us with unique estimates for population parameters. In terms of matrix notation, and
for the more general case if one of the columns of matrix X is a linear combination of
one or more of the other columns, the matrix X X is singular, which implies that its
determinant is zero (|X X| = 0). Since the OLS estimators are given by:

β̂ = (X X)−1 X Y
106 Violating the assumptions of the CLRM

we need the inverse matrix of X X, which is calculated by the expression:

1
(X X)−1 = [adj(X X)]
|X X|
 
but because X X = 0 it cannot be inverted.
Another way of showing no solution can be found by trying to evaluate the
expression for the least squares estimator. From Equation (4.13):

Cov(X2 , Y)Var(X3 ) − Cov(X3 , Y)Cov(X2 , X3 )


β̂2 =
Var(X2 )Var(X3 ) − [Cov(X2 , X3 )]2

Substituting X3 = δ1 + δ2 X2 :

Cov(X2 , Y)Var(δ1 + δ2 X2 ) − Cov(δ1 + δ2 X2 , Y)Cov(X2 , δ1 + δ2 X2 )


β̂2 =
Var(X2 )Var(δ1 + δ2 X2 ) − [Cov(X2 , δ1 + δ2 X2 )]2

Dropping the additive δ1 term:

Cov(X2 , Y)Var(δ2 X2 ) − Cov(δ2 X2 , Y)Cov(X2 , δ2 X2 )


β̂2 =
Var(X2 )Var(δ2 X2 ) − [Cov(X2 , δ2 X2 )]2

Taking the term δ2 out of the Var and Cov:

Cov(X2 , Y)δ22 Var(X2 ) − δ2 Cov(X2 , Y)δ2 Cov(X2 , X2 )


β̂2 =
Var(X2 )δ22 Var(X2 ) − [δ2 Cov(X2 , X2 )]2

And using the fact that Cov(X2 , X2 ) = Var(X2 ):

δ22 Cov(X2 , Y)Var(X2 ) − δ22 Cov(X2 , Y)Var(X2 ) 0


β̂2 = =
δ22 Var(X2 )2 − δ22 Var(X2 )2 0

which means that the regression coefficient is indeterminate. So we have seen that
the consequences of perfect multicollinearity are extremely serious. However, perfect
multicollinearity seldom arises with actual data. The occurrence of perfect multi-
collinearity often results from correctable mistakes, such as the dummy variable trap
presented above, or including variables as ln (X) and ln (X2 ) in the same equation. The
more relevant question and the real problem is how to deal with the more common
case of imperfect multicollinearity, examined in the next section.

Imperfect multicollinearity
Imperfect multicollinearity exists when the explanatory variables in an equation are
correlated, but this correlation is less than perfect. Imperfect multicollinearity can be
expressed as follows: when the relationship between the two explanatory variables in
Equation (5.4), for example, is X3 = X2 + v (where v is a random variable that can be
viewed as the ‘error’ in the exact linear relationship among the two variables), then
if v has non-zero values we can obtain OLS estimates. On a practical note, in real-
ity every multiple regression equation will contain some degree of correlation among
its explanatory variables. For example, time series data frequently contain a common
Multicollinearity 107

upward time trend that causes variables of this kind to be highly correlated. The prob-
lem is to identify whether the degree of multicollinearity observed in one relationship
is high enough to create problems. Before discussing that we need to examine the
effects of imperfect multicollinearity in the OLS estimators.

Consequences of imperfect multicollinearity


In general, when imperfect multicollinearity exists among two or more explanatory
variables, not only are we able to obtain OLS estimates but these will also be the best
linear unbiased estimators (BLUE). However, the BLUEness of these should be exam-
ined in a more detailed way. Implicit in the BLUE property is the efficiency of the OLS
coefficients. As we shall show later, while OLS estimators are those with the smallest
possible variance of all linear unbiased estimators, imperfect multicollinearity affects
the attainable values of these variances and therefore also estimation precision. Using
the matrix solution again, imperfect multicollinearity implies that one column of the
X matrix is now an  approximate
 linear combination of one or more of the others.
Therefore, matrix X X will be close to singularity, which again implies that its deter-
minant will be close to zero. As stated
 earlier, when forming the inverse (X X)−1 we
  
multiply by the reciprocal of X X , which means that the elements (and particularly
the diagonal elements) of (X X)−1 will be large. Because the variance of β̂ is given by:

Var(β̂) = σ 2 (X X)−1 (5.6)

the variances, and consequently the standard errors, of the OLS estimators will tend to
be large when there is a relatively high degree of multicollinearity. In other words,
while OLS provides linear unbiased estimators with the minimum variance prop-
erty, these variances are often substantially larger than those obtained in the absence
of multicollinearity.
To explain this in more detail, consider the expression that gives the variance of the
partial slope of variable Xj , which is given by (in the case of two explanatory variables):

σ2
Var(β̂2 ) =  (5.7)
(X2 − X̄2 )2 (1 − r 2 )
σ2
Var(β̂3 ) =  (5.8)
(X3 − X̄3 )2 (1 − r 2 )

where r 2 is the square of the sample correlation coefficient between X2 and X3 . Other
things being equal, a rise in r (which means a higher degree of multicollinearity) will
lead to an increase in the variances and therefore also to an increase in the standard
errors of the OLS estimators.
Extending this to more than two explanatory variables, the variance of βj will be
given by:

σ2
Var(β̂j ) =  (5.9)
(Xj − X̄j )2 (1 − R2j )
108 Violating the assumptions of the CLRM

where R2j is the coefficient of determination from the auxiliary regression of Xj on


all other explanatory variables in the original equation. The expression can be re-
written as:

σ2 1
Var(β̂j ) =  (5.10)
(Xj − X̄j ) (1 − R2j )
2

The second term in this expression is called the variance inflation factor (VIF) for Xj :

1
VIFj =
(1 − R2j )

It is called the variance inflation factor because high degrees of intercorrelation


among the Xs will result in a high value of R2j , which will inflate the variance of
β̂j . If R2j = 0 then VIF = 1 (which is its lowest value). As R2j rises, VIFj rises at an
increasing rate, approaching infinity in the case of perfect multicollinearity (R2j = 1).
The following table presents various values for R2j and the corresponding VIFj .

Rj2 VIFj

0 1
0.5 2
0.8 5
0.9 10
0.95 20
0.975 40
0.99 100
0.995 200
0.999 1000

VIF values exceeding 10 are generally viewed as evidence of the existence of prob-
lematic multicollinearity, which will be discussed below. From the table we can see that
this occurs when R2 > 0.9. In conclusion, imperfect multicollinearity can substantially
diminish the precision with which the OLS estimators are obtained. This obviously
has more negative effects on the estimated coefficients. One important consequence
is that large standard errors will lead to confidence intervals for the β̂j parameters
calculated by:

β̂j ± ta,n−k sβ̂j

being very wide, thereby increasing uncertainty about the true parameter values.
Another consequence is related to the statistical inference of the OLS estimates.
Since the t-ratio is given by t = β̂j /sβ̂j , the inflated variance associated with multi-
collinearity raises the denominator of this statistic and causes its value to fall. Therefore
we might have t-statistics that suggest the insignificance of the coefficients, but this
Multicollinearity 109

is only a result of multicollinearity. Note here that the existence of multicollinear-


ity does not necessarily mean small t-stats. This can be because the variance is also

affected by the variance of Xj (presented by writing (Xj − X̄j )2 ) and the residual’s
variance (σ 2 ). Multicollinearity affects not only the variances of the OLS estimators,
but also the covariances. Thus, the possibility of sign reversal arises. Also, when
there is severe multicollinearity, the addition or deletion of just a few sample obser-
vations can change the estimated coefficient substantially, causing ‘unstable’ OLS
estimators. The consequences of imperfect multicollinearity can be summarized as
follows:

1 Estimates of the OLS coefficients may be imprecise in the sense that large standard
errors lead to wide confidence intervals.
2 Affected coefficients may fail to attain statistical significance because of low
t-statistics, which may lead us mistakenly to drop an influential variable from a
regression model.
3 The signs of the estimated coefficients can be the opposite of those expected.
4 The addition or deletion of a few observations may result in substantial changes in
the estimated coefficients.

Detecting problematic multicollinearity


Simple correlation coefficient
Multicollinearity is caused by intercorrelations between the explanatory variables.
Therefore, the most logical way to detect multicollinearity problems would appear
to be through the correlation coefficient for these two variables. When an equation
contains only two explanatory variables, the simple correlation coefficient is an ade-
quate measure for detecting multicollinearity. If the value of the correlation coefficient
is large then problems from multicollinearity might emerge. The problem here is
to define what value can be considered as large, and most researchers consider the
value of 0.9 as the threshold beyond which problems are likely to occur. This can be
understood from the VIF for a value of r = 0.9 as well.

R2 from auxiliary regressions


In the case where we have more than two variables, the use of the simple correlation
coefficient to detect bivariate correlations, and therefore problematic multicollinear-
ity, is highly unreliable, because an exact linear dependency can occur among three
or more variables simultaneously. In these cases, we use auxiliary regressions. Can-
didates for dependent variables in auxiliary regressions are those displaying the
symptoms of problematic multicollinearity discussed above. If a near-linear depen-
dency exists, the auxiliary regression will display a small equation standard error,
a large R2 and a statistically significant t-value for the overall significance of the
regressors.
110 Violating the assumptions of the CLRM

Computer examples
Example 1: induced multicollinearity
The file multicol.wf1 contains data for three different variables, namely Y, X2 and X3,
where X2 and X3 are constructed to be highly collinear. The correlation matrix of the
three variables can be obtained from EViews by opening all three variables together in
a group, by clicking on Quick/Group Statistics/Correlations. EViews requires us to
define the series list that we want to include in the group so we type:

Y X2 X3

and then click OK. The results will be as shown in Table 5.1.

Table 5.1 Correlation matrix


Y X2 X3

Y 1 0.8573686 0.857437
X2 0.8573686 1 0.999995
X3 0.8574376 0.999995 1

The results are, of course, symmetrical, while the diagonal elements are equal to
1 because they are correlation coefficients of the same series. We can see that Y is
highly positively correlated with both X2 and X3, and that X2 and X3 are nearly
the same (the correlation coefficient is equal to 0.999995; that is very close to 1).
From this we obviously suspect that there is a strong possibility of the negative effects
of multicollinearity.
Estimate a regression with both explanatory variables by typing in the EViews
command line:

ls y c x2 x3

We get the results shown in Table 5.2. Here we see that the effect of X2 on Y is negative
and the effect of X3 is positive, while both variables appear to be insignificant. This
latter result is strange, considering that both variables are highly correlated with Y, as
we saw above. However, estimating the model by including only X2, either by typing
on the EViews command line:

ls y c x2

or by clicking on the Estimate button of the Equation Results window and respec-
ifying the equation by excluding/deleting the X3 variable, we get the results shown
in Table 5.3. This time, we see that X2 is positive and statistically significant (with a
t-statistic of 7.98).
Re-estimating the model, this time including only X3, we get the results shown in
Table 5.4. This time, we see that X3 is highly significant and positive.
Multicollinearity 111

Table 5.2 Regression results (full model)


Dependent variable: Y
Method: least squares
Date: 02/17/04 Time: 01:53
Sample: 1 25
Included observations: 25

Variable Coefficient Std. error t-statistic Prob.

C 35.86766 19.38717 1.850073 0.0778


X2 −6.326498 33.75096 −0.187446 0.8530
X3 1.789761 8.438325 0.212099 0.8340

R -squared 0.735622 Mean dependent var. 169.3680


Adjusted R -squared 0.711587 S.D. dependent var. 79.05857
S.E. of regression 42.45768 Akaike info criterion 10.44706
Sum squared resid. 39658.40 Schwarz criterion 10.59332
Log likelihood −127.5882 F -statistic 30.60702
Durbin–Watson stat. 2.875574 Prob(F -statistic) 0.000000

Table 5.3 Regression results (omitting X 3)


Dependent variable: Y
Method: least squares
Date: 02/17/04 Time: 01:56
Sample: 1 25
Included observations: 25

Variable Coefficient Std. error t-statistic Prob.

C 36.71861 18.56953 1.977358 0.0601


X2 0.832012 0.104149 7.988678 0.0000

R -squared 0.735081 Mean dependent var. 169.3680


Adjusted R -squared 0.723563 S.D. dependent var. 79.05857
S.E. of regression 41.56686 Akaike info criterion 10.36910
Sum squared resid. 39739.49 Schwarz criterion 10.46661
Log likelihood −127.6138 F -statistic 63.81897
Durbin–Watson stat. 2.921548 Prob(F -statistic) 0.000000

Table 5.4 Regression results (omitting X 2)


Dependent variable: Y
Method: least squares
Date: 02/17/04 Time: 01:58
Sample: 1 25
Included observations: 25

Variable Coefficient Std. error t-statistic Prob.

C 36.60968 18.57637 1.970766 0.0609


X3 0.208034 0.026033 7.991106 0.0000

R -squared 0.735199 Mean dependent var. 169.3680


Adjusted R -squared 0.723686 S.D. dependent var. 79.05857
S.E. of regression 41.55758 Akaike info criterion 10.36866
Sum squared resid. 39721.74 Schwarz criterion 10.46617
Log likelihood −127.6082 F -statistic 63.85778
Durbin–Watson stat. 2.916396 Prob(F -statistic) 0.000000
112 Violating the assumptions of the CLRM

Table 5.5 Auxiliary regression results (regressing X 2 to X 3)


Dependent variable: X 2
Method: least squares
Date: 02/17/04 Time: 02:03
Sample: 1 25
Included observations: 25

Variable Coefficient Std. error t-statistic Prob.

C −0.117288 0.117251 −1.000310 0.3276


X3 0.250016 0.000164 1521.542 0.0000

R -squared 0.999990 Mean dependent var. 159.4320


Adjusted R -squared 0.999990 S.D. dependent var. 81.46795
S.E. of regression 0.262305 Akaike info criterion 0.237999
Sum squared resid. 1.582488 Schwarz criterion 0.335509
Log likelihood − 0.974992 F -statistic 2315090.
Durbin–Watson stat. 2.082420 Prob(F -statistic) 0.000000

Finally, running an auxiliary regression of X2 on a constant and X3 yields the results


shown in Table 5.5. Note that the value of the t-statistic is extremely high (1521.542!)
while R2 is nearly 1.
The conclusions from this analysis can be summarized as follows:

1 The correlation among the explanatory variables is very high, suggesting that multi-
collinearity is present and that it might be serious. However, as mentioned above,
looking at the correlation coefficients of the explanatory variables alone is not
enough to detect multicollinearity.
2 Standard errors or t-ratios of the estimated coefficients changed from estimation to
estimation, suggesting that the problem of multicollinearity in this case was serious.
3 The stability of the estimated coefficients was also problematic, with negative and
positive coefficients being estimated for the same variable in two alternative specifi-
cations.
4 R2 from auxiliary regressions is substantially high, suggesting that multicollinearity
exists and that it has been an unavoidable effect on our estimations.

Example 2: with the use of real economic data


Let us examine the problem of multicollinearity again, but this time using real eco-
nomic data. The file imports_uk.wf1 contains quarterly data for four different variables
for the UK economy: namely, imports (IMP); gross domestic product (GDP), the
consumer price index (CPI); and the producer price index (PPI).
The correlation matrix of the three variables can be obtained from EViews by open-
ing all the variables together by clicking on Quick/Group Statistics/Correlations.
EViews asks us to define the series list we want to include in the group so we type in:

imp gdp cpi ppi


Multicollinearity 113

Table 5.6 Correlation matrix


IMP GDP CPI PPI

IMP 1.000000 0.979713 0.916331 0.883530


GDP 0.979713 1.000000 0.910961 0.899851
CPI 0.916331 0.910961 1.000000 0.981983
PPI 0.883530 0.899851 0.981983 1.000000

Table 5.7 First model regression results (including only CPI)


Dependent variable: LOG(IMP)
Method: least squares
Date: 02/17/04 Time: 02:16
Sample: 1990:1 1998:2
Included observations: 34

Variable Coefficient Std. error t-statistic Prob.

C 0.631870 0.344368 1.834867 0.0761


LOG(GDP) 1.926936 0.168856 11.41172 0.0000
LOG(CPI) 0.274276 0.137400 1.996179 0.0548

R -squared 0.966057 Mean dependent var. 10.81363


Adjusted R -squared 0.963867 S.D. dependent var. 0.138427
S.E. of regression 0.026313 Akaike info criterion −4.353390
Sum squared resid. 0.021464 Schwarz criterion −4.218711
Log likelihood 77.00763 F -statistic 441.1430
Durbin–Watson stat. 0.475694 Prob(F -statistic) 0.000000

and then click OK. The results are shown in Table 5.6. From the correlation matrix we
can see that, in general, the correlations among the variables are very high, but the
highest correlations are between CPI and PPI, 0.98, as expected.
Estimating a regression with the logarithm of imports as the dependent variable and
the logarithms of GDP and CPI only as explanatory variables by typing in the EViews
command line:

ls log(imp) c log(gdp) log(cpi)

we get the results shown in Table 5.7. The R2 of this regression is very high, and both
variables appear to be positive, with the log(GDP) also being highly significant. The
log(CPI) is only marginally significant.
However, estimating the model also including the logarithm of PPI, either by typing
on the EViews command line:

ls log(imp) c log(gdp) log(cpi) log(ppi)

or by clicking on the Estimate button of the Equation Results window and respeci-
fying the equation by adding the log(PPI) variable to the list of variables, we get the
results shown in Table 5.8. Now log(CPI) is highly significant, while log(PPI) (which is
highly correlated with log(CPI) and therefore should have more or less the same effect
on log(IMP)) is negative and highly significant. This, of course, is because of the inclu-
sion of both price indices in the same equation specification, as a result of the problem
of multicollinearity.
114 Violating the assumptions of the CLRM

Table 5.8 Second model regression results (including both CPI and PPI )
Dependent variable: LOG(IMP)
Method: least squares
Date: 02/17/04 Time: 02:19
Sample: 1990:1 1998:2
Included observations: 34

Variable Coefficient Std. error t-statistic Prob.

C 0.213906 0.358425 0.596795 0.5551


LOG(GDP) 1.969713 0.156800 12.56198 0.0000
LOG(CPI) 1.025473 0.323427 3.170645 0.0035
LOG(PPI) −0.770644 0.305218 −2.524894 0.0171

R -squared 0.972006 Mean dependent var. 10.81363


Adjusted R -squared 0.969206 S.D. dependent var. 0.138427
S.E. of regression 0.024291 Akaike info criterion −4.487253
Sum squared resid. 0.017702 Schwarz criterion −4.307682
Log likelihood 80.28331 F -statistic 347.2135
Durbin–Watson stat. 0.608648 Prob(F -statistic) 0.000000

Table 5.9 Third model regression results (including only PPI )


Dependent variable: LOG(IMP)
Method: least squares
Date: 02/17/04 Time: 02:22
Sample: 1990:1 1998:2
Included observations: 34

Variable Coefficient Std. error t-statistic Prob.

C 0.685704 0.370644 1.850031 0.0739


LOG(GDP) 2.093849 0.172585 12.13228 0.0000
LOG(PPI) 0.119566 0.136062 0.878764 0.3863

R -squared 0.962625 Mean dependent var. 10.81363


Adjusted R -squared 0.960213 S.D. dependent var. 0.138427
S.E. of regression 0.027612 Akaike info criterion −4.257071
Sum squared resid. 0.023634 Schwarz criterion −4.122392
Log likelihood 75.37021 F -statistic 399.2113
Durbin–Watson stat. 0.448237 Prob(F -statistic) 0.000000

Estimating the equation this time without log(CPI) but with log(PPI) we get the
results in Table 5.9, which show that log(PPI) is positive and insignificant. It is clear
that the significance of log(PPI) in the specification above was a result of the linear
relationship that connects the two price variables.
The conclusions from this analysis are similar to the case of the collinear data set in
Example 1 above, and can be summarized as follows:

1 The correlation among the explanatory variables was very high.


2 Standard errors or t-ratios of the estimated coefficients changed from estimation
to estimation.
Multicollinearity 115

3 The stability of the estimated coefficients was also quite problematic, with nega-
tive and positive coefficients being estimated for the same variable in alternative
specifications.

In this case it is clear that multicollinearity is present, and that it is also serious,
because we included two price variables that are quite strongly correlated. We leave it as
an exercise for the reader to check the presence and the seriousness of multicollinearity
with only the inclusion of log(GDP) and log(CPI) as explanatory variables (see Exercise
5.1 below).

Questions and exercises


Questions
1 Define multicollinearity and explain its consequences in simple OLS estimates.
2 In the following model:

Y = β1 + β2 X2 + β3 X3 + β4 X4 + ut

assume that X4 is a perfect linear combination of X2 . Show that in this case it is


impossible to obtain OLS estimates.
3 From Chapter 4 we know that β̂ = (X X)−1 (X Y). What happens to β̂ when
there is perfect collinearity among the Xs? How would you know if perfect
collinearity exists?
4 Explain what the VIF is and its use.
5 Show how to detect possible multicollinearity in a regression model.

Exercise 5.1
The file imports_uk_y.wf1 contains yearly data (1972–1997) for real imports (IMP), real
gross domestic product (GDP) and the relative consumer price index (CPI) of domestic
to foreign prices for the US economy. Use these data to estimate the following model:

ln(IMP)t = β1 + β2 ln(GDP)t + β3 ln(CPI)t + ut

Check whether there is multicollinearity in the data. Calculate the correlation


matrix of the variables and comment regarding the possibility of multicollinearity.
Also, run the following additional regressions:

ln(IMP)t = β1 + β2 ln(GDP)t + ut
ln(IMP)t = β1 + β2 ln(CPI)t + ut
ln(GDP)t = β1 + β2 ln(CPI)t + ut
116 Violating the assumptions of the CLRM

What can you conclude about the nature of multicollinearity from these results?

Exercise 5.2
The file imports_uk.wf1 contains quarterly observations of the variables mentioned in
Exercise 5.1. Repeat Exercise 5.1 using the quarterly (higher frequency) data. Do your
results change?

Exercise 5.3
Use the data in the file money_uk02.wf1 to estimate the parameters α, β and γ in the
equation below:

ln(M/P)t = a + β ln(Yt ) + γ (R1t ) + ut

where R1t is the three-month treasury bill rate. For the rest of the variables the usual
notation applies.

(a) Use as an additional variable in the above equation R2t , which is the dollar
interest rate.
(b) Do you expect to find multicollinearity? Why?
(c) Calculate the correlation matrix of all the variables. Which correlation coefficient
is the largest?
(d) Calculate auxiliary regressions and conclude whether the degree of multicollinear-
ity in (a) is serious or not.

Exercise 5.4
The file cars.wf1 contains annual data from 1971–1986 (16 observations) for new cars
sales in the US. The variables are defined as follows:

cars_sold = new passenger cars sold (measured in thousands)


cpi_cars = new cars, consumer price index, 1967 = 100
cpi_all = consumer price index, all items, all urban consumers, 1967 = 100
disp = real personal disposable income (measured in billions of dollars)
int_rate = interest rate (measured in percentage)
lab_force = employed civilian labor force (measured in thousands)
Multicollinearity 117

(a) Estimate a log-linear model that examines a demand function for new cars,
including all explanatory variables as regressors. Discuss your results.
(b) In the model estimated above, do you expect to face multicollinearity? Discuss
why.
(c) Check if there is indeed multicollinearity among the regressors. How will you do
that? What are your conclusions?
(d) Based on your answer above, now develop a more suitable model that avoids the
problems of multicollinearity.

Exercise 5.5
Consider the following data:

Y −12 −8 −4 −3 1 2 4 8 18 13
X1 12 13 14 15 16 15 17 18 21 20
X2 21 23 25 27 29 27 31 33 39 37

(a) Can you obtain estimations for the coefficients of the following linear model:

Y = β0 + β1 X1 + β2 X2 + u

using the data given above?


(b) If your response to (a) was no, then how can you reparameterise the model in order
to obtain satisfactory estimates of the coefficients.

Exercise 5.6
The file manhours.xls contains the following variables:

manhours: monthly manhours needed to operate an establishment


occup: average daily occupancy
checkins: monthly average number of check-ins
service: weekly hours of service desk operation
comusearea: common use area (in square feet)
no_wings: number of building wings
berthing: operational berthing capacity
no_rooms: number of rooms
118 Violating the assumptions of the CLRM

The data describes the manpower needs for operating a US Navy bachelor officers’
quarters. It contains cross-sectional observations for 25 establishments.

(a) Estimate a regression model that explains average daily occupancy (occup) from
all available explanatory variables. Do the results suggest multicollinearity?
(b) Are some of the explanatory variables collinear? How is this detected?

You might also like