0% found this document useful (0 votes)
98 views

Multicollinearity Samiji

The document discusses the concept of multicollinearity, which refers to a high degree of correlation between explanatory variables in a regression model. Multicollinearity does not violate the assumptions of regression analysis, but it can make it difficult to accurately estimate the coefficients of the explanatory variables. Near or high multicollinearity leads to coefficient estimates with large variances and standard errors, making precise estimation difficult. It can also result in statistically insignificant t-ratios for coefficients despite a high overall model fit. The document outlines various causes and consequences of multicollinearity as well as methods for detecting its presence and degree.

Uploaded by

Drizzie Cay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

Multicollinearity Samiji

The document discusses the concept of multicollinearity, which refers to a high degree of correlation between explanatory variables in a regression model. Multicollinearity does not violate the assumptions of regression analysis, but it can make it difficult to accurately estimate the coefficients of the explanatory variables. Near or high multicollinearity leads to coefficient estimates with large variances and standard errors, making precise estimation difficult. It can also result in statistically insignificant t-ratios for coefficients despite a high overall model fit. The document outlines various causes and consequences of multicollinearity as well as methods for detecting its presence and degree.

Uploaded by

Drizzie Cay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MULTICOLLINEARITY

The term multicollinearity is due to Ragnar Frisch. Originally it meant the existence of a

“perfect,” or exact, linear relationship among some or all explanatory variables of a regression

model.

For the k-variable regression involving explanatory variable x1 , x2 , . . . , xK (where 𝑥1 = 1 for

all observations to allow for the intercept term), an exact linear relationship is said to exist

if the following condition is satisfied:

λ1 x1 +λ2 x2 +... +λK xK =0

where λ1, λ2, . . . , λk are constants such that not all of them are zero simultaneously.

Strictly speaking, perfect multicollinearity is the violation of Classical Assumption VI (that

no independent variable is a perfect linear function of one or more other independent

variables). Perfect multicollinearity is rare, but severe imperfect multicollinearity, although

not violating Classical Assumption VI, still causes substantial problems.

Today, however, the term multicollinearity is used in a broader sense to include the case of

perfect multicollinearity, as well as the case where the X variables are intercorrelated but

not perfectly.

Recall that the coefficient βK can be thought of as the impact on the dependent variable of

a one-unit increase in the independent variable xK , holding constant the other independent

variables in the equation. If two explanatory variables are significantly related, then the OLS

computer program will find it difficult to distinguish the effects of one variable from the

effects of the other. In essence, the more highly correlated two (or more) independent
variables are, the more difficult it becomes to accurately estimate the coefficients of the true

model. If two variables move identically, then there is no hope of distinguishing between

their impacts, but if the variables are only roughly correlated, then we still might be able to

estimate the two effects accurately enough for most purposes.

CAUSES OF MULTICOLLINEARITY

There are several sources of multicollinearity. As Montgomery and Peck note,

multicollinearity may be due to the following factors:

1. The data collection method employed, for example, sampling over a limited range of the

values taken by the regressors in the population.

2. Constraints on the model or in the population being sampled. For example, in the

regression of electricity consumption on income (X2) and house size (X3) there is a physical

constraint in the population in that families with higher incomes generally have larger homes

than families with lower incomes.

3. Model specification, for example, adding polynomial terms to a regression model,

especially when the range of the X variable is small.

4. An overdetermined model. This happens when the model has more explanatory variables

than the number of observations. This could happen in medical research where there may

be a small number of patients about whom information is collected on a large number of

variables.

An additional reason for multicollinearity, especially in time series data, may be that the

regressors included in the model share a common trend, that is, they all increase or decrease
over time. Thus, in the regression of consumption expenditure on income, wealth, and

population, the regressors income, wealth, and population may all be growing over time at

more or less the same rate, leading to collinearity among these variables.

CONSEQUENCIES OF MULTICOLLINEARITY

In cases of near or high multicollinearity, one is likely to encounter the following

consequences:

1. Although BLUE, the OLS estimators have large variances and covariances, making precise

estimation difficult.

The variances and standard errors of the estimates will increase. This is the principal

consequence of multicollinearity. Since two or more of the explanatory variables are

significantly related, it becomes difficult to precisely identify the separate effects of the

multicollinear variables. When it becomes hard to distinguish the effect of one variable from

the effect of another, we’re much more likely to make large errors in estimating the βs than

we were before we encountered multicollinearity. As a result, the estimated coefficients,

although still unbiased, now come from distributions with much larger variances and,

therefore, larger standard errors.

2. Because of consequence 1, the confidence intervals tend to be much wider, leading to the

acceptance of the “zero null hypothesis” (i.e., the true population coefficient is zero) more

readily.

3. Also because of consequence 1, the t ratio of one or more coefficients tends to be

statistically insignificant.
4. Although the t ratio of one or more coefficients is statistically insignificant, R2 the overall

measure of goodness of fit, can be very high.

5. The OLS estimators and their standard errors can be sensitive to small changes in the

data.

NOTE:

As Christopher Achen remarks (note also the Leamer quote at the beginning of this chapter):

Beginning students of methodology occasionally worry that their independent variables are

correlated—the so-called multicollinearity problem. But multicollinearity violates no

regression assumptions. Unbiased, consistent estimates will occur, and their standard errors

will be correctly estimated. The only effect of multicollinearity is to make it hard to get

coefficient estimates with small standard error. But having a small number of observations

also has that effect, as does having independent variables with small variances. (In fact, at a

theoretical level, multicollinearity, few observations and small variances on the independent

variables are essentially all the same problem.) Thus “What should I do about

multicollinearity?” is a question like “What should I do if I don’t have many observations?”

No statistical answer can be given.

To drive home the importance of sample size, Goldberger coined the term micronumerosity,

to counter the exotic polysyllabic name multicollinearity. According to Goldberger, exact

micronumerosity (the counterpart of exact multicollinearity) arises when n, the sample size,

is zero, in which case any kind of estimation is impossible. Near micronumerosity, like near
multicollinearity, arises when the number of observations barely exceeds the number of

parameters to be estimated.

Leamer, Achen, and Goldberger are right in regretting the lack of attention given to the

sample size problem and the undue attention to the multicollinearity problem.

Unfortunately, in applied work involving secondary data (i.e., data collected by some agency,

such as the GNP data collected by the government), an individual researcher may not be

able to do much about the size of the sample data and may have to face “estimating problems

important enough to warrant our treating it [i.e., multicollinearity] as a violation of the CLR

[classical linear regression] model.”

First, it is true that even in the case of near multicollinearity the OLS estimators are unbiased.

But unbiasedness is a multisample or repeated sampling property. What it means is that,

keeping the values of the X variables fixed, if one obtains repeated samples and computes

the OLS estimators for each of these samples, the average of the sample values will converge

to the true population values of the estimators as the number of samples increases. But this

says nothing about the properties of estimators in any given sample.

Second, it is also true that collinearity does not destroy the property of minimum variance: In

the class of all linear unbiased estimators, the OLS estimators have minimum variance; that

is, they are efficient. But this does not mean that the variance of an OLS estimator will

necessarily be small (in relation to the value of the estimator) in any given sample.

Third, multicollinearity is essentially a sample (regression) phenomenon in the sense that

even if the X variables are not linearly related in the population, they may be so related in
the particular sample at hand: When we postulate the theoretical or population regression

function (PRF), we believe that all the X variables included in the model have a separate or

independent influence on the dependent variable Y. But it may happen that in any given

sample that is used to test the PRF some or all of the X variables are so highly collinear that

we cannot isolate their individual influence on Y.

For all these reasons, the fact that the OLS estimators are BLUE despite multicollinearity

is of little consolation in practice. We must see what happens or is likely to happen in any

given sample.

DETECTION OF MULTICOLLINEARITY

NOTE

Here it is useful to bear in mind Kmenta’s warning:

1. Multicollinearity is a question of degree and not of kind. The meaningful

distinction is not between the presence and the absence of multicollinearity, but

between its various degrees.

2. Since multicollinearity refers to the condition of the explanatory variables that are

assumed to be nonstochastic, it is a feature of the sample and not of the population.

Therefore, we do not “test for multicollinearity” but can, if we wish, measure its

degree in any particular sample.

1. High 𝐑𝟐 but few significant t ratios. This is the “classic” symptom of multicollinearity. If

R2 is high, say, in excess of 0.8, the F test in most cases will reject the hypothesis that the
partial slope coefficients are simultaneously equal to zero, but the individual t tests will show

that none or very few of the partial slope coefficients are statistically different from zero.

Although this diagnostic is sensible, its disadvantage is that “it is too strong in the sense

that multicollinearity is considered as harmful only when all of the influences of the

explanatory variables on Y cannot be disentangled.”

2. High pair-wise correlations among regressors. Another suggested rule of thumb is that if

the pair-wise or zero-order correlation coefficient between two regressors is high, say, in

excess of 0.8, then multicollinearity is a serious problem. The problem with this criterion is

that, although high zero-order correlations may suggest collinearity, it is not necessary that

they be high to have collinearity in any specific case. To put the matter somewhat

technically, high zero-order correlations are a sufficient but not a necessary condition for

the existence of multicollinearity because it can exist even though the zero-order or simple

correlations are comparatively low (say, less than 0.50). Therefore, in models involving more

than two explanatory variables, the simple or zero-order correlation will not provide an

infallible guide to the presence of multicollinearity. Of course, if there are only two

explanatory variables, the zero-order correlations will suffice.

3. Auxiliary regressions. Since multicollinearity arises because one or more of the regressors

are exact or approximately linear combinations of the other regressors, one way of finding

out which X variable is related to other X variables is to regress each Xi on the remaining

X variables and compute the corresponding 𝐑𝟐 , which we designate as R2i ; each one of
these regressions is called an auxiliary regression, auxiliary to the main regression of Y on

the X’s.

Instead of formally testing all auxiliary 𝐑𝟐 values, one may adopt Klien’s rule of thumb,

which suggests that multicollinearity may be a troublesome problem only if the R2 obtained

from an auxiliary regression is greater than the overall R2, that is, that obtained from the

regression of Y on all the regressors. Of course, like all other rules of thumb, this one should

be used judiciously.

4. Eigenvalues and condition index.

From these eigenvalues, however, we can derive what is known as the condition number

k defined as

Maximum Eigen Value


k=
Minimum Eigen Value

and the condition index (CI) defined as

Maximum Eigen Value


CI =√ =√k
Minimum Eigen Value

Rule of thumb: If k is between 100 and 1000 there is moderate to strong multicollinearity

and if it exceeds 1000 there is severe multicollinearity. Alternatively, if the CI (√k) is between

10 and 30 there is moderate to strong multicollinearity and if it exceeds 30 there is severe

Multicollinearity.

5. Tolerance (TOL) and variance inflation factor (VIF).


As 𝐑𝟐 the coefficient of determination in the regression of regressor Xj on the remaining

regressors in the model, increases toward unity, that is, as the collinearity of Xj with the

other regressors increases,


VIF also increases and in the limit it can be infinite. Some authors therefore use the VIF as

an indicator of multicollinearity. The larger the value of VIF, the more “troublesome” or

collinear the variable X. As a rule of thumb, if the VIF of a variable exceeds 10, which will

happen if 𝐑𝟐 exceeds 0.90, that variable is said be highly collinear. Of course, one could use

TOL as a measure of multicollinearity in view of its intimate connection with VIF. The

closer is TOL to zero, the greater the degree of collinearity of that variable with the other

regressors. On the other hand, the closer TOL is to 1, the greater the evidence that X is not

collinear with the other regressors. VIF (or tolerance) as a measure of collinearity is not free

of criticism. As

σ2
Var(β̂j )=
∑ X2j
VIFj

shows, var(βj ) depends on three factors: σ2 , Xj2 and VIFj. A high VIF can be

2
counterbalanced by a low σ2 or a highX j . To put it differently, a high VIF is neither

necessary nor sufficient to get high variances and high standard errors. Therefore, high

multicollinearity, as measured by a high VIF, may not necessarily cause high standard errors.

NOTE

1
TOL =
VIF
REMEDIES OF MULTICOLLINEARITY

1. Do Nothing
The first step to take once severe multicollinearity has been diagnosed is to decide whether
anything should be done at all. As we’ll see, it turns out that every remedy for
multicollinearity has a drawback of some sort, and so it often happens that doing nothing
is the correct course of action.One reason for doing nothing is that multicollinearity in an
equation will not always reduce the t-scores enough to make them insignificant or change
the βs enough to make them differ from expectations. In other words, the mere existence
of multicollinearity does not necessarily mean anything. A remedy for multicollinearity
should be considered only if the consequences cause insignificant t-scores or unreliable
estimated coefficients.
A second reason for doing nothing is that the deletion of a multicollinear variable that
belongs in an equation will cause specification bias. If we drop a theoretically important
variable, then we are purposely creating bias. Given all the effort typically spent avoiding
omitted variables, it seems foolhardy to consider running that risk on purpose.
The final reason for considering doing nothing to offset multicollinearity is that every time
a regression is rerun, we risk encountering a specification that fits because it accidentally
works for the particular data set involved, not because it is the truth. The larger the number
of experiments, the greater the chances of finding the accidental result. To make things
worse, when there is significant multicollinearity in the sample, the odds of strange results
increase rapidly because of the sensitivity of the coefficient estimates to slight
specification changes.
2. Drop a Redundant Variable
On occasion, the simple solution of dropping one of the multicollinear variables is a good

one. For example, some inexperienced researchers include too many variables in their

regressions, not wanting to face omitted variable bias. As a result, they often have two or

more variables in their equations that are measuring essentially the same thing. In such a

case the multicollinear variables are not irrelevant, since any one of them is quite probably

theoretically and statistically sound. Instead, the variables might be called redundant; only
one of them is needed to represent the effect on the dependent variable that all of them

currently represent.

3. Increase the Size of the Sample

Another way to deal with multicollinearity is to attempt to increase the size of the sample

to reduce the degree of multicollinearity. Although such an increase may be impossible, it’s

a useful alternative to be considered when feasible.

OTHER INCLUDES:

 A priori Information.eg. Income and wealth on consumption.

 Combining Cross Section and Time series Data.

 Transforming Variable

The first difference regression model often reduces the severity of multicollinearity

because, although the levels of X2 and X3 may be highly correlated, there is no a

priori reason to believe that their differences will also be highly correlated.

 Ridge regression.

SUMMARY

1. Perfect multicollinearity is the violation of the assumption that no explanatory variable is

a perfect linear function of other explanatory variable(s). Perfect multicollinearity results in

indeterminate estimates of the regression coefficients and infinite standard errors of those

estimates, making OLS estimation impossible.


2. Imperfect multicollinearity, which is what is typically meant when the word

“multicollinearity” is used, is a linear relationship between two or more independent

variables that is strong enough to significantly

affect the estimation of the equation. Multicollinearity is a sample phenomenon as well as

a theoretical one. Different samples can exhibit different degrees of multicollinearity.

3. The major consequence of severe multicollinearity is to increase the variances of the

estimated regression coefficients and therefore decrease the calculated t-scores of those

coefficients and expand the confidence intervals. Multicollinearity causes no bias in the

estimated coefficients, and it has little effect on the overall significance of the regression or

on the estimates of the coefficients of any nonmulticollinear explanatory variables.

4. Since multicollinearity exists, to one degree or another, in virtually every data set, the

question to be asked in detection is how severe the multicollinearity in a particular sample

is.

5. Two useful methods for the detection of severe multicollinearity are:

a. Are the simple correlation coefficients between the explanatory

variables high?

b. Are the variance inflation factors high?

If either of these answers is yes, then multicollinearity certainly exists, but multicollinearity

can also exist even if the answers are no.

6. The three most common remedies for multicollinearity are:

a. Do nothing (and thus avoid specification bias).


b. Drop a redundant variable.

c. Increase the size of the sample.

7. Quite often, doing nothing is the best remedy for multicollinearity. If the multicollinearity

has not decreased t-scores to the point of insignificance, then no remedy should even be

considered as long as the variables are theoretically strong. Even if the t-scores are

insignificant, remedies should be undertaken cautiously, because all impose costs on the

estimation that may be greater than the potential benefit of ridding the equation of

multicollinearity.

PREPARED BY: SAMIJI, ALLY

2019

You might also like