Multicollinearity Samiji
Multicollinearity Samiji
The term multicollinearity is due to Ragnar Frisch. Originally it meant the existence of a
“perfect,” or exact, linear relationship among some or all explanatory variables of a regression
model.
all observations to allow for the intercept term), an exact linear relationship is said to exist
where λ1, λ2, . . . , λk are constants such that not all of them are zero simultaneously.
Today, however, the term multicollinearity is used in a broader sense to include the case of
perfect multicollinearity, as well as the case where the X variables are intercorrelated but
not perfectly.
Recall that the coefficient βK can be thought of as the impact on the dependent variable of
a one-unit increase in the independent variable xK , holding constant the other independent
variables in the equation. If two explanatory variables are significantly related, then the OLS
computer program will find it difficult to distinguish the effects of one variable from the
effects of the other. In essence, the more highly correlated two (or more) independent
variables are, the more difficult it becomes to accurately estimate the coefficients of the true
model. If two variables move identically, then there is no hope of distinguishing between
their impacts, but if the variables are only roughly correlated, then we still might be able to
CAUSES OF MULTICOLLINEARITY
1. The data collection method employed, for example, sampling over a limited range of the
2. Constraints on the model or in the population being sampled. For example, in the
regression of electricity consumption on income (X2) and house size (X3) there is a physical
constraint in the population in that families with higher incomes generally have larger homes
4. An overdetermined model. This happens when the model has more explanatory variables
than the number of observations. This could happen in medical research where there may
variables.
An additional reason for multicollinearity, especially in time series data, may be that the
regressors included in the model share a common trend, that is, they all increase or decrease
over time. Thus, in the regression of consumption expenditure on income, wealth, and
population, the regressors income, wealth, and population may all be growing over time at
more or less the same rate, leading to collinearity among these variables.
CONSEQUENCIES OF MULTICOLLINEARITY
consequences:
1. Although BLUE, the OLS estimators have large variances and covariances, making precise
estimation difficult.
The variances and standard errors of the estimates will increase. This is the principal
significantly related, it becomes difficult to precisely identify the separate effects of the
multicollinear variables. When it becomes hard to distinguish the effect of one variable from
the effect of another, we’re much more likely to make large errors in estimating the βs than
although still unbiased, now come from distributions with much larger variances and,
2. Because of consequence 1, the confidence intervals tend to be much wider, leading to the
acceptance of the “zero null hypothesis” (i.e., the true population coefficient is zero) more
readily.
statistically insignificant.
4. Although the t ratio of one or more coefficients is statistically insignificant, R2 the overall
5. The OLS estimators and their standard errors can be sensitive to small changes in the
data.
NOTE:
As Christopher Achen remarks (note also the Leamer quote at the beginning of this chapter):
Beginning students of methodology occasionally worry that their independent variables are
regression assumptions. Unbiased, consistent estimates will occur, and their standard errors
will be correctly estimated. The only effect of multicollinearity is to make it hard to get
coefficient estimates with small standard error. But having a small number of observations
also has that effect, as does having independent variables with small variances. (In fact, at a
theoretical level, multicollinearity, few observations and small variances on the independent
variables are essentially all the same problem.) Thus “What should I do about
To drive home the importance of sample size, Goldberger coined the term micronumerosity,
micronumerosity (the counterpart of exact multicollinearity) arises when n, the sample size,
is zero, in which case any kind of estimation is impossible. Near micronumerosity, like near
multicollinearity, arises when the number of observations barely exceeds the number of
parameters to be estimated.
Leamer, Achen, and Goldberger are right in regretting the lack of attention given to the
sample size problem and the undue attention to the multicollinearity problem.
Unfortunately, in applied work involving secondary data (i.e., data collected by some agency,
such as the GNP data collected by the government), an individual researcher may not be
able to do much about the size of the sample data and may have to face “estimating problems
important enough to warrant our treating it [i.e., multicollinearity] as a violation of the CLR
First, it is true that even in the case of near multicollinearity the OLS estimators are unbiased.
keeping the values of the X variables fixed, if one obtains repeated samples and computes
the OLS estimators for each of these samples, the average of the sample values will converge
to the true population values of the estimators as the number of samples increases. But this
Second, it is also true that collinearity does not destroy the property of minimum variance: In
the class of all linear unbiased estimators, the OLS estimators have minimum variance; that
is, they are efficient. But this does not mean that the variance of an OLS estimator will
necessarily be small (in relation to the value of the estimator) in any given sample.
even if the X variables are not linearly related in the population, they may be so related in
the particular sample at hand: When we postulate the theoretical or population regression
function (PRF), we believe that all the X variables included in the model have a separate or
independent influence on the dependent variable Y. But it may happen that in any given
sample that is used to test the PRF some or all of the X variables are so highly collinear that
For all these reasons, the fact that the OLS estimators are BLUE despite multicollinearity
is of little consolation in practice. We must see what happens or is likely to happen in any
given sample.
DETECTION OF MULTICOLLINEARITY
NOTE
distinction is not between the presence and the absence of multicollinearity, but
2. Since multicollinearity refers to the condition of the explanatory variables that are
Therefore, we do not “test for multicollinearity” but can, if we wish, measure its
1. High 𝐑𝟐 but few significant t ratios. This is the “classic” symptom of multicollinearity. If
R2 is high, say, in excess of 0.8, the F test in most cases will reject the hypothesis that the
partial slope coefficients are simultaneously equal to zero, but the individual t tests will show
that none or very few of the partial slope coefficients are statistically different from zero.
Although this diagnostic is sensible, its disadvantage is that “it is too strong in the sense
that multicollinearity is considered as harmful only when all of the influences of the
2. High pair-wise correlations among regressors. Another suggested rule of thumb is that if
the pair-wise or zero-order correlation coefficient between two regressors is high, say, in
excess of 0.8, then multicollinearity is a serious problem. The problem with this criterion is
that, although high zero-order correlations may suggest collinearity, it is not necessary that
they be high to have collinearity in any specific case. To put the matter somewhat
technically, high zero-order correlations are a sufficient but not a necessary condition for
the existence of multicollinearity because it can exist even though the zero-order or simple
correlations are comparatively low (say, less than 0.50). Therefore, in models involving more
than two explanatory variables, the simple or zero-order correlation will not provide an
infallible guide to the presence of multicollinearity. Of course, if there are only two
3. Auxiliary regressions. Since multicollinearity arises because one or more of the regressors
are exact or approximately linear combinations of the other regressors, one way of finding
out which X variable is related to other X variables is to regress each Xi on the remaining
X variables and compute the corresponding 𝐑𝟐 , which we designate as R2i ; each one of
these regressions is called an auxiliary regression, auxiliary to the main regression of Y on
the X’s.
Instead of formally testing all auxiliary 𝐑𝟐 values, one may adopt Klien’s rule of thumb,
which suggests that multicollinearity may be a troublesome problem only if the R2 obtained
from an auxiliary regression is greater than the overall R2, that is, that obtained from the
regression of Y on all the regressors. Of course, like all other rules of thumb, this one should
be used judiciously.
From these eigenvalues, however, we can derive what is known as the condition number
k defined as
Rule of thumb: If k is between 100 and 1000 there is moderate to strong multicollinearity
and if it exceeds 1000 there is severe multicollinearity. Alternatively, if the CI (√k) is between
Multicollinearity.
regressors in the model, increases toward unity, that is, as the collinearity of Xj with the
an indicator of multicollinearity. The larger the value of VIF, the more “troublesome” or
collinear the variable X. As a rule of thumb, if the VIF of a variable exceeds 10, which will
happen if 𝐑𝟐 exceeds 0.90, that variable is said be highly collinear. Of course, one could use
TOL as a measure of multicollinearity in view of its intimate connection with VIF. The
closer is TOL to zero, the greater the degree of collinearity of that variable with the other
regressors. On the other hand, the closer TOL is to 1, the greater the evidence that X is not
collinear with the other regressors. VIF (or tolerance) as a measure of collinearity is not free
of criticism. As
σ2
Var(β̂j )=
∑ X2j
VIFj
shows, var(βj ) depends on three factors: σ2 , Xj2 and VIFj. A high VIF can be
2
counterbalanced by a low σ2 or a highX j . To put it differently, a high VIF is neither
necessary nor sufficient to get high variances and high standard errors. Therefore, high
multicollinearity, as measured by a high VIF, may not necessarily cause high standard errors.
NOTE
1
TOL =
VIF
REMEDIES OF MULTICOLLINEARITY
1. Do Nothing
The first step to take once severe multicollinearity has been diagnosed is to decide whether
anything should be done at all. As we’ll see, it turns out that every remedy for
multicollinearity has a drawback of some sort, and so it often happens that doing nothing
is the correct course of action.One reason for doing nothing is that multicollinearity in an
equation will not always reduce the t-scores enough to make them insignificant or change
the βs enough to make them differ from expectations. In other words, the mere existence
of multicollinearity does not necessarily mean anything. A remedy for multicollinearity
should be considered only if the consequences cause insignificant t-scores or unreliable
estimated coefficients.
A second reason for doing nothing is that the deletion of a multicollinear variable that
belongs in an equation will cause specification bias. If we drop a theoretically important
variable, then we are purposely creating bias. Given all the effort typically spent avoiding
omitted variables, it seems foolhardy to consider running that risk on purpose.
The final reason for considering doing nothing to offset multicollinearity is that every time
a regression is rerun, we risk encountering a specification that fits because it accidentally
works for the particular data set involved, not because it is the truth. The larger the number
of experiments, the greater the chances of finding the accidental result. To make things
worse, when there is significant multicollinearity in the sample, the odds of strange results
increase rapidly because of the sensitivity of the coefficient estimates to slight
specification changes.
2. Drop a Redundant Variable
On occasion, the simple solution of dropping one of the multicollinear variables is a good
one. For example, some inexperienced researchers include too many variables in their
regressions, not wanting to face omitted variable bias. As a result, they often have two or
more variables in their equations that are measuring essentially the same thing. In such a
case the multicollinear variables are not irrelevant, since any one of them is quite probably
theoretically and statistically sound. Instead, the variables might be called redundant; only
one of them is needed to represent the effect on the dependent variable that all of them
currently represent.
Another way to deal with multicollinearity is to attempt to increase the size of the sample
to reduce the degree of multicollinearity. Although such an increase may be impossible, it’s
OTHER INCLUDES:
Transforming Variable
The first difference regression model often reduces the severity of multicollinearity
priori reason to believe that their differences will also be highly correlated.
Ridge regression.
SUMMARY
indeterminate estimates of the regression coefficients and infinite standard errors of those
estimated regression coefficients and therefore decrease the calculated t-scores of those
coefficients and expand the confidence intervals. Multicollinearity causes no bias in the
estimated coefficients, and it has little effect on the overall significance of the regression or
4. Since multicollinearity exists, to one degree or another, in virtually every data set, the
is.
variables high?
If either of these answers is yes, then multicollinearity certainly exists, but multicollinearity
7. Quite often, doing nothing is the best remedy for multicollinearity. If the multicollinearity
has not decreased t-scores to the point of insignificance, then no remedy should even be
considered as long as the variables are theoretically strong. Even if the t-scores are
insignificant, remedies should be undertaken cautiously, because all impose costs on the
estimation that may be greater than the potential benefit of ridding the equation of
multicollinearity.
2019