0% found this document useful (0 votes)
4 views

Multicollinearity

Uploaded by

Victor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Multicollinearity

Uploaded by

Victor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Multicollinearity

In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are
linearly dependent.

Perfect multicollinearity refers to a situation where the predictive variables have an exact linear
relationship. When there is perfect collinearity, the design matrix has less than full rank, and therefore
the moment matrix cannot be inverted. In this situation, the parameter estimates of the regression
are not well-defined, as the system of equations has infinitely many solutions.

Imperfect multicollinearity refers to a situation where the predictive variables have a nearly exact linear
relationship.

Contrary to popular belief, neither the Gauss–Markov theorem nor the more common maximum
likelihood justification for ordinary least squares relies on any kind of correlation structure between
dependent predictors[1][2][3] (although perfect collinearity can cause problems with some software).

There is no justification for the practice of removing collinear variables as part of regression
analysis,[1][4][5][6][7] and doing so may constitute scientific misconduct. Including collinear variables does
not reduce the predictive power or reliability of the model as a whole,[6] and does not reduce the accuracy
of coefficient estimates.[1]

High collinearity indicates that it is exceptionally important to include all collinear variables, as
excluding any will cause worse coefficient estimates, strong confounding, and downward-biased
estimates of standard errors.[2]

To address the high collinearity of a dataset, variance inflation factor can be used to identify the
collinearity of the predictor variables.

Perfect multicollinearity
A depiction of multicollinearity.

In a linear regression, the true


parameters are
which are reliably estimated in the
case of uncorrelated and
(black case) but are unreliably
estimated when and are
correlated (red case).

Perfect multicollinearity refers to a situation where the predictors are linearly dependent (one can be
written as an exact linear function of the others).[8] Ordinary least squares requires inverting the matrix
, where

is an matrix, where is the number of observations, is the number of explanatory


variables, and . If there is an exact linear relationship among the independent variables, then
at least one of the columns of is a linear combination of the others, and so the rank of (and therefore
of ) is less than , and the matrix will not be invertible.

Resolution
Perfect collinearity is typically caused by including redundant variables in a regression. For example, a
dataset may include variables for income, expenses, and savings. However, because income is equal to
expenses plus savings by definition, it is incorrect to include all 3 variables in a regression
simultaneously. Similarly, including a dummy variable for every category (e.g., summer, autumn, winter,
and spring) as well as an intercept term will result in perfect collinearity. This is known as the dummy
variable trap.[9]

The other common cause of perfect collinearity is attempting to use ordinary least squares when working
with very wide datasets (those with more variables than observations). These require more advanced data
analysis techniques like Bayesian hierarchical modeling to produce meaningful results.

Numerical issues
Sometimes, the variables are nearly collinear. In this case, the matrix has an inverse, but it is
ill-conditioned. A computer algorithm may or may not be able to compute an approximate inverse; even if
it can, the resulting inverse may have large rounding errors.

The standard measure of ill-conditioning in a matrix is the condition index. This determines if the
inversion of the matrix is numerically unstable with finite-precision numbers, indicating the potential
sensitivity of the computed inverse to small changes in the original matrix. The condition number is
computed by finding the maximum singular value divided by the minimum singular value of the design
matrix.[10] In the context of collinear variables, the variance inflation factor is the condition number for a
particular coefficient.

Solutions
Numerical problems in estimating can be solved by applying standard techniques from linear algebra to
estimate the equations more precisely:

1. Standardizing predictor variables. Working with polynomial terms (e.g. , ), including


interaction terms (i.e., ) can cause multicollinearity. This is especially true when the
variable in question has a limited range. Standardizing predictor variables will eliminate this
special kind of multicollinearity for polynomials of up to 3rd order.[11]
For higher-order polynomials, an orthogonal polynomial representation will generally fix
any collinearity problems.[12] However, polynomial regressions are generally unstable,
making them unsuitable for nonparametric regression and inferior to newer methods
based on smoothing splines, LOESS, or Gaussian process regression.[13]
2. Use an orthogonal representation of the data.[12] Poorly-written statistical software will
sometimes fail to converge to a correct representation when variables are strongly
correlated. However, it is still possible to rewrite the regression to use only uncorrelated
variables by performing a change of basis.
For polynomial terms in particular, it is possible to rewrite the regression as a function of
uncorrelated variables using orthogonal polynomials.

Effects on coefficient estimates


In addition to causing numerical problems, imperfect collinearity makes precise estimation of variables
difficult. In other words, highly correlated variables lead to poor estimates and large standard errors.

As an example, say that we notice Alice wears her boots whenever it is raining and that there are only
puddles when it rains. Then, we cannot tell whether she wears boots to keep the rain from landing on her
feet, or to keep her feet dry if she steps in a puddle.

The problem with trying to identify how much each of the two variables matters is that they are
confounded with each other: our observations are explained equally well by either variable, so we do not
know which one of them causes the observed correlations.

There are two ways to discover this information:

1. Using prior information or theory. For example, if we notice Alice never steps in puddles, we
can reasonably argue puddles are not why she wears boots, as she does not need the
boots to avoid puddles.
2. Collecting more data. If we observe Alice enough times, we will eventually see her on days
where there are puddles but not rain (e.g. because the rain stops before she leaves home).
This confounding becomes substantially worse when researchers attempt to ignore or suppress it by
excluding these variables from the regression (see #Misuse). Excluding multicollinear variables from
regressions will invalidate causal inference and produce worse estimates by removing important
confounders.

Remedies
There are many ways to prevent multicollinearity from affecting results by planning ahead of time.
However, these methods all require a researcher to decide on a procedure and analysis before data has
been collected (see post hoc analysis and Multicollinearity § Misuse).

Regularized estimators
Many regression methods are naturally "robust" to multicollinearity and generally perform better than
ordinary least squares regression, even when variables are independent. Regularized regression
techniques such as ridge regression, LASSO, elastic net regression, or spike-and-slab regression are less
sensitive to including "useless" predictors, a common cause of collinearity. These techniques can detect
and remove these predictors automatically to avoid problems. Bayesian hierarchical models (provided by
software like BRMS) can perform such regularization automatically, learning informative priors from the
data.

Often, problems caused by the use of frequentist estimation are misunderstood or misdiagnosed as being
related to multicollinearity.[3] Researchers are often frustrated not by multicollinearity, but by their
inability to incorporate relevant prior information in regressions. For example, complaints that
coefficients have "wrong signs" or confidence intervals that "include unrealistic values" indicate there is
important prior information that is not being incorporated into the model. When this is information is
available, it should be incorporated into the prior using Bayesian regression techniques.[3]

Stepwise regression (the procedure of excluding "collinear" or "insignificant" variables) is especially


vulnerable to multicollinearity, and is one of the few procedures wholly invalidated by it (with any
collinearity resulting in heavily biased estimates and invalidated p-values).[2]

Improved experimental design


When conducting experiments where researchers have control over the predictive variables, researchers
can often avoid collinearity by choosing an optimal experimental design in consultation with a
statistician.

Acceptance
While the above strategies work in some situations, estimates using advanced techniques may still
produce large standard errors. In such cases, the correct response to multicollinearity is to "do nothing".[1]
The scientific process often involves null or inconclusive results; not every experiment will be
"successful" in the sense of decisively confirmation of the researcher's original hypothesis.

Edward Leamer notes that "The solution to the weak evidence problem is more and better data. Within
the confines of the given data set there is nothing that can be done about weak evidence".[3] Leamer notes
that "bad" regression results that are often misattributed to multicollinearity instead indicate the
researcher has chosen an unrealistic prior probability (generally the flat prior used in OLS).[3]

Damodar Gujarati writes that "we should rightly accept [our data] are sometimes not very informative
about parameters of interest".[1] Olivier Blanchard quips that "multicollinearity is God's will, not a
problem with OLS";[7] in other words, when working with observational data, researchers cannot "fix"
multicollinearity, only accept it.

Misuse
Variance inflation factors are often misused as criteria in stepwise regression (i.e. for variable
inclusion/exclusion), a use that "lacks any logical basis but also is fundamentally misleading as a rule-of-
thumb".[2]

Excluding collinear variables leads to artificially small estimates for standard errors, but does not reduce
the true (not estimated) standard errors for regression coefficients.[1] Excluding variables with a high
variance inflation factor also invalidates the calculated standard errors and p-values, by turning the results
of the regression into a post hoc analysis.[14]

Because collinearity leads to large standard errors and p-values, which can make publishing articles more
difficult, some researchers will try to suppress inconvenient data by removing strongly-correlated
variables from their regression. This procedure falls into the broader categories of p-hacking, data
dredging, and post hoc analysis. Dropping (useful) collinear predictors will generally worsen the
accuracy of the model and coefficient estimates.
Similarly, trying many different models or estimation procedures (e.g. ordinary least squares, ridge
regression, etc.) until finding one that can "deal with" the collinearity creates a forking paths problem. P-
values and confidence intervals derived from post hoc analyses are invalidated by ignoring the
uncertainty in the model selection procedure.

It is reasonable to exclude unimportant predictors if they are known ahead of time to have little or no
effect on the outcome; for example, local cheese production should not be used to predict the height of
skyscrapers. However, this must be done when first specifying the model, prior to observing any data,
and potentially-informative variables should always be included.

See also
Ill-conditioned matrix
Linear dependence

References
1. Gujarati, Damodar (2009). "Multicollinearity: what happens if the regressors are
correlated?". Basic Econometrics (https://archive.org/details/basiceconometric05edguja)
(4th ed.). McGraw−Hill. pp. 363 (https://archive.org/details/basiceconometric05edguja/page/
363). ISBN 9780073375779.
2. Kalnins, Arturs; Praitis Hill, Kendall (13 December 2023). "The VIF Score. What is it Good
For? Absolutely Nothing" (http://journals.sagepub.com/doi/10.1177/10944281231216381).
Organizational Research Methods. doi:10.1177/10944281231216381 (https://doi.org/10.117
7%2F10944281231216381). ISSN 1094-4281 (https://search.worldcat.org/issn/1094-4281).
3. Leamer, Edward E. (1973). "Multicollinearity: A Bayesian Interpretation" (https://www.jstor.or
g/stable/1927962). The Review of Economics and Statistics. 55 (3): 371–380.
doi:10.2307/1927962 (https://doi.org/10.2307%2F1927962). ISSN 0034-6535 (https://searc
h.worldcat.org/issn/0034-6535). JSTOR 1927962 (https://www.jstor.org/stable/1927962).
4. Giles, Dave (15 September 2011). "Econometrics Beat: Dave Giles' Blog: Micronumerosity"
(https://davegiles.blogspot.com/2011/09/micronumerosity.html). Econometrics Beat.
Retrieved 3 September 2023.
5. Goldberger,(1964), A.S. (1964). Econometric Theory. New York: Wiley.
6. Goldberger, A.S. "Chapter 23.3". A Course in Econometrics. Cambridge MA: Harvard
University Press.
7. Blanchard, Olivier Jean (October 1987). "Comment" (http://www.tandfonline.com/doi/abs/10.
1080/07350015.1987.10509611). Journal of Business & Economic Statistics. 5 (4): 449–
451. doi:10.1080/07350015.1987.10509611 (https://doi.org/10.1080%2F07350015.1987.10
509611). ISSN 0735-0015 (https://search.worldcat.org/issn/0735-0015).
8. James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2021). An introduction to
statistical learning: with applications in R (https://link.springer.com/book/10.1007/978-1-0716
-1418-1) (Second ed.). New York, NY: Springer. p. 115. ISBN 978-1-0716-1418-1. Retrieved
1 November 2024.
9. Karabiber, Fatih. "Dummy Variable Trap - What is the Dummy Variable Trap?" (https://www.l
earndatasci.com/glossary/dummy-variable-trap/). LearnDataSci (www.learndatasci.com).
Retrieved 18 January 2024.
10. Belsley, David (1991). Conditioning Diagnostics: Collinearity and Weak Data in Regression
(https://archive.org/details/conditioningdiag0000bels). New York: Wiley. ISBN 978-0-471-
52889-0.
11. "12.6 - Reducing Structural Multicollinearity | STAT 501" (https://newonlinecourses.science.p
su.edu/stat501/lesson/12/12.6). newonlinecourses.science.psu.edu. Retrieved 16 March
2019.
12. "Computational Tricks with Turing (Non-Centered Parametrization and QR Decomposition)"
(https://storopoli.io/Bayesian-Julia/pages/12_Turing_tricks/#qr_decomposition). storopoli.io.
Retrieved 3 September 2023.
13. Gelman, Andrew; Imbens, Guido (3 July 2019). "Why High-Order Polynomials Should Not
Be Used in Regression Discontinuity Designs" (https://www.tandfonline.com/doi/full/10.1080/
07350015.2017.1366909). Journal of Business & Economic Statistics. 37 (3): 447–456.
doi:10.1080/07350015.2017.1366909 (https://doi.org/10.1080%2F07350015.2017.136690
9). ISSN 0735-0015 (https://search.worldcat.org/issn/0735-0015).
14. Gelman, Andrew; Loken, Eric (14 November 2013). "The garden of forking paths" (http://ww
w.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf) (PDF). Unpublished –
via Columbia.

Further reading
Belsley, David A.; Kuh, Edwin; Welsch, Roy E. (1980). Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. New York: Wiley. ISBN 978-0-471-05856-4.
Goldberger, Arthur S. (1991). "Multicollinearity" (https://books.google.com/books?id=mHmx
NGKRlQsC&pg=PA245). A Course in Econometrics. Cambridge: Harvard University Press.
pp. 245–53. ISBN 9780674175440.
Hill, R. Carter; Adkins, Lee C. (2001). "Collinearity". In Baltagi, Badi H. (ed.). A Companion
to Theoretical Econometrics. Blackwell. pp. 256–278. doi:10.1002/9780470996249.ch13 (htt
ps://doi.org/10.1002%2F9780470996249.ch13). ISBN 978-0-631-21254-6.
Johnston, John (1972). Econometric Methods (https://archive.org/details/econometricmetho
0000john_t7q9) (Second ed.). New York: McGraw-Hill. pp. 159 (https://archive.org/details/ec
onometricmetho0000john_t7q9/page/159)–168. ISBN 9780070326798.
Kalnins, Arturs (2022). "When does multicollinearity bias coefficients and cause type 1
errors? A reconciliation of Lindner, Puck, and Verbeke (2020) with Kalnins (2018)". Journal
of International Business Studies. 53 (7): 1536–1548. doi:10.1057/s41267-022-00531-9 (htt
ps://doi.org/10.1057%2Fs41267-022-00531-9). S2CID 249323519 (https://api.semanticscho
lar.org/CorpusID:249323519).
Kmenta, Jan (1986). Elements of Econometrics (https://archive.org/details/elementsofecono
m0003kmen/page/430) (Second ed.). New York: Macmillan. pp. 430–442 (https://archive.or
g/details/elementsofeconom0003kmen/page/430). ISBN 978-0-02-365070-3.
Maddala, G. S.; Lahiri, Kajal (2009). Introduction to Econometrics (Fourth ed.). Chichester:
Wiley. pp. 279–312. ISBN 978-0-470-01512-4.
Tomaschek, Fabian; Hendrix, Peter; Baayen, R. Harald (2018). "Strategies for addressing
collinearity in multivariate linguistic data" (https://doi.org/10.1016%2Fj.wocn.2018.09.004).
Journal of Phonetics. 71: 249–267. doi:10.1016/j.wocn.2018.09.004 (https://doi.org/10.101
6%2Fj.wocn.2018.09.004).

External links
Thoma, Mark (2 March 2011). "Econometrics Lecture (topic: multicollinearity)" (https://www.y
outube.com/watch?v=K8eFiMIb8qo&list=PLD15D38DC7AA3B737&index=16#t=25m09s).
University of Oregon. Archived (https://ghostarchive.org/varchive/youtube/20211212/K8eFiM
Ib8qo) from the original on 12 December 2021 – via YouTube.
Earliest Uses: The entry on Multicollinearity has some historical information. (http://jeff560.tri
pod.com/m.html)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Multicollinearity&oldid=1262962229"

You might also like