MultivariableRegression 3
MultivariableRegression 3
Much like in the case of the univariate regression with one independent variable, the multiple
regression model has a number of required assumptions:
(MR.1): Linear Model The Data Generating Process (DGP), or in other words, the
population, is described by a linear (in terms of the coefficients) model:
Y = Xβ + ε (MR.1)
E (ε|X) = 0 (MR.2)
This assumption implies that all error pairs are uncorrelated. For cross-sectional data,
this assumption implies that there is no spatial correlation between errors.
(MR.5) There exists no exact linear relationship between the explanatory variables.
This means that:
rank (X) = k + 1
ε|X ∼ N 0, σ2 I
(MR.6)
Restricted Least Squares
Suppose that we have the following multiple regression with k independent variables Xj ,
j = 1, ..., k, in matrix notation, with M < k + 1 different linear restrictions on the parameters:
Y = Xβ + ε
Lβ = r
where:
c10 c11 ... c1k r1
c20 c21 ... c2k r2
L= . .. , r = .
.. ..
.. . . . ..
cM0 cM1 ... cMk rM
Lagrange multipliers are widely used to solve various constrained optimization problems
in economics.
In general, in order to find the stationary points of a function f (X), subject to an
equality constraint g(X) = 0, from the Lagrangian function:
Note that the solution corresponding to the above constrained optimization is always a
saddle point of the Lagrangian function.
Estimator Derivation
As before, we want to minimize the sum of squares, but this time, they are subject to the
condition that Lβ = r . This leads to the Lagrangian function:
>
L(β, λ) = (Y − Xβ) (Y − Xβ) + 2λ> (Lβ − r )
= Y> Y − 2Y> Xβ + β > X> Xβ + 2λ> Lβ − 2λ> r
Lβ = r
−1 (OLS)
b (RLS) = β
b (OLS) − X> X −1 L> L X> X −1 L>
β Lβ
b −r
This means is that the RLS estimator can be defined as:
b (RLS) = β
β b (OLS) + "Restriction Adjustment"
(OLS) (OLS)
where the Restriction Adjustment is the divergence between Lβ
b and E Lβ
b = r.
=r
Unbiasedness of RLS Estimator
We begin by examining the estimation error of estimating β via RLS in a similar way as we did
for OLS:
(RLS) (OLS) >
−1 > >
−1 > −1 (OLS)
β
b =β
b − X X L L X X L Lβ
b − Lβ
(OLS) >
−1 > >
−1 > −1 (OLS)
=β
b + X X L L X X L L β−β b
if Lβ = r .
Then the expected value of the RLS estimator is:
−1
b (RLS) = E βb (OLS) + X> X −1 L> L X> X −1 L> b (OLS) = β
E β L E β−β
b (OLS)
if E β = β.
In other words:
If the OLS estimator is unbiased, then the RLS estimator is also unbiased as long as
the constraints Lβ = r are true.
Otherwise, the RLS estimator is biased, i.e. E βb (RLS) 6= β.
b (OLS) = σ 2 X> X −1 , we can calculate the difference in the variances of the RLS
Since Var β
and OLS estimators:
(OLS) (RLS) −1 > −1 > −1 −1
Var β b − Var βb = σ 2 X> X L L X> X L L X> X = σ2 C
if Lβ = r (which we can then plug in our RLS estimator expression). So, the RLS estimator is a
consistent estimator of β, provided the restrictions are correctly specified.
Estimating the Error Variance
According to the RLS estimator, the residual vector can be defined as:
ε = Y − Xβ
b b
RLS
The error variance is derived analogously to how it would be via OLS, except we have M
restrictions, so now:
The RLS (with M restrictions) estimator of the error variance σ 2 :
2 ε> b
b ε
σ
bRLS =
N − (k + 1 − M)
b2 .
is an unbiased estimator of σ
Some final notes about the RLS estimator:
If the restrictions are correct, Lβ = r , then:
I β b (RLS) is unbiased;
b (RLS) is more efficient than β
I β b (OLS) ;
b (RLS) is BLUE;
I β
b (RLS) is consistent;
I β
b (RLS) is asymptotically normally distributed:
I β
b (RLS) ∼ N β, σ 2 D X> X −1
β
Then:
I If the null hypothesis cannot be rejected - re-estimate the model via RLS;
I If the null hypothesis is rejected - keep the OLS estimates.
As we have seen - RLS is biased, unless the imposed constraints are exactly true.
Consequently, being knowledgeable about economics allows one to:
I Identify which variables may effect the dependent variable;
I Specify the model in a correct functional form;
I Specify any additional restrictions on the coefficients.
When these restrictions, as well as the model form itself, primarily come from your knowledge of
economic theory, rather than from the data sample, they may be regarded as nonsample
information.
As we have seen thus far, there are many different combinations of variables and their
transformations, which would lead to a near-infinite number of possible models.
This is where economic theory is very useful.
I As we have noted, evidence of whether the imposed restrictions are true or not can be
obtained by carrying out the specified F -test.
I Then again, if the test rejects the hypothesis about our constraints, but we are absolutely
sure that they are valid from economic theory, then our RLS estimation may be closer to
the true model (it will have lower variance), but the hypothesis may be rejected due to
other reasons (for example, an incorrect model form, or omitted important variable).
If we are wrong, then the RLS estimates will be biased (though their variance will still
be lower than OLS). In general, biased estimates are undesirable.
Example
We will simulate the following model:
where:
I β1 = β 3 ;
I (−2) · β2 = β4 ;
I β5 = 0 (i.e. insignificant variable);
set.seed(123)
#
N <- 1000
beta_vec <- c(-5, 2, 3, 2, -6, 0)
#
x1 <- rnorm(mean = 5, sd = 2, n = N)
x2 <- sample(1:50, size = N, replace = TRUE)
x3 <- seq(from = 0, to = 10, length.out = N)
x4 <- rnorm(mean = 10, sd = 3, n = N)
x5 <- rnorm(mean = -2, sd = 3, n = N)
#
x_mat <- cbind(x1, x2, x3, x4, x5)
e <- rnorm(mean = 0, sd = 3, n = N)
y <- cbind(1, x_mat) %*% beta_vec + e
data_smpl <- data.frame(y, x_mat)
Next up, we will estimate the relevant model:
y_mdl_fit <- lm("y ~ x1 + x2 + x3 + x4 + x5", data = data_smpl)
#
print(round(coef(summary(y_mdl_fit)), 5))
## y ~ x1 + x2 + x3 + x4 + x5
## <environment: 0x000000001b910ea8>
L <- matrix(c(0, 1, 0, -1, 0, 0,
0, 0, 2, 0, 1, 0,
0, 0, 0, 0, 0, 1), nrow = 3, byrow = TRUE)
rr <- c(0, 0, 0)
#
mdl_rls = lrmest::rls(formula = formula(y_mdl_fit), R = L, r = rr, data = data_smpl, delt = rep(0, le
Having calculated the adjustment component, we can finally calculate the RLS estimator:
beta_rls <- beta_ols - RA
print(beta_rls)
## [,1]
## -5.317401
## x1 2.022635
## x2 3.009071
## x3 2.022635
## x4 -6.018142
## x5 0.000000
Since we know the asymptotic distribution, we can calculate the variance of the estimates:
y_fit <- X_d %*% beta_rls
#
resid <- y - y_fit
#
sigma2_rls <- sum(resid^2) / (nrow(data_smpl) - (length(beta_rls) - length(rr)))
print(paste0("Estimated RLS residual standard error: ", round(sqrt(sigma2_rls), 5)))
## est se
## -5.317401 2.881625e-01
## x1 2.022635 2.687947e-02
## x2 3.009071 6.113634e-03
## x3 2.022635 2.687947e-02
## x4 -6.018142 1.222727e-02
## x5 0.000000 1.851409e-11
Note the zero-value restriction on the last coefficient. Because of that restriction the variance of
x5 coefficient is very close to zero (as it should be) but it is can also sometimes be negative:
print(tail(diag(rls_vcov), 1))
## x5
## 3.427715e-22
I For cases where we have an insignificant parameter, instead of imposing a zero-value
separate restriction in the RLS, it would be better to remove it first, and then re-estimate
the model via RLS for any other restrictions (like parameter equality, difference, sum, ratio
and other linear restrictions).
I We would like to stress that, knowing how the estimates are calculated is very useful, since
it is possible to identify the source of any errors, or warnings, that we may get using
parameter estimator functions from various packages, which may vary in functionality, speed
and quality.
Multicollinearity
Definition of Multicollinearity
I Sometimes explanatory variables are tightly connected and it is impossible to distinguish
their individual influences on the dependent variable.
I Many economic variables may move together in some systematic way. Such variables are
said to be collinear and cause the collinearity problem.
I In the same way, multicollinearity refers to a situation in which two or more explanatory
variables are highly linearly related.
A set of variables is perfectly multicollinear if there exist one of more exact linear
relationship among some of the variables. This means that if for a multiple regression
model:
Yi = β0 + β1 X1,i + ... + βk Xk,i + i
the following relationship holds true for m explanatory variables:
Yi = β0 + β1 X1,i + β2 X2,i + i
Now, assume that X2,i = γ0 + γ1 X1,i . Then our initial expression becomes:
Thus, while we may be able to estimate α0 and α1 , we would not be able to obtain estimates of
the original β0 , β1 , β2 .
To verify that we cannot estimate the original coefficients, consider the following example.
Example
We will generate a simple example of two models of the following form:
## [1] 3
x_mat_2 <- cbind(1, x1, x3)
print(Matrix::rankMatrix(x_mat_2)[1])
## [1] 2
Consequently, calculating the inverse of X> X yields an error:
solve(t(x_mat_2) %*% x_mat_2)
## 3
x_mat_2 = sm.add_constant(np.column_stack((x1, x3)))
print(np.linalg.matrix_rank(x_mat_2))
## 2
print(np.linalg.inv(np.dot(np.transpose(x_mat_2), x_mat_2)))
## 9.669827929485215e-07
In theory, if we have calculated the inverse matrix, then the following relation should hold:
−1
X> X X> X = I
## [[ 1. -0. -0.]
## [-0. 1. -0.]
## [-0. 0. 1.]]
On the other hand, for the non-invertible matrix, this does not hold (because the inverse should not exist):
xtx_2 = np.dot(np.transpose(x_mat_2), x_mat_2)
xtx_2_inv = np.linalg.inv(xtx_2)
print(np.round(np.dot(xtx_2_inv, xtx_2), 10))
## [,1]
## -1.844784
## x1 7.984537
Consequently, this is exactly what is (implicitly) done in R. If we estimate the models separately,
we get:
dt1 <- data.frame(y1, x1, x2)
dt2 <- data.frame(y2, x1, x3)
#
mdl_1 <- lm(y1 ~ x1 + x2, data = dt1)
print(coef(summary(mdl_1)))
##
## Call:
## lm(formula = y2 ~ x1 + x3, data = dt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.83465 -0.59224 0.06078 0.64298 2.46366
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.84478 0.29798 -6.191 3.38e-09 ***
## x1 7.98454 0.03633 219.764 < 2e-16 ***
## x3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9668 on 198 degrees of freedom
## Multiple R-squared: 0.9959, Adjusted R-squared: 0.9959
## F-statistic: 4.83e+04 on 1 and 198 DF, p-value: < 2.2e-16
We see that the variable X3 has NA values - in other words, it is not included in our lm()
estimated regression which is why coef() does not return the row with x3 variable.
I Always be mindful of the possible consequences of collinearity (which results in
an non-invertible matrices in OLS calculation), as well as other possible problems
like autocorrelation and heteroskedasticity (which are to be discussed further on).
I Econometric software is not always explicit in its methodology when some of these
problems arise - it is usually assumed that the user if knowledgeable enough to
identify these issues from the output, or by reading the software
documentation.
I We may sometimes be content with an approximate solution, while other times,
we want to be fully aware of any and all possible problems with the data,
especially if we are automating some model estimation processes, since
extracting only the coefficient (or any other) values may not tell us about any
problems detected by the software.
Consequences of Multicollinearity
While we may be able to estimate γ0 and γ1 , we would not be able to obtain estimates of the
original β0 , β1 , β2 .
I On one hand, this situation virtually never arises in practice and can be disregarded.
I On the other hand, the relationship in real-world data is usually approximate.
In other words if the relationship is approximately linear:
where η is some kind of (not necessarily Gaussian) random variable. If η is small, then
−1
det X> X is close to zero, then our data is close to multicollinear and X> X
would be large,
which would then result in imprecise estimators.
Consequently multicollinearity results in multiple difficulties:
I That the standard errors of the affected coefficients tend to be large.
This results in failure to reject the null hypothesis on coefficient significance.
I Small changes to the input data can lead to large changes in the model,
even resulting in changes of sign of parameter estimates. This is easier to
verify if we estimate the same model on different subsets of the data and examine
its coefficients.
I R 2 may be strangely large, which should be concerning when the model contains
insignificant variables and variables with (economically) incorrect signs.
I The usual interpretation of a unit increase in Xj,i effect on Y , holding all else
constant (ceteris paribus) does not hold if the explanatory variables are
correlated.
I In some sense, the collinear variables contain, at least partially, the same
information about the dependent variable. A consequence of such data
redundancy is overfitting.
I Depending on the software used, an approximate inverse of X> X may be
computed, although the resulting approximated inverse may be highly sensitive to
slight variations in the data and so may be very inaccurate or very
sample-dependent.
Modifying our previous example:
0.9
X3,i = 2 − 2 · (X1,i ) + ηi , ηi ∼ N (−2, 1)
set.seed(123)
#
x3_new = 2 - 2 * x1^(0.9) + rnorm(mean = -2, sd = 1, n = N)
y2_new = cbind(1, x1, x3_new) %*% beta_vec + e
#
plot(1:N, x3, type = "l", col = "blue")
lines(1:N, x3_new, col = "red")
legend("bottomleft", legend = c("x3", "x3_new"), col = c("blue", "red"), lty = 1)
−5
−10
−15
x3
−20
x3
−25
x3_new
## x3 x3_new
## x3 1.0000000 0.9996801
## x3_new 0.9996801 1.0000000
yields the following model coefficient estimates:
dt3 <- data.frame(y2_new, x1, x3_new)
#
mdl_3 <- lm(y2_new ~ x1 + x3_new, data = dt3)
#
print(round(coef(summary(mdl_3)), 4))
## [1] 0.9926
We see that the parameters are estimated close to the true values, even when X1 and X3 are
related (though in this case - only approximately linearly). On the other hand, the coefficient of
X3 is not statistically significantly different form zero.
On the other hand, if we were to estimate the following model:
mdl_4 <- lm(y2_new ~ x1 + x3_new + I(x3_new^2))
print(round(coef(summary(mdl_4)), 4))
## [1] 0.9926
print(round(summary(mdl_4)$adj.r.squared, 4))
## [1] 0.9925
We have that this model, with all insignificant coefficients, explains 99.3% of the variation in Y !
If we were to do the same on data without multicollinearity:
print(round(coef(summary(lm(y1 ~ x1 + x2))), 4))
there are a number of alternative ways to measure this presence, without examining only the
design matrix:
I Large changes in the estimated regression coefficients when a predictor variable is
added/removed are indicative of multicollinearity;
I If the multiple regression contains insignificant coefficients, then the F -test for significance
should not reject the null hypothesis. If the null hypothesis for those coefficients is
rejected, then we have an indication that the low t-values are due to multicollinearity.
I A popular measure of multicollinearity is the variance inflation factor (VIF).
Variance Inflation Factor (VIF)
Variance Inflation Factor (VIF)
Consider the following multiple regression model:
##
## Call:
## lm(formula = y2_new ~ x1 + x3_new + I(x3_new^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.78937 -0.58760 0.05885 0.68430 2.44656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.4846 65.7724 -0.631 0.529
## x1 -8.1334 17.5430 -0.464 0.643
## x3_new -11.0723 13.5328 -0.818 0.414
## I(x3_new^2) 0.1341 0.1798 0.746 0.457
##
## Residual standard error: 0.9693 on 196 degrees of freedom
## Multiple R-squared: 0.9926, Adjusted R-squared: 0.9925
## F-statistic: 8763 on 3 and 196 DF, p-value: < 2.2e-16
We see that the p-value is less than the 0.05 significance level, meaning that we reject the null
hypothesis that β 2 . On the other hand,
b1 = βb2 = βb3 = 0 in the model Ybi = βb0 + βb1 X1,i + βb2 X3,i + βb3 X3,i
the summary statistics show that we do not reject the individual hypothesis tests that H0 : β
bj = 0,
j = 1, 2, 3.So, clearly something isn’t right with the specified model.
I We can calculate the VIF manually:
# VIF(x3_new)
1/(1 - summary(lm(x3_new ~ 1 + x1))$r.squared)
## [1] 1563.274
# VIF(x1)
1/(1 - summary(lm(x1 ~ 1 + x3_new))$r.squared)
## [1] 1563.274
We can also use the built-in functions to verify that the variables are clearly collinear:
print(car::vif(lm(y2_new ~ x1 + x3_new)))
## x1 x3_new
## 1563.274 1563.274
while for the model without collinearity the VIF is around 1 for each variable:
print(car::vif(mdl_1))
## x1 x2
## 1.000768 1.000768
Note that this does not say anything about causality - whether X1 influences X3 , or whether X3
influences X1 - we do not know.
Generalized Variance Inflation Factor (GVIF)
I Variance inflation factors are not fully applicable to models, which include a set of regressors
(i.e. indicator regressors for the same categorical variable), or polynomial regressors.
I This is because the correlations among these variables are induced by the model structure
and therefore are artificial. Usually, we are not concerned with artificial correlations of
these types - we want to identify the effects of different explanatory variables.
Consequently, Fox, John, and Georges Monette. 1992. “Generalized Collinearity Diagnostics”
introduced the Generalized VIF, denoted GVIF.
Assume that our regression model
Y = X β + ε
N×1 N×(k+1) (k+1)×1 N×1
where:
I X1 contains the related r indicator variables (or, a specific variable and its polynomial terms).
I X2 contains the remaining regressors, excluding the constant;
Then the Generalized VIF is defined as:
det(R11 ) det(R22 )
GVIF1 =
det(R)
where:
I R11 is the correlation matrix for X1 ;
I R22 is the correlation matrix for X2 ;
I R is the correlation matrix for all variables in the whole design matrix X, excluding the
constant;
The GVIF is calculated for sets of related regressors, such as a for a set of indicator regressors for some
kind of categorical variable, or for polynomial variables:
I For the continuous variables GVIF is the same as the VIF values before;
I For the categorical variables, we now get one GVIF value for each separate category type
(e.g. one value for all age groups, another value for all regional indicator variables and so on);
So, variables which require more than 1 coefficient and thus more than 1 degree of freedom are typically
evaluated using the GVIF.
To make GVIFs comparable across dimensions, Fox and Monette also suggested using
GVIFˆ(1/(2*DF)), where DF (degrees of freedom) is the number of coefficients in the subset.
This reduces the GVIF to a linear measure. It is analogous to taking the square root of the usual
VIF. In other words:
We can apply the usual VIF rule of thumb if we squared the GVIFˆ(1/(2*Df)) value.
2
For example, requiring GVIF1/(2·DF) < 5 is equivalent to a requirement that VIF < 5
for the continuous (i.e. non-categorical) variables.
Example
Continuing our previous example, assume that we have the following age indicator variables, all of which are in a
single variable ‘age‘:
set.seed(123)
#
dt3$age <- sample(c("10_18", "18_26", "26_40", "other"), size = length(x1), replace = TRUE)
print(head(dt3))
## [,1]
## x1 1579.3000
## x3_new 1579.7000
## age18_26 1.6115
## age26_40 1.6389
## ageother 1.5362
I The problem is that the VIF values are affected by the baseline of the categorical variable.
I In order to be sure of not having a VIF value above an acceptable level, it would be
necessary to redo this analysis for every level of the categorical variable being the base
group - i.e. we would also need to use age18_26 as the base group and calculate the VIF’s,
then do the same with age26_40 and finally with other. This gets more complicated if we
have more than one categorical variable.
As such, it would be easier to calculate the GVIF’s.
We begin by separating the indicator variables for the categorical age variable in a separate
matrix X1 and the remaining explanatory variables (excluding the constant) in X2 :
age <- dt3$age
X1 <- cbind(model.matrix(~ age + 0))
X2 <- cbind(dt3$x1, dt3$x3_new)
and here is how the matrices would look like:
print(head(X1))
## [,1] [,2]
## [1,] 6.879049 -11.90550
## [2,] 7.539645 -12.55117
## [3,] 11.117417 -15.91695
## [4,] 8.141017 -13.13152
Now, we can calculate the GVIF for age:
tmp_gvif <- det(cor(X1)) * det(cor(X2)) / det(cor(cbind(X1, X2)))
tmp_gvif <- data.frame(GVIF = tmp_gvif)
tmp_gvif$"GVIF^(1/2Df)" <- tmp_gvif$GVIF^(1/(2 * (ncol(X1))))
print(tmp_gvif)
## GVIF GVIF^(1/2Df)
## 1 1.015084 1.002498
We can also do the same for the continuous variables, say x1:
X1 <- cbind(x1)
X2 <- cbind(model.matrix(~ age + 0))[, -1]
X2 <- cbind(X2, x3_new)
#
tmp_gvif <- det(cor(X1)) * det(cor(X2)) / det(cor(cbind(X1, X2)))
tmp_gvif <- data.frame(GVIF = tmp_gvif)
tmp_gvif$"GVIF^(1/2Df)" <- tmp_gvif$GVIF^(1/(2 * (ncol(X1))))
print(tmp_gvif)
## GVIF GVIF^(1/2Df)
## 1 1579.27 39.74003
Alternatively, we can do the same using the built-in functions:
print(car::vif(mdl_5))
## GVIF Df GVIF^(1/(2*Df))
## x1 1579.270044 1 39.740031
## x3_new 1579.727037 1 39.745780
## age 1.015084 3 1.002498
## GVIF^(1/(2*Df))
## x1 1579.270044
## x3_new 1579.727037
## age 1.005003
We can then compare the values with the usual rule of thumb values of 5 and 10, as we did for
VIF.
Note: we are still using the same car::vif() function - since we have included age - the GVIF
is calculated automatically instead of the standard VIF.
Methods to Address Multicolinearity
The multicollinearity problem is that the collinear variables do not contain enough information
about individual explanatory variable effects
There are a number of ways we can try to deal with multicollinearity:
I Make sure you have not fallen into the dummy variable trap - including a dummy variable
for every category and including a constant term in the regression together guarantees
perfect multicollinearity.
I Do nothing - leave the model as is, despite multicollinearity. If our purpose is to predict
Y , the presence of multicollinearity doesn’t affect the efficiency of extrapolating the fitted
model to new data as long as the predictor variables follow the same pattern of
multicollinearity in the new data as in the data on which the regression model is
based. As mentioned before, if our sample is not representative, then we may get
coefficients with incorrect signs, or our model may be overfitted.
I Try to drop some of the collinear variables. If our purpose is to estimate the individual
influence of each explanatory variable, multicollinearity prevents us from doing so.
However, we lose additional information on Y from the dropped variables.
Furthermore, omission of relevant variables results in biased coefficient estimates of the
remaining coefficients.
I Obtain more data. This the ideal solution. As we have seen from the OLS formulas, a
larger sample size produces more precise parameter estimates (i.e. with smaller standard
errors).
Methods to Address Multicolinearity (Cont.)
I Using polynomial models, or models with interaction terms may sometimes lead to
multicollinearity, especially if those variables have a very limited attainable value range. A
way to solve this would be to mean-center the predictor variables. On the other hand, if
the variables do not have a limited range, then this transformation does not help.
I Alternatively, we may combine all (or some) multicollinear variables into groups and use the
method of principle components (alternatively, ridge regression, partial least squares
regression can also be used).
I If the variables with high VIF’s are indicator variables, that represent a category with three
or more categories, then high VIF may be caused when the base category has a small
fraction of the overall observations. To address this, select the base category with a larger
fraction of overall cases.
I Generally, if we are analyzing time series data - the correlated variables may in fact be the
same variable, but at different lagged times. Consequently, including the relevant lags of the
explanatory variables would account for such a problem. For cross-sectional data
however, this is not applicable.
Example
We will show an example of how polynomial and interaction variables are strongly collinear, but
centering the variables reduces this problem.
The simulated data is for the following model:
2
Yi = β0 + β1 X1,i + β2 X1,i + β3 X2,i + β4 (X1,i × X2,i ) + β5 X3,i + i
set.seed(123)
#
N <- 1000
beta_vec <- c(5, 2, 0.002, 3, -0.05, 4)
#
x1 <- rnorm(mean = 15, sd = 5, n = N)
x2 <- sample(1:20, size = N, replace = TRUE)
x3 <- sample(seq(from = 0, to = 12, length.out = 50), size = N, replace = TRUE)
e <- rnorm(mean = 0, sd = 3, n = N)
#
x_mat <- cbind(1, x1, x1^2, x2, x1 * x2, x3)
#
y <- x_mat %*% beta_vec + e
#
dt_sq <- data.frame(y, x1, x2, x3)
Our estimated model is:
mdl_sq <- lm(y ~ x1 + I(x1^2) + x2 + x1:x2 + x3, data = dt_sq)
print(round(coef(summary(mdl_sq)), 5))
## x1 I(x1^2) x2 x3 x1:x2
## 25.641296 20.849557 9.806743 1.003584 12.685321
We see that the VIF is very large, despite the fact that only one variable appears to be
insignificant. Furthermore, the signs of the coefficients are close to the true ones.
One of the strategies of dealing with multicollinearity in polynomial and interaction terms is to
initially center the values by subtracting their means and then create polynomial and interaction
terms. The mean of X1 and X2 is:
print(colMeans(dt_sq))
## y x1 x2 x3
## 83.299219 15.080639 10.493000 5.990694
One efficient way of subtracting the means is by applying a function to the data:
dt_sq_centered <- dt_sq
print(head(dt_sq_centered, 5))
## y x1 x2 x3
## 1 80.27734 12.19762 1 12.0000000
## 2 111.49284 13.84911 17 9.3061224
## 3 108.27019 22.79354 18 6.6122449
## 4 58.86745 15.35254 10 0.9795918
## 5 102.96509 15.64644 9 11.2653061
dt_sq_centered[, c("x1", "x2")] <- apply(dt_sq_centered[, c("x1", "x2")], MARGIN = 2, function(x){x -
print(head(dt_sq_centered, 5))
## y x1 x2 x3
## 1 80.27734 -2.8830176 -9.493 12.0000000
## 2 111.49284 -1.2315268 6.507 9.3061224
## 3 108.27019 7.7129022 7.507 6.6122449
## 4 58.86745 0.2719026 -0.493 0.9795918
## 5 102.96509 0.5657993 -1.493 11.2653061
Finally, if we re-estimate the model with the centered variables:
mdl_sq_centered <- lm(formula(mdl_sq), data = dt_sq_centered)
print(round(coef(summary(mdl_sq_centered)), 5))
## x1 I(x1^2) x2 x3 x1:x2
## 1.006275 1.009893 1.009524 1.003584 1.012698
We see that it has disappeared.
So, in case of polynomial and interaction variables, their multicollinearity has no adverse
consequences.
On the other hand, if there are many other variables, then it may be difficult to distinguish,
which variables are affected by multicollinearity.
Consequently, it is better to determine the collinearity between different variables first, and then
add polynomial and interaction terms later.
Furthermore, it is worth noting, that when we center the independent variable, the interpretation
of its coefficient remains the same.
On the other hand, the intercept parameter changes - β0 can now be interpreted as the average
value of Y , when X is equal to the average in the sample (when X is equal to the sample mean -
its centered variable is zero, therefore the associated βj also becomes zero).
Finally:
From a set of multiple linear regression models, defined as:
The best regression models are those where the predictor variables Xj :
I each highly correlate with the dependent variable Y ;
I minimally correlate with each other (preferably - no correlation);
Weak collinearity is not a violation of OLS assumptions. If the estimated coefficients:
I have the expected signs and magnitudes;
I they are not sensitive to adding or removing a few observations
I they are not sensitive to adding or removing insignificant variables;
then there is no reason to try to identify and mitigate collinearity.