4-Regression Diagnostics SAS
4-Regression Diagnostics SAS
Lalmohan Bhar
I.A.S.R.I., Library Avenue, Pusa, New Delhi – 110 012
[email protected]
1. Introduction
Regression analysis is a statistical methodology that utilizes the relation between two or more
quantitative variables so that one variable can be predicted from the other, or others. This
methodology is widely used in business, the social and behavioral sciences, the biological
sciences including agriculture and fishery research. Regression analysis serves three major
purposes: (1) description (2) control and (3) prediction. We frequent use equations to summarize
or describe a set of data. Regression analysis is helpful in developing such equations.
When a regression model is considered for an application, we can usually not be certain in
advance that the model is appropriate for that application, any one, or several, of the features of
the model, such as linearity of the regression function or normality of the error terms, may not be
appropriate for the particular data at hand. Hence, it is important to examine the aptness of the
model for the data before inferences based on that model are undertaken. In this note we discuss
some simple methods for studying the appropriateness of a model, as well as some remedial
measures that can be helpful when the data are not in accordance with the conditions of the
regression model.
2. Diagnostics
2.1 Nonlinearity of Regression Model
Whether a linear regression function is appropriate for the data being analyzed can be studied
from a residual plot against the predictor variable or equivalently from a residual plot against the
fitted values.
Figure 1(a) shows a prototype situation of the residual plot against X when a linear regression
model is appropriate. The residuals then fall within a horizontal band centred around 0,
displaying no systematic tendencies to be positive and negative.
Regression Diagnostics and Remedial Measures
Figure 1(b) shows a prototype situation of a departure from the linear regression model that
indicates the need for a curvilinear regression function. Here the residuals tend to vary in a
systematic fashion between being positive and negative.
The prototype plot in Figure 1(a) exemplifies residual plots when error term variance is constant.
Figure 1(c) shows a prototype picture of residual plot when the error variance increases with X.
In many biological science applications, departures from constancy of the error variance tend to
be of the “meghaphone” type.
Modified Levene Test: The test is based on the variability of the residuals. Let ei1 denotes the ith
residual for group 1 and ei2 denotes the ith residual for group 2. Also we denote n1 and n2 to
denote the sample sizes of the two groups, where: n1 + n2 = n.
Further, we shall use ~ e1 and e~2 to denote the medians of the residuals in the two groups. The
modified Levene test uses the absolute deviations of the residuals around their median, to be
denoted by di1 and di2:
d i1 ei1 e~1 , d i 2 ei 2 e~2
With this notation, the two-sample t test statistic becomes:
82
Regression Diagnostics and Remedial Measures
d1 d 2
t L* =
1 1
s
n1 n2
Where d1 and d 2 are the sample means of the di1 and di2, respectively, and the pooled variance s2
is:
s
2 (d i1 d1 ) 2 (d i 2 d 2 ) 2
.
n2
If the error terms have constant variance and n1 and n2 are not too small, t L* follows
approximately the t distribution with n-2 degrees of freedom. Large absolute values of t L* indicate
that the error terms do not have constant variance.
(ii) WSSDi: WSSDi is an important statistic to locate points that are remote in x-space.
WSSDi measures the weighted sum of squared distance of the ith point from the center of the
data. Generally if the WSSDi values progress smoothly from small to large, there are probably
no extremely remote points. However, if there is a sudden jump in the magnitude of WSSDi, this
often indicates that one or more extreme points are present.
(iii) Cook's Di: Cook's Di is designed to measure the shift in ŷ when ith obsevation is not used
in the estimation of parameters. Di follows approximately F p , n p 1 (1-). Lower 10% point of
this distribution is taken as a reasonable cut off (more conservative users suggest the 50% point).
4
The cut off for Di can be taken as .
n
(iv)
DFFITSi: DFFIT is used to measure difference in ith component of ŷ ŷi . It is
1
p 1 2
suggested that DFFITSi 2 may be used to flag off influential observations.
n
(v) DFBETAS j (i ) : Cook's Di reveals the impact of ith observation on the entire vector of
the estimated regression coefficients. The influential observations for individual regression
83
Regression Diagnostics and Remedial Measures
coefficient are identified by DFBETAS j( i ) , j 1,2 ,..., p 1 , where each DFBETAS j( i ) is the
standardized change in b j when the ith observation is deleted.
(vi) COVRATIO i :The impact of the ith observation on variance-covariance matrix of the
estimated regression coefficients is measured by the ratio of the determinants of the two
variance-covariance matrices. Thus, COVRATIO reflects the impact of the ith observation on the
precision of the estimates of the regression coefficients. Values near 1 indicate that the ith
observation has little effect on the precision of the estimates. A value of COVRATIO greater
than 1 indicates that the deletion of the ith observation decreases the precision of the estimates; a
ratio less than 1 indicates that the deletion of the observation increases the precision of the
3 p 1
estimates. Influential points are indicated by COVRATIOi 1 .
n
(vii) FVARATIO i : The statistic detects change in variance of ŷi when an observation is
deleted. A value near 1 indicates that the ith observation has negligible effect on variance of yi .
A value greater than 1 indicates that deletion of the ith observation decreases the precision of the
estimates, a value less than one increases the precision of the estimates.
Durbin-Watson test:
The Durbin-Watson test assumes the first order autoregressive error models. The test consists of
determining whether or not the autocorrelation coefficient ( , say) is zero. The usual test
alternatives considered are:
H0 : 0 ; H0 : 0 .
The Durbin-Watson test statistic D is obtained by using ordinary least squares to fit the
regression function, calculating the ordinary residuals: et Yt Yˆt , and then calculating the
statistic:
n
(e t et 1 ) 2
D t 2
.
n e 2
t
t 1
84
Regression Diagnostics and Remedial Measures
Exact critical values are difficult to obtain, but Durbin-Watson have obtained lower and upper
bound d L and dU such that a value of D outside these bounds leads to a definite decision. The
decision rule for testing between the alternatives is:
if D > dU, conclude H0
if D <dL, conclude H1
if d L D d U , test is inconclusive.
Small value of D lead to the conclusion that >0.
Comparison of frequencies: when the number of cases is reasonably large is to compare actual
frequencies of the residuals against expected frequencies under normality. For example, one can
determine whether, say, about 90% of the residuals fall between 1.645 MSE .
Normal probability plot: Still another possibility is to prepare a normal probability plot of the
residuals. Here each residual is plotted against its expected value under normality. A plot that is
nearly linear suggests agreement with normality, whereas a plot that departs substantially from
linearity suggests that the error distribution is not normal.
2. 6 Multicollinearity
The use and interpretation of a multiple regression model depends implicitly on the assumption
that the explanatory variables are not strongly interrelated. In most regression applications the
explanatory variables are not orthogonal. Typically, it is impossible to estimate the unique
effects of individual variables in the regression equation. The estimated values of the coefficients
are very sensitive to slight changes in the data and to the addition or deletion of variables in the
equation. The regression coefficients have large sampling errors which affect both inference and
forecasting that is based on the regression model. The condition of severe non-orthogonality is
also referred to as the problem of multicollinearity.
Detection of Multicollinearity
Let
R rij and
R 1 r ij denote simple correlation matrix and its inverse. Let
i ,i 1,2 ,..., p p p 1 ....1 denote the eigen values of R. The following are common
indicators of relationships among independent variables.
1. Simple pair-wise correlations
rij 1
85
Regression Diagnostics and Remedial Measures
The first of these indicators, the simple correlation coefficients between pairs of independent
variables rij , may detect a simple relationship between xi and x j . Thus rij 1 implies that the
ith and jth variables are nearly proportional.
The second set of indicators, Ri2 , the squared multiple correlation coefficient for the regression
of xi on the remaining x variables indicates the degree to which xi is explained by a linear
combination of all of the other input variables.
The third set of indicators, the diagonal elements of the inverse matrix, which have been labeled
as the Variance Inflation Factors, VIFi . The term arises by noting that with standardized data
(mean zero and unit sum of squares), the variance of the least squares estimate of the ith
coefficient is proportional to r ii , VIFi 10 is probably based on the simple relation between Ri
and VIFi . That is VIFi 10 corresponds to Ri2 0.9 .
86
Regression Diagnostics and Remedial Measures
When the nature of the regression function is not known, exploratory analysis that does not
require specifying a particular type of function is often useful.
Transformations is another way in stabilizing the variance. We first consider transformation for
linearizing a nonlinear regression relation when the distribution of the error terms is reasonably
close to a normal distribution and the error terms have approximately constant variance. In this
situation, transformation on X should be attempted. The reason why transformation on Y may not
be desirable here is that a transformation on Y, such as Y Y , may materially change the
shape of the distribution and may lead to substantially differing error term variance.
87
Regression Diagnostics and Remedial Measures
3.6 Multicollinearity
i) Collection of additional data: Collecting additional data has been suggested as one of
the methods of combating multicollinearity. The additional data should be collected in a manner
designed to break up the multicollinearity in the existing data.
ii) Model respecification: Multicollinearity is often caused by the choice of model, such as
when two highly correlated regressors are used in the regression equation. In these situations
some respecification of the regression equation may lessen the impact of multicollinearity. One
approach to respecification is to redefine the regressors. For example, if x1, x2 and x3 are nearly
linearly dependent it may be possible to find some function such as x = (x1+x2)/x3 or x = x1x2x3
that preserves the information content in the original regressors but reduces the multicollinearity.
iii) Ridge Regression: When method of least squares is used, parameter estimates are
unbiased. A number of procedures have been developed for obtaining biased estimators of
regression coefficients to tackle the problem of multicollinearity. One of these procedures is
ridge regression. The ridge estimators are found by solving a slightly modified version of the
normal equations. Each of the diagonal elements of XX matrix are added a small quantity.
88
Regression Diagnostics and Remedial Measures
How to prepare the data file and syntax for performing normality test have been captured in the
following screen shot.
The following program is useful for performing heterogeneity test for the errors.
89
Regression Diagnostics and Remedial Measures
90
Regression Diagnostics and Remedial Measures
91
Regression Diagnostics and Remedial Measures
Some of the observations were identified as outliers. After deleting these observation the fitted
equations are shown in the following table. Multicollinearity diagnostics are also shown in the
same table.
92