0% found this document useful (0 votes)
5 views

4-Regression Diagnostics SAS

This document discusses regression analysis, its purposes, and the importance of model diagnostics to ensure appropriateness for data analysis. It covers various diagnostic methods for assessing issues such as nonlinearity, nonconstancy of error variance, presence of outliers, nonindependence of error terms, nonnormality, and multicollinearity. Additionally, it outlines remedial measures for addressing identified issues, including model modification and data transformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

4-Regression Diagnostics SAS

This document discusses regression analysis, its purposes, and the importance of model diagnostics to ensure appropriateness for data analysis. It covers various diagnostic methods for assessing issues such as nonlinearity, nonconstancy of error variance, presence of outliers, nonindependence of error terms, nonnormality, and multicollinearity. Additionally, it outlines remedial measures for addressing identified issues, including model modification and data transformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Lalmohan Bhar
I.A.S.R.I., Library Avenue, Pusa, New Delhi – 110 012
[email protected]

1. Introduction
Regression analysis is a statistical methodology that utilizes the relation between two or more
quantitative variables so that one variable can be predicted from the other, or others. This
methodology is widely used in business, the social and behavioral sciences, the biological
sciences including agriculture and fishery research. Regression analysis serves three major
purposes: (1) description (2) control and (3) prediction. We frequent use equations to summarize
or describe a set of data. Regression analysis is helpful in developing such equations.

A functional relation between two variables is expressed by a mathematical formula. If X denotes


the independent variable and Y the dependent variable, a functional relation is of the form
Y = f(X)
Given a particular value of X, the function f indicates the corresponding value of Y. Depending
on the nature of the relationships between X and Y, regression approach may be classified into
two broad categories viz., linear regression models and nonlinear regression models. The
response variable is generally related to other causal variables through some parameters. The
models that are linear in these parameters are known as linear models, whereas in nonlinear
models parameters are appear nonlinearly. Linear models are generally satisfactory
approximations for most regression applications. There are occasions, however, when an
empirically indicated or a theoretically justified nonlinear model is more appropriate.

When a regression model is considered for an application, we can usually not be certain in
advance that the model is appropriate for that application, any one, or several, of the features of
the model, such as linearity of the regression function or normality of the error terms, may not be
appropriate for the particular data at hand. Hence, it is important to examine the aptness of the
model for the data before inferences based on that model are undertaken. In this note we discuss
some simple methods for studying the appropriateness of a model, as well as some remedial
measures that can be helpful when the data are not in accordance with the conditions of the
regression model.

2. Diagnostics
2.1 Nonlinearity of Regression Model
Whether a linear regression function is appropriate for the data being analyzed can be studied
from a residual plot against the predictor variable or equivalently from a residual plot against the
fitted values.

Figure 1(a) shows a prototype situation of the residual plot against X when a linear regression
model is appropriate. The residuals then fall within a horizontal band centred around 0,
displaying no systematic tendencies to be positive and negative.
Regression Diagnostics and Remedial Measures

Figure 1(b) shows a prototype situation of a departure from the linear regression model that
indicates the need for a curvilinear regression function. Here the residuals tend to vary in a
systematic fashion between being positive and negative.

Fig. 1(a) Fig. 1(b)

Fig. 1(c) Fig. 1(d)

2.2 Nonconstancy of Error Variance


Plots of residuals against the predictor variable or against the fitted values are not only helpful to
study whether a linear regression function is appropriate but also to examine whether the
variance of the error terms is constant

The prototype plot in Figure 1(a) exemplifies residual plots when error term variance is constant.
Figure 1(c) shows a prototype picture of residual plot when the error variance increases with X.
In many biological science applications, departures from constancy of the error variance tend to
be of the “meghaphone” type.

Modified Levene Test: The test is based on the variability of the residuals. Let ei1 denotes the ith
residual for group 1 and ei2 denotes the ith residual for group 2. Also we denote n1 and n2 to
denote the sample sizes of the two groups, where: n1 + n2 = n.
Further, we shall use ~ e1 and e~2 to denote the medians of the residuals in the two groups. The
modified Levene test uses the absolute deviations of the residuals around their median, to be
denoted by di1 and di2:
d i1  ei1  e~1 , d i 2  ei 2  e~2
With this notation, the two-sample t test statistic becomes:

82
Regression Diagnostics and Remedial Measures

d1  d 2
t L* =
1 1
s 
n1 n2

Where d1 and d 2 are the sample means of the di1 and di2, respectively, and the pooled variance s2
is:

s 
2  (d i1  d1 ) 2   (d i 2  d 2 ) 2
.
n2
If the error terms have constant variance and n1 and n2 are not too small, t L* follows
approximately the t distribution with n-2 degrees of freedom. Large absolute values of t L* indicate
that the error terms do not have constant variance.

2.3 Presence of Outliers


Outliers are extreme observations. Residual outliers can be identified from residual plots against
X or Yˆ . Outliers can create great difficulty. When we encounter one, our first suspicion is that
the observation resulted from a mistake or other extraneous effect. On the other hand, outliers
may convey significant information, as when an outlier occurs because of an interaction with
another predictor omitted from the model.

Tests for Outlying Observations


(i) Elements of Hat Matrix: The Hat matrix is defined as H  X( XX) 1 X , X is the
matrix for explanatory variables. The larger values reflect data points are outliers.

(ii) WSSDi: WSSDi is an important statistic to locate points that are remote in x-space.
WSSDi measures the weighted sum of squared distance of the ith point from the center of the
data. Generally if the WSSDi values progress smoothly from small to large, there are probably
no extremely remote points. However, if there is a sudden jump in the magnitude of WSSDi, this
often indicates that one or more extreme points are present.

(iii) Cook's Di: Cook's Di is designed to measure the shift in ŷ when ith obsevation is not used
in the estimation of parameters. Di follows approximately F p , n  p 1 (1-). Lower 10% point of
this distribution is taken as a reasonable cut off (more conservative users suggest the 50% point).
4
The cut off for Di can be taken as .
n
(iv) 
DFFITSi: DFFIT is used to measure difference in ith component of ŷ  ŷi  . It is
1
 p  1 2
suggested that DFFITSi  2  may be used to flag off influential observations.
 n 
(v) DFBETAS j (i ) : Cook's Di reveals the impact of ith observation on the entire vector of
the estimated regression coefficients. The influential observations for individual regression

83
Regression Diagnostics and Remedial Measures

coefficient are identified by DFBETAS j( i ) , j  1,2 ,..., p  1 , where each DFBETAS j( i ) is the
standardized change in b j when the ith observation is deleted.

(vi) COVRATIO i :The impact of the ith observation on variance-covariance matrix of the
estimated regression coefficients is measured by the ratio of the determinants of the two
variance-covariance matrices. Thus, COVRATIO reflects the impact of the ith observation on the
precision of the estimates of the regression coefficients. Values near 1 indicate that the ith
observation has little effect on the precision of the estimates. A value of COVRATIO greater
than 1 indicates that the deletion of the ith observation decreases the precision of the estimates; a
ratio less than 1 indicates that the deletion of the observation increases the precision of the
3 p  1
estimates. Influential points are indicated by COVRATIOi  1  .
n

(vii) FVARATIO i : The statistic detects change in variance of ŷi when an observation is
deleted. A value near 1 indicates that the ith observation has negligible effect on variance of yi .
A value greater than 1 indicates that deletion of the ith observation decreases the precision of the
estimates, a value less than one increases the precision of the estimates.

2.4 Nonindependence of Error Terms


Whenever data are obtained in a time sequence or some other type of sequence, such as for
adjacent geographical areas, it is good idea to prepare a sequence plot of the residuals. The
purpose of plotting the residuals against time or some other type of sequence is to see if there is
any correlation between error terms that are near each other in the sequence. A prototype
residual plot showing a time related trend effect is presented in Figure 1(d), which portrays a
linear time related trend effect. When the error terms are independent, we expect the residuals in
a sequence plot to fluctuate in a more or less random pattern around the base line 0.

Tests for Randomness


A run test is frequently used to test for lack of randomness in the residuals arranged in time
order. Another test, specially designed for lack of randomness in least squares residuals, is the

Durbin-Watson test:
The Durbin-Watson test assumes the first order autoregressive error models. The test consists of
determining whether or not the autocorrelation coefficient (  , say) is zero. The usual test
alternatives considered are:
H0 :   0 ; H0 :   0 .
The Durbin-Watson test statistic D is obtained by using ordinary least squares to fit the
regression function, calculating the ordinary residuals: et  Yt  Yˆt , and then calculating the
statistic:
n

 (e t  et 1 ) 2
D t 2
.
n e 2
t
t 1

84
Regression Diagnostics and Remedial Measures

Exact critical values are difficult to obtain, but Durbin-Watson have obtained lower and upper
bound d L and dU such that a value of D outside these bounds leads to a definite decision. The
decision rule for testing between the alternatives is:
if D > dU, conclude H0
if D <dL, conclude H1
if d L  D  d U , test is inconclusive.
Small value of D lead to the conclusion that  >0.

2.5 Nonnormality of Error Terms


Small departures from normality do not create any serious problems. Major departures, on the
other hand, should be of concern. The normality of the error terms can be studied informally by
examining the residuals in a variety of graphic ways.

Comparison of frequencies: when the number of cases is reasonably large is to compare actual
frequencies of the residuals against expected frequencies under normality. For example, one can
determine whether, say, about 90% of the residuals fall between  1.645 MSE .

Normal probability plot: Still another possibility is to prepare a normal probability plot of the
residuals. Here each residual is plotted against its expected value under normality. A plot that is
nearly linear suggests agreement with normality, whereas a plot that departs substantially from
linearity suggests that the error distribution is not normal.

Correlation Test for Normality


In addition to visually assessing the appropriate linearity of the points plotted in a normal
probability plot, a formal test for normality of the error terms can be conducted by calculating
the coefficient of correlation between residuals ei and their expected values under normality. A
high value of the correlation coefficient is indicative of normality.

2. 6 Multicollinearity
The use and interpretation of a multiple regression model depends implicitly on the assumption
that the explanatory variables are not strongly interrelated. In most regression applications the
explanatory variables are not orthogonal. Typically, it is impossible to estimate the unique
effects of individual variables in the regression equation. The estimated values of the coefficients
are very sensitive to slight changes in the data and to the addition or deletion of variables in the
equation. The regression coefficients have large sampling errors which affect both inference and
forecasting that is based on the regression model. The condition of severe non-orthogonality is
also referred to as the problem of multicollinearity.

Detection of Multicollinearity
Let  
R  rij and  
R 1  r ij denote simple correlation matrix and its inverse. Let
i ,i  1,2 ,..., p  p   p 1  ....1  denote the eigen values of R. The following are common
indicators of relationships among independent variables.
1. Simple pair-wise correlations
rij  1

85
Regression Diagnostics and Remedial Measures

2. The squared multiple correlation coefficients


1
Ri2  1   0.9 , where Ri2 denote the squared multiple correlation coefficients for the
ii
r
regression of xI on the remaining x variables.
3. The variance inflation factors, VIFi  r ii  10 and
4. eigen values, i  0 .

The first of these indicators, the simple correlation coefficients between pairs of independent
variables rij , may detect a simple relationship between xi and x j . Thus rij  1 implies that the
ith and jth variables are nearly proportional.

The second set of indicators, Ri2 , the squared multiple correlation coefficient for the regression
of xi on the remaining x variables indicates the degree to which xi is explained by a linear
combination of all of the other input variables.

The third set of indicators, the diagonal elements of the inverse matrix, which have been labeled
as the Variance Inflation Factors, VIFi . The term arises by noting that with standardized data
(mean zero and unit sum of squares), the variance of the least squares estimate of the ith
coefficient is proportional to r ii , VIFi  10 is probably based on the simple relation between Ri
and VIFi . That is VIFi  10 corresponds to Ri2  0.9 .

3. Overview of Remedial Measures


If the simple regression model (1) is not appropriate for a data set, there are two basic choices:
1. Abandon regression model and develop and use a more appropriate model.
2. Employ some transformation on the data so that regression model (1) is appropriate for
the transformed data.
Each approach has advantages and disadvantages. The first approach may entail a more complex
model that could yield better insights, but may also lead to more complex procedure for
estimating the parameters. Successful use of transformations, on the other hand, lead to relatively
simple methods of estimation and may involve fewer parameters than a complex model, an
advantage when the sample size is small. Yet transformation may obscure the fundamental
interconnections between the variables, though at times they may illuminate them.

3.1 Nonlinearity of Regression Function


When the regression function is not linear, a direct approach is to modify regression model (1)
by altering the nature of the regression function. For instance, a quadratic regression function
might be used.
Yi  0  1 X i  2 X i2  i
or an exponential regression function:
Xi
Yi   0 1  i .

86
Regression Diagnostics and Remedial Measures

When the nature of the regression function is not known, exploratory analysis that does not
require specifying a particular type of function is often useful.

3.2 Nonconstancy of Error Variance


When the error variance is not constant but varies in a systematic fashion, a direct approach is to
modify the method to allow for this and use the method of weighted least squares to obtain the
estimates of the parameters.

Transformations is another way in stabilizing the variance. We first consider transformation for
linearizing a nonlinear regression relation when the distribution of the error terms is reasonably
close to a normal distribution and the error terms have approximately constant variance. In this
situation, transformation on X should be attempted. The reason why transformation on Y may not
be desirable here is that a transformation on Y, such as Y   Y , may materially change the
shape of the distribution and may lead to substantially differing error term variance.

Following transformations are generally applied for stabilizing variance.


(1) when the error variance is rapidly increasing Y   log10 Y or Y   Y
(2) when the error variance is slowly increasing, Y   Y 2 or Y   Exp(Y )
(3) when the error variance is decreasing, Y   1 / Y or Y   Exp(Y ) .

Box - Cox Transformations: It is difficult to determine, which transformation of Y is most


appropriate for correcting skewness of the distributions of error terms, unequal error variance,
and nonlinearity of the regression function. The Box-Cox transformation automatically identifies
a transformation from the family of power transformations on Y. The family of power
transformations is of the form: Y   Y  , where is a parameter to be determined from the data.
Using standard computer programme it can be determined easily.

3.3 Nonindependence of Error Terms


When the error terms are correlated, a direct approach is to work with a model that calls for error
terms. A simple remedial transformation that is often helpful is to work with first differences.

3.4 Nonnormality of Error terms


Lack of normality and non-constant error variance frequently go hand in hand. Fortunately, it is
often the case that the same transformation that helps stabilize the variance is also helpful in
approximately normalizing the error terms. It is therefore, desirable that the transformation for
stabilizing the error variance be utilized first, and then the residuals studied to see if serious
departures from normality are still present.

3.5 Outlying Observations


When outlying observations are present, use of the least squares and maximum likelihood
estimates for regression model (1) may lead to serious distortions in the estimated regression
function. When the outlying observations do not represent recording errors and should not be
discarded, it may be desirable to use an estimation procedure that places less emphasis on such
outlying observations. Robust Regression falls under such methods.

87
Regression Diagnostics and Remedial Measures

3.6 Multicollinearity
i) Collection of additional data: Collecting additional data has been suggested as one of
the methods of combating multicollinearity. The additional data should be collected in a manner
designed to break up the multicollinearity in the existing data.
ii) Model respecification: Multicollinearity is often caused by the choice of model, such as
when two highly correlated regressors are used in the regression equation. In these situations
some respecification of the regression equation may lessen the impact of multicollinearity. One
approach to respecification is to redefine the regressors. For example, if x1, x2 and x3 are nearly
linearly dependent it may be possible to find some function such as x = (x1+x2)/x3 or x = x1x2x3
that preserves the information content in the original regressors but reduces the multicollinearity.
iii) Ridge Regression: When method of least squares is used, parameter estimates are
unbiased. A number of procedures have been developed for obtaining biased estimators of
regression coefficients to tackle the problem of multicollinearity. One of these procedures is
ridge regression. The ridge estimators are found by solving a slightly modified version of the
normal equations. Each of the diagonal elements of XX matrix are added a small quantity.

4 Illustration through SAS


Consider the following data set
Case X11 X21 X31 Yi
1 12.980 0.317 9.998 57.702
2 14.295 2.028 6.776 59.296
3 15.531 5.305 2.947 56.166
4 15.133 4.738 4.201 55.767
5 15.342 7.038 2.053 51.722
6 17.149 5.982 -0.055 60.446
7 15.462 2.737 4.657 60.715
8 12.801 10.663 3.048 37.447
9 17.039 5.132 0.257 60.974
10 13.172 2.039 8.738 55.270
11 16.125 2.271 2.101 59.289
12 14.340 4.077 5.545 54.027
13 12.923 2.643 9.331 53.199
14 14.231 10.401 1.041 41.896
15 15.222 1.220 6.149 63.264
16 15.740 10.612 -1.691 45.798
17 14.958 4.815 4.111 58.699
18 14.125 3.153 8.453 50.086
19 16.391 9.698 -1.714 48.890
20 16.452 3.912 2.145 62.213
21 13.535 7.625 3.851 45.625
22 14.199 4.474 5.112 53.923
23 15.837 5.753 2.087 55.799
24 16.565 8.546 8.974 56.741
25 13.322 8.589 4.011 43.145
26 15.949 8.290 -0.248 50.706

88
Regression Diagnostics and Remedial Measures

How to prepare the data file and syntax for performing normality test have been captured in the
following screen shot.

A portion of the output file is given in the following screen shot

The following program is useful for performing heterogeneity test for the errors.

89
Regression Diagnostics and Remedial Measures

proc reg data=abc;


model y=x1 x2 x3;
output out=abc r=rs p=pr;
run; quit;
data lm;
set abc;
abrs=abs(rs);
proc corr data=lm spearman ;
var abrs pr ; run;

The following program is used for detecting influential observations.


proc reg data=abc;
model y = x1-x3/influence;
output out=lmb cookd=d;
proc print data=lmb;
run;
For graphical view of the diagnostics for outliers can be obtained through the following
programme.

ods rtf file="lm5.rtf";


ods graphics on ;
proc reg
plots=(diagnostics(stats=none) RStudentByLeverage(label)
CooksD(label) Residuals(smooth)
DFFITS(label) DFBETAS ObservedByPredicted(label));

model y=x1 x2 x3;


run;
ods graphics off;
ods rtf close; run;
A portion of the result of this programme is given below:

90
Regression Diagnostics and Remedial Measures

The following programme is used for Box-Cox transformation.


ods graphics on;
ods rtf file ="lm2.rtf";
ods rtf select boxcoxplot boxcoxloglikeplot rmseplot;
proc transreg data=lmb test cl
plots= boxcox (rmse unpack);
model boxcox (y)=identity (x1 x2 x3);
run;
ods rtf close;
ods graphics off;

91
Regression Diagnostics and Remedial Measures

A portion of the result is given below:

Some of the observations were identified as outliers. After deleting these observation the fitted
equations are shown in the following table. Multicollinearity diagnostics are also shown in the
same table.

Regression Coefficients and Summary Statistics


Description b0 b1 b2 b3 s R2 Max Min Max
VIF e.v. R i2
All Data (n=26) 8.11 3.56 -1.63 0.34 1.80 0.94 2.82 0.210 0.65
Delete (11, 17, 18) 7.17 3.66 -1.79 0.40 0.51 0.99 2.85 0.210 0.65
Delete (24) 30.91 2.39 -2.14 -0.36 1.78 0.94 30.64 0.017 0.97
Delete (11, 17, 18, 24.27 2.79 -2.11 -0.16 0.50 0.99 171.9 0.003 0.99
24) 0
Ridge k=0.05 14.28 3.22 -1.73 0.25 0.66 0.99 10.20 0.053 0.90
(n=22)
Delete X3 (n=22) 19.50 3.03 -2.00 0.49 0.99 1.02 0.863 0.02

92

You might also like