Chapter 4
Chapter 4
Additionally, there are issues that can arise during the analysis that, while strictly speaking
are not assumptions of regression, are none the less, of great concern to data analysts.
The validity of these assumptions is needed for the results to be meaningful. If these assumptions are
violated, the result can be incorrect and may have serious consequences. If these departures are small, the
final result may not be changed significantly. But if the departures are large, the model obtained may
become unstable in the sense that a different sample could lead to a entirely different model with opposite
conclusions. So such underlying assumptions have to be verified before attempting to regression
modeling. Such information is not available from the summary statistic such as t-statistic, F-statistic or
coefficient of determination.
One important point to keep in mind is that these assumptions are for the population and we work only
with a sample. So the main issue is to take a decision about the population on the basis of a sample of
data.
Several diagnostic methods to check the violation of regression assumption are based on the study of
model residuals with the help of various types of graphics
Checking of linear relationship between study and explanatory variables
1. Case of one explanatory variable
If there is only one explanatory variable in the model, then it is easy to check the existence of linear
relationship between and by scatter diagram of the available data. Y and X
If the scatter diagram shows a linear trend, it indicates that the relationship between and is linear. If the
trend is not linear, then it indicates that the relationship between y and X is nonlinear. For example, the
following figure indicates a linear trend y and X.
1
Where as the following figure indicates a nonlinear trend:
2
Such arrangement helps in examining of plot and corresponding correlation coefficient together. The
pairwise correlation coefficient should always be interpreted in conjunction with the corresponding scatter
plots because
The correlation coefficient measures only the linear relationship and
The correlation coefficient is non-robust, i.e., its value can be substantially influenced by
one or two observations in the data.
The presence of linear patterns is reassuring but absence of such patterns does not imply that linear model
is incorrect. Most of the statistical software provides the option for creating the scatterplot matrix. The
view of all the plots provides an indication that a multiple linear regression model may provide a
reasonable fit to the data. It is to be kept is mind that we get only the information on pairs of variables
through the scatterplot of ( y versus X1 ), (y versus X2 ), …, (y versus Xk ). whereas the assumption of
linearity is between and y jointly with X1 ,X2 ,….,Xk ).
If some of the explanatory variables are themselves interrelated, then these scatter diagrams can be
misleading. Some other methods of sorting out the relationships between several explanatory variables
and a study variable are used.
3
Residual Analysis in Regression
Because a linear regression model is not always appropriate for the data, you should assess the
appropriateness of the model by defining residuals and examining residual plots.
Residuals
The residual is defined as the difference betw
een the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e).
Each data point has one residual.
To be studied by residuals
4
Residual Plots
A residual plot: it is a graph that shows the residuals on the vertical axis and the independent variable on
the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a
linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
Below the table on the left shows inputs and outputs from a simple linear regression analysis, and the
chart on the right displays the residual (e) and independent variable (X) as a residual plot.
X 60 70 80 85 95
Y 70 65 70 95 85
The residual plot shows a fairly random pattern - the first residual is positive, the next two are
negative, the fourth is positive, and the last residual is negative. This random pattern indicates
that a linear model provides a decent fit to the data.
Below, the residual plots show three typical patterns. The first plot shows a random pattern,
indicating a good fit for a linear model. The other plot patterns are non-random (U-shaped and
inverted U), suggesting a better fit for a non-linear model. So should be no systematic
relationship between residual and predictor variable if it is linear related.
a
When a residual plot reveals a data set to be nonlinear, it is often possible to "transform" the raw data to
make it more linear. This allows us to use linear regression techniques more effectively with nonlinear data.
Transforming a variable involves using a mathematical operation to change its measurement scale. Broadly
speaking, there are two kinds of transformations.
5
Linear transformation. A linear transformation preserves linear relationships between variables. Therefore,
the correlation between x and y would be unchanged after a linear transformation. Examples of a linear
transformation to variable x would be multiplying x by a constant, dividing x by a constant, or adding a
constant to x.
Nonlinear Transformation
A nonlinear transformation changes (increases or decreases) linear relationships between variables and, thus,
changes the correlation between variables. Examples of a nonlinear transformation of variable x would be taking
the square root of x or the reciprocal of x.In regression, a transformation to achieve linearity is a special kind of
nonlinear transformation. It is a nonlinear transformation that increases the linear relationship between two
variables.
There are many ways to transform variables to achieve linearity for regression analysis .some common
transformation
Trans
Method Regression equation Predicted value (ŷ)
formation(s)
Standard
linear None y = b0 + b1x ŷ = b0 + b1x
regression
Exponential
= log(y) log(y) = b0 + b1x ŷ = 10b + b1x
model
Quadratic
= sqrt(y) sqrt(y) = b0 +b1x ŷ = ( b0 + b1x )2
model
Reciprocal
= 1/y 1/y = b0 + b1x ŷ = 1 / ( b0 + b1x )
model
Logarithmic
= log(x) y= b0 + b1log(x) ŷ = b0 + b1log(x)
model
Each row shows a different nonlinear transformation method. The second column shows the specific
transformation applied to dependent and/or independent variables. The third column shows the regression
equation used in the analysis. And the last column shows the "back transformation" equation used to restore
the dependent variable to its original, non-transformed measurement scale.
6
In practice, these methods need to be tested on the data to which they are applied to be sure that they increase
rather than decrease the linearity of the relationship. Testing the effect of a transformation method involves
looking at residual plots and correlation coefficients, as described in the following sections.
7
Extreme X value Extreme Y value
Influential Points
An influential point is an outlier that greatly affects the slope of the regression line. One way to test the
influence of an outlier is to compute the regression equation with and without the outlier.
This type of analysis is illustrated below. The scatter plots are identical, except that the plot on the right
includes an outlier. The slope is flatter when the outlier is present (-3.32 vs. -4.10), so this outlier would
be considered an influential point.
8
The charts below compare regression statistics for another data set with and without an outlier. Here, the
chart on the right has a single outlier, located at the high end of the X axis (where x = 24). As a result of
that single outlier, the slope of the regression line changes greatly, from -2.5 to -1.6; so the outlier would
be considered an influential point.
Without Outlier
With Outlier
If your data set includes an influential point, here are some things to consider.
An influential point may represent bad data, possibly the result of measurement error. If possible,
check the validity of the data point.
Compare the decisions that would be made based on regression equations defined with and
without the influential point. If the equations lead to contrary decisions, use caution.
Studentized Residuals – Residuals divided by their estimated standard errors (like t-statistics).
Observations with values larger than 3 in absolute value are considered outliers.
9
Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an
observation whose dependent-variable value is unusual given its values on the predictor variables. An
outlier may indicate a sample peculiarity or may indicate a data entry error or other problem
Leverage Values (Hat Diag) – An observation with an extreme value on a predictor variable is
called a point with high leverage. Levearge Measure of how far an observation is from the others in
terms of the levels of the independent variables (not the dependent variable). Observations with
values larger than 2(k+1)/n are considered to be potentially highly influential, where k is the number
of predictors and n is the sample size.
Influence: An observation is said to be influential if removing the observation substantially changes
the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.
DFFITS – Measure of how much an observation has affected its fitted value from the regression
model. Values larger than 2*sqrt((k+1)/n) in absolute value are considered highly influential. Use
standardized DFFITS in SPSS.
DFBETAS – Measure of how much an observation has affected the estimate of a regression
coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values
larger than 2/sqrt (n) in absolute value are considered highly influential.
Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients,
as well as the group of fitted values. Values larger than 4/n are considered highly influential.
COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of
the regression coefficients and their covariance’s. Values outside the interval 1 3(k+1)/n are
considered highly influential.
Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is
with the other predictors in the model. Values larger than 10 for a predictor imply large inflation of
standard errors of regression coefficients due to this variable being in model.
A. Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of
Independent variables from your model of interest (possibly having been chosen via an automated
model selection method).
B. Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and
CONTINUE
C. Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot
of studentized residuals versus standardized predicted values, and a histogram of standardized
residuals (residual/sqrt(MSE)). Select CONTINUE.
10
D. Under SAVE, select Student zed Residuals, Cook’s, Leverage Values, Covariance Ratio,
Standardized DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to
your original data worksheet.
Remedial Measures
There are two things you can do when you find out that your linear regression model is
not appropriate:
Change your model (use another statistical model).
Change your data.
Transformations of X. Omitting outliers.
Transformations of Y.
Problems and Solutions
Nonlinearity of regression function:
Use a nonlinear model or transform x(if residuals are reasonably normal and constant variance)
Non consistent error variance:
Use weighted least squares estimation method.
Transform Y if mean function is reasonably linear; address working with variance stabilizing
transformations of Y.
11
5. Remedial measures of model inadequacy
Data do not always come in a form that is immediately suitable for analysis. We often have to transform
the variables before carrying out the analysis. Transformations are applied to accomplish certain
objectives such as
To stabilize the relationship
To stabilize the variance of the dependent variable
To normalize dependent variable
To linearize the regression model
12
A number of the problems in our model can be solved by transforming X.
Why do we concentrate on x?
The distribution of the error terms depends on Y, not X.
If we were to transform Y, we would change the shape and nature of the analysis.
So, always transform X.
So, you have problems, what transformation do you use? Some common transformations are:
X′ = ln(X) X′ = √X
X′ = X′ = exp(X)
Note: Box-Cox transformations of the response: Instead of selecting a transformation “by eye”, select
an optimal power transformation.
1. In order to make the model as realistic as possible, the analyst may include as many as possible
explanatory variables.
2. In order to make the model as simple as possible, one may include only fewer number of explanatory
variables.
Both the approaches have their own consequences. In fact, model building and subset selection have
contradicting objectives.
When large numbers of variables are included in the model, then these factors can influence the
prediction of study variable y.
On the other hand when small number of variables are included then the predictive variance of ̂
decreases.
Also, when the observations on more number are to be collected, then it involves more cost, time,
labour etc.
A compromise between these consequences is striked to select the “best regression equation”.
13
In order to keep the model simple, the analyst may delete some of the explanatory variables which may be
of importance from the point of view of theoretical considerations. There can be several reasons behind
such decision, e.g., it may be hard to quantity the variables like taste, intelligence etc. Sometimes it may
be difficult to take correct observations on the variables like income etc.
Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some
explanatory variables that are not very relevant to the model. Such variables may contribute very little to
the explanatory power of the model. This may tend to reduce the degrees of freedom (n - k) and
consequently the validity of inference drawn may be questionable. For example, the value of coefficient
of determination will increase indicating that th e model is getting better which may not really be true.
Step 1: Specifying the maximum Model: The maximum model is defined to be the largest model (the
one having the most predictor variables) considered at any point in the process of model selection.
Step 2: Specifying a Criterion for Selecting a Model: There are several criteria that can be used to
evaluate subset regression models. The criterion that we used for model selection certainly be related to
intended use of model.
14
F-Test Statistic: Another reasonable criterion for selecting the best model is the F-test statistic
for comparing the full and reduced models.
This statistic may be compared to an F-distribution with k-p+1 and n-k-1 degrees of
freedom. If F-Calculated is not significant, we can use the smaller (P-1 variables) model.
Coefficient of Determination ( ): A measure of the adequacy of a regression model that has
been widely used is the coefficient of determination.
Increases as P increases and is maximum when P=K+1. Therefore, the analyst uses this
criterion by adding repressors to the model up to the point where an additional variable only
provides a small increase in .
All possible regression procedure: The all possible regression procedure requires that we fit
each possible regression equation.
Backward Elimination Procedure: We begin with a model that includes all candidate
regressors. Then the partial F-statistic is computed for each repressors as if it were the last
variable to enter the model. The smallest of these partial F-statistics is compared with a pre-
selected value FOUT, that regressor is removed from the model. Now a regression model with k-1
is fit. The partial F-statistics for this new model calculated, and the procedure repeated. The
backward elimination algorithm terminates when the smallest partial F- value is not less than the
pre-selected cutoff value FOUT.
Forward Selection Procedure: The procedure begins with the assumption that there are no
regressors in the model other than the intercept. An effort is made to find an optimal subset by
inserting into model one at a time. At each step the repressor having the highest partial correlation
with (or equivalently the largest F-statistic given the other regressors already in the model) is
added to the model if its partial F-statistic exceeds the pre-selected entry level FIN
Stepwise Regression Procedure: Stepwise regression is a modified version of forward
regression that permits reexamination, at every step, of the variables incorporated in the model in
pervious steps. A variable that entered at an early stage may become superfluous at a larger stage
because of its relationship with other variables subsequently added to the model.
15