0% found this document useful (0 votes)
18 views

Chapter 4

sampling

Uploaded by

Admasu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Chapter 4

sampling

Uploaded by

Admasu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter 4

Model Adequacy Checking


The fitting of linear regression model, estimation of parameters testing of hypothesis properties
of the estimator are based on following major assumptions:
1. The relationship between the study variable and explanatory variables is linear, at least
approximately.
2. The error term has zero mean.
3. The error term has constant variance.
4. The errors are uncorrelated.
5. The errors are normally distributed.

Additionally, there are issues that can arise during the analysis that, while strictly speaking
are not assumptions of regression, are none the less, of great concern to data analysts.

 Influence - individual observations that exert undue influence on the coefficients


 Collinearity - predictors that are highly collinear, i.e., linearly related, can cause problems
in estimating the regression coefficients.

The validity of these assumptions is needed for the results to be meaningful. If these assumptions are
violated, the result can be incorrect and may have serious consequences. If these departures are small, the
final result may not be changed significantly. But if the departures are large, the model obtained may
become unstable in the sense that a different sample could lead to a entirely different model with opposite
conclusions. So such underlying assumptions have to be verified before attempting to regression
modeling. Such information is not available from the summary statistic such as t-statistic, F-statistic or
coefficient of determination.

One important point to keep in mind is that these assumptions are for the population and we work only
with a sample. So the main issue is to take a decision about the population on the basis of a sample of
data.

Several diagnostic methods to check the violation of regression assumption are based on the study of
model residuals with the help of various types of graphics
Checking of linear relationship between study and explanatory variables
1. Case of one explanatory variable
If there is only one explanatory variable in the model, then it is easy to check the existence of linear
relationship between and by scatter diagram of the available data. Y and X
If the scatter diagram shows a linear trend, it indicates that the relationship between and is linear. If the
trend is not linear, then it indicates that the relationship between y and X is nonlinear. For example, the
following figure indicates a linear trend y and X.

1
Where as the following figure indicates a nonlinear trend:

2. Case of more than one explanatory variable


To check the assumption of linearity between study variable and explanatory variables, the scatter plot
matrix of the data can be used. A scatterplot matrix is a two dimensional array of two dimension plots
where each form contains a scatter diagram except for the diagonal. Thus, each plot sheds some light on
the relationship between a pair of variables. It gives more information than the correlation coefficient
between each pair of variables because it gives a sense of linearity or nonlinearity of the relationship and
some awareness of how the individual data points are arranged over the region. It is a scatter diagram of

(y versus X1), (y versus X2), …, (y versus Xk ).


Another option to present the scatterplot is
- present the scatterplots in the upper triangular part of plot matrix.
- Mention the corresponding correlation coefficients in the lower triangular part of the matrix.
Suppose there are only two explanatory variables and the model is then the
scatterplot matrix looks like as follows.

2
Such arrangement helps in examining of plot and corresponding correlation coefficient together. The
pairwise correlation coefficient should always be interpreted in conjunction with the corresponding scatter
plots because
 The correlation coefficient measures only the linear relationship and
 The correlation coefficient is non-robust, i.e., its value can be substantially influenced by
one or two observations in the data.
The presence of linear patterns is reassuring but absence of such patterns does not imply that linear model
is incorrect. Most of the statistical software provides the option for creating the scatterplot matrix. The
view of all the plots provides an indication that a multiple linear regression model may provide a
reasonable fit to the data. It is to be kept is mind that we get only the information on pairs of variables
through the scatterplot of ( y versus X1 ), (y versus X2 ), …, (y versus Xk ). whereas the assumption of
linearity is between and y jointly with X1 ,X2 ,….,Xk ).
If some of the explanatory variables are themselves interrelated, then these scatter diagrams can be
misleading. Some other methods of sorting out the relationships between several explanatory variables
and a study variable are used.

3
Residual Analysis in Regression
Because a linear regression model is not always appropriate for the data, you should assess the
appropriateness of the model by defining residuals and examining residual plots.

Residuals
The residual is defined as the difference betw

een the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e).
Each data point has one residual.

Residual = Observed value - Predicted value


e=y–ŷ
^
ei  Yi  Y i  Yi  (b0  b1 X i )
Both the sum and the mean of the residuals are equal to zero. That is, Σ ei = 0 and e = 0.

Approximate average variance of residuals is estimated by

To be studied by residuals

 Regression function not linear


 Error terms do not have constant
variance
 Error terms are not independent
 Model fits all but one or a few outlier
-+observations
 Er
 ror terms are not normally distributed
 One or more predictor variables have
been omitted from the model

Diagnostics for residuals

 Plot of residuals against predictor variable


 Plot of absolute or squared residuals against predictor variable
 Plot of residuals against fitted values
 Plot of residuals against time or other sequence
 Plot of residuals against omitted predictor variables
 Box plot of residuals
 Normal probability plot of residuals

4
Residual Plots
A residual plot: it is a graph that shows the residuals on the vertical axis and the independent variable on
the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a
linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

Below the table on the left shows inputs and outputs from a simple linear regression analysis, and the
chart on the right displays the residual (e) and independent variable (X) as a residual plot.

X 60 70 80 85 95

Y 70 65 70 95 85

Ŷ 65.41 71.84 78.28 81.5 87.95

E 4.589 -6.85 -8.288 13.493 -2.945

 The residual plot shows a fairly random pattern - the first residual is positive, the next two are
negative, the fourth is positive, and the last residual is negative. This random pattern indicates
 that a linear model provides a decent fit to the data.

Below, the residual plots show three typical patterns. The first plot shows a random pattern,
indicating a good fit for a linear model. The other plot patterns are non-random (U-shaped and
inverted U), suggesting a better fit for a non-linear model. So should be no systematic
relationship between residual and predictor variable if it is linear related.
a

Random pattern Non-random: U-shaped Non-random: Inverted U

Transformations to Achieve Linearity

When a residual plot reveals a data set to be nonlinear, it is often possible to "transform" the raw data to
make it more linear. This allows us to use linear regression techniques more effectively with nonlinear data.

What is a Transformation to Achieve Linearity?

Transforming a variable involves using a mathematical operation to change its measurement scale. Broadly
speaking, there are two kinds of transformations.

5
 Linear transformation. A linear transformation preserves linear relationships between variables. Therefore,
the correlation between x and y would be unchanged after a linear transformation. Examples of a linear
transformation to variable x would be multiplying x by a constant, dividing x by a constant, or adding a
constant to x.

 Nonlinear Transformation

A nonlinear transformation changes (increases or decreases) linear relationships between variables and, thus,
changes the correlation between variables. Examples of a nonlinear transformation of variable x would be taking
the square root of x or the reciprocal of x.In regression, a transformation to achieve linearity is a special kind of
nonlinear transformation. It is a nonlinear transformation that increases the linear relationship between two
variables.

Methods of Transforming Variables to Achieve Linearity

There are many ways to transform variables to achieve linearity for regression analysis .some common
transformation

Trans
Method Regression equation Predicted value (ŷ)
formation(s)

Standard
linear None y = b0 + b1x ŷ = b0 + b1x
regression
Exponential
= log(y) log(y) = b0 + b1x ŷ = 10b + b1x
model
Quadratic
= sqrt(y) sqrt(y) = b0 +b1x ŷ = ( b0 + b1x )2
model
Reciprocal
= 1/y 1/y = b0 + b1x ŷ = 1 / ( b0 + b1x )
model
Logarithmic
= log(x) y= b0 + b1log(x) ŷ = b0 + b1log(x)
model

Each row shows a different nonlinear transformation method. The second column shows the specific
transformation applied to dependent and/or independent variables. The third column shows the regression
equation used in the analysis. And the last column shows the "back transformation" equation used to restore
the dependent variable to its original, non-transformed measurement scale.

6
In practice, these methods need to be tested on the data to which they are applied to be sure that they increase
rather than decrease the linearity of the relationship. Testing the effect of a transformation method involves
looking at residual plots and correlation coefficients, as described in the following sections.

How to Perform a Transformation to Achieve Linearity

Transforming a data set to enhance linearity is a multi-step, trial-and-error process.


First Conduct a standard regression analysis on the raw data and Construct a residual plot.
 If the plot pattern is random, do not transform data.
 If the plot pattern is not random, continue.
 Compute the coefficient of determination (R2).
 Choose a transformation method (see above table).
 Transform the independent variable, dependent variable, or both.
 Conduct a regression analysis, using the transformed variables.
 Compute the coefficient of determination (R2), based on the transformed variables.
o If the transformed R2 is greater than the raw-score R2, the transformation was successful. Congratulations!
o If not, try a different transformation method.
The best transformation method (exponential model, quadratic model, reciprocal model, etc.) will depend on
nature of the original data. The only way to determine which method is best is to try each and compare the
result (i.e., residual plots, correlation coefficients).

Influential Points in Regression


Sometimes in regression analysis, a few data points have disproportionate effects on the slope of the
regression equation. In this lesson, we describe how to identify those influential points.
Outliers
Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that
a data point might be considered an outlier.
 It could have an extreme X value compared to other data points.
 It could have an extreme Y value compared to other data points.
 It could have extreme X and Y values.
 It might be distant from the rest of the data, even without extreme X or Y value

7
Extreme X value Extreme Y value

Extreme X and Y Distant data point

Influential Points

An influential point is an outlier that greatly affects the slope of the regression line. One way to test the
influence of an outlier is to compute the regression equation with and without the outlier.

This type of analysis is illustrated below. The scatter plots are identical, except that the plot on the right
includes an outlier. The slope is flatter when the outlier is present (-3.32 vs. -4.10), so this outlier would
be considered an influential point.

Without Outlier With Outlier

Regression equation: ŷ = 104.78 - 4.10x Regression equation: ŷ = 97.51 - 3.32x


Coefficient of determination: R2 = 0.94 Coefficient of determination: R2 = 0.55

8
The charts below compare regression statistics for another data set with and without an outlier. Here, the
chart on the right has a single outlier, located at the high end of the X axis (where x = 24). As a result of
that single outlier, the slope of the regression line changes greatly, from -2.5 to -1.6; so the outlier would
be considered an influential point.

Without Outlier
With Outlier

Regression equation: ŷ = 92.54 - 2.5x Regression equation: ŷ = 87.59 - 1.6x


Slope: b0 = -2.5 Slope: b0 = -1.6
Coefficient of determination: R2 = 0.46 Coefficient of determination: R2 = 0.52

Sometimes, an influential point will cause the coefficient of determination to be bigger;


sometimes, smaller. In the first example above, the coefficient of determination is smaller when
the influential point is present (0.94 vs. 0.55). In the second example, it is bigger (0.46 vs. 0.52).
| |
Outliers can strongly affect the fitted values of the regression line If say it is an outlier

(

If your data set includes an influential point, here are some things to consider.

 An influential point may represent bad data, possibly the result of measurement error. If possible,
check the validity of the data point.
 Compare the decisions that would be made based on regression equations defined with and
without the influential point. If the equations lead to contrary decisions, use caution.

Influence Statistics, Outliers, and Collinearity Diagnostics

 Studentized Residuals – Residuals divided by their estimated standard errors (like t-statistics).
Observations with values larger than 3 in absolute value are considered outliers.

9
 Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an
observation whose dependent-variable value is unusual given its values on the predictor variables. An
outlier may indicate a sample peculiarity or may indicate a data entry error or other problem
 Leverage Values (Hat Diag) – An observation with an extreme value on a predictor variable is
called a point with high leverage. Levearge Measure of how far an observation is from the others in
terms of the levels of the independent variables (not the dependent variable). Observations with
values larger than 2(k+1)/n are considered to be potentially highly influential, where k is the number
of predictors and n is the sample size.
 Influence: An observation is said to be influential if removing the observation substantially changes
the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.
 DFFITS – Measure of how much an observation has affected its fitted value from the regression
model. Values larger than 2*sqrt((k+1)/n) in absolute value are considered highly influential. Use
standardized DFFITS in SPSS.
 DFBETAS – Measure of how much an observation has affected the estimate of a regression
coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values
larger than 2/sqrt (n) in absolute value are considered highly influential.
 Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients,
as well as the group of fitted values. Values larger than 4/n are considered highly influential.
 COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of
the regression coefficients and their covariance’s. Values outside the interval 1 3(k+1)/n are
considered highly influential.
 Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is
with the other predictors in the model. Values larger than 10 for a predictor imply large inflation of
standard errors of regression coefficients due to this variable being in model.

Obtaining Influence Statistics and Student zed Residuals in SPSS

A. Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of
Independent variables from your model of interest (possibly having been chosen via an automated
model selection method).
B. Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and
CONTINUE
C. Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot
of studentized residuals versus standardized predicted values, and a histogram of standardized
residuals (residual/sqrt(MSE)). Select CONTINUE.

10
D. Under SAVE, select Student zed Residuals, Cook’s, Leverage Values, Covariance Ratio,
Standardized DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to
your original data worksheet.

Remedial Measures
 There are two things you can do when you find out that your linear regression model is
not appropriate:
 Change your model (use another statistical model).
 Change your data.
 Transformations of X.  Omitting outliers.
 Transformations of Y.
Problems and Solutions
Nonlinearity of regression function:
 Use a nonlinear model or transform x(if residuals are reasonably normal and constant variance)
Non consistent error variance:
 Use weighted least squares estimation method.
 Transform Y if mean function is reasonably linear; address working with variance stabilizing
transformations of Y.

Non independent error terms:


 Change your model to include correlated error terms (change error assumption).
 use more complex models so that errors about them might indeed be reasonably independent, or
model first differences, or use models designed to handle dependent errors
Non normal error terms:
 Transform x.
 Often variance stabilizing transformations of the response also make residuals more consistent
with an iid Gaussian sample.
 Use a generalized model (differing error assumptions).
Omission of important predictor variables:
 Add them to the model.
 Use more complex models that include them.
Outlying observations:
 Check to see if that observation is "real".
 If so, you may want to use a more robust estimation method.
 If more than one, maybe use a mixture model.
 If outlier was from an error in data collection/coding, then delete the observation.

11
5. Remedial measures of model inadequacy

Data do not always come in a form that is immediately suitable for analysis. We often have to transform
the variables before carrying out the analysis. Transformations are applied to accomplish certain
objectives such as
 To stabilize the relationship
 To stabilize the variance of the dependent variable
 To normalize dependent variable
 To linearize the regression model

TRANSFORMATIONS TO STABILIZE VARIANCE


We have discussed in the preceding section the use of transformations to achieve linearity of the
regression function. Transformations are also used to stabilize the error variance, that is, to make the error
variance constant for all the observations. The constancy of error variance is one of the standard
assumptions of least squares theory. It is often referred to as the assumption of homoscedusdcity. When
the error variance is not constant over all the observations, the error is said to be heteroscedustic.
Heteroscedusticity is usually detected by suitable graphs of the residuals such as the scatter plot of the
standardized residuals against the fitted values or against each of the predictor variables. A plot with the
characteristics of Figure 6.9 typifies the situation. The residuals tend to have a funnel-shaped distribution,
either fanning out or closing in with the values of X. If heteroscedasticity is present, and no corrective
action is taken application of OLS to the raw data will result in estimated coefficients which lack
precision in a theoretical sense. The estimated standard errors of the regression coefficients are often
understated, giving a false sense of accuracy. Heteroscedasticity can be removed by means of a suitable
transformation. We describe an approach for
(a) detecting heteroscedasticity and its effects on the analysis
(b) removing heteroscedustic from the data analyzed using transformation

12
A number of the problems in our model can be solved by transforming X.
Why do we concentrate on x?
 The distribution of the error terms depends on Y, not X.
 If we were to transform Y, we would change the shape and nature of the analysis.
 So, always transform X.
So, you have problems, what transformation do you use? Some common transformations are:
X′ = ln(X) X′ = √X
X′ = X′ = exp(X)
Note: Box-Cox transformations of the response: Instead of selecting a transformation “by eye”, select
an optimal power transformation.

Variable Selection and Model Building

The Model-Building Problem


• Ensure that the function form of the model is correct and that the underlying assumptions
are not violated.
In most practical problems, the analyst has a rather large pool of possible candidate repressors, of which
only a few are likely to be important. Finding an appropriate subset of repressors for the model is often
called the variable selection problem. While choosing a subset of explanatory variables, there are two
possible options:

1. In order to make the model as realistic as possible, the analyst may include as many as possible
explanatory variables.

2. In order to make the model as simple as possible, one may include only fewer number of explanatory
variables.

Both the approaches have their own consequences. In fact, model building and subset selection have
contradicting objectives.

 When large numbers of variables are included in the model, then these factors can influence the
prediction of study variable y.
 On the other hand when small number of variables are included then the predictive variance of ̂
decreases.
 Also, when the observations on more number are to be collected, then it involves more cost, time,
labour etc.

A compromise between these consequences is striked to select the “best regression equation”.

There can be two types of incorrect model specifications.


 Omission/exclusion of relevant variables.
 Inclusion of irrelevant variables.
Exclusion of relevant variables

13
In order to keep the model simple, the analyst may delete some of the explanatory variables which may be
of importance from the point of view of theoretical considerations. There can be several reasons behind
such decision, e.g., it may be hard to quantity the variables like taste, intelligence etc. Sometimes it may
be difficult to take correct observations on the variables like income etc.

Inclusion of irrelevant variables

Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some
explanatory variables that are not very relevant to the model. Such variables may contribute very little to
the explanatory power of the model. This may tend to reduce the degrees of freedom (n - k) and
consequently the validity of inference drawn may be questionable. For example, the value of coefficient
of determination will increase indicating that th e model is getting better which may not really be true.

Exclusion type Inclusion type


Estimation of coefficients Biased Unbiased
Efficiency Generally declines Declines
Estimation of disturbance term Over-estimate Unbiased
Conventional test of hypothesis and Invalid and faulty inferences Valid though erroneous
confidence region
The basic steps for variable selection are as follows:

(a) Specify the maximum model to be considered.

(b) Specify a criterion for selection a model.

(c) Specify a strategy for selecting variables.

(d) Conduct the specified analysis.

(e) Evaluate the Validity of the model chosen.

Step 1: Specifying the maximum Model: The maximum model is defined to be the largest model (the
one having the most predictor variables) considered at any point in the process of model selection.

 Error degrees of freedom must be positive. Therefore, n-p=n-(k+1)>0


 The weakest requirement is n-(k+1)>10
 Another suggested rule of thumb for regression is to have at least 5 (or 10) observations per
predictor.
 In general, we like to have large error degrees of freedom.

Step 2: Specifying a Criterion for Selecting a Model: There are several criteria that can be used to
evaluate subset regression models. The criterion that we used for model selection certainly be related to
intended use of model.

14
 F-Test Statistic: Another reasonable criterion for selecting the best model is the F-test statistic
for comparing the full and reduced models.
 This statistic may be compared to an F-distribution with k-p+1 and n-k-1 degrees of
freedom. If F-Calculated is not significant, we can use the smaller (P-1 variables) model.
 Coefficient of Determination ( ): A measure of the adequacy of a regression model that has
been widely used is the coefficient of determination.
 Increases as P increases and is maximum when P=K+1. Therefore, the analyst uses this
criterion by adding repressors to the model up to the point where an additional variable only
provides a small increase in .

Step 3: Specifying a Strategy for Selecting Variables:

 All possible regression procedure: The all possible regression procedure requires that we fit
each possible regression equation.
 Backward Elimination Procedure: We begin with a model that includes all candidate
regressors. Then the partial F-statistic is computed for each repressors as if it were the last
variable to enter the model. The smallest of these partial F-statistics is compared with a pre-
selected value FOUT, that regressor is removed from the model. Now a regression model with k-1
is fit. The partial F-statistics for this new model calculated, and the procedure repeated. The
backward elimination algorithm terminates when the smallest partial F- value is not less than the
pre-selected cutoff value FOUT.
 Forward Selection Procedure: The procedure begins with the assumption that there are no
regressors in the model other than the intercept. An effort is made to find an optimal subset by
inserting into model one at a time. At each step the repressor having the highest partial correlation
with (or equivalently the largest F-statistic given the other regressors already in the model) is
added to the model if its partial F-statistic exceeds the pre-selected entry level FIN
 Stepwise Regression Procedure: Stepwise regression is a modified version of forward
regression that permits reexamination, at every step, of the variables incorporated in the model in
pervious steps. A variable that entered at an early stage may become superfluous at a larger stage
because of its relationship with other variables subsequently added to the model.

15

You might also like