Statistical Models in R
Statistical Models in R
Linear regression
yi = β0 + β1 xi + ϵi
Moreover, it is assumed that, for each i = 1, 2, ..., n, the error terms ϵi have constant vari-
ances σ 2 , are independent, and are identically and normally distributed with ϵi ∼ N (0, σ 2 )
In matrix form, the model has the appearance
y = Xβ + ϵ,
where it is assumed that the design matrix, X, has full column rank; the response vector y
is a solution of y = Xβ + ϵ, and the entries of the error vector ϵ satisfy the above-mentioned
assumptions.
A traditional application of simple linear regression typically involves a study in which
the continuous response variable is, in theory, assumed to be linearly related to a continuous
explanatory variable, and for which the data provide evidence in support of this structural
1
2 CHAPTER 6. FROM R COMPANION
requirement as well as for all fundamental assumptions on the error terms. Data for illus-
trations to follow are given in Table 5.1. This is simulated data.
x <- c(0.00 , 0.75 , 1.50 , 2.25 , 3.00 , 3.75 , 4.50 , 5.25 , 6.00, 6.75, 7.5, 8.25, 9,
y <- c(54.3 , 50.8 , 58.0 , 54.6 , 45.3 , 47.0 , 51.7 , 43.3 , 44.7, 38.5, 42.1, 40, 32,
SimpleRegData <- data.frame(x, y)
with(SimpleRegData, hist(y, sub = "fig 5.1: histogram of response variable y"))
Histogram of y
4
3
Frequency
2
1
0
25 30 35 40 45 50 55 60
y
fig 5.1: histogram of response variable y
Such figures might give some
feel for clear deviations from symmetry in the observed responses. A preliminary assess-
6.1. SIMPLE LINEAR REGRESSION 3
ment of the manner in which the response might be (at least approximately) related to the
explanatory variable can be made using a basic scatterplot1 (see Figure 5.2).
#insert the scatterplot
with(SimpleRegData,plot(y, x,xlab="x",ylab="y"))
10
8
y
6
4
2
0
30 35 40 45 50 55
x
At this point, boxplots can also
be used in a preliminary check for potential outlying data as well as
#insert the summary
summary(SimpleRegData)
## x y
## Min. : 0.000 Min. :26.90
## 1st Qu.: 3.188 1st Qu.:33.70
## Median : 6.375 Median :42.70
## Mean : 6.375 Mean :42.15
## 3rd Qu.: 9.562 3rd Qu.:49.85
## Max. :12.750 Max. :58.00
Combining the observations that the response does not suggest a deviation from symme-
try, and the fact that there appears to be an approximate linear relationship between X and
Y , it is reasonable to fit the data to a simple linear regression model.
4 CHAPTER 6. FROM R COMPANION
Simple.mod<-lm(y~x,SimpleRegData)
Then names(Simple.mod) shows that the resulting object, Simple.mod, contains several
sub-objects each of which contains useful information. To determine what the fitted model
is, execute
Simple.mod$coefficients
## (Intercept) x
## 56.417544 -2.238046
Thus, the fitted model is ŷ = 56.418 − 2.238x. Among the objects contained in Sim-
ple.mod that will find use in the diagnostic phase are coefficients, which contains the pa-
rameter estimates β0 and β1 ; fitted.values, which contains the fitted values ŷi ; residuals,
which contains the residuals, ϵˆi , for the fitted model; and df.residual, the residual degrees of
freedom.
In the case of simple linear regression models, it is a simple matter to obtain the tradi-
tional ANOVA table:
anova(Simple.mod)
This table contains the residual sum of squares (SSE = 151.5) and regression sum of
squares (SSR = 1365.1). The sum of squares of total variation (SSTo) is the sum of SSE
and SSR. The mean square error (MSE = 9.47) is also contained in this table along with the
F- and p-value for the corresponding goodness-of-fit F-test for the model.
6.1. SIMPLE LINEAR REGRESSION 5
It is important to remember that this approach of obtaining the traditional ANOVA table
will not work for later multiple linear regression models.
The summary function can be used to obtain the model summary statistics, which include
measures of fit. This summary also provides the essentials of the information given in the
traditional ANOVA table. The summary statistics produced can be stored in an object for
later recall using the command (output not shown)
Observe that by enclosing the whole assignment statement within parentheses, R not
only performs the assignment, but also outputs the summary statistics. Here is a partial list
of the summary output for the linear model object Simple.mod.
The object Simple.sum contains summary statistics associated with the residuals and
the coefficients along with s, the residualstandarderror, stored as sigma; the coefficient
of determination, r2 , stored as Multiple R-squared; and the relevant F − statistic and
degreesof f reedom for the goodness of fit F-test, stored as fstatistic.
Testing the model’s significance via the hypotheses
H0 : β0 + ϵi {reduced model}, Vs
H1 : β0 + β1 xi + ϵi {full model}.
then boils down to interpreting the line
anova(Simple.mod)
in the object Simple.sum generated above. Finally, the line Multiple R-squared: 0.9001,
Adjusted R-squared: 0.8939 in Simple.sum contains the coefficient of determination, r2
≈ 0.9001. Recall, by definition, this indicates that close to 90% of the variations in the
observed responses for the given data are well explained by the fitted model and variations
in the explanatory variable. Note that in simple linear regression cases, Multiple R-squared
should be read simply as r-Squared and Adjusted R-squared should be ignored.
6 CHAPTER 6. FROM R COMPANION
6.1.3 Diagnostics
For many, graphical diagnostic methods are considered adequate and for pretty much all,
these methods are essential in providing visual support when numerical assessment methods
are used. To facilitate simpler code, begin by extracting the needed information from the
objects Simple.mod and Simple sum.
e <- residuals(Simple.mod)
y.hat <-fitted.values(Simple.mod)
s<-Simple.sum$sigma
r <- e/s
d<-rstudent(Simple.mod)
The objects e, y.hat, and s are fairly self-explanatory, r contains standardized residuals,
and d contains what are referred to as studentized deleted residuals.
To assess whether the variances of the error terms might not be constant, the two traditional
plots used are given in Figure 5.3, the code2 for which is
Here, ideal plots should not have any noticeable trends and should have the bulk of the
plotted points randomly scattered and (approximately) symmetrically bounded about the
horizontal axis. These plots can also serve as a preliminary means of flagging potential
outlying values of the response variable.
6.1. SIMPLE LINEAR REGRESSION 7
4
Residuals
Residuals
2
2
0
0
−2
−2
−4
−4
30 35 40 45 50 55 0 2 4 6 8 10 12
Fitted Values x
The QQ normal probability plot is a popular graphical assessment of the normality assump-
tion; the residuals or standardized residuals may be used, each with an appropriate reference
line. Figure 5.4 was obtained using
The line y = x works as a reference line for standardized residuals, but not for un-
standardized residuals. An alternative reference line that should be used for QQ plots of
unstandardized data can be plotted using
qqline(r,lty=3)
correlation coefficient test(no built-in r function for it) appears to be more versatile with
respect to sample size than the Shapiro-Wilk test.
0.5
−0.5
−1.5
−2 −1 0 1 2
Shapiro-Wilk test
This test is a modification of the correlation coefficient test, and the associated R command
to perform this test is
shapiro.test(e)
##
## Shapiro-Wilk normality test
##
## data: e
## W = 0.95673, p-value = 0.5399
which also indicates that there is not enough evidence to reject the assumption of nor-
mality (for α < p − value)
6.1. SIMPLE LINEAR REGRESSION 9
In the case of simple linear regression, an assessment of the influence of an outlying case, an
ordered pair (xk , yk ) for which either xk , yk , or both xk and yk influence the fit of the model,
is best performed using the earlier constructed xy-scatterplot in Figure 5.2. Typically, an
influential outlier is indicated by a plotted point on an xy-scatterplot that lies far from the
bulk of the data and that does not fit in, or appears to disrupt the general trend of the bulk
of the data.
It might also correctly be inferred that if an observed response is an outlying case, then
the corresponding residual will lie further from zero than the bulk of the residuals. Thus, if
a particular value for the observed responses is an outlier, then this will be evident in plots
shown in Figures 5.3 and 5.4.
One way to test for outliers might be to decide beforehand how many points to test.
Suppose the 5% of residuals that are most extreme fit the bill. For the current data, this
amounts to a single data value. Cut-off values can be obtained using a Bonferroni adjustment
and then be placed in a plot of the studentized deleted residuals such as in Figure 5.5.
For example, suppose it is determined that the most extreme 5% of the data involve m
observations, which may or may not be potential outliers. Then, cutoff values for possible
outliers are obtained using t(α/(2m), n − p − 2), where n is the sample size and, in the case
of simple regression, p = 1. Thus, for the current model, the code
n<-length(e)
m<-ceiling(.05*n)
p<-1
cv<-qt(0.05/(2*m),n-p-2,lower.tail=F)
plot(d~y.hat,ylim=c(-cv-.5,cv+.5), xlab = "", ylab = "")
title(xlab="Fitted Values", ylab="Studentized Deleted Residuals")
abline(h=c(-cv,0,cv),lty=c(3,1,3))
10 CHAPTER 6. FROM R COMPANION
2
1
0
−1
−2
30 35 40 45 50 55
Fitted Values
produces Figure 5.5. While
none of the plotted points for the example in question lie outside the plotted cutoff lines,
there are occasions when outliers will be present.
6.2. MULTIPLE LINEAR REGRESSION 11
As with the simple linear regression model, it is assumed that, for each i = 1, 2, . . . , n, the
error terms ϵi have constant variances σ 2 , are independent, and are identically and normally
distributed with ϵi ∼ N (0, σ 2 )
The following data will be used for all illustrations
y <- c(15.09, 34.37, 24.64, 19.16, 26.68, 38.04, 5.59, 29.31, 28.27, 7.04,
38.56, 3.95, 1.57, 10.38, 47.61, 12.71,28.19, 6.8, 26.25, 45.33, 29.38,
3.3, 16.17, 29.24, 48, 39.97,17.16, 13.8, 22.86, 30.05, 16.5, 40.04, 2.9,
42, 39.08, 24.74, 34.61, 45.54, 5.6, 23.7)
x1 <- c(9.83, 6.62, 5.27, 1.96, 6.47,9.02, 9.32, 9.8, 7.89, 2.45, 5.73,
8.91, 7.95, 6.09, 3.28, 4.28, 1.56, 5.24, 3.48, 6.91, 3.14, 8.09, 6.76,
7.59, 4.7, 8.18, 1, 9.24, 8.05, 3.87,7.83, 7.48, 6.06, 5.81, 9.69, 7.01,
1.53, 7.8, 6.9, 3.51)
x2 <- c(1.87, 3.94, 4.95, 4.78, 2.87, 1.11, 1.09,1.06, 4.59, 3.18, 2.29,
2.01, 2.16, 4.6, 2.8, 3.23, 2.42, 2.46,2.48, 4.73, 2.82, 1.72, 4.57, 2.43,
4.54, 4.15, 2.72, 3.1, 3.87,4.73, 1.36, 4.53, 2.73, 1.72,4.26, 1.81, 3.74,
2.14, 4.89, 4.67)
x3 <- c(8.14, 14.16, 9.17, 5.38, 11.2, 16.99, 4.01, 13.55,11.09, 1.31, 15.72,
3.73, 1.98, 3.34, 17.63, 4.87, 10.12, 1.79, 9.82, 17.68, 11.32, 2.04, 6.53,
12.55, 17.83, 16.7, 4.99, 6.59, 9.81, 11.07, 7.7, 16.08, 1.61, 16.86, 16.43,
10.18, 12.5, 19.27,1.52, 7.53)
MultipleReg <- data.frame(y, x1, x2, x3)
12 CHAPTER 6. FROM R COMPANION
An initial graphical analysis might focus on the distributional properties of Y ; the relation-
ships between Y and each of X1 , X2 , and X3 ; the relationships between pairs of X1 , X2 ,
and X3 ; and the presence of unusual observations. One might begin with a histogram of the
observed responses
with(MultipleReg, hist(y))
title(sub = "Fig. 5.5 Histogram for observed responses in MultipleReg dataset")
Histogram of y
7
6
5
Frequency
4
3
2
1
0
0 10 20 30 40 50
y
Fig. 5.5 Histogram for observed responses in MultipleReg dataset
(see Figure 5.5). Histograms might provide some information on the symmetry and
spread of the observed responses in relation to the normal distribution having the same
mean and standard deviation as those of the sample data.
Boxplots can also be used as they provide information on symmetry and the presence of
outliers (in the univariate sense); however, a caution is appropriate here. When combining
boxplots on a single figure, pay attention to scale (and units). It is preferable to use separate
plots if there is a large difference in ranges between variables, and it is definitely preferable
to use separate plots if variable units differ. A matrix scatterplot can be used to look for
patterns or surprising behavior in both the response and the explanatory variables (see
Figure 5.6). One of the ways to obtain this figure is
6.2. MULTIPLE LINEAR REGRESSION 13
with(MultipleReg,pairs(cbind(y,x1,x2,x3)))
title(sub = "fig 5.6: Matrix scatterplot of MultipleReg dataset illustrating pairwise
relationships for all variables in the dataset.")
2 4 6 8 10 5 10 15
40
30
y
20
10
0
10
8
6
x1
4
2
5
4
x2
3
2
1
15
x3
10
5
0 10 20 30 40 1 2 3 4 5
fig 5.6: Matrix scatterplot of MultipleReg dataset illustrating pairwise
relationships for all variables in the dataset.
Another way is to simply enter plot(MultipleReg). An examination of (Xj , Y ) pairs in
scatterplots such as Fig 5.6 provide some (often vague) information on issues associated with
approximate relational fits between the response and explanatory variables. The more useful
information obtainable from such plots concerns the presence of possible linear relationships
between pairs of explanatory variables, (Xj , Xk ), where j ̸= k.
The summary function can be used here if summary statistics for the data are desired,
14 CHAPTER 6. FROM R COMPANION
Multiple.mod<-lm(y~x1+x2+x3, MultipleReg)
and the contents of Multiple.mod include information analogous to the simple regression
case.
round(Multiple.mod$coefficients,4)
## (Intercept) x1 x2 x3
## 2.9411 -0.9098 0.8543 2.4917
Multiple.sum<-summary(Multiple.mod)
Take a look at the contents of the object Multiple.sum with the help of the function
names. Then,
Multiple.sum$coefficients
which provide quite a bit of additional information. Begin with the third line of the output
shown above. This provides results of the goodness of fit F-test, or the model significance
test involving the hypotheses
The second line of the Multiple.sum output shown previously contains the coefficient of
multiple determination (R2 ), Multiple R-squared, and the adjusted coefficient of multiple
determination (R2 Adj ), Adjusted R-squared. Finally, the first line gives the residual stan-
dard error (s) along with its degrees of freedom.
Diagnostics
There is quite a bit more involved in the diagnostics stage for multiple regression models
and this can require some preparatory work. The objects of interest that are contained
in the fitted model object Multiple.mod include coefficients(β̂), residuals(ϵ̂), and
fitted.values(ŷ). Of use in the model summary object Multiple.sum are sigma (s), the
degrees of freedom of s2 which is contained in df[2], and cov.unscaled, which is the matrix
(X ′ X)−1 . These objects will be used to obtain the various statistics needed for the diagnos-
tics to be performed.
ˆ
In addition to the usual (unstandardized) residuals, epsilon i , available in Multiple.mod,
two transformed versions of the residuals may find use in the preliminary stages of the fol-
lowing diagnostics: standardized (or semistudentized) residuals,
ϵ̂i
r̂i = ,
s
ϵ̂i
r̂i = √ .
s 1 − hii
The hii terms in the above formula are called leverages and are the diagonal entries of the
16 CHAPTER 6. FROM R COMPANION
hat-matrix H = X(X ′ X)−1 X ′ . These find use in the analysis of explanatory variable
p-tuples, (xi1 , xi2 , . . . , xip ), for flagging potential outlying cases.
One further form of residuals, seen earlier, that plays a role in outlier analysis is the
studentized deleted (or externally studentized) residuals,
s
n−p−2
dˆ∗i = ϵ̂i .
s2 (n − p − 1)(1 − hii )ϵ̂2i
The choice of which form of residuals to use in the diagnostics process is sometimes guided
by the variability present in the data and also depends on a combination of the task to be
performed and individual preference.
To ease the way for the illustrations to follow, compute and store each of the above-listed
statistics in appropriately named variables.
y.hat<-fitted.values(Multiple.mod)
e<-residuals(Multiple.mod)
s<-Multiple.sum$sigma
r<-e/s#Compute standardized residuals
h<-hatvalues(Multiple.mod)#Extract leverages
stud<-e/(s*sqrt(1-h))#Compute studentized residuals
stud.del<-rstudent(Multiple.mod)#Extract studentized deleted residuals
Although the plots shown in fig 5.7 serve mainly to assess the constant variances assumption,
recall that they also serve to flag other potential issues such as the presence of outliers, the
absence of important variables, or possible issues of fit. Ideal plots have all plotted points
randomly scattered and approximately symmetrically spread about the horizontal axis.
Only the studentized residuals are used here; however, it should be remembered that any one
of the three forms of (undeleted) residuals provide equivalent information, but in potentially
varying levels of clarity. The code used to produce fig 5.7 is shown below.
par(mfrow=c(2,2))
plot(y.hat,stud,xlab="Fitted values",ylab="Studentized residuals",sub="(a)")
abline(h=c(-2,0,2),lty=c(3,1,3))
plot(x1,stud,xlab="x1", ylab = "Studentized residuals",sub="(b)")
abline(h=c(-2,0,2),lty=c(3,2,3))
6.2. MULTIPLE LINEAR REGRESSION 17
Studentized residuals
2
2
1
1
0
0
−1
−1
−2
−2
10 20 30 40 2 4 6 8 10
Fitted values x1
(a) (b)
Studentized residuals
Studentized residuals
2
2
1
1
0
0
−1
−1
−2
−2
1 2 3 4 5 5 10 15
x2 x3
(c) (d)
The included horizontal lines are simply for reference and do not represent cutoff values.
This may be accomplished by looking at a QQ normal probability plot of any one of the
unstandardized, standardized, or studentized residuals. Fig 5.8, was obtained using the code
18 CHAPTER 6. FROM R COMPANION
qqnorm(stud,main=NULL)
abline(a=0,b=1)
2
Sample Quantiles
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Observe, by setting main=NULL,
the default figure title does not appear; not specifying an lty results in a default solid line.
For the fitted model in question, the QQ plot of the standardized residuals looks very much
like fig. 5.8. The tests of normality encountered earlier work here, too.
shapiro.test(stud)
##
## Shapiro-Wilk normality test
##
## data: stud
## W = 0.98147, p-value = 0.744
also indicating that there is not enough evidence to reject the assumption of normality
(for α < p).