R Lab 4
R Lab 4
Contents
0.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Lab Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.4 Fitting the Simple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.4.1 Probing the lm function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.4.2 Assessing Homogeneity of Variance and Normality Assumptions . . . . . . . . . . . . 5
0.5 Fitting the Exponential Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.6 Lab Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.6.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.6.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.6.3 Question 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.6.4 Question 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1 Objectives
1. Use the lm function to fit a simple linear model
1
0.3 The Data
The dataset is from Chapter 8, Problem 10 in your textbook. We are trying to estimate the survival of liver
transplant patients using information on the patients collected before the operation. The variables are:
• CLOT: a measure of the clotting potential of the patient’s blood;
• PROG: a subjective index of the patient’s prospective of recovery;
• ENZ: a measure of a protein present in the body;
• LIV: a measure relating to white blood cell count and the response;
• TIME: a measure of the survival time of the patient.
In this lab we will use the TIME as the dependent and ENZ as the independent variable. The data
is available at http://statweb.lsu.edu/EXSTWeb/StatLab/DataSets/EXST7015/FW&M%20Data%202010/
TEXT/DATATAB_8_31.TXT
#' The data link above is unavailable now
#' so download the data_lab4.txt file to your working directory
#' Create an object to host the data set
#'
#' @sep="" because the columns are seperated by 'space'
#'
Visualizing fitted model with observations. The blue lines represent the errors for each fitted value. The red
line is the fitted model.
2
Fitted Model of Survival Time vs Enzyme (Blood Protein)
800
600
time
400
200
25 50 75 100
enz
## (Intercept) enz
## -108.71614 3.96678
Also certain sub-functions specific to the lm model can be applied to the model object (lm_patients)
methods(class = class(lm_patients))[1:10] # extracts the 1st 10 functions
3
confint(lm_patients) # produces 95% CI of parameter estimates
## 2.5 % 97.5 %
## (Intercept) -232.564499 15.132220
## enz 2.417402 5.516158
Some global base R functions like plot( ), summary( ), print( ) can be applied to the lm model.
Example:
print(lm_patients)
##
## Call:
## lm(formula = time ~ enz, data = patients)
##
## Coefficients:
## (Intercept) enz
## -108.716 3.967
#' @plot does not have the data argument
#' so to avoid using the $ (indexing/subsetting) symbol
#' the @with is used to attach the patients dataset
#' for the @plot() function
400
200
20 40 60 80 100 120
enz
4
0.4.2 Assessing Homogeneity of Variance and Normality Assumptions
The residual plot can be used to detect various problems such as non-linear, non-homogeneous variances
and outliers. If the data is of homogeneity of variance, most of residual points of the data randomly scatter
around the mean residual (or zero line). If patterns like curvature ( that is, non-homogeneity of variance)
are detected in the residual plot, we may consider fitting a more complicated model.
Checking Homogeneity of Variance
#' The function below is from the olsrr package
#'
#' @ols_plot_resid_fit function plots the model residuals against
#' the fitted values of the model
#' this function has one argument which is the name of the lm object
ols_plot_resid_fit(lm_patients)
#' Alternatively you can use the plot function from base R
#'
#' Applying the plot() on the lm object produces several diagnostic plots
#' the @which= can be used to extract a particular plot in this case,
#' the plot for Fitted values against residuals
plot(lm_patients, which =1)
ols_test_normality(lm_patients)
#' Alternatively you can use the shaprio.wilk function from base R
#' Extract the model residuals @lm_patients$residuals
shapiro.test(lm_patients$residuals)
##
5
## Call:
## lm(formula = log(time) ~ enz, data = patients)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19415 -0.29725 -0.02198 0.34125 1.01853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.558633 0.245526 14.494 < 2e-16 ***
## enz 0.019727 0.003072 6.423 4.12e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4753 on 52 degrees of freedom
## Multiple R-squared: 0.4424, Adjusted R-squared: 0.4316
## F-statistic: 41.25 on 1 and 52 DF, p-value: 4.118e-08
with(patients, plot(enz, log(time)))
3.5 4.0 4.5 5.0 5.5 6.0 6.5
log(time)
20 40 60 80 100 120
enz
6
0.6.1 Question 1
Make a scatter plot to show the relationship between TIME and ENZ. What is your observation? How about
the scatter plot showing the relationship between Log-Time ( that is the log transform of Time) and ENZ.
0.6.2 Question 2
Fit the simple linear regression model T IM E = β0 + β1 EN Z + . Write down the estimated regression
function and examine the residual plot and normality test. Describe what you observed and make brief
comments. Hint: you need to check the ANOVA table (that is the F-Statistic and its p-value on the
last line of the summary(yourmodel) output)), parameter estimates tables, R-Square, residual plot and
normality test.
0.6.3 Question 3
Fit the exponential model logT IM E = β0 + β1 EN Z + . Write down the estimated regression function.
Does the model fit well? Why? Hint: you need to check the ANOVA table (that is the F-Statistic and
its p-value on the last line of the summary(yourmodel) output)), parameter estimates tables, R-Square,
residual plot and normality test.
Remember to attach your code
0.6.4 Question 4
Compare the simple linear model in Question 1 and the exponential model in Question 2, do you observe any
improvements after conducting the exponential model relative to the linear model? Support your conclusion
with details (such as R-Square, homogeneity of variance and normality test)