Department of Statistics: COURSE STATS 330/762
Department of Statistics: COURSE STATS 330/762
1. Read the data into R. Make a data frame, naming the variables with the names above.
Print out the first 10 lines. [5 marks]
The following code reads in the data, and names the variables. Note that there is no header
line in the file.
pbf.df =
read.table("C:\\Users\\alee044\\Documents\\Teaching\\330\\assignments\\2010\\Assignme
nt 3\\PBF.txt",header=FALSE)
Note that all characters after the # are ignored. The first 10 lines are
> pbf.df[1:10,]
PBF Density Age Weight Height BMI Neck Chest Abdomen Hip Thigh Knee Ankle Biceps Forearm Wrist
1 12.6 1.0708 23 154.25 67.75 23.7 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1
2 6.9 1.0853 22 173.25 72.25 23.4 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2
3 24.6 1.0414 22 154.00 66.25 24.7 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6
4 10.9 1.0751 26 184.75 72.25 24.9 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2
5 27.8 1.0340 24 184.25 71.25 25.6 34.4 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7
6 20.6 1.0502 24 210.25 74.75 26.5 39.0 104.5 94.4 107.8 66.0 42.0 25.6 35.7 30.6 18.8
7 19.0 1.0549 26 181.00 69.75 26.2 36.4 105.1 90.7 100.3 58.4 38.3 22.9 31.9 27.8 17.7
8 12.8 1.0704 25 176.00 72.50 23.6 37.8 99.6 88.5 97.1 60.0 39.4 23.2 30.5 29.0 18.8
9 5.1 1.0900 25 191.00 74.00 24.6 38.1 100.9 82.5 99.9 62.9 38.3 23.8 35.9 31.1 18.2
10 12.0 1.0722 23 198.25 73.50 25.8 42.1 99.6 88.6 104.1 63.1 41.7 25.0 35.6 30.0 19.2
2. I have not changed any values in the original data set, but there are several strange
values. Identify these using graphical methods and either correct them or delete the
offending observations. ( Delete a maximum of 4 ) . In particular, some of the PBF
1
values seem suspect (which ones?) Calculate the volume from the variables Density and
Weight. [5 marks]
Seems though there are a lot f outliers. To identify them, we can use the function order. This
will give the index of the largest and smallest observations for a particular variable:
> order(pbf.df$Weight)
[1] 182 74 45 172 226 50 241 27 29 47 248 49 55 159 53 52 164 23 211 224 75 149 176
[24] 183 153 72 231 28 76 48 24 217 202 82 177 25 161 67 124 128 151 171 54 191 221 3
[47] 1 218 220 68 69 239 246 146 197 99 70 154 134 87 235 116 51 26 184 144 210 88 32
[70] 123 209 30 73 125 79 223 234 77 98 16 195 114 81 46 93 130 106 207 167 137 86 126
[93] 143 199 71 186 230 33 85 198 135 139 213 110 236 227 185 84 111 200 80 132 156 131 103
[116] 170 215 2 142 102 229 115 233 8 90 174 89 141 91 105 117 160 173 127 78 64 113 201
2
[139] 62 21 118 60 92 225 196 13 7 57 31 232 158 112 66 19 204 5 95 163 4 190 122
[162] 136 11 250 179 120 15 138 145 97 240 83 251 9 119 237 36 129 94 63 193 109 214 208
[185] 17 162 38 104 100 133 56 10 101 228 219 245 107 155 22 203 249 58 37 189 59 40 108
[208] 42 14 65 157 121 148 252 147 206 18 6 188 20 44 140 247 12 61 212 43 166 34 165
[231] 216 238 181 150 205 96 242 168 194 175 244 169 222 187 243 180 178 152 192 35 41 39
The large weight visible in the pairs plot is observation 39. Similarly the small height is observation 42 (with a
height of29,5 inches and a weight of 205 pounds! ), the large hip, thigh and knee all 39, the two large ankles 31
and 86.
Deleting these (and the last 2 observations as well) gives a better pairs plot:
pairs(pbf.df[-c(31,39,42,86),])
There are some other points liable to have large influence, but we have used up our outlier budget.
Lets work with this reduced data set, plus deleting the last 2 observations:
pbf.use.df = pbf.df[-c(31,39,42,86,251,252),]
3
To check out the accuracy of the PBF calculation, plot the calculated PBF against the density, using the index
number as the plotting symbol:
Seems like 4 points are miscalculated, namely 45, 73, 92, 178.
Since these will not affect rest of the analysis, we wont correct them.
To calculate the volume in litres and add this variable to the data frame, we type
3. Develop a model that will predict the volume from the other variables, excluding Density
and PBF. You should be able to come up with a model that predicts very well. Points to note:
Which variables should be selected? Are transformations indicated? (think Cherry trees). You
should potentially consider using all the techniques you have been taught, up to the end of
lecture 15. [20 marks]
It seems as though the relationship between Volume and the other variables could well be
multiplicative, as it was with the cherry trees. Lets log all the variables, making new, logged variables.
This will be necessary for the variable selection methods to work. We will eliminate Density and PBF as
they are no longer needed.
4
Here is a quick way to do this (it works because all the variables are numeric)
There are substantial correlations between the variables, so not all will be needed. To figure
out which ones are required, we do some variable selection. First all possible regressions:
Either the 4-variable model with log.Weight, log.Height, log.Neck, log.Abdomen, log.Wrist (on the
basis of BIC, almost the smallest CV) or the 8-variable model with the above plus log.Age, log,Neck,
log.Chest and log.Biceps seems indicated.
5
model1.lm = lm(log.Volume~., data=log.pbf.df)
null.lm = lm(log.Volume~1, data=log.pbf.df)
step(null.lm, scope=formula(model1.lm), direction="both")
which is the same as the eight-variable model chosen by APR. Both the 4 and 8-variable models should
be OK for prediction. In fact the 4 variable model has an R2 almost as good as the 8-variable model, so
we go with the 4-variable model in the interest of simplicity.
A hint ov curvature is present, and the Boxcox plot (not shown) indicates squaring the response might
be a good idea. This leads to the model with plots
6
which look good apart from the except for an outlier pt 92. This however does not seem to be
affecting the coefficients too much, as the influence plot shows. (Cooks distance is OK). Cov ratio
indicates point 92 is affecting standard errors. No big outliers. Model seems OK, could use it for
prediction. We will explore the effect of point 92 in the predictions.
4. I have replaced the values of the variables PBF and Density on the last two individuals
in the data set with NAs . Using your model, predict the body volume for these two
individuals. [10 marks]
Code for predictions (with and without the high CR point 92:
# predictions
predict.df = log(pbf.df[251:252,-(1:2)])
names(predict.df) = names(log.pbf.df)[-15]
7
> exp(sqrt(predict(model.42.lm, predict.df, interval="p")))
fit lwr upr
251 82.87071 81.31958 84.44463
252 90.54701 88.86330 92.25543
# without pt 92
lv2.no92=lv2[-92]
model.42.no92.lm = lm(lv2.no92~ log.Weight + log.Height +
log.Abdomen + log.Wrist , data=log.pbf.df[-92,])
The results are very similar. We will go with the last one.
NB: There will be a prize for the best predictions. In the event of a tie, a stochastic mechanism
will be used.
Suppose we logged the volume and the other variables (excluding PBF and Density), and fitted a
model to log volume, using the other logged variables. Can you explain why we would not need to
include the variable log(BMI) in the model, given the other variables are included?
Since
there is an exact linear relationship between the logged variables. Thus, if Weight and Height are
in the model, adding BMI will not reduce the RSS at all (ie we have perfect collinearity.) If we did
try to include it, the software would just ignore it.