Solutions Manual Using R Introductory ST
Solutions Manual Using R Introductory ST
Univariate data
2.2 The diff function returns the distance between fill-ups, so mean(diff(gas))
is your average mileage per fill-up, and mean(gas) is the uninteresting
average of the recorded mileage.
2.3 The data may be entered in using c then manipulated in a natural way.
## [1] 4 25 16 100 64
x - 6
## [1] -4 -1 -2 4 2
(x - 9)^2
## [1] 49 16 25 1 1
5
CHAPTER 2. UNIVARIATE DATA 6
rep("a", 10)
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
## [21] 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
## [41] 81 83 85 87 89 91 93 95 97 99
rep(1:3, rep(3,3))
## [1] 1 1 1 2 2 2 3 3 3
rep(1:3, 3:1)
## [1] 1 1 1 2 2 3
c(1:5, 4:1)
## [1] 1 2 3 4 5 4 3 2 1
2.6 We have:
## [1] 313.5508
## [1] -1 0 1 2 3
## [1] 3 6
2.8 If we know the cities starting with a “J” then this is just an exercise in
indexing by the names attribute, as with:
precip["Juneau"]
## Juneau
## 54.7
Getting the cities with the names beginning with “J” can be done by
sorting and inspecting, say with sort(names(precip)). This gives:
The inspection of the names by scanning can be tedious for large data
sets. The grepl function can be useful here, but requires the specifica-
CHAPTER 2. UNIVARIATE DATA 8
precip[grepl("^J", names(precip))]
2.9 There are many ways to do this, the following uses paste:
paste("Trial", 1:10)
## [1] "Trial 1" "Trial 2" "Trial 3" "Trial 4" "Trial 5"
## [6] "Trial 6" "Trial 7" "Trial 8" "Trial 9" "Trial 10"
2.10 This answer will very depending on the underlying system. One answer
is:
## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/UsingR/DESCRIPTION"
2.11 The number of levels and number of cases are returned by:
require(MASS)
man <- Cars93$Manufacturer
length(man) # number of cases
## [1] 93
## [1] 32
CHAPTER 2. UNIVARIATE DATA 9
2.12 Looking at the levels, we see that one is rotary, which is clearly not
numeric. As for the 5-cylinder cars, we can get them as follows:
## [1] 89 93
2.13 The factor function allows this to be done by specifying the labels
argument:
This produces a modified, local copy of mtcars. The ordering of the la-
bels should match the following: sort(unique(as.character(mtcars$am))).
require(HistData)
any(Arbuthnot$Female > Arbuthnot$Male)
## [1] FALSE
Read the help page to see how this could be construed to show the
“guiding hand of a devine being.”
2.15 We have:
CHAPTER 2. UNIVARIATE DATA 10
!A | !B
2.17 After parsing the question, it can be seen that this expression answers
it:
m <- mean(precip)
trimmed_m <- mean(precip, trim=0.25)
any(precip > m + 1.5 * trimmed_m)
## [1] FALSE
commutes <- c(17, 16, 20, 24, 22, 15, 21, 15, 17, 22)
commutes[commutes == 24] <- 18
max(commutes)
## [1] 22
min(commutes)
## [1] 15
mean(commutes)
## [1] 18.3
## [1] 4
## [1] 0.5
2.20 We need to know that the months with 31 days are 1, 3, 5, 7, 8, 10, and
12.
cds <- c(79, 74, 161, 127, 133, 210, 99, 143, 249, 249, 368, 302)
longmos <- c(1, 3, 5, 7, 8, 10, 12)
long <- cds[longmos]
short <- cds[-longmos]
mean(long)
## [1] 166.5714
mean(short)
## [1] 205.6
CHAPTER 2. UNIVARIATE DATA 12
x <- c(0.57, 0.89, 1.08, 1.12, 1.18, 1.07, 1.17, 1.38, 1.441, 1.72)
names(x) <- 1990:1999
diff(x)
which(diff(x) < 0)
## 1995
## 5
The jump between 1994 and 1995 was negative (there was a work stop-
page that year). The percentage difference is found by dividing by
x[-10] and multiplying by 100. (Recall that x[-10] is all but the tenth
(10th) number of x). The first year’s jump was the largest.
diff(x)/x[-10] * 100
2.22 We have:
f <- function(x) {
mean(x^2) - mean(x)^2
}
f(1:10)
## [1] 8.25
## 1 2 3 4 5 6 7 8 9 10
## FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
2.25 A simple implementation looks like this. One could improve it by only
looking at integer factors less or equal the square-root of x.
2.26 The package containing the data set is no longer maintained, so this
problem becomes quite hard to do! Here we copy the data:
time <- c(169, 125, 210, 118, 117, 135, 128, 120, 122, 164, 174, 155, 120, 159,
121, 144, 129, 136, 124, 138, 195, 141, 156, 179, 109, 112, 167, 113,
133, 153, 141, 150, 126, 202, 165, 139, 164, 162, 171, 154, 147, 148,
137, 144, 139, 159, 128, 181, 181, 146, 161, 157, 130, 121, 122, 135,
150, 151, 177, 168, 180, 136, 230, 153, 275, 204, 245, 177, 187, 237,
119, 166, 205, 167, 153, 204, 156, 303, 158, 163, 155, 80, 303, 165,
240, 130, 190, 62, 185, 286, 167, 148, 121, 140, 124, 213, 232, 102,
106, 177, 160, 241, 166, 145, 195, 270, 188, 253, 162, 175, 191, 495,
194)
album <- rep(c("BBC_tapes", "Rubber_Soul", "Revolver", "Magical Mystery Tour",
"Seargent Peper", "The White album"),
c(31, 11, 14, 14,13,30))
beatles <- data.frame(time=time, album=album)
nk <- ChestSizes$count
yk <- ChestSizes$chest
n <- sum(nk)
wk <- nk/n
sum(wk * yk)
## [1] 39.83182
2.28 We have
CHAPTER 2. UNIVARIATE DATA 15
x <- c(80,82,88,91,91,95,95,97,98,101,106,106,109,110,111)
median(x)
## [1] 97
2.29 The LearnEDA package is no longer available. The data for farms is re-
produced with:
state <- c("Al", "Als", "Ar", "Ark", "Ca", "Col", "Conn", "De", "Fl", "Ge",
"Ha", "Id", "Ill", "Ind", "Io", "Kan", "Ken", "Lou", "Ma", "Mary",
"Mass", "Mi", "Minn", "Miss", "Misso", "Mon", "Neb", "Nev", "NH",
"NJ", "NM", "NY", "NC", "ND", "Oh", "Ok", "Or", "PA", "RI", "SC",
"SD", "Te", "Tex", "Ut", "Ver", "Vir", "Wa", "WV", "Wi", "Wy")
count <- c(48, 1, 8, 49, 89, 29, 4, 3, 45, 50, 6, 25, 79, 65, 98, 65, 91, 30,
7, 12, 6, 53, 81, 43, 110, 28, 55, 3, 3, 10, 16, 39, 58, 31, 80,
84, 41, 59, 1, 25, 33, 91, 227, 16, 7, 50, 40, 21, 78, 9)
farms <- data.frame(state=state, count=count)
stem(farms$count)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 1133346677890266
## 2 | 155890139
## 4 | 013589003589
## 6 | 5589
## 8 | 0149118
## 10 | 0
## 12 |
## 14 |
## 16 |
## 18 |
## 20 |
## 22 | 7
It is hard to gauge the influence of the outlier, but otherwise, the balance
point is likely in the stem labeled 4, or 4000 farms. A check shows it is
44.04.
CHAPTER 2. UNIVARIATE DATA 16
2.3 * 10^(-4)
## [1] 0.00023
## [1] 23.97701
hist(pi2000-.1, prob=TRUE)
lines(density(pi2000))
## [1] 98.24923
require(MASS)
hist(DDT)
boxplot(DDT)
CHAPTER 2. UNIVARIATE DATA 17
The histogram shows the data to be roughly symmetric, with one outly-
ing value, so the mean and median should be similar and the standard
deviation about three-quarters the IQR. The median and IQR can be
identified on the boxplot giving estimates of 3.2 for the mean and a
standard deviation a little less than 0.5. We can check with this com-
mand:
c(mean=mean(DDT), sd=sd(DDT))
## mean sd
## 3.3280000 0.4371531
2.35 The hist function needs the data to be in a data vector, not tabulated.
We pad it out using rep, then plot. The histogram is very symmetric.
2.36 The histogram has a rather wide range (about 3 times from smallest to
largest. Some year had over 93 feet of snow fall!
2.37 First assign names. Then you can access the entries using the respective
state abbreviations.
## NJ
## 7836
## [1] 8
## [1] 40
CHAPTER 2. UNIVARIATE DATA 18
To see that Alaska is the big state, we could look at this histogram then
query:
## AK
## 589757
Histogram of state.area
40
30
Frequency
20
10
0
2.38 For a heavily skewed-right data set, such as this, the mean is signifi-
cantly more than the median due to a relatively few large values. The
median more accurately reflects the bulk of the data. If your intention
were to make the data seem outrageously large, then a mean might be
used.
2.39 The definition of the median is incorrect. Can you think of a shape for
a distribution when this is actually okay?
CHAPTER 2. UNIVARIATE DATA 19
2.40 The median is lower for skewed-left distributions. It makes an area look
more affordable. For exclusive listings, the mean is often used to make
an area seem more expensive.
## [1] 39.5
## [1] 50.75
## [1] 0.5815603
## [1] 0.6666667
quantile(rivers, 0.75)
## 75%
## 680
2.43 First grab the data and check the units (minutes). The top 10% is actu-
ally the 0.10 quantile in this case, as shorter times are better.
## [1] 2.6
## 10% 25%
## 208.695 233.775
quantile(times,c(.90)) # 5:31
## 90%
## 331.75
mean(rivers)
## [1] 591.1844
median(rivers)
## [1] 425
mean(rivers, trim=.25)
## [1] 449.9155
2.45 We see
CHAPTER 2. UNIVARIATE DATA 21
##
## The decimal point is 3 digit(s) to the right of the |
##
## 0 | 00000000000000000000000000000111111222338
## 2 | 07
## 4 | 5
## 6 | 8
## 8 | 4
## 10 | 5
## 12 |
## 14 |
## 16 | 0
c(mean=mean(islands),
median=median(islands),
trimmed=mean(islands,trim=0.25))
The data set is quite skewed due to the seven continents. We wouldn’t
expect the mean and median to agree, but note that after trimming the
mean and median are similar.
2.46 We can find the z-score for Barry Bonds using the name as follows:
## bondsba01
## 5.990192
## [1] -1.340544e-17
sd(z)
## [1] 1
2.48 No, as the data is skewed heavily to the right, the standard deviation is
quite different:
## mad IQR sd
## 20.7564 27.5000 207.0435
2.49 As this distribution has a long tail, we find that the mean is much more
than the median.
## [1] 74.90069
2.50 The value is relatively close to 1, which is the value for exponentially
distributed data:
sd(rivers) / mean(rivers)
## [1] 0.8353922
CHAPTER 2. UNIVARIATE DATA 23
## [1] 0.8889017
2.52 The skew of wt is negative indicating a slight left skew. The skew of the
inter-arrival times is twice as much and to the right.
skew(babyboom$wt)
## [1] -1.078636
## [1] 1.829281
hist(hall.fame$HR)
The home run distribution is skewed right, the batting average is fairly
symmetric, and on-base percentage is also symmetric but may have
longer tails than the batting average.
2.54 After a log transform the data looks more symmetric. If you find the
median of the transformed data, you can take its exponential to get the
median of the untransformed data. Not so with the mean.
2.55 The data is tabulated, so we first create a data vector through rep, then
plot.
CHAPTER 2. UNIVARIATE DATA 24
require(HistData)
chest <- rep(ChestSizes$chest, ChestSizes$count)
qqnorm(chest)
hist(cfb$AGE)
After making the graphics, we see that AGE is short-tailed and somewhat
symmetric; EDUC is not symmetric; NETWORTH is very skewed (some can
get really rich, some can get pretty poor, most close to 0 on this scale;
and log(SAVING + 1) is symmetric except for a spike of people at 0 who
have no savings (the actual data is skewed—the logarithm changes this).
## [1] TRUE
2.60 This can be done as follows (using a different name from the built-in
mode function):
Outside of the last line, this is a simple translation of the example give.
The last line is not necessary. It simply generalizes the call to as.numeric
in the example by coercing the output to the class of the input variable.
## bumps
## (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03]
## 2 8 7 6
summary(Cars93$Cylinders)
## 3 4 5 6 8 rotary
## 3 49 2 31 7 1
## chars
## \n , ; . a A b c C d D e E f F
## 10 589 48 1 73 251 3 38 156 5 102 8 370 3 22 2
## g h i I j l L m M n N o p P q Q
## 45 17 343 8 4 220 3 142 7 211 13 174 80 6 30 2
## r s S t u U v V x
## 183 272 7 289 289 3 49 5 3
CHAPTER 2. UNIVARIATE DATA 26
sort(table(chars))
## chars
## ; F Q A E L U x j C V P M S D I
## 1 2 2 3 3 3 3 3 4 5 5 6 7 7 8 8
## \n N h f q b g , v . p d m c o r
## 10 13 17 22 30 38 45 48 49 73 80 102 142 156 174 183
## n l a s t u i e
## 211 220 251 272 289 289 343 370 589
require(MASS)
dotchart(table(Cars93$Cylinders))
The graphic shows that 4-cylinder cars were the most popular in 1993.
Was this the case in 1974 (cf. mtcars$cyl)?
FIGURE 2.1 A dot plot of the data on children’s weights. Such a graphic shows the data in sorted
order allowing quick visual senses of both the center and the spread. Values are just drawn on the
number line with repeated values being stacked.
002x001.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.2 The mean is the value that balances the dot plot.
002x002.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.3 Plot of absolute z-scores for the wts data set and a subset of the exec.pay data set. There
are no values larger than 2 in the wts data set, in agreement with the rule of thumb for bell-shaped
data. For the executive pay data, we see a z-score nearly as large as 5, virtually impossible for bell-
shaped data.
002x003.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.4 Dot plots of two data sets with different shapes. The left data set, a sample of the execu-
tive pay data set, is skewed right, the right data set, on the heights of four-year-old children, is mostly
symmetric. For the symmetric data, the mean and median measure the center in a similar manner
(36.7 to 38). For the skewed data this is not so (42.5 to 24).
002x004.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.5 The left graphic shows stacked dot plots of z-scores of two data sets. The lower one has
long tails, the top one “normal” tails. The right graphic shows the galaxies data set. The overlapping
dots in the data show the presence of at least 3 clusters, corresponding to modes.
002x005.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.6 Two histograms of times between eruptions at the Old Faithful geyser in Yellowstone
National Park shows two modes. The left graphic rep- resents frequencies, the right graphic is scaled
to have total area equal to 1.
002x006.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.7 A histogram of a random sample of n = 10, 000 data points and a corresponding density
plot of the data. The vertical lines of the histogram are de-emphasized. From either, we can see the
data is symmetric, unimodal with a mean of 0.
002x007.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.8 Histogram of bumpers data with a density plot layered on top.
002x008.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.9 Boxplots of various data sets. The left one shows the bumpers data set, a mostly sym-
metric data set with no outliers. The right one, of the weight variable in the kid.weights data, shows
a right skew and some outliers.
002x009.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.10 Three quantile-normal plots produced by qnorm. The leftmost graphic shows data on
finger lengths of several prisoners from the finger variable in the Macdonell (HistData) data set. It
shows data more or less on a straight line, indicating a normal distribution. The grouping is due to
the data being discretized. The second graphic uses data on the height of children in Galton’s classic
study of heights. This data has slight bends on the edges, like an “S”. This being due to the tails being
slightly less long than the normal. The final data shows what a decidedly non-normal distribution
appears like in this graphic. The executive pay data is used which is skewed right and long tailed.
Such data shows a clear curve.
002x010.eps
Courtesy of CRC Press/Taylor & Francis Group
FIGURE 2.11 A horizontal bar chart and dot chart of the smoking data.
002x011.eps