0% found this document useful (0 votes)

267 views

Solutions Manual Using R Introductory ST

The document discusses univariate data and various R functions for manipulating and summarizing univariate data. Some key points include: - The c() function can be used to enter data vectors and the diff() function returns the distance between values. - Functions like rep(), seq(), and c() can be used to generate repeated or sequential values. - Mean, maximum, minimum and other summary statistics can be calculated on univariate data vectors. - Logical subsetting and grepl() can be used to extract subsets of data based on conditions.

Uploaded by

Georgiana Ruxandra

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

267 views

Solutions Manual Using R Introductory ST

Uploaded by

Georgiana Ruxandra

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

2

Univariate data

2.1 For example:

p <- c(2, 3, 5, 7, 11, 13, 17, 19)

2.2 The diff function returns the distance between fill-ups, so mean(diff(gas))
is your average mileage per fill-up, and mean(gas) is the uninteresting
average of the recorded mileage.

2.3 The data may be entered in using c then manipulated in a natural way.

x <- c(2, 5, 4, 10, 8)

x^2

## [1] 4 25 16 100 64

x - 6

## [1] -4 -1 -2 4 2

(x - 9)^2

## [1] 49 16 25 1 1

2.4 These can be done with

5
CHAPTER 2. UNIVARIATE DATA 6

rep("a", 10)

## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

seq(1, 99, by=2)

## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
## [21] 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
## [41] 81 83 85 87 89 91 93 95 97 99

rep(1:3, rep(3,3))

## [1] 1 1 1 2 2 2 3 3 3

rep(1:3, 3:1)

## [1] 1 1 1 2 2 3

c(1:5, 4:1)

## [1] 1 2 3 4 5 4 3 2 1

2.5 These can be done with the following commands:

primes_under_20 <- c(1, 2, 3, 5, 8, 13, 21, 34)

ns <- 1:10
recips <- 1/ns
cubes <- (1:6)^3
years <- 1964:2014
subway <- c(14, 18, 23, 28, 34, 42, 50, 59, 66, 72, 79, 86, 96, 103, 110)
by25 <- seq(0,1000, by=25)

2.6 We have:

sum(abs(rivers - mean(rivers))) / length(rivers)

CHAPTER 2. UNIVARIATE DATA 7

## [1] 313.5508

To elaborate, rivers - mean(rivers) centers the values and is a data

vector. Calling abs makes all the values non-negative, and sum reduces
the result to a single number, which is then divided by the length.

2.7 The unary minus is evaluated before the colon:

-1:3 # like (-1):3

## [1] -1 0 1 2 3

However, the colon is evaluated before multiplication in the latter:

1:23 # not like 1:(23)

## [1] 3 6

2.8 If we know the cities starting with a “J” then this is just an exercise in
indexing by the names attribute, as with:

precip["Juneau"]

## Juneau
## 54.7

Getting the cities with the names beginning with “J” can be done by
sorting and inspecting, say with sort(names(precip)). This gives:

j_cities <- c("Jackson", "Jacksonville", "Juneau")

precip[j_cities]

## Jackson Jacksonville Juneau

## 49.2 54.5 54.7

The inspection of the names by scanning can be tedious for large data
sets. The grepl function can be useful here, but requires the specifica-
CHAPTER 2. UNIVARIATE DATA 8

tion of a regular expression to indicate words that start with “J”. As a

teaser, here is how this could be done:

precip[grepl("^J", names(precip))]

## Juneau Jacksonville Jackson

## 54.7 54.5 49.2

Regular expressions are described in the help page ?regexp.

2.9 There are many ways to do this, the following uses paste:

paste("Trial", 1:10)

## [1] "Trial 1" "Trial 2" "Trial 3" "Trial 4" "Trial 5"
## [6] "Trial 6" "Trial 7" "Trial 8" "Trial 9" "Trial 10"

2.10 This answer will very depending on the underlying system. One answer
is:

paste(dname, fname, sep=.Platform$file.sep)

## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/UsingR/DESCRIPTION"

2.11 The number of levels and number of cases are returned by:

require(MASS)
man <- Cars93$Manufacturer
length(man) # number of cases

## [1] 93

length(levels(man)) # number of levels

## [1] 32
CHAPTER 2. UNIVARIATE DATA 9

2.12 Looking at the levels, we see that one is rotary, which is clearly not
numeric. As for the 5-cylinder cars, we can get them as follows:

cyl <- Cars93$Cylinders

levels(cyl) # "rotary"

## [1] "3" "4" "5" "6" "8" "rotary"

which(cyl == "5") # just 5 is also okay

## [1] 89 93

Cars93$Manufacturer[ which(cyl == 5) ] # which companies

## [1] Volkswagen Volvo

## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet ... Volvo

2.13 The factor function allows this to be done by specifying the labels
argument:

mtcars$am <- factor(mtcars$am, labels=c("automatic", "manual"))

This produces a modified, local copy of mtcars. The ordering of the la-
bels should match the following: sort(unique(as.character(mtcars$am))).

2.14 The answer is no:

require(HistData)
any(Arbuthnot$Female > Arbuthnot$Male)

## [1] FALSE

Read the help page to see how this could be construed to show the
“guiding hand of a devine being.”

2.15 We have:
CHAPTER 2. UNIVARIATE DATA 10

A <- c(TRUE, FALSE, TRUE, TRUE)

B <- c(TRUE, FALSE, TRUE, TRUE)
!(A & B)

## [1] FALSE TRUE FALSE FALSE

!A | !B

## [1] FALSE TRUE FALSE FALSE

It is not necessary to express the latter as (!A) | (!B), as the unary !

operator has higher precedence than the binary | operator.

2.16 We use logical extraction for this task:

names(precip[precip > 50])

## [1] "Mobile" "Juneau" "Jacksonville" "Miami"

## [5] "New Orleans" "San Juan"

2.17 After parsing the question, it can be seen that this expression answers
it:

m <- mean(precip)
trimmed_m <- mean(precip, trim=0.25)
any(precip > m + 1.5 * trimmed_m)

## [1] FALSE

A similar question is used for the algorithmic determination of “out-

liers” in a data set.

2.18 The comparison of strings is done lexicographically. That is, compar-

isons are done character by character until a tie is broken. The com-
parison of characters varies due to the locale. This may be decided by
ASCII codes—which yields alphabetically ordering—but need not be.
See ?locale for more detail.

2.19 First we store the data, then we analyze it.

CHAPTER 2. UNIVARIATE DATA 11

commutes <- c(17, 16, 20, 24, 22, 15, 21, 15, 17, 22)
commutes[commutes == 24] <- 18
max(commutes)

## [1] 22

min(commutes)

## [1] 15

mean(commutes)

## [1] 18.3

sum(commutes >= 20)

## [1] 4

sum(commutes < 18)/length(commutes)

## [1] 0.5

2.20 We need to know that the months with 31 days are 1, 3, 5, 7, 8, 10, and
12.

cds <- c(79, 74, 161, 127, 133, 210, 99, 143, 249, 249, 368, 302)
longmos <- c(1, 3, 5, 7, 8, 10, 12)
long <- cds[longmos]
short <- cds[-longmos]
mean(long)

## [1] 166.5714

mean(short)

## [1] 205.6
CHAPTER 2. UNIVARIATE DATA 12

2.21 Enter in the data as follows:

x <- c(0.57, 0.89, 1.08, 1.12, 1.18, 1.07, 1.17, 1.38, 1.441, 1.72)
names(x) <- 1990:1999

Using diff gives

diff(x)

## 1991 1992 1993 1994 1995 1996 1997 1998 1999

## 0.320 0.190 0.040 0.060 -0.110 0.100 0.210 0.061 0.279

We can see that one year was negative:

which(diff(x) < 0)

## 1995
## 5

The jump between 1994 and 1995 was negative (there was a work stop-
page that year). The percentage difference is found by dividing by
x[-10] and multiplying by 100. (Recall that x[-10] is all but the tenth
(10th) number of x). The first year’s jump was the largest.

diff(x)/x[-10] * 100

## 1991 1992 1993 1994 1995 1996

## 56.140351 21.348315 3.703704 5.357143 -9.322034 9.345794
## 1997 1998 1999
## 17.948718 4.420290 19.361554

2.22 We have:

mean_distance <- function(x) {

distances <- abs(x - mean(x))
mean(distances)
}
CHAPTER 2. UNIVARIATE DATA 13

2.23 This can be done through:

f <- function(x) {
mean(x^2) - mean(x)^2
}
f(1:10)

## [1] 8.25

2.24 A simple answer is just given by:

iseven <- function(x) x %%2 == 0

Then isodd would be:

isodd <- function(x) x%%2 == 1

The following implementation ensures integers are used, and adds

names:

iseven <- function(x) {

x <- as.integer(x)
ans <- x %% 2 == 0
setNames(ans, x) # add names
}
iseven(1:10)

## 1 2 3 4 5 6 7 8 9 10
## FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE

Restricting a function to handle only integer inputs can be achieved by

using generic functions, such as described in Appendix ??.

2.25 A simple implementation looks like this. One could improve it by only
looking at integer factors less or equal the square-root of x.

isprime <- function(x){

!any(x %% 2:(x-1) == 0)
}
CHAPTER 2. UNIVARIATE DATA 14

Though this isn’t a terribly efficient means to generate a list of primes,

it can be used to check if one number is prime.

2.26 The package containing the data set is no longer maintained, so this
problem becomes quite hard to do! Here we copy the data:

time <- c(169, 125, 210, 118, 117, 135, 128, 120, 122, 164, 174, 155, 120, 159,
121, 144, 129, 136, 124, 138, 195, 141, 156, 179, 109, 112, 167, 113,
133, 153, 141, 150, 126, 202, 165, 139, 164, 162, 171, 154, 147, 148,
137, 144, 139, 159, 128, 181, 181, 146, 161, 157, 130, 121, 122, 135,
150, 151, 177, 168, 180, 136, 230, 153, 275, 204, 245, 177, 187, 237,
119, 166, 205, 167, 153, 204, 156, 303, 158, 163, 155, 80, 303, 165,
240, 130, 190, 62, 185, 286, 167, 148, 121, 140, 124, 213, 232, 102,
106, 177, 160, 241, 166, 145, 195, 270, 188, 253, 162, 175, 191, 495,
194)
album <- rep(c("BBC_tapes", "Rubber_Soul", "Revolver", "Magical Mystery Tour",
"Seargent Peper", "The White album"),
c(31, 11, 14, 14,13,30))
beatles <- data.frame(time=time, album=album)

We first convert time to minutes, then compute:

lengths <- beatles$time / 60

c(mean=mean(lengths), median=median(lengths),
longest=max(lengths), shortest=min(lengths))

## mean median longest shortest

## 2.773009 2.616667 8.250000 1.033333

2.27 We need to take a weighted mean, which we do as follows:

nk <- ChestSizes$count
yk <- ChestSizes$chest
n <- sum(nk)
wk <- nk/n
sum(wk * yk)

## [1] 39.83182

2.28 We have
CHAPTER 2. UNIVARIATE DATA 15

x <- c(80,82,88,91,91,95,95,97,98,101,106,106,109,110,111)
median(x)

## [1] 97

2.29 The LearnEDA package is no longer available. The data for farms is re-
produced with:

state <- c("Al", "Als", "Ar", "Ark", "Ca", "Col", "Conn", "De", "Fl", "Ge",
"Ha", "Id", "Ill", "Ind", "Io", "Kan", "Ken", "Lou", "Ma", "Mary",
"Mass", "Mi", "Minn", "Miss", "Misso", "Mon", "Neb", "Nev", "NH",
"NJ", "NM", "NY", "NC", "ND", "Oh", "Ok", "Or", "PA", "RI", "SC",
"SD", "Te", "Tex", "Ut", "Ver", "Vir", "Wa", "WV", "Wi", "Wy")
count <- c(48, 1, 8, 49, 89, 29, 4, 3, 45, 50, 6, 25, 79, 65, 98, 65, 91, 30,
7, 12, 6, 53, 81, 43, 110, 28, 55, 3, 3, 10, 16, 39, 58, 31, 80,
84, 41, 59, 1, 25, 33, 91, 227, 16, 7, 50, 40, 21, 78, 9)
farms <- data.frame(state=state, count=count)

The stem and leaf plot is produced by:

stem(farms$count)

##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 1133346677890266
## 2 | 155890139
## 4 | 013589003589
## 6 | 5589
## 8 | 0149118
## 10 | 0
## 12 |
## 14 |
## 16 |
## 18 |
## 20 |
## 22 | 7

It is hard to gauge the influence of the outlier, but otherwise, the balance
point is likely in the stem labeled 4, or 4000 farms. A check shows it is
44.04.
CHAPTER 2. UNIVARIATE DATA 16

2.30 The value is 2.3e-4 or 2.3 · 10−4 :

2.3 * 10^(-4)

## [1] 0.00023

2.31 For firstchi this is done as follows:

hist(firstchi) # looks like 25 or so

mean(firstchi) # we were pretty close ...

## [1] 23.97701

2.32 This is done with

hist(pi2000-.1, prob=TRUE)
lines(density(pi2000))

This distribution is “flat,” as each digit is more or less equally likely.

We subtracted 0.1 so the bins for 0 and 1 are not combined, something
that is seen when making the histogram at first. Alternately, we could
specify the argument breaks=0:10-.5.

2.33 This is done as follows:

hist(normtemp$temperature) # looks like its 98.2 -- not 98.6

mean(normtemp$temperature)

## [1] 98.24923

2.34 The graphics can be produced with these commands:

require(MASS)
hist(DDT)
boxplot(DDT)
CHAPTER 2. UNIVARIATE DATA 17

The histogram shows the data to be roughly symmetric, with one outly-
ing value, so the mean and median should be similar and the standard
deviation about three-quarters the IQR. The median and IQR can be
identified on the boxplot giving estimates of 3.2 for the mean and a
standard deviation a little less than 0.5. We can check with this com-
mand:

c(mean=mean(DDT), sd=sd(DDT))

## mean sd
## 3.3280000 0.4371531

2.35 The hist function needs the data to be in a data vector, not tabulated.
We pad it out using rep, then plot. The histogram is very symmetric.

x <- rep(ChestSizes$chest, ChestSizes$count)

hist(x)

2.36 The histogram has a rather wide range (about 3 times from smallest to
largest. Some year had over 93 feet of snow fall!

2.37 First assign names. Then you can access the entries using the respective
state abbreviations.

names(state.area) <- state.abb

state.area[NJ]

## NJ
## 7836

sum(state.area < state.area[NJ])/50 * 100

## [1] 8

sum(state.area < state.area[NY])/50 * 100

## [1] 40
CHAPTER 2. UNIVARIATE DATA 18

To see that Alaska is the big state, we could look at this histogram then
query:

hist(state.area) # 50,000 cuts of last case

state.area[state.area > 5e5]

## AK
## 589757

Histogram of state.area
40
30
Frequency
20
10
0

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

state.area

2.38 For a heavily skewed-right data set, such as this, the mean is signifi-
cantly more than the median due to a relatively few large values. The
median more accurately reflects the bulk of the data. If your intention
were to make the data seem outrageously large, then a mean might be
used.

2.39 The definition of the median is incorrect. Can you think of a shape for
a distribution when this is actually okay?
CHAPTER 2. UNIVARIATE DATA 19

2.40 The median is lower for skewed-left distributions. It makes an area look
more affordable. For exclusive listings, the mean is often used to make
an area seem more expensive.

2.41 We do the usual sum(...)/length(...) formula:

sum(pi2000 <= 3)/length(pi2000) * 100

## [1] 39.5

sum(pi2000 >= 5)/length(pi2000) * 100

## [1] 50.75

2.42 These values are found with:

sum(rivers < 500) / length(rivers)

## [1] 0.5815603

sum(rivers < mean(rivers)) / length(rivers)

## [1] 0.6666667

quantile(rivers, 0.75)

## 75%
## 680

2.43 First grab the data and check the units (minutes). The top 10% is actu-
ally the 0.10 quantile in this case, as shorter times are better.

times <- nym.2002$time # easier to use

range(times) # looks like minutes

## [1] 147.3333 566.7833

CHAPTER 2. UNIVARIATE DATA 20

sum(times < 360)/length(times) 100 # 2.6% beat 3 hours

## [1] 2.6

quantile(times,c(.10, .25)) # 3:28 to 3:53

## 10% 25%
## 208.695 233.775

quantile(times,c(.90)) # 5:31

## 90%
## 331.75

It is doubtful that the data is symmetric. It is much easier to be relatively

slow in a marathon, as it requires little talent and little training—just
doggedness.

2.44 Use the functions:

mean(rivers)

## [1] 591.1844

median(rivers)

## [1] 425

mean(rivers, trim=.25)

## [1] 449.9155

Yes, the data is skewed to the right.

2.45 We see
CHAPTER 2. UNIVARIATE DATA 21

stem(islands) # quite skewed

##
## The decimal point is 3 digit(s) to the right of the |
##
## 0 | 00000000000000000000000000000111111222338
## 2 | 07
## 4 | 5
## 6 | 8
## 8 | 4
## 10 | 5
## 12 |
## 14 |
## 16 | 0

c(mean=mean(islands),
median=median(islands),
trimmed=mean(islands,trim=0.25))

## mean median trimmed

## 1252.72917 41.00000 51.08333

The data set is quite skewed due to the seven continents. We wouldn’t
expect the mean and median to agree, but note that after trimming the
mean and median are similar.

2.46 We can find the z-score for Barry Bonds using the name as follows:

(OBP[bondsba01]- mean(OBP)) / sd(OBP)

## bondsba01
## 5.990192

This is a decidedly “non-normal” data set, as we wouldn’t expect z-

scores larger than 3 in that case.

2.47 Using scale gives us the z-scores.

z <- scale(x)[,1] # matrix notation

mean(z) # basically 0
CHAPTER 2. UNIVARIATE DATA 22

## [1] -1.340544e-17

sd(z)

## [1] 1

Alternatively, we could have found the z-scores directly with (x -

mean(x))/sd(x).

2.48 No, as the data is skewed heavily to the right, the standard deviation is
quite different:

c(mad=mad(exec.pay), IQR=IQR(exec.pay), sd=sd(exec.pay))

## mad IQR sd
## 20.7564 27.5000 207.0435

2.49 As this distribution has a long tail, we find that the mean is much more
than the median.

amt <- npdb$amount

summary(amt)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 50 8750 37500 166300 175000 25000000

sum(amt < mean(amt))/length(amt) * 100

## [1] 74.90069

2.50 The value is relatively close to 1, which is the value for exponentially
distributed data:

sd(rivers) / mean(rivers)

## [1] 0.8353922
CHAPTER 2. UNIVARIATE DATA 23

2.51 Yes, fairly close:

ia_times <- diff(babyboom$running.time)

sd(ia_times) / mean(ia_times)

## [1] 0.8889017

2.52 The skew of wt is negative indicating a slight left skew. The skew of the
inter-arrival times is twice as much and to the right.

skew(babyboom$wt)

## [1] -1.078636

ia_times <- diff(babyboom$running.time)

skew(ia_times)

## [1] 1.829281

Inter-arrival times are typically exponentially distributed. As such, they

should have a coefficient of variation that is nearly 1.

2.53 The histograms are all made in a similar manner to this:

hist(hall.fame$HR)

The home run distribution is skewed right, the batting average is fairly
symmetric, and on-base percentage is also symmetric but may have
longer tails than the batting average.

2.54 After a log transform the data looks more symmetric. If you find the
median of the transformed data, you can take its exponential to get the
median of the untransformed data. Not so with the mean.

2.55 The data is tabulated, so we first create a data vector through rep, then
plot.
CHAPTER 2. UNIVARIATE DATA 24

require(HistData)
chest <- rep(ChestSizes$chest, ChestSizes$count)
qqnorm(chest)

The steps are due to truncation in the measurement. Jittering can

smooth this out (try qqnorm(jitter(chest,3))). This data hews closely
to a line, and was used by Quetelet in 1846 to demonstrate “normally-
distributed” data.

2.56 The graphic is made with qqnorm(Michelson$velocity). It falls fairly

close to a straight line.

2.57 The histograms can be made in a manner similar to this:

hist(cfb$AGE)

After making the graphics, we see that AGE is short-tailed and somewhat
symmetric; EDUC is not symmetric; NETWORTH is very skewed (some can
get really rich, some can get pretty poor, most close to 0 on this scale;
and log(SAVING + 1) is symmetric except for a spike of people at 0 who
have no savings (the actual data is skewed—the logarithm changes this).

2.58 The histogram (hist(brightness)) shows a fairly symmetric distribu-

tion centered near 8. (Star brightness is measured on a logarithmic
scale—a difference of 5 is a factor of 100 in terms of brightness. Thus,
the actual brightnesses are skewed.)

2.59 The Price variable is:

skew(Cars93$Price) > skew(Cars93$MPG.highway)

## [1] TRUE

(Both are skewed right.)

2.60 This can be done as follows (using a different name from the built-in
mode function):

Mode <- function(x) {

tbl <- table(x)
ind <- which(tbl == max(tbl))
CHAPTER 2. UNIVARIATE DATA 25

vals <- names(ind)

as(vals, class(x)[1]) # unnecessary!
}

Outside of the last line, this is a simple translation of the example give.
The last line is not necessary. It simply generalizes the call to as.numeric
in the example by coercing the output to the class of the input variable.

2.61 From this command we see the answer is 1001-2000 dollars:

bumps <- cut(bumpers, c(0, 1000, 2000, 3000, 4000))

table(bumps)

## bumps
## (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03]
## 2 8 7 6

2.62 The output of summary is a table:

summary(Cars93$Cylinders)

## 3 4 5 6 8 rotary
## 3 49 2 31 7 1

This seems a good choice—factors are used to categorize values and a

table of counts is a useful summary.

2.63 Applying the idiom to lorem we have:

chars <- unlist(strsplit(lorem, split=""))

table(chars)

## chars
## \n , ; . a A b c C d D e E f F
## 10 589 48 1 73 251 3 38 156 5 102 8 370 3 22 2
## g h i I j l L m M n N o p P q Q
## 45 17 343 8 4 220 3 142 7 211 13 174 80 6 30 2
## r s S t u U v V x
## 183 272 7 289 289 3 49 5 3
CHAPTER 2. UNIVARIATE DATA 26

Scanning we see that e is the most common. To avoid scanning, the

function sort can be called on the output of table:

sort(table(chars))

## chars
## ; F Q A E L U x j C V P M S D I
## 1 2 2 3 3 3 3 3 4 5 5 6 7 7 8 8
## \n N h f q b g , v . p d m c o r
## 10 13 17 22 30 38 45 48 49 73 80 102 142 156 174 183
## n l a s t u i e
## 211 220 251 272 289 289 343 370 589

2.64 This is done with

require(MASS)
dotchart(table(Cars93$Cylinders))

## Warning in dotchart(table(Cars93$Cylinders)): ’x’ is neither a

vector nor a matrix: using as.numeric(x)

The graphic shows that 4-cylinder cars were the most popular in 1993.
Was this the case in 1974 (cf. mtcars$cyl)?

2.68 It contains the days when nothing much happened.

Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.1 A dot plot of the data on children’s weights. Such a graphic shows the data in sorted
order allowing quick visual senses of both the center and the spread. Values are just drawn on the
number line with repeated values being stacked.

002x001.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.2 The mean is the value that balances the dot plot.

002x002.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.3 Plot of absolute z-scores for the wts data set and a subset of the exec.pay data set. There
are no values larger than 2 in the wts data set, in agreement with the rule of thumb for bell-shaped
data. For the executive pay data, we see a z-score nearly as large as 5, virtually impossible for bell-
shaped data.

002x003.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.4 Dot plots of two data sets with different shapes. The left data set, a sample of the execu-
tive pay data set, is skewed right, the right data set, on the heights of four-year-old children, is mostly
symmetric. For the symmetric data, the mean and median measure the center in a similar manner
(36.7 to 38). For the skewed data this is not so (42.5 to 24).

002x004.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.5 The left graphic shows stacked dot plots of z-scores of two data sets. The lower one has
long tails, the top one “normal” tails. The right graphic shows the galaxies data set. The overlapping
dots in the data show the presence of at least 3 clusters, corresponding to modes.

002x005.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.6 Two histograms of times between eruptions at the Old Faithful geyser in Yellowstone
National Park shows two modes. The left graphic rep- resents frequencies, the right graphic is scaled
to have total area equal to 1.

002x006.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.7 A histogram of a random sample of n = 10, 000 data points and a corresponding density
plot of the data. The vertical lines of the histogram are de-emphasized. From either, we can see the
data is symmetric, unimodal with a mean of 0.

002x007.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.8 Histogram of bumpers data with a density plot layered on top.

002x008.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.9 Boxplots of various data sets. The left one shows the bumpers data set, a mostly sym-
metric data set with no outliers. The right one, of the weight variable in the kid.weights data, shows
a right skew and some outliers.

002x009.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.10 Three quantile-normal plots produced by qnorm. The leftmost graphic shows data on
finger lengths of several prisoners from the finger variable in the Macdonell (HistData) data set. It
shows data more or less on a straight line, indicating a normal distribution. The grouping is due to
the data being discretized. The second graphic uses data on the height of children in Galton’s classic
study of heights. This data has slight bends on the edges, like an “S”. This being due to the tails being
slightly less long than the normal. The final data shows what a decidedly non-normal distribution
appears like in this graphic. The executive pay data is used which is skewed right and long tailed.
Such data shows a clear curve.

002x010.eps
Courtesy of CRC Press/Taylor & Francis Group

FIGURE 2.11 A horizontal bar chart and dot chart of the smoking data.

002x011.eps

Maintenance: 1. General
No ratings yet
Maintenance: 1. General
32 pages
STAT 231 Course Notes Winter
100% (1)
STAT 231 Course Notes Winter
358 pages
HTHSCI 2G03 - Statistics and Epidemiology I
No ratings yet
HTHSCI 2G03 - Statistics and Epidemiology I
16 pages
WS3 Geographic
100% (1)
WS3 Geographic
18 pages
SQL Using R
No ratings yet
SQL Using R
30 pages
Research Data Strategy
No ratings yet
Research Data Strategy
9 pages
Review Question #5
0% (2)
Review Question #5
2 pages
CCW331 Business Analytics Lecture Notes 1
No ratings yet
CCW331 Business Analytics Lecture Notes 1
286 pages
Resume - Rajat Chaturvedi
No ratings yet
Resume - Rajat Chaturvedi
3 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Data Analytics Concepts Techniques and A PDF
100% (11)
Data Analytics Concepts Techniques and A PDF
451 pages
Complete Download An Introduction to Statistical Learning: with Applications in Python Gareth James PDF All Chapters
No ratings yet
Complete Download An Introduction to Statistical Learning: with Applications in Python Gareth James PDF All Chapters
55 pages
6 XG Boost - Jupyter Notebook
100% (1)
6 XG Boost - Jupyter Notebook
3 pages
Unit 1,2,3, And4
100% (1)
Unit 1,2,3, And4
159 pages
Radio Frequency Engineering - GSM and WCDMA Single Site Verification SSV
No ratings yet
Radio Frequency Engineering - GSM and WCDMA Single Site Verification SSV
8 pages
Excel 2016 Intermediate
0% (1)
Excel 2016 Intermediate
157 pages
Mastering SQL Window Functions - 01
No ratings yet
Mastering SQL Window Functions - 01
39 pages
Introduction To Spreadsheet Modeling - Winston Albright
No ratings yet
Introduction To Spreadsheet Modeling - Winston Albright
46 pages
Statistics For Business Analysis: Learning Objectives
No ratings yet
Statistics For Business Analysis: Learning Objectives
37 pages
21AML543 - Fundamentals of Data Science
No ratings yet
21AML543 - Fundamentals of Data Science
4 pages
A Primer On Process Mining Practical Skills With Python and Graphviz
No ratings yet
A Primer On Process Mining Practical Skills With Python and Graphviz
101 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
(M) BROCHURE - Data Science Learning Path
No ratings yet
(M) BROCHURE - Data Science Learning Path
33 pages
MYSQL
No ratings yet
MYSQL
69 pages
Data Science Capstone Project
No ratings yet
Data Science Capstone Project
21 pages
Project Report: CS 574 - Computer Vision Using Machine Learning
No ratings yet
Project Report: CS 574 - Computer Vision Using Machine Learning
38 pages
Research Data Analysis With Power BI: Vijay Krishnan S Bharanidharan G Krishnamoorthy
No ratings yet
Research Data Analysis With Power BI: Vijay Krishnan S Bharanidharan G Krishnamoorthy
8 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Business Analytics and Data Science
No ratings yet
Business Analytics and Data Science
25 pages
Business Analytics & Data Visualization - Unit1
100% (1)
Business Analytics & Data Visualization - Unit1
30 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Data Science in E-Commerce - Report - Writing
No ratings yet
Data Science in E-Commerce - Report - Writing
18 pages
ف1
No ratings yet
ف1
4 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Data Transformation and Arima Models A S
No ratings yet
Data Transformation and Arima Models A S
8 pages
Excel Formula's
No ratings yet
Excel Formula's
210 pages
500+ Free Mock Test Visit:: Join Telegram Channel: Join Telegram Group
No ratings yet
500+ Free Mock Test Visit:: Join Telegram Channel: Join Telegram Group
16 pages
Proc SQL
100% (1)
Proc SQL
7 pages
XLOOKUP Guide
No ratings yet
XLOOKUP Guide
3 pages
Data Mining For The Masses
100% (1)
Data Mining For The Masses
77 pages
School of Data Science and Forecasting: M.B.A. (Business Analytics)
No ratings yet
School of Data Science and Forecasting: M.B.A. (Business Analytics)
2 pages
How To Use PowerPivot Instead of VLOOKUP - Excel Campus
No ratings yet
How To Use PowerPivot Instead of VLOOKUP - Excel Campus
14 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Snowflakes Beginner To Intermediate Path Updated
No ratings yet
Snowflakes Beginner To Intermediate Path Updated
4 pages
Module 2
No ratings yet
Module 2
20 pages
2017 Fuzzy Information Retrieval
No ratings yet
2017 Fuzzy Information Retrieval
83 pages
Gate 2024 Da Sample Question Paper Final
No ratings yet
Gate 2024 Da Sample Question Paper Final
29 pages
Data Science 2
No ratings yet
Data Science 2
55 pages
Angular 2 Essentials - Sample Chapter
0% (1)
Angular 2 Essentials - Sample Chapter
39 pages
Extending Power BI With Python and R: Perform Advanced Analysis Using The Power of Analytical Languages, (2nd Edition) Luca Zavarella
100% (9)
Extending Power BI With Python and R: Perform Advanced Analysis Using The Power of Analytical Languages, (2nd Edition) Luca Zavarella
52 pages
Github Data Science Projects
No ratings yet
Github Data Science Projects
16 pages
Business Intelligence & Business Analytics
No ratings yet
Business Intelligence & Business Analytics
8 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
20 pages
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
No ratings yet
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
3 pages
Data Analytics Project
No ratings yet
Data Analytics Project
9 pages
Part 1: The Star Schema Data Model: Healthcare Data Models UC Davis Continuing and Professional Education
No ratings yet
Part 1: The Star Schema Data Model: Healthcare Data Models UC Davis Continuing and Professional Education
5 pages
Seaborn Final
No ratings yet
Seaborn Final
67 pages
Data Literacy Fundamentals: Understanding the Power & Value of Data
From Everand
Data Literacy Fundamentals: Understanding the Power & Value of Data
Ben Jones
No ratings yet
Hock Questions 2017 P1D11. Business Process Improvement 22Q
No ratings yet
Hock Questions 2017 P1D11. Business Process Improvement 22Q
15 pages
Statistics Lecture PDF
No ratings yet
Statistics Lecture PDF
51 pages
Descriptive Statistics and Graphs: Statistics For Psychology
No ratings yet
Descriptive Statistics and Graphs: Statistics For Psychology
14 pages
Chapter 2 - Frequency Distributions and Graphs: Limits Boundaries F
No ratings yet
Chapter 2 - Frequency Distributions and Graphs: Limits Boundaries F
20 pages
Statistics A. Introduction
50% (2)
Statistics A. Introduction
24 pages
IQM - Lecture Guide
No ratings yet
IQM - Lecture Guide
12 pages
Minitab Session Commands: Appendix
No ratings yet
Minitab Session Commands: Appendix
8 pages
Endowments
No ratings yet
Endowments
7 pages
Applied Statistics with Python
100% (1)
Applied Statistics with Python
320 pages
HáKon Gudbjartsson Samuel Patz - The Rician Distribution of Noisy Mri Data
No ratings yet
HáKon Gudbjartsson Samuel Patz - The Rician Distribution of Noisy Mri Data
5 pages
AC9000 Plus Operations (part 1)
No ratings yet
AC9000 Plus Operations (part 1)
68 pages
Edexcel IGCSE Mathematics B 4MB1 Revision Notes
No ratings yet
Edexcel IGCSE Mathematics B 4MB1 Revision Notes
42 pages
Bio Introduction
No ratings yet
Bio Introduction
101 pages
Zuur 2010
No ratings yet
Zuur 2010
12 pages
SolutionsManualForStat2 PDF
No ratings yet
SolutionsManualForStat2 PDF
753 pages
SMTA1104
No ratings yet
SMTA1104
88 pages
Prentice Hall Mathematics Courses 1 3 Compress
No ratings yet
Prentice Hall Mathematics Courses 1 3 Compress
42 pages
Probability and Statistics For Computer Scientists Second Edition, By: Michael Baron
No ratings yet
Probability and Statistics For Computer Scientists Second Edition, By: Michael Baron
63 pages
Ling 228 Project
No ratings yet
Ling 228 Project
11 pages
SFM QB
No ratings yet
SFM QB
8 pages
Five S and QC Tools Introduction
No ratings yet
Five S and QC Tools Introduction
72 pages
Stat-231 Practical Manual
No ratings yet
Stat-231 Practical Manual
45 pages
Toolbox4Planning - How To Get Resource Loading Histogram - S Curve in Primavera P6
No ratings yet
Toolbox4Planning - How To Get Resource Loading Histogram - S Curve in Primavera P6
14 pages
Session - 1 Duration: 1 HR: Lesson - 1 Statistics For Management
No ratings yet
Session - 1 Duration: 1 HR: Lesson - 1 Statistics For Management
36 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
Frequency Polygons
100% (1)
Frequency Polygons
13 pages
Chapter 2
No ratings yet
Chapter 2
40 pages
Graph presentation of data
No ratings yet
Graph presentation of data
19 pages
Image Processing For Surface Quality Control in Stainless Steel Production Lines
No ratings yet
Image Processing For Surface Quality Control in Stainless Steel Production Lines
6 pages

Solutions Manual Using R Introductory ST

Uploaded by

Solutions Manual Using R Introductory ST

Uploaded by

2

2.1 For example:

p <- c(2, 3, 5, 7, 11, 13, 17, 19)

x <- c(2, 5, 4, 10, 8)

2.4 These can be done with

seq(1, 99, by=2)

2.5 These can be done with the following commands:

primes_under_20 <- c(1, 2, 3, 5, 8, 13, 21, 34)

sum(abs(rivers - mean(rivers))) / length(rivers)

To elaborate, rivers - mean(rivers) centers the values and is a data

2.7 The unary minus is evaluated before the colon:

-1:3 # like (-1):3

However, the colon is evaluated before multiplication in the latter:

1:2*3 # not like 1:(2*3)

j_cities <- c("Jackson", "Jacksonville", "Juneau")

## Jackson Jacksonville Juneau

tion of a regular expression to indicate words that start with “J”. As a

## Juneau Jacksonville Jackson

Regular expressions are described in the help page ?regexp.

paste(dname, fname, sep=.Platform$file.sep)

length(levels(man)) # number of levels

cyl <- Cars93$Cylinders

## [1] "3" "4" "5" "6" "8" "rotary"

which(cyl == "5") # just 5 is also okay

Cars93$Manufacturer[ which(cyl == 5) ] # which companies

## [1] Volkswagen Volvo

mtcars$am <- factor(mtcars$am, labels=c("automatic", "manual"))

2.14 The answer is no:

A <- c(TRUE, FALSE, TRUE, TRUE)

## [1] FALSE TRUE FALSE FALSE

## [1] FALSE TRUE FALSE FALSE

It is not necessary to express the latter as (!A) | (!B), as the unary !

2.16 We use logical extraction for this task:

names(precip[precip > 50])

## [1] "Mobile" "Juneau" "Jacksonville" "Miami"

A similar question is used for the algorithmic determination of “out-

2.18 The comparison of strings is done lexicographically. That is, compar-

2.19 First we store the data, then we analyze it.

sum(commutes >= 20)

sum(commutes < 18)/length(commutes)

2.21 Enter in the data as follows:

Using diff gives

## 1991 1992 1993 1994 1995 1996 1997 1998 1999

We can see that one year was negative:

## 1991 1992 1993 1994 1995 1996

mean_distance <- function(x) {

2.23 This can be done through:

2.24 A simple answer is just given by:

iseven <- function(x) x %%2 == 0

Then isodd would be:

isodd <- function(x) x%%2 == 1

The following implementation ensures integers are used, and adds

iseven <- function(x) {

Restricting a function to handle only integer inputs can be achieved by

isprime <- function(x){

Though this isn’t a terribly efficient means to generate a list of primes,

We first convert time to minutes, then compute:

lengths <- beatles$time / 60

## mean median longest shortest

2.27 We need to take a weighted mean, which we do as follows:

The stem and leaf plot is produced by:

2.30 The value is 2.3e-4 or 2.3 · 10−4 :

2.31 For firstchi this is done as follows:

hist(firstchi) # looks like 25 or so

2.32 This is done with

This distribution is “flat,” as each digit is more or less equally likely.

2.33 This is done as follows:

hist(normtemp$temperature) # looks like its 98.2 -- not 98.6

2.34 The graphics can be produced with these commands:

x <- rep(ChestSizes$chest, ChestSizes$count)

names(state.area) <- state.abb

sum(state.area < state.area[NJ])/50 * 100

sum(state.area < state.area[NY])/50 * 100

hist(state.area) # 50,000 cuts of last case

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

2.41 We do the usual sum(...)/length(...) formula:

1:23 # not like 1:(23)

sum(times < 360)/length(times) 100 # 2.6% beat 3 hours