number summary and box plots
number summary and box plots
Solution The 25th percentile is the value with a quarter of the values below or equal to it. This
is the value where 25% of the area of the histogram lies to the left. This appears to
be about 3000 million. We do not expect you to compute this exactly, but simply be
able to give an estimate. A volume of about 4200 million has roughly 10% of the data
values above it (and 90% below), so the 90th percentile is about 4200 million.
where
Q1 = First quartile = 25th percentile
Q3 = Third quartile = 75th percentile
The five number summary divides the dataset into fourths: about 25%
of the data fall between any two consecutive numbers in the five num-
ber summary.
© kristian sekulic/iStockphoto
Example 2.21
The five number summary for the number of hours spent exercising a week for the
StudentSurvey sample is (0, 5, 8, 12, 40). Explain what this tells us about the amount
of exercise for this group of students.
2.3 One Quantitative Variable: Measures of Spread 85
Solution All of the students exercise between 0 and 40 hours per week. The 25% of students
who exercise the least exercise between 0 and 5 hours a week, and the 25% of stu-
dents who exercise the most exercise between 12 and 40 hours a week. The middle
50% of students exercise between 5 and 12 hours a week, with half exercising less
than 8 hours per week and half exercising more than 8 hours per week.
Example 2.22
The five number summary for the mammal longevity data in Table 2.15 on page 64
is (1, 8, 12, 16, 40). Find the range and interquartile range for this dataset.
Solution From the five number summary (1, 8, 12, 16, 40), we see that the minimum longevity
is 1 and the maximum is 40, so the range is 40 − 1 = 39 years. The first quartile is 8
and the third quartile is 16, so the interquartile range is IQR = 16 − 8 = 8 years.
Note that the range and interquartile range calculated in Example 2.22 (39 and
8, respectively) are numerical values not intervals. Also notice that if the elephant
(longevity = 40) were omitted from the sample, the range would be reduced to
25 − 1 = 24 years while the IQR would go down by just one year, 15 − 8 = 7. In
general, although the range is a very easy statistic to compute, the IQR is a more
resistant measure of spread.
Example 2.23
Using the temperature data for Des Moines and San Francisco given in Table 2.20,
find the five number summary for the temperatures in each city. Find the range and
the IQR for each dataset and compare the results for the two cities.
Solution We use technology to find the five number summaries. From the output in Figure 2.15
or Figure 2.16 on page 78, we see that the five number summary for Des Moines
temperatures is (35.7, 44.4, 54.7, 59.9, 74.9). (Notice that the value for Q3 is slightly
different between the three outputs. You may get slightly different values for the
quartiles depending on which technology you use.) The five number summary for
San Francisco temperatures is (48.7, 51.4, 53.8, 56.0, 61.0). We have
Range for Des Moines = 74.9 − 35.7 = 39.2∘ F
Range for San Francisco = 61.0 − 48.7 = 12.3∘ F
86 CHAPTER 2 Describing Data
The range and IQR are much larger for the Des Moines data than the San Francisco
data. Temperatures are much more variable in central Iowa than they are on the
California coast!
Example 2.24
Example 2.13 on page 70 describes salaries in the US National Football League,
in which some star players are paid much more than most other players.
(a) We see in that example that players prefer to use the median ($872,000 in 2014)
as a measure of center since they don’t want the results heavily influenced by a
few huge outlier salaries. What should they use as a measure of spread?
(b) We also see that the owners of the teams prefer to use the mean ($2.14 million in
2014) as a measure of center since they want to use a measure that includes all
the salaries. What should they use as a measure of spread?
Solution (a) The interquartile range (IQR) should be used with the median as a measure of
spread. Both come from the five number summary, and both the median and the
IQR are resistant to outliers.
(b) The standard deviation should be used with the mean as a measure of spread.
Both the mean and the standard deviation use all the data values in their
computation.
S E C T I ON L E A R NI NG GOA L S
You should now have the understanding and skills to:
• Use technology to compute summary statistics for a quantitative
variable
• Recognize the uses and meaning of the standard deviation
• Compute and interpret a z-score
• Interpret a five number summary or percentiles
• Use the range, the interquartile range, and the standard deviation as
measures of spread
• Describe the advantages and disadvantages of the different measures
of center and spread
2.4 Boxplots and Quantitative/Categorical Relationships 93
2007 and 2013. Use technology to find the mean, the deviations below the mean. Thus the range should
standard deviation, and the five number summary be roughly 4 to 6 times the standard deviation. As
for the data in this variable. a rough rule of thumb, we can get a quick estimate
2.128 Audience Movie Ratings The variable Audi- of the standard deviation for a bell-shaped distribu-
enceScore in the HollywoodMovies dataset gives the tion by dividing the range by 5. Check how well this
audience rating on the Rotten Tomatoes website of quick estimate works in the following situations.
movies that came out of Hollywood between 2007 (a) Pulse rates from the StudentSurvey dataset dis-
and 2013. Use technology to find the mean, the stan- cussed in Example 2.17 on page 80. The five
dard deviation, and the five number summary for number summary of pulse rates is (35, 62, 70, 78,
the data in this variable. 130) and the standard deviation is s = 12.2 bpm.
2.129 Using the Five Number Summary to Visual- Find the rough estimate using all the data, and
ize Shape of a Distribution Draw a histogram or a then excluding the two outliers at 120 and 130,
smooth curve illustrating the shape of a distribution which leaves the maximum at 96.
with the properties that: (b) Number of hours a week spent exercising
(a) The range is 100 and the interquartile range from the StudentSurvey dataset discussed in
is 10. Example 2.21 on page 84. The five number sum-
(b) The range is 50 and the interquartile range is 40. mary of this dataset is (0, 5, 8, 12, 40) and the
standard deviation is s = 5.741 hours.
2.130 Rough Rule of Thumb for the Standard Devi-
ation According to the 95% rule, the largest value in (c) Longevity of mammals from the Mammal-
a sample from a distribution that is approximately Longevity dataset discussed in Example 2.22
symmetric and bell-shaped should be between 2 and on page 85. The five number summary of the
3 standard deviations above the mean, while the longevity values is (1, 8, 12, 16, 40) and the stan-
smallest value should be between 2 and 3 standard dard deviation is s = 7.24 years.
Boxplots
A boxplot is a graphical display of the five number summary for a single quantitative
variable. It shows the general shape of the distribution, identifies the middle 50% of
the data, and highlights any outliers.
Boxplots
A boxplot includes:
• A numerical scale appropriate for the data values.
• A box stretching from Q1 to Q3 .
• A line that divides the box drawn at the median.
• A line from each quartile to the most extreme data value that is not
an outlier.
• Each outlier plotted individually with a symbol such as an asterisk
or a dot.
94 CHAPTER 2 Describing Data
Example 2.25
Draw a boxplot for the data in MammalLongevity, with five number summary
(1, 8, 12, 16, 40).
Solution The boxplot for mammal longevities is shown in Figure 2.31. The box covers the
interval from the first quartile of 8 years to the third quartile of 16 years and is divided
at the median of 12 years. The line to the left of the lower quartile goes all the way to
the minimum longevity at 1 year, since there were no small outliers. The line to the
right stops at the largest data value (25, grizzly bear and hippopotamus) that is not
an outlier. The only clear outlier is the elephant longevity of 40 years, so this value
is plotted with an individual symbol at the maximum of 40 years.
Q1 m Q3
Example 2.26
One of the quantitative variables in the USStates dataset is Vegetables, which gives
the percentage of the population that eats at least one serving of vegetables per day
for each of the states. Figure 2.32 shows a boxplot of the percent for all 50 states.
(a) Discuss what the boxplot tells us about the distribution of this variable.
(b) Estimate the five number summary.
Solution (a) The distribution of percentages is relatively symmetric, with one low outlier, and
centered around 77. The percent of people who eat vegetables appears to range
from about 67% to about 84%. The middle 50% of percentages is between about
75% and 80%, with a median value at about 77%. The low outlier is about 67%
and represents the state of Louisiana.
(b) The five number summary appears to be approximately (67, 75, 77, 80, 84).
Example 2.27
One of the quantitative variables in the HollywoodMovies dataset is Budget, which
gives the budget, in millions of dollars, to make each movie. Figure 2.33 shows the
boxplot for the budget of all Hollywood movies that came out in 2013.
(a) Discuss what the boxplot tells us about the distribution of this variable.
(b) What movies do the two largest outliers correspond to?
(c) What was the budget to make Frozen? Is it an outlier?
∗ ∗
∗ ∗ ∗
∗
∗∗∗ ∗
Solution (a) Because the minimum, first quartile, and median are so close together, we see
that half the data values are packed in a small interval, then the other half extend
out quite far to the right. These data are skewed to the right, with many large
outliers.
(b) The two largest outliers represent movies with budgets of about 225 million
dollars. We see from the dataset HollywoodMovies that the two movies are Man
of Steel and The Desolation of Smaug.
(c) We see from the dataset that the budget for Frozen was 150 million dollars. We
see in the boxplot that this is not an outlier.
Detection of Outliers
Consider again the data on mammal longevity in Data 2.2 on page 64. Our intu-
ition suggests that the longevity of 40 years for the elephant is an unusually high
value compared to the other lifespans in this sample. How do we determine objec-
tively when such a value is an outlier? The criteria should depend on some measure
of location for “typical” values and a measure of spread to help us judge when a data
point is “far” from those typical cases. One approach, typically used for identifying
outliers for boxplots, uses the quartiles and interquartile range. As a rule, most data
values will fall within about 1.5(IQR)’s of the quartiles.49
Example 2.28
According to the IQR method, is the elephant an outlier for the mammal longevi-
ties in the dataset MammalLongevity? Are any other mammals outliers in that
dataset?
49 Inpractice, determining outliers requires judgment and understanding of the context. We present a
specific method here, but no single method is universally used for determining outliers.
2.4 Boxplots and Quantitative/Categorical Relationships 97
Solution The five number summary for mammal longevities is (1, 8, 12, 16, 40). We have Q1 = 8
and Q3 = 16 so the interquartile range is IQR = 16 − 8 = 8. We compute
Q1 − 1.5(IQR) = 8 − 1.5(8) = 8 − 12 = −4
and
Q3 + 1.5(IQR) = 16 + 1.5(8) = 16 + 12 = 28
Clearly, there are no mammals with negative lifetimes, so there can be no outliers
below the lower value of −4. On the upper side, the elephant, as expected, is clearly
an outlier beyond the value of 28 years. No other mammal in this sample exceeds
that value, so the elephant is the only outlier in the longevity data.
Example 2.29
Who watches more TV, males or females? The data in StudentSurvey contains the
categorical variable Gender as well as the quantitative variable TV for the number of
hours spent watching television per week. For these students, is there an association
between gender and number of hours spent watching television? Use the side-by-side
graphs in Figure 2.34 showing the distribution of hours spent watching television for
males and females to discuss how the distributions compare.