0% found this document useful (0 votes)
12 views

number summary and box plots

Statistics-number summary and box plots

Uploaded by

tommichael283
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

number summary and box plots

Statistics-number summary and box plots

Uploaded by

tommichael283
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

84 CHAPTER 2 Describing Data

Solution The 25th percentile is the value with a quarter of the values below or equal to it. This
is the value where 25% of the area of the histogram lies to the left. This appears to
be about 3000 million. We do not expect you to compute this exactly, but simply be
able to give an estimate. A volume of about 4200 million has roughly 10% of the data
values above it (and 90% below), so the 90th percentile is about 4200 million.

Five Number Summary


The minimum and maximum in a dataset identify the extremes of the distribution:
the smallest and largest values, respectively. The median is the 50th percentile, since
it divides the data into two equal halves. If we divide each of those halves again,
we obtain two additional statistics known as the first (Q1 ) and third (Q3 ) quartiles,
which are the 25th and 75th percentiles. Together these five numbers provide a good
summary of important characteristics of the distribution and are known as the five
number summary.

Five Number Summary


We define

Five Number Summary = (minimum, Q1 , median, Q3 , maximum)

where
Q1 = First quartile = 25th percentile
Q3 = Third quartile = 75th percentile
The five number summary divides the dataset into fourths: about 25%
of the data fall between any two consecutive numbers in the five num-
ber summary.

© kristian sekulic/iStockphoto

How many hours a week do students exercise?

Example 2.21
The five number summary for the number of hours spent exercising a week for the
StudentSurvey sample is (0, 5, 8, 12, 40). Explain what this tells us about the amount
of exercise for this group of students.
2.3 One Quantitative Variable: Measures of Spread 85

Solution All of the students exercise between 0 and 40 hours per week. The 25% of students
who exercise the least exercise between 0 and 5 hours a week, and the 25% of stu-
dents who exercise the most exercise between 12 and 40 hours a week. The middle
50% of students exercise between 5 and 12 hours a week, with half exercising less
than 8 hours per week and half exercising more than 8 hours per week.

Range and Interquartile Range


The five number summary provides two additional opportunities for summariz-
ing the amount of spread in the data, the range and the interquartile range.

Range and Interquartile Range


From the five number summary, we can compute the following two
statistics:

Range = Maximum − Minimum


Interquartile range (IQR) = Q3 − Q1

Example 2.22
The five number summary for the mammal longevity data in Table 2.15 on page 64
is (1, 8, 12, 16, 40). Find the range and interquartile range for this dataset.

Solution From the five number summary (1, 8, 12, 16, 40), we see that the minimum longevity
is 1 and the maximum is 40, so the range is 40 − 1 = 39 years. The first quartile is 8
and the third quartile is 16, so the interquartile range is IQR = 16 − 8 = 8 years.

Note that the range and interquartile range calculated in Example 2.22 (39 and
8, respectively) are numerical values not intervals. Also notice that if the elephant
(longevity = 40) were omitted from the sample, the range would be reduced to
25 − 1 = 24 years while the IQR would go down by just one year, 15 − 8 = 7. In
general, although the range is a very easy statistic to compute, the IQR is a more
resistant measure of spread.

Example 2.23
Using the temperature data for Des Moines and San Francisco given in Table 2.20,
find the five number summary for the temperatures in each city. Find the range and
the IQR for each dataset and compare the results for the two cities.

Solution We use technology to find the five number summaries. From the output in Figure 2.15
or Figure 2.16 on page 78, we see that the five number summary for Des Moines
temperatures is (35.7, 44.4, 54.7, 59.9, 74.9). (Notice that the value for Q3 is slightly
different between the three outputs. You may get slightly different values for the
quartiles depending on which technology you use.) The five number summary for
San Francisco temperatures is (48.7, 51.4, 53.8, 56.0, 61.0). We have
Range for Des Moines = 74.9 − 35.7 = 39.2∘ F
Range for San Francisco = 61.0 − 48.7 = 12.3∘ F
86 CHAPTER 2 Describing Data

IQR for Des Moines = 59.9 − 44.4 = 15.5∘ F


IQR for San Francisco = 56.0 − 51.4 = 4.6∘ F

The range and IQR are much larger for the Des Moines data than the San Francisco
data. Temperatures are much more variable in central Iowa than they are on the
California coast!

Choosing Measures of Center and Spread


Because the standard deviation measures how much the data values deviate from
the mean, it makes sense to use the standard deviation as a measure of variability
when the mean is used as a measure of center. Both the mean and standard devi-
ation have the advantage that they use all of the data values. However, they are
not resistant to outliers. The median and IQR are resistant to outliers. Furthermore,
if there are outliers or the data are heavily skewed, the five number summary can
give more information (such as direction of skewness) than the mean and standard
deviation.

Example 2.24
Example 2.13 on page 70 describes salaries in the US National Football League,
in which some star players are paid much more than most other players.
(a) We see in that example that players prefer to use the median ($872,000 in 2014)
as a measure of center since they don’t want the results heavily influenced by a
few huge outlier salaries. What should they use as a measure of spread?
(b) We also see that the owners of the teams prefer to use the mean ($2.14 million in
2014) as a measure of center since they want to use a measure that includes all
the salaries. What should they use as a measure of spread?

Solution (a) The interquartile range (IQR) should be used with the median as a measure of
spread. Both come from the five number summary, and both the median and the
IQR are resistant to outliers.
(b) The standard deviation should be used with the mean as a measure of spread.
Both the mean and the standard deviation use all the data values in their
computation.

S E C T I ON L E A R NI NG GOA L S
You should now have the understanding and skills to:
• Use technology to compute summary statistics for a quantitative
variable
• Recognize the uses and meaning of the standard deviation
• Compute and interpret a z-score
• Interpret a five number summary or percentiles
• Use the range, the interquartile range, and the standard deviation as
measures of spread
• Describe the advantages and disadvantages of the different measures
of center and spread
2.4 Boxplots and Quantitative/Categorical Relationships 93

2007 and 2013. Use technology to find the mean, the deviations below the mean. Thus the range should
standard deviation, and the five number summary be roughly 4 to 6 times the standard deviation. As
for the data in this variable. a rough rule of thumb, we can get a quick estimate
2.128 Audience Movie Ratings The variable Audi- of the standard deviation for a bell-shaped distribu-
enceScore in the HollywoodMovies dataset gives the tion by dividing the range by 5. Check how well this
audience rating on the Rotten Tomatoes website of quick estimate works in the following situations.
movies that came out of Hollywood between 2007 (a) Pulse rates from the StudentSurvey dataset dis-
and 2013. Use technology to find the mean, the stan- cussed in Example 2.17 on page 80. The five
dard deviation, and the five number summary for number summary of pulse rates is (35, 62, 70, 78,
the data in this variable. 130) and the standard deviation is s = 12.2 bpm.
2.129 Using the Five Number Summary to Visual- Find the rough estimate using all the data, and
ize Shape of a Distribution Draw a histogram or a then excluding the two outliers at 120 and 130,
smooth curve illustrating the shape of a distribution which leaves the maximum at 96.
with the properties that: (b) Number of hours a week spent exercising
(a) The range is 100 and the interquartile range from the StudentSurvey dataset discussed in
is 10. Example 2.21 on page 84. The five number sum-
(b) The range is 50 and the interquartile range is 40. mary of this dataset is (0, 5, 8, 12, 40) and the
standard deviation is s = 5.741 hours.
2.130 Rough Rule of Thumb for the Standard Devi-
ation According to the 95% rule, the largest value in (c) Longevity of mammals from the Mammal-
a sample from a distribution that is approximately Longevity dataset discussed in Example 2.22
symmetric and bell-shaped should be between 2 and on page 85. The five number summary of the
3 standard deviations above the mean, while the longevity values is (1, 8, 12, 16, 40) and the stan-
smallest value should be between 2 and 3 standard dard deviation is s = 7.24 years.

2.4 BOXPLOTS AND QUANTITATIVE/CATEGORICAL


RELATIONSHIPS
In this section, we examine a relationship between a quantitative variable and a cat-
egorical variable by examining both comparative summary statistics and graphical
displays. Before we get there, however, we look at one more graphical display for a
single quantitative variable that is particularly useful for comparing groups.

Boxplots
A boxplot is a graphical display of the five number summary for a single quantitative
variable. It shows the general shape of the distribution, identifies the middle 50% of
the data, and highlights any outliers.

Boxplots
A boxplot includes:
• A numerical scale appropriate for the data values.
• A box stretching from Q1 to Q3 .
• A line that divides the box drawn at the median.
• A line from each quartile to the most extreme data value that is not
an outlier.
• Each outlier plotted individually with a symbol such as an asterisk
or a dot.
94 CHAPTER 2 Describing Data

Example 2.25
Draw a boxplot for the data in MammalLongevity, with five number summary
(1, 8, 12, 16, 40).

Solution The boxplot for mammal longevities is shown in Figure 2.31. The box covers the
interval from the first quartile of 8 years to the third quartile of 16 years and is divided
at the median of 12 years. The line to the left of the lower quartile goes all the way to
the minimum longevity at 1 year, since there were no small outliers. The line to the
right stops at the largest data value (25, grizzly bear and hippopotamus) that is not
an outlier. The only clear outlier is the elephant longevity of 40 years, so this value
is plotted with an individual symbol at the maximum of 40 years.
Q1 m Q3

Minimum non-outlier Maximum non-outlier Outlier

Figure 2.31 Boxplot of 0 10 20 30 40


longevity of mammals Longevity

DATA 2.6 US States


The dataset USStates includes many variables measured for each of the 50
states in the US. Some of the variables included for each state are average
household income, percent to graduate high school, health statistics such as
consumption of fruits and vegetables, percent obese, percent of smokers, and
some results from the 2012 US presidential election.47 ■

Example 2.26
One of the quantitative variables in the USStates dataset is Vegetables, which gives
the percentage of the population that eats at least one serving of vegetables per day
for each of the states. Figure 2.32 shows a boxplot of the percent for all 50 states.
(a) Discuss what the boxplot tells us about the distribution of this variable.
(b) Estimate the five number summary.

Solution (a) The distribution of percentages is relatively symmetric, with one low outlier, and
centered around 77. The percent of people who eat vegetables appears to range
from about 67% to about 84%. The middle 50% of percentages is between about
75% and 80%, with a median value at about 77%. The low outlier is about 67%
and represents the state of Louisiana.
(b) The five number summary appears to be approximately (67, 75, 77, 80, 84).

47 Data from a variety of sources, mostly http://www.census.gov.


2.4 Boxplots and Quantitative/Categorical Relationships 95

Figure 2.32 Percent of


people who eat at least
one serving of vegetables 65 70 75 80 85
per day, by state Vegetables

© Ben Molyneux/Alamy Stock Photo

How much did it cost to make


this movie?

DATA 2.7 Hollywood Movies


Over 900 movies came out of Hollywood between 2007 and 2013 and the
dataset HollywoodMovies contains lots of information on these movies, such as
studio, genre, budget, audience ratings, box office average opening weekend,
world gross, and others.48 ■

Example 2.27
One of the quantitative variables in the HollywoodMovies dataset is Budget, which
gives the budget, in millions of dollars, to make each movie. Figure 2.33 shows the
boxplot for the budget of all Hollywood movies that came out in 2013.
(a) Discuss what the boxplot tells us about the distribution of this variable.
(b) What movies do the two largest outliers correspond to?
(c) What was the budget to make Frozen? Is it an outlier?

48 McCandless, D., “Most Profitable Hollywood Movies,” from “Information is Beautiful,”


http://www.davidmccandless.com, Accessed July 2015.
96 CHAPTER 2 Describing Data

∗ ∗
∗ ∗ ∗

∗∗∗ ∗

Figure 2.33 Budget, in


millions of dollars, of 0 50 100 150 200 250
Hollywood movies Budget

Solution (a) Because the minimum, first quartile, and median are so close together, we see
that half the data values are packed in a small interval, then the other half extend
out quite far to the right. These data are skewed to the right, with many large
outliers.
(b) The two largest outliers represent movies with budgets of about 225 million
dollars. We see from the dataset HollywoodMovies that the two movies are Man
of Steel and The Desolation of Smaug.
(c) We see from the dataset that the budget for Frozen was 150 million dollars. We
see in the boxplot that this is not an outlier.

Detection of Outliers
Consider again the data on mammal longevity in Data 2.2 on page 64. Our intu-
ition suggests that the longevity of 40 years for the elephant is an unusually high
value compared to the other lifespans in this sample. How do we determine objec-
tively when such a value is an outlier? The criteria should depend on some measure
of location for “typical” values and a measure of spread to help us judge when a data
point is “far” from those typical cases. One approach, typically used for identifying
outliers for boxplots, uses the quartiles and interquartile range. As a rule, most data
values will fall within about 1.5(IQR)’s of the quartiles.49

IQR Method for Detecting Outliers


For boxplots, we call a data value an outlier if it is

Smaller than Q1 − 1.5(IQR) or Larger than Q3 + 1.5(IQR)

Example 2.28
According to the IQR method, is the elephant an outlier for the mammal longevi-
ties in the dataset MammalLongevity? Are any other mammals outliers in that
dataset?

49 Inpractice, determining outliers requires judgment and understanding of the context. We present a
specific method here, but no single method is universally used for determining outliers.
2.4 Boxplots and Quantitative/Categorical Relationships 97

Solution The five number summary for mammal longevities is (1, 8, 12, 16, 40). We have Q1 = 8
and Q3 = 16 so the interquartile range is IQR = 16 − 8 = 8. We compute

Q1 − 1.5(IQR) = 8 − 1.5(8) = 8 − 12 = −4

and
Q3 + 1.5(IQR) = 16 + 1.5(8) = 16 + 12 = 28
Clearly, there are no mammals with negative lifetimes, so there can be no outliers
below the lower value of −4. On the upper side, the elephant, as expected, is clearly
an outlier beyond the value of 28 years. No other mammal in this sample exceeds
that value, so the elephant is the only outlier in the longevity data.

One Quantitative and One Categorical Variable


Do men or women tend to watch more television? Is survival time for cancer patients
related to genetic variations? How do April temperatures in Des Moines compare
to those in San Francisco? These questions each involve a relationship between a
quantitative variable (amount of TV, survival time, temperature) and a categorical
variable (gender, type of gene, city). One of the best ways to visualize such relation-
ships is to use side-by-side graphs. Showing graphs with the same axis facilitates the
comparison of the distributions.

Visualizing a Relationship between Quantitative and


Categorical Variables
Side-by-side graphs are used to visualize the relationship between a
quantitative variable and a categorical variable. The side-by-side graph
includes a graph for the quantitative variable (such as a boxplot, his-
togram, or dotplot) for each group in the categorical variable, all using
a common numeric axis.

Erik Von Weber/Getty Images, Inc.

Who watches more TV, males or females?

Example 2.29
Who watches more TV, males or females? The data in StudentSurvey contains the
categorical variable Gender as well as the quantitative variable TV for the number of
hours spent watching television per week. For these students, is there an association
between gender and number of hours spent watching television? Use the side-by-side
graphs in Figure 2.34 showing the distribution of hours spent watching television for
males and females to discuss how the distributions compare.

You might also like