0% found this document useful (0 votes)

12 views

number summary and box plots

Statistics-number summary and box plots

Uploaded by

tommichael283

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

number summary and box plots

Statistics-number summary and box plots

Uploaded by

tommichael283

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

84 CHAPTER 2 Describing Data

Solution The 25th percentile is the value with a quarter of the values below or equal to it. This
is the value where 25% of the area of the histogram lies to the left. This appears to
be about 3000 million. We do not expect you to compute this exactly, but simply be
able to give an estimate. A volume of about 4200 million has roughly 10% of the data
values above it (and 90% below), so the 90th percentile is about 4200 million.

Five Number Summary

The minimum and maximum in a dataset identify the extremes of the distribution:
the smallest and largest values, respectively. The median is the 50th percentile, since
it divides the data into two equal halves. If we divide each of those halves again,
we obtain two additional statistics known as the first (Q1 ) and third (Q3 ) quartiles,
which are the 25th and 75th percentiles. Together these five numbers provide a good
summary of important characteristics of the distribution and are known as the five
number summary.

Five Number Summary

We define

Five Number Summary = (minimum, Q1 , median, Q3 , maximum)

where
Q1 = First quartile = 25th percentile
Q3 = Third quartile = 75th percentile
The five number summary divides the dataset into fourths: about 25%
of the data fall between any two consecutive numbers in the five num-
ber summary.

How many hours a week do students exercise?

Example 2.21
The five number summary for the number of hours spent exercising a week for the
StudentSurvey sample is (0, 5, 8, 12, 40). Explain what this tells us about the amount
of exercise for this group of students.
2.3 One Quantitative Variable: Measures of Spread 85

Solution All of the students exercise between 0 and 40 hours per week. The 25% of students
who exercise the least exercise between 0 and 5 hours a week, and the 25% of stu-
dents who exercise the most exercise between 12 and 40 hours a week. The middle
50% of students exercise between 5 and 12 hours a week, with half exercising less
than 8 hours per week and half exercising more than 8 hours per week.

Range and Interquartile Range

The five number summary provides two additional opportunities for summariz-
ing the amount of spread in the data, the range and the interquartile range.

Range and Interquartile Range

From the five number summary, we can compute the following two
statistics:

Range = Maximum − Minimum

Interquartile range (IQR) = Q3 − Q1

Example 2.22
The five number summary for the mammal longevity data in Table 2.15 on page 64
is (1, 8, 12, 16, 40). Find the range and interquartile range for this dataset.

Solution From the five number summary (1, 8, 12, 16, 40), we see that the minimum longevity
is 1 and the maximum is 40, so the range is 40 − 1 = 39 years. The first quartile is 8
and the third quartile is 16, so the interquartile range is IQR = 16 − 8 = 8 years.

Note that the range and interquartile range calculated in Example 2.22 (39 and
8, respectively) are numerical values not intervals. Also notice that if the elephant
(longevity = 40) were omitted from the sample, the range would be reduced to
25 − 1 = 24 years while the IQR would go down by just one year, 15 − 8 = 7. In
general, although the range is a very easy statistic to compute, the IQR is a more
resistant measure of spread.

Example 2.23
Using the temperature data for Des Moines and San Francisco given in Table 2.20,
find the five number summary for the temperatures in each city. Find the range and
the IQR for each dataset and compare the results for the two cities.

Solution We use technology to find the five number summaries. From the output in Figure 2.15
or Figure 2.16 on page 78, we see that the five number summary for Des Moines
temperatures is (35.7, 44.4, 54.7, 59.9, 74.9). (Notice that the value for Q3 is slightly
different between the three outputs. You may get slightly different values for the
quartiles depending on which technology you use.) The five number summary for
San Francisco temperatures is (48.7, 51.4, 53.8, 56.0, 61.0). We have
Range for Des Moines = 74.9 − 35.7 = 39.2∘ F
Range for San Francisco = 61.0 − 48.7 = 12.3∘ F
86 CHAPTER 2 Describing Data

IQR for Des Moines = 59.9 − 44.4 = 15.5∘ F

IQR for San Francisco = 56.0 − 51.4 = 4.6∘ F

The range and IQR are much larger for the Des Moines data than the San Francisco
data. Temperatures are much more variable in central Iowa than they are on the
California coast!

Choosing Measures of Center and Spread

Because the standard deviation measures how much the data values deviate from
the mean, it makes sense to use the standard deviation as a measure of variability
when the mean is used as a measure of center. Both the mean and standard devi-
ation have the advantage that they use all of the data values. However, they are
not resistant to outliers. The median and IQR are resistant to outliers. Furthermore,
if there are outliers or the data are heavily skewed, the five number summary can
give more information (such as direction of skewness) than the mean and standard
deviation.

Example 2.24
Example 2.13 on page 70 describes salaries in the US National Football League,
in which some star players are paid much more than most other players.
(a) We see in that example that players prefer to use the median ($872,000 in 2014)
as a measure of center since they don’t want the results heavily influenced by a
few huge outlier salaries. What should they use as a measure of spread?
(b) We also see that the owners of the teams prefer to use the mean ($2.14 million in
2014) as a measure of center since they want to use a measure that includes all
the salaries. What should they use as a measure of spread?

Solution (a) The interquartile range (IQR) should be used with the median as a measure of
spread. Both come from the five number summary, and both the median and the
IQR are resistant to outliers.
(b) The standard deviation should be used with the mean as a measure of spread.
Both the mean and the standard deviation use all the data values in their
computation.

S E C T I ON L E A R NI NG GOA L S
You should now have the understanding and skills to:
• Use technology to compute summary statistics for a quantitative
variable
• Recognize the uses and meaning of the standard deviation
• Compute and interpret a z-score
• Interpret a ﬁve number summary or percentiles
• Use the range, the interquartile range, and the standard deviation as
measures of spread
• Describe the advantages and disadvantages of the different measures
of center and spread
2.4 Boxplots and Quantitative/Categorical Relationships 93

2007 and 2013. Use technology to find the mean, the deviations below the mean. Thus the range should
standard deviation, and the five number summary be roughly 4 to 6 times the standard deviation. As
for the data in this variable. a rough rule of thumb, we can get a quick estimate
2.128 Audience Movie Ratings The variable Audi- of the standard deviation for a bell-shaped distribu-
enceScore in the HollywoodMovies dataset gives the tion by dividing the range by 5. Check how well this
audience rating on the Rotten Tomatoes website of quick estimate works in the following situations.
movies that came out of Hollywood between 2007 (a) Pulse rates from the StudentSurvey dataset dis-
and 2013. Use technology to find the mean, the stan- cussed in Example 2.17 on page 80. The five
dard deviation, and the five number summary for number summary of pulse rates is (35, 62, 70, 78,
the data in this variable. 130) and the standard deviation is s = 12.2 bpm.
2.129 Using the Five Number Summary to Visual- Find the rough estimate using all the data, and
ize Shape of a Distribution Draw a histogram or a then excluding the two outliers at 120 and 130,
smooth curve illustrating the shape of a distribution which leaves the maximum at 96.
with the properties that: (b) Number of hours a week spent exercising
(a) The range is 100 and the interquartile range from the StudentSurvey dataset discussed in
is 10. Example 2.21 on page 84. The five number sum-
(b) The range is 50 and the interquartile range is 40. mary of this dataset is (0, 5, 8, 12, 40) and the
standard deviation is s = 5.741 hours.
2.130 Rough Rule of Thumb for the Standard Devi-
ation According to the 95% rule, the largest value in (c) Longevity of mammals from the Mammal-
a sample from a distribution that is approximately Longevity dataset discussed in Example 2.22
symmetric and bell-shaped should be between 2 and on page 85. The five number summary of the
3 standard deviations above the mean, while the longevity values is (1, 8, 12, 16, 40) and the stan-
smallest value should be between 2 and 3 standard dard deviation is s = 7.24 years.

2.4 BOXPLOTS AND QUANTITATIVE/CATEGORICAL

RELATIONSHIPS
In this section, we examine a relationship between a quantitative variable and a cat-
egorical variable by examining both comparative summary statistics and graphical
displays. Before we get there, however, we look at one more graphical display for a
single quantitative variable that is particularly useful for comparing groups.

Boxplots
A boxplot is a graphical display of the five number summary for a single quantitative
variable. It shows the general shape of the distribution, identifies the middle 50% of
the data, and highlights any outliers.

Boxplots
A boxplot includes:
• A numerical scale appropriate for the data values.
• A box stretching from Q1 to Q3 .
• A line that divides the box drawn at the median.
• A line from each quartile to the most extreme data value that is not
an outlier.
• Each outlier plotted individually with a symbol such as an asterisk
or a dot.
94 CHAPTER 2 Describing Data

Example 2.25
Draw a boxplot for the data in MammalLongevity, with five number summary
(1, 8, 12, 16, 40).

Solution The boxplot for mammal longevities is shown in Figure 2.31. The box covers the
interval from the first quartile of 8 years to the third quartile of 16 years and is divided
at the median of 12 years. The line to the left of the lower quartile goes all the way to
the minimum longevity at 1 year, since there were no small outliers. The line to the
right stops at the largest data value (25, grizzly bear and hippopotamus) that is not
an outlier. The only clear outlier is the elephant longevity of 40 years, so this value
is plotted with an individual symbol at the maximum of 40 years.
Q1 m Q3

Minimum non-outlier Maximum non-outlier Outlier

Figure 2.31 Boxplot of 0 10 20 30 40

longevity of mammals Longevity

DATA 2.6 US States

The dataset USStates includes many variables measured for each of the 50
states in the US. Some of the variables included for each state are average
household income, percent to graduate high school, health statistics such as
consumption of fruits and vegetables, percent obese, percent of smokers, and
some results from the 2012 US presidential election.47 ■

Example 2.26
One of the quantitative variables in the USStates dataset is Vegetables, which gives
the percentage of the population that eats at least one serving of vegetables per day
for each of the states. Figure 2.32 shows a boxplot of the percent for all 50 states.
(a) Discuss what the boxplot tells us about the distribution of this variable.
(b) Estimate the five number summary.

Solution (a) The distribution of percentages is relatively symmetric, with one low outlier, and
centered around 77. The percent of people who eat vegetables appears to range
from about 67% to about 84%. The middle 50% of percentages is between about
75% and 80%, with a median value at about 77%. The low outlier is about 67%
and represents the state of Louisiana.
(b) The five number summary appears to be approximately (67, 75, 77, 80, 84).

47 Data from a variety of sources, mostly http://www.census.gov.

2.4 Boxplots and Quantitative/Categorical Relationships 95

Figure 2.32 Percent of

people who eat at least
one serving of vegetables 65 70 75 80 85
per day, by state Vegetables

How much did it cost to make

this movie?

DATA 2.7 Hollywood Movies

Over 900 movies came out of Hollywood between 2007 and 2013 and the
dataset HollywoodMovies contains lots of information on these movies, such as
studio, genre, budget, audience ratings, box ofﬁce average opening weekend,
world gross, and others.48 ■

Example 2.27
One of the quantitative variables in the HollywoodMovies dataset is Budget, which
gives the budget, in millions of dollars, to make each movie. Figure 2.33 shows the
boxplot for the budget of all Hollywood movies that came out in 2013.
(a) Discuss what the boxplot tells us about the distribution of this variable.
(b) What movies do the two largest outliers correspond to?
(c) What was the budget to make Frozen? Is it an outlier?

48 McCandless, D., “Most Profitable Hollywood Movies,” from “Information is Beautiful,”

http://www.davidmccandless.com, Accessed July 2015.
96 CHAPTER 2 Describing Data

∗ ∗
∗ ∗ ∗
∗
∗∗∗ ∗

Figure 2.33 Budget, in

millions of dollars, of 0 50 100 150 200 250
Hollywood movies Budget

Solution (a) Because the minimum, first quartile, and median are so close together, we see
that half the data values are packed in a small interval, then the other half extend
out quite far to the right. These data are skewed to the right, with many large
outliers.
(b) The two largest outliers represent movies with budgets of about 225 million
dollars. We see from the dataset HollywoodMovies that the two movies are Man
of Steel and The Desolation of Smaug.
(c) We see from the dataset that the budget for Frozen was 150 million dollars. We
see in the boxplot that this is not an outlier.

Detection of Outliers
Consider again the data on mammal longevity in Data 2.2 on page 64. Our intu-
ition suggests that the longevity of 40 years for the elephant is an unusually high
value compared to the other lifespans in this sample. How do we determine objec-
tively when such a value is an outlier? The criteria should depend on some measure
of location for “typical” values and a measure of spread to help us judge when a data
point is “far” from those typical cases. One approach, typically used for identifying
outliers for boxplots, uses the quartiles and interquartile range. As a rule, most data
values will fall within about 1.5(IQR)’s of the quartiles.49

IQR Method for Detecting Outliers

For boxplots, we call a data value an outlier if it is

Smaller than Q1 − 1.5(IQR) or Larger than Q3 + 1.5(IQR)

Example 2.28
According to the IQR method, is the elephant an outlier for the mammal longevi-
ties in the dataset MammalLongevity? Are any other mammals outliers in that
dataset?

49 Inpractice, determining outliers requires judgment and understanding of the context. We present a
specific method here, but no single method is universally used for determining outliers.
2.4 Boxplots and Quantitative/Categorical Relationships 97

Solution The five number summary for mammal longevities is (1, 8, 12, 16, 40). We have Q1 = 8
and Q3 = 16 so the interquartile range is IQR = 16 − 8 = 8. We compute

Q1 − 1.5(IQR) = 8 − 1.5(8) = 8 − 12 = −4

and
Q3 + 1.5(IQR) = 16 + 1.5(8) = 16 + 12 = 28
Clearly, there are no mammals with negative lifetimes, so there can be no outliers
below the lower value of −4. On the upper side, the elephant, as expected, is clearly
an outlier beyond the value of 28 years. No other mammal in this sample exceeds
that value, so the elephant is the only outlier in the longevity data.

One Quantitative and One Categorical Variable

Do men or women tend to watch more television? Is survival time for cancer patients
related to genetic variations? How do April temperatures in Des Moines compare
to those in San Francisco? These questions each involve a relationship between a
quantitative variable (amount of TV, survival time, temperature) and a categorical
variable (gender, type of gene, city). One of the best ways to visualize such relation-
ships is to use side-by-side graphs. Showing graphs with the same axis facilitates the
comparison of the distributions.

Visualizing a Relationship between Quantitative and

Categorical Variables
Side-by-side graphs are used to visualize the relationship between a
quantitative variable and a categorical variable. The side-by-side graph
includes a graph for the quantitative variable (such as a boxplot, his-
togram, or dotplot) for each group in the categorical variable, all using
a common numeric axis.

Erik Von Weber/Getty Images, Inc.

Who watches more TV, males or females?

Example 2.29
Who watches more TV, males or females? The data in StudentSurvey contains the
categorical variable Gender as well as the quantitative variable TV for the number of
hours spent watching television per week. For these students, is there an association
between gender and number of hours spent watching television? Use the side-by-side
graphs in Figure 2.34 showing the distribution of hours spent watching television for
males and females to discuss how the distributions compare.

PA 2020 Voter Analysis Report
94% (121)
PA 2020 Voter Analysis Report
40 pages
Topic 1 Describing Data II
No ratings yet
Topic 1 Describing Data II
68 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
Basic Statistical Concepts-2
No ratings yet
Basic Statistical Concepts-2
20 pages
Statistics Part 1 and 2
No ratings yet
Statistics Part 1 and 2
53 pages
Final Measures of Dispersion DR Lotfi
No ratings yet
Final Measures of Dispersion DR Lotfi
54 pages
Dispersion
No ratings yet
Dispersion
26 pages
Chapter 02-Describing Distributions With Numbers
No ratings yet
Chapter 02-Describing Distributions With Numbers
21 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
24 pages
Ch 2 Lecture Notes
No ratings yet
Ch 2 Lecture Notes
12 pages
Describing Data Numerically
No ratings yet
Describing Data Numerically
9 pages
Variability Final
No ratings yet
Variability Final
53 pages
Lecture Slides - Capítulo 02
No ratings yet
Lecture Slides - Capítulo 02
21 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Chapter 02-Describing Distributions With Numbers-2023!09!13
No ratings yet
Chapter 02-Describing Distributions With Numbers-2023!09!13
22 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Further Bound Reference
No ratings yet
Further Bound Reference
42 pages
Measures of Dispersion Tendency
No ratings yet
Measures of Dispersion Tendency
7 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
20 - Levels of Measurement, Central Tendency Dispersion
No ratings yet
20 - Levels of Measurement, Central Tendency Dispersion
35 pages
Bast 503 Lect 5
No ratings yet
Bast 503 Lect 5
53 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
31 pages
Unit II
No ratings yet
Unit II
76 pages
3-Measures of Central Tendency
No ratings yet
3-Measures of Central Tendency
59 pages
Lecture 5
No ratings yet
Lecture 5
25 pages
Nanodegree
No ratings yet
Nanodegree
12 pages
Descriptive Statsistics
No ratings yet
Descriptive Statsistics
34 pages
1-Descriptive Statistics
No ratings yet
1-Descriptive Statistics
44 pages
1-Descriptive Statistics
No ratings yet
1-Descriptive Statistics
44 pages
Business Analytics: Describing The Distribution of A Single Variable
No ratings yet
Business Analytics: Describing The Distribution of A Single Variable
58 pages
Full Statistics
No ratings yet
Full Statistics
108 pages
Topic1 Summarizing and Visualizing Data PDF
No ratings yet
Topic1 Summarizing and Visualizing Data PDF
29 pages
Gtu 302 Biostatistics: Descriptive Statistics
100% (1)
Gtu 302 Biostatistics: Descriptive Statistics
57 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
21 pages
IL2-Describing Variation in Data
No ratings yet
IL2-Describing Variation in Data
7 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
32 pages
L2-Types of Data, Central Tendency and Dispersion-2
No ratings yet
L2-Types of Data, Central Tendency and Dispersion-2
81 pages
Measures of Dispersion Topic 11
No ratings yet
Measures of Dispersion Topic 11
8 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
4 pages
Business Statistics and Analysis Course 2&3
No ratings yet
Business Statistics and Analysis Course 2&3
42 pages
TUT1
No ratings yet
TUT1
7 pages
Statistics and Probability
No ratings yet
Statistics and Probability
91 pages
slides_week2
No ratings yet
slides_week2
43 pages
Lecture 3 - Stat HO
No ratings yet
Lecture 3 - Stat HO
21 pages
Business Statistics ASSIGNMENT
No ratings yet
Business Statistics ASSIGNMENT
4 pages
Lecture03 Descriptive Statistics
No ratings yet
Lecture03 Descriptive Statistics
22 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
2 Research - 2ND QT - Week 1 - 10 14 2024
No ratings yet
2 Research - 2ND QT - Week 1 - 10 14 2024
13 pages
Statistics
No ratings yet
Statistics
11 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Unit 4
No ratings yet
Unit 4
152 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
Descriptive Lec
No ratings yet
Descriptive Lec
8 pages
MCT and MD For Pharmacy Students
No ratings yet
MCT and MD For Pharmacy Students
58 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Student Notes 1.3 New
No ratings yet
Student Notes 1.3 New
6 pages
01 Data
No ratings yet
01 Data
100 pages
5_Data Summaries and Visualization
No ratings yet
5_Data Summaries and Visualization
97 pages
Lecture_04
No ratings yet
Lecture_04
88 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Baumgardner Et Al 2023 A Brief Measure of Positive and Negative Interpretation Biases Development and Validation of The
No ratings yet
Baumgardner Et Al 2023 A Brief Measure of Positive and Negative Interpretation Biases Development and Validation of The
17 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Nonparametric Imputation by Data Depth PDF
No ratings yet
Nonparametric Imputation by Data Depth PDF
31 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Ecs 171 Project Report Group 11
No ratings yet
Ecs 171 Project Report Group 11
34 pages
ANACHEM - ChE - 02 - 1st Quiz
No ratings yet
ANACHEM - ChE - 02 - 1st Quiz
33 pages
Enamel and Dentin Bonding
No ratings yet
Enamel and Dentin Bonding
9 pages
Reporting Mann-Whitney U-Test in Apa
100% (1)
Reporting Mann-Whitney U-Test in Apa
25 pages
Linear Regression
100% (1)
Linear Regression
56 pages
Descriptive Statistics
100% (3)
Descriptive Statistics
41 pages
Dearing Fundamentals of Chemo Metrics
No ratings yet
Dearing Fundamentals of Chemo Metrics
65 pages
Residual Analysis For Simple Linear Regression: X B B y N e N e
No ratings yet
Residual Analysis For Simple Linear Regression: X B B y N e N e
15 pages
Soil Investigation
No ratings yet
Soil Investigation
57 pages
Normal Curve Lesson 1 Lesson 4
No ratings yet
Normal Curve Lesson 1 Lesson 4
85 pages
Essentials of Modern Business Statistics with Microsoft Excel 8th Edition David R. Anderson - eBook PDF - Download the ebook today and experience the full content
100% (2)
Essentials of Modern Business Statistics with Microsoft Excel 8th Edition David R. Anderson - eBook PDF - Download the ebook today and experience the full content
31 pages
CH 10 Analysing Data
100% (1)
CH 10 Analysing Data
68 pages
11-LQMS - QC For Quantitative Tests
No ratings yet
11-LQMS - QC For Quantitative Tests
20 pages
Credit Card Project-2
No ratings yet
Credit Card Project-2
17 pages
Unit 5 Data Management Unit Plan
No ratings yet
Unit 5 Data Management Unit Plan
5 pages
OREAS 194 Certificate PDF
No ratings yet
OREAS 194 Certificate PDF
9 pages
UKP6053 - L8 Multiple Regression
100% (2)
UKP6053 - L8 Multiple Regression
105 pages
ExercisIe Collection
No ratings yet
ExercisIe Collection
111 pages
Curve Fitting Best Practice Part 1
No ratings yet
Curve Fitting Best Practice Part 1
4 pages
Applied Robust Statistics
No ratings yet
Applied Robust Statistics
532 pages
Ch. 10 Data Displays
No ratings yet
Ch. 10 Data Displays
42 pages
J. AOAC Int 2002, Vol 85 (5), Pages 1090-1095
No ratings yet
J. AOAC Int 2002, Vol 85 (5), Pages 1090-1095
6 pages
STS Handout 1 Edited
No ratings yet
STS Handout 1 Edited
5 pages
EST Prep (Probability)
No ratings yet
EST Prep (Probability)
8 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
126 pages