Ast5e PPT ch02
Ast5e PPT ch02
Exploring Data
with Graphs and
Numerical
Summaries
Examples:
▪ Gender (Male or Female)
▪ Religious Affiliation (Catholic, Jewish, …)
▪ Type of Residence (Apartment, Condo, …)
▪ Belief in Life After Death (Yes or No)
Examples:
▪ Age
▪ Number of Siblings
▪ Annual Income
Examples:
▪ Number of pets in a household
▪ Number of children in a family
▪ Number of foreign languages spoken by an individual
Examples:
▪ Height/Weight
▪ Age
▪ Amount of time to complete an assignment
Graphical
Summaries of
Data
Pie Charts:
▪ Used for summarizing a categorical variable.
Figure 2.1 Pie Chart of Shark Attacks Across U.S. States. The label for each
slice of the pie gives the category and the percentage of attacks in a state. The slice that
represents the percentage of attacks reported in Hawaii is 13% of the total area of the pie.
Question Why is it beneficial to label the pie wedges with the percent?
Figure 2.2 Bar Graph of Shark Attacks Across U.S. States. Except for the Other
category, which is shown last, the bars are ordered from largest to smallest based on
the frequency of shark attacks. Question What is the advantage of ordering the bars
this way rather than alphabetically?
Dot Plot: shows a dot for each observation placed above its
value on a number line.
Table 2.4 Frequency Table for Sodium in 20 Breakfast Cereals. The table
summarizes the sodium values using nine intervals and lists the number of
observations in each, as well as the proportions and percentages.
Figure 2.6 Histogram of Breakfast Cereal Sodium Values. The rectangular bar over
an interval has height equal to the number of observations in the interval.
The value that occurs the most often is called the mode.
x =
x
n
The median is the middle value of the observations when they are
ordered from the smallest to the largest (or from the largest to
smallest).
The relatively high value of 16.9 falls well above the rest of the
data. It is an outlier.
The size of the outlier affects the calculation of the mean but not
the median.
Figure 2.9 Relationship Between the Mean and Median. Question: For skewed
distributions, what causes the mean and median to differ?
Mode
Figure 2.11 Dot Plot for Cereal Sodium Data, Showing Deviations for Two
Observations. Question: When is a deviation positive and when is it negative?
( x − x ) 2
s=
n −1
The larger the standard deviation, s, the greater the
variability of the data.
The larger the standard deviation s, the greater the variability of the
data.
▪ s measures the spread of the data.
▪ s = 0 only when all observations have the same value, otherwise
s > 0. As the spread of the data increases, s gets larger.
▪ s has the same units of measurement as the original
observations. The variance = s 2 has units that are squared.
▪ s is not resistant. Strong skewness or a few outliers can greatly
increase s.
Figure 2.12 The Empirical Rule. For bell-shaped distributions, this tells us approximately
how much of the data fall within 1, 2, and 3 standard deviations of the mean. Question:
About what percentage would fall more than 2 standard deviations from the mean?
Figure 2.14 The Quartiles Split the Distribution Into Four Parts. 25% is below the first
quartile (Q1), 25% is between the first quartile and the second quartile (the median, Q2), 25%
is between the second quartile and the third quartile (Q3), and 25% is above the third quartile.
Question: Why is the second quartile also the median?
Consider the sodium values for the 20 breakfast cereals. What are the
quartiles for the 20 cereal sodium values? From Table 2.3, the sodium values,
in ascending order, are:
The median of the 20 values is the average of the 10th and 11th
observations, 180 and 180, which is Q2 = 180 mg.
The first quartile Q1 is the median of the 10 smallest observations (in the
top row), which is the average of 130 and 140, Q1 = 135 mg.
The third quartile Q3 is the median of the 10 largest observations (in the
bottom row), which is the average of 200 and 210, Q3 = 205 mg.
In Words: If the interquartile range of U.S. music teacher salaries equals $16,000, this
means that for the middle 50% of the distribution stretches over a distance of $16,000.
▪ Minimum value
▪ First Quartile
▪ Median
▪ Third Quartile
▪ Maximum value
Figure 2.15 shows a box plot for the sodium values. Labels
are also given for the five-number summary of positions.
Figure 2.15 Box Plot and Five-Number Summary for 20 Breakfast Cereal Sodium Values.
The central box contains the middle 50% of the data. The line in the box marks the median.
Whiskers extend from the box to the smallest and largest observations, which are not identified
as potential outliers. Potential outliers are marked separately. Question: Why is the left whisker
drawn down only to 50 rather than to 0?
Figure 2.16 Box Plots of Male and Female College Student Heights. The box plots use the
same scale for height. Question: Do the quartiles given under Descriptive Statistics match
those shown in the box plot for each group?
x−x
The z-score z = is an example of a linear transformation
of observations. s
Figure 2.19 An Example of a Poor Graph. Question What’s misleading about the
way the data are presented?
Figure 2.20 A Better Graph for the Data in Figure 2.19. Question: What trends do
you see in the enrollments from 2004 to 2012?