Statistics (Chapter 1)
Statistics (Chapter 1)
Statistical Analysis
with Applications in Engineering and Sciences
Lecture Notes
Adolfo Mart
Polytechnic in O. Solima
University n
of the Phil
ippines
Manila, 201
9
ii
Descriptive Statistics
You may be familiar with probability and statistics through radio, television,
newspapers, and magazines. For example, you may have read statements like
"68% of Filipinos are still confident that the Philippines will again rise in the
economic scene."
Statistics is used in almost all fields of human endeavor. In sports, for ex-
ample, a statistician may keep records of the number of points a point guard
scored during a basketball game, or the number of hits a baseball player gets
in a season. In other areas, such as public health, an administrator might be
concerned with the number of residents who contract a new strain of flu virus
during a certain year. In education, a researcher might want to know if new
methods of teaching are better than old ones. These are only a few examples
of how statistics can be used in various occupations.
1
2 CHAPTER 1. DESCRIPTIVE STATISTICS
Definition 1.1.1.
(b) Quantitative variables are variables which are numerical and can be
ordered or ranked.
Remark 1.1.3. Quantitative variables can be further classified into two groups.
Data
Qualitative
Quantitative
Discrete
Continuous
To easily determine the level of measurement, one may use the following
flowchart, for convenience:
Ratio
Level
Data
Yes
Are there
Can the
precise Is zero
data be Yes Yes
variations defined as
ranked or
between null ?
ordered?
ranks?
No
No
No
Nominal
Level
Ordinal Interval
Level Level
(c) Note that precise measurement of differences in the ordinal level of mea-
surement does not exist. For instance, when people are classified ac-
cording to their build (small, medium, or large), a large variation exists
among the individuals in each class.
Example 1.1.3. (a) The interval level differs from the ordinal level in the
case that precise differences do exist between units. For example, many
standardized psychological tests yield values measured on an interval
scale. IQ is an example of such a variable. There is a meaningful differ-
ence of 1 point between an IQ of 109 and an IQ of 110.
(c) One property is lacking in the interval scale, that is, there is no true zero.
For example, IQ tests do not measure people who have no intelligence.
For temperature, 0◦ F does not mean no heat at all.
6 CHAPTER 1. DESCRIPTIVE STATISTICS
Example 1.1.4. (a) Examples of ratio scales are those used to measure
height, weight, area, and number of phone calls received.
(b) Ratio scales have differences between units (1 inch, 1 pound, etc.) and
a true zero.
(c) In addition, the ratio scale contains a true ratio between values. For
example, if one person can lift 200 pounds and another can lift 100
pounds, then the ratio between them is 2 to 1. Put another way, the
first person can lift twice as much as the second person.
EXERCISES
(a) In the year 2010, 148 million Americans will be enrolled in an HMO.
(Source: USA TODAY )
(b) Nine out of ten on-the-job fatalities are men.
(Source: USA TODAY Weekend )
(c) Expenditures for the cable industry were $5.66 billion in 1996.
(Source: USA TODAY )
(d) The median household income for people aged 25-34 is $35,888.
(Source: USA TODAY )
(e) Allergy therapy makes bees go away. (Source: Prevention)
1.1. DATA AND LEVELS OF MEASUREMENT 7
9. Give three examples each of nominal, ordinal, interval, and ratio data.
The average American man is five feet, nine inches tall; the average
woman is five feet, 3.6 inches.
The average American is sick in bed seven days a year missing five
days of work.
On the average day, 24 million people receive animal bites.
By his or her 70th birthday, the average American will have eaten
14 steers, 1050 chickens, 3.5 lambs, and 25.2 hogs.
1.2. MEASURES OF CENTRAL TENDENCY 9
In these examples, the word average is ambiguous, since several different meth-
ods can be used to obtain an average. Loosely stated, the average means the
center of the distribution or the most typical case. Measures of average are
also called measures of central tendency and include the mean, median, and
mode.
Knowing the average of a data set is not enough to describe the data set en-
tirely. Even though a shoe store owner knows that the average size of a man’s
shoe is size 10, she would not be in business very long if she ordered only size
10 shoes.
The previous section stated that statisticians use samples taken from popula-
tions; however, when populations are small, it is not necessary to use samples
since the entire population can be used to gain information. For example,
suppose an insurance manager wanted to know the average weekly sales of
all the company’s representatives. If the company employed a large number
of salespeople, say, nationwide, he would have to use a sample and make an
inference to the entire sales force. But if the company had only a few sales-
people, say, only 87 agents, he would be able to use all representatives’ sales
for a randomly chosen week and thus, use the entire population.
Measures found by using all the data values in the population are called pa-
rameters. Measures obtained by using the data values from samples are called
statistics.
Definition 1.2.1.
Definition 1.2.2. The (arithmetic) mean is the sum of all the observations
divided by the number of observations.
1. The symbol x (read as "x-bar "), represents the sample mean, given by
n
X
xi
x1 + x2 + x3 + · · · + xn i=1
x= = ,
n n
Remark 1.2.2. (Rounding Rule for the Mean) The mean should be
rounded to one more decimal place than occurs in the raw data.
Remark 1.2.3. The arithmetic mean is, in general, a very natural measure
of location. One of its main limitations, however, is that it is oversensitive to
extreme values. In this instance, it may not be representative of the location
of the great majority of sample points.
Remark 1.2.4. It is possible in extreme cases for all but one of the sample
points to be on one side of the arithmetic mean. In these types of samples,
the arithmetic mean is a poor measure of central location because it does not
reflect the center of the sample.
Example 1.2.1. Suppose the sample consists of the birthweights of all live-
born infants born at a private hospital in Pasig City, during a one-week period.
i xi i xi i xi i xi
1 3265 6 3323 11 2581 16 2759
2 3260 7 3649 12 2841 17 3248
3 3245 8 3200 13 3609 18 3314
4 3484 9 3031 14 2838 19 3101
5 4146 10 2069 15 3541 20 2834
Example 1.2.2. The following data deal with the aflatoxin levels of raw
peanut kernels as described by Quesenberry et al. (1976). Approximately
560 g or ground meal was divided among 16 centrifuge bottles and analyzed.
One sample was lost, so that only 15 readings are available (measurement units
12 CHAPTER 1. DESCRIPTIVE STATISTICS
are not given). The values were 30, 26, 26, 36, 48, 50, 16, 31, 22, 27, 23, 35,
52, 28, 37. The mean aflatoxin level of the readings is
30 + 26 + · · · + 28 + 37
x= = 32.5
15
The Median
An article recently reported that the median income for college professors was
$43,250. This measure of central tendency means that one-half of all the profes-
sors surveyed earned more than $43,250, and one-half earned less than $43,250.
The median is the halfway point in a data set. Before you can find this point,
the data must be arranged in order. The median either will be a specific value
in the data set or will fall between two values.
n + 1 th
1. the largest observation, if n is odd
2
n th n th
2. the average of the and +1 largest observations if n is even.
2 2
Remark 1.2.5. The rationale for these definitions is to ensure an equal num-
ber of sample points on both sides of the sample median.
Remark 1.2.6. The median is defined differently when n is even and odd
because it is impossible to achieve this goal with one uniform definition.
Remark 1.2.7. Samples with an odd sample size have a unique central point,
while samples with an even sample size have no unique central point, and the
middle two values must be averaged.
1.2. MEASURES OF CENTRAL TENDENCY 13
Remark 1.2.8. The main weakness of the sample median is that it is deter-
mined mainly by the middle points in a sample and is less sensitive to the
actual numeric values of the remaining data points.
Remark 1.2.9. When the data set is ordered, it is called a data array.
Example 1.2.3. Consider the data set in Table 1.2, which consists of white-
blood counts taken upon admission of all patients entering a small hospital in
Quezon City, on a given day. Compute the median white-blood count.
Table 1.2: Sample of admission white-blood counts (×1000) for all patients
entering a hospital in Quezon City, on a given day
i xi i xi i xi
1 7 4 9 7 10
2 35 5 8 8 12
3 5 6 3 9 8
Solution. First, order the sample as follows: 3, 5, 7, 8, 8, 9, 10, 12, 35. Because
n = 9 is odd, the sample median is given by the fifth largest point, which equals
8 or 8000 on the original scale.
Example 1.2.4. Compute the sample median for the sample in Example 1.2.1.
Solution. First, arrange the sample in ascending order: 2069, 2581, 2759, 2834,
2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484, 3541,
3609, 3649, 4146. Because n = 20 is even, then, we have
14 CHAPTER 1. DESCRIPTIVE STATISTICS
Example 1.2.5. Compute the sample median for the sample in Example 1.2.2.
Solution. First, arrange the sample in ascending order: 16, 22, 23, 26, 26, 27,
28, 30, 31, 35, 36, 37, 48, 50, 52. Because n = 15 is odd, then, we have
The Mode
The third measure of average is called the mode. The mode is the value that
occurs most often in the data set. It is sometimes said to be the most typical
case.
Definition 1.2.4. The mode is the most frequently occurring value among
all the observations in a sample. It is denoted by x̂ (read as "x-hat").
Remark 1.2.10. A data set can have more than one mode or no mode at all.
Remark 1.2.11. When no data value occurs more than once, the data set is
said to have no mode.
Remark 1.2.12. A data set that has only one value that occurs with the
greatest frequency is said to be unimodal. If a data set has two values that
occur with the same greatest frequency, both values are considered to be the
mode and the data set is said to be bimodal. If a data set has more than two
values that occur with the same greatest frequency, each value is used as the
mode, and the data set is said to be multimodal.
Solution. There is no mode, because all the values occur exactly once.
Solution. The observation 26 is the most frequent, occuring twice in the data
set. Therefore, x̂ = 26.
1.2. MEASURES OF CENTRAL TENDENCY 15
Solution. The observation 8 is the most frequent, occuring twice in the data
set. Therefore, x̂ = 8000, based on the original scale.
4
HM = = 2.1
1 1 1 1
+ + +
1 4 5 2
Example 1.2.11. Suppose a person drove 100 miles at 40 miles per hour and
returned driving 50 miles per hour. The average miles per hour is not 45 miles
per hour, which is found by adding 40 and 50 and dividing by 2. The average
is found as shown.
1.2. MEASURES OF CENTRAL TENDENCY 17
distance
Since time = , then
rate
100
Time 1 = = 2.5 hours to make the trip
40
100
Time 2 = = 2 hours to return
50
Hence, the total time is 4.5 hours, and the total miles driven are 200 miles.
Now, the average speed is
distance 200
rate = = = 44.44 miles per hour
time 4.5
This value can also be found by using the harmonic mean, that is
2
HM = = 44.44
1 1
+
40 50
Definition 1.2.7. The geometric mean (GM ) is defined as the nth root of
the product of n values. That is,
v
u n
uY
√
GM = n x1 · x2 · x3 · . . . · xn = t
n
xi .
i=1
18 CHAPTER 1. DESCRIPTIVE STATISTICS
Example 1.2.14. If a person receives a 20% raise after 1 year of service and
a 10% raise after the second year of service, the average percentage raise per
year is not 15% but 14.89%, as shown. Since GM = (1.2)(1.1) = 1.1489
p
year and 110% at the end of the second year. This is equivalent to an average
of 14.89%, since 114.89% − 100% = 14.89%.
Definition 1.2.8. The quadratic mean (QM ) is defined at the square root
of the average of the squares of each value. That is,
v n
uX
x2i
u
r u
2 2 2
x1 + x2 + · · · + xn t
i=1
QM = = .
n n
This is a useful mean in the physical sciences (such as voltage).
2. The mean varies less than the median or mode when samples are taken
from the same population and all three measures are computed for these
samples.
4. The mean for the data set is unique and not necessarily one of the data
values.
The Median
1. The median is used to find the center or middle value of a data set.
2. The median is used when it is necessary to find out whether the data
values fall into the upper half or lower half of the distribution.
4. The median is affected less than the mean by extremely high or extremely
low values.
20 CHAPTER 1. DESCRIPTIVE STATISTICS
The Mode
3. The mode can be used when the data are nominal, such as religious
preference, gender, or political affiliation.
4. The mode is not always unique. A data set can have more than one
mode, or the mode may not exist for a data set.
EXERCISES
(a) One-half of the factory workers make more than $5.37 per hour,
and one-half make less than $5.37 per hour.
(b) The average number of children per family in the Plaza Heights
Complex is 1.8.
(c) Most people prefer red convertibles over any other color.
1.2. MEASURES OF CENTRAL TENDENCY 21
3. A local fast-food company claims that the average salary of its employ-
ees is $13.23 per hour. An employee states that most employees make
minimum wage. If both are being truthful, how could both be correct?
4. If the mean of five values is 64, find the sum of the values.
5. If the mean of five values is 8.2 and four of the values are 6, 10, 7, and
12, find the fifth value.
6. (a) Find the mean of 10, 20, 30, 40, and 50.
(b) Add 10 to each value and find the mean.
(c) Subtract 10 from each value and find the mean.
(d) Multiply each value by 10 and find the mean.
(e) Divide each value by 10 and find the mean.
(f) Make a general statement about each situation.
(a) A salesperson drives 300 miles round trip at 30 miles per hour going
to Chicago and 45 miles per hour returning home. Find the average
miles per hour.
(b) A bus driver drives the 50 miles to West Chester at 40 miles per
hour and returns driving 25 miles per hour. Find the average miles
per hour.
(c) A carpenter buys $500 worth of nails at $50 per pound and $500
worth of nails at $10 per pound. Find the average cost of 1 pound
of nails.
22 CHAPTER 1. DESCRIPTIVE STATISTICS
(a) The growth rates of the Living Life Insurance Corporation for the
past 3 years were 35, 24, and 18%.
(b) A person received these percentage raises in salary over a 4-year
period: 8, 6, 4, and 5%.
(c) A stock increased each year for 5 years at these percentages: 10, 8,
12, 9, and 3%.
(d) The price increases, in percentages, for the cost of food in a specific
geographic region for the past 3 years were 1, 3, and 5.5%.
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
Since the means are equal in Example 1.3.1, you might conclude that both
brands of paint last equally well. However, when the data sets are examined
graphically, a somewhat different conclusion might be drawn.
As Figure 1.3 shows, even though the means are the same for both brands,
the spread, or variation, is quite different. Figure 1.3 shows that Brand B
performs more consistently; it is less variable. For the spread or variability of
a data set, three measures are commonly used: range, variance, and standard
deviation. Each measure will be discussed in this section.
The Range
Several different measures can be used to describe the variability of a sample.
Perhaps the simplest measure is the range.
24 CHAPTER 1. DESCRIPTIVE STATISTICS
Definition 1.3.1. The range is the difference between the largest and small-
est observations in a sample. The symbol R is used for the range, and we
have
Remark 1.3.1. One advantage of the range is that it is very easy to compute
once the sample points are ordered.
Example 1.3.2. Find the ranges for the paints in Example 1.3.1.
R = 60 − 10 = 50 months.
R = 45 − 25 = 20 months.
Here, we see that the range for Brand A shows that 50 months separate the
largest data value from the smallest data value, and for Brand B, 20 months
separate the largest data value from the smallest data value, which is less than
one-half of Brand A’s range.
The range for the Autoanalyzer method is given by 226 − 177 = 49 mg/dL.
The range for the Microenzymatic method is given by 209 − 192 = 17 mg/dL.
The Autoanalyzer method clearly seems more variable.
Example 1.3.5. The range of the aflatoxin levels of raw peanut kernels in
Example 1.2.2 is given by
R = 52 − 16 = 36.
Example 1.3.6. The range of the white-blood counts for all patients entering
a hospital in Quezon City, on a given day, based on Example 1.2.3, is given by
R = 35 − 3 = 32(×1000) or 32000.
26 CHAPTER 1. DESCRIPTIVE STATISTICS
Definition 1.3.2. The variance is the average of the squares of the distance
each value is from the mean.
N
X
(Xi − µ)2
i=1
σ2 = ,
N
where
Xi = individual value
µ = population mean
N = population size
where
xi = individual value
x = sample mean
n = sample size
1.3. MEASURES OF VARIATION 27
Remark 1.3.4. When computing for the variance of a sample, one might
expect the use of the formula
n
X
(xi − x)2
i=1
s2 = .
n
This formula is not usually used, however, since in most cases the purpose
of calculating the statistic is to estimate the corresponding parameter. For
example, the sample mean x is used to estimate the population mean µ. The
expression
n
X
(xi − x)2
i=1
n
does not give the best estimate of the population variance because when the
population is large and the sample is small (usually less than 30), the variance
computed by this formula usually underestimates the population variance.
Therefore, instead of dividing by n, find the variance of the sample by divid-
ing by n − 1, giving a slightly larger value and an unbiased estimate of the
population variance. Thus, we use
n
X
(xi − x)2
i=1
s2 =
n−1
where
Xi = individual value
µ = population mean
N = population size
where
xi = individual value
x = sample mean
n = sample size
Remark 1.3.5. The rounding rule for the standard deviation is the same as
that for the mean. The final answer should be rounded to one more decimal
place than that of the original data.
1.3. MEASURES OF VARIATION 29
n
!2
X
n
xi
i=1
X
x2i −
n
i=1
s2 = ,
n−1
where
xi = individual value
x2i = square of the individual value
x = sample mean
n = sample size
where
xi = individual value
x2i = square of the individual value
x = sample mean
n = sample size
30 CHAPTER 1. DESCRIPTIVE STATISTICS
Example 1.3.7. Find the variance and standard deviation for the data set
for Brand A in Example 1.3.1.
Solution. First, we compute for the mean of the data set. From Example
1.3.1, we see that µA = 35. Second, we shall subtract the mean from each
data value.
10 − 35 = −25 50 − 35 = 15 40 − 35 = 5
60 − 35 = 25 30 − 35 = −5 20 − 35 = −15
Third, we square each result.
Now, we get the sum of the squares and then divide it by N (since we are
dealing with the population variance). That is,
(XA − µA )2
P
2 625 + 625 + 225 + 25 + 25 + 225 1750
σA = = = = 291.7
N 6 6
r
1750
Now, for the standard deviation, we have σ = = 17.1. It is an advice
6
to make a table for proper track of computation.
XA XA − µA (XA − µA )2
10 −25 625
60 25 625
50 15 225
30 −5 25
40 5 25
20 −15 225
(XA − µA )2 = 1750
P
1.3. MEASURES OF VARIATION 31
Example 1.3.8. Find the variance and standard deviation for the data set
for Brand B in Example 1.3.1.
Solution. First, we compute for the mean of the data set. From Example
1.3.1, we see that µB = 35. Second, we shall subtract the mean from each
data value, square each result, and then get the sum of the squares and then
divide it by N (since we are dealing with the population variance). That is,
XB XB − µB (XB − µB )2
35 0 0
45 10 100
30 −5 25
35 0 0
40 5 25
25 −10 100
(XB − µB )2 = 250
P
Therefore,
(XB − µB )2
P
2 250
σB = = = 41.7
N 6
r
250
Now, for the standard deviation, we have σ = = 6.5.
6
Since the standard deviation of Brand A is 17.1 and the standard deviation of
Brand B is 6.5, the data are more variable for Brand A. In summary, when
the means are equal, the larger the variance or standard deviation is, the more
variable the data are.
Example 1.3.9. Find the variance and standard deviation for the rate of
death in a certain barrio in Rizal for a sample of 6 years shown. The data are
in percentages.
Solution. Without actually solving for the sample mean, x, we can solve for
the sample variance and standard deviation of the given sample. To do this,
we find the sum of the values, the sum of the squares of each values, then
substitute in the shortcut formula. That is,
x x2
11.2 125.44
11.9 141.61
12.0 144.00
12.8 163.84
13.4 179.56
14.3 204.49
x2 = 958.94
P P
x = 75.6
Therefore,
( x)2 (75.6)2
P
x2 −
P
958.94 −
s2 = n = 6 = 1.28
n−1 6−1
v
2
u 958.94 − (75.6)
u
6
t
Moreover, we have s = = 1.13.
6−1
Example 1.3.10. Compute the variance and standard deviation for the Au-
toanalyzer and Microenzymatic-method data in Figure 1.4.
x x−x (x − x)2
177 −23 529
193 −7 49
195 −5 25
209 9 81
226 26 676
(x − x)2 = 1360
P
1.3. MEASURES OF VARIATION 33
r
1360 1360
Therefore, s2 = = 340 and s = = 18.4.
5−1 5−1
(b) For the Microenzymatic method, we have
x x−x (x − x)2
192 −8 64
197 −3 9
200 0 0
202 2 4
209 9 81
(x − x)2 = 158
P
r
158 158
Therefore, s2 = = 39.5 and s = = 6.3.
5−1 5−1
3. The variance and standard deviation are used to determine the number
of data values that fall within a specified interval in a distribution.
4. Finally, the variance and standard deviation are used quite often in in-
ferential statistics. These uses will be shown in later chapters of this
lecture noted.
34 CHAPTER 1. DESCRIPTIVE STATISTICS
Coefficient of Variation
Whenever two samples have the same units of measure, the variance and stan-
dard deviation for each can be compared directly. For example, suppose an
automobile dealer wanted to compare the standard deviation of miles driven
for the cars she received as trade-ins on new cars. She found that for a spe-
cific year, the standard deviation for Buicks was 422 miles and the standard
deviation for Cadillacs was 350 miles. She could say that the variation in
mileage was greater in the Buicks. But what if a manager wanted to compare
the standard deviations of two different variables, such as the number of sales
per salesperson over a 3-month period and the commissions made by these
salespeople?
For many traits, standard deviation and mean change together when organ-
isms of different sizes are compared. Humans have greater mass than mice
and also more variability in mass. For many purposes, we care more about
the relative variation among individuals. A special measure, the coefficient of
variation, is often used for this purpose.
This measure can also be used to compare the variability of traits that do not
have the same units. If we wanted to ask, "What is more variable in humans,
body mass or life span? " then the standard deviation is not very informative,
because mass is measured in kilograms and life span is measured in years. The
coefficient of variation would allow us to make such a comparison.
Example 1.3.11. The mean for the number of pages of a sample of women?s
fitness magazines is 132, with a variance of 23; the mean for the number of
advertisements of a sample of women?s fitness magazines is 182, with a vari-
ance of 62. Compare the variations.
Example 1.3.12. The mean of the number of sales of cars over a 3-month
period is 87, and the standard deviation is 5. The mean of the commissions is
$5225, and the standard deviation is $773. Compare the variations of the two.
Example 1.3.13. The coefficient of variation for the data consisting of birth-
445.3 g
weigths in Example 1.2.1 is given by CV = · 100% = 14.1%.
3166.9 g
36 CHAPTER 1. DESCRIPTIVE STATISTICS
n Mean s CV (%)
Height (cm) 364 142.6 0.31 0.2
Weight (kg) 365 39.5 0.77 1.9
Triceps skin fold (mm) 362 15.2 0.51 3.4
Systolic blood pressure (mm Hg) 337 104.0 4.97 4.8
Diastolic blood pressure (mm Hg) 337 64.0 4.57 7.1
Total cholesterol (mg/dL) 395 160.4 3.44 2.1
HDL cholesterol (mg/dL) 349 56.9 5.89 10.4
Source: Foster, T. A., & Berenson, G. (1987). Measurement error and reli-
ability in four pediatric cross-sectional surveys of cardiovascular disease risk
factor variables - the Bogalusa Heart Study. Journal of Chronic Diseases,
40 (1), 13-21.
1.3. MEASURES OF VARIATION 37
EXERCISES
(a) Range
(b) Standard deviation
5. For this data set, find the mean, variance, and standard deviation of the
variable. The data represent the serum cholesterol levels of 30 individu-
als.
x = individual observation
x = sample mean
n = sample size
Find the mean absolute deviation for these data: 5, 9, 10, 11, 11, 12, 15,
18, 20, 22.
fall above it. The median is the value that corresponds to the 50th percentile,
since one-half of the values fall below it and one-half of the values fall above
it. This section discusses these measures of position.
Quantiles
1. The median, x̃, divides the data set into two (2) equal parts.
2. The quartiles, Qk (k = 1, 2, 3), divides the data set into four (4) equal
parts.
3. The deciles, Dk (k = 1, 2, . . . , 9), divides the data set into ten (10) equal
parts.
Remark 1.4.2. Percentiles have the advantage over the range of being less
sensitive to outliers and of not being greatly affected by the sample size, n.
2. Find the median of the data values. This is the value for Q2 .
3. Find the median of the data values that fall below Q2 . This is the value
for Q1 .
4. Find the median of the data values that fall above Q2 . This is the value
for Q3 .
Example 1.4.1. Compute the tenth and ninetieth percentiles for the birth-
weight data in Example 1.2.1.
1.4. MEASURES OF POSITION 41
Solution. First, arrange the sample in ascending order: 2069, 2581, 2759, 2834,
2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484, 3541,
3609, 3649, 4146.
20(10)
(a) For k = 10, we have nk
100 = 100 = 2. Therefore, P10 is the average of the
second and third largest observations, that is, P10 = 2581+2759
2 = 2670 g.
20(90)
(b) For k = 90, we have nk
100 = 100 = 18. Therefore, P90 is the average of
the 18th and 19th largest observation, that is, P90 = 3609+3649
2 = 3629 g.
Example 1.4.2. Compute for the sixth and seventh deciles for the aflatoxin
data in Example 1.2.2.
Solution. First, arrange the sample in ascending order: 16, 22, 23, 26, 26, 27,
28, 30, 31, 35, 36, 37, 48, 50, 52.
15(6)
(a) For k = 6, we have nk
10 = 10 = 9. Therefore, D6 is the average of the
ninth and tenth largest observation, that is, D6 = 31+35
2 = 33.
15(7)
(b) For k = 7, we have nk
10 = 10 = 10.5. Therefore, D7 is the eleventh
observation, that is, D7 = 36.
Example 1.4.3. The ages, in years, of the eight respondents in a health sur-
vey are as follows: 15, 13, 6, 5, 12, 50, 22, 18. Find its quartiles.
Solution. First we arrange the data in ascending order. That is, 5, 6, 12, 13,
15, 18, 22, 50. Computing for the median, Q2 , we see that with n = 8, we have
Q2 = x̃ = 13+15
2 = 14. Now, we consider the data values less than 14, that is,
5, 6, 12, 13. Getting its median, with the fact that it has four observations,
we have Q1 = 6+12
2 = 9. Lastly, we consider the data values greater than
14, that is, 15, 18, 22, 50. Getting its median, with the fact that it has four
observations, we have Q3 = 18+22
2 = 20.
42 CHAPTER 1. DESCRIPTIVE STATISTICS
Outliers
In addition to dividing the data set into four groups, quartiles can be used as
a rough measurement of variability.
IQR = Q3 − Q1 .
The IQR is interpreted as the range of the middle 50% of the data.
Remark 1.4.7. An outlier can strongly affect the mean and standard devi-
ation of a variable. For example, suppose a researcher mistakenly recorded
an extremely high data value. This value would then make the mean and
standard deviation of the variable much larger than they really were.
Remark 1.4.9. To identify outliers in a given data set, we employ the follow-
ing procedures:
5. Check the data set for any value that is smaller than Q1 − 1.5(IQR) or
larger than Q3 + 1.5(IQR). These data are outliers in the given data
set.
Remark 1.4.10. There are several reasons why outliers may occur.
2. The data value may have resulted from a recording error. That is, it
may have been written or typed incorrectly.
3. The data value may have been obtained from a subject that is not in
the defined population. For example, suppose test scores were obtained
from a seventh-grade class, but a student in that class was actually in the
sixth grade and had special permission to attend the class. This student
might have scored extremely low on that particular exam on that day.
2. When they occur naturally by chance, the statistician must make a de-
cision about whether to include them in the data set.
Example 1.4.4. Check the data set in Example 1.4.3 for outliers.
Solution. At first glance, the data value 50 is extremely suspect. To check for
an outlier, we employ the steps in Remark 1.4.9.
44 CHAPTER 1. DESCRIPTIVE STATISTICS
(a) We solve for the first and third quartiles. In Example 1.4.3, we see that
Q1 = 9 and Q3 = 20.
(b) Solving for the interquartile range, we see that IQR = Q3 −Q1 = 20−9 =
11.
(d) Subtract the value obtained in (c) from Q1 , and add the value obtained
in (c) to Q3 . That is, 9 − 16.5 = −7.5 and 20 + 16.5 = 36.5.
(e) Check the data set for any data values that fall outside the interval from
−7.5 to 36.5. Here, we see that the value 50 is outside this interval;
hence, it can be considered an outlier.
EXERCISES
1.1, 1.7, 1.9, 2.1, 2.2, 2.5, 3.3, 6.2, 6.8, 20.3
2. The average weekly earnings in dollars for various industries are listed
below. Find the quartiles of the given data set.
Q1 + Q3
Midquartile =
2
You can be too rigid in applying this scheme (as unfortunately, some academic
journals are). Frequently, ordinal data are coded in increasing numerical or-
der and averages are taken. Or, interval and ratio measurements are ranked
(i.e., reduced to ordinal status) and averages taken at that point. Even with
nominal data, we sometimes calculate averages. For example, coding male as
0 and female as 1 in a class of 100 students, the average is the proportion of
females in the class. Most statistical procedures for ordinal data implicitly use
a numerical coding scheme, even if this is not made clear to the user.
Sources:
The purpose of exploratory data analysis is to examine data to find out what
information can be discovered about the data such as the center and the spread.
Exploratory data analysis was developed by John Tukey and presented in his
book Exploratory Data Analysis (Addison-Wesley, 1977).
48 CHAPTER 1. DESCRIPTIVE STATISTICS
1. the lowest value of the data set, i.e., the minimum value
3. the median, x̃
5. the highest value of the data set, i.e., the maximum value
Remark 1.6.1. To construct a boxplot for a given data set, we employ the
following procedures:
1. Find the five-number summary for the data values, that is, the maximum
and minimum data values, Q1 and Q3 , and the median.
2. Draw a horizontal axis with a scale such that it includes the maximum
and minimum data values.
4. Draw a line from the minimum data value to the left side of the box and
a line from the maximum data value to the right side of the box.
1.6. EXPLORATORY DATA ANALYSIS 49
Solution. First, we arrange the data in ascending order. Doing so, we have
30, 39, 47, 48, 78, 89, 138, 164, 215, 296. Solving for the median, we have
x̃ = 78+89
2 = 83.5. Next, solving for Q1 , we consider the data values less than
83.5, that is, 30, 39, 47, 48, 78. Solving for its median, we have Q1 = 47. Next,
considering the data values greater than 83.5, that is, 89, 138, 164, 215, 296,
we see that its median is Q3 = 164. Employing the procedure given in Remark
1.6.1, of constructing the boxplot, we see that the boxplot for the number of
meteorites found in 10 states of the United States is given by
Remark 1.6.2. The following information can be obtained from the boxplot
of a given data set:
1. (a) If the median is near the center of the box, then the distribution is
approximately symmetric.
50 CHAPTER 1. DESCRIPTIVE STATISTICS
(b) If the median falls to the left of the center of the box, then the
distribution is positively skewed.
(c) If the median falls to the right of the center of the box, then the
distribution is negatively skewed.
2. (a) If the lines are about the same length, then the distribution is ap-
proximately symmetric.
(b) If the right line is larger than the left line, then the distribution is
positively skewed.
(c) If the left line is larger than the right line, then the distribution is
negatively skewed.
If the boxplots for two or more data sets are graphed on the same axis, the
distributions can be compared. To compare the averages, use the location of
the medians. To compare the variability, use the interquartile range, i.e., the
length of the boxes.
Solution. We solve for the median, first and third quartile of the two data sets.
1.6. EXPLORATORY DATA ANALYSIS 51
(a) For the real cheese data, we first arrange the data set as follows: 40,
45, 90, 180, 220, 240, 310, 420. One can easily determine the values
x̃ = 180+220
2 = 200, Q1 = 45+90
2 = 67.5, and Q3 = 240+310
2 = 275.
(b) For the cheese substitute data, we first arrange the data set as follows:
130, 180, 250, 260, 270, 290, 310, 340. One can easily determine the
values x̃ = 260+270
2 = 265, Q1 = 180+250
2 = 215, and Q3 = 290+310
2 = 300.
The boxplots for each distribution are drawn on the same graph, as follows:
Real Cheese
Cheese Substitute
Figure 1.6: Boxplots for the Sodium Content of Real Cheese and Cheese Sub-
stitute
It is quite apparent that the distribution for the cheese substitute data has a
higher median than the median for the distribution for the real cheese data.
The variation or spread for the distribution of the real cheese data is larger
than the variation for the distribution of the cheese substitute data.
the median and interquartile range may more accurately summarize the data
than the mean and standard deviation, since the mean and standard deviation
are more affected in this case.
EXERCISES
1. Identify the five-number summary, find the interquartile range, and draw
the boxplot of the following data set.
(a) 8, 12, 32, 6, 27, 19, 54 (d) 147, 243, 156, 632, 543, 303
(b) 19, 16, 48, 22, 7 (e) 14.6, 19.8, 16.3, 15.5, 18.2
(c) 362, 589, 437, 316, 192, 188 (f) 9.7, 4.6, 2.2, 3.7, 6.2, 9.4, 3.8
2. Construct a boxplot for the following data and comment on the shape
of the distribution representing the number of games pitched by major
league baseball’s earned run average (ERA) leaders for the past few
years.
30 34 29 30 34 29 31
30 27 34 32 33 34 27
3. Construct a boxplot for the following data which represents the number
of innings pitched by the ERA leaders for the past few years. Comment
on the shape of the distribution.
4. Construct a boxplot for these numbers of state sites for Frogwatch USA.
Is the distribution symmetric?
(a) Which month had the highest mean number of tornadoes for this
3-year period?
(b) Which year has the highest mean number of tornadoes for this 4-
month period?
7. Assume you work for OSHA (Occupational Safety and Health Adminis-
tration) and have complaints about noise levels from some of the workers
at a state power plant. You charge the power plant with taking decibel
readings at six different areas of the plant at different times of the day
and week. The results of the data collection are listed. Use boxplots
to initially explore the data and make recommendations about which
plant areas workers must be provided with protective ear wear. The safe
hearing level is at approximately 120 decibels.
54 CHAPTER 1. DESCRIPTIVE STATISTICS
49 57 38 73 81 74 59 76 65 69
54 56 69 68 78 65 85 49 69 61
48 81 68 37 43 78 82 43 64 67
52 56 81 77 79 85 40 85 59 80
60 71 57 61 69 61 83 90 87 74
Since little information can be obtained from looking at raw data, the re-
searcher organizes the data into what is called a frequency distribution. A
frequency distribution consists of classes and their corresponding frequencies.
1.7. FREQUENCY DISTRIBUTIONS 55
Each raw data value is placed into a quantitative or qualitative category called
a class. The frequency of a class then is the number of data values contained
in a specific class. A frequency distribution is shown for the preceding data
set.
Class Limits Tally Frequency
35-41 3
42-48 3
49-55 4
56-62 10
63-69 10
70-76 5
77-83 10
84-90 5
Total 50
Now some general observations can be made from looking at the frequency
distribution. For example, it can be stated that the majority of the wealthy
people in the study are over 55 years old.
As we see in the previous sections, there is no difficulty if the data set is small,
for we can arrange those few numbers and write them, say, in increasing order;
the result would be sufficiently clear. For fairly large data sets, the use of a
frequency distribution is a big help.
Remark 1.7.1. Two types of frequency distributions that are most often used
are the categorical frequency distribution and the grouped frequency distribu-
tion.
56 CHAPTER 1. DESCRIPTIVE STATISTICS
2. Tally the data and place the results on the second column.
3. Count the tallies and place the results on the third column.
6. Removing the column for the tally (optional) finishes the desired fre-
quency distribution.
Remark 1.7.3. Percentages are not normally part of a frequency distribution,
but they can be added since they are used in certain types of graphs such as pie
graphs. Also, the decimal equivalent of a percent is called a relative frequency.
Example 1.7.1. Twenty-five army inductees were given a blood test to deter-
mine their blood type. Construct a frequency distribution for the data. The
data set is given below.
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
1.7. FREQUENCY DISTRIBUTIONS 57
Solution. Since the data are categorical, discrete classes can be used. There
are four blood types: A, B, O, and AB. These types will be used as the classes
for the distribution. Employing the procedures in Remark 1.7.2, we have
Removing the tally column, we see that the final frequency distribution is
For the sample, more people have type O blood than any other type.
58 CHAPTER 1. DESCRIPTIVE STATISTICS
Definition 1.7.3.
1. Given a class, the endpoints of the class are called the class limits.
2. The lower class limit represents the smallest data value that can be
included in the class.
3. The upper class limit represents the largest value that can be included
in the class.
4. The numbers used to separate the classes so that there are no gaps in
the frequency distribution are called the class boundaries.
Remark 1.7.4. The basic rule of thumb is that the class limits should have
the same decimal place value as the data, but the class boundaries should have
one additional place value and end in a 5.
Remark 1.7.5. The class width can also be found by subtracting the lower
boundary from the upper boundary for any given class. Do not subtract the
limits of a single class. It will result in an incorrect answer.
Remark 1.7.6. The researcher must decide how many classes to use and the
width of each class. To construct a frequency distribution, follow these rules:
1. There should be between 5 and 20 classes. Although there is no hard-and-
fast rule for the number of classes contained in a frequency distribution,
it is of the utmost importance to have enough classes to present a clear
description of the collected data.
1.7. FREQUENCY DISTRIBUTIONS 59
4. The classes must be continuous. Even if there are no values in a class, the
class must be included in the frequency distribution. There should be
no gaps in a frequency distribution. The only exception occurs when the
class with a zero frequency is the first or last class. A class with a zero
frequency at either end can be omitted without affecting the distribution.
6. The classes must be equal in width. This avoids a distorted view of the
data. One exception occurs when a distribution has a class that is open-
ended. That is, the class has no specific beginning value or no specific
ending value. A frequency distribution with an open-ended class is called
an open-ended distribution.
1. Determine the classes. This can be done by finding the highest and
lowest values in the data set. Afterwards, solve for the range, R.
3. Find the class width by dividing the range by the number of classes
R
desired. That is, width = . Round the answer up to
number of classes
60 CHAPTER 1. DESCRIPTIVE STATISTICS
4. Select a starting point for the lowest class limit. This can be the smallest
data value or any convenient number less than the smallest data value.
5. Add the width to the lowest score taken as the starting point to get the
lower limit of the next class. Keep adding until the number of desired
classes is achieved.
6. Subtract one unit from the lower limit of the second class to get the
upper limit of the first class. Then add the width to each upper limit to
get all the upper limits.
7. Find the class boundaries by subtracting 0.5 from each lower class limit
and adding 0.5 to each upper class limit.
4. To enable the researcher to draw charts and graphs for the presentation
of data.
68 63 42 27 30 36 28 32 79 27
22 23 24 25 44 65 43 25 74 51
36 42 28 31 28 25 45 12 57 51
12 32 49 38 42 27 31 50 38 21
16 24 69 47 23 22 43 27 49 28
23 19 46 30 43 49 12
Construct a grouped frequency distribution with seven classes for the given
data.
Solution. We shall employ the procedures in Remark 1.7.7. First, note that
R = 79 − 12 = 67. Next, with seven desired classes, we see that the (class)
width is equal to 67
7 = 9.6 ≈ 10. Since the smallest number is 12, we may
begin our first interval with 10. The considerations discussed so far lead to
the following seven classes:
10-19 50-59
20-29 60-69
30-39 70-79
40-49
Solving for the class boundaries, tallying the data, and reflecting the corre-
sponding numerical frequencies from the tallies, we have
Construct a grouped frequency distribution with nine classes for the given data.
Solution. We shall employ the procedures in Remark 1.7.7. First, note that
R = 16.1 − 12.4 = 3.7. Next, with nine desired classes, we see that the (class)
width is equal to 3.7
9 = 0.41 ≈ 0.5. Since the smallest number is 12.4, we may
begin our first interval with 12.0. The considerations discussed so far lead to
the following seven classes:
1.7. FREQUENCY DISTRIBUTIONS 63
Tallying the data, and reflecting the corresponding numerical frequencies from
the tallies, we have
64 CHAPTER 1. DESCRIPTIVE STATISTICS
EXERCISES
1. Find the class boundaries, midpoints, and widths for each class.
3. Name the two types of frequency distributions, and explain when each
should be used.
4. How many classes should frequency distributions have? Why shoild the
class width be an odd number?
Class Frequency
27-32 1
33-38 0
39-44 6
45-49 4
50-55 2
Class Frequency
5-9 1
9-13 2
13-17 5
17-20 6
20-24 3
1.7. FREQUENCY DISTRIBUTIONS 65
Class Frequency
123-127 3
128-132 7
138-142 2
143-147 19
Class Frequency
9-13 1
14-19 6
20-25 2
26-28 5
29-32 9
12. The following are the daily fat intake (grams) of a group of 150 adult
males. Construct a frequency distribution with ten classes for the given
data set.
22 62 77 84 42 56 78 73 37 69
82 93 30 77 81 94 46 89 88 99
63 85 81 94 51 80 88 98 52 70
76 95 107 105 117 128 144 150 68 79
82 96 109 108 117 120 147 153 67 75
76 92 105 104 117 129 148 164 62 85
77 96 103 105 116 132 146 168 53 72
72 91 102 101 128 136 143 164 65 73
83 92 103 118 127 132 140 167 68 75
89 95 107 111 128 139 148 168 68 79
82 96 109 108 117 130 147 153 91 102
117 129 137 141 96 105 117 125 135 143
93 100 114 124 135 142 97 102 119 125
138 142 95 100 116 121 131 152 93 106
114 127 133 155 97 106 119 122 134 151
1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 67
13. The following data provided the percentage saturation of bile for 29
women. These percentages were
65 58 52 91 84 107
86 98 35 128 116 84
76 146 55 75 73 120
89 80 127 82 87 123
142 66 77 69 76
Construct a frequency distribution with six classes for the given data set.
14. The following frequency distribution was obtained for the preoperational
percentage hemoglobin values of a group of subjects from a village where
there has been a malaria eradication program (MEP):
43 63 63 75 95 75 80 48 62 71 76 90
51 61 74 103 93 82 74 65 63 53 64 67
80 77 60 69 73 76 91 55 65 69 50 68
72 89 75 57 66 79 85 70 87 67 72 52
35 67 99 81 97 74 84 78 59 71 61 62
convey the data to the viewers in pictorial form. It is easier for most people
to comprehend the meaning of data presented graphically than data presented
numerically in tables or frequency distributions. This is especially true if the
users have little or no statistical knowledge.
Statistical graphs can be used to describe the data set or to analyze it. Graphs
are also useful in getting the audience’s attention in a publication or a speaking
presentation. They can be used to discuss an issue, reinforce a critical point,
or summarize a data set. They can also be used to discover a trend or pattern
in a situation over a period of time.
1. the histogram
The Histogram
Definition 1.8.1. The histogram is a graph that displays the data by using
contiguous vertical bars (unless the frequency of a class is 0) of various heights
to represent the frequencies of the classes.
(a) The horizontal scale represents the value of the variable marked at in-
terval boundaries.
(b) The vertical scale represents the frequency or relative frequency in each
interval.
Example 1.8.1. These data represents the record high temperatures in de-
grees Fahrenheit (◦ F ) for each of the 50 states in USA.
1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 69
112 100 110 118 107 112 116 108 120 113
127 120 134 117 116 118 114 115 118 110
121 113 120 117 105 118 105 110 122 114
114 117 118 122 120 119 111 110 118 112
109 112 105 109 106 110 104 111 114 114
(a) Construct a grouped frequency distribution for the data using 7 classes.
Solution.
(a) We shall employ the procedures in Remark 1.7.7. First, note that R =
134 − 100 = 34. Next, with seven desired classes, we see that the (class)
width is equal to 34
7 = 4.9 ≈ 5. Since the smallest number is 100, we
may begin our first interval with this. The considerations discussed so
far lead to the following seven classes:
100-104 120-124
105-109 125-129
110-114 130-134
115-119
(b) To construct the histogram, we first draw and label the x and y axes. The
x-axis is always the horizontal axis, and the y-axis is always the vertical
axis. Represent the frequency on the y-axis and the class boundaries on
the x-axis. Using the frequencies as the heights, draw vertical bars for
each class. Thus, the following histogram is constructed.
18
15
Frequency
12
Temperature (◦ F )
Figure 1.7: Histogram for Record High Temperatures in the 50 States of USA
1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 71
Definition 1.8.2. The frequency polygon is a graph that displays the data
by using lines that connect points plotted for the frequencies at the midpoints
of the classes. The frequencies are represented by the heights of the points.
Remark 1.8.3. The frequency polygon can also be shown without the his-
togram on the same graph.
Remark 1.8.4. The frequency polygon and the histogram are two different
ways to represent the same data set. The choice of which one to use is left to
the discretion of the researcher.
Solution. We first find the midpoints of each class. Doing so, we have
To draw the frequency polygon, we first draw and label the x and y axes. The
x-axis is always the horizontal axis, and the y-axis is always the vertical axis.
Represent the frequency on the y-axis and the class midpoints on the x-axis.
Using these, we then plot the points. Finally, connecting adjacent points with
line segments, the following frequency polygon is constructed.
18
15
Frequency
12
Temperature (◦ F )
Figure 1.8: Frequency Polygon for Record High Temperatures in the 50 States
of USA
1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 73
The Ogive
The third type of graph that can be used represents the cumulative frequencies
for the classes. This type of graph is called the cumulative frequency graph,
or ogive.
Definition 1.8.3.
2. The ogive is a graph that represents the cumulative frequencies for the
classes in a frequency distribution.
Solution. We first find the cumulative frequency of each class. Doing so, we
have
Temperature (◦ F ) Cumulative Frequency
Less than 99.5 0
Less than 104.5 2
Less than 109.5 10
Less than 114.5 28
Less than 119.5 41
Less than 124.5 48
Less than 129.5 49
Less than 134.5 50
To draw the ogive, we first draw and label the x and y axes. The x-axis is
always the horizontal axis, and the y-axis is always the vertical axis. Represent
the cumulative frequency on the y-axis and the class midpoints on the x-axis.
74 CHAPTER 1. DESCRIPTIVE STATISTICS
Using these, we then plot the points. Finally, connecting adjacent points with
line segments, the following ogive is constructed.
50
45
Cumulative Frequency
40
35
30
25
20
15
10
0
99.5 104.5 109.5 114.5 119.5 124.5 129.5 134.5
Temperature (◦ F )
Figure 1.9: Ogive for Record High Temperatures in the 50 States of USA
EXERCISES
1. The number of faculty listed for a variety of private colleges which offer
only bachelor’s degrees is listed below. Use these data to construct a
frequency distribution with 7 classes, a histogram, a frequency polygon,
and an ogive. Discuss the shape of this distribution.
23 30 20 27 44 26 35 20 29 29
25 15 18 27 19 22 12 26 34 15
27 35 26 43 35 14 24 12 23 31
40 35 38 57 22 42 24 21 27 33
76 CHAPTER 1. DESCRIPTIVE STATISTICS