0% found this document useful (0 votes)
588 views

Statistics (Chapter 1)

1. The document provides lecture notes on descriptive statistics, including definitions of key terms like data, variables, levels of measurement, and qualitative vs. quantitative variables. 2. It discusses the four levels of measurement - nominal, ordinal, interval, and ratio - and provides examples to illustrate each level. Precise differences in measurement and the concept of a true zero help distinguish the levels. 3. Various examples are given to illustrate qualitative and quantitative variables as well as the different levels of measurement. These include subject taught, gender, zip codes, intelligence scores, and physical attributes.

Uploaded by

Renz Moneda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
588 views

Statistics (Chapter 1)

1. The document provides lecture notes on descriptive statistics, including definitions of key terms like data, variables, levels of measurement, and qualitative vs. quantitative variables. 2. It discusses the four levels of measurement - nominal, ordinal, interval, and ratio - and provides examples to illustrate each level. Precise differences in measurement and the concept of a true zero help distinguish the levels. 3. Various examples are given to illustrate qualitative and quantitative variables as well as the different levels of measurement. These include subject taught, gender, zip codes, intelligence scores, and physical attributes.

Uploaded by

Renz Moneda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

i

Statistical Analysis
with Applications in Engineering and Sciences
Lecture Notes

Adolfo Mart
Polytechnic in O. Solima
University n
of the Phil
ippines
Manila, 201
9
ii

Last revision: December 1, 2020.


1

Descriptive Statistics

1.1 Data and Levels of Measurement

You may be familiar with probability and statistics through radio, television,
newspapers, and magazines. For example, you may have read statements like
"68% of Filipinos are still confident that the Philippines will again rise in the
economic scene."

Statistics is used in almost all fields of human endeavor. In sports, for ex-
ample, a statistician may keep records of the number of points a point guard
scored during a basketball game, or the number of hits a baseball player gets
in a season. In other areas, such as public health, an administrator might be
concerned with the number of residents who contract a new strain of flu virus
during a certain year. In education, a researcher might want to know if new
methods of teaching are better than old ones. These are only a few examples
of how statistics can be used in various occupations.

Furthermore, statistics is used to analyze the results of surveys and as a tool


in scientific research to make decisions based on controlled experiments. Other
uses of statistics include operations research, quality control, estimation, and
prediction.

1
2 CHAPTER 1. DESCRIPTIVE STATISTICS

Definition 1.1.1.

1. Statistics is the science of conducting studies to collect, organize, sum-


marize, analyze, and draw conclusions from data.

2. A variable is a characteristic or attribute that can assume different


values.

3. Data (sing. datum) are the values (measurements or observations) that


the variables can assume.

Remark 1.1.1. The body of knowledge called statistics is sometimes divided


into two parts:

(a) Descriptive statistics consists of the collection, organization, summa-


rization, and presentation of data.

(b) Inferential statistics consists of generalizing from samples to popula-


tions, perfoming estimations and hypothesis tests, determining relation-
ships among variables, and making predictions.

Remark 1.1.2. Variables can be classified as qualitative or quantitative.

(a) Qualitative (categorical) variables are variables that can be placed


into distinct categories, according to some characteristic or attribute.

(b) Quantitative variables are variables which are numerical and can be
ordered or ranked.

Remark 1.1.3. Quantitative variables can be further classified into two groups.

(a) Discrete variables can be assigned values such as 0, 1, 2, . . . , and are


said to be countable.

(b) Continuous variables can be assigned an infinite number of values


between two specific values. They are obtained by measuring; moreover,
they often include fractions and decimals.
1.1. DATA AND LEVELS OF MEASUREMENT 3

Data
Qualitative
Quantitative
Discrete
Continuous

Figure 1.1: Classification of Variables

Remark 1.1.4. The classification of variables can be summarized as follows:

In addition to being classified as qualitative or quantitative, variables can


be classified by how they are categorized, counted, or measured. This type
of classification - i.e., how variables are categorized, counted, or measured -
uses measurement scales, and four common types of scales are used: nominal,
ordinal, interval, and ratio.

Definition 1.1.2. (Levels of Measurement)

1. The nominal level of measurement classifies data into mutually ex-


clusive (nonoverlapping), exhausting categories in which no order or
ranking can be imposed on the data.

2. The ordinal level of measurement classifies data into categories that


can be ranked; however, precise differences between the ranks do not
exist.

3. The interval level of measurement ranks data, and precise differences


between units of measure do exist; however, there is no meaningful zero.

4. The ratio level of measurement possesses all characteristics of interval


measurement, and there exists a true zero.
4 CHAPTER 1. DESCRIPTIVE STATISTICS

To easily determine the level of measurement, one may use the following
flowchart, for convenience:

Ratio
Level
Data

Yes

Are there
Can the
precise Is zero
data be Yes Yes
variations defined as
ranked or
between null ?
ordered?
ranks?

No
No
No
Nominal
Level
Ordinal Interval
Level Level

Figure 1.2: Flowchart in Determining the Level of Measurement of a Data

Example 1.1.1. (a) A sample of college instructors classified according to


subject taught (e.g., English, history, psychology, or mathematics) is an
example of nominal-level measurement.

(b) Classifying survey subjects as male or female is another example of


nominal-level measurement as no ranking or order can be placed on the
data.

(c) Classifying residents according to zip codes is also an example of the


nominal level of measurement. Even though numbers are assigned as zip
codes, there is no meaningful order or ranking.
1.1. DATA AND LEVELS OF MEASUREMENT 5

(d) Other examples of nominal-level data are political party (Democratic,


Republican, Independent, etc.), religion (Christianity, Judaism, Islam,
etc.), and marital status (single, married, divorced, widowed, separated).


Example 1.1.2. (a) From student evaluations, guest speakers might be


ranked as superior, average, or poor.

(b) Floats in a homecoming parade might be ranked as first place, second


place, etc.

(c) Note that precise measurement of differences in the ordinal level of mea-
surement does not exist. For instance, when people are classified ac-
cording to their build (small, medium, or large), a large variation exists
among the individuals in each class.

(d) Other examples of ordinal data are letter grades (A, B, C, D, F ).




Example 1.1.3. (a) The interval level differs from the ordinal level in the
case that precise differences do exist between units. For example, many
standardized psychological tests yield values measured on an interval
scale. IQ is an example of such a variable. There is a meaningful differ-
ence of 1 point between an IQ of 109 and an IQ of 110.

(b) Temperature is another example of interval measurement, since there is


a meaningful difference of 1◦ F between each unit, such as 72 and 73◦ F.

(c) One property is lacking in the interval scale, that is, there is no true zero.
For example, IQ tests do not measure people who have no intelligence.
For temperature, 0◦ F does not mean no heat at all.

6 CHAPTER 1. DESCRIPTIVE STATISTICS

Example 1.1.4. (a) Examples of ratio scales are those used to measure
height, weight, area, and number of phone calls received.

(b) Ratio scales have differences between units (1 inch, 1 pound, etc.) and
a true zero.

(c) In addition, the ratio scale contains a true ratio between values. For
example, if one person can lift 200 pounds and another can lift 100
pounds, then the ratio between them is 2 to 1. Put another way, the
first person can lift twice as much as the second person.


EXERCISES

1. Name and define the two areas of statistics

2. Suggest some ways statistics can be used in everyday life.

3. Explain the differences between a sample and a population.

4. Why are samples used in statistics?

?5. In each of these statements, tell whether descriptive or inferential statis-


tics have been used.

(a) In the year 2010, 148 million Americans will be enrolled in an HMO.
(Source: USA TODAY )
(b) Nine out of ten on-the-job fatalities are men.
(Source: USA TODAY Weekend )
(c) Expenditures for the cable industry were $5.66 billion in 1996.
(Source: USA TODAY )
(d) The median household income for people aged 25-34 is $35,888.
(Source: USA TODAY )
(e) Allergy therapy makes bees go away. (Source: Prevention)
1.1. DATA AND LEVELS OF MEASUREMENT 7

(f) Drinking decaffeinated coffee can raise cholesterol levels by 7%.


(Source: American Heart Association).
(g) The national average annual medicine expenditure per person is
$1052. (Source: The Greensburg Tribune Review )
(h) Experts say that mortgage rates may soon hit bottom.
(Source: USA TODAY )

?6. Classify each as nominal-level, ordinal-level, interval-level, or ratio-level


measurement.

(a) Pages in the novel of Lang Leav.


(b) Rankings of tennis players.
(c) Weights of air conditioners.
(d) Temperatures inside 10 refrigerators.
(e) Salaries of top five CEOs in the Makati Business District
(f) Ratings of eight local basketball plays (poor, fair, good, excellent)
(g) Times required for mechanics to do a tune-up.
(h) Ages of students in a classroom.
(i) Marital status of patients in a physician’s office.
(j) Horsepower of tractor engines.

?7. Classify each variable as qualitative or quantitative.

(a) Number of bicycles sold in 1 year by a large sporting goods store.


(b) Colors of baseball caps in a store.
(c) Times it takes to cut a lawn.
(d) Capacity in cubic feet of six truck beds.
(e) Classification of children in a day care center (infant, toddler, preschool).
(f) Weights of fish caught in Lake George.
(g) Marital status of faculty members in a large university.
8 CHAPTER 1. DESCRIPTIVE STATISTICS

?8. Classify each variable as discrete or continuous.

(a) Number of doughnuts sold each day by Dunkin Donuts.


(b) Water temperatures of six swimming pools in Bulacan on a given
day.
(c) Weights of cats in a pet shelter.
(d) Lifetime (in hours) of 12 flashlight batteries.
(e) Number of cheeseburgers sold each day by a hamburger stand on a
college campus.
(f) Number of DVDs rented each day by a video store.
(g) Capacity (in gallons) of six reservoirs in Luzon.

9. Give three examples each of nominal, ordinal, interval, and ratio data.

1.2 Measures of Central Tendency


In the book American Averages by Mike Feinsilber and William B. Meed, the
authors state:

"Average" when you stop to think of it is a funny concept. Although


it describes all of us it describes none of us... While none of us
wants to be the average American, we all want to know about
him or her.

The authors go on to give examples of averages:

The average American man is five feet, nine inches tall; the average
woman is five feet, 3.6 inches.
The average American is sick in bed seven days a year missing five
days of work.
On the average day, 24 million people receive animal bites.
By his or her 70th birthday, the average American will have eaten
14 steers, 1050 chickens, 3.5 lambs, and 25.2 hogs.
1.2. MEASURES OF CENTRAL TENDENCY 9

In these examples, the word average is ambiguous, since several different meth-
ods can be used to obtain an average. Loosely stated, the average means the
center of the distribution or the most typical case. Measures of average are
also called measures of central tendency and include the mean, median, and
mode.

Knowing the average of a data set is not enough to describe the data set en-
tirely. Even though a shoe store owner knows that the average size of a man’s
shoe is size 10, she would not be in business very long if she ordered only size
10 shoes.

The previous section stated that statisticians use samples taken from popula-
tions; however, when populations are small, it is not necessary to use samples
since the entire population can be used to gain information. For example,
suppose an insurance manager wanted to know the average weekly sales of
all the company’s representatives. If the company employed a large number
of salespeople, say, nationwide, he would have to use a sample and make an
inference to the entire sales force. But if the company had only a few sales-
people, say, only 87 agents, he would be able to use all representatives’ sales
for a randomly chosen week and thus, use the entire population.

Measures found by using all the data values in the population are called pa-
rameters. Measures obtained by using the data values from samples are called
statistics.

Definition 1.2.1.

1. A statistic is a characteristic or measure obtained by using the data


values from a sample.

2. A parameter is a characteristic or measure obtained by using all the


data values from a specific population.
10 CHAPTER 1. DESCRIPTIVE STATISTICS

General Rounding Rule


In statistics, the basic rounding rule is that when computations are done in
the calculation, rounding should not be done until the final answer is calcu-
lated. When rounding is done in the intermediate steps, it tends to increase
the difference between that answer and the exact one.

The Arithmetic Mean


How to define the middle of a sample may seem obvious, but the more you
think about it, the less obvious it becomes.

Definition 1.2.2. The (arithmetic) mean is the sum of all the observations
divided by the number of observations.

1. The symbol x (read as "x-bar "), represents the sample mean, given by
n
X
xi
x1 + x2 + x3 + · · · + xn i=1
x= = ,
n n

where n represents the sample size.

2. The symbol µ (Greek: "mu"), represents the population mean, given


by
n
X
Xi
X1 + X2 + X3 + · · · + XN i=1
µ= = ,
N N

where N represents the population size.

Remark 1.2.1. In statistics, Greek letters are used to denote parameters,


and Roman letters are used to denote statistics. Assume that the data are
obtained from samples unless otherwise specified.
1.2. MEASURES OF CENTRAL TENDENCY 11

Remark 1.2.2. (Rounding Rule for the Mean) The mean should be
rounded to one more decimal place than occurs in the raw data.

Remark 1.2.3. The arithmetic mean is, in general, a very natural measure
of location. One of its main limitations, however, is that it is oversensitive to
extreme values. In this instance, it may not be representative of the location
of the great majority of sample points.

Remark 1.2.4. It is possible in extreme cases for all but one of the sample
points to be on one side of the arithmetic mean. In these types of samples,
the arithmetic mean is a poor measure of central location because it does not
reflect the center of the sample.

Example 1.2.1. Suppose the sample consists of the birthweights of all live-
born infants born at a private hospital in Pasig City, during a one-week period.

Table 1.1: Sample of birthweights (g) of live-born infants born at a private


hospital in Pasig City, during a one-week period

i xi i xi i xi i xi
1 3265 6 3323 11 2581 16 2759
2 3260 7 3649 12 2841 17 3248
3 3245 8 3200 13 3609 18 3314
4 3484 9 3031 14 2838 19 3101
5 4146 10 2069 15 3541 20 2834

The (arithmetic) mean for the given sample of birthweights is


3265 + 3260 + 2834
x= = 3166.9 g
20


Example 1.2.2. The following data deal with the aflatoxin levels of raw
peanut kernels as described by Quesenberry et al. (1976). Approximately
560 g or ground meal was divided among 16 centrifuge bottles and analyzed.
One sample was lost, so that only 15 readings are available (measurement units
12 CHAPTER 1. DESCRIPTIVE STATISTICS

are not given). The values were 30, 26, 26, 36, 48, 50, 16, 31, 22, 27, 23, 35,
52, 28, 37. The mean aflatoxin level of the readings is
30 + 26 + · · · + 28 + 37
x= = 32.5
15


The Median
An article recently reported that the median income for college professors was
$43,250. This measure of central tendency means that one-half of all the profes-
sors surveyed earned more than $43,250, and one-half earned less than $43,250.

The median is the halfway point in a data set. Before you can find this point,
the data must be arranged in order. The median either will be a specific value
in the data set or will fall between two values.

Definition 1.2.3. Suppose there are n observations in a sample. If these ob-


servations are arranged in ascending order, then the sample median, denoted
by x̃ (read as "x-tilde"), is defined as

n + 1 th
 
1. the largest observation, if n is odd
2
 n th n th
2. the average of the and +1 largest observations if n is even.
2 2

Remark 1.2.5. The rationale for these definitions is to ensure an equal num-
ber of sample points on both sides of the sample median.

Remark 1.2.6. The median is defined differently when n is even and odd
because it is impossible to achieve this goal with one uniform definition.

Remark 1.2.7. Samples with an odd sample size have a unique central point,
while samples with an even sample size have no unique central point, and the
middle two values must be averaged.
1.2. MEASURES OF CENTRAL TENDENCY 13

Remark 1.2.8. The main weakness of the sample median is that it is deter-
mined mainly by the middle points in a sample and is less sensitive to the
actual numeric values of the remaining data points.

Remark 1.2.9. When the data set is ordered, it is called a data array.

Example 1.2.3. Consider the data set in Table 1.2, which consists of white-
blood counts taken upon admission of all patients entering a small hospital in
Quezon City, on a given day. Compute the median white-blood count.

Table 1.2: Sample of admission white-blood counts (×1000) for all patients
entering a hospital in Quezon City, on a given day

i xi i xi i xi
1 7 4 9 7 10
2 35 5 8 8 12
3 5 6 3 9 8

Solution. First, order the sample as follows: 3, 5, 7, 8, 8, 9, 10, 12, 35. Because
n = 9 is odd, the sample median is given by the fifth largest point, which equals
8 or 8000 on the original scale. 

Example 1.2.4. Compute the sample median for the sample in Example 1.2.1.

Solution. First, arrange the sample in ascending order: 2069, 2581, 2759, 2834,
2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484, 3541,
3609, 3649, 4146. Because n = 20 is even, then, we have

x̃ = average of the 10th and 11th largest observations


3245 + 3248
=
2
= 3246.5 g


14 CHAPTER 1. DESCRIPTIVE STATISTICS

Example 1.2.5. Compute the sample median for the sample in Example 1.2.2.

Solution. First, arrange the sample in ascending order: 16, 22, 23, 26, 26, 27,
28, 30, 31, 35, 36, 37, 48, 50, 52. Because n = 15 is odd, then, we have

x̃ = 8th observation = 30.

The Mode
The third measure of average is called the mode. The mode is the value that
occurs most often in the data set. It is sometimes said to be the most typical
case.

Definition 1.2.4. The mode is the most frequently occurring value among
all the observations in a sample. It is denoted by x̂ (read as "x-hat").

Remark 1.2.10. A data set can have more than one mode or no mode at all.

Remark 1.2.11. When no data value occurs more than once, the data set is
said to have no mode.

Remark 1.2.12. A data set that has only one value that occurs with the
greatest frequency is said to be unimodal. If a data set has two values that
occur with the same greatest frequency, both values are considered to be the
mode and the data set is said to be bimodal. If a data set has more than two
values that occur with the same greatest frequency, each value is used as the
mode, and the data set is said to be multimodal.

Example 1.2.6. Compute for the mode of Example 1.2.1.

Solution. There is no mode, because all the values occur exactly once. 

Example 1.2.7. Compute for the mode of Example 1.2.2.

Solution. The observation 26 is the most frequent, occuring twice in the data
set. Therefore, x̂ = 26. 
1.2. MEASURES OF CENTRAL TENDENCY 15

Example 1.2.8. Compute for the mode of Example 1.2.3.

Solution. The observation 8 is the most frequent, occuring twice in the data
set. Therefore, x̂ = 8000, based on the original scale. 

Example 1.2.9. Consider the sample of time intervals between successive


menstrual periods for a group of 500 college women age 18 to 21 years, shown
in Table 1.3. The frequency column gives the number of women who reported
each of the respective durations. The mode is 28 because it is the most fre-
quently occurring value.

Table 1.3: Sample of time intervals between successive menstrual periods


(days) in college-age women

Value Frequency Value Frequency Value Frequency


24 5 29 96 34 7
25 10 30 63 35 3
26 28 31 24 36 2
27 64 32 9 37 1
28 185 33 2 38 1

Other Types of Mean


Aside from the arithmetic mean, we shall discuss, at this particular section,
the other types of statistical means that one can find useful when dealing with
averages of various quantities applied to a particular fields. These means are
the weighted mean, harmonic mean, geometric mean, and the quadratic mean.

The Weighted Mean


Sometimes, you must find the mean of a data set in which not all values are
equally represented. The type of mean that considers an additional factor is
called the weighted mean, and it is used when the values are not all equally
represented.
16 CHAPTER 1. DESCRIPTIVE STATISTICS

Definition 1.2.5. The weighted mean of a variable X, denoted by xW is


obtained by multiplying each value by its corresponding weight and dividing
the sum of the products by the sum of the weights. That is,
n
X
x i wi
x 1 w1 + x 2 w2 + · · · + x n wn i=1
xW = = n ,
w1 + w2 + · · · + wn X
wi
i=1

where w1 , w2 , . . . , wn are the weights of x1 , x2 , . . . , xn , respectively.

The Harmonic Mean

Definition 1.2.6. The harmonic mean (HM ) is defined as the number of


values divided by the sum of the reciprocals of each value. That is,
n n
HM = = n .
1 1 1 X 1
+ + ··· +
x1 x2 xn xi
i=1

This mean is useful for finding the average speed.

Example 1.2.10. The harmonic mean of 1, 4, 5, and 2 is

4
HM = = 2.1
1 1 1 1
+ + +
1 4 5 2


Example 1.2.11. Suppose a person drove 100 miles at 40 miles per hour and
returned driving 50 miles per hour. The average miles per hour is not 45 miles
per hour, which is found by adding 40 and 50 and dividing by 2. The average
is found as shown.
1.2. MEASURES OF CENTRAL TENDENCY 17

distance
Since time = , then
rate
100
Time 1 = = 2.5 hours to make the trip
40
100
Time 2 = = 2 hours to return
50

Hence, the total time is 4.5 hours, and the total miles driven are 200 miles.
Now, the average speed is
distance 200
rate = = = 44.44 miles per hour
time 4.5
This value can also be found by using the harmonic mean, that is
2
HM = = 44.44
1 1
+
40 50


The Geometric Mean

Definition 1.2.7. The geometric mean (GM ) is defined as the nth root of
the product of n values. That is,
v
u n
uY

GM = n x1 · x2 · x3 · . . . · xn = t
n
xi .
i=1

The geometric mean is useful in finding the average of percentages, average of


ratios, average of indices, or average of growth rates.

Example 1.2.12. The geometric mean of 4 and 16 is given by


p √
GM = (4)(16) = 64 = 8.


18 CHAPTER 1. DESCRIPTIVE STATISTICS

Example 1.2.13. The geometric mean of 1, 3, and 9 is given by


p √
GM = 3 (1)(3)(9) = 3 27 = 3.

Example 1.2.14. If a person receives a 20% raise after 1 year of service and
a 10% raise after the second year of service, the average percentage raise per
year is not 15% but 14.89%, as shown. Since GM = (1.2)(1.1) = 1.1489
p

or GM = (120)(110) = 114.89%. His salary is 120% at the end of the first


p

year and 110% at the end of the second year. This is equivalent to an average
of 14.89%, since 114.89% − 100% = 14.89%. 

The Quadratic Mean

Definition 1.2.8. The quadratic mean (QM ) is defined at the square root
of the average of the squares of each value. That is,
v n
uX
x2i
u
r u
2 2 2
x1 + x2 + · · · + xn t
i=1
QM = = .
n n
This is a useful mean in the physical sciences (such as voltage).

Example 1.2.15. The quadratic mean of 3, 5, 6, and 10 is


r
32 + 52 + 62 + 102 √
QM = = 42.5 = 6.52.
4

1.2. MEASURES OF CENTRAL TENDENCY 19

Properties and Uses of the Measures of Central Tendency


Researchers and statisticians must know which measure of central tendency is
being used and when to use each measure of central tendency. The properties
and uses of the measures of central tendency are summarized next.

The (Arithmetic) Mean

1. The mean is found by using all the values of the data.

2. The mean varies less than the median or mode when samples are taken
from the same population and all three measures are computed for these
samples.

3. The mean is used in computing other statistics, such as the variance.

4. The mean for the data set is unique and not necessarily one of the data
values.

5. The mean cannot be computed for the data in a frequency distribution


that has an open-ended class.

6. The mean is affected by extremely high or low values, called outliers,


and may not be the appropriate average to use in these situations.

The Median

1. The median is used to find the center or middle value of a data set.

2. The median is used when it is necessary to find out whether the data
values fall into the upper half or lower half of the distribution.

3. The median is used for an open-ended distribution.

4. The median is affected less than the mean by extremely high or extremely
low values.
20 CHAPTER 1. DESCRIPTIVE STATISTICS

The Mode

1. The mode is used when the most typical case is desired.

2. The mode is the easiest average to compute.

3. The mode can be used when the data are nominal, such as religious
preference, gender, or political affiliation.

4. The mode is not always unique. A data set can have more than one
mode, or the mode may not exist for a data set.

EXERCISES

1. For these situations, state which measure of central tendency - mean,


median, or mode - should be used.

(a) The most typical case is desired.


(b) The distribution is open-ended.
(c) There is an extreme value in the data set.
(d) The data are categorical.
(e) Further statistical computations will be needed.
(f) The values are to be divided into two approximately equal groups,
one group containing the larger values and one containing the smaller
values.

2. Describe which measure of central tendency - mean, median, or mode -


was probably used in each situation.

(a) One-half of the factory workers make more than $5.37 per hour,
and one-half make less than $5.37 per hour.
(b) The average number of children per family in the Plaza Heights
Complex is 1.8.
(c) Most people prefer red convertibles over any other color.
1.2. MEASURES OF CENTRAL TENDENCY 21

(d) The average person cuts the lawn once a week.


(e) The most common fear today is fear of speaking in public.
(f) The average age of college professors is 42.3 years.

3. A local fast-food company claims that the average salary of its employ-
ees is $13.23 per hour. An employee states that most employees make
minimum wage. If both are being truthful, how could both be correct?

4. If the mean of five values is 64, find the sum of the values.

5. If the mean of five values is 8.2 and four of the values are 6, 10, 7, and
12, find the fifth value.

6. (a) Find the mean of 10, 20, 30, 40, and 50.
(b) Add 10 to each value and find the mean.
(c) Subtract 10 from each value and find the mean.
(d) Multiply each value by 10 and find the mean.
(e) Divide each value by 10 and find the mean.
(f) Make a general statement about each situation.

7. Using the harmonic mean, find each of these.

(a) A salesperson drives 300 miles round trip at 30 miles per hour going
to Chicago and 45 miles per hour returning home. Find the average
miles per hour.
(b) A bus driver drives the 50 miles to West Chester at 40 miles per
hour and returns driving 25 miles per hour. Find the average miles
per hour.
(c) A carpenter buys $500 worth of nails at $50 per pound and $500
worth of nails at $10 per pound. Find the average cost of 1 pound
of nails.
22 CHAPTER 1. DESCRIPTIVE STATISTICS

8. Find the geometric mean of each of these.

(a) The growth rates of the Living Life Insurance Corporation for the
past 3 years were 35, 24, and 18%.
(b) A person received these percentage raises in salary over a 4-year
period: 8, 6, 4, and 5%.
(c) A stock increased each year for 5 years at these percentages: 10, 8,
12, 9, and 3%.
(d) The price increases, in percentages, for the cost of food in a specific
geographic region for the past 3 years were 1, 3, and 5.5%.

1.3 Measures of Variation


In statistics, to describe the data set accurately, statisticians must know more
than the measures of central tendency. Consider the example below.

Example 1.3.1. A testing lab wishes to test two experimental brands of


outdoor paint to see how long each will last before fading. The testing lab
makes 6 gallons of each paint to test. Since different chemical agents are added
to each group and only six cans are involved, these two groups constitute two
small populations. The results (in months) are shown. Find the mean of each
group.

Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25

Solution. The mean for Brand A is


X
X 210
µA = = = 35 months.
N 6
1.3. MEASURES OF VARIATION 23

The mean for Brand B is


X
X 210
µB = = = 35 months.
N 6


Since the means are equal in Example 1.3.1, you might conclude that both
brands of paint last equally well. However, when the data sets are examined
graphically, a somewhat different conclusion might be drawn.

Figure 1.3: Examining Data Sets in Example 1.3.1 Graphically

As Figure 1.3 shows, even though the means are the same for both brands,
the spread, or variation, is quite different. Figure 1.3 shows that Brand B
performs more consistently; it is less variable. For the spread or variability of
a data set, three measures are commonly used: range, variance, and standard
deviation. Each measure will be discussed in this section.

The Range
Several different measures can be used to describe the variability of a sample.
Perhaps the simplest measure is the range.
24 CHAPTER 1. DESCRIPTIVE STATISTICS

Definition 1.3.1. The range is the difference between the largest and small-
est observations in a sample. The symbol R is used for the range, and we
have

Range = highest value − lowest value

Remark 1.3.1. One advantage of the range is that it is very easy to compute
once the sample points are ordered.

Remark 1.3.2. One striking disadvantage is that it is very sensitive to ex-


treme observations.

Remark 1.3.3. Another disadvantage of the range is that it depends on the


sample size (n). That is, the larger n is, the larger the range tends to be. This
complication makes it difficult to compare ranges from data sets of differing
size.

Example 1.3.2. Find the ranges for the paints in Example 1.3.1.

Solution. For Brand A, the range is

R = 60 − 10 = 50 months.

Moreover, for Brand B, the range is

R = 45 − 25 = 20 months.

Here, we see that the range for Brand A shows that 50 months separate the
largest data value from the smallest data value, and for Brand B, 20 months
separate the largest data value from the smallest data value, which is less than
one-half of Brand A’s range. 

Example 1.3.3. The cholesterol measurement of a certain person was mea-


sured using two methods, namely, the Autoanalyzer and Microenzymatic mea-
surement methods. The samples obtained from this measurements were recorded
below:
1.3. MEASURES OF VARIATION 25

Figure 1.4: Two samples of cholesterol measurements on a given person using


the Autoanalyzer and Microenzymatic measurement methods

The range for the Autoanalyzer method is given by 226 − 177 = 49 mg/dL.
The range for the Microenzymatic method is given by 209 − 192 = 17 mg/dL.
The Autoanalyzer method clearly seems more variable. 

Example 1.3.4. The range of the sample birthweights in Example 1.2.1 is


given by

R = 4146 − 2069 = 2077 g.

Example 1.3.5. The range of the aflatoxin levels of raw peanut kernels in
Example 1.2.2 is given by

R = 52 − 16 = 36.

Example 1.3.6. The range of the white-blood counts for all patients entering
a hospital in Quezon City, on a given day, based on Example 1.2.3, is given by

R = 35 − 3 = 32(×1000) or 32000.


26 CHAPTER 1. DESCRIPTIVE STATISTICS

Variance and Standard Deviation


For some data set, the computed range is a large number. Thus, to have a
more meaningful statistic to measure the variability, statisticians use measures
called the variance and standard deviation.

Definition 1.3.2. The variance is the average of the squares of the distance
each value is from the mean.

1. The population variance, denoted by σ 2 (Greek: lowercase letter


"sigma"), is given by

N
X
(Xi − µ)2
i=1
σ2 = ,
N

where

Xi = individual value
µ = population mean
N = population size

2. The sample variance, denoted by s2 , is given by


n
X
(xi − x)2
i=1
s2 = ,
n−1

where

xi = individual value
x = sample mean
n = sample size
1.3. MEASURES OF VARIATION 27

Remark 1.3.4. When computing for the variance of a sample, one might
expect the use of the formula

n
X
(xi − x)2
i=1
s2 = .
n

This formula is not usually used, however, since in most cases the purpose
of calculating the statistic is to estimate the corresponding parameter. For
example, the sample mean x is used to estimate the population mean µ. The
expression

n
X
(xi − x)2
i=1
n

does not give the best estimate of the population variance because when the
population is large and the sample is small (usually less than 30), the variance
computed by this formula usually underestimates the population variance.
Therefore, instead of dividing by n, find the variance of the sample by divid-
ing by n − 1, giving a slightly larger value and an unbiased estimate of the
population variance. Thus, we use

n
X
(xi − x)2
i=1
s2 =
n−1

for the sample variance.


28 CHAPTER 1. DESCRIPTIVE STATISTICS

Definition 1.3.3. The standard deviation is the (positive) square root of


the variance.

1. The population standard deviation, denoted by σ (Greek: lowercase


letter "sigma"), is given by
v
u N
uX
u
u (Xi − µ)2
t i=1
σ= ,
N

where

Xi = individual value
µ = population mean
N = population size

2. The sample standard deviation, denoted by s, is given by


v
u n
uX
u
u (xi − x)2
t i=1
s= ,
n−1

where

xi = individual value
x = sample mean
n = sample size

Remark 1.3.5. The rounding rule for the standard deviation is the same as
that for the mean. The final answer should be rounded to one more decimal
place than that of the original data.
1.3. MEASURES OF VARIATION 29

Shortcut Formulas for s2 and s

Definition 1.3.4. (Shortcut Formulas for s2 and s) In the absence of the


sample mean, x, we have

1. the sample variance, denoted by s2 , is given by

n
!2
X
n
xi
i=1
X
x2i −
n
i=1
s2 = ,
n−1

where

xi = individual value
x2i = square of the individual value
x = sample mean
n = sample size

2. the sample standard deviation, denoted by s, is given by


v !2
n
u
u X
u
u n xi
i=1
uX
u x2i −
u
t i=1 n
s2 = ,
n−1

where

xi = individual value
x2i = square of the individual value
x = sample mean
n = sample size
30 CHAPTER 1. DESCRIPTIVE STATISTICS

These formulas are mathematically equivalent to the preceding formulas and


do not involve using the mean. They save time when repeated subtracting and
squaring occur in the original formulas. They are also more accurate when the
mean has been rounded.

Example 1.3.7. Find the variance and standard deviation for the data set
for Brand A in Example 1.3.1.

Solution. First, we compute for the mean of the data set. From Example
1.3.1, we see that µA = 35. Second, we shall subtract the mean from each
data value.
10 − 35 = −25 50 − 35 = 15 40 − 35 = 5
60 − 35 = 25 30 − 35 = −5 20 − 35 = −15
Third, we square each result.

(−25)2 = 625 (15)2 = 225 (5)2 = 25


(25)2 = 625 (−5)2 = 25 (−15)2 = 225

Now, we get the sum of the squares and then divide it by N (since we are
dealing with the population variance). That is,

(XA − µA )2
P
2 625 + 625 + 225 + 25 + 25 + 225 1750
σA = = = = 291.7
N 6 6
r
1750
Now, for the standard deviation, we have σ = = 17.1. It is an advice
6
to make a table for proper track of computation.

XA XA − µA (XA − µA )2
10 −25 625
60 25 625
50 15 225
30 −5 25
40 5 25
20 −15 225
(XA − µA )2 = 1750
P
1.3. MEASURES OF VARIATION 31

Example 1.3.8. Find the variance and standard deviation for the data set
for Brand B in Example 1.3.1.

Solution. First, we compute for the mean of the data set. From Example
1.3.1, we see that µB = 35. Second, we shall subtract the mean from each
data value, square each result, and then get the sum of the squares and then
divide it by N (since we are dealing with the population variance). That is,

XB XB − µB (XB − µB )2
35 0 0
45 10 100
30 −5 25
35 0 0
40 5 25
25 −10 100
(XB − µB )2 = 250
P

Therefore,

(XB − µB )2
P
2 250
σB = = = 41.7
N 6
r
250
Now, for the standard deviation, we have σ = = 6.5. 
6
Since the standard deviation of Brand A is 17.1 and the standard deviation of
Brand B is 6.5, the data are more variable for Brand A. In summary, when
the means are equal, the larger the variance or standard deviation is, the more
variable the data are.

Example 1.3.9. Find the variance and standard deviation for the rate of
death in a certain barrio in Rizal for a sample of 6 years shown. The data are
in percentages.

11.2, 11.9, 12.0, 12.8, 13.4, 14.3


32 CHAPTER 1. DESCRIPTIVE STATISTICS

Solution. Without actually solving for the sample mean, x, we can solve for
the sample variance and standard deviation of the given sample. To do this,
we find the sum of the values, the sum of the squares of each values, then
substitute in the shortcut formula. That is,
x x2
11.2 125.44
11.9 141.61
12.0 144.00
12.8 163.84
13.4 179.56
14.3 204.49
x2 = 958.94
P P
x = 75.6
Therefore,
( x)2 (75.6)2
P
x2 −
P
958.94 −
s2 = n = 6 = 1.28
n−1 6−1
v
2
u 958.94 − (75.6)
u
6
t
Moreover, we have s = = 1.13. 
6−1
Example 1.3.10. Compute the variance and standard deviation for the Au-
toanalyzer and Microenzymatic-method data in Figure 1.4.

Solution. In Figure 1.4, we see that x = 200. Thus,

(a) For the Autoanalyzer method, we have

x x−x (x − x)2
177 −23 529
193 −7 49
195 −5 25
209 9 81
226 26 676
(x − x)2 = 1360
P
1.3. MEASURES OF VARIATION 33
r
1360 1360
Therefore, s2 = = 340 and s = = 18.4.
5−1 5−1
(b) For the Microenzymatic method, we have

x x−x (x − x)2
192 −8 64
197 −3 9
200 0 0
202 2 4
209 9 81
(x − x)2 = 158
P

r
158 158
Therefore, s2 = = 39.5 and s = = 6.3.
5−1 5−1


Uses of the Variance and Standard Deviation

1. As previously stated, variances and standard deviations can be used to


determine the spread of the data. If the variance or standard devia-
tion is large, the data are more dispersed. This information is useful in
comparing two (or more) data sets to determine which is more (most)
variable.

2. The measures of variance and standard deviation are used to determine


the consistency of a variable. For example, in the manufacture of fittings,
such as nuts and bolts, the variation in the diameters must be small, or
the parts will not fit together.

3. The variance and standard deviation are used to determine the number
of data values that fall within a specified interval in a distribution.

4. Finally, the variance and standard deviation are used quite often in in-
ferential statistics. These uses will be shown in later chapters of this
lecture noted.
34 CHAPTER 1. DESCRIPTIVE STATISTICS

Coefficient of Variation
Whenever two samples have the same units of measure, the variance and stan-
dard deviation for each can be compared directly. For example, suppose an
automobile dealer wanted to compare the standard deviation of miles driven
for the cars she received as trade-ins on new cars. She found that for a spe-
cific year, the standard deviation for Buicks was 422 miles and the standard
deviation for Cadillacs was 350 miles. She could say that the variation in
mileage was greater in the Buicks. But what if a manager wanted to compare
the standard deviations of two different variables, such as the number of sales
per salesperson over a 3-month period and the commissions made by these
salespeople?

For many traits, standard deviation and mean change together when organ-
isms of different sizes are compared. Humans have greater mass than mice
and also more variability in mass. For many purposes, we care more about
the relative variation among individuals. A special measure, the coefficient of
variation, is often used for this purpose.

This measure can also be used to compare the variability of traits that do not
have the same units. If we wanted to ask, "What is more variable in humans,
body mass or life span? " then the standard deviation is not very informative,
because mass is measured in kilograms and life span is measured in years. The
coefficient of variation would allow us to make such a comparison.

Definition 1.3.5. The coefficient of variation, denoted by CV , is the stan-


dard deviation expressed as a percentage of the mean. That is, it is the
standard deviation divided by the mean, whose quotient is expressed as a
percentage. Therefore,
σ
1. for populations, CV = · 100%; and,
µ
s
2. for samples, CV = · 100%.
x
1.3. MEASURES OF VARIATION 35

Remark 1.3.6. The CV is most useful in comparing the variability of several


different samples, each with different arithmetic means. This is because a
higher variability is usually expected when the mean increases, and the CV is
a measure that accounts for this variability.

Example 1.3.11. The mean for the number of pages of a sample of women?s
fitness magazines is 132, with a variance of 23; the mean for the number of
advertisements of a sample of women?s fitness magazines is 182, with a vari-
ance of 62. Compare the variations.

Solution. The coefficients of variation are



23
CVpages = · 100% = 3.6%
132
and

62
CVadvertisements = · 100% = 4.3%
182
Therefore, the number of advertisements is more variable than the number of
pages since the coefficient of variation is larger for advertisements. 

Example 1.3.12. The mean of the number of sales of cars over a 3-month
period is 87, and the standard deviation is 5. The mean of the commissions is
$5225, and the standard deviation is $773. Compare the variations of the two.

Solution. The coefficients of variation are


5
CVsales = · 100% = 5.7%
87
and
$773
CVcommissions = · 100% = 14.8%
$5225
Since the coefficient of variation is larger for commissions, the commissions are
more variable than the sales. 

Example 1.3.13. The coefficient of variation for the data consisting of birth-
445.3 g
weigths in Example 1.2.1 is given by CV = · 100% = 14.1%. 
3166.9 g
36 CHAPTER 1. DESCRIPTIVE STATISTICS

Example 1.3.14. The CV is also useful for comparing the reproducibility


of different variables. Consider, for example, data from the Bogalusa Heart
Study, a large study of cardiovascular risk factors in children that began in
the 1970s and continues up to the present time.

At approximately 3-year intervals, cardiovascular risk factors such as blood


pressure, weight, and cholesterol levels were measured for each of the children
in the study. In 1978, replicate measurements were obtained for a subset of the
children a short time apart from regularly scheduled risk factor measurements.
Table 1.4 presents reproducibility data on a selected subset of cardiovascular
risk factors. We note that the CV ranges from 0.2% for height to 10.4% for
HDL cholesterol.

Table 1.4: Reproducibility of cardiovascular risk factors in children, Bogalusa


heart Study, 1978-1979

n Mean s CV (%)
Height (cm) 364 142.6 0.31 0.2
Weight (kg) 365 39.5 0.77 1.9
Triceps skin fold (mm) 362 15.2 0.51 3.4
Systolic blood pressure (mm Hg) 337 104.0 4.97 4.8
Diastolic blood pressure (mm Hg) 337 64.0 4.57 7.1
Total cholesterol (mg/dL) 395 160.4 3.44 2.1
HDL cholesterol (mg/dL) 349 56.9 5.89 10.4

Source: Foster, T. A., & Berenson, G. (1987). Measurement error and reli-
ability in four pediatric cross-sectional surveys of cardiovascular disease risk
factor variables - the Bogalusa Heart Study. Journal of Chronic Diseases,
40 (1), 13-21. 
1.3. MEASURES OF VARIATION 37

EXERCISES

1. Why do statisticians need measures of variability? State in your own


words the definitions of the following measures of variability:

(a) Range
(b) Standard deviation

2. How are the mean and variance of a distribution affected when:

(a) A constant is added to every value of a variable?


(b) Every value of a variable is multiplied by a constant?

3. The following cholesterol levels of 10 people were measured in mg/dl:


{260, 150, 165, 201, 212, 243, 219, 227, 210, 240}. For this sample:

(a) Calculate the mean and median.


(b) Calculate the variance and standard deviation.
(c) Calculate the coefficient of variation.

4. (a) Can a population have a zero variance?


(b) Can a population have a negative variance?
(c) Can a sample have a zero variance?
(d) Can a sample have a negative variance?

5. For this data set, find the mean, variance, and standard deviation of the
variable. The data represent the serum cholesterol levels of 30 individu-
als.

211 240 255 219 204


200 212 193 187 205
256 203 210 221 249
231 212 236 204 187
201 247 206 187 200
237 227 221 192 196
38 CHAPTER 1. DESCRIPTIVE STATISTICS

6. Use the data set: 10, 20, 30, 40, 50.

(a) Find the standard deviation.


(b) Add 5 to each value, and then find the standard deviation.
(c) Subtract 5 from each value and find the standard deviation.
(d) Multiply each value by 5 and find the standard deviation.
(e) Divide each value by 5 and find the standard deviation.
(f) Generalize the results of parts (b) through (e).

7. The mean (absolute) deviation (M AD) of a sample of values of a


variable is the arithmetic mean of the absolute values of the deviations
about the sample mean. It is found using the formula
n
X
|x − x|
i=1
M AD = ,
n
where

x = individual observation
x = sample mean
n = sample size

Find the mean absolute deviation for these data: 5, 9, 10, 11, 11, 12, 15,
18, 20, 22.

1.4 Measures of Position


In addition to measures of central tendency and measures of variation, there are
measures of position or location. These measures include percentiles, deciles,
and quartiles. They are used to locate the relative position of a data value in
the data set. For example, if a value is located at the 80th percentile, it means
that 80% of the values fall below it in the distribution and 20% of the values
1.4. MEASURES OF POSITION 39

fall above it. The median is the value that corresponds to the 50th percentile,
since one-half of the values fall below it and one-half of the values fall above
it. This section discusses these measures of position.

Quantiles

Definition 1.4.1. (Quantiles)

1. The median, x̃, divides the data set into two (2) equal parts.

2. The quartiles, Qk (k = 1, 2, 3), divides the data set into four (4) equal
parts.

3. The deciles, Dk (k = 1, 2, . . . , 9), divides the data set into ten (10) equal
parts.

4. The percentiles, Pk (k = 1, 2, . . . , 99), divides the data set into one


hundred (100) equal parts.

5. Percentiles are also sometimes called quantiles.

Remark 1.4.1. By definition, we have x̃ = Q2 = D5 = P50 .

Remark 1.4.2. Percentiles have the advantage over the range of being less
sensitive to outliers and of not being greatly affected by the sample size, n.

Remark 1.4.3. To compute percentiles, the sample points must be ordered.


This can be difficult if n is even moderately large.

Remark 1.4.4. To compute for the k th percentile of a given (ungrouped) data


set, we employ the following steps:

1. Arrange the observations in the given data set in ascending order.


nk
2. Compute for c = , where n is the sample size and k is the order of
100
the desired percentile.
40 CHAPTER 1. DESCRIPTIVE STATISTICS

3. (a) If c is not a whole number, round up to the next whole number.


Starting at the lowest value, count over the number that corre-
sponds to the rounded-up value.
(b) If c is a whole number, use the value halfway between the cth and
the (c + 1)th values when counting up from the lowest value.

Remark 1.4.5. To compute for the k th decile of a given (ungrouped) data


set, we employ the following steps:

1. Arrange the observations in the given data set in ascending order.


nk
2. Compute for c = , where n is the sample size and k is the order of
10
the desired decile.

3. (a) If c is not a whole number, round up to the next whole number.


Starting at the lowest value, count over the number that corre-
sponds to the rounded-up value.
(b) If c is a whole number, use the value halfway between the cth and
the (c + 1)th values when counting up from the lowest value.

Remark 1.4.6. To compute for the values corresponding to the quartiles, Q1 ,


Q2 , and Q3 , of a given (ungrouped) data set, we employ the following steps:

1. Arrange the observations in the given data set in ascending order.

2. Find the median of the data values. This is the value for Q2 .

3. Find the median of the data values that fall below Q2 . This is the value
for Q1 .

4. Find the median of the data values that fall above Q2 . This is the value
for Q3 .

Example 1.4.1. Compute the tenth and ninetieth percentiles for the birth-
weight data in Example 1.2.1.
1.4. MEASURES OF POSITION 41

Solution. First, arrange the sample in ascending order: 2069, 2581, 2759, 2834,
2838, 2841, 3031, 3101, 3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484, 3541,
3609, 3649, 4146.
20(10)
(a) For k = 10, we have nk
100 = 100 = 2. Therefore, P10 is the average of the
second and third largest observations, that is, P10 = 2581+2759
2 = 2670 g.
20(90)
(b) For k = 90, we have nk
100 = 100 = 18. Therefore, P90 is the average of
the 18th and 19th largest observation, that is, P90 = 3609+3649
2 = 3629 g.

Example 1.4.2. Compute for the sixth and seventh deciles for the aflatoxin
data in Example 1.2.2.

Solution. First, arrange the sample in ascending order: 16, 22, 23, 26, 26, 27,
28, 30, 31, 35, 36, 37, 48, 50, 52.
15(6)
(a) For k = 6, we have nk
10 = 10 = 9. Therefore, D6 is the average of the
ninth and tenth largest observation, that is, D6 = 31+35
2 = 33.
15(7)
(b) For k = 7, we have nk
10 = 10 = 10.5. Therefore, D7 is the eleventh
observation, that is, D7 = 36.

Example 1.4.3. The ages, in years, of the eight respondents in a health sur-
vey are as follows: 15, 13, 6, 5, 12, 50, 22, 18. Find its quartiles.

Solution. First we arrange the data in ascending order. That is, 5, 6, 12, 13,
15, 18, 22, 50. Computing for the median, Q2 , we see that with n = 8, we have
Q2 = x̃ = 13+15
2 = 14. Now, we consider the data values less than 14, that is,
5, 6, 12, 13. Getting its median, with the fact that it has four observations,
we have Q1 = 6+12
2 = 9. Lastly, we consider the data values greater than
14, that is, 15, 18, 22, 50. Getting its median, with the fact that it has four
observations, we have Q3 = 18+22
2 = 20. 
42 CHAPTER 1. DESCRIPTIVE STATISTICS

Outliers
In addition to dividing the data set into four groups, quartiles can be used as
a rough measurement of variability.

Definition 1.4.2. The interquartile range (IQR) is defined as the differ-


ence of the first and third quartiles, that is

IQR = Q3 − Q1 .

The IQR is interpreted as the range of the middle 50% of the data.

The interquartile range is used to identify outliers, and it is also used as a


measure of variability in exploratory data analysis.

Definition 1.4.3. An outlier is an extremely high or an extremely low value


when compared with the rest of the data values.

Remark 1.4.7. An outlier can strongly affect the mean and standard devi-
ation of a variable. For example, suppose a researcher mistakenly recorded
an extremely high data value. This value would then make the mean and
standard deviation of the variable much larger than they really were.

Remark 1.4.8. Outliers can have an effect on other statistics as well.

Remark 1.4.9. To identify outliers in a given data set, we employ the follow-
ing procedures:

1. Arrange the data in ascending order and solve for Q1 and Q3 .

2. Find the interquartile range, given by IQR = Q3 − Q1 .

3. Multiply the interquartile range by 1.5.

4. Subtract the value obtained in (3) from Q1 , that is, Q1 − 1.5(IQR).


Moreover, add the same value in (3) to Q3 , that is Q3 + 1.5(IQR).
1.4. MEASURES OF POSITION 43

5. Check the data set for any value that is smaller than Q1 − 1.5(IQR) or
larger than Q3 + 1.5(IQR). These data are outliers in the given data
set.

Remark 1.4.10. There are several reasons why outliers may occur.

1. The data value may have resulted from a measurement or observational


error. Perhaps the researcher measured the variable incorrectly.

2. The data value may have resulted from a recording error. That is, it
may have been written or typed incorrectly.

3. The data value may have been obtained from a subject that is not in
the defined population. For example, suppose test scores were obtained
from a seventh-grade class, but a student in that class was actually in the
sixth grade and had special permission to attend the class. This student
might have scored extremely low on that particular exam on that day.

4. The data value might be a legitimate value that occurred by chance


(although the probability is extremely small).

Remark 1.4.11. There are no hard-and-fast rules on what to do with outliers,


nor is there complete agreement among statisticians on ways to identify them.

1. Obviously, if they occurred as a result of an error, an attempt should


be made to correct the error or else the data value should be omitted
entirely.

2. When they occur naturally by chance, the statistician must make a de-
cision about whether to include them in the data set.

Example 1.4.4. Check the data set in Example 1.4.3 for outliers.

Solution. At first glance, the data value 50 is extremely suspect. To check for
an outlier, we employ the steps in Remark 1.4.9.
44 CHAPTER 1. DESCRIPTIVE STATISTICS

(a) We solve for the first and third quartiles. In Example 1.4.3, we see that
Q1 = 9 and Q3 = 20.

(b) Solving for the interquartile range, we see that IQR = Q3 −Q1 = 20−9 =
11.

(c) Multiplying this by 1.5, we have 1.5(11) = 16.5.

(d) Subtract the value obtained in (c) from Q1 , and add the value obtained
in (c) to Q3 . That is, 9 − 16.5 = −7.5 and 20 + 16.5 = 36.5.

(e) Check the data set for any data values that fall outside the interval from
−7.5 to 36.5. Here, we see that the value 50 is outside this interval;
hence, it can be considered an outlier.

EXERCISES

1. The percentile corresponding to a given value x is computed by using


the following formula

(number of values below x) + 0.5


Percentile = · 100%
total number of values
(a) A teacher gives a 50-point test to seven students, scores of which
are shown. Find the percentile rank for each test scores obtained
by the students.

12, 28, 35, 42, 47, 49, 50

(b) In (a), what value corresponds to the 60th percentile?


(c) Find the percentile rank for each value in the data set. The data
represent the values in billions of dollars of the damage of 10 hur-
ricanes.

1.1, 1.7, 1.9, 2.1, 2.2, 2.5, 3.3, 6.2, 6.8, 20.3

(d) In (c), what value corresponds to the 40th percentile?


1.4. MEASURES OF POSITION 45

2. The average weekly earnings in dollars for various industries are listed
below. Find the quartiles of the given data set.

804, 736, 659, 489, 777, 623, 597, 524, 228

3. Check each data set for outliers.

(a) 16, 18, 22, 19, 3, 21, 17, 20


(b) 24, 32, 54, 31, 16, 18, 19, 14, 17, 20
(c) 321, 343, 350, 327, 200
(d) 88, 72, 97, 84, 86, 85, 100
(e) 145, 119, 122, 118, 125, 116
(f) 14, 16, 27, 18, 13, 19, 36, 15, 20

4. Another measure of average is called the midquartile. It is the numer-


ical value halfway between Q1 and Q3 , and the formula is

Q1 + Q3
Midquartile =
2

Using this formula and other formulas, find Q1 , Q2 , Q3 , the midquartile,


and the interquartile range for each data set.

(a) 5, 12, 16, 25, 32, 38


(b) 53, 62, 78, 94, 96, 99, 103

5. An extreme outlier is an observation, x such that it is smaller than


Q1 − 3(IQR) or larger than Q3 + 3(IQR). The method of identifying
extreme outliers is the same as that of the one presented in Remark 1.4.9.
Check each data set for extreme outliers.

(a) 16, 18, 22, 19, 3, 21, 17, 20


(b) 24, 32, 54, 31, 16, 18, 19, 14, 17, 20
(c) 321, 343, 350, 327, 200
46 CHAPTER 1. DESCRIPTIVE STATISTICS

(d) 88, 72, 97, 84, 86, 85, 100


(e) 145, 119, 122, 118, 125, 116
(f) 14, 16, 27, 18, 13, 19, 36, 15, 20

1.5 Taxonomy of Data


Social scientists have thought hard about types of data. Table 1.5 summarizes
a fairly standard taxonomy of data based on the four scales nominal, ordinal,
interval, and ratio. This table is to be used as a guide only.

You can be too rigid in applying this scheme (as unfortunately, some academic
journals are). Frequently, ordinal data are coded in increasing numerical or-
der and averages are taken. Or, interval and ratio measurements are ranked
(i.e., reduced to ordinal status) and averages taken at that point. Even with
nominal data, we sometimes calculate averages. For example, coding male as
0 and female as 1 in a class of 100 students, the average is the proportion of
females in the class. Most statistical procedures for ordinal data implicitly use
a numerical coding scheme, even if this is not made clear to the user.

Table 1.5: Standard Taxonomy of Data

Scale Characteristic Question Statistic to be Used


Nominal Do A and B differ? Mode
Ordinal Is A bigger (better) than B? Median
Interval How much do A and B differ? Mean
Ratio How many times is A bigger than B? Mean

Sources:

1. Luce, R. D. and Narens, L. (1987). Measurement scales on the contin-


uum. Science, 236 : 1527-1532.

2. van Belle, G. (2002). Statistical Rules of Thumb. Wiley, New York.


1.6. EXPLORATORY DATA ANALYSIS 47

3. Velleman, P. F. and Wilkinson, L. (1993). Nominal, ordinal, interval,


and ratio typologies are misleading. American Statistician, 46 : 193-197.

1.6 Exploratory Data Analysis


In traditional statistics, data are organized by using a frequency distribution.
From this distribution various graphs such as the histogram, frequency poly-
gon, and ogive can be constructed to determine the shape or nature of the
distribution. In addition, various statistics such as the mean and standard
deviation can be computed to summarize the data.

The purpose of traditional analysis is to confirm various conjectures about the


nature of the data. For example, from a carefully designed study, a researcher
might want to know if the proportion of Americans who are exercising today
has increased from 10 years ago. This study would contain various assump-
tions about the population, various definitions such as of exercise, and so on.

In exploratory data analysis (EDA), the measure of central tendency used is


the median. Moreover, the measure of variation used is the interquartile range,
Q3 −Q1 . Also, the data are represented graphically using a boxplot (sometimes
called a box-and-whisker plot).

The purpose of exploratory data analysis is to examine data to find out what
information can be discovered about the data such as the center and the spread.
Exploratory data analysis was developed by John Tukey and presented in his
book Exploratory Data Analysis (Addison-Wesley, 1977).
48 CHAPTER 1. DESCRIPTIVE STATISTICS

The Five-Number Summary and Boxplots


A boxplot can be used to graphically represent the data set. These plots involve
five specific values, namely,

1. the lowest value of the data set, i.e., the minimum value

2. the first quartile, Q1

3. the median, x̃

4. the third quartile, Q3

5. the highest value of the data set, i.e., the maximum value

These values are called a five-number summary of the data set.

Definition 1.6.1. A boxplot is a graph of a data set obtained by drawing a


horizontal line from the minimum data value to Q1 , drawing a horizontal line
from Q3 to the maximum data value, and drawing a box whose vertical sides
pass through Q1 and Q3 with a vertical line inside the box passing through
the median or Q2 .

Remark 1.6.1. To construct a boxplot for a given data set, we employ the
following procedures:

1. Find the five-number summary for the data values, that is, the maximum
and minimum data values, Q1 and Q3 , and the median.

2. Draw a horizontal axis with a scale such that it includes the maximum
and minimum data values.

3. Draw a box whose vertical sides go through Q1 and Q3 , and draw a


vertical line though the median.

4. Draw a line from the minimum data value to the left side of the box and
a line from the maximum data value to the right side of the box.
1.6. EXPLORATORY DATA ANALYSIS 49

Example 1.6.1. The number of meteorites found in 10 states of the United


States is 89, 47, 164, 296, 30, 215, 138, 78, 48, 39. Construct a boxplot for the
data.

Solution. First, we arrange the data in ascending order. Doing so, we have
30, 39, 47, 48, 78, 89, 138, 164, 215, 296. Solving for the median, we have
x̃ = 78+89
2 = 83.5. Next, solving for Q1 , we consider the data values less than
83.5, that is, 30, 39, 47, 48, 78. Solving for its median, we have Q1 = 47. Next,
considering the data values greater than 83.5, that is, 89, 138, 164, 215, 296,
we see that its median is Q3 = 164. Employing the procedure given in Remark
1.6.1, of constructing the boxplot, we see that the boxplot for the number of
meteorites found in 10 states of the United States is given by

50 100 150 200 250 300

Figure 1.5: Boxplot for the Number of Meteorites Found

Remark 1.6.2. The following information can be obtained from the boxplot
of a given data set:

1. (a) If the median is near the center of the box, then the distribution is
approximately symmetric.
50 CHAPTER 1. DESCRIPTIVE STATISTICS

(b) If the median falls to the left of the center of the box, then the
distribution is positively skewed.

(c) If the median falls to the right of the center of the box, then the
distribution is negatively skewed.

2. (a) If the lines are about the same length, then the distribution is ap-
proximately symmetric.

(b) If the right line is larger than the left line, then the distribution is
positively skewed.

(c) If the left line is larger than the right line, then the distribution is
negatively skewed.

If the boxplots for two or more data sets are graphed on the same axis, the
distributions can be compared. To compare the averages, use the location of
the medians. To compare the variability, use the interquartile range, i.e., the
length of the boxes.

Example 1.6.2. A dietitian is interested in comparing the sodium content of


real cheese with the sodium content of a cheese substitute. The data for two
random samples are shown. Compare the distributions, using boxplots.

Real Cheese Cheese Substitute


310 270
420 180
45 250
40 290
220 130
240 260
180 340
90 310

Solution. We solve for the median, first and third quartile of the two data sets.
1.6. EXPLORATORY DATA ANALYSIS 51

(a) For the real cheese data, we first arrange the data set as follows: 40,
45, 90, 180, 220, 240, 310, 420. One can easily determine the values
x̃ = 180+220
2 = 200, Q1 = 45+90
2 = 67.5, and Q3 = 240+310
2 = 275.

(b) For the cheese substitute data, we first arrange the data set as follows:
130, 180, 250, 260, 270, 290, 310, 340. One can easily determine the
values x̃ = 260+270
2 = 265, Q1 = 180+250
2 = 215, and Q3 = 290+310
2 = 300.

The boxplots for each distribution are drawn on the same graph, as follows:

Real Cheese

Cheese Substitute

100 200 300 400

Figure 1.6: Boxplots for the Sodium Content of Real Cheese and Cheese Sub-
stitute

It is quite apparent that the distribution for the cheese substitute data has a
higher median than the median for the distribution for the real cheese data.
The variation or spread for the distribution of the real cheese data is larger
than the variation for the distribution of the cheese substitute data. 

Another important point to remember is that the summary statistics (median


and interquartile range) used in exploratory data analysis are said to be re-
sistant statistics. A resistant statistic is relatively less affected by outliers
than a nonresistant statistic. The mean and standard deviation are nonresis-
tant statistics. Sometimes when a distribution is skewed or contains outliers,
52 CHAPTER 1. DESCRIPTIVE STATISTICS

the median and interquartile range may more accurately summarize the data
than the mean and standard deviation, since the mean and standard deviation
are more affected in this case.

EXERCISES

1. Identify the five-number summary, find the interquartile range, and draw
the boxplot of the following data set.

(a) 8, 12, 32, 6, 27, 19, 54 (d) 147, 243, 156, 632, 543, 303
(b) 19, 16, 48, 22, 7 (e) 14.6, 19.8, 16.3, 15.5, 18.2
(c) 362, 589, 437, 316, 192, 188 (f) 9.7, 4.6, 2.2, 3.7, 6.2, 9.4, 3.8

2. Construct a boxplot for the following data and comment on the shape
of the distribution representing the number of games pitched by major
league baseball’s earned run average (ERA) leaders for the past few
years.

30 34 29 30 34 29 31
30 27 34 32 33 34 27

3. Construct a boxplot for the following data which represents the number
of innings pitched by the ERA leaders for the past few years. Comment
on the shape of the distribution.

192 228 186 199 238 217 213


234 264 187 214 115 238 246

4. Construct a boxplot for these numbers of state sites for Frogwatch USA.
Is the distribution symmetric?

421 395 314 294 289


253 242 238 235 199
1.6. EXPLORATORY DATA ANALYSIS 53

5. Construct a boxplot and comment on the skewness of these data which


represent median household income (in dollars) for the top 10 educated
cities (based on the percent of the population with a college degree or
higher).

49297 48131 43731 39752 55637


57496 47221 41829 42562 42442

6. A four-month record for the number of tornadoes in 2016-2018 is given


here.

Month 2016 2017 2018


April 132 125 157
May 123 509 543
June 316 268 292
July 138 124 167

(a) Which month had the highest mean number of tornadoes for this
3-year period?

(b) Which year has the highest mean number of tornadoes for this 4-
month period?

(c) Construct three boxplots and compare the distributions.

7. Assume you work for OSHA (Occupational Safety and Health Adminis-
tration) and have complaints about noise levels from some of the workers
at a state power plant. You charge the power plant with taking decibel
readings at six different areas of the plant at different times of the day
and week. The results of the data collection are listed. Use boxplots
to initially explore the data and make recommendations about which
plant areas workers must be provided with protective ear wear. The safe
hearing level is at approximately 120 decibels.
54 CHAPTER 1. DESCRIPTIVE STATISTICS

Area 1 Area 2 Area 3 Area 4 Area 5 Area 6


30 64 100 25 59 67
12 99 59 15 63 80
35 87 78 30 81 99
65 59 97 20 110 49
24 23 84 61 65 67
59 16 64 56 112 56
68 94 53 34 132 80
57 78 59 22 145 125
100 57 89 24 163 100
61 32 88 21 120 93
32 52 94 32 84 56
45 78 66 52 99 45
92 59 57 14 105 80
56 55 62 10 68 34
44 55 64 33 75 21

1.7 Frequency Distributions


Suppose a researcher wished to do a study on the ages of the top 50 wealthiest
people in the world. The researcher first would have to get the data on the ages
of the people. In this case, these ages are listed in Forbes Magazine. When
the data are in original form, they are called raw data and are listed next.

49 57 38 73 81 74 59 76 65 69
54 56 69 68 78 65 85 49 69 61
48 81 68 37 43 78 82 43 64 67
52 56 81 77 79 85 40 85 59 80
60 71 57 61 69 61 83 90 87 74

Since little information can be obtained from looking at raw data, the re-
searcher organizes the data into what is called a frequency distribution. A
frequency distribution consists of classes and their corresponding frequencies.
1.7. FREQUENCY DISTRIBUTIONS 55

Each raw data value is placed into a quantitative or qualitative category called
a class. The frequency of a class then is the number of data values contained
in a specific class. A frequency distribution is shown for the preceding data
set.
Class Limits Tally Frequency
35-41 3
42-48 3
49-55 4
56-62 10
63-69 10
70-76 5
77-83 10
84-90 5
Total 50
Now some general observations can be made from looking at the frequency
distribution. For example, it can be stated that the majority of the wealthy
people in the study are over 55 years old.

As we see in the previous sections, there is no difficulty if the data set is small,
for we can arrange those few numbers and write them, say, in increasing order;
the result would be sufficiently clear. For fairly large data sets, the use of a
frequency distribution is a big help.

Definition 1.7.1. A frequency distribution is an ordered display of each


value in a data set together with its frequency, that is, the number of times
that value occurs in the data set. In addition, when deemed necessary, the
percentage of sample points that take on a particular value is also typically
given.

Remark 1.7.1. Two types of frequency distributions that are most often used
are the categorical frequency distribution and the grouped frequency distribu-
tion.
56 CHAPTER 1. DESCRIPTIVE STATISTICS

Categorical Frequency Distributions

Definition 1.7.2. The categorical frequency distribution is used for data


that can be placed in specific categories, such as nominal or ordinal-level data.

Remark 1.7.2. To construct a frequency distribution for categorical data, we


employ the following procedures:
1. Make a table with the (discrete) classes on the first column.

2. Tally the data and place the results on the second column.

3. Count the tallies and place the results on the third column.

4. Find the percentages of values in each class by using the formula


f
%= · 100%,
n
where f is the frequency of the class and n is the total number of values.
Put the obtained percentages on the fourth column.

5. Find the totals for the third and fourth columns.

6. Removing the column for the tally (optional) finishes the desired fre-
quency distribution.
Remark 1.7.3. Percentages are not normally part of a frequency distribution,
but they can be added since they are used in certain types of graphs such as pie
graphs. Also, the decimal equivalent of a percent is called a relative frequency.
Example 1.7.1. Twenty-five army inductees were given a blood test to deter-
mine their blood type. Construct a frequency distribution for the data. The
data set is given below.
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
1.7. FREQUENCY DISTRIBUTIONS 57

Solution. Since the data are categorical, discrete classes can be used. There
are four blood types: A, B, O, and AB. These types will be used as the classes
for the distribution. Employing the procedures in Remark 1.7.2, we have

Class Tally Frequency Percent


A 5 5
25 · 100% = 20%
B 7 7
25 · 100% = 28%
O 9 9
25 · 100% = 36%
AB 4 4
25 · 100% = 16%
Total 25 Total 100%

Removing the tally column, we see that the final frequency distribution is

Class Frequency Percent


A 5 20%
B 7 28%
O 9 36%
AB 4 16%
25 100%

For the sample, more people have type O blood than any other type. 
58 CHAPTER 1. DESCRIPTIVE STATISTICS

Grouped Frequency Distributions


When the range of the data is large, the data must be grouped into classes
that are more than one unit in width, in what is called a grouped frequency
distribution.

Definition 1.7.3.

1. Given a class, the endpoints of the class are called the class limits.

2. The lower class limit represents the smallest data value that can be
included in the class.

3. The upper class limit represents the largest value that can be included
in the class.

4. The numbers used to separate the classes so that there are no gaps in
the frequency distribution are called the class boundaries.

5. The class width for a class in a frequency distribution is found by


subtracting the lower (or upper) class limit of one class from the lower
(or upper) class limit of the next class.

Remark 1.7.4. The basic rule of thumb is that the class limits should have
the same decimal place value as the data, but the class boundaries should have
one additional place value and end in a 5.
Remark 1.7.5. The class width can also be found by subtracting the lower
boundary from the upper boundary for any given class. Do not subtract the
limits of a single class. It will result in an incorrect answer.
Remark 1.7.6. The researcher must decide how many classes to use and the
width of each class. To construct a frequency distribution, follow these rules:
1. There should be between 5 and 20 classes. Although there is no hard-and-
fast rule for the number of classes contained in a frequency distribution,
it is of the utmost importance to have enough classes to present a clear
description of the collected data.
1.7. FREQUENCY DISTRIBUTIONS 59

2. It is preferable but not absolutely necessary that the class width be an


odd number. This ensures that the midpoint of each class has the same
place value as the data. The class midpoint, xm , is obtained by adding
the lower and upper boundaries and dividing by 2, or adding the lower
and upper limits and dividing by 2. Note that this rule is only a
suggestion, and it is not rigorously followed, especially when a
computer is used to group data.

3. The classes must be mutually exclusive. Mutually exclusive classes have


nonoverlapping class limits so that data cannot be placed into two classes.

4. The classes must be continuous. Even if there are no values in a class, the
class must be included in the frequency distribution. There should be
no gaps in a frequency distribution. The only exception occurs when the
class with a zero frequency is the first or last class. A class with a zero
frequency at either end can be omitted without affecting the distribution.

5. The classes must be exhaustive. There should be enough classes to ac-


commodate all the data.

6. The classes must be equal in width. This avoids a distorted view of the
data. One exception occurs when a distribution has a class that is open-
ended. That is, the class has no specific beginning value or no specific
ending value. A frequency distribution with an open-ended class is called
an open-ended distribution.

Remark 1.7.7. The procedure for constructing a grouped frequency distri-


bution for numerical data are as follows:

1. Determine the classes. This can be done by finding the highest and
lowest values in the data set. Afterwards, solve for the range, R.

2. Select the number of classes desired (usually between 5 and 20).

3. Find the class width by dividing the range by the number of classes
R
desired. That is, width = . Round the answer up to
number of classes
60 CHAPTER 1. DESCRIPTIVE STATISTICS

the nearest whole number if there is a remainder. Otherwise, you will


need an extra class to accommodate all the data.

4. Select a starting point for the lowest class limit. This can be the smallest
data value or any convenient number less than the smallest data value.

5. Add the width to the lowest score taken as the starting point to get the
lower limit of the next class. Keep adding until the number of desired
classes is achieved.

6. Subtract one unit from the lower limit of the second class to get the
upper limit of the first class. Then add the width to each upper limit to
get all the upper limits.

7. Find the class boundaries by subtracting 0.5 from each lower class limit
and adding 0.5 to each upper class limit.

8. Tally the data.

9. Find the numerical frequencies from the tallies.

Remark 1.7.8. The reasons for constructing a frequency distribution are as


follows:

1. To organize the data in a meaningful, intelligible way.

2. To enable the reader to determine the nature or shape of the distribution.

3. To facilitate computational procedures for measures of average and spread.

4. To enable the researcher to draw charts and graphs for the presentation
of data.

5. To enable the reader to make comparisons among different data sets.

Example 1.7.2. The following are weights, in pounds, of 57 children at a


day-care center:
1.7. FREQUENCY DISTRIBUTIONS 61

68 63 42 27 30 36 28 32 79 27
22 23 24 25 44 65 43 25 74 51
36 42 28 31 28 25 45 12 57 51
12 32 49 38 42 27 31 50 38 21
16 24 69 47 23 22 43 27 49 28
23 19 46 30 43 49 12

Construct a grouped frequency distribution with seven classes for the given
data.

Solution. We shall employ the procedures in Remark 1.7.7. First, note that
R = 79 − 12 = 67. Next, with seven desired classes, we see that the (class)
width is equal to 67
7 = 9.6 ≈ 10. Since the smallest number is 12, we may
begin our first interval with 10. The considerations discussed so far lead to
the following seven classes:

10-19 50-59
20-29 60-69
30-39 70-79
40-49
Solving for the class boundaries, tallying the data, and reflecting the corre-
sponding numerical frequencies from the tallies, we have

Weight (lb) Class Boundaries Tally Frequency Percentage


10-19 9.5-19.5 5 8.77%
20-29 19.5-29.5 19 33.33%
30-39 29.5-39.5 10 17.54%
40-49 39.5-49.5 13 22.81%
50-59 49.5-59.5 4 7.02%
60-69 59.5-69.5 4 7.02%
70-79 69.5-79.5 2 3.51%
57 100.0%

Finally, polishing the table gives us


62 CHAPTER 1. DESCRIPTIVE STATISTICS

Weight (lb) Frequency Percentage


10-19 5 8.77%
20-29 19 33.33%
30-39 10 17.54%
40-49 13 22.81%
50-59 4 7.02%
60-69 4 7.02%
70-79 2 3.51%
57 100.0%

Example 1.7.3. A study was conducted to investigate the possible effects of


exercise on the menstrual cycle. From the data collected from that study, we
obtained the menarchal age (in years) of 56 female swimmers who began their
swimming training after they had reached menarche; these served as controls
to compare with those who began their training prior to menarche.

14.0 16.1 13.4 14.6 13.7 13.2 13.7 14.3


12.9 14.1 15.1 14.8 12.8 14.2 14.1 13.6
14.2 15.8 12.7 15.6 14.1 13.0 12.9 15.1
15.0 13.6 14.2 13.8 12.7 15.3 14.1 13.5
15.3 12.6 13.8 14.4 12.9 14.6 15.0 13.8
13.0 14.1 13.8 14.2 13.6 14.1 14.5 13.1
12.8 14.3 14.2 13.5 14.1 13.6 12.4 15.1

Construct a grouped frequency distribution with nine classes for the given data.

Solution. We shall employ the procedures in Remark 1.7.7. First, note that
R = 16.1 − 12.4 = 3.7. Next, with nine desired classes, we see that the (class)
width is equal to 3.7
9 = 0.41 ≈ 0.5. Since the smallest number is 12.4, we may
begin our first interval with 12.0. The considerations discussed so far lead to
the following seven classes:
1.7. FREQUENCY DISTRIBUTIONS 63

12.0-12.4 13.5-13.9 15.0-15.4


12.5-12.9 14.0-14.4 15.5-15.9
13.0-13.4 14.5-14.9 16.0-16.4

Tallying the data, and reflecting the corresponding numerical frequencies from
the tallies, we have

Age (years) Tally Frequency Percentage


12.0-12.4 1 1.8%
12.5-12.9 8 14.3%
13.0-13.4 5 8.9%
13.5-13.9 12 21.4%
14.0-14.4 16 28.6%
14.5-14.9 4 7.1%
15.0-15.4 7 12.5%
15.5-15.9 2 3.6%
16.0-16.4 1 1.8%
56 100.00%

Finally, polishing the table gives us

Age (years) Frequency Percentage


12.0-12.4 1 1.8%
12.5-12.9 8 14.3%
13.0-13.4 5 8.9%
13.5-13.9 12 21.4%
14.0-14.4 16 28.6%
14.5-14.9 4 7.1%
15.0-15.4 7 12.5%
15.5-15.9 2 3.6%
16.0-16.4 1 1.8%
56 100.00%


64 CHAPTER 1. DESCRIPTIVE STATISTICS

EXERCISES

1. Find the class boundaries, midpoints, and widths for each class.

(a) 12-18 (c) 695-705 (e) 2.15-3.93


(b) 56-74 (d) 13.6-14.7 (f) 3.315-3.765

2. List five reasons for organizing data into a frequency distribution.

3. Name the two types of frequency distributions, and explain when each
should be used.

4. How many classes should frequency distributions have? Why shoild the
class width be an odd number?

5. Shown here is a frequency distribution that is incorrectly constructed.


State the reason why.

Class Frequency
27-32 1
33-38 0
39-44 6
45-49 4
50-55 2

6. Shown here is a frequency distribution that is incorrectly constructed.


State the reason why.

Class Frequency
5-9 1
9-13 2
13-17 5
17-20 6
20-24 3
1.7. FREQUENCY DISTRIBUTIONS 65

7. Shown here is a frequency distribution that is incorrectly constructed.


State the reason why.

Class Frequency
123-127 3
128-132 7
138-142 2
143-147 19

8. Shown here is a frequency distribution that is incorrectly constructed.


State the reason why.

Class Frequency
9-13 1
14-19 6
20-25 2
26-28 5
29-32 9

9. What are open-ended frequency distributions? Why are they necessary?

10. A researcher conducted a survey asking people if they believed more


than one person was involved in the assassination of John F. Kennedy.
The results were as follows: 73% said yes, 19% said no, and 9% had no
opinion. Is there anything suspicious about the results?

11. A sample of birthweights (in ounce), from 100 consecutive deliveries at


a California hospital are as follows. Construct a frequency distribution
with eight classes for the given data set.
66 CHAPTER 1. DESCRIPTIVE STATISTICS

58 118 92 120 86 123 134 104 132 121


68 111 121 91 122 104 115 128 106 133
115 115 94 98 107 124 85 126 88 89
125 102 122 115 104 98 108 118 67 146
122 104 138 99 138 105 125 108 127 135
132 32 95 83 124 155 132 93 140 112
105 138 96 161 128 127 124 100 112 141
94 116 113 108 115 85 137 110 101 89
119 109 103 108 109 122 124 110 135 115
64 144 87 98 133 89 121 88 104 112

12. The following are the daily fat intake (grams) of a group of 150 adult
males. Construct a frequency distribution with ten classes for the given
data set.

22 62 77 84 42 56 78 73 37 69
82 93 30 77 81 94 46 89 88 99
63 85 81 94 51 80 88 98 52 70
76 95 107 105 117 128 144 150 68 79
82 96 109 108 117 120 147 153 67 75
76 92 105 104 117 129 148 164 62 85
77 96 103 105 116 132 146 168 53 72
72 91 102 101 128 136 143 164 65 73
83 92 103 118 127 132 140 167 68 75
89 95 107 111 128 139 148 168 68 79
82 96 109 108 117 130 147 153 91 102
117 129 137 141 96 105 117 125 135 143
93 100 114 124 135 142 97 102 119 125
138 142 95 100 116 121 131 152 93 106
114 127 133 155 97 106 119 122 134 151
1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 67

13. The following data provided the percentage saturation of bile for 29
women. These percentages were

65 58 52 91 84 107
86 98 35 128 116 84
76 146 55 75 73 120
89 80 127 82 87 123
142 66 77 69 76

Construct a frequency distribution with six classes for the given data set.

14. The following frequency distribution was obtained for the preoperational
percentage hemoglobin values of a group of subjects from a village where
there has been a malaria eradication program (MEP):

Hemoglobin (%) 30-39 40-49 50-59 60-69 70-79 80-89 90-99


Frequency 2 7 14 10 8 2 2

The results in another group was obtained after MEP:

43 63 63 75 95 75 80 48 62 71 76 90
51 61 74 103 93 82 74 65 63 53 64 67
80 77 60 69 73 76 91 55 65 69 50 68
72 89 75 57 66 79 85 70 87 67 72 52
35 67 99 81 97 74 84 78 59 71 61 62

Form a frequency distribution using the same intervals as in the first


table.

1.8 Histograms, Frequency Polygons, and Ogives


After you have organized the data into a frequency distribution, you can
present them in graphical form. The purpose of graphs in statistics is to
68 CHAPTER 1. DESCRIPTIVE STATISTICS

convey the data to the viewers in pictorial form. It is easier for most people
to comprehend the meaning of data presented graphically than data presented
numerically in tables or frequency distributions. This is especially true if the
users have little or no statistical knowledge.

Statistical graphs can be used to describe the data set or to analyze it. Graphs
are also useful in getting the audience’s attention in a publication or a speaking
presentation. They can be used to discuss an issue, reinforce a critical point,
or summarize a data set. They can also be used to discover a trend or pattern
in a situation over a period of time.

The three most commonly used graphs in research are

1. the histogram

2. the frequency polygon

3. the cumulative frequency graph, or ogive (read as "o-jive)"

The Histogram

Definition 1.8.1. The histogram is a graph that displays the data by using
contiguous vertical bars (unless the frequency of a class is 0) of various heights
to represent the frequencies of the classes.

Remark 1.8.1. In a histogram,

(a) The horizontal scale represents the value of the variable marked at in-
terval boundaries.

(b) The vertical scale represents the frequency or relative frequency in each
interval.

Example 1.8.1. These data represents the record high temperatures in de-
grees Fahrenheit (◦ F ) for each of the 50 states in USA.
1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 69

112 100 110 118 107 112 116 108 120 113
127 120 134 117 116 118 114 115 118 110
121 113 120 117 105 118 105 110 122 114
114 117 118 122 120 119 111 110 118 112
109 112 105 109 106 110 104 111 114 114

(a) Construct a grouped frequency distribution for the data using 7 classes.

(b) Construct a histogram to represent the data above.

Solution.

(a) We shall employ the procedures in Remark 1.7.7. First, note that R =
134 − 100 = 34. Next, with seven desired classes, we see that the (class)
width is equal to 34
7 = 4.9 ≈ 5. Since the smallest number is 100, we
may begin our first interval with this. The considerations discussed so
far lead to the following seven classes:

100-104 120-124
105-109 125-129
110-114 130-134
115-119

Tallying the data, and reflecting the corresponding numerical frequencies


from the tallies, we have

Temperature (◦ F ) Class Boundaries Tally Frequency Percentage


100-104 99.5-104.5 2 4%
105-109 104.5-109.5 8 16%
110-114 109.5-114.5 18 36%
115-119 114.5-119.5 13 26%
120-124 119.5-124.5 7 14%
125-129 124.5-129.5 1 2%
130-134 130.5-134.5 1 2%
50 100%
70 CHAPTER 1. DESCRIPTIVE STATISTICS

Finally, polishing the table gives us

Temperature (◦ F ) Frequency Percentage


100-104 2 4%
105-109 8 16%
110-114 18 36%
115-119 13 26%
120-124 7 14%
125-129 1 2%
130-134 1 2%
50 100%

(b) To construct the histogram, we first draw and label the x and y axes. The
x-axis is always the horizontal axis, and the y-axis is always the vertical
axis. Represent the frequency on the y-axis and the class boundaries on
the x-axis. Using the frequencies as the heights, draw vertical bars for
each class. Thus, the following histogram is constructed.

18

15
Frequency

12

99.5 104.5 109.5 114.5 119.5 124.5 129.5 134.5

Temperature (◦ F )

Figure 1.7: Histogram for Record High Temperatures in the 50 States of USA


1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 71

The Frequency Polygon


Another way to represent the same data set is by using a frequency polygon.

Definition 1.8.2. The frequency polygon is a graph that displays the data
by using lines that connect points plotted for the frequencies at the midpoints
of the classes. The frequencies are represented by the heights of the points.

Remark 1.8.2. To draw a frequency polygon, we first place a dot at the


midpoint of the upper base of each rectangular bar. The points are connected
with straight lines. At the ends, the points are connected to the midpoints
of the previous and succeeding intervals (these are make-up intervals with
zero frequency, where widths are the widths of the first and last intervals,
respectively).

Remark 1.8.3. The frequency polygon can also be shown without the his-
togram on the same graph.

Remark 1.8.4. The frequency polygon and the histogram are two different
ways to represent the same data set. The choice of which one to use is left to
the discretion of the researcher.

Example 1.8.2. Using the frequency distribution given in Example 1.8.1,


construct a frequency polygon.

Solution. We first find the midpoints of each class. Doing so, we have

Temperature (◦ F ) Midpoints Frequency


100-104 102 2
105-109 107 8
110-114 112 18
115-119 117 13
120-124 122 7
125-129 127 1
130-134 132 1
72 CHAPTER 1. DESCRIPTIVE STATISTICS

To draw the frequency polygon, we first draw and label the x and y axes. The
x-axis is always the horizontal axis, and the y-axis is always the vertical axis.
Represent the frequency on the y-axis and the class midpoints on the x-axis.
Using these, we then plot the points. Finally, connecting adjacent points with
line segments, the following frequency polygon is constructed.

18

15
Frequency

12

99.5 104.5 109.5 114.5 119.5 124.5 129.5 134.5

Temperature (◦ F )

Figure 1.8: Frequency Polygon for Record High Temperatures in the 50 States
of USA


1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 73

The Ogive
The third type of graph that can be used represents the cumulative frequencies
for the classes. This type of graph is called the cumulative frequency graph,
or ogive.

Definition 1.8.3.

1. The cumulative frequency is the sum of the frequencies accumulated


up to the upper boundary of a class in the distribution.

2. The ogive is a graph that represents the cumulative frequencies for the
classes in a frequency distribution.

Remark 1.8.5. Cumulative frequency graphs are used to visually represent


how many values are below a certain upper class boundary.

Example 1.8.3. Using the frequency distribution given in Example 1.8.1,


construct an ogive.

Solution. We first find the cumulative frequency of each class. Doing so, we
have
Temperature (◦ F ) Cumulative Frequency
Less than 99.5 0
Less than 104.5 2
Less than 109.5 10
Less than 114.5 28
Less than 119.5 41
Less than 124.5 48
Less than 129.5 49
Less than 134.5 50
To draw the ogive, we first draw and label the x and y axes. The x-axis is
always the horizontal axis, and the y-axis is always the vertical axis. Represent
the cumulative frequency on the y-axis and the class midpoints on the x-axis.
74 CHAPTER 1. DESCRIPTIVE STATISTICS

Using these, we then plot the points. Finally, connecting adjacent points with
line segments, the following ogive is constructed.

50

45
Cumulative Frequency

40

35

30

25

20

15

10

0
99.5 104.5 109.5 114.5 119.5 124.5 129.5 134.5

Temperature (◦ F )

Figure 1.9: Ogive for Record High Temperatures in the 50 States of USA

EXERCISES

1. The number of faculty listed for a variety of private colleges which offer
only bachelor’s degrees is listed below. Use these data to construct a
frequency distribution with 7 classes, a histogram, a frequency polygon,
and an ogive. Discuss the shape of this distribution.

165 221 70 210 176 162 221


161 218 206 207 154 225 214
128 310 138 135 155 82 93
389 224 204 120 116 77 135

2. The number of counties, divisions, or parishes for each of the 50 states is


given below. Use the data to construct a grouped frequency distribution
1.8. HISTOGRAMS, FREQUENCY POLYGONS, AND OGIVES 75

with 6 classes, a histogram, a frequency polygon, and an ogive. Analyze


the distribution.

67 27 102 44 83 87 62 100 95 254


15 75 92 99 82 114 53 88 29 14
58 64 105 120 56 93 77 36 95 39
8 67 64 16 16 10 67 5 55 72
159 5 23 14 21 33 46 66 23 3

3. The number of calories per serving for selected ready-to-eat cereals is


listed here. Construct a frequency distribution using 7 classes. Draw a
histogram, a frequency polygon, and an ogive for the data, using relative
frequencies. Describe the shape of the histogram.

130 190 140 80 100 120 220 220 110


100 210 130 100 90 210 120 200 120
180 120 190 210 120 200 130 180 260
270 100 160 190 240 80 120 90 190
200 210 190 180 115 210 110 225 190

4. The amount of protein (in grams) for a variety of fast-food sandwiches


is reported here. Construct a frequency distribution using 6 classes.
Draw a histogram, a frequency polygon, and an ogive for the data, using
relative frequencies. Describe the shape of the histogram.

23 30 20 27 44 26 35 20 29 29
25 15 18 27 19 22 12 26 34 15
27 35 26 43 35 14 24 12 23 31
40 35 38 57 22 42 24 21 27 33
76 CHAPTER 1. DESCRIPTIVE STATISTICS

You might also like