Chap 3
Chap 3
What is an average? It is a single number used to describe the central tendency of a set of
data.
Examples of an average are:
The average length of the school year for students in public schools in the United
States is 180 days.
The median salary of New York Yankees baseball players on opening day 2003
was $4,575,000.
(http://asp.usatoday.com/sports/baseball/salaries/mediansalaries.aspx?
year=2003 )
The median asking price for a group of houses listed for sale by a Toledo realtor
is $128,000.
"The 2003 average income for Computer Support Specialists was $20.16 per
hour. The Census Bureau's Income Statistics Branch, July, 2003)
Computer Software Engineers' pay averaged $35.81 per hour, while Nurse's Aids
averaged $9.24 per hour. (Bureau of Labor Statistics web site: http://stats.bls.gov,
July, 2003)
There are several types of averages. We will consider five: the arithmetic mean, the
median, the mode, the weighted mean, and the geometric mean.
Measures of Location
The purpose of a measure of location is to pinpoint the center of a set of observations.
Measure of location: A single value that summarizes a set of data. It locates
the center of the values.
The arithmetic mean, or simply the mean, is the most widely used measure of location.
Mean: The sum of observations divided by the total number of observations.
The population mean is calculated as follows:
Population Mean
Where:
µ represents the population mean. It is the Greek letter "mu."
N is the number of items in the population.
X is any particular value.
Σ indicates the operation of adding all the values. It is the Greek letter "sigma."
ΣX is the sum of the X values.
[3-1]indicates the formula number from the text.
Any measurable characteristic of a population is called a parameter.
Parameter: A characteristic of a population.
The Sample Mean
As explained in Chapter 1, we frequently select a sample from the population to find out
something about a specific characteristic of the population.
The mean of a sample and the mean of a population are computed in the same way, but
the shorthand notation is different.
In terms of symbols, the formula for the mean of a sample is:
Sample Mean
Where:
is the sample mean; it is read AX bar@.
nis the number of values in the sample.
Xis a particular value.
Σ indicates the operation of adding all the values.
ΣXis the sum of the X values.
[3-2]is the formula number from the text.
The mean of a sample, or any other measure based on sample data, is called a statistic.
Statistic: A characteristic of a sample.
"The mean weight of a sample of laptop computers is 6.5 pounds," is an example of a
statistic.
In formulas [3-1] and [3-2] the mean is calculated by summing the observations and
dividing by the total number of observations.
Suppose the Kellogg Company's quarterly earnings per share for the last five quarters are:
$0.89, $0.77, $1.05, $0.79, and $0.95. If the earnings are a population, the mean is found
by:
Weighted Mean
The weighted mean is a special case of the arithmetic mean. It is often useful when there
are several observations of the same value.
Weighted mean: The value of each observation is multiplied by the number
of times it occurs. The sum of these products is divided by the total number of
observations to determine the weighted mean.
In general, the weighted mean of a set of values, designated X1, X2, X3, … Xn, with the
corresponding weights w1, w2, w3, …, wn is computed by:
Weighted Mean
The weighted mean is particularly useful when various classes or groups contribute
differently to the total. For example, the coronary care unit of a hospital consists of
nurses= aides who are paid $12 per hour, nurses= assistants who earn $15 per hour, and
registered nurses who earn $24 per hour.
To say the average hourly wage for the coronary unit is $16 per hour ($12 + $15 + $21) ÷
3 would not be accurate unless there were the same number of people in each group.
Suppose the coronary care unit has ten employees: two aides who earn $12 per hour, 3
nurses= assistants who earn $15 per hour, and five registered nurses who earn $21 per
hour. The weighted mean is:
Thus the weighted mean is $18.90.
The Median
It was pointed out that the arithmetic mean is often not representative of data with
extreme values. The median is a useful measure when we encounter data with an extreme
value.
Median: The midpoints of the values after all observations have been ordered
from the smallest to the largest, or from largest to smallest.
Fifty percent of the observations are above the median and 50 percent are below the
median. To determine the median, the values are ordered from low to high, or high to
low, and the middle value selected. Hence, half the observations are above the median
and half are below it. For the executive incomes, the middle value is $44,000, the median.
$40,000 $42,000 $44,000 $48,000 $300,000
median
Obviously, it is a more representative value in this problem than the mean of $94,800.
Note that there were an odd number of executive incomes (5). For an odd number of
ungrouped values we just order them and select the middle value. To determine the
median of an even number of ungrouped values, the first step is to arrange them from low
to high as usual, and then determine the value half way between the two middle values.
As an example, the number of bronze castings produced in a day at Markey Bronze is 87,
62, 91, 58, 99, and 85. Ordering these from low to high:
58 62 85 87 91 99
The median number produced is halfway between the two middle values of 85 and 87.
The median is 86. Thus we note that the median (86) may not be one of the values in a set
of data.
Properties of the Median
The major properties of the median are:
1. The median is a unique value, that is, like the mean, there is only one median for a
set of data.
2. It is not influenced by extremely large or small values.
3. It can be computed for ratio level, interval level, and ordinal-level data.
4. Fifty percent of the observations are greater than the median and fifty percent of
the observations are less than the median.
The Mode
A third measure of location is the mode.
Mode: The value of the observation that appears most frequently.
The mode is the value that occurs most often in a set of raw data. The dividends per share
declared on five stocks were: $3, $2, $4, $5, and $4. Since $4 occurred twice, which was
the most frequent, the mode is $4.
Properties of the Mode
1. The mode can be found for all levels of data (nominal, ordinal, interval, and
ratio).
2. The mode is not affected by extremely high or low values.
3. A set of data can have more than one mode. If it has two modes, it is said to be
bimodal.
4. A disadvantage is that a set of data may not have a mode because no value
appears more than once.
Geometric Mean
Where:
X 1, X 2, (X3) etc.are data values.
nis the number of values.
is the n th root.
The geometric mean can be used for averaging percents. Suppose the return on
investment for McDermoll International for the past 4 years is 0.4%, 2.9%, 2.1%, and
12.3%. The GM increase over the period is 4.3 percent, found by:
The geometric mean is fourth root of 1.18455, which is 1.043. The average return on the
investment is found by subtracting one from the geometric mean. (1.043 – 1.000) = 0.043
= 4.3%.
Another application of the geometric mean is to find average percent increase over a
period of time. Text formula [3-5] is used:
Average Percent
Increase Over Time
Why Study Dispersion?
A direct comparison of two sets of data based only on two measures of location such as
the mean and the median can be misleading since an average does not tell us anything
about the spread of the data.
For example, the mean salary paid to baseball players for the New York Yankees is
$5,341,148. However, the range is $15,300,000, with a low of $300,000 and a high of
$15,600,000. The Tampa Devil Rays have a mean salary of $784,000. The range is
$6,200,000, with a low of $300,000 and a high of $6,500,000.
(http://espn.go.com/mlb/clubhouses/salaries).
Suppose a statistics instructor has two classes, one in the morning and one in the evening;
each with six students. In the morning class (AM) the students' ages are 18, 20, 21, 21,
23, and 23 years. In the evening class (PM) the ages are 17, 17, 18, 20, 25, and 29 years.
Note that for both classes the mean age is 21 years but there is more variation or
dispersion in the ages of the evening students.
A small value for a measure of dispersion indicates that the data are clustered closely,
say, around the arithmetic mean. Thus the mean is considered representative of the data,
that is, it is reliable. Conversely, a large measure of dispersion indicates that the mean is
not reliable and is not representative of the data.
Measures of Dispersion
We will consider several measures of dispersion: the range, the mean deviation, the
variance, and the standard deviation.
Range
The simplest measure of dispersion is the range.
Range: The difference between the largest and smallest values in a data set.
The formula for range is:
Range
The statistics instructor referred to above has two classes with the ages indicated:
A.M. Class: 18, 20, 21, 21, 23, 23 P.M. Class: 17, 17, 18, 20, 25, 29
The range for the classes is:
A.M. Class: (23 - 18) = 5 P.M. Class: (29 - 17) = 12
Thus we can say that there is more spread in the ages of the students enrolled in
theevening (P.M.) class compared with the morning (A.M.) class.
The characteristics of the range are:
Only two values are used in the calculation.
It is influenced by extreme values.
It is easy to compute and understand.
It can be distorted by an extreme value.
The range has two disadvantages. It can be distorted by a single Ages of Students
extreme value. Suppose the same statistics instructor has a third class
of five students. The ages of these students are given in the table. 20 20 21 22 60
The range of ages is 40 years, yet four of the five students' ages are within two years of
each other. The 60-year old student has distorted the spread. Another disadvantage is that
only two values, the largest and the smallest, are used in its calculation.
Mean Deviation
In contrast to the range, the mean deviation considers all the data.
Mean Deviation: The arithmetic mean of the absolute values of the deviations from the
arithmetic mean.
In terms of symbols, the formula for the mean deviation is:
Mean Deviation
Where:
Xis the value of each observation.
is the arithmetic mean of the values.
nis the number of observations in the sample.
| |indicates the absolute value.
We take the absolute value of the deviations from the mean because if we didn't, the
positive and negative deviations from the mean exactly offset each other, and the mean
deviation would always be zero. Such a measure X - Absolute Deviation
(zero) would be a useless statistic.
|17 - 21| = |- 4| = 4
The mean deviation is computed by first
determining the difference between each |17 - 21| = |- 4| = 4
observation and the mean. These differences are |18 - 21| = |- 3| = 3
then averaged without regard to their signs. For |20 - 21| = |- 1| = 1
the PM statistics class the mean deviation is 4.0
years, found by the table on the right: |25 - 21| = |- 4| = 4
|29 - 21| = |- 8| = 8
Then
3 = 24
The parallel lines | | indicate absolute value. To interpret, 4.0 years is the mean amount
by which the ages differ from the arithmetic mean age of 21.0 years for the PM students.
Variance and Standard Deviation
The disadvantage of the mean deviation is that the absolute values are difficult to
manipulate mathematically. Squaring the differences from each value and the mean
eliminates the problem of absolute values. These squared differences are used both in the
computation of the variance and the standard deviation.
Variance: The arithmetic mean of the squared deviations from the mean.
The variance is non-negative and is zero only if all observations are the same.
Standard Deviation: The square root of the variance
Squaring units of measurement, such as dollars or years, makes the variance cumbersome
to use since it yields units like "dollars squared" or "years squared." However, by
calculating the standard deviation, which is the positive square root of the variance, we
can return to the original units, such as years or dollars. Because the standard deviation is
easier to interpret, it is more widely used than the mean deviation or the variance.
Population Variance
The formula for the population variance and the sample variance are slightly different.
The formula for the population variance is:
Population Variance
Where:
s 2is the symbol for the population variance (s is the Greek letter sigma). It is usually
referred to as "sigma squared."
Xis a value of an observation in the population.
m is the arithmetic mean of the population.
Nis the total number of observations in the population.
The major characteristics of the variance are:
1. All the observations are used in the calculations.
2. It is not influenced by extreme observations.
3. The units are somewhat difficult to work with. (They are the original units
squared.)
Sample Variance
The conversion of the population variance formula to the sample variance formula is not
as direct as the change made when we went from the population mean formula to the
sample mean formula. Recall in that instance we replaced m with and N with n.
The conversion from population variance to sample variance requires a change in the
denominator. Instead of substituting n, the number in the sample, for N, the number in the
population, we replace N with (n – 1). Thus the formula for the sample variance is:
Sample Variance
Where:
s2is the symbol for the sample variance. It is pronounced as "s squared."
Xis the value of each observation in the sample.
is the mean of the sample.
nis the total number of observations in the sample.
Changing the denominator to (n – 1) seems insignificant, however the use of n tends to
underestimate the population variance. The use of (n –1) in the denominator provides an
appropriate correction factor.
Sample Standard Deviation
The sample standard deviation is used as an estimator of the population standard
deviation. The sample standard deviation is the square root of the sample variance. The
formula is:
Standard Deviation
Where:
is the designation for the sample mean.
Mis the midpoint of each class.
fis the frequency in each class.
fMis the frequency in each class times the midpoint of the class.
fMis the sum of these products.
nis the total number of frequencies.
Standard Deviation
The formula for the sample standard deviation for grouped data is:
Where:
sis the symbol for the sample standard deviation.
Mis the midpoint of a class.
fis the class frequency.
nis the total number of observations in the sample.