4. Statistical Estimation
4. Statistical Estimation
On many occasions estimating the population mean is useful in business research. For
example, the manager of human resources in a company might want to estimate the aver-
age number of days of work an employee misses per year because of illness. If the firm has
thousands of employees, direct calculation of a population mean such as this may
be practically impossible. Instead, a random sample of employees can be taken,
and the sample mean number of sick days can be used to estimate the population
mean. Suppose another company developed a new process for prolonging the
shelf life of a loaf of bread. The company wants to be able to date each loaf for
freshness, but company officials do not know exactly how long the bread will stay
fresh. By taking a random sample and determining the sample mean shelf life, they can
estimate the average shelf life for the population of bread.
As the cellular telephone industry matures, a cellular telephone company is
rethinking its pricing structure. Users appear to be spending more time on the phone
and are shopping around for the best deals. To do better planning, the cellular company
wants to ascertain the average number of minutes of time used per month by each of
its residential users but does not have the resources available to examine all monthly
bills and extract the information. The company decides to take a sample of customer
bills and estimate the population mean from sample data. A researcher for the company
takes a random sample of 85 bills for a recent month and from these bills computes a
sample mean of 510 minutes. This sample mean, which is a statistic, is used to estimate
the population mean, which is a parameter. If the company uses the sample mean of
510 minutes as an estimate for the population mean, then the sample mean is used as a
point estimate.
A point estimate is a statistic taken from a sample that is used to estimate a population
parameter. A point estimate is only as good as the representativeness of its sample. If other
random samples are taken from the population, the point estimates derived from those
samples are likely to vary. Because of variation in sample statistics, estimating a population
parameter with an interval estimate is often preferable to using a point estimate. An interval
estimate (confidence interval) is a range of values within which the analyst can declare, with
some confidence, the population parameter lies. Confidence intervals can be two sided or one
sided. This text presents only two-sided confidence intervals. How are confidence intervals
constructed?
As a result of the central limit theorem, the following z formula for sample means can
be used if the population standard deviation is known when sample sizes are large, regard-
less of the shape of the population distribution, or for smaller sizes if the population is
normally distributed.
x - m
z =
s
1n
Rearranging this formula algebraically to solve for m gives
s
m = x - z
1n
Because a sample mean can be greater than or less than the population mean, z can be
positive or negative. Thus the preceding expression takes the following form.
s
x ; z
1n
Rewriting this expression yields the confidence interval formula for estimating m with
large sample sizes if the population standard deviation is known.
254 Chapter 8 Statistical Inference: Estimation for Single Populations
Alpha (a) is the area under the normal curve in the tails of the distribution outside the
area defined by the confidence interval. We will focus more on a in Chapter 9. Here we use
a to locate the z value in constructing the confidence interval as shown in Figure 8.3.
Because the standard normal table is based on areas between a z of 0 and za>2, the table z
value is found by locating the area of .5000 - a>2, which is the part of the normal curve
between the middle of the curve and one of the tails. Another way to locate this z value is
to change the confidence level from percentage to proportion, divide it in half, and go to
the table with this value. The results are the same.
The confidence interval formula (8.1) yields a range (interval) within which we feel
with some confidence that the population mean is located. It is not certain that the popu-
lation mean is in the interval unless we have a 100% confidence interval that is infinitely
wide. If we want to construct a 95% confidence interval, the level of confidence is 95%, or
.95. If 100 such intervals are constructed by taking random samples from the population,
it is likely that 95 of the intervals would include the population mean and 5 would not.
As an example, in the cellular telephone company problem of estimating the popula-
tion mean number of minutes called per residential user per month, from the sample of
85 bills it was determined that the sample mean is 510 minutes. Using this sample mean, a
confidence interval can be calculated within which the researcher is relatively confident
that the actual population mean is located. To make this calculation using formula 8.1, the
value of the population standard deviation and the value of z (in addition to the sample
mean, 510, and the sample size, 85) must be known. Suppose past history and similar stud-
ies indicate that the population standard deviation is 46 minutes.
The value of z is driven by the level of confidence. An interval with 100% confidence
is so wide that it is meaningless. Some of the more common levels of confidence used by
business researchers are 90%, 95%, 98%, and 99%. Why would a business researcher not
just select the highest confidence and always use that level? The reason is that trade-offs
between sample size, interval width, and level of confidence must be considered. For exam-
ple, as the level of confidence is increased, the interval gets wider, provided the sample size
and standard deviation remain constant.
For the cellular telephone problem, suppose the business researcher decided on a 95%
confidence interval for the results. Figure 8.4 shows a normal distribution of sample means
about the population mean. When using a 95% level of confidence, the researcher selects
an interval centered on m within which 95% of all sample mean values will fall and then
uses the width of that interval to create an interval around the sample mean within which
he has some confidence the population mean will fall.
FIGURE 8.3
z Scores for Confidence
Intervals in Relation to a
(.5000 – α /2)
1– α
Confidence
α /2 α /2
–z α /2 0 z α /2
α = shaded area
8.1 Estimating the Population Mean Using the z Statistic (S Known) 255
FIGURE 8.4
Distribution of Sample Means
for 95% Confidence
95%
α /2 = .025 α /2 = .025
z = –1.96 z = +1.96
.4750 .4750
x
μ
FIGURE 8.5 For 95% confidence, a = .05 and a>2 = .025. The value of za>2 or z.025 is found by look-
ing in the standard normal table under .5000 - .0250 = .4750. This area in the table is asso-
Twenty 95% Confidence
ciated with a z value of 1.96. Another way can be used to locate the table z value. Because the
Intervals of m
distribution is symmetric and the intervals are equal on each side of the population mean,
1⁄ 2(95%), or .4750, of the area is on each side of the mean. Table A.5 yields a z value of 1.96
x’s for this portion of the normal curve. Thus the z value for a 95% confidence interval is always
95%
1.96. In other words, of all the possible x values along the horizontal axis of the diagram,
μ 95% of them should be within a z score of 1.96 from the population mean.
x The business researcher can now complete the cellular telephone problem. To deter-
x mine a 95% confidence interval for x = 510, s = 46, n = 85, and z = 1.96, the researcher
x estimates the average call length by including the value of z in formula 8.1.
x 46 46
x 510 - 1.96 … m … 510 + 1.96
185 185
x
x 510 - 9.78 … m … 510 + 9.78
x 500.22 … m … 519.78
x
The confidence interval is constructed from the point estimate, which in this problem is
x
510 minutes, and the error of this estimate, which is ; 9.78 minutes. The resulting confidence
x interval is 500.22 … m … 519.78. The cellular telephone company researcher is 95%, confi-
x dent that the average length of a call for the population is between 500.22 and 519.78 minutes.
x What does being 95% confident that the population mean is in an interval actually
x indicate? It indicates that, if the company researcher were to randomly select 100 samples
x of 85 calls and use the results of each sample to construct a 95% confidence interval,
x approximately 95 of the 100 intervals would contain the population mean. It also indicates
x that 5% of the intervals would not contain the population mean. The company researcher
x is likely to take only a single sample and compute the confidence interval from that sample
x information. That interval either contains the population mean or it does not. Figure 8.5
x depicts the meaning of a 95% confidence interval for the mean. Note that if 20 random
samples are taken from the population, 19 of the 20 are likely to contain the population
mean if a 95% confidence interval is used (19> 20 = 95%). If a 90% confidence interval is
constructed, only 18 of the 20 intervals are likely to contain the population mean.
D E M O N S T R AT I O N
PROBLEM 8.1
A survey was taken of U.S. companies that do business with firms in India. One of
the questions on the survey was: Approximately how many years has your company
been trading with firms in India? A random sample of 44 responses to this question
yielded a mean of 10.455 years. Suppose the population standard deviation for this
question is 7.7 years. Using this information, construct a 90% confidence interval for
the mean number of years that a company has been trading in India for the popula-
tion of U.S. companies trading with firms in India.
Solution
Here, n = 44, x = 10.455, and s = 7.7. To determine the value of za>2, divide the
90% confidence in half, or take .5000 - a>2 = .5000 - .0500 where a = 10%. Note: The z
256 Chapter 8 Statistical Inference: Estimation for Single Populations
s s
x - z … m … x + z
1n 1n
7.7 7.7
10.455 - 1.645 … m … 10.455 + 1.645
144 144
10.455 - 1.910 … m … 10.455 + 1.910
8.545 … m … 12.365
The analyst is 90% confident that if a census of all U.S. companies trading with
firms in India were taken at the time of this survey, the actual population mean num-
ber of years a company would have been trading with firms in India would be
between 8.545 and 12.365. The point estimate is 10.455 years.
TA B L E 8 . 1 For convenience, Table 8.1 contains some of the more common levels of confidence
and their associated z values.
Values of z for Common
Levels of Confidence
Finite Correction Factor
Confidence Level z Value
Recall from Chapter 7 that if the sample is taken from a finite population, a finite correc-
90% 1.645 tion factor may be used to increase the accuracy of the solution. In the case of interval esti-
95% 1.96 mation, the finite correction factor is used to reduce the width of the interval. As stated in
98% 2.33 Chapter 7, if the sample size is less than 5% of the population, the finite correction factor
99% 2.575 does not significantly alter the solution. If formula 8.1 is modified to include the finite
correction factor, the result is formula 8.2.
CONFIDENCE INTERVAL TO s N - n s N - n
ESTIMATE m USING THE x - za>2 … m … x + za>2
1n A N - 1 1n A N - 1
FINITE CORRECTION
FACTOR (8.2)
Demonstration Problem 8.2 shows how the finite correction factor can be used.
D E M O N S T R AT I O N
PROBLEM 8.2
A study is conducted in a company that employs 800 engineers. A random sample
of 50 engineers reveals that the average sample age is 34.3 years. Historically, the
population standard deviation of the age of the company’s engineers is approximately
8 years. Construct a 98% confidence interval to estimate the average age of all the
engineers in this company.
Solution
This problem has a finite population. The sample size, 50, is greater than 5% of the
population, so the finite correction factor may be helpful. In this case N = 800, n = 50,
x = 34.30, and s = 8. The z value for a 98% confidence interval is 2.33 (.98 divided into
two equal parts yields .4900; the z value is obtained from Table A.5 by using .4900).
Substituting into formula 8.2 and solving for the confidence interval gives
8 750 8 750
34.30 - 2.33 … m … 34.30 + 2.33
A
150 799 A
150 799
34.30 - 2.55 … m … 34.30 + 2.55
31.75 … m … 36.85
8.1 Estimating the Population Mean Using the z Statistic (S Known) 257
Without the finite correction factor, the result would have been
34.30 - 2.64 … m … 34.30 + 2.64
31.66 … m … 36.94
The finite correction factor takes into account the fact that the population is only
800 instead of being infinitely large. The sample, n = 50, is a greater proportion of the
800 than it would be of a larger population, and thus the width of the confidence
interval is reduced.
FIGURE 8.6
Excel and Minitab Output for Excel Output
the Cellular Telephone The sample mean is: 510
Example The error of the interval is: 9.779
The confidence interval is: 510 ± 9 .779
The confidence interval is: 500.221 ≤ µ ≤ 519.779
Minitab Output
One-Sample Z
The assumed standard deviation = 46
N Mean SE Mean 95% CI
85 510.00 4.99 (500.22, 519.78)
260 Chapter 8 Statistical Inference: Estimation for Single Populations
United States. Shown here is the Minitab output for such a sample. Examine the
output. What is the point estimate? What is the value of the assumed population
standard deviation? How large is the sample? What level of confidence is being used?
What table value is associated with this level of confidence? What is the confidence
interval? Often the portion of the confidence interval that is added and subtracted
from the mean is referred to as the error of the estimate. How much is the error of
the estimate in this problem?
One-Sample Z
The assumed standard deviation = 0.14
N Mean SE Mean 95% CI
41 0.5765 0.0219 (0.5336, 0.6194)
In Section 8.1, we learned how to estimate a population mean by using the sample mean
when the population standard deviation is known. In most instances, if a business
researcher desires to estimate a population mean, the population standard deviation will
be unknown and thus techniques presented in Section 8.1 will not be applicable. When
the population standard deviation is unknown, the sample standard deviation
must be used in the estimation process. In this section, a statistical technique is
presented to estimate a population mean using the sample mean when the pop-
ulation standard deviation is unknown.
Suppose a business researcher is interested in estimating the average flying time of a 767
jet from New York to Los Angeles. Since the business researcher does not know the popula-
tion mean or average time, it is likely that she also does not know the population standard
deviation. By taking a random sample of flights, the researcher can compute a sample mean
and a sample standard deviation from which the estimate can be constructed. Another busi-
ness researcher is studying the impact of movie video advertisements on consumers using a
random sample of people. The researcher wants to estimate the mean response for the pop-
ulation but has no idea what the population standard deviation is. He will have the sample
mean and sample standard deviation available to perform this analysis.
The z formulas presented in Section 8.1 are inappropriate for use when the population
standard deviation is unknown (and is replaced by the sample standard deviation). Instead,
another mechanism to handle such cases was developed by a British statistician, William S.
Gosset.
Gosset was born in 1876 in Canterbury, England. He studied chemistry and mathe-
matics and in 1899 went to work for the Guinness Brewery in Dublin, Ireland. Gosset was
involved in quality control at the brewery, studying variables such as raw materials and
temperature. Because of the circumstances of his experiments, Gosset conducted many
studies where the population standard deviation was unavailable. He discovered that using
the standard z test with a sample standard deviation produced inexact and incorrect distri-
butions. This finding led to his development of the distribution of the sample standard
deviation and the t test.
Gosset was a student and close personal friend of Karl Pearson. When Gosset’s first
work on the t test was published, he used the pen name “Student.” As a result, the t test is
sometimes referred to as the Student’s t test. Gosset’s contribution was significant because
it led to more exact statistical tests, which some scholars say marked the beginning of the
modern era in mathematical statistics.*
*Adapted from Arthur L. Dudycha and Linda W. Dudycha,“Behavioral Statistics: An Historical Perspective,” in
Statistical Issues: A Reader for the Behavioral Sciences, Roger Kirk, ed. (Monterey, CA: Brooks/Cole, 1972).
8.2 Estimating the Population Mean Using the t Statistic (S Unknown) 261
The t Distribution
Gosset developed the t distribution, which is used instead of the z distribution for doing
inferential statistics on the population mean when the population standard deviation is
unknown and the population is normally distributed. The formula for the t statistic is
x - m
t =
s
1n
This formula is essentially the same as the z formula, but the distribution table values
are different. The t distribution values are contained in Table A.6 and, for convenience,
inside the front cover of the text.
The t distribution actually is a series of distributions because every sample size has a
different distribution, thereby creating the potential for many t tables. To make these t
values more manageable, only select key values are presented; each line in the table contains
values from a different t distribution. An assumption underlying the use of the t statistic is
that the population is normally distributed. If the population distribution is not normal or
is unknown, nonparametric techniques (presented in Chapter 17) should be used.
Robustness
Most statistical techniques have one or more underlying assumptions. If a statistical tech-
nique is relatively insensitive to minor violations in one or more of its underlying assump-
tions, the technique is said to be robust to that assumption. The t statistic for estimating
a population mean is relatively robust to the assumption that the population is normally
distributed.
Some statistical techniques are not robust, and a statistician should exercise extreme
caution to be certain that the assumptions underlying a technique are being met before
using it or interpreting statistical output resulting from its use. A business analyst should
always beware of statistical assumptions and the robustness of techniques being used in an
analysis.
FIGURE 8.7
Comparison of Two t Standard normal curve
Distributions to the Standard
Normal Curve t curve (n = 25)
t curve (n = 10)
262 Chapter 8 Statistical Inference: Estimation for Single Populations
FIGURE 8.8
split
Distribution with Alpha for
α = 10%
90% Confidence
90%
α /2 = 5% α /2 = 5%
TA B L E 8 . 2 Degrees of
t Distribution Freedom t.10 t.05 t.025 t.01 t.005 t.001
23
24 1.711
25
table having different degrees of freedom and containing t values for different t distribu-
tions. The degrees of freedom for the t statistic presented in this section are computed by
n - 1. The term degrees of freedom refers to the number of independent observations for a
source of variation minus the number of independent parameters estimated in computing the
variation.* In this case, one independent parameter, the population mean, m, is being esti-
mated by x in computing s. Thus, the degrees of freedom formula is n independent obser-
vations minus one independent parameter being estimated (n - 1). Because the degrees of
freedom are computed differently for various t formulas, a degrees of freedom formula is
given along with each t formula in the text.
In Table A.6, the degrees of freedom are located in the left column. The t distribution
table in this text does not use the area between the statistic and the mean as does the z dis-
tribution (standard normal distribution). Instead, the t table uses the area in the tail of the
distribution. The emphasis in the t table is on a, and each tail of the distribution contains
a>2 of the area under the curve when confidence intervals are constructed. For confidence
intervals, the table t value is found in the column under the value of a>2 and in the row of
the degrees of freedom (df) value.
For example, if a 90% confidence interval is being computed, the total area in the two
tails is 10%. Thus, a is .10 and a>2 is .05, as indicated in Figure 8.8. The t distribution table
shown in Table 8.2 contains only six values of a>2 (.10, .05, .025, .01, .005, .001). The t value
is located at the intersection of the df value and the selected a>2 value. So if the degrees of
freedom for a given t statistic are 24 and the desired a>2 value is .05, the t value is 1.711.
*Roger E. Kirk. Experimental Design: Procedures for the Behavioral Sciences. Belmont, California: Brooks/Cole, 1968.
8.2 Estimating the Population Mean Using the t Statistic (S Unknown) 263
can be manipulated algebraically to produce a formula for estimating the population mean
when a is unknown and the population is normally distributed. The results are the formulas
given next.
CONFIDENCE INTERVAL TO s
x ; ta>2, n - 1
ESTIMATE M : POPULATION 1n
STANDARD DEVIATION s s
UNKNOWN AND THE x - ta>2, n - 1 … m … x + ta>2, n - 1
1n 1n
POPULATION NORMALLY
df = n - 1
DISTRIBUTED (8.3)
Formula 8.3 can be used in a manner similar to methods presented in Section 8.1 for con-
structing a confidence interval to estimate m. For example, in the aerospace industry some
companies allow their employees to accumulate extra working hours beyond their 40-hour
week. These extra hours sometimes are referred to as green time, or comp time. Many managers
work longer than the eight-hour workday preparing proposals, overseeing crucial tasks, and
taking care of paperwork. Recognition of such overtime is important. Most managers are usu-
ally not paid extra for this work, but a record is kept of this time and occasionally the manager
is allowed to use some of this comp time as extra leave or vacation time. Suppose a researcher
wants to estimate the average amount of comp time accumulated per week for managers in the
aerospace industry. He randomly samples 18 managers and measures the amount of extra time
they work during a specific week and obtains the results shown (in hours).
6 21 17 20 7 0 8 16 29
3 8 12 11 9 21 25 15 16
He constructs a 90% confidence interval to estimate the average amount of extra time
per week worked by a manager in the aerospace industry. He assumes that comp time is
normally distributed in the population. The sample size is 18, so df = 17. A 90% level of
confidence results in a>2 = .05 area in each tail. The table t value is
t.05,17 = 1.740
The subscripts in the t value denote to other researchers the area in the right tail of the
t distribution (for confidence intervals a>2) and the number of degrees of freedom. The
sample mean is 13.56 hours, and the sample standard deviation is 7.8 hours. The confi-
dence interval is computed from this information as
s
x ; ta>2, n - 1
1n
7.8
13.56 ; 1.740 = 13.56 ; 3.20
218
10.36 … m … 16.76
The point estimate for this problem is 13.56 hours, with an error of ; 3.20 hours. The
researcher is 90% confident that the average amount of comp time accumulated by a man-
ager per week in this industry is between 10.36 and 16.76 hours.
From these figures, aerospace managers could attempt to build a reward system for
such extra work or evaluate the regular 40-hour week to determine how to use the normal
work hours more effectively and thus reduce comp time.
D E M O N S T R AT I O N
PROBLEM 8.3
The owner of a large equipment rental company wants to make a
rather quick estimate of the average number of days a piece of
ditchdigging equipment is rented out per person per time. The com-
pany has records of all rentals, but the amount of time required to
conduct an audit of all accounts would be prohibitive. The owner
264 Chapter 8 Statistical Inference: Estimation for Single Populations
FIGURE 8.9
Excel and Minitab Output for Excel Output
the Comp Time Example Comp Time
Mean 13.56
Standard error 1.8386
Standard deviation 7.8006
Confidence level (90.0%) 3.20
Minitab Output
One-Sample T: Comp Time
Variable N Mean StDev SE Mean 90% CI
Comp Time 18 13.56 7.80 1.84 (10.36, 16.75)