0% found this document useful (0 votes)
87 views77 pages

Business Statistics PDF

The document discusses continuous probability distributions and the normal distribution. It defines key concepts like the mean, standard deviation, and z-scores. It also provides examples of using the standard normal distribution to calculate probabilities and find values of z for given areas under the normal curve.

Uploaded by

Maymuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views77 pages

Business Statistics PDF

The document discusses continuous probability distributions and the normal distribution. It defines key concepts like the mean, standard deviation, and z-scores. It also provides examples of using the standard normal distribution to calculate probabilities and find values of z for given areas under the normal curve.

Uploaded by

Maymuna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

CHAPTER ONE

CONTINUOUS PROBABILITY
DISTRIBUTIONS
OBJECTIVES
After completing this chapter, you should be able to
 To define the continuous random variable
 Identify the properties of a normal distribution.
 Mention the parameters of the normal distribution
 3 Find the area under the standard normal distribution, given various z values
 4 Find probabilities for a normally distributed variable by transforming it into a
standard normal variable.
 To solve applications using the standard normal distribution table
INTRODUCTION
This chapter completes our presentation of probability by introducing continuous random
variables and their distributions. In the previous Chapters, we introduced discrete probability
distributions that are employed to calculate the probability associated with discrete random
variables, we also introduced the binomial distribution, which allows us to determine the
probability that the random variable equals a particular value (the number of successes).
In this chapter, we continue our study of probability distributions by examining continuous
probability distributions. A continuous probability distribution usually results from measuring
something, such as the distance from the dormitory to the classroom, the weight of an individual,
or the amount of bonus earned by CEOs. Suppose we select five students and find the distance,
in miles, they travel to attend class as 12.2, 8.9, 6.7, 3.6, and 14.6. When examining a continuous
distribution we are usually interested in information such as the percent of students who travel
less than 10 miles or the percent who travel more than 8 miles. In other words, for a continuous
distribution we may wish to know the percent of observations that occur within a certain range.
It is important to realize that a continuous random variable has an infinite number of values
within a particular range. So you think of the probability a variable will have a value within a
specified range, rather than the probability for a specific value.

Lecture notes on Social statistics Page 1


The continuous distribution discussed in this chapter is the normal probability distribution. The
normal distribution is described by its mean and standard deviation. For example, assume the life
of an Energizer C-size battery follows a normal distribution with a mean of 45 hours and a
standard deviation of 10 hours when used in a particular toy. We can determine the likelihood
the battery will last more than 50 hours, between 35 and 62 hours, or less than 39 hours. The life
of the battery is measured on a continuous scale.
NORMAL DISTRIBUTION
The normal distribution is the most important of all probability distributions because of its
crucial role in statistical inference.
The shape and position of a normal distribution curve depend on two parameters, the mean and
the standard deviation. Each normally distributed variable has its own normal distribution curve,
which depends on the values of the variable’s mean and standard deviation
A normal distribution is a continuous, symmetric, bell-shaped distribution of a variable.
Properties of the normal distribution
1) A normal distribution curve is bell-shaped.
2) The mean, median, and mode are equal and are located at the center of the distribution
3) . The curve is symmetric about the mean, which is equivalent to saying that its shape is
the same on both sides of a vertical line passing through the center.
4) The curve is continuous; that is, there are no gaps or holes
5) The curve never touches the x axis.
6) The total area under a normal distribution curve is equal to 1.00, or 100%.
These characteristics are shown graphically in figure1-3.

Lecture notes on Social statistics Page 2


There is no just one normal probability distribution, but rather a "family" of them.
For example, in figure 1-4 the probability distributions of length of employee service in three
different plants can be compared. In the Camden plant, the mean is 20 years and the standard
deviation is 3.1 years. There is another normal probability distribution for the length of service in
the Dunkirk plant, where   20 years and   3.9 years. In the Elmira plant,   20 years
and   5 . Note that .the means are the same but the standard deviations are different.

The Standard Normal Distribution


Since each normally distributed variable has its own mean and standard deviation, as stated
earlier, the shape and location of these curves will vary. In practical applications, then, you
would have to have a table of areas under the curve for each variable. To simplify this situation,
statisticians use what is called the standard normal distribution.

Lecture notes on Social statistics Page 3


The standard normal distribution is a normal distribution with a mean of 0 and a standard
deviation of 1.
Any normal distribution can be converted into a standard normal distribution by subtracting the
mean from each observation and dividing this difference by the standard deviation. The results
are called z values. They are also referred to as z scores, the z statistics, the standard normal
deviates, the standard normal values, or just the normal deviate.
All normally distributed variables can be transformed into the standard normally distributed
x
variable by using the formula for the standard score: z 

Finding Areas under the Standard Normal Distribution Curve
For the solution of problems using the standard normal distribution, the following steps are
recommended to use:
1. Draw the figure
2. Highlight the required area
3. Find the required area from z table

Example:
1) Find the area under the standard normal curve
a) Between z=0 and z=1.95
Solution
p(0  z  1.95)  0.4744

Class work
1. Find the area under the standard normal curve
a. Between z=0 and z=-2.05
b. Between z=0 and z=2.37
c. Between z=0 and z=-1.53
2) Find the area under the standard normal curve
a) To the right of z=1.36
b) To the left of z=-1.93

Lecture notes on Social statistics Page 4


c) To right of z= -2.32
d) To the left of z=1.78
3) Determine the following probabilities for the standard normal distribution.
a) p(2  z  3)
b) p(0.37  z  1.86)
c) p (1.83  z  0.64)
d) p (2.6  z  2.6)
3) Determine the following probabilities for the standard normal distribution.
a) p (0  z  1.27)
b) p ( z  1.8)
c) p ( z  2.54)
d) p (1.71  z  1.71)

EXAMPLE:
Find the value of z so that the area under the standard normal curve
a) from 0 to z is 0.4772 and z is positive
Solution
The corresponding z value is z  0, 4772

Class work
1) Find the value of z so that the area under the standard normal curve
a. between 0 and z is (approximately) 0.4784 and z is negative
b. in the left tail is 0 .0582
c. in the right tail is 0.0268
2) Determine the value of z so that the area under the standard normal curve
a) In the right tail is .0500
b) in the left tail is .0250
c) in the left tail is .0100
d) in the right tail is .0050
3) Find the value of z so that the area under the standard normal curve
a) from 0 to z is 0.1950 and z is positive
Lecture notes on Social statistics Page 5
b) between 0 and z is (approximately) 0.2733 and z is negative
c) in the right tail is 0.1056
4) Determine the value of z so that the area under the standard normal curve
a) in the right tail is .0250
b) in the left tail is .0500
c) in the left tail is .0010
d) in the right tail is .0100
Applications of the Normal Distribution
The standard normal distribution curve can be used to solve a wide variety of practical problems.
The only requirement is that the variable be normally or approximately normally distributed.
There are several mathematical tests to determine whether a variable is normally distributed. See
the Critical Thinking Challenges on page 352. For all the problems presented in this chapter, you
can assume that the variable is normally or approximately normally distributed. To solve
problems by using the standard normal distribution, transform the original variable to a standard
normal distribution variable by using the formula
x
z

EXAMPLE:
1) Find the z value for each of the following x values for a normal distribution with   30
and   5
a) X=39 b) x=19 c) x=24 d) x=44
2) Find the following areas under a normal distribution curve with   20 and   4
a) Area between x =20 and x =27
b) Area from x =23 to x =26
c) Area between x= 9.5 and x =17

3) The weekly incomes of shift foremen in the glass industry are normally distributed with a
mean of $1, 000 and a standard deviation of $1 00.

Lecture notes on Social statistics Page 6


a) What is the z value for the income X of a foreman who earns $1,100 per week? For
a foreman who earns $900 per week?
b) What is the likelihood of selecting a foreman whose weekly income is between
$1,000 and $1,1 OO?
c) What is the probability of selecting a shift foreman in the glass industry whose
income is between $790 and $1000?
d) What is the probability of selecting a shift foreman in the glass industry whose
income is Less than $790?
Solution
x 1100  1000 x 900  1000
a) z  1 z   1
 100  100
b) p(1000  x  1100)  p (0  z  1)  0.341

c) p(790  x  1000)  p(2.10  z  0)  0.4821


p(790  x  1000)  p( z  2.10)  0.5  0.4821  0.0179

Lecture notes on Social statistics Page 7


EXERCISES
1) A survey found that women spend on average $146.21 on beauty products during the
summer months. Assume the standard deviation is $29.44. Find the percentage of
women who spend less than $160.00. Assume the variable is normally distributed.
2) A recent study of the hourly wages of maintenance crew members for major
airlines showed that the mean hourly salary was $20.50, with a standard deviation of
$3.50. If we select a crew member at random, what is the probability the crew
member earns:
a) Between $20.50 and $24.00 per hour? .
b) More than $24.00 per hour?
c) Less than $19.00 per hour?
3) The mean starting salary for college graduates in the spring of 2005 was $36,280.
Assume that the distribution of starting salaries follows the normal distribution with a
standard deviation of $3,300. What percent of the graduates have starting salaries:

Lecture notes on Social statistics Page 8


a) Between $35,000 and $40,000?
b) More than $45,000?
c) Between $40,000 and $45,000?
4) Suppose the life span of a calculator manufactured by Texas Instruments has a
normal distribution with a mean of 54 months and a standard deviation of 8 months.
The company guarantees that any calculator that starts malfunctioning within 36
months of the purchase will be replaced by a new one. About what percentage of
calculators made by this company are expected to be replaced?
5) The lifetimes of light-bulbs that are advertised to last for 5,000 hours are normally
distributed with a mean of 5,100 hours and a standard deviation of 200 hours. What is
the probability that a bulb lasts longer than the advertised figure?
6) The amount of time devoted to studying statistics each week by students who achieve
a grade of A in the course is a normally distributed random variable with a mean of
7.5 hours and a standard deviation of 2.1 hours.
a) What proportion of A students study for more than 10 hours per week
b) Find the probability that an A student spends between 7 and 9 hours
studying.
c) What proportion of A students spend fewer than 3 hours studying?
d) What is the amount of time below which only5% of all A students spend
studying?
7) Each month, an American household generates an average of 28 pounds of newspaper
for garbage or recycling. Assume the standard deviation is 2 pounds. If a household is
selected at random, find the probability of its generating
a) Between 27 and 31 pounds per month
b) More than 30.2 pounds per month
c) Assume the variable is approximately normally distributed.

Lecture notes on Social statistics Page 9


CHAPTER TWO
SAMPLING DISTRIBUTION
INTRODUCTION
After completing this chapter, you should be able to
 Define population and sampling distribution
 Construct a sampling distribution of the sample mean
 Explain the central limit theorem
This chapter introduces the sampling distribution, a fundamental element in statistical inference.
We remind you that statistical inference is the process of converting data into information. Here
are the parts of the process we have thus far discussed:
1. Parameters describe populations.
2. Parameters are almost always unknown.
3. We take a random sample of a population to obtain the necessary data.
4. We calculate one or more statistics from the data.
For example, to estimate a population mean, we compute the sample mean. Although there is
very little chance that the sample mean and the population mean are identical, we would expect
them to be quite close. However, for the purposes of statistical inference, we need to be able to
measure how close. The sampling distribution provides this service. It plays a crucial role in the
process because the measure of proximity it provides is the key to statistical inference.
POPULATION DISTRIBUTON
This section introduces the concepts of population distribution and sampling distribution of x .
The population distribution is the probability distribution of the population data.
To grasp the idea of a sampling distribution, suppose there are only five students in an advanced
statistics class and the midterm scores of these five students are

Lecture notes on Social statistics Page 10


70 78 80 80 95
Let x denote the score of a student. Using single-valued classes (because there are only five data
values, there is no need to group them), we can write the frequency distribution of scores as in
Table 2.1 along with the relative frequencies of classes, which are obtained by dividing the
frequencies of classes b the population size. Table 2.2, which lists the probabilities of various x
values, presents the probability distribution of the population. Note that these probabilities are
the same as the relative frequencies.
Table 2.1 table 2.2
X freq R.freq X P(x)
70 1 0.20 70 0.20
78 1 0.20 78 0.20
80 2 0.4 80 0.4
95 1 0.2 95 0.2
Pop. Freq and rel.freq distributions pop. probability distribution

The values of the mean and standard deviation calculated for the probability distribution of Table
2.2 give the values of the population parameters  and  These values are   80.60 and
  8.09 . The values of  and  for the probability distribution of Table 2.2
SAMPLING DISTRIBUTION
The probability distribution of x is called its sampling distribution. It lists the various values
that x can assume and the probability of each value of x . In general; the probability distribution
of a sample statistic is called its sampling distribution
Reconsider the population of midterm scores of five students given in Table 2.1. Consider all
possible samples of three scores each that can be selected, without replacement, from that
population. The total number of possible samples, given by the combinations formula:
5!
5C3   10
3!(5  3)!

Suppose we assign the letters A, B, C, D, and E to the scores of the five students, so that
Lecture notes on Social statistics Page 11
A = 70, B = 78, C =80, D =80, E =95 Then, the 10 possible samples of three scores each are
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE
These 10 samples and their respective means are listed in Table 3.3. Note that the first two
samples have the same three scores. The reason for this is that two of the students (C and D)
have the same score, and, hence, the samples ABC and ABD contain the same values. Note that
the values of the means of samples in Table 2.3 are rounded to two decimal places.
By using the values of given in Table 2.3, we record the frequency distribution of in Table 2.4.
By dividing the frequencies of the various values of by the sum of all frequencies, we obtain the
relative frequencies of classes, which are listed in the third column of Table 2.4.
These relative frequencies are used as probabilities and listed in Table 2.5. This table gives the
sampling distribution of x .
If we select just one sample of three scores from the population of five scores, we may draw any
of the 10 possible samples. Hence, the sample mean, can assume any of the values listed in Table
2.5 with the corresponding probability. For instance, the probability that the mean of a randomly
selected sample of three scores is 81.67 is .20. This probability can be written as
p( x  81.67)  0.20

All possible samples and their means


Sample Scores in x
the sample

ABC 70,78,80 76.00


ABD 70,78,80 76.00
ABE 70,78,95 81.00
ACD 70,80,80 76.67
ACE 70,80,95 81.67
ADE 70,80,95 81.67
BCD 78,80,80 79.33
BCE 78,80,95 84.33
BDE 78,80,95 84.33
CDE 80,80,95 85.00
Table 2.3

Lecture notes on Social statistics Page 12


Frequency and rel.freq distributions of x Sampling distribution of x
x f Rel. freq x f Rel. freq
76.00 2 0.20 76.00 2 0.20
76.67 1 0.10 76,67 1 0.10
79.33 1 0.10 79.33 1 0.10
81.00 1 0.10 81.00 1 0.10
81.67 2 0.20 81.67 2 0.20
84.33 2 0.20 84.33 2 0.20
85.00 1 0.10 85.00 1 0.10
Sum=10 Sum=1.00 Sum=806 Sum=10 Sum=1.00
Table 2.4 table 2.5

EXERCISE
1) Tartus Industries has seven production employees (considered the population). The
hourly earnings of each employee are given in the table below

a) What is the population mean?


b) What is the sampling distribution of the sample mean for samples of size 2?
c) What is the mean of the sampling distribution?
d) What observations can be made about the population and the sampling
distribution? .
2) A population consists of the following four values: 12, 12, 14, and 16.
a. List all samples of size 2, and compute the mean of each sample.
b. Compute the mean of the distribution of the sample mean and the population
mean. Compare the two values.
c. Compare the dispersion in the population with that of the sample mean.

Lecture notes on Social statistics Page 13


3) A population consists of the following five values: 2, 2, 4, 4, and 8.
a. List all samples of size 2, and compute the mean of each sample.
b. Compute the mean of the distribution of sample means and the population mean.
Com-pare the two values.
SAMPLING AND NONSOMPLING ERRORS
Usually, different samples selected from the same population will give different results because
they contain different elements. This is obvious from Table 2.3, which shows that the mean of a
sample of three scores depends on which three of the five scores are included in the sample. The
result obtained from any one sample will generally be different from the result obtained from the
corresponding population. The difference between the value of a sample statistic obtained
from a sample and the value of the corresponding population parameter obtained from the
population is called the sampling error. Note that this difference represents the sampling error
only if the sample is random and no nonsampling error has been made. Otherwise, only a part of
this difference will be due to the sampling error.
SAMPLING ERROR:
Sampling error is the difference between the value of a sample statistic and the value of the
corresponding population parameter. In the case of the mean, E  x  
Assuming that the sample is random and no nonsampling error has been made.
 It is important to remember that a sampling error occurs because of chance. The errors
that occur for other reasons, such as errors made during collection, recording, and
tabulation of data, are called nonsampling errors. These errors occur because of human
mistakes, and not chance.
Note that there is only one kind of sampling error—the error that occurs due to chance. However,
there is not just one nonsampling error, but there are many nonsampling errors that may
occur for different reasons.
NONSOMPLING ERROR:
The errors that occur in the collection, recording, and tabulation of data are called nonsampling
errors.
MEAN AND STANDARD DEVIATION OF x

Lecture notes on Social statistics Page 14


The mean and standard deviation calculated for the sampling distribution of are called the mean
and standard deviation of x . Actually, the mean and standard deviation of x are, respectively,
the mean and standard deviation of the means of all samples of the same size selected from a
population. The standard deviation of x is also called the standard error of x .

Definition
The mean and standard deviation of the sampling distribution of x are called the mean and

standard deviation o f x and are denoted by  x and x , respectively.

If we calculate the mean and standard deviation of the 10 values of x listed in Table 2.3, we

obtain the mean  x , and the standard deviation  x , of x . Alternatively, we can calculate the

mean and standard deviation of the sampling distribution of x listed in Table 2.5. These will also

be the values of  x and  x .


From these calculations, we will obtain:

 x    80.60
x
1.
n

x   (x   x )2
 3.30
2.
n
The mean of the sampling distribution of x is always equal to the mean of the population and
the standard deviation of the sampling distribution of x is smaller than the spread of the
corresponding population distribution.

Lecture notes on Social statistics Page 15


 The mean and standard deviation of the sampling distribution of x are called the mean

and standard deviation o f x and are denoted by  x and x , respectively.

a)  x  

  0.05
b)  x 
n
if N
n

 N n
c)  x  if
n
N  0.05
n N n

 The sample mean, is called an estimator of the population mean x .


 When the expected value (or mean) of a sample statistic is equal to the value of the
corresponding population  parameter, that sample statistic is said to be an unbiased
estimator.
 The standard deviation of the sampling distribution of x decreases as the sample size
increases.
 If the standard deviation of a sample statistic decreases as the sample size is increased,
that statistic is said to be a consistent estimator.
SHAPE OF THE SAMPLING DISTRIBUTION OF x
The shape of the sampling distribution of x relates to the following two cases.
1. The population from which samples are drawn has a normal distribution.
2. The population from which samples are drawn does not have a normal distribution.
 If the population from which the samples are drawn is normally distributed with mean 
and standard deviation  , then the sampling distribution of the sample mean x , will also
be normally distributed with the following mean and standard deviation, irrespective of
the sample size:

x   and x 
n

Lecture notes on Social statistics Page 16


 When the population from which the samples drawn is not normally distributed the shape

of the sampling distribution of ( x ) is referred from a very important theorem called the
Central limit theorem, which states that regardless of the shape of the distribution of the
population, the distribution of the sample means approaches the normal probability
distribution as the sample size increases .

 The sample size is usually considered to be large if n  30


EXAMPLE
1) The mean wage per hour for all 5000 employees who work at a large company is $27.50,
and the standard deviation is $3.70. Let x be the mean wage per hour for a random
sample of certain employees selected from this company. Find the mean and standard
deviation of x for a sample size of
a) 30 b) 75 c) 200
EXERCISE

1) Consider a large population with   60 , and   10 . Assuming  0.05 , find the mean
n
N

and standard deviation of the sample mean, x , for a sample size of


a) 18 b) 90
2) For a population,   125 and   36 .

a) For a sample selected from this population   125 , and  x  3.6 Find the sample

size. Assume N  0.05 .


n

b) For a sample selected from this population   125 , and  x  2.25 Find the
sample size.

Assume N  0.05 .
n

3) For a population,   46 and   10

a) For a sample selected from this population   46 , and  x  2 Find the sample size.

Assume N  0.05 .
n

Lecture notes on Social statistics Page 17


b) For a sample selected from this population   46 , and  x  1.6 find the sample size
assume
4) The living spaces of all homes in a city have a mean of 2300 square feet and a
standard deviation of 500 square feet. Let x be the mean living space for a random
sample of 25 homes selected from this city. Find the mean and standard deviation of
the sampling distribution of x .
5) The foreman of a bottling plant has observed that the amount of soda in each 32-
ounce bottle is actually a normally distributed random variable, with a mean of 32.2
ounces and a standard deviation of .3 ounce.
a) If a customer buys one bottle, what is the probability that the bottle will
contain more than 32 ounces?
b) If a customer buys a carton of four bottles, what is the probability that
the mean amount of the four bottles will be greater than 32 ounces?
6) In a recent SAT, the mean score for all examinees was 1020. Assume that the
distribution of SAT scores of all examinees is normal with a mean of 1020 and a
standard deviation of 153. Let x be the mean SAT score of a random sample of
certain examinees. Calculate the mean and standard deviation of x and describe the
shape of its sampling distribution when the sample size is
a) 16 b) 50 c) 1000
7) The amounts of electricity bills for all households in a particular city have an
approximately normal distribution with a mean of $140 and a standard deviation of $30.
Let x be the mean amount of electricity bills for a random sample of 25 households
selected from this city. Find the mean and standard deviation of x , and comment on the
shape of its sampling distribution.
8) The GPAs of all 5540 students enrolled at a university have an approximately normal
distribution with a mean of 3.02 and a standard deviation of 0 .29. Let x be the mean GPA
of a random sample of 48 students selected from this university. Find the mean and
standard deviation of x , and comment on the shape of its sampling distribution
9) The weights of all people living in a particular town have a distribution that is skewed to
the right with a mean of 133 pounds and a standard deviation of 24 pounds. Let x be the
Lecture notes on Social statistics Page 18
mean weight of a random sample of 45 persons selected from this town. Find the mean
and standard deviation of x and comment on the shape of its sampling distribution.
10) The GPAs of all students enrolled at a large university have an approximately normal
distribution with a mean of 3.02 and a standard deviation of 0 .29. Find the probability
that the mean GPA of a random sample of 20 students selected from this university is
a) 3.10 Or higher. b) 2.90 or lower c) 2.95 to 3.11
11) The amounts of electricity bills for all households in a city have a skewed probability
distribution with a mean of $140 and a standard deviation of $30. Find the probability
that the mean amount of electric bills for a random sample of 75 households selected
from this city will be:
a) between $132 and $136
b) within $6 of the population mean
c) more than the population mean by at least $4

12) The Old Farmer’s Almanac reports that the average person uses 123 gallons of water
daily. If the standard deviation is 21 gallons, find the probability that the mean of a
randomly selected sample of 15 people will be between 120 and 126 gallons. Assume the
variable is normally distributed
13) A. C. Neilsen reported that children between the ages of 2 and 5 watch an average of 25
hours of television per week. Assume the variable is normally distributed and the
standard deviation is 3 hours. If 20 children between the ages of 2 and 5 are randomly
selected, find the probability that the mean of the number of hours they watch television
will be greater than 26.3 hours.

POPULATION AND SAMPLE PROPORTION


The concept of proportion is the same as the concept of relative frequency discussed in Chapter
2. The relative frequency of a category or class gives the proportion of the sample or population
that belongs to that category or class.

The population and sample proportions, denoted by p and p̂ , respectively, are calculated as:

Lecture notes on Social statistics Page 19


X x
p and pˆ 
N n
Where
N = total number of elements in the population
n = total number of elements in the sample
X = number of elements in the population that possess a specific characteristic
x = number of elements in the sample that possess a specific characteristic
Examples:
1) Suppose a total of 789,654 families live in a particular city and 563,282 of them own
homes. A sample of 240 families is selected from this city, and 158 of them own homes.
Find the proportion of families who own homes in the population and in the sample.

Mean, Standard deviation and the shape of the sampling distribution of p̂

This section discusses the sampling distribution of the sample proportion and the mean, standard
deviation, and shape of this sampling distribution.

Sampling distribution of p̂

Definition: The probability distribution of the sample proportion p̂ , is called its sampling

distribution. It gives the various values that p̂ can assume and their probabilities.
 The mean of the sample proportion, is denoted by  p̂ and is equal to the population

proportion, p. Thus,  p̂  p
 Standard Deviation of the Sample Proportion The standard deviation of the sample

proportion p̂ , is denoted by  p̂ and is given by the formula:

pq
 p̂ 
n
where p is the population proportion, q  1  p , and n is the sample size. This formula is use
when N  0.05 , where N is the population size.
n

Lecture notes on Social statistics Page 20


 According to the central limit theorem, the sampling distribution of p̂ is approximately
normal for a sufficiently large sample size. In the case of proportion, the sample size is
considered to be sufficiently large if np and nq are both greater than 5—that is, if
np  5 and nq  5
EXAMPLES

1) Consider a large population with p  0.63 . Assuming


n
N  0.05 , find the mean and

standard deviation of the sample proportion p̂ for a sample size of:


a) 100 b) 900

2) Consider a large population with p  0.21 . Assuming


n
N  0.05 , find the mean and

standard deviation of the sample proportion p̂ for a sample size of:


b) 400 b) 750
.

Lecture notes on Social statistics Page 21


CHAPTER THREE
ESTIMATION, CONFIDENCE INTERVALS
AND SAMPLE SIZE
Objectives
After completing this chapter, you should be able to
 Define the estimation
 Differentiate the point and interval estimate
 Find the confidence interval for the mean when  is known.
 Determine the minimum sample size for finding a confidence interval for the mean.
 Find the confidence interval for the mean when  is unknown.
 Find the confidence interval for a proportion.
 Determine the minimum sample size for finding a confidence interval for a proportion.

Introduction
Having discussed descriptive statistics: probability distributions and sampling distributions, we
are ready to tackle statistical inference. As we explained in basic statistics, statistical inference is
the process by which we acquire information and draw conclusions about populations from
samples. There are two general procedures for making inferences about populations: estimation
and hypothesis testing. In this chapter, we introduce the concepts and foundations of estimation
and demonstrate them with simple examples. In Chapter 4, we describe the fundamentals of
hypothesis testing. Because most of what we do in the remainder of this book applies the
concepts of estimation and hypothesis testing, understanding Chapters 4 and 5 is vital to your
development as a statistics practitioner.
In other words, inferential statistics uses the sample results to make decisions and draw
conclusions about the population from which the sample is drawn. Estimation is the first topic to
be considered in our discussion of inferential statistics. Estimation and hypothesis testing taken
together are usually referred to as inference making. This chapter explains how to estimate the
population mean and population proportion for a single population.

Lecture notes on Social statistics Page 22


 Estimation is a procedure by which a numerical value or values are assigned to a
population parameter based on the information collected from a sample.
 Estimate and Estimator The value(s) assigned to a population parameter based on the
value of a sample statistic is called an estimate.
 The estimation procedure involves the following steps.
1) Select a sample.
2) Collect the required information from the members of the sample.
3) Calculate the value of the sample statistic.
4) Assign value(s) to the corresponding population parameter
 Point and Interval estimates:
An estimate may be a point estimate or an interval estimate.
a) Point estimate: The value of a sample statistic that is used to estimate a population
parameter is called a point estimate.
b) Interval estimate: In interval estimation, an interval is constructed around the point
estimate, and it is stated that this interval is likely to contain the corresponding population
parameter.
To illustrate the difference between point and interval estimators, suppose that a statistics
professor wants to estimate the mean summer income of his second-year business students.
Selecting 25 students at random, he calculates the sample mean weekly income to be $400. The
point estimate is the sample mean. In other words, he estimates the mean weekly summer income
of all second-year business students to be $400. Using the technique described subsequently, he
may instead use an interval estimate; he estimates that the mean weekly summer income of
second-year business students to lie between $380 and $420.
Numerous applications of estimation occur in the real world. For example, television network
executives want to know the proportion of television viewers who are tuned in to their networks;
an economist wants to know the mean income of university graduates; and a medical researcher
wishes to estimate the recovery rate of heart attack victims treated with a new drug. In each of
these cases, to accomplish the objective exactly, the statistics practitioner would have to examine
each member of the population and then calculate the parameter of interest. For instance,
network executives would have to ask each person in the country what he or she is watching to

Lecture notes on Social statistics Page 23


determine the proportion of people who are watching their shows. Because there are millions of
television viewers, the task is both impractical and prohibitively expensive.
An alternative would be to take a random sample from this population, calculate the sample
proportion, and use that as an estimator of the population proportion. The use of the sample
proportion to estimate the population proportion seems logical.
The selection of the sample statistic to be used as an estimator, however, depends on the
characteristics of that statistic. Naturally, we want to use the statistic with the most desirable
qualities for our purposes.
A good estimator should satisfy the following three properties described:
1. The estimator should be an unbiased estimator..
2. The estimator should be consistent.
3. The estimator should be a relatively efficient estimator.
 Confidence level and confidence interval:
Each interval is constructed with regard to a given confidence level and is called a
confidence interval. The confidence interval is given as
Point estimate  Margin of error
The confidence level associated with a confidence interval states how much confidence
We have that this interval contains the true population parameter. The confidence level is
Denoted by (1   )100%

Estimation of a population mean:  known


This section explains how to construct a confidence interval for the population mean  when the
population standard deviation  is known.
Case I. If the following three conditions are fulfilled:
1. The population standard deviation  is known
2. The sample size is small (n  30)
3. The population from which the sample is selected is normally distributed, and then we use the
normal distribution to make the confidence interval for  .
Case II. If the following two conditions are fulfilled:
1. The population standard deviation  is known

Lecture notes on Social statistics Page 24


2. The sample size is large (n  30)
then, again, we use the normal distribution to make the confidence interval for 
 Confidence interval for  :The (1   )100% confidence interval for  under Cases I and

II is: x  z x where x 
n
 The value of z used here is obtained from the standard normal distribution table for the
given confidence level.

Information and the Width of the Interval


Interval estimation, like all other statistical techniques, is designed to convert data into
information.
The width of the confidence interval estimate is a function of the population standard deviation,
the confidence level, and the sample size.
 The width of a confidence interval depends on the size of the margin of error, which
depends on the values of z,  and n
 The width of a confidence interval can be controlled using
a) The value of z, which depends on the confidence level
b) The sample size n
 The value of z increases as the confidence level increases, and it decreases as the
confidence level decreases.

 if we want to decrease the width of a confidence interval, we have two choices:


a) Lower the confidence level
b) Increase the sample size
 Increasing the sample size decreases the width of the confidence interval while the
confidence level can remain unchanged.
 Margin of Error: The margin of error for the estimate for  denoted by E, is the
quantity that is subtracted from and added to the value of x to obtain a confidence

interval for  . Thus, E  z x


Lecture notes on Social statistics Page 25
 The following table lists the z values for some of the most commonly used confidence
levels. Note that we always use the positive value of z in the formula.

Confidence level Areas to look for in table z- Value


90% 0.050 1.64/1.65
95% 0.250 1.96
98% 0.010 2.33
99% 0.005 2.57/2.58

 We know that : E  z x we finally get n  z 2


2 2

EXAMPLE
1) A survey of 30 emergency room patients found that the average waiting time for
treatment was 174.3 minutes. Assuming that the population standard deviation is 46.5
minutes.
a) Find the best point estimate of the population mean
b) Construct the 99% confidence interval of the population mean.
Solution:
 46.5
n  30 x  174.3   $46.5 ,  x    8.5
n 30
a. Point estimate of   x  174.3
b. 99% Confidence interval for  is:

x  z x
174.3  2.58(8.5)
174.3  21.93
Hence, one can be 99% confident that the mean waiting time for emergency room treatment is
between 152.4 and 196.2 minutes.

Lecture notes on Social statistics Page 26


Exercise:
2) A researcher wishes to estimate the number of days it takes an automobile dealer to sell a
Chevrolet. A sample of 50 cars had a mean time on the dealer’s lot of 54 days. Assume
the population standard deviation to be 6.0 days. Find the best point estimate of the
population mean and the 95% confidence interval of the population mean
3) For a data set obtained from a sample, n  20 and x  24.5 . It is known that   3.1 .
The population is normally distributed.
a) What is the point estimate of  ?
b) Make a 99% confidence interval for  ?
c) What is the margin of error of estimate for part b?
4) For a data set obtained from a sample, n  81 and x  48.5 . It is known that   4.8 .
The population is normally distributed.
d) What is the point estimate of  ?
e) Make a 95% confidence interval for  ?
f) What is the margin of error of estimate for part b?
5) For a population data set,   14.50
a) What should the sample size be for a 98% confidence interval for  to have a
margin of error of estimate equal to 5.50?
b) What should the sample size be for a 95% confidence interval for  to have a
margin of error of estimate equal to 4.25?
6) A study of 415 kindergarten students showed that they have seen on average 5000 hours
of television. If the sample standard deviation of the population is 900.
a. find the 95%confidence level of the mean for all students.
7) A sample of 1500 homes sold recently in a state gave the mean price of homes equal to
$299,720. The population standard deviation of the prices of homes in this state is
$68,650. Construct a 99% confidence interval for the mean price of all homes in this
state.

Lecture notes on Social statistics Page 27


8) A survey of 30 adults found that the mean age of a person’s primary vehicle is 5.6 years.
Assuming the standard deviation of the population is 0.8 year, find the best point estimate
of the population mean and the 99% confidence interval of the population mean.
9) A pizza shop owner wishes to find the 95% confidence interval of the true mean cost of a
large plain pizza. How large should the sample be if she wishes to be accurate to within
$0.15? A previous study showed that the standard deviation of the price was $0.26.
10) Determine the sample size for the estimate of  for the following.
a) E  2.3   15.40 confidence level 99%

b) E  4.1   23.45 confidence level 95%

c) E  25.9   122.25 confidence level 95%

Estimation of a Population Mean:  unknown


This section explains how to construct a confidence interval for the population mean when the
population standard deviation is not known.
When  is known and the sample size is 30 or more, or the population is normally distributed if
sample size is less than 30, the confidence interval for the mean can be found by using the z
distribution. However, most of the time, the value of  is not known, so it must be estimated by
using s, namely, the standard deviation of the sample. When s is used, especially when the
sample size is small, critical values greater than the values for are used in confidence intervals in
order to keep the interval at a given level, such as the 95%. These values are taken from the
Student t distribution, most often called the t distribution.
Case I. If the following three conditions are fulfilled:
1. The population standard deviation is not known
2. The sample size is small ( n  30)
3. The population from which the sample is selected is normally distributed,
Then we use the t distribution to make the confidence interval for 
Case II. If the following two conditions are fulfilled:
1. The population standard deviation is not known
2. The sample size is large ( n  30)

Lecture notes on Social statistics Page 28


Then again we use the t distribution to make the confidence interval for 

Characteristics of t-Distribution:
 The t distribution was developed by W. S. Gosset in 1908 and published under the
pseudonym Student. As a result, the t distribution is also called Student’s t distribution.
 The t distribution is similar to the standard normal distribution in these ways:
1) It is bell-shaped.
2) It is symmetric about the mean.
3) The mean, median, and mode are equal to 0 and are located at the center of the
distribution.
4) The curve never touches the x axis.
 The t distribution differs from the standard normal distribution in the following ways:
1) The variance is greater than 1.
2) The t distribution is actually a family of curves based on the concept of degrees
of freedom, which is related to sample size.
3) As the sample size increases, the t distribution approaches the standard normal
distribution.
4) The number of degrees of freedom is the only parameter of the t distribution.

Example
1) Find the value of t for 16 degrees of freedom and .05 areas in the right tail of a t
distribution curve.
2) Find the value of t for the t distribution for each of the following
a) Area in the right tail  0.05 and df  12
b) Area in the left tail  0.05 and df  49
c) Area in the left tail  0.025 and n  66
d) Area in the right tail  0.005 and n  24

Confidence Interval for  using t distribution

 The ( (1   )100% confidence interval for  is:

Lecture notes on Social statistics Page 29


x  tsx Where sx 
s
n
 The value of t is obtained from the t distribution table for n 1 degrees of freedom and
the given confidence level.
 When the population standard deviation is not known , then we replace it by sample
standard deviation which is its estimator

Examples
The American Sugar Producers Association wants to estimate the mean yearly sugar
consumption. A sample of 16 people reveals the mean yearly consumption to be 60 pounds with
a standard deviation of 20 pounds.
a. What is the value of the point estimate?
b. Develop the 95 percent confidence interval for the population mean.

Solution:
n  16 x  60 s  20 Confidence level=95%

s
df  n  1  16  1  15 Area in each tail=0.025, t  2.131 sx  5
n
The 95% confidence interval for  is:
x  tsx
60  2.131(5)
60  10.66
Thus, we can state with 95% confidence that the mean yearly sugar consumption cholesterol
level for all Americans is between 49.34 and 70.66 pounds.

Lecture notes on Social statistics Page 30


Exercise
1) Ten randomly selected people were asked how long they slept at night. The mean time
was 7.1 hours, and the standard deviation was 0.78 hour.
a. Find the 95% confidence interval of the mean time. Assume the variable is
normally distributed.
2) The data represent a sample of the number of home fires started by candles for the past
several years. (Data are from the National Fire Protection Association.)
a. Find the 99% confidence interval for the mean number of home fires started by
candles each year.
5460 5900 6090 6310 7160 8440 9930

3) A random sample of 16 airline passengers at the Bay City airport showed that the mean
time spent waiting in line to check in at the ticket counters was 31 minutes with a
standard deviation of 7 minutes.
a) Construct a 99% confidence interval for the mean time spent waiting in line by all
passengers at this airport. Assume that such waiting times for all passengers are
normally distributed.
4) Social Networking Sites A recent survey of 8 social networking sites has a mean of 13.1
million visitors for a specific month. The standard deviation was 4.1 million.
a. Find the 95% confidence interval of the true mean
Estimation of a population proportion: Large Samples
This section explains how to estimate the population proportion p using the sample proportion.
 We know that for large samples:
1) The sampling distribution of the sample proportion is (approximately) normal.

2) The mean  p , of the sampling distribution of is equal to the population proportion, p

3) The standard deviation  p , of the sampling distribution of the sample proportion is :

pq
 p̂  Where q  1 p
n

Lecture notes on Social statistics Page 31


 In the case of a proportion, a sample is considered to be large if np and nq are both

greater than 5.If p and q are not known, then np and nq should each be greater than 5
for the sample to be large.
 When estimating the value of a population proportion, we do not know the values of p

and q. Consequently, we cannot compute  p Therefore, in the estimation of a population

proportion, we use the value of s p as an estimate of  p .The value of is calculated using


the following formula.
pq
sp 
n
 The (1   )100% )100% confidence interval for the population proportion p is:

pˆ  zs pˆ
 Margin of error: The quantity that is subtracted from and added to the value of a sample
statistic to obtain a confidence interval for the corresponding population parameter.
 Determining the Sample size for the Estimation of Proportion

z 2 pq
 We know that : E  z p̂ and finally we get n 
E2
EXAMPLE
A sample of 300 observations taken from a population produced a sample proportion of .63. Make a 95%
confidence interval for p.
SOLUTION

pq (0.63)(0.37)
pˆ  0.63 qˆ  0.37 sp    0.0279
n 300
95% confidence interval for p is:

pˆ  zs pˆ
0.63  1.96(0.0279)
0.63  0.5468

Lecture notes on Social statistics Page 32


EXERCISE
1) Another sample of 300 observations taken from the same population produced a sample
proportion of .59. Make a 95% confidence interval for p.
2) A third sample of 300 observations taken from the same population produced a sample
proportion of .67. Make a 95% confidence interval for
3) It is said that happy and healthy workers are efficient and productive. A company that
manufactures exercising machines wanted to know the percentage of large companies
that provide on-site health club facilities. A sample of 240 such companies showed that
96 of them provide such facilities on site.
a) What is the point estimate of the percentage of all such companies that provide
such facilities on site?
b) Construct a 97% confidence interval for the percentage of all such companies that
provide such facilities on site. What is the margin of error for this estimate?
4) A nutritionist found that in a sample of 80 families, 25% indicated that they ate fruit at
least 3 times a week. Find the 99% confidence interval of the true proportion of families
who said that they ate fruit at least 3 times a week.
5) Determine the most conservative sample size for the estimation of the population
proportion for the following

a) E  2.3 confidence level 95%

b) E  0.05 confidence level 90%

c) E  0.015 confidence level 99%

Lecture notes on Social statistics Page 33


CHAPTER FOUR
HYPOTHESIS TESTING
Objectives
After completing this chapter, you should be able to
Understand the definitions used in hypothesis testing.
State the null and alternative hypotheses.
 Find critical values for the z test.
State the five steps used in hypothesis testing.
Test means when  is known, using the z test.
Test means when  is unknown, using the t test.
Test proportions, using the z test.
Test hypotheses, using confidence intervals.
Explain the relationship between type I and type II errors and the power of a test.
INTRODUCTION
In Chapter 3, we introduced estimation and showed how it is used. Now we’re going to present
the second general procedure of making inferences about a population hypothesis testing. The
purpose of this type of inference is to determine whether enough statistical evidence exists to
enable us to conclude that a belief or hypothesis about a parameter is supported by the data. You
will discover that hypothesis testing has a wide variety of applications in business and
economics, as well as many other fields.
This chapter will lay the foundation upon which the rest of the book is based. As such it
represents a critical contribution to your development as a statistics practitioner.
CONCEPTS OF HYPOTHESIS TESTING
The term hypothesis testing is likely new to most readers, but the concepts underlying hypothesis
testing are quite familiar. There are a variety of nonstatistical applications of hypothesis testing,
the best known of which is a criminal trial.
When a person is accused of a crime, he or she faces a trial. The prosecution presents its case,
and a jury must make a decision on the basis of the evidence presented. In fact, the jury conducts

Lecture notes on Social statistics Page 34


a test of hypothesis. There are actually two hypotheses that are tested. The first is called the null
hypothesis and is represented by H0 (pronounced H-nought is a British term for zero). It is:
H 0 : The defendant is innocent.
The second is called the alternative hypothesis (or research hypothesis) and is denoted H1. In a
criminal trial it is:
H 0 : The defendant is guilty.
Of course, the jury does not know which hypothesis is correct. The members must make a
decision on the basis of the evidence presented by both the prosecution and the defense. There
are only two possible decisions. Convict or acquit the defendant. In statistical parlance,
convicting the defendant is equivalent to rejecting the null hypothesis in favor of the alternative;
that is, the jury is saying that there was enough evidence to conclude that the defendant was
guilty. Acquitting a defendant is phrased as not rejecting the null hypothesis in favor of the
alternative, which means that the jury decided that there was not enough evidence to conclude
that the defendant was guilty. Notice that we do not say that we accept the null hypothesis. In a
criminal trial, that would be interpreted as finding the defendant innocent. Our justice system
does not allow this decision.
There are two possible errors. A Type I error occurs when we reject a true null hypothesis. A
Type II error is defined as not rejecting a false null hypothesis. In the criminal trial, a Type I
error is made when an innocent person is wrongly convicted. A Type II error occurs when a
guilty defendant is acquitted. The probability of a Type I error is denoted by  , which is also
called the significance level. The probability of a Type II error is denoted by  (Greek letter
beta). The error probabilities  and  are inversely related, meaning that any attempt to reduce
one will increase the other.
This chapter introduces the second topic in inferential statistics: tests of hypotheses. In a test of
hypothesis, we test a certain given theory or belief about a population parameter. We may want
to find out, using some sample information, whether or not a given claim (or statement) about a
population parameter is true. This chapter discusses how to make such tests of hypotheses about
the population mean  , and the population proportion, p.

Lecture notes on Social statistics Page 35


 Hypothesis: is a claim or statement about a property of a population. There are two types
of hypothesis:
1. Null Hypothesis: A null hypothesis is a claim (or statement) about a population
parameter that is assumed to be true until it is declared false.
2. Alternative Hypothesis (Research hypothesis): An alternative hypothesis is a
claim about a population parameter that will be true if the null hypothesis is false.
 Critical(rejection) region: is the region where the null hypothesis is rejected
 Nonrejection (acceptance) region: is the region where the null hypothesis is not rejected
 Critical point (value): is any value that separates the critical region and acceptance
region
 Test statistic : is a value computed from the sample data that is used in making the
decision about the rejection of the null hypothesis
 Significance level: The probability of committing a type I error is denoted by (Alpha)
 In testing statistical hypothesis ,we may commit two types of errors:
1) Type I Error: A Type I error occurs when a true null hypothesis is
rejected.
2) Type II Error: A Type II error occurs when a false null hypothesis is not
rejected.
The following table summarizes the decisions the researcher could make and the possible
consequences.

 It would be wonderful if we could force both  and  to equal zero. Unfortunately, these
quantities have an inverse relationship. As  increases,  decreases and vice versa.

Lecture notes on Social statistics Page 36


 The only way to decrease both  and  is to increase the sample size. To make both
quantities equal zero, the sample size would have to be infinite—you would have to
sample the entire population.
 Based on the form of the null hypothesis and alternative hypothesis, we have two types of
tests:
1) Two tailed tests:
 A two-tailed test has rejection regions in both tails
 A two-tailed test has one acceptance region
 In a two tailed test H1 has a not equal sign 

2) Left tailed tests:


 a left-tailed test has the rejection region in the left tail,
 a left-tailed test has one acceptance region
 in a left- tailed test H1 has a less than  sign
3) right-tailed test:
 A right tailed test has the rejection region in the right tail of the
distribution curve.
 A right-tailed test has one acceptance region
 In a right-tailed test H1 has a greater than  sign

Lecture notes on Social statistics Page 37


 The following table summarizes the relationship between the signs in H and H1 and
the tails of a test.
Two-tailed test Left-tailed test Right-tailed test
Sign in the null hyp H   or   or 

Sign in the altern hyp H1   

Rejection region In both tails In the left tail In the right tail
Acceptance region B/w critical values Right of the critical value Left of the
critical value

EXAMPLE:
1) Write the null and alternative hypotheses for each of the following examples. Determine
if each is a case of a two-tailed, a left-tailed, or a right-tailed test.
a) To test if the mean number of hours spent working per week by college students who
hold jobs is different from 20 hours
b) To test whether or not a bank’s ATM is out of service for an average of more than 10
hours per month
EXERCISE:
1) Write the null and alternative hypotheses for each of the following examples. Determine
if each is a case of a two-tailed, a left-tailed, or a right-tailed test.
a. To test if the mean length of experience of airport security guards is different
from 3 years
Lecture notes on Social statistics Page 38
b. To test if the mean credit card debt of college seniors is less than $1000
c. To test if the mean time a customer has to wait on the phone to speak to a
representative of a mail-order company about unsatisfactory service is more
than 12 minutes
d. An engineer hypothesizes that the mean number of defects can be decreased
in a manufacturing process of compact disks by using robots instead of
humans for certain tasks. The mean number of defective disks per 1000 is 18.

2) Consider H 0 :   55 versus H1 :   55
a) What type of error would you make if the null hypothesis is actually false and you fail
to reject it?
b) What type of error would you make if the null hypothesis is actually true and you
reject it?

Hypothesis Tests About the Population Mean:  is known


This section explains how to perform a test of hypothesis for the population mean when the

population standard deviation  is known.


 Test Statistic: In tests of hypotheses about  using the normal distribution, the random
variable

x  
z x 
x Where
n
is called the test statistic. The test statistic can be defined as a rule or criterion that is used to
make the decision on whether or not to reject the null hypothesis.
Five-Step Procedure for Testing a Hypothesis
 A test of hypothesis procedure that uses the critical-value approach involves the
following five steps:
1) State the null and alternative hypotheses.
2) Select the distribution to use.
3) Determine the rejection and nonrejection regions.
4) Calculate the value of the test statistic.

Lecture notes on Social statistics Page 39


5) Make a decision.
Example:
1) The TIV Telephone Company provides long-distance telephone service in an area.
According to the company’s records, the average length of all long-distance calls placed
through this company in 2009 was 12.44 minutes. The company’s management wanted to
check if the mean length of the current long-distance calls is different from 12.44
minutes. A sample of 150 such calls placed through this company produced a mean
length of 13.71 minutes. The standard deviation of all such calls is 2.65 minutes. Using
the 2% significance level, can you conclude that the mean length of all current long-
distance calls is different from 12.44 minutes?
Solution
n  150 x  13.71 minutes   2.65 minutes   0.02
We are to test whether or not the mean length of all current long-distance calls is different
from 12.44 minutes.
H 0 :   12.44

H 0 :   12.44

 We use the normal distribution because n  30


 Area in each tail  0.01

x 

z  5.87
x
 We reject H 0 and conclude that based on the sample information, it appears that the mean
length of all such calls is not equal to 12.44 minutes.
EXERCISE
2) Make the following hypothesis tests
a) H 0 :   25 , H1 :   25 n  81 x  28.5   3,   0.01 ,

b) H 0 :   12 , H1 :   12 n  45 x  11.25   4.5,   0.05 ,

c) H 0 :   40 , H1 :   40 n  100 x  47   7,   0.1 ,

Lecture notes on Social statistics Page 40


3) The manufacturer of a certain brand of auto batteries claims that the mean life of these
batteries is 45 months. A consumer protection agency that wants to check this claim took
a random sample of 24 such batteries and found that the mean life for this sample is 43.05
months. The lives of all such batteries have a normal distribution with the population
standard deviation of 4.5 months.
a) Test the hypothesis that the mean life of these batteries is less than 45 months
using   0.025 ?
4) A telephone company claims that the mean duration of all long-distance phone calls
made by its residential customers is 10 minutes. A random sample of 100 long-distance
calls made by its residential customers taken from the records of this company showed
that the mean duration of calls for this sample is 9.20 minutes. The population standard
deviation is known to be 3.80 minutes.
a) Test that the mean of duration of long distance call is different from 10 minutes
using the critical-value approach and   0.05
5) For each of the following examples of tests of hypotheses about  , show the rejection
and nonrejection regions on the sampling distribution of the sample mean assuming it is
normal.
a) A two-tailed test with   0.01 and n = 100
b) A left-tailed test with   0.05 and n =27
c) A right-tailed test with   0.025 and n =36
6) A researcher reports that the average salary of assistant professors is more than $42,000.
A sample of 30 assistant professors has a mean salary of $43,260. At   0.05 , test the
claim that assistant professors earn more than $42,000 per year. The standard deviation
of the population is $5230.

7) A researcher wishes to test the claim that the average cost of tuition and fees at a four
year public college is greater than $5700. She selects a random sample of 36 four-year
public colleges and finds the mean to be $5950. The population standard deviation is
$659. Is there evidence to support the claim at   0.05 .

Lecture notes on Social statistics Page 41


8) The manager of a department store is thinking about establishing a new billing system for
the store’s credit customers. After a thorough financial analysis, she determines that the
new system will be cost-effective only if the mean monthly account is more than $170. A
random sample of 400 monthly accounts is drawn, for which the sample mean is $178.
The manager knows that the accounts are approximately normally distributed with a
standard deviation of $65. Can the manager conclude from this that the new system will
be cost-effective? Using the significance level of 5%.
9) The Thompson's Discount Appliance Store issues its own credit card. The credit manager
wants to find whether the mean monthly unpaid balance is more than $400. The level of
significance is set at .05. A random check of 60 unpaid balances revealed the sample
mean is $407 and the standard deviation of the sample is $22.50. Should the credit
manager conclude the population mean is greater than $400, or is it reasonable that the
difference of $7 ($407 - $400 = $7) is due to chance?
10) The MacBurger restaurant chain claims that the waiting time of customers for service is
normally distributed, with a mean of 3 minutes and a standard deviation of 1 minute. The
quality-assurance department found in a sample of 50 customers at the Warren
Road MacBurger that the mean waiting time was 2.75 minutes. At the .05 significance
level, can we conclude that the mean waiting time is less than 3 minutes?

Hypothesis Tests About population mean:  unknown


This section explains how to perform a test of hypothesis about the population mean when the
population standard deviation is not known.
 The value of test statistics t for the sample mean x is computed as:
x  s
t Where sx 
sx n
The value of t calculated for x by using this formula is also called the observed
value of t.
EXAMPLE:
1) A psychologist claims that the mean age at which children start walking is 12.5 months.
Carol wanted to check if this claim is true. She took a random sample of 18 children and

Lecture notes on Social statistics Page 42


found that the mean age at which these children started walking was 12.9 months with a
standard deviation of .80 month. It is known that the ages at which all children start
walking are approximately normally distributed.
a) What will your conclusion be if the significance level is 1%
2) Grand Auto Corporation produces auto batteries. The company claims that its top-of-the-
line Never Die batteries are good, on average, for at least 65 months. A consumer
protection agency tested 45 such batteries to check this claim. It found that the mean life
of these 45 batteries is 63.4 months, and the standard deviation is 3 months.
a) Test that the mean life of all such batteries is less than 65 months.
Using the significance level of 2.5%?
3) An educator claims that the average salary of substitute teachers in school districts in
Somaliland Country, Hargeisa, is less than $60 per day. A random sample of eight school
districts is selected, and the daily salaries (in dollars) are shown. Is there enough evidence
to support the educator’s claim at a   0.10 ?
60 56 60 55 70 55 60 55

4) Perform the following tests of hypothesis


d) H 0 :   285 , H1 :   285 n  55 x  267.80 s  42.90,   0.05 ,

e) H 0 :   147500 , H1 :   147500 n  41 x  141812 s  22972,   0.1 ,

5) Perform the following hypothesis tests


a) H 0 :   285, H1 :   285, n  55, x  267.8, s  42.90,   0.05

b) H 0 :   10.70, H1 :   10.70, n  47, x  12.025, s  4.90,  0.01

Hypothesis Tests About a Population proportion: large samples


 This section presents the procedure to perform tests of hypotheses about the population
proportion, p, for large samples. The procedures to make such tests are similar in many
respects to the ones for the population mean. Again, the test can be two-tailed or one-
tailed. We know from Chapter 6 that when the sample size is large, the sample

Lecture notes on Social statistics Page 43


proportion, is approximately normally distributed with its mean equal to p and standard

deviation equal to pq . Hence,


n

we use the normal distribution to perform a test of hypothesis about the population proportion,
p, for a large sample. As was mentioned in Chapters 6 and 7, in the case of a proportion, the
sample size is considered to be large when np and nq are both greater than 5.

 The value of the test statistic z for the sample proportion, p̂ , is computed as
pˆ  p pq
z where  p̂ 
 pˆ n

The value of p that is used in this formula is the one from the null hypothesis. The value of q
is equal to 1-p. The value of z calculated for pˆ using the above formula is also called the
observed value of z.
EXAMPLE
1) Direct Mailing Company sells computers and computer parts by mail. The company
claims that at least 90% of all orders are mailed within 72 hours after they are received.
The quality control department at the company often takes samples to check if this claim
is valid. A recently taken sample of 150 orders showed that 129 of them were mailed
within 72 hours.
a) Do you think the company’s claim is true? Use a 2.5% significance level.
2) A telephone company representative estimates that 40% of its customers have call-
waiting service. To test this hypothesis, she selected a sample of 100 customers and
found that 37% had call waiting. At a   0.01 , is there enough evidence to reject the
claim?
3) A researcher claims that 54% of fatal car/truck accidents are caused by driver error. A
researcher studies 30 randomly selected accidents and finds that 14 were caused by driver
error. Using a   0.05 , can the researcher’s claim be refuted?
4) A food company is planning to market a new type of frozen yogurt. However, before
marketing this yogurt, the company wants to find what percentage of the people like it.
The company’s management has decided that it will market this yogurt only if at least

Lecture notes on Social statistics Page 44


35% of the people like it. The company’s research department selected a random sample
of 400 persons and asked them to taste this yogurt. Of these 400 persons, 112 said they
liked it.
a) Testing at the 2.5% significance level, can you conclude that the company should
market this yogurt?
5) Chicken Delight claims that 90 percent of its orders are delivered within 10 minutes of
the time the order is placed. A sample of 100 orders revealed that 82 were delivered
within the promised time. At the .10 significance level, can we conclude that less than 90
percent of the orders are delivered in less than 10 minutes?
6) Make the following hypothesis tests about p
a) H 0 : p  0.45, H1 : p  0.45, n  100, pˆ  0.49,   0.1

b) H 0 : p  0.72, H1 : p  0.72, n  700, pˆ  0.64,   0.05

c) H 0 : p  0.30, H1 : p  0.30, n  200, pˆ  0.33,   0.01

Lecture notes on Social statistics Page 45


CHAPTER FIVE

ESTIMATION AND HYPOTHESIS TESTING:

TWO POPULATIONS

OBJECIVES
After completing this chapter, you should be able to
 Test the difference between population means, using the z test.
 Test the difference between two means for independent samples, using the t
test.
 Test the difference between two means for dependent samples
 Test the difference between two proportions
INTRODUCTION
Chapters 3 and 4 discussed the estimation and hypothesis-testing procedures for  and p
involving a single population. This chapter extends the discussion of estimation and hypothesis-
testing procedures to the difference between two population means and the difference between
two population proportions. For example, we may want to make a confidence interval for the
difference between the mean prices of houses in and in or we may want to test the hypothesis
that the mean price of houses in Hargeisa is different from that in Erigavo. As another example,
we may want to make a confidence interval for the difference between the proportions of all
male and female adults who abstain from drinking, or we may want to test the hypothesis that the
proportion of all adult men who abstain from drinking is different from the proportion of all adult
women who abstain from drinking. Constructing confidence intervals and testing hypotheses
about population parameters are referred to as making inferences.

Lecture notes on Social statistics Page 46


INFERENCES ABOUT THE DIFFERENCE BETWEEN TWO MEANS:
INDEPENENDENT SAMPLES:  1 and  2 known:

Let 1 be the mean of the first population and  2 be the mean of the second population. Suppose
we want to make a confidence interval and test a hypothesis about the difference between these
two population means, that is 1  2 . Let x be the mean of a sample taken from the first

population and be the mean of a sample taken from the second population. Then, x1  x2 is the

sample statistic that is used to make an interval estimate and to test a hypothesis about 1  2 .
This section discusses how to make confidence intervals and test hypotheses about
1  2 when certain conditions (to be explained later in this section) are satisfied.
First we explain the concepts of independent and dependent samples.
INDEPENDENT VS DEPENENDT SAMPLES
Two samples are independent if they are drawn from two different populations and the elements
of one sample have no relationship to the elements of the second sample. If the elements of the
two samples are somehow related, then the samples are said to be dependent. Thus, in two
independent samples, the selection of one sample has no effect on the selection of the second
sample.
EXAMPLE 1
Suppose we want to estimate the difference between the mean salaries of all male and all female
executives. To do so, we draw two samples, one from the population of male executives and
another from the population of female executives. These two samples are independent because
they are drawn from two different populations, and the samples have no effect on each other.
EXAMPLE 2
Suppose we want to estimate the difference between the mean weights of all participants before
and after a weight loss program. To accomplish this, suppose we take a sample of 40 participants
and measure their weights before and after the completion of this program. Note that these two
samples include the same 40 participants. This is an example of two dependent samples. Such
samples are also called paired or matched samples.

Lecture notes on Social statistics Page 47


Interval Estimation Of 1  2 : Independent Samples:

By constructing a confidence interval for 1  2 , we find the difference between the means
of two populations. For example, we may want to find the difference between the mean heights
of male and female adults. The difference between the two sample means, x1  x2 , is the point

estimator of the difference between the two population means, 1  2 . When the conditions
mentioned earlier in this section hold true, we use the normal distribution to make a confidence
interval for the difference between the two population means. The following formula gives the
interval estimation for 1  2 .

Population one Population two


parameters: 1 and  1 Parameters: 2 ,  2

Sample
Sample
n2
n1

Sample Sample
statistics statistics

x1 and s1 x2 and s2

Confidence Interval for 1  2 :

When using the normal distribution, the (1   )100% ) confidence interval for 1  2 is:

x1  x2  z x1  x2 where  x  x 
 12

 22
1 2
n1 n2

Lecture notes on Social statistics Page 48


The value x1  x2 is the point estimator of 1  2 .

Hypothesis Testing About 1  2 :


It is often necessary to compare the means of two populations. For example, we may want to
know if the mean price of houses in Hargeisa is the same as that in Buroa. Similarly, we may be
interested in knowing if, on average, Somali children spend fewer hours in school than Japanese
children do. In both these cases, we will perform a test of hypothesis about 1  2 . The
alternative hypothesis in a test of hypothesis may be that the means of the two populations are
different, or that the mean of the first population is greater than the mean of the second
population, or that the mean of the first population is less than the mean of the second
population. These three situations are described next.
1) Testing an alternative hypothesis that the means of two populations are different is
equivalent to 1  2 which is the same as 1  2  0 .
2) Testing an alternative hypothesis that the mean of the first population is greater than
the mean of the second population is equivalent to 1  2 which is the same as

1  2  0 .
3) Testing an alternative hypothesis that the mean of the first population is less than the
mean of the second population is equivalent to 1  2 which is the same as

1  2  0 .
The procedure followed to perform a test of hypothesis about the difference between two
population means is similar to the one used to test hypotheses about single-population
parameters in Chapter 5.
 If the following conditions are satisfied, we will use the normal distribution to make a test
of hypothesis about
1) The two samples are independent.
2) The standard deviations and of the two populations are known.
3) At least one of the following two conditions is fulfilled:
a) Both samples are large (i.e., and

Lecture notes on Social statistics Page 49


b) If either one or both sample sizes are small, then both populations from which the
samples are drawn are normally distributed
Test Statistic z for x1  x2 : When using the normal distribution, the value of the test statistic z

( x1  x2 )  ( 1  2 )
for is computed as z 
 x x
1 2

The value of 1  2 is substituted from H 0 . The value of  x x


1 2 is calculated as earlier in this

section
EXAMPLE:
1) Gasoline prices reached record high levels in 16 states during 2003 (The Wall Street
Journal,March 7, 2003). Two of the affected states were California and Florida. The
American Automobile Association reported a sample mean price of $2.04 per gallon in
California and a sample mean price of $1.72 per gallon in Florida. Use a sample size of
40 for the California data and a sample size of 35 for the Florida data. Assume that prior
studies indicate a population standard deviation of .10 in California and .08 in Florida.
a) What is a point estimate of the difference between the population mean
prices per gallon in California and Florida?
b) Construct a 99% confidence interval for? 1  2 ?
c) Test at the 1% significance level if the two population means are different.
Solution
a. The point estimate of 1  2 is: 1  2  x1  x2  0.32

c. 99% confidence interval for 1  2 is:

x1  x2  z x1  x2
0.32  2.57(0.051)
0.32  0.132

Lecture notes on Social statistics Page 50


H 0 : 1   2
H1 : 1   2
Normal distribution

  0.01,  0.005, zc  2.58
2

( x1  x2 )  ( 1  2 ) 0.32  0
z   6.27
 x x
1 2
0.051
We reject null hypothesis because the test statistic falls the rejection region
EXERCISE
1. The following information is obtained from two independent samples selected from two
normally distributed populations
n1  18, x1  7.82, 1  2.35 and n2  15, x2  5.99,  2  3.17
a) What is the point estimate of 1  2 ?

b) Construct a 99% confidence interval for 1  2 .


c) Find the margin of error for this estimate.
d) Test at the 5% significance level if the two population means are different.
e) Test at the 1% significance level if 1 is greater than  2
2. The following information is obtained from two independent samples selected from two
normally distributed populations
n1  650, x1  1.05, 1  5.22 and n2  675, x2  1.54,  2  6.80
a) What is the point estimate of 1  2 ?

b) Construct a 95% confidence interval for 1  2 .


c) Find the margin of error for this estimate.
d) Test at the 1% significance level if the two population means are different.
e) Test at the 5% significance level if 1 is less than  2

Lecture notes on Social statistics Page 51


3. The management at New Century Bank claims that the mean waiting time for all
customers at its branches is less than that at the Public Bank, which is its main
competitor. A business consulting firm took a sample of 200 customers from the New
Century Bank and found that they waited an average of 4.5 minutes before being served.
Another sample of 300 customers taken from the Public Bank showed that these
customers waited an average of 4.75 minutes before being served. Assume that the
standard deviations for the two populations are 1.2 and 1.5 minutes, respectively.
a) Make a 97% confidence interval for the difference between the two population
means.
b) Test at the 2.5% significance level whether the claim of the management of the
New Century Bank is true.
4. Employees of a large corporation are concerned about the declining quality of medical
services provided by their group health insurance. A random sample of 100 office visits
by employees of this corporation to primary care physicians during 2004 found that the
doctors spent an average of 19 minutes with each patient. This year a random sample of
108 such visits showed that doctors spent an average of 15.5 minutes with each patient.
Assume that the standard deviations for the two populations are 2.7 and 2.1 minutes,
respectively.
a) Construct a 95% confidence interval for the difference between the two
population means for these two years.
b) Using the 2.5% level of significance, can you conclude that the mean time spent
by doctors with each patient is lower for this year than for 2004?
5. A survey found that the average hotel room rate in New Orleans is $88.42 and the
average room rate in Phoenix is $80.61. Assume that the data were obtained from two
samples of 50 hotels each and that the standard deviations of the populations are $5.62
and $4.83, respectively. At   0.05 , can it be concluded that there is a significant
difference in the rates?

Lecture notes on Social statistics Page 52


INFERENCE ABOUT THE DIFFERENCE BETWEEN TWO MEANS FOR
INDEPENDENT SAMPLES:  1 and  2 UNKNOWN BUT EQUAL
This section discusses making a confidence interval and testing a hypothesis about the difference
between the means of two populations, 1  2 assuming that the standard deviations,  1 and  2 ,
of these populations are not known but are assumed to be equal. There are some other conditions,
explained below, that must be fulfilled to use the procedures discussed in this section.
 If the following conditions are satisfied,
1) The two samples are independent
2) The standard deviations and of the two populations are unknown, but they can be
assumed to be equal, that is,  1   2
3) At least one of the following two conditions is fulfilled:
a) Both samples are large (i.e., and
b) If either one or both sample sizes are small, then both populations
from which the samples are drawn are normally distributed
then we use the t distribution to make a confidence interval and test a hypothesis about the
difference between the means of two populations 1  2 ,

When the standard deviations of the two populations are equal, we can use  for both  1

and  2 . Because  is unknown, we replace it by its point estimator s p , which is called the

pooled sample standard deviation (hence, the subscript p). The value of sp is computed by

using the information from the two samples as follows.


Pooled Standard Deviation for Two Samples: The pooled standard deviation for two samples

(n1  1) s12  (n2  1) s2 2 (df1 )s12  (df 2 )s2 2


sp  
is computed as
n1  n2  2 df1  df 2
where n1 and n2 are the sizes of the two samples s 21 and s 2 2 and are the variances of the two

samples, respectively. Here sp is an estimator of  .

Lecture notes on Social statistics Page 53


When sp is used as an estimator of  , the standard deviation  x  x of x1  x2 is estimated by
1 2

s x1  x2 . The value of s x1  x2 is calculated by using the following formula.

Estimator of the Standard Deviation of x1  x2 The estimator of the standard deviation of

1 1
x1  x2 is: sx1  x2  sp 
n1 n2

Now we are ready to discuss the procedures that are used to make confidence intervals and
test hypotheses about 1  2 for small and independent samples selected from two populations
with unknown but equal standard deviations.
INTERVAL ESTIMATION OF 1  2 :

As was mentioned earlier in this chapter, the difference between the two sample means, x1  x2 ,

is the point estimator of the difference between the two population means, 1  2 .

The following formula gives the confidence interval for 1  2 when the t distribution is used
and the conditions mentioned earlier in this section are fulfilled.
Confidence Interval for 1  2 :The ( (1   )100% confidence interval for 1  2 is

1 1
( x1  x2 )  tsx1  x2 Where sx1  x2  sp 
n1 n2
HYPOTHESIS TESTING ABOUT 1  2
When the conditions mentioned in the beginning of the above Section are satisfied, the t
distribution is applied to make a hypothesis test about the difference between two population
means. The test statistic in this case is t, which is calculated as follows.
Test Statistic t for x1  x2 The value of the test statistic t for x1  x2 is computed as

( x1  x2 )  ( 1  2 )
t
sx1  x2

Lecture notes on Social statistics Page 54


EXAMPLES

1) A consumer agency wanted to estimate the difference in the mean amounts of caffeine in
two brands of coffee. The agency took a sample of 15 one-pound jars of Brand I coffee
that showed the mean amount of caffeine in these jars to be 80 milligrams per jar with a
standard deviation of 5 milligrams. Another sample of 12 one-pound jars of Brand II
coffee gave a mean amount of caffeine equal to 77 milligrams per jar with a standard
deviation of 6 milligrams.
a) Construct a 95% confidence interval for the difference between the mean
amounts of caffeine in one-pound jars of these two brands of coffee.
Assume that the two populations are normally distributed and that the
standard deviations of the two populations are equal.
b) At the 1% significance level, can you conclude that the mean amounts of
caffeine in brand I are different for these in brand II?
Solution

(n1  1) s12  (n2  1) s2 2


sp   5.46260011
a)
n1  n2  2
A  0.025
1 1
sx1  x2  sp   2.11565593 df  n1  n2  2  25
n1 n2
t  2.060
The 95% confidence interval for 1  2 is:

( x1  x2 )  tsx1  x2
3  2.060(2.11565593)
3  4.36

Lecture notes on Social statistics Page 55


H 0 : 1   2
Step 1
H1 : 1   2
Step 2 t-distribution
Step 3 determine rejection and nonrejection regions


  0.01,  0.005,
2
A  0.005
df  25
tc  2.787
Step 4 calculate the test statistic

( x1  x2 )  ( 1  2 ) 30
t   1.418
sx1  x2 2.11565593
We don’t reject null hypothesis because the test statistic falls the non rejection region

Lecture notes on Social statistics Page 56


Exercise

1. A sample of 14 cans of Brand I diet soda gave the mean number of calories of 23 per can
with a standard deviation of 3 calories. Another sample of 16 cans of Brand II diet soda
gave the mean number of calories of 25 per can with a standard deviation of 4 calories.
a) At the 1% significance level, can you conclude that the mean numbers of calories
per can are different for these two brands of diet soda? Assume that the calories per
can of diet soda are normally distributed for each of the two brands and that the
standard deviations for the two populations are equal.

2. A sample of 40 children from New York State showed that the mean time they spend
watching television is 28.50 hours per week with a standard deviation of 4 hours. Another
sample of 35 children from California showed that the mean time spent by them watching
television is 23.25 hours per week with a standard deviation of 5 hours.
a) Using a 2.5% significance level, can you conclude that the mean time spent
watching television by children in New York State is greater than that for
children in California? Assume that the standard deviations for the two
populations are equal.
3. The following information was obtained from two independent samples selected from
two normally distributed populations with unknown but equal standard deviations.
n1  21, x1  13.97, s1  3.78 and n2  20, x2  15.55, s2  3.26
a) What is the point estimate of 1  2 ?

b) Construct a 95% confidence interval for 1  2 2


4. An insurance company wants to know if the average speed at which men drive cars is
greater than that of women drivers. The company took a random sample of 27 cars driven
by men on a highway and found the mean speed to be 72 miles per hour with a standard
deviation of 2.2 miles per hour. Another sample of 18 cars driven by women on the same
highway gave a mean speed of 68 miles per hour with a standard deviation of 2.5 miles

Lecture notes on Social statistics Page 57


per hour. Assume that the speeds at which all men and all women drive cars on this
highway are both normally distributed with the same population standard deviation.
a) Construct a 98% confidence interval for the difference between the mean speeds of
cars driven by all men and all women on this highway.
b) Test at the 1% significance level whether the mean speed of cars driven by all men
drivers on this highway is greater than that of cars driven by all women drivers
5. A sample of scores on an examination given in Statistics are:

At the .01 significance level, is the mean grade of the women higher than that of the
men?
6. Ms. Lisa Monnin is the budget director for Nexus Media, Inc. She would like to compare
the daily travel expenses for the sales staff and the audit staff. She collected the following
sample information.

At the .10 significance level, can she conclude that the mean daily expenses are greater
for the sales staff than the audit staff?

Lecture notes on Social statistics Page 58


INFERENCE ABOUT THE DIFFERENCE BETWEEN TWO MEANS FOR PAIRED
SAMPLES
Sections 5.1, 5.2, and 5.3 were concerned with estimation and hypothesis testing about the
difference between two population means when the two samples were drawn independently from
two different populations.
This section describes estimation and hypothesis-testing procedures for the difference between
two population means when the samples are dependent.
In a case of two dependent samples, two data values—one for each sample—are collected from
the same source (or element) and, hence, these are also called paired or matched samples.
For example, we may want to make inferences about the mean weight loss for members of a
health club after they have gone through an exercise program for a certain period of time. To do
so, suppose we select a sample of 15 members of this health club and record their weights before
and after the program. In this example, both sets of data are collected from the same 15 persons,
once before and once after the program. Thus, although there are two samples, they contain the
same 15 persons. This is an example of paired (or dependent or matched) samples. The
procedures to make confidence intervals and test hypotheses in the case of paired samples are
different from the ones for independent samples discussed in earlier sections of this chapter.
DEFINITION
Paired or Matched Samples Two samples are said to be paired or matched samples when for
each data value collected from one sample there is a corresponding data value collected from the
second sample, and both these data values are collected from the same source.
In paired samples, the difference between the two data values for each element of the two
samples is denoted by d. This value of d is called the paired difference. We then treat all the
values of d as one sample and make inferences applying procedures similar to the ones used for
one-sample cases in Chapters 8 and 9. Note that because each source (or element) gives a pair of
values (one for each of the two data sets), each sample contains the same number of values.
That is, both samples are the same size. Therefore, we denote the (common) sample size by n,
which gives the number of paired difference values denoted by d. The degrees of freedom for
the paired samples are n-1. Let

Lecture notes on Social statistics Page 59


d = the mean of paired differences for the population
 d = the standard deviation of the paired differences for the population which is usually never
known

d = the mean of paired differences for the sample

sd =the standard deviation of the paired differences for the sample


n =the number of the paired difference values

Mean and Standard Deviation of the Paired Differences for Two Samples The values of the
mean and standard deviation d , and sd , respectively, of paired differences for two samples are
calculated as:

d d  n
2 ( d )2

d  sd  2
n n 1

In paired samples, instead of using x1  x2 as the sample statistic to make inferences about

1  2 , we use the sample statistic d to make inferences about d . Actually the value of d is

always equal to x1  x2 , and the value of d is always equal to 1  2


Interval Estimation of d
The mean of paired differences for paired samples is the point estimator of d The following
formula is used to construct a confidence interval for d when the t distribution is used.
Confidence Interval for d The ( 1   )100% confidence interval for d is

sd
d  tsd where sd 
n

Lecture notes on Social statistics Page 60


where the value of t is obtained from the t distribution table for the given confidence level and

n 1 degrees of freedom,
Hypothesis Testing About d
A hypothesis about d is tested by using the sample statistic This section illustrates the case
of the t distribution only. Earlier in this section we learned what conditions should hold true to

use the t distribution to test a hypothesis about d . The following formula is used to calculate
the value of the test statistic t when testing a hypothesis about d .
Test Statistic t for The value of the test statistic t for is computed as follows:

d  d
t
sd
EXAMPLES:
1) A company wanted to know if attending a course on “how to be a successful salesperson”
can increase the average sales of its employees. The company sent six of its salespersons
to attend this course. The following table gives the 1-week sales of these salespersons
before and after they attended this course.
Before 12 18 25 9 14 16
After 18 24 24 14 19 20

a) Compute the 95% confidence interval for d


b) Using the 1% significance level, can you conclude that the mean weekly sales for all
salespersons increase as a result of attending this course? Assume that the population of
paired differences has a normal distribution.

Lecture notes on Social statistics Page 61


Solution
a.
Before 12 18 25 9 14 16
After 18 24 24 14 19 20

d -6 6 1 -5 -5 -4
 d  13
d2
36 36 1 25 25 16
 d  139

The values of d and sd are calculated as follows:

(  d )2

d 
d

13
 2.17 sd 
 d 2
 n

139  28.17
 4.71
n 6 n 1 5

sd 4.71
sd    1.92 ,
n 6
A  0.025
df  5
t  2.5706
Therefore, the 95% confidence interval for d is

d  tsd
2.17  2.5706(1.92)
2.17  4.94
H o : d  0
b. Step 1
H1 :  d  0
Step 2 t-distribution
Step 3 Determine rejection and nonrejection regions

Lecture notes on Social statistics Page 62


  0.01
A  0.01
df  5
t  3.365

Step 4 calculate the test statistic

d  d 2.17  0
t   1.13
sd 1.92
Step 5: make decision : we do not reject null hypothesis

EXERCISE

1. Find the following confidence intervals for d , assuming that the populations of
paired differences are normally distributed.

a) n  12, d  17.5, sd  6.3, conf .level  99%

b) n  27, d  55.9, sd  14.7, conf .level  95%

c) n  16, d  29.3, sd  8.3, conf .level  90%


2) Perform the following tests of hypotheses, assuming that the populations of paired
differences are normally distributed.

a) H o : d  0, H1 : d  0, n  26, d  9.6, sd  3.9,   0.05

b) H o : d  0, H1 : d  0, n  15, d  8.8, sd  4.7,   0.01

Lecture notes on Social statistics Page 63


c) H o : d  0, H1 : d  0, n  20, d  7.4, sd  2.3,   0.1
3) A company sent seven of its employees to attend a course in building self-confidence.
These employees were evaluated for their self-confidence before and after attending this
course. The following table gives the scores (on a scale of 1 to 15, 1 being the lowest and
15 being the highest score) of these employees before and after they attended the course.

Before 8 5 4 9 6 9 5
After 10 8 5 11 6 7 9

a) Construct a 95% confidence interval for the mean _d of the population paired differences,
where a paired difference is equal to the score of an employee before attending the
course minus the score of the same employee after attending the course.
b) Test at the 1% significance level whether attending this course increases the mean score
of employees. Assume that the population of paired differences has a normal distribution.
4) A researcher wanted to find the effect of a special diet on systolic blood pressure. She
selected a sample of seven adults and put them on this dietary plan for 3 months. The
following table gives the systolic blood pressures (in mm Hg) of these seven adults
before and after the completion of this plan.
Before 210 180 195 220 231 199 224
After 193 186 186 223 220 183 233

Let d be the mean reduction in the systolic blood pressures due to this special dietary
plan for the population of all adults.
a) Construct a 95% confidence interval for _d. Assume that the population of paired
differences is (approximately) normally distributed.
b) Using the 5% significance level, can you conclude that the mean of the paired differences

d is different from zero? Assume that the population of paired differences is


(approximately) normally distributed.

Lecture notes on Social statistics Page 64


5) A number of minor automobile accidents occur at various high-risk intersections in Teton
County despite traffic lights. The Traffic Department claims that a modification in the
type of light will reduce these accidents. The county commissioners have agreed to a
proposed experiment. Eight intersections were chosen at random, and the lights at those
intersections were modified. The numbers of minor accidents during a six-month period
before and after the modifications were:

At the .01 significance level is it reasonable to conclude that the modification reduced the
number of traffic accidents?

Lecture notes on Social statistics Page 65


CHAPTER SIX
ANALYSIS OF VARIANCE
ANOVA
OBJECTIVES
After completing this chapter, you should be able to
 To define ANOVA
 c1 Use the one-way ANOVA technique to determine if there is a significant
difference among three or more means
INTRODUCTION
Chapter five described the procedures that are used to test hypotheses about the difference
between two population means using the normal and t distributions. Also described in that
chapter were the hypothesis-testing procedures for the difference between two population
proportions using the normal distribution.
This chapter explains how to test the null hypothesis that the means of more than two
populations are equal. For example, suppose that teachers at a school have devised three different
methods to teach arithmetic. They want to find out if these three methods produce different mean
scores. Let 1 ,  2 , and  3 be the mean scores of all students who will be taught by Methods I,
II, and III, respectively. To test whether or not the three teaching methods produce the same
mean, we test the null hypothesis H 0 : 1  2  3 (All the population means are equal) against

the alternative hypothesis H1 : not all three population means are equal
We use the analysis of variance procedure to perform such a test of hypothesis.
Note that the analysis of variance procedure can be used to compare two population means.
However, the procedures learned in Chapter 6 are more efficient for performing tests of
hypothesis about the difference between two population means; the analysis of variance
procedure, to be discussed in this chapter, is used to compare three or more population means.
Lecture notes on Social statistics Page 66
An analysis of variance test is performed using the F distribution. First, the F distribution is
described in Section 7.1 of this chapter. Then, Section 7.2 discusses the application of the one-
way analysis of variance procedure to perform tests of hypothesis.

The F Distribution
Like the chi-square distribution, the shape of a particular F distribution1 curve depends on the
number of degrees of freedom. However, the F distribution has two numbers of degrees of
freedom: degrees of freedom for the numerator and degrees of freedom for the denominator.
These two numbers representing two types of degrees of freedom are the parameters of the F
distribution. Each combination of degrees of freedom for the numerator and for the denominator
gives a different F distribution curve. The units of an F distribution are denoted by F, which
assumes only nonnegative values. Like the normal, t, and chi-square distributions, the F
distribution is a continuous distribution. The shape of an F distribution curve is skewed to the
right, but the skewness decreases as the number of degrees of freedom increases.
Characteristics of the F Distribution
1) The F distribution is continuous and skewed to the right.
2) The F distribution has two numbers of degrees of freedom: df for the numerator and df
for the denominator.
3) The units of an F distribution, denoted by F, are nonnegative.
For an F distribution, degrees of freedom for the numerator and degrees of freedom for the

denominator are usually written as follows: df  (df num , df den )  (8,14)

Lecture notes on Social statistics Page 67


Figure 6.1 shows three F distribution curves for three sets of degrees of freedom for the
numerator and for the denominator. In the figure, the first number gives the degrees of freedom
associated with the numerator, and the second number gives the degrees of freedom associated
with the denominator. We can observe from this figure that as the degrees of freedom increase,
the peak of the curve moves to the right; that is, the skewness decreases.
Table VII in Appendix C lists the values of F for the F distribution. To read Table VII, we
need to know three quantities: the degrees of freedom for the numerator, the degrees of freedom
for the denominator, and an area in the right tail of an F distribution curve. Note that the F
distribution table (Table VII) is read only for an area in the right tail of the F distribution curve.

EXAMPLE
1) Find the F value for 8 degrees of freedom for the numerator, 14 degrees of freedom for
the denominator, and .05 area in the right tail of the F distribution curve
2) Find the critical value of F for the following.
a) df  (3,3) and area in the right tail  0.05
b) df  (3,10) and area in the right tail  0.05
c) df  (3,30) ) and area in the right tail  0.05
EXERCISE
3) Find the critical value of F for the following.
d) df  (2, 6) and area in the right tail  0.25
e) df  (6, 6) and area in the right tail  0.25
f) df  (15, 6) ) and area in the right tail  0.25

ONE-WAY ANALYSIS OF VARIABCE: ANOVA


As mentioned in the beginning of this chapter, the analysis of variance procedure is used to test
the null hypothesis that the means of three or more populations are the same against the
alternative hypothesis that not all population means are the same. The analysis of variance
procedure can be used to compare two population means. However, the procedures learned in
Chapter 6 are more efficient for performing tests of hypotheses about the difference between two

Lecture notes on Social statistics Page 68


population means; the analysis of variance procedure is used to compare three or more
population means.
Reconsider the example of teachers at a school who have devised three different methods to
teach arithmetic. They want to find out if these three methods produce different mean scores. Let
1 ,  2 , and  3 be the mean scores of all students who are taught by Methods I, II, and III,
respectively. To test if the three teaching methods produce different means, we test the null
hypothesis H 0 : 1  2  3 (all three population means are equal)
against the alternative hypothesis
H1 : Not all three population means are equal.

One method to test such a hypothesis is to test the three hypotheses H 0 : 1  2 , H 0 : 1  3

and H 0 : 2  3 separately using the procedure discussed in Chapter 6. Besides being time
consuming, such a procedure has other disadvantages. First, if we reject even one of these three
hypotheses, then we must reject the null hypothesis H 0 : 1  2  3 . Second, combining the
Type I error probabilities for the three tests (one for each test) will give a very large Type I error
probability for the test H 0 : 1  2  3 . Hence, we should prefer a procedure that can test
the equality of three means in one test. The ANOVA, short for analysis of variance, provides
such a procedure. It is used to compare three or more population means in a single test.
DEFINITION
ANOVA: is a procedure used to test the null hypothesis that the means of three or more
populations are all equal.
This section discusses the one-way ANOVA procedure to make tests by comparing the means
of several populations. By using a one-way ANOVA test, we analyze only one factor or variable.
For instance, in the example of testing for the equality of mean arithmetic scores of students
taught by each of the three different methods, we are considering only one factor, which is the
effect of different teaching methods on the scores of students. Sometimes we may analyze the
effects of two factors. For example, if different teachers teach arithmetic using these three
methods, we can analyze the effects of teachers and teaching methods on the scores of students.
This is done by using a two-way ANOVA. The procedure under discussion in this chapter is

Lecture notes on Social statistics Page 69


called the analysis of variance because the test is based on the analysis of variation in the data
obtained from different samples. The application of one-way ANOVA requires that the
following assumptions hold true.
Assumptions of One-Way ANOVA
The following assumptions must hold true to use one-way ANOVA.
1. The populations from which the samples are drawn are (approximately) normally distributed.
2. The populations from which the samples are drawn have the same variance (or standard
deviation).
3. The samples drawn from different populations are random and independent.
For instance, in the example about three methods of teaching arithmetic, we first assume
that the scores of all students taught by each method are (approximately) normally distributed.
Second, the means of the distributions of scores for the three teaching methods may or may not
be the same, but all three distributions have the same variance,  2 . Third, when we take samples
to make an ANOVA test, these samples are drawn independently and randomly from three
different populations.
The ANOVA test is applied by calculating two estimates of the variance,  2 , of population
distributions: the variance between samples and the variance within samples. The variance
between samples is also called the mean square between samples or MSB. The variance within
samples is also called the mean square within samples or MSW.
The variance between samples, MSB, gives an estimate of  2 based on the variation among
the means of samples taken from different populations. For the example of three teaching
methods, MSB will be based on the values of the mean scores of three samples of students taught
by three different methods. If the means of all populations under consideration are equal, the
means of the respective samples will still be different, but the variation among them is expected
to be small, and, consequently, the value of MSB is expected to be small. However, if the means
of populations under consideration are not all equal, the variation among the means of respective
samples is expected to be large, and, consequently, the value of MSB is expected to be large.
The variance within samples, MSW, gives an estimate of  2 based on the variation within
the data of different samples. For the example of three teaching methods, MSW will be based

Lecture notes on Social statistics Page 70


on the scores of individual students included in the three samples taken from three populations.
The concept of MSW is similar to the concept of the pooled standard deviation, sp, for two
samples discussed in Section 7.2 of Chapter 7.
The ANOVA test is applied by calculating two estimates of the variance  2 of the population
distribution :
 The variance between samples (MSB) : is an estimate of the common population
variance  2 that is based on the variation among the sample means.
 The variance within samples (MSW): is an estimate of the common population
variance based on the variation within the data of different samples.
The one-way ANOVA test is always right-tailed with the rejection region in the right tail of
the F distribution curve. The hypothesis-testing procedure using ANOVA involves the same five
steps that were used in earlier chapters. The next subsection explains how to calculate the value
of the test statistic F for an ANOVA test
Calculating the Value of the Test Statistic
The value of the test statistic F for a test of hypothesis using ANOVA is given by the ratio of
two variances, the variance between samples (MSB) and the variance within samples (MSW).
Test Statistic F for a One-Way ANOVA Test :The value of the test statistic F for an ANOVA
var iancebetweensamples MSB
test is calculated as F  
var iancewithinsamples MSW
 To calculate the value of test statistic F, we follow the following steps:
1) Calculate T1 ,  x,  x2 , n

2) Determine SSB, SSW , SST


MSB
3) Determine F 
MSW
The calculation of MSB and MSW is explained in Example 6–2.
One-Way ANOVA Test
Now suppose we want to test the null hypothesis that the mean scores are equal for all three
groups of fourth-graders taught by three different methods of Example 6–2 against the
alternative hypothesis that the mean scores of all three groups are not equal. Note that in a one-

Lecture notes on Social statistics Page 71


way ANOVA test, the null hypothesis is that the means for all populations are equal. The
alternative hypothesis is that not all population means are equal. In other words, the alternative
hypothesis states that at least one of the population means is different from the others. Example
6–2 demonstrates how we use the one-way ANOVA procedure to make such a test.
EXAMPLE
1) Fifteen fourth-grade students were randomly assigned to three groups to experiment with
three different methods of teaching arithmetic. At the end of the semester, the same test was
given to all 15 students. The table gives the scores of students in the three groups.
Method I Method II Method III
48 55 84
73 85 68
51 70 95
65 69 74
87 90 67
a) Calculate the value of the test statistic F. Assume that all the required assumptions hold
true.
b) At the 1% significance level, can we reject the null hypothesis that the mean arithmetic
score of all fourth-grade students taught by each of these three methods is the same?
Assume that all the assumptions required to apply the one-way ANOVA procedure hold
true.
Solution
a) In ANOVA terminology, the three methods used to teach arithmetic are called treatments.
The table contains data on the scores of fourth-graders included in the three samples. Each
sample of students is taught by a different method. Let
x= the score of a student
k= the number of different samples
ni  the size of samples

Ti  the sum of the values in all samples = n1  n2  n3  ......

 x  the sum of the values in all samples = T  T


1 2  T3  ...........

Lecture notes on Social statistics Page 72


x 2
 the sum of the squares of the values in all samples

To calculate MSB and MSW, we first compute the between-samples sum of squares, denoted
by SSB, and the within-samples sum of squares, denoted by SSW. The sum of SSB
and SSW is called the total sum of squares and is denoted by SST; that is,
SST=SSB+SSW
The values of SSB and SSW are calculated using the following formulas.

 Between- and Within-Samples Sums of Squares The between-samples sum of squares,


  x
2
 T12 T2 2 T32
denoted by SSB, is calculated as: SSB      ........  
 n1 n1 n1  n

 The within-samples sum of squares, denoted by SSW, is calculated as:

 T12 T2 2 T32 
SSW   x    2
  ........ 
 n1 n1 n1 
T1  324, T2  369, T3  388
n1  5, n2  5, n3  5
n  15
 x  1081, x
 x  80709
2

SSB  432.1333
SSW  2372.8000
SST  2804.9333
SSB 432.1333
MSB    216.0667 Where k-1 is df for numerator
k 1 3 1
SSW 2372.8000
MSW    197.7333 Where n-k is df for denominator
nk 15  3

Lecture notes on Social statistics Page 73


MSB 216.0667
F   1.09
SSW 197.7333
b) To make a test about the equality of the means of three populations, we follow our standard
procedure with five steps.

 H 0 : 1  2  3
H1 : Not all three means are equal
 F distribution
df n  k  1  2
 df den  n  k  12 F  1.09 We don’t reject null hypothesis
F  6.93
EXERCISE:
1) From time to time, unknown to its employees, the research department at Post Bank observes
various employees for their work productivity. Recently this department wanted to check
whether the four tellers at a branch of this bank serve, on average, the same number of customers
per hour. The research manager observed each of the four tellers for a certain number of hours.
The following table gives the number of customers served by the four tellers during each of the
observed hours.
Teller A Teller B Teller C Teller D
19 14 11 24
21 16 14 19
26 14 21 21
24 13 13 26
18 17 16 20
13 18

a) At the 5% significance level, test the null hypothesis that the mean number of
customers served per hour by each of these four tellers is the same. Assume that all the
assumptions required to apply the one-way ANOVA procedure hold true.

Lecture notes on Social statistics Page 74


2) A researcher wishes to try three different techniques to lower the blood pressure of
individuals diagnosed with high blood pressure. The subjects are randomly assigned to three
groups; the first group takes medication, the second group exercises, and the third group
follows a special diet. After four weeks, the reduction in each person’s blood pressure is
recorded. At a significance level of 0.05, test the claim that there is no difference among the
means. The data are shown.

3) Consider the following data obtained for two samples selected at random from two
populations that are independent and normally distributed with equal variances.
Sample I 32 26 31 20 27 34

Sample II 27 35 33 40 38 31

a) Calculate the means and standard deviations for these samples using the formulas from
Chapter3
b) Using the one-way ANOVA procedure, test at the 1% significance level whether the
means of the populations from which these samples are drawn are equal.
4) The following ANOVA table, based on information obtained for three samples selected
from three independent populations that are normally distributed with equal variances, has a
few missing values.
Source of Degrees of freedom Sum of squares Mean squares Value of test statistic
variation

Lecture notes on Social statistics Page 75


Between 2 19.2813 F=
Within 89.3677
Total 12
a) Find the missing values and complete the ANOVA table.
b) Using   0.01 , what is your conclusion for the test with the null hypothesis that the
means of the three populations are all equal against the alternative hypothesis that the
means of the three populations are not all equal?
5) The following ANOVA table, based on information obtained for four samples selected from
four independent populations that are normally distributed with equal variances, has a few
missing values.
Source of Degrees of freedom Sum of squares Mean squares Value of test statistic
variation
Between F =4.07
Within 15 9.2154
Total 18
a) Find the missing values and complete the ANOVA table
b) Using   0.05 , what is your conclusion for the test with the null hypothesis that the
means of the four populations are all equal against the alternative hypothesis that the
means of the four populations are not all equal?

REFERENCES

1. A.G. Blumn-Elementary statistics, 10th edition

2. Basic Statistics for Business and Economics, Douglas A.Lind

3. Statistics for business and economics, 10th edition , Anderson

4. Introductory statistics, 7th edition, Prem S.MANN

5. Statistics for management and economics 9th edition, Gerald Keller

6. Business statistics schauma’s outlines, 4th edition, leonard

Lecture notes on Social statistics Page 76


Lecture notes on Social statistics Page 77

You might also like