Chapter 2
Chapter 2
2.1. Introduction
Inference, specifically decision making and prediction, is centuries old and plays a
very important role in our lives. Each of us faces daily personal decisions and
situations that require predictions concerning the future. The inferences that
individuals make should be based on relevant facts, which we call observations, or
data.
Methods for making inferences about parameters fall into one of two categories.
Either we will estimate (predict) the value of the population parameter of interest
or we will test a hypothesis about the value of the parameter. These two methods
of statistical inference estimation and hypothesis testing involve different
procedures, and, more important, they answer two different questions about the
parameter. In estimating a population parameter, we are answering the question,
‘‘what is the value of the population parameter?’’ In testing a hypothesis, we are
answering the question, ‘‘is the parameter value equal to this specific value?’’
Inference is the process of making interpretations or conclusions from sample
data for the totality of the population. Inferential statistics uses the sample
results to make decisions and draw conclusions about the population from which
the sample is drawn. In statistics there are two ways through which inference can
be made.
Getu D.
The two common forms of statistical inference are. 1. Estimation 2. Null
hypothesis tests of significance (NHTS)
There are two forms of estimation:
Point estimation (maximally likely value for parameter)
Interval estimation (also called confidence interval for parameter)
Both estimation and NHTS are used to infer parameters. A parameter is a
statistical constant that describes a feature about a phenomena, population, pmf,
or pdf
2.2 Statistical Estimation:
This is one way of making inference about the population parameter where the
investigator does not have any prior notion about values or characteristics of the
population parameter. There are two ways estimation:
i. Point Estimation: The goal of point estimation is to make a reasonable guess
of the unknown value of a designated population quantity, e.g., the populations
mean. The quality of an individual estimate depends on the individual sample
from which it was computed and is therefore affected by chance variation.
Point Estimation is a single value or number of sample information that is used
μ
to estimate a parameter. The best point estimate of the population mean is
X̄ .
the sample mean
ii. Interval estimation: It is the procedure that results in the interval of values as
an estimate for a parameter, which is interval that contains the likely values of
Getu D.
a parameter. It deals with identifying the upper and lower limits of a
parameter.
the population mean and X̄ =10 is an estimate, which is one of the possible values
of X̄ .
Properties of best estimator
Three Properties of a Good Estimator
It should be unbiased.
It should be consistent.
It should be relatively efficient.
The estimator should be an unbiased estimator. That is, the expected value or
the mean of the estimates obtained from samples of a given size is equal to
the parameter being estimated. It’s desirable that the sampling distribution be
centered on the true population parameter. An estimator with this property is
called unbiased.
= = it is UE
Getu D.
Now, we want to compute the expected value of this
Now, let's multiply both sides of the equation by n-1, just so we don't have
to keep carrying that around, and square out the right side, just like we did
with that shortcut formula for SSX, above.
Getu D.
terms is an expected value of something squared: a second moment. Let's
use the trick about moments that we saw above. First, let Y be the random
variable defined by the sample mean. We're trying to figure out the
expected value of its square.
We can substitute this stuff for the second term on the RHS of equation 1. Also,
note that the first term on the RHS of equation 1 is the second moment of X, so
that can also be rewritten. Doing both substitutions gives us:
Getu D.
have a small standard error in comparison with other estimators we might
have chosen.
2.2.1 Sampling Distribution of the sample mean
Because statistic such as x varies from sample to sample, they are random
variables. As such, Statistic has probability distributions associated with them. In
order to make probability statements regarding a sample statistic, we need to
know the probability distribution of the sample statistic. That is to say, we need to
know the shape, center and spread of the sample statistic’s distribution.
The sampling distribution of a statistic is a probability distribution for all
possible values of the statistic computed from a sample of size n.
Getu D.
Suppose we have a population of size 5=N, consisting of the age of five
Example:
children: 1, 3, 5, 7 and 9
⇒ Population mean=μ=
∑ X i =1+3+5+7+9 =25 =5
N 5 5
2 ∑
2
( X i−μ ) (1−5)2 +(3−5 )2 +(5−5 )2 +(7−5 )2 +(9−5 )2 40
Population variance=σ = = = =8
N 5 5
( Nn ) =(52 )=10
There are possible samples of size as shown below.
x̄
Sample No Sample Mean ( )
1 1, 3 2
2 1, 5 3
3 1, 7 4
4 1, 9 5
5 3, 5 4
6 3, 7 5
7 3, 9 6
8 5, 7 6
9 5, 9 7
10 7, 9 8
1+ 3 1+ 5
x
For instance, 1 = 2 =2,
x 2 = 2 =3, etc
Getu D.
Sampling is random so that each sample has the same probability
1/ ( Nn ) 1
=10 of
being selected. x
f Probability
2 1 1/10
3 1 1/10
4 2 2/10
5 2 2/10
6 2 2/10
7 1 1/10
8 1 1/10
Total 10 1.0
2 σ2
σ x=
n
2. The sample mean is unbiased estimator of the population mean i.e.
μ x=μ ⇒ E ( x ) =μ
2. The mean of x μ =μ
is equal to the population mean, i.e. x
Getu D.
⇒x ~ N μ,(σ
√n )
⇒ Z=
x−μ
σ
√n
2.2.2 Point and Interval estimation of the population mean
i. Point estimation of the population mean
A point estimator is the numeric value of a sample statistic that is used to
estimate the value of a population parameter. The best point estimator of the
Getu D.
The probability that an interval estimate will contain the parameter is called
confidence level. There are different cases to be considered to construct
confidence intervals.
Intervals constructed in this way are called confidence intervals
Suppose that a sample of size n is selected from a population that has mean 𝛍
and standard deviation σ. Let X 1; X2; ; Xn be the n observations that are
independent and identically distributed (i.i.d.). Define now the sample mean and
the total of these n observations as follows:
and T =
The central limit theorem states that the sample mean follows approximately
the normal distribution with mean 𝛍 and standard deviation σ, where 𝛍 and σ are
the mean and standard deviation of the population from where the sample was
selected. The sample size n has to be large (usually n ≥ 30) if the population from
where the sample is taken is non-normal.
If the population follows the normal distribution then the sample size n can be
either small or large
10
Getu D.
X̄ − μ
⇒ Z= has a normal distribution with mean 0 and s tan dard deviation 1.
σ
√n
σ σ
⇒ μ= X̄ ±Z = X̄±Ε ⇒ Ε=Z
√n √n
For the interval estimator to be a good estimator the error should be small. How ε
can be small?
σ
o If is small
o By increasing the sample size (n)
o By decreasing Z
The best way is to decrease Z. to decrease Z we have to attach standard normal
distribution with the theory of chance.
z/2
z
z/2 0
11
Getu D.
The Z values corresponding to the most commonly used confidence levels is given
below
(1-
α)100% α α/2 Zα/2
90 0.1 0.05 1.645
95 0.05 0.025 1.96
99 0.01 0.005 2.58
For example for 95% confidence interval Zα/2=1.96
Statistical interpretation of a confidence interval: Suppose we repeated this
sampling experiment 100 times; that is, we collected 100 different sets of data,
each set consisting of 40 observations. Suppose that we computed a confidence
interval based on each of the100 data sets. On average, we would expect 90 of
the confidence intervals to include the true mean µ; we would expect that 10
would not. The figure 90 comes from the fact that we chose a 90% confidence
interval.
More generally, we can choose whatever confidence level we want. The
convention is to specify the confidence level as 1−α, where α is typically 0.1, 0.05
or 0.01. These three α values correspond to confidence levels 90%, 95% and 99%.
(α is the Greek letter alpha.)
Definition: For any α between 0 and 1, we define z α to be the point on the z-axis
such that the area to the right of zα under the standard normal curve is α; i.e.
(Z>zα)=α.
12
Getu D.
Figure 1: The area to the right of z α is α. For example, z.05 is 1.645. The area
outside ± zα /2 is α/2+α/2=α. For example, z.025=1.96 so the area to the right of
1.96 is 0.025, the area to the left of−1.96 is also 0.025, and the area
outside±1.96 is .05.
Why zα/2, rather than zα? We want to make sure that the total area outside the
interval is α. This means that α/2 should be to the left of the interval and α/2
should be to the right. In the special case of a 90% confidence interval,
α=0.1, so α/2=0.05, andz.05 is indeed 1.645.
The expression
is called the half width of the confidence interval or the margin of error. The half
width is a measure of precision; the tighter the interval, the more precise our
estimate. Not surprisingly, the half width
2
Case 2: When n is small and the population variance σ is not known
When σ is known and the variable is normally distributed, or when σ is unknown
13
Getu D.
Characteristics of the t-distribution
o The t-distribution is bell-shaped
o The t-distribution is symmetrical about the mean
o The mean, median, and mode are equal to 0 and are located at the center
of the distribution
o The curve never touches the x-axis
The t-distribution differs from the standard normal distribution in the following
ways:
o The variance is greater than 1
o The t- distribution is actually a family of curves based on the concept of
degrees of freedom, which is related to sample size.
o As the sample size increases, the t-distribution approaches the standard
normal distribution.
Many statistical distributions use the concept of degrees of freedom, and the
formula for finding the degrees of freedom vary from different statistical tests.
The degrees of freedom are the number of values that are free to vary after a
sample statistics has been computed.
X̄−μ
t=
2
If the sample size is small and the population variance σ is not known S/ √ n has
a t-distribution with n-1 degrees of freedom.
t/2 0 t/2
S S
( X̄ −t α , X̄ + t α )
⇒ A (1-α) 100% confidence interval for µ is given by 2 √n 2 √ n
14
Getu D.
For any sample size n and any confidence level 1−α,we have tn−1,α/2 > zα/2
1) A random sample size 36 selected from a normal population has a mean of 32.
Given that the population standard deviation (σ) is 4.2. Find
X
2) The mean operating life time for a random sample of n =10 light bulbs is
=4,000 hr, with the sample standard deviation S=200 hr. The operating life of
bulbs in general is assumed to be approximately normally distributed. Find the
95% confidence interval for the true mean operating life time.
Solutions:
X̄ =23.2 years
1. Given: σ=2 years, , n=50 (Case 1)
15
Getu D.
σ σ
⇒ A (1−α )100 %confidence int erval for μ is ( X̄−Z α , X̄ +Z α )
2
√n 2
√n
σ σ
⇒ X̄ −Z α <μ< X̄ +Z α
2
√n 2
√n
α
1−α=0.95 ⇒α=0.05 , =0.025 ⇒Z α =1. 96
2
2
⇒ 32−1. 96 ( )
4 .2
√ 36
<μ<32+1 . 96
4 .2
√ 36 ( )
⇒ 32±1. 372
⇒ 30. 628<μ<33. 372
⇒The 95 % confidence int erval is ( 30 . 35 , 33 . 65 )
Interpretation : We are 95 % confident taht the population mean is between 30. 35 and 33. 65
16
Getu D.
α
b )99 %⇒1−α=0. 99⇒ α=0 .01, =0 .005 ⇒ Z α =2.58
2 2
σ σ
⇒ X̄ −Z α < μ< X̄ +Z α
2 √n 2 √n
⇒ 32−2 .58
( 4√36. 2 )<μ <32+2. 58( 4√36. 2 )
⇒ 32±1. 806
⇒ 30. 194<μ <33 .806
⇒The 99 % confidence int erval is ( 29 . 83 , 34 .17 )
Interpretation : We are 95 % confident taht the population mean is between 29. 83 and 34 . 17
c )The 99 % confidence int erval is wider than the 95 % confidence int erval
⇒ As the confidence increases the int erval becomes l arg e
3. Given n=10 X̄=4,000 hrs and S=200hrs
n is small and σ unkown (Case 2)⇒Use the t−distribution
S S
⇒ A (1−α )100% confidence int erval for μ is ( X̄ −t α , X̄ +t α )
,(n−1) √ n , ( n−1) √ n
2 2
α
95% ⇒ =0.025⇒t α =t 0 .025 ,9=2.262
2 2
,( n−1 )
⇒ 4000−2.262
200
( )
√ 10
<4000+2.262
200
√ 10( )
⇒3856.8<μ<4143.2
⇒(3856.8, 4143.2)
Interpretation: The registerar is 95 % confident taht the averafe age of graguating students is
between 22.65and 23.75 years
Exercises:
1. A sociologist found that in a sample of 49 retired men, the average number of
jobs they had during their life-time was 7.2. From previous studies it was
found that the population standard deviation of the number of jobs is 2.1.
a) Find the 90% confidence interval of the mean for the number of jobs a
man had during his life time
b) Find the 95% confidence interval of the mean for the number of jobs a
man had during his life time
c) Compare the intervals in (a) and (b)
2. An electrical firm manufactures light bulbs that have a length of life that is
approximately normally distributed with a standard deviation of 40 hours. If a
random sample of 30 bulbs has an average life of 780 hours, find a 99%
confidence interval for the population mean of all bulbs produced by this firm.
17
Getu D.
3. A random sample of 400 households was drawn from a town and a survey
generated data on weekly earning. The mean in the sample was Birr 250 with
a standard deviation Birr 80. Construct a 95% confidence interval for the
population mean earning.
4. A sample of 15 private-duty nurses showed an average weekly wage of birr
480.75 with standard deviation of birr 56. Find the 99% confidence interval for
the true mean.
5. A major truck has kept extensive records on various transactions with its
customers. If a random sample of 16 of these records shows average sales of
290 liters of diesel fuel with a standard deviation of 12 liters, construct a 95%
confidence interval for the mean of the population sampled.
18
Getu D.
The probability distribution of the sample proportion ^p is called sampling
distribution. It lists the various values that p can assume and their probabilities.
( 5
)
of possible samples is 3 =10.
The following table shows all possible value of ^p (rounded to two decimal places)
for each sample.
Sample No Sample Proportion who know HIV/AIDS
1 A, J, S 1/3=0.33
2 A, J, L 2/3=0.67
3 A, J, T 2/3=0.67
4 A, S, L 2/3=0.67
5 A, S, T 2/3=0.67
6 A, L, T 3/3=1.00
7 J, S, L 1/3=0.33
8 J, S, T 1/3=0.33
9 J, L, T 2/3=0.67
10 S, L, T 2/3=0.67
^ can be prepared from the above
The frequency and sampling distribution of p
table and it is summarized as follows.
^p f probability, P( ^p )
0.33 3 3/10=0.3
0.67 6 6/10=0.6
1.00 1 1/10=0.1
total 10 1.0
19
Getu D.
E( ^p ) = ∑ p^ P( ^p )= 0.33¿ 0.3+0.67¿ 0.6+1¿ 0.1=0.601
⇒ E( ^
p )=0.60 = P, which is population proportion.
^ x
P=
The sample proportion is n is a point estimate of P can be approximated by
^ X
P=
If P represents for the population proportion then the sample proportion n
^ is the point
provides a good estimate of P. Therefore, the sample proportion P
estimation of the population proportion.
Interval estimation of population proportions (P)
In the binomial experiment each trial results in one of two outcomes, which we
labeled as either a success or a failure. We designated P as the probability of a
success and 1−P as the probability of a failure. Then the probability distribution
^ x.
P=
sample trials, the sample proportion is n X can be approximated by using a
^ x
P=
In a similar way, the distribution of n can be approximated by a normal
20
Getu D.
( 1−α ) 100% confidence interval for the proportion of successes is
A general 100
given by
( ^p−Z α
2
√ p^ q^ ^
n
, p+ Z α
2
√ ^p q^
n
Examples
a. If in a random sample of n=230 voters, 54 voted for candidate A. find the
90% confidence interval for the proportion of individuals who voted for
candidate A.
b. In a sample of 100 teenage girls, 30% used hair coloring. Find the 95%
confidence interval of the true proportion of teenage girls who use hair
coloring.
Solutions:
a ) Let x be the number of individuals who voted for candidate A
x 54
⇒ p^ = = =0 . 235⇒ q^ =1− ^p =1−0 . 235=0 . 765 90 % ⇒ Z α =1. 645
n 230 2
⇒ 0.235−1.645
230 √
0.235×0.765
⇒ 0.235−0.046 , 0.235+0.046
, 0.235−1.645
0.235×0.765
23 √
⇒(0.189 ,0.281)⇒ 0.189< p<0.281
⇒18 .9%<p<28.1%
We can be 90 % confident that the true population proportion is betwen 18.9% and 28.1%
21
Getu D.
b) Given ^p=0.3⇒ q^ =0.7 95 % ⇒ Z α =1.96
2
⇒ 0.3−1.96
√
0.3×0.7
100
, 0.3−1.96
⇒ 0.3−0.0898, 0.3+0.0898
0.3×0.7
100 √
⇒(0.1202,0.3898)⇒ 0.1202< p<0.3898
⇒ 21.02 %< p<38.98 %
We can be 95 % confident that the true population proportion is betwen 21 .02% and 38 .98 %
Generally how do you interpret a confidence interval?
How do you interpret a confidence interval?
Suppose you calculate a 95% confidence interval for some unknown
parameter µ (the true price all students spent on books).
IT IS INCORRECT TO SAY:
“There is a 95% probability that µ (the average price all UNL students spent
on books) is within this interval”
Why is it Incorrect?
The confidence interval you compute is NOT a random interval and µ is a
constant (unfortunately unknown to us), thus there is no randomness. In fact, µ
either falls in that interval or it does not.
What is the Correct Interpretation?
“We are 95% confident that if µ (the average price all UNL students spent
on books) were known, this interval would cover/contain it”
Note: The probability refers to the interval containing µ, not on µ being in the
interval
Why is this?
A 95% confidence interval is not so much a statement about any particular
interval, such as (79.3, 80.7), but pertains to what would happen if a very large
number of like intervals were to be constructed. That is, from a practical point of
view, the 95% gives the fraction of the time, in repeated sampling, that the
intervals constructed will contain the target parameter µ.
Exercise:
22
Getu D.
1. A survey of 1000 people who watched the Democrats/Republican debate
resulted in 600 who thought that democrats won the debate. Construct a 95%
percent confidence interval for the proportion of people who thought
democrats won the debate.
2. A survey of 120 female freshmen shows that 18% did not wish to work after
marriage. Find the 95% confidence interval of the true proportion of females
who do not wish to work after marriage.
2.3 Hypothesis testing
23
Getu D.
hypotheses for each situation: the null hypothesis and the alternative hypothesis.
Types and size of errors: There are two types of error in hypothesis testing
Type I error: Rejecting the null hypothesis when it is true. The significance level (
α ) can be interpreted as the probability of rejecting the null hypothesis when it is
actually true. The probability of type I error is denoted by α. That is, P (Type I
error) = α called level of significance.
Type II error: Failing to reject the null hypothesis when it is false (accepting the
null hypothesis when it is false). The probability of type II error is denoted by β.
That is, P (Type I error) = β
Type I error and type II error have inverse relationship and therefore, cannot be
minimized at the same time. In practice we set α at some value and design a test
that minimizes β. This is because type I error is often considered to be more
serious, and therefore more important to avoid than type II error.
The following table gives a summary of possible results of any hypothesis test:
24
Getu D.
State the null hypothesis and the alternate hypothesis.
Null Hypothesis – statement about the value of a population parameter.
Alternate Hypothesis – statement that is accepted if evidence proves null
hypothesis to be false.
Decide on the significance level :
In practice, the level of significance (α) is chosen arbitrarily. Three levels 0.01,
0.05, or 0.10. (Depending on confidence level). The smaller the level of
significance, the stronger the hypothesis tests. The level of significance
determines the values of the test statistic that would cause us to reject the
hypothesis. The corresponding test statistic values for the level of significance
are called the critical values. The critical value is the value that divides the non-
reject region from the reject region. A level of significance has different critical
values for one and two tailed test. Level of significance of 0.05 has critical value of
±1.96 if the test is two tailed. However if the test is one tailed the critical value
would be 1.64 to either of the tails. Note that critical values for a given level of
significance differ depending on the test statistic intended to be used.
The critical value separates the critical region from the noncritical region. The
symbol for critical value is C.V.
The critical or rejection region is the range of values of the test value that
indicates that there is a significant difference and that the null hypothesis
should be rejected.
The non-critical or non-rejection region is the range of values of the test
value that indicates that the difference was probably due to chance and
that the null hypothesis should not be rejected
25
Getu D.
The critical and noncritical regions and the critical value are shown in the
following Figure for one tailed
The critical and noncritical regions and the critical value are shown in
the following Figure for two tailed
26
Getu D.
When we use the t-statistic, we use the formula
Compare the computed test statistic with critical value.If the computed value is
within the rejection region(s), we reject the null hypothesis; otherwise, we do not
reject the null hypothesis.
Interpret the decision.
Based on the decision in Step 4, we state a conclusion in the context of the
original problem.
2.3.2 Hypothesis testing about the population means (µ)
Let
μ0 be the assumed or hypothesized value of µ, then one can formulate two-
sided (1) and one-sided (2 and 3) hypothesis as follows:
27
Getu D.
X̄ −μ0
Z cal=
σ
Test Statistics: √n
After specifying α we have the following regions (critical and acceptance) on the
standard normal distribution corresponding to the above three hypothesis.
Table: Summary of Decision Rules
Do not reject H0 (Accept H0)
H1 Reject H0 if if
|Z cal|> Z α |Z cal|< Z α
μ≠μ 0 2 2
μ> μ0 Z cal > Z α Z cal < Z α
μ< μ0 Z cal <−Z α Z cal >−Z α
X̄ −μ0
Z cal=
σ
√n
Where
28
Getu D.
μ< μ0 t cal ¿−¿t α ¿ t cal ¿−¿t α ¿
X̄−μ 0
t cal =
S
√n
Where
Sometimes the second assumptions may not be met as the t test is robust for
departures from the normal distribution. That means even when assumption 2 is
not satisfied, the probabilities calculated from the t table are still approximately
correct.
Examples:
29
Getu D.
hours. Will the Dambi Dollo University purchase the new brand of fluorescent
bulbs?"
3. For healthy women aged 18-24, the systolic blood pressure reading with a
mean 114.8. A random sample of 16 women has an average systolic blood
pressure is 117.23 with a standard deviation of 5.63. Test the claim that the
systolic blood is different from 114.8. Use the 0.05 significance level
4. A job placement director claims that the average monthly starting salary for
nurses is less than 1600 birr. A sample of 16 nurses has a mean monthly
starting salary of 1570 birr with a sample standard deviation of 120 birr. At
α=0.05 test the claim that nurses earn less than 1600 birr a month.
5. Researchers are interested in the mean level of an enzyme in a certain
population. They take a sample of 36 individuals, determine the level of
enzyme in each and compute a sample mean 22. It is known that the variable
of interest is approximately normally distributed with a standard deviation of
10. Let’s say that they are asking the following question: Can we conclude that
the mean enzyme level in this population is different from 25?
Solution:
1 . Step 1 : State the null and alternative hypothesis
H 0 : μ=18 . 7
H 1 : μ≠18 . 7
Step 2: α=0 . 05
Step 3: σ known and n l arg e ⇒ use the Z−stastic
Step 4 : Critical regions: Re ject H 0 if |Z cal|≥Z α =1 .96
2
30
Getu D.
X̄−μ 0
Step 5 : Calculation of the test statistic : Z cal=
σ
√n
17 . 2−18 . 7
⇒ Z cal= =−2. 143
4.2
√36
Step 6 : Decission : Since |Z cal|=2. 143>1 . 96 ⇒Re ject H 0
Step 7 : Interpretation : At α=0 .05 the cri min o log ist can conclude that the average sentence is
differnt from 18. 7 years .
2 . Step 1 : State the null and alternative hypothesis
H 0 : μ=900
H 1 : μ>900
Step 2 : α=0. 05
Step 3 : σ unknown but n is larg e ⇒ use the Z−stastic
Step 4 : Critical regions: Re ject H 0 if Z cal >Z α =1. 645
X̄−μ 0
Step 5 : Calculation of the test statistic : Z cal=
S
√n
920−900
⇒ Z cal= =2
80
√ 64
Step 6 : Decission : Since Z cal =2>1 . 645⇒ Re ject H 0
Step 7 : Interpretation : At α=0 .05 there is enough evidence to indicate that the new brand of light bulbs has a
mean life time of more than 900 hours.
3. Step 1: State the null and alternative hypothesis
H 0 : μ=114.8
H 1 : μ≠114.8
Step 2: α=0.05
Step 3:n small and σ unknown ⇒use the t −test
X̄−μ 0
Step 4: Critical regions: Re ject H 0 if |t cal|¿tα =¿t0.025, (15 ) =2.131 ¿ Step 5: Calculation of the test statistic: tcal =
,n−1 S
2
√n
117.23−114.8
⇒ t cal = =1.726
5.63
√ 16
Step 6: Decission: Since |t cal|¿t α =2.131⇒Do not Re ject H 0
,n−1
2
Step 7: Interpretation: The Systolic blood pressure for a healthy women aged 18−24 is 114.8
31
Getu D.
4. Step 1: State the null and alternative hypothesis
H 0 : μ=1600
H 1 : μ<1600
Step 2: α=0.05
Step 3:n small σ unknown ⇒use the t −stastic
X̄−μ 0
Step 4: Critical regions: Re ject H 0 if t cal <−t α ,n−1 =−¿t 0.05 ,15=−1.753 ¿ Step 5: Calculation of the test statistic: tcal =
S
√n
1570−1600
⇒ t cal = =−1
120
√16
Step 6: Decission: Since Z cal=−1>−1.753⇒Do not reject H 0
Step 7: Interpretation: At α=0.05 the mean monthly starting salary of nurses is not less than 1600 birr
Exercises:
1. State the null and alternative hypotheses for each of the following
a) A researcher thinks that if expectant mothers use vitamin pills, the birth
weight of the babies will increase. The average of the birth weights of the
population is 4.6 Kilograms.
b) An engineer claims that she can decrease the mean number of defects in a
manufacturing process of compact discs by using robots instead of human
for certain tasks. The mean number of defective disks is 18
c) A psychologist feels that if he plays soft music during a test, the result of
the test will be changed. He is not whether the grades will be higher or
lower. In the past, the mean of the scores was 73.
2. The scores on an aptitude test required for entry into a certain job position is
normally distributed with mean 500 and standard deviation of 120. If a
random sample of 36 applicants has a mean of 546, is there evidence that
their mean score is different from 500? Use α=0.05.
3. Ten years ago, the mean age of juveniles held in public custody was 16.0
years. The mean age of 250 randomly selected juveniles currently being held
in public custody is 15.86 years. Assuming σ=1.01 years, does it appear that
32
Getu D.
the mean age of all juveniles being held in public custody this year is less
than it was 16 years ago? Use α=0.10.
4. The mean life time of light bulbs produced by a company is known to be
1600 hours. The mean life time of a sample of 16 light bulbs produced by the
factory is computed to be 1570 hours
a) If the population standard deviation is 120 hours, test whether or not the
mean life time is different from 1600 hours
b) If the population standard deviation is not known and the sample
standard deviation is 110 hour, is there any evidence to say that the
mean life time of the light bulbs is more than 1600 hours?
33
Getu D.
^ is approximately normally
When the sample size is large, the sample proportion P
distributed with its mean equal to P and standard deviation equal to √ P(1−P)
n
.
Hence; we use the normal distribution to perform a test of hypothesis about the
population proportion P for a large Sample. The sample size considered to be
^ ^
n(1− P)
large when n P and are both greater than 5.
Suppose the assumed or hypothesized value of P (parameter of the binomial
distribution) is denoted by
P0 then one can formulate two sided (1) and one sided
P
The choice of H 1 depends on the prior information we have on the values of 0 .
Decision Rule:
Hypothesis
Decision rule is to reject
Alternativ H0 if:
Null
e
P≠P 0 |Z cal|>Z α /2
VS
P=P 0 P> P0 Z cal > Z α
P< P0 Z cal <−Z α
^
( P−P 0)
Z cal= ~ N ( 0 , 1)
n √ P0 (1−P0 )
34
Getu D.
process and 5 are defective: Is this evidence sufficient to conclude that the
method has been improved? Use a 0.05 level of significance.
Solution: As usual, we follow the steps:
H 0 : P=0 . 9 P≤0. 9 H 1 : P>0 . 9
1. (actually ) VS
α =0 . 05
2.
3. Critical Region: Z>1.645
4. Computation
^ X = 95 =0 . 95
P=
n 100
^
( P−P 0) 0. 95−0 . 90
Z cal= = =1. 67
√ P0 (1−P0 )
n √ 0 . 9∗0. 1
100
5. Decision: Reject H0
6. Conclusion: At 0.05 we have an evidence to say that the improvement has
reduced the proportion of defective.
35
Getu D.
5. Computation
^ X = 48 =0 . 096
P=
n 500
^
( P−P 0) 0. 096−0 .1
Z cal= = =−0. 3
√ P0 (1−P0 )
n √ 0 .1∗0 . 9
500
⇒ Z tab=−Z α =Z 0 . 01=−2. 33
Exercise: A large sample of 200 students from the students of a certain high
school is interviewed and 85 of them are found to use city bus. Can you conclude
that at least 40% of the students use city bus? Use a 0.05 level of significance
Examples:
1. A registrar officer believes that the dropout for seniors at Dambi Dollo
University is 15%. He performed a hypothesis test to determine if the
percentage is the same or different from 15%. Last year, 38 seniors from a
random sample of 200 seniors withdrew. At α=0.05 test the educator’s claim.
2. A telephone company representative estimates that more than 25% of its
customers want call waiting service. A sample of 200 customers showed that
63 had the call waiting service. At α=0.05 is his estimate appropriate?
Solutions:
1) Step 1 : State the null and alternative hypothesis
H 0 : p =0 .15 H 1 : p ≠0 .15
Step 2 : α=0. 05
Step 4 : Critical regions : Re ject H 0 if |Z cal|≥Z α =1 . 96
2
p^ − p 0
Step 5 : Calculation of the test statistic : Z cal=
√ p0 (1− p0 )
n
36
Getu D.
38
p^ =200 =0.19, p0 =0.15⇒1−¿ p0 =0.85 ¿⇒Z cal=0.19−0.15 =1.58
200 √
0.15×0.85
63
^p= =0.315, p0 =0.25⇒1−¿ p 0 =0.75 ¿⇒ Z cal=
√ n
p0 (1− p0 )
0.315−0.25
=2.12
√
200 0.25×0.75
200
Step 6: Decission : Since Z cal=2.12>1.645⇒Re ject H 0
Step 7: Interpretation: At α=0.05 more than 25 % have a call−waiting service .
Exercises: 1) Candidate Chala is one of the two candidates running for the
mayor of Dambi Dollo town. A random polling of 672 registered voters finds that
323 will vote for candidate Chala. At α=0.05 is it reasonable to assume that half of
the population will vote for Chala?
2) Hana believes that 50% the brides in the Dambi Dollo are younger than their
grooms. She performs a hypothesis test to determine if the percentage is the
same or different from 50%. Hana samples 100 brides and 53 reply that they are
younger than their grooms. At 1% level of significance test Hana’s claim
2.3.4 Sample size determination
In planning a statistical investigation we should decide the number of units
(Sample size) to be studied in order to answer the study objectives. If the sample
size is too small we may fail to detect important effects, or may estimate effects
too imprecisely. If the sample size is too large then we will waste resources.
37
Getu D.
Therefore it is recommended to determine the appropriate sample size for our
study.
How many samples should be included in our study? The sample size depends on
the maximum error of the estimate, the population standard deviation, and the
degree of confidence.
( )
2
Zα σ Zα σ
σ
Ε=Z α ⇒ Ε √n=Z α σ ⇒ √n=
2
⇒ n=
2
2 √n
Ε Ε
Recall that 2
Example: The college president asks the registrar officer to estimate the average
age of the students at their college. From a previous study, the standard deviation
of the ages was found to be σ= 2 years. How large the sample should be if the
officer wishes to be accurate within 1 year?
Solution: Given : Z α =2. 58 σ=2 Ε=1
2
( )(
2
Zα σ
⇒ n=
Ε
2
=
2. 58×2
1 )=26 .6256≈27
A scientist wishes to estimate the average depth of a river. He wants to be 99%
confident that the estimate is accurate within 2 feet. From a previous study, the
standard deviation of the depths measured was 4.38 feet.
Solution
Round the value 31.92 up to 32 therefore, to be 99% confident that the estimate
is within 2 feet of the true mean depth, the scientist needs at least a sample of 32
measurements. (Always round up to the next whole number.)
()
2
Zα
2
n= p^ q^
Ε
38
Getu D.
What sample size should be required, if the researcher wishes to be accurate
within 5% of the true proportion?
Solution:
54
Given : 90 %⇒ Z α =1. 645 ^p= ^ . 765 and Ε=0 . 05
=0 . 235 ⇒ q=0
2
230
()
2
Zα
⇒ n= ^p q^
Ε
2
=0. 235×0 .765 ( 0 . 05)
1. 645 2
=194 . 59≈195
Exercises:
1. A college dean wishes to estimate the average number of hours his part-time
instructors teach per week. The standard deviation from pervious study is 2.6
hours. How large sample must be selected if he wants to be 99% confident of
finding whether the true mean differs from the sample mean by 1 hour?
2. A researcher wants to estimate, with 95% confidence, the number of people
who own a home computer. A previous study shows that 40% of those
interviewed had a computer at home. The researcher wishes to be accurate
within 2% of the true proportion. Find the minimum sample size necessary.
39
Getu D.