Sampling and Estimation
Sampling and Estimation
Objectives
At the end of this chapter, you should be able to:
•• Explain why in many situations a sample is the only feasible way to learn something
about a population.
•• Explain the various methods of selecting a sample.
•• Distinguish between probability sampling and non-probability sampling.
•• Define and construct a sampling distribution of sample means.
•• Explain the Central Limit theorem and its importance in statistical inference.
•• Calculate confidence intervals for means and proportions.
•• Determine how large a sample should be for both means and proportions.
Fast Forward: Sampling is that part of statistical practice concerned with the selection of
individual
observations intended to yield some knowledge about a population of concern, especially for the
purposes of statistical inference.
Introduction
A population consists of all the items with which a particular study is concerned. A sample is a
much smaller number chosen from this population. The sample must be chosen randomly. The
data collected in the sample is used to draw inferences about the corresponding population
parameter.
The three types of distributions:
1. Population distribution
Population distribution is the distribution of the individual values of population. Its mean is
denoted by “μ”.
2. S ample distribution
I t is the distribution of the individual values of a single sample. Its mean is generally written
as ‘m.’ It is extremely unlikely that it will be the same as “μ”.
3 Distribution of sample means
A sample of size n is taken from the parent population and mean of the sample is calculated.
This is repeated for a number of samples so that a population of sample means is obtained.
This population approaches a normal distribution as n increases and is the distribution of
sample means.
Definition of key terms
1. Estimate – an approximate calculation of quantity or degree or worth; an estimate of what
it would cost; a rough idea how long it would take.
1 3 0 quantitative techniques
STUD Y TEX T
2. Sample - a small part of something intended as representative of the whole. In statistics, a
sample is a subset of a population.
3. Probability – Probability is the likelihood or chance that something is the case or will happen.
Probability theory is used extensively in areas such as statistics, mathematics, science and
philosophy to draw conclusions about the likelihood of potential events and the underlying
mechanics of complex systems.
4. Proportion – The quotient obtained when the magnitude of a part is divided by the magnitude
of the whole; a quantity of something that is part of the whole amount or number.
5. Null hypothesis – describes some aspect of the statistical behaviour of a set of data.
This description is treated as valid unless the actual behaviour of the data contradicts this
assumption.
6. An alternative hypothesis is one that specifies that the null hypothesis is not true. The
alternative hypothesis is false when the null hypothesis is true, and true when the null
hypothesis is false. The symbol H1 is used for the alternative hypothesis.
Industry Context
With the realisation of the fact that in business time is money, dynamic technologies for
forecasting
have been a necessary toolin a wide range of managerial decisions.. In making strategic
decisions
under uncertainty, we all make forecasts. We may not think that we are forecasting, but our
choices will be directed by our anticipation of results of our actions or inactions.
Indecision and delays are the parents of failure. For instance, budgets are intended to help
managers and administrators do a better job of anticipating, and hence a better job of
managing
uncertainty, by using effective forecasting and other predictive techniques.
EXAM CONTEXT
Sampling and estimation has been a popular field for examiners. The student must understand
the formulae for the previous tests to avoid confusion during an exam. Previous exam papers
where the topic has featured are:
12/02, 6/06, 12/05, 6/05, 12/04, 6/04, 6/03, 12/02, 12/00, 6/00
Fast forward: Sampling is that part of statistical practice concerned with the selection of
individual
observations intended to yield knowledge about a population of concerns, especially for the
purpose of statistical inference.
131
STUD Y TEX T
Sampling and Estimation
4.1 Methods of probability sampling
There is no one ‘best’ method of selecting a probability sample from a population of interest. A
method used to select a sample of invoices in a file drawer might not be the most appropriate
method to use when choosing a national sample of voters. However, all probability sampling
methods have a similar goal, namely, to allow chance to determine the items or persons to be
included in the sample.
These sampling methods include:
1. Simple random sampling
A sample is formulated in such a manner that each item or person in the population has the same
chance of being included. For instance suppose a population consists of 576 employees of Yana
Tires. A sample of 63 employees is to be selected from that population. One way is to first write
all
their names, put the names in a box, mix them thoroughly then make the first selection. Repeat
this process until the 63 employees are selected.
The other method which is convenient is to use the identification number of each employee and
a table of random numbers. As the name implies, these numbers have been generated by a
random process (in this case by a computer). Bias is therefore completely eliminated from the
selection process.
2. Systematic random sampling
The items or individuals of the population are arranged in some way- alphabetically, in a file
drawer by date received, or some other method. A random starting point is selected and then
every kth number of the population is selected for the sample. A systematic sample should not
be used, however, if there is a predetermined pattern to the population.
3. Stratified random sampling
A population is first divided into subgroups called strata and a sample is selected from each
stratum. After the population has been divided into strata, either a proportional or non-
proportional
sample can be selected. A proportional sampling procedure requires that the number of items in
each stratum be in the same proportion as found in the population.
In a non-proportional stratified sample, the number of items studied in each stratum is
disproportionate to their number in the population. We then weight the sample results according
to the stratum’s proportion of the total population.
4. Cluster sampling
It is often employed to reduce the cost of sampling a population scattered over a large geographic
area. Suppose you want to conduct a survey to determine the views of industrialists in a state.
Selecting a random sample of industrialists in the state and personally contacting each one
would be time-consuming and very expensive. Cluster sampling would be useful by subdividing
the state into small units of regions often called primary units.
1 3 2 quantitative techniques
STUD Y TEX T
Standard error of the mean.
The standard deviation of the sample mean from the overall mean μ is called the standard error
(Se)
For large samples, Se = s
√n
Where s is the standard deviation of the population and n is the size of the population.
Note: In general, the standard deviation of the population is not known. In such cases, the
standard deviation of the sample (provided it is large) is a good estimate of the population
standard deviation (s ).
The standard error of the mean then becomes
Standard DeviationOf Sample
√n
Proportions
There are times when information cannot be given as a mean or as a measure but only as a
fraction or percentage.
Examples:
Percentage of female in a certain population.
Proportion of defective production in total production.
In these cases, we are faced with estimating the population proportion from a single sample.
The sampling theory states that if repeated large random samples are taken from a population,
the sample proportion ‘p’ will be normally distributed with mean equal to the population
proportion
and standard error equal to
p(1 − p)
√n
where n is sample size.
The procedure for estimating a proportion is similar to that for estimating a mean.
The Sampling distribution of sample proportions
Population proportions are used in business particularly in market research, where we might
investigate the proportions of populations displaying a particular characteristic.
Provided p, (1-p) are both greater than 5, we may use
p(1 − p)
√n
to represent the standard error of the sampling distribution of sample proportions
133
STUD Y TEX T
Sampling and Estimation
Confident intervals
Usually we use a single sample, to produce an estimate of population parameter. It is important
to know how reliable this estimate is based on sample results.
The standard error of the sampling distribution (or proportions) will give us an indication of
reliability – the smaller the standard error, the less variable the sample statistic.
However it is simpler to appreciate the reliability of an estimate if we set up a range of values
within which we can be reasonably sure that the population parameters lie. This range of values
is called confidence interval.
Confidence limits
Confidence limits are the outer limits to a confidence interval. This is a zone of values within
which we may be confident that the true population mean (or the parameter being considered)
does lie.
Example: the 95% confidence interval for the population proportion is:
p ± 1.96 p(1 − p)
√n
Note: It is usual to write 1-p = q
Test of significance
In practice, sizes of population parameters such as mean and proportions, are generally unknown.
However a claim may be made that the size of a given parameter is equal to ‘x’ (say). Such a
claim may be true or false. There is need for such a claim to be put to test.
To test this claim we take a sample. It will be a miracle if the sample taken from the population
will have the same mean or standard deviation as the corresponding population parameters. The
difference could be:
a) because the original belief was wrong, or
b) because the difference was purely due to ordinary chance.
If the difference cannot be explained solely due to ordinary chance, then the difference is said to
be statistically significant.
4.2 The Null hypothesis
Fast Forward
A common convention is to use the symbol H0 to denote the null hypothesis.
It is an assumption that nothing has changed i.e. what we have assumed to be true (H0), is
actually
true. In other words, there is no contradiction between the believed mean and the sampled mean
and the difference is solely due to chance.
1 3 4 quantitative techniques
STUD Y TEX T
Testing the null hypothesis
We know that 95% of the means of all samples, will fall within 1.96 standard errors of the
population mean (or believed mean). If the sample mean lies more than 1.96 standard error from
the believed mean, one can reject the null hypothesis at the 95% level of confidence and say
that there is a contradiction between the believed population mean and the sample. If the null
hypothesis is not rejected, then we say there is not enough evidence to prove that the true mean
is not as we believe.
Sampling theory for small samples (t-Distribution)
In applying sampling theories to small samples (n ≤30), we assume:
a) the parent population is normal or near normal
b) I n this case the sample means do not follow a normal distribution.
This distribution is similar to the standard normal distributions in that, it is symmetrical and
continuous. The major difference between the two is that in the case of normal distribution there
is only one defined distribution; there are however many t-distributions depending upon the
sample size.
The value of t is obtained from the t- tables depending upon the degrees of freedom and the level
of confidence.
Type I and Type II Errors
If we reject a hypothesis when it should be accepted, we say that a type I error has been made.
If on the other hand, we accept a hypothesis when it should be rejected, we say that a type II
error has been made.
In either case a wrong decision or error in judgment has occurred. For a given sample size, an
attempt to decrease one type of error is accompanied in general by an increase in another type
of error.
While testing hypothesis (H0) and deciding to either accept or reject a null hypothesis, there are
four possible occurrences.
a) A cceptance of a true hypothesis (correct decision) – accepting the null hypothesis
and it happens to be the correct decision. Note that statistics does not give absolute
information, thus its conclusion could be wrong only that the probability of it being right
are high.
b) R ejection of a false hypothesis (correct decision).
c) R ejection of a true hypothesis – (incorrect decision) – this is called type I error, with
probability = α.
d) A cceptance of a false hypothesis – (incorrect decision) – this is called type II error, with
probability = β.
In practice, type I error is considered more serious than the type II errors. The maximum
probability
with which we would be willing to risk a type I error is called the level of significance of the test.
135
STUD Y TEX T
Sampling and Estimation
One-tailed and two –tailed test
When we are interested in extreme values of the statistics S or its corresponding Z score on
both sides of the statistic, we perform a test called a two-tailed test (or two-sided test). Similarly
we may be interested only in extreme values to one side of the mean, such tests are called
onetailed
tests (or one-sided tests)
Levels of significance
A level of significance is a probability value which is used when conducting tests of hypothesis.
A level of significance is basically the probability of one making an incorrect decision after the
statistical testing has been done. Usually such probability used are very small e.g. 1% or 5%
NB: If the standardised value of the mean is less than –1.65 we reject the null hypothesis (H0)
and accept the alternative hypothesis (H1) but if the standardised value of the mean is more than
–1.65 we accept the null hypothesis and reject the alternative hypothesis
The above sketch graph and level of significance are applicable when the sample mean is less
than the population mean)
The following is used when sample mean > population mean
0.45
0
5% = 0.05
Critical Region
Critical Value = −1.65
0.5000
0.4900
1% provision for errors
0
Critical Value
1 3 6 quantitative techniques
STUD Y TEX T
NB: If the sample mean standardized value < 1.65, we accept the null hypothesis but reject
the alternative. If the sample mean value > 1.65 we reject the null hypothesis and accept the
alternative hypothesis
The above sketch is normally used when the sample mean given is greater than the population
mean
NB: If the standardized value of the sample mean is between –2.58 and +2.58 accept the null
hypothesis but otherwise reject it and therefore accept the alternative hypothesis
Two-tailed tests
A two-tailed test is normally used in statistical work (tests of significance) e.g. if a complaint
lodged
by the client is about a product not meeting certain specifications i.e. the item will generate a
complaint if its measurements are below the lower tolerance limit or above the upper tolerance
limit
Acceptance region
Critical region (rejection region)
5% = 0.05
0 Z = 1.65 (critical value)
Accept null hyp (reject alternative hyp)
Reject null hyp (accept alt hyp) Reject null hyp (accept alt hyp)
0.05% = 0.05 0.495 0.495 0.5% = 0.05
-2.58 +2.58
137
STUD Y TEX T
Sampling and Estimation
NB: Alternative hypothesis is usually rejected if the standardized value of the sample mean lies
beyond the tolerance limits (15cm and 17 ½ cm).
One-tailed test
This is a test where the alternative hypothesis (H1:) is only concerned with one of the tails of the
distribution e.g. to test a business complaint if the complaint is above the measurements of an
item being shorter than is required.
E.g. a manufacturer of a given brand of bread may state that the average weight of the bread is
500gms but if a consumer takes a sample and weighs each of the pieces of bread and happens
to have a mean of 450 gms he will definitely complain about the bread which is underweight.
The
statistical analysis to be done will concentrate on the left tail of the normal distribution in which
one will have to establish whether 450gms being less than 500g is statistically significant. Such
a test is referred to as one-tailed test.
On the other hand, the test may tend towards the right hand tail of the normal distribution. When
this happens, the major complaint is likely to do with oversize items bought. The test is known as
one-tailed as the focus is on one end of the normal distribution.
Region of acceptance for
HO
Critical Region Critical Region
15cm 17 ½ cm
Left
1 3 8 quantitative techniques
STUD Y TEX T
Number of standard errors
Two-tailed
test
One-tailed
test
5% level of
significance
1.96 1.65
1% level of
significance
2.58 2.33
Hypothesis Testing Procedure
Whenever a business complaint comes up there is a recommended procedure for conducting a
statistical test. The purpose of such a test is to establish whether the null hypothesis or alternative
hypothesis is to be accepted.
The following are steps normally adopted:
1. Statement of the null and alternative hypothesis
2. Statement of the level of significance to be used.
3. Statement about the test statistic i.e. what is to be tested e.g. the sample mean, sample
proportion, difference between sample means or sample proportions
4. Type of test whether two tailed or one tailed.
5. Statement on critical values using the appropriate level of significance
6. Standardising the test statistic
7. Conclusion showing whether to accept or reject the null hypothesis
The F-Distribution (The Variance Ratio Test)
The F-test is based on the ratio of the variances of two samples.
The F-distribution has the following shape
Diagram 4.1
Critcal Piont F - Value
139
STUD Y TEX T
Sampling and Estimation
The null-hypothesis is that there is no significant difference between the variances of the two
samples.
How to use the F-test
Step I: C alculate the variance for each sample and use these to estimate the population
variance.
Step II: Obtain the F-value which is
Larger variance
Smaller variance
Step III: Find F – value from F-tables using degrees of freedom and level of confidence.
4.3 THE CHI –SQUARE DISTRIBUTION X2
This distribution can be used to test whether an observed series of values differs significantly
from what is expected.
Formula:
x2 = Σ (observed value − expected value)2 = Σ (o − E)2
expected value E
Testing the difference between two sample means (large samples)
A large sample is defined as one which contains 30 or more items (n≥30 where n is the sample
size)
In a business those involved are constantly observant about the standards or specifications
of the item which they sell e.g. a trader may receive a batch of items at one time and another
batch at a later time. At the end, he may have concluded that the two samples are different in
certain specifications e.g. mean weight, mean lifespan, mean length e.t.c. Further it may become
necessary to establish whether the observed differences are statistically significant or not. If the
differences are statistically significant then it means that such differences must be explained
i.e. there are known causes but if they are not statistically significant then it means that the
differences observed have no known causes and are mainly due to chance.
If the differences are established to be statistically significant then it implies that the complaints,
which necessitated that kind of test, are justified.
Let X1 and X2 be any two samples whose sizes are n1 and n2 and mean X
1 and X
2. Standard
deviation S1 and S2 respectively. In order to test the difference between the two sample means,
we apply the following formulas
Z=
X1 − X2
where = S(X1 − X2 =
S1
2 + S2
2
S (X1 − X2) √ n1 n2
1 4 0 quantitative techniques
STUD Y TEX T
Example 1
An agronomist was interested in a particular fertilizer yield output. He planted maize on 50 equal
pieces of land and the mean harvest obtained later was 60 bags per plot with a standard deviation
of 1.5 bags. The crops grew under natural circumstances and conditions without the soil being
treated with any fertilizer. The same agronomist carried out an alternative experiment where he
picked 60 plots in the same area and planted the same plant of maize but a fertilizer was applied
on these plots. After the harvest it was established that the mean harvest was 63 bags per plot
with a standard deviation of 1.3 bags
Required
Conduct a statistical test in order to establish whether there was a significant difference between
the mean harvest under the two types of field conditions. Use 5% level of significance.
Solution
H0 : μ1 = μ2
H1 : μ1 ≠ μ2
Critical values of the two-tailed test at 5% level of significance are 1.96
The standardised value of the difference between sample means is given by Z where
Z=
X1 − X2
where = S(X1 − X2 =
1.52 + 1.32
S (X1 − X2) √ 50 60
Z=
(60 − 63)
√ 50 60
= 11.11
Since 11.11 < -1.96, we reject the null hypothesis but accept the alternative hypothesis at 5%
level of significance i.e. the difference between the sample mean harvest is statistically
significant.
This implies that the fertilizer had a positive effect on the harvest of maize
Note: You don’t have to illustrate your solution with a diagram.
−−
−−
- 1.96 0 +1.96
141
STUD Y TEX T
Sampling and Estimation
Example 2
An observation was made about reading abilities of males and females. The observation led
to a conclusion that females are faster readers than males. The observation was based on the
times taken by both females and males when reading out a list of names during graduation
ceremonies.
In order to investigate the observation and the consequent conclusion a sample of 200 men
were given lists to read. On average, each man took 63 seconds with a standard deviation of 4
seconds. A sample of 250 women was also taken and asked to read the same list of names. It
was found that they took 62 seconds on average with a standard deviation of 1 second.
Required
By conducting a statistical hypothesis testing at 1% level of significance establish whether the
sample data obtained supports the earlier observation.
Solution
H0: μ1 = μ2
H1: μ1 ≠ μ2
Critical values of the two tailed test is at 1% level of significance is 2.58.
Z = X1 − X2
S(X1 −X2)
63 − 62
Z = 42 + 12
√ 200 250
Since 3.45 > 2.33 reject the null hypothesis but accept the alternative hypothesis at 1% level of
significance i.e. there is a significant difference between the reading speed of Males and females,
thus females are actually faster readers.
Acceptance region
Rejection Region
- 2.58 0 +2.58 +3.45
1 4 2 quantitative techniques
STUD Y TEX T
Test of hypothesis on proportions
This follows a similar method to the one for means except that the standard error used in this
case is:
Sp =
Pq
√n
Z score is calculated as, Z = P S −p Where P = Proportion found in the sample.
Π – the hypothetical proportion.
Example
A member of parliament (MP) claims that in his constituency only 50% of the total youth
population
lacks university education. A local media company wanted to ascertain that claim thus they
conducted a survey taking a sample of 400 youths, of these 54% lacked university education.
Required:
At 5% level of significance, confirm if the MP’s claim is wrong.
Solution
Note: This is a two-tailed test since we wish to test that the hypothesis is different (≠) and not
against a specific alternative hypothesis e.g. less than or more than.
H0 : π = 50% of all youth in the constituency lack university education.
H1 : π ≠ 50% of all youth in the constituency lack university education.
Sp = pq = 0.5 × 0.5 = 0.025 √ n √ 400
Z=
0.54 − 0.50
0.025 = 1.6
at 5% level of significance for a two-tailed test the critical value is 1.96 since calculated Z value
< tabulated value (1.96).
i.e. 1.6 < 1.96 we accept the null hypothesis.
Thus the MP’s claim is accurate.
Hypothesis testing of the difference between proportions
Example
Ken industrial manufacturers have produced a perfume known as “fianchetto.” In order to test
its popularity in the market, the manufacturer carried a random survey in Back rank city where
10,000 consumers were interviewed after which 7,200 showed preference. The manufacturer
also moved to area Rook town where he interviewed 12,000 consumers out of which 10,000
showed preference for the product.
143
STUD Y TEX T
Sampling and Estimation
Required
Design a statistical test and use it to advise the manufacturer regarding the differences in the
proportion, at 5% level of significance.
Solution
H0 : π1 = π2
H1 : π1 ≠ π2
The critical value for this two-tailed test at 5% level of significance = 1.96.
Now Z =
()()
()
1212
12
PP
SP P
- - Π -Π
-
But since the null hypothesis is π1 = π2, the second part of the numerator disappears i.e.
π1 - π2 = 0 which will always be the case at this level.
Then Z =
()
()
12
12
PP
SPP
-
-
Where:
Sample 1 Sample 2
Sample size n1 =
10,000
n2 =
12,000
Sample proportion of success P1 =0.72 P2 = 0.83
Population proportion of
success.
Π1 Π2
Now S (p1 - p2 )=
12
pq pq
nn
+
Where P = 1 1 2 2
12
pn pn
nn
+
+
And q = 1 – p
∴in our case
P=
10,000(0.72) 12,000(0.83)
10,000 12,000
+
+
=
84,000
22,000
= 0.78
∴ q = 0.22
1 4 4 quantitative techniques
STUD Y TEX T
()()()
12
0.78 0.22 0.78 0.22
10,000 12,000
SP-P=+
= 0.00894
Z=
0.72 0.83
0.00894
-
= 12.3
Since 12.3 > 1.96, we reject the null hypothesis but accept the alternative. The differences
between the proportions are statistically significant. This implies that the perfume is much more
popular in Rook town than in Back rank city.
Hypothesis testing about the difference between two proportions
Is used to test the difference between the proportions of a given attribute found in two random
samples.
The null hypothesis is that there is no difference between the population proportions. It means
two samples are from the same population.
Hence
H0 : π1 = π2
The best estimate of the standard error of the difference of P1 and P2 is given by pooling the
samples and finding the pooled sample proportions (P) thus
P=1122
12
pn pn
nn
+
+
Standard error of difference between proportions
()12
12
S p p pq pq
nn
-=+
And Z = ( )
12
12
PP
Spp
-
-
Example
In a random sample of 100 persons taken from village A, 60 are found to be consuming tea. In
another sample of 200 persons taken from a village B, 100 persons are found to be consuming
tea. Do the data reveal significant difference between the two villages so far as the habit of
taking
tea is concerned?
Solution
Let us take the hypothesis that there is no significant difference between the two villages as far
as the habit of taking tea is concerned i.e. π1 = π2
We are given
P1
= 0.6; n1 = 100
P2
= 0.5; n2 = 200
145
STUD Y TEX T
Sampling and Estimation
Appropriate statistic to be used here is given by
P=1122
12
pn pn
nn
+
+
=
(0.6)(100) (0.5)(200) 60 100
100 200 300
++
=
+
= 0.53
q = 1 – 0.53
= 0.47
()12SP-P
=
12
pq pq
nn
+
=
(0.53)(0.47) (0.53)(0.47)
100 200
+
= 0.0608
Z=
0.6 0.5
0.0608
-
= 1.64
Since the computed value of Z is less than the critical value of Z = 1.96 at 5% level of
significance;
therefore we accept the hypothesis and conclude that there is no significant difference in the
habit of taking tea in the two villages A and B
t-distribution (student’s t distribution) tests of hypothesis (test for small samples
n < 30)
For small samples n < 30, the method used in hypothesis testing is similar to the one for large
samples except that t values are used from t distribution at a given degree of freedom v, instead
of z score, the standard error Se statistic used is also different.
Note that v = n – 1 for a single sample and n1 + n2 – 2 where two samples are involved.
a) Test of hypothesis about the population mean
When the population standard deviation (S) is known then the t statistic is defined as
t=
X
X
S
-m
where X
SS
n
=
Follows the students t distribution with (n-1) d.f. where:
X = Sample mean
μ = Hypothesis population mean
n = sample size
and S is the standard deviation of the sample calculated by the formula:
1 4 6 quantitative techniques
STUD Y TEX T
S=
( )2
1
XX
n
-
-
Σ
for n < 30
If the calculated value of t exceeds the table value of t at a specified level of significance, the null
hypothesis is rejected.
Example
Ten oil tins are taken at random from an automatic filling machine. The mean weight of the tins
is
15.8 kg and the standard deviation is 0.5kg. Does the sample mean differ significantly from the
intended weight of 16kgs. Use 5% level of significance.
Solution
Given that n = 10; x = 15.8; S = 0.50; μ = 16; v = 9
H0 : μ = 16
H1 : μ ≠ 16
=
0.5
X 10 S =
t = 0.5
10
15.8 -16
=
0.2
0.16
= -1.25
The table value for t for 9 d.f. at 5% level of significance is 2.26. The computed value of t is
smaller than the table value of t. therefore, difference is insignificant and the null hypothesis is
accepted.
b) Test of hypothesis about the difference between two means
The t test can be used under two assumptions when testing hypothesis concerning the difference
between the two means; that the two are normally distributed (or near normally distributed)
populations and that the standard deviation of the two is the same or at any rate not significantly
different.
Appropriate test statistic to be used is
t=
(12)
12
XX
XX
S-
-
at (n1 + n2 – 2) d.f.
The standard deviation is obtained by pooling the two sample standard deviation as shown
below.
Sp =
()2()2
1122
12
11
2
nSnS
nn
-+-
+-
147
STUD Y TEX T
Sampling and Estimation
Where S1 and S2 are standard deviation for sample 1 and 2 respectively.
Now X1 S =
1
Sp
n
and X 2 S =
2
Sp
n
( ) X1 X 2 S - = 1 2
22
XXS+S
A lternatively ( ) X1 X 2 S - = Sp 1 2
12
nn
nn
+
Example
Two different types of drugs – A and B – were tried on certain patients for increasing weights, 5
persons were given drug A and 7 persons were given drug B. The increase in weight (in pounds)
is given below
Drug A 8 12 16 9 3
Drug B 10 8 12 15 6 8 11
Do the two drugs differ significantly with regard to their effect in increasing weight? (Given that
v= 10; t0.05 = 2.23)
Solution
H0 : μ1 = μ2
H1 : μ1 ≠ μ2
t=
(12)
12
XX
XX
S-
-
Calculate X1 , X 2 and S
X1 X1 – X1 (X1 – X1 )2 X2 (X2 – X 2 ) (X2 – X )2
8 -1 1 10 0 0
12 +3 9 8 -2 4
13 +4 16 12 +2 4
9 0 0 15 +5 25
3 -6 36 6 -4 16
8 -2 4
11 +1 1
ΣX1 = 45
Σ(X1– X1 )
=0
Σ (X1 – X1 )2=
62
ΣX2=
70 Σ (X2 – X 2 ) = 0 Σ (X2– X 2 )2= 54
1 4 8 quantitative techniques
STUD Y TEX T
X1 = 1
1
X
n
Σ=
45
5
= 9 X2 = 2
2
70 10
7
X
n
==Σ
S1 =
62
4
= 3.94 S 2 =
54 3
6
=
Sp =
(4)15.4 (6)9
10
+
= 3.406
(12)
11.6 11.6
X X 5 7 S - = + or 3.406 ( )
75
57
+
= 1.99
t=
(12)
12
XX
XX
S-
-
=
9 10
1.99
-
= 0.50
Now t0.05 (at v = 10) = 2.23 > 0.5
Thus we accept the null hypothesis.
Hence there is no significant difference in the efficacy of the two drugs in the matter of
increasing
weight.
Example
Two salesmen A and B are working in a certain district. From a survey conducted by the head
office, the following results were obtained. State whether there is any significant difference in
the
average sales between the two salesmen at 5% level of significance.
AB
No. of sales 20 18
Average sales in shs 170 205
Standard deviation in shs 20 25
Solution
H0 : μ1 = μ2
H1 : μ1 ≠ μ2
Where
Sp =
()2()2
1122
12
11
2
nSnS
nn
-+-
+-
( ) X1 X 2 S - = Sp
12
12
nn
nn
+
149
STUD Y TEX T
Sampling and Estimation
Where: X1 =170, X 2 = 205, n1 = 20, n2 = 18, S1 = 20, S2 = 25, V = 36
Sp =
(19)(202 ) (17)(252 )
20 18 2
+
+-
= 22.5
(12)
22.5 38
X X 360 S - =
= 7.31
t=
170 205
7.31
-
= 4.79
t0.05(36) = 1.9 (Since d.f > 30 we use the normal tables)
The table value of t at 5% level of significance for 36 d.f. when d.f. >30, that t distribution is the
same as normal distribution, 1.9. Since the computed value of t is more than the table value, we
reject the null hypothesis. Thus, we conclude that there is significant difference in the average
sales between the two salesmen.
Testing the hypothesis equality of two variances
The test for equality of two population variances is based on the variances in two independently
selected random samples drawn from two normal populations
Under the null hypothesis 22
2
1ó=ó
F=
22
22
2
1
2
1
ó
s
ó
s
N ow under the H0 : 22
2
1 ó = ó it follows that
F=
2
1
2
2
S
S
which is the test statistic.
Which follows F – distribution with V1 and V2 degrees of freedom. The larger sample variance
is
placed in the numerator and the smaller one in the denominator.
If the computed value of F exceeds the table value of F, we reject the null hypothesis i.e. the
alternate hypothesis is accepted
Example
In one sample of observations the sum of the squares of the deviations of the sample values from
sample mean was 120 and in the other sample of 12 observations it was 314; test whether the
difference is significant at 5% level of significance
1 5 0 quantitative techniques
STUD Y TEX T
Solution
Given that n1 = 10, n2 = 12, Σ(x1 – X1 )2 = 120
Σ(x2 – X 2 )2 = 314
Let us take the null hypothesis that the two samples are drawn from the same normal population
of equal variance
H0 : 22
2
1ó=ó
H1: 22
2
1ó≠ó
Applying F test i.e.
F=
2
1
2
2
S
S
=
()
()
()
2
11
1
2
22
2
1
1
XX
n
XX
n
-
-
-
-
Σ
Σ
=
120
9
314
11
=
13.33
28.55
since the numerator should be greater than denominator
F=
28.55 2.1
13.33
=
The table value of F at 5% level of significance for V1 = 9 and V2 = 11. Since the calculated
value
of F is less than the table value, we accept the hypothesis. The samples may have been drawn
from the two populations having the same variances.
Chi square hypothesis tests (Non-parametric test)(X2)
They include among others
Test for goodness of fit
Test for independence of attributes
Test of homogeneity
Test for population variance
The Chi square test (χ2) is used when comparing an actual (observed) distribution with a
hypothesized, or explained distribution.
It is given by: χ2 =
( )2 O E
E
- Σ Where O = Observed frequency
E = Expected frequency
151
STUD Y TEX T
Sampling and Estimation
The computed value of χ2 is compared with that of tabulated χ2 for a given significance level
and
degrees of freedom.
i) Test for goodness of fit
These tests are used when we want to determine whether an actual sample distribution matches
a known theoretical distribution
The null hypothesis usually states that the sample is drawn from the theoretical population
distribution and the alternate hypothesis usually states that it is not.
Example
Mr Nguku carried out a survey of 320 families in Ateka district, each family had 5 children and
they revealed the following distribution
No. of boys 5 4 3 2 1 0
No. of girls 0 1 2 3 4 5
No. of families 14 56 110 88 40 12
Is the result consistent with the hypothesis that male and female births are equally probable at
5% level of significance?
Solution
If the distribution of gender is equally probable then the distribution conforms to a binomial
distribution with probability P(X) = ½.
Therefore
H0 = the observed number of boys conforms to a binomial distribution with P = ½
H1 = the observations do not conform to a binomial distribution.
On the assumption that male and female births are equally probable, the probability of a male
birth is P = ½ . The expected number of families can be calculated by the use of binomial
distribution. The probability of male births in a family of 5 is given by
P(x) = 5cX Px q5-x (for x = 0, 1, 2, 3, 4, 5,)
= 5cX ( ½ )5 (Since P = q = ½ )
To get the expected frequencies, multiply P(x) by the total number N = 320. The calculations are
1 5 2 quantitative techniques
STUD Y TEX T
shown below in the tables
x P(x) Expected frequency =
NP(x)
0
5c0 ( ½ )5 =
1
32 320 × 1
32 = 10
1
5c1 ( ½ )5 =
5
32 320 × 5
32 = 50
2
5c2 ( ½ )5 =
10
32 320 × 10
32 = 100
3
5c3 ( ½ )5 =
10
32 320 × 10
32 = 100
4
5c4 ( ½ )5 = 5
32 320 × 5
32 = 50
5
5c5 ( ½ )5 = 1
32 320 × 1
32 = 10
Arranging observed and expected frequencies in the following table and calculating x2
O E (O – E) 2 (O – E) 2 /E
14 10 16 1.60
56 50 16 0.72
110 100 100 1.00
88 100 144 1.44
40 50 100 2.00
12 10 4 0.40
Σ(0 – E) 2 /E = 7.16
χ2 =
( )2 O E
E
-Σ
= 7.16
The table of χ2 for V = 6 – 1 = 5 at 5% level of significance is 11.07. The computed value of χ2
=
7.16 is less than the table value. Therefore the hypothesis is accepted. Thus it can be concluded
that male and female births are equally probable.
153
STUD Y TEX T
Sampling and Estimation
ii) Test of independence of attributes
This test discloses whether there is any association or relationship between two or more
attributes.
The following steps are required to perform the test of hypothesis.
1. T he null and alternative hypothesis are set as follows
H0: No association exists between the attributes
H1: An association exists between the attributes
2. U nder H0 an expected frequency E corresponding to each cell in the contingency table is
found by using the formula
E=
RC
n
×
Where R = a row total, C = a column total and n = sample size
3. Based upon the observed values and corresponding expected frequencies the χ2 statistic is
obtained using the formular
χ2 =
( )2 O E
E
-Σ
4. The characteristics of this distribution are defined by the number of degrees of freedom
(d.f.) which is given by
d.f. = (r-1) (c-1),
Where r is the number of rows and c is number of columns corresponding to a chosen level
of significance, the critical value is found from the chi squared table
5. The calculated value of χ2 is compared with the tabulated value χ2 for (r-1) (c-1) degrees
of freedom at a certain level of significance. If the computed value of χ2 is greater than the
tabulated value, the null hypothesis of independence is rejected. Otherwise we accept it.
Example
In a sample of 200 people where a particular device was selected, 100 were given a drug and
the others were not given any drug. The results are as follows
Drug No drug Total
Cured 65 55 120
Not cured 35 45 80
Total 100 100 200
Test whether the drug will be effective or not, at 5% level of significance.
Solution
Let us take the null hypothesis that the drug is not effective in curing the disease.
Applying the χ2 test
The expected cell frequencies are computed as follows
E11 =
1 1RC
n=
120 100
200
×
= 60
1 5 4 quantitative techniques
STUD Y TEX T
E12 = 1 2 RC
n
= 120 100
200
×
= 60
E21 =
2 1 RC
n=
80 100
200
×
= 40
E22 =
2 2 RC
n=
80 100
200
×
= 40
The table of expected frequencies is as follows
60 60 120
40 40 80
100 100 200
O E (O – E) 2 (O – E) 2 /E
65 60 25 0.417
55 60 25 0.625
35 40 25 0.417
45 40 25 0.625
Σ(O – E) 2 /E = 2.084
Arranging the observed frequencies with their corresponding frequencies in the following table
we get
χ2 =
( )2 O E
E
-Σ
= 2.084
V= (r –1) (c-1) = (2 – 1) (2 –1) = 1; 2
tabulat05ed (0. ) c = 3.841
The calculated value of χ2 is less than the table value. The hypothesis is accepted. Hence the
drug is not effective in curing the disease.
iii) Test of homogeneity
This is concerned with the proposition that several populations are homogenous with respect to
some characteristic of interest e.g. one may be interested in knowing if raw materials available
from several retailers are homogenous. A random sample is drawn from each of the population
and the number in each of sample falling into each category is determined. The sample data is
displayed in a contingency table.
155
STUD Y TEX T
Sampling and Estimation
The analytical procedure is the same as that given for the test of independence.
Example
A random sample of 400 persons was selected from each of three age groups and each person
was asked to specify which types of TV programmes they preferred. The results are shown in
the following table
Type of programme
Age group A B C Total
Under 30 120 30 50 200
30 – 44 10 75 15 100
45 and above 10 30 60 100
Total 140 135 125 400
Test the hypothesis that the populations are homogenous with respect to the types of television
programmes they prefer, at 5% level of significance.
Solution
Let us take the hypothesis that the populations are homogenous with respect to different types
of television programmes they prefer
Applying χ2 test
O E (O – E) 2 (O – E) 2 /E
120 70.00 2500.00 35.7143
10 35.00 625.00 17.8571
10 35.00 625.00 17.8571
30 67.50 1406.25 20.8333
75 33.75 1701.56 50.4166
30 33.75 14.06 0.4166
50 62.50 156.25 2.500
15 31.25 264.06 8.4499
60 31.25 826.56 26.449
Σ(O – E) 2 /E = 180.4948
χ2 =
( )2 O E
E
-Σ
The table value of χ2 for 4 d.f. at 5% level of significance is 9.488.
The calculated value of χ2 is greater than the table value. We reject the hypothesis and conclude
that the populations are not homogenous with respect to the type of TV programmes preferred,
thus the different age groups vary in choice of TV programmes.
1 5 6 quantitative techniques
STUD Y TEX T
Chapter Summary
Methods of sampling
a. R andom or probability sampling methods
These include
Simple random sampling
Stratified sampling
Systematic sampling
Multi stage sampling
b. N on random probability sampling methods
These consist of
Judgment sampling
Quota sampling
Cluster sampling
- A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be tested
statistically in order to establish whether it is correct
- When testing a hypothesis, one must fully understand the 2 basic hypothesis to be tested
namely
The null hypothesis (H0)
The alternative hypothesis(H1)
Standard hypothesis tests
Normal test
Test a sample mean ( X ) against a population mean (μ) (where samples size n > 30 and
population
variance σ2 is known) and sample proportion, P (where sample size np >5 and nq >5 since in this
case the normal distribution can be used to approximate the binomial distribution).
t test
Tests a sample mean ( X ) against a population mean, especially where the population variance
is unknown and n < 30.
Variance ratio test or f test
It is used to compare population variances with samples of any size drawn from normal
populations.
Chi squared test
It can be used to test the association between attributes or the goodness of fit of an observed
frequency distribution to a standard distribution
157
STUD Y TEX T
Sampling and Estimation
Chapter Quiz
1. Which is the odd one out?
(a) S imple random sampling
(b) Stratified sampling
(c) S ystematic sampling
(d) C ontinuous sampling
(e) Multi stage sampling
2. I f the difference cannot be explained solely due to ordinary chance, then the difference
is said to be ____________
3 __________ is an assumption that nothing has changed.
4. What is the formula for T- Distribution?
5. Which one is not in this category?
a) Judgment sampling
b) Q uota sampling
c) S ystematic sampling
d) C luster sampling
1 5 8 quantitative techniques
STUD Y TEX T
Answers to chapter quiz
1. (d) Continuous sampling
2. Statistically significant.
3. N ull Hypothesis
4. t =
x−μ
σ
√n
5. (c) Systematic Sampling
questions from previous exams
December 2000 Question 4
a) Briefly explain each of the following distributions indicating whether it is a discrete or a
continuous distribution.
(i) Binomial distribution. (3 marks)
(ii) Poisson distribution (3 marks)
(iii) N ormal distribution (3 marks)
(iv) C hi-square distribution (3 marks)
(v) F isher (F) distribution (3 marks)
b) Give one example in the accounting profession where each of the above distributions
can be applied. (5 marks)
(Total: 20 marks)
December 2002 Question 2
a) T ransparency and Certified Public Accountants (CPAs) have been appointed to
audit accounts of Health National Hospital. Due to the large number of accounts, the
consultants have decided to audit a random sample of the accounts.
Required
(i) S tate the advantages of auditing a sample of the accounts. (3 marks)
(ii) Describe briefly the sampling technique you would recommend. (5 marks)
(iii) What are the advantages and disadvantages of the sampling technique recommended
in (ii) above? (4 marks)
b) S tate and briefly explain the qualities of a good point estimator. (8 marks)
(Total: 20 marks)
159
STUD Y TEX T
Sampling and Estimation
June 2003 Question 4
a) N ation Standard Newspaper poll for the year 2002 presidential campaign in Kenya
sampled 491 potential voters in October 2002. A primary purpose of the poll was to
obtain an estimate of the promotion of potential voters who favour each candidate.
Close to December 2002 elections, better precision and smaller margins of error were
desired. Assume a planning value for the population proportion of p = 0.50 and that a
95% confidence level is desired.
Required:
Determine the recommended sample size for each of the following surveys:
Survey Margin of error
Early December 0.02
Pre-election day 0.01
(8 marks)
b) F uture Computer Company has developed a new computer accounting software
package to help accountancy analysts reduce the time required to design, develop
and implement an accounting system. To evaluate the benefits of the software
package, a random sample of 24 accountancy analysts is selected. Each analyst is
given specifications for a hypothetical accounting system. Then 12 of the analysts are
instructed to produce the accounting system by using the current technology. The other
12 analysts are trained in the use of the software package and then instructed to use it
to produce the accounting system. The 24 analysts complete the study and the results
are shown below:
Completion Time and Summary Statistics for the Software Testing Study
C urrent technology N ew technology
300 276
280 222
344 310
385 338
372 200
360 302
288 317
321 260
376 320
290 312
301 334
283 265
1 6 0 quantitative techniques
STUD Y TEX T
Sample size n1 = 12 n2 = 12
Sample mean X1 = 325 X1 = 288
Sample standard deviation S 1 = 40 S 2 = 44
Required:
Determine whether the new software package should be adopted at 95% confidence level.
(7 marks)
Note:
()()
2
11
12
2
22
2
211
+-
-+-
=
nn
nSnS
S
and
()()
2
1
12
2
1212
11
+
---
=
nn
S
XXuu
t
(Total: 20 marks)
161
STUD Y TEX T
Sampling and Estimation
1 6 2 quantitative techniques
STUD Y TEX T
163
STUD Y TEX T
Regression, Time Series and Forecasting
CHAPTER ONE
Regression,
Time Series and
Forecasting
STUSTUD YD YTE TEX TX T
FIVE
1 6 4 quantitative techniques
STUD Y TEX T
165
STUD Y TEX T
Regression, Time Series and Forecasting
CHAPTER FIVE
Regression, Time Series and Forecasting
Objectives
At the end of this chapter, you should be able to:
•• Establish relationship between two or more variables.
•• Understand a particular situation, explain it and then analyse it.
•• Discuss assumptions underlying analysis of the linear regression model.
•• To build a model qualitatively about which factors are likely to influence the dependent
variable.
Fast Forward:Time series forecasting is the use of a model to forecast future events based on
known past events to forecast future data points before they are measured.
Introduction
All businesses have to plan their future activities, both short and long term. Managers will have
to make forecasts of the future values of important variables such as sales, interest rates, costs
etc. In this chapter, we will look at ways of using past information to make these