Sampling CH-2
Sampling CH-2
2.1.Basic Concepts
Sampling without replacement (wor) means that once a unit has been selected, it cannot be
selected again. In other words, it means that no unit can appear more than once in the sample. If
there are n sample units required for selection from a population having N units, then there are
( ) ways of selecting n units out of a total of N units without replacement, disregarding the order
of the n units. Hence, simple random sampling is equivalent to the selection of one of the ( )
possible samples with an equal probability ⁄ assigned to each sample.
( )
In simple random sampling without replacement the probability of a specified unit of the
population being selected at any given draw is equal to the probability of its being selected at the
first draw, that is, ⁄ . However, for a sample of size n, the sum of the probabilities of these
mutually exclusive events is ⁄ .
The process of sampling with replacement (wr) allows for a unit to be selected on more than one
draw. There are ways of selecting n units out of a total of N units with replacement. In this
case, the order of selection will be considered. All selections are independent since the selected
unit is returned to the population before making the next selection. Thus, the probability is , ⁄
for any specific element on each of the n draws.
In sample survey when sample units are selected from a population there could be possibilities of
biases in the selection procedure which may come from the use of a non-random method. That is,
the selection is consciously or unconsciously influenced by subjective judgment of human being.
Such bias can be avoided by using a random selection method. The true randomness can be
ensured by using the method of selection which cannot be affected by human influence.
There are different random sample selection methods. The important aspect of random selection
in each method is that the selection of each unit is based purely on chance. This chance is known
as probability of selection which eliminates selection bias. If there is a bias in the selection, it
may prevent the sample from being representative of the population. Representative means that
probability samples permits scientific approaches in which the samples give accurate estimates
of the total population. We consider here two basic and common procedures of random selection
method.
Lottery Method: This is a very common method of taking a random sample. Under this method,
we label each member of the population by identifiable disc or a ticket or pieces of paper. Discs
or tickets must be of identical size, color and shape. They are placed in a container (urn/bowl)
and well mixed before each draw, and then without looking into the container selection of
designated labels will be performed with or without replacement. Then series of drawing may be
continued until a sample of the required size is selected. This procedure shows that selection of
each item depends entirely on chance.
For example, if we want to take a sample of 18 persons out of a population of 90 persons, the
procedure is to write the names of all the 90 persons on separate slips (tickets) of paper. The slips
(tickets) of paper must be of identical size, color and shape. The next step is to fold these slips,
mix them thoroughly and then make a blindfold selection of 18 slips one at a time without
replacement.
This lottery method becomes quite cumbersome and time consuming to use as the sizes of
sample and population increase. To avoid such problems and to reduce the labor of selection
process, another method known as a random number table selection process can be used.
Use of Random Numbers: A table of random numbers consists of digits from 0 to 9, which are
equally represented with no pattern or order, produced by a computer random number generator.
Identify the population units (N) and give serial numbers from 1 to N. This total number
N determines how many of the random digits we need to read when selecting the sample
elements. This requires preparation of accurate sampling frame.
Decide the sample size (n) to be selected, which will indicate the total serial numbers to
be selected.
Select a starting point of the table of random numbers; you can start from any one of the
columns, which can be determined randomly.
Since each digit has an equal chance of being selected at any draw, you may read down
columns of digits in the table.
Depending on the population size N, you can use numbers in pairs, three at a time, four at
a time, and so on, to read from the table.
If selected numbers are less or equal to the population size N, then they will be
considered as sample serial numbers.
All selected numbers greater than N should be ignored.
For sampling without replacement, reject numbers that come up for a second time.
The selection process continues until n distinct units are obtained.
For example, consider a population with size N = 5000. Suppose it is desired to take a sample of
25 items out of 5000 without replacement. Since N = 5000, we need four digit numbers. All
items from 1 to 5000 should be numbered. We can start anywhere in the table and select numbers
four at a time. Thus, using a random table found at the end of this chapter, if we start from
column five and read down columns then we will obtain 2913, 2108, 2993, 2425, 1365, 1760,
2104, 1266, 4033, 4147, 0334,4225, 0150, 2940, 1836,1322, 2362, 3942, 3172, 2893, 3933,
2514, 1578, 3649, 0784 by ignoring all numbers greater than 5000.
Terminologies
N = Population size
n = Sample size
Y=∑ Value of the character under study for the ith unit in the
population
∑ ⁄ Sample mean
∑ ̅ ∑ ̅
, , Population variance
The relationship between these two variances can be established by expressing each variance in
terms of other, i.e., or .
∑ ̅
Sample variance and its square root denoted by is the
standard deviation of the sample elements.
∑ ̅ ̅ ∑ ̅ ̅
, or is the covariance of the random variable X and
Y.
∑ ̅ ̅
is sample covariance
is sampling fraction
CV = coefficient of variation
The sample statistics are computed from the results of sample surveys since the primary
objective of a sample survey is to provide estimates of the population parameters, because the
reality shows that almost all population parameters are unknown.
Theorem 1: Prove that the probability of selecting a specified unit of the population at any given
draw is equal to the probability of its being selected at the first draw.
Let, yi be the event that any specified unit is selected at the ith draw. P (yi) = Prob.{A specific
unit is not selected at any one of previous (i-1) draws and then selected at the ith draw}
∑[ ]
∑[ ]
That means
Theorem 2: The probability that a specified unit is selected in the sample of size n is .
Proof: Since a specified unit can be selected in the sample of size n in n mutually exclusive
ways. It can be selected in the sample at the ith draw (I =1, 2,…, n) and since the probability that
it is selected at nth draw is
Therefore, the probability that a specified unit is included in the sample would be the sum of the
probabilities of inclusion in the sample at 1st draw, 2nd draw, … , nth draw. Thus, by addition
theorem of probability, we get
(⋃ ) ∑
( )
Example: For demonstration purpose we will consider a very small hypothetical population of 5
farmers, who use fertilizer in their farming. Suppose the amount of fertilizer used (in kg) by each
farmer is 70, 78, 80, 80, and 95. Then, the following parameters of the population and sample
values (statistics) are computed to justify the basic idea behind estimation.
Let Yi denotes the amount of fertilizer used by each farmer (i =1, 2, - - -, 5). The population size
is 5, i.e. N = 5. The total amount of fertilizer used by all farmers and the average fertilizer
consumption per farmer are computed as follows.
∑ ̅
∑ ̅
Assume that sample of three farmers are selected from the total farmers to estimate the
population parameters. The total number of possible samples can be calculated as ( ) ( )
. The following table shows the ten possible samples with their corresponding values and
sample means. Let Fi represents the ith farmer, i = 1, 2, - - - , 5.
For each possible sample, dividing the sum of the amount of fertilizer used by the size of a
sample would give the sample mean ( ̅ ). For instance, the mean of the first sample is
=76.00, and the remaining sample means can be calculated in a similar way.
From the values of random variable ̅ , we can construct the frequency distribution as shown
below. From this frequency we obtain the probabilities of the random variable ̅ , by dividing
the frequency of the random variable ̅ by the sum of the frequencies.
Table 1: Possible samples with corresponding values and sample means
This table gives the sampling distribution of ( ̅ ). If we draw just one sample of three farmers
from the population of five farmers, we may draw any one of the 10 possible samples of farmers.
Hence, the sample mean ̅ can assume any one of the values listed above with the corresponding
probabilities.
Probability
Values of ̅ Frequency (f)
of ̅
76.00 2 2 10 = 0.2
76.67 1 1 10 = 0.1
79.33 1 1 10 = 0.1
81.00 1 1 10 = 0.1
81.67 2 2 10 = 0.2
84.33 2 2 10 = 0.2
85.00 1 1 10 = 0.1
Total 10 1.00
The overall mean, which can be calculated from all possible samples, is equal to the true
population mean. That is, the expected value of ̅ , denoted by E( ̅ ) , taken over all possible
∑ ̅
samples equals the true mean of the population. From the table, E( ̅ ∑
.
which is the same as ̅ . It can also be calculated using probability concept, that is, ̅
∑ ̅ ̅
What is the deviation of sample mean from the true population mean? It can be observed
that the sample mean is either equal to or different from the true population mean. This deviation
can be assessed in terms of probability. We will continue with the same example to explain the
properties of this deviation. We will consider only when the deviation is one unit or two units or
four units from the true population.
̅ ̅ ̅ ⁄
̅ ̅ ̅ ⁄
̅ ̅ ̅ ⁄
This indicates that the greater the demands we make of being close to "true" value, the smaller
the chance we have of fulfilling it.
We know that the population variance is = 81.8, and this shows that = with some
rounding errors. But the sampling variance, Var( ̅), is not the same as the population variance
( , that is, Var( ̅ )≠ . The equality can be established using the following relationship.
̅ , where is a finite population correction (fpc).
E( ̅ * ∑ + * ∑ + where
( )
Hence, ̅ ∑ ∑ ̅
̅ [ ∑ ] ∑ ∑̅ ̅ ̅
̅ ( )
Proof: we have
̅ [̅ ̅ ] ̅ ̅
Now ̅ * ∑ + [∑ ∑ ]
= [ ∑ ∑ ]
But ∑ ̅ ∑ ̅
∑ ∑ ̅ ̅ ̅
Therefore ∑ *( ) ̅+
Also (∑ ) ∑ [ ∑ ∑ ]
[ ̅ ̅ ]
[ ̅ ]
[̅ ]
̅ [ ∑ ∑ ]
[( ) ̅ ] [̅ ]
̅ ( )̅ ( ) ( )
̅ ( )
So, ̅ ̅ ( ) ̅
̅ ( ) ( )
̅ ( )
( ∑ ) ∑ ∑
But
Then,
Eg. From a population of 50 units, a random sample of size 10 is drawn without replacement.
From the sample following result are obtained.
∑ ∑
∑ ̅
, which is estimate value of .
Therefore,
Eg. Draw all possible samples of size 2 from the population {8, 12, 16} and verify that
̅ ∑̅ ∑̅
∑ ̅
Then
Since the expression of variance of sample mean involve which is based on population
values, so these expressions cannot be used in real life applications. In order to estimate the
variance of on the basis of a sample, an estimator of (or equivalently ) is needed.
Consider is an estimator of or and we investigate its biased ness for in the case of
SRSWOR and SRSWR.
Consider ∑ ̅ ∑ ̅ ̅ ̅
[∑ ̅ ̅ ̅ ]
[∑ ̅ ̅ ̅ ]
[∑ ̅ ] [ ̅ ]
̅ And so
[ ] [ ]
In case of SRSWR
̅ And so
[ ] [ ] ( ( ) )
Hence, ,
If we look at all these expressions, we can observe that as n increases, the value of √ also
increases and hence the standard error decreases. Thus, the standard error from a sample is used
for various purposes. It is mainly used:
To compare the precision of estimate from SRS with that from other sampling methods.
To determine the sample size required in a survey, and
To estimate the actual precision of the survey.
Consistency: An estimate is consistent if its values tend to concentrate increasingly around the
true value as the sample size increases. In other words, the estimate assumes the population value
with probability approaching unity as the sample size tends to infinity. This definition of
consistency strictly applies to estimates based on samples drawn from an infinite population. We
use the following definition in the case of a finite population. An estimate is said to be a
consistent estimate of the parameter Y if it takes the population value when n = N.
̅ ̅ ̅
Example: An estimator is said to be consistent if it tends to the population value with increasing
sample size. As the size of the sample increases, the sample estimates concentrate around the
population value. By considering the population of 5 farmers, we can find all possible samples of
size 2, 3, and 4 without replacement and compute the sample results. The sampling distribution is
has already been calculated when the sample size is three and in similar way the sampling
distributions can be calculated for sample sizes two and four. The following possible sample
means can be observed from three different sample sizes.
This example shows that as the sample size increases, the sample mean tends to the population
mean in both directions.
Efficiency: A particular sampling scheme is said to be more" efficient" than another if, for a
fixed sample size, the sampling variance of survey estimates for the first scheme is less than that
for the second. For the same population often comparisons of efficiency are made with simple
random sampling as a basic scheme using the ratio of their variances.
For example; if and are two estimators of , with equal sample size, and having variances
V( ) and V( ) respectively, the efficiency of relative to is given as follows.
Efficiency ( , Thus, if this ratio is greater than one, then is a better estimator
than .
Theorem 9: The variance of the sample mean is more in SRSWR in comparison to its variance
in SRSWOR, i .e
Therefore,
[ ]
That implies
That means variance of the sample mean is more in SRSWR as compared with its variance in the
case of SRSWOR. In other words SRSWOR provides a more efficient estimate of sample mean
relative to SRSWR.
Example: A population have 7 units 1, 2, 3, 4, 5, 6, 7. Write down all possible samples of size 2
(without replacement) which can be drawn from the given population and verify that sample
mean is an unbiased estimate of the population mean. Also calculate its sample variance and
verify that
∑ ̅
, And
( )
∑ ̅
̅ ̅
( ) ( )
∑ ̅ ̅
̅
( )
̅ =
̅ =
Hence,
In practice surveys are conducted only once for one specific objective. In other words, one does
not draw all possible samples to calculate the variance or the standard error of an estimate.
However, if probability-sampling methods are used, the sample estimates and their associated
measures of sampling error can be determined on the basis of a single sample.
A 95% confidence interval can be described as follows. If sampling is repeated indefinitely, each
sample will lead to a new confidence interval. Then in 95% of the samples the interval will cover
the true population value. For example, consider a sample mean ̅, which is unbiased estimate of
population mean μy, the confidence interval for μy is μ y = ̅ Sampling error, where the
sampling error depends on the sampling distribution of ̅ . Translating this into a description of
a normal distribution, an approximate 1001 % probability confidence interval for ̅ is:
(̅ ⁄ ̅ ̅ ⁄ ̅ )
Sometimes, it is also of interest to estimate the population total. E.g. total house hold income,
total expenditure etc. let denotes the population total
Obviously
̂ ̅ ̅
( )
̂ ̅ {
( )
̂ ⁄
̂ ̅ ⁄ ̅ . Since S.E ( ̅ ) is not known we substitute the S.E
( ̅) by the sample standard error, s.e.( ̅) computed from the sample observations.
Notation:
N = the number of elements in the population
Nj = the number of elements in the jth domain
nj = the number of sample elements in a SRS of size n that happen to fall in the jth domain.
Yjk are measurements on the kth element in jth domain, for k = 1, 2, - - -, nj for sample and k =1,
2, - - -, Nj for population
The objective is to estimate the subpopulation parameters such as mean, ̅ j , and total, Yj for the
jth domain. These parameters and their estimators are computed as follows.
∑
i) Subpopulation Mean ( ̅ j): The subpopulation mean is defined as and its sample
∑
estimator is given by ̅ .
̅
a. E( ̅ , b. ̅ , where ∑ , where ,
sampling fraction for jth domain.
̅
The sample variance is given by : var( ̅ , ∑ and its
ii). Sub population total Yj: it is given by ∑ and consider two cases to get its
population estimator ̂ .
Case 1 is when is known; a) ̂ ,
be: a) ̂ ̅ ∑ , b) ̂ , where
( )
(
∑ ∑
∑
∑
b) (̂) , if unknown, where and
In the planning of a sample survey one of the first consideration is sample size determination.
Since every survey is different, there can be no hard and fast rules for determining sample size.
Generally, the factors, which decide the scale of the survey operations, have to do with cost,
time, operational constraints and the desired precision of the results. Once these points have been
appraised and individually assessed, the investigators are in a better position to decide the size of
the sample.
One of the major considerations in deciding sample size has to do with the level of error that one
deems tolerable and acceptable. We know that measures of sampling error such as standard error
or coefficient of variation are frequently used to indicate the precision of sample estimates. Since
it is desirable to have high levels of precision, it is also desirable to have large sample sizes,
since the larger the sample, the more precise estimates will be. The sample size can be
determined by specifying the precision required for each major finding to be produced from the
survey.
Allowing for a small probability that the error may exceed that difference, choose a sample
size n such that ̅ .
With SRS we can show that, assuming the estimate ̅ has a standard normal distribution, the
the reliability coefficient which denote the upper point of standard normal distribution.
If the population size N is very much greater than the required sample size n, the relation above
can be approximated by or . As a first approximation calculate . If
, the sampling fraction is very small, say less than 5%, we may consider as a satisfactory
approximation to the required sample size n. Otherwise calculate using the given formula,
( ⁄ ̅)
. If we use the relative error ̅ , then we get ( )
Often we wish to consider not the absolute value of the standard error, but its value in relation to
the magnitude of the statistic (mean, total, etc.) being estimated. For this purpose, One can
express the standard error as a proportion (or a percent) of the value being estimated. This form
is called the relative standard error or coefficient of variation and is denoted by the symbol CV.
Statistical measures such as standard deviation and the standard error appear in the units of
measurement of variables. Such measurement units may cause difficulties in making some
comparisons. Relative measures, such as coefficients of variation, can be used to overcome the
problems. The element coefficient of variation can be expressed as ̅
and estimated by
̂
. For the mean ( ̅ ), the coefficient of variation is given by ̅ ̂
, For the total
̅
̅ ̅
( ̂ ), the coefficient of variation is given by ̂ , which is the same as the
̅ ̅
coefficient of variation of the mean.
,
̅
= 36,
Therefore, , which is a good approximation for the sample. But if you calculate for
n, you will get that
Simple random sampling is very important as a basis for development of the theory of sampling.
It serves as a central reference for all other sampling designs. Under simple random sampling
any particular sample of n elements from a population of N elements can be chosen and in
addition, is as likely to be chosen as any other sample. In this sense, it is conceptually the
simplest possible method, and hence it is one against which all other methods can be compared.
However, despite such importance, simple random sampling has the following limitations:
It can be expensive and often not feasible in practice since it requires that all elements be
identified and labeled prior to the sampling. This prior identification is not possible, and
hence a simple random sample of elements cannot be drawn.
Since it gives each element in the population an equal chance of being chosen in the
sample, it may result in samples that are spread out over a large geographic area. Such a
geographic distribution of the sample would be very costly to implement.
It would not be good for those surveys in which interest is focused on subgroups that
comprise a small proportion of the population. For example, it is not likely to be an
efficient design for rare events such as disability and special crops.