Sampling CH-4
Sampling CH-4
Corollary: The total estimate 𝒀 ̂ 𝒔𝒕 = 𝑵𝒚̅𝒔𝒕 is an unbiased estimator of the population total, Y, i.e.,
𝐸(𝑌̂𝑠𝑡 ) = 𝑌.
Theorem 4.2: If 𝑦̅ℎ is an unbiased estimate of 𝑌̅ℎ in every stratum, and sample selection is
independent in different strata, then the variance of the estimate is given by: 𝑉𝑎𝑟(𝑦̅𝑠𝑡 ) =
∑𝐿ℎ=1 𝑊ℎ 2 𝑉𝑎𝑟(𝑦̅ℎ ), where 𝑉𝑎𝑟(𝑦̅ℎ ) the variance of 𝑦̅ℎ over repeated samples from stratum h.
Proof: From linear combination of ∑ 𝑎𝑖 𝑦𝑖
𝑉𝑎𝑟 (∑ 𝑎𝑖 𝑦𝑖 ) = ∑ 𝑎𝑖 2 𝑉𝑎𝑟(𝑦𝑖 ) + ∑ ∑ 𝑎𝑖 𝑎𝑗 𝐶𝑜𝑣(𝑦𝑖 𝑦𝑗 )
𝑖≠𝑗
Similarly,
𝐿 𝐿
𝑛ℎ 𝑁ℎ 𝑛ℎ 𝑛
If in every stratum, = 𝑜𝑟 = 𝑁 𝑜𝑟 𝑓ℎ = 𝑓
𝑛 𝑁 𝑁ℎ
This means that the sampling fraction is the same in all strata. This stratification is described as
stratification with proportional allocation of the 𝑛ℎ .
𝑁ℎ
With proportional allocation, we substitute 𝑛ℎ = 𝑛 .
𝑁
Thus, the variance 𝑉𝑎𝑟(𝑦̅𝑠𝑡 ) reduces to
Theorem 4.4: With stratified random sampling, an unbiased estimate of the variance of 𝑦̅𝑠𝑡 is
Estimating the population total: Since 𝑦̅𝑠𝑡 is an estimator of the population mean 𝑌̅ in
stratified random sampling, 𝑌̂ =N𝑦̅𝑠𝑡 is an
estimator of the population total 𝑌.
𝑆ℎ 2
Corollary: The standard error of 𝑦̅𝑠𝑡 is S.E.(𝑦̅𝑠𝑡 ) = √∑𝐿ℎ=1 𝑊ℎ 2 (1 − 𝑓ℎ ).
𝑛ℎ
4.3.2. Confidence Intervals
For Population Mean: A 100(1 - α)% confidence interval for population mean 𝑌̅ is given as:
𝑌̅ = 𝑦̅𝑠𝑡 ± 𝑍𝛼 𝑆. 𝐸. (𝑦̅𝑠𝑡 )
2
For Population Total: A 100(1 - α)% confidence interval for population total 𝑌 is given as:
𝑌 = 𝑌̂𝑠𝑡 ± 𝑍𝛼 𝑆. 𝐸. (𝑌̂𝑠𝑡 ) 𝑜𝑟 𝑌 = 𝑁𝑦̅𝑠𝑡 ± 𝑍𝛼 𝑁𝑆. 𝐸. (𝑦̅𝑠𝑡 ).
2 2
Where 𝑦̅𝑠𝑡 is assumed to be distributed normally and Z value is obtained from table of normal
distribution. For small sample, 𝑛ℎ , we use the t-value from the tables of the t-distribution.
Example: Consider a population of 14 military families living on three compounds. We can use
the families as elementary units, the compounds as strata, and family size as the characteristics.
The values of population characteristics for each stratum are given below.
Stratum 1 Stratum 2 Stratum 3
Nh: N1= 3 N2 = 5 N3= 6
Yhi: Y11 = 4 Y21 = 4 Y31 = 2
Y12 =3 Y22 = 6 Y32 = 3
Y13 = 4 Y23 = 4 Y33 = 2
Y24 = 7 Y34 = 2
Y25 = 8 Y35 = 2
Y36 = 3
Yh: Y1= 11 Y2= 29 Y3 = 14
̅
𝑌ℎ : ̅
𝑌1 = 3.67 ̅
𝑌2 = 5.8 ̅
𝑌3 = 2.33
2 2 2
𝑆 ℎ: 𝑆 1 = 0.333 𝑆 2 = 3.200 𝑆 2 3 = 0.267
3 3
𝑁 = ∑ 𝑁ℎ = 3 + 5 + 6 = 14, 𝑌 = ∑ 𝑌ℎ = 11 + 29 + 14 = 54,
ℎ=1 ℎ=1
This shows that 𝑛ℎ is directly proportional to 𝑁ℎ and 𝑆ℎ , and inversely proportional to √𝑐ℎ . These
three factors should be considered in sample allocation. As a rule, in a given stratum, take a large
sample if the stratum is larger, the stratum is more variable internally, and sampling is cheaper in
the stratum.
The sample size 𝑛ℎ , is expressed in terms of n, but the value of n is unknown. It can be
determined by using the cost function or variances specification.
(𝐶−𝐶𝑜 )𝑁ℎ 𝑆ℎ ⁄√𝑐ℎ
Corollary: a) If the cost is fixed, the optimum n is given by 𝑛 = ∑ 𝑁ℎ 𝑆ℎ √𝑐ℎ
, from this we
(𝐶−𝐶𝑜 )𝑁ℎ 𝑆ℎ ⁄√𝑐ℎ
can calculate 𝑛ℎ as 𝑛ℎ = ∑ 𝑁ℎ 𝑆ℎ √𝑐ℎ
.
〈∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ √𝑐ℎ 〉(∑𝐿ℎ=1𝑊ℎ 𝑆ℎ ⁄√𝑐ℎ )
b) If the variance, V0, is fixed, the optimum n is: 𝑛ℎ = 1 .
𝑉𝑜 + ∑ 𝑊ℎ 𝑆ℎ 2
𝑁
Assume that the cost per unit is the same for all the strata, i.e., 𝑐ℎ = 𝑐, constant for each stratum.
The cost function reduces to 𝐶 = 𝐶𝑜 + 𝑐 × 𝑛, then the optimum allocation for fixed cost reduce
to optimum allocation for simple fixed sample size.
Theorem 4.6: In stratified random sampling the variance, 𝑉𝑎𝑟(𝑦̅𝑠𝑡 ), is minimized for a fixed
𝑊 𝑆ℎ 𝑁 𝑆ℎ
total sample size n if: 𝑛ℎ = 𝑛 ∑ 𝑊ℎ = 𝑛 ∑ 𝑁ℎ . This is known as Neyman allocation. The
ℎ 𝑆ℎ ℎ 𝑆ℎ
(∑ 𝑊ℎ 𝑆ℎ )2 ∑ 𝑊ℎ 𝑆ℎ 2
minimum variance will be given as 𝑉(𝑦̅𝑠𝑡 )𝑚𝑖𝑛 = −
𝑛 𝑁
Proof:
2. Suppose that a corporation has 260,000 accident reports available over a period of a time. A
sample survey is required for the purpose of estimating the average number of days of work lost
per accident. Of these reports, 150,000 are coded and 110,000 are un-coded. Before processing
the forms on computer directly, un-coded forms must first be coded. Approximately $10,000 is
available for selecting the sample and coding and processing the data. From experience it is
150,000⁄√0.32 265,165.0429
𝑛1 = × 10,000 = × 10,000 ≅ 8762.
150,000√0.32 + 2 × 110,000√0.98 302,641.7023
(𝐶 − 𝐶𝑜 ) ∑ 𝑁2 𝑆2 ⁄√𝑐2 2𝑁2 ⁄√𝑐2
𝑛2 = = ×𝐶
∑ 𝑁ℎ 𝑆ℎ √𝑐ℎ 𝑁1 √𝑐1 + 2𝑁2 √𝑐2
2 × 110,000⁄√0.98 222,233.5598
𝑛2 = × 10,000 = × 10,000 ≅ 7343
150,000√0.32 + 2 × 110,000√0.98 302,641.7023
Therefore, a sample of 8762 coded reports and 7343 un-coded reports could be taken.
d) Allocation Requiring More Than 100% Sampling :
In optimal allocation the sample size 𝑛ℎ may be greater than the total number of elements in
stratum, 𝑁ℎ, i.e., 𝑛ℎ > 𝑁ℎ . When this occurs set 𝑛ℎ = 𝑁ℎ for each stratum having optimal
allocation greater than 𝑁ℎ and reallocate remaining sample to other strata.
𝑁ℎ 𝑆ℎ
If 𝑛1 > 𝑁1 , set 𝑛1 = 𝑁1 and for the remaining strata calculate 𝑛̃ℎ = ∑𝐿 × (𝑛 − 𝑁1 ),
ℎ=2 𝑁ℎ 𝑆ℎ
provided that 𝑛̃ℎ ≤ 𝑁ℎ for ℎ ≥ 2.
Similarly, if 𝑛1 > 𝑁1 and 𝑛2 > 𝑁2 , then set 𝑛1 = 𝑁1 and 𝑛2 = 𝑁2 . For the remaining strata
𝑁ℎ 𝑆ℎ
calculate 𝑛̃ℎ = ∑𝐿 × (𝑛 − 𝑁1 − 𝑁2 ), provided that 𝑛̃ℎ ≤ 𝑁ℎ for ℎ ≥ 3.
ℎ=3 𝑁ℎ 𝑆ℎ
Example: Consider the following data from three strata. Stratum 1, 𝑁1 = 100, 𝑆1 = 50, stratum
2, 𝑁2 = 110, 𝑆2 = 10, stratum 3, 𝑁3 = 120, 𝑆3 = 5. Allocate a total sample of 140 elements to
each stratum by using optimum allocation.
𝑁ℎ 𝑆ℎ 𝑁1 𝑆1 100×50
Solution: 𝑛ℎ = ∑𝐿 × 𝑛, 𝑛1 = ∑𝐿 × 𝑛 ⟹ 𝑛1 = 100×50+110×10+120×5
ℎ=3 𝑁ℎ 𝑆ℎ ℎ=3 𝑁ℎ 𝑆ℎ
5000×140 110×10×140 120×5×140
⟹ 𝑛1 = = 104.477 ≈ 104, 𝑛2 = = 23, 𝑛3 = = 13.
6700 6700 6700
Since 𝑛1 > 𝑁1 , we allocate 𝑛1 = 𝑁1 = 100. For the remaining two strata we calculate as
follows.
𝑁2 𝑆2 110 × 10 × (140 − 100) 44000
𝑛̃2 = 𝐿 × (𝑛 − 𝑁1 ) ⟹ 𝑛2 = = = 26
∑ℎ=2 𝑁ℎ 𝑆ℎ 110 × 10 + 120 × 5 1700
𝑁3 𝑆3 120 × 5 × (140 − 100) 24000
𝑛̃3 = 𝐿 × (𝑛 − 𝑁1 ) ⟹ 𝑛3 = = = 14
∑ℎ=2 𝑁ℎ 𝑆ℎ 110 × 10 + 120 × 5 1700
From the standard algebraic identity for the analysis of variance of stratified population, we have
𝐴 = ∑ 𝐴ℎ = ∑ ∑ 𝑌ℎ𝑖 = ∑ 𝑌ℎ = 𝑌, 𝑃 = 𝐴⁄𝑁 = ∑ 𝑊ℎ 𝑃ℎ
ℎ=1 ℎ=1 𝑖=1 ℎ=1 ℎ=1
𝑛ℎ th
𝑎ℎ = ∑𝑖=1 𝑦ℎ𝑖 = the number of sample units in the h stratum.
𝑝ℎ = 𝑎ℎ ⁄𝑛ℎ = 𝑦̅ℎ , the sample proportion in hth stratum.
𝑛 𝑝ℎ 𝑞 ℎ
𝑞ℎ = 1 − 𝑝ℎ , 𝑠ℎ 2 = 𝑛ℎ −1 = sample variance in stratum h.
ℎ
b) Properties of Estimates
For the proportion in the whole population, the estimate appropriate to stratified random
sampling is
Since the population proportion 𝑃ℎ is unknown, we use the population total, 𝐴̂𝑠𝑡 = 𝑁𝑝𝑠𝑡 =
∑ 𝑁ℎ 𝑝ℎ , the variance is given by 𝑉𝑎𝑟(𝐴̂𝑠𝑡 ) = 𝑁 2 𝑉𝑎𝑟(𝑝𝑠𝑡 )
c) Sample Allocation
With proportional allocation, 𝑛ℎ = 𝑛𝑁𝑁ℎ, and the variance of 𝑝𝑠𝑡 is
1−𝑓 1
= ∑𝐿ℎ=1 𝑊ℎ 𝑃ℎ 𝑄ℎ = ∑𝐿ℎ=1 𝑊ℎ 𝑃ℎ 𝑄ℎ (1 − 𝑓), if 𝑁ℎ is large.
𝑛 𝑛
𝑝 𝑞 𝑃 𝑄
For the sample estimate of the variance, substitute 𝑛 ℎ−1ℎ for the unknown 𝑁ℎ −1ℎ in any of the
ℎ ℎ
formulas above.
𝑁ℎ
With optimum allocation: For fixed total sample size n, 𝑛ℎ ∝ 𝑁ℎ √𝑁 𝑃ℎ 𝑄ℎ ≅ 𝑁ℎ √𝑃ℎ 𝑄ℎ
ℎ −1
e) Relative precision
𝑉𝑎𝑟(𝑝𝑠𝑡 )𝑜𝑝𝑡 ≤ 𝑉𝑎𝑟(𝑝𝑠𝑡 )𝑝𝑟𝑜𝑝 ≤ 𝑉𝑎𝑟(𝑝)𝑆𝑅𝑆
f) Confidence Limit:
A 100(1-𝛼)% confidence interval for population proportion P is
𝑃 = 𝑝𝑠𝑡 ± 𝑍𝜶 𝑠. 𝑒(𝑝𝑠𝑡 )
2
Example: A chain of department stores is interested in estimating the proportion of accounts
receivable that are delinquent. The chain consists of four stores. To reduce the cost of sampling,
stratified random sampling is used with each store as a stratum. Since no information on
population proportions is available before sampling, proportional allocation is used. From the
table given below, estimate P, the proportion of delinquent accounts for the chain, find its
sampling error and 95% CI for population proportion.
Stratum I II III IV
Number of accounts receivable N1 = 65 N2 = 42 N3 = 93 N4 = 25
Sample size n1 = 14 n2 = 9 n3 = 21 n4 = 6
Sample Proportion of delinquent accounts
Solution: 𝑁 = 𝑁1 + 𝑁2 + 𝑁3 + 𝑁4 = 65 + 42 + 93 + 25 = 225
𝑛 = 𝑛1 + 𝑛2 + 𝑛3 + 𝑛4 = 14 + 9 + 21 + 6 = 50
𝑁ℎ 𝑝ℎ 65 42 93 25
𝑝𝑠𝑡 = ∑𝐿ℎ=1 𝑊ℎ 𝑝ℎ = ∑𝐿ℎ=1 = 225 ∗ 0.3 + 225 ∗ 0.2 + 225 ∗ 0.4 + 225 ∗ 0.1 = 0.3
𝑁
𝐿 𝐿
𝑝ℎ 𝑞ℎ 𝑁ℎ − 𝑛ℎ
2 𝑁ℎ 2 𝑝ℎ 𝑞ℎ
𝑉𝑎𝑟(𝑝𝑠𝑡 ) = ∑ 𝑊ℎ ( )=∑ 2 (1 − 𝑓ℎ )
𝑛ℎ − 1 𝑁ℎ − 1 𝑁 𝑛ℎ − 1
ℎ=1 ℎ=1
2
65 0.3 ∗ 0.7 42 2 0.2 ∗ 0.8 93 2
=( ) ∗( ) ∗ 0.78 + ( ) ∗( ) ∗ 0.79 + ( )
225 13 225 8 225
0.4 ∗ 0.6 25 2 0.1 ∗ 0.9
∗( ) ∗ 0.77 + ( ) ∗( ) ∗ 0.76
20 225 5
65 2 42 2 93 2 25 2
=( ) ∗ 0.0126 + ( ) ∗ 0.0158 + ( ) ∗ 0.009 + ( ) ∗ 0.0137
225 225 225 225
= 0.001 + 0.006 + 0.0015 + 0.0002 = 0.00326
𝑆. 𝑒(𝑝𝑠𝑡 ) = √𝑉𝑎𝑟(𝑝𝑠𝑡 ) = √0.00326 = 0.057
95% confidence interval for the population proportion is:
𝑃 = 𝑝𝑠𝑡 ± 𝑍𝜶 𝑠. 𝑒(𝑝𝑠𝑡 ) = 0.3 ± 1.96 ∗ 0.057 = (0.3 − 0.112,0.3 + 0.112)
2
= [0.188,0.412]
by ignoring fpc. Instead of V we can use an equivalent value 𝑑 2 ⁄𝑍𝛼⁄2 2 = (𝜀𝑌̅)2 ⁄𝑍𝛼⁄2 2 .
ii) In proportional allocation the sample size n is obtained as follows.
∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ 2 ∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ 2 ⁄𝑉 𝑛 ∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ 2 𝑍 2 ∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ 2
𝑛= ∑𝐿 𝑊 𝑆 2
= ∑𝐿 𝑊 𝑆 2
in which 𝑛 = 1+𝑜𝑛𝑜 , where 𝑛𝑜 = = , and
𝑉 𝜀 2 𝑌̅ 2
𝑉+ ℎ=1 ℎ ℎ 1+ ℎ=1 ℎ ℎ 𝑁
𝑁 𝑁𝑉
can be used as a first approximation for sample size and V is the defined variance, i.e.,
̅2
𝑉 = 𝑉𝑎𝑟(𝑦̅𝑠𝑡 ) = 𝑑 2 ⁄𝑍𝛼⁄ 2 = (𝜀𝑌 2) .
2 𝑍𝛼⁄
2
iii) In optimum allocation the sample size calculation is as follows.
2 2 2
(∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ ) (∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ ) ⁄𝑉 𝑍𝛼⁄ 2 (∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ ) ⁄𝜀2 𝑌̅ 2
For fixed n, it is given as 𝑛 = ∑𝐿 𝑊 𝑆 2
= ∑𝐿 𝑊 𝑆 2
𝑜𝑟 𝑛 = 2
𝑍𝛼⁄ 2 ∑𝐿 2
𝑉+ ℎ=1 ℎ ℎ 1+ ℎ=1 ℎ ℎ 2 ℎ=1 𝑊ℎ 𝑆ℎ
𝑁 𝑁𝑉 1+
𝑁𝜀2 𝑌
̅2
(𝐶−𝐶𝑜 ) ∑𝑁ℎ 𝑆ℎ ⁄√𝑐ℎ
For fixed cost, 𝐶 = 𝐶𝑜 + ∑ 𝑐ℎ 𝑛ℎ , the sample size is given as: 𝑛 = ∑ 𝑁ℎ 𝑆ℎ √𝑐ℎ
(∑ 𝑊ℎ 𝑆ℎ √𝐶ℎ )(𝑊ℎ 𝑆ℎ ⁄√𝐶ℎ )
For fixed variance, V, Optimum n is: 𝑛 = 1
𝑉+ ∑ 𝑊ℎ 𝑆ℎ 2
𝑁
B. Estimation of population total
If V is a desired 𝑉𝑎𝑟(𝑌̂𝑠𝑡 ) the principal formulas are:
𝑆 2 𝑆 2
∑𝐿ℎ=1 𝑊ℎ 2 𝜋ℎ ∑𝐿ℎ=1 𝑁ℎ 2 𝜋ℎ
ℎ ℎ
General: 𝑛 = ∑𝐿 2 = 𝑛 = 𝑉+∑𝐿 2
𝑉+ ℎ=1 𝑊ℎ 𝑆ℎ ℎ=1 𝑁ℎ 𝑆ℎ
𝑁
𝑁 𝐿
∑𝐿ℎ=1 𝑊ℎ 𝑆ℎ 2 ∑ 𝑁 𝑆 2 𝑛0 𝑁
𝑉 ℎ=1 ℎ ℎ
Proportional: 𝑛 = ∑𝐿 𝑊 𝑆 2
= 𝑁 𝐿 = 𝑛 , 𝑤ℎ𝑒𝑟𝑒 𝑛0 = 𝑉 ∑𝐿ℎ=1 𝑁ℎ 𝑆ℎ 2
∑ 𝑁 ℎ 𝑆ℎ 2 1+ 0
𝑉+ ℎ=1 ℎ ℎ 1+ 𝑉 ℎ=1 𝑁
𝑁 𝑁
𝑛 ∑ 𝑊ℎ 𝑃ℎ 𝑄ℎ 𝑍𝛼⁄ 2 ∑ 𝑊ℎ 𝑃ℎ 𝑄ℎ
𝑛 = 1+𝑜𝑛𝑜, where 𝑛𝑜 = or 𝑛𝑜 = 2
, where 𝑉 = 𝑑 2 ⁄𝑍𝛼⁄2 2 and 𝑑 = 𝜀𝑃.
𝑁
𝑉 𝜀2 𝑃2
ii) For optimum allocation: for fixed n, the variance can be expressed as
2
(∑ 𝑊ℎ √𝑃ℎ 𝑄ℎ ) ∑ 𝑊ℎ 𝑃ℎ 𝑄ℎ
𝑉= − , for large 𝑁ℎ . From this we obtain the sample size
𝑛 𝑁
2 2
(∑ 𝑊ℎ √𝑃ℎ 𝑄ℎ ) (∑ 𝑊ℎ √𝑃ℎ 𝑄ℎ ) ⁄𝑉
𝑛= =
𝑉 + ∑ 𝑊ℎ𝑁𝑃ℎ 𝑄ℎ 1 + ∑ 𝑊𝑉𝑁
ℎ 𝑃ℎ 𝑄 ℎ
In all cases, if the stratum population variance is unknown, substitute their respective estimates.
That is, substitute 𝑠ℎ 2 for 𝑆ℎ 2 and 𝑝ℎ for 𝑃ℎ .