slidesc53_2
slidesc53_2
Stratified Sampling
References: Sampling Design and Analysis, S.L. Lohr (Chap 3)
1
Stratified sampling
• In stratified sampling we divide the population into distinct
non-overlapping subpopulations called strata. Then select
SRSWOR from each stratum. The SRSs in the strata are
selected independently.
2
Stratified sampling : Introduction
• We use stratified sampling for one or more of the following reasons:
• To protect from the possibility of underrepresenting some parts of the
population.
• Higher precision (smaller variance) of estimates for population means and totals.
• Administrative convenience.
• With stratified random sampling, in addition to the estimates of population
parameters, we can also estimate the parameters for each stratum.
3
Example
• A Statistics class had two lecture sections (LEC01 and LEC02).
The data are given in le students.csv. Use R sampling package
to select a stratified sample of 15 students, using the two
lecture sections as strata, selecting a SRSWOR of ten
students from LEC01 and five students from LEC02.
4
students <- read.csv("students.csv", header=1)
head(students)
GIVEN_NAME LEC
1 Yan LEC_01
2 Prateek LEC_01
3 Adeel LEC_01
4 Jingya LEC_02
5 Anuradha LEC_02
6 Bryan LEC_01
5
students=read.csv("students.csv", header=1)
library(sampling)
s <- strata(students,
stratanames=c("LEC"),size=c(10,5), method="srswor",
description=T)
Stratum 1
Population total and number of selected units: 186 10
Stratum 2
Population total and number of selected units: 208 5
Number of strata 2
Total number of selected units 15
6
•s
7
• getdata(students,s)
8
Notations and some basic results
9
Notations and some basic results
• 𝑦ℎ𝑗 = measurement on the 𝑗𝑡ℎ unit in stratum ℎ
1 N
• Population mean in stratum ℎ: 𝑦തℎ𝑈 = σ𝑗=1 ℎ
𝑦ℎ𝑗
Nℎ
1 𝐻 Nℎ 𝑡
• Population mean = 𝑦ത𝑈 = ℎ=1 𝑗=1 𝑦ℎ𝑗 =
σ σ
𝑁 𝑁
Nℎ
• where 𝑡 = σ𝐻 σ
ℎ=1 𝑗=1 𝑦ℎ𝑗 is the population total.
• Note that: 𝑦ത𝑈 = σ𝐻
ℎ=1 𝑊ℎ 𝑦തℎ𝑈
𝑁ℎ
• where 𝑊ℎ = are the stratum weights.
𝑁
1 𝑁ℎ 2
• Population variance in stratum ℎ: 𝑆ℎ2 = σ 𝑦ℎ𝑗 − 𝑦തℎ𝑈
𝑁ℎ −1 𝑗=1
σNℎ
1 2
• Population variance : 𝑆2 = σ𝐻 𝑦ℎ𝑗 − 𝑦ത𝑈
𝑁−1 ℎ=1 𝑗=1
10
Sum of Squares Decomposition
11
Sample statistics using SRS estimators within each stratum
12
Result
• Under stratifies sampling (with SRSWOR within each stratum)
13
Result (Confidence intervals)
• Under stratified sampling (with SRSWOR within each stratum)
14
Example
15
Example
16
R code for estimating the population mean and the CI based on a stratified sample
• str.mu.est <- function(N_h,y,details="no", conf.level) {
• # N_h is a vector of the stratum sizes
• # y is a list object with each component being a stratum sample
• N <- sum(N_h)
• n_h <- unlist(lapply(y,length))
• f_h = n_h/N_h
• fpc <- 1-f_h #Finite population correction
• ybar_h <- unlist(lapply(y,mean))
• yvar_h <- unlist(lapply(y,var))
• W_h= N_h/N
• ybar_str <- sum(W_h*ybar_h)
• vybar_str <- sum(W_h^2*fpc*yvar_h/n_h)
• ME_CI <- qnorm((1+conf.level)/2)*sqrt(vybar_str)
• LCL = ybar_str - ME_CI
• UCL = ybar_str + ME_CI
• if(details=="no") {
• cbind(ybar_str,vybar_str,LCL, UCL)}
• else{
• cbind(ybar_str,vybar_str,LCL, UCL,n_h,ybar_h,yvar_h)}
• }
17
Let’s make it bigger
18
• ybar_str <- sum(W_h*ybar_h)
• vybar_str <- sum(W_h^2*fpc*yvar_h/n_h)
• ME_CI <- qnorm((1+conf.level)/2)*sqrt(vybar_str)
• LCL = ybar_str - ME_CI
• UCL = ybar_str + ME_CI
• if(details=="no") {
• cbind(ybar_str,vybar_str,LCL, UCL)}
• else{
• cbind(ybar_str,vybar_str,LCL, UCL,n_h,ybar_h,yvar_h)}
• }
19
Some useful notes
• NOTES:
• unlist: given a list structure x, it produces a vector which contains all
the individual components of x.
• lapply(X,FUN) returns a list of the same length as X, each element of
which is the result of applying a function to the corresponding
element of X
20
To use the function enter the info as follows
N_h <- c(155,62,93)
str1 <- c(35, 28, 26, 41, 43, 29, 32, 37, 36, 25, 29,
31, 39, 38, 40, 45, 28, 27, 35, 34)
str2 <- c(27, 4, 49, 10, 18, 41, 25, 30)
str3 <- c(8, 15, 21, 7, 14, 30, 20, 11, 12, 32, 34, 24)
data <- list(townA=str1,townB=str2,rural=str3)
str.mu.est(N_h=N_h, y=data, details="yes", conf.level =
0.95)
21
str.mu.est(N_h=N_h, y=data, details=“no",
conf.level = 0.95)
22
Stratified sampling for proportions
• Proportions are in fact means of Bernoulli (i.e. 0’s and 1’s) random
variables and so we can use the same formulas with
• 𝑦തℎ = 𝑝Ƹ ℎ
𝑛ℎ
• 𝑠ℎ2 = 𝑝Ƹ ℎ (1 − 𝑝Ƹ ℎ )
𝑛ℎ −1
• 𝑝 = σ𝐻 ℎ=1 𝑊ℎ 𝑝ℎ , where 𝑊ℎ = 𝑁ℎ /𝑁
• 𝑝Ƹ 𝑠𝑡𝑟 = σ𝐻ℎ=1 𝑊ℎ 𝑝Ƹ ℎ
𝑝ℎ (1−𝑝ℎ )
• 𝑉(𝑝Ƹ 𝑠𝑡𝑟 ) = σ𝐻 2
ℎ=1 ℎ (1
𝑊 − 𝑓ℎ )
𝑛ℎ
• where 𝑝ℎ is the population proportion in the ℎ𝑡ℎ stratum.
𝑝ොℎ (1−𝑝ොℎ )
𝐻 2
• 𝑉(𝑝Ƹ 𝑠𝑡𝑟 ) = σℎ=1 𝑊ℎ (1 − 𝑓ℎ )
𝑛ℎ −1
23
Exercise
• A researcher is studying the effectiveness of a new educational program across three
different schools, which represent distinct strata based on their socioeconomic status:
low, medium, and high. The population sizes (𝑁ℎ ) for each school are as follows:
• Low SES: 200 students
• Medium SES: 300 students
• High SES: 500 students
• The researcher decides to sample 20 students from each school (𝑛ℎ = 20 for all strata).
After conducting the survey, the observed proportions of students favoring the program
are:
• Low SES: 0.60
• Medium SES: 0.70
• High SES: 0.80
1.Calculate the overall estimated proportion of students favoring the program across all
schools.
2.What is the standard error (SE) of the estimated proportion?
24
25
26
Sample size allocation
• We consider two commonly used methods for allocating the sample
sizes into each of the strata: proportional allocation, and optimal
allocation for a given n, the total sample size.
• Proportional Allocation
27
Sample size allocation
𝑁ℎ
• Substituting 𝑛ℎ = 𝑛 in the formula for 𝑉(𝑦ത𝑆𝑡𝑟 ), we get the
𝑁
following result:
28
Sample size allocation
29
Example
• A Statistics class had two lecture sections (LEC01 and LEC02).
The data are given in file students.csv. Use R sampling
package to select a stratified sample of 20 students using
proportional allocation.
30
#R code for selecting a Statified sample with proportional
# allocation using sampling package
n = 20 # The sample size
students=read.csv("students.csv", header=1)
head(students)
GIVEN_NAME LEC
1 Yan LEC_01
2 Prateek LEC_01
3 Adeel LEC_01
4 Jingya LEC_02
5 Anuradha LEC_02
6 Bryan LEC_01
Nh = table(students$LEC)
Nh
LEC_01 LEC_02
186 208
N = nrow(students)
N
[1] 394
31
library(sampling)
set.seed(123)
s <- strata(students,stratanames=c("LEC"),size=round((Nh/N)*n),
method="srswor", description=T)
Stratum 1
Population total and number of selected units: 186 9
Stratum 2
Population total and number of selected units: 208 11
Number of strata 2
Total number of selected units 20
32
s
33
getdata(students,s)
34
Optimal Allocation
• The objective in optimal allocation is to minimize the variance 𝑉(𝑦ത𝑆𝑡𝑟 )
for a fixed cost
• Let 𝐶 represent total cost, 𝑐0 represent overhead costs, and 𝑐ℎ
represent the cost of taking an observation in stratum ℎ
• Then
35
Result( Optimal Allocation)
36
Note
• This result shows that in a given stratum, the sample size is
larger when
• The stratum size is larger
• Within stratum variability is higher
• The cost is lower in the stratum
37
Sprcial case 𝑐ℎ = 𝑐
38
Sample size for proportions
39
40
41