0% found this document useful (0 votes)
2 views

slidesc53_2

Stratified sampling involves dividing a population into non-overlapping subpopulations (strata) and selecting a simple random sample without replacement from each stratum. This method enhances precision, reduces variance in estimates, and allows for separate parameter estimation for each stratum. The document also includes examples and R code for implementing stratified sampling and discusses sample size allocation methods.

Uploaded by

Ki Yan Shih
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

slidesc53_2

Stratified sampling involves dividing a population into non-overlapping subpopulations (strata) and selecting a simple random sample without replacement from each stratum. This method enhances precision, reduces variance in estimates, and allows for separate parameter estimation for each stratum. The document also includes examples and R code for implementing stratified sampling and discusses sample size allocation methods.

Uploaded by

Ki Yan Shih
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

STAC53

Stratified Sampling
References: Sampling Design and Analysis, S.L. Lohr (Chap 3)

1
Stratified sampling
• In stratified sampling we divide the population into distinct
non-overlapping subpopulations called strata. Then select
SRSWOR from each stratum. The SRSs in the strata are
selected independently.

2
Stratified sampling : Introduction
• We use stratified sampling for one or more of the following reasons:
• To protect from the possibility of underrepresenting some parts of the
population.
• Higher precision (smaller variance) of estimates for population means and totals.
• Administrative convenience.
• With stratified random sampling, in addition to the estimates of population
parameters, we can also estimate the parameters for each stratum.

3
Example
• A Statistics class had two lecture sections (LEC01 and LEC02).
The data are given in le students.csv. Use R sampling package
to select a stratified sample of 15 students, using the two
lecture sections as strata, selecting a SRSWOR of ten
students from LEC01 and five students from LEC02.

4
students <- read.csv("students.csv", header=1)
head(students)
GIVEN_NAME LEC
1 Yan LEC_01
2 Prateek LEC_01
3 Adeel LEC_01
4 Jingya LEC_02
5 Anuradha LEC_02
6 Bryan LEC_01

5
students=read.csv("students.csv", header=1)
library(sampling)
s <- strata(students,
stratanames=c("LEC"),size=c(10,5), method="srswor",
description=T)
Stratum 1
Population total and number of selected units: 186 10
Stratum 2
Population total and number of selected units: 208 5
Number of strata 2
Total number of selected units 15

6
•s

7
• getdata(students,s)

8
Notations and some basic results

9
Notations and some basic results
• 𝑦ℎ𝑗 = measurement on the 𝑗𝑡ℎ unit in stratum ℎ
1 N
• Population mean in stratum ℎ: 𝑦തℎ𝑈 = σ𝑗=1 ℎ
𝑦ℎ𝑗
Nℎ
1 𝐻 Nℎ 𝑡
• Population mean = 𝑦ത𝑈 = ℎ=1 𝑗=1 𝑦ℎ𝑗 =
σ σ
𝑁 𝑁
Nℎ
• where 𝑡 = σ𝐻 σ
ℎ=1 𝑗=1 𝑦ℎ𝑗 is the population total.
• Note that: 𝑦ത𝑈 = σ𝐻
ℎ=1 𝑊ℎ 𝑦തℎ𝑈
𝑁ℎ
• where 𝑊ℎ = are the stratum weights.
𝑁
1 𝑁ℎ 2
• Population variance in stratum ℎ: 𝑆ℎ2 = σ 𝑦ℎ𝑗 − 𝑦തℎ𝑈
𝑁ℎ −1 𝑗=1
σNℎ
1 2
• Population variance : 𝑆2 = σ𝐻 𝑦ℎ𝑗 − 𝑦ത𝑈
𝑁−1 ℎ=1 𝑗=1

10
Sum of Squares Decomposition

11
Sample statistics using SRS estimators within each stratum

12
Result
• Under stratifies sampling (with SRSWOR within each stratum)

13
Result (Confidence intervals)
• Under stratified sampling (with SRSWOR within each stratum)

14
Example

15
Example

16
R code for estimating the population mean and the CI based on a stratified sample
• str.mu.est <- function(N_h,y,details="no", conf.level) {
• # N_h is a vector of the stratum sizes
• # y is a list object with each component being a stratum sample
• N <- sum(N_h)
• n_h <- unlist(lapply(y,length))
• f_h = n_h/N_h
• fpc <- 1-f_h #Finite population correction
• ybar_h <- unlist(lapply(y,mean))
• yvar_h <- unlist(lapply(y,var))
• W_h= N_h/N
• ybar_str <- sum(W_h*ybar_h)
• vybar_str <- sum(W_h^2*fpc*yvar_h/n_h)
• ME_CI <- qnorm((1+conf.level)/2)*sqrt(vybar_str)
• LCL = ybar_str - ME_CI
• UCL = ybar_str + ME_CI
• if(details=="no") {
• cbind(ybar_str,vybar_str,LCL, UCL)}
• else{
• cbind(ybar_str,vybar_str,LCL, UCL,n_h,ybar_h,yvar_h)}
• }

17
Let’s make it bigger

str.mu.est <- function(N_h,y,details="no", conf.level) {


# N_h is a vector of the stratum sizes
# y is a list object with each component being
# a stratum sample
N <- sum(N_h)
n_h <- unlist(lapply(y,length))
f_h = n_h/N_h
fpc <- 1-f_h #Finite population correction
• ybar_h <- unlist(lapply(y,mean))
• yvar_h <- unlist(lapply(y,var))
• W_h= N_h/N

18
• ybar_str <- sum(W_h*ybar_h)
• vybar_str <- sum(W_h^2*fpc*yvar_h/n_h)
• ME_CI <- qnorm((1+conf.level)/2)*sqrt(vybar_str)
• LCL = ybar_str - ME_CI
• UCL = ybar_str + ME_CI
• if(details=="no") {
• cbind(ybar_str,vybar_str,LCL, UCL)}
• else{
• cbind(ybar_str,vybar_str,LCL, UCL,n_h,ybar_h,yvar_h)}
• }

19
Some useful notes
• NOTES:
• unlist: given a list structure x, it produces a vector which contains all
the individual components of x.
• lapply(X,FUN) returns a list of the same length as X, each element of
which is the result of applying a function to the corresponding
element of X

20
To use the function enter the info as follows
N_h <- c(155,62,93)
str1 <- c(35, 28, 26, 41, 43, 29, 32, 37, 36, 25, 29,
31, 39, 38, 40, 45, 28, 27, 35, 34)
str2 <- c(27, 4, 49, 10, 18, 41, 25, 30)
str3 <- c(8, 15, 21, 7, 14, 30, 20, 11, 12, 32, 34, 24)
data <- list(townA=str1,townB=str2,rural=str3)
str.mu.est(N_h=N_h, y=data, details="yes", conf.level =
0.95)

21
str.mu.est(N_h=N_h, y=data, details=“no",
conf.level = 0.95)

• Ex: Do these calculations by hand.

22
Stratified sampling for proportions
• Proportions are in fact means of Bernoulli (i.e. 0’s and 1’s) random
variables and so we can use the same formulas with
• 𝑦തℎ = 𝑝Ƹ ℎ
𝑛ℎ
• 𝑠ℎ2 = 𝑝Ƹ ℎ (1 − 𝑝Ƹ ℎ )
𝑛ℎ −1
• 𝑝 = σ𝐻 ℎ=1 𝑊ℎ 𝑝ℎ , where 𝑊ℎ = 𝑁ℎ /𝑁
• 𝑝Ƹ 𝑠𝑡𝑟 = σ𝐻ℎ=1 𝑊ℎ 𝑝Ƹ ℎ
𝑝ℎ (1−𝑝ℎ )
• 𝑉(𝑝Ƹ 𝑠𝑡𝑟 ) = σ𝐻 2
ℎ=1 ℎ (1
𝑊 − 𝑓ℎ )
𝑛ℎ
• where 𝑝ℎ is the population proportion in the ℎ𝑡ℎ stratum.
𝑝ොℎ (1−𝑝ොℎ )
෠ 𝐻 2
• 𝑉(𝑝Ƹ 𝑠𝑡𝑟 ) = σℎ=1 𝑊ℎ (1 − 𝑓ℎ )
𝑛ℎ −1
23
Exercise
• A researcher is studying the effectiveness of a new educational program across three
different schools, which represent distinct strata based on their socioeconomic status:
low, medium, and high. The population sizes (𝑁ℎ ) for each school are as follows:
• Low SES: 200 students
• Medium SES: 300 students
• High SES: 500 students
• The researcher decides to sample 20 students from each school (𝑛ℎ = 20 for all strata).
After conducting the survey, the observed proportions of students favoring the program
are:
• Low SES: 0.60
• Medium SES: 0.70
• High SES: 0.80
1.Calculate the overall estimated proportion of students favoring the program across all
schools.
2.What is the standard error (SE) of the estimated proportion?

24
25
26
Sample size allocation
• We consider two commonly used methods for allocating the sample
sizes into each of the strata: proportional allocation, and optimal
allocation for a given n, the total sample size.
• Proportional Allocation

27
Sample size allocation
𝑁ℎ
• Substituting 𝑛ℎ = 𝑛 in the formula for 𝑉(𝑦ത𝑆𝑡𝑟 ), we get the
𝑁
following result:

• Result: Under stratified sampling with proportional allocation,

28
Sample size allocation

29
Example
• A Statistics class had two lecture sections (LEC01 and LEC02).
The data are given in file students.csv. Use R sampling
package to select a stratified sample of 20 students using
proportional allocation.

30
#R code for selecting a Statified sample with proportional
# allocation using sampling package
n = 20 # The sample size
students=read.csv("students.csv", header=1)
head(students)
GIVEN_NAME LEC
1 Yan LEC_01
2 Prateek LEC_01
3 Adeel LEC_01
4 Jingya LEC_02
5 Anuradha LEC_02
6 Bryan LEC_01
Nh = table(students$LEC)
Nh
LEC_01 LEC_02
186 208
N = nrow(students)
N
[1] 394
31
library(sampling)
set.seed(123)
s <- strata(students,stratanames=c("LEC"),size=round((Nh/N)*n),
method="srswor", description=T)
Stratum 1
Population total and number of selected units: 186 9
Stratum 2
Population total and number of selected units: 208 11
Number of strata 2
Total number of selected units 20

32
s

33
getdata(students,s)

34
Optimal Allocation
• The objective in optimal allocation is to minimize the variance 𝑉(𝑦ത𝑆𝑡𝑟 )
for a fixed cost
• Let 𝐶 represent total cost, 𝑐0 represent overhead costs, and 𝑐ℎ
represent the cost of taking an observation in stratum ℎ
• Then

35
Result( Optimal Allocation)

36
Note
• This result shows that in a given stratum, the sample size is
larger when
• The stratum size is larger
• Within stratum variability is higher
• The cost is lower in the stratum

37
Sprcial case 𝑐ℎ = 𝑐

38
Sample size for proportions

39
40
41

You might also like