Sample Survey
Sample Survey
2
Sample Survey Theory
• A population is defined as including all people or items with the characteristic one wishes to
understand.
i. Census
iii. Experiment
3
Sample Survey Theory Cont’d
2. Generalizability – the appropriateness of applying findings from a study to a larger
population.
• Sampling Methods refers to the way that observations are selected from a population to be in
the sample for a sample survey.
• Probability Samples: each population element has a known chance of being chosen for sample
random selection.
• Non-Probability Samples: we don’t know the probability that each population element will be
chosen.
4
Sample Survey Theory Cont’d
• Below are some of the types of non-probability sampling method
5
Sample Survey Theory Cont’d
• Bias is the tendency of a sample statistic to systematically over or under estimate a population
parameter. It often occurs when the survey sample does not accurately represents the
population.
• The bias that results from an unrepresentative sample is called selection bias. Examples include
under-coverage, non-response bias and voluntary response bias.
• Random Sampling is a procedure for sampling from a population in which the selection of a
sample unit is based on chance and every element of the population has equal chance of being
selected.
• Response Bias refers to the bias that results from problems in the measurement process.
6
Sample Survey Theory Cont’d
• Sampling refers to the process of choosing a sample of element from a total population of
elements.
• Probability Sampling: every element of the population has a known possibility of being
included in the sample.
• Non-Probability Sampling: we cannot specify the possibility that each element will be included
in the sample.
i. It is convenient
7
Sample Survey Theory Cont’d
• Disadvantages of Non-Probability Sampling
i. Accuracy
ii. Precision
• A sample design can be described by the two factors namely Sampling method and Estimator.
8
Simple Random Sampling
• Properties
iii. If all possible samples of n objects are equally likely to occur then the sampling method is
called simple random sampling.
9
Simple Random Sampling Cont’d
2
S.D = ------ with replacement S2
n
n N 2 S2
S.D = 1- . S.E =
N N -1 n n
n 2
S.E = 1 -
N n
Proportion
P 1 - P N - n . P 1 - P
S.D = =
n N -1 n
P 1 - P
S.E =
n- 1
2
S.D =
n
Known (without replacement)
n N 2
S.D = 1- .
N N - 1 n
S.E – this is an estimate of the standard deviation of the sampling distribution.
11
Simple Random Sampling Cont’d
• Estimated Population Variation
With replacement
S2
S.E =
n
Without replacement
n S2
S.E = 1-
N n
12
Simple Random Sampling Cont’d
• Population Proportion (Known)
P 1 - P
S.D =
n
Known (without replacement)
S.D =
N - n . P 1 - P
N - 1 n
13
Simple Random Sampling Cont’d
• Population Proportion (Estimated)
P 1 - P
S.E =
n - 1
Estimated (without replacement)
n P 1 - P
S.E = 1 - .
N n - 1
14
Simple Random Sampling Cont’d
• Note
S 2
=
x- x
n - 1
15
Simple Random Sampling Cont’d
• Questions
1. At the end of every school year, the state administers a reading test to a simple random
sample drawn without replacement from a population of 20,000 third graders. This year,
the test was administered to 36 students selected via simple random sampling. The test
score from each sampled student is as shown below:
50, 55, 60, 62, 62, 65, 67, 67, 70, 70, 70, 70, 72, 72, 73, 73, 75, 75, 75, 78, 78, 78, 80, 80, 80,
82, 82, 85, 85, 88, 88, 90, 90, 90.
Using the sample data estimate the main reading achievement level in the population.
Find the margin of error and the confidence interval. Assume a 95% confidence level.
16
Simple Random Sampling Cont’d
• Solution
x 50 + 55 + 60 + 62 + 62 + + 90
Sam ple M ean x = n i
=
36
2700
Sam ple M ean x = 36
= 75
Hence the mean reading achievement level in the population is equal to 75.
S2 =
x - x
n - 1
2 2 2 2 2 2
S2 =
80 - 75 + 55 - 75 + 60 - 75 + + 90 - 75 + 90 - 75 + 90 - 75
36 - 1
17
Simple Random Sampling Cont’d
n S2
S.E = 1-
N n
36 94.26
= 1 - 36 1.6166
20, 000
19
Stratified Random Sampling
• Properties
1. The population consist of N elements.
2. The population is divided into H group called strata.
3. Each element of the population can be assigned to one and only one, stratum.
4. The number of observations within each stratum Nh is known.
• Advantages
1. Provide greater precision than a simple random sample of the same size.
2. It requires a smaller sample which saves money.
3. It guard against an unrepresentative sample.
4. An assurance of obtaining sufficient sample point to support a separate analysis of any
subgroup.
20
Stratified Random Sampling Cont’d
• Disadvantages
1. It provides equal or better precision than a simple random sample of the same size.
2. Gains in precision are greatest when values within strata are homogeneous.
• Disproportionate Stratification: the sampling fraction may vary from one stratum to the next.
21
Stratified Random Sampling Cont’d
• Properties of Disproportionate Stratification
1. Precisions may be very good or very poor, depending on how sample points are allocated
to strata.
2. If variances differ across strata, disproportionate stratification can provide better precision
than proportionate stratification, when sample points are correctly allocated to strata.
Nh
S a m p le e stim a te o f m e a n = Xh
N
N
S a m p le e stim a te o f p ro p o rtio n = h P h
N
Nh
is the sample fraction.
N
22
Stratified Random Sampling Cont’d
• With replacement (population variance known)
1 2 h2
S.D = Nh n
N h
1 N h3 n h h2
S.D = N - 1 1- N n
N h h h
23
Stratified Random Sampling Cont’d
• With replacement (population variation estimated)
1 2 S h2
S.E = Nh n
N h
1 2 n h S h2
S.E = N h 1 -
N N h
nh
24
Stratified Random Sampling Cont’d
• With Replacement (Population Proportion Known)
1 2 Ph 1 - Ph
S.D = N h
N nh
1 2 Ph 1 - Ph
S.E = N h
N n h - 1
1 N h3 n h Ph 1 - Ph
S.D = 1 -
N N
h - 1 N h
n h
25
Stratified Random Sampling Cont’d
• Without Replacement (Population Proportion Estimated)
1 2 n h Ph 1 - Ph
S.E = N
h 1 -
N N h
n h
- 1
• Questions
2. At the end of every school year, the state administers a reading test to a simple random
sample drawn without replacement from a population of 20,000 third graders. This year, a
proportionate stratified sampling was used to select 36 students for testing. Because the
population is half boy and half girl one stratum consisted of 18 boys and the other, 18 girls.
Test scores from each sampled student
Boys: 50, 55, 60, 62, 62, 65, 67, 67, 70, 70, 73, 73, 75, 78, 78, 80, 85, 90.
Girls: 70, 70, 72, 72, 75, 75, 78, 78, 80, 80, 82, 82, 85, 85, 88, 88, 90, 90.
26
Stratified Random Sampling Cont’d
Using the sample data, estimate the mean reading achievement level in the population.
Find the margin of error and the confidence interval. Assume 95% confidence level.
xb =
x i
=
50 + 55 + 60 + 62 + 62 + + 90
n 18
1260
xb = = 70
18
70 + 70 + 72 + 72 + + 90
xg =
18
1440
xb = = 80
18
The stratum mean = stratum mean for boys + stratum mean for girls
N 10000 10000
The stratum mean, x = h x h = 70 + 80 = 75
N 20000 20000
27
Stratified Random Sampling Cont’d
Finding the margin of error , we need to find the standard error
S h2 =
x - xh
n - 1
2 2 2 2
2 50 - 70 + 55 - 70 + 60 - 70 + + 90 - 70
S hb = = 105.41
18 - 1
2 2 2 2
2 70 - 80 + 70 - 80 + 72 - 80 + + 90 - 80
S hg = = 45.41
18 - 1
1 2 n h S h2
S.E = N h 1 -
N N h
nh
28
Stratified Random Sampling Cont’d
1 18 105.41 18 45.41
S.E = 100002 1 - + 100002 1 -
10000 18
20000 10000 18
1
S.E = 584557011.1 + 257823677.8
20000
1
S.E = 836380688.9 1.45
20000
Hence the standard error of the sampling distribution of the mean is 6.45
29
Simple Random Sampling Cont’d
From the z-distribution (normal distribution) the critical value is 1.96.
30
Cluster Sampling
• Properties
4. Each element of the population can be assigned to one and only one cluster.
1. One-stage sampling
2. Two-stage sampling
31
Cluster Sampling Cont’d
• Advantage of Cluster Sampling
2. When the increased sample size is sufficient to offset the less in precision, cluster may be
the best choice.
1. All strata are represented in the sample but in a subsets of cluster are in the sample.
2. With stratified the best survey results occur when elements within strata are internally
homogeneous but with cluster the best survey results occur when elements within clusters
are internally heterogeneous.
32
Cluster Sampling Cont’d
• Formulae for Cluster Sampling
N
Mean = M i xi
nM
N
Proportional Mean = M i Pi
n M
33
Cluster Sampling Cont’d
Standard Error of Mean Score for Two-Stage Method
2
n t
1- M i xi - mean
1 2 N N N mi 2 Si2
N 1 - Mi
M n n-1 n M i m i
Note
N Mi
tmean ti , ti
n mi
For all proportion stages xi must be Pi and tmean = tprop
mi = (mi – 1)
34
Cluster Sampling Cont’d
• Questions
3. At the end of every school year, the state administers a reading test to a sample third
graders. The school system has 20,000 third graders grouped in 1000 separate classes.
Assume that each class has 20 students. This year, the test was administered to each
student in 36 randomly sampled classes. Thus, this is one-stage cluster sampling, with
classes serving as clusters. The average test score from each sample cluster is given as
55, 60, 65, 67, 67, 70, 70, 70, 72, 72, 72, 72, 73, 75, 75, 75, 75, 75, 77, 77, 78, 78, 78, 78, 80,
80, 80, 80, 80, 80, 83, 83, 85, 85.
Using the sample data estimate the main reading achievement level in the population.
Find the margin of error and the confidence interval. Assume a 95% confidence level.
35
Cluster Sampling Cont’d
• Solution
mi = 20, N = 1000, n = 36, M = 20000, M i = 20
N 1000 xi
Sample Mean = M
i x i = 36 20000
20 xi =
nM 36
x i = 55 + 60 + 65 + 67 + + 83 + 85 + 85 + 85 = 2700
36
Cluster Sampling Cont’d
• Solution
N M
tmean i x
ij
n mi
1000 20
x
ij
36 20
27.778 20 xi
36
1 -
1000
1
S.E = 1000
2 18217.143 = 1.1
20000 36
= 75 ± 2.156
38
Cluster Sampling
• 6 Factors That Influence Sample Size
1. Cost considerations
2. Administrative concerns
4. Confidence level
5. Sampling method
39
Cluster Sampling
• Questions
4. At the end of every school year, the state administers a reading test to a sample of 36
third graders. The school system has 20,000 third graders, half boys and half girls. The
results of last year’s test are shown in the table below
Boys 70 10.27
Girls 80 6.66
This year, the research plan to is to use a stratified sample, with one stratum consisting of
boys and girls. Use the above information to find
40
Cluster Sampling
a) Maximize precision, how many sampled students should be boys and that of girls.
Solution
n Nh h
N b = 10000, N g = 10000 [i.e. half boys and half girls], nh =
N
i i
36 10000 10.27
nb = = 21.825 22
10000 10.27 10000 6.67
36 10000 6.67
ng = = 14.17 14
10000 10.27 10000 6.67
41
Cluster Sampling
Hence the number of boys needed is 22 and that of girls is 14.
N
x = h xh
N
10000 10000
x=
20000
70
20000
80 = 75
We need to find the standard error and find the margin of error
42
Cluster Sampling
Critical value is obtained by finding the critical
1 2 n h S h2
S.E = N h 1 -
N N h
nh
1 2 22 10.27 2 2 14 6.67 2
S.E = 1000 1 - 1000 1 -
20000 1000 22 1000 14
43
Cluster Sampling Cont’d
1
S.E = 478367543.7 + 317332968.1
20000
1
S.E = 795700511.8 = 1.41
20000
= 75 ± 2.76
44
Best Sampling Method
• Best Sampling Method: is the sampling method that must effectively meet the particular goals
of the study in question.
2. Identify the potential sampling methods that might effectively achieve the goal
45
Ratio Estimation
• Ratio Estimation
In Simple Random Sampling, if X and Y are positively correlated, we can use ratio estimation
to give more reliable estimates of the population.
n
Ty = yi Total of the y-values in the population
i =1
n
Tx = xi Total of the x-values in the population
i =1
Ty Yu Y
B= = =
Tx Xu X
R=
n x - x y - y
i u i u
: the population correlation of X and Y
i =1 N - 1 S x Sy
46
Ratio Estimation Cont’d
The estimators for B, Ty and y u are
y Y
T
B= =
Tx X
yr = BT
T
x
u = B
Y Xu
Tx
Xu =
n
Variance of the Ratio Estimator
- B 1 - n 1 BS 2 - RS S
2 X Y
Bias: E B
N X
nX i
47
Ratio Estimation Cont’d
Variance Known (B)
n
2
B
= 1-
n 1
Yi BX i
V
N nX u
2
i =1
n-1
V
ar Y
r = V X u B
ar T
V
yr = V Tx B
If sample size n are sufficiently large, the 95% confidence intervals are
1.96S B kS B
B e i.e. B e
Y r 1.96Se Y r
i.e. Y r kSe Y r
yr 1.96S T yr yr kS T yr
T e i.e. T e
48
Ratio Estimation Cont’d
• Why Use Ratio Estimation
2. We want to estimate a population total but the population size N is unknown. So the
estimator T Y NY can’t be used.
T
Hence we can estimate N by X .
X
3. Ratio estimation increases the precision of estimated means and totals.
4. Ratio estimation is used to adjust estimates from the sample so that they reflect the
demographic totals.
49
Ratio Estimation Cont’d
• Questions
Xi = Hectares of village I
n = 24, TX = 21875.6
50
Ratio Estimation Cont’d
51
Ratio Estimation Cont’d
n 24, y i
3135
y
i 1
i
3135
Y 130.625
n 24
n
x
i1
i
875.1
X 36.4625
n 24
Y 13 0 .6 2 5
B 3 .5 8 2 4 4 7 7 2
X 3 6 .4 6 2 5
y T B
T 2 1 8 7 5 .6 3 .5 8 2 4 4 7 7 2 7 8 3 6 8 .2
x
52
Ratio Estimation Cont’d
Finding the Variance of B
2
1 n 1
B yi Bxi
V N n xu 2
n 1
xu Mean of the Population
2
1 24
B 1 y 3.582xi
V
576 21875.6
2
i
23
24
576
53
Ratio Estimation Cont’d
Village − Village −
1 15.0854 13 20.830
2 0.019044 14 7.918596
3 89.567 15 172.5544
4 11.916 16 2.3716
5 12.362 17 6.130576
6 2.696164 18 9.072144
7 76.94798 19 0.524176
8 0.9826 20 8.2944
9 40.144816 21 4.376464
10 132.8486 22 2.119936
11 0.024964 23 0.191844
12 13.95769 24 47.116996
54
Ratio Estimation Cont’d
Finding the Variance of B
1 24
B 1 678.3235
V
576 21875.6
2
23
24
576
B
8.1644 104
V B
V 8.1644 104 0.028
T
y T2 V
B
21875.62 8.1644 104 390798.35
V x
y N Y 576 130.625 75240
T
T
y N 2 V Y 5762 44.41040943 147034308
V
55
Ratio Estimation Cont’d
100 100
Question: n 100, N 1000, y i
1750, x i
1200, t x 12500
i 1 i 1
y=
y i
=
1750
= 17.5 x=
x i
=
1200
= 12
n 100 n 100
= y = 17.5 = 1.4583
B
x 12
tx 12500
Xu = = = 12.5
N 1000
n1
n
2 2 2 x2
xy B
yi Bxi =
i1
y
i 2 B ii i
n
2 2
y
i 1
i
Bxi = 31680 2 1.4583 22059.35 1.4583 18620 529.7992518
2
Se2 =
y i
Bxi
529.7992518
5.352
n1 100 1
y r kSe y r
y r 1.96 0.219
N 2 B2Ux2
n , D
ND 2 4
58
Ratio Estimation Cont’d
Question: Problem of estimating the ratio of change from last year to this year in the numbers of
workers hour due to sickness. Pilot study of n=10. company recorded total worker-hours lost for
last year was tx =16300. Determine the sample size to estimate R, the rate of change for the
company with bound of error B=0.01. Assume total number of worker is N=1000.
Solution
y i 187
r= = = 1.05
x i 178
2
t x 16300
Ux = = = 16.3
N 1000
59
Ratio Estimation Cont’d
2 2
B 2 U x 0.01 2 16.3
D = = = 6.642 10 -3
4 4
N 2 1000 3.4596
n= =
ND 2
1000 6.642 10 3 3.4596
3459.6
n= 342.48
10.1016
N 2 B2
n= 2
, D
ND 4
60
Ratio Estimation Cont’d
Question: N=1000-acre. Wish to estimate the average number of trees Uy Determine the sample
size necessary to estimate Uy , B=10.
Solution
r=
y i
=
221
= 1.06
x i 208
2
2 =
y i
- rxi
= 4.20
9
B2 1.0 2
D = = 0.25
4 4
61
Ratio Estimation Cont’d
Sample Size for Estimating Uy
N 2 B2
n= , D
ND 2 4N 2
y 1.583
r= = = 0.9814 0.98
x 16.13
2
2 =
y i
- rxi
= 2.73 2 7.45
n1
2
2 =
y i
- rxi
= 2.73 2 7.45
n1
62
Ratio Estimation Cont’d
N 2
n=
ND 2
2100 7.45
n=
2100 0.01417 7.45
15645
= 420.485 421
37.207
63
Regression Estimation
• Ratio estimate is most appropriate when relation between X and Y is linear through the origin
if otherwise one uses regression estimator.
y i = a + bxi and a = y - bx
y i = y + b xi - x
• Regression Estimator of a Population Mean Uy
y = y + b U - x
U x where
b=
y - y x - x
i i
2
x - x i
64
Regression Estimation Cont’d
N -n 1 N N 2
V U yi = i
Nn n - 2 i =1
y - y xi
- x - b
2
i =1
xi- x
yi = N - n MSE
U
V Nn
1 N N 2
MSE = i
n - 2 i =1
y - y
xi - x -b 2
i =1
xi - x
Question: Mathematics achievement test given to 486 student prior to entering a college. A
simple random sample n=10 was selected from these students and their final calculus grades , y
were observed. If Ux (mean of math test) = 52. Estimate Uy and place in the error of estimation if
MSE=75.8
65
Regression Estimation Cont’d
Solution
y = y + b x - x = 76 + 0.766 52 - 46 = 80.6
U i
yi = N - n MSE = 486 - 10 75.8 = 7.424
U
V Nn
486 10
U
yi
Margin of Error = 2 V
Margin of Error = 2 7.4240 = 5.45
66
Errors in Surveys
• Sampling errors arise solely as a result of drawing a probability sample rather than conducting a
complete enumeration.
• Non-sampling errors are mainly associated to data collection and processing procedures.
TOTAL ERROR
67
Errors in Surveys Cont’d
• Why Do Errors Occur
• Causes of Error
3. Systematic error.
- By proper and unbiased probability sampling and by using a large sample size.
68
Errors in Surveys Cont’d
• Sources of Non-Sampling Error
1. Definition to be used
3. Measurements to be made
69
Errors in Surveys Cont’d
• Types of Non-Sampling Error
1. Specification error
4. Net non-coverage
- By improving the frame by excluding erroneous units and duplicates and updating the frame
through field work to identify units missing from the frame.
• Non-Response: to the failure to measure some of the sample units. Failure to obtain
observation on some units selected for the sample.
70
Errors in Surveys Cont’d
• Types of Non-Respondents
1. Not-****-homes
2. Refusals
• Causes of Non-Respondents
1. Lack of motivation
2. Shortage of time
71
Errors in Surveys Cont’d
• Ways of Reducing Non-Respondents
1. Good frames
• Measurements of Error: These errors arise from the fact that what is observed or measured
departs from the actual values of sample units, like recording, coding.
72
Errors in Surveys Cont’d
• Processing Errors
• Errors of Estimation: Arise in the process of extrapolation of results from the observed sample
units to the entire target population. Errors include coverage, sample selection bias and variable
error.
73
Errors in Surveys Cont’d
• Bias refer to systematic errors that affect any sample taken under a specified survey design with
the same constant error.
• Variable error occurs as a result of the failure to constantly apply survey and census.
1. Consistency check
2. Sample check
3. Survey check
74