0% found this document useful (0 votes)
53 views16 pages

Sampling in Statistics

The document discusses sampling techniques used in statistical studies. It explains that sampling involves collecting data from a subset of a population rather than a full census. This can be done through simple random sampling, stratified random sampling, or cluster sampling. Simple random sampling randomly selects elements from the population so each has an equal chance of selection. Stratified random sampling divides the population into non-overlapping groups and then randomly samples from each group. Cluster sampling divides the population into geographic or other convenient groups and then randomly samples some number of these clusters.

Uploaded by

Ariel Raye Rica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views16 pages

Sampling in Statistics

The document discusses sampling techniques used in statistical studies. It explains that sampling involves collecting data from a subset of a population rather than a full census. This can be done through simple random sampling, stratified random sampling, or cluster sampling. Simple random sampling randomly selects elements from the population so each has an equal chance of selection. Stratified random sampling divides the population into non-overlapping groups and then randomly samples from each group. Cluster sampling divides the population into geographic or other convenient groups and then randomly samples some number of these clusters.

Uploaded by

Ariel Raye Rica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 16

8.

0 The use of Sampling in Statistical Studies

The previous section showed you how to describe data through


statistics like the mean, median and standard deviation. You were
also able to describe data graphically through a histogram or bar
chart. We will now discuss the method of collecting data to create
the histogram. This method of collecting data is called sampling.
This method is devised such that we may be reasonably assured
that the collected data would be representative of the population
that it came from.

8.1 Sampling versus Census

If we could collect data on all the points in a population, we


would have a census. It would appear that collecting a
census is the most accurate way to describe that population.

However, sometimes it may not be possible to collect such


a large amount of data. We can actually just pick out a
smaller portion of said population and we could be
reasonably assured that the collected data would not be too
far different from a census. This begs the question: Is the
cost worth the extra information gathered by the census?

In contrast with a census, a sample is only a portion of the


population. When samples are collected in some rational
way, we would be reasonably knowledgeable of the
population already to make some decision concerning it.

A census may be eliminated if a large enough sample


would provide information that may be not far different from
that of a census.

Think of sampling as how medicine diagnoses a person’s


state of health. Doctors do not take all your blood and have
it analyzed. (Heaven forbid!) They necessarily only take a
sample--like about half a test tube or about 30 mL. This is a
sample out of the approximately 6 to 7 liters of blood that we
have in our body. 30 millileters is only about half of 1
percent of your blood. They also take blood pressure
readings and use heartbeat stethoscopes to simply see into
your health indicators at that time when you visit the doctor.
Based on those few readings, the doctor may assume that
those should be representative of your daily blood pressure
and heartbeat. Our blood pressure may rise or our heart
may beat irregularly, but conventional medicine practice

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 114
should be able to take that into consideration and still know
when your body is in danger or not. A blood pressure of
140/100 at rest when found by a doctor can be dangerous
enough for the doctor to check further signs of health
problems. The heart, er, never lies.

8.2 Statistical (Probability) Sampling techniques:

8.2.1 Simple Random sampling

A simple random sampling process is such that each and


every element of the population has an equal probability of
being obtained. This can be done by ensuring that no single
block of the population keeps getting picked. Raffling a set
of prizes is an example of simple random sampling. If all
raffle tickets are in a drawing box, and all the tickets are
mixed up, shaken and shuffled enough such that few
continuously-numbered tickets are in close proximity with
each other, then we can draw a raffle number with
confidence that the selection would be random.

In practice, simple random sampling makes use of random


number tables or random number generators like the RAN#
key on your calculator. You assign members of the
population on some numbered list. Then use the random
number to determine what portion of the list is to be
selected.

For example, say there are 37 people in a class, and you


would want to sample 11 person’s weights. We could
probably arrange the class by height, and assign the number
1 to the shortest, 2 to the next shortest and so on until we
assign 37 to the tallest. Or we could simply use a class list of
surname sorted names for our list. Now we would generate a
list of 11 students by using our calculator’s RAN# key. Let’s
say we got the number 0.902. (By the say, the random
number generator in most scientific calculators generate a
number between 0.001 to 0.999, all with presumably equal
probability.) We could use this number to represent that we
should select the 90.2 percentile of our 37 person list. So
we should pick 37 x 0.902 = 33.374 th person or the 33rd
person in the list.

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 115
We could then repeatedly use the calculator keystrokes
RAN# x 37 and round out the number we get to the nearest
integer to find the next selected person, until we get our
required 11 people. Now if it happens that the same
ordered person comes out (which could happen, by the way,
although should be virtually impossible), we could simply
press the next random number to choose another.

How can this selection process turn out a random sample?


By assuring that the list is associated with a set random
number representing the whole end-to-end roster on the list
(i.e. 0.001 to 0.04 for the first in the list, and 0.987 to
0.999 to the last numbered 37 on the list), we have made
each point in the 37 list have equal probability of being
selected.

8.2.2 Stratified Random sampling

When a population can be separated into non-overlapping


groups, called strata, then selecting a simple random sample
sample within each stratum is the sampling procedure called
stratified random sampling.

In economics, we can generally stratify a society into


income groups, like what is popularly known as Class A,
Class B or C and D, E. Based on income, a member of the
population cannot be both Class A and B. so we could say
that income is a neat way of stratifying people in a society.
If we would collect random samples from each class about
some measured variable like amount of call-minutes made in
a mobile phone, then the collection of call-minutes data
from class A would constitute a random sample for that
stratum, and another set from class B would be another
random sample, but for another sample.

When done this way, the random sample from each strata
may be described and analyzed alongside the other strata
data to find if the measured variable may have some
significant differences or not.

Other examples of strata would be:


 Dividing engineering students by year level.
 Categorizing companies as to what business
commercial sector they belong to, like trading, or

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 116
banking or manufacturing, and measuring their
revenues and profit margins.
 Categorizing companies by size: Large-, Small- or
Medium-scale and sampling their bank loan amounts.
 Dividing paper according to density (or what the paper
industry calls substance) to measure the moisture
absorption of various brands of paper of the same
substance categories.
 Dividing up dogs into large, medium and small breeds,
to measure daily food intake.
 Dividing up college students in the Philippines as
either being in Bachelor of Arts degrees or Bachelor of
Science Degrees.

8.2.3 Cluster sampling

When the population is divided up into groups due to


geographic location, or by some convenient grouping
method, and then we would take a random sample of some
m out of of the total M clusters. For the selected
clusters, we could either take a census or just a random
sample from each cluster.

For example, we would like to study the effects of a


chemical pesticide on plants at various parts of the
country. We could probably divide the country via
provinces and each province could be considered a cluster,
but we would not sample each province, but only those
which are likely to use pesticides for their agricultural flora.
Say we select Laguna, Cebu and Baguio out of the
hundreds of provinces. We then apply the pesticides at
the three mentioned agricultural environments and then
make conclusions about the pesticide’s effectiveness
based on the harvest found.

So how is cluster sampling different from stratified


sampling?

Stratified sampling Clustered sampling


1. Population is divided up 1. Population is divided up
into subgroups, each with into subgroups, each with
many elements. The many elements. The
subgroups are selected subgroups are selected
according to some according to some
criterion that is related to criterion of ease or

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 117
the variables under study. availability in data
collection.
2. We try to secure 2. We try to secure
homogeneity (similarity heterogeneity within
of characteristics) within subgroups and
each strata and homogeneity between
heterogeneity (diversity subgroups. We hope that
of characteristics) each cluster contains
between different strata. members with diverse
characteristics, and if we
compare across clusters,
we hope that all clusters
are the same in the
percentages of variety
groups.
3. Elements within each 3. Only Clusters are
strata are randomly selected randomly, and
selected. Each strata each selected cluster are
would be sampled. in turn studied in the
same manner. Not all
clusters would be
sampled from.

Homogeneity exists among classes of say, Young


Single Filipino Men aged 22 to 26. Middle-aged Married
Filipino Men aged 40 to 49 would probably have some
homogeneity as well, but would be of the different kind
compared to the younger set. We would expect that men
who grew in the same generation or decade would have
some similar taste, outlook or characteristic.

Heterogeneity exists among groups of young urban


professionals (yuppies) because we are likely to find men
and women, highly-educated with MBAs or just college
graduates, with some who like to party in the weekend, or
some who prefer a quiet evening with family and friends.
There is a diversity in young urban professionals. A
cluster sampling may be made of the Manila yuppie, the
Shanghai yuppie and the Singaporean yuppie, out of all the
Asian professional groups. Cluster sampling is often the
used in the study of marketing demographics.

Further Examples of Clustered sampling


 Dividing engineering students by course.

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 118
 Dividing up the Philippines’ school districts into
Regions, of which NCR (National Capital Region) would
be one of the clusters.
 Dividing burned out metal samples from a fire accident
site by their locality or place of detection.
 A face recognition algorithm/device would divide a
person’s face into areas or clusters, and then a
sample of each face sector is compared with a known
database of facial features to confirm the face’s
identity.

8.2.4 Systematic sampling

This sampling procedure is obtained by randomly


selecting one element from the first K elements in a
population, and then selecting every Kth element
thereafter. Systematic sampling can be used as an
alternative to a simple random sampling if the population
is randomly ordered.

For example, in exit polls in an election, every other


5th voter is asked which candidate they voted for. That
would mean the chosen voters sampled would be the 3 rd
(randomly selected out of the first 5 voters) then the 8 th,
13th, 18th, 23rd, and so on.

8.3 Non-Probability Sampling techniques:

These sampling techniques would be used sparingly,


because they cannot assure randomness and hence,
representativeness in the resulting sample.

8.3.1 Conveniece sampling


This sampling method simply collects data based on
the sources that are immediately available at the
time when data is to be collected. When asking
about people whether they like an ice cream bar by
giving them free taste samples, necessarily, only
those who are adventurous enough to try new ice
creams are the only opinions one will get. Other
potential buyers of this ice cream may have been too
shy or unaware that there exists a new ice cream
being hawked.

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 119
8.3.2 Quota sampling
When one has a specified number of sampling points
and one goes to collect such samples until the upper
limit (or quota) of samples is achieved. This method
may work, because we can use specific formulas (to
be discussed in a later section here at Unit 8.0). If a
quota is just randomly thought of, we run the risk of
not being confident enough that the sample is a good
enough large sample to make valid inferences about
the population. In quota sampling cases, the order
of selecting points to measure in sampling must still
be unbiased like convenience sampling in order to
assure validity.

8.3.3 Judgment Sampling

In this procedure, the data collector is assumed


to be an expert in judging whether a sampling
element will be included in the sample or not. This
now gives direct control to the data collector
concerning the members of sampling points in the
sample. Judgement sampling is done when decision
makers feel that some population members have
better information than other members or he feels
that some elements are more representative of the
population than others. In computer game testing,
some expert gamers are given a “first-look” into
certain games that will come out later to enable the
game creators to see how average game
consumers/customers would react to the game as
their skills in the game grows. The expert gamers
would--as conventional wisdom goes--be able to find
problems faster than others would. Eventually,
others would find out that, say, Orcs are better than
humans, but the experts would find out earlier during
first-looks.

Non-Probability sampling techniques are not necessarily


wrong and do not have a place in statistics. Sometimes, a
non-random sample is what one gets when one is pressed for
time and cannot make truly random samples. When
newspeople interview people on the street, they would more
likely than not get a non-random (non-probability) sample,

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 120
because those who would not be vocal enough about their
views may never be interviewed, but certainly many among
the population would also feel the same way about being
interviewed, and they would not be represented. Non-
random sampling techniques may be used if the intention is
only to describe a population and not about making
inferences about the future behavior of a variable from the
said population.

8.4 Sample size Determination

We now turn our attention to knowing just how many


observations must be made in a sample to be reasonably
correct with any inferences made.

We have three formulas for this:

2  
 Z  x   Z2  p  (1  p )   NZ 2 pq 
n 2
 or  2  or  2

 E   E 2
  
  NE  Z pq  
2 2
   2 

Where n = required minimum sample size.


E = Tolerable error in the estimate, either for a
measurement, or a percentage
Z/2= normal Z table value for a sample inference
probability of error a.
[The value of  is halved because we can be wrong in
the inference in two extreme ways, being too high, or
being too low. We divide the total probability of error 
into two(/2) to represent this.]
x = population standard deviation in the measurement for
the mean.

P = percentage of the population that is of interest, based


on prior knowledge about the population. If p is
unknown, a value of 0.5 may be assumed because
p=0.5 maximizes the value of p(1-p).
N = Population size.

The first formula is useful when we are measuring variables.


The second and third formulas is useful when we are making

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 121
inferences about proportions. The second assumes we do not
know the population size N, while the third formula does.

In all three formulas, the parameter E represents the tolerable


error in either the measurement (for formula 1) or tolerable
percentage error (0<E<1, as in formulas 2 and 3).

Example problems:

1. A survey is planned to determine the average annual family


medical expenses of employees of a large company. The
management of the company wishes to be 95% confident that
the sample average is correct to within P200 of the true average
annual medical expenses. A pilot study indicates that the
standard deviation can be estimated as P1,500. How large a
sample size is necessary ?
[Ans: 216.09 or 216 families’ medical expenses.]

2. A cable television company would like to estimate the proportion


of its customers who would purchase a cable television program
guide. The company would like to be 95% confidence that its
estimate is correct to within 0.05 of the true proportion. Past
experience in other areas indicates that 30% of the customers
will purchase the program guide. What sample size is needed ?
[Ans: 322.6944 or 323 customers should be
asked whether they would buy or not.]

3. In a feasibility study, the target market has been identified as


4,200 households in a certain metropolitan residential zone.
Determine the sample size that is appropriate to get statistical
inferences at 95% confidence level when the estimate of the
market acceptability must be within 4 percentage points of the
true value.
[Ans: Assume p=0.5, n =525.19 or 526 samples]

A last word about sample size:

In the absence of formulas, we can use the rule of thumb of


using at least 30 observations in a sample to be assured that our
sample is “normally distributed”. This is recommended by a well-
established statistical theorem called the Central Limit Theorem.
This theorem states that when sample sizes are large enough, all
sample distributions become like a normal distribution. This “large
enough” lower limit is generally accepted by statisticians to be 30
observations.

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 122
Take note that this “30 sample size“ rule of thumb only
guarantees normality of your sample. It does not guarantee
representativeness. Normality must be assumed in sample tests
because most statistical tests are based on the normality
assumption. If a sample is not normally distributed, most
statistical tests would not be valid.

8.5 Basic Experimental Design:

What most of us learned in our grade school science classes


about the scientific method goes something like this:

The Scientific Method:


1. Create a hypothesis.
2. Perform an experiment. (or Make observations.)
3. Make conclusions based on the results of the experiment.

What our grade school science teachers may have left out is how
to design an experiment. What we normally have in grade school is a
set out procedure to be followed, Let us now discuss certain concepts
that would be used in design of experiments.

Most, if not all, hypotheses is about some causal relationship


between a cause and an effect. Let us take some engineering
examples:

 Glass coating affects solar water heating efficiency.


 Coefficient of thermal dispersion affects the machinability of cast
iron.
 The shape and texture of an object affects how a robot arm can
pick it up.
 The type of interfering material affects radio signal transmission.
 The amount of bacteria affects oxidation rates in a sewage tank.
 The amount of moisture in concrete mixes affects the shear
strength of resulting concrete slab.
 The number of micromotions needed to perform a task affects
the skill-learning curve of a worker.

8.5.1 Hypothesis building: Operationally defining your


measures of performance.

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 123
Our hypothesis should be translated into some numerical
relationship in order for us to collect data for an experiment. To do
this, we may want to find some measure that best describes the
population.

We usually use numerical measurements in our sampling data.


What we need in experiments is to select a particular quantitative
measure to best describe what we want to study.

Let us take an easily imagined experiment. Suppose that we just


learned about a new way to go home by car from school. Now, we
would like to know if this new way would be better than our old route.
We can now choose a measure of route quality. Time would be a
variable associated with routes. The number of minutes it takes you to
get home via both routes would then be of interest.

You could state your initial hypothesis that the new route takes
the same number of minutes of travel time as the old route. You have
effectively made a causal relationship in your hypothesis: that the
“route” causes differences in “travel time”.

8.5.2 Performing the Experiment: Sample size


considerations and experimental conditioning.

So you could now do observations. This is where care should be


taken, and where design of experiments begin.

We need to determine sample size. Let’s suppose (via


judgement nonprobability sampling) that 2 weeks’ worth of
observations on the new route should be adequate to represent all
future route times. Since you have always been using your old route,
but need some exact quantity, you decide to take one week’s worth of
trips and timed yourself on the old route.

Now we need to take care that experimental conditions are set


constant during the two observation periods. This ensures that no
other factors may come in to affect the travel times. Here are some
examples of factors may affect the travel time (and must be eliminated
or minimized in their effects):

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 124
 Beginning to go home at different times. Rush hours vs non-
rush hours would certainly affect travel time. Try going home at
the times that you normally would. Say 4 to 4:15 pm every day.
 Using a different car. We may not be aware of it, but we may
unconsciously drive faster with an old car rather than a relatively
new one.
 Going for gas refills at various days. When you measure your
travel time, gas refueling times must not be included so as not to
increase travel time. [Of course, you would normally take 2
days out of a week to stop by a gasoline station to refuel, but
note that these times may be irregular activities in your travel.
You must guarantee consistency of conditions so that the
experimental observations would be untainted with other time-
expanding factors.]
 Carpooling with friends on some days, and going home alone in
others. This is self-explanatory. You go on different ways, and
the detour certainly lengthens the travel time.
 Having a driver instead of your own self. Your driver may have
more skills.
 Using another type of fuel. Using another car oil. Having your
car oil changed. Car-related changes may make your car run
faster.
 Turning on your car radio on a different station. Some kinds
music may affect your driving awareness and pacing.

You may already have a good enough picture from the list
above what other conditions may inadvertently affect travel
driving time. The key here is consistency of conditions. If that is
assured, then we can say that average times you got using your
old route and the average times you got with the new route can
now be compared on equal terms, and any significant differences
would be conclusive about the superiority of one route over the
other.

Let’s suppose that the average times are significantly


different with the new route, and then you are confident that the
new route can be adopted. This does not mean that once in a
while, the new route may take you longer to get home. When
this happens, just remember that some deviation from the
normal conditions may have happened to make this happen.
There may have been a car accident along the new route during
that time, or there may have been a new trafficking scheme
implemented. In any case, you should be observant, and try to
determine if route conditions have truly changed or if the
changes are just “flukes” or what statisticians call outlier points.

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 125
8.5.3 Ways on how to improve the experimental design

In the example given on routes and travel times, one can


improve the experiment by enabling some randomization in the
observation.

 You may try observing at various days in a month,


and not during the contiguous 2-week period for the
new route and the 1-week old route. You could take
observations such that each weekday at least 2
observations, and you could alternate old and new
routes---2 days with old route, then 2 days with new
route, then 1 day with old, then 1 day new, etc.
 You could also try driving out during rush hour, and
then during non-rush hours, but keep your records
clear on what time conditions were on each trial, so
that you may compare rush-hour effects along with
the originally planned difference between old and
new route. The following matrix should help visualize
the conditions:

Old route New route


Rush hour
departure <Observed ………
times>
Non-rush hr
departure ……….. ………..

In fact, this design is studying two factors instead of just


one. The two factors that affect travel time are (1)
routes and (2) time of departure.

Old and new routes refer to the levels of the factor


called “route”, and Rush and non-rush hour refer to the
levels of the factor called “time of departure”.

One can also add another factor, say, using new and
old cars--to further find if the car type affects travel
time. We may now revise the design of the experiment
to include this new factor.

Old car New car

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 126
Old route New Old route New
route route
Rush
hour
departur
e
Non-rush
departur
e

A word about censoring data:


When an assignable cause can be found that changes your
observed times, then you can confidently drop that data point
(i.e. disregard that observation) from your sample and simply go
on collecting the next sampling point.

8.5.4 Making basic conclusions:

To find if the new route is different from the old route, you
may create a histogram of the two samples, and compare the
two. Let’s say you took 14 observations with the old route and
21 with the new. You made the following histograms:

Old Route New Route


6
9
Number of observations
Number of observations

5 8
7
4 6
5
3
4
2 3
2
1 1
0 0
15- 20- 25- 30- 35- 40- 45- 50- 55- 15- 20- 25- 30- 35- 40- 45- 50- 55-
20 25 30 35 40 45 50 55 60 20 25 30 35 40 45 50 55 60
Minutes Minutes

Mean = 42.5 minutes SD = Mean=30.60 minutes SD=5.80


5.88 mins mins
Median = 43 minutes Median=30.63 minutes

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 127
From the histogram, it is quite obvious that the new route takes a
lower number of minutes to take. The mode of the old route is on the
interval [40,45] minutes, while the mode of the new route is in

The statistics show that the standard deviations are not too
different (5.88 mins vs 5.80 minutes). What is more important is that
the average times are different by about 12 minutes. Since the
standard deviations are indicators of measurement error, then the two
means are actually apart by about 2 standard deviations, suggesting
that there is really a significant difference between the average times
of both routes.

This gives more evidence that the new route indeed saves you
travel time over the old route. Based on this data, you can confidently
say that the new route should be your preferred route home.

When the other factors are included, then histograms of these


combinations may be made further, and comparisons made again.

Further statistical differences may be ascertained by computing


for the confidence intervals for the differences among the groups
sampled. Confidence intervals is the topic of the next section.

When more factors are included in the experiment, then


more explanatory causes may be identified.

Basic Experimental design:


To summarize, design of experiments is the planning
involved in hypothesizing about some cause-effect relationship,
then operationally defining a performance measure (usually the
“effect”) and finding some suitable numerical quantity to
measure in the experiment. Once the measures are selected,
levels of the causal factor(s) would have to be identified. Each
level of the factor would be the basis for a sample each.

Care must be practiced in observing and collecting data,


consistency of experimental conditions must be assured as well
as collecting enough data for representativeness.

Conclusions may be made by analyzing the sample data.

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 128
Practice Activities:

Hypothesize about what may cause a certain phenomenon of


which you can easily observe data on. Put this section’s learnings to
use. Short of analyzing the data you have collected, perform
a. Formulate hypothesis. Identify factors (causes) that
affect some observable event (effect)
b. Select a suitable measure
c. Identify levels of the factors in your experiment and
make a table of the experimental observations cases
that need to be made.
d. Collect data.
e. Create histograms of the results. Make some
preliminary conclusions. (Keep your data, for the
next section’s analysis.)

Sampling Procedures, Sample Size Determination and Basic Experimental Design


page 129

You might also like