Research Methods Knowledge Base - Book - SAMPLING
Research Methods Knowledge Base - Book - SAMPLING
Sampling is the process of selecting units (e.g., people, organizations) from a population
of interest so that by studying the sample we may fairly generalize our results back to the
population from which they were chosen. Let's begin by covering some of the key terms
in sampling like "population" and "sampling frame." Then, because some types of
sampling rely upon quantitative models, we'll talk about some of the statistical terms used
in sampling. Finally, we'll discuss the major distinction between probability and
Nonprobability sampling methods and work through the major types in each.
External Validity
External validity is related to generalizing. That's the major thing you need to keep in
mind. Recall that validity refers to the approximate truth of propositions, inferences, or
conclusions. So, external validity refers to the approximate truth of conclusions the
involve generalizations. Put in more pedestrian terms, external validity is the degree to
which the conclusions in your study would hold for other persons in other places and at
other times.
In science there are two major approaches to how we provide evidence for a
generalization. I'll call the first approach the Sampling Model. In the sampling model,
you start by identifying the population you would like to generalize to. Then, you draw a
fair sample from that population and conduct your research with the sample. Finally,
because the sample is representative of the population, you can automatically generalize
your results back to the population. There are several problems with this approach. First,
perhaps you don't know at the time of your study who you might ultimately like to
generalize to. Second, you may not be easily able to draw a fair or representative sample.
Third, it's impossible to sample across all times that you might like to generalize to (like
next year).
Page 1 of 19
Page 2 of 19
you should try to assure that the respondents participate in your study and that you keep
your dropout rates low. A second approach would be to use the theory of proximal
similarity more effectively. How? Perhaps you could do a better job of describing the
ways your contexts and others differ, providing lots of data about the degree of similarity
between various groups of people, places, and even times. You might even be able to map
out the degree of proximal similarity among various contexts with a methodology like
concept mapping. Perhaps the best approach to criticisms of generalizations is simply to
show them that they're wrong -- do your study in a variety of places, with different people
and at different times. That is, your external validity (ability to generalize) will be
stronger the more you replicate your study.
Sampling Terminology
As with anything else in life you have to learn the language of an area if you're going to
ever hope to use it. Here, I want to introduce several different terms for the major groups
that are involved in a sampling process and the role that each group plays in the logic of
sampling.
The major question that motivates sampling in the first place is: "Who do you want to
generalize to?" Or should it be: "To whom do you want to generalize?" In most social
research we are interested in more than just the people who directly participate in our
study. We would like to be able to talk in general terms and not be confined only to the
people who are in our study. Now, there are times when we aren't very concerned about
generalizing. Maybe we're just evaluating a program in a local agency and we don't care
whether the program would work with other people in other places and at other times. In
that case, sampling and generalizing might not be of interest. In other cases, we would
really like to be able to generalize almost universally. When psychologists do research,
they are often interested in developing theories that would hold for all humans. But in
most applied social research, we are interested in generalizing to specific groups. The
group you wish to generalize to is often called the population in your study. This is the
group you would like to sample from because this is the group you are interested in
generalizing to. Let's imagine that you wish to generalize to urban homeless males
between the ages of 30 and 50 in the United States. If that is the population of interest,
you are likely to have a very hard time developing a reasonable sampling plan. You are
probably not going to find an accurate listing of this population, and even if you did, you
would almost certainly not be able to mount a national sample across hundreds of urban
areas. So we probably should make a distinction between the population you would like
to generalize to, and the population that will be accessible to you. We'll call the former
the theoretical population and the latter the accessible population. In this example, the
accessible population might be homeless males between the ages of 30 and 50 in six
selected urban areas across the U.S.
Page 3 of 19
Once you've identified the theoretical and accessible populations, you have to do one
more thing before you can actually draw a sample -- you have to get a list of the members
of the accessible population. (Or, you have to spell out in detail how you will contact
them to assure representativeness). The listing of the accessible population from which
you'll draw your sample is called the sampling frame. If you were doing a phone survey
and selecting names from the telephone book, the book would be your sampling frame.
That wouldn't be a great way to sample because significant subportions of the population
either don't have a phone or have moved in or out of the area since the last book was
printed. Notice that in this case, you might identify the area code and all three-digit
prefixes within that area code and draw a sample simply by randomly dialing numbers
(cleverly known as random-digit-dialing). In this case, the sampling frame is not a list
per se, but is rather a procedure that you follow as the actual basis for sampling. Finally,
you actually draw your sample (using one of the many sampling procedures). The sample
is the group of people who you select to be in your study. Notice that I didn't say that the
sample was the group of people who are actually in your study. You may not be able to
contact or recruit all of the people you actually sample, or some could drop out over the
course of the study. The group that actually completes your study is a subsample of the
sample -- it doesn't include nonrespondents or dropouts. The problem of nonresponse and
its effects on a study will be addressed elsewhere.
People often confuse what is meant by random selection with the idea of random
assignment. You should make sure that you understand the distinction between random
selection and random assignment.
At this point, you should appreciate that sampling is a difficult multi-step process and that
there are lots of places you can go wrong. In fact, as we move from each step to the next
in identifying a sample, there is the possibility of introducing systematic error or bias.
Page 4 of 19
For instance, even if you are able to identify perfectly the population of interest, you may
not have access to all of them. And even if you do, you may not have a complete and
accurate enumeration or sampling frame from which to select. And, even if you do, you
may not draw the sample correctly or accurately. And, even if you do, they may not all
come and they may not all stay. Depressed yet? This is a very difficult business indeed.
At times like this I'm reminded of what Donald Campbell used to say (I'll paraphrase
here): "Cousins to the amoeba, it's amazing that we know anything at all!"
Page 5 of 19
measure the entire population. If you measure the entire population and calculate a value
like a mean or average, we don't refer to this as a statistic, we call it a parameter of the
population.
distribution. We don't ever actually construct a sampling distribution. Why not? You're
not paying attention! Because to construct it we would have to take an infinite number of
samples and at least the last time I checked, on this planet infinite is not a number we
Page 6 of 19
know how to reach. So why do we even talk about a sampling distribution? Now that's a
good question! Because we need to realize that our sample is just one of a potentially
infinite number of samples that we could have taken. When we keep the sampling
distribution in mind, we realize that while the statistic we got from our sample is
probably near the center of the sampling distribution (because most of the samples would
be there) we could have gotten one of the extreme samples just by the luck of the draw. If
we take the average of the sampling distribution -- the average of the averages of an
infinite number of samples -- we would be much closer to the true population average -the parameter of interest. So the average of the sampling distribution is essentially
equivalent to the parameter. But what is the standard deviation of the sampling
distribution (OK, never had statistics? There are any number of places on the web where
you can learn about them or even just brush up if you've gotten rusty. This isn't one of
them. I'm going to assume that you at least know what a standard deviation is, or that
you're capable of finding out relatively quickly). The standard deviation of the sampling
distribution tells us something about how different samples would be distributed. In
statistics it is referred to as the standard error (so we can keep it separate in our minds
from standard deviations. Getting confused? Go get a cup of coffee and come back in ten
minutes...OK, let's try once more... A standard deviation is the spread of the scores
around the average in a single sample. The standard error is the spread of the averages
around the average of averages in a sampling distribution. Got it?)
Sampling Error
In sampling contexts, the standard error is called sampling error. Sampling error gives
us some idea of the precision of our statistical estimate. A low sampling error means that
we had relatively less variability or range in the sampling distribution. But here we go
again -- we never actually see the sampling distribution! So how do we calculate
sampling error? We base our calculation on the standard deviation of our sample. The
greater the sample standard deviation, the greater the standard error (and the sampling
error). The standard error is also related to the sample size. The greater your sample size,
the smaller the standard error. Why? Because the greater the sample size, the closer your
sample is to the actual population itself. If you take a sample that consists of the entire
population you actually have no sampling error because you don't have a sample, you
have the entire population. In that case, the mean you estimate is the parameter.
Page 7 of 19
Page 8 of 19
Page 9 of 19
Probability Sampling
A probability sampling method is any method of sampling that utilizes some form of
random selection. In order to have a random selection method, you must set up some
process or procedure that assures that the different units in your population have equal
probabilities of being chosen. Humans have long practiced various forms of random
selection, such as picking a name out of a hat, or choosing the short straw. These days, we
tend to use computers as the mechanism for generating random numbers as the basis for
random selection.
Some Definitions
Before I can explain the various probability methods we have to define some basic terms.
These are:
That's it. With those terms defined we can begin to define the different probability
sampling methods.
Objective: To select n units out of N such that each NCn has an equal chance of
being selected.
Procedure: Use a table of random numbers, a computer random number
generator, or a mechanical device to select the sample.
Page 10 of 19
A somewhat stilted, if accurate, definition. Let's see if we can make it a little more real.
How do we select a simple random sample? Let's assume that we are doing some
research with a small service agency that wishes to assess client's views of quality of
service over the past year. First, we have to get the sampling frame organized. To
accomplish this, we'll go through agency records to identify every client over the past 12
months. If we're lucky, the agency has good accurate computerized records and can
quickly produce such a list. Then, we have to actually draw the sample. Decide on the
number of clients you would like to have in the final sample. For the sake of the example,
let's say you want to select 100 clients to survey and that there were 1000 clients over the
past 12 months. Then, the sampling fraction is f = n/N = 100/1000 = .10 or 10%. Now, to
actually draw the sample, you have several options. You could print off the list of 1000
clients, tear then into separate strips, put the strips in a hat, mix them up real good, close
your eyes and pull out the first 100. But this mechanical procedure would be tedious and
the quality of the sample would depend on how thoroughly you mixed them up and how
randomly you reached in. Perhaps a better procedure would be to use the kind of ball
machine that is popular with many of the state lotteries. You would need three sets of
balls numbered 0 to 9, one set for each of the digits from 000 to 999 (if we select 000
we'll call that 1000). Number the list of names from 1 to 1000 and then use the ball
machine to select the three digits that selects each person. The obvious disadvantage here
is that you need to get the ball machines. (Where do they make those things, anyway? Is
there a ball machine industry?).
Neither of these mechanical procedures is very feasible and, with the development of
inexpensive computers there is a much easier way. Here's a simple procedure that's
especially useful if you have the names of the clients already on the computer. Many
computer programs can generate a series of random numbers. Let's assume you can copy
and paste the list of client names into a column in an EXCEL spreadsheet. Then, in the
column right next to it paste the function =RAND() which is EXCEL's way of putting a
random number between 0 and 1 in the cells. Then, sort both columns -- the list of names
and the random number -- by the random numbers. This rearranges the list in random
order from the lowest to the highest random number. Then, all you have to do is take the
first hundred names in this sorted list. pretty simple. You could probably accomplish the
whole thing in under a minute.
Simple random sampling is simple to accomplish and is easy to explain to others.
Because simple random sampling is a fair way to select a sample, it is reasonable to
generalize the results from the sample back to the population. Simple random sampling is
not the most statistically efficient method of sampling and you may, just because of the
luck of the draw, not get good representation of subgroups in a population. To deal with
these issues, we have to turn to other sampling methods.
Page 12 of 19
All of this will be much clearer with an example. Let's assume that we have a population
that only has N=100 people in it and that you want to take a sample of n=20. To use
systematic sampling, the population must be listed in a random order. The sampling
fraction would be f = 20/100 = 20%. in this case, the interval size, k, is equal to N/n =
100/20 = 5. Now, select a random integer from 1 to 5. In our example, imagine that you
chose 4. Now, to select the sample, start with the 4th unit in the list and take every k-th
unit (every 5th, because k=5). You would be sampling units 4, 9, 14, 19, and so on to 100
and you would wind up with 20 units in your sample.
For this to work, it is essential that the units in the population are randomly ordered, at
least with respect to the characteristics you are measuring. Why would you ever want to
use systematic random sampling? For one thing, it is fairly easy to do. You only have to
select a single random number to start things off. It may also be more precise than simple
random sampling. Finally, in some situations there is simply no easier way to do random
sampling. For instance, I once had to do a study that involved sampling from all the
books in a library. Once selected, I would have to go to the shelf, locate the book, and
record when it last circulated. I knew that I had a fairly good sampling frame in the form
of the shelf list (which is a card catalog where the entries are arranged in the order they
occur on the shelf). To do a simple random sample, I could have estimated the total
number of books and generated random numbers to draw the sample; but how would I
Page 13 of 19
find book #74,329 easily if that is the number I selected? I couldn't very well count the
cards until I came to 74,329! Stratifying wouldn't solve that problem either. For instance,
I could have stratified by card catalog drawer and drawn a simple random sample within
each drawer. But I'd still be stuck counting cards. Instead, I did a systematic random
sample. I estimated the number of books in the entire collection. Let's imagine it was
100,000. I decided that I wanted to take a sample of 1000 for a sampling fraction of
1000/100,000 = 1%. To get the sampling interval k, I divided N/n = 100,000/1000 = 100.
Then I selected a random integer between 1 and 100. Let's say I got 57. Next I did a little
side study to determine how thick a thousand cards are in the card catalog (taking into
account the varying ages of the cards). Let's say that on average I found that two cards
that were separated by 100 cards were about .75 inches apart in the catalog drawer. That
information gave me everything I needed to draw the sample. I counted to the 57th by
hand and recorded the book information. Then, I took a compass. (Remember those from
your high-school math class? They're the funny little metal instruments with a sharp pin
on one end and a pencil on the other that you used to draw circles in geometry class.)
Then I set the compass at .75", stuck the pin end in at the 57th card and pointed with the
pencil end to the next card (approximately 100 books away). In this way, I approximated
selecting the 157th, 257th, 357th, and so on. I was able to accomplish the entire selection
procedure in very little time using this systematic random sampling approach. I'd
probably still be there counting cards if I'd tried another random sampling method. (Okay,
so I have no life. I got compensated nicely, I don't mind saying, for coming up with this
scheme.)
Page 14 of 19
Multi-Stage Sampling
The four methods we've covered so far -- simple, stratified, systematic and cluster -- are
the simplest random sampling strategies. In most real applied social research, we would
use sampling methods that are considerably more complex than these simple variations.
The most important principle here is that we can combine the simple methods described
earlier in a variety of useful ways that help us address our sampling needs in the most
efficient and effective manner possible. When we combine sampling methods, we call
this multi-stage sampling.
For example, consider the idea of sampling New York State residents for face-to-face
interviews. Clearly we would want to do some type of cluster sampling as the first stage
of the process. We might sample townships or census tracts throughout the state. But in
cluster sampling we would then go on to measure everyone in the clusters we select.
Even if we are sampling census tracts we may not be able to measure everyone who is in
the census tract. So, we might set up a stratified sampling process within the clusters. In
this case, we would have a two-stage sampling process with stratified samples within
cluster samples. Or, consider the problem of sampling students in grade schools. We
might begin with a national sample of school districts stratified by economics and
educational level. Within selected districts, we might do a simple random sample of
schools. Within schools, we might do a simple random sample of classes or grades. And,
within classes, we might even do a simple random sample of students. In this case, we
have three or four stages in the sampling process and we use both stratified and simple
random sampling. By combining different sampling methods we are able to achieve a
Page 15 of 19
rich variety of probabilistic sampling methods that can be used in a wide range of social
research contexts.
Nonprobability Sampling
The difference between nonprobability and probability sampling is that nonprobability
sampling does not involve random selection and probability sampling does. Does that
mean that nonprobability samples aren't representative of the population? Not necessarily.
But it does mean that nonprobability samples cannot depend upon the rationale of
probability theory. At least with a probabilistic sample, we know the odds or probability
that we have represented the population well. We are able to estimate confidence intervals
for the statistic. With nonprobability samples, we may or may not represent the
population well, and it will often be hard for us to know how well we've done so. In
general, researchers prefer probabilistic or random sampling methods over
nonprobabilistic ones, and consider them to be more accurate and rigorous. However, in
applied social research there may be circumstances where it is not feasible, practical or
theoretically sensible to do random sampling. Here, we consider a wide range of
nonprobabilistic alternatives.
We can divide nonprobability sampling methods into two broad types: accidental or
purposive. Most sampling methods are purposive in nature because we usually approach
the sampling problem with a specific plan in mind. The most important distinctions
among these types of sampling methods are the ones between the different types of
purposive sampling approaches.
Purposive Sampling
Page 16 of 19
Expert Sampling
Expert sampling involves the assembling of a sample of persons with
known or demonstrable experience and expertise in some area. Often, we
convene such a sample under the auspices of a "panel of experts." There
are actually two reasons you might do expert sampling. First, because it
Page 17 of 19
would be the best way to elicit the views of persons who have specific
expertise. In this case, expert sampling is essentially just a specific
subcase of purposive sampling. But the other reason you might use expert
sampling is to provide evidence for the validity of another sampling
approach you've chosen. For instance, let's say you do modal instance
sampling and are concerned that the criteria you used for defining the
modal instance are subject to criticism. You might convene an expert panel
consisting of persons with acknowledged experience and insight into that
field or topic and ask them to examine your modal definitions and
comment on their appropriateness and validity. The advantage of doing
this is that you aren't out on your own trying to defend your decisions -you have some acknowledged experts to back you. The disadvantage is
that even the experts can be, and often are, wrong.
Quota Sampling
In quota sampling, you select people nonrandomly according to some
fixed quota. There are two types of quota sampling: proportional and non
proportional. In proportional quota sampling you want to represent the
major characteristics of the population by sampling a proportional amount
of each. For instance, if you know the population has 40% women and
60% men, and that you want a total sample size of 100, you will continue
sampling until you get those percentages and then you will stop. So, if
you've already got the 40 women for your sample, but not the sixty men,
you will continue to sample men but even if legitimate women
respondents come along, you will not sample them because you have
already "met your quota." The problem here (as in much purposive
sampling) is that you have to decide the specific characteristics on which
you will base the quota. Will it be by gender, age, education race, religion,
etc.?
Nonproportional quota sampling is a bit less restrictive. In this method,
you specify the minimum number of sampled units you want in each
category. here, you're not concerned with having numbers that match the
proportions in the population. Instead, you simply want to have enough to
assure that you will be able to talk about even small groups in the
population. This method is the nonprobabilistic analogue of stratified
random sampling in that it is typically used to assure that smaller groups
are adequately represented in your sample.
Heterogeneity Sampling
We sample for heterogeneity when we want to include all opinions or
views, and we aren't concerned about representing these views
proportionately. Another term for this is sampling for diversity. In many
brainstorming or nominal group processes (including concept mapping),
Page 18 of 19
Snowball Sampling
In snowball sampling, you begin by identifying someone who meets the
criteria for inclusion in your study. You then ask them to recommend
others who they may know who also meet the criteria. Although this
method would hardly lead to representative samples, there are times when
it may be the best method available. Snowball sampling is especially
useful when you are trying to reach populations that are inaccessible or
hard to find. For instance, if you are studying the homeless, you are not
likely to be able to find good lists of homeless people within a specific
geographical area. However, if you go to that area and identify one or two,
you may find that they know very well who the other homeless people in
their vicinity are and how you can find them.
Page 19 of 19