0% found this document useful (0 votes)
124 views

Sampling Error: in Statistics, Sampling Error Is Incurred When The Statistical Characteristics of

Sampling error occurs when statistics calculated from a sample, like the mean or median, differ from the true population parameters due to the sample not including the entire population. For example, the average height calculated from a sample of 1,000 people may differ from the true average height of the entire population of 1 million people. Sampling error is considered the difference between the sample and population values and cannot be precisely measured since the true population values are unknown.

Uploaded by

Akanksha Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Sampling Error: in Statistics, Sampling Error Is Incurred When The Statistical Characteristics of

Sampling error occurs when statistics calculated from a sample, like the mean or median, differ from the true population parameters due to the sample not including the entire population. For example, the average height calculated from a sample of 1,000 people may differ from the true average height of the entire population of 1 million people. Sampling error is considered the difference between the sample and population values and cannot be precisely measured since the true population values are unknown.

Uploaded by

Akanksha Jain
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Sampling error: In statistics, sampling error is incurred when the statistical characteristics of

a population are estimated from a subset, or sample, of that population. Since the sample
does not include all members of the population, statistics on the sample, such as means and
quantiles, generally differ from the characteristics of the entire population, which are
known as parameters. For example, if one measures the height of a thousand individuals
from a country of one million, the average height of the thousand is typically not the same
as the average height of all one million people in the country. Since sampling is typically
done to determine the characteristics of a whole population, the difference between the
sample and population values is considered an error. Exact measurement of sampling error
is generally not feasible since the true population values are unknown."
In statistics, sampling error is the error caused by observing a sample instead of the whole
population. The sampling error is the difference between a sample statistic used to estimate
a population parameter and the actual but unknown value of the parameter. An estimate of
a quantity of interest, such as an average or percentage, will generally be subject to sample-
to-sample variation.
These variations in the possible sample values of a statistic can theoretically be expressed as
sampling errors, although in practice the exact sampling error is typically unknown.
Population of 80 students Estimate average height in your population. It is hard to measure
the height of all 80 students so you decide to take a simple random sample. It is reasonable
to measure the height of 30 students. Randomly sample 30 students out of 80 students,
take their average height, that can be taken as a good estimate of average height of entire
population. How to select 30 students from the population – method followed could be to
put the names of 80 students on pieces of paper and put it in a bowl. You pick up 30 pieces
of paper without replacing it.
A simple random sample is a subset of individuals (a sample) chosen from a
larger set (a population). Each individual is chosen randomly and entirely by chance, such
that each individual has the same probability of being chosen at any stage during the
sampling process. A simple random sample is an unbiased surveying technique.
Simple random sampling is a sampling technique where every item in the population has an
equal chance of being selected in the sample. Selection of items completely depends on
chance and is also called method of chances.
Unbiased and representative of population.
Difficult and expensive to take a simple random sample when dealing with people. More
practical when population is geographically concentrated and a good sampling frame exists.
Sampling frame is a list of all the people or objects in the population of interest. Can be
easily implemented for manufacturing populations.

Sample question: Outline the steps for obtaining a simple random sample for outcomes of
strokes in U.S. trauma hospitals.
Step 1: Make a list of all the trauma hospitals in the U.S. (there are several hundred: the CDC
keeps a list).
Step 2: Assign a sequential number to each trauma center (1,2,3…n). This is your sampling
frame (the list from which you draw your simple random sample).
Step 3: Figure out what your sample size is going to be.
Step 4: Use a random number generator to select the sample, using your sampling frame
(population size) from Step 2 and your sample size from Step 3. For example, if your sample
size is 50 and your population is 500, generate 50 random numbers between 1 and 500.

Consider a hospital has 1000 staff members and they need to allocate night shift to 100
members. All of their names will be put in a box to be randomly selected. Each person will
have equal chance of being selected.

It is a fair method of sampling and helps to reduce any bias involved as compared to any
other sampling method. Basic method of collecting data and does not require any technical
or subject knowledge. Population size is very large in this sampling method. There is no
restriction on sample size that needs to be created. From a large population, you can get a
small sample quite easily.

Convenience sampling is just convenient. You ask people nearby or take next 20 objects of
the production line. You do what is easy or convenient. Biased in some way. Quick and
cheap pool. Convenience samples can have self selection bias when people chose to
participate becoz they have interest in the issue in question.

In stratified sampling, entire population is bifurcated into various mutually exclusive,


homogeneous and non over-lapping subgroups known as strata. Then the sample is drawn
randomly from each stratum. The procedure of drawing simple random samples from each
stratum is called stratified random sampling.

Stratified random sampling is used when your population is divided into strata
(characteristics like male and female or education level), and you want to include the
stratum when taking your sample. The stratum may be already defined (like census data) or
you might make the stratum yourself to fit the purposes of your research.

To perform stratified random sampling, take a random sample from within each category or
stratum. Let’s say you have a population divided into the following strata:
 Category 1: Low socioeconomic status — 39 percent
 Category 2: Middle class — 38 percent
 Category 3: Upper income — 23 percent
To get the stratified random sample, you would randomly sample the categories so that
your eventual sample size has 39 percent of participants taken from category 1, 38 percent
from category 2 and 23 percent from category 3.

Following steps should be taken to obtain the stratified sample:

1. Name the target population.


2. Name the categories (stratum) in the population.
3. Figure out what sample size you need.
4. List all of the cases within each stratum.
5. Assign a random number to each case.
6. Sort each case by random number.
7. choose the number of participants required.

How to Get a Stratified Random Sample: Steps

Sample size of the strata = size of entire sample / population size * layer size.
Sample question: You work for a small company of 1,000 people and want to find out how
they are saving for retirement. Use stratified random sampling to obtain your sample.
Step 1: Decide how you want to stratify (divide up) your population. For example, people in
their twenties might have different saving strategies than people in their fifties.
Step 2: Make a table representing your strata. The following table shows age groups and
how many people in the population are in that strata:

Total Number of People in


Age Strata

20-29 160

30-39 220

40-49 240

50-59 200

60+ 180

Step 3: Decide on your sample size. If you don’t know how to find a sample size, see: Sample
size (how to find one). For this example, we’ll assume your sample size is 50.

Step 4: Use the stratified sample formula (Sample size of the strata = size of entire sample /
population size * layer size) to calculate the proportion of people from each group:

Age Number of People in Strata Number of People in Sample

20-29 160 50/1000 * 160 = 8

30-39 220 50/1000 * 220 = 11

40-49 240 50/1000 * 240 = 12

50-59 200 50/1000 * 200 = 10

60+ 180 50/1000 * 180 = 9


Note that all of the individual results from the stratum add up to your sample size of 50: 8 +
11 + 12 + 10 + 9 = 50

Step 5: Perform random sampling (i.e. simple random sampling) in each stratum to select
your survey participants.
Each element in your population should only fit into one stratum. In other words, one
person cannot be in more than one group.

Stratified random sampling is also called proportional random sampling or quota random
sampling.

A stratified random sampling involves dividing the entire population into homogeneous
groups which are called strata (singular is stratum). Random samples are then selected from
each stratum. For example, consider an academic researcher who would like to know the
number of MBA students that got a job within three months of graduation in 2007. He will
soon find that there were almost 200,000 MBA graduates for the year. He might decide to
just take a simple random sample of 50,000 grads and run a survey, or better still, divide the
population into strata and take a random sample from the strata. To do this, he could create
population groups based on gender, age range, race, country of nationality, and career
background. A random sample from each stratum is taken in a number proportional to the
stratum's size when compared to the population. These subsets of the strata are then
pooled to form a random sample.

Proportionate and Disproportionate Stratification


Stratified random sampling ensures that each subgroups of a given population is adequately
represented within the whole sample population of a research study. Stratification can be
proportionate or disproportionate. In a proportionate stratified method, the sample size of
each stratum is proportionate to the population size of the stratum. For example, if the
researcher wanted a sample of 50,000 using age range, the proportionate stratified random
sample will be obtained using this formula: (sample size/population size) x stratum size.
Assuming a population size of 180,000 MBA grads per year,

Age group 24-28 29-33 34-37 Total


Number of people in stratum 90,000 60,000 30,000 180,000
Strata sample size 25,000 16,667 8,333 50,000

The strata sample size for MBA grads that are in the range of 24 to 28 years old is calculated
as (50,000/180,000) x 90,000 = 25,000. Same method is used for the other age range
groups. Now that the strata sample size is known, the researcher can perform simple
random sampling in each stratum to select his survey participants. In other words, 25,000
grads from the 24-28 age group will be selected randomly from the entire population,
16,667 grads form the 29-33 age range will be selected from the population randomly, and
so on.

In a disproportional stratified sample, the size of each stratum is not proportional to its size
in the population. The researcher may decide to sample ½ of the population within the 34-
37 age group and 1/3 of the grads with ages ranging from 29-33 years.

It is important to note that one person cannot fit into multiple strata. Each entity must only
fit in one stratum. Having overlapping subgroups means that some individuals will have
higher chances of being selected for the survey, which completely negates the concept of
stratified sampling as a type of probability sampling.

The main advantage of stratified sampling is that it captures key population characteristics
in the sample.

Stratification gives a smaller error in estimation and greater precision than the simple
random sampling method. The greater the differences between the strata, the greater the
gain in precision.

Simple random sampling is a sample of individuals that exist in a population; the individuals
are randomly selected from the population and placed into a sample. This method of
randomly selecting individuals seeks to select a sample size that is an unbiased
representation of the population. However, it's not advantageous when the samples of the
population vary widely.

Stratified random sampling is a method of sampling that involves the division of a


population into smaller groups known as strata. In stratified random sampling or
stratification, the strata are formed based on members' shared attributes or characteristics.

Stratified random sampling is a better method than simple random sampling. Stratified
random sampling divides a population into subgroups or strata, and random samples are
taken, in proportion to the population, from each of the strata created. The members in
each of the stratum formed have similar attributes and characteristics. This method of
sampling is widely used and very useful when the target population is heterogeneous.
A simple random sample should be taken from each stratum.

Suppose a research team wants to determine the GPA of college students across the U.S.
The research team has difficulty collecting data from all 21 million college students; it
decides to take a random sample of the population by using 4,000 students.

Now assume that the team looks at the different attributes of the sample participants and
wonders if there are any differences in GPAs and students’ majors. Suppose it finds that 560
students are English majors, 1,135 are science majors, 800 are computer science majors,
1,090 are engineering majors, and 415 are math majors. The team wants to use a
proportional stratified random sample where the stratum of the sample is proportional to
the random sample in the population.

Assume the team researches the demographics of college students in the U.S and finds the
percentage of what students major in 12% major in English, 28% major in science, 24%
major in computer science, 21% major in engineering, and 15% major in mathematics. Thus,
five strata are created from the stratified random sampling process.

The team then needs to confirm that the stratum of the population is in proportion to the
stratum in the sample; however, they find the proportions are not equal. The team then
needs to resample 4,000 students from the population and randomly select 480 English,
1,120 science, 960 computer science, 840 engineering, and 600 mathematics students. With
those, it has a proportionate stratified random sample of college students, which provides a
better representation of students' college majors in the U.S.

Stratified random sampling is a technique best used with a sample population easily broken
into distinct subgroups. Samples are then taken from each subgroup based on the ratio of
the subgroup’s size to the total data population.

For example, assume a total data population of 1000, broken into four subgroups with data
populations as follows:

A: 450

B: 250

C: 200

D: 100

To perform stratified random sampling with 200 pieces of the data, 45% of the sample must
come from A, 25% must come from B, 20% must come from C and 10% must come from
D. This yields a sample of 90 samples from A, 50 samples from B, 40 samples from C and 20
samples from D, for a total of 200.

Using stratified random samples insures that there will be selections from each
subgroup. Otherwise, there is a chance that one subgroup might be omitted, and that
subgroup’s characteristics would not be included in the statistical analysis of the sample.

Stratified random sampling is used in the investment world to create a portfolio that
replicates an index, such as a bond index. Rather than incur the high costs of purchasing all
the thousands of bonds in the specific bond index, a portfolio manager creates a sample
replication of the index. He does this by using stratified random sampling to make sure
bonds of all types within the index are included in the sample.

Cluster sampling is a sampling technique in which population is divided into already existing
groupings. Cluster is natural but heterogeneous grouping of the members of the population.
Common variables used in the clustering population are the geographical areas, buildings
and school. Sub types of cluster sampling – Single stage, Two stage and Multistage cluster
sampling. Heterogeneity of the cluster is an important feature of an ideal cluster design.

Cluster sampling is a sampling plan used when mutually homogeneous yet internally
heterogeneous groupings are evident in a statistical population. It is often used in marketing
research. In this sampling plan, the total population is divided into these groups (known as
clusters) and a simple random sample of the groups is selected. The elements in each cluster
are then sampled. If all elements in each sampled cluster are sampled, then this is referred
to as a "one-stage" cluster sampling plan. If a simple random subsample of elements is
selected within each of these groups, this is referred to as a "two-stage" cluster sampling
plan. A common motivation for cluster sampling is to reduce the total number of interviews
and costs given the desired accuracy. For a fixed sample size, the expected random error is
smaller when most of the variation in the population is present internally within the groups,
and not between the groups.

The population within a cluster should ideally be as heterogeneous as possible, but there
should be homogeneity between clusters. Each cluster should be a small-scale
representation of the total population. The clusters should be mutually exclusive and
collectively exhaustive. A random sampling technique is then used on any relevant clusters
to choose which clusters to include in the study. In single-stage cluster sampling, all the
elements from each of the selected clusters are sampled. In two-stage cluster sampling, a
random sampling technique is applied to the elements from each of the selected clusters.

The main difference between cluster sampling and stratified sampling is that in cluster
sampling the cluster is treated as the sampling unit so sampling is done on a population of
clusters (at least in the first stage). In stratified sampling, the sampling is done on elements
within each strata. In stratified sampling, a random sample is drawn from each of the strata,
whereas in cluster sampling only the selected clusters are sampled. A common motivation
of cluster sampling is to reduce costs by increasing sampling efficiency. This contrasts with
stratified sampling where the motivation is to increase precision.

There is also multistage cluster sampling, where at least two stages are taken in selecting
elements from clusters.

Without modifying the estimated parameter, cluster sampling is unbiased when the clusters
are approximately the same size. In this case, the parameter is computed by combining all
the selected clusters. When the clusters are of different sizes, probability proportionate to
size sampling is used. In this sampling plan, the probability of selecting a cluster is
proportional to its size, so that a large clusters has a greater probability of selection than a
small cluster.

An example of cluster sampling is area sampling or geographical cluster sampling. Each


cluster is a geographical area. Because a geographically dispersed population can be
expensive to survey, greater economy than simple random sampling can be achieved by
grouping several respondents within a local area into a cluster

 Can be cheaper than other sampling plans – e.g. fewer travel expenses, administration
costs.
 Feasibility: This sampling plan takes large populations into account. Since these groups
are so large, deploying any other sampling plan would be very costly.
 Economy: The regular two major concerns of expenditure, i.e., traveling and listing, are
greatly reduced in this method. For example: Compiling research information about
every household in a city would be very costly, whereas compiling information about
various blocks of the city will be more economical.
 Disadv:
 Higher sampling error, which can be expressed in the so-called "design effect", the ratio
between the number of subjects in the cluster study and the number of subjects in an
equally reliable, randomly sampled unclustered study
 Biased samples: If the group in population that is chosen as a sample has a biased
opinion, then the entire population is inferred to have the same opinion. This may not
be the actual case.

First, the researcher selects groups or clusters, and then from each cluster, the researcher
selects the individual subjects by either simple random or systematic random sampling. The
researcher can even opt to include the entire cluster and not just a subset from it.
The most common cluster used in research is a geographical cluster. For example, a
researcher wants to survey academic performance of high school students in Spain.

1. He can divide the entire population (population of Spain) into different clusters (cities).
2. Then the researcher selects a number of clusters depending on his research through
simple or systematic random sampling.
3. Then, from the selected clusters (randomly selected cities) the researcher can either
include all the high school students as subjects or he can select a number of subjects from
each cluster through simple or systematic random sampling.

The important thing to remember about this sampling technique is to give all the clusters
equal chances of being selected.
Types of Cluster Sample
One-Stage Cluster Sample - one-stage cluster sample occurs when the researcher includes
all the high school students from all the randomly selected clusters as sample.

Two-Stage Cluster Sample - Is obtained when the researcher only selects a number of
students from each cluster by using simple or systematic random sampling.

Cluster sampling refers to a type of sampling method . With cluster sampling, the researcher
divides the population into separate groups, called clusters. Then, a simple
random sample of clusters is selected from the population. The researcher conducts his
analysis on data from the sampled clusters.

Systematic sampling is a type of probability sampling method in which sample members


from a larger population are selected according to a random starting point and a fixed,
periodic interval. This interval, called the sampling interval, is calculated by dividing the
population size by the desired sample size.

Since simply random sampling a population can be inefficient and time-consuming,


statisticians turn to other methods, such as systematic sampling. Choosing a sample size
through a systematic approach can be done quickly. Once a fixed starting point has been
identified, a constant interval is selected to facilitate participant selection.

For example, if you wanted to select a random group of 1,000 people from a population of
50,000 using systematic sampling, all the potential participants must be placed in a list and a
starting point would be selected. Once the list is formed, every 50th person on the list
(starting the count at the selected starting point) would be chosen as a participant, since
50,000/1,000 = 50. For example, if the selected starting point was 20, the 70th person on
the list would be chosen followed by the 120th, and so on. Once the end of the list was
reached and if additional participants are required, the count loops to the beginning of the
list to finish the count.

Risks Associated With Systematic Sampling


One risk that statisticians must consider when conducting systematic sampling involves how
the list used with the sampling interval is organized. If the population placed on the list is
organized in a cyclical pattern that matches the sampling interval, the selected sample may
be biased. For example, a company's human resources department wants to pick a sample
of employees and ask how they feel about company policies. Employees are grouped in
teams of 20, with each team headed by a manager. If the list used to pick the sample size is
organized with teams clustered together, the statistician risks picking only managers (or no
managers at all) depending on the sampling interval.

To begin, a researcher selects a starting integer to base the system on. After a number has
been selected, the researcher picks the interval, or spaces between samples in the
population.

There is a greater risk of data manipulation with systematic sampling because researchers
might be able to construct their systems to increase the likelihood of achieving a targeted
outcome rather than letting the random data produce a representative answer. Any
resulting statistics could not be trusted.

Disadvantage of Systematic Sampling. The process of selection can interact with a hidden
periodic trait within the population. If the sampling technique coincides with the periodicity
of the trait, the sampling technique will no longer be random and representativeness of
the sample is compromised.

When you’re sampling from a population, you want to make sure you’re getting a fair
representation of that population. Otherwise, your statistics will be biased or skewed and
perhaps meaningless. One way to get a fair and random sample is to assign a number to
every population member and then choose the nth member from that population. For
example, you could choose every 10th member, or every 100th member. This method of
choosing the nth member is called systematic sampling.

Systematic sampling is quick and convenient when you have a complete list of the members
of your population. However, if there’s some kind of pattern to the original list,
then bias may creep in to your statistics. For example, if a list of people is ordered as
MFMFMFMF, then choosing every 10th number will give you a sample consisting entirely of
females. How to perform systematic sampling without this type of sampling bias? You could
randomly shuffle the list before choosing the nth item or you could use repeated systematic
sampling, where you take several small samples from the same population. It’s used if you
aren’t sure you have a completely random list and you want to avoid sample bias.
Step 1: Assign a number to every element in your population.
Step 2: Decide how large your sample size should be.
Step 3: Divide the population by your sample size.
your population is 100 and your sample size is 10, so:
100 / 10 = 10 you’ll choose every 10th item

Repeated Systematic Sampling


Step 4: Use the sampling digit from Step 3 up to a certain point.
Switch to a different starting point and then continue sampling with the nth digit.

Systematic sampling is cost and time efficient. It is effectively suitable in collecting data from
geographically disperse cases. Systematic sampling can be applied only if the complete list of
population is available. If there are periodic patterns within the dataset, the sample will be
biased.

Sampling bias is a bias in which a sample is collected in such a way that some members of
the intended population are less likely to be included than others. It results in a biased
sample, a non-random sample of a population (or non-human factors) in which all
individuals, or instances, were not equally likely to have been selected.

Sampling bias is problematic because it is possible that a statistic computed of the sample is
systematically erroneous. Sampling bias can lead to a systematic over- or under-estimation
of the corresponding parameter in the population. Sampling bias occurs in practice as it is
practically impossible to ensure perfect randomness in sampling. If the degree of
misrepresentation is small, then the sample can be treated as a reasonable approximation
to a random sample.

Three types of bias can be distinguished: information bias, selection bias, and
confounding. These three types of bias and their potential solutions are discussed using
various examples.

Selection bias is the bias introduced by the selection of individuals, groups or data for
analysis in such a way that proper randomization is not achieved, thereby ensuring that the
sample obtained is not representative of the population intended to be analysed. It is
sometimes referred to as the selection effect.

Sample selection bias is a type of bias caused by choosing non-random data for statistical
analysis. The bias exists due to a flaw in the sample selection process, where a subset of the
data is systematically excluded due to a particular attribute. The exclusion of the subset can
influence the statistical significance of the test, or produce distorted results.

Survivorship bias is a common type of sample selection bias. For example, when back-
testing an investment strategy on a large group of stocks, it may be convenient to look for
securities that have data for the entire sample period. If we were going to test the strategy
against 15 years worth of stock data, we might be inclined to look for stocks that have
complete information for the entire 15-year period. However, eliminating a stock that
stopped trading, or shortly left the market, would input a bias in our data sample. Since we
only include stocks that lasted the 15-year period, our final results would be flawed, as
these performed well enough to survive the market.
Hedge fund performance indexes are one example of sample selection bias subject to
survivorship bias. Because hedge funds that don’t survive stop reporting their performance
to index aggregators, resulting indices are naturally tilted to funds and strategies that
remain, hence “survive.” This can be an issue with popular mutual fund reporting service as
well.

Self-selection bias (see also Non-response bias), which is possible whenever the group of
people being studied has any form of control over whether to participate

Participants' decision to participate may be correlated with traits that affect the study,
making the participants a non-representative sample. For example, people who have strong
opinions or substantial knowledge may be more willing to spend time answering a survey
than those who do not. Another example is online and phone-in polls, which are biased
samples because the respondents are self-selected. Those individuals who are highly
motivated to respond, typically individuals who have strong opinions, are overrepresented,
and individuals that are indifferent or apathetic are less likely to respond. This often leads to
a polarization of responses with extreme perspectives being given a disproportionate weight
in the summary. As a result, these types of polls are regarded as unscientific.

self-selection bias arises in any situation in which individuals select themselves into a group,
causing a biased sample with nonprobability sampling. It is commonly used to describe
situations where the characteristics of the people which cause them to select themselves in
the group create abnormal or undesirable conditions in the group. It is closely related to
the non-response bias, describing when the group of people responding has different
responses than the group of people not responding.

Self-selection bias is a major problem in research in sociology, psychology, economics and


many other social sciences.[1] In such fields, a poll suffering from such bias is termed a self-
selected listener opinion poll or "SLOP".

Self-selection makes determination of causation more difficult. For example, when


attempting to assess the effect of a test preparation course in increasing participant's test
scores, significantly higher test scores might be observed among students who choose to
participate in the preparation course itself. Due to self-selection, there may be a number of
differences between the people who choose to take the course and those who choose not
to, such as motivation, socioeconomic status, or prior test-taking experience. Due to self-
selection according to such factors, a significant difference in mean test scores could be
observed between the two populations independent of any ability of the course to effect
higher test scores. An outcome might be that those who elect to do the preparation course
would have achieved higher scores in the actual test anyway. If the study measures an
improvement in absolute test scores due to participation in the preparation course, they
may be skewed to show a higher effect. A relative measure of 'improvement' might improve
the reliability of the study somewhat, but only partially.

Self-selection bias causes problems for research about programs or products. In particular,
self-selection affects evaluation of whether or not a given program has some effect, and
complicates interpretation of market research.
In survey sampling, the bias that results from an unrepresentative sample is called selection
bias.

Undercoverage. occurs when some members of the population are inadequately


represented in the sample. A classic example of undercoverage is the Literary Digest voter
survey, which predicted that Alfred Landon would beat Franklin Roosevelt in the 1936
presidential election. The survey sample suffered from undercoverage of low-income voters,
who tended to be Democrats. Undercoverage is often a problem with convenience
samples .

Voluntary response bias. occurs when sample members are self-selected volunteers, as
in voluntary samples . An example would be call-in radio shows that solicit audience
participation in surveys on controversial topics (abortion, affirmative action, gun control,
etc.). The resulting sample tends to overrepresent individuals who have strong opinions.

Nonresponse bias. Sometimes, individuals chosen for the sample are unwilling or unable to
participate in the survey. This can be a big problem with mail surveys, where the response
rate can be very low.

Selection bias is a kind of error that occurs when the researcher decides who is going to be
studied. It is usually associated with research where the selection of participants isn’t
random

For example, say you want to study the effects of working nights on the incidence of a
certain health problem. You collect health information on a group of 9-to-5 workers and a
group of workers doing the same kind of work, but at night. You then measure the rates at
which members of both groups reported the health problem. You might conclude that night
work is associated with an increase in that problem.

The trouble is, the two groups you studied may have been very different to begin with. The
people who worked nights may have been less skilled, with fewer employment options.
Their lower socioeconomic status would also be linked with more health risks—due to less
healthy diets, less time and money for leisure activities and so on. So your finding may not
be related to night work at all, but a reflection of the influence of socioeconomic status.

Selection bias also occurs when people volunteer for a study. Those who choose to join (i.e.
who self-select into the study) may share a characteristic that makes them different from
non-participants from the get-go. Let’s say you want to assess a program for improving the
eating habits of shift workers. You put up flyers where many work night shifts and invite
them to participate. However, those who sign up may be very different from those who
don’t. They may be more health conscious to begin with, which is why they are interested in
a program to improve eating habits.

If this was the case, it wouldn’t be fair to conclude that the program was effective because
the health of those who took part in the program was better than the health of those who
did not. Due to self-selection, other factors may have affected the health of your study
participants more than the program.
Minimizing selection bias
Good researchers will look for ways to overcome selection bias in their observational
studies. They’ll try to make their study representative by including as many people as
possible. They will match the people in their study and control groups as closely as possible.
They will “adjust” for factors that may affect outcomes. They will talk about selection bias in
their reports, and recognize the degree to which their results may apply only to certain
groups or in certain circumstances.

Another way researchers try to minimize selection bias is by conducting experimental


studies, in which participants are randomly assigned to the study or control groups (i.e.
randomized controlled studies or RCTs).

Often, selection bias is unavoidable. That’s why it’s important for researchers to examine
their study design for this type of bias and find ways to adjust for it, and to acknowledge it in
their study report.

Look-ahead bias occurs by using information or data in a study or simulation that would not
have been known or available during the period being analyzed. This will usually lead to
inaccurate results in the study or simulation. Look-ahead bias can be used to sway
simulation results closer into line with the desired outcome of the test.

For example, if a trade is simulated based on information that was not available at the time
of the trade - such as a quarterly earnings number that was released three months later - it
will diminish the accuracy of the trading strategy's true performance and potentially bias the
results in favor of the desired outcome.

Look-ahead Bias
Look-ahead bias is present when the analyst assumes information is readily available on a
certain date while in fact it’s not. As an example, analysts may assume end-of-year financial
information such as the annual profit generated is available in January, yet most companies
take up to 60 additional days before releasing results.
Data Mining Bias
Data mining is the practice of analyzing historical data so as to unearth trends and other
inherent relationships between variables. Analysts may then use such trends to predict
future behavior.
Data mining bias occurs when analysts excessively analyze data, giving rise to statistically
irrelevant, sometimes non-existent trends.
Time Period Bias
Time period bias involves inappropriate generalization of time-specific results – those
results that only apply to certain seasons or periods. Most entities experience seasonal
variation in performance so that some months may be more productive than others. For
example, ice cream production companies across Europe may record bigger sales during the
summer and lower sales during winter. Therefore, a sample of such entities drawn during
winter will estimate winter-specific parameters.
Sample Selection Bias
Sample selection bias refers to the tendency to exclude a section of the population from
sample analysis due to unavailability of data. This erodes the idea of randomness since the
exclusion of a certain class of data is somewhat identical to collecting data from a subset of
the population. The resulting parameter is not representative of the population as a whole.
Survivorship Bias
Survivorship bias entails exclusion of information that relates to financial vehicles that are
no longer existent, during sampling. Consequently, conclusions may underestimate or
overestimate the population parameters. For example, most mutual fund databases that
track performance may exclude funds that have underperformed, leading to closure.
Analyzing only the “surviving” funds may overestimate the average mutual fund earnings.
The data snooping bias is a statistical bias that appears when exhaustively searching for
combinations of variables, the probability that a result arose by chance grow with the
number of combinations tested.

To minimize the likelihood that the results suffer from the data snooping bias- Use a simple
and popular technique called out-of-sample testing.

In order to minimize the probability that our results occurred simply by chance, we can
divide that data that we used in the backtesting process into 2 samples.
The first one is called the in-sample and it is the data sample that will be used to backtest all
the combinations that result from the initial trading rules.
The second one is called out-of-sample and it is used as a way to test the best performing
rules (the one that were picked from the in-sample backtesting) on new data. The out-of-
sample testing acts as a filter, where the rules that didn't perform as well as in the in-sample
test will be rejected and only rules that passes both tests are accepted.
This technique decreases dramatically the likelihood that the rules data suffer from data
snooping bias.

Data Snooping Bias is also referred to as Optimization Bias or Curve Fitting.

This bias is the result of refining too many parameters to improve a system’s performance
on a single data set.

A common example of data snooping starts out as an honest effort to improve a system.
You then test if adding another indicator into the mix will improve your results. If it does,
then you incorporate that indicator into the system and test adding another indicator. The
end result is a system that is perfectly optimized to trade the exact data set you tested it on.
The problem is that the system is only optimized for that specific data set, which already
happened.

The best way to avoid data snooping, or curve fitting, is to keep your systems simple, using
as few parameters as possible. It is also important to backtest your system on many
different data sets across different markets and time periods.
data-snooping bias is a form of statistical bias generated by the misuse of data
mining techniques which can lead to bogus results in scientific research. In the process of
data mining, huge numbers of hypotheses about a single data set can be tested in a very
short time, by exhaustively searching for combinations of variables that might show a
correlation. Thus, given enough hypotheses tested, it is virtually certain that some of them
will appear to be highly statistically significant, even on a data set with no real correlations
at all. Researchers who are using data mining techniques can be easily misled by these
apparently significant results, even though they are merely chance artifacts.

For example, in a list of 367 people, at least two are guaranteed to share a birthday. Let's
say, on a particular list of 367 people, Mary Jones and John Smith both have their birthday
on August 7th. A data-snooping hypothesis would seek to find something special about the
two (for example, perhaps they are the youngest and the oldest; perhaps they are the only
two who have met exactly once before; exactly twice before; exactly three times before;
perhaps they are the only two with a father who has the same first name; a mother who has
the same first name etc.) By mentally going through hundreds, or perhaps thousands, of
potential, very interesting hypotheses that each have a low-probability of being true, we can
find one that is. Let's say that for this data-group it turns out that John and Mary are the
only two who switched minors three times in college, a fact we found out by exhaustively
comparing their life's histories. Our hypothesis can then become "Being born on August 7th
results in a much higher than average chance of switching minors more than twice in
college"! Indeed, turning to the data, we are helpless to see that it very strongly supports
that correlation, since not one of the other people (with a different birthday) had switched
minors three times in college, whereas BOTH of the people with an August 7th birthday had!
Turning to the general population, we attempt to reproduce the results, by selecting for
August 7th birthdates, and find that no such correlation can be extrapolated. Why? Because
in this example we have become victims of the data-snooper, who only chose whatever
obscure fact happened to be true for that particular data-set.

Data fishing can also be done by narrowing down the sample size to include those results
that bear out the intended hypothesis. Thus the drug might be tested on 1000 patients and
the results might not show a statistically significant positive result for a given problem.
However, by narrowing down the study to 500 people and using a selection bias towards
those who showed favorable results by using the drug, the company can claim something
that is not actually true.

However, most data dredging is intentional. Many times, researchers are simply misled by
the apparent correlations that they see. This happens most frequently when the researchers
themselves are not sure what exactly they are looking for. Therefore it is important to form
a hypothesis before starting and conducting the experiment in order to prevent any
accidental cases of data dredging.
If not, the researchers might stumble upon some correlation that doesn’t actually exist but
shows strongly in their data. Thus researchers working in data mining need to be aware of
this as it can be a serious mislead and divert valuable resources to some claims that are not
really true.

You might also like