0% found this document useful (0 votes)
16 views30 pages

Sampling Thoery

Uploaded by

yasvinariya2708
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views30 pages

Sampling Thoery

Uploaded by

yasvinariya2708
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

SAMPLING THEORY

- Population and sample


- Sampling with and without replacement
- Random sample and population parameters
- Sample statics and mean
- Sample variances
- Unbiased estimates and efficient estimates
Population And Sample
Population:
• population is the entire set of items from which you
draw data for a statistical study.
• Example: We want to predict the final grades of all
students in the school based on various features such
as study hours, attendance, and past exam scores.
Sample:
• A subset of a larger population that contains
characteristics of that population.
• A sample is used in statistical testing when the
population size is too large for all members or
observations to be included in the test.
• Example :We randomly select 50 students from the
entire school population to create a sample. This
subset will be used to build and train our predictive
model.
Why sampling ?

Feasibility of Data Collection:


•Reason: In some situations, collecting data from the entire population may be impractical
or impossible.
•Example: If you're studying the behavior of online users, obtaining data from the entire
global user base may be unrealistic, so you might opt for a sample.

Reducing Bias:
•Reason: In some cases, working with a sample can help reduce bias in the data or
mitigate the impact of outliers.
•Example: If a dataset contains outliers or is heavily skewed, a sample might provide a
more balanced representation for model training.
Model Testing and Evaluation:
•Reason: Sampling is crucial during the testing and evaluation phase of model
development to assess performance on unseen data.
•Example: After training a model, it is essential to evaluate its performance on a separate
sample (validation or test set) to ensure its generalization to new, unseen data.

Computational Efficiency:
•Reason: Training machine learning models on large datasets can be computationally
expensive and time-consuming.
•Example: If you're building a model to predict customer preferences, it might be more
practical to work with a sample of the customer data rather than the entire dataset.
Sampling with and without replacement
 sampling without replacement, in which a subset of the observations is selected
randomly, and once an observation is selected it cannot be selected again.
•In sampling without replacement, once an item is selected from the population, it is not
returned before the next selection.
•This means that each subsequent selection has a reduced pool of items to choose from.
•Once an item is selected, it cannot be selected again in the same sample.
•Sampling without replacement is often used when dealing with finite populations or when
it's important to ensure that each item is selected only once.
•E.g. Continuing with the dataset of 1000 images of cats and dogs, in cross-validation, you
divide the dataset into 5 equal parts. After each selection of a validation set, you do not put
the images back into the dataset before the next selection. This ensures that each image is
used exactly once for validation across all iterations of cross-validation, without
replacement.
 sampling with replacement, in which a subset of observations are selected randomly,
and an observation may be selected more than once.
•In sampling with replacement, each selected item is returned to the population before
the next item is selected.
•This means that each item in the population has the same probability of being selected
each time.
•Consequently, it's possible to select the same item more than once in the sample.
•Sampling with replacement is commonly used in situations where the population is
either very large or infinite.
•E.g. Imagine you have a dataset of 1000 images of cats and dogs for a classification
task. you randomly select 100 images from the dataset, and after each selection, you put
the image back into the dataset before the next selection. This allows some images to be
selected multiple times for training the model.
Sampling methods
1.Probability sampling:
Probability sampling is a sampling technique where a researcher selects a few criteria and
chooses members of a population randomly. All the members have an equal opportunity to
participate in the sample with this selection parameter.
For example, in a population of 1000 members, every member will have a 1/1000 chance of
being selected to be a part of a sample. Probability sampling eliminates sampling bias in the
population and allows all members to be included in the sample.
2.Non-probability sampling:
In non-probability sampling, the researcher randomly chooses members for research. This
sampling method is not a fixed or predefined selection process. This makes it difficult for all
population elements to have equal opportunities to be included in a sample.
For example:
suppose a teacher has to choose 4 participants from a class of 30 students in a debate
competition. Here, the teacher may select the top 4 debaters on the basis of her own conscious
judgement about the top debaters in the class. This is an example of purposive sampling. In this
method, the purpose of the sample guides the choice of certain members or units of the
population. Here, the all population member has not same chance to being selected.
Simple random sampling
In simple random sampling technique, every item in the population
has an equal and likely chance of being selected in the sample.
Since the item selection entirely depends on the chance, this
method is known as “Method of chance Selection”. As the
sample size is large, and the item is chosen randomly, it is known
as “Representative Sampling”.
Example:
Suppose we want to select a simple random sample of 200 students
from a school. Here, we can assign a number to every student in
the school database from 1 to 500 and use a random number
generator to select a sample of 200 numbers.

One of the best probability sampling techniques that helps in


saving time and resources is the Simple Random Sampling
method. It is a reliable method of obtaining information where
every single member of a population is chosen randomly, merely
by chance. Each individual has the same probability of being
chosen to be a part of a sample.
Cluster sampling
Cluster sampling is a method where the researchers divide the
entire population into sections or clusters representing a
population. Clusters are identified and included in a sample based
on demographic parameters like age, sex, location, etc. This makes
it very simple for a survey creator to derive effective inferences
from the feedback.
Example:
In a study on education quality, a researcher selects
five schools out of a total of twenty in a district using
cluster sampling. They then survey all teachers
within those selected schools to gather data on
teaching methods and student performance.

Cluster sampling is a sampling technique used in


statistics when the population being studied is too
large and widely dispersed to enumerate or sample
directly. Instead of individually sampling from the
Systematic sampling
Systematic sampling is a probability sampling method where
the researcher chooses elements from a target population by
selecting a random starting point and selecting sample
members after a fixed ‘sampling interval.’
Example:Suppose we have a list of 200 registered
voters, and we want to select a systematic
sample of 20 voters to interview about their
political preferences. With a sampling interval of
200 / 20 = 10, we randomly choose a starting
point, say the 5th voter. We then proceed to
select every 10th voter from the list until we
reach 20, forming
Systematic samplingourissystematic
often moresample.
efficient than simple random sampling
because it requires less time and effort. Once the sampling interval is
determined, selecting the sample members becomes straightforward.
One disadvantage of systematic sampling is that if there's a periodic pattern
or structure in the population, it may lead to a biased sample. Additionally, if
there are periodic changes within the population, systematic sampling may
Stratified random sampling
Stratified sampling designs involve partitioning
a population into strata based on a certain
characteristic that is known for every
sampling unit in the population, and then
selecting samples independently from each
stratum.
This design offers flexibility of sampling
methods in different strata and gains improved
precision of estimates of target parameters
when each stratum is composed of units that
are relatively homogenous.
Stratified sampling is a method used in
statistics and research to ensure that
subgroups within a population are represented
proportionally in the sample.
For Example:For research, the target market is
split into two strata based on gender, where
How stratified sampling works:
1.Identify Strata: First, you divide the population into distinct subgroups
called "strata." These strata should be mutually exclusive and collectively
exhaustive, meaning that every member of the population belongs to one and
only one stratum, and all possible members of the population are accounted
for in the strata.
2.Determine Proportions: Next, you determine the proportion of the
population that each stratum represents. This could be based on certain
characteristics or attributes that you're interested in studying.
3.Sample from Each Stratum: After determining the proportions, you then
take a sample from each stratum. The size of the sample from each stratum is
proportional to the size of the stratum relative to the population. This ensures
that each subgroup is adequately represented in the final sample.
4.Combine Samples: Finally, you combine the samples from each stratum to
form the complete sample for your study.
Stratified sampling is particularly useful when there are known differences or
variations within the population that could affect the outcome of the study. By
ensuring that each subgroup is represented in the sample, researchers can
Random sample

A random sample is a subset of individuals or items selected from a larger population in


such a way that every member of the population has an equal chance of being chosen.

Random sampling is a fundamental technique used in statistics and research to draw


conclusions about a population based on the characteristics of the sample.

When conducting research or analysis, random sampling helps ensure that the sample is
representative of the population, meaning that the characteristics and attributes of the
sample closely resemble those of the entire population.

This allows researchers to make valid inferences or generalizations about the population
based on the data collected from the sample.
Population Parameter
A parameter is a characteristic of a population. Population parameters are
numerical values that describe various characteristics of a population. These
parameters provide a summary of the entire population and are typically
unknown because it's often impractical or impossible to measure every
individual in the population. Instead, researchers use statistical techniques to
estimate these parameters based on data collected from a sample of the
population.
These parameters are used in statistical analysis to make inferences about the
population, test hypotheses, and draw conclusions. Estimating population
parameters from sample data involves using statistical methods such as point
estimation and interval estimation. Point estimation provides a single value
estimate for a population parameter, while interval estimation provides a range
of values within which the parameter is likely to lie, along with a level of
confidence.
Sample statics
Sample statistics are numerical values calculated from data collected
from a sample. These statistics provide information about the
characteristics of the sample and are used to estimate population
parameters or make inferences about the population as a whole.
Common sample statistics include the sample mean, sample standard
deviation, sample median, sample variance, etc. Sample statistics are
often used in statistical analysis to summarize data, test hypotheses, and
draw conclusions.
These sample statistics are used to estimate their corresponding
population parameters. However, sample statistics are subject to
sampling variability, meaning that they may differ from one sample to
another. The accuracy of sample statistics as estimators of population
parameters depends on factors such as sample size, sampling method,
and the representativeness of the sample.
PARAMETER VS STATISTICS:
(1) A parameter is a fixed measure describing the whole population (population is a
group of people or things with common characteristics). On the hand, a statistics
is a characteristics of a sample, a portion of target population.
(2) A parameter is fixed, unknown numerical value, while statistics is a known
number and a variable which depends on the portion of the population.
(3) Sample statistics and population parameter both use different statistical notations.
(4) Parameters never change while statistics do change.
(5) A parameter is a characteristic of a population and a statistic is a characteristic of
a sample.
(6) Statistic makes one guess about a population parameter based on a statistic
computed from sample.
Population Sample
Parameter Statics
Sample Size N n

Mean μ

Variance σ² s²

Standard σ s
deviation
Sample Mean
•The sample mean, denoted by is a measure of central tendency that represents the average
value of a set of data points in a sample.
•It provides a single numerical value that summarizes the center of the distribution of data
in the sample.
•The sample mean is calculated by summing up all the individual values in the sample and
dividing by the total number of values (sample size).
•Mathematically, it can be represented as:

•where:
• is the sample mean,
• xi​represents each individual data point,
• n is the total number of data points in the sample.
•The sample mean provides an estimate of the population mean when the sample is drawn
from a larger population.
•It represents the "typical" value or average value observed in the sample.
•The sample mean is influenced by the values of all data points in the sample.
•The sample mean is a point estimator, meaning it provides a single estimate of the
population mean.
•It is sensitive to outliers in the data, as extreme values can disproportionately influence
the calculation of the mean.
•The sample mean is an unbiased estimator of the population mean, meaning that, on
average, it provides an accurate estimate of the population mean when multiple samples
are drawn.
•It is used extensively in data analysis, inference, and decision-making
processes across various fields, including science, business, and social
sciences.
Sample Variance
•Sample variance can be defined as the expectation of the squared difference of data points
from the mean of the data set.
•It is an absolute measure of dispersion and is used to check the deviation of data points with
respect to the data's average.
•The formula to calculate sample variance,

where:
•is the sample variance,
•xi​represents each individual data point,
•is the sample mean,
•n is the total number of data points in the sample.
•Sample variance measures the average squared deviation of data points from the sample
mean.
•Larger variance values indicate greater variability among data points, while smaller values
suggest more consistency.
•Sample variance is commonly used in inferential statistics to estimate population variance.
•It serves as a crucial parameter in hypothesis testing and constructing confidence intervals.
•Sample variance can help identify outliers or extreme values that significantly affect the
overall variability of the sample.
•Sample variance helps assess the consistency or variability of data within a sample.
•Its importance extends across various fields, including scientific research,
business analytics, quality control, finance, and risk management.
Unbiased estimate
Definition:
• An unbiased estimate is a statistical estimator whose expected value is equal to the true
population parameter being estimated. In simpler terms, an unbiased estimator, when used
repeatedly, produces estimates that are on average equal to the true value of the parameter
being estimated.
Mathematical Definition:
• Mathematically, an estimator is unbiased if its expected value E() equals the true population
parameter θ: E()=θ
• is the estimator,
• E(represents the expected value of the estimator,
• θ is the true population parameter being estimated.
• This means that, on average, the estimator provides an accurate estimate of the population
parameter across multiple samples.
•For example, when estimating the population mean (μ) using the sample mean , the
sample mean is an unbiased estimator because: E()=μ
This means that, on average, the sample mean accurately estimates the population
mean.
•Unbiased estimators are desirable because they provide estimates that are not
systematically too high or too low.
•They provide a fair and consistent estimate of the population parameter across different
samples.
•Unbiasedness is a desirable property when evaluating the performance of estimators, as it
ensures that the estimator does not systematically overestimate or underestimate the
population parameter.
Efficient estimate
Definition:
An efficient estimator is a statistical estimator that achieves the smallest possible variance
among all unbiased estimators for a given sample size. In other words, an efficient estimator
minimizes the variability of estimates and provides the most precise estimate of the
population parameter.
An estimator is considered efficient if it achieves the smallest possible variance among all
unbiased estimators for a given sample size.
Mathematically, let be an estimator for a population parameter θ. The efficiency of is
determined by comparing its variance with the variances of all other unbiased estimators of
θ.
If is the estimator with the smallest variance among all unbiased estimators, it is said to be
efficient. In other words:
Var() ≤ Var(θ)
• Efficiency is a desirable property because it ensures that estimates are not only unbiased but
also precise.
• Efficient estimators produce estimates with the least amount of sampling variability, making
them more reliable and informative.
• Efficient estimates provide more reliable estimates of population parameters. They minimize
the variability in estimates, making them more consistent and accurate.
• By reducing variability in estimates, efficient estimators provide a clearer picture of the
underlying data. This leads to a better understanding of the phenomenon being studied.
• In predictive modeling and forecasting, precise estimates are essential for accurate
predictions. Efficient estimates contribute to better forecasting models and more reliable
predictions.

You might also like