Sampling Thoery
Sampling Thoery
Reducing Bias:
•Reason: In some cases, working with a sample can help reduce bias in the data or
mitigate the impact of outliers.
•Example: If a dataset contains outliers or is heavily skewed, a sample might provide a
more balanced representation for model training.
Model Testing and Evaluation:
•Reason: Sampling is crucial during the testing and evaluation phase of model
development to assess performance on unseen data.
•Example: After training a model, it is essential to evaluate its performance on a separate
sample (validation or test set) to ensure its generalization to new, unseen data.
Computational Efficiency:
•Reason: Training machine learning models on large datasets can be computationally
expensive and time-consuming.
•Example: If you're building a model to predict customer preferences, it might be more
practical to work with a sample of the customer data rather than the entire dataset.
Sampling with and without replacement
sampling without replacement, in which a subset of the observations is selected
randomly, and once an observation is selected it cannot be selected again.
•In sampling without replacement, once an item is selected from the population, it is not
returned before the next selection.
•This means that each subsequent selection has a reduced pool of items to choose from.
•Once an item is selected, it cannot be selected again in the same sample.
•Sampling without replacement is often used when dealing with finite populations or when
it's important to ensure that each item is selected only once.
•E.g. Continuing with the dataset of 1000 images of cats and dogs, in cross-validation, you
divide the dataset into 5 equal parts. After each selection of a validation set, you do not put
the images back into the dataset before the next selection. This ensures that each image is
used exactly once for validation across all iterations of cross-validation, without
replacement.
sampling with replacement, in which a subset of observations are selected randomly,
and an observation may be selected more than once.
•In sampling with replacement, each selected item is returned to the population before
the next item is selected.
•This means that each item in the population has the same probability of being selected
each time.
•Consequently, it's possible to select the same item more than once in the sample.
•Sampling with replacement is commonly used in situations where the population is
either very large or infinite.
•E.g. Imagine you have a dataset of 1000 images of cats and dogs for a classification
task. you randomly select 100 images from the dataset, and after each selection, you put
the image back into the dataset before the next selection. This allows some images to be
selected multiple times for training the model.
Sampling methods
1.Probability sampling:
Probability sampling is a sampling technique where a researcher selects a few criteria and
chooses members of a population randomly. All the members have an equal opportunity to
participate in the sample with this selection parameter.
For example, in a population of 1000 members, every member will have a 1/1000 chance of
being selected to be a part of a sample. Probability sampling eliminates sampling bias in the
population and allows all members to be included in the sample.
2.Non-probability sampling:
In non-probability sampling, the researcher randomly chooses members for research. This
sampling method is not a fixed or predefined selection process. This makes it difficult for all
population elements to have equal opportunities to be included in a sample.
For example:
suppose a teacher has to choose 4 participants from a class of 30 students in a debate
competition. Here, the teacher may select the top 4 debaters on the basis of her own conscious
judgement about the top debaters in the class. This is an example of purposive sampling. In this
method, the purpose of the sample guides the choice of certain members or units of the
population. Here, the all population member has not same chance to being selected.
Simple random sampling
In simple random sampling technique, every item in the population
has an equal and likely chance of being selected in the sample.
Since the item selection entirely depends on the chance, this
method is known as “Method of chance Selection”. As the
sample size is large, and the item is chosen randomly, it is known
as “Representative Sampling”.
Example:
Suppose we want to select a simple random sample of 200 students
from a school. Here, we can assign a number to every student in
the school database from 1 to 500 and use a random number
generator to select a sample of 200 numbers.
When conducting research or analysis, random sampling helps ensure that the sample is
representative of the population, meaning that the characteristics and attributes of the
sample closely resemble those of the entire population.
This allows researchers to make valid inferences or generalizations about the population
based on the data collected from the sample.
Population Parameter
A parameter is a characteristic of a population. Population parameters are
numerical values that describe various characteristics of a population. These
parameters provide a summary of the entire population and are typically
unknown because it's often impractical or impossible to measure every
individual in the population. Instead, researchers use statistical techniques to
estimate these parameters based on data collected from a sample of the
population.
These parameters are used in statistical analysis to make inferences about the
population, test hypotheses, and draw conclusions. Estimating population
parameters from sample data involves using statistical methods such as point
estimation and interval estimation. Point estimation provides a single value
estimate for a population parameter, while interval estimation provides a range
of values within which the parameter is likely to lie, along with a level of
confidence.
Sample statics
Sample statistics are numerical values calculated from data collected
from a sample. These statistics provide information about the
characteristics of the sample and are used to estimate population
parameters or make inferences about the population as a whole.
Common sample statistics include the sample mean, sample standard
deviation, sample median, sample variance, etc. Sample statistics are
often used in statistical analysis to summarize data, test hypotheses, and
draw conclusions.
These sample statistics are used to estimate their corresponding
population parameters. However, sample statistics are subject to
sampling variability, meaning that they may differ from one sample to
another. The accuracy of sample statistics as estimators of population
parameters depends on factors such as sample size, sampling method,
and the representativeness of the sample.
PARAMETER VS STATISTICS:
(1) A parameter is a fixed measure describing the whole population (population is a
group of people or things with common characteristics). On the hand, a statistics
is a characteristics of a sample, a portion of target population.
(2) A parameter is fixed, unknown numerical value, while statistics is a known
number and a variable which depends on the portion of the population.
(3) Sample statistics and population parameter both use different statistical notations.
(4) Parameters never change while statistics do change.
(5) A parameter is a characteristic of a population and a statistic is a characteristic of
a sample.
(6) Statistic makes one guess about a population parameter based on a statistic
computed from sample.
Population Sample
Parameter Statics
Sample Size N n
Mean μ
Variance σ² s²
Standard σ s
deviation
Sample Mean
•The sample mean, denoted by is a measure of central tendency that represents the average
value of a set of data points in a sample.
•It provides a single numerical value that summarizes the center of the distribution of data
in the sample.
•The sample mean is calculated by summing up all the individual values in the sample and
dividing by the total number of values (sample size).
•Mathematically, it can be represented as:
•where:
• is the sample mean,
• xirepresents each individual data point,
• n is the total number of data points in the sample.
•The sample mean provides an estimate of the population mean when the sample is drawn
from a larger population.
•It represents the "typical" value or average value observed in the sample.
•The sample mean is influenced by the values of all data points in the sample.
•The sample mean is a point estimator, meaning it provides a single estimate of the
population mean.
•It is sensitive to outliers in the data, as extreme values can disproportionately influence
the calculation of the mean.
•The sample mean is an unbiased estimator of the population mean, meaning that, on
average, it provides an accurate estimate of the population mean when multiple samples
are drawn.
•It is used extensively in data analysis, inference, and decision-making
processes across various fields, including science, business, and social
sciences.
Sample Variance
•Sample variance can be defined as the expectation of the squared difference of data points
from the mean of the data set.
•It is an absolute measure of dispersion and is used to check the deviation of data points with
respect to the data's average.
•The formula to calculate sample variance,
where:
•is the sample variance,
•xirepresents each individual data point,
•is the sample mean,
•n is the total number of data points in the sample.
•Sample variance measures the average squared deviation of data points from the sample
mean.
•Larger variance values indicate greater variability among data points, while smaller values
suggest more consistency.
•Sample variance is commonly used in inferential statistics to estimate population variance.
•It serves as a crucial parameter in hypothesis testing and constructing confidence intervals.
•Sample variance can help identify outliers or extreme values that significantly affect the
overall variability of the sample.
•Sample variance helps assess the consistency or variability of data within a sample.
•Its importance extends across various fields, including scientific research,
business analytics, quality control, finance, and risk management.
Unbiased estimate
Definition:
• An unbiased estimate is a statistical estimator whose expected value is equal to the true
population parameter being estimated. In simpler terms, an unbiased estimator, when used
repeatedly, produces estimates that are on average equal to the true value of the parameter
being estimated.
Mathematical Definition:
• Mathematically, an estimator is unbiased if its expected value E() equals the true population
parameter θ: E()=θ
• is the estimator,
• E(represents the expected value of the estimator,
• θ is the true population parameter being estimated.
• This means that, on average, the estimator provides an accurate estimate of the population
parameter across multiple samples.
•For example, when estimating the population mean (μ) using the sample mean , the
sample mean is an unbiased estimator because: E()=μ
This means that, on average, the sample mean accurately estimates the population
mean.
•Unbiased estimators are desirable because they provide estimates that are not
systematically too high or too low.
•They provide a fair and consistent estimate of the population parameter across different
samples.
•Unbiasedness is a desirable property when evaluating the performance of estimators, as it
ensures that the estimator does not systematically overestimate or underestimate the
population parameter.
Efficient estimate
Definition:
An efficient estimator is a statistical estimator that achieves the smallest possible variance
among all unbiased estimators for a given sample size. In other words, an efficient estimator
minimizes the variability of estimates and provides the most precise estimate of the
population parameter.
An estimator is considered efficient if it achieves the smallest possible variance among all
unbiased estimators for a given sample size.
Mathematically, let be an estimator for a population parameter θ. The efficiency of is
determined by comparing its variance with the variances of all other unbiased estimators of
θ.
If is the estimator with the smallest variance among all unbiased estimators, it is said to be
efficient. In other words:
Var() ≤ Var(θ)
• Efficiency is a desirable property because it ensures that estimates are not only unbiased but
also precise.
• Efficient estimators produce estimates with the least amount of sampling variability, making
them more reliable and informative.
• Efficient estimates provide more reliable estimates of population parameters. They minimize
the variability in estimates, making them more consistent and accurate.
• By reducing variability in estimates, efficient estimators provide a clearer picture of the
underlying data. This leads to a better understanding of the phenomenon being studied.
• In predictive modeling and forecasting, precise estimates are essential for accurate
predictions. Efficient estimates contribute to better forecasting models and more reliable
predictions.