Asm Assignment
Asm Assignment
Introduction
Sampling is a process through which we can analyse or study a small group of people from the large
group to derive conclusions and that are most likely to be applicable to all the people of large group. It
is used by academics to collect data from a small group rather than researching the full population,
which can be impracticable or unattainable.
Example:
i. For a research study you need to collect data. Let us suppose that as a researcher, you want
to study the association between role model of parents and undesirable behaviour of
children in a home for street children. For this, you have to select a few representative cases
from the home. The process of selection requires thorough knowledge of various sampling
techniques.
ii. Let us take a company which wants to understand consumer preferences for a new product.
Instead of surveying all potential customers, they select a random sample of 1000 people
from their target market. This sample is analyzed to gather inferences on helping company
to make informed decision.
Advantages of Sampling
2. Role of Sampling Theory in Statistical Inference
Sampling Theory plays a crucial role in statistical research by providing the framework to draw
conclusions about a population from a subset of observations (also called sample). In practical use cases,
it is in most cases impossible to collect data from an entire population, making sampling essential for
analysis and decision making.
Term Definition
Population The entire group of events that a Statistician wants to study and draw conclusions.
Types of Sampling:
Random Sampling Every entity has an equal chance of being selected in the sample.
Stratified Sampling Population divided into strata, random samples taken from each strata.
Cluster Sampling Population divided into clusters, clusters are selected randomly.
The Central Limit Theorem states that the distribution of a sample mean will be approximately normally
distributed when the sample size is large enough. The CLT helps Staticians to make inferences using
the properties of normal distribution without knowing the underlying distribution of the population.
1. Estimation of Population Parameters: Sample mean and sample variance are used to estimate
population mean and population variance.
2. Hypothesis testing: Sample statistics and unbiased estimators are used in hypothesis testing.
3. Confidence Intervals: Sample statistics are also used in providing confident acceptance and
rejection to null and alternative hypotheses.
Suppose we want to estimate the average height of students in BITS. Measuring everyone’s height is
not feasible. We select a random sample of 200 students and calculate their average height. This value
comes out to be 170cm.
2. 95% confidence interval is (168cm, 172cm). (Using CLT to assume Normality in the
background population)
● Example: If we want to survey 50 students out of 5000 students living in the hostels at BITS,
we assign a unique number to each student and use a random number generator to pick 50
students.
2) Systematic Sampling
●A type of probability sampling technique in which sample members from a larger population
are selected according to a random starting point but with a fixed, periodic interval.
●A type of probability sampling technique in which the total population is divided into
homogenous groups (strata).
● It is of two types:
●Example: BITS having both ug and pg students across multiple disciplines, if we want that all
programs are proportionally represented in our survey, we could divide the student population
into strata by department then further we can use random sampling.
4) Cluster Sampling
●A type of probability sampling technique in which we create multiple clusters of people from
a population where they are indicative of homogeneous characteristics and have an equal
chance of being a part of the sample.
● It is of three types:
●Multistage Sampling.
●Example: BITS has multiple hostels across its campus. Instead of surveying individual
students from all hostels, we could randomly select three hostels and survey all students from
those hostels.
There are several advantages to using non-probability sampling techniques. They are often more cost-
effective and quicker to implement than probability-based methods. These techniques are particularly
useful in exploratory research or when the target population is difficult to access.
There are a few limitations to non-probability sampling techniques. First, there is a higher risk of
sampling bias since not all members of the population have an equal chance of being selected.
Additionally, the results may not be generalizable to the entire population due to the lack of randomness
in the selection process.
1. Convenience Sampling
This is the most basic form of non-probability sampling wherein the samples are selected from the
population directly which are conveniently available to the researcher. The samples are easy to select,
and hence the researcher does not choose the sample that represents the entire population.
Even though this is quick and convenient, there is a huge chance that the results may be biased.
● Example: A student researcher studying the eating habits of college students may survey
classmates or students at the campus library simply because they are easy to access and
willing to participate.
2. Purposive/Judgmental Sampling
Purposive sampling is based on the presumption that with good judgment the researcher can select the
sample units that are satisfactory in relation to one's requirements.
A common strategy of this sampling technique is to select cases that are judged to be typical of the
population in which one is interested, assuming that errors of judgment in the selection will tend to
counterbalance each other.
3. Quota Sampling
In the quota sampling method, the researcher forms a sample that involves the individuals to represent
the population based on specific traits or qualities. The researcher chooses the sample subsets that bring
the useful collection of data that generalizes the entire population.
This gives us better samples representing diverse units. However, the selection within each group is
non-random and based on convenience.
● Example: A student researcher studying the effect of social media on students’ academic
performance might set a quota to include 50 undergraduate students and 50 higher degree
students.
4. Snowball Sampling
Snowball sampling, also known as referral or respondent-driven sampling, is invaluable for accessing
hard-to-reach or elusive populations such as homeless people, teenagers, drug users, or other hidden
populations. Initial participants are recruited based on some criteria, who then refer others within their
networks, creating a snowball effect. The researchers only have control over the initial respondents who
in turn bring in more such respondents.
1. Ethical Guidelines
Ethical considerations ensure that participants are respected, their rights are protected, and the research
remains credible.
2. Technical Guidelines
Technical guidelines ensure that the sampling process is scientifically sound and yields reliable, valid
data for statistical inference.
When collecting data from a population, one of the key assumptions we make in statistical analysis is
that the data is independent and identically distributed (i.i.d). This helps us make sure that the results
we get from the data are accurate for the whole group. In simple terms, independent means that one
sample doesn’t affect another, and identically distributed means that all the samples follow the same
pattern or distribution.
• Independent: This means that the selection of one sample does not influence the selection of
another. For example, flipping a coin multiple times is independent because each flip doesn’t affect
the next.
• Identically Distributed: All samples must come from the same probability distribution. For
instance, if we measure the height of students in a class, all heights follow the same distribution (e.g.,
normal distribution).
Example (R programming):
Random Sampling: We draw two samples from the same normal distribution. Both samples are
independent of each other, and they follow the same probability distribution (mean = 50, sd = 10).
Results: The histograms overlap, showing that both samples follow the same distribution, and the
scatter plot does not show any clear pattern, indicating independence.
In this case, we draw samples from two different distributions, where one sample has a different mean
and standard deviation (mean = 60, sd = 15). This shows that the samples are not identically
distributed. Also sample 3 was taken along with sample 1 and sample 2 , which was dependent on
sample 1.
Results: The scatter plot shows a clear relationship between the two samples, indicating that they are
not independent, the Histograms do not overlap showing the samples are not identically distributed.
To ensure independence in sampling, each sample must be chosen without affecting the others.
Random sampling helps by giving everyone an equal chance of being selected. In small groups, if you
don’t return the samples after selection (sampling without replacement), future picks are affected, so
sampling with replacement is better to keep the choices independent. Bias in methods like quota
sampling and others can also ruin independence by making some samples more likely to be chosen. If
the samples aren’t independent, the results might not be reliable, leading to incorrect conclusions.
For identically distributed samples, all samples should follow the same pattern or distribution.
Stratified sampling, where the population is divided into groups and samples are taken from each,
helps achieve this. Removing outliers also ensures that the data is consistent. If the samples aren’t
identically distributed, the analysis might be biased, and the conclusions may not represent the true
population. Before sampling, make sure the population has consistent characteristics. After sampling,
use graphs like histograms or control charts to verify that the samples are distributed the same way.
•Listwise Deletion: This method eliminates any observation that contains missing values. Although
quite intuitive, it can have the effect of reducing the sample size dramatically if many variables have
missing values. The use of listwise deletion is typically only advisable if data is missing completely at
random (MCAR), meaning the missingness relates to nothing in particular to the data.
•Imputation: A more elaborate version is imputation of missing values in with a plausible data set.
Perhaps the most intuitive method is mean imputation which simply is a replacement of missing values
by the average of observed values for that variable. The most elaborate would probably be multiple
imputation or MI, in which missing values are predicted using regression models based on their
relationship with other variables.
Both methods have their strengths and weaknesses, but multiple imputation tends to be more effective
in preserving the underlying structure of the data, thereby minimizing bias.
2. Outlier Detection
Outliers are values which are very far from other values. They may be the result of differences in data
collection, or they may simply be the errors that occurred during data collection. Outliers need to be
treated carefully because it could mislead statistical inferences, particularly if they relate to summary
measures like mean or standard deviation. Here are several methods for the detection of outliers:
• Z-Scores: A Z-score measures the distance of a data point from the mean, expressed in standard
deviations. A Z-score greater than ±3 is often considered an outlier in normally distributed data.
• Interquartile Range (IQR): This method identifies outliers as values that lie beyond 1.5 times
the IQR, which is the range between the first and third quartiles (Q1 and Q3).
o Example: If analysing housing prices, homes priced far above the upper quartile might be
identified as outliers.
Outliers can be handled by Winsorization, where extreme values are replaced with the nearest valid
values, or by log transformation, which reduces the impact of outliers in right-skewed data, such as
income.
3. Data Pre-Processing
Before running any statistical models, it is important to ensure that the data is in a consistent format and
prepared for analysis. Pre-processing involves several steps:
• Standardization: Most of the statistics techniques, including linear regression, expect variables
of interest to be measured on comparable scales. Data are rescaled so that its mean would now be 0 and
the standard deviation would be 1 during standardization. This goes well in managing variables which
are measured in different units or magnitudes, like height in centimeters and weight in kilograms.
• Normalization: This technique scales the data to a fixed range, often between 0 and 1.
Normalization is crucial when using algorithms sensitive to scale, such as k-nearest neighbours.
o Example: In a dataset of survey responses, one-hot encoding would create separate columns for
each response category, assigning a 1 or 0 depending on the respondent’s answer.
•Example: In time series data, the presence of autocorrelation might violate the i.i.d. assumption.
Transformations like differencing can restore this property.
Another important check is normality. Many statistical models assume that the data are normally
distributed.
Several software tools and programming languages can facilitate the data cleaning process:
•Python: Libraries like pandas, NumPy, and scikit-learn are effective for handling missing data,
detecting outliers, and transforming variables.
•R: Packages such as dplyr and tidyverse excel in data manipulation and cleaning.
•MATLAB: Offers strong built-in functions for data cleaning and pre-processing, particularly in
engineering and scientific fields.
By leveraging these tools, researchers can streamline the data cleaning process, making it more efficient
and accurate.