0% found this document useful (0 votes)
7 views

Asm Assignment

Uploaded by

manan gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Asm Assignment

Uploaded by

manan gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1.

Introduction
Sampling is a process through which we can analyse or study a small group of people from the large
group to derive conclusions and that are most likely to be applicable to all the people of large group. It
is used by academics to collect data from a small group rather than researching the full population,
which can be impracticable or unattainable.

Example:

i. For a research study you need to collect data. Let us suppose that as a researcher, you want
to study the association between role model of parents and undesirable behaviour of
children in a home for street children. For this, you have to select a few representative cases
from the home. The process of selection requires thorough knowledge of various sampling
techniques.
ii. Let us take a company which wants to understand consumer preferences for a new product.
Instead of surveying all potential customers, they select a random sample of 1000 people
from their target market. This sample is analyzed to gather inferences on helping company
to make informed decision.

Advantages of Sampling
2. Role of Sampling Theory in Statistical Inference
Sampling Theory plays a crucial role in statistical research by providing the framework to draw
conclusions about a population from a subset of observations (also called sample). In practical use cases,
it is in most cases impossible to collect data from an entire population, making sampling essential for
analysis and decision making.

Term Definition

Population The entire group of events that a Statistician wants to study and draw conclusions.

Sample A subset of population that is used to display the characteristics of a population.

Types of Sampling:

Random Sampling Every entity has an equal chance of being selected in the sample.

Stratified Sampling Population divided into strata, random samples taken from each strata.

Systematic Sampling Individuals selected at regular intervals from the list.

Cluster Sampling Population divided into clusters, clusters are selected randomly.

Central Limit Theorem(CLT):

The Central Limit Theorem states that the distribution of a sample mean will be approximately normally
distributed when the sample size is large enough. The CLT helps Staticians to make inferences using
the properties of normal distribution without knowing the underlying distribution of the population.

Role in Statistical Inference:

1. Estimation of Population Parameters: Sample mean and sample variance are used to estimate
population mean and population variance.

2. Hypothesis testing: Sample statistics and unbiased estimators are used in hypothesis testing.

3. Confidence Intervals: Sample statistics are also used in providing confident acceptance and
rejection to null and alternative hypotheses.

Example: Estimating Average Height of Students

Suppose we want to estimate the average height of students in BITS. Measuring everyone’s height is
not feasible. We select a random sample of 200 students and calculate their average height. This value
comes out to be 170cm.

Using sampling theory:


1. Population mean is 170cm. (Point Estimation)

2. 95% confidence interval is (168cm, 172cm). (Using CLT to assume Normality in the
background population)

3. Probability Sampling Techniques


Probability sampling technique is a sampling technique in which we select a small group of people from
a larger population and predict the likelihood that all their responses put together will match with the
overall population.

There are four main methods of probability sampling: -

1) Simple Random Sampling

●A type of probability sampling technique in which we randomly select a subset of participants


from a population.

● Example: If we want to survey 50 students out of 5000 students living in the hostels at BITS,
we assign a unique number to each student and use a random number generator to pick 50
students.

2) Systematic Sampling

●A type of probability sampling technique in which sample members from a larger population
are selected according to a random starting point but with a fixed, periodic interval.

●Example: If we wish to select 50 students out of 4,000 participants attending OASIS, we


randomly select the 7th student, and then we pick every 20th student (7, 27, 47, 67, and so on)
until we reach a total of 50.

3) Stratified Random Sampling

●A type of probability sampling technique in which the total population is divided into
homogenous groups (strata).

● It is of two types:

●Proportionate - the sample size for each group is proportional to its


size in the population.

● Disproportionate - the sample size can vary across groups.

●Example: BITS having both ug and pg students across multiple disciplines, if we want that all
programs are proportionally represented in our survey, we could divide the student population
into strata by department then further we can use random sampling.
4) Cluster Sampling

●A type of probability sampling technique in which we create multiple clusters of people from
a population where they are indicative of homogeneous characteristics and have an equal
chance of being a part of the sample.

● It is of three types:

● Single-stage - entire clusters are chosen.

●Two-stage - a random sample is taken from selected clusters.

●Multistage Sampling.

●Example: BITS has multiple hostels across its campus. Instead of surveying individual
students from all hostels, we could randomly select three hostels and survey all students from
those hostels.

Probability sampling technique Pros Cons

Simple random sampling ● Easy to implement ● Resource-intensive


● Unbiased selection

Systematic sampling ● Easier and faster to ● If population has any


implement. unseen cyclic pattern, it
● Provides a good spread can give biased result.
across the entire
population.
Stratified random sampling ● Reduces variability ● Not useful for
within strata, homogenous
improving precision. populations.
● Good for subgroup
analysis.
Cluster sampling ● More practical and less ● Less precision
expensive when ● Important subgroups
dealing with large can be underrepresented
population. or overlooked

Non-Probability Sampling Techniques


The non-probability sampling method is a technique in which the researcher/statistician selects the
sample based on subjective judgment rather than the random selection generally seen in probability
sampling. These are methods of selecting samples where not all members of a population have a known
or equal chance of being selected.

There are several advantages to using non-probability sampling techniques. They are often more cost-
effective and quicker to implement than probability-based methods. These techniques are particularly
useful in exploratory research or when the target population is difficult to access.

There are a few limitations to non-probability sampling techniques. First, there is a higher risk of
sampling bias since not all members of the population have an equal chance of being selected.
Additionally, the results may not be generalizable to the entire population due to the lack of randomness
in the selection process.

The main types of non-probability sampling methods are:

1. Convenience Sampling
This is the most basic form of non-probability sampling wherein the samples are selected from the
population directly which are conveniently available to the researcher. The samples are easy to select,
and hence the researcher does not choose the sample that represents the entire population.

Even though this is quick and convenient, there is a huge chance that the results may be biased.

● Example: A student researcher studying the eating habits of college students may survey
classmates or students at the campus library simply because they are easy to access and
willing to participate.

2. Purposive/Judgmental Sampling
Purposive sampling is based on the presumption that with good judgment the researcher can select the
sample units that are satisfactory in relation to one's requirements.

A common strategy of this sampling technique is to select cases that are judged to be typical of the
population in which one is interested, assuming that errors of judgment in the selection will tend to
counterbalance each other.

● Example: A student researcher examining the study habits of top-performing students


might select participants based on their CGPA, choosing only greater than 9 CGPA
students for interviews.

3. Quota Sampling
In the quota sampling method, the researcher forms a sample that involves the individuals to represent
the population based on specific traits or qualities. The researcher chooses the sample subsets that bring
the useful collection of data that generalizes the entire population.

This gives us better samples representing diverse units. However, the selection within each group is
non-random and based on convenience.

● Example: A student researcher studying the effect of social media on students’ academic
performance might set a quota to include 50 undergraduate students and 50 higher degree
students.
4. Snowball Sampling
Snowball sampling, also known as referral or respondent-driven sampling, is invaluable for accessing
hard-to-reach or elusive populations such as homeless people, teenagers, drug users, or other hidden
populations. Initial participants are recruited based on some criteria, who then refer others within their
networks, creating a snowball effect. The researchers only have control over the initial respondents who
in turn bring in more such respondents.

● Example: A student researcher investigating the experiences of international students may


start by interviewing a few known international students and ask them to refer others.

4. Standard Guidelines for Sample Data Collection


When conducting research using sampling methods, it is crucial to follow established ethical and
technical guidelines to ensure that the data collected is both valid and ethical. The following are some
key principles:

1. Ethical Guidelines
Ethical considerations ensure that participants are respected, their rights are protected, and the research
remains credible.

1.1. Informed Consent


Participants must be fully informed about the purpose of the research, what their
participation entails, and any potential risks or benefits. They should give their voluntary
consent to participate, with the option to withdraw at any point without facing any
consequences.
1.2. Confidentiality & Anonymity
It is the researcher’s responsibility to protect participants’ privacy. Any personal data
collected should be anonymized to ensure that individuals cannot be identified. Moreover,
researchers must ensure that the collected data is securely stored to prevent unauthorized
access.

1.3. Avoiding Harm:


Researchers must avoid causing any psychological, emotional, or physical harm to
participants. If sensitive topics are discussed, participants must be given the choice to refuse
or withdraw from such sections of the survey.

1.4. Right to Information


Participants have the right to know how their data will be used and whom it will be shared
with. They should also be informed of the outcomes, which ensures transparency in the
research process.

2. Technical Guidelines
Technical guidelines ensure that the sampling process is scientifically sound and yields reliable, valid
data for statistical inference.

2.1. Development of the Sampling Frame:


A sampling frame is a list or database from which the sample will be drawn. It must be
complete and representative of the target population to avoid selection bias.

2.2. Sampling Design


The sampling design refers to the method used to select the sample from the population.
Researchers should use probability sampling methods (e.g., simple random sampling,
stratified sampling) to ensure that each member of the population has an equal chance of
being selected. This reduces bias and increases the generalizability of the results.

2.3. Sample Size Determination


Determining the appropriate sample size is critical for ensuring reliable and valid results. If
the sample size is too small, the results may not be representative of the population, leading
to inaccurate inferences.

2.4. Pre-Testing the Data Collection Instrument


Before full-scale data collection, it is essential to test the survey or measurement tools on a
smaller sample to identify any issues with the questions or data collection process. This
process, known as pre-testing, helps improve the reliability and validity of the survey.

2.5. Data Quality Control


During data collection, it is important to ensure data quality by monitoring the collection
process for accuracy and consistency. This might involve training data collectors, setting
clear instructions for survey administration, and using technology (e.g., digital tools) to
minimize errors.
5. ENSURING I.I.D PROPERTY WHILE SAMPLING
Introduction:

When collecting data from a population, one of the key assumptions we make in statistical analysis is
that the data is independent and identically distributed (i.i.d). This helps us make sure that the results
we get from the data are accurate for the whole group. In simple terms, independent means that one
sample doesn’t affect another, and identically distributed means that all the samples follow the same
pattern or distribution.

What is the I.I.D. Property?

The i.i.d. property is made up of:

• Independent: This means that the selection of one sample does not influence the selection of
another. For example, flipping a coin multiple times is independent because each flip doesn’t affect
the next.

• Identically Distributed: All samples must come from the same probability distribution. For
instance, if we measure the height of students in a class, all heights follow the same distribution (e.g.,
normal distribution).

Example (R programming):

Example 1: I.I.D. Behavior in Sampling

Random Sampling: We draw two samples from the same normal distribution. Both samples are
independent of each other, and they follow the same probability distribution (mean = 50, sd = 10).

Results: The histograms overlap, showing that both samples follow the same distribution, and the
scatter plot does not show any clear pattern, indicating independence.

Example 2: Non-I.I.D. Behavior in Sampling

In this case, we draw samples from two different distributions, where one sample has a different mean
and standard deviation (mean = 60, sd = 15). This shows that the samples are not identically
distributed. Also sample 3 was taken along with sample 1 and sample 2 , which was dependent on
sample 1.

Results: The scatter plot shows a clear relationship between the two samples, indicating that they are
not independent, the Histograms do not overlap showing the samples are not identically distributed.

Ensuring Independence and identically distributed property

To ensure independence in sampling, each sample must be chosen without affecting the others.
Random sampling helps by giving everyone an equal chance of being selected. In small groups, if you
don’t return the samples after selection (sampling without replacement), future picks are affected, so
sampling with replacement is better to keep the choices independent. Bias in methods like quota
sampling and others can also ruin independence by making some samples more likely to be chosen. If
the samples aren’t independent, the results might not be reliable, leading to incorrect conclusions.
For identically distributed samples, all samples should follow the same pattern or distribution.
Stratified sampling, where the population is divided into groups and samples are taken from each,
helps achieve this. Removing outliers also ensures that the data is consistent. If the samples aren’t
identically distributed, the analysis might be biased, and the conclusions may not represent the true
population. Before sampling, make sure the population has consistent characteristics. After sampling,
use graphs like histograms or control charts to verify that the samples are distributed the same way.

6. The Process of Cleaning Data


Data cleaning is the most important preprocessing step in any statistical analysis especially of sampled
data. It is for checking the dataset and correcting errors, inconsistencies, and biases before any
meaningful conclusions are drawn from it. This process, commonly referred to as data pre-processing,
involves locating the missing values, detecting outliers, and the data has to be standardized so that it is
ready for robust analysis.

1. Handling Missing Data


Missing data arises very frequently in most datasets and has sources such as nonresponse in surveys,
malfunctioning sensors or corrupted data in devices. Missing data may introduce biases if not properly
handled. It also reduces the statistical power of an analysis if ignored or mishandled. Here are several
strategies to deal with missing data:

•Listwise Deletion: This method eliminates any observation that contains missing values. Although
quite intuitive, it can have the effect of reducing the sample size dramatically if many variables have
missing values. The use of listwise deletion is typically only advisable if data is missing completely at
random (MCAR), meaning the missingness relates to nothing in particular to the data.

•Imputation: A more elaborate version is imputation of missing values in with a plausible data set.
Perhaps the most intuitive method is mean imputation which simply is a replacement of missing values
by the average of observed values for that variable. The most elaborate would probably be multiple
imputation or MI, in which missing values are predicted using regression models based on their
relationship with other variables.

o Example: In a dataset of employee ages, missing values might be imputed by examining


relationships between age and years of experience.

Both methods have their strengths and weaknesses, but multiple imputation tends to be more effective
in preserving the underlying structure of the data, thereby minimizing bias.

2. Outlier Detection
Outliers are values which are very far from other values. They may be the result of differences in data
collection, or they may simply be the errors that occurred during data collection. Outliers need to be
treated carefully because it could mislead statistical inferences, particularly if they relate to summary
measures like mean or standard deviation. Here are several methods for the detection of outliers:

• Z-Scores: A Z-score measures the distance of a data point from the mean, expressed in standard
deviations. A Z-score greater than ±3 is often considered an outlier in normally distributed data.

• Interquartile Range (IQR): This method identifies outliers as values that lie beyond 1.5 times
the IQR, which is the range between the first and third quartiles (Q1 and Q3).
o Example: If analysing housing prices, homes priced far above the upper quartile might be
identified as outliers.

Outliers can be handled by Winsorization, where extreme values are replaced with the nearest valid
values, or by log transformation, which reduces the impact of outliers in right-skewed data, such as
income.

3. Data Pre-Processing
Before running any statistical models, it is important to ensure that the data is in a consistent format and
prepared for analysis. Pre-processing involves several steps:

• Standardization: Most of the statistics techniques, including linear regression, expect variables
of interest to be measured on comparable scales. Data are rescaled so that its mean would now be 0 and
the standard deviation would be 1 during standardization. This goes well in managing variables which
are measured in different units or magnitudes, like height in centimeters and weight in kilograms.

• Normalization: This technique scales the data to a fixed range, often between 0 and 1.
Normalization is crucial when using algorithms sensitive to scale, such as k-nearest neighbours.

• Encoding Categorical Data: Categorical variables, such as gender or region, need to be


converted into a numerical format for use in most statistical models. One-hot encoding is commonly
used, where each category is represented by a binary variable.

o Example: In a dataset of survey responses, one-hot encoding would create separate columns for
each response category, assigning a 1 or 0 depending on the respondent’s answer.

4. Ensuring Data Validity


After cleaning, it's essential to verify that the cleaned dataset meets the assumptions of the analysis. A
critical assumption in most statistical analyses is that the data are independent and identically distributed
(i.i.d.). This means that each observation is independent of the others and drawn from the same
distribution. Violations of this assumption can lead to biased estimates.

•Example: In time series data, the presence of autocorrelation might violate the i.i.d. assumption.
Transformations like differencing can restore this property.

Another important check is normality. Many statistical models assume that the data are normally
distributed.

5. Tools for Data Cleaning

Several software tools and programming languages can facilitate the data cleaning process:

•Python: Libraries like pandas, NumPy, and scikit-learn are effective for handling missing data,
detecting outliers, and transforming variables.

•R: Packages such as dplyr and tidyverse excel in data manipulation and cleaning.

•MATLAB: Offers strong built-in functions for data cleaning and pre-processing, particularly in
engineering and scientific fields.

By leveraging these tools, researchers can streamline the data cleaning process, making it more efficient
and accurate.

You might also like