0% found this document useful (0 votes)
22 views

Week9 BAM

This document provides an overview of statistical sampling techniques and inferential statistics. It discusses the differences between populations and samples, and why random sampling is preferred to reduce bias. The main random sampling techniques covered are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. It also provides examples of how to implement each technique. The document concludes with a recap of inferential statistics such as confidence intervals and hypothesis testing. It provides an example of calculating a 90% confidence interval using sample mean, standard deviation and normal distribution properties.

Uploaded by

rajaayyappan317
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Week9 BAM

This document provides an overview of statistical sampling techniques and inferential statistics. It discusses the differences between populations and samples, and why random sampling is preferred to reduce bias. The main random sampling techniques covered are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. It also provides examples of how to implement each technique. The document concludes with a recap of inferential statistics such as confidence intervals and hypothesis testing. It provides an example of calculating a 90% confidence interval using sample mean, standard deviation and normal distribution properties.

Uploaded by

rajaayyappan317
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

BAM 1024

Introduction to Statistical Analysis


Weekly Course Objectives
● Why do we even sample?
● Random sampling.
● SRS
● SRS with and without replacement.
● Types of random sampling:
○ Simple random sample
○ Stratified sampling
○ Cluster sampling
○ Systematic sampling
● Recap Inferential Statistics
● Confidence intervals
Recap
● A population is every single observation that falls under the
experiment you want to run.
● A sample is the subset of observations from your population.
● For example, the heights in the class of a first year statistics
course taken from 1 class would be a sample. The population
would be every first year statistics student (within some
demographic.
● It’s very expensive to collect information - some companies
even sell information from surveys (look at SurveyMonkey)!
● Getting adequate data for your study is critical, but sometimes
data isn’t easy to come by (the weather behavior during solar
eclipse).
● We ideally want a representative sample, which is a subset of
the population that has the same characteristics as it.
Random Sampling
● If you chose samples yourself, that would lead to bias in your
statistics.
○ For example, say you “felt” 10 specific people from the class
would be representative for the average IQ within the class.
○ Because you chose participants yourself, you are skewing
the insights / statistics you get from that sample.
● To reduce bias, we ideally want a random sample.
● A random sample is a sample that is chosen randomly.
● The most common sampling technique is simple random
sampling (SRS), which means that collecting any sample at
any point in the sampling process gives equal probability of
selection across all objects!
● In other words, each sample of the same size has an equal
chance of being selected.
SRS example
● Say you had 10 people. If you had to sample 3 people, the
probability of selecting, say person 1, 3 and 7, is the exact same
as sampling person 2, 3 and 9. Everyone is selected with the
same probability at each selection process.

● P(1,3,7) = 3C3 / 10C3


● P(2,3,9) = 3C3 / 10C3
SRS with replacement
● True random sampling is done with replacement. That is, once
a member is picked, that member goes back into the
population and thus may be chosen more than once.
● Therefore, SRS with replacement is when an observation is
selected but is placed back into the sampling pool to be
selected again!
● This means that you can select observations more than once.
● In practical applications, samples are selected without
replacement. Meaning, once you sample an observation, you
cannot select it again.
○ For example, once you are surveyed, you will typically not
be asked again.
● For this reason, the probability of getting a specific sample
without replacement is higher if that sample was with
replacement.
SRS with replacement example.
● Say you had 10 people. If you had to sample 3 people.
● The probability that your first person sample is Bob is 1/10.
○ This makes sense because there are 10 people, and Bob is
chosen.
● This is where things get interesting… The probability you
sample a different person that is not Bob is;
○ SRS without replacement: 9/9 = 100%. This makes sense
because the next person has to be different from Bob.
○ SRS with replacement: 9/10 = 90%. this makes sense
because the next person could be Bob again.
● This example goes to show that getting a specific sample is
higher without replacement only because with replacement
you can get the same observations drawn again giving you
more options! More options = less probability under fixed
conditions!
Cluster sampling
● Sometimes our sampling population is divided into groups.
This sets us up for a natural sampling process where instead of
sampling from the overall population, you can just sample a
subset from the groups to collect a sample.
● This process is called cluster sampling.
● To choose a cluster sample: \
○ Divide the population into clusters (groups) and then
randomly select some of the clusters.
● All the members from these clusters are in the cluster sample.
Cluster sampling example
● Say our population was students from a specific school. This
school is naturally divided into departments.
● Instead of sampling say 200 students, we can sample the
departments so its faster. So, if you randomly sample say four
departments from your college population, the four
departments make up the cluster sample and every student
within the sampled department is used!
● We can begin this process by dividing your college faculty by
department.
● The departments are the clusters. Number each department,
and then choose four different numbers using simple random
sampling. All members of the four departments with those
numbers are the cluster sample.
Stratified sampling
● There’s a chance that we might get a bias sample using cluster
sampling if we end up with groups that have unique
characteristics from the overall population.
● To counter this, why not just sample some amount from all
groups!?
● Stratified sampling divides the population into strata/groups,
and samples using SRS from within each stratum a
proportional amount of observations such that you get a
constant representation from each strata/group.
Stratified sampling example
● Much like before, say our population was students from a
specific school. This school is naturally divided into
departments.
● We want to sample 200 students but want a representation of
each department. So, if we had say 10 departments, we could
sample 200/10 = 20 students per department!
● Say department 1 had 30 students and department 2 had 100.
You would still only select 20 students from each, regardless of
the fact that ⅔ of department 1 will be selected, and only ⅕ of
department 2 will be selected.

20 students per department


Systematic (Convience) Sampling
● Sometimes, our sampling observations will be collected in
intervals. For this reason, instead of “randomly” picking out
people ourselves, there is an unbiased method that is still
systematic to optimize our predicament.
● Systematic sampling randomly selects a starting point and
take every nth piece of data from a listing of the population.
● This ensures we collect data in a timely manner, without bias.
Systematic (Convience) Sampling example
● For example, suppose you have to do a phone survey. Your
phone book contains 20,000 residence listings.
● You must choose 400 names for the sample.
● Doing SRS implies you would have to select 400 numbers from
the book at random. Software would make this easy if the
numbers were recorded in some data set… but they probably
aren’t.
● Instead, you could using a systematic sampling approach where
if you wanted 400 numbers from the 20,000, you could sample
every 50th observation from the beginning of the book
● this would eliminate the headache of recording the numbers
down first and then doing an SRS approach.
Inferential Statistics
● Recall, inferential statistics are those which involve making
inferences on our data using some probabilistic approach.
● For example;
○ Confidence intervals are intervals that can capture the true
population parameter if we were to collect a sample over
and over again.
○ Hypothesis testing to measure a claim made against the
data ( average height is greater than 150cm, average grades
are less than 70%, etc )

● Descriptive statistics describe what you see, inferential


statistics attempt to infer based on your data.
Confidence Intervals
● Placing our trust in 1 value for an analysis is very risky.
● Say we wanted to forecast budgets and stated that we expect
the average savings to be $10,000 next month. When the next
month financial results occur, we find out we actually only
saved $8,000 - but the business trusted us so much that they
allocated the $2,000 not saved to some other product. Now we
are in trouble!
● To avoid this issue of saying “I don’t know” or “maybe we expect
somewhere around $10,000”, we instead give a range of what we
can expect!
● So, a confidence interval is the probability of our population
estimate(s) falling within some range.

Point estimate: mean confidence interval: mean


Confidence Intervals cont.

● We essentially want to find how likely we are to capture the


true population mean within an interval.
○ We could have gotten a bad sample.
○ The cost of getting more observations in our sample is
expensive.
● The bigger the interval, the more confident we are!
● We need to use a bit of math to derive the values needed to
calculate the confidence intervals.
Confidence Intervals Example
● Atinder has made his own chocolate called Atindies! Like
Smarties, but with an A on them. He would like to know the
average weight a box of chocolates can have. Say he took a
sample of 1000 boxes of chocolates and found the mean and
standard deviation to be 45 grams and 3.8 grams, respectively.
the data is distributed normally as well.
● What is the 90% confidence interval for the average weight of
the box?
● The 90% confidence interval is captured
when you are 1.645 standard deviations of
the mean!
○ We want the interval to be around the
mean s.t. 90% of the data is captured.
○ How? Proof on next slide
Confidence Intervals Example: Proof
● We know that since the confidence interval is symmetric
around the mean, we must trim off α% from the standard
normal distribution.
○ 1-α = 90%
○ Therefore, here, α = 10%. This means 5% is trimmed off
both sides of the normal distribution:

● Therefore, P(Z<z) = 0.95 -> z = 1.645.


Confidence Intervals Example: Proof
● Remember, to standardize your results we must subtract by the
mean, and then divide by the standard deviation. So we get the
following:
Z = (X-µ)/σ
● However, the mean and standard deviation for a sample of size
n is:
○ E(X̄) = x̄
○ Var(X̄) = σ2/n
■ S.D. = σ / √(n)
● So we get as the confidence interval:
-zα/2< Z < zα/2
-zα/2< (X - x̄ ) / σ / √(n) < zα/2
x̄ - zα/2(σ / √(n) < X < x̄ + zα/2(σ / √(n),
where zα/2 is the value that covers 1-α/2 probability
For proof of the expected value and variance of the sample mean, visit here! , and the CI proof here!
Confidence Intervals: Back to the example
x̄ - zα/2(σ / √(n) < X < x̄ + zα/2(σ / √(n)
45 - 1.645*(3.8 / √(1000), 45 + 1.645*(3.8 / √(1000)
45 - 0.19767, 45 + 0.19767
44.802, 45.198
Confidence Intervals: Question 1
● The average heights of a random sample of 400 people from a
city is 1.75 m. It is known that the heights of the population are
random variables that follow a normal distribution with a
variance of 0.16.

● Determine the interval of 95% confidence for the average


heights of the population.
Confidence Intervals: Question 2
● You want to rent an unfurnished one-bedroom apartment in
Durham, NC next year. The mean monthly rent for a random
sample of 60 apartments advertised on Craigslist (a website
that lists apartments for rent) is $1000. Assume a population
standard deviation of $200.

● Construct a 95% confidence interval.

● How large a sample of one-bedroom apartments above would


be needed to estimate the population mean within plus or
minus $50 with 90% confidence?
Homework 1
Homework 1 Answers
Let’s do some Python programming :)
Thank you

You might also like