This document provides an overview of statistical sampling techniques and inferential statistics. It discusses the differences between populations and samples, and why random sampling is preferred to reduce bias. The main random sampling techniques covered are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. It also provides examples of how to implement each technique. The document concludes with a recap of inferential statistics such as confidence intervals and hypothesis testing. It provides an example of calculating a 90% confidence interval using sample mean, standard deviation and normal distribution properties.
This document provides an overview of statistical sampling techniques and inferential statistics. It discusses the differences between populations and samples, and why random sampling is preferred to reduce bias. The main random sampling techniques covered are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. It also provides examples of how to implement each technique. The document concludes with a recap of inferential statistics such as confidence intervals and hypothesis testing. It provides an example of calculating a 90% confidence interval using sample mean, standard deviation and normal distribution properties.
Weekly Course Objectives ● Why do we even sample? ● Random sampling. ● SRS ● SRS with and without replacement. ● Types of random sampling: ○ Simple random sample ○ Stratified sampling ○ Cluster sampling ○ Systematic sampling ● Recap Inferential Statistics ● Confidence intervals Recap ● A population is every single observation that falls under the experiment you want to run. ● A sample is the subset of observations from your population. ● For example, the heights in the class of a first year statistics course taken from 1 class would be a sample. The population would be every first year statistics student (within some demographic. ● It’s very expensive to collect information - some companies even sell information from surveys (look at SurveyMonkey)! ● Getting adequate data for your study is critical, but sometimes data isn’t easy to come by (the weather behavior during solar eclipse). ● We ideally want a representative sample, which is a subset of the population that has the same characteristics as it. Random Sampling ● If you chose samples yourself, that would lead to bias in your statistics. ○ For example, say you “felt” 10 specific people from the class would be representative for the average IQ within the class. ○ Because you chose participants yourself, you are skewing the insights / statistics you get from that sample. ● To reduce bias, we ideally want a random sample. ● A random sample is a sample that is chosen randomly. ● The most common sampling technique is simple random sampling (SRS), which means that collecting any sample at any point in the sampling process gives equal probability of selection across all objects! ● In other words, each sample of the same size has an equal chance of being selected. SRS example ● Say you had 10 people. If you had to sample 3 people, the probability of selecting, say person 1, 3 and 7, is the exact same as sampling person 2, 3 and 9. Everyone is selected with the same probability at each selection process.
● P(1,3,7) = 3C3 / 10C3
● P(2,3,9) = 3C3 / 10C3 SRS with replacement ● True random sampling is done with replacement. That is, once a member is picked, that member goes back into the population and thus may be chosen more than once. ● Therefore, SRS with replacement is when an observation is selected but is placed back into the sampling pool to be selected again! ● This means that you can select observations more than once. ● In practical applications, samples are selected without replacement. Meaning, once you sample an observation, you cannot select it again. ○ For example, once you are surveyed, you will typically not be asked again. ● For this reason, the probability of getting a specific sample without replacement is higher if that sample was with replacement. SRS with replacement example. ● Say you had 10 people. If you had to sample 3 people. ● The probability that your first person sample is Bob is 1/10. ○ This makes sense because there are 10 people, and Bob is chosen. ● This is where things get interesting… The probability you sample a different person that is not Bob is; ○ SRS without replacement: 9/9 = 100%. This makes sense because the next person has to be different from Bob. ○ SRS with replacement: 9/10 = 90%. this makes sense because the next person could be Bob again. ● This example goes to show that getting a specific sample is higher without replacement only because with replacement you can get the same observations drawn again giving you more options! More options = less probability under fixed conditions! Cluster sampling ● Sometimes our sampling population is divided into groups. This sets us up for a natural sampling process where instead of sampling from the overall population, you can just sample a subset from the groups to collect a sample. ● This process is called cluster sampling. ● To choose a cluster sample: \ ○ Divide the population into clusters (groups) and then randomly select some of the clusters. ● All the members from these clusters are in the cluster sample. Cluster sampling example ● Say our population was students from a specific school. This school is naturally divided into departments. ● Instead of sampling say 200 students, we can sample the departments so its faster. So, if you randomly sample say four departments from your college population, the four departments make up the cluster sample and every student within the sampled department is used! ● We can begin this process by dividing your college faculty by department. ● The departments are the clusters. Number each department, and then choose four different numbers using simple random sampling. All members of the four departments with those numbers are the cluster sample. Stratified sampling ● There’s a chance that we might get a bias sample using cluster sampling if we end up with groups that have unique characteristics from the overall population. ● To counter this, why not just sample some amount from all groups!? ● Stratified sampling divides the population into strata/groups, and samples using SRS from within each stratum a proportional amount of observations such that you get a constant representation from each strata/group. Stratified sampling example ● Much like before, say our population was students from a specific school. This school is naturally divided into departments. ● We want to sample 200 students but want a representation of each department. So, if we had say 10 departments, we could sample 200/10 = 20 students per department! ● Say department 1 had 30 students and department 2 had 100. You would still only select 20 students from each, regardless of the fact that ⅔ of department 1 will be selected, and only ⅕ of department 2 will be selected.
20 students per department
Systematic (Convience) Sampling ● Sometimes, our sampling observations will be collected in intervals. For this reason, instead of “randomly” picking out people ourselves, there is an unbiased method that is still systematic to optimize our predicament. ● Systematic sampling randomly selects a starting point and take every nth piece of data from a listing of the population. ● This ensures we collect data in a timely manner, without bias. Systematic (Convience) Sampling example ● For example, suppose you have to do a phone survey. Your phone book contains 20,000 residence listings. ● You must choose 400 names for the sample. ● Doing SRS implies you would have to select 400 numbers from the book at random. Software would make this easy if the numbers were recorded in some data set… but they probably aren’t. ● Instead, you could using a systematic sampling approach where if you wanted 400 numbers from the 20,000, you could sample every 50th observation from the beginning of the book ● this would eliminate the headache of recording the numbers down first and then doing an SRS approach. Inferential Statistics ● Recall, inferential statistics are those which involve making inferences on our data using some probabilistic approach. ● For example; ○ Confidence intervals are intervals that can capture the true population parameter if we were to collect a sample over and over again. ○ Hypothesis testing to measure a claim made against the data ( average height is greater than 150cm, average grades are less than 70%, etc )
● Descriptive statistics describe what you see, inferential
statistics attempt to infer based on your data. Confidence Intervals ● Placing our trust in 1 value for an analysis is very risky. ● Say we wanted to forecast budgets and stated that we expect the average savings to be $10,000 next month. When the next month financial results occur, we find out we actually only saved $8,000 - but the business trusted us so much that they allocated the $2,000 not saved to some other product. Now we are in trouble! ● To avoid this issue of saying “I don’t know” or “maybe we expect somewhere around $10,000”, we instead give a range of what we can expect! ● So, a confidence interval is the probability of our population estimate(s) falling within some range.
Point estimate: mean confidence interval: mean
Confidence Intervals cont.
● We essentially want to find how likely we are to capture the
true population mean within an interval. ○ We could have gotten a bad sample. ○ The cost of getting more observations in our sample is expensive. ● The bigger the interval, the more confident we are! ● We need to use a bit of math to derive the values needed to calculate the confidence intervals. Confidence Intervals Example ● Atinder has made his own chocolate called Atindies! Like Smarties, but with an A on them. He would like to know the average weight a box of chocolates can have. Say he took a sample of 1000 boxes of chocolates and found the mean and standard deviation to be 45 grams and 3.8 grams, respectively. the data is distributed normally as well. ● What is the 90% confidence interval for the average weight of the box? ● The 90% confidence interval is captured when you are 1.645 standard deviations of the mean! ○ We want the interval to be around the mean s.t. 90% of the data is captured. ○ How? Proof on next slide Confidence Intervals Example: Proof ● We know that since the confidence interval is symmetric around the mean, we must trim off α% from the standard normal distribution. ○ 1-α = 90% ○ Therefore, here, α = 10%. This means 5% is trimmed off both sides of the normal distribution:
● Therefore, P(Z<z) = 0.95 -> z = 1.645.
Confidence Intervals Example: Proof ● Remember, to standardize your results we must subtract by the mean, and then divide by the standard deviation. So we get the following: Z = (X-µ)/σ ● However, the mean and standard deviation for a sample of size n is: ○ E(X̄) = x̄ ○ Var(X̄) = σ2/n ■ S.D. = σ / √(n) ● So we get as the confidence interval: -zα/2< Z < zα/2 -zα/2< (X - x̄ ) / σ / √(n) < zα/2 x̄ - zα/2(σ / √(n) < X < x̄ + zα/2(σ / √(n), where zα/2 is the value that covers 1-α/2 probability For proof of the expected value and variance of the sample mean, visit here! , and the CI proof here! Confidence Intervals: Back to the example x̄ - zα/2(σ / √(n) < X < x̄ + zα/2(σ / √(n) 45 - 1.645*(3.8 / √(1000), 45 + 1.645*(3.8 / √(1000) 45 - 0.19767, 45 + 0.19767 44.802, 45.198 Confidence Intervals: Question 1 ● The average heights of a random sample of 400 people from a city is 1.75 m. It is known that the heights of the population are random variables that follow a normal distribution with a variance of 0.16.
● Determine the interval of 95% confidence for the average
heights of the population. Confidence Intervals: Question 2 ● You want to rent an unfurnished one-bedroom apartment in Durham, NC next year. The mean monthly rent for a random sample of 60 apartments advertised on Craigslist (a website that lists apartments for rent) is $1000. Assume a population standard deviation of $200.
● Construct a 95% confidence interval.
● How large a sample of one-bedroom apartments above would
be needed to estimate the population mean within plus or minus $50 with 90% confidence? Homework 1 Homework 1 Answers Let’s do some Python programming :) Thank you