0% found this document useful (0 votes)
6 views

Confidence Intervals

Confidence Intervals - Stat 20 Berkeley

Uploaded by

riaankumarjha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Confidence Intervals

Confidence Intervals - Stat 20 Berkeley

Uploaded by

riaankumarjha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Confidence Intervals

Quantifying the sampling variability of a statistic.

The process of generalizing from a statistic of a sample to a


parameter of a population is known as statistical inference. The
parameter of interest could be a mean, median, proportion,
correlation coefficient, the coefficient of a linear model . . . the
list goes on. In the scenario that unfolded in Pimentel Hall,
the parameter was the mean year of the 527 students in the
class. The process of estimating that parameter by calculating
the sample mean of the 18 students who decided to sit in the
front row that day induces a sampling distribution.

Sampling Distribution

75

50

25

1 2 3 4
x (mean year)

This sampling distribution captures the two sources of error


that creep in while generalizing. The horizontal offset from the
true population parameter (the red line) to the mean of the
sampling distribution (the gold triangle) represents the bias.
The spread of the sampling distribution represents the variation.
In these lecture notes you’ll learn how to quantify sampling
variability using two common tools.

1
Standard Error (SE) The standard deviation of the sampling
distribution of a statistic.
Confidence Interval An interval of two values that represent
lower and upper bounds on the statistic that captures
most of the sampling distribution.

To focus on the variation, let’s introduce a second example, one


in which we will not need to worry about bias.

A Simple Random Sample

Restaurants in San Francisco


Every year, the city of San Francisco’s health department visits
all the restaurants in the city and inspects them for food safety.
Each restaurant is given an inspection score; these range from
100 (perfectly clean) to 48 (serious potential for contamination).
We have these scores from 2016. Let’s build up to the sampling
distribution bit by bit.

The Population Distribution


Our population consists of the restaurants in San Francisco.
Since the data are published online for all restaurants, we have
a census1 of scores for every restaurant in the city.

Population Distribution

0.075
Proportion

0.050

0.025

0.000
60 80 100
Food Safety Scores
1
The terms census refers to a setting where you have access to the entire
population.

2
The population distribution is skewed left with a long left tail.
The highest possible score is 100. It appears that even scores
are more popular than odd scores for scores in the 90s; in fact
there are no scores of 99, 97, and 95.
We can calculate two parameters of this population:
Population parameters, like the
parameters of probability
• The population mean, 𝜇, is 87.6. distributions, are usually given a
• The population SD, 𝜎, is 8.9. Greek letter. The population mean
is 𝜇, said “myoo”, and the
population standard deviation is 𝜎,
said “sigma”.
The Empirical Distribution
Although we have data on all of the restaurants in the city,
imagine that you’re an inspector who has visited a simple
random sample of 100 restaurants. That is, you draw 100
times without replacement from the population, with each unit
equally likely to be selected. This leads to a representative
sample that will have no selection bias.
The distribution of this sample (an empirical distribution) looks
like:

Empirical Dist. (Sample 1)

9
Proportion

0
60 80 100
Food Safety Scores

The sample statistics here are:


While parameters are symbolized
with Greek letters, statistics are
• The sample mean, 𝑥,̄ is 86.27. usually symbolized with Latin
• The sample SD, 𝑠, is 9.9. letters.

3
Observe that the empirical distribution resembles the popula-
tion distribution because we are using a sampling method with-
out with selection bias. It’s not a perfect match but the shape
is similar. The sample average (𝑥)̄ and the sample SD (𝑠) are
also close to but not the same as the population average (𝜇)
and SD (𝜎).

The Sampling Distribution


If you compared your sample to that of another inspector who
visited 100 restaurants, their sample would not be identical to
yours, but it would still resemble the population distribution,
and its 𝑥̄ and 𝑠 would be close to those of all the restaurants
in the city.
The distribution of the possible values of the 𝑥̄ of a simple
random sample of 100 restaurants is the sampling distribution
of the mean (of the sample). We can use it to, for example,
find the chance that the sample mean will be over 88, or the
chance that the sample mean will be between 85 and 95.
Ordinarily this distribution takes some work to create, but in
this thought-experiment have have access to the full population,
so we can simply use the computer to simulate the process. We
repeat 100,000 times the process of drawing a simple random
sample of 100 restaurants. The full distribution looks like:

Sampling Distribution
0.5

0.4
Proportion

0.3

0.2

0.1

0.0
85.0 87.5 90.0
Average Food Safety Scores

4
We can consider numerical summaries of this distribution:

• The mean of the sampling distribution is 87.6.


• The SD of the sampling distribution, which is called the
Standard Error (SE), is 0.9. This convention of using a
different name for the SD for the distribution of a statistic
helps keep straight which kind of standard deviation we’re
talking about.

Observe that the sampling distribution of 𝑥̄ doesn’t look any-


thing like the population or sample. Instead, it’s roughly sym-
metric in shape with a center that matches 𝜇, and a small SE.
The small size of the SE reflects the fact that the 𝑥̄ tends to be
quite close to 𝜇.
Again, the sampling distribution provides the distribution for
the possible values of 𝑥.̄ From this distribution, we find that the
chance 𝑥̄ is over 88 is about 0.33, and the chance 𝑥̄ is between
85 and 95 is roughly, 1.

Putting the Three Panels Together


Let’s look at these three aspects of this process side-by-side.
Population Distribution Empirical Dist. (Sample 1) Sampling Distribution
0.5

0.075 9 0.4
Proportion

Proportion

Proportion

0.3
0.050 6
0.2

0.025 3
0.1

0.000 0 0.0
60 80 100 60 80 100 85.0 87.5 90.0
Food Safety Scores Food Safety Scores Average Food Safety Scores

Population Empirical Sampling


Shape left skew left skew bell-shaped / normal
Mean 𝜇 = 87.6 𝑥̄ = 86.27 87.6
SD 𝜎 = 8.9 𝑠 = 9.9 0.89

Observe that:

1. 𝜇 and the mean of the sampling distribution are roughly


the same.

5
2. 𝜎 and the SE of the sample averages are related in the
following way2 :

𝜎
𝑆𝐸(𝑥)̄ ≈ √
𝑛

3. The histogram of the sample averages is not skewed like


the histogram of the population, on the contrary, it is
symmetric and bell-shaped, like the normal curve.
4. The histogram of our sample of 100 resembles the popu-
lation histogram.
5. Since 100 is a pretty large sample,

𝜇 ≈ 𝑥̄
𝜎≈𝑠

Up until this point, we’ve worked through this thought exper-


iment with the unrealistic assumption that we know the popu-
lation. Now we’re ready to make inferences in a setting where
we don’t know the population.

2
This approximation becomes equality for a random sample with re-
placement. When we have a SRS, the exact formula is 𝑆𝐸(𝑥)̄ =

√ 𝑁−𝑛
𝑁−1
𝜎/ 𝑛.
This additional term, called the finite population correction factor,
adjusts for the fact that we are drawing without replacement. Here 𝑁
is the number of tickets in the box (the size of the population) and 𝑛
is the number of tickets drawn from the box (the size of the sample).
To help make sense of this correction factor, think about the following
two cases:

• Draw 𝑁 tickets from the box (that is, 𝑛 = 𝑁).


• Draw only one ticket from the box.

What happens to the SE in these two extreme cases?


In the first case, you will always see the entire population if you are
drawing without replacement. So, the sample mean will exactly match
the population mean. The sampling distribution has no variation, so
𝑆𝐸 = 0.
In the second case, since you take only one draw from the box, it
doesn’t matter if you replace it or not. So the SE for a SRS should
match the SE when sampling with replacement in this special case. In
settings when 𝑁 is large relative to 𝑛, it effectively behaves as if you
are sampling with replacement.

6
Inference for a Population Average

Drawing on our understanding of the thought-experiment, we


ask:
What happens when you don’t see the population, you just have
your sample, and you want to make an inference about the
population?
We have serious gaps in our procedure for learning about the
sampling distribution!

To start, we know we can use the sample average, 𝑥,̄ to infer


the population average, 𝜇. This is called a point estimate for
the population parameter.
But can we do better than that? Can we bring in more of the
information that we have learned from the thought-experiment?
For example, can we accompany our point estimate with a sense
of its accuracy? Ideally, this would be the SE of the sample
mean. Unfortunately, we don’t know the SE because it depends
on 𝜎. So now what do we do?

Standard Error

The thought-experiment tells us that 𝑠 is close to the 𝜎 (when


you have a SRS). So we can substitute the 𝑠 into the formula
for the SE.

7
𝑠
𝑆𝐸(𝑥)̄ ≈ √
𝑛

When presenting our findings, you might say, that based on a


SRS of 100 restaurants in San Francisco, the average food safety
score is estimated to be 86 with a standard error of about 1.
Suppose someone took a sample of 25 restaurants and provided
an estimate of the average food safety score. Is that only 1/4
as accurate because the sample is 1/4 the size of ours?
Suppose someone took a sample of 100 restaurants in New York
City where there are 50,000 restaurants (this is a made up
number). Is their estimate only 1/10 as accurate because the
number of units in the population is 10 times yours?
We can use the formula for the SE to answer these questions.
In the table below, we have calculated SEs for a generic value of
𝜎 and various choices of the population size and sample size.

Population Size (𝑁 ) Sample Size (𝑛)


25 100 400
500 𝑆𝐸 = 𝜎/5 𝑆𝐸 = 𝜎/10 𝑆𝐸 = 𝜎/20
5,000 𝑆𝐸 = 𝜎/5 𝑆𝐸 = 𝜎/10 𝑆𝐸 = 𝜎/20
50,000 𝑆𝐸 = 𝜎/5 𝑆𝐸 = 𝜎/10 𝑆𝐸 = 𝜎/20

What do you notice about the relationship between sample size


and population size and SE?

• The absolute size of the population doesn’t enter into the


accuracy of the estimate, as long as the sample size is
small relative to the population.
• A sample of 400 is twice as accurate as a sample of 100,
which in turn is twice as accurate as a sample of 25 (as-
suming the population is relatively much larger than the
sample). The precision of estimating the population mean
improves according to the square root of the sample size.

8
Confidence Intervals

Confidence intervals bring in more information from the


thought-experiment. The confidence interval provides an
interval estimate, instead of a point estimate, that is based on
the spread of the sampling distribution of the statistic.
We have seen that the sampling distribution takes a famil-
iar shape: that of the normal curve (also called the bell
curve)3 . Therefore we can fill in some of the holes in the
thought-experiment with approximations.

This is the Central Limit Theorem in action. The CLT states


that sums of random variables become normally distributed as
𝑛 increases. Conveniently enough, most useful statistics are
some version of a sum: 𝑥̄ is a sum divided by 𝑛 and 𝑝̂ is a sum
of variables that take values 0 or 1, divided by 𝑛. This powerful
mathematical result enables one of the most popular methods
of constructing confidence intervals.

Normal Confidence Intervals


When the sampling distribution is roughly normal in shape,
then we can construct an interval that expresses exactly how
much sampling variability there is. Using our single sample
of data and the properties of the normal distribution, we can

3
This is not always the case. We’ll come back to this point later.

9
be 95% confident that the population parameter is within the
following interval.

[𝑥̄ − 1.96𝑆𝐸, 𝑥̄ + 1.96𝑆𝐸] The number 1.96 doesn’t come out


of thin air. Refer to the notes on the
Normal Distribution to understand
So for a sample where the sample mean is 86 and the 95% the origins.
confidence interval is [84.3, 88.2 ], you would say,

I am 95% confident that the population mean is


between 84.3 and 88.2.

For the particular interval that you have created, you don’t
know if it contains the population mean or not. This is why
we use the term confidence to describe it instead of probability.
Probability comes into play when taking the sample, after that
our confidence interval is a known observed value with nothing
left to chance.

Confidence not Probability


To be more precise about what is meant by “confidence”, let’s
take 100 samples of size 25 from the restaurant scores, and
calculate a 95% confidence interval for each of our 100 samples.
How many of these 100 confidence intervals do you think will
include the population mean?
Let’s simulate it! At the bottom of the plot below, the horizon-
tal line at the 𝑦 = 1 indicates the coverage of the confidence
interval from the first sample. It stretches from roughly 84 to
91. The line above it at 𝑦 = 2 indicates the coverage of the
confidence interval that resulted from the second sample, from
roughly 85 to 92.5. Both of these confidence intervals happened
to cover the true population parameter, indicated by the black
vertical line.

10
100

75
Iteration

50

25

80 85 90 95
x

As we look up the graph through the remaining intervals, we


see that 95 of the 100 confidence intervals cover the population
parameter. This is by design. If we simulate another 100 times,
we may get a different number, but it is likely to be close to
95.

Inference for a Population Proportion

To gain practice with making confidence intervals, we turn to


another example. This time we sample from a population where
the values are 0s and 1s. You will see that the process is very
much the same, although there are a few simplifications that
arise due to the nature of the population.

11
Suppose we only want to eat at restaurants with food safety
scores above 95. Let’s make a confidence interval for the pro-
portion of restaurants in San Francisco with scores that are
“excellent” (scores over 95). To tackle this problem, we can
modify our population. Since we need only to keep track of
whether a score is excellent, we can replace the scores on the
tickets with 0s and 1s, where 1 indicates an excellent score. Of
the 5766 restaurants in San Francisco, 1240 are excellent. We
can think of our population as a box with 5766 tickets in it,
and 1240 are marked 1, and 4526 are marked 0. This time let’s
take a SRS of 25.
The thought-experiment appears as
Population Empirical Distribution Sampling Distribution
0.8 5
0.6
4
0.6
Proportion

Proportion

Proportion

0.4 3
0.4
2
0.2
0.2
1

0.0 0.0 0
−0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6
Excellent score? Excellent scores? Proportion excellent

Population Empirical Sampling


Shape left skew left skew bell-shaped / normal
Mean 𝑝 = 0.22 𝑝̂ = 0.36 0.22
SD 𝜎 = √𝑝(1 − 𝑝) = 0.41 𝑠 = 0.49 0.08

In the special case of a 0-1 box:

• The population average is the proportion of 1s in the box,


let’s call this parameter 𝑝.
• The taking a draw from the population distribution tak-
ing a draw from a Bernoulli random variable, so 𝜎 =
√𝑝(1 − 𝑝).
• The sampling distribution has mean 𝑝.
• The sampling proportion, 𝑝,̂ is similar to 𝑝.
• The SE of the sample proportion4 is approximately
4
This calculation results from casting the total number of 1’s in a sample
of size 𝑛 as a binomial random variable with success probability 𝑝. Call
that random variable 𝑌 . The variance of a binomial random variable
is 𝑉 𝑎𝑟(𝑌 ) = 𝑛𝑝(1 − 𝑝). Observing that sample proportion can be

12
√𝑝(1−
̂ 𝑝)̂
𝑆𝐸(𝑝)̂ = √
𝑛
.

With an equation to estimate 𝑆𝐸 from our data in hand, we


can form a 95% confidence interval.

√𝑝(1
̂ − 𝑝)̂ √𝑝(1
̂ − 𝑝)̂
[𝑝̂ − 1.96 √ , 𝑝̂ + 1.96 √ ]
𝑛 𝑛

Summary

In these notes, we have restricted ourselves to the simple ran-


dom sample, where the only source of error that we’re con-
cerned with is sampling variability. We outlined two tools for
estimating that variability: the standard error (SE) and the
confidence interval.
We saw how the size of the sample impacts the standard error
of the estimate. The larger the sample, the more accurate our
estimates are and in particular the accuracy improves according

to 1/ 𝑛. We also found that the size of the population doesn’t
impact the accuracy, as long as the sample is small compared
to the population.
We made confidence intervals for population averages and pro-
portions using the normal distribution. This approach can be
extended to other properties of a population, such as the me-
dian of a population, or the coefficient in a regression equa-
tion.
considered a binomial count divided by 𝑛, and applying the properties
of variance, we can find the variance of 𝑝̂ as,

1
𝑉 𝑎𝑟(𝑝)̂ = 𝑉 𝑎𝑟( 𝑛 𝑌) (1)
= 𝑛12 𝑉 𝑎𝑟(𝑌 ) (2)
= 𝑛12 𝑛𝑝(1 − 𝑝) (3)
= 𝑝(1−𝑝)
𝑛
(4)

So the standard error can be calculated as:

𝑆𝐸(𝑝)̂ = √𝑉 𝑎𝑟(𝑝)̂ = √ 𝑝(1−𝑝)


𝑛
(5)

When estimating the SE from data, we plug in 𝑝̂ for 𝑝.

13
The confidence intervals that we have made are approximate in
the following sense:

• We’re approximating the shape of the unknown sampling


distribution with the normal curve.
• The SD of the sample is used in place of the SD of the
population in calculating the SE of the statistic.

There are times when we are unwilling to make the assumption


of normality. This is the topic of the next set of notes.

14

You might also like