0% found this document useful (0 votes)
3 views

Session 2 Workshop

The document discusses the importance of confidence intervals (CIs) in statistical analysis, particularly in estimating summary statistics for a population based on sample data. It outlines the conditions under which CIs are useful, the significance of a 95% confidence level, and the assumptions necessary for valid statistical formulas. Additionally, it emphasizes the distinction between accuracy and precision in data analysis and the need for effective communication of statistical findings to non-expert audiences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Session 2 Workshop

The document discusses the importance of confidence intervals (CIs) in statistical analysis, particularly in estimating summary statistics for a population based on sample data. It outlines the conditions under which CIs are useful, the significance of a 95% confidence level, and the assumptions necessary for valid statistical formulas. Additionally, it emphasizes the distinction between accuracy and precision in data analysis and the need for effective communication of statistical findings to non-expert audiences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Management Science

Session 2 workshop:

Confidence intervals and the reliability of


statistical analyses

Stefan Scholtes
Unit 1
No assumptions Descriptive statistics: How well
needed to interpret do we understand the data we
results have?

Units 2-5 Inferential statistics: Does the


Assumptions are
needed to interpret data we have tell us anything
results about data we wish we had?
Learning objectives

Technical: Measruing the precision of estimates of unknown numbers

1. When are confidence intervals useful?

2. What is a “95%” confidence intervals?

3. When are the statistical formulas for 95% CIs valid?

4. Accuracy (unbiasedness) is more important than precision (error margin)

Managerial: Communicating uncertainty in estimates

The reason managers do statistics is to make them more confident about


the validity of their messages, not to include the stats in your message
Key concept: Sample & Population

Partial Information Full Information

“Population” of
Sample interest
Data that you
Data love to
have
have
History Future
Customers you have data on Customers you’d like to have data on
The population of interest is defined by an
“exam question” that informs a decision
Decision on 28 Sep 2021:
Over the past two weeks, Covid bed
capacity in the East of England has
been declining. Hospitals are asking
whether it is safe to release reserved
COVID ward capacity for planned
surgery over the coming 2-3 weeks?

Exam question?
How will the COVID bed occupancy
change between 28/9/21 and 15/10/21

Data we have Data we wish


Data we have: Daily COVID hospital
(sample) we had
bed occupancy in the region from
(“population of
15/8/21 – 27/9/21
interest”)

Population of interest?
Daily COVID hospital bed occupancy
in the region from 15/8/21 – 15/10/21
A “population of interest” doesn’t refer to actual
“people” but to a spreadsheet you’d like to have,
so that its summary statistics would answer your
exam question
When are confidence
intervals useful?

CIs are used when we want to


estimate summary statistics of a
population of interest, using
information from the sample

Naïve: Use the value of the sample


statistic as a “best estimate”
- Doesn’t tell you anything about
precision

Better: Give a range of plausible


values of the summary statistics
- Mid-point of confidence interval is
“best estimate”
- Width of confidence interval reveals
precision of the estimate
When are confidence intervals not useful?

You want to know the average GMAT score


of the MBA class

Would a confidence interval be useful?


What is a “95% Confidence interval”?

A confidence interval is a range that is “likely” to contain an unknown


summary statistic of a population of interest

What does “likely” mean?

Definition: 95% of all 95% confidence intervals ever created will


contain the unknown summary statistic – and 5% won’t

Think weather app: If the weather app says that there is a 60% chance
of rain today, then this tells you that on 60% of all days that have a
“60% chance of rain” label, it actually rains, and on 40% it doesn’t.

If you are uncomfortable getting it wrong 5% of the time, use a higher


confidence level, e.g. 99%. But that will give you a larger CI which
might be managerially less useful.
Calculating 95% Confidence Intervals
The formulas

95% CI = Sample_statistic ± 2 ∗ 𝑆𝐸

The formula for the standard error (SE) depends on the


sample statistic you are interested in

standard deviation (Data)


Average: SE=
N

!(#$!)
Proportion: SE= &
, (𝑝 is the sample proportion)
Three ways of presenting a Confidence Interval

Call length in VLE example:

1. As a range:
95% CI = 2.98 mins to 3.74 mins

2. As a best estimate with an error margin


95% CI = 3.36 min ± 0.38 min
Error margin = 2*SE = 0.38 min

3. As a best estimate with a percentage error margin


95% CI = 3.36mins ± 11%
Relative EM = EM / Average = 0.38 mins /3.36 mins = 11%
Excel confusion: stdev.p or stdev.s?

Excel has two different formulas, stdev.s and stdev.p, for


standard deviations that produce slightly different results

!"#
stdev.s(Data) = ∗ 𝐬tdev.p(Data)
!

Which Excel formula you use makes little difference in practice.

I will consistently use stdev.p

stdev.p(Data)
SE of Average =
N
Exercise 1: How many patients die in
hospitals as a consequence of negligence?
Calculate a 95% confidence interval for the number of
avoidable lethal adverse events per 1,000 admitted patients

Data
• 2.6 M hospital admissions in NY State in 1984
• Sample: Medical records of 30,195 hospital admissions in NY State hospitals in 1984,
randomly drawn

Summary statistics
• 3.7% of sampled patients experienced at least one “adverse event”
• 13.6% of these adverse events led to death
• 51.3% of the lethal adverse events were classified by the research team as “avoidable”

Comparison data: 2,060 road fatalities in NY State in 1984.

Key questions:
1. What is the population of interest? (the data we’d love to have)
2. What’s the population statistic of interest?
3. What’s the value of the corresponding sample statistic?
4. What’s the formula we need to calculate the 95% CI?
5. How should we communicate the result?
Communication

You are writing a blog for a general audience on the


dangers of negligence in hospitals and want to include the
evidence from the Harvard Medical Practice Study

“30 years ago, researchers at Harvard showed that (…).


Has the situation improved?”

What do you include in “(…)”?

Why?
“30 years ago, researchers at Harvard showed that
hospitals had more than double the number of
avoid adverse events resulting in patient
fatalities compared to road accidents. Has the
situation improved?”
Question: What proportion of
the UK population are in
favor of writing off student
loans for nurses?

Calculate a 95% confidence


interval

How would you deal with the


“don’t knows”?

https://yougov.co.uk/topics/politics/survey-results/daily/2023/09/28/b92d8/2
Excel
Formulas
VLE example: Call Centre Staffing
Contract with new client: Need to deliver 1,000 calls per day
Contract with callers: A caller is expected to deliver 360 call-minutes per working
day

How many callers do we need to hire?

Depends on the length of the calls


If each call takes 2 mins, a caller can handle 360/2=180 calls per day and we
need 1,000/180=5.6 FTE callers for 1,000 calls per day

But call lengths vary

Would you use


the mean or the
median for this
task?
How good is the estimated average call length?

95% CI is 2.98 mins - 3.74 mins


à Based on the data, a caller can do between 96 and 121
calls per day
à We therefore need between 8.2 and 10.4 FTE callers

Boss: “8 or 10? That’s a 20%


difference in staffing. Can’t you be
more precise?”
Collect more data – but how much more?

Want to reduce the error margin

stdev.p(data)
EM = 2 ∗
!

"
stdev.p(data)
Rearranging gives 𝑁 = 2∗
EM

The square on the right-hand side means that we need to increase


- N by a factor 22=4 if we want to cut EM by 50% (i.e. to EM/2)
- N by a factor 102=100 if we want to cut EM by 90% (i.e. to EM/10)
The critical question:

Is the 95% CI formula valid?

Remember the definition:

95% of all 95% confidence intervals ever created


will contain the unknown summary statistic – and
5% won’t
3 threats to
the reliability
of any 1. Poor data quality (garbage-in-
garbage-out)
statistical
estimation
2. Small sample size (formulas assume
“large enough” sample)

3. Biased sampling process


(comparing apples and oranges)
Reasons for poor data quality

• Data Entry Errors


• Incomplete Data
• Duplicate records
• Data inconsistency (e.g. formatting, units of
measurement, variable definition)
• Outdated Data
• Data integration issues (data linking)
• Lack of Data Documentation / Data Dictionaries
• Poor Data Governance Culture
• ETC
Sample size assumption: continuous variables

The confidence interval formula is only valid if the sample is


“large enough”

First rule of thumb: At a minimum, N should be larger than 30

If the histogram suggests that the distribution is different from a


normal distribution, you may need larger samples (e.g. N=100)
Sample size assumption: Binary variables

Second rule of thumb for binary variable (0 or 1):

The computed 95% CI is fully contained in the interval [0,1]


(no negative values, no values larger than 1)

You need a large sample if the proportion of 1’s is very small or very large

Example: Novel surgery, N=100 surgery patients, 2 deaths


Exam question: what’s the mortality rate from this surgery?

!(#$!)
à 𝑝=0.02; SE= = 0.014
&
à 95% CI = [-0.018, 0.048]

Sample size assumption not satisfied because -0.018 < 0


You often hear people saying “…it’s
meaningless, the sample is way too small”

The miracle of statistics is that


the precision of your estimate of the
population of interest does (by and
large) not depend on the size of the
population of interest, it only
depends on the size of the
sample
p=31%, N=362,
95% CI = [26%, 36%]

What’s the real


question to ask?
Bias relates to representativeness for the
population of interest
Before you discuss bias / unrepresentativeness, you
must clarify what your population of interest is

Start with the smallest meaningful population of


interest beyond the sample (the “core population”)
– All admissions to acute hospitals in NY state in 1984?

Then ask whether the results generalize to wider


populations of interest
– All hospital admissions in the USA in the 1980ies?
– All hospital admissions in the USA in 1991 (when the
article was published)?
- All hospital admissions in the world today?
Representativeness is a judgement call

Mathematical ideal: Random sampling from


the population of interest
– Harvard Medical Practice Study

Market research are even smarter


- Stratified sampling ensures that all important
subgroups are properly represented in the sample, and
they
- Response weighing of survey responses ensures that
responses of important but under-represented
subgroups gain appropriate weight
Accuracy refers to representativeness of the data for the
population of interest
Precision refers to the width of the confidence interval

Precise Imprecise

= Right “on average”


Accurate
Errors cancel out
No “systematic error”
(no bias towards particular areas

Inaccurate = Wrong “on average”


Errors don’t cancel out
Error is “systematic”
(biased towards particular area)
= Narrow range = Wide range
Accuracy vs precision

Precise Imprecise

Would you rather have


Accurate a result that is accurate
(unbiased) but
imprecise (small
sample)
Inaccurate
or a result that is
inaccurate (biased) but
precise (large sample)?
Sample size increases precision but does not
reduce the bias (and often makes it worse)

Unknown value we
We are increasing
are interested in
the sample to
estimating
improve precision
One of the biggest mistakes in naïve statistics

“A larger sample is better than a smaller


sample”

WRONG!

“An unbiased sample is better than a


biased sample”
5 key learning points
for Unit 2 1. When do we use confidence
intervals?

We want to know the value of an


unknown number (a statistic)
related to a population of interest
but only have the value of the
statistic for a sample

(e.g. we have some data on the


past but no data on the future)
5 key learning points
for Unit 2 2. What is a “95%” confidence
intervals?

95% of all correctly calculated


95% confidence intervals will
contain the population statistic
of interest – and 5% won’t

You do not know if your interval


contains the value, you only know
the chance
5 key learning points 3. When are statistical formulas
for Unit 2 valid?

Three key assumptions


1. High-quality data
2. Sufficiently large sample
3. Unbiased data collection
process
5 key learning points 4. Accuracy is more important
for Unit 2 than precision

Precision = width of the confidence


interval
• Doubling precision requires quadrupling
of the sample size

Accuracy = representativeness of the


sample
• Determined by the data collection process
• The gold standard is random sampling
from the population of interest

Beware of “convenience sampling” of


large samples through online platforms
5 key learning points
for Unit 2 5. Communicate for impact

Do not use statistical language in


your comms, unless you know that
your audience appreciates it

Make sure you have “the stats” up


your sleaves in case someone asks

The main goal of statistical analysis


is to give yourself confidence in your
narrative (“Do you believe it?”)
Appendix: VLE exercises
Insurance company

An insurance company has designed a new travel


insurance product for students. Before they launch it,
they want you to find out if there is sufficient demand

You have designed a questionnaire and hired a few


Cambridge u/g students who will ask students at
college dinner if they have time for an interview while
they eat

You have given them a bunch of £10 Amazon


vouchers to incentivize interviewees to take part in
the survey
Interrogating representativeness
The fact that the sample is not randomly chosen doesn’t automatically
invalidate it

Suppose the Cambridge survey shows that 10% of the surveyed


students are willing to buy the travel insurance product.
• Prosecutor: “Cambridge students have better high school grades
than those in other universities, so the sample is not representative
for the UK student population”
• Defense: ”I agree - but what has that to do with their willingness to
buy a travel insurance programme?”

Potential non-representativeness should be justified AND must be


linked to the results of interest
• Prosecutor: “Cambridge students, with a roughly 50% private school
attendance rate compared to 7% in the general UK population, have
higher socioeconomic backgrounds, on average, and therefore likely
more frequent international travel. This implies that your demand
estimate based on Cambridge data may overstate the needs of the
broader UK undergraduate population”
How to respond to challenges to non-
representativeness?

You have to refute non-representativeness


arguments with evidence

E.g. “A recently published survey by a


consultancy on spending habits of students,
by type of university, showed that students in
the leading universities don’t spend more on
travel than those in other universities.”
What if my sample is not representative for the
population I need for my decision?

Don’t throw the sample away – but ask “for


what segment of the population of interest is
the sample (reasonably) representative?”

Then sample additional data from the


remaining segment for which the sample is
not representative
Exercise 2:
Length of stay for fractured hips

A nurse has collected data


on length of stay

Using the sample, what can


you say about the average
length of stay of patients
with this condition?
Reliability of the estimate

Nurse: “Of course this is a random sample.


I went into the file room and took a bunch of
200 patient records from a shelf. I didn’t
know anything about these patients before-
hand. How could this be non-random?”

Explain the notion of a representative


sample to the nurse

Discuss what could possibly make the


current sample non-representative.

If you are unhappy with this sampling


process, propose a better sampling method.
Communicating
statistical concepts

Nurse: “You mentioned a


95% confidence interval.
What is that?”

Explain the notion of a


confidence interval to the
nurse
- why do we need it?
- what do we need to ensure
that the way we compute it is
valid?

You might also like