Biostatistics 8
Biostatistics 8
Week 8 lectures
Melissa Wos-Oxley
Aims for Week 8: Statistical Inference
1. Key concepts in statistical inference: 2 most common types
• Hypothesis testing (tests of significance)
• Estimation (Confidence Intervals)
2. Introduce key terms like null hypothesis, test statistics, p-values,
sampling distributions, SRS
3. Statistical inference for proportions
Moore et al. 2017 – Chapters 6.1-6.2; (revision in chapters 1.4 & 5), for statistical
inference for proportions – chapter 8.1
Summary of topic to-date and where to next?
Statistics:
✓ Types of variables (& data), concept of sampling, and graphing data
✓ Descriptive statistics & summarises, shape of data distributions (3 S’s, and outliers)
✓ Relationships between variables (explanatory vs response variables) weeks 1-5
✓ Correlation
✓ Regression
✓ Producing data: types of studies, data collection, errors
✓ Probability:
✓ Marginal, joint, conditional, concept of independence (week 6)
✓ Probability models for discrete and continuous random variables (week 7) weeks 6-7
✓ Binomial probability distributions
✓ Normal probability distribution (use the standard Normal Table)
➢ Hypothesis testing and estimation:
➢ Statistical Inference for proportions (weeks 8 & 9)
➢ Statistical Inference for numerical data (weeks 10 & 11) weeks 8-11
➢ t-distributions (another Table to use)
Rather than collecting every possible datum in a population, we take random samples
to represent that population and make inferences from these samples back to the
whole population.
We can now combine this idea with our knowledge of probability and draw
conclusions about a population by applying statistical inference, powerful methods
that use probability to give confidence in our conclusions.
COLLECT DATA
in the form of POPULATION
n replicates
population parameters
SAMPLE STATISTICS µ, σ, N, 𝑝
𝑥,ҧ s, 𝑝ො
with an assumed
“Normal sampling distribution” MAKE CONCLUSIONS
ABOUT POPULATION
ESTIMATION
point estimation &
confidence intervals HYPOTHESIS
TESTING
test-statistic &
p-value
Hypothesis testing:
• “tests of significance” used to assess the evidence provided by the data in favour of
some claim about the population parameters.
• A 6-step process: setting a hypothesis, applying a test and generating & interpreting
both a test statistic and p-value.
The factor is land-use; where there are 2 samples [rural vs urban]; each nutrient is a dependent
variable; population is the units of water
You are interested in testing whether a new antibiotic will kill a certain bacterial strain; you grow
bacteria on agar plates and then add the antibiotic (at 1% concentration) to half of the plates, leaving
the other half of the plates free from antibiotic.
The factor is antibiotic use; where there are 2 samples [1% antibiotic use vs control]; counts of
bacterial colonies or zone of inhibition is the dependent variable; population is bacterial colonies
You are interested in knowing whether cigarette smoking changes the bacterial assemblages and
diversity of your nasal passages; you sample people that currently smoke and those that have never.
The factor is lifestyle habits; where there are 2 samples [smokers vs non-smokers]; diversity of
bacteria in the mouth; population is the human population
• When we “count” outcomes across categorical variables (like in binomial last week
classifying & counting “successes” and “failures”), where data is of counts or of
percents obtained from counts, then we are interested in making inferences about
population proportions (p).
➢ statistical inferences for proportions – z-test & CI for p
• The sampling distribution of a statistic is the distribution of all possible values taken by
the test statistic when all possible samples of a fixed size n are taken from the
population.
• It is a theoretical concept – we do not actually build it.
• The sampling distribution of a statistic is the ‘probability distribution’ of the statistic.
6 steps in carrying out a hypothesis test (generic recipe)
3. Determine the sampling distribution of the test statistic (from step 2)
Sampling variability:
• Each time we take a random sample from a population, we are likely to get a different set
of individuals and thus calculate a different test statistic.
• If we take a lot of random samples of the same size from a given population, the variation
from sample to sample - the sampling distribution – will follow a predictable pattern.
• Sampling distributions are never exactly Normal, but as sample size increases, the
sampling distribution of the test statistic becomes approximately Normal.
The confidence interval (CI) is a range of values that’s likely to include a population value with a
certain degree of confidence.
It is often expressed a % whereby a population means lies between an upper and lower interval.
EXAMPLE 2: A Google research study asked 5013 smartphones users about how they used their phones. In response to
a question about purchases, 3563 reported that they purchased an item after using their smartphones to search for
information about the item.
Is there evidence to support the statement that greater than 70% of smartphone users use their phones to research an
item prior to its purchase.
That is, does the population proportion significantly differ from 70%?
EXAMPLE 3: A sample of university student mobile phone users were asked about their service provider, where 18 of the
33 students surveyed were Optus customers. Optus claims that the proportion of Optus customers among university
student mobile phone users is 50%.
Does the sample provide evidence to support Optus' claim?
That is, does the population proportion significantly differ from 50%?
6 steps in carrying out a hypothesis test (generic recipe)
1. Pose the null and alternative hypotheses
2. Choose an appropriate statistical test and calculate the test statistic
3. Determine the sampling distribution of the test statistic (from step 2)
4. Find the p-value associated with the test statistic
5. Make a decision to either “reject” or “accept” the null hypothesis (based on step 4)
6. State the conclusion in the context of the scientific problem -relate it back to the real
world (population!)
Next 7 lectures
Considerations for hypothesis testing of population proportions:
• Tests for the null hypothesis of a single population proportion (H0: p = p0) are based on the z
statistic from the one-sample z-test; where the test statistic summarises the differences
between the observed and expected data.
• The test statistic is a random variable with a distribution that we know
• Assumptions:
✓Random: a random sample of subjects for testing
✓Normal: Assuming a null (H0: p = p0) is true, we can use Normal distribution to calculate
p-values (elaborate on next slide)
• The p-value can be interpreted as the strength of the evidence provided by the observed
data against H0.
Normal Approximation for binomial distributions
• As n gets larger, something interesting happens to the shape of a binomial
distribution, that is, it becomes approximately Normal. In this case, the sampling
distribution of 𝑝Ƹ becomes approximately Normal.
• As a rule of thumb, we will use the Normal approximation when n is large; that
• n x p ≥ 10, AND
• n(1 – p) ≥ 10
- page 324 of Moore et 2017.
Next 7 lectures
EXAMPLE 1: Your company produces a sunblock lotion designed to protect the skin
from UVA and UVB exposure to the sun. You hire a testing company to compare your
product with your major competitor’s product. For 13 of 20 randomly chosen subjects,
your product provides better protection; for the remaining seven, the competitor’s
product does.
Do you have evidence to claim that your product provides better protection?
That is, does the population proportion significantly differ from 50%?
6 steps in carrying out a hypothesis test (generic recipe)
1. Pose the null and alternative hypotheses
2. Choose an appropriate statistical test and calculate the test statistic
3. Determine the sampling distribution of the test statistic (from step 2)
4. Find the p-value associated with the test statistic
5. Make a decision to either “reject” or “accept” the null hypothesis (based on step 4)
6. State the conclusion in the context of the scientific problem -relate it back to the real
world (population!)
6 steps in carrying out a hypothesis test (generic recipe)
1. Pose the null and alternative hypotheses
Tests for the null hypothesis of a single population (H0: p = p0) are based on the z statistic
from the one-sample z-test:
Inference about a population proportion p from an SRS of
size n is based on the sample proportion, 𝑝:Ƹ
Tests for the null hypothesis of a single population (H0: p = p0) are based on the z statistic
from the one-sample z-test:
Inference about a population proportion p from an SRS of
size n is based on the sample proportion, 𝑝:Ƹ
13
𝑝Ƹ= 20 = 0.65
0.65 −0.5 0.15 0.15 0.15
n=20 Then, z = 0.5(1−0.5) = 0.5(0.5) = = = 1.34
√ √ √0.0125 0.112
20 20
p0 = 0.5
6 steps in carrying out a hypothesis test (generic recipe)
3. Determine the sampling distribution of the test statistic (from step 2)
p-values are then calculated from the sampling distribution, Z ~N(0,1) distribution, (when
the expected value of successes, n × p0, and the expected number of failures, n(1 - p0)
are both greater than 10.
Check the conditions for using a Normal approximation:
n × p0 = 20 x 0.5 = 10
n(1 - p0) = 20(1 – 0.5) = 10
As both are at least 10, we may use the Normal approximation.
6 steps in carrying out a hypothesis test (generic recipe)
4. Find the p-value associated with the test statistic (in this case z-statistic)
The z statistic has approximately the standard Normal distribution when H0 is true. P-values
therefore come from the standard Normal distribution. Also, upon testing the conditions, both n×p0
and n(1 - p0) were greater than 10, so we can obtain p-values from the Z~N(0,1) distribution.
From Table A or the function in excel “=NORM.S.DIST(z,TRUE)” (from Week 7 Friday’s lecture),
we can find P(z < 1.34). It is 0.9099. So the probability in the upper tail is 1-0.9099 = 0.0901.
If we set our alpha value (α) at 0.05: to reject the null if a p-value is less than 0.05 and
accept the null if a p-value is greater then 0.05.
Do you have evidence to claim that the new sunscreen provides better protection?
Answer: There is no statistical evidence to support the claim that the sunscreen provides
better protection.
Summary of the assumptions/conditions for inference on p
1) The data used for the estimate are an SRS from the population studied.
2) The population size is at least 10 times as large as the sample used for inference.
This ensures that the standard deviation of 𝑝Ƹ is close to p(1 − p)/n.
3) The sample size, n, is large enough that the sampling distribution can be
approximated with a Normal distribution. How large a sample size is required
depends in part on the value of p and the test conducted. Otherwise, rely on the
binomial distribution (will not show you this further in this topic).
Next 7 lectures
Suppose that the sunscreen provided better UVA and UVB protection for 15 of the 20
subjects. Perform the significance test and summarise the results.
That is, does the population proportion significantly differ from 50%?
15
𝑝Ƹ
= = 0.75
20 0.75 −0.5 0.25 0.25 0.25
n=20 Then, z = 0.5(1−0.5) = 0.5(0.5) = √0.0125 = 0.112 = 2.23
√ √
20 20
p0 = 0.5
Next 7 lectures
Suppose that the sunscreen provided better UVA and UVB protection for 15 of the 20
subjects. Perform the significance test and summarise the results.
That is, does the population proportion significantly differ from 50%?
STEP 1: H0: p = 0.50 H1: p ≠ 0.50 If conditions are met, we should do a one-sample z test for the
STEP 2: z = 2.23 population proportion p.
STEP 3: YES, conditions are met ✓Random: We assume the company chose a random
sample of subjects for testing.
✓Normal: Assuming H0: p = 0.50 is true, the expected
Normal approximation to Binomial: numbers of successes and failures are np0 = 20(0.50) = 10
Z ~N(0,1) and n(1 – p0) = 20(1 - 0.50) = 10, respectively. Because
both of these values are at least 10, we can use the Normal
approximation.
✓The sampling distribution: z ~ N(0,1)
Next 7 lectures
Suppose that the sunscreen provided better UVA and UVB protection for 15 of the 20
subjects. Perform the significance test and summarise the results.
That is, does the population proportion significantly differ from 50%?
That is, does the population proportion significantly differ from 50%?
That is, does the population proportion significantly differ from 50%?
3563
𝑝Ƹ = 5013 = 0.71
n=5013
p0 = 0.7
0.71 −0.7 0.01
0.01 0.01
Then, z = 0.7(1−0.7) = 0.7(0.3) = = = 1.54
√ √ √0.00004189 0.0065
5013 5013
Next
used their 7 lectures
EXAMPLE 2: A Google research study asked 5013 smartphones users about how they
phones. In response to a question about purchases, 3563 reported that they
purchased an item after using their smartphones to search for information about the item.
Is there evidence to support the statement that greater than 70% of smartphone users use
their phones to research an item prior to its purchase.
That is, does the population proportion significantly differ from 70%?
STEP 1: H0: H1 :
STEP 2: z =
STEP 3: Are conditions met? Normal approximation to Binomial: Z ~N(0,1)
STEP 4: p-value =
STEP 5: accept or reject the NULL?
STEP 6: Make a statement…………