0% found this document useful (0 votes)
44 views

Biometry Lecture 6 Posted

The document discusses hypothesis testing, which involves comparing sample data to hypotheses about a population. Hypothesis testing asks whether there is any effect, rather than estimating the size of an effect. It provides an example of Jonas Salk testing whether his polio vaccine was effective by comparing outcomes between a treatment group that received the vaccine and a control group that received a placebo. The document also provides an example of testing whether toads exhibit handedness by recording which forelimb each of 18 toads used to remove a balloon from their head and calculating the probability of the observed results if there was actually no preference.

Uploaded by

S. Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Biometry Lecture 6 Posted

The document discusses hypothesis testing, which involves comparing sample data to hypotheses about a population. Hypothesis testing asks whether there is any effect, rather than estimating the size of an effect. It provides an example of Jonas Salk testing whether his polio vaccine was effective by comparing outcomes between a treatment group that received the vaccine and a control group that received a placebo. The document also provides an example of testing whether toads exhibit handedness by recording which forelimb each of 18 toads used to remove a balloon from their head and calculating the probability of the observed results if there was actually no preference.

Uploaded by

S. Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Chp 6: Hypothesis

testing
Hypothesis testing
• Hypothesis testing, like
estimation, uses sample data to
make inferences about the
population
• Unlike estimate, hypothesis
testing asks only whether the
__________________________

__________________________
Hypothesis testing
• Estimation asks, “how large is
the effect?”

• Hypothesis testing asks, “Is


there any effect at all?”
Pierre-Simon Laplace R. A. Fisher

https://youtu.be/lgs7d5saFFc
Hypothesis testing
• Hypothesis testing involves
comparing results with two
expectations (hypotheses)
1. The null hypothesis (Ho) –there
is no relationship, no effect
• Null mean “nothing”, nada, no
association
• This is your default, or primary
expectation
Hypothesis testing
• The assumption of the null is
intended to protect the integrity
of your conclusions
• By forcing you to consider the
idea of no relationship or no
effect, you should be less likely
to jump to a positive conclusion
Hypothesis testing
• Hypothesis testing involves
comparing results with two
expectations (hypotheses)
2. The alternative hypothesis (Ha)

________________________

________________________

________________________
Hypothesis testing
• An example of hypothesis testing
• Polio vaccine was developed by
Jonas Salk
• At one time polio was considered
one of the most frightening public
health problems in the world
Hypothesis testing
• Polio is an infectious disease
caused by the poliovirus
• Effects range from no symptoms
to death
• Can result in life-long disability
Hypothesis testing
• Salk and his team developed an
experiment to test their polio
vaccine
• Their null hypothesis was that the
vaccine didn’t work
• The alternative hypothesis was
that it would work
Hypothesis testing
• Salk’s vaccine was tested on
elementary school students
• 401,974 randomly into two groups
• Treatment group – students who
received the vaccine
• Control group – students who
received a saline injection
• Student were unaware which group
they were in
Hypothesis testing
• Realize that:
• Most students did not get polio,
regardless of what treatment
group they were in
• Vaccines don’t work 100%
Hypothesis testing
• Of those who received the
vaccine, 64 (0.016%) students
developed polio
• While, 229 (0.057%) student in
the control group developed
the disease
Hypothesis testing
• Did the vaccine work, or did
this relatively small difference
arise purely by chance?
• To answer this question requires
evaluating the null hypothesis
• You evaluate the null and not the
alternate because you are starting
from the premise that there is no
effect, the vaccine did not work
Hypothesis testing
• So, to test their null hypothesis
Salk et. al. calculated a
probability
• This probability related the
likelihood that they got their
results____________________
_________________________
Hypothesis testing
• They found the probability was
very small

• Thus, they rejected the null


hypothesis that the vaccine
had no affect
• You don’t test the alternative
hypothesis, it is accepted by
default because you rejected the
null
Hypothesis testing
• Hypothesis testing quantifies
how unusual the data are,
assuming that null hypothesis
is true
Hypothesis testing
• The polio example which left
our details for the sake of
simplicity
• Lets consider another, more
detailed example
• Do animals exhibit
handedness, like humans?
• i.e. are some animals right, or
left handed?
Hypothesis testing
• Some actually did this in toads
• They tied a balloon to the head
of a toad and the researchers
recorded which forelimb (right
or left) the animal used to try
to remove the balloon
• They did this to 18 toads, n =18
Hypothesis testing
• The four basic steps in
hypothesis testing:
• State the hypothesis
• Compute the test statistic
• Determine the probability of the
observed test statistic when the
null hypothesis is in-fact true
(i.e. the P-value)
• Draw the appropriate
conclusions based on your
results
Stating the hypothesis
• The number of interest is the
proportion (p) of right-handed
toads in the population
• We are inferring the proportion
of right-handed toad in the total
population of toads based on
our sample of 18
• If our sample is random, it
should be adequate
Stating the hypothesis
• The null hypothesis, Ho (our
default) is that both forelimbs
are used equally, i.e. p = 0.5
• In-other-words, there is no
difference in the use of their
limbs
Stating the hypothesis
• The alternate hypothesis (Ha)
is that left and right handed
toads are not equally frequent
• i.e. p ≠ 0.5
• This is a two-sided (or two-
tailed) hypothesis, it makes no
expectation as to the direction
of the difference
• A one-sided hypothesis would
have stated that left, or right
handed toads would have been
more common
Stating the hypothesis
• After stating there hypothesis
the research had to calculate
the test statistic
The test statistic
• The test statistic is a number
calculated from the data that is
used to evaluate how
compatible the data are with
the result expected under the
null hypothesis
The test statistic
• In our toad example, if the null
is correct we would expect to
observed 9 (half of 18) toads
would be left handed and 9
would be right handed, i.e. p =
0.5
• Instead, we observe 14 toads
used their right forelimb to
move the balloon
• Thus, 14 is our test statistic
The test statistic
• In this example, we are using
the number of right-handed
toads to provide a value to test
• This does not mean we are
expecting more right or left
handed toads
The test statistic
• By itself, the test statistic
doesn’t tell us much, to make it
meaningful we must calculate
probability that we could get
14 when the null (9) is actually
true
• This value is called a P-value
• But before we can calculate
this probability we must
understand where it comes
from
P-values
• The null distribution is the
sampling distribution of
outcomes for a test statistic
under the assumption that the
null hypothesis is true
P-values
• The height of a given bar
corresponds to the probability
(0.18) of obtaining a given
value (9) if the null is true.
Number of right-handed Probability
toads
0 0.000004

P-values 1
2
0.00007
0.0006
3 0.0031
4 0.0117
• This is the same data, but in 5 0.0327
table form 6 0.0708
7 0.1214
8 0.1669
9 0.1855
10 0.1669
11 0.1214
12 0.0708
13 0.0327
14 0.0117
15 0.0031
16 0.0006
17 0.00007
18 0.000004
Total 1
Number of right-handed Probability
toads
0 0.000004

P-values 1
2
0.00007
0.0006
3 0.0031
4 0.0117
• This is the same data, but in 5 0.0327
table form 6 0.0708
7 0.1214
8 0.1669
9 0.1855
10 0.1669
11 0.1214
12 0.0708
13 0.0327
14 0.0117
15 0.0031
16 0.0006
17 0.00007
18 0.000004
Total 1
Number of right-handed Probability
toads
0 0.000004

P-values 1
2
0.00007
0.0006
3 0.0031
4 0.0117
• The P-value is the probability 5 0.0327
of obtaining the observed test 6 0.0708

statistic, but with the null being 7 0.1214


8 0.1669
true 9 0.1855
10 0.1669
11 0.1214
12 0.0708
13 0.0327
14 0.0117
15 0.0031
16 0.0006
17 0.00007
18 0.000004
Total 1
P-values
• The P-value is
calculated based on
the summed
probabilities of the
relativity
observations
P-values
• Our expectation
under the null
hypothesis was 9
• However, we
observed 14
P-values
• To calculate the p-
value we sum all the
probabilities shown
in red
• Probability for
observing 14, 15, 16,
17, and 18 right-
handed toads
• But we also include
the probability of
observing 0, 1, 2, 3,
and 4 right handed
toads
P-values
• We sum both sides
of the distribution
because we made no
prediction about
toad handedness,
left vs. right
• Hence this is a two
tailed hypothesis
P-values
• If we had made a
statement about
getting more left
handed toads we
would have summed
only one side of the
probability
distribution
Number of right-handed Probability
toads
0 0.000004

P-values 1
2
0.00007
0.0006
3 0.0031
4 0.0117
• Shown with the table 5 0.0327
6 0.0708
7 0.1214
8 0.1669
9 0.1855
10 0.1669
11 0.1214
12 0.0708
13 0.0327
14 0.0117
15 0.0031
16 0.0006
17 0.00007
18 0.000004
Total 1
Number of right-handed Probability
toads
0 0.000004

P-values 1
2
0.00007
0.0006
3 0.0031
4 0.0117
• All of values summed 5 0.0327
= our P-value = 0.031 6 0.0708
7 0.1214
8 0.1669
9 0.1855
10 0.1669
11 0.1214
12 0.0708
13 0.0327
14 0.0117
15 0.0031
16 0.0006
17 0.00007
18 0.000004
Total 1
P-values
• This means that the probability of an outcome as
extreme or more extreme than 14 out of 18 toads being
right handed is 0.031
Draw the appropriate conclusion
• Is 0.031 small enough?
• By convention, we set our cut off to 0.05
Draw the appropriate conclusion
• In other words, if P is less than or equal to 0.05, then
we reject the null hypothesis
• If P is >0.05, we do not reject
Draw the appropriate conclusion
• In this case, 0.031 is smaller than 0.05, thus we reject
the null that this sample of toads uses both arms
equally as frequent
P-values and confidence intervals

• Confidence intervals (CI) - a measure of uncertainty of


an estimate
• A range of values surrounding the sample estimate that is
likely to contain the population parameter
• 95% confidence intervals are most common
• There is a 95% chance that the true population parameter is
contained within this range, if another sample is taken
• e.g. the 95% CI for our 100 sampled genes is:
• 2121.4 < μ < 2702.2
P-values and confidence intervals
• CIs can be
calculated for all
the samples
taken
P-values and confidence intervals
• Imagine your null is that an
estimate = 0 10

• You calculate the 95% 9

confidence interval the 7

estimate and find the upper 6

Y axis
5 Does not
include 0
and lower bonds do not 4

3
include 0 2

0 is down
here
P-values and confidence intervals
• You then perform a
hypothesis test with an 10

appropriate statistical 9

model and the P-value is 8

<0.05 6
Does not

Y axis
5

• This will be the case 99% of


include 0
4

3
the time 2

0 is down
here
P-values and confidence intervals
• Thus, if you are interested
in if your estimate is 10

different from 0 and your 9

confidence intervals do no 8

include 0, its very likely 6


Does not

Y axis
significant
5
include 0
4

0 is down
here
P-values and confidence intervals
• Imagine your null is that
two means are the
same 18

16

• You calculate the 95% 14

confidence intervals for 12

10
Does not

Y axis
overlap
both means, 8

respectively 6

• You find that the 95% 2

confidence intervals do 0

not overlap
P-values and confidence intervals
• You then apply an
appropriate statistical
model and find that the 18

P = <0.05
16

14

• Thus, if your confidence 12

10
Does not

Y axis
overlap
intervals don’t overlap 8

between estimates, its 6

4
likely significant 2

0
Errors in hypothesis testing
• Rejecting or the null hypothesis does not necessarily
mean that the null hypothesis is false
• Likewise, one could fail to reject, when in reality a
difference does exist
Errors in hypothesis testing
• Chance may influence sampling in ways that are difficult
to predict
• Some uncertainty can be quantified, through, if the
data are a random sample, so making rational decisions
is possible
Errors in hypothesis testing
There are two kinds of errors in hypothesis testing

• In other words, a type I error is detecting an effect that is


not present, while a type II error is failing to detect an effect
that is present

One of these is much worse than the other…..


Type I error
• Type I error –
________________________________________
• The significance level (α) gives us the probability of
committing a Type I error
• We reject the null when P ≤ 0.05
• This mean that, if the null were true, we would reject it
mistakenly 1 time in 20 (1/20 =0.05), or 5% of the time
The significance level (α)
• α = 0.05, usually
• What if α is lowered (a
smaller number, e.g.
0.01)?
• It would reduce our
Type I error rate, and
make our test more
stringent
• However, it becomes
more likely that a type
II error will occur
(having a false
negative)
The significance level (α)
• α = 0.05, usually
• What if α is increased (a
larger number, e.g. 0.1)?
• It would make the null
easier to reject when
true, but would also
increase the likelihood of
a false positive
• It has been done, but you
would:
• Have to do this a priori
(before you do the .1

stats)!
• Extensively justify this
decision, and many
would never accept that
justification
Type II error
• Type II error – failing to reject a false null
hypothesis
• In other words, you conclude there is no difference
when there actually is…
• Probability of a Type II errors is directly related to
power
Power

• Power is the probability that a random sample


taken from a population will, when analyzed,
lead to rejection of a false null hypothesis (i.e.
you say they are different and they really are)
• Power = 1 – Type II Error Rate (β)
• Thus, the smaller your Type II Error Rate, the
more power you have
Power
• You want more power!
Power
• What affects power?
• The significance level (α)
• A larger α (e.g. α = 0.1) means a less stringent test and
smaller Type II error
• Smaller your Type II Error Rate, the more power you have
• But this also means increased Type I error (false positive)
• Magnitude of the effect
• This refers to how different groups are
• Quantified by the effect size (more on that later)

• Thus, you shouldn’t mess with α and you often cant


do much to improve the magnitude
Power
• What affects power?
• The sample size (n)
• Directly affects sampling
error
• Generally, effects are
harder to detect in
smaller samples
• Increasing sample size is
often the easiest way
increase power
Power
• How much power do I have?
• Power is difficult to quantify
• We will never know the true population parameter,
thus we can never really say how much power we
have
• There are “power analyses” available but these only
give estimate
Power
• Power estimates are specific to a given statistical model
(ANOVA, regression)
• There is no generic power analysis
• Many statistical models do not have “canned” power
analyses
• You would have to get creative, and more
importantly know what you are doing
Power
• Although cumbersome and limited, power
analysis is not useless, we will cover in detail at a
later date
• The R Package “pwr” provides many power analysis
tools
https://cran.r-project.org/web/packages/pwr/index.html
Problems with hypothesis testing
• The multiple comparisons problem – this happens
when one performs multiple hypothesis test at the
same time
Problems with hypothesis testing
• The multiple comparisons problem means that as
more tests that are performed on a dataset, it becomes
more likely a null will be rejected (P = < 0.05), by chance
(its not real!!!)
• Eventually, you will find a significant result
• In other words, it increases your Type I error rate (false
positive)
• This all applies to P-values as well as confidence intervals
Problems with hypothesis testing
• There are many ways to control for the multiple
comparisons problem
• Use the appropriate statistical model
• You don’t do a bunch of t-tests, use a single test instead
• If you have to do multiple tests, perform a correction for
multiple comparisons
• Bonferroni correction – really well know, but very conservative,
maybe too conservative (increased type II error)
• False discovery rate (FDR) – newer, less extreme
• Many, many, many others
Problems with hypothesis testing
• The multiple comparisons problem goes hand-in-hand
with data dredging (AKA P-hacking)
• This occurs when people to uncover patterns in data that can
be presented as statistically significant, without first devising
a specific hypothesis

P >0.05 P <0.05
Problems with hypothesis testing
• Science is predicated on the idea that:
1. That you make a statement (a hypothesis) upfront (a priori),
before you have analyzed the data
2. You apply a specific statistical test to evaluate your specific
hypothesis
Do not go on fishing expeditions!
Problems with hypothesis testing
• How to data dredge (do not do this!!!)
• Collect data
• Correlate all your data, looking for relationships that are
significant
• Makeup a story explain your correlations
• Data dredging is a huge problem which undermines the
value of science

https://youtu.be/42QuXLucH3Q
Problems with hypothesis testing
• How to prevent data dredging
• State your hypothesis before you collect data
• If you cannot reject the null you can consider another
hypothesis and test it with the data
• Do not just see what has a relationship, and make-up a story
Problems with hypothesis testing
• There is nothing
biologically relevant
about α = 0.05

“α = 0.05 sounded good to me.”*


-R A Fisher
* Not a quote by RA Fisher
Problems with hypothesis testing
• This could obscure many
biological phenomena
• e.g. measuring natural
selection involves
regression analysis (a
statistical model) and
thus α = 0.05 is the P
value “cut-off”

“α = 0.05 sounded good to me.”*


-R A Fisher
* Not a quote by RA Fisher
Errors in hypothesis testing
• However, theory and empirical
data show that selection can
be statistically non-significant
over one or a few generations
(i.e. P > 0.05) but can still lead
to strong evolutionary
outcomes over long time
periods, so long as it is
consistent
Errors in hypothesis testing
• Hypothesis testing and causation
• Causation is an action that connects one process (the
cause) with another process or state (the effect), where
the first is understood to be partly responsible for the
second, and the second is dependent on the first
• To test a causation requires a hypothesis which directly test
predictions generated as a result of the stated causal
mechanism
• Causation is best studied using manipulative experiments
Errors in hypothesis testing
• Hypothesis testing and causation
• A P-value only tells you the probability of rejecting the
null when the null was really true, given the data you
have collected
• By itself, it tells you nothing regarding cause and effect
Errors in hypothesis testing
• Hypothesis testing and causation
• The best examples of this are the most ridiculous
• All relationships presented in the link below are statistically
significant, i.e. P < 0.05
• http://www.tylervigen.com/spurious-correlations
Problems with hypothesis testing
• Many, many people do not understand, forget, or never
learned that hypothesis testing does not indicate the
strength of a finding
• Hypothesis testing asks, “Is there an effect?”
• Estimation, more specifically effect sizes, ask, “how large is
the effect?”
• Effect sizes - a quantitative measure of the strength of a observation
Problems with hypothesis testing
• Effect size estimation complements hypothesis testing
Follow these steps:
1. P-values tell you if there is a difference
2. Effect size estimates tell you how large that difference is
2a. Including confidence intervals for effect sizes provides an estimate
of uncertainty around the effect size estimate

Do this are you will be cool, like super cool


Problems with hypothesis testing
• However, effect sizes are more complicate than P-
values, as there are many options which depend on the
model used (much like power analysis)
• e.g. Pearson’s r, coefficient of determination (R2), Eta-squared
(η2), Omega-squared (ω2), Cohen's d, just to name a few

https://cran.r-project.org/web/packages/compute.es/index.html

https://cran.r-project.org/web/packages/bootES/index.html

You might also like