0% found this document useful (0 votes)
6 views

Hypothesis Power Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Hypothesis Power Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Hypothesis Testing

1
Hypothesis Testing
• The science part of data science frequently involves
forming and testing hypotheses about our data and
the processes that generate it.
• A car manufacturing company releases a new car
and it claims that the car has a mileage of 25 kms
per liter.
• Should we believe it or not? And why should or
shouldn't we believe it?
• Hypothesis testing is a method that can be used to
make decisions in such situations.
2
Hypothesis Testing
Hypothesis testing generally involves four steps.
1. First, we develop two claims: null hypothesis (H0) and alternative
hypothesis (Ha). In our case ‘the car can run 20 kms per liter’ is a null
hypothesis and ‘the car can’t run 20 kms per liter’ is an alternative
hypothesis.
2. Second, we collect a sample (sample of newly manufactures cars) and
collect relevant data from them and summarize it using statistics.
3. Then, we calculate how probable it is to find the result from our
analysis if the hypothesis was true.
4. Finally, we reach a conclusion based on our result in the previous step.
If probability of the observed data is very low, we can safely discard our
null hypothesis in favour of our alternative hypothesis. If it isn't low, we
can continue to believe the null.
Thus, by using hypothesis testing we can evaluate mutually exclusive claims
and choose the one which is best supported by our data.
3
Hypothesis Testing
Hypothesis testing for population proportion:
• When the parameter of population in which we are
interested is categorical in nature (for example, category
based on weight: underweight, normal weight and
overweight). For such kind of data, we require using z-test
for hypothesis testing using population proportion.

• z-test for population proportion
• The z-test is named so as it depends upon z-score.
• The hypothesis testing using z-test follow same four general
steps of hypothesis testing.
4
P0= 0.3,
Hypothesis Testing p = 0.35,
n = 100

• In the first step, we declare our null and alternative


hypothesis (H0 and Ha). For example, let's say a research
claims that 30% of students in a particular university own
an iPhone. So our null hypothesis will be
• H0: p0 = 0.3.
• We then collect random sample of size n = 100 students
from the university.
• The size of sample should meet two criteria:
• (i) np0 >= 10 and (ii) n(1-p0) >= 10.
• Since both of our criteria are met,
• (for (i) we get 30 and for (ii) we get 70),
• our choice of sample size is fine.
5
Hypothesis Testing
• Now suppose, we found that out of 100 students 35 students
used iPhone.
• Then, we can find our sample proportion
• p = 35/100 = 0.35, which will be followed by calculation of z-
score given as:
• z = p − p0 √(p0(1−p0)n)
• By using the value calculated above we can find ourP0=
z-score
0.3, as:
• z = 0.35 − 0.3 √0.3(1−0.3)100 p = 0.35,
• = 1.091 n = 100
• What does 1.091 z-score mean?
• We can interpret it as our sample proportion being 1.091
standard deviation above our null value (0.3).
6
Hypothesis Testing
• P-Values:
• An alternative way of thinking about the preceding test involves p-values.
• Instead of choosing bounds based on some probability cutoff, compute the
probability assuming H0 is true. That is we would see a value at least as extreme as
the one we actually observed.
• Now, we will find the corresponding probability (p-value) from the calculated z value.
• The z-value we found has a special property – it follows standard normal
distribution.
• For z = 0, the standard deviation of sample proportion from null value is zero and its
p-value is highest.
• As we go left or right according to our z-value (if positive we observe p-value right
from 0 and if negative left from zero), the standard deviation increases and p-value
decreases.
• Greater the magnitude of z, smaller will be the p-value.
– For example, if we have z values: 1, -0.5, -2.5, and 2 then the p-value for z = -2.5 will be
the smallest as its magnitude (2.5) is highest among all.
7
Statistical Hypothesis Testing
• Statistical Hypothesis Testing
• A statistical hypothesis test makes an assumption about the
outcome, called the null hypothesis.
• For example, the null hypothesis for the Pearson’s correlation
test is that there is no relationship between two variables.
• The null hypothesis for the Student’s t test is that there is no
difference between the means of two populations.
• The test is often interpreted using a p-value, which is the
probability of observing the result given that the null
hypothesis is true, not the reverse, as is often the case with
misinterpretations.

8
Statistical Hypothesis Testing
• p-value (p): Probability of obtaining a result equal to or more extreme
than was observed in the data.
• In interpreting the p-value of a significance test, you must specify a
significance level, often referred to as the Greek lower case letter alpha (a).
A common value for the significance level is 5% written as 0.05.
• The p-value is interested in the context of the chosen significance level. A
result of a significance test is claimed to be “statistically significant” if the
p-value is less than the significance level. This means that the null
hypothesis (that there is no result) is rejected.
• p <= alpha: reject H0, different distribution.
• p > alpha: fail to reject H0, same distribution.
• Where:
• Significance level (alpha): Boundary for specifying a statistically significant
finding when interpreting the p-value.
9
Statistical Hypothesis Testing
• We can see that the p-value is just a probability and that in actuality
the result may be different.
• The test could be wrong. Given the p-value, we could make an error
in our interpretation.
• There are two types of errors; they are:
• Type I Error. Reject the null hypothesis when there is in fact no
significant effect (false positive). The p-value is optimistically small.
• Type II Error. Not reject the null hypothesis when there is a
significant effect (false negative). The p-value is pessimistically large.
• In this context, we can think of the significance level as the
probability of rejecting the null hypothesis if it were true. That is the
probability of making a Type I Error or a false positive.

10
Statistical Hypothesis Testing
• What Is Statistical Power?
• Statistical power, or the power of a hypothesis
test is the probability that the test correctly
rejects the null hypothesis.
• That is, the probability of a true positive result. It
is only useful when the null hypothesis is rejected.
• statistical power is the probability that a test will
correctly reject a false null hypothesis. Statistical
power has relevance only when the null is false.
11
Statistical Hypothesis Testing
• Higher the statistical power for a given experiment, the lower the
probability of making a Type II (false negative) error.
• That is the higher the probability of detecting an effect when there
is an effect.
• In fact, the power is precisely the inverse of the probability of a Type
II error.
– Power = 1 - Type II Error
– Pr(True Positive) = 1 - Pr(False Negative)
• More intuitively, the statistical power can be thought of as the
probability of accepting an alternative hypothesis, when the
alternative hypothesis is true.
• When interpreting statistical power, we seek experiential setups
that have high statistical power.
12
Statistical Hypothesis Testing
• Low Statistical Power: Large risk of committing Type II errors,
e.g. a false negative.
• High Statistical Power: Small risk of committing Type II errors.
• Experimental results with too low statistical power will lead to
invalid conclusions about the meaning of the results. Therefore
a minimum level of statistical power must be sought.
• It is common to design experiments with a statistical power of
80% or better, e.g. 0.80. This means a 20% probability of
encountering a Type II error.
• This different to the 5% likelihood of encountering a Type I
error for the standard value for the significance level.

13
Power Analysis
• Statistical power is one piece in a puzzle that has four related parts; they
are:
• Effect Size. The quantified magnitude of a result present in the population.
• Effect size is calculated using a specific statistical measure, such as
Pearson’s correlation coefficient for the relationship between variables or
Cohen’s d for the difference between groups.
• Sample Size. The number of observations in the sample.
• Significance. The significance level used in the statistical test, e.g. alpha.
Often set to 5% or 0.05.
• Statistical Power. The probability of accepting the alternative hypothesis if
it is true.
• All four variables are related. For example, a larger sample size can make
an effect easier to detect, and the statistical power can be increased in a
test by increasing the significance level.
14
Power Analysis
• We can use either a z-table or a software to find out corresponding
p-value for given z-value.
• P-value basically gives the probability of getting the observed data
if the null hypothesis was true.
• So, if we get very small p-value (usually smaller than 0.05), we can
reject the null hypothesis.
• If the p-value is large, we can continue to believe in the claim made
by research that 30% of the student in the university uses iPhone.
• From the z-value we calculated, we get corresponding p-value of
0.137.
• The p-value is greater than significance level 0.05, so we can
continue to believe in our null hypothesis.

15
Power Analysis
• A power analysis involves estimating one of these four
parameters given values for three other parameters.
• This is a powerful tool in both the design and in the analysis of
experiments that we wish to interpret using statistical
hypothesis tests.
• For example, the statistical power can be estimated given an
effect size, sample size and significance level. Alternately, the
sample size can be estimated given different desired levels of
significance.
• Power analysis answers questions like “how much statistical
power does my study have?” and “how big a sample size do I
need?”.
16
Power Analysis
• The most common use of a power analysis is
in the estimation of the minimum sample size
required for an experiment.
• Power analyses are normally run before a
study is conducted.
• A prospective or a priori power analysis can be
used to estimate any one of the four power
parameters but is most often used to estimate
required sample sizes.
17
Power Analysis
• As a practitioner, we can start with sensible defaults for some
parameters, such as a significance level of 0.05 and a power level of
0.80.
• We can then estimate a desirable minimum effect size, specific to
the experiment being performed.
• A power analysis can then be used to estimate the minimum
sample size required.
• In addition, multiple power analyses can be performed to provide a
curve of one parameter against another, such as the change in the
size of an effect in an experiment given changes to the sample size.
• More elaborate plots can be created varying three of the
parameters. This is a useful tool for experimental design.

18
Student’s t Test Power Analysis
• We can make the idea of statistical power and power analysis concrete
with a worked example.
• look at the Student’s t test, which is a statistical hypothesis test for
comparing the means from two samples of Gaussian variables.
• The assumption, or null hypothesis, of the test is that the sample
populations have the same mean, e.g. that there is no difference between
the samples or that the samples are drawn from the same underlying
population.
• The test will calculate a p-value that can be interpreted as to whether the
samples are the same (fail to reject the null hypothesis), or there is a
statistically significant difference between the samples (reject the null
hypothesis).
• A common significance level for interpreting the p-value is 5% or 0.05.
• Significance level (alpha): 5% or 0.05.
19
Student’s t Test Power Analysis
• The size of the effect of comparing two groups can be quantified
with an effect size measure.
• A common measure for comparing the difference in the mean
from two groups is the Cohen’s d measure.
• It calculates a standard score that describes the difference in terms
of the number of standard deviations that the means are different.
• A large effect size for Cohen’s d is 0.80 or higher, as is commonly
accepted when using the measure.
• Effect Size: Cohen’s d of at least 0.80.
• We can use the default and assume a minimum statistical power of
80% or 0.8.
• Statistical Power: 80% or 0.80.

20
Student’s t Test Power Analysis
• For a given experiment with these defaults, we may be interested in
estimating a suitable sample size.
• That is, how many observations are required from each sample in
order to at least detect an effect of 0.80 with an 80% chance of
detecting the effect if it is true (20% of a Type II error) and
• a 5% chance of detecting an effect if there is no such effect (Type I
error).
• We can solve this using a power analysis.
• The stats models library provides the TTestIndPower class for
calculating a power analysis for the Student’s t test with independent
samples.
• Of note is the TTestPower class that can perform the same analysis
for the paired Student’s t test.
21
Student’s t Test Power Analysis
• The function solve_power() can be used to calculate one of the four
parameters in a power analysis.
• In our case, we are interested in calculating the sample size.
• We can use the function by providing the three pieces of information we
know (alpha, effect, and power) and setting the size of argument we wish to
calculate the answer of (nobs1) to “None“.
• This tells the function what to calculate.
• A note on sample size: the function has an argument called ratio that is the
ratio of the number of samples in one sample to the other.
• If both samples are expected to have the same number of observations,
then the ratio is 1.0. If, e.g., the second sample is expected to have half as
many observations, then the ratio would be 0.5.
• The TTestIndPower instance must be created,
• then we can call the solve_power() with our arguments to estimate the
sample size for the experiment.
22
Student’s t Test Power Analysis
• # perform power analysis python code
• analysis = TTestIndPower()
• result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)

• #The complete example is listed below.

• # estimate sample size via power analysis


• from statsmodels.stats.power import TTestIndPower
• # parameters for power analysis
• effect = 0.8
• alpha = 0.05
• power = 0.8
• # perform power analysis
• analysis = TTestIndPower()
• result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)
• print('Sample Size: %.3f' % result)

• #Output
• Sample Size: 25.525
Running the example calculates and prints the estimated number of
samples for the experiment as 25. This would be a suggested minimum
number of samples required to see an effect of the desired size.

23
Student’s t Test Power Analysis
• We can go one step further and calculate power curves.
• Power curves are line plots that show how the change in variables, such
as effect size and sample size, impact the power of the statistical test.
• The plot_power() function can be used to create power curves. The
dependent variable (x-axis) must be specified by name in the ‘dep_var‘
argument.
• Arrays of values can then be specified for the sample size (nobs), effect
size (effect_size), and significance (alpha) parameters.
• One or multiple curves will then be plotted showing the impact on
statistical power.
• For example, we can assume a significance of 0.05 (the default for the
function) and explore the change in sample size between 5 and 100
with low, medium, and high effect sizes.

24
Student’s t Test Power Analysis
• # calculate power curves from multiple power analyses
• analysis = TTestIndPower()
• analysis.plot_power(dep_var='nobs', nobs=arange(5, 100), effect_size=array([0.2, 0.5,
0.8]))
• # calculate power curves for varying sample and effect size
• from numpy import array
• from matplotlib import pyplot
• from statsmodels.stats.power import TTestIndPower
• # parameters for power analysis
• effect_sizes = array([0.2, 0.5, 0.8])
• sample_sizes = array(range(5, 100))
• # calculate power curves from multiple power analyses
• analysis = TTestIndPower()
• analysis.plot_power(dep_var='nobs', nobs=sample_sizes, effect_size=effect_sizes)
• pyplot.show()
25
Student’s t Test Power Analysis
• Running the example creates the plot showing the impact on
statistical power (y-axis) for three different effect sizes (es) as the
sample size (x-axis) is increased.

We can see that if we


are interested in a large
effect that a point of
diminishing returns in
terms of statistical
power occurs at around
40-to-50 observations.

Ref: https://machinelearningmastery.com/statistical-power-and-power-
analysis-in-python/
26
27
Hypothesis Testing
• Applications
• Hypothesis testing has a variety of applications. Most
prominently, it is used to verify the claims of an
organization, or to choose between two options by
evaluating which one is better. Some specific examples are
described below:
• 1. Testing performance: We deploy multiple statistical
models and need to find out the best one.
• For example, we might be trying to improve our 'related
content' recommendation engine. Hypothesis testing is
used to decide which model is best.
28
Hypothesis Testing
• Applications
• 2. AB Testing: We deploy different versions of web products and
need to find out the best one. Does the red button lead to higher
conversion in sign-ups? Does the blue button increase total number
of purchases? Hypothesis testing is used to decide which version is
better.
• 3. In social science, it can be used to verify claims like domestic
violence against women in rural area is higher than urban area.
• 4. In healthcare industry, it is used to evaluate whether a proposed
drug is effective or not, or whether it is more effective than existing
drugs. Or even to decide whether hospital carpeting results in more
infection or not.
These wide range of applications of hypothesis testing make it an
important skill for any data science professional. 29
Hypothesis and Inferences
Confidence Interval
• We have been testing hypotheses about the
value of the heads probability p, which is a
parameter of the unknown “heads”
distribution.
• When this is the case, a third approach is to
construct a confidence interval around the
observed value of the parameter.

30
Hypothesis and Inferences
• Confidence Interval
• Example: estimate the probability of the unfair coin
by looking at the overage value of the Bernoulli
variables corresponding to each flip -
1 if heads and 0 if tails.
• If we observe 525 heads out of 1000 flips, then we
knew the exact value of p, the central limit theorem
tells us that the average of those Bernoulli variables
should be approximately normal, with mean p and
standard deviation:
31
Hypothesis and Inferences
• Confidence Interval
math.sqrt(p*(1-p)/1000)
• #Here we don’t know p, so instead we use our estimate:
p_hat = 525/1000
mu = p_hat
Sigma = math.sqrt (p_hat * (1-p_hat)/1000) #0.0158
• This is not entirely justified, but people seem to do it anyway.
• Using the normal approximation, we conclude that we are
“95% confident” that the following interval contains the true
parameter p:
• normal_two sided_bounds(0.95, mu, sigma) #[0.4940, 0.5560]

32
More examples on hypothesis
testing – self study

33
Hypothesis and Inferences
P-Hacking:
• A procedure that enormously rejects the null hypothesis only 5% of the
time will- by definition – 5% of the time erroneously reject the null
hypothesis:
• from typing import List
• def run_experiment()  List[bool]:
• “””Flips a fair coin 1000 times, True = heads, False = tails”””
• return [random.random() < 0.5 for _ in range(1000) ]
• num_rejections == 46
• This means you can test enough hypotheses against your dataset, and
one of them will almost certainly appear significant. Remove the right
outliers, and get the p-value below 0.05.
• This is called p-haking.
34
Hypothesis and Inferences
Example: Running an A/B Test:
• The most important primary responsibility of data scientist is experience
optimization.
• It is a euphemism for trying to get people to click on advertisements.
• One of the advertisers has developed a new energy drink targeted at data
scientists, and the VP of Advertisements wants your help choosing between
advertisement A(“ tests great!”) and advertisement B (“less bias!”).
• As a scientist, decide to run an experiment by randomly showing site visitors
one of the two advertisements and tracking how many people click on each
one.
• If 990 out of 1000 A-viewers click their ad, while only 10 out of 1000
b_viewers click their ad, we are confident that A is better.
• But if the differences are not so stark?
• In this situation we should use statistical inference.

35
Hypothesis and Inferences
Example: Running an A/B Test:
Statistical inference:
• Let’s say that NA people see ad A, and that nA of them click it. We can think
of each ad view as a Bernoulli trial where pA is the probability that someone
clicks ad A. Then (if NA is large, which it is here) we know that nA/NA is

deviation σA =√PA(1 – PA)/NA


approximetly a normal random variable with mean pA and standard

• Similarly, nB/NB is approximately a normal random variable


with mean pB and standard deviation σB = √PB(1 – PB)/NB.

The code for this is:


def estimated_parameters(N: int, n : int)  Tuple[float, float]:

p = n/N
sigma = math.sqrt(p*(1 – p)/N)
return p, sigma

36
Hypothesis and Inferences
Example: Running an A/B Test:
Statistical inference:

should also be normal with mean pB – pA and standard deviation √( (σ


• If we assume that the two normals are independent, then their difference

A)^2 + (σ B)^2).
• This means we can test null hypothesis that pA and pB are
the same (i.e. pA – pB = 0) by using statistic:

• def a_b_test_statistic(N_A: int, n_A: int, n_B: int)  float:


p_A, sigma_A = estimated_parameters(N_A, n_A)
p_B, sigma_B = estimated_parameters(N_B, n_B)

return (p_B – p_A)/math.sqrt(sigma_A**2 + sigma_B**2)


• Which should approximately be a standard normal.


37
Hypothesis and Inferences
Example: Running an A/B Test:
Statistical inference:
For example, if “tastes great” gets 200 clicks out of 1000 views and “less bias” gets 180
clicks out of 1000 views, the statistic equals:
z = a_b_test_statistic(1000, 200, 1000, 100) # -1.14
the probability of seeing such a large difference if the means were actually equal
would be:
two_sided_p_value(z) # 0.254
which is large enough that we can’t conclude there is much difference. On other side ,
if “less bias” only got 150 clicks, we have:
z = a_b_test_statistic(1000, 200, 1000, 150) # -2.94
two_sided_p_value(z) # 0.003
It means there is only a 0.003 probability we could see such a large difference if the ads
were equally effective.

38

You might also like