0% found this document useful (0 votes)
11 views

Biostatistics Series Module 2. Overview of Hypothesis Testing

Bioestadística
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Biostatistics Series Module 2. Overview of Hypothesis Testing

Bioestadística
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IJD® MODULE ON BIOSTATISTICS AND RESEARCH METHODOLOGY FOR THE DERMATOLOGIST

Biostatistics Series Module 2: Overview of Hypothesis Testing


Avijit Hazra, Nithya Gogtay1

Abstract From the Department of


Hypothesis testing (or statistical inference) is one of the major applications of biostatistics. Pharmacology, Institute of
Much of medical research begins with a research question that can be framed as a hypothesis. Postgraduate Medical Education
and Research, Kolkata,
Inferential statistics begins with a null hypothesis that reflects the conservative position of no
West Bengal, 1Department of
change or no difference in comparison to baseline or between groups. Usually, the researcher Clinical Pharmacology, Seth GS
has reason to believe that there is some effect or some difference which is the alternative Medical College and KEM Hospital,
hypothesis. The researcher therefore proceeds to study samples and measure outcomes in the Parel, Mumbai, Maharashtra, India
hope of generating evidence strong enough for the statistician to be able to reject the null
hypothesis. The concept of the P value is almost universally used in hypothesis testing. It
Address for correspondence:
denotes the probability of obtaining by chance a result at least as extreme as that observed,
Dr. Avijit Hazra,
even when the null hypothesis is true and no real difference exists. Usually, if P is < 0.05 Department of Pharmacology,
the null hypothesis is rejected and sample results are deemed statistically significant. With Institute of Postgraduate Medical
the increasing availability of computers and access to specialized statistical software, the Education and Research,
drudgery involved in statistical calculations is now a thing of the past, once the learning curve 244B, Acharya J. C. Bose Road,
of the software has been traversed. The life sciences researcher is therefore free to devote Kolkata ‑ 700 020,
oneself to optimally designing the study, carefully selecting the hypothesis tests to be applied, West Bengal, India.
and taking care in conducting the study well. Unfortunately, selecting the right test seems E‑mail: [email protected]
difficult initially. Thinking of the research hypothesis as addressing one of five generic research
questions helps in selection of the right hypothesis test. In addition, it is important to be clear
about the nature of the variables (e.g., numerical vs. categorical; parametric vs. nonparametric)
and the number of groups or data sets being compared (e.g., two or more than two) at a
time. The same research question may be explored by more than one type of hypothesis test.
While this may be of utility in highlighting different aspects of the problem, merely reapplying
different tests to the same issue in the hope of finding a P < 0.05 is a wrong use of statistics.
Finally, it is becoming the norm that an estimate of the size of any effect, expressed with its
95% confidence interval, is required for meaningful interpretation of results. A large study is
likely to have a small (and therefore “statistically significant”) P value, but a “real” estimate of
the effect would be provided by the 95% confidence interval. If the intervals overlap between
two interventions, then the difference between them is not so clear‑cut even if P < 0.05. The
two approaches are now considered complementary to one another.

Key Words: Confidence interval, hypothesis testing, inferential statistics, null hypothesis,
P value, research question

Introduction Biological phenomena are inherently variable and in this


Much of statistics involves collecting, describing, and age of “evidence‑based medicine” an understanding of
analyzing data that are subject to random variation. such variation through statistical approaches is essential
Descriptive statistics presents and summarizes collected not only for the medical researcher who intends to draw
data to characterize features of their distribution. inferences from his sample, but also for the practicing
Inferential statistics analyzes sample data in order clinician, and the medical teacher whose responsibility
to estimate or predict characteristics of the larger is to critically appraise the presented inferences
population from which the sample is drawn.
This is an open access article distributed under the terms of the Creative
Access this article online Commons Attribution‑NonCommercial‑ShareAlike 3.0 License, which allows
Quick Response Code: others to remix, tweak, and build upon the work non‑commercially, as long as the
Website: www.e‑ijd.org author is credited and the new creations are licensed under the identical terms.
For reprints contact: [email protected]

How to cite this article: Hazra A, Gogtay N. Biostatistics series


DOI: 10.4103/0019-5154.177775 module 2: Overview of hypothesis testing. Indian J Dermatol 2016;61:137-45.
Received: February, 2016. Accepted: February, 2016.

© 2016 Indian Journal of Dermatology | Published by Wolters Kluwer - Medknow 137


Hazra and Gogtay: Overview of hypothesis testing

before accepting them into practice or the curriculum. as adverse drug reaction profile, cost, and availability
Development of new drugs, devices, and techniques is to arrive at a decision. We may even wonder if the
heavily dependent, nowadays, upon statistical analyses observed 2% difference is a real occurrence or could
to prove their effectiveness. It is also frequently it be just a chance difference? It is in such situations
alleged that statistical analyses can be misused and that hypothesis testing is required to help us arrive at
misrepresented. This should be a further impetus for a decision whether an observed change or difference is
understanding inferential statistics. statistically ‘significant’ so that it may become the basis
for altering or adapting clinical practice.
Much of medical research begins with a research
question that can be stated as a hypothesis. Testing
The Null Hypothesis, Type I and Type II
of a hypothesis involves comparisons between sets of
numbers. Comparisons are done for various reasons. Errors, and Power
A common reason for comparison is to see if there is The statistical convention in hypothesis testing is
a difference between data sets or groups. For instance, to begin with a null hypothesis that reflects the
we may sample the price of 1 kg of potatoes and that conservative position of no change in comparison to
of 1 kg of onions from different market locations and baseline or no difference between groups. Usually, the
then compare the two sets of prices. We may be engaging researcher has reason to believe that there is some
in this exercise with the hypothesis that there is a price effect or some difference - indeed this is usually
difference between the two commodities, with onions the reason for the study in the first place! The null
being the more expensive item. Suppose, we sample the hypothesis is designated as H0, while the clinician’s
price of onions per kilogram and that of 10 g gold from working hypothesis may be one of a number of
the same market locations. The intention in this latter alternatives (e.g. H1 ≡ A > B or H2 ≡ B > A). Hypothesis
exercise is unlikely to be finding out the price difference testing requires that researchers proceed to study samples
of the two items. However, one may have a hypothesis and measure outcomes in the hope of finding evidence
that, in Indian markets, the price of onions goes up strong enough to be able to reject the null hypothesis.
whenever there is a hike in gold prices and hence the This is somewhat confusing at first but becomes easier
price sampling. The question in this case, therefore, to understand if we take recourse to the legal analogy.
is to check whether there is an association between Under Indian criminal law, an accused is presumed to be
commodity prices. Another reason for comparison may not guilty unless proven otherwise. The task before the
be to assess if there is an agreement between sets of prosecution is therefore to collect and present sufficient
numbers. Thus, in the same set of subjects, we may be evidence to enable to judge to reject the presumption of
measuring fasting blood glucose from capillary blood not guilty, which is the null hypothesis in this case.
samples using a glucometer on one hand, and at the While putting the null hypothesis to the test, two types
same time, from whole venous plasma samples sent to
of error may creep into the analysis: Type I error of
a reference laboratory on the other. The intention here
incorrectly rejecting the null hypothesis, the probability
would be to compare the two sets of readings to judge
of which is denoted by the quantity α, and Type II
if there is an agreement between them. Whatever the
error of incorrectly accepting the null hypothesis, the
reason, comparisons are formally conducted as statistical
probability of which is designated β. Again this is
tests of hypothesis and the results generated can be
confusing and so let us return to the criminal court,
extrapolated to populations from which samples have
where the verdict given may be at variance with reality.
been drawn.
Let us say a person has not committed a murder but
However, not all situations require formal hypothesis is accused of the same. The prosecuting side collects
testing. Let us say, we give you two topical antifungal and presents evidence in such a manner that the judge
drugs and say that Drug A offers a cure rate of 20% in pronounces the verdict as guilty. Thus, although the
tinea corporis, while Drug B offers 80% cure rate in the person has not committed the crime, the presumption of
same indication and ask which one will you use? Your not guilty is falsely rejected. This is Type I error. On the
answer, most likely, would be Drug B without hesitation. other hand, let us say that the person has committed
We do not require any formal hypothesis testing to the murder but the prosecution fails to provide enough
draw an inference if the difference is this large. The evidence to establish guilt. Therefore, the judge
simple descriptive statistics, as percentages, is enough pronounces the verdict as not guilty which is actually
to help us take a clinical decision. However, today not the truth. This is Type II error. Note that although
if we are offered two antifungal drugs, the cure rates lay people and even the media may interpret not guilty
claimed are more likely to be around 95% and 97%. as innocent, this is not the correct interpretation. The
Which one to use then? This question is not answered judge did not use the term “innocent” but gave the
so easily. Efficacy‑wise, they appear to be comparable. verdict as “not guilty.” What the judge meant was that,
We would, therefore, start looking at other factors such in the technicalities of law, sufficient evidence was not

Indian Journal of Dermatology 2016; 61(2) 138


Hazra and Gogtay: Overview of hypothesis testing

presented to establish guilt beyond reasonable doubt. hypothesis is true. The boundary for ‘more extreme’ is
The statistician’s stand in hypothesis testing is like dependent on the way the hypothesis is tested. Before
that of the judge hearing a criminal case. He will start the test is performed, a threshold value is chosen called
with the null hypothesis, and it is the clinician’s job the significance level of the test (also denoted by α)
to collect enough evidence to enable the statistician to and this is conventionally taken as 5% or 0.05. If the
reject the null hypothesis of no change or no difference. P value obtained from the hypothesis test is less than
Figure 1 diagrammatically depicts the concepts of Type I the chosen threshold significance level, it is taken
and Type II errors. that the observed result is inconsistent with the null
Statistician’s also fondly speak of the power of a study. hypothesis, and so the null hypothesis must be rejected.
Mathematically, power is simply the complement of β, This ensures that the Type I error rate is at the most α.
i.e., power = (1− β). In probability terms, power denotes Typically the interpretation is:
the probability of correctly rejecting H0 when it is false • A small P value (<0.05) indicates strong evidence
and thereby detecting a real difference when it does against the null hypothesis, so it is rejected. The
exist. In practice, it is next to impossible to achieve alternative hypothesis may be accepted although it is
100% power in a study since there will always be the not 100% certain that it is true. The result is said to
possibility of some quantum of Type II error. Type I and be statistically “significant”
Type II errors bear a reciprocal relationship. For a given • An even smaller P value (<0.01) indicates even
sample size, both cannot be minimized at the same time, stronger evidence against the null hypothesis.
and if we seek to minimize Type II error, the probability The result may be considered statistically “highly
of Type I error will go up and vice versa. Therefore, the significant”
strategy is to strike an acceptable balance between the • A large P value (>0.05) indicates weak evidence
two a priori. Conventionally, this is done by setting the against the null hypothesis. Therefore, it cannot be
acceptable value of α at no more than 0.05 (i.e., 5%) rejected, and the alternate hypothesis cannot be
and that of β at no more than 0.2 (i.e., 20%). The latter accepted
is more usually expressed as a power of no less than 0.8 • A P value close to the cutoff (≈0.05 after rounding
(i.e., 80%). The chosen values of α and β affect the size off) is considered to be marginal. It is better to err
of the sample that needs to be studied - the smaller the on the side of caution in such a case and not reject
values, the larger the size. For any research question, the null hypothesis.
they are the two fundamental quantities that will Let us try to understand the P value concept by an
influence sample size. We will deal with other factors example. Suppose a researcher observes that the
that affect sample size in a latter module. difference in cure rate for pityriasis versicolor using
single doses of two long‑acting systemic antifungals on
The P Value 50 subjects (25 in each group) is 11% with an associated
The concept of the P value is almost universally used in P value of 0.07. This means that assuming the null
hypothesis testing. Technically, P denotes the probability hypothesis of no difference in a cure rate of the two
of obtaining a result equal to or “more extreme” than antifungals to be true, the probability of observing a
what is actually observed, assuming that the null difference of 11% is 0.07, i.e. 7%. Since this is above
the threshold of 5%, the null hypothesis cannot be
rejected, and we have to go by the inference that
the 11% difference may have occurred by chance. If
another group repeats the study on 500 subjects (250
in each group) and observes the same 11% difference
in cure rate, but this time with a P value of 0.03, the
null hypothesis is to be rejected. The inference is that
the difference in cure rate between the two drugs is
statistically significant, and therefore the clinical choice
should be with the more effective drug. Note that, in
this example, although the observed difference was the
same, the P value became significant with the increase
in sample size. This demonstrates one of the fallacies
of taking the P value as something sacrosanct. The
inference from the P value is heavily dependent on
sample size. In fact, with a large enough sample, one
Figure 1: Diagrammatic representation of the concept of the null hypothesis and
error types. Note that α and β denote the probabilities, respectively, of Type I and can discover statistical significance in even marginal
Type II errors. The happy faces represent error‑free decisions difference. Thus, if we compare two antihypertensive

139 Indian Journal of Dermatology 2016; 61(2)


Hazra and Gogtay: Overview of hypothesis testing

drugs on 10,000 subjects, we may find that an observed • Decide upon the hypothesis test (i.e., test of
difference of just 2 mmHg in systolic or diastolic blood statistical significance) that is to be applied for each
pressure also turns to be statistically significant. But is a outcome variable of interest
difference of 2 mmHg clinically meaningful? Most likely • Once the data have been collected, apply the test
it is not. If the observed difference is large, then even and determine the P value from the results observed
very small samples will yield a statistically significant • Compare it with the critical value of P, say 0.05 or
P value. Therefore, P values must always be interpreted 0.01
in the context of a given sample size. If the sample size • If the P value is less than the critical value, reject
is inadequate, studies will be underpowered, and even the null hypothesis (and rejoice) or otherwise accept
small P values will be clinically meaningless. the null hypothesis (and reflect on the reasons why
no difference was detected).
A P value is often given the adjective of being
one‑tailed (or one‑sided) or two‑tailed (or two‑sided). With the increasing availability of computers and access
Tails refer to the ends of a distribution curve, which to specialized statistical software (e.g. SPSS, Statistica,
typically has two tails. Figure 2 depicts the two tails of Systat, SAS, STATA, S‑PLUS, R, MedCalc, Prism, etc.) the
a normal distribution curve. A two‑tailed P value implies drudgery involved in statistical calculations is now a
one obtained through two‑sided testing, meaning testing thing of the past. Medical researchers are therefore free
that has been done without any directional assumptions to devote their energy to optimally designing the study,
for the change or difference that was studied. If we selecting the appropriate tests to be applied based on
are studying cure rates of two systemic antifungals for sound statistical principles and taking care in conducting
pityriasis versicolor, we should ideally do the testing as a the study well. Once this is done, the computer will
two‑sided situation since new Drug A may be better than work on the data that is fed into it and take care of the
existing Drug B and the reverse possibility also exists. It rest. The argument that statistics is time‑consuming can
is seldom fair to begin with the presumption that Drug no longer be an excuse for not doing the appropriate
A can only be better than or similar to Drug B but will analysis.
not be worse than it. However, consider a hypothetical
drug that is claimed to increase height in adults. We may Which test to apply in a given situation
be justified in testing this drug as a one‑tailed situation A large number of statistical tests are used in hypothesis
since we know that even if a drug cannot increase adult testing. However, most research questions can be tackled
height, it will not decrease it. Decreasing height is a through a basket of some 36 tests. Let us follow an
biological improbability unless the drug is causing bone algorithmic approach to understand the selection of the
degeneration in the axial skeleton. A one‑tailed test is appropriate test. To do so, we convert a specific research
more powerful in detecting a difference, but it should not question in our study to a generic question. It turns out
be applied unless one is certain that change or difference that the vast majority of research questions we tackle
is possible only in one direction. can fit into one of five generic questions:
a. Is there a difference between groups or data sets
Steps in hypothesis testing that are unpaired?
Hypothesis testing, as it stands now, should proceed b. Is there a difference between groups or data sets
through the following five steps: that are paired?
• Select a study design and sample size appropriate to c. Is there an association between groups or data sets?
the research question or hypothesis to be tested d. Is there an agreement between groups or data sets?
e. Is there a difference between time to event trends?
Let us pick up each question and follow the algorithm
that leads to the tests. We will discuss the pros and cons
of individual tests in subsequent modules. For the time
being, let us concentrate on the schemes that are based
on the context of individual questions.
Question 1. Is there a difference between groups or
data sets that are unpaired (parallel or independent)?
This is perhaps the most common question encountered
in clinical research. The tests required to answer this
question are decided on the basis of nature of the data
and the number of groups to be compared. The data
Figure 2: A normal distribution curve with its two tails. Note that an observed result sets or groups need to be independent of one another,
is likely to return a statistically significant result in hypothesis testing if it falls in
one of the two shaded areas, which together represent 5% of the total area. Thus, i.e., there should be no possibility of the values in
the shaded area is the area of rejection of the null hypothesis one data set influencing values in the other set or
Indian Journal of Dermatology 2016; 61(2) 140
Hazra and Gogtay: Overview of hypothesis testing

being related to them in some way. If related, then the ANOVA, or Chi‑square test) will tell us whether there is
data sets will have to be treated as paired. Thus, if we a statistically significant difference overall. It will not
compare the blood glucose values of two independent point out exactly between which two groups or data
sets of subjects, the data sets are unpaired. However, sets does the significant difference lie. If we need to
if we impose a condition like all the individuals answer this question, we have to follow‑up a multiple
comprising one data set are brothers or sisters of the group comparison test by a so‑called post hoc test.
individuals represented in the other data set, then there Thus, if ANOVA or its nonparametric counterpart shows
is possibility of corresponding values in the two data a significant difference between multiple groups tested,
sets being related in some way (because of genetic or we can follow them up with various post hoc tests like:
other familial reasons) and the data sets are no longer • Parametric data: Tukey’s honestly significant
independent. Figure 3 provides a flowchart for test difference test (Tukey–Kramer test), Newman–Keuls
selection in the context of this question. test, Bonferroni’s test, Dunnett’s test, Scheffe’s test,
etc.
Note that numerical data have been subcategorized as
• Nonparametric data: Dunn’s test.
“parametric” or “otherwise.” Numerical data that follows
the parameters of a normal distribution curve come in Question 2. Is there a difference between groups or
the first subcategory. In other words, parametric data are data sets that are paired (cross‑over type or matched)?
normally distributed numerical data. If the distribution Data sets are considered to be paired if there is a
is skewed, if there is no particular distribution or simply possibility that values in one data set influence the
if the distribution is unknown, then the data must be values in another or are related to them in some way.
considered as nonparametric. How do we know whether There is often confusion over which data sets to treat as
numeric data are normally distributed? We can look at paired. The following provide a guide:
the two measures of central tendency, mean and median. • Before-after or time series data: A variable is measured
The properties of the normal distribution curve tell us before an intervention, and the measurement is
that these should coincide. Hence, if mean and median repeated at the end of the intervention. There
are the same or are close to one another (compared to may be periodic interim measurements as well. The
the total spread of data), then we are probably dealing investigator is interested in knowing if there is a
with parametric data. However, a more foolproof way significant change from baseline value with time
is to formally test the fit of the data to a normal • A crossover study is done, and both arms receive both
distribution using one of a number of “goodness‑of‑fit” treatments, though at different times. Comparison
tests such as the Kolmogorov–Smirnov test, Shapiro– needs to be done within a group in addition to
Wilk test, or D’Agostino and Pearson omnibus normality between groups
test. If such a test returns a P < 0.05, it implies that • Subjects are recruited in pairs, deliberately matched
the null hypothesis of no difference between the data’s for key potentially confounding variables such as
distribution and a normal distribution will have to age, sex, disease duration, and disease severity.
be rejected and the data taken to be nonparametric. One subject gets one treatment, while his paired
The normal probability plot is a graphical method of counterpart gets the other
deducing normality of continuous data. • Measurements are taken more than once, and
Note also that whenever more than two groups are to be comparison needs to be made between such repeated
compared at a time we have a multiple group comparison sets (e.g., duplicate or triplicate) of measurements
situation. A multiple group comparison test (like • Variables are measured in other types of pairs, for
one‑way analysis of variance [ANOVA], Kruskal–Wallis example, right‑left, twins, parent‑child, etc.
Many instances of pairing are obvious. A before‑after
(intervention) comparison would be paired. If we have
Numerical data Categorical data
time series data, such as Psoriasis Area and Severity
Index scores estimated at baseline and every 3 months
over 1 year, then all the five sets of data are paired
Parametric Otherwise 2 groups to > 2 groups to one another. Similarly, twin studies, sibling studies,
be compared to be
compared
parent‑offspring studies, cross‑over studies, and matched
case–control studies, usually involve paired comparisons.
2 groups 2 groups
Chi-square [x2] test Chi-square [x2] test However, some instances of pairing are not so obvious.
Student’s unpaired Mann-Whitney U test
t test > 2 groups
Fisher's exact test Suppose we are evaluating two topical treatments for
> 2 groups Kruskal-Wallis H test
Analysis of variance or Kruskal-Wallis atopic dermatitis, which is a symmetrical dermatosis and
(ANOVA) or F test ANOVA
decide to use one‑half of the body (randomly selected)
Figure 3: Tests to assess statistical significance of difference between data sets that for the test treatment and the other half for some
are independent of one another control treatment. We may assume that data accruing

141 Indian Journal of Dermatology 2016; 61(2)


Hazra and Gogtay: Overview of hypothesis testing

from the test half and the control half are independent Question 3. Is there an association between groups
of one another but is this assumption correct? Are we or data sets?
certain that if the test treatment works and the lesions As seen from Figure 5, the algorithm for deciding tests
on the test half regress, this is in no way going to appropriate to this question is simpler. Correlation is a
influence the results in the control half? There would statistical procedure that indicates the extent to which
be absolutely no systemic effect of the treatment that two or more variables fluctuate together. A positive
would influence the control half? If we are certain then correlation indicates that the variables increase or
by all means, we can go ahead and treat the two data decrease in parallel; a negative (or inverse) correlation
sets as unpaired. Otherwise, it is preferable to treat indicates that as one variable increases in value, the
them as paired. other decreases.
Figure 4 provides the flowchart for test selection in With numerical data, we quantify the strength of an
the context of comparing data sets that show pairing. association by calculating a correlation coefficient. If
Note that the scheme remains the same as for the first both the variables are normally distributed, we calculate
question but the tests are different. Thus, Student’s Pearson’s product moment correlation coefficient r
unpaired or independent samples t‑test is now replaced or simply Pearson’s r. If one or both variables are
by the Student’s paired t‑test, with its nonparametric nonparametric or we do not know what their distribution
counterpart now as the Wilcoxon’s signed rank test. If are, we calculate either Spearman’s rank correlation
we are comparing two independent proportions, such coefficient Rho (ρ) or Kendall’s rank correlation
as the gender distribution in two arms of a parallel coefficient Tau (τ).
group clinical trial, we can use the Chi‑square test or
If numerical variables are found to be correlated to a
Fisher’s exact test. However, if we are comparing paired
strong degree, a regression procedure may be attempted
proportions, such as the proportion of subjects with a
to deduce a predictive quantitative relationship between
headache before and after treatment of herpes zoster
them. In the simplest scenario, if two numerical
with aciclovir, then the test to employ is McNemar’s test
variables are strongly correlated and linearly related to
which is also a Chi‑squared test.
one another, a simple linear regression analysis (by least
If multiple group comparisons are involved, such squares method) enables generation of a mathematical
as repeated measures ANOVA or its nonparametric equation to allow prediction of one variable, given the
counterpart the Friedman’s ANOVA, then once again value of the other.
post hoc tests are required if we are interested in
Associating categorical data becomes simple if we can
deciphering exactly between which two data sets the
arrange the data in two rows and two columns as a
significant difference lies. We have listed the post hoc
2 × 2 contingency table. Thus, if we want to explore the
tests under the first question. The same tests can be
association between smoking and lung cancer, we can
used adjusted for paired comparisons. However, note
categorize subjects as smokers and nonsmokers and the
that post hoc tests are run to confirm where the
outcome as lung cancer and no lung cancer. Arranging
differences occurred between groups, and they should this data in a 2 × 2 table will allow ready calculation
only be run when an overall significant difference is of a relative risk or an odds ratio, two measures of
noted. Running post hoc tests when the a priori multiple association that are hallmarks of epidemiological studies.
comparison test has not returned a significant P value is However, if we categorize subjects as nonsmokers,
tantamount to ‘data dredging’ in the hope of discovering moderate smokers, and chain smokers, then we have to
a significant difference by chance and then emphasizing look for association by calculating a Chi‑square for trend
it inappropriately. or other measures. We will take a detailed look at risk
assessment in a future module.
Numerical data Categorical data

Numerical data Categorical data

Parametric Otherwise 2 groups to > 2 groups


be compared to be
compared Both Otherwise 2 groups to > 2 groups
parametric be compared to be
compared
2 groups 2 groups McNemar’s chi- Cochran’s Q test
Student’s paired t test Wilcoxon’s matched square test
> 2 groups pairs signed rank test Pearson's (product Spearman's (rank
Repeated > 2 groups Relative risk Chi-square for trend
moment correlation correlation coefficient) ρ Odds Ratio Log linear analysis
measures ANOVA Friedman’s ANOVA coefficient) r Kendall's (rank
correlation coefficient)
Figure 4: Tests to assess statistical significance of difference between data sets that
are or could be paired Figure 5: Tests for association between variables

Indian Journal of Dermatology 2016; 61(2) 142


Hazra and Gogtay: Overview of hypothesis testing

Question 4. Is there an agreement between groups data. For example, let us consider two chemotherapy
or data sets? Regimens A and B for malignant melanoma of the skin.
Agreement between groups or data sets can be With Regimen A, we tell you that at the end of 5 years
inferred indirectly as lack of significant difference, but following treatment, 20% of the patients are expected to
there are fallacies with this approach. Let’s say in a survive. With Regimen B, the expected 5 years survival
university examination all candidates are being assessed rate is also 20%. Which regimen to choose? Obviously,
independently by an internal examiner and an external on the face of it, both regimens are similar with respect
examiner and being marked out of 50 in each case. In to efficacy. Now, we divulge a little more information.
this instance, if we are interested in seeing to what Let’s say with Regimen A there are no deaths in the first
extent the assessment of the two examiners agree, we 3 years, then 40% of the patients die in the 4th year and
may calculate the average marks (out of 50) given by another 40% in the 5th year, so 20% are surviving at
the two examiners and see if there is a statistically the end of 5 years. With Regimen B, nobody dies in the
significant difference between means by a paired 1st year but thereafter 20% of the patients die each year
comparison. If the difference is not significant, we may leaving 20% survivors at the end of 5 years. This time
conclude that the two examiners have agreed in their there is something to choose, probably Regimen A will
assessments. However, we have compared group means. be the natural choice. Thus, whenever we are considering
Even if group means tally, there may be wide variation in the time to a future event as a variable, it is not only
marks allotted to individual examinees by the examiners. the absolute time, but also the time trend that matters.
What if the denominators were different - the internal With such an example there are other complexities. For
examiner marking out of 30 and the external marking instance, if we start with hundred melanoma patients
out of 70? The means will no longer tally, even if overall, then it may so happen that at the end of
assessments of the performance of individual candidates 5 years, twenty would be surviving, sixty would have
are in agreement. We could of course convert to died of cancer, ten may have died of unrelated causes,
percentages and work with the mean of the percentages, and the rest ten may be simply lost to follow‑up. How
but the initial fallacy remains. Measures of association do we reconcile these diverse outcomes when estimating
were used earlier to denote agreement, but there are survival? This issue is dealt with by censoring in survival
also problems there. studies.
Therefore, statisticians agree that a better way of The algorithm is simplest for this question, as seen
assessing agreement is to do so differently using in Figure 7. We just have to decide on the number of
measures that adjust individual agreements with the groups to compare. In practice, the log‑rank test is
proportion of the disagreement in the overall result. The popularly used as it is simplest to understand and allows
tests to do so are depicted in Figure 6. both two group and multiple group comparisons.
An intraclass correlation coefficient, as we will see later in It is evident from the above schemes that when numerical
our module on correlation and regression, is interpreted data are involved, it is important to distinguish between
in a manner similar to a correlation coefficient and is now parametric tests (applied to data that is normally
employed widely in rater validation of questionnaires. distributed) and nonparametric tests (applied to data
Cohen’s kappa statistic is used to compare diagnostic that is not normally distributed or whose distribution is
tests that return their results as positive or negative or unknown). It is sometimes possible to transform skewed
as some other categorical outcome. data to a normal distribution (e.g., through a log
Question 5. Is there a difference between time to transformation) and then analyze it through parametric
event trends? tests. However, a better option is to use nonparametric
Time to an event that is going to happen in the tests for nonparametric data. Note that survival or
(relatively distant) future is a special kind of numeric time‑to‑event data are generally taken as nonparametric.
Some of the other assumptions of parametric tests are
Numerical data Categorical data
that samples have the same variance, i.e., drawn from
the same population (Levene’s test or Brown–Forsythe

2 groups > 2 groups


Numerical Graphical 2 categories to > 2 categories
method method be compared to be compared

Cox-Mantel test
Intraclass correlation Bland-Altmann plot Gehan’s (generalized Wilcoxon) test
Cohen’s kappa Cohen’s kappa Log-rank test
coefficient Log-rank test

Figure 6: Tests for association between variables Figure 7: Tests for comparing time to event data sets

143 Indian Journal of Dermatology 2016; 61(2)


Hazra and Gogtay: Overview of hypothesis testing

test can be applied to assess homogeneity of variances), taken in the context of other available evidence, will
observations within a group are independent and result in a more practical conclusion being reached.
that the samples have been drawn randomly from the
Moreover, the P value does not give a clear indication
population. These assumptions are not always met but
as to the clinical importance of an observed effect. It
the commonly used tests are robust enough to tolerate
is not like a score indicating that smaller the P value,
some deviations from ideal.
more clinically important the result. A small P value
Limitations of the P value approach to simply indicates that the observed result is unlikely to
inference be due to chance. However, just as a small study may
The concept of the P value dates back to the 1770s fail to detect a genuine effect, a large study may yield
when Pierre‑Simon Laplace studied almost half a million a small P value (and a very large study a very small
births. He observed an excess of boys compared to girls P value) based on a small difference that is unlikely
and concluded by calculation of a P value that the to be of clinical importance. We discussed an example
excess was a real, but unexplained, effect. However, earlier. Decisions on whether to accept or reject a new
the concept was formally introduced by Karl Pearson, in treatment necessarily have to depend on the observed
the early 1900s through his Pearson’s Chi‑square test, extent of change, which is not reflected in the P value.
where he presented the Chi‑squared distribution and P value vis‑à‑vis confidence interval
noted p as capital P. Its use was popularized by Ronald The P value provides a measure of the likelihood of a
Aylmer Fisher, through his influential book Statistical chance result, but additional information is revealed
Methods for Research Workers published in 1925. Fisher by expressing the result with appropriate confidence
proposed the level P < 0.05 (i.e., a 1 in 20 chance of intervals. It is becoming the norm that an estimate of the
results occurring by chance) as the cut‑off for statistical size of any effect based on a sample (a point estimate),
significance. Since then the P value concept has played must be expressed with its 95% confidence interval (an
a central role in inferential statistics, but its limitations interval estimate), for meaningful interpretation of results.
must be understood. A large study is likely to have a small (and therefore
Using the P value, researchers classify results as “statistically significant”) P value, but a “real” estimate
statistically “significant” or “nonsignificant,” based on of the effect would be provided by the 95% confidence
whether the P value was smaller than the prespecified interval. If the intervals overlap between two treatments,
cut‑off. This practice is now frowned upon, and the use then the difference between them is not so clear‑cut
of exact P values (to say three decimal places) is now even if P < 0.05. Conversely, even with a nonsignificant
preferred. This is partly for practical reasons because P, confidence intervals that are apart suggest a real
the use of statistical software allows ready calculation difference between interventions. Increasingly, statistical
of exact P values, unlike in the past when statistical packages are getting equipped with the routines to
tables had to be used. However, there is also a technical provide 95% confidence intervals for a whole range of
reason for this shift. The imputation of statistical statistics. Confidence intervals and P values are actually
significance based on a convention (P < 0.05) tends complementary to one another, and both have an
to lead to a misleading notion that a “statistically important role to play in drawing inferences.
significant” result is the real thing. However, note that
a P value of 0.05 means that 1 of 20 results would Conclusion
show a difference at least as big as that observed just The living world is full of uncertainty. Biostatistics is not
by chance. Thus, a researcher who accepts a ‘significant’ a collection of clever mathematical formulae but is an
result as real will be wrong 5% of the time (committing applied science borne out of the need to make sense out
a Type I error). Similarly, dismissing an apparently of this uncertainty and in dealing with random variations
“nonsignificant” finding as a null result may also be that arise in any study situation. All medical personnel,
incorrect (committing a Type II error), particularly in a whatever be their primary fields of activity, such as
small study, in which the lack of statistical significance clinical practice, teaching, research, or administration,
may simply be due to the small sample size rather than need to be aware of the fundamental principles of
real lack of clinical effect. Both scenarios have serious inferential statistics to correctly interpret study results
implications in the context of accepting or rejecting a and critically read medical literature to be able to
new treatment. It must be clearly understood that the make informed decisions furthering their own activity.
P value is not the probability that the null hypothesis This discussion has given an overview of hypothesis
is true or the probability that the alternative hypothesis testing. To acquire working knowledge of the field, it is
is false. The presentation of exact P values allows the important to read through the statistical analysis part
reader to make an informed judgment as to whether the of research studies published in peer‑reviewed journals.
observed effect is likely to be due to chance and this, Mastery of the principles can, however, come only

Indian Journal of Dermatology 2016; 61(2) 144


Hazra and Gogtay: Overview of hypothesis testing

through actual number crunching using the appropriate 2. P values and statistical hypothesis testing. In: Motulsky HJ,
statistical software. editor. Prism 4 Statistics Guide – Statistical Analysis for
Laboratory and Clinical Researchers. San Diego: GraphPad
Financial support and sponsorship Software Inc.; 2005. p. 16‑9.
Nil. 3. Interpreting P values and statistical significance. In:
Motulsky HJ, editor. Prism 4 Statistics Guide – Statistical
Conflicts of interest Analysis for Laboratory and Clinical Researchers. San Diego:
GraphPad Software Inc.; 2005. p. 20‑4.
There are no conflicts of interest.
4. Research questions about one group. In: Dawson B, Trapp RG,
editors. Basic and Clinical Biostatistics. 4th ed. New York:
Further Reading McGraw‑Hill; 2004. p. 93‑133.
1. Hypothesis testing. In: Glaser AN, editor. High Yield 5. Research questions about two separate or independent groups.
Biostatistics. Baltimore: Lippincott Williams & Wilkins; 2001. In: Dawson B, Trapp RG, editors. Basic & Clinical Biostatistics.
p. 33‑49. 4th ed. New York: McGraw‑Hill; 2004. p. 134‑61.

145 Indian Journal of Dermatology 2016; 61(2)

View publication stats

You might also like