Design and Analysis of Non-Inferiority Trials
Design and Analysis of Non-Inferiority Trials
Non-Inferiority Trials
Mark D. Rothmann
Brian L. Wiens
Ivan S. F. Chan
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Rothmann, Mark D.
Design and analysis of non-inferiority trials / Mark D. Rothmann, Brian L. Wiens,
Ivan S.F. Chan.
p. ; cm. -- (Chapman & Hall/CRC biostatistics series)
Includes bibliographical references and index.
ISBN 978-1-58488-804-8 (hardback : alk. paper)
1. Drugs--Testing. 2. Experimental design. 3. Therapeutics, Experimental. I. Wiens,
Brian L. II. Chan, Ivan S. F. III. Title. IV. Series: Chapman & Hall/CRC biostatistics
series.
[DNLM: 1. Clinical Trials as Topic. 2. Research Design. 3. Therapies, Investigational.
QV 771]
RM301.27.R68 2011
615.5’80724--dc22 2011005377
Preface......................................................................................................................xv
ix
In recent years there has been frequent use of non-inferiority trial designs
to establish the efficacy of an experimental agent. There has also been a pro-
liferation of research articles on the design and analysis of non-inferiority
studies. Points to Consider documents involving non-inferiority trials have
been issued by the European Medicines Agency, and there is a draft guid-
ance on non-inferiority trials that has been issued by the U.S. Food and Drug
Administration. A typical non-inferiority trial randomizes subjects to an
experimental regimen or to a standard of care, which is often referred to
as an “active” control. A non-inferiority trial places a limit on the amount
an experimental therapy is allowed to be inferior to a standard of care to
still be considered worthwhile. This limit or non-inferiority margin should
be selected so that a loss of efficacy of less than the margin relative to the
standard of care implies that the experimental therapy has efficacy (relative
to a placebo) and its efficacy is not unacceptably worse than the standard
of care. A new treatment that offers a better safety profile or a more prefer-
able method of administration compared to standard treatment may be ben-
eficial even if somewhat less effective than standard treatment. There have
been many non-inferiority clinical trials in various medical areas, including
thrombolytic, oncology, cardiorenal, and anti-infective drugs, vaccines, and
medical devices.
Design and Analysis of Non-Inferiority Trials is not intended as a substitute
for regulatory guidances on non-inferiority trials, but as a complement to
such guidances. This text provides a comprehensive discussion on the pur-
pose and issues involved in non-inferiority trials and will assist the reader in
designing a non-inferiority trial and in assessing the quality of non-inferior-
ity comparisons done in practice.
Design and Analysis of Non-Inferiority Trials is intended for statisticians and
nonstatisticians involved in drug development. Although some sections are
technical and written for an audience of statisticians, most of the book is
nontechnical and written to be easily understood by a broad audience with-
out any prior knowledge of non-inferiority clinical trials. Additionally, every
chapter begins with a nontechnical introduction.
We have strived to provide a thorough discussion on the most important
aspects involved in the design and analysis of non-inferiority trials. The first
two chapters discuss the history of non-inferiority trials and the design and
conduct considerations for a non-inferiority trial. A first step in designing a
non-inferiority trial is evaluating the previous effect of the selected active
control treatment. Chapters 3 and 4 cover the strength of evidence of an effi-
cacy finding and evaluating the effect size of a treatment. The active con-
trol therapy is identified based on knowledge of its performance in previous
xv
trials, not independent of the results of those previous trials. Thus, addi-
tional efforts are required to understand the effect size of the active control.
Chapter 5 presents the two main analysis methods frequently used in non-
inferiority trials, their variations, and their properties. Chapter 6 discusses
the gold standard non-inferiority design that additionally includes a placebo
group. Chapters 7 through 10 cover a variety of individual issues of non-infe-
riority trials, including multiple comparisons, missing data, analysis popu-
lation, the use of safety margins, the internal consistency of non-inferiority
inference, the use of surrogate endpoints, trial monitoring, and equivalence
trials. Chapters 11 through 13 provide specific issues and analysis methods
when the data are binary, continuous, and time to event, respectively. Design
and Analysis of Non-Inferiority Trials can be read fully in the order presented.
Various chapters can be comprehended or used as reference directly without
reading previous chapters. A reader with little prior exposure to non-infe-
riority trials should start with Chapters 1 through 6 in the order presented,
and cover the remaining material as needed. We have also included a discus-
sion on p values, confidence intervals, and frequentist and Bayesian analyses
in the appendix.
We appreciate the assistance of all the reviewers of this book and the book
proposal for their careful, insightful review. We are also indebted to so many
at Taylor & Francis Publishing, most notably David Grubbs for his guidance
and patience.
We thank all of those who have provided discussions and interactions on
non-inferiority trials, including David Brown, Kevin Carroll, Gang Chen,
George Chi, Ralph D’Agostino, Susan Ellenberg, Thomas Fleming, Paul Flyer,
Thomas Hammerstrom, Dieter Hauschke, Rob Hemmings, David Henry, Jim
Hung, Qi Jiang, Armin Koch, John Lawrence, Ning Li, Kathryn Odem-Davis,
Robert O’Neill, Stephen Snapinn, Greg Soon, Robert Temple, Ram Tiwari,
Yi Tsong, Hsiao-Hui Tsou, Thamban Valappil, and Sue Jane Wang. We are
particularly grateful to Dr. Ellenberg for providing slides on the history of
non-inferiority trials.
We are grateful for the support and encouragement provided by our fam-
ilies. Our deepest gratitude to our wives, Shiowjen (for MR), Marilyn (for
BW), and Lotus (for IC), for their patience and support during the writing of
this book.
than a standard of care. This may be the case when the experimen-
tal therapy is in the same “drug class” as the standard therapy. It
would be necessary to demonstrate that the experimental therapy
has efficacy either better than or not too much worse than the stan-
dard therapy.
Case 4: The experimental regimen replaces a drug in a standard reg-
imen of multiple drugs with the experimental drug. For the experi-
mental regimen to be considered as an alternative to the standard
regimen, it may be necessary for the experimental regimen to have
better efficacy than every drug, drug combination, and regimen for
which that standard combination is superior. If each component
of the drug combination for the experimental arm is regarded as
“active,” it may also be necessary for the experimental combina-
tion to have more efficacy than any subset of the drugs in that drug
combination. When a new standard of care demonstrates improved
survival over the previous standard of care, it may (or may not) be
unethical to give patients that previous standard of care for that
indication or line of therapy. Thus, it is important that any therapy
being considered for use has sufficient efficacy to be considered an
ethical therapy for the studied indication.
Case 5: The purpose of an experimental drug is to reduce the chance
of toxicities or side effects to patients caused by a standard therapy. It
is important to study whether the experimental drug interferes with
the effectiveness of the standard therapy; that is, to study the amount
of efficacy of the standard therapy that may be lost by additionally
providing patients with the experimental therapy. The standard
therapy with the experimental drug may be worthwhile to patients
relative to the standard therapy alone if the standard therapy plus the
experimental drug has less toxicities or side effects than the standard
therapy alone, despite having a little less efficacy than the standard
therapy alone. However, a lower dose (or less frequent use) of the stan-
dard therapy may also have less toxicities or side effects than the reg-
ular dose of the standard therapy. While a trial comparing the regular
dose of the standard therapy with and without the experimental drug
provides efficacy and safety data on the two regimens, it may not
(depending on the results of the trial) provide evidence of the neces-
sity of the experimental drug, unless the dose–response relationship
on efficacy and safety is known for the standard therapy.
therapy (“me too drugs”) has been criticized.8,9 This is particularly problem-
atic when nonrigorous margins are used, potentially leading to a “biocreep,”
in which an inferior therapy is used as the control therapy for the next gen-
eration of non-inferiority trials.
There may not be an appropriate choice for the active comparator for a
non-inferiority trial of efficacy even when it may seem that a non-inferi-
ority trial is the appropriate choice. Per the International Conference of
Harmonization (ICH) E9 Guidance10: “A suitable active comparator could
be a widely used therapy whose efficacy in the relevant indication has been
clearly established and quantified in well-designed and well-documented
superiority trials and which can be reliably expected to have similar effi-
cacy in the contemplated active control trial.” Per the ICH-E10 guidelines,11
an active control can be used in a non-inferiority trial when its effect is
(1) of substantial magnitude compared to placebo or some other reference
therapy, (2) precisely estimated, and (3) relevant to the setting of the non-
inferiority trial. Due to the importance of the effect of the active control
(the motivation for conducting an active control trial), the non-inferiority
margin should be sufficiently small so that demonstrating non-inferiority
leads to the conclusion that the experimental therapy preserves a substan-
tial fraction of the active control effect and that the use of the experimental
therapy instead of the active control therapy will not result in a clinically
meaningful loss of effectiveness.12
For an active-controlled clinical trial, the efficacy requirements are less rig-
orous in a non-inferiority comparison (less needs to be statistically ruled out)
than in a superiority comparison. That is, when compared with an effective
standard therapy, it is easier to demonstrate that the experimental therapy
has noninferior efficacy than to demonstrate that it has superior efficacy.
Thus, when the efficacy of an experimental therapy must be determined
against an effective standard therapy, non-inferiority may be preferred as
the main objective instead of superiority. When the experimental therapy
has a small efficacy advantage over the standard therapy, a superiority trial
having the standard therapy as the control therapy would require a large
number of subjects to be adequately powered (e.g., at least 80% power). When
the experimental therapy has no efficacy advantage over the standard ther-
apy, it is impossible to design an adequately powered superiority trial that
has the standard therapy as the control therapy.
TABLE 1.1
Description of Each Type of Comparison
Type of Comparison Description
Inferiority The experimental arm is worse than the control arm.
Equivalence The absolute difference between the experimental arm and the
control arm is smaller than a prespecified margin.
Non-inferiority The experimental arm is either better than the control arm or the
experimental arm is inferior to the control arm by less than some
prespecified margin.
Superiority The experimental arm is better than the control arm.
Difference The study arms are not equal. Either the experimental arm is worse
than the control arm or the experimental arm is better than the
control arm.
arm is also noninferior to the control arm. When the equivalence margin cor-
responds to the non-inferiority margin and the experimental arm is “equiva-
lent” to the control arm, then the treatment arm is also noninferior to the
control arm.
Whether a specific relation can be concluded between the control and
experimental arms is often reduced to comparing a confidence interval for
the difference in effects with either zero, a non-inferiority margin of δ, or
equivalence limits of ±δ (for some δ > 0). For a prespecified confidence level,
a confidence interval does not contain those cases that have been ruled out
by the data. For a confidence level of 100 (1 – α)%, the method for determining
the confidence interval is such that before observing the data, there was a 100
(1 – α)% chance (or greater chance) that the confidence interval will capture
the true value. As such, about 95% of all 95% confidence intervals capture the
true value of the parameter that is being estimated.
In Figure 1.1, since confidence interval A contains only negative values for
the difference in the effects between the control and experimental therapies
(C–E), the experimental therapy is concluded to be superior to the control
B
A
–δ 0 δ
Favors experimental Favors control
FIGURE 1.1
Relationship between different types of conclusions.
therapy. Because confidence interval B contains only values less than δ, the
experimental therapy is concluded to be noninferior to the control therapy
with respect to the margin δ. However, as confidence interval B contains both
positive and negative values for the difference in the effects, the experimental
therapy cannot be concluded to be superior or inferior to the control therapy.
As confidence interval C contains only values between –δ and δ, reflecting
a small absolute difference in the effects of the experimental and control
therapies, the experimental therapy and control therapy are concluded to
be “equivalent” or similar with respect to the limits ±δ. Since confidence
interval D contains only positive values for the difference in the effects that
are smaller than δ, the experimental therapy is concluded to be inferior and
noninferior to the control therapy. This would mean that the experimental
therapy is less effective than the control therapy, but not with unacceptably
worse efficacy. Since confidence interval E contains only positive values with
some of those values larger than δ, the experimental therapy is inferior to
the control therapy and cannot be concluded to be noninferior to the control
therapy.
The type of comparison that can be done depends on the type or scale of
the data. Data may be qualitative or quantitative. Qualitative data may be
nominal or ordinal. The scale is nominal when subjects’ outcomes are orga-
nized into unordered categories (e.g., gender, type of disease). The scale
is ordinal when subjects’ outcomes are organized into ordered categories.
Quantitative data may have an interval or ratio scale. The scale is interval
when differences have meaning (e.g., time of day and temperature in degrees
Celsius). Two different scenarios having equal differences display the same
meaning. The scale is ratio when ratios or quotients have meaning (e.g., time
to complete a task, survival time, temperature in degrees kelvin). Data hav-
ing a ratio scale have a meaningful zero.
When the data have a nominal scale, the relevant parameters are the actual
relative frequencies or probabilities for each category. Since the categories
are unordered, comparisons between study arms of the distributions for
such measurements involve comparing for each category the similarity of
the respective relative frequencies. That the distributions are different or
that distributions are similar (an “equivalence” type of inference) are the
only possible type of inferences. Non-inferiority, superiority, and inferiority
inferences require that there is an order to the possible values. One measure
of the similarity of two distributions of nominal measurements is the sum
over all categories of the smaller relative frequencies between the two arms.
For all other scales of measurements, any type of inference (e.g., equivalence,
non-inferiority or superiority) can be made.
For data having an ordinal scale, additional relevant parameters would
include the actual cumulative relative frequencies or cumulative probabili-
ties for each category. For a given category, its cumulative relative frequency
is the relative frequency of observations that either fall into that category or
any category having less value. For data that have an interval or ratio scale,
References
1. Rothmann, M. et al., Design and analysis of non-inferiority mortality trials in
oncology, Stat. Med., 22, 239–264, 2003.
2. Ellenberg, S.S. and Temple, R., Placebo controlled trials and active-control trials
in the evaluation of new treatments. Part 2: Practical issues and specific cases,
Ann. Intern. Med., 133, 464–470, 2000.
3. Freedman, B., Equipoise and the ethics of clinical research, N. Engl. J. Med., 317,
141–145, 1987.
4. Freedman, B., Placebo-controlled trials and the logic of clinical purpose, IRB:
Rev. Hum. Subj. Res., 12, 1–6, 1990.
5. D’Agostino, R.B., Massaro, J.M., and Sullivan, L.M., Non-inferiority trials:
Design concepts and issues—The encounters of academic consultants in statis-
tics. Stat. Med., 22, 169–186, 2003.
6. Ebutt, A.F. and Frith, L., Practical issues in equivalency trials, Stat. Med., 17,
1691–1701, 1998.
7. Committee for Medicinal Products for Human Use (CHMP), Guideline on
the Choice of the Non-inferiority Margin, EMA, London, 2005, at http://www.ema
.europa.eu/ema/pages/includes/document/open_document. jsp?webContent
Id=WC500003636.
8. Piaggio, G. et al., Reporting of non-inferiority and equivalence randomized tri-
als: An extension of the CONSORT statement, JAMA, 295, 1152–1160, 2006.
9. Fleming, T.R., Current issues in non-inferiority trials, Stat. Med., 27, 317–332,
2008.
10. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E9: Statistical Principles
for Clinical Trials, 1998, at http://www.ich.org/cache/compo/475-272-1
.html#E4.
11. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E-10: Guidance on
Choice of Control Group in Clinical Trials, 2000, at http://www.ich.org/cache/
compo/475-272-1.html#E4.
12. Fleming, T.R. and Powers, J.H., Issues in non-inferiority trials: The evidence in
community-acquired pneumonia, Clin. Infect. Dis., 47, S108–120, 2008.
13. Makuch, R. and Simon, R., Sample size requirements for evaluating a conserva-
tive therapy. Cancer Treat. Rep., 62, 1037–1040, 1978.
14. Lasagna, L., Placebos and controlled trials under attack, Eur. J. Clin. Pharmacol.,
15, 373–374, 1979.
15. Blackwelder, W.C., Proving the null hypothesis in clinical trials, Control. Clin.
Trials, 3, 345–353, 1982.
16. Temple, R., Government viewpoint of clinical trials, Drug Inf. J., 16, 10–17,
1982.
17. Fleming, T.R., Treatment evaluation in active control studies, Cancer Treat. Rep.,
71, 1061–1065, 1987.
18. Fleming, T.R., Evaluation of active control trials in acquired immune deficiency
syndrome, J. AIDS, 3, 82–87, 1990.
19. U.S. Food and Drug Administration Division of Anti-Infective Drug Products
Advisory Committee Meeting Transcript, February 19–20, 2002, at http://www
.fda.gov/ohrms/dockets/ac/cder02.htm#Anti-Infective.
20. U.S. Food and Drug Administration. Guidance for Industry Antibacterial Drug
Products: Use of Non-inferiority Studies to Support Approval (draft guidance),
October 2007.
21. Greene, W.L., Concato, J., and Feinstein A.R., Claims of equivalence in medical
research: Are they supported by the evidence?, Ann. Intern. Med., 132, 715–722,
2000.
22. Committee for Proprietary Medicinal Products. Points to Consider on Switching
between Superiority and Non-inferiority, EMA, London, 2005, at http://
www.ema.europa.eu/ema/pages/includes/document/open_document
.jsp?webContentId=WC500003658.
23. Henanff, A.L. et al., Quality of reporting of non-inferiority and equivalence ran-
domized trials, JAMA, 295, 1147–1151, 2006.
24. U.S. Food and Drug Administration, Guidance for Industry: Non-inferiority
Clinical Trials (draft guidance), March 2010.
2.1 Introduction
The gold standard in evaluating the safety and efficacy of an experimental
agent is a placebo-controlled trial that is designed and conducted so that no
or little bias is introduced in the comparison of study arms. It is also neces-
sary for the clinical trial to have assay sensitivity—the ability to distinguish
an effective therapy from an ineffective therapy. The experimental condi-
tions of the clinical trial should also be such that the results are externally
valid.
Poor study conduct will either introduce a bias, favoring one treatment
over another, or obscure treatment differences. Obscuring treatment dif-
ferences makes it more difficult to show that one study arm is better than
another. However, for an active-controlled trial, obscuring treatment differ-
ences will make it easier to conclude both equivalence and non-inferiority
when the experimental therapy is not notably better than the active control.
Additionally, since the active control effect size is often assumed before con-
ducting the non-inferiority trial, poor study conduct can reduce the active
control effect in the setting of the non-inferiority trial, making it more dif-
ficult to distinguish whether an experimental therapy is effective. For a non-
inferiority comparison, it is important that the selection of the non-inferiority
margin and the effect of the control arm for the current trial are such that
a demonstration of noninferior efficacy by the experimental arm compared
with the control arm, along with an appropriate, fair study conduct, will
imply that the experimental therapy is effective and not unacceptably worse
than the active control.
In this chapter we will discuss external validity, assay sensitivity, the steps
and issues in designing a non-inferiority trial, including the setting of the
non-inferiority margin, the analysis population, and the sizing of a non-
inferiority trial. The last section of this chapter briefly discusses the early
history and experience of non-inferiority studies in anti-infective products.
15
3. Setting a margin
4. Conducting the trial
to ICH E10,2 indications where this has been a concern include depression,
anxiety, dementia, angina, symptomatic congestive heart failure, seasonal
allergies, and symptomatic gastroesophageal reflux disease. For these indica-
tions, therapies have been shown effective in multiple well-controlled trials.
However, because of the lack of sensitivity to having even a minimal effect
size, a non-inferiority margin cannot be established for which the effec-
tiveness of the experimental arm could be inferred from a non-inferiority
comparison. In such situations, it would be necessary to conduct trials for
the purpose of demonstrating superiority to either a placebo or a standard
therapy.
of the standard therapy can be assessed against the placebo arm within
the non-inferiority trial. Because only direct comparisons are needed to
evaluate the effectiveness of the experimental therapy, there will be fewer
issues involving the sensitivity of the trial to establish that the experimental
therapy has efficacy or adequate efficacy. For example, there are no similar
issues, as with a two-arm non-inferiority trial, about whether the results of
the previous trials are transferable to the non-inferiority trial. Additionally,
a three-arm trial allows for the control of both the precision of the estimated
effect of the experimental therapy versus placebo and the precision of the
estimated difference between the experimental and active control thera-
pies. This is usually an advantage as the precision of the historical effect of
the active control therapy is what it is, possibly leading to imprecise indirect
estimates of the effect of the experimental therapy relative to placebo. If the
precision of the historically estimated effect of the active control is very low,
this historical estimation may not be useful in designing a two-arm, active-
controlled non-inferiority trial.
Recall that for an experimental drug that treats or prevents toxicity caused
by another drug, it may be important to study whether the use of the exper-
imental drug alters the benefits or likelihood of benefit from the original
therapy. Since a reduction of the dosage of the original drug should produce
less toxicity, for the experimental drug to be useful, it is important that the
use of the experimental drug with the studied dosage of original drug have
a benefit–toxicity profile that is as good as or better than the benefit–toxicity
profile of any given reduction of dosage of the original drug. Without know-
ing the dose–response relationship of the original drug on benefit and toxic-
ity, the demonstration of less toxicity and noninferior benefit of adding the
experimental drug to the standard dosage of the original drug may not be
sufficient to show that the experimental drug is absolutely necessary.
The assumption that the historical estimation of the effect of the control
therapy is unbiased for the setting of the non-inferiority trial has been called
the constancy assumption.
Because the evaluation of the active control effect is based on past studies,
it is often unclear whether these estimated effects apply to the non-inferiority
trial setting. Even when the historical studies show a fairly constant active
control effect, there may be factors that would alter the effect of the active
control in the setting of the non-inferiority trial. The non-inferiority trial
may be conducted in subjects with less responsive or more resistant disease;
subjects may now have access to better supportive care or different concomi-
tant interventions that may attenuate the active control effect, or there may
be lower adherence in the non-inferiority trial.4 The definitions of the pri-
mary endpoint and/or how the primary endpoint is measured may also vary
across studies. If it is believed that the effect size of the active control therapy
has diminished or otherwise will be smaller in the non-inferiority trial than
in the previous trials, the estimated effect of the control therapy should be
reduced when applied to the setting of the non-inferiority trial. If the active
control effect is smaller in the non-inferiority trial than in the historical tri-
als and this is not accounted for in the analysis, the assay sensitivity of the
trial will be low and there will be an increased risk of claiming an ineffec-
tive therapy as effective. The non-inferiority margin is often conservatively
chosen because of concerns that the effect of the standard therapy may have
diminished. As stated in the ICH-E10 guidance2: “The determination of the
margin in a non-inferiority trial is based on both statistical reasoning and
clinical judgment, and should reflect uncertainties in the evidence on which
the choice is based, and should be suitably conservative.”
As the effect of the active control depends on external experience, the
non-inferiority comparison is an across-trials comparison. As such, formal
cause-and-effect conclusions cannot be made from across-trials comparison
without either making assumptions or providing evidence or arguments
that the conditions and conduct of the current trial and previous trial are
exchangeable, or the results are so marked that the lack of such exchange-
ability is not impactful.
Although many of these across-trials issues are shared with historically
controlled trials, other issues are different. Essentially, historically controlled
studies compare subject outcomes where subjects were not randomized.
Unaddressed imbalances between groups on known and unknown prog-
nostic factors can invalidate a historical comparison. Differences between
the historical trials used to evaluate the effect of the active control and the
non-inferiority trial in factors associated with the size of the active control
effect (effect modifiers) that are not accounted for in the analysis can invali-
date a non-inferiority comparison. More on effect modification is discussed
in Chapter 4 on evaluating the active control effect.
The non-inferiority margin should account for effect modifiers and also for
biases in the estimation of the active control effect. Biases in the estimation of
the historical effect of the active control can arise owing to selection biases in
choosing the historical studies and regression to the mean bias in identifica-
tion of the active control. If the historical trials were found through a literature
search, there may be a publication bias. If studies having unfavorable or less
favorable results were not published, and thus not included in the estimation
of the active control therapy’s effect, the historical active control effect will be
overestimated. Furthermore, the active control is likely selected on the basis
of outcome (i.e., positive results from previous trials) and thus the estimated
active control effect will be biased and greater than the true effect size.
Additionally, in the absence of the ability to estimate the between-trial vari-
ability of the effect of the active control therapy, some additional variability
may need to be added to the variance of the estimator of the control therapy’s
historical effect to account for potential unknown factors that influence the
effect of the active control. This would be particularly true if there were only
one or two previous studies that could be used to estimate the effect of the
control therapy and the disease of interest has a history of therapies having
between-trial variability in their effects.
A constant or slightly varying effect size across studies for the control
therapy is more important when the effect size is always small or moderate.
The planned non-inferiority trial may still have assay sensitivity in dem-
onstrating that the experimental arm has adequate efficacy when there are
inconsistent, but all large, demonstrated effect sizes across studies. When
large effects have been demonstrated across studies, there may be little sta-
tistical uncertainty in the choice of the acceptable amount of loss of a control
therapy’s effect that an experimental therapy can have for the experimental
therapy to be noninferior.
The U.S. Food and Drug Administration (FDA) draft guidance1 discusses
an efficacy margin (M1), used in evaluating whether the experimental ther-
apy has any efficacy, and a clinical margin (M2), used to evaluate whether the
experimental therapy has unacceptably less efficacy than the active control.
The reason for considering a clinical margin that is smaller than the efficacy
margin is attributable to the importance of the effect of the active control
therapy. The importance of the active control effect is often the reason why a
placebo-controlled trial cannot be conducted.
As noted in an FDA advisory committee meeting for antibiotic drugs,5 for
some diseases (e.g., pneumonia), the reasons that make it unethical to do a
placebo-controlled trial are the same reasons attributed to the unwillingness
to have an experimental therapy that is much less effective than the standard
therapy. For clinical trials, in such diseases, it may therefore be worthwhile
to consider how much less efficacious a new therapy could be compared with
an existing therapy when choosing the non-inferiority margin. Such margins
may be based on clinical practice guidelines, patient opinion, other sources,
and/or sound reasoning.
For endpoint of mortality or irreversible morbidity, it may be more difficult
or impossible to define any margin that is clinically acceptable. However, if
The potential for biocreep can be greatly reduced by using as the control
therapy the therapy (or one of the therapies) with the greatest demonstrated
effect.5,8
Fleming6 proposed that a clinical margin that takes into consideration the
perspective of the patient be determined by a team of clinical and statistical
researchers—that is, how much clinical benefit would a patient be willing to
exchange for greater ease of administration or less risk of adverse events.
Probably the most common choice of a non-inferiority margin is half of the
lower limit of the 95% CI of the effect of the control therapy based on a meta-
analysis of historical studies comparing that therapy with placebo. Different
individuals have agreed on using such a margin but have disagreed on its
interpretation. Some have viewed this margin as acceptable only for indi-
rectly concluding that the experimental treatment is better than placebo—
that is, the lower limit of the 95% CI of the effect of the control therapy is
used as an “estimate” of the historical effect of the control therapy and is
then decreased by 50% to apply it to the setting of the non-inferiority trial
(see Snapinn9). Others have viewed such a non-inferiority margin as using
the lower limit of the 95% CI of the effect of the control therapy as a con-
servative estimate of the control therapy’s effect for the non-inferiority, and
it is required that the experimental therapy retain 50% of the effect of the
control therapy.10 In both perspectives, how conservative such an approach
for selecting a margin changes from case to case and is independent of the
concerns on how transferable the estimates based on historical trials are to
the non-inferiority trial.
Synthesis test procedures have also been used in testing non-inferiority.
Typically, the results from the active-controlled trial and the results from
estimating the historical effect of the control therapy are integrated through
a normalized test statistic. The goal is to demonstrate that the experimental
therapy retains a fraction of the control therapy’s effect greater than a pre-
specified fraction of the control therapy’s effect. Examples and discussion on
particular synthesis methods can be found in the papers of Hasselblad and
Kong,11 Holmgren,12 Simon,13 and Rothmann et al.10 The procedures used by
Rothmann et al.,10 Hasselblad and Kong,11 and Holmgren,12 are designed to
maintain a desired type I error rate when the estimation of the effect of the
control therapy is unbiased for the setting of the non-inferiority trial. For a
synthesis test procedure, Wang, Hung, and Tsong14 examined how the type
I error rate changes in various cases when the historical estimation of the
control effect is used and the constancy assumption is false.
Efficacy and Clinical Margins. As stated in the FDA draft guidance,1 “Deter
mining the NI margin is the single greatest challenge in the design, conduct,
and interpretation of NI trials.” We discussed in Section 2.3 much of the
issue involved in selecting a non-inferiority margin. Temple and Ellenberg15
described three possible margins, M0, M1, and M2. M0 (or just zero) is the
margin used when the active control is not regularly superior to placebo. M1
is the efficacy margin used to determine whether the experimental therapy
has any efficacy and M2 is the clinical margin used to evaluate whether the
experimental therapy has unacceptably less efficacy than the active control.
The efficacy margin has been regarded as the assumed active control effect
size for the non-inferiority trial.1 The value for M1 is often based on previ-
ous trials evaluating the active control effect with appropriate adjustments
due to factors (effect modifiers) that may lead to a different effect size for the
active control in the setting of the non-inferiority trial. Since the quality of
the non-inferiority trial cannot be assessed beforehand,1 “the size of M1 can-
not be entirely specified until the NI study is complete.” As concluding that
an ineffective therapy is effective comes with a great cost, there is a tendency
to conservatively choose the assumed effect of the active control and the cor-
responding clinical margin.1
Clinical judgment is used in determining M2, which cannot be greater
than M1. M2 is often determined by taking a fraction of M1. This particular
fraction, the retention fraction, depends on the importance of the endpoint
and the size of the active control effect. The importance of the active control
effect is the motivation for choosing a non-inferiority trial as the basis of
demonstrating the effectiveness of the experimental therapy. Therefore, it
may be unacceptable for the experimental therapy to have an effect much
less than the active control. Situations that influence the retention fraction are
provided in the FDA draft guidance.1 If the active control has a large effect
in reducing the mortality rate, retaining a large fraction of that effect will be
desirable. If it is known that the experimental therapy is associated with a
lower incidence of serious adverse events or is more tolerable for patients, the
retention fraction may be lowered.
Statistical hypotheses involving such an M1 or M2 are surrogate hypoth-
eses. The intention or hope is that ruling out that the experimental therapy
has an effect that is less than the effect of the active control by M1 or more
will imply that the experimental therapy is effective. Likewise, the inten-
tion is that ruling out that the experimental therapy has an effect that is less
than the effect of the active control by M2 will imply that the experimental
therapy does not have unacceptably worse efficacy than the active control
therapy. When M2 is much smaller than M1, ruling out a difference in effects
between the experimental therapy and active control of M2 should provide
persuasive evidence that the experimental therapy is effective.
Assurance that the active control will have an effect at least the size of M1
in the setting of the non-inferiority trial is the “single most critical determi-
nation” in planning the non-inferiority trial.1 Whether the non-inferiority
trial will have assay sensitivity is based on whether the effect of the active
control will be at least M1 in the setting of the non-inferiority trial, and the
quality of the design and conduct of the non-inferiority trial.
The FDA draft guidance1 prefers basing M1 on the lower limit of a high-
percentage confidence for the active control effect (e.g., a 95% CI) from a
meta-analysis of clinical trials that evaluates the effect of the active control.
In ruling out that the experimental therapy is unacceptably worse than the
outcomes more similar between the study arms. It is important that the
results and/or conclusions of these analyses be similar. Any difference in the
results of the analyses may be indicative of influential, poor study quality.
Too much missing data may potentially introduce a large bias and invalidate
both analyses, even if the results are similar.
For a non-inferiority comparison, an ITT analysis need not be more conser-
vative than a PP analysis. For non-inferiority comparisons of anti-infective
products, in most studies evaluated by Brittain and Lin,17 the PP analysis was
less conservative than the ITT analysis. In fact, sloppiness due to poor study
conduct may introduce a bias that favors a particular treatment arm. Study 1
of Rothmann et al.18 in an advanced cancer setting had a high percentage of
subjects prematurely censored for progression-free survival. On the basis of
poorer prognosis for overall survival among subjects prematurely censored
for progression-free survival compared with those still under observation for
a progression-free survival event on the experimental arm, this premature
censoring appears to be highly informative, whereas the premature censor-
ing on the control arm does not appear to be informative. More on analysis
populations is discussed in Chapter 8.
Because biases can also occur in subtle, unknown ways, the robustness of
the results and primary conclusions should be evaluated19—that is, how sen-
sitive the conclusions are to the limitations of the data and the unverifiable
assumptions made. Open-label trials may be particularly vulnerable to bias.
The limitations of the data will likely not be known until they are analyzed.
It is important to keep missing data to a minimum. Proper sensitivity analy-
ses are important in addressing data limitations and the potential impact
of missing data. Sensitivity analyses should be prespecified to the extent
possible. While sensitivity analyses are recommended, the use of sensitivity
analyses is not a substitute for poor trial conduct or poor adherence to proto-
col, and does not rescue the results of a poor-quality clinical trial.
It is thus important that the conduct of the non-inferiority trial be of high
quality so as to not compromise the non-inferiority comparison by either
obscuring differences in the effects of the study arms on the endpoint of
interest, or being so dissimilar to the study conduct of those previous trials
whose results were used to establish the non-inferiority criterion so as to
make the non-inferiority margin irrelevant.
The comparison of interest with the greatest real-world relevance is that
between a control arm of a standard therapy along with best medical manage-
ment and an experimental arm consisting of the experimental therapy along
with best medical management. This compares how all or many patients are
currently being treated with how the same group of patients could be treated
if the experimental drug becomes approved for that indication. Influences or
biases that interfere with having an unbiased comparison reduce the assay
sensitivity of the trial. However, this comparison of interest may not require
that all aspects of the trial conduct be equal between arms. Differences in the
tolerability and effectiveness of different study therapies may result in the
subjects in one study arm complying more frequently in taking their study
therapy than subjects in other study arms. This unevenness in taking the
assigned study therapy is an outcome of being on different arms and not a
bias to any comparison that will be made. The analysis should not adjust for
such unevenness. The potential subsequent therapies that are used and their
distribution of usage may naturally be different between study arms, and
such would be expected for that comparison of interest. If the control therapy
is available for subsequent use in practice, or would be available for subse-
quent use if the experimental therapy is approved (as part of its best medi-
cal management), then it may be natural for the control therapy to be made
available to subjects on the experimental arm for subsequent use. Although
this feature may make it more difficult to show that the experimental ther-
apy has better efficacy than the active control therapy, it can make it easier to
show non-inferiority or equivalence. Delayed use of the control therapy may
be noninferior to immediate use of the control therapy. In which case, an
experimental therapy being noninferior to the control therapy where many
subjects cross-in to the control therapy may not distinguish the experimental
therapy from a placebo. Therefore, unless it was true for the historical stud-
ies used to establish the non-inferiority margin, allowing the control therapy
to be available to the subjects in the experimental arm can obscure a non-
inferiority comparison and the determination of effectiveness of the experi-
mental therapy. If it is unethical to deny the control therapy for later use to
subjects on the experimental arm, either the non-inferiority margin would
need to account for this cross-in to the control therapy or a superiority com-
parison may need to be required. In most instances, previous studies evalu-
ating the effect of that standard therapy (the active control therapy) would
probably not have subjects on the placebo arm later use the standard therapy.
This makes it difficult to evaluate the effect of the active control therapy in the
setting of the non-inferiority trial where cross-in to the control therapy may
be allowed.
The study conduct of the non-inferiority trial cannot be evaluated until
the trial has ended. It is only at that time an assessment or reassessment
can be made as to the transferability of the results of previous trials that
had been used to establish the non-inferiority margin. If the non-inferiority
margin was based on previous trials and the conduct of the non-inferiority
trial is not consistent with the required conduct, a reevaluation of the non-
inferiority margin may be necessary.
a Superiority is concluded if the lower bound of the two-sided 95% CI for Δ is greater than zero.
b Non-inferiority is concluded if the lower bound of the two-sided 95% CI for Δ is greater than −δ.
c Equivalence is concluded if the two-sided 90% CI for Δ lies within the interval (–δ,δ).
31
© 2012 by Taylor and Francis Group, LLC
32 Design and Analysis of Non-Inferiority Trials
continuous data. It is assumed that outcomes with larger values are more
desirable. Here,
Δ is the true difference in the effects of the experimental arm and the
control arm (E − C).
Δa is the assumed difference in the effects of the experimental arm
and the control arm chosen to size the study.
δ is the non-inferiority (equivalence) margin.
α is the significance level.
1 − β is the power (the probability of making the respective conclu-
sion of superiority, non-inferiority, or equivalence) at Δa.
σ2 is the common population variance of the values for each study arm.
zγ is the 100(1 − γ)th percentile of a standard normal distribution.
π is the proportion of patients randomized to the control arm.
For time-to-event endpoints where effect sizes are measured with a log-
hazard ratio, formulas for the required number of events are obtained by
replacing σ with the numeral 1. Note that the sample size formula for a supe-
riority trial is just the sample size for a non-inferiority trial when δ = 0 (or
when δ → 0+). Note also that
a) For δ > 0, the same α, β, and σ, and the same alterative Δa, the required
sample size is smaller for a non-inferiority trial than for a superiority
trial.
b) For δ > 0 and the same α, β, and σ, the sample size for a superiority
trial powered at the alterative Δa equals the sample size for a non-
inferiority trial powered at the alternative Δa − δ.
c) For both superiority and non-inferiority trials, the required sample
size decreases as Δa increases within the alternative hypothesis.
d) For an equivalence trial, the required sample size decreases as
|Δa − 0| decreases.
arm have the same effect. This may be consistent with the thought that
“non-inferiority” is “one-sided equivalence.” When the active control has a
small effect (i.e., when the non-inferiority margin is small), a non-inferiority
trial powered at no difference in the effects of the experimental and control
therapies will generally require a rather large sample size. When the control
therapy has a very small effect versus a placebo, sizing a trial at the alterna-
tive where the experimental arm and the control arm have the same effect
means that the trial is being powered at an alternative where the experi-
mental arm has a very small effect versus a placebo. A placebo-controlled
clinical trial with such an experimental arm having a small effect relative to
placebo would require a large study size to be adequately powered to dem-
onstrate superiority. When the non-inferiority margin is much smaller than
the true effect of the control therapy versus placebo, a large sample size may
be needed.
As the active control therapy may represent the therapy having the best
effect among all therapies previously evaluated for a disease, it may also be
unrealistic to assume that the next arbitrary product for that disease is equal
in effect to the active control therapy. In these instances, it may be reason-
able to size the non-inferiority trial on the basis of an assumed difference
in effects where the experimental therapy is less effective than the active
control therapy.
In an active-controlled, substitution trial, a misconception frequently
arises about whether a non-inferiority trial or a superiority trial would re
quire more subjects. When comparing an experimental therapy with an
active control therapy, lesser values need to be statistically ruled out by the
CI for a non-inferiority comparison than for a superiority comparison. Thus,
for a fixed power, a larger sample size is required for a superiority compari-
son than for a non-inferiority comparison of the same two treatment arms.
The misconception arises from comparing sample size calculations based
on different assumed differences in effects. For the superiority analysis, the
calculated sample size has adequate power when the experimental arm has
greater efficacy than the control arm by some meaningful amount. For the
non-inferiority analysis, the calculated sample size has adequate power when
the experimental arm and the control arm have the same effect. The sample
sizes should be compared on the basis of a single assumed difference in the
effects of the experimental and active control therapies. When comparing an
experimental therapy to an active control therapy for the same assumed dif-
ference in effects and power, a non-inferiority comparison requires a smaller
sample size than a superiority comparison.
There is also a misconception that the more efficacious the control therapy,
the easier it is for an experimental therapy (E) to demonstrate non-inferiority.
Suppose that, in designing a non-inferiority trial, there are two candidates
that may be chosen as the active comparator of the trial, C1 and C2, where C2
is more effective than C1 (C2 > C1). It is easier for an experimental therapy
to demonstrate superiority (more probable or requires a smaller size) against
TABLE 2.2
Sample Sizes for Direct or Indirect Superiority Comparisons of Test
Therapy versus Placebo
α = 0.025, β = 0.10, α = 0.025, β = 0.10, α = 0.025, β = 0.10,
α, β, σ σ = 30 σ = 30 σ = 30
Type of trial Superiority of Non-inferiority Non-inferiority
E vs. P of E vs. C of E vs. C
Δa and δ Δa = 10 = ΔE–P , Δa = 0 = ΔE–C, Δa = 5 = ΔE–C,
δ = 0 δ = 10 = ΔC–P δ = 5 = ΔC–P
N N = 378 N = 378 N = 378
TABLE 2.3
Sample Sizes for Non-Inferiority Comparisons of Test Therapy versus Control
Therapy where Greater Than 50% Retention of Control Therapy’s Effect Is Required
α = 0.025, β = 0.10, α = 0.025, β = 0.10, α = 0.025, β = 0.10,
α, β, σ σ = 30 σ = 30 σ = 30
Type of trial Non-inferiority of Non-inferiority of Non-inferiority of
E vs. C E vs. C E vs. C
Δa Δa = 0 = ΔE–C Δa = 5 = ΔE–C Δa = 8 = ΔE–C
δ δ = 5 = 0.5 × ΔC–P δ = 2.5 = 0.5 × ΔC–P δ = 1 = 0.5 × ΔC–P
N N = 1513 N = 673 N = 467
N/378 4 1.78 1.23
Because the margin was based on the larger cure rate, the margin used at
the time of analysis may be different from the anticipated margin at the time
of study design. If the anticipated control cure rate was 81%, the anticipated
margin is 15%. If the observed cure rate for the control arm is 79%, with a
lower observed cure rate for the experimental arm, a 20% margin would
TABLE 2.4
Possible Cases in Deciding whether Non-Inferiority Has Been Demonstrated
Experimental Control Arm 95% CI for the
Sample Size Arm Number Number Difference in
Case per Arm Cured (Rate) Cured (Rate) Cure Rates Margin
1 150 112 (0.75) 121 (0.81) (–0.154, 0.034) –0.15
2 150 103 (0.69) 118 (0.79) (–0.199, –0.001) –0.20
TABLE 2.5
Example Showing Lack of Transitivity of a Non-Inferiority Conclusion
Cure Rate 95% CI for the Difference in Cure Rates and Margin
Arm A Arm B Arm C B vs. A C vs. B C vs. A
122/150 115/150 103/150 (–0.139, 0.045) (–0.180, 0.020) (–0.224, –0.030)
(0.81) (0.77) (0.69) Margin = –0.15 Margin = –0.20 Margin = –0.20
TABLE 2.6
Potential Change in Observed Cure Rates When Experimental Arm for Each Study
Is Control Arm of the Next Study
Experimental Control Arm 95% CI for
Sample Size Arm Number Number Difference in
Therapy per Arm Cured (Rate) Cured (Rate) Cure Rates Margin
A 150 118 (0.79) — — —
B 150 103 (0.69) 118 (0.79) (–0.199, –0.001) 0.20
C 150 90 (0.60) 103 (0.69) (–0.195, –0.021) 0.20
D 150 78 (0.52) 90 (0.60) (–0.192, –0.032) 0.20
E 150 66 (0.44) 78 (0.52) (–0.193, –0.033) 0.20
F 150 53 (0.35) 66 (0.44) (–0.197, –0.024) 0.20
G 150 39 (0.26) 53 (0.35) (–0.197, –0.010) 0.20
clinical trials. In February 2002 the FDA held an advisory committee meet-
ing on the selection of the non-inferiority margin for antibiotic drugs.5 The
outcomes from the meeting8 included
Some PP populations exclude subjects who die before the primary end-
point assessment where the cause of death is not regarded as related to the
infection. Alternatively, these patients have been treated as nonfailures in
the PP population and treated as failures in the ITT population. True ITT
analyses where all subjects are followed to the endpoint or the end of study
rarely occur due to missing outcomes. Patients with missing outcomes are
often treated as failures in the cure rate analyses.
Brittain and Lin17 compared the PP and ITT analyses from 20 trials
that were presented to the FDA Anti-Infective Drug Products Advisory
Committee between October 1999 and January 2003. Each trial studied a spe-
cific infection. The characteristic of the trials and the results are summarized
as follows:
• The overall sample sizes ranged from 20 to 819 with a median of 400.
• The percentage of patients in the ITT population that were excluded
from the PP population ranged from 2% to 43% with a median of 22%.
• The estimated treatment effect was more favorable for the experi-
mental therapy in the ITT analysis for 13 of the 20 trials.
• The 95% CI for the difference in cure rates was wider for the ITT
analysis for 12 of the 20 trials.
• The absolute differences in the treatment effect between the PP and
ITT analyses ranged from 0.03% to 18.9% (the trial with the overall
sample size of 20) with a median of 1.3%. The second largest absolute
difference was 4.8%.
References
1. U.S. Food and Drug Administration, Guidance for industry: Non-inferiority
clinical trials (draft guidance), March 2010.
2. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E10: Guidance on
choice of control group in clinical trials, 2000, at http://www.ich.org/cache/
compo/475-272-1.html#E4.
3. Wiens, B., Choosing an equivalence limit for non-inferiority or equivalence
studies, Control. Clin. Trials, 1–14, 2002.
4. Fleming, T.R. and Powers, J.H., Issues in non-inferiority trials: The evidence in
community-acquired pneumonia, Clin. Infect. Dis., 47, S108–S120, 2008.
5. U.S. Food and Drug Administration Division of Anti-Infective Drug Products
Advisory Committee meeting transcript, February 19–20, 2002, at http://www
.fda.gov/ohrms/dockets/ac/cder02.htm#Anti-Infective.
6. Fleming, T.R., Design and interpretation of equivalence trials, Am. Heart J., 139,
S171–S176, 2000.
3.1 Introduction
It is important to evaluate from the evidence in the data whether an observed
finding is real and can be reproduced in any similar or different relevant
setting. In evaluating the strength of the evidence in the data, the Kefauver–
Harris amendment of the Food and Drug Cosmetic Act of 1962 defines “sub-
stantial evidence” as “evidence consisting of adequate and well-controlled
investigations.”1 According to Huque,2 the U.S. Food and Drug Administra
tion (FDA) has interpreted this as “the need to conduct at least two adequate
and well-controlled studies, each convincing on its own, as evidence of effi-
cacy of a new treatment of a given disease.” There are conditions where data
from a single adequate and well-controlled trial can be considered to con-
stitute substantial evidence.3 An interpretation of this is data from a large,
adequate, and well-controlled multicenter study with a sufficiently small
p-value that is internally consistent and clinically meaningful.2
Studying how the results of a clinical trial or an experiment would change
if it were repeated under the exact same conditions (i.e., the same time in
history with the same clinical investigators and potential patient pool, etc.)
is a study of the variability of the results. Studying how the results of a clini-
cal trial or an experiment would change if an additional study (or studies)
were performed under a different environment (e.g., using different clini-
cal investigators or a different potential pool of patients) is a study of the
reproducibility of the results. Understanding the variability of the results from
a clinical trial is necessary but usually not sufficient in understanding the
reproducibility of the results.
For some indications, it may often be reasonable that if the results of a
given clinical trial are quite marked and internally consistent—having a
large estimated effect size that is many times greater than its correspond-
ing standard error—a statistically significant result will be reproduced if an
additional clinical trial were conducted having the same number of subjects
as the earlier trial. The only way to truly know would be to conduct the addi-
tional clinical trial.
43
Given the results from the trial (i.e., “given the data”), the probability that
the experimental therapy is or is not effective can be determined on the basis
of some prior probability that a random experimental therapy is effective.
This differs from a p-value, which is a probability involving the likelihood
of observing the actual data (or data that would provide stronger evidence)
given that the null hypothesis is true. The p-value does not consider the like-
lihood that the null hypothesis is true. Suppose 5% of investigated agents
for a given indication are truly effective (and meaningfully so). Additionally,
when an agent is effective, there is 80% power to achieve a one-sided p-value
of less than 0.025. When an agent is ineffective, there is a 2.5% probability of
achieving a one-sided p-value less than 0.025. For a typical 100 cases, Table
3.1 gives the number of cases for each combination of whether the investi-
gated agent is truly effective and whether the observed one-sided p-value
is less than 0.025. From Table 3.1, in 2.4 out of 6.4 cases (37.5%) where the
one-sided p-value is less than 0.025, the agent was truly ineffective. Thus,
when 5% of the investigated agents for a given indication are truly effective,
simply achieving a one-sided p-value of less than 0.025 from a single clinical
trial may be suggestive, but far from convincing, evidence of effectiveness. A
similar example as that provided here can be found in the paper by Fleming.4
We will refer to the posterior probability that the experimental agent is truly
ineffective, given that the experimental therapy has been concluded as effec-
tive, as the Bayesian false-positive rate.
Suppose that two simultaneously conducted clinical trials are done per
investigational agent. An investigational agent is concluded as effective when
each of the two studies has a one-sided p-value less than 0.025. As before,
when an agent is effective, there is 80% power to achieve a one-sided p-value
less than 0.025 within a single clinical trial. For a typical 100 cases, Table 3.2
gives the number of cases for each combination of whether the investigated
agent is truly effective and whether the agent is concluded as effective by
achieving one-sided p-values less than 0.025 in both studies. From Table 3.2,
in 3.2 out of 3.26 cases (≈98.2%) where the conclusion was “effective,” the
agent was truly effective. Therefore, the Bayesian false-positive rate is about
1.8%. Thus, when 5% of the investigated agents for a given indication are
truly effective, achieving a one-sided p-value less than 0.025 from each of
two clinical trials is fairly convincing evidence of effectiveness.
TABLE 3.1
Number of Cases for Which the Agent Is Effective According to
Observed p-value
Truth
One-Sided p-value Effective Ineffective
<0.025 4 2.4 6.4
≥0.025 1 92.6 93.6
5 95 100
TABLE 3.2
Number of Cases for Which the Agent is Effective According to
Conclusion Drawn from Two Studies
Truth
Conclusion of “Effective” Effective Ineffective
Yes 3.2 0.06 3.26
No 1.8 94.94 96.74
5 95 100
α (1 − η)/2
α* = (3.1)
α (1 − η)/2 + (1 − β )η
The Bayesian false-positive rate increases as the power or trial size decreases.
Thus, a group of small studies would have a larger Bayesian false-positive
rate than an analogous group of large studies. The Bayesian false-positive
rate also increases as the probability that a random agent is truly effective
decreases or as the significance level increases. The Bayesian false-negative
rate, β*, that an experimental therapy is effective given a nonfavorable test
results equals
βη
β* =
βη + (1 − α /2)(1 − η)
The Bayesian false-negative rate increases as the power or trial size decreases.
Thus, a group of small studies would have a larger Bayesian false-negative
rate than an analogous group of large studies. The Bayesian false-negative
rate also increases as the probability that a random agent is truly effec-
tive increases or as the significance level increases (and the power remains
unchanged). Our notation of α* and β* for the Bayesian false-positive and
Bayesian false-negative rates is the reverse of the notation by Lee and Zelen.5
Lee and Zelen5 examined 87 studies conducted by the Eastern Cooperative
Oncology Group. Most studies used a one-sided significance level of 0.05
and sized for 80–90% power. Twenty-five of those studies had “significant
outcomes.” On the basis of a model that considers only no effect and the
assumed effect to size the study as the possibilities for the true effect, it was
deduced that the true fraction of studies with effective experimental thera-
pies was between 0.28 and 0.32. Moreover, “on average,” 3 of the 25 studies
(12%) having significant outcomes are expected to have false-positive con-
clusions and 4–10% of the 62 nonpositive studies are expected to be false-
negative conclusions.
For fixed power and probability that a random trial uses an effective
experimental agent, the one-sided significance level (α/2) for a single trial
or overall level for two trials can be determined so as to lead to a desired
Bayesian false-positive rate. For fixed β, η, and α*,
α * (1 − β )η
α* η
α /2 = = (1 − β ) (3.2)
(1 − α * )(1 − η) 1 − α * 1 − η
The required significance level equals the product of the odds of a false-
positive result, the odds a random study has an effective experimental agent,
and the power at the assumed effect.
From Equation 3.2, for α* = 00.025 and 1 – β = 0.9, α/2 ≈ (00.0231)(η/(1 – η)).
When η > 0.52, α/2 > α* = 00.025. For α* = 00.025, 00.01 and 00.000625 and
1 – β = 0.9, Table 3.3 gives the value of α/2 for various η values. When the
probability that a random agent is truly effective is .2, a single-study signifi-
cance level of 0.0058 leads to a Bayesian false-positive rate of 00.025, whereas
a single-study significance level of 0.00014 leads to a Bayesian false-positive
rate of 00.000625.
than 0.053 supports no effect more than the assumed effect. For a non-inferi-
ority trial, an estimated difference in effects between the experimental and
active control arms of –δ/2 supports the null difference of –δ more than the
alternative of no difference in effects. When the study has 90% power at no
difference with a one-sided significance level of 0.025, an estimated differ-
ence in effects of –δ/2 corresponds to a one-sided p-value of 0.053.
Bayes Factor. Goodman6 proposes the use of the Bayes factor as a measure of
the strength of evidence instead of a p-value. The greater the data support the
alternative hypothesis relative to the null hypothesis, the more likely the alter-
native hypothesis is true. For testing two simple hypotheses, we have that
where the Bayes factor equals the probability of the data given the null
hypothesis/probability of the data given the alternative hypothesis.
For testing the simple hypotheses Ho:θ = θo versus Ha:θ = θa, Equation 3.3
can be expressed as
( ) = h (θ ) f ( x θ )
g θo x o o
g (θ x ) h (θ ) f ( x θ )
a a a
f (x θ o )
Minimum Bayes factor = (3.4)
supθ ∈Θa f (x θ )
which is also the generalized likelihood ratio that is often used as a frequen-
tist test statistic. In practice, the supremum in the denominator of Equation
3.4 occurs at the maximum likelihood estimate of θ.
In many applications where the maximum likelihood estimator has an
approximate normal distribution, the minimum Bayes factor is approxi-
mated by exp(–z2/2), where z is the number of standard errors the maximum
likelihood estimate is different from θo.6 Goodman evaluated the strength of
evidence for various “small” p-values and prior odds that the null hypoth-
esis is true. On the basis of a fairly pessimistic prior that the alternative
hypothesis is true, Goodman regarded a one-sided p-value of 0.05 as pro-
viding moderate evidence (at best) against the null hypothesis, a one-sided
p-value of 0.001–0.01 as at best moderate to strong evidence against the null
hypothesis, and a one-sided p-value of less than 0.001 as strong to very strong
evidence. Data leading to a p-value less than 0.001 yields posterior odds that
the null hypothesis is true that is less than 1/216 of the prior odds that the
null hypothesis is true.6
When testing with two composite hypotheses (i.e., Ho:θ ∊ Θo vs. Ha:θ ∊ Θa), a
natural extension chooses the generalized likelihood ratio as that Bayes fac-
tor when determining the posterior odds that the null hypothesis is true. For
two composite hypotheses, Goodman6 proposes having the selected Bayes
factor based on a weight function. For a nonnegative function w defined on
the parameter space, the weight-based Bayes factor is given by
∫ w(θ ) f (θ x) dθ / ∫ w(θ ) dθ
Θo Θo
∫ w(θ ) f (θ x) dθ / ∫ w(θ ) dθ
Θa Θa
A weight function can also used when testing a simple hypothesis against a
composite hypothesis. When the weight function is the prior density func-
tion h, the posterior odds that the null hypothesis is true is given by
∫ h(θ ) f (θ x) dθ
Θo
∫ h(θ ) f (θ x) dθ
Θa
which is the posterior odds that the null hypothesis is true on the basis of the
posterior distribution for θ.
For fixed prior odds that the null hypothesis is true, Goodman6 notes that
the weights or prior densities for the possibilities in the alternative hypoth-
esis can be distributed to focus on whether the true difference or effect is
meaningful. When the observed effect is small and not meaningful, such a
weight-based Bayes factor would account for this and lead to unimpressive
posterior odds that the null hypothesis is true.
In practice, it may be better or more appropriate for the prior odds of the
null hypotheses to be based on a typical or random therapy for that indica-
tion, not on the prior belief involving the given experimental therapy. This
leads to a consistent criterion across all studies in that indication. Different
decisions from studies involving different experimental therapies would be
based on the differences in the study results.
3.3 Reproducibility
It is important that a finding in one laboratory by one investigator can be
reproduced in another laboratory by a different investigator. A finding that
fails to be reproduced when tried at different laboratories by different investi-
gators may not be of great consequence and may have been a fluke. Likewise,
it is important to know whether a positive finding from a clinical trial can be
reproduced from an independent clinical trial having different subjects and
different investigators. If a positive finding from a given clinical trial fails to
be reproduced by other conducted clinical trials, the finding will lack exter-
nal validity. Hung and O’Neill9 investigated the distribution for the p-value
under the alternative hypothesis and the likelihood of reproducing a posi-
tive result in an identical, second trial when the true effect is the observed
effect from the first trial for the second trial. When the observed one-sided
p-value in the first trial is 0.025, there is a 50% probability of achieving a one-
sided p-value less than 0.025 in the second trial when the true effect is the
observed effect from the first trial. When the observed one-sided p-value in
the first trial is 0.000625, there is a 90% probability of achieving a one-sided
p-value less than 0.025 in the second trial when the true effect is the observed
effect from the first trial.
In practice, the reproducibility of a positive finding from a clinical trial
need not require clinical trials of identical designs. Separate positive find-
ings from clinical trials involving different stages of the same disease may
support each other and represent reproducibility of a positive finding for that
disease. Similarly, positive findings from a clinical trial using subjects who
were previously treated and also from a clinical trial using subjects who were
previously untreated of the disease may support each other and represent
reproducibility of a positive finding.
Reproducibility is also important for a non-inferiority efficacy claim. How
ever, there may be differing views on what reproducibility should mean for
a non-inferiority inference,10 which depends jointly on both the compari-
son from the non-inferiority trial and the historical experience of the active
control. Conceptually, repeating the entire non-inferiority inference can be
considered as jointly repeating both the historical experience of the active
control and the non-inferiority trial.11 Alternatively, the reproducibility of a
non-inferiority inference can be viewed by separately assessing the repro-
ducibility in the estimated active control effect across the previous trials and
the reproducibility in the difference in effects between the active control and
the experimental therapy across multiple non-inferiority trials. When the
non-inferiority trial is based on an assumed active control effect size and
that effect size or a larger effect size is “regularly reproduced” across trials
studying the active control, the testing of non-inferiority will generally be
associated with a rather small false-positive rate (α*) for a conclusion that
the experimental therapy is effective and constitute substantial evidence of
any efficacy, provided that the active control effect is at least the size of the
assumed effect.
A consistent, reproduced conclusion of efficacy across trials not only
increases the likelihood that the finding of efficacy is real but also can
justify that a model used to estimate the active control effect may approx-
imately hold. Before observing the results of any study, the estimated treat-
ment effect or treatment difference is unbiased. However, as the decision to
evaluate the effect of a selected active control is dependent on the already
observed effects, the retrospective estimation of the active control effect is
biased. When a finding of efficacy across studies has reproducibility, this
bias should be small.
For indications where there is only one effective standard therapy that can
be difficult to tolerate, a second clinical trial comparing the experimental
therapy with a placebo can use subjects that do not tolerate the standard
therapy. A demonstration of effectiveness for that trial may involve dem-
onstrating superior efficacy to the placebo or some other therapy. In some
instances, the dose–response relationship of an experimental therapy may
provide supportive information on the efficacy of the experimental therapy.
f * ( x n+ 1 ) =
∫ f (x n+ 1 θ ) g(θ x1 , x2 , , xn ) dθ (3.5)
Ω
For example, consider a Jeffreys prior (a beta distribution with both parame-
ters equal to 0.5) for the probability that a random study subject will respond
to therapy. Suppose that 9 of 20 patients have responded to therapy. Solely
on the basis of these data, the predictive probability that a future random
study patient will respond to therapy is 19/42 (≈0.452). This value (19/42)
1
1
was found by evaluating f * (1) =
∫ u ⋅ B(9.5, 11.5) u
0
9.5
(1 − u)11.5 du , where
1
B(9.5, 11.5) =
∫
0
9.5 11.5
u (1 − u) du. Thus, with Bernoulli (dichotomous) data, the
posterior mean is the predictive probability that a future random observation
will be a success. Additionally, the predictive distribution for the number of
the next m subjects that will respond to therapy will be a binomial distribu-
tion with parameters m trials and a probability of success of 19/42.
Example 3.1 illustrates determining the predictive probability of a favor-
able outcome from a second identical trial when the first clinical trial has a
favorable outcome.
Example 3.1
When it is believed that the true effect in the second trial is the same as the
true effect in the first trial, the posterior distribution for the common effect
based on the results in the first trial forms the prior distribution for the com-
mon effect in the second trial. The predictive probability of achieving statis-
tical significance can also be determined under a model for differing true
effects between clinical trials by adding variability into the prior distribu-
tion for the common effect in the second trial or from a hierarchical model.
Example 3.2
TABLE 3.4
Posterior and Predictive Distributions
Parameter Posterior Distribution
θC/P Normal distribution with mean –0.234 and standard deviation 0.0750
θE/C Normal distribution with mean –0.044 and standard deviation 0.0613
θE/P Normal distribution with mean –0.278 and standard deviation 0.0969
Estimator Predictive Distribution
xE/P Normal distribution with mean –0.278 and standard deviation 0.1300
where u is the observed value for U, 1 – Φ(zα/2) = α/2, and σ V–U is the standard
deviation of V – U. In Example 3.1, U is the estimated log-hazard ratio from
the first clinical trial based on 400 events, V is the estimator of the log-hazard
ratio for the second clinical trial based on 400 events, and σ V–U = 0.02 . From
Equation 3.6 the 95% prediction interval for the observed log-hazard ratio in
the second clinical trial, on the basis of 400 events, is (–0.565, –00.010) ((i.e.,
the 95% prediction interval for the hazard ratio is 0.568, 0.990)). An observed
hazard ratio less than 0.822 is needed for statistical significance at a one-
sided 0.025 level. The one-sided 74.2% prediction interval for the observed
hazard ratio in the second clinical trial is (0, 0.822), an analogous result to the
Bayesian predictive probability.
For Example 3.2, prediction limits can be determined for the comparison
of capecitabine with 5-FU from a hypothetical 5-FU–controlled trial. Let
UC/P and UE/C denote the estimators of the overall survival log-hazard ratio
for 5-FU + LV versus 5-FU (“placebo”) and capecitabine versus 5-FU + LV,
respectively. Let V denote the estimator of the overall survival log-hazard
ratio for capecitabine versus 5-FU based on 533 events from the hypotheti-
cal trial comparing capecitabine and 5-FU arms. We will assume that E(V) =
E(UC/P + UE/C) and use normal distributions for the sampling distributions.
The 95% prediction interval for V is (–0.533, –00.023). The corresponding
95% prediction interval for the hazard ratio based on 533 events is (0.587,
0.977). This interval includes 0.844, the threshold needed for the observed
hazard ratio for achieving statistical significance, and larger possibilities for
the observed hazard ratio. A one-sided 79.7% prediction interval for the log-
hazard ratio based on 533 events is (–∞, –0.170), the analog to the result of the
Bayesian predictive analysis.
References
1. U.S. Food and Drug Administration. Statement regarding the demonstration of
effectiveness of human drug products and devices. Federal Register, 60, Docket
No. 9500230, 39180–39181, August 1, 1995.
2. Huque, M.F., Commentaries on statistical consideration of the strategy for dem-
onstrating clinical evidence of effectiveness—one larger vs two smaller pivotal
studies by Z. Shun, E. Chi, S. Durrleman and L. Fisher, Stat. Med., 24, 1639–1651,
2005.
3. U.S. Food and Drug Administration, Guidance for industry: Providing clini-
cal evi dence of effectiveness for human drug and biological products, 1998,
at http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatory
Information/Guidances/UCM078749.pdf.
4. Fleming, T.R., Clinical trials: Discerning hype from substance, Ann. Intern. Med.,
153, 400–406, 2010.
5. Lee, S.J. and Zelen, M., Clinical trials and sample size considerations: Another
perspective, Stat. Sci., 15, 95–110, 2000.
6. Goodman, S.N., Toward evidence-based medical statistics 2: The Bayes factor,
Ann. Intern. Med., 130, 1005–1013, 1999.
7. Shun, Z. et al., Statistical consideration of the strategy for demonstrating clinical
evidence of effectiveness—one larger vs two smaller pivotal studies, Stat. Med.
24, 1619–1637, 2005.
8. Koch, G.G., Commentaries on statistical consideration of the strategy for dem-
onstrating clinical evidence of effectiveness—one larger vs two smaller pivotal
studies by Z. Shun, E. Chi, S. Durrleman and L. Fisher, Stat. Med., 24, 1639–1651,
2005.
9. Hung, H.M.J. and O’Neill, R.T. Utilities of the p-value distribution associated
with effect size in clinical trials, Biometrical J., 45, 659–669, 2003.
10. Rothmann, M.D., Issues to consider when constructing a non-inferiority analy-
sis, ASA Biopharm. Sec. Proc., 1–6, 2005.
11. Lawrence, J., Some remarks about the analysis of active control studies,
Biometrical J., 47, 616–622, 2005.
12. Tsong, Y. et al., Choice of λ-margin and dependency of non-inferiority trials,
Stat. Med. 27, 520–528, 2008.
13. Tsong, Y., Zhang, J., and Levenson, M., Choice of δ non-inferiority margin and
dependency of the non-inferiority trials, J. Biopharm. Stat., 17, 279–288, 2007.
14. FDA Medical-Statistical review for Xeloda (NDA 20-896), dated April 23, 2001.
15. FDA/CDER New and Generic Drug Approvals: Xeloda product labeling, at
http://www.fda.gov/cder/foi/label/2003/20896slr012_xeloda_lb1.pdf.
16. Rothmann, M. et al. Design and analysis of non-inferiority mortality trials in
oncology, Stat. Med., 22, 239–264, 2003.
4.1 Introduction
According to the U.S. Food and Drug Administration (FDA) Draft Guidance
on Non-inferiority Trials,1 “The first and most critical task in designing an
NI study is obtaining the best estimate of the effect of the active control in
the NI study (i.e., M1).” The FDA draft guidance on non-inferiority trials1
provides instances for which a non-inferiority margin can be defined in the
absence of controlled clinical trials evaluating the active control effect. The
circumstances are similar to those for which historically controlled trials can
provide persuasive evidence.2 For example, there should be a good under-
standing or estimate of the outcome (e.g., spontaneous cure rate) without
treatment, and the outcomes or cure rate for the active control from mul-
tiple historical experiences should be substantially different from those
seen without treatment (e.g., substantially different spontaneous cure rates).
The assumed effect of the active control in the setting of the non-inferiority
would be conservatively chosen.
Usually, there are data on the effect of the active control therapy from other
clinical trials. It is a daunting task to determine whether the estimated effects
of the active control therapy from previous trials apply to the setting of the
non-inferiority trial. Differences in patient populations, in the natural his-
tory of the disease, and in supportive care are just some of the factors that can
alter the effect of a therapy from one clinical trial to another. Additionally,
bias may be introduced by identifying the active control therapy after its
effect has been estimated. Bias may also be introduced by selective, post hoc
determination of which studies to include in the evaluation of the active con-
trol effect. How to integrate results across trials and whether the integrated
results would apply to a future clinical trial are also key concerns. The poten-
tial heterogeneity in the active control effect across trials needs to be consid-
ered and investigated. For the setting of the non-inferiority trial, explained
heterogeneity should be accounted for in the estimated effect of the active
control and unexplained heterogeneity should be accounted for in the cor-
responding variance. These issues and topics are discussed in this chapter.
57
involves examining the effect of the active control from relevant previous tri-
als, adjusting for any potential biases, and understanding and adjusting for
any differences between the historical trials and the non-inferiority trial. The
between-trial variability that cannot be explained should also be considered
when modeling the uncertainty of an estimate of the active control effect.
When there are no historical studies providing relevant information on the
effect of the control therapy, a two-arm non-inferiority trial cannot be done.
When there are relevant randomized comparative studies, it may be possible
to assess the effect of the control therapy. In assessing the size of the active
control effect in the setting of the non-inferiority trial, relevant information
from previous trials needs to be considered, including the consistency of the
size of the estimated effects, consistency of any effect, and similarities and
differences in the designs of the trials (e.g., differences in patient popula-
tions, concurrent therapies, and subsequent therapies).
If there are concerns that the effect of the control therapy has diminished
by some fraction, ε, then the estimated control effect can also be reduced by
this fraction.
When there is one historical, randomized trial comparing the active con-
trol with placebo, between-trial variability cannot be assessed. It is also dif-
ficult to quantify the between-trial variability with just two historical trials.
The potential between-trial variability should be considered, particularly
when that disease or indication has a history of inconsistent estimated effects
across clinical trials investigating the effects of the same therapy.
“Likes” should be combined with “likes.” For example, it may be inappro-
priate to combine the results of observational studies with blinded, placebo-
controlled studies. Therefore, in a meta-analysis of a collection of studies, it
may be necessary to first divide the overall collection of studies into subsets
where, within each subset, the studies are fairly homogeneous on the most
important design and conduct features relative to the treatment effect and its
estimation. Then a meta-analysis is done for each subset of the studies. The
use of multiple definitions of the endpoint, differences in how the endpoint
is measured or how frequently the endpoint is monitored, differences in the
amount of follow-up on the endpoint, or meaningfully different patient pop-
ulations may be the basis for dividing the overall collection of studies into
subsets.
“unbiased” for the effect of the active control in the non-inferiority trial (i.e.,
E(γˆ ) = E(γˆ i ) = γ ). The difference is the variance that is attributed to γˆ . Since
E(γˆ − γˆ k +1 )2 = E(γˆ − γ )2 + E(γ k +1 − γ )2 , the variance for the second case is larger.
While the estimator is unbiased for the active control effect in both cases, the
modeling of the uncertainty of the estimator and its sampling distribution
is different. In general, the constancy assumption is more than just having
an unbiased estimator of the active control effect in the non-inferiority trial,
but also correctly modeling or identifying the sampling distribution for the
estimator.
The constancy of effect may depend on the chosen metric. The benefit of a
therapy used to prevent a disease may depend on the placebo rate of getting
the disease. An experimental therapy that prevents occurrence of the disease
in one out of two subjects who would have otherwise acquired the disease
has an occurrence rate of 25% when the placebo rate is 50% (a difference of
25%). The occurrence rate would be 15% when the placebo rate is 30% (a dif-
ference of 15%). How to make adjustments for departures from constancy in
the active control effect is often a matter of judgment.1
which are known, this will be an easier task. In many instances, a literature
search is done to obtain known relevant studies that can be used to quantify
the effect of the active control therapy. There may be concern of a publication
bias, that only the results from studies indicative of a treatment effect will
be published or that such studies are more likely to be published (and thus
found in a literature search) than the results from those studies that did not
indicate a treatment effect. There are various techniques that can assist in
recognizing that a publication bias may exist. The techniques tend to assume
that the true treatment effect is constant across trials or that the true effect
is not dependent on the size of the trial. Hopefully, the recent creation of a
clinical trials data bank (i.e., clinicaltrials.gov) will reduce the possibility of
publication bias for estimating the effects of many future active controls.
A “funnel plot” is a graphical display often used to evaluate for potential
publication bias.9 Plotted for each study is the pair of the estimated effect and
a measure related to the associated variability in the estimate. The greater the
variability associated with the estimate (as with smaller studies), the more
spread there is in the observed estimates; thus, when there is no publication
or sampling bias, a funnel-like shape is expected.
Search strategies that attempt to find all relevant studies and minimize
bias should be used. The methods for abstracting estimates and the standard
error from summaries of the results should also be considered. For example,
it is easy and valid to derive the estimate and corresponding standard devia-
tion from a 95% confidence interval that was based on a normal distribu-
tion. The estimated effect would be the midpoint of the confidence interval,
whereas the standard deviation would equal the difference of the upper and
lower limits of the 95% confidence interval divided by 3.92. When the sub-
jects are monitored indefinitely for an event (e.g., death) and accrued over
time, it would probably be inappropriate to use the fraction of subjects who
had events in both arms to arrive at an estimate of the hazard ratio.
The Timing of the Definitive Analysis Is Random. When a study includes one
or more interim analyses, the sample treatment effect is often regarded as the
estimated treatment effect when an efficacy boundary is first crossed, or as
the estimated treatment effect at the final planned analysis should no efficacy
boundary be crossed. For this definition, when the experimental therapy is
more effective than the control therapy or placebo, the sample treatment
effect is biased, having a mean or expected value greater than the true treat-
ment effect. If the study were replicated over and over, the long-run arithme-
tic average sample treatment effect would exceed the true treatment effect.
This long-run average gives equal weight to each replication. If each study
result was weighted by the amount of the information used at the definitive
analysis (analogous to a fixed-effects meta-analysis), then the corresponding
long-run weighted average sample treatment effect would converge almost
surely to the true treatment effect. On the basis of a sequence of independent,
identical clinical trials having at least one interim analysis, the cumulative
fixed-effects meta-analysis estimator is asymptotically unbiased (i.e., the bias
decreases towards zero) as the number of trials increases, whereas the bias
in the arithmetic average is constant in the number of trials. When a given
clinical trial necessarily continues to the final analysis, regardless of the
results of earlier analyses, the estimated treatment effect at the time of the
final planned analysis is an unbiased estimator of the true treatment effect,
provided no design or conduct changes occur on the basis of the results of
the interim analysis.
When there is zero treatment effect and parallel boundaries are used at
the interim analyses, the expected sample treatment effect will be zero. The
bias in the observed treatment effect at the time of the definitive analysis is
attributable to the randomness of the amount of information at the time of
that analysis. The amount of bias will depend on the true effect size, the α
allocation, and the timings of the analyses.
Adaptive designs having a sample size reestimation component also have
sample treatment effects that are biased with the mean sample effect greater
than the true treatment effect when the treatment is effective.
Random Highs. Random highs in the estimated effect of the active control
therapy in historical studies are a real issue. For example, “data dredging”
leads to estimates of treatment effects that tend to overstate the true effect.
Situations include selecting a subgroup retrospectively on the basis of the
estimated effect seen in that group. The estimates that generate hypothe-
ses for further studies are in themselves conditionally biased, tending to be
larger than the true effect size. Likewise, conditional bias estimates can occur
when a claim is limited to a subgroup either due to quite positive results seen
for that subgroup or for quite negative results seen for the complement sub-
group. There are various other scenarios when random highs are likely to
be more prevalent, which include the following: when the use of the control
therapy in the non-inferiority trial was predicated on the success of one or
two trials designed to study that therapy in the indication; the first trial in an
indication to yield a favorable statistically significant result after many other
trials (possibly based on different therapies) previously failed to do so; the
estimated effect is from an interim analysis that resulted in favorable statisti-
cal significance or is from a design having a sample size reestimation; and
a retrospective or nonprespecified analysis on a demographic or genomic
subgroup.
Statistical Significance Bias. Studies whose results are responsible for moti-
vating the use of a therapy as an active control introduce some bias or con-
ditional bias into the historical estimation of a treatment effect. Before a trial
that will be well conducted and well controlled is started, the estimated
treatment effect is unbiased for the true treatment effect of a population that
is represented by the subjects in the clinical trial. At the start of the trial, the
observed treatment effect will or will not wind up being large enough to
achieve statistical significance. Conditional on statistical significance being
achieved, the expected or mean sample treatment effect is greater than the
true treatment effect. Because active control therapies in a non-inferiority
trial are often selected because statistical significance was reached in one or
two trials that were designed to study that therapy in the indication, there
will be a tendency for the estimated active control effect to be greater than
the true effect. It is therefore necessary for the estimated control effect to
be either reproduced in multiple trials or be “adjusted” for this conditional
bias.
When a drug has demonstrated an effect in a clinical trial (e.g., one-sided
p-value < 0.025), it is more likely than not that the estimated effect in a sec-
ond trial of the same design will be smaller than that seen in the first trial.
Statistically, consider two normalized test statistics from two separate, iden-
tically designed clinical trials, Z1 and Z2, that have standard normal distribu-
tions when there is no difference in effects between treatments. For the ith
trial (i = 1,2), a one-sided p-value of less than 0.025 is equivalent to Zi > 1.96.
It can be shown that P(Z1 > Z2|Z1 > 1.96) > 0.5. In other words, given that the
first trial achieved statistical significance, it is more likely that the first trial
had a smaller p-value (and also a larger estimated effect) than the second
trial.
Let σ denote the standard error for the estimated treatment effect. Consider
a study having 100γ% power at the actual treatment effect μ, which is based
on a large sample normalized test statistic with a one-sided significance level
of α. When statistical significance has been achieved (p-value < α), the con-
ditional expected or mean estimated treatment effect is approximately μ +
g U(γ)σ, where gU(γ) = [ϕ(Φ−1(γ))/γ] and ϕ and Φ are the density and distribution
functions for a standard normal distribution, respectively. Note that for a
one-sided significance level of α, γ = Φ(μ/σ − zα). When statistical significance
is not reached, the conditional expected or mean estimated treatment effect
is approximately μ – g L(γ)σ, where g L(γ) = [ϕ(Φ−1(γ))/(1 − γ)]. Because of the
symmetry of ϕ about zero, g U(γ) = g L(1 − γ). Table 4.1 provides values of g U(γ)
for various γ values. For 90% power at the actual treatment effect μ, based
on a large sample normalized test, the conditional expected or mean esti-
mated treatment effect given that statistical significance has been reached
is approximately μ + 0.195σ. Thus, if the same study having 90% power was
repeated over and over where only the estimated effects from those replica-
tions having a one-sided p-value of < 0.025 are retained, the long-run average
TABLE 4.1
Number of Standard Error Bias in Achieving Statistical Significance
at a One-Sided Level by Power
γ gU(γ) γ gU(γ) γ gU(γ)
0.05 2.06 0.30 1.16 0.75 0.42
0.10 1.75 0.40 0.97 0.80 0.35
0.15 1.55 0.50 0.80 0.85 0.27
0.20 1.40 0.60 0.64 0.90 0.195
0.25 1.27 0.70 0.50 0.95 0.11
Example 4.1
Consider a time-to-event endpoint compared between two arms after 400 events
in a placebo-controlled clinical trial having a 1:1 randomization. Suppose the true
experimental to placebo hazard ratio is 0.894, which provides 20% power to
achieve statistical significance at a one-sided α of 0.025 (which occurs when the
observed experimental to placebo hazard ratio is less than 0.822). Given that
statistical significance is achieved, the mean for the observed experimental to
placebo log-hazard ratio is –0.252 (=ln 0.894 – 1.40 × 0.1) from Table 4.1, which
corresponds to a hazard ratio of 0.777. In cases where the true power is 20%,
the typical observed experimental to placebo hazard ratio when statistical signifi-
cance is achieved will be 13% less than the true value.
Maximum and Regression to the Mean Biases. In baseball, the best rookie
player in each league receives the Rookie of the Year (ROY) award. The per-
formance of the winners in their next seasons tends to be worse than in their
rookie year. This is often referred to as the “sophomore jinx.” The ROY has
the maximum performance or outcome among all rookies in their league
their first year. Also, among the same group of players for the next year, the
ROY cannot do any better than having the maximum performance and can
do comparatively worse.
The sophomore jinx occurs because the ROY is identified, not at random,
but on the basis of having the maximum performance. The sophomore jinx
is an example of “regression to the mean.” The bias that occurs from using
the rookie outcome of the ROY to project their second (sophomore) year per-
formance or estimate their ability, while ignoring and not adjusting for the
fact that the ROY is being identified on the basis of maximum performance/
outcome (not identified at random), is an example of regression to the mean
bias. In a non-inferiority trial, the active control is usually identified on the
basis of past performance in clinical trials. Often, the active control will be
the therapy that performed best (or one therapy among the therapies that
performed the best) in previous clinical trials. Therefore, unless a proper
adjustment is made, including the outcomes from previous clinical trials
that were used to identify the active control will lead to biased estimation of
the active control effect with a tendency to overestimate the true effect even
when the true effect of the active control remains constant across previous
trials and the non-inferiority trial.
Regression to the mean refers to the phenomenon in a simple linear regres-
sion when an observed value of an explanatory variable X of x is k standard
deviations away from its mean (μX), and the expected value for the response
variable Y of μY|x is ρk of its standard deviations away from its mean (μY),
where –1 < ρ < 1 is the correlation coefficient. Since |ρk| < |k|, in relative
terms of respective standard deviations, μY|x is closer to μY than x is to μX.
For example, if ρ = 0.5 with standard deviations of σ X and σ Y, then when we
observe x = μX + 2σ X, the corresponding expected value of Y is μY|x = μY + σ Y.
For the sophomore jinx, X is the performance in the rookie year and Y is the
performance in the second year.
In Statistics, when an outcome or estimate represents a maximum, the out-
come or estimate will tend to be greater than the true mean of the underlying
distribution with high probability. Thus, in the subject area of clinical tri-
als, when the estimated effect represents a maximum across studies and/or
subgroups, it is highly likely that the estimated effect is greater than the true
effect. Conditional bias is also introduced when the selection of historical
studies used to estimate the effect of the active control is outcome dependent.
For example, limiting the selected studies to a narrow indication where a
study achieves statistical significance and ignoring the results from related
indications will lead to a bias and an exaggerated estimate of the active con-
trol effect and potentially inflate the type I error rate for the non-inferiority
trial. The maximum of a random sample tends to be larger than the mean of
the underlying distribution (i.e., larger than the true effect). The bias of the
maximum in estimating the underlying mean increases as the number of
studies increases. When an observation represents a maximum, it should not
be evaluated as if it were an isolated, random observation.
Consider an investigational agent, A, being studied for a first-line meta-
static cancer in three large, equally sized, randomized clinical trials. Each
clinical trial compared the addition of agent A to a different standard che-
motherapy regimen with that standard chemotherapy regimen alone. The
three clinical trials used a different background standard chemotherapy
regimen (X1, X2, and X3). Suppose that the only trial that demonstrated
improved overall survival when agent A is added is the trial that used X1 as
the background chemotherapy. Now, a sponsor wants to study the addition
of the experimental agent B in a non-inferiority trial that compares X1 plus
B with X1 plus A. As the observed effect of adding A to X1 represents the
maximum observed effect across three trials, the observed effect of adding
A to X1 probably overestimates the true effect. Therefore, if the estimation of
the effect of the active control, A, only considers the previous trial that used
X1 as the background chemotherapy and ignores the fact that the observed
effect represents a maximum observed effect, the true effect of adding A to
X1 will tend to be overestimated. This may then lead to an inappropriately
large non-inferiority margin and an increase in the likelihood of conclud-
ing that an ineffective experimental therapy is effective. If only the results
from the clinical trial using X1 as the background therapy are used, the esti-
mated effect in that trial needs to be interpreted and modeled as represent-
ing a maximum observed effect. It is important to note that when improved
survival is not demonstrated, it does not mean that improved survival was
ruled out. If the other two trials had slightly favorable observed effects, their
failure to demonstrate a survival improvement does not mean that the effect
of adding agent A to chemotherapy is heterogeneous across background che-
motherapies. The observed effects across the studies may still be consistent
with homogeneous effects. Knowledge of the observed effects from the other
two studies is needed to correctly interpret the results from the study using
X1 as the background chemotherapy.
Similar situations would also arise when an investigational agent is stud-
ied in multiple lines of an advanced or metastatic cancer, when an investi-
gational agent is studied in separate trials in different disease settings, or
when the chosen dose for the active control in the non-inferiority trial is the
dose with the greatest estimated effect and only data on that dose is used to
estimate the effect of the active control. Treating a better or best finding as
coming from an isolated trial will tend to overstate the true effect. Treating
a better or best finding as a maximum or the relevant upper order statistic of
a sample when the effects are homogeneous will be correct when the effects
are homogeneous and will be conservative when the effects are heteroge-
neous. However, when the effects are homogeneous, the most reliable esti-
mate of the common effect integrates the estimated effects from all trials.
Dealing with Maximum Bias. For a random sample, the observed maximum
is not an appropriate estimator of the common true mean or treatment effect.
When assumptions are made on the shape of the underlying distribution
and/or the shape of the distribution of the maximum, the observed maxi-
mum can be used to make inferences on the common true mean or treatment
effect.
Let X1, … ,Xk be a random sample from a distribution with underlying
distribution function H. Let X(k) denote the maximum of X1, . . . ,Xk, and let
H(k) denote its distribution function. Then for –∞ < t < ∞, H(k)(t) = (H(t))k.
The quantiles/percentiles for X(k) are given by H(−k1) (γ ) = H −1 (γ 1/k ) for 0 < γ < 1.
1
The mean and variance for X(k) are given as µ X( k ) =
1 ∫H
0
−1
( x 1/k ) d x and
σ X2 ( k ) =
∫ (H
0
−1
( x 1/k ) − µ X( k ) )2 d x , respectively.
for 0 < γ < 1. The mean and variance for Z(k) are given, respectively, as
1
µ Z( k ) =
∫Φ
0
−1
( x 1/k ) d x (4.2)
and
∫ (Φ ) dx
1 2
−1
σ Z2 ( k ) = ( x 1/k ) − µ Z( k ) (4.3)
0
Table 4.2 provides the mean, standard deviation, and various percentiles
based on Equations 4.1 through 4.3.
When the underlying distribution for X1, . . . ,Xk is a normal distribution
with mean μ and standard deviation σ, X(k) is equal in distribution to μ +Z(k) σ.
The behavior of the minimum treatment effect is analogous to that of the
maximum treatment effect, with the roles of the treatment arms reversed.
Example 4.2 illustrates using the distribution of the maximum in construct-
ing a confidence interval for the true common treatment effect for a time-to-
event endpoint.
Example 4.2
Suppose that there are five randomized, placebo-controlled clinical trials evaluat-
ing an experimental therapy with each trial based on a one-to-one randomization.
For each study, as in Example 4.1, the same time-to-event endpoint will be com-
pared after 400 events where the true experimental versus placebo hazard ratio is
0.894 (which provides 20% power to achieve statistical significance at a one-sided
α of 0.025) in every study. Then, using Table 4.2, the maximum observed treat-
ment effect is represented by the minimum observed experimental versus placebo
log-hazard ratio, which has mean –0.228 (= ln 0.894 – 1.16 × 2/ 400 ), which cor-
responds to a hazard ratio of 0.796. The standard deviation for the minimum log-
hazard ratio is 0.067. The median minimum experimental versus placebo hazard
ratio is 0.799 (= ln 0.894 – 1.13 × 2/ 400). Using the 2.5th and 97.5th percentiles
in Table 4.2, an equal-tailed 95% prediction interval for the minimum hazard ratio
is 0.691, 0.899.
TABLE 4.2
Means, Standard Deviations, and Various Percentiles for the Maximum of a
Random Sample from a Standard Normal Distribution
Percentiles
K µ z( k ) σ z( k ) 2.5th 25th 50th 75th 97.5th
2 0.56 0.83 –1.00 0 0.54 1.11 2.24
3 0.85 0.75 –0.55 0.33 0.82 1.33 2.39
5 1.16 0.67 –0.05 0.70 1.13 1.59 2.57
10 1.54 0.59 0.50 1.13 1.50 1.91 2.80
25 1.97 0.51 1.09 1.61 1.92 2.28 3.09
Suppose instead that the true common experimental versus placebo hazard ratio
is unknown and that only the best (minimum) observed hazard ratio is considered.
If the minimum observed hazard ratio is 0.75, then, based solely on that, a 95%
equal-tailed confidence interval for the true common experimental versus placebo
hazard ratio is 0.746, 0.970. This confidence interval is based on the 2.5th and
97.5th percentiles in Table 4.2 and the relation X(5) = μ + Z(5)σ in distribution, which
is applied to the placebo versus the experimental log-hazard ratio.
Once the estimates have been observed along with their respective order,
the confidence coefficients for the confidence intervals change. The confidence
coefficients will depend on the distributions (or conditional distributions) of
the order statistics. In Example 4.2, where there is a random sample of five
estimated effects, the confidence coefficient for the error symmetric 95% con-
fidence interval for that individual study that had the maximum (minimum)
estimated effect is now an error asymmetric 88.1% confidence interval when
the order of the estimated effects across all five studies is considered. The
confidence coefficient for the 95% confidence interval for that study having
the second largest (second smallest) estimated effect is 99.4% when the order
of the estimated effects is considered. The confidence coefficient for the 95%
confidence interval for that study having the median estimated effect is 99.97%
when the order of the estimated effects is considered. For a random sample of
estimated effects, the confidence coefficient for the 95% confidence interval
for the individual study that had the maximum (median) estimated effect
decreases (increases) toward zero (one) as the number of studies increases.
Simultaneous Confidence Bounds. Fairly analogous to having the inference
based on a maximum is requiring simultaneous one-sided confidence inter-
vals to maintain a desired overall coverage. For k studies and a probability
of 1 – α that every one-sided confidence interval will capture the respective
true effect, the common confidence coefficient for each confidence interval is
(1 − α)1/k. When the estimated effects across studies is a random sample (e.g.,
the studies are identical in design and conduct), the largest (smallest) of the
one-sided simultaneous lower (upper) confidence bounds each with confi-
dence coefficient (1 − α)1/k equals the lower confidence bound of coefficient
1 – α based solely on the maximum (minimum) observed effect. For example,
when k = 5 and α = 0.025, the confidence coefficient for each confidence inter-
val is 0.995 (=0.9750.2). Note that the formula for determining the common
confidence coefficient for each confidence interval is the same as the formula
for relating the (1 – α)th quantile for the maximum to the [(1 − α)1/k]th quantile
of the underlying distribution.
It is fairly common to use for the non-inferiority trial the lower limit of
a 95% confidence interval for the true active control effect (usually from a
meta-analysis) as a surrogate or substitute for the unknown true effect of the
active control. When only the result from the study that produced the larg-
est estimated effect among the k studies is considered, it seems a reasonable
analog to base the surrogate or substitute for the unknown true effect of the
active control for the non-inferiority trial as the (0.975)1/k × 100% lower confi-
dence bound calculated solely from that study.
More extensive modeling based on order statistics can also be done. For
example, suppose it is believed that there may be a specific number of small
studies that did not get published because of unfavorable results for the
treated arm. A model can be applied to the results from the known small
studies that assumes that those known results represent better-order statis-
tics from some samples of independent observations. Two approaches used
in Example 4.3 are based on the maximum of a sample of estimated effects
that are not a random sample. In Example 4.3 we consider various ways of
integrating the available information from two studies on the overall sur-
vival effect of docetaxel in second-line non-small cell lung cancer (NSCLC).
Example 4.3
The JMEI trial studied the use of pemetrexed against the active control of docetaxel
at a dose of 75 mg/m2 (D75) with subjects in second-line NSCLC. A non-inferiority
claim for pemetrexed versus docetaxel on overall survival was sought.10 Thus, it
would be necessary to understand the effect of docetaxel on overall survival in
second-line NSCLC. There have been several clinical trials studying the effects
of docetaxel in NSCLC and other cancers. For the sake of this example, only two
studies of docetaxel in second-line NSCLC (TAX 317 and TAX 320) will be consid-
ered. For the TAX 320 study, 373 subjects were randomized to either 100 mg/m2
docetaxel (D100), D75, or a control therapy (vinorelbine or ifosfamide, V/I). There
is little evidence that vinorelbine or ifosfamide extends life in a second-line setting
of NSCLC. For the TAX 317 study, 100 subjects were randomized to D100 or best
supportive care (BSC) in phase A of the study, and 104 subjects were randomized
to D75 or BSC in phase B of the study.
How the results are modeled or integrated will have a great impact on the
estimation of the relevant effect of D75. When an approach is selected retro-
spectively and dependent on the trial results, it will produce biased estimates.
Prespecification of an approach before the conduct of the TAX 320 and TAX 317
studies (or independent of their results) would be necessary to produce unbiased
estimates. Some possible approaches are listed below.
1. A naïve approach that uses only the results from phase B of the TAX 317 study.
2. There is no strong enough evidence from the TAX 320 study to rule out that
the effects are equal between the docetaxel regimens. Therefore estima-
tion of the active control effect based on the assumption that the effects
of the docetaxel regimens are equal and constant across studies can be
considered.
a. Use only the results from the TAX 317 study.
b. Use results from both studies treating the control arms of vinorelbine or
ifosfamide, and BSC as exchangeable.
3. An approach that integrates the results in the TAX 320 study of the com-
parison of D100 with D75, with the separate comparisons of each phase
of docetaxel to BSC from the TAX 317 study. The effects of each docetaxel
For each approach, Table 4.3 summarizes the estimated hazard ratio, the corre-
sponding 95% confidence interval for the true D75 versus BSC hazard ratio, and
the one-sided p-value for testing that D75 is superior to BSC. For approach 1, the
estimate of the D75 versus BSC hazard ratio from TAX 317 is 0.56 with the cor-
responding 95% confidence interval of 0.35–0.88.11 From the confidence interval,
the standard error for the log-hazard ratio estimator is approximately 0.235 and the
one-sided p-value for superiority of D75 versus BSC is approximated as 0.007.
From the overall survival results provided in the Statistical review of NDA
20449/S11 for TAX 317,12 with data cutoff date of April 12, 1999, the observed
D100 versus BSC hazard ratio is either 0.96 or 1.04 = 1/0.96 (using the p-value and
the number of events for each group) and the corresponding standard error for the
log-hazard ratio estimator is 0.221 (= 1/ 40 + 1/ 42 ). For this example, we will use
0.96 as the observed hazard ratio. For approach 2a, applying a fixed-effects meta-
analysis to the independent comparison of phases A and B of TAX 317 leads to an
estimated D75/D100 versus BSC hazard ratio of 0.743 (= exp([ln 0.56/(0.235)2 +
ln 0.96/(0.221)2]/[1/(0.235)2 + 1/(0.221)2])) and the corresponding standard error for
the log-hazard ratio estimator of 0.160 (= 1/ (0.235)2 + 1/ (0.221)2 ).
For the TAX 320 study, there were 104, 97, and 110 deaths in the D100, D75,
and V/I treatment groups, respectively.12 The D75 versus V/I hazard ratio is pro-
vided in the product label for Taxotere11 as 0.82, and the D100 versus V/I hazard
ratio is determined to be either 0.99 or 1.01 = 1/0.99. For this example, we will use
TABLE 4.3
Estimates of D75 versus BSC Hazard Ratio by Approach
D75 vs. BSC Hazard Ratio
One-Sided
Approach Estimate 95% Confidence Interval p-Value
1 0.56 (0.35, 0.88) 0.007
2a 0.743 (0.543, 1.018) 0.032
2b 0.842 (0.698, 1.015) 0.035
3 0.655 (0.466, 0.921) 0.007
4a 0.675 (0.524, 0.938) 0.011
4b 0.704 (0.536, 0.985) 0.021
0.99 as the observed hazard ratio. The geometric mean of the two hazard ratio
estimates is 0.901, which will be used as the combined estimate of the D75/D100
versus V/I hazard ratio. The estimated standard error for the combined log-hazard
ratio estimator is 0.119 (= 0.5 × 1/104 + 1/ 97 + 4 /110 ). When the combined results
for TAX 320 and TAX 317 (determined for approach 2a) are integrated by a fixed-
effects meta-analysis, the estimated D75/D100 versus V/I/BSC hazard ratio is
0.842 (= exp([ln 0.743/(0.160)2 + ln 0.901/(0.119)2]/[1/(0.160)2+1/(0.119)2])), with a
corresponding estimated standard error for the log-hazard ratio estimator of 0.095
(= 1/ (0.160)2 + 1/ (0.119)2 ).
For approach 3, the D75 versus D100 hazard ratio from TAX 320 is determined
as 0.828 (=0.82/0.99), with a corresponding standard error for the log-hazard ratio
of 0.141 (= 1/104 + 1/ 97 ). This result is combined with the results of the D100
versus BSC, yielding an estimated hazard ratio of 0.795 and a corresponding stan-
dard error for the log-hazard ratio estimator of 0.262. This indirect comparison is
now integrated with the direct comparison of D75 versus BSC, yielding an overall
D75 versus BSC hazard ratio of 0.655 with a corresponding standard error for the
log-hazard ratio estimator of 0.175. This estimate of the D75 versus BSC hazard
ratio is the maximum likelihood estimate under the model where the log-hazard
ratio estimators have independent normal distributions with respective standard
deviations equal to the estimated standard errors and the true log-hazard ratios
of D75 versus BSC, BSC versus D100, and D100 versus D75 are required to sum
to zero.
Let β denote the common true D75/D100 versus BSC/V/I log-hazard ratio across
studies. Let β̂1 , β̂ 2 , β̂3 , and β̂ 4 denote the log-hazard ratios of D100 versus V/I,
D75 versus V/I in TAX 320, D100 versus BSC, and D75 versus BSC in TAX 317,
respectively. For approaches 4a and 4b, the distribution of the deviation between
the minimum observed log-hazard ratio (maximum observed effect) was studied
by simulations. This deviation equals the minimum deviation across estimates/
estimators (i.e., min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } − β = min{βˆ1 − β , βˆ 2 − β , βˆ3 − β , βˆ 4 − β } ). Let (Z1, Z2),
Z3, Z4 be independent, where each Zi has a standard normal distribution and
(Z1, Z2) has a bivariate normal joint distribution with a correlation of 0.5. For the
comparison of D100 versus V/I and D75 versus V/I in TAX 320, the respective
deviations β̂1 − β and β̂ 2 − β are modeled as 0.138Z1 and 0.138Z2 (0.138 repre-
sents the average of the two estimated standard errors). For the comparisons of
D100 versus BSC and D75 versus BSC in TAX 317, the respective deviations β̂ 3 − β
and β̂ 4 − β are modeled as 0.221Z3 and 0.235Z4.
On the basis of 100,000 replications, the simulated mean minimum deviation is
−0.188 with simulated 2.5th and 97.5th percentiles of −0.515 and 0.067, respec-
tively. On the basis of retaining only the maximum observed effect of a hazard
ratio of 0.56, this leads to the estimated common hazard ratio of 0.675 (= exp(ln
0.56 + 0.188)) and limits of the corresponding 95% confidence interval of 0.524
(= exp(ln 0.56 – 0.067)) and 0.938 (= exp(ln 0.56 + 0.515)).
In 31,899 of the 100,000 replications, βˆ 4 = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 }. Conditioning on
β 4 = min{βˆ 1 , βˆ 2 , βˆ 3 , βˆ 4 }, the simulated mean minimum deviation is −0.229 with
ˆ
simulated 2.5th and 97.5th percentiles of −0.565 and 0.043, respectively. On the
basis of retaining only the maximum observed effect of a hazard ratio of 0.56 and
that this maximum observed effect came from phase B of TAX 317, this leads to
the estimated common hazard ratio of 0.704 (= exp(ln 0.56 + 0.229)) and limits of
the corresponding 95% confidence interval of 0.536 (= exp(ln 0.56 – 0.043)) and
0.985 (= exp(ln 0.56 + 0.565)).
From Table 4.3, the results vary across approaches with the estimated effects
ranging from a D75 versus BSC hazard ratio of 0.56–0.842. The upper limits of
the 95% confidence intervals range from 0.88 to 1.018. Additionally, the more
information used in an approach or the more restrictive the assumptions, the nar-
rower the 95% confidence interval. Two approaches failed to achieve statistical
significance at a one-sided 0.025 level. All in all, the results do not provide sub-
stantial evidence that the true D75 versus BSC hazard ratio for overall survival is
less than 1.
∑ ∑ ∑
k k k
by γˆ FE = (1/σ i2 )γˆ i / (1/σ i2 ) . The variance of γˆ FE is 1 / (1/σ i2 ).
i=1 i=1 i=1
Commonly, the true variances of the within-study sample effects are not
known, but are estimated. Let si2 denote the estimated variance of γˆ i for
i = 1, … , k. Then the standard fixed-effects estimator of γ is given by
k k
i=1 i=1
∑
k
and the corresponding estimated variance is given by s2 = 1 / (1/si2 ).
i=1
∑
k
tested on the basis of the statistic Q = (1/si2 )(γˆ i − γˆ FE )2 (see DerSimonian
i=1
and Laird16). When the effects are homogeneous, Q has an approximate χ2
2
distribution with k – 1 degrees of freedom. For 0 < α < 1, let χ k−1,α denote the
upper αth percentile from a χ2 distribution with k – 1 degrees of freedom.
2
When Q > χ k−1,α , a single common effect is rejected. Formally, the conclusion
is that there are at least two different values among γ 1, γ 2, . . . , γ k. The formal
conclusion is neither that there are k distinct values for γ1, γ 2, . . . , γ k nor that γ 1,
γ 2, . . . , γ k are independent and identically distributed random variables. The
test of heterogeneity tends to have low power in most practical situations. A
formula for determining the power for testing heterogeneity on the basis of
the test statistic Q is given by Jackson.17
∑
∑ V = ∑ Z ∑ V . The esti-
k k k k
odds ratio is given by θˆ = Viθˆi i i i
i=1 i=1 i=1 i=1
∑
∑
∑ ∑
k k k k
basis of the statistic R = Zi2 Vi − Zi Vi (see Yusuf
i=1 i=1 i=1 i=1
∑ ∑
k k
of γ is given by γˆ RE = (τ 2 + σ i2 )−1 γˆ i (τ 2 + σ i2 )−1 . The variance of γˆ RE
i=1 i=1
∑
k
is 1 (τ 2 + σ i2 )−1 . In practice, σ 1, . . . , σ k and τ 2 are not known. Then the
i=1
between-study variance, τ 2, is estimated by
k
2
∑ (1/s )(γˆ
2
i 1 − γˆ 0 )2 − ( k − 1)
τˆ = max 0, k
i=1
k k
∑ i=1
(1/si2 ) − ∑i=1
(1/si4 ) ∑i=1
(1/si2 )
∑ ∑
k k
where γˆ 0 = (1/si2 )γˆ i (1/si2 ). Then the random-effects estimator
i=1 i=1
∑ ∑
k k
of γ is given by γˆ RE = (τˆ 2 + si2 )−1 γˆ i (τˆ 2 + si2 )−1 . The corresponding
i=1 i=1
∑
k
estimated variance for γˆ RE is s2 = 1 (τ̂ 2 + si2 )−1 .
i=1
For the random-effects model, the units are studies, not subjects. Inference
is formally on studies. For the inference to apply at the subject level, either all
studies should have the same study size or all individual estimated effects
should have the same study standard error. Otherwise, the study standard
error should not be correlated (not even spuriously correlated) with the study
effect size. It is difficult to evaluate and be certain that the study standard
error and study effect size are not correlated.
While the existence of heterogeneity invalidates the assumptions of a fixed-
effects meta-analysis, the existence of a correlation between the estimated
effects and the within-study standard errors invalidates the assumptions of
a random-effects meta-analysis. There are other circumstances that can also
invalidate the assumptions of a random-effects meta-analysis.
For these meta-analysis methods, the common or average effect in the models
reflects the expected value (or conditional expected value for a random-effects
model) of the estimated effects. This is important as study conduct, missing
data, or design features can introduce bias in estimating the true study effect.
of Larholt, Tsiatis, and Gelber21 to determine the confidence intervals for τ2 and
the alternative confidence intervals for γ on the basis of the distribution of
k k k k
2
τˆ BT = ∑
i=1
(1/si2 )(γˆ i − γˆ 0 )2 − ( k − 1)
∑ (1/ s ) − ∑ (1/s ) ∑ (1/s )
i=1
2
i
i=1
4
i
i=1
2
i
within-trial variance and the number of studies decreases. For known and
equal study-specific variances of σ 2, they provided an approximate type I
error rate for the RE test of
2 1 − Φ zα /2
( ) (
1 + τ 2 /σ 2 Fk −1 ( k − 1)/(1 + τ 2 / σ 2 ) )
∞
−
∫ (
Φ zα /2 )
x/( k − 1) f k −1 ( x)dx
( k − 1)( 1+τ 2 /σ 2 )
where Fk – 1 and f k – 1 are the distribution and density function for a χ2 distribu-
tion with k – 1 degrees of freedom, respectively. As noted by Ziegler, Koch,
and Victor,22 for a fixed number of studies, the type I error rate is increas-
ing in τ 2/σ 2. Thus, for fixed k and τ 2, the type I error rate is decreasing in
σ 2 (increasing in the sample size/the number of events for a time-to-event
endpoint). For fixed τ 2 and σ 2, the type I error rate decreases as the number
of studies, k, increases. As σ 2 → 0, the type I error rate converges to 2(1 –
Gk – 1(Z α/2)), where Gk – 1 is the distribution function for a t distribution with
k – 1 degrees of freedom.
For selected numbers of studies, Table 4.4 provides the limiting type I error
rates as σ 2 → 0 for α/2 = 0.025. When there are only three studies and the
within-trial variability is much less than the between-trial variability, the
type I error rate for the superiority test will be about 9.5%. The type I error
inflation may be quite large when the number of studies is small.
As σ 2 → 0, the form of the asymptotic distribution function, Hk – 1, for the
RE test statistic when γ = 0 is provided in the paper of Ziegler, Koch, and
−1
Victor.22 They proposed using H k− 1 (1 − α /2) as the critical value for the RE
test statistic when testing for effectiveness and as a multiplier when deter-
mining confidence intervals for γ. From their simulations, the new test either
maintains the approximate type I error rate or is conservative.
TABLE 4.4
Limiting Type I Error Rates as σ 2 → 0 for α/2 = 0.025
Number of Studies Limiting Type I Error Rate
3 0.095
10 0.041
25 0.031
50 0.028
Example 4.4
In this example, there are three previous randomized, clinical trials comparing the
active control therapy to placebo on a continuous outcome. We will assume that
each within-trial estimator of the treatment effect is unbiased and has a normal
distribution. Table 4.5 provides the estimated active control effect along with the
corresponding standard deviation and 95% confidence interval for each trial and
for the fixed-effects and random-effects meta-analyses. Figure 4.1 displays the
corresponding 95% confidence intervals.
For a random-effects meta-analysis, Table 4.5 provides the 95% confi-
dence intervals based on both percentiles from a standard normal distribution
TABLE 4.5
Trial and Integrated Estimated Effects, Standard Deviations, and 95% Confidence
Intervals
Trial/Analysis Estimated Effect Standard Deviation 95% Confidence Interval
Trial 1 7 3 (1.1, 12.9)
Trial 2 6 2 (2.1, 9.9)
Trial 3 37 3.5 (30.2, 43.8)
Fixed effects 12.0 1.50 (9.0, 14.9)
Random effects 16.5 9.01 (–1.2, 34.2)a (–22.3, 55.3)b
a Based on the 2.5th and 97.5th percentiles of a standard normal distribution.
b Based on the 2.5th and 97.5th percentiles of a t distribution with 2 degrees of freedom.
Trial 1
Trial 2
Trial 3
Fixed
Effects
Random
Effects
0 10 20 30 40
FIGURE 4.1
Trial and meta-analysis 95% confidence intervals.
TABLE 4.6
Trial and Integrated Estimated Effects, Standard Deviations, and 95% Confidence
Intervals
Trial/Analysis Estimated Effect Standard Deviation 95% Confidence Interval
Trial 1 7 3 (1.1, 12.9)
Trial 2 6 2 (2.1, 9.9)
Trial 3 12 3.5 (5.2, 18.8)
Fixed effects 7.4 1.50 (4.4, 10.3)
Random effects 7.5 1.62 (4.3, 10.6)a (0.5, 14.4)b
a Based on the 2.5th and 97.5th percentiles of a standard normal distribution.
b Based on the 2.5th and 97.5th percentiles of a t distribution with 2 degrees of freedom.
dose of the active control therapy, and study conduct in the non-inferiority trial.
Differences in the bias on the estimated effects, due to differences in study con-
duct and study design, also contribute to heterogeneity in the estimated effects.
This heterogeneity is not properly dealt with by being treated as unexplained vari-
ability in treatment effects.
When there is convincing evidence that γi and σ i2 are correlated, the assump-
tions for the random effects probably do not hold. That is, the assumption
that γ 1, γ 2, . . . , γ k are identically distributed is probably false.
The fixed-effects estimator, γˆ FE , is also an unbiased estimator of γ when
the random-effects model holds. The variance for γˆ FE under the random-
−1 2
∑
∑ ∑
k k k
effects model is given by 1/σ i2 + τ 2 1/σ i4 1/σ i2 .
i=1 i=1 i=1
The variance of γˆ FE under a fixed-effects model and the variances for
γˆ RE and γˆ FE under a random-effects model are respectively ordered with
−1 −1 −1 2
∑
∑ ( )
∑
∑
∑
k k k k k
1/σ i2 ≤ 1/ σ i2 + τ 2 ≤ 1/σ i2 +τ2 1/σ i4 1/σ i2
i=1 i=1 i=1 i=1 i=1
2
∑
∑
k k
1/σ i4 1/σ i2 . The closer τ 2 is to zero, the more similar the variances. When
i=1 i=1
the fixed effects and random effects estimated effect sizes are quite different,
this may indicate that the assumptions for the random effects model do not
hold.
When the effects are heterogeneous, Greenland and Salvan20 suggest mod-
eling the study differences instead of providing a single estimated effect.
In a case like that given in Table 4.5, where there is enormous heteroge-
neity of the estimated effects, it is likely that much of that heterogeneity is
explainable. Both the fixed-effects and the DerSimonian–Laird random-
effects meta-analyses are not appropriate. It is important to investigate the
heterogeneity of the estimated effects. The potential bias in the estimates
should also be considered. The variability in the estimated effects that can be
explained should be used to estimate the active control effect in the setting
of the non-inferiority trial along with any further effect modification antici-
pated in the setting of the non-inferiority trial. The precision of the estimate
would be based on the within-trial variances of the estimated effects and
the unexplained between-trial variability in the estimated effects of active
control therapy.
The standard deviation for the resulting estimator of the active control effect
is larger than the corresponding standard deviation for the estimator not
based on a covariate adjustment.
In the absence of other biases previously discussed, the unbiased appli-
cation of this estimated effect to the non-inferiority trial requires the con-
ditional constancy assumption that for any given set of covariates the
conditional active control effect is constant across all studies (including the
non-inferiority trial) and that all effect modifiers have been accounted for.
There should be biological plausibility that a covariate is an effect modifier
with preferably reproduced results on the effect size of the active control.
Selection of a covariate should not be based on data dredging by selecting
an arbitrary covariate that just happens to have differing observed effects
across its subgroups.
Such a procedure is most relevant when multiple trials evaluating the active
control have demonstrated similar heterogeneous effects. When there are
no underlying differences in the effects across subgroups, there will always
be some anticipated difference in the estimated effects across subgroups.
Observing similar but small differences in the estimated effects within sub-
groups in two, three, or even four trials may not be strong evidence of het-
erogeneous effects (or at least meaningful heterogeneous effects that would
deserve attention). A conservative approach may select the smaller of the
margin (or the more conservative estimation of the active control effect) from
an approach that adjusts the estimated active control effect by the relative
frequencies of important subgroups and from an approach of homogeneous
effects across subgroups.
This problem of heterogeneous effects across subgroups cannot be solved
by using adjusted analyses within each study, as such analyses either assume
homogeneous effects across the corresponding subgroups, or weights the
effects by the corresponding relative frequencies, which would be different
across trials (leading to heterogeneous effects across trials).
However, when the effect of the active control varies across important
subgroups, a non-inferiority or any efficacy conclusion overall is really a
conclusion on an overall or weighted-average result with the weights being
the relative frequencies of the important subgroups. A conclusion of non-
inferiority or any efficacy for every meaningful subgroup requires individ-
ual non-inferiority comparisons for each subgroup. Unless the results are
quite marked, it is very difficult to interpret subgroup analyses in a two-arm
non-inferiority trial.
∑ ∑ ∑
k k k
(1/σ i2 )γˆ i (1/σ i2 ) and variance equal to 1 / (1/σ i2 ). When the
i=1 i=1 i=1
constancy assumption holds, the derived posterior distribution for the effect
of the active control can validly be used in the setting of the non-inferiority
trial as the distribution for the effect of the active control. If appropriate the
active control effect can be discounted when applied to the setting of the
non-inferiority trial.
For a Bayesian analog to a random-effects meta-analysis, the output can be
either a posterior distribution for the random within-study treatment effect,
γ k+1, or a posterior distribution for the mean treatment effect across studies
(i.e., the global mean), γ. Let ψ = τ 2. We will consider an improper prior distri-
bution for (γ,ψ) whose density depends only on the value of ψ (i.e., g(γ,ψ) = j(ψ)).
Conditional on (γ,ψ), the true within-study effects, γ1, γ 2, . . . , γ k, are assumed
to be a random sample from a normal distribution having mean γ and vari-
ance ψ. Conditional on γ1, γ 2, . . . , γ k, γˆ 1 , γˆ 2 , , γˆ k are independently normally
distributed, where γˆ 1 has a normal distribution with mean γi and variance
σ i2 for i = 1, . . . , k. The variances σ 12 , σ 22 , , σ k2 may be regarded as known or as
having some prior distribution. In the known variances case, the joint poste-
rior distribution for (γ,ψ) is determined conditional on the observed values,
x1, x2, . . . , xk, of γˆ 1 , γˆ 2 , , γˆ k . The joint posterior density will factor into the
product of the marginal distribution for ψ and a normal conditional distri
∑ ∑
k k
bution for γ given ψ having a mean equal to (σ i2 + ψ )−1xi (σ i2 + ψ )−1
i=1 i=1
∑
k
and variance equal to 1 (σ i2 + ψ )−1 . The density for the marginal distri-
i=1
∏
k
bution for ψ is proportional to exp[−(1/2)q(ψ )] (σ i2 + ψ )−1/2 × j(ψ ) , where
i=1
2
∑
∑
∑
k k k
q(ψ ) = xi2 (σ i2 + ψ )−1 − xi (σ i2 + ψ )−1 (σ i2 + ψ )−1 . For simulat-
i=1 i=1 i=1
∑ ∑
k k
normal distribution having mean equal to (σ i2 + ψ r )−1 xi (σ i2 + ψ r )−1
i=1 i=1
∑
k
and variance equal to 1 (σ i2 + ψ r )−1 .
i=1
∑ ∑
k k
to (σ i2 + ψ )−1 xi (σ i2 + ψ )−1 .
i=1 i=1
Under the assumption that the same model applies for the next trial or the
non-inferiority trial (i.e., a form of the constancy assumption), the distribu-
tion for the treatment or active control effect in the next study, γ k+1, is based
on the posterior distribution for (γ,ψ) and that conditional on (γ,ψ), γ k+1 has a
normal distribution with mean γ and variance ψ. Thus, the distribution for
γ k+1 can be approximated by further taking a random value from the normal
distribution with mean equal to the simulated value for γ and variance equal
to the simulated value for ψ. As noted earlier, if appropriate the active control
effect can be discounted when applied to the setting of the non-inferiority
trial.
In the cases where the variances σ 12 , σ 22 , , σ k2 are unknown, the joint pos-
terior distribution for (γ i , σ i2 ) can be determined for each i = 1, 2, . . . , k. For
continuous data, Section 12.2.4 provides an example of a joint posterior dis-
tribution for the mean and variance. Approximating the posterior distribu-
tion for γ involves:
∑ ∑ ∑
k k k
(σ i2 + ψ r )−1 xi (σ i2 + ψ r )−1 and variance 1 (σ i2 + ψ r )−1 .
i=1 i=1 i=1
References
1. U.S. Food and Drug Administration, Guidance for industry: Non-inferiority
clinical trials (draft guidance), March 2010.
2. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E-10: Guidance on
choice of control group in clinical trials, 2000, at http://www.ich.org/cache/
compo/475-272-1.html#E4.
3. Prentice, R.L. et al., Combined postmenopausal hormone therapy and cardio-
vascular disease: Toward resolving the discrepancy between observational
studies and the Women’s Health Initiative clinical trial, Am. J. Epidemiol., 162,
404–414, 2005.
4. Fleming, T.R. et al., Some essential considerations in the design and conduct of
non-inferiority trials, submitted manuscript, 2010.
5. Wang, S.-J., Hung, H.M.J., and Tsong, Y., Utility and pitfalls of some statistical
methods in active controlled clinical trials, Control. Clin. Trials, 23, 15–28, 2002.
6. Hedges, L.V., Modeling publication selection effects in meta-analysis, Stat. Sci.,
7, 246–255, 1992.
7. Dear, K.B. and Begg, C.B., An approach for assessing publication bias prior to
performing meta-analysis, Stat. Sci., 7, 237–245, 1992.
8. Sterling, T.D., Rosenbaum, W.L., and Weinkam, J.J., Publication decisions revis-
ited: The effect of the outcome of statistical tests on the decision to publish and
vice-versa. Am. Stat., 49, 108–112, 1995.
9. Light, R.J. and Pillemer, D.B., Summing Up: The Science of Reviewing Research,
Harvard University Press, Boston, MA, 1984.
10. U.S. Food and Drug Administration Oncologic Drugs Advisory Committee
meeting, July 27, 2004, transcript, at http://www.fda.gov/ohrms/dockets/
ac/04/transcripts/2004-4060T1.pdf.
11. Product label for Taxotere, at http://www.accessdata.fda.gov/drugsatfda_
docs/label/2010/020449s059lbl.pdf.
12. Statistical review of NDA 20449/S11 dated December 15, 1999, at http://www
.accessdata.fda.gov/drugsatfda_docs/nda/99/20449-S011_TAXOTERE_statr
.pdf.
13. Glass, G.V., Primary, secondary and meta-analysis of research, Educ. Res., 5, 3–8,
1976.
14. Follmann, D.A. and Proschan, M.A., Valid inferences in random effects meta-
analysis, Biometrics, 55, 732–737, 1999.
15. Rothmann, M.D. et al., Missing data in biologic oncology products, J. Biopharm.
Stat., 19, 1074–1084, 2009.
16. DerSimonian, R. and Laird, N., Meta-analysis in clinical trials, Control. Clin.
Trials, 7, 177–188, 1986.
17. Jackson, D., The power of the standard test for the presence of heterogeneity in
meta-analysis, Stat. Med., 25, 2688–2699, 2006.
18. Yusuf, S. et al., Beta blockade during and after myocardial infarction: An over-
view of the randomized trials. Prog. Cardiovasc. Dis., 27, 335–371, 1985.
19. Peto, R., Why do we need systematic overviews of randomised trials? Stat. Med.,
6, 233–240, 1987.
20. Greenland, S. and Salvan, A., Bias in the one-step method for pooling study
results, Stat. Med., 9, 247–252, 1990.
21. Larholt, K., Tsiatis, A.A., and Gelber, R.D., Variability of coverage probabili-
ties when applying a random effects methodology for meta-analysis, Harvard
School of Public Health Department of Biostatistics, unpublished, 1990.
22. Ziegler, S., Koch, A., and Victor, N., Deficits and remedy of the standard random
effects methods in meta-analysis, Methods Inform. Med., 40, 148–155, 2001.
23. Biggerstaff, B.J. and Tweedie, R.L., Incorporating variability in estimates of het-
erogeneity in the random effects model in meta-analysis. Stat. Med., 16, 753–768,
1997.
24. Zhang, Z., Covariate-adjusted putative placebo analysis in active-controlled
clinical trials, Stat. Biopharm. Res., 1, 279–290, 2009.
5.1 Introduction
A non-inferiority analysis is frequently conducted based on the determi-
nation of a non-inferiority margin or threshold. The choice of the margin
should depend on prior experience of the estimated effect of the active con-
trol in adequate, well-controlled trials, and account for regression-to-the-
mean bias, effect modification, and clinical judgment. The non-inferiority
margin must be small enough to preclude that a placebo (or a treatment that
is no better than placebo on a given endpoint) is noninferior to the active
control. Other concerns about the non-inferiority margin might make the
margin even smaller, but it should not be larger than the smallest anticipated
difference between a placebo and the active control in the setting of the non-
inferiority trial.
From experience there are basically two philosophies in constructing a
non-inferiority analysis. One philosophy involves making adjustments to
the estimation of the active control effect to account for biases, effect modi-
fication, and any additional uncertainty, and then use a test procedure that
targets a desired type I–like error rate. The other philosophy involves apply-
ing a conservative method of analysis (e.g., comparing the most conservative
limits of 95% confidence intervals) that includes the results from an unad-
justed estimation of the active control effect and from the non-inferiority
trial. The hope is that the conservative method will account for any biases in
the estimate of the active control effect and any deviation from the constancy
assumption.
There will be instances when the non-inferiority analysis will not be based
on either philosophy. For example, the clinical judgment of the unaccept-
able amount of loss of the active control effect is necessarily smaller than
determined from either philosophy. Another exception can occur when there
is great heterogeneity in the effect of the control in previous studies. If the
heterogeneity cannot be explained, the non-inferiority analysis may need to
consider this heterogeneity and how small the active control effect may need
to be in the non-inferiority trial. If the active control therapy has not regu-
larly shown efficacy in clinical trials, the non-inferiority margin may need to
91
be zero, meaning that the experimental therapy must show superiority to the
active control to be deemed effective.
In this chapter, we discuss two confidence interval and synthesis methods
for non-inferiority testing in an active-controlled trial. These methods are
compared in Section 5.4. Additionally, the type I error rates are also assessed
in Section 5.4, including under practical models where the estimation of the
active control effect is subject to regression-to-the mean bias. In Section 5.5,
we compare the results of two confidence interval and synthesis methods
with an example in oncology.
active control may have versus placebo in the setting of the non-inferiority
trial. If the active control has an effect of M1 in the non-inferiority trial, the
trial will have assay sensitivity in determining whether an experimental
therapy is effective or ineffective provided adequate study conduct. The non-
inferiority margin M2 is a fraction of M1 chosen to assure that the experi-
mental therapy retains at least some desired amount of the active control
effect. The margins of M1 and M2 are used respectively for two objectives:
(1) demonstrating that the experimental therapy is superior to placebo and
(2) demonstrating that the experimental therapy is not unacceptably worse
than placebo.
M2 has been treated as a fixed margin, despite often being based on or
influenced by the estimated active control effect. In this section, we will con-
sider testing involving statistical hypotheses that treat M2 as a fixed value.
In Section 5.4 on evaluating error rates, we will treat M1 and M2 as realized
values involving the estimated active control effect. We will consider com-
paring the treatment arms using metrics based on undesirable outcomes.
This includes the difference in means where the smaller the value the better,
differences in proportions on an undesirable event, the log-relative risk of
an undesirable event, and the log-hazard ratio where the longer the time the
better. Then the hypotheses of interest are expressed as
where β N is the experimental therapy versus the active control therapy (i.e.,
E–C or E/C) parameter of interest (i.e., the true treatment difference) in
the non-inferiority trial. The inequalities in the hypotheses in Equation 5.1
would be reversed for “positive” or desirable outcomes (e.g., adverse cure,
prevention, time-to-relief). For these cases, each hypothesis is expressed by
multiplying each side of the inequality by –1 and defining the parameter in
terms of the active control therapy versus the experimental therapy.
Since M1 and M2 are based on the estimated effect of the active control,
they are realizations of random quantities, not a priori. The hypotheses in
Equation 5.1 are surrogate hypotheses for whether the experimental ther-
apy is unacceptably worse than the active control in the setting of the non-
inferiority trial. The hope is that rejecting Ho in Expression 5.1 and concluding
Ha will imply that the experimental therapy is effective, with an effect that
is not unacceptably worse than the active control. The null hypothesis in
Expression 5.1 is rejected and non-inferiority is concluded when the upper
limit of a 100(1 – α)% confidence interval for βN is less than M2. Normal-
based confidence intervals are often used, so Ho would be rejected when
β̂ N + zα /2 sN < M2, where β̂ N is the estimated value for β N and sN is the esti-
mated standard deviation for β̂ N .
Additionally, it is popular to define M1 as the lower limit of a 100(1 – γ)%
confidence interval for the historical active control effect, βH (expressed
Consistent with Hung, Wang, and O’Neill,5 an “Y–X method” will refer to a
two−confidence interval procedure where a two-sided Y% confidence inter-
val is determined from the non-inferiority trial and the active control effect
is based on a two-sided X% confidence interval. The definitions of Y and
X are the reverse in the U.S. Food and Drug Administration (FDA) draft
guidance.6
Using 95% confidence intervals for both the historical effect of the active
control therapy and for the comparison of the experimental and active control
therapies in the non-inferiority trial is common. We will refer to this approach
as the 95–95 method or approach. This approach has been described as com-
paring the two statistically worst cases. For the 95–95 approach, Rothmann et
al.1 showed that when the constancy assumption holds, the one-sided type I
error rate is between 0.0027 and 0.025 in falsely concluding that an ineffective
therapy is effective. Sankoh7 called the two−confidence interval approach
“uniformly ultraconservative” and preferring to use a fraction of the point
estimate instead of the lower bound of a confidence interval for the active
control effect. Although using a fraction of the lower bound of a confidence
interval may be conservative in many situations, it may not be conservative
(and certainly not uniformly ultraconservative) in all situations, particularly
in indications where regression-to-the-mean bias and/or effect modification
are major concerns.
Such a margin (the lower bound of the 95% confidence interval or some
fraction thereof) will be conservative when the constancy assumption holds.
However, in many cases, the constancy assumption does not hold, or cannot
be proven to hold. The use of the lower limit of the 95% confidence interval
for the estimated active control effect provides some adjustment for bias and
deviation from the constancy. Subjects enrolled in the current study may be
fundamentally different from subjects enrolled in the historical study, owing
to changes in diagnosis or standards of concomitant care since the histori-
cal study was completed; or the disease is fundamentally different (such as
infectious diseases, which are known to change over time as they adapt in
response to medications); or logistics differ (when a study is run in a dif-
ferent set of geographic sites than the historical comparison used). When
the constancy assumption may not hold, choosing a fraction of the lower
bound of a confidence interval for the historical treatment effect can provide
an allowance for deviation from the constancy assumption.
The width of the confidence interval for the historical effect of the active
control will depend on the sample sizes of the historical studies. A large
estimated effect for the active control therapy from large studies may pro-
duce a confidence interval with a lower bound that is a large effect, and thus
require a smaller sample size for the non-inferiority trial to rule out a differ-
ence of practical importance. Conversely, a single small study may produce
a confidence interval with a lower bound that corresponds to a small effect,
even if the point estimate of the active control effect was large, and thus
require a large sample size for the non-inferiority trial to rule out an appro-
priate non-inferiority margin. In such cases, it is tempting, although it may
not be possible, to increase the margin because of the lack of precision in the
estimate of the historical treatment effect. When warranted, the confidence
level for the historical effect of the active control therapy can be adjusted to
be higher for a more conservative, smaller margin or be adjusted lower for
a more liberal, larger margin. Hauck and Anderson8 suggested utilizing the
lower bound of a confidence interval with a confidence level of 68–90%. The
lower confidence level will lead to a larger lower bound, and hence a larger
non-inferiority margin.
Fixed-effects and random-effects meta-analyses have been used in de
termining the confidence interval for the historical active control effect.
Between-trial variability in the active control is a concern especially when
the heterogeneity in the active control effect cannot be explained. When
there is a single study, the heterogeneity in the active control effect cannot
be assessed. Also, with the lack of a reproduced effect size, the significant
or highly significant result from a single trial may have a large associated
regression-to-the-mean bias and thus greatly overstate the true active con-
trol effect. The existence of multiple studies that provide consistent estimates
of the active control effect gives assurance that the regression-to-the-mean
bias is small, and that the meta-analysis reliably estimates the active control
effect when the historical effect of the active control applies in the setting of
the non-inferiority trial. Concerns involving the use of the estimated active
control effect to the setting of the non-inferiority trial may lead to either dis-
counting the estimated active control effect (i.e., discounting the lower limit
of the confidence interval for the active control effect) or basing the non-
inferiority margin on a larger-level confidence interval for the active control
effect.
If the multiple historical comparisons of the active control to placebo
provide inconsistent estimates of the active control effect, confidence in a
common active control effect decreases. In such a case, the choice of non-
inferiority margin should consider the between-trial variability in the active
control effect. When a random-effects meta-analysis is used for the estima-
tion of the active control effect, Lawrence9 proposes using a 95% prediction
interval for the active control effect in a next random trial as a replacement
in the 95–95 method for the 95% confidence interval for the active control
effect.
Example 5.1 summarizes one of the first two confidence interval proce-
dures that involved thrombolytic products.
Example 5.1
Example 5.2
To illustrate the use of the margins just discussed, we revisit Example 4.3, which
considered six approaches for estimating the effect of docetaxel versus best sup-
portive care (BSC) on overall survival in second-line NSCLC. Table 4.3 provided
the estimates and 95% confidence intervals for the docetaxel versus BSC haz-
ard ratios. Here the BSC versus docetaxel log-hazard ratio is the docetaxel effect
parameter, β H. For each of the six approaches using Table 4.3, Table 5.1 gives the
95% confidence intervals for β H and the corresponding margins obtained from the
confidence interval where M2 represents 50% of M1 where M1 is the lower limit
of the 95% confidence interval for β H. For approaches 2a and 2b, a superiority
comparison to docetaxel would be required for a new investigation agent.
TABLE 5.1
95% Confidence Intervals for Docetaxel Effect and Corresponding Margins
Approach 95% Confidence Interval for βH Margins
1 (0.128, 1.050) M1 = 0.128, M2 = 0.064
2a (–0.018, 0.611) M0 = 0
2b (–0.015, 0.360) M0 = 0
3 (0.082, 0.764) M1 = 0.082, M2 = 0.041
4a (0.064, 0.646) M1 = 0.064, M2 = 0.032
4b (0.015, 0.624) M1 = 0.015, M2 = 0.0075
The JMEI trial studied the use of pemetrexed against the active control of doc-
etaxel at a dose of 75 mg/m2 with subjects in second-line NSCLC. From the FDA
Oncologic Drugs Advisory Committee meeting transcript,12 the 95% confidence
interval for the pemetrexed versus docetaxel hazard ratio in the JMEI study is
0.817–1.204. Taking the natural logarithms of each limit in the confidence interval
gives a 95% confidence interval for the pemetrexed versus docetaxel log-hazard
ratio, βN, of –0.202 to 0.186. The upper limit of the 95% confidence interval for βN
of 0.186 exceeds every margin specified in Table 5.1. Thus when a margin is based
on a 95% confidence interval for the docetaxel effect versus BSC, the results
from the JMEI trial fail to conclude non-inferiority to docetaxel, regardless of the
approach used to estimate the docetaxel effect.
Reducing the Potential for Biocreep. In general, when possible, the seemingly
most effective, available standard of care should be used as the active control
in a non-inferiority trial. However, as the selected standard of care is the
therapy or regimen that has the best estimated effect, the estimation of the
effect of that standard of care will have a regression-to-the-mean bias. This
bias should be accounted for by either making an appropriate adjustment to
the estimation of its effect or including the estimated effects of all potential
candidates for a standard of care into the meta-analysis.
For a given indication, once the first non-inferiority trial has established
a criterion for non-inferiority, it may be reasonable that all future non-
inferiority trials have the same or more stringent criterion regardless of the
active control used in the trial. For example, suppose that a margin of 5 days
was used for the duration of an adverse event for the original active control
(A) in non-inferiority trials. If another therapy (B) is to be used as an active
control in a non-inferiority trial and the 5-day margin to A is still relevant,
the non-inferiority margin for B as a control, δ, should be such that it guar-
antees that if the experimental therapy (C) is noninferior to B with margin δ,
then C is noninferior to A with a margin of 5 days. Suppose B was previously
compared with A in randomized trials and the 95% confidence interval for
the difference in days of the mean durations was –0.8 to 1.6. Using the phi-
losophy of a 95–95 method, a margin of 5.0 – 1.6 = 3.4 days may be justified
for B as the control therapy in a non-inferiority trial. In practice, it may also
This provides a basis for some indications on requiring that a new therapy
have efficacy greater than some minimal threshold. In the typical synthesis
testing, that threshold is regarded as a prespecified fraction of the effect of
the active control. Snapinn and Jiang18 expressed concern that a requirement
that the experimental therapy in a non-inferiority trial retain more than some
fraction of the effect of the active control creates a higher bar for approval
than was required for the active control, and that such a requirement may
prevent the approval of superior treatments to the active control.
In this section we discuss definitions for the proportion of the active con-
trol effect that the experimental therapy retains, possible corresponding sets
of non-inferiority hypotheses that can be tested, frequentist and Bayesian
procedures, and respective issues.
βH − βN
λ= (5.3)
βH − 1
The definition of the proportion of the active control effect that is retained
by the experimental therapy in Equation 5.3 is referred to as a “arithmetic
definition” in Rothmann’s paper.1 The definition of the retention fraction has
been used for relative metrics—for example, a relative risk or a hazard ratio.
However, for relative metrics, how two different possible values (e.g., a and b)
for the metric compare (or statistically compare) depends on their ratio (i.e.,
a/b) not on their difference (i.e., a – b).
For undesirable outcomes (i.e., smaller probabilities of “success” are better)
with a prespecified retention fraction of λo, the null and alternative hypoth-
eses are expressed as
βH − βN
λ= (5.5)
βH
When it is assumed that βH > 0 the alternative hypothesis is that the experi-
mental therapy retains more than 100λo% of the historical effect of the active
control.
The null and alternative hypotheses have also been expressed simply as
Ha: {βN – (1 – λo) βH < 0 and βH > 0} or {βN – βH < 0 and βH < 0} (5.9)
The test rejects the null hypothesis in Expression 5.4 and concludes non-
inferiority when Z1 < − zα /2 for some 0 < α < 1. When β̂ N and β̂ H are in
dependent, Z1 having an approximate standard normal distribution when λ =
λo depends on β̂ N having a normal distribution, and whether log(λo + (1 − λo )βˆ H )
log(λo + (1 − λo )βˆ H ) has an approximate normal distribution with a variance reli-
ably estimated by (λo + (1 − λo )βˆ H ))2 Var ˆ (log βˆ H ) or whether Var
ˆ (log βˆ N )/[((1 − λo )βˆ H /(
ˆ (log βˆ N )/[((1 − λo )βˆ H /(λo + (1 − λo )βˆ H )) Var
Var 2
ˆ (log β̂ H )] is large.
The remainder of this section will focus on absolute metrics.
βˆ N − βˆ H + 1.645 sN
2
+ sH2 < 0 (5.11)
as a one-sided test that the experimental therapy is better than placebo. When
the constancy assumption holds, the left-hand side in Expression 5.11 is the
upper limit for the the two-sided 90% confidence interval for the difference
between the experimental therapy and a placebo. This test procedure can
be rewritten to compare the upper limit of a one-sided 95% confidence
interval for βN with δ * = βˆ H − csH , where c = 1.645 1 + sN2 /sH2 − 1.645 sN /sH .
Non-inferiority is concluded when the upper limit of the two-sided 90% con-
fidence interval for βN is less than δ*. A similar procedure can also be found
in papers by Fisher and colleagues.21,22
The use of δ* is contingent on the constancy assumption. Hauck and
Anderson recommended discounting δ* when it is believed that there may be
between-trial variability in the active control effect. As there may be disagree-
ment in the appropriate margin, Hauck and Anderson recommend reporting
the 90% or the 95% two-sided confidence interval for βN to allow each indi-
vidual to decide for themselves whether non-inferiority has been met.
When Expression 5.11 and the constancy assumption holds, the left-hand
side provides the effect or minimal effect or a greater effect versus placebo
that can be ruled out. For the experimental therapy to rule out the same
minimal effect versus placebo as the active control has ruled out on the basis
of a one-sided 95% confidence interval requires that
βˆ N < −1.645 × ( )
sN2 + sH2 − sH < 0
For a one-sided 100(1 – α/2)% confidence interval, 1.645 is replaced with zα/2.
The Standard Synthesis Method. The standard synthesis method is based on
the test statistic
βˆ N − (1 − λo )βˆ H
Z2 = (5.12)
sN2 + (1 − λo )2 sH2
The test rejects the null hypothesis and concludes non-inferiority when Z2 < –
zα/2 for some 0 < α < 1. After correcting for differences in notation, we see that
the test statistics Z1 in Equation 5.10 and Z2 in Equation 5.12 are equivalent
when λo = 0 or 1. When the estimators β̂ N and β̂ H are independent, normally
distributed, and unbiased, or approximately so, Z2 will have a standard nor-
mal distribution or an approximate standard normal distribution when βN =
(1 – λo)βH with a type I error rate of approximately α/2 for falsely concluding
that an experimental therapy that retains 100λo% of the active control effect
retains more than 100λo% of the active control effect. When β̂ H tends to overes-
timate (underestimate) the effect of the active control in the setting of the non-
inferiority trial, the type I error rate will be inflated (deflated). If the sampling
distributions for βˆ N − βˆ H and βˆ N − (1 − λo )βˆ H are not normal distributions, these
two tests can be modified to fit the appropriate sampling distributions.
A Fieller 100(1 – α)% confidence interval can be determined for λ.1 The
Fieller 100(1 – α)% confidence interval equals {λo: –zα/2 < Z2 < zα/2, –∞ ≤ λo ≤ ∞}.
The null hypothesis in Expression 5.6 is rejected whenever every value in the
Fieller 100(1 – α)% confidence interval exceeds the prespecified value for λo.
If it is believed that the effect of the active control may have decreased,
the historical estimated effect can by discounted by using 0 < θ < 1.1 The test
statistic Z2 in Expression 5.6 would then be replaced with the test statistic Z2*
in Equation 5.13 where
βˆ N − (1 − λo )θβˆ H
Z2* = (5.13)
sN2 + (1 − λo )2 θ 2 sH2
βˆ N − (1 − λo )βˆ H
σˆ 2ˆ (βˆ ) + (1 − λ )2 s 2
βN H o H
where the true variance for the non-inferiority trial σ β̂2N (βˆ H ) is a random
variable that depends on the estimated active control effect, β̂ H . Rothmann2
assessed the type I error probability for testing the hypotheses in Expression
5.6 for two confidence interval methods and the standard synthesis method
when the standard error from the non-inferiority trial depends on the esti-
mated historical active control effect.
Delta-Method Confidence Interval Approach. When λo = 0, Hasselblad and
Kong15 recommended testing the hypotheses in Expression 5.6 on the basis
of the test statistic Z2 in Equation 5.12. However, when 0 < λo ≤ 1, Hasselblad
and Kong15 proposed a delta-method confidence interval test procedure. The
estimator of λ is given by λˆ = 1 − βˆ N /βˆ H and the estimated standard error is
given by Sλˆ = (βˆ N /βˆ H )2 ( sN2 /βˆ N2 + sH2 /βˆ H2 ). The null hypothesis in Expression
5.7 is rejected, and non-inferiority is concluded when λˆ − zα /2Sλˆ > λo. The test
is equivalent to rejecting the null hypothesis in Expression 5.7 when
βˆ N − (1 − λo )βˆ H
Z3 = < − zα /2 (5.14)
βˆ S ˆ
H λ
As Z2 in Equation 5.12 and Z3 in Equation 5.14 have the same numerators, and
Z2 has an approximate standard normal distribution when λ = λo, how close
the distribution of Z3 is to a standard normal distribution may depend on the
whether R = βˆ HSλˆ / sN2 + (1 − λo )2 sH2 (i.e., Z2/Z3) tends to be close to 1.
When λo = 0, R will be greater than 1 with probability 1 and Z3 would have
a distribution more concentrated near zero than the distribution for Z2.23 It
was noted from simulations that fairly large sample sizes may be needed
for the ratio of two independent normally distributed quantities to have an
approximate normal distribution.23 In particular the ratio of the mean of β̂ H to
its (estimated) standard deviation should be greater than 8 for the test based
on the delta-method confidence interval to have approximately the desired
type I error rate when β̂ H unbiasedly estimates the effect of the active control
in the setting of the non-inferiority trial. We comment further on the distri-
bution of Z3 in Section 5.4.2 on comparing the different analysis methods.
In Example 5.3, synthesis procedures will be performed including the
determination of Fieller confidence intervals for the proportion of the active
control effect retained by the experimental therapy.
Example 5.3
To illustrate the use of some of the synthesis methods just discussed, we revisit
Example 4.3. The JMEI trial studied the use of pemetrexed against the active con-
trol of docetaxel at a dose of 75 mg/m2 with subjects in second-line NSCLC. The
endpoint of interest was overall survival. We use the result of approach 2b in
Example 4.3 for the estimation of the docetaxel effect. From that approach, the
estimated docetaxel versus BSC hazard ratio was 0.842, with a corresponding
standard error for the log-hazard ratio estimator of 0.095 (95% confidence interval
for the hazard ratio of 0.698, 1.015). From the FDA Oncologic Drugs Advisory
Committee meeting,12 the estimated pemetrexed versus docetaxel hazard ratio
was 0.992 in the JMEI study with corresponding standard error for the log-hazard
ratio estimator of 0.099, which is determined from the 95% confidence interval of
0.817–1.204. Then the indirect estimate of the pemetrexed versus BSC hazard ratio
is given by 0.992 × 0.842 = 0.835 with a standard deviation for the correspond-
ing log-hazard ratio estimator of (0.099)2 + (0.095)2 = 0.137. This leads to a 95%
confidence interval for the pemetrexed versus BSC hazard ratio of 0.638–1.093.
ln 0.992 − (1 − 0.5)ln(1/ 0.842)
For λo = 0.5, we have Z2 = = −0.856, which would
(0.099)2 + (1 − 0.5)2(0.095)2
correspond with a one-sided p-value of 0.195. Here, since the 95% confidence
interval for the docetaxel versus BSC includes 1 (the upper limit is 1.015), and the
pemetrexed versus docetaxel estimated hazard ratio is close to 1, the 95% Fieller
confidence interval for λ as defined in Equation 5.5 is –∞ to ∞. That is that the
95% confidence interval does not rule out any possibilities for λ. A 90% Fieller
confidence interval for λ is –1.01 to 3.55.
If the estimated docetaxel effect was discounted by 20% (i.e., θ = 0.8), the
resulting value of the test statistic would be Z2* = −0.724 with a corresponding
one-sided p-value of 0.23.
βˆ H ± zα /2 Var(βˆ H ) + Var(βˆ N )
is a 100(1 – α)% prediction interval for the observed value of β̂ N when the
experimental therapy has the same efficacy as a placebo. Non-inferiority (or
any efficacy in this case) is concluded when βˆ N < βˆ H − zα /2 Var(βˆ H ) + Var(βˆ N ) ,
or in other words when the observed value for β̂ N is less than the lower limit
of the 100(1 – α)% prediction interval for the observed value of β̂ N (deter-
mined under the assumption that the experimental therapy has the same
efficacy as a placebo).
We will discuss synthesis methods as prediction interval methods for both
fixed-effects and random-effects models for the active control effect.
Fixed-Effects Model. Consider a fixed-effects model for the active control
effect where it is assumed that the active control effect is constant across all
trials, including in the setting of the non-inferiority trial. When the experi-
mental therapy retains 100λ% of the effect of the active control therapy,
a 100(1 – α)% prediction interval for the observed value of β̂ N is given by
(1 − λ )βˆ H ± zα /2 (1 − λ )2 Var(βˆ H ) + Var(βˆ N ) . Non-inferiority (i.e., the experi-
mental therapy retains more than 100λ% of the effect of the active control
therapy) is concluded when βˆ N < (1 − λ )βˆ H − zα /2 (1 − λ )2 Var(βˆ H ) + Var(βˆ N ) , or
in other words when the observed value for β̂ N is less than the lower limit of
the 100(1 – α)% prediction interval for the observed value of β̂ N (determined
under the assumption that the experimental therapy retains exactly 100λ% of
the effect of the active control therapy).
Random-Effects Model. Consider a random-effects model for the active con-
trol effect where it is assumed that the same random-effects model holds
for all trials, including in the setting of the non-inferiority trial. For the case
where a random-effects model is used for the effect of the active control,
the same notation will be used as in Section 4.3.3. Thus γ and γ k+1 will be
used in place of βH and βN, respectively. Parameters and random variables
for the non-inferiority trial will be subscripted by k + 1. When the experi-
mental therapy has the same effect as a placebo and the same random-effects
model that applies for the historical studies of the active control also applies
in the non-inferiority trial, we have γ k+1 = γ + ηk+1 and γˆ k +1 = γ + ηk +1 + ε K +1 ,
where ηk+1 ~ N(0, τ 2) and ε k +1|ηk +1 ~ N (0, σ k2+1 ) are uncorrelated, μ is the global
mean active control effect across studies, and σ k2+1 = Var(γˆ k +1|ε k +1 ). If σ 1,…, σ k,
σ k+1 and τ 2 are known, then γˆ k+1 − γˆ has a normal distribution with mean
∑
k
equal to zero and variance 1 / (τ 2 + σ i2 )−1 + τ 2 + σ k2+1 . In practice, σ 1,…,
i=1
σ k, σ k+1, and τ 2 are not known and then a 100(1 – α)% prediction interval for
∑
k
the observed value of γˆ k+1 is given by γˆ ± wα /2 1 / (τˆ 2 + σˆ i2 )−1 + τˆ 2 + σˆ k2+1 ,
i=1
where wα/2 is the 100(1 – α/2) percentile (or an approximation thereof) of
∑
k
the distribution for (γˆ k +1 − γˆ )/ 1 / (τˆ 2 + σˆ i2 )−1 + τˆ 2 + σˆ k2+1 . Under certain
i=1
∑
k
assumptions or conditions, (γˆ k +1 − γˆ )/ 1 / (τˆ 2 + σˆ i2 )−1 + τˆ 2 + σˆ k2+1 may
i=1
∑
k
(1 − λo )γˆ ± wα /2 (1 − λo )2 / (τˆ 2 + σˆ i2 )−1 + (1 − λo )2 τˆ 2 + σˆ k2+1 . Non-inferiority
i=1
(i.e., the experimental therapy retains more than 100λo% of the effect
of the active control therapy) is concluded when βˆ N < (1 − λo )γˆ − wα /2
∑
k
(1 − λo )2 / (τˆ 2 + σˆ i2 )−1 + (1 − λo )2 τˆ 2 + σˆ k2+1 .
i=1
When wα/2 ≈ zα/2, this test procedure can be expressed as comparing a
synthesis-like test to –zα/2. The test statistic would be given by
βˆ N − (1 − λo )γˆ
Z4 =
∑
k
(1 − λo )2 / (τˆ 2 + σˆ i2 )−1 + (1 − λo )2 τˆ 2 + σˆ k2+1
i=1
The appropriateness of using standard normal critical values may be influ-
enced by the sampling distribution for the estimated active control effect and
by whether the sizing of the non-inferiority trial depended on the estimation
of the active control effect.
μX will factor into the product of the marginal posterior densities. The joint
posterior density for μY and μX is given by
∫ ∫
−∞ −∞
hX ( µ X )hY ( µY )(exp{(−1/2)[( y − µY )2 /σ Y2 + ( x − µ X )2 /σ 2 ( y )]}) d µY d µ X
hY ( µY )(exp{(−1/2)( µY − y )2 / σ Y2 })
∞
∫ −∞
hY ( µY )(exp{(−1/2)( µY − y )2 /σ Y2 }) d µY
∫ −∞
hX ( µ X )(exp{(−1/2)( µ X − x)2 /σ 2 ( y )}) d µ X
where x and z are treatment indicators and ε is the random deviation from the
mean. The indicator x = 1 if the treatment is the control therapy; otherwise
x = 0. The indicator z = 1 if the treatment is the experimental therapy; other-
wise z = 0. Per Simon’s setup, the larger values of y are better outcomes. The
mean outcome for the experimental therapy, control therapy, and placebo are
χ + γ, χ + β, and χ, respectively. The errors are assumed to be independent and
normally distributed with mean 0 and some common variance. Let h denote
the joint prior distribution for χ, β, and γ. Then for the sample means from the
non-inferiority trial y E and y C , the joint posterior distribution satisfies
g( χ , β , γ |y E , y C ) ∝ fE ( y E |χ , γ ) fC ( y C |χ , β )h( χ , β , γ )
When χ, β, and γ are modeled with independent prior distributions, h can be
replaced with the product of the marginal prior densities.
When the sample means are modeled as having normal distributions and
independent normal distributions are chosen for the prior distributions of
α, β, and γ, the joint posterior distribution for (χ, β, γ) is a multivariate nor-
mal distribution. Various posterior probabilities can be determined. These
include for “positive” or desirable outcomes (e.g., response):
(a) The probability that the experimental therapy is better than placebo
(i.e., P(γ > 0)).
(b) The probability that the experimental therapy is better than the con-
trol therapy (i.e., P(γ > β)).
(c) The probability that the experimental therapy is better than both the
control therapy and placebo (i.e., P(γ > β and γ > 0)).
(d) From Simon’s paper, 14 the probability that the experimental therapy
retains more than 100k% of the control therapy’s effect and the con-
trol therapy is better than placebo (i.e., P(γ – kβ > 0 and β > 0)).
(e) The probability that the experimental therapy retains more than
100k% of the control therapy’s effect and the control therapy is better
than placebo, or the experimental arm is better than both the control
therapy and placebo (i.e., P(γ – kβ > 0 and β > 0) + P(γ > 0 and β < 0)).
Note that the probability statements in (a)–(e) do not involve χ. The inequali-
ties in the probability statements would be reversed for “negative” or unde-
sirable outcomes (e.g., adverse events, time-to-death/overall survival). In (e),
the experimental therapy may have adequate efficacy when the experimental
therapy retains more than some minimal fraction of the effect of the control
therapy when the control therapy is effective, or when the experimental ther-
apy is more effective than both the placebo and the control therapy when the
control therapy is not effective. Additional comments on (e) are given below.
The posterior probabilities in (d) and (e) involve the experimental ther
apy retaining more than a minimal fraction of the control therapy’s effect.
The definitions for the fraction of the control therapy’s effect retained by
the experimental therapy, the retention fraction, require that the control
therapy has a positive effect (β > 0). In such situations, the retention frac-
tion is a measure of the relative efficacy of the experimental therapy ver-
sus the control therapy. Since the parameter space for (γ, β) is –∞ < γ < ∞,
–∞ < β < ∞, and includes possibilities where the effect of the active control is
zero or negative, the retention fraction is not defined for some possible (γ, β).
In general, it can be problematic dealing with new parameters (a function
of the original parameters) that do not exist everywhere over the original/
underlying parameter space. This is particularly true when the estimator of
the new parameter is a function of estimators of the original parameters.
The variance for the original parameters and their sampling distributions
would incorporate possibilities for which the new parameter is not defined.
Inference on the new parameter should consider such issues. Here when
β ≤ 0 (the placebo is as effective or more effective than the control therapy),
the desired possibilities for (γ, β) may be that the experimental therapy has
any efficacy (i.e., γ > 0). When β > 0 and the experimental therapy has some
fixed advantage over placebo, the proportion of the control therapy’s effect
that is retained by the experimental therapy increases without bound as the
effect of the control therapy decreases toward zero. It is thus reasonable for
any fixed γ > 0 and –∞ < a < b < ∞ that the relative efficacy of the experimen-
tal therapy versus the control therapy is larger when β = a than when β = b,
even when a (and possibly also b) is negative. The probability in (e) would
consider any case of (γ, β) where γ > 0 and β ≤ 0 as providing greater relative
efficacy of the experimental therapy versus the control therapy than any case
of (γ, β) where γ > 0 and β > 0.
For undesirable outcomes as was used in the earlier definitions of βN and
βH, the probability statements (a)–(e) are given by
1. P(βN – βH < 0)
2. P(βN < 0)
3. P(βN < 0 and βN – βH < 0)
4. P(βN – (1 – k)βH < 0 and βH > 0)
5. P(βN – (1 – k)βH < 0 and βH > 0) + P(β N – βH < 0 and βH < 0)
Example 5.4
Consider the following hypothetical example for overall survival. The prior distri-
bution for β H, the placebo versus control therapy log-hazard ratio, is modeled as
a normal distribution with mean 0.2 and standard deviation 0.1. On the basis of a
noninformative prior distribution and the study results comparing the experimen-
tal and control arms in the non-inferiority trial, the posterior distribution for η, the
experimental versus control log-hazard ratio βN is modeled modeled as a normal
distribution with mean –0.10 and standard deviation 0.08. Then we have the fol-
lowing probabilities:
The probability that the experimental therapy is better than placebo: P(β N –
β H < 0) = 0.990
The probability that the experimental therapy is better than the control ther-
apy: P(β N < 0) = 0.894
The probability that the experimental therapy is better than both the control
therapy and placebo: P(βN < 0 and βN – β H < 0) = 0.891
The probability that the experimental therapy retains more than 50% of
the control therapy’s effect and the control therapy is better than placebo:
P(β N – β H/2 < 0 and β H > 0) = 0.964
The probability that the experimental therapy retains more than 50% of the
control therapy’s effect and the control therapy is better than placebo, or
the experimental arm is better than both the control therapy and placebo:
P(βN – β H/2 < 0 and β H > 0) + P(βN – β H < 0 and β H < 0) = 0.981
than placebo. Therefore for λ = 1 – βN/β H, 0.95 = P(0.614 < λ < 9.90, β H > 0). Thus
0.614–9.90 is a 95% credible interval for λ, the proportion of the control therapy’s
effect that is retained by the experimental therapy, when additionally requiring
that the control therapy has an effect.
There are six possible orderings for the effects of the experimental therapy,
active control, and placebo. Table 5.2 provides the posterior probability for each
possible ordering on overall survival of the experimental therapy, control therapy,
and placebo. There is only a 0.005 posterior probability that the placebo is better
than both the active control and the experimental therapy. The bulk of the prob-
ability, 0.973, corresponds with the orderings of E > C > P and C > E > P.
Effect Retention Likelihood Plot. As a graphical tool for assessing the relative
efficacy of an experimental therapy to the active control therapy, Carroll26
proposed the use of an effect retention likelihood plot, which plots the pos-
terior probability that the experimental therapy retains more than a given
retention fraction against that given retention fraction between 0 (i.e., indirect
superiority to placebo) and 1 (i.e., superiority to the active control). According
to Carroll, the use of an effect retention likelihood plot would be part of
a stepwise approach where first the non-inferiority trial would be sized to
indirectly demonstrate that the experimental therapy is better than placebo;
when the data are analyzed, the posterior probability that the experimental
therapy is superior to placebo is determined, and if sufficiently high, then
the relative efficacy of the experimental therapy to the control therapy is
assessed using the effect retention likelihood plot.
Analogous plots to the effect retention likelihood plot can also be con-
structed of the posterior probability that the difference in effects of the
experimental therapy and the active control therapy (or the indirect effect
of the experimental therapy versus placebo) is greater than any prespecified
value. Additionally, when noninformative prior distributions are used for
the effects, the posterior probabilities will equal or approximately equal 1
minus the corresponding one-sided p-value. Therefore the one-sided p-values
can be substituted for the corresponding posterior probabilities in such plots.
Example 5.5 gives a modified version of Carroll’s effect retention likelihood
plot for approach 2b in the Examples 4.3, 5.2, and 5.3.
TABLE 5.2
Posterior Probability for Each Possible Ordering
Ordera Probability Statement Posterior Probability
E>C>P P(βN < 0, βH > 0) 0.874
E>P>C P(βN – βH < 0, βH < 0) 0.017
C>E>P P(βN > 0, βN – βH < 0) 0.099
C>P>E P(βH > 0, βN – βH > 0) 0.004
P>E>C P(βN – βH > 0, βN < 0) 0.003
P>C>E P(βH < 0, βN > 0) 0.002
a The “>” sign represents “better than” or “superior to.”
0.9
Probability 0.8
0.7
0.6
0.5
0.0 0.5 1.0
Retention fraction
FIGURE 5.1
Probability that the true effect retention exceeds a given value between 0 and 1.
Example 5.5
We will revisit the previous example involving pemetrexed, docetaxel, and BSC
in NSCLC based on approach 2b. Consider noninformative prior distributions on
the pemetrexed versus docetaxel log hazard ratio, βN, and the BSC versus doc-
etaxel log hazard ratio, β H. Then βN has a normal posterior distribution with mean
–0.008 and standard deviation 0.099, and β H has an independent normal poste-
rior distribution with mean 0.172 and standard deviation 0.095. Let Λ = 1 – βN/β H.
Figure 5.1 provides a plot of P(Λ > λ, β H > 0) + P(βN – β H, β H < 0) versus λ, which is
a modified version of Carroll’s effect retention likelihood plot. For λ = 0, 0.25, 0.5,
0.75, and 1, the respective probability is 0.905, 0.868, 0.803, 0.690, and 0.526.
5.3.7 Application
Example 5.6 revisits the non-inferiority comparison of pemetrexed ver-
sus docetaxel discussed in Example 5.3 using all approaches discussed in
Example 4.3 in Section 4.2.4. This allows for a comparison of the results from
each approach in estimating the docetaxel effect. Additionally, for approaches
4a and 4b in estimating the docetaxel effect, which involves a nonnormal
sampling distribution for the estimated docetaxel effect, a Bayesian analysis
will be done.
Example 5.6
For each approach discussed in Section 4.2.4, Table 5.3 provides the estimates
and 95% confidence interval for the indirect pemetrexed versus BSC hazard ratio
along with the one-sided p-value for indirectly testing that pemetrexed is superior
to BSC and the one-sided p-value for testing that pemetrexed is noninferior to doc-
etaxel at 75 mg/m2 (D75) by retaining more than 50% of the effect of D75 versus
BSC. These calculations are based on the “constancy assumption” that the effects
are constant across trials. The indirect estimate of the pemetrexed versus BSC
TABLE 5.3
Estimates, Confidence Intervals, and p-values of Pemetrexed versus BSC Hazard
Ratio by Approach
Pemetrexed vs. BSC Hazard Ratio One-Sided p-value
95% Confidence/ Pemetrexed Better
Approach Estimate Credible Interval than BSC 50% Retention
1 0.555 (0.337, 0.916) 0.011 0.026
2a 0.737 (0.510, 1.067) 0.052 0.110
2b 0.835 (0.638, 1.093) 0.095 0.195
3 0.650 (0.438, 0.963) 0.016 0.048
4a 0.670 (0.485, 0.974) 0.019 0.053
4b 0.699 (0.499, 1.027) 0.033 0.077
log-hazard ratio equals the pemetrexed versus D75 estimate from the JMEI study
plus the estimated D75 versus BSC log-hazard ratio from the particular approach
(hazard ratios provided in Table 4.3). For approaches 1, 2a, 2b, and 3, the stan-
dard error for the indirect log-hazard ratio estimator is the square root of the sum
of the variances. The corresponding p-values for approaches 1, 2a, 2b, and 3 are
based on synthesis test statistics. Results based on approach 2b are provided in
Example 5.3.
For approaches 4a and 4b, simulations were performed to determine the indi-
rect estimate and 95% credible interval for the pemetrexed versus BSC hazard
ratio and to determine the posterior probabilities for the respective one-sided null
hypotheses in testing that pemetrexed superior to BSC and that pemetrexed is
noninferior to D75.
As in Example 4.3 in Section 4.2.4, let β denote the common true D75/D100 ver-
sus BSC/V/I log-hazard ratio (β = – β H). Also, define β̂1, β̂ 2 , β̂3 , and β̂ 4 , and (Z1, Z2),
Z3, Z4 as in Example 4.3. Let βˆ = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } denote the minimum observed
D75/D100 versus BSC/V/I log-hazard ratio (maximum observed effect) and let
W = min{0.138Z1, 0.138Z2, 0.221Z3, 0.235Z4}. Then β̂ = β + W . Thus, β = βˆ − W ,
and because the distribution of W does not depend on the value of β, it makes
sense that the posterior distribution of β is the distribution of y –W given that β̂ = y .
This will be true when a flat, improper prior distribution is selected for β.
Let f W (·) denote the density for W and f(·|β) denote the density for β̂ given the
true value β. Then for –∞ < y < ∞, f(y|β) = f W (y – β). For a flat, improper prior
distribution for β (i.e., the “density” equals a positive constant over the parameter
space), the posterior density for β is given by g(β|y) = f W (y – β) for –∞ < β < ∞.
Thus, given the value of y for β̂ , the posterior distribution of β is simulated through
random values of W, where for each replication β = βˆ − W is calculated. On the
basis of the results of the JMEI study, the posterior distribution for βN, the pem-
etrexed versus D75 log-hazard ratio is modeled as having a normal distribution
with mean ln 0.992 and standard deviation 0.099. The pemetrexed versus BSC
log-hazard ratio is equal to β + βN (i.e., βN – β H). On the basis of 100,000 simula-
tions, the posterior distribution of β + βN has mean –0.400 (0.670 = exp(–0.400))
with 2.5th and 97.5th percentiles of –0.724 (0.485 = exp(–0.724)) and –0.026
(0.974 = exp (–0.026)), respectively. Zero was the 98.1st percentile of the posterior
distribution of β + βN, which leads to the “p-value” (i.e., the posterior probability
that pemetrexed is inferior to BSC) of 0.019 = 1 – 0.981. The retention fraction is
given by λ = 1 + βN/β when β < 0. Among the 100,000 replications, 94.7% had
λ > 0.5 and β < 0, or β + βN < 0 and β > 0. The posterior probability of the comple-
ment event of 5.3% is used in Table 5.3 as a one-sided p-value for testing for more
than 50% retention of the docetaxel effect.
For approach 4b, 31,899 of the 100,000 replications had β̂ 4 = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } .
Conditioning on β̂ 4 = min{βˆ1, βˆ 2 , βˆ3 , βˆ 4 } , the posterior distribution of β + βN has a
mean of –0.358 (0.699 = exp(–0.358)) with 2.5th and 97.5th percentiles of –0.695
(0.499 = exp(–0.695)) and 0.026 (1.027= exp (0.026)), respectively. Zero was the
96.7th percentile of the posterior distribution of β + βN, which leads to a one-sided
p-value of 0.033 = 1 – 0.967. Among the 31,899 replications, 92.3% had λ > 0.5
and β < 0, or β + βN < 0 and β > 0. The posterior probability of the complement
event of 7.7% is used in Table 5.3 as a one-sided p-value for testing for more than
50% retention of the docetaxel effect.
In all, three of the six approaches provided one-study evidence that pemetrexed
is more effective than BSC by having one-sided p-values less than 0.025. The
one-sided p-values for each approach was greater than 0.025 for testing that pem-
etrexed retained more than 50% of the docetaxel effect.
n = 4σ 2 /σ N2 (5.16)
r = 4/σ N2 (5.17)
Example 5.7
Consider a continuous variable where smaller values are more desirable and the
estimated mean difference between placebo and the active control is 4.5 with a
corresponding standard error of 0.6. An experimental therapy is required to dem-
onstrate better than 60% retention of the active control effect. For the standard
synthesis method or a Bayesian synthesis method, Table 5.4 provides the solutions
for σ N in Equation 5.15 and the overall sample size for a one-to-one randomization
based on Equation 5.16 when 90% power is desired for βN,a = –0.5, 0, or 0.5, pos-
sibilities where the experimental therapy is slightly more effective than the active
control, has the same effect as the active control, or is slightly less effective than
the active control, respectively; α = 0.05 and it is assumed that the population
variance in each arm is 100. For β N,a = –0.5, Equation 5.15 becomes
The solution to the equation is σ N = 0.685. Then applying Equation 5.16 gives
TABLE 5.4
Sample Sizes by Assumed Mean Differences for Experimental and Active Control
Arms for 90% Power
Assumed Mean Difference in Standard Error in
Non-Inferiority Trial (Exper.–Control) Non- Inferiority Trial Sample Size
–0.5 0.685 853
0 0.524 1459
0.5 0.357 3142
TABLE 5.5
Event Sizes by Assumed Experimental versus Active Control Hazard
Ratios for 80% Power
Assumed Hazard Ratio Standard Error in
(Exper./Control) Non- Inferiority Trial Event Size
0.9 0.0885 511
1 0.0443 2037
1.1 No positive solution No event size can
provide 80% power
Example 5.8
Consider a time-to-event variable where longer values are more desirable. The
placebo versus active control hazard ratio is 1.40 with a corresponding standard
error for the log-hazard ratio of 0.1. The experimental therapy is required to dem-
onstrate better than 50% retention of the active control effect. For the standard
synthesis method or a Bayesian synthesis method, Table 5.5 provides the solutions
for σ N in Equation 5.15 and the overall required number of events for a one-to-
one randomization based on Equation 5.17 when 80% power is desired for βN,a =
ln 0.9, 0, or ln 1.1, possibilities where the experimental therapy has a 10% lower
instantaneous risk of an event than the active control, has the same instantaneous
risk as the active control, or has a 10% greater instantaneous risk of an event than
the active-control, respectively (where α = 0.05). For βN,a = ln 0.95, Equation 5.15
becomes
The solution to the equation is σ N= 0.0885. Then applying Equation 5.16 gives
In Example 5.8, note that 80% power cannot be achieved at βN,a = ln 1.1,
regardless of the sample size. This is attributable to the knowing of the esti-
mated active control effect along with the fixed, nonzero standard error
beforehand and that powering a comparison involving a difference of two
parameters is usually based on an assumed difference for those two param-
eters where the standard error for the estimated difference can be chosen as
any positive value. Here the known standard error for (1 − λo )βˆ H establishes
a positive lower bound for the standard error of βˆ N − (1 − λo )βˆ H .
The power at βN,a = ln 1.1 is maximized at approximately 9.5% for 441
events. When conditioned on the estimated active control effect and its corre-
sponding standard error, the power for a given of βN,a need not be monotone
in the sample/event size.
For a one-to-one randomization when β N,a < [(1 − λo )(βH − zα /2 sH )], Equations
5.16 and 5.17 will apply for determining the sample size for a continuous
variable and the event size for a time-to-event variable, respectively. For
β N,a < [(1 − λo )(βH − zα /2 sH )], the sample size or event size can be determined
that provides a given power greater than α/2. For β N,a > [(1 − λo )(βH − zα /2 sH )],
the power will always be less than α/2. Again, as previously stated, the term
“power” in this context is truly “conditional power.” As [(1 − λo )(βH − zα /2 sH )]
is an already observed value, β N < [(1 − λo )(βH − zα /2 sH )] does not reflect an
alternative hypothesis that specifies when the experimental therapy has
acceptable efficacy. Likewise, for the standard synthesis method, when the
conditional power is greater than or less than α/2 does not necessarily cor-
respond to exactly when the experimental therapy has acceptable efficacy.
βˆ N − (1 − λo )βˆ H
Z4 = = Z2 /Q < − zα /2 (5.19)
sN + (1 − λo )sH
sN + (1 − λo )sH
where Q = and Z2 is the standard synthesis method test
sN2 + (1 − λo )2 sH2
statistic given in Equation 5.12. The value for Q is necessarily greater than
or equal to 1, which follows from the triangle inequality applied to a right
triangle having for the lengths of its legs sN and (1 – λo)sH, and thus having
sN2 + (1 − λo )2 sH2 for the length of the hypotenuse (see Figure 5.2). As the sum
of the length of the legs is greater than the length of the hypotenuse, we have
SN
(1 – λ0)SH
FIGURE 5.2
Right triangle representation of standard errors.
that Q ≥ 1 (equality holding only when a given leg has length zero). The largest
possible value for Q is 2 ≈ 1.414 , which occurs when sN = (1 – λo)sH. Wiens28
expressed the ratio Q as a function of λo when sN = sH that is equivalent to
2 − λo
t(λo ) =
1 + (1 − λo )2
βˆ N − (1 − λo )βˆ H
Z5 = = RPE × Z2 < − zα /2 (5.20)
sN
where RPE = 1 + (1 − λo )2 sH2 /sN2 . The factor RPE increases from 1 to ∞, as sH2 /sN2
increases from 0 to ∞.
Recall that the Hassalblad and Kong method concludes non-inferiority when
βˆ N − (1 − λo )βˆ H
Z3 = = RHK × Z2 < − zα /2 (5.21)
s2 + (βˆ /βˆ )2 s2
N N H H
sN2 + (1 − λo )2 sH2
where RHK = . We see from Equation 5.21 that when βN =
sN2 + (βˆ N /βˆ H )2 sH2
(1 – δo)βH and Var(βˆ N /βˆ H ) is small, RHK ≈ 1 in distribution and Z3 will have an
approximate standard normal distribution when Z2 has a standard normal
distribution.
βˆ N + zα /2 sN < (1 − λo )βˆ H − zα /2 × ( sN
2
+ (1 − λo )2 sH2 − sN ) (5.22)
The right-hand side of Equation 5.22 is larger than the right-hand side of
Equation 5.23 by
Expression 5.24, which can also be found in the Appendix of Fleming,29 can
be thought of as the adjustment made by the two–confidence interval method
compared with the synthesis method. Such an adjustment may be in hopes
of addressing any bias in the estimation of the active control effect and any
deviation from constancy. In terms of S = (1 – λo)sH, Expression 5.24 will be
between 0 and zα/2S, which occur at sN = 0 and as sN → ∞, respectively.
Unified Test Statistic. For indirectly testing that the experimental therapy is
effective (i.e., more effective than placebo), Snapinn and Jiang16 compared the
power of synthesis and two–confidence interval methods through a unified
approach where the test statistic is given by
βˆ N − (1 − w)βˆ H
U (w , v) = (5.25)
Var(βˆ N ) + (1 − w)2 Var(βˆ H ) + 2 v(1 − w) Var(βˆ N )Var(βˆ H )
where v > 0 is a variance inflation factor and 0 < w < 1 is a discounting factor.
For a given (w, v), non-inferiority is concluded when U(w, v) < –1.96. We see
from Equation 5.25 that the standard synthesis statistic given in Equation
5.12 equals U(λo, 0), the two–confidence interval equivalent test statistic
in Equation 5.19 equals U(λo, 1), and U(1,v) provides the test statistic for a
superiority test of the experimental therapy to the active control, regardless
of the value of v. Snapinn and Jiang noted that failing to account for viola-
tions of assay sensitivity and constancy can lead to an inflated type I error
rate, which increases the risk of claiming an ineffective therapy as effective.
Departures from assay sensitivity are given by the amount a = E(βˆ N ) − β N ,ideal
and departures from constancy by the amount c = E(β̂ H ) − β C,P,N, where
βN, ideal is the true treatment difference between the active control and the
experimental arms under ideal trial situations and β C,P,N is the actual effect of
the active control in the non-inferiority trial. For a given pair of values for the
departures from assay sensitivity and constancy (a, c) a nd β = βH = βN and
fixed values for the variances of β̂ N and β̂ H , Snapinn and Jiang determined
the values wS and wF so that the calibrated synthesis test statistic U(wS, 0) and
the calibrated fixed margin approach test statistic U(wF, 1) maintain a 0.025
type I error rate. For various cases studied involving departures from assay
sensitivity and constancy, Snapinn and Jiang found that the calibrated syn-
thesis method based on the statistic U(wS, 0) had greater power than the cali-
brated fixed-margin approach based on U(wF, 1). The difference in power (or
in the determined sample sizes) became more profound as Var(βˆ H )/Var(βˆ N )
increased.
zα N /2 = zα /2 k and zα H /2 = zα /2 1 − k (5.26)
TABLE 5.6
zα N /2 and zα H /2 Based on Equation 5.26
α/2 = 0.0005 α/2 = 0.005 α/2 = 0.025
k zα N / 2 zα H / 2 zα N / 2 zα H / 2 zα N / 2 zα H / 2
1/2 2.327 2.327 1.822 1.822 1.386 1.386
2/3 2.687 1.900 2.103 1.487 1.600 1.132
3/4 2.850 1.645 2.231 1.288 1.697 0.980
1 3.291 0 2.576 0 1.960 0
zα /2 = zα N /2 k + zα H /2 1 − k (5.27)
For k = 0, 1/4, 1/4, and 1, Table 5.7 gives the values for zα/2 used in the stan-
dard synthesis test and one-sided type I error rates (α/2) under the con-
stancy assumption for the 95–95 approach and a 95–80 approach based on
Equation 5.27. When k = 1/2, a two 95% confidence interval approach is
equivalent to a standard synthesis method that targets a one-sided type I
error rate of 0.0028. For the 95–80 method, zα/2 has an umbrella shape in k
with its maximum at k = 0.7004 of z0.0096 = 2.342. For 0.1606 < k < 1, we have
zα/2 > 1.96 and for k < 0.1605, we have zα/2 < 1.96. For k = 0, 1/4, and 1/3, the
TABLE 5.7
zα/2 Values Used in Standard Synthesis Test and One-Sided Type I Error Rates
αN/2 = αH/2 = 0.025 αN/2 = 0.025, αH/2 = 0.10
values for zα/2 and α/2 are the same as those in Table 5.7 for k = 1, 3/4, and
2/3, respectively. Rothmann et al.1 provided a graph of the type I error for the
95–95 approach by the ratio of the standard deviations (i.e., (1 – λo)sH/sN).
For a one-to-one randomization and a time-to-event endpoint, Equation 10
of Rothmann et al.1 gives the non-inferiority threshold for a 100(1 – α N)% con-
fidence interval of βN that is consistent with the standard synthesis method
when α N/2 = α/2. Non-inferiority is concluded at a targeted level of α/2
for the standard synthesis method when the upper limit of the two-sided
100(1 – α)% confidence interval for β N is less than
where βH is the placebo versus active control estimate of the active control
effect. For a two–confidence interval approach, the use of the threshold in
Equation 5.28 is equivalent to choosing zα H /2 (α H/2) by
zα H /2 = zα /2 (1 − k )/ 1 − k = zα /2 1 − k /(1 + k ) (5.29)
For k = 0, 1/4, 1/3, 1/2, 2/3, 3/4, and 1, Table 5.8 gives the values for zα H /2 and
α H/2 for a two–confidence interval procedure where α N/2 = α/2, which is
equivalent to the standard synthesis method with critical value –zα/2, which
targets a one-sided type I error rate of α/2 under the constancy assump-
tion. When k = 1/2 and α/2 = 0.025, an approach that compares a two-sided
95% confidence interval for βN to a two-sided 58.3% confidence interval for
(1 – λo) βH is equivalent to the standard synthesis test with critical value –1.96
(α/2 = 0.025).
When zα H /2 = 0, the two–confidence interval method is referred to as the
“point estimate method.” The point estimate method would compare a con-
TABLE 5.8
zα H /2 and α H/2 Based on Equation 5.29
α/2 = 0.0005 α/2 = 0.005 α/2 = 0.025
fidence interval for βN from the non-inferiority trial with (1 − λo )βH . For the
point estimate method by Equation 5.27,
zα /2 = zα N /2 k (5.30)
zα N /2 = zα H /2 = zα /2 /( k + 1 − k ) (5.31)
For k = 1/2, 2/3, 3/4, and 1, and α/2 = 0.0005, 0.005, and 0.025, Table 5.9 gives
the common values for zα N /2 = zα H /2 and α N = α H based on Equation 5.31 that
yield two–confidence interval procedures equivalent to the standard synthe-
sis test with critical value –zα/2. For k = 1/2 and α/2 = 0.025, zα N /2 = zα H /2 = 1.386
and α N/2 = α H/2 = 0.0829, corresponding to a two 83.4% confidence interval
approach being equivalent to the standard synthesis method that targets a
one-sided type I error rate of 0.025. As the right-hand side of Equation 5.31 is
symmetric in k, the values for zα/2 and α/2 for k = 0, 1/4, and 1/3 are the same
as the values in Table 5.9 for k = 1, 3/4, and 2/3, respectively.
If it is desired to have equal-length confidence intervals, then for 0 < k < 1,
For k = 1/2, 2/3, 3/4, and k → 1, and α/2 = 0.0005, 0.005, and 0.025, Table
5.10 gives the values for zα N /2 and zα H /2 (α N and α H) based on Equation 5.32
that provide an equal-length two confidence interval procedure equivalent
to the standard synthesis test with critical value –zα/2. For k = 3/4 and α/2 =
0.025, a 74.2–95 two–confidence interval procedure is based on equal-length
TABLE 5.9
Values for zα N /2 = zα H /2 and α N = α H Based on Equation 5.31
α/2 = 0.0005 α/2 = 0.005 α/2 = 0.025
k zα N / 2 = zα H / 2 (αN/2 = αH/2) zα N / 2 = zα H / 2 (αN/2 = αH/2) zα N / 2 = zα H / 2 (αN/2 = αH/2)
1/2 2.327 (0.0100) 1.822 (0.0343) 1.386 (0.0829)
2/3 2.361 (0.0091) 1.848 (0.0323) 1.406 (0.0798)
3/4 2.409 (0.0080) 1.886 (0.0297) 1.435 (0.0757)
1 3.291 (0.0005) 2.576 (0.0050) 1.960 (0.0250)
TABLE 5.10
Values for zα N /2 , zα H /2 , α N/2 and α H/2 Based on Equation 5.32
α/2= 0.0005 α/2 = 0.005 α/2 = 0.025
P(U < l(βˆ H )|H o : θ = 0) =
∫ P(U < l(β )|H
H o : θ = 0 ; βH )v(βH ) dβH
Odem-Davis24 compared across-trials type I error rates for the standard syn-
thesis method, the standard synthesis method with various discounting,
the 95–95 two–confidence interval method, and various bias-only adjusted
synthesis methods for cases where the effect of the active control in the non-
inferiority trial is smaller than the historical effect by some known fraction.
The 95–95 two–confidence interval method has a type I error rate less than
the analogous synthesis method. Odem-Davis observed that the type I error
rate for the 95–95 two–confidence interval method was more sensitive to
changes in the historical variance. Additionally, when the historical variance
is small and the estimator of the active control effect in the setting of the non-
inferiority trial has a small bias favoring overestimating the true effect, the
95–95 two–confidence interval method may have a type I error rate greater
than that of a synthesis method that accounts for the bias. A synthesis method
that does not closely account for the bias would be even more likely to lead to
a false positive than the 95–95 two–confidence interval method.
Wang, Hung, and Tsong3 examined the impact of deviations from con-
stancy on a range of procedures. Included are: (a) a synthesis method
indirectly testing superiority of the experimental therapy to placebo, (b) a
method that uses a non-inferiority margin or a random threshold based on
50% of the estimated control effect, (c) a random threshold based on 50% of
the lower limit or the 95% confidence interval for the control effect, (d) use
of a non-inferiority margin or a random threshold based on 20% of the
estimated control effect, (e) a random threshold based on 20% of the lower
limit of the 95% confidence interval for the control effect. For the meth-
ods based on random thresholds, non-inferiority would be concluded if the
95% confidence interval for the difference in the effects of the experimental
therapy and the active control lies entirely below the random threshold.
The method in (a) had the largest type I error rates, then (b), (c), (d), and (e)
in that order.
When there is a deviation in the effect of the active control in the non-
inferiority trial relative to the historical effect of the active control by a and the
estimator of the historical effect of the active control is normally distributed,
the approximate type I error rate is given by (from Equation 5.12, 5.19, and 5.20,
(1 − λo )a
respectively) Φ − zα /2 + for the standard synthe-
Var(βˆ N ) + (1 − λo )2 Var(βˆ H )
(1 − λo )a
sis method, Φ − zα /2 / Q + for a two 100(1 – α)%
Var(βˆ N ) + (1 − λo )2 Var(βˆ H )
(1 − λo )a
confidence interval method, and Φ − zα /2 / RPE +
Var(βˆ N ) + (1 − λo )2 Var(βˆ H )
for the point estimate method.
Hung et al.31 simulated the across-trials trial type I error rate for the point
estimate, 95–95, and standard synthesis methods based on proportions of
undesirable outcomes. For the simulations, the true active control rate is 14%
or 15%, which is less than the placebo rate by 4%; the sample size is 7500 or
10,000 per arm. The sample sizes for the active control and the placebo in the
historical comparison are assumed to be roughly equal to the sample sizes
in the non-inferiority trial. In each case, the 95–95 and the standard synthesis
methods had type I error rates of approximately 0.003 and 0.025, respectively.
The unconditional type I error rate for the point estimate method ranged
from 0.050 to 0.058.
For the 95–80, 95–85, 95–90, and 95–95 methods, Figure 1 of Hung et al.5
gives across-trials type I error rate curves as the ratio of the variance of the
historical estimate of the active control effect to the variance in the non-
inferiority trial ranges from 0.5 to 10.
Additionally, for both fixed-effects and random-effects meta-analyses,
Wang, Hung, and Tsong3 compared the type I error rates for 50% and 80%
retention based on log-relative risks under the constancy assumption of three
methods: one method using the lower limit of the 95% confidence interval for
the control effect as the true effect of the control therapy, a standard synthe-
sis method, and the Hasselblad and Kong procedure. For the random-effects
model, the true effect for the active control in the non-inferiority trial is
assumed to equal the true global mean effect across studies. The 95% lower
confidence interval method maintained type I error rates below the target of
0.025 in all studied cases. The standard synthesis method maintains a proper
type I error rate except for cases where the within-trial variability was much
smaller than the between-trial variability. The Hasselblad and Kong method
consistently had a slightly higher type I error rate than the synthesis method
in the random-effects cases. This increase above the desired type I error rate
may be largely due to using a normal sampling distribution for the estimated
active control effect when an appropriate t-distribution would be a better
choice.32,33
For the fixed-effects meta-analysis, the standard synthesis method
maintained approximately the desired type I error rate. When basing the
non- inferiority margin on the lower limit of the 95% confidence interval for
the active control effect, the simulated type I error rates ranged from 0.0015
to 0.0115 with the type I error rates increasing as the percentage of retention
increased. The Hasselblad and Kong method had simulated type I error rates
ranging from 0.0009 to 0.0386 with the type I error rates decreasing as the
percentage of retention increases.
For 0 < α < 1 and 0 < γ < 1, Ho is rejected by a two–confidence interval proce-
dure whenever X + zα/2σ X < (1 – λo)(Y – zγ/2 σY). For a given strategy σ X(•), the
probability of rejecting the null hypothesis at (μX, μY) is
∞
(1 − λo )( y − zγ /2σ Y ) − µ X y − µY
∫ Φ − z
−∞
α /2 +
σ X (y) ⋅ ϕ σ dy/σ Y
Y
(5.33)
Based on Expression 5.33, Rothmann2 showed that over all possible strategies
σ X(•), the supremum one-sided type I error probability is (α + γ)/2 – αγ/4 >
α/2. This supremum occurs as σX(y) → 0 for y > μY and as σX(y) → ∞ for y < μY.
Conversely, the infimum one-sided type I error probability across all possible
strategies is αγ/4. This infimum occurs as σX(y) → 0 for y < μY and as σX(y) →
∞ for y > μY. However, practical strategies have type I error probabilities
between those two extremes. For α = γ = 0.05, the infimum and supremum
type I error probabilities are 0.0006 and 0.0494, respectively. This is a wider
range than the range for the type I error rate when the design of the non-
inferiority trial is independent of the estimation of the active control effect
of 0.0028–0.025.
A common strategy sizes the non-inferiority trial to have a desired power
(e.g., 80% or 90% power) when there is no difference in the effects of the
some 0 < α < 1. For a given strategy σ X(•), the probability of rejecting the null hy
(1 − λ )y − µ − z 2 2
α /2 σ X ( y ) + σ Y y − µY
∞
∫
o X
pothesis at (μX, μY) is Φ ⋅ϕ dy/σ Y .
−∞ σ X ( y ) σ Y
As with the two–confidence interval approach, Rothmann determines the
infimum and supremum type I error probabilities over all possible strategies
for σ X(•) for the synthesis approach. When α = 0.05, infimum and supremum
type I error probabilities are 0.0006 and 0.049, respectively. This is the same
range for the type I error probabilities as with the two–confidence interval
approach when α = γ = 0.05. A critical value for the test can be determined
that gives a maximum type I error probability of 0.025 over a likely range
of values for (1 – λo)μY or that leads to a type I error rate of 0.025 based on a
distribution for the possible values of (1 – λo)μY.
In the examined settings, Rothmann noted that strategies based on having
adequate power when there is no difference in effects between the experi-
mental and active control therapies have smaller type I error rates than the
type I error rate in an analogous case when the sizing of the non-inferiority
trial is independent of the estimation of the active control effect. This will
be likewise seen in Example 5.9 in Section 5.4.4.2, which examines the type I
error rate under a model that introduces regression-to-the mean bias.
As noted in Section 5.3.6, for the Bayesian setting, the analysis does not
depend on whether the sizing of the non-inferiority is independent or
dependent on estimation of the active control effect. In both cases, the joint
posterior distribution of the differences in effects of the experimental ther-
apy versus the active control therapy and the active control therapy versus
placebo factors into the product of the marginal posterior distributions when
independent prior distributions are used. In the frequentist setting, the joint
density function does not factor into the marginal density functions unless
the sizing of the non-inferiority trial does not depend on the estimation of
the active control effect.
Example 5.9
Consider that there are five previous studied therapies for an indication. For i = 1,
…, 5, let Xi denote the observed placebo versus active control log-hazard ratio
for the i-th therapy from either one clinical trial or a meta-analysis of clinical
trials. Suppose X1,…,X5 is a random sample from a normal distribution having
mean ln(4/3) and standard deviation of 0.1. A new experimental therapy is to be
compared in a clinical trial to that therapy among the five therapies that has the
largest observed effect. We will first assume that the sizing of the non-inferiority
trial is independent of the estimation of the active control effect. We are interested
in testing the hypotheses in Expression 5.6 when λo = 0 and when the true placebo
versus active control log-hazard ratio in the non-inferiority trial is ln(4/3). Table
5.11 provides simulated type I error rates for both the 95–95 and standard synthe-
sis methods when the true placebo versus experimental hazard ratio is 1 (i.e., the
experimental therapy is ineffective) for various standard errors for the experimen-
tal versus active control log-hazard ratio, as the standard error goes to 0 and as
the standard error goes to infinity. Note that the regression-to-the-mean bias and
the type I error rates depend on the common historical standard deviation and the
standard error in the non-inferiority trial and do not depend on the value for the
common effect of the five previously studied therapies.
As the standard error for the estimated difference in the non-inferiority trial
decreases, the simulated type I error rate for the standard synthesis method
increases from 0.025 toward about 0.12. The type I error rate for the 95–95 method
is “U-shaped” in the standard error from the non-inferiority trial. That is, as the
standard error in the non-inferiority trial decreases, the type I error rate decreases
to some minimum value at a standard error between 0.1 and 0.2, and increases
TABLE 5.11
Simulated Type I Error Rates—Independent Design
Simulated Type I Error Ratea
Non-Inferiority Trial Standard Error 95–95 Method Standard Synthesis Method
0 0.1194 0.1194
0.03 0.0359 0.1161
0.04 0.0260 0.1145
0.05 0.0200 0.1114
0.0707 0.0157 0.1066
0.1 0.0119 0.0920
0.2 0.0130 0.0638
→∞b 0.025 0.025
a Each type I error rate is based on 100,000 simulations.
b Type I error rate for the “→ ∞” case determined mathematically. The simulated type I error
rate for this case was 0.0245.
toward a value of about 0.12 as the standard error decreases toward zero. The
95–95 method maintains a type I error rate below 0.025 unless the standard error
in the trial is very small.
We next consider the case where the sizing of the non-inferiority trial depends
on the estimated active control effect. For both analysis methods, Table 5.12 pro-
vides simulated type I error rates when the true placebo versus experimental haz-
ard ratio is 1, where the non-inferiority trial is sized based on a conditional power
ranging from 3% to 100% from using the 95–95 method under the assumptions
that the experimental and active control therapies have the same effect and the
true placebo versus active control log-hazard ratio in the non-inferiority trial is
ln(4/3).
Comparing Tables 5.11 and 5.12, when the sizing of the non-inferiority trial
depends on the estimated active control effect, the type I error rates tend to be
slightly smaller than when the sizing of the non-inferiority trial is independent of
the estimated active control effect. The 95–95 method achieved type I error rates
smaller than 0.025 for every practical choice for power. The simulated type I error
rates for the standard synthesis method exceeded 0.025 and were increasing in
the conditional power.
In general, the amount of regression-to-the-mean bias (and the increase to
the type I error rates of the standard synthesis method and the 95–95 method)
increases as there is an increase in the number of previously studied investiga-
tional agents that potentially could have produced results so as to be selected as
the active control and/or as the standard deviations for the estimated effects of the
previously studied investigational agents increases. When the standard errors for
the previously studied investigational agents are equal, the regression-to-the-mean
bias will be at its greatest when the effects of those previously studied investi-
gational agents are equal. Reproducibility across studies in the effect size of the
selected active control can provide assurance that the size of the regression-to-
TABLE 5.12
Simulated Type I Error Rates—Dependent Design
Simulated Type I Error Ratea
Conditional Power (%) 95–95 Method Standard Synthesis Method
3 0.0225 0.0269
10 0.0138 0.0428
20 0.0113 0.0528
30 0.0101 0.0589
40 0.0095 0.0631
50 0.0094 0.0668
60 0.0092 0.0702
70 0.0090 0.0739
80 0.0091 0.0776
90 0.0095 0.0824
99.99 0.0143 0.0988
Just below 100b 0.1177 0.1193
a Each type I error rate is based on 100,000 simulations.
b Power of 1 – β where zβ = 1000.
TABLE 5.13
Simulated Type I Error Rates for a Bias-Corrected Synthesis
Test—Independent Design
Non-Inferiority Trial Standard Error Simulated Type I Error Ratea
0 0.0047
0.03 0.0053
0.04 0.0057
0.05 0.0066
0.0707 0.0078
0.1 0.0116
0.2 0.0193
→ ∞b 0.025
a Each type I error rate is based on 100,000 simulations.
b Type I error rate for the “→ ∞” case determined mathematically. The
simulated type I error rate for this case was 0.0245.
error for the non-inferiority trial equaled 0.1, the common standard error for the
previous studies, the type I error rates for the bias-corrected synthesis and 95–95
methods are similar (0.0116 and 0.0119, respectively).
Table 5.14 provides simulated type I error rates when the true placebo versus
experimental hazard ratio is 1 for a bias-corrected synthesis method where the
non-inferiority trial is sized on the basis of a conditional power ranging from 3% to
100% from using the 95–95 method under the assumptions that the experimental
and active control therapies have the same effect and the true placebo versus
active control log-hazard ratio in the non-inferiority trial is ln(4/3).
As in the independent design case, we note that the type I error rates in Table 5.14
for the bias-corrected synthesis method are all smaller than 0.025 and are much
smaller than those type I error rates given in Table 5.12 for the standard synthesis
method without a bias correction. Additionally, the type I error rates in Table 5.14
for the bias-corrected synthesis method are decreasing in the conditional power,
not increasing in the conditional power as in Table 5.12 for the standard synthesis
method without a bias correction. Also, in this example for 90% or greater con-
ditional power, the bias-corrected synthesis method had a smaller simulated type
I error rate than the 95–95 method without bias correction. For 80% or smaller
conditional power, the bias-corrected synthesis method had a greater simulated
type I error rate than the 95–95 method without bias correction.
TABLE 5.14
Simulated Type I Error Rates for a Bias Corrected Synthesis
Test—Dependent Design
Conditional Power (%) Simulated Type I Error Ratea
3 0.0240
10 0.0194
20 0.0166
30 0.0145
40 0.0134
50 0.0123
60 0.0113
70 0.0104
80 0.0095
90 0.0085
99.99 0.0065
Just below 100b 0.0048
a Each type I error rate is based on 100,000 simulations.
b Power of 1 – β, where zβ = 1000.
made by the 95–95 method in Expression 5.24 when the standard error in the
non-inferiority trial equals the (1 – λo) multiplied by the common standard
error for the five previously studied therapies.
Example 5.10 evaluates the type I error rate of false conclusions of efficacy
of the experimental therapy under the previous model in Chapter 3, where
the likelihood that the active control is truly effective depends on the prob-
ability a random agent for that indication is truly effective and the power for
concluding effectiveness when the agent is effective.
Example 5.10
Often, when a therapy has been concluded as effective for an indication based
on one or more clinical trials (generally, at least two clinical trials), it may become
unethical to conduct further placebo-controlled trials for that indication. The ther-
apy concluded as effective would be given as an “active” control in future trials.
The active control may or may not be effective. As we have seen earlier, the likeli-
hood that the active control is truly effective depends on the statistical significance
of the results, the probability a random agent for that indication is truly effective,
and the power for concluding effectiveness when the agent is effective. Equation
3.1 gives the probability that the active control is ineffective. Under this paradigm,
we will evaluate the type I error rate of falsely concluding that the experimental
therapy has any efficacy when it has zero efficacy and the type I error rate of
falsely concluding that the experimental therapy has efficacy more than half the
efficacy of the active control when the experimental therapy has efficacy exactly
half that of the active control. For ease, it is assumed that the sizing of the non-
inferiority trial is independent of the estimation of the active control effect.
The active control for the non-inferiority trial will be evaluated based on achiev-
ing a favorable one-sided p-value less than 0.025 from a single trial. It is under-
stood that in practice, usually a more stringent criterion than a one-sided p-value
less than 0.025 would be used for having a therapy as an active control in future
clinical trials. The evaluation of the type I error rate in a non-inferiority trial can
also be done for a more stringent significance level than 0.025. At the beginning
of the historical trial, the estimated active control effect is assumed to have a
normal distribution with standard deviation sH. When the active control is truly
effective, the actual power is assumed to be 90%, which makes the true effect of
the active control approximately 3.24sH. Given that the p-value is less than 0.025,
the conditional mean, median, and standard deviation for the estimated effect are
3.44sH, 3.37sH, and 0.84sH, respectively. When the “active” control is truly inef-
fective (has zero effect), the conditional mean, median, and standard deviation
for the estimated effect given a p-value of less than 0.025 are 2.33sH, 2.24sH, and
0.34sH, respectively.
For the cases where the active control has its assumed effect and when the
active control has zero effect, the type I error rate for a conclusion of any efficacy
from a non-inferiority trial was evaluated for the standard synthesis method, the
standard synthesis method with 50% discounting, the 95–95 method, and the
95–95 method with 50% discounting under the assumption that the active control
effect has not changed. For these methods, Tables 5.15 and 5.16 provide the simu-
lated type I error rates for a false conclusion of any efficacy when the experimental
TABLE 5.15
Simulated Type I Error Rates Conditioned on Active Control Having Its Assumed
Effect with Historical Power of 90% and Significance Level of 0.025
Synthesis Synthesis Method 95–95 Method
Ratio of Method 0% 0% Retention 95–95 Method 0% Retention
Variances Retention 50% Discounting 0% Retention 50% Discounting
→0+ 0.025 0.025 0.025 0.025
0.1 0.0270 0.0075 0.0078 0.0033
0.25 0.0270 0.0031 0.0044 0.0008
0.5 0.0279 0.0011 0.0034 0.0002
1 0.0269 0.0003 0.0027 0.00001
2 0.0286 0.00003 0.0032 0
4 0.0271 0.00001 0.0045 0
10 0.0282 0 0.0076 0
→∞ 0.0278 0.0000001 0.0278 0.0000001
therapy has zero efficacy for various ratios of the variance of the estimated dif-
ference in effects from the non-inferiority trial to the variance of the historically
based estimate of the active control effect. For each case, 100,000 simulations
were used on the underlying distribution for the estimated effect. Thus, when the
active control is effective (is ineffective), there were typically 90,000 cases (2,500
cases) with a p-value of less than 0.025 that were further used to simulate the type
I error rate of falsely concluding that the experimental therapy has efficacy when
the experimental therapy has zero effect in the non-inferiority trial.
The simulated type I error rates have associated margins of error with some
simulated rates greater than the true rate. Some reversals appear to occur in the
order of the simulated type I error rates. For example, when the active control is
truly ineffective and a 95–95 method is used with 50% discounting, the simulated
type I error rates when the ratio of variances is 0.5 and 1 are 0.0452 and 0.0404,
TABLE 5.16
Simulated Type I Error Rates Conditioned on Active Control Being Ineffective
Synthesis Synthesis Method 95–95 Method
Ratio of Method 0% 0% Retention 95–95 Method 0% Retention
Variances Retention 50% Discounting 0% Retention 50% Discounting
→0+ 0.025 0.025 0.025 0.025
0.1 0.1033 0.0572 0.0347 0.0300
0.25 0.1619 0.0844 0.0457 0.0337
0.5 0.2524 0.1274 0.0611 0.0452
1 0.3353 0.1494 0.0646 0.0404
2 0.4651 0.2323 0.1051 0.0570
4 0.5796 0.3217 0.1466 0.0666
10 0.7221 0.5199 0.2913 0.1233
→∞ 1 1 1 1
respectively. Based on the general pattern for the simulated type I error rates for
this procedure, it is likely that the true type I error rates in these two cases are
ordered in the reverse direction. Additionally, it appears that when the active con-
trol is effective, the synthesis method has a type I error rate between 0.025 and
0.0278 and that the deviations out of this range by the simulated type I error rates
are due to random error.
When the active control is truly effective, the methods based on 50% discount-
ing appear to have type I error rates for concluding any efficacy that decreases as
the ratio of the variances increases. The type I error rates for the 95–95 method
without discounting was U-shaped in the ratio of the variances, with the type I
error rate exceeding 0.025 only when the ratio of variances is quite large. When
the active control is truly ineffective for all methods, the limiting type I error rate is
1 as the ratio of variances goes to infinity. In all cases, the limiting type I error rate
is 0.025 as the ratio of variances goes to zero.
In the cases studied when the active control was truly effective, the standard
synthesis method with 50% discounting had a smaller type I error rate for a false
conclusion of any efficacy when the experimental therapy has zero efficacy than
the 95–95 method without discounting. However, when the active control is truly
ineffective, the order was reversed with the 95–95 method without discount-
ing having the smaller type I error rate. For these two methods, when the likeli-
hood that the active control is truly effective is considered, the standard synthesis
method with 50% discounting will have the smaller type I error rate when it is
highly likely that the active control is effective, and the larger type I error rate
when it is highly unlikely that the active control is effective.
For the four analysis methods, Tables 5.17 and 5.18 provide the type I error rates
for a false conclusion of any efficacy when the experimental therapy has zero effi-
cacy based on various probabilities that the active control is ineffective when the
ratio of the variances is 1 and 4. The probability that the active control is ineffec-
tive is based on Equation 3.1 and the selected probability that a random therapy is
effective. The simulation-based type I error rate is then a convex combination of
the corresponding conditional type I error rates in Tables 5.15 and 5.16. When the
probability a random therapy is effective equals 0.25, the probability is 0.0769 that
a therapy that achieves favorable statistical significance at a one-sided 0.025 level
is ineffective. Then for the standard synthesis method when the ratio of variances
is 1, the simulation-based type I error rate for a false conclusion of any efficacy
TABLE 5.17
Simulation-Based Type I Error Rate for a False Conclusion of Any Efficacy When
Ratio of Variances = 1
Probability Standard
Random Probability Standard Synthesis 95–95 Method
Therapy Is Active Control Synthesis 95–95 Method with with 50%
Effective Is Ineffective Method Method 50% Discounting Discounting
0.1 0.2000 0.0886 0.0151 0.0301 0.0081
0.25 0.0769 0.0506 0.0075 0.0118 0.0031
0.5 0.0270 0.0352 0.0044 0.0043 0.0011
0.75 0.0092 0.0297 0.0033 0.0017 0.0004
0.9 0.0031 0.0278 0.0029 0.0008 0.0001
TABLE 5.18
Simulation-Based Type I Error Rate for a False Conclusion of Any Efficacy When
Ratio of Variances = 4
Probability Standard
Random Probability Standard Synthesis 95–95 Method
Therapy Is Active Control Synthesis 95–95 Method with with 50%
Effective Is Ineffective Method Method 50% Discounting Discounting
0.1 0.2000 0.1376 0.0329 0.0643 0.0133
0.25 0.0769 0.0696 0.0154 0.0248 0.0051
0.5 0.0270 0.0420 0.0083 0.0087 0.0018
0.75 0.0092 0.0322 0.0058 0.0030 0.0006
0.9 0.0031 0.0288 0.0049 0.0010 0.0002
when the experimental therapy has zero efficacy is 0.9231 × 0.0269 + 0.0769 ×
0.3351 = 0.0506. When the probability that a random therapy is effective is only
10% and the ratio of variances is 4, the simulation-based type I error rate for the
95–95 method without discounting is 0.0329, larger than a desired one-sided level
of 0.025. When the probability a random therapy is effective is small, the type I
error rate for a false conclusion of any efficacy for the 95–95 method without
discounting will be larger than an intended level of 0.025. It is thus important in
settings where “success” is rare to consider using a more stringent criterion than
a margin based on the lower limit of the 95% confidence interval of the active
control effect.
For cases where the active control has its assumed effect and when the active
control has zero effect, the simulated type I error rates for a conclusion of efficacy
of more than half the efficacy of the active control when the experimental therapy
has efficacy exactly half that of the active control are provided in Table 5.19 for
TABLE 5.19
Simulated Conditional Type I Error Rates for Testing for Better 50% Retention of
Active Control Effect
Active Inactive
Ratio of Synthesis Method 95–95 Method Synthesis Method 95–95 Method
Variances 50% Retention 50% Retention 50% Retention 50% Retention
→0+ 0.025 0.025 0.025 0.025
0.1 0.0269 0.0137 0.0572 0.0300
0.25 0.0273 0.0091 0.0844 0.0337
0.5 0.0279 0.0067 0.1274 0.0452
1 0.0272 0.0043 0.1494 0.0404
2 0.0282 0.0034 0.2323 0.0570
4 0.0268 0.0030 0.3217 0.0666
10 0.0280 0.0036 0.5199 0.1233
→∞ 0.0278 0.0278 1 1
the standard synthesis and 95–95 methods based on 50% retention. When the
active control has zero effect, half the active control effect equals zero effect.
As with Tables 5.15 and 5.16, for each case, 100,000 simulations were used in
the underlying distribution for the estimated effect. Simulations are based on no
change in the active control effect. For the standard synthesis method and the
95–95 method, the limiting conditional type I error rate is 0.0278 = 0.025/0.9 as
the ratio of variances goes to infinity when the active control is truly effective.
For both methods, the limiting conditional type I error rate is 0.025 as the ratio of
variances goes to zero, and when the active control is truly ineffective the limiting
type I error rate is 1 as the ratio of variances goes to infinity. When the probability
that a random therapy is effective equals 0.25 (leading to a probability of 0.0769
that the active control is ineffective from Table 5.17) and the ratio of variances is
4, the standard synthesis method for 50% retention has a simulation-based type I
error rate of 0.9231 × 0.0268 + 0.0769 × 0.3217 = 0.0495 and the 95–95 method
has a simulation-based type I error rate of 0.9231 × 0.0030 + 0.0769 × 0.0666 =
0.0079.
TABLE 5.20
Summary of Results on a Meta-Analysis on Overall Survival Comparing
5-FU + LV with 5-FU
Log Hazard Ratioa Standard Error 95% Confidence Interval for Hazard Ratioa
0.2341 0.0750 (1.091–1.464)
a Hazard ratios are 5-FU + LV/5-FU.
Bolus 5-FU by itself has not demonstrated an effect on overall survival for
first-line metastatic colorectal cancer, whereas the addition of leucovorin to
bolus 5-FU appears to improve overall survival (see Table 5.20). There is an
assumption in the trials that the use of the fluoropyrimidine capecitabine
does not require the additional use of leucovorin to improve its effect. A sys-
tematic review was done to find those clinical trials that compared similar
regimens of 5-FU + LV as the Mayo clinic regimen of 5-FU + LV to the same
regimen minus leucovorin. For each capecitabine trial, capecitabine would
be regarded as noninferior to 5-FU + LV on overall survival if capecitabine
retains greater than 50% of the historical survival effect of 5-FU + LV relative
to 5-FU alone.
TABLE 5.21
Summary of Overall Survival Results for Two Capecitabine Clinical Trials
Total Number Log Hazard Standard 95% Confidence Interval
Study of Deaths Ratioa Error for the Hazard Ratioa
Study 1 533 –0.0036 0.0868 (0.841, 1.181)
Study 2 533 –0.0844 0.0867 (0.775, 1.089)
Combinedb 1066 –0.0440 0.0613 (0.849, 1.079)
a Hazard ratios are capecitabine/5-FU + LV.
b A fixed-effects meta-analysis of studies 1 and 2.
Both a 95–95 approach and a synthesis method with a test statistic similar to
Equation 5.10 were used.
On the basis of the 95–95 approach, the non-inferiority efficacy threshold
for the capecitabine versus 5-FU + LV hazard ratio is 1.091. The upper limits of
the 95% confidence interval for the capecitabine versus 5-FU + LV hazard ratio
are 1.181 and 1.089 for studies 1 and 2, respectively. Thus, study 1 fails to meet
this criterion for determining efficacy, whereas study 2 barely succeeds. For
the fixed-effects meta-analysis of studies 1 and 2, the upper limit of the 95%
confidence interval for the capecitabine versus 5-FU + LV hazard ratio is 1.079,
satisfying the non-inferiority efficacy threshold as determined by the 95–95
approach. The non-inferiority threshold for the capecitabine versus 5-FU + LV
hazard ratio is 1.045 (=1 + (1.09 – 1)/2) based on the 95–95 approach and a reten-
tion fraction of 50%. Studies 1 and 2 and their combined analysis all fail to
satisfy this threshold with the upper limits of their respective 95% confidence
intervals for the capecitabine versus 5-FU + LV hazard ratio above 1.045.
When targeting a one-sided type I error rate of 0.025 and assuming that
Z1 in Equation 5.10 has an approximate standard normal distribution at the
boundary of the null hypothesis in Expression 5.4, the critical value for the
synthesis method is –1.96. For studies 1 and 2, the values for the test statistic
Z1 in Equation 5.10 for a retention fraction of 50% are –1.32 and –2.16, respec-
tively. From this synthesis method, study 1 fails to demonstrate that capecit-
abine retains more than 50% of the historical effect of 5-FU + LV relative to
5-FU (–1.32 > –1.96), whereas study 2 demonstrates that capecitabine retains
more than 50% of the historical effect of 5-FU + LV relative to 5-FU (–2.16 <
–1.96). For the combined analysis, Z1 = –2.26.
A Fieller lower confidence limit for the retention fraction can be deter-
mined by setting Z1 = –1.96 and solving for the unspecified retention fraction,
λo. From this, we see that study 1 (study 2) demonstrated that capecitabine
retains at least 10% (61%) of the historical effect of 5-FU + LV on overall sur-
vival. The combined analysis demonstrated that capecitabine retains at least
64% of the historical effect of 5-FU + LV on overall survival.
TABLE 5.22
Power Is Provided for Individual Studies and Combined Analyses at Various True
Hazard Ratios of Capecitabine versus 5-FU + LV
Methoda
Standard Synthesis Method with
Non-Inferiority Threshold of 1.044 50% Retention of the Control Effect
True Hazard
Ratiob 533 Deaths (%) 1066 Deaths (%) 533 Deaths (%) 1066 Deaths (%)
1.05 2 2 9 12
1.00 7 10 22 35
0.95 19 34 42 67
0.90 40 68 67 91
0.85 66 92 86 99
a Based on the geometric definition of the proportion of effect retained.
b For overall survival of capecitabine versus 5-FU + LV.
Table 5.22, as the alternative becomes a slightly more and more advantageous
effect for capecitabine, the power increases greatly. Also, for each alternative
and fixed number of deaths, the power is slightly higher for the standard
synthesis method than for the 95–95 approach with threshold of 1.044.
Conditioning on the results from the estimation of the historical effect
size of 5-FU + LV on overall survival, Table 5.23 provides, for a one-to-one
randomization, the number of events to have 90% power at various capecit-
abine versus 5-FU + LV hazard ratios for these two methods of analysis.
For each case in Table 5.23, as the alternative becomes a slightly more and
more advantageous effect for capecitabine, the number of events sharply
decreases. This helps illustrate the importance of a proper choice of the
alternative to size the non-inferiority trial. Also, for each hazard ratio, the
number of events is smaller for the test procedure that uses a normalized
test statistic than for the test procedure that is based on a two 95% confi-
dence interval approach.
TABLE 5.23
For Each Method, Number of Events Is Provided to Have 90% Power for Various
True Hazard Ratios of Capecitabine versus 5-FU + LV (1:1 Randomization)
Method
Non-Inferiority Cutoff Standard Synthesis Method with
True Hazard Ratioa of 1.044 50% Retention of the Control Effect
1.00 22,669 7,192
0.95 4,721 2,113
0.90 1,908 1,030
0.85 995 606
a For overall survival of capecitabine versus 5-FU + LV.
References
1. Rothmann, M. et al., Design and analysis of non-inferiority mortality trials in
oncology, Stat. Med., 22, 239–264, 2003.
2. Rothmann, M., Type I error probabilities based on design-stage strategies with
applications to non-inferiority trials, J. Biopharm. Stat., 15, 109–127, 2005.
3. Wang, S.-J., Hung, H.M.J., and Tsong, Y., Utility and pitfalls of some statisti-
cal methods in active controlled clinical trials, Control Clin. Trials, 23, 15–28,
2002.
4. Temple, R. and Ellenberg, S.S., Placebo-controlled trials and active-controlled
trials in the evaluation of new treatments: Part 1. Ethical and scientific issues,
Ann. Intern. Med., 133, 455–463, 2000.
5. Hung, H.M.J., Wang, S.-J., and O’Neill, R., Issues with statistical risks for test-
ing methods in non-inferiority trial without a placebo arm, J. Biopharm. Stat., 17,
201–213, 2007.
6. U.S. Food and Drug Administration, Guidance for industry: Non-inferiority
clinical trials (draft guidance), March 2010.
7. Sankoh, A.J., A note on the conservativeness of the confidence interval approach
for the selection of non-inferiority margin in the two-arm active-control trial,
Stat. Med., 27, 3732–3742, 2008.
8. Hauck, W.W. and Anderson, S., Some issues in the design and analysis of equiv-
alence trials, Drug Inf. J., 33, 109–118, 1999.
9. Lawrence, J., Some remarks about the analysis of active control studies, Biom. J.,
47, 616–622, 2005.
10. Gupta, G. et al., Statistical review experiences in equivalence testing at FDA/
CBER, Proc. Biopharm. Sec., American Statistical Association Alexandria, VA,
1999, 220–223.
11. The ASSENT-2 Investigators, Single bolus tenecteplase compared with front-
loaded alteplase in acute myocardial infarction: ASSENT-2 double-blind ran-
domised trial, Lancet, 354, 716–22 1999.
12. U.S. Food and Drug Administration Oncologic Drugs Advisory Committee
meeting July 27, 2004, transcript at http://www.fda.gov/ohrms/dockets/
ac/04/transcripts/2004-4060T1.pdf.
13. Holmgren, E.B., Establishing equivalence by showing that a specified percent-
age of the effect of the active control over placebo is maintained, J. Biopharm.
Stat., 9, 651–659, 1999.
14. Simon, R., Bayesian design and analysis of active control clinical trials, Biometrics,
55, 484–487, 1999.
15. Hasselblad, V. and Kong, D.F., Statistical methods for comparison to placebo in
active-control trials, Drug Inf. J., 35, 435–449, 2001.
16. Snapinn, S. and Jiang, Q., Controlling the type 1 error rate in non-inferiority tri-
als, Stat. Med., 27, 371–381, 2008.
17. Clinton, B. and Gore, A., Reinventing regulation of drugs and medical devices,
National Performance Review, April 1995.
18. Snapinn, S. and Jiang, Q., Preservation of effect and the regulatory approval
of new treatments on the basis of non-inferiority trials, Stat. Med., 27, 382–391,
2008.
19. Hung, H.M.J., Wang, S-J,. and O’Neill, R.T., A regulatory perspective on choice
of margin and statistical inference issue in non-inferiority trials, Biom. J., 47,
28–36, 2005.
20. Chen, G., Wang, Y-C., and Chi, Y.H.G., Hypotheses and type I error in active-
control non-inferiority trials, J. Biopharm. Stat. 14, 301–313, 2004.
21. Fisher, L.D., Active control trials: What about a placebo? A method illustrated
with clopidogrel, aspirin and placebo, J. Am. Coll. Cardiol, 31, 49A, 1998.
22. Fisher, L.D., Gent, M., and Büller, H.R., Active-control trials: How would a new
agent compare with placebo? A method illustrated with clopidogrel, aspirin,
and placebo, Am. Heart J., 141, 26–32, 2001.
23. Rothmann, M.D. and Tsou, H., On non-inferiority analysis based on delta-
method confidence intervals, J. Biopharm. Stat., 13, 565–583, 2003.
24. Odem-Davis, K.S., Current issues in non-inferiority trials, dissertation,
University of Washington, Department of Biostatistics, 2010.
25. Wang, S.-J. and Hung, H.M.J., TACT method for non-inferiority testing in active
controlled trials, Stat. Med., 22, 227–238, 2003.
26. Carroll, K.J., Active-controlled non-inferiority trials in oncology: Arbitrary lim-
its, infeasible sample sizes and uninformative data analysis. Is there another
way? Pharm. Stat., 5, 283–293, 2006.
27. Snapinn, S.M., Alternatives for discounting in the analysis of non-inferiority tri-
als, J. Biopharm. Stat., 14, 263–273, 2004.
28. Wiens, B., Choosing an equivalence limit for non-inferiority or equivalence
studies, Control Clin. Trials, 1–14, 2002.
29. Fleming, T.R., Current issues in non-inferiority trials, Stat. Med., 27, 317–332,
2008.
30. Hettmansperger, T.P., Two-sample inference based on one-sample sign statis-
tics, Appl. Stat., 33, 45–51, 1984.
31. Hung, H.M.J. et al., Some fundamental issues with non-inferiority testing in
active controlled clinical trials, Stat. Med., 22, 213–225, 2003.
32. Follmann, D.A. and Proschan, M.A., Valid inferences in random effects meta-
analysis, Biometrics, 55, 732–737, 1999.
33. Larholt, K., Tsiatis, A.A, and Gelber, R.D., Variability of coverage probabili-
ties when applying a random effects methodology for meta-analysis, Harvard
School Public Health Department of Biostatistics, unpublished, 1990.
34. FDA Medical-Statistical review for Xeloda (NDA 20-896), dated April, 23, 2001.
35. FDA/CDER New and Generic Drug Approvals: Xeloda product labeling, at
http://www.fda.gov/cder/foi/label/2003/20896slr012_xeloda_lb1.pdf.
6.1 Introduction
When designing a study to show that an experimental therapy is effective, it
is sometimes possible to include a third arm in the study to obtain data on
both a concurrent placebo control and an active control. Earlier in this book,
we considered two-arm non-inferiority trials having only an active control
when the use of a placebo control is unethical or problematic—for example,
if an effective treatment is available for a disease with obvious discomfort or
irreversible morbidity, it may be difficult to obtain permission from an ethi-
cal review board to include a placebo control and most likely impossible to
obtain informed consent from potential study subjects. Alternatively, when
a placebo control is ethical, the comparison of an experimental treatment to
a placebo control is the gold standard and inclusion of an active control is
generally not required. However, there are situations in which inclusion of
both a placebo control and an active control are ethically and scientifically
defensible.
A three-arm trial involving concurrent active and placebo controls may
evolve in one of two ways—an active control may be added to a placebo-
controlled trial where the objective is the demonstration of superior efficacy
of the experimental therapy relative to placebo or a placebo control is added
to a two-arm non-inferiority trial.
When the standard therapy has a large, important effect, use of placebo-
controlled trials without a concurrent standard therapy arm may allow
claims of effectiveness for drugs that are substantially less effective than
standard therapy. Also, failure to demonstrate superiority of an experimen-
tal therapy to placebo can either be due to the experimental therapy being
ineffective or due to the trial lacking assay sensitivity. Additional use of an
active control arm (i.e., a three-arm trial) can assist in determining whether
the study has assay sensitivity. If the active control is demonstrated to be
superior to placebo, then the trial has assay sensitivity. If neither the active
control nor the experimental therapy is demonstrated to be superior to pla-
cebo, the trial may have lacked assay sensitivity.
149
It may be rare that all these criteria are met. Potential situations include a dis-
ease that does not result in discomfort, mortality, or irreversible morbidity
(perhaps for short-term studies of chronic indications in which there are lim-
ited acute symptoms, or for acute diseases with relatively mild symptoms).
Another potential situation is when the active control has not been studied
in a clinical trial in the disease under investigation but is hypothesized to
work. A third potential situation is a disease area in which the active control
is believed to confer benefit, but for whatever reason does not always show
an advantage in direct comparisons to placebo (i.e., the studies lack assay
sensitivity). An example of the first situation is a study of mild infections
control to placebo and the experimental treatment to placebo are made at the
full α level, there is an inflated chance that at least one of these two treatments
will falsely be considered superior to placebo, if neither is. However, this
might not be sufficient to disregard such an analysis strategy. The hypoth-
eses can be structured such that the only hypothesis used for making a con-
clusion is the comparison of the experimental treatment to placebo, and the
comparison of active control to placebo is presented for descriptive purposes
and only interpreted in the event that the primary comparison is not positive.
Testing the experimental treatment versus placebo at the full α level will con-
trol the probability that an ineffective treatment is considered effective, even
though there is an inflated chance of rejecting at least one true null hypoth-
esis. Note that a non-inferiority comparison in this case is of relatively low
interest, and a non-inferiority margin δ might not even be proposed a priori.
If a margin is proposed, a fixed sequence test would be a useful approach
to multiplicity, with the first comparison being the experimental therapy to
placebo and the second comparison being the non-inferiority comparison of
the experimental therapy to the active control. Again, the comparison of the
active control to placebo would be used to establish assay sensitivity, but will
not be included in the α-preserving multiple comparison procedure.
Concluding superiority of the experimental treatment compared with the
active control can be a secondary objective of a trial in which the primary
comparison is the experimental treatment to placebo. The comparison of the
experimental treatment to the active control will generally follow the fixed-
sequence approach, so other hypotheses are first tested and, conditional on
demonstrating sufficient benefit, the experimental treatment is compared
with the active control using the full two-sided α level. As an example, the
first comparison could be a direct comparison of the experimental treatment
versus placebo at the full α level. If this comparison shows significance favor-
ing the experimental treatment, the second comparison would be the non-
inferiority comparisons, also at the full α (or one-sided α/2) level. If this again
is positive, the third comparison might be to demonstrate that experimental
treatment is superior to the active control, again at the full α level. This third
comparison would be two sided; thus, it is possible with this strategy to dem-
onstrate that the experimental treatment is worse or better than the active
control.
Another use of an active control as a third arm in a non-inferiority trial
might be to establish favorable risk–benefit of the experimental treatment
compared with the active control. When the active control consistently dem-
onstrates efficacy compared with placebo but is also associated with consid-
erable toxicity, a less efficacious but better tolerated experimental treatment
might be preferable and therefore should be considered so future patients
can choose among treatments with various levels of efficacy and tolerability.
Statistical methodology for combining measures of efficacy and toxicity are
not commonly used, although they have been proposed.4–6 Therefore, such
comparisons of risk–benefit will generally be made more informally. This
µE − µP
λ= (6.1)
µC − µP
The best linear unbiased estimator for the unknown quantity in the hypoth-
eses of Expression 6.3 is found by substituting the sample means for the
population means: ψ(µ̂ ) = X E − λoX C − (1 − λo )X P , where µ̂ is the vector of
sample means. Under the null hypothesis, with the assumption of normality
and homogenous variances, the test can use the statistic
X E − λoX C − (1 − λo )X P
TBLUE = (6.4)
1 λo2 (1 − λo )2
σˆ + +
nE nC nP
where σ̂ is the pooled estimate of variance and nP, nC, and nE are the corre-
sponding sample sizes. Under Ho, TBLUE in Equation 6.4 follows a t distribu-
tion with nE + nCt + nP – 3 degrees of freedom.
Pigeot et al.3 discussed values of λo and suggested λo = 0.8 could be appro-
priate. This implies that the experimental treatment retains 80% or more of
the efficacy of the active control. Values of λo = 0.5, or even less, may be appro-
priate depending on the indication and the amount of efficacy and toxicity
of the active control.
Hasler, Vonk, and Hothorn7 considered the continuous case under the
assumption of unequal, unknown variances. The test statistic becomes
X E − λoX C − (1 − λo )X P
SE2 λo2SC2 (1 − λo )2 SP2
+ +
nE nC nP
2
sE2 λo2 sC2 (1 − λo )2 sP2
n + n + nP
E C
4 4 4
sE λ s (1 − λ )4 s 4
+ 2 o C + 2 o P
2
nE (nE − 1) nC (nC − 1) nP (nP − 1)
Koch and Tangen8 provided a sample size formula for three-arm trials. Pigeot
et al.3 also discussed optimal allocation ratios of active control to experimental
to placebo. In general, the active control and experimental treatment should
have the same sample size. With equal sample sizes for the first two treat-
ments, the optimal allocation ratio becomes 1:1:kP. Pigeot et al. showed that
(1 − λo ) 2 + 2 λo
kP = (6.5)
1 + λo2
was optimal for λo < 1. In other words, the optimal ratio depends only on the
value of λo. For λo = 0.8, the ratio is 1:1:0.23, or approximately 9:9:2. For λo = 0.5,
the optimal ratio is 1:1:0.69, or approximately 3:3:2. For λo = 0.3213, the ratio
is approximately 1:1:1; whereas for even smaller values of λo, more subjects
are required in the placebo group than in the other two groups. We have
from Equation 6.5 that kP is decreasing in λo for 0 < λo < 1 and as λo increases
toward 1, kP decreases toward 0. For λo < 0.5, we recommend equal sample
sizes in all three groups, as the power loss will not be large compared to the
complexity of unequal ratios and the potential ethical concern of enrolling
more subjects in the placebo group than the other groups. In the general case
(the experimental and active control arms need not have equal sample sizes),
as shown by Pigeot et al.,3 the optimal allocation across the experimental,
active control, and placebo arms is 1:λo:1 – λo. Optimal allocation ratios when
the variances are unequal among the three arms lead to larger allocation for
the study arms with larger variances.7 The procedure described by Hasler,
Vonk, and Hothorn’s paper7 required that the ratio of variances among treat-
ments be known, something that in practice is not known precisely. A two-
stage sample size recalculation procedure for the three-arm testing problem
is described by Schwartz and Denne.9 The optimal allocation ratios for the
two-stage procedure are the same as those for the fixed sample size case.
Inferences on Proportions. Similar methodologies have been proposed for
demonstrating that a sufficient proportion of the effect of the active con-
trol was maintained when considering binomial data rather than continu-
ous data.10,11 Letting p̂P , p̂C, and p̂E be the observed success proportions of
desired outcomes for the placebo, active control, and experimental treatment
groups, respectively, inference is made on the quantity λ, modified from
p − pP
Equation 6.1 for binary data λ = E . The hypotheses are expressed as
pC − pP
with allocations to the experimental, active control, and placebo groups of 1:1:1,
2:2:1, and 3:2:1, respectively. According to their results, the simulated one-sided
type I error probabilities ranged overall from 0.043 to 0.060 with the use of the
restricted maximum likelihood estimates maintaining the desired one-sided
type I error rate of 0.05 better than the Wald’s test. The power when using the
restricted maximum likelihood estimates to estimate the standard error was
consistently slightly less than the power when using the sample proportions.
Kieser and Friede11 also investigated the type I error rate for a three-arm
non-inferiority test on proportions when using the Wald-type test and the
analogous test based on the null restricted maximum likelihood estimates of
the true proportions when estimating the standard error. Their calculations
were based on the actual probabilities from the corresponding binomial dis-
tributions, not simulations, and differed from those of Tang and Tang.10 All
cases in Tang and Tang’s study10 were considered along with additional cases.
Desired one-sided levels of α = 0.025 and 0.05 were considered with λo = 0.6
and 0.8; pP = 0.05, 0.10, . . . , 0.50; pC – pP = 0.05, 0.10, . . . , 0.95 (only those cases
where pC ≤ 1); and pE = λopC + (1 – λo)pP . The overall sample sizes were 30, 60, 90,
120, 180, 240, and 300 with allocations to the experimental, active control, and
placebo groups of 1:1:1, 2:2:1, and 3:2:1. Both procedures tended to have actual
type I error rates above the desired rates of 0.025 and 0.05. The inflation was
more profound for the Wald-type test. Interestingly, cases that had the greatest
actual type I error rate for the Wald-type test (as high as 0.212) had the actual
type I error rate maintained under the desired level of 0.025 or 0.05 when using
the restricted maximum likelihood estimates to estimate the standard error.
Kieser and Friede further proposed sample size calculations to achieve a
given power. Because power estimates depend on the variances, which dif-
fer under the null and alternative hypotheses, Kieser and Friede proposed
several sample size formulae. The one with the best properties has for the
overall sample size N = (1 + kE + kC )( zατ 0 + zβτ 1 )2 ψ 1−2 , where the allocation of
subjects to the experimental, active control, and placebo groups is kE:kC:1, τ i2 =
(1 − λo )2 pP ,i (1 − pP ,i ) + (λ 2o/kC ) pC ,i (1 − pC ,i ) + (1/kE ) pE ,i (1 − pE ,i ), where for i = 0
the proportions are under the null hypothesis and for i = 1 the proportions
are in the alternative hypothesis. However, even this formula can be incor-
rect, so simulations are advised to confirm the power before conducting the
study. In addition, the ratio k E:kC:1 does not in general have a unique point
that maximizes power, so investigation of various values (with kC > kE > 1
often holding) is advised.
Additionally, because the hypotheses in Expression 6.2 assume that the
active control is superior to placebo, Kieser and Friede11 recommended test-
ing first that the active control is superior to placebo at the full α, and if supe-
riority is concluded, proceed to testing for non-inferiority at the full α. They
further discussed how to size the trial under this testing sequence to achieve
the desired power of concluding non-inferiority.
The inconsistent calculations reported by Tang and Tang10 and Kieser
and Friede11 suggest additional caution against planning based on direct
λ = βP/E/βP/C
( )
2
1/rE + λo2/rC + 1 − λo /rP (6.9)
where rE,rC, and rP denote the number of events in the experimental, active
control, and placebo arms, respectively. From Expressions 6.8 and 6.9, we
have the test statistic
The test rejects the null hypothesis in Expression 6.7 and concludes non-
inferiority when Z > zα/2.
A similar test statistic to Equation 6.10 was used by Mielke, Munk, and
Schacht13 under the assumption of underlying exponential distributions.
There, the estimator β̂ P/E (β̂ P/C ) is equal to the difference in the natural loga-
rithms of the maximum likelihood estimators of the means of the experi-
mental (active control) and placebo arms.
For all of these three-arm non-inferiority cases, a Fieller 100(1 – α) confi-
dence interval for λ can be found by treating λo as unknown and setting the
test statistic equal to ±zα/2 (or the analogous values from the appropriate t dis-
tribution) and solving for λo.3,8 If all the values in the confidence interval are
greater than zero, superiority of the experimental arm to the placebo arm is
concluded. If all the values in the confidence interval are greater than λo, non-
inferiority of the experimental arm to the active control arm is concluded. If
all the values in the confidence interval are greater than 1, superiority of the
experimental arm to the active control arm is concluded.
Capturing All Possibilities of Efficacy. In determining whether the experi
mental therapy is efficacious or has adequate efficacy, the possibility that
μ C ≤ μP < μE should be included, but is not included, in the non-inferiority
inference. For a two-arm non-inferiority trial, superiority of the experimen-
tal therapy to the active control therapy is intended to imply non-inferiority
and that the experimental therapy is effective (i.e., due to the assumption that
the control therapy is “active”). Although the possibility that μP < μ C < μE is
included in the alternative hypothesis in Expression 6.3 by having the over-
all assumption that μP < μ C and that μE – λoμ C – (1 – λo)μP > 0 for some pre-
specified 0 ≤ λo ≤ 1, the possibility that μ C ≤ μP < μE is excluded. The possibility
of μ C ≤ μP < μE accounts for one-sixth of the overall, unrestricted parameter
space for (μP, μ C, μE), and the estimation of μE – λoμ C – (1 – λo)μP that is done,
including the modeling of the uncertainty in that estimation, does not pre-
clude μ C ≤ μP < μE. Order restricted or constrained inference is not done. The
aforementioned test procedures do not estimate or model the estimation of
μE – λoμ C – (1 – λo)μP under the restriction that μP < μ C.
Having as the alternative hypothesis
H a: {( µ P , µ C , µ E ) : µ E − λo µC − (1 − λo )µ P > 0, µ P < µ C } ∪
(6.11)
{( µ P , µ C , µ E ) : µ C ≤ µ P < µ E }
( µ − x )2 − n/2−1 1
∑
n
g( µ , θ |x1 , x2 ,..., xn ) ∝ θ −1/2 exp − ×θ exp − ( xi − x )2/θ
2θ/n 2 i=1
(6.13)
We see from Expression 6.13 that the joint density factors into the product
of an inverse gamma marginal distribution for θ and a normal conditional
distribution for μ given θ. The inverse gamma distribution has shape and
∑
n
scale parameters equal to n/2 and ( xi − x )2/2 , respectively, with a mean
i=1 2
∑
∑
n n
equal to ( xi − x )2/(n − 2) and a variance equal to 2 ( xi − x )2 /[(n − 2)2 (n −
2 i=1 i=1
( xi − x )2 /[(n − 2)2 (n − 4)] . Note that θ has an inverse gamma distribution with para
=1
∑
n
meters n/2 and ( xi − x )2/ 2 if and only if 1/θ has a gamma distribution
i=1
∑ ∑
n n
with parameters n/2 and 2 / ( xi − x )2 with mean equal to n/
( x i − x )2 .
i=1 i=1
probability exceeds the preselected threshold (or in the last case both prob-
abilities exceed the threshold), non-inferiority of the experimental therapy
would be concluded.
An alternative way of calculating posterior probabilities in this case is pro-
vided by Gamalo et al.16 They discussed the use of a generalized p-value (i.e.,
the posterior probability of a one-sided null hypothesis) and a generalized
confidence interval (i.e., a credible interval) for μE – λoμC – (1 – λo)μP when the
variances are unknown and are not assumed equal. For arm i, i = C, E, P, let
ni denote the number of subjects on that arm, xi denote the observed sam-
ple mean, and si denote the observed standard deviation (i.e., si2 has the form
∑
ni
( x j ,i −x )2/(ni − 1)). The means μC, μE, and μP are independent where for i =
j=1
xi + Ti si/ ni
where Ti has a t distribution with ni – 1 degrees of freedom.
Similar results are reported from applying the procedure described by
Gamalo et al.16 as with applying the procedure of Hasler, Vonk, and Hothorn7
in testing the hypotheses in Expression 6.3. The advantage of the Bayesian
procedure is that the uncertainty of the active control is superior to placebo
(i.e., μP < μ C) and the possibility that μ C ≤ μP < μE can be included directly into
the testing procedure. That is, posterior probabilities like those in (e) and (f)
can be calculated. Gamalo et al.16 validated the type I error rate of their pro-
cedure in testing the hypotheses in Expression 6.3 with simulations based on
a model that includes modeling the variances.
The hypotheses given in Expression 6.12 can also be tested either based
on the posterior probability of μE > max{μP,μ C – δ} or based on min{P(μE > μP),
P(μE > μ C – δ)}. Comparing min{P(μE > μP), P(μE > μ C – δ)} to a threshold of 0.975
would be analogous to two separate one-sided tests at level 0.025 that the
experimental therapy is superior to placebo (i.e., μE > μP) and that the experi-
mental therapy is noninferior to the active control therapy (i.e., μE > μ C – δ).
The two-arm versions of these Bayesian approaches are discussed in
Section 12.2.4, along with examples that calculate posterior probabilities and
credible intervals for the difference in means.
Inferences on Proportions. For each arm we will consider a beta prior dis-
tribution for the probability of a success. For a random sample of n binary
observations where x are successes, a beta prior distribution with parameters
α and β for p, the probability of success, leads to a posterior distribution for
p with parameters α + x and β + n – x. A Jeffreys prior distribution has α =
β = 0.5.
For proportions where a “success” is desirable, the probability in (f) is the
posterior probability that pE – λopC – (1 – λo)pP > 0, pP < pC or pC ≤ pP < pE. If
that probability exceeds some threshold (e.g., 0.975), then the experimental
for (θ P/E, θ P/C) leads to a joint posterior distribution for (θ P/E, θ P/C) with mean
(θˆP/E , θˆP/C ), variances σ P/E
2 2
and σ C/E , and correlation ρ. Unlike the estimator
of the standard error of a sample mean from a normal random sample, the
estimators of standard error for a log-hazard ratio are quite stable in the vast
majority of applications. Likewise, the estimator of ρ is fairly stable. Therefore,
2
for the Bayesian model, with slight crudeness, we use estimates of σ P/E and
σ C/E and correlation ρ as the true values. A parallel to using the test statistic
2
in Equation 6.10 would use 1/rP + 1/rE, 1/rP + 1/rC, and 1 / (1 + rP/rC )(1 + rP/rE )
as the values for σ P/E2 2
and σ C/E and ρ. Then the posterior probability of θ P/E –
λo θ P/C ≤ 0 would equal exactly the one-sided p-value from using the test sta-
tistic in Equation 6.10.
For time-to-event endpoints where the event is undesirable (i.e., longer
times are more desirable), the probability in (f) is the posterior probability
that θ P/E – λo θ P/C > 0, θ P/C > 0, or θ P/C ≤ 0 < θ P/E. If that probability exceeds
some threshold (e.g., 0.975), then the experimental therapy is concluded to
be noninferior to the active control therapy and to be efficacious. The prob-
ability in (e) is the posterior probability of the alternative hypothesis given in
Expression 6.7. As with means and proportions, the analogs of the frequen-
tist tests would be based on P(θ P/E – λo θ P/C > 0), or P(θ P/E – λo θ P/C > 0, θ P/C > 0),
or both P(θ P/C > 0) and P(θ P/E – λo θ P/C > 0). If the posterior probability exceeds
the preselected threshold (or in the last case both probabilities exceed the
threshold), non-inferiority of the experimental therapy would be concluded.
Example 6.1 illustrates the use of a frequentist and a Bayesian method in a
three-arm testing involving proportions.
Example 6.1
motivation for doing non-inferiority testing is not clear and we do not recommend
doing non-inferiority testing in this fashion. For this example, we will assume that
the greater the reporting rate of adverse events the better. Adverse events were
reported in 7 of 61 patients randomized to placebo, 10 of 59 patients randomized
to cisapride (the active control), and 12 of 58 patients randomized to simethicone
(the experimental therapy). For testing the hypotheses in Expression 6.6 with λo =
0.8, the one-sided p-value = 0.234 for the Wald-type test as given by Tang and
Tang.10 It should be noted that the 95% confidence intervals for the difference in
rates between the simethicone and placebo arms, and between the cisapride and
placebo arm are –0.039 to 0.224 and –0.070 to 0.179, respectively. Thus, neither
the active control nor experimental therapy demonstrated a higher underlying
reporting rate of adverse events. Also, for every –∞ < λo < ∞, the value for Wald-
type test statistic is between –1.96 and 1.96. Therefore, ignoring that the active
control was not demonstrated to be “superior” to placebo, the Fieller 95% confi-
dence interval for the retention fraction is –∞ to ∞.
Jeffreys’ prior distributions were used for each of pE, pC, and pP. Posterior prob-
abilities and credible intervals were approximated from 100,000 simulations. The
simulated 95% credible intervals for the difference in rates between the simethi-
cone and placebo arms, and between the cisapride and placebo arm were similar
to the Wald’s 95% confidence intervals and are (–0.038 to 0.224) and (–0.070
to 0.180), respectively. The simulated 95% credible interval for the difference in
rates between the simethicone and cisapride arms is (–0.104 to 0.178). Simulated
posterior probabilities of interest are given in Table 6.1.
Note that, in (f), pE > pP ≥ pC implies pE – 0.8pC – 0.2pP > 0. From Table 6.1, the
simulated posterior probability of pE – 0.8pC – 0.2pP > 0 and pC > pP, or pE > pP ≥
pC equals 0.738, which is far smaller than 0.975. The experimental arm has not
demonstrated the combination of non-inferiority and efficacy (i.e., adverse events
reporting rates greater than placebo). The direct analog of the one-sided p-value
of .234 of the Wald-type test in testing the hypotheses in Expression 6.6 is the
simulated posterior probability for pE – 0.8pC – 0.2pP > 0 of 0.763 (i.e., the simu-
lated posterior probability of the null hypothesis is 0.237 = 1 – 0.763). However, in
2.5% of the simulations, pE – 0.8pC – 0.2pP > 0 and pP > pE > pC. The uncertainty
that pP > pE > pC is not accounted for in the Wald’s-type test of the hypotheses in
Expression 6.6.
TABLE 6.1
Simulated Posterior Probabilities of Interest
Event Simulated Posterior Probability
(a) pE > pP 0.916
(b) pC > pP 0.806
(c) pE > pC 0.696
(d) pE > max{pC,pP} 0.667
(e) pE – 0.8pC – 0.2pP > 0 and pC > pP 0.589
(f) pE – 0.8pC – 0.2pP > 0 and pC > pP, or pE > pP ≥ pC 0.738
pE – 0.8pC – 0.2pP > 0 and pP > pE > pC 0.025
pE – 0.8pC – 0.2pP > 0 0.763
References
1. Koch, G.G., Comments on ‘current issues in non-inferiority trials’ by Thomas R.
Fleming, Stat. Med., 27, 333–342, 2008.
2. Temple, R. and Ellenberg S.S., Placebo-controlled trials and active-controlled tri-
als in the evaluation of new treatments: Part 1. Ethical and scientific issues, Ann.
Intern. Med., 133, 455–463, 2000.
3. Pigeot, I. et al., Assessing non-inferiority of a new treatment in a three-arm clini-
cal trial including a placebo, Stat. Med., 22, 883–899, 2003.
4. Letierce, A. et al., Two-treatment comparison based on joint toxicity and efficacy
ordered alternatives in cancer trials, Stat. Med., 22, 859–868, 2003.
5. Jennison, C. and Turnbull, B.W., Group sequential tests for bivariate response:
Interim analyses of clinical trials with both efficacy and safety endpoints,
Biometrics, 49, 741–752, 1993.
6. Thall, P.F. and Cheng, S.-C., Treatment comparisons based on two-dimensional
safety and efficacy alternatives in oncology trials, Biometrics, 55, 746–753, 1999.
7. Hasler, M., Vonk, R., and Hothorn, L.A., Assessing non-inferiority of a new
treatment in a three-arm trial in the presence of heteroscedasticity, Stat. Med.,
27, 490–503, 2008.
8. Koch, G.G. and Tangen, C.M., Nonparametric analysis of covariance and its role
in non-inferiority clinical trials, Drug Inf. J., 33, 1145–1159, 1999.
9. Schwartz, T.A. and Denne, J.S., A two-stage sample size recalculation procedure
for placebo- and active-controlled non-inferiority trials, Stat. Med. 45, 3396–3406,
2006.
10. Tang, M.-L. and Tang, N.-S., Tests of non-inferiority via rate difference for three-
arm clinical trials with placebo, J. Biopharm. Stat., 14, 337–347, 2004.
11. Kieser, M. and Friede, T., Planning and analysis of three-arm non-inferiority tri-
als with binary endpoints, Stat. Med., 26, 253–273, 2007.
12. Farrington, C.P. and Manning, G., Test statistics and sample size formulae for
comparative binomial trials with null hypothesis of non-zero risk difference or
non-unity relative risk, Stat. Med., 9, 1447–1454, 1990.
13. Mielke, M., Munk, A., and Schacht, A., The assessment of non-inferiority in a
gold standard design with censored, exponentially distributed endpoints, Stat.
Med., 27, 5093–5110, 2008.
14. Koch, A. and Röhmel, J., Hypothesis testing in the ‘gold standard’ design for
proving the efficacy of an experimental treatment relative to placebo and a refer-
ence, J. Biopharm. Stat., 14, 315–325, 2004.
15. Ghosh, P. et al., Assessing non-inferiority in a three-arm trial using the Bayesian
approach, Technical report, Memorial Sloan-Kettering Cancer Center, 2010.
16. Gamalo, M. et al., A Generalized p-value approach for assessing non-inferiority
in a three-arm trial, Stat. Methods Med. Res. Published online February 7, 2011.
7.1 Introduction
Multiple comparisons pose a problem in any clinical trial. There are many
aspects to this, including exploring multiple treatment arms, multiple effi-
cacy measurement endpoints, and multiple timepoints, but all lead to the
same problem: the chance of falsely concluding efficacy is inflated without
proper recognition of multiplicity.
In non-inferiority testing, the roles of the null and alternative hypoth
eses are in some ways reversed, which can cause confusion at first glance. In
non-inferiority testing, the type I error is the probability of concluding non-
inferiority when the active control is markedly superior to the experimental
treatment; in superiority testing, the type I error is the probability of con-
cluding superiority of one treatment when the effects of the treatments are
identical. This may lead to some confusion about the interpretation of type I
and type II errors. However, when the type I error is properly recognized as
the probability of rejecting a null hypothesis that is true, and the type II error
is the probability of not rejecting a null hypothesis that is false, the confusion
dissipates. This is the same in non-inferiority or superiority testing.
Control of the type I error rate can be defined in different ways. The most
common for clinical trials is control of the familywise error (FWE) rate—the
probability of rejecting at least one true null hypothesis. In the case of testing
multiple endpoints for non-inferiority, this means concluding non-inferiority
at least one time when non-inferiority is not true. Control in the strong sense
requires that the FWE is controlled at the claimed α level or less for every
possibility in the parameter space (i.e., no matter which null hypotheses are
true and which are false). This is in contrast to control in the weak sense,
which requires that the FWE is controlled at or below the claimed level only
when all null hypotheses are true. Control of the FWE in the strong sense is
most commonly used in a regulatory setting. Thus, in the rest of this chapter,
we do not continually state “in the strong sense” when we refer to control of
the FWE, although this is implied. Other definitions of type I error rate can
also be considered, including comparisonwise error rate (CWE: controlling
167
Example 7.1
With such an approach, one primary endpoint becomes the most important
primary endpoint, and other endpoints are not even considered unless non-
i nferiority is demonstrated for the first endpoint. If non-inferiority is dem-
onstrated on an endpoint, the next endpoint (again from the prespecified
ordering) will be tested. It may seem illogical that a less important endpoint
that is called “primary” might not be tested at all, depending on the results
of the previous comparisons. If this is a concern, an alternative is to only
call the first primary endpoint “primary” and label other endpoints “sec-
ondary,” by placing them in a separate family of endpoints. This change in
labels has no impact on the testing process, the power, or the interpretation
of results.
An obvious alternative to the fixed sequence strategy, to avoid some of the
problems mentioned above, is to save some of the α for subsequent testing, as
in the fallback test described earlier. With the fallback, all comparisons can
be considered even if one or more endpoints do not result in a conclusion of
non-inferiority.
The Holm procedure, described earlier, can also be used for a single family
of comparisons, and will control the FWE.
Hochberg8 proposed a procedure based on the Holm procedure, but using
a step-up rather than a step-down approach. That is, the null hypothesis is
rejected and non-inferiority is concluded for the endpoint associated with
the largest p-value for which p(j) ≤ α/(I – j + 1), and for all endpoints with
smaller p-values. By definition, this will include all endpoints for which
the Holm procedure concludes non-inferiority, and maybe more; thus, the
Hochberg procedure is uniformly more powerful than the Holm procedure.
Again, caution must be used as the Hochberg procedure does not always
control the FWE in the strong sense.
More generally, a multiple comparison procedure that is a closed test-
ing procedure will control the FWE in the strong sense.9 A closed testing
procedure considers all possible nonempty subsets of hypotheses. With J
J
∑
J
endpoints, there will be = 2J – 1 subsets to consider. Within each
k =1 k
∑
I
error rate for a given subset will be bounded above by α k I ( k ), where
k =1
I(k) is an indicator function that equals 1 if endpoint k is in the subset and 0
otherwise. This will be strictly less than the nominal α level except for the
subset containing all endpoints. For this reason, it is always possible to find
a procedure that is more powerful than the Bonferroni procedure for a single
family of hypotheses but still controls the FWE in the strong sense.
Non-inferiority Non-inferiority
concluded concluded
Non-inferiority
concluded
FIGURE 7.1
A testing sequence involving two endpoints.
primary endpoint and non-inferiority for the secondary endpoint are each
tested at a one-sided level of α/2 (or otherwise such that the levels sum to α).
If both are rejected, then superiority for the secondary endpoint can be tested
at the one-sided level of α; if only non-inferiority for the secondary endpoint
is concluded, then superiority for the secondary endpoint can be tested at the
one-sided level of α/2; otherwise, superiority for the secondary endpoint can-
not be tested. However, further testing would cease if non-inferiority for the
primary endpoint was not concluded, or if either superiority for the primary
endpoint or non-inferiority for the secondary endpoint was not concluded.
Proof of the control of FWE is shown by considering the closure of all
potential hypotheses. In the example, the closure contains 15 nonempty sub-
sets, but the shortcut illustrated in Figure 7.1 provides an equivalent test.
This method can be expanded to a larger number of endpoints, including
co-primary and co-secondary endpoints.
References
1. Berger, R.L. and Hsu, J.C., Bioequivalence trials, intersection–unions tests and
equivalence confidence sets, Stat. Sci., 11, 283–319, 1996.
2. Berger, R.L., Multiparameter hypothesis testing and acceptance sampling,
Technometrics, 24, 295–300, 1982.
3. Holm, S., A simple sequentially rejective multiple test procedure, Scand. J. Stat.,
6, 65–70, 1979.
4. Dunnett, C., New tables for multiple comparisons with a control, Biometrics, 20,
482–491, 1964.
5. Wiens, B.L., A fixed sequence Bonferroni procedure for testing multiple end-
points, Pharm. Stat., 2, 211–215, 2003.
6. Wiens, B.L. and Dmitrienko, A., The fallback procedure for evaluating a single
family of hypotheses, J. Biopharm. Stat., 15, 929–942, 2005.
7. O’Neill, R.T., Secondary endpoints cannot be validly analyzed if the primary
endpoint does not demonstrate clear statistical significance, Control. Clin. Trials,
18, 550–556, 1997.
8.1 Introduction
Issues involving missing data are often linked with issues involving the choice
of the proper analysis set. However, it is also true that missing data issues
are often confused with issues involving the proper choice of analysis set.
Consider a randomized, double-blind, two-arm clinical trial in which some
subjects drop out at randomization before undergoing study therapy or any
other therapy, and have no follow-up for the study endpoint. Because these
subjects should be fairly distributed between arms, they may be excluded
from the analysis without compromising the integrity of the randomization.
Whether to include such subjects in the analysis is an analysis set issue. If
these subjects were included in the analysis [i.e., as in an intent-to-treat (ITT)
analysis], the imputation or representation of their values for the endpoint
should not depend on treatment arms (since such subjects should have been
fairly distributed between arms) and should consider the actual adherence
or nonadherence to therapy. A variation of this “imputation under the null”
can be used for non-inferiority trials and will be discussed later.
According to the ITT principle, all subjects should be followed to the end-
point or the end of study with the comparisons based on the “as-randomized”
treatment groups (i.e., based on the ITT population). This allows for an
unbiased analysis. Missing data violate the ITT principle and can under-
mine both the integrity of the randomization and confidence in the results.
Additionally, selective follow-up of subjects can weaken the quality of the
data from the high quality expected from a randomized clinical trial to that
obtained from an observational study.
The purpose of accounting for missing data is not to retrospectively change
the design or objective of the clinical trial or the adherence to therapy or the
protocol of any subject. Rather, the purpose is to account for all subjects with
respect to the ITT principle. For a subject with a missing outcome, the objec-
tive is to adequately represent the missing outcome based on what would
have been the expected outcome had the outcome been measured.
181
Example 8.1
between the missingness and the observed responses, efforts should focus
on investigating this relationship.
Informal visual approaches can often provide a preliminary assessment
of the relationship between the missingness and the observed responses.
A graphical approach to assessing this relationship with longitudinal data
is to graph over time the outcomes for those who have complete data and
those who discontinue at various time points (including only data up to
the point at which the discontinuation occurred, of course). Subsetting the
subjects into a small number of groups (three to five groups often provide
informative results) based on meaningful study benchmarks will allow a
quick visual assessment of potential relationships. Such graphs can provide
assessment of whether the trajectory of response or baseline values differed
between those with complete data and those with incomplete data.
Inferential assessments of the relationship between missingness and the
observed data are also possible. Under the null hypothesis that data are
MCAR, various test statistics can be developed. Simplistically, testing for
differences in early response between subjects with complete data and sub-
jects with incomplete data can provide for a test – albeit, perhaps not an opti-
mal test. Incorporating covariates into this assessment can improve the test.
Correlation assessments of presence of outcome measurements to baseline
characteristics and, after accounting for this relationship, of presence of out-
come to the previous measurement will provide insight. Logistic regression
can be used for the multivariate investigations. This will give information on
whether subjects with higher or lower outcomes, or a better or worse interme-
diate response, are more likely to have missing data at the subsequent assess-
ment, something allowed under MAR but not under MCAR. Other tests
specifically developed for missing data assessments are available as well.10
The ability of such tests to differentiate the type of missingness is unclear due
to lack of power, but they can be used to demonstrate the lack of MCAR.
Distinguishing between MAR and MNAR is more difficult. By defini-
tion, MNAR data are missing in part because of the unobserved value—but
without observing the value, it can be difficult to establish this association.
Heuristic arguments can be used, more convincingly if they argue for MNAR
than if they argue against MNAR. Other markers of disease progression or
regression can be used if collected after discontinuation. In the situation that
subjects who discontinue (or otherwise have missing data) tend to experi-
ence events at a higher or lower rate after discontinuation than subjects who
continue, one can conclude MNAR. Subjects who discontinue because of
adverse events might also be considered to have data that are MNAR, espe-
cially if the specific adverse events are related to or affect the outcome mea-
surement (such as when the outcome is quality of life). This can be handled
more easily in the analysis than MNAR in general, but if discontinuation due
to reported adverse events is common, it might also be common that discon-
tinuation occurs because of adverse events that are not reported, which is
harder to handle.
therapy has inferior or equal efficacy to the active control). Even when esti-
mation is unbiased, LOCF may understate the standard error surrounding
the point estimate, leading to test procedures that have an inflated type I
error rate.
Koch11 recommended incorporating imputation under the null hypothesis
for missing data in a non-inferiority trial. Therefore, a penalized imputa-
tion might be considered. When a subject discontinues, subsequent missing
observations can be imputed as the last observation plus or minus a small
amount to penalize the analysis for the missing data. For a non-inferiority
margin of δ > 0 with larger values being more preferred, possible ways of
performing imputation under the non-inferiority null hypothesis include
first performing an imputation under the assumption of no treatment differ-
ence and then either (1) subtract δ from each imputed outcome in the experi-
mental group, or (2) add δ to each imputed outcome in the control group, or
(3) subtract δ/2 from each imputed outcome in the experimental group and
add δ/2 to each imputed outcome in the control group (or some variation
thereof). Per Koch11 for binary data (i.e., “0” for a failure and “1” for a success)
with a fixed margin on the difference in proportions, the imputed outcomes
should be inclusively between 0 and 1. Imputation under the non-inferiority
null hypothesis is particularly important when the evaluation of the active
control effect and the selection of the non-inferiority margin did not consider
the possibility of missing data.
Incorporating an imputation under the non-inferiority null hypothesis can
help alleviate some problems inherent in disregarding missing data when
establishing the non-inferiority margin. Consider a continuous outcome
for an endpoint of interest where the conditional effect of the active control
among subjects for which the endpoint is measured is 50 and the conditional
effect of the active control is zero among subjects for which the endpoint is
not measured. If the endpoint is not measured in 20% of subjects in both
arms, then the true effect of the active control is 40. Suppose that the evalu-
ation of the active control effect in previous clinical trials treated missing
data as ignorable and M1, the efficacy margin, was set at 50. Suppose the
non-inferiority trial had 20% of subjects on each arm not measured for
the primary endpoint. The true effect of the active control in the setting of
the non-inferiority trial for the ITT population is 40. The efficacy margin
can be adjusted retrospectively to account for the missing data. However,
there may be a preference to pre-specify the margin without later chang-
ing it. Employing an imputation under the non-inferiority null hypothe-
sis (based on a true mean difference of 50) without altering the margin can
achieve the same or similar results as altering the margin and ignoring the
missing data.
Another simple imputation approach, with similar advantages and disad-
vantages to LOCF, is to use the mean value of all observed values (or other
estimate of central tendency) to replace missing values. This might be the
mean across an individual’s observed assessments, or the median across all
most helpful when the process used to impute data is different from the
process used to analyze data, such as by using different covariates to predict
missingness than are used in the analysis model, and most useful for large
studies. This could happen when information used to explain missingness
occurs after discontinuation, such as death predicting (backwards) study dis-
continuation within a short period before the death. Such information may
not be included in the analysis model, but can easily be incorporated into
the imputation procedure. If multiple imputation is planned, collection of
auxiliary data (such as other measures of efficacy or follow-up after discon-
tinuation, subject to ethical constraints) should be planned. Treating subjects
with missing data the same way as subjects with complete data may bias
the conclusion toward no difference. For a non-inferiority analysis, a penal-
ized imputation approach can also be considered with multiple imputation.
However, little is known about how this approach will affect an analysis in
the presence of non-ignorable missingness.
A trite but true piece of advice is that missing data are best handled by
prevention. Missing data are most easily handled when almost all data are
known. Most sensitivity analyses, even when missing data are handled or
imputed quite differently, will give very similar conclusions if the amount of
missing data is small. This requires planning before the study starts to allow
collection of auxiliary data, motivate study subjects and investigators to com-
ply with the protocol, and perhaps gain approval from ethics committees to
allow collection of data even after a subject “withdraws” from some study
procedures. Efforts should be directed toward obtaining data on all subjects,
even those who discontinue treatment, to best assess treatment effects when
subjects are noncompliant with certain aspects of the study plan.
In summary, there is not one clear method for handling missing data that
are MAR but not MCAR. Each method has promise, but each must be used
with caution. Simple methods that focus on unbiased estimation may inflate
the type I error rate by underestimating the variance. For example, in many
settings, dropouts are likely to have poorer outcomes than subjects who
remain under observation for outcomes. Then, for continuous outcomes,
treating the missing outcome as ignorable or using LOCF (and other single
imputation methods) treats their unknown outcomes as being more central-
valued, and thus will then lead to an underestimated subject-level standard
deviation. More complex methods can address the estimation of the variance
of the estimated treatment difference. Various methods should be consid-
ered for most studies, with one simple method pre-specified as primary and
the other methods pre-specified as sensitivity or exploratory analyses.
Example 8.2
Among subjects who did not complete the study, a nearly equal number were
for administrative reasons such as inconvenient study visits. Many more subjects
discontinued the experimental group because of adverse events or lack of efficacy
(14.0% vs. 1.3%). It is quite possible that subjects who discontinued because of
adverse events had a fairly large antihypertensive effect, enough to affect the group
mean by 1 mmHg. So, the discontinuation is related to outcome, and implies that
the missing data are not MCAR. In addition, subjects who do not complete the
study do not receive benefit at week 8, so the inclusion of such data in the primary
analysis via LOCF is questionable (for superiority or non-inferiority).
A pattern-mixture model composite approach such as suggested by Shih and
Quan,14 does not fully confirm the conclusions of the primary analysis, due both to the
lack of conclusion of non-inferiority on antihypertensive effect and also the increase
in discontinuations for adverse reasons. Therefore the conclusion of the primary anal-
ysis is questioned by the sensitivity analysis because of the impact of missing data.
Table 8.1
Summary of Trial Results and Reasons for Discontinuation
Treatment Group Active Control Experimental
Overall Study Population
N 150 150
Mean –8.0 –8.2
Standard deviation 14.2 13.8
Confidence interval (–3.0, 3.4)
Completers
N 140 120
Mean –8.0 –7.2
Standard deviation 14.2 13.8
Confidence interval (–4.2, 2.6)
an evolving issue. In this section we will define several analysis sets used in
clinical trials, and discuss their advantages and disadvantages.
Depending on the method of analysis, use of the ITT set can provide unbi-
ased estimation and comparisons of the endpoints in the setting of the clini-
cal trial. If the clinical trial setting represents medical practice, subjects in
the trial are randomly selected from subjects in medical practice, and full fol-
low-up is obtained, then use of the ITT set will provide unbiased estimates
and comparisons on the use of the treatments in practice. Limitations to this
include selecting subjects who are not representative of medical practice and
accounting for missing data – especially from subjects who discontinue the
trial but would have continued to be observed by a physician in medical
practice.
Part of the ITT principle is that subjects be followed until an event or end
of study. A subject is said to be lost to follow-up if the individual is not fol-
lowed to the endpoint or to the end of study. Loss to follow-up can rarely be
assumed to be random – that is subjects lost to follow-up are different from
subjects not lost to follow-up, particularly with respect to the distribution
of the primary outcome. Even if the numbers that are lost to follow-up are
similar between arms, the unobserved outcomes may not be similar between
arms. It is much more important to keep loss-to-follow-up rates to a mini-
mum than to have similar numbers that are lost to follow-up. The greater the
amount of loss to follow-up, the greater the potential of substantial bias, even
if rates are similar between groups.
One rationale for using the PP analysis set is that it may more closely fol-
low the scientific hypothesis that a subject with the disease of interest, who
receives a particular treatment, will exhibit improvement compared with a
subject not receiving that treatment. If a subject does not have the disease
under study, or does not receive the treatment, the subject is not part of the
target population for examining the scientific hypothesis.
Another reason for the recommendation to use the PP set for non-inferiority
analyses is that deviations from the protocol in randomization, conduct or
evaluation might make the outcomes for the treatment groups more simi-
lar. In other words, sloppiness in trial conduct or other deviations from the
planned procedures may bias the results toward no difference between the
arms. In an extreme case, all subjects on both treatment arms could discon-
tinue treatment immediately upon randomization, resulting in all subjects
receiving the same treatment. Producing outcomes that are more similar
between the groups has the effect of making a superiority analysis con-
servative, since no difference between treatments is in the null hypothesis.
However, producing outcomes that are more similar between the groups
might have the effect of making a non-inferiority analysis anticonservative
since no difference between treatments is in the null hypothesis.
Consider the following situations to illustrate the relative conservativeness
of the ITT and PP analysis sets:
Subjects are treated with study treatment that is not their randomized study
treatment. This can be caused by several kinds of errors, including a
Under the belief that the PP analysis set addresses these problems, the PP
analysis set has been commonly used instead of or in addition to the ITT
analysis set. A concern is whether the PP analysis set is always the appropri-
ate way to address these and other issues. Some authors have questioned
the wisdom of using the per protocol concept for analyzing non-inferiority
clinical trials. A discussion of several antibiotic non-inferiority clinical trials
concluded that use of the ITT analysis set does not systematically lead to
smaller estimates of treatment effect in these trials (see Section 2.5 for more
details).19 A hybrid ITT/PP analysis set that excludes noncompliant subjects
as with the PP set while addressing the impact of missing data (based on
maximum likelihood) due only to lack of efficacy using an ITT approach was
also proposed as a compromise.20 More aggressively campaigning against
the use of the PP analysis sets, Hauck and Anderson21 noted that standards
required the use of PP analysis set for the null hypothesis Ho: μC ‑ μE > δ
versus Ha: μC ‑ μE ≤ δ for any value of δ > 0, making the null hypothesis of
equality the only point at which the ITT analysis set is favored. Such a dis-
continuity is difficult to justify for only one point, making the assumption
faulty. Wiens and Zhao22 expanded this idea and concluded that the argu-
ments for using the ITT set for superiority analyses apply equally well to
non-inferiority analyses, and therefore the ITT analysis set should be consid-
ered. Furthermore, the PP approach is not the universally best choice for a
sensitivity analysis and therefore should not be a standard adjunct analysis,
much less the standard co-primary analysis.
Other authors have proposed basing the analysis on the treatment actu-
ally received, regardless of whether it was the randomized treatment. These
Example 8.3
the pathogen takes 48 hours but treatment must be commenced immediately for
ethical reasons. If the pathogen is determined to be one that is not susceptible to
the study treatments based on preclinical results, or if no pathogen is found, it is
common to discontinue the subject from the study and ignore the subject in any
efficacy analyses. However, this might not be the best course of action. When the
new treatment is approved for marketing, it will be prescribed based on empiri-
cal symptoms rather than on cultures. Thus, it may be of interest for the subject
to be offered the chance to stay in the study and even continue to receive study
medication until symptoms resolve. It is likely that the informed subject will not
choose to remain on study medication if told that it will likely not impact the
symptoms. The subject who chooses to discontinue study medication and start a
different course of treatment should continue to receive follow-up evaluations in
accordance with the protocol. The next question becomes how to use the subject
in the analysis. For the primary analysis, it may be possible to remove the subject
since the exclusion was in place before randomization—even though it was not
known until after randomization.27 It may be of benefit to report the success rate
in the primary analysis and also among subjects treated empirically, to give the
physician information on success rates under clinical trial situations and under
practical situations.
substitute for poor trial conduct or poor adherence to the protocol, and does
not salvage a poorly conducted clinical trial.
Although both the ITT and PP approaches have been criticized, we recom-
mend that non-inferiority analyses should be performed for both the ITT
and the PP analyses. In most instances, the results should be quite similar.
Any notable difference in the results should be investigated and may be
indicative of poor study conduct or other reasons that must be thoroughly
investigated and explained. Similarity in the results of the ITT and PP anal-
yses, while reassuring, does not imply confidence in the results. A poorly
conducted trial likely introduces bias, which can be of a nearly equal size for
both the ITT and PP analyses.
References
1. European Medicines Evaluation Agency, Guideline on Missing Data in Confirmatory
Clinical Trials, Committee for Medical Products for Human Use, 2009, at http://
www.ema.europa.eu/human/ewp/177699endraft.pdf.
2. Code of Federal Regulations 21 CFR 314.126.
3. Carroll, K.J., Analysis of progression-free survival in oncology trials: Some com-
mon statistical issues. Pharm. Stat., 6, 99–113, 2007.
4. Fleming, T.R., Rothmann, M.D., and Lu, H.L., Issues in using progression-free
survival when evaluating oncology products. J. Clin. Oncol., 27, 2874–2880,
2009.
5. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH), E9: Statistical principles for
clinical trials. 1998, at http://www.ich.org/cache/compo/475-272-1.html#E4.
6. Fleming, T.R., Addressing missing data in clinical trials. Ann. Intern. Med., 154,
113–117 2011.
7. Jackson, J.B. et al., Intrapartum and neonatal single-dose neviparine compared
with ziodovudine for prevention of mother-to-child transmission of HIV-1 in
Kampala, Uganda: 18 months follow-up of the HIVNET 012 randomised trial.
Lancet 362, 859–868, 2003.
8. Little, R.J.A. and Rubin, D.B., Statistical Analysis with Missing Data, John Wiley,
New York, NY, 1987.
9. Little, R.J.A., Regression with missing X’s: a review, J. Am. Stat. Assoc., 87, 1227–
1237, 1992.
10. Little, R.J.A., A test for missing completely at random for multivariate data with
missing values, J. Am. Stat. Assoc., 83, 1198–1202, 1988.
11. Koch, G.G., Comments on ‘current issues in non-inferiority trials’ by Thomas R.
Fleming, Stat. Med., 27, 333–342, 2008.
12. Wiens, B.L., Randomization as a basis for inference in noninferiority trials,
Pharm. Stat., 5, 265–271, 2006.
13. Mallinckrodt, C.H. et al., Recommendations for the primary analysis of continu-
ous endpoints in longitudinal clinical trials, Drug Inf. J., 42, 303–319, 2008.
14. Shih, W.J. and Quan, H., Testing for treatment differences with dropouts present
in clinical trials—A composite approach, Stat. Med., 16, 1225–1239, 1997.
15. Hollis, S., A graphical sensitivity analysis for clinical trials with non-ignorable
missing binary outcome, Stat. Med., 21, 3755–3911, 2002.
16. Hill, A.B., Principles of Medical Statistics, 7th ed., Lancet, London, 1961.
17. Snapinn, S.M., Noninferiority trials. Curr. Control Trials Cardiovasc. Med. 1, 19–21,
2000.
18. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) E-10: Guidance on
choice of control group in clinical trials, 2000, at http://www.ich.org/cache/
compo/475-272-1.html#E4.
19. Brittain, E. and Lin, D., A comparison of intent-to-treat and per-protocol results
in antibiotic non-inferiority trials, Stat. Med., 24, 1–10, 2005.
20. Sanchez, M.M. and Chen, X., Choosing the analysis population in non-inferiority
studies: Per protocol or intent-to-treat, Stat. Med., 25, 1169–1181, 2006.
21. Hauck, W.W. and Anderson, S., Some issues in the design and analysis of equiv-
alence trials, Drug Inf. J., 33, 177–224, 1999.
22. Wiens, B.L. and Zhao, W., The role of intention to treat in analysis of noninferi-
ority studies, Clin. Trials, 4, 286–291, 2007.
23. Stewart, W.H., Basing intention-to-treat on cause and effect criteria, Drug Inf. J.,
38, 361–369, 2004.
24. Robins, J.M., Correction for non-compliance in equivalence trials, Stat. Med., 17,
269–302, 1998.
25. Lee, Y.J., Ellenberg, J.H., Hirtz, D.G. and Nelson, K.B., Analysis of clinical trial
data by treatment actually received:Is it really an option? Stat. Med., 10, 1595–
1605, 2002.
26. Peto, R., Pike, M.C., Armitage, P., Breslow, N.E., Cox, D.R., Howard, S.V., Mantel,
N., McPherson, K., Peto, J., Smith, P.G., Design and analysis of randomized clin-
ical trials requiring prolonged observation of each patient. I. Introduction and
design, Brit. J. Cancer, 34, 585–612, 1976.
27. Gillings, D. and Koch, G., The application of the principle of intention-to-treat
to the analysis of clinical trials, Drug Inf. J., 1991.
9.1 Introduction
Statistical hypotheses and testing to rule out a prespecified increased risk of
an adverse event is statistically similar to that in the determination of non-
inferior efficacy. Examples include establishing the safety of a test treatment
compared to placebo or establishing the safety of a test compound com-
pared to an active control, both with the objective of ruling out an important
increase in the rates of adverse events. Less common, but possible, is the com-
parison of a test compound to an active control with inference desired on the
event rate of the test compound compared to a putative placebo. Because the
design and analysis are dependent on the objectives, and the objectives can
vary, it is vital to prespecify and define the study objectives.
There may be uncertainty about the safety of a drug at the time of approval.
Some adverse events are infrequent or are long-term adverse outcomes
that may not be discovered during the clinical trials that led to approval.
Additionally, the risk–benefit profile can change based on changes in sup-
portive care, the nature of the disease, the standards or in the understanding
of the risks. A change to an unfavorable risk-benefit assessment may alter or
remove the indication or intended use. New evidence on safety may be suf-
ficient to provide caution in the use of the product but not sufficient to lead
to an unfavorable or uncertain risk–benefit profile. Some changes in risks
can be addressed through introduced changes in medical practice. Subjects
who are more at risk of a particular known adverse event can either not be
recommended or provided the drug or may be more thoroughly monitored
on their risk of experiencing the adverse event while receiving the drug.
The U.S. Food and Drug Administration (FDA) Amendments Act of 2007,1
which expanded the authority of the FDA during postmarketing, provides
situations in which a postapproval study on safety may be required. A post-
approval study on the safety of a drug may be required to assess a known
serious risk, or a signal of a serious risk, or to identify an unexpected serious
risk when data indicate the potential for a serious risk. The source of a safety
signal may be clinical trials, adverse event reports, postapproval studies, peer-
reviewed biomedical literature, postmarket data, or other scientific data.
207
Evaluating whether data suggest a safety signal that was not prespeci-
fied for the investigation may be associated with substantial error and bias.
Although the efficacy of a drug is based on the intended effects of the exper-
imental agent or regimen, its safety profile usually involves unintended,
harmful effects. If the rate of these unintended, harmful effects is too great,
the risk–benefit profile may be unfavorable. However, unlike efficacy analy-
ses that prespecify the endpoints to be tested and with the overall type I error
rate maintained at a desired level, standard safety analyses usually involve
multiple tests, sometimes on nonprespecified adverse events, without any
multiplicity adjustment. Thus there is an exploratory nature to the standard
safety analyses that are conducted in a clinical trial. Additionally, owing to
the multiple testing, the most impressive differences between arms in an
adverse event will tend to be randomly high, and more likely than not will
have a smaller observed difference in a subsequent identically designed and
conducted clinical trial. Therefore, when the safety signal is evaluated on the
basis of ongoing or previous trials, any meta-analysis used to formally test
whether an unacceptable increased risk can be ruled out should not include
the results from the clinical trial that identified the potential safety risk as
the analysis that identified a potential safety signal is conditionally biased
and potentially represents a random high.
Retrospective meta-analysis may be used to identify safety signals. If
random-effects meta-analyses are done, the results should be viewed with
care. Increasing the variability and altering the weighting of the studies can
obscure the determination of a safety signal or in what subgroup a safety
signal may be present.
There are three criteria or questions to be considered when assessing the
reliability of an exploratory safety analysis2:
for an appropriate δ > 0. In other words, the null hypothesis is that the inves-
tigational treatment increases the event rate by some difference δ, and the
alternative hypothesis is that the investigational treatment increases the rate
by less than δ (or has no effect, or decreases the event rate). The safety margin
for increased risk or harm may depend on the benefit of the product. The
parallels to non-inferiority testing for efficacy are immediately obvious. A
possible disadvantage to expressing the hypotheses about a risk difference
includes lack of robustness to an incorrect estimate of the rate in the control
group: an increase of 5 percentage points may seem inconsequential when
the placebo event rate is 30%, but not when placebo event rate is 3%. This
disadvantage can be exacerbated when the patient population changes over
time.
Alternatively, the hypotheses can be expressed as a ratio of event rates (risk
ratio):
for an appropriate δ > 1. The relative risk is often perceived as being more consis-
tent across different patient populations with different event rates than the risk
difference. However, the risk ratio is not robust to a change in event rates, par-
ticularly for fairly rare events. A 50% increase when the placebo event rate is 1%
(which affects 1 out of every 200 subjects) is quite different from a 50% increase
when the placebo event rate is 10% (which affects 1 out of every 20 subjects).
If time to an undesirable event is of interest, then the hypotheses can be
expressed as a hazard ratio, θ:
1
Risk ratio
FIGURE 9.1
Some possible outcomes for a safety evaluation.
and
where θ is the risk ratio of the experimental therapy (in the numerator) to
the control therapy (either a placebo or an active comparator, if appropriate).
If the first null hypothesis, H01, is not rejected, the possibility of an impor-
tant safety signal cannot be ruled out, and approval is unlikely. If both null
hypotheses are rejected, an important safety signal is unlikely and approval
is possible, given that efficacy and other safety data support such approval.
If only the first null hypothesis (Ho1) is rejected and the second (Ho2) is not,
then further study is required. In this situation, other safety and efficacy
data may allow approval of the product, but the sponsor will be obligated to
conduct further study to rule out an important increase of 30% in cardiovas-
cular risk.
With multiple hypotheses being tested, and possibly multiple attempts at
testing, it is necessary to consider the type I error rate. We consider testing
on the basis of two-sided 95% confidence intervals. When θ ≥ 1.8, the prob-
ability is at most 0.025 that θ ≥ 1.8 is rejected on the basis of the data used
for the consideration of the approval. The probability of concluding that θ <
1.3 is much less than 0.025. In the event that it is falsely concluded that θ <
1.8 but θ < 1.3 is not concluded, it is quite unlikely that a later safety study,
if properly designed and conducted, would conclude θ < 1.3 (the probability
being much less than 0.025).
When 1.3 < θ < 1.8, concluding that θ < 1.8 is not an error and is also not
automatic. Any given test of θ ≥ 1.3 versus θ < 1.3 would maintain the desired
type I error rate or a smaller rate. Because there would likely be two oppor-
tunities (the pre- and post-approval analyses) to conclude that θ < 1.3, the
overall the type I error rate would be a little less than 0.05 (when θ is slightly
larger than 1.3) or smaller.
In calculating confidence intervals on risk ratios, the number of events
(particularly for rare events) must be adequate to result in a sufficiently nar-
row confidence interval. Observing few events will result in a confidence
interval that is wide, which may result in not being able to rule out an impor-
tant increase in events even if the observed rates are similar. To obtain an
adequate number of events, the guidance document recommends enroll-
ing patients at increased risk of the event. This serves a second purpose,
which is to study patients who are at higher risk, since some such patients
will inevitably receive the drug once it is approved for marketing. However,
enrolling patients at increased risk of the event can also make the results less
generalizable: enrolling patients with different risk levels than the general
population may provide results that do not easily extrapolate to the general
population, whether those rates are higher or lower.
We direct the reader to Chapter 13 in this book for determining confi-
dence intervals for non-inferiority testing of time-to-event endpoints and to
Chapter 7 on multiple testing.
References
1. Food and Drug Administration Amendments Act of 2007 http://frwebgate
.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=110_cong_public_laws&docid=f:
publ085.110.
2. Fleming, T.R., Identifying and addressing safety signals in clinical trials, New
Engl. J. Med., 359, 1400–1402, 2008.
3. Guidance for industry: Diabetes mellitus—evaluating cardiovascular risk in
new antidiabetic therapies to treat type 2 diabetes, United States Food and Drug
Administration, Silver Spring, MD, 2008.
10.1 Introduction
In this chapter, we discuss additional topics that may be involved in the design,
analysis, and interpretation of the results of a non-inferiority trial. Many of these
topics are well developed for superiority trials but less understood for non-
inferiority trials. We discuss issues involving the consistency of non-inferiority
across subgroups in Section 10.2. The relationship between non-inferiority
inferences on a surrogate endpoint and the corresponding clinical benefit end-
point are discussed in Section 10.3. Adaptive designs (mostly involving trial
monitoring) and group sequential trials are discussed in Section 10.4. Section
10.5 provides a brief discussion on equivalence comparisons.
The effects of therapies (e.g., the active control and experimental therapies
in a non-inferiority trial) may vary across meaningful subgroups. A non-
inferiority inference involves concluding that the effectiveness of the experi-
mental therapy is both superior to placebo and not unacceptably worse than
the active control. To formally make such an inference, or to check for con-
sistency in those inferences, across subgroups would require an understand-
ing of the effect of the active control relative to placebo in the investigated
subgroups along with the estimated difference in effects between the active
control and experimental therapies from the non-inferiority trial(s). There
are various scenarios in which the effects (relative to placebo) of the active
control therapy and/or the experimental therapy, as well as the differences
in their effects, may vary across subgroups. It may also be the case that dif-
ferent subgroups may have different “non-inferiority margins” due to vary-
ing effects of the active control.
A surrogate endpoint is an endpoint used as a substitute for a clinical
benefit endpoint. The objective of using a surrogate endpoint is that specific
inferences on the surrogate endpoint imply specific inferences on the clinical
benefit endpoint. It is therefore important that treatment effects on the surro-
gate endpoint are related to treatment effects on the clinical benefit endpoint.
In a superiority trial with a rather good surrogate endpoint that represents the
sole pathway toward clinical benefit, superiority on the surrogate endpoint
would imply superiority on the clinical benefit endpoint. For a non-inferiority
219
N
Group Active control
experimental
78
Male 76
126
Female 120
120
Etiology 1 116
18
Etiology 2 16
66
Etiology 3 64
Europe 48
34
Asia 119
125
Australia 19
21
Japan 18
16
–2 –1 0 1 δ 2 3 4
FIGURE 10.1
Exploratory plot to informally look for qualitative or quantitative interaction.
∑
I (D − D)2
test statistic is W = i , where σi is the standard deviation within
2
i=1 σi
the ith stratum and D is the mean of the Di values weighted by the inverses
−1
I D
∑ ∑
I
of the subgroup variances (standard errors squared): i 1 .
2 2
i=1 σ
i
i=1
σi
Under the null hypothesis, W has a central χ2 distribution with I – 1 degrees
of freedom.2 Rejection of Ho implies that an understanding of the interaction
is required to fully interpret the full results of the clinical trial. However,
failure to reject Ho does not imply anything about the magnitude or mean-
ingfulness of the differences in effects across subgroups. Tests for interac-
tion effects tend to have little power at meaningful alternatives, and thus
the interaction effects may be rather large even if Ho is not rejected. These
limitations will also be true for tests of a qualitative interaction when com-
paring the experimental and active control therapies. The test procedure for
an interaction is basically the same as the test procedure for heterogeneous
effects across studies provided in Section 4.3.1.
Wiens and Heyse proposed several likelihood ratio–type tests for a quali-
tative interaction, where a common non-inferiority margin across a partition
− +
of subgroups is used, on the basis of quantities QLR, QLR , and QLR defined as
follows:
I
+
QLR = ∑ (Di − δ )2 I (Di > δ )
σi
2
i=1
∑ (D − δ )σI(D < δ )
2
−
QLR = i
2
i
i
i=1
+
QLR = min QLR −
, QLR ( )
In these calculations, I(•) is an indicator function that equals 1 if the argu-
ment is true and 0 otherwise.1 The test statistic QLR+ tests the null hypothesis
that non-inferiority exists in all strata versus the alternative hypothesis that
non-inferiority does not exist in at least one stratum. Such hypotheses can
be written as
The test statistic QLR tests the null hypothesis that either non-inferiority
exists in all strata or the experimental treatment is markedly inferior to the
active control in all strata; the alternative is that this is true in some strata but
not in others. Such hypotheses can be written as
HaQ: δi > δ for at least one i and δi < δ for at least one i
+
Thus, both QLR and QLR are testing for the existence of qualitative interac-
tion. For non-inferiority, the test statistic QLR does not seem to make much
sense because the additional area in the null region is an area in which it
will not be possible to conclude non-inferiority, but in equivalence analyses
QLR might be appropriate. Critical values for both tests are given in Table 1
of Gail and Simon,2 who discussed this test without the “–δ” in the formulae
+ −
for either QLR or QLR for superiority trials.
When the point estimate of treatment effect in every subgroup has the
+ −
same directional relationship to the non-inferiority margin, either QLR or QLR
will be zero (and therefore QLR will be zero). Using the data from Figure 10.1,
+
the test statistic QLR will be zero when testing interaction between treatment
+ −
and gender and between treatment and etiology, but neither QLR nor QLR will
be zero when testing interaction between treatment and geographic region.
Therefore, QLR will be zero in the interaction test for treatment and gender
and treatment and etiology, but not for the interaction test of treatment and
geographic region.
Note that in both cases, the tests start with the null hypothesis of no quali-
tative interaction and conclude qualitative interaction only if there is strong
evidence that it exists. In both cases, observed qualitative interaction (i.e., at
least one stratum with Di > δ and at least one stratum with Di < δ) is neces-
sary to reject the null hypothesis and conclude the existence of qualitative
interaction. An alternative test, based on the “min test,” assumes the exis-
tence of qualitative interaction unless there is strong evidence to the con-
trary.1 However, a test will require a conclusion of non-inferiority in each
stratum when based on the min test.3 Hence, this test will have little power
for the typical non-inferiority trial (and thus great uncertainty) and is not
recommended.
An alternative to the likelihood ratio test is the standardized range test.
For an appropriate critical value C, the standardized range test considers
the hypotheses in Expression 10.1—analogous to QLR—with test statistics
max((Di – δ)/σ i) and min(–(Di – δ)/σ i). This can be written as Q SR = min[max((Di
– δ)/ σ i), –min(–(Di – δ)/σ i)]. HoQ is rejected if Q SR > C. Alternatively, the range
+
test considers the hypotheses in Expression 10.2—analogous to QLR —with
+ +
the test statistic QLR = min(–(Di – δ)/σ i). HoQ+ is rejected if QLR < C′. Critical
values for C and C′ are given in Table 1 of Piantadosi and Gail.4 Furthermore,
the range test is more powerful when the effect is reversed in very few
subgroups, whereas the likelihood ratio test is more powerful when a few
subgroups have an effect in one direction and a few in the other. For non-
inferiority purposes, it is unlikely to achieve a conclusion of non-inferiority
if there are many subgroups for which the true effect is against a conclu-
sion of non-inferiority, which argues for use of the standardized range test.
However, this test has not been studied extensively in the non-inferiority lit-
erature and therefore should be approached with caution. In addition, with
few strata, the difference in the performance of the tests is minor, so the
likelihood ratio test should perform well.4
Similar criteria can be established for other types of endpoints (e.g., continu-
ous or binary endpoints). The second criterion requires that the effect of a
treatment on the clinical benefit endpoint is completely mediated through
the surrogate endpoint. Various researchers recommend verifying the sec-
ond criterion through meta-analysis of relevant trials studying the (poten-
tial) surrogate and clinical benefit endpoints.9,13,15
For regular approval, the type I error rate for drawing conclusions on the
clinical benefit endpoint from testing on the surrogate endpoint should be
maintained at the desired level that would be used in a test directly on the
clinical benefit endpoint. In a superiority trial, if testing at a one-sided type
I error rate of 0.025 on the clinical benefit endpoint is desired, then the sur-
rogate endpoint must be such that if the experimental therapy has zero effect
on the clinical benefit endpoint, the probability is 0.025 of demonstrating
superiority on the surrogate endpoint.
For an active-controlled trial, if the aim of the trial is to demonstrate any
efficacy, the non-inferiority margin for the surrogate endpoint should assure
that when the experimental therapy has zero effect on the clinical benefit
endpoint, the probability that the experimental arm demonstrates non-
inferiority to the control arm on the surrogate endpoint is at most 0.025 or
whatever level is prespecified. In this setting, it is concluded that the experi-
mental therapy has a positive effect on the clinical benefit endpoint when-
ever it is concluded that the experimental therapy has a noninferior effect
on the surrogate endpoint. When the surrogate endpoint is acceptable for
regular approval, it is sufficient to choose a non-inferiority margin that addi-
tionally guarantees that the experimental therapy has an effect on the sur-
rogate endpoint (i.e., the non-inferiority margin is less than or equal to the
effect on the surrogate endpoint that the active control can be assumed to
have in the non-inferiority trial). When the aim of the trial is to demonstrate
adequate efficacy (e.g., the experimental therapy retains at least some mini-
mal amount or fraction of the active control effect), more precise information
is needed on the relationship between effect sizes on the surrogate endpoint
and effect sizes on the clinical benefit endpoint. Such precise information
on the relationship of the effect sizes may not be known. Uncertainty on the
precise relationship may lead to a more conservatively selected margin for
the surrogate endpoint or invalidate the use of the surrogate endpoint in a
non-inferiority trial setting.
For fixed non-inferiority margins, the requirements for the surrogate and
clinical benefit endpoints have a similar appearance as the mathematical
requirement for a function to be uniformly continuous. Consider the cases
of using means where μE,S and μ C,S are the true means for the surrogate end-
point and μE,CB and μ C,CB are the true means for the clinical benefit endpoint
for the experimental and active control arms, respectively. The use of the sur-
rogate endpoint for regular approval with an associated type I error rate or
significance level of 2.5%, where the non-inferiority margin on the surrogate
endpoint is δ > 0 and the non-inferiority margin on the clinical benefit end-
point is ε > 0, requires 97.5% certainty that μ C,S – μE,S < δ to imply 97.5% cer-
tainty that μ C,CB – μE,CB < ε. The value for ε would represent either the entire
effect of the control therapy (vs. placebo) on the clinical benefit endpoint or
the amount of the effect that a therapy can be worse than the active control
therapy but still have adequate efficacy. It is unlikely that 97.5% certainty that
μ C,S – μE,S < δ will be equivalent to 97.5% certainty that μ C,CB – μE,CB < ε. A sur-
rogate endpoint would still be useful and conservative when 97.5% certainty
that μ C,S – μE,S < δ was equivalent to a greater than 97.5% certainty that μ C,CB –
μE,CB < ε. Example 10.1 describes the approval of peg-filgrastim based on non-
inferiority comparisons from two clinical trials on a surrogate endpoint.
Example 10.1
The registration clinical trials comparing peg-filgrastim with filgrastim are exam-
ples of non-inferiority trials on the surrogate endpoint of the duration of severe
neutropenia, which led to the regular approval of peg-filgrastim. Filgrastim was
approved on the basis of a demonstrated improvement (reduction) in the clinical
benefit endpoint of the incidence in febrile neutropenia.16 Filgrastim also dem-
onstrated an improvement in the duration of severe neutropenia during the first
cycle of chemotherapy.16 The duration of severe neutropenia in the first cycle is
correlated with the chance of getting febrile neutropenia. It is also biologically
plausible that reducing the duration of severe neutropenia decreases the likeli-
hood of experiencing febrile neutropenia.
In each of the two non-inferiority registration trials comparing peg-filgrastim
with filgrastim, the non-inferiority margin for the mean duration of severe neu-
tropenia during the first cycle of chemotherapy was 1 day.17 Study 1, which ran-
domized 157 subjects, used fixed-dose peg-filgrastim, whereas study 2, which
randomized 310 subjects, used a weight-adjusted dose of peg-filgrastim. The
95% CIs for the difference in the mean duration of severe neutropenia between
peg-filgrastim and filgrastim were (–0.2 to 0.6) and (–0.2 to 0.4) for studies 1 and
2, respectively. Both studies succeeded in demonstrating that peg-filgrastim was
noninferior to filgrastim in the mean duration of severe neutropenia during the
first cycle of chemotherapy, leading to the conclusion that peg-filgrastim is effec-
tive (relative to placebo) in reducing the incidence of febrile neutropenia.
final analysis. Proponents of adaptive designs believe that they can be more
efficient than standard designs in either reducing the expected trial size for a
given power or in increasing the study power for a given trial size.
Adaptive designs should not be a means to alleviate the burden of rigorous
planning. Changes in the design of ongoing trials are not recommended.18
When substantial changes are made to the design of the trial, the primary
analysis may need to stratify by whether subjects were randomized before or
after the change.18 There may not be a way to correct an analysis for adapta-
tions that affect subjects already in the trial.
When an adaptation involves external information, that external informa-
tion tends to be available to the study subjects. However, this is not true
when adaptations are made based on internal information. As such, there
may be ethical concerns when adaptations are made based on internal data.
If a sponsor deems the results as important enough to make design modifi-
cations during the trial, then information learned from the study should be
important for subjects to learn.19 However, subjects and investigators may
prejudge the results if provided information on the relative treatment effect.
Properties of adaptive designs are not well understood for many poten-
tial adaptations in non-inferiority clinical trials. Properties of non-inferiority
group sequential trials have been studied in greater detail, so our discussion
focuses on such designs. In these designs, the study can be terminated early
at prespecified time points on the basis of accumulating evidence of efficacy
or of lack of efficacy. For reasons that will be discussed later, such designs
are not as common in non-inferiority trials, but can easily be implemented
if desired in a particular situation. Other adaptations can also be considered
for non-inferiority designs, but much less experience is available with which
to evaluate them. Adding or dropping a treatment arm is applicable when
multiple treatment arms are tested (notably in dose-ranging trials that com-
pare several doses of a single drug), but such designs are not commonly ana-
lyzed as non-inferiority trials. Changing the primary endpoint or primary
test statistic is a difficult proposition for superiority trials, and not much is
known about the effects on non-inferiority trials. Changing the sample size
is possible in a non-inferiority trial, usually as a result of insufficient infor-
mation being available before the start of the study to appropriately power
the study.
For the rest of this section, we will focus on group sequential methods and
the use of sample size reestimation based on interim results. For other issues
involving adaptive designs, see the U.S. Food and Drug Administration
(FDA) draft guidance.20
effects at the interim analysis and using only blinded estimates of variability
of the internal pilot study to reestimate sample size.29 For an interim analy-
sis, consider calculating the one-sample standard deviation at a prespecified
interim analysis (i.e., the internal pilot study) according to the usual method:
1
∑ ( X − X ) . This estimate is then used with a priori estimates of
2
σ̂ = i
n1 − 1 i
from the first stage of the trial may not represent the comparisons between
arms of the same endpoint for the second stage of the trial. For a time-to-
event endpoint, the duration of follow-up is different for the two analyses.
Short-term effects or relative effects may not translate to long-term effects.
The effects of therapy may be greater for subjects with better prognosis than
those with worse prognosis. The patients with worse prognosis will have a
lopsided contribution to the total number of events at the interim analysis
and thus may yield an estimated effect that would tend to be smaller than
that obtained at the final analysis. Additionally, sample size or event size
reestimation allows for the possibility of back calculating, gaining an idea of
the estimated effect at the interim analysis. Such knowledge could be used
in a manner that reduces the integrity of the trial.
Alternatively, the trial can be sized for a minimally meaningful effect size
and a group sequential testing procedure can be implemented. This is gen-
erally more efficient and provides a more natural relative weighting of data
generated before and after any interim look.
Wang et al.32 compared the power, type I error rate, and sample size for two
group sequential approaches for testing non-inferiority and superiority with a
two-stage adaptive approach. In the two-stage adaptive approach, if the trial
continues after stage 1, the specific primary objective for the end of trial is cho-
sen between non-inferiority and superiority based on the results in stage 1.
Thresholds for decision making in such two-stage adaptive approaches were
also considered by Shih, Quan, and Li33 and Koyama, Sampson, and Gleser.34
The validity of the TOST approach has been documented by Berger,37 and this
approach is general enough to apply to continuous, discrete, or time-to-event
data. The testing approach is equivalent to comparing an appropriate-level
confidence interval for μE/μ C with the interval (δ1,δ1). If the confidence inter-
val lies entirely within (δ1,δ1), then “equivalence” is concluded. Otherwise,
equivalence is not shown. For example, performing the standard sets of tests
of the respective sets of hypotheses in Expression 10.4, each at a significance
level of α/2, is equivalent to comparing a 100(1 – α)% two-sided confidence
interval for μE/μ C with the interval (δ1,δ1). As the two tests are simultaneously
performed at a significance level of α/2 and both null hypotheses needed to
be rejected to conclude equivalence, the type I error rate is maintained at a
level of α/2 or less.
This TOST approach is recommended in the International Conference on
Harmonization of Technical Requirements for Registration of Pharmaceutic
als for Human Use E9,27 which states that “Operationally, this (equivalence
test) is equivalent to the method of using two simultaneous one-sided tests
to test the (composite) null hypothesis that the treatment difference is outside
∑
k
outcomes, the parameter of interest is min{ pE ,i , pC ,i }. It is easy to see that
i=1
this parameter does not retain any information on any ordered relationship
among the observations—that is, the possible outcomes are treated as nomi-
nal, unordered categories.
For continuous data, Rom and Hwang38 defined the PSR to be the over-
lap under the density curves between the two treatments. It measures the
degree of overlap (similarity) of the two distributions. More formally, the
PSR is given by
PSR =
∫ min{ f (x), f (x)}dx
E C
−∞
where f E and fC are the underlying density functions for the experimental
and control arms, respectively. A PSR close to 1 indicates similar distribu-
tions of outcomes between the two arms. In practice, when the PSR is far
from 1, the means, medians, and/or variances of the two distributions will
be quite different.
We will consider the special case of normal distributions having equal
standard deviations. Let μE and μ C denote the underlying means for the
experimental and control arms, and let σ denote the common standard devi-
ation. Then the PSR can be expressed as a decreasing function of the absolute
standardized difference in the means (|DS|). That is,
PSR(DS) = 2Φ(–|DS|/2)
where DS = (μE – μ C)/σ and Φ is the distribution function for a standard nor-
mal distribution. Inferences on PSR then reduce to inferences on |μE – μ C|/σ,
the number of absolute standard deviation difference in the means. When σ
is known, the inference reduces to an inference on |μE – μ C| with hypoth-
eses tested like those in Expressions 10.3 and 10.4. The analysis can then
be based on a confidence interval for μE – μ C or a TOST on the difference in
means. When σ 2 is unknown, the t statistic can be used to make inference
since its noncentrality parameter is a monotone function of the standardized
difference in means.38 Rom and Hwang38 have also derived the PSR as a
function of the means and standard deviations of two normal distributions,
allowing the standard deviations to be different and unknown. They showed
in this general normal case that the PSR measure provides a better tool for
comparing treatments than the standard t test, which only focuses on a dif-
ference in means and not a difference in the standard deviations.
In terms of equivalence margin for PSR, there is no universally agreed-
upon value. Rom and Hwang38 suggested that a PSR of at least 0.7 (70% over-
lap) could be used to judge whether two treatments are equivalent. Values
of 0.8 or 0.9 have also been suggested. As with other designs, if one is inter-
ested in using the PSR to analyze equivalence trials, it is important to pre-
specify the equivalence margin and discuss its properties before the start of
the study.
For two normal distributions with a common standard deviation, Table
10.1 gives for different values of |DS| the corresponding PSR. To further
interpret these values of |DS| and PSR(DS), the probability that a random
observation from the smaller distribution is greater than a random observa-
tion from the larger distribution and the percentile of the value of the smaller
mean in the larger distribution are also provided in Table 10.1. When the two
means differ by half a standard deviation (i.e., |DS| = 0.5), the PSR is 0.80,
and the probability that a random observation from the smaller distribution
is greater than a random observation from the larger distribution is 0.36.
Also, since a value of a half standard deviation below the mean is the 31st
percentile of a normal distribution, the smaller mean is the 31st percentile of
the larger normal distribution.
A nonparametric estimate of PSR using kernel density estimates was
proposed by Heyse and Stine.39 This nonparametric estimate avoids strong
assumptions on the shape of the populations, such as normality or equal vari-
ance. Through empirical studies, they showed that nonparametric estimates
of PSR are accurate for a variety of normal and nonnormal distributions. The
sampling variance from the kernel-based estimate of PSR is only slightly
larger than that of the normal maximum likelihood estimated variance for
TABLE 10.1
Comparative Characteristics of Two Normal Distributions Having a Common
Standard Deviation
Number of
Standard Deviation Proportion of Probability that Smaller Percentile of
Difference in Similar Distribution Has Greater Smaller Mean in the
Means Responses Random Value Larger Distribution
0 1 0.50 50
0.25 0.90 0.43 40
0.5 0.80 0.36 31
0.75 0.71 0.30 23
1 0.62 0.24 16
normal data, and the kernel-based estimate may have less bias in analyzing
nonnormal data.
In a pure nonparametric setting where no assumptions are made about the
underlying distributions, the amount of overlap in the densities also treats
the data as having a nominal scale as in the discrete case. A relationship to
order in the outcomes is introduced when the densities are expressed involv-
ing a parameter for which order makes sense (e.g., the mean), as in the afore-
mentioned normal case.
Kolmogorov–Smirnov Approach. When order makes sense (i.e., the data have
an ordinal, interval, or a ratio scale), a Kolmogorov–Smirnov type of statistic
is one of the possibilities for an equivalence comparison. For distribution
functions of the experimental and control arms of FE and FC, respectively, the
hypotheses are expressed as
The Kolmogorov–Smirnov statistic is given by max −∞< x<∞ |FˆE ( x) − FˆC ( x)|, where
F̂E and F̂C are the corresponding estimated distribution functions. As equality
of the distributions is in the alternative hypothesis, the common scaled ver-
sion of the Kolmogorov–Smirnov test statistic would not apply for equivalence
testing. Bootstrapping or simulations may be useful in studying the behavior
of the Kolmogorov–Smirnov statistic in equivalence testing. Alternatively, the
hypotheses in the above expression could be tested on the basis of simulta
neous confidence bounds for FE(x) – FC(x). Some rank-based tests of equiva-
lence are provided by Wellek.35
(1) The pooled estimate of the within-lot variance is often used when it
may not be appropriate, and by doing so the within-lot variances are
assumed to be equal when it may be important to additionally dem-
onstrate that the within-lot variances are similar to reliably conclude
that the production process will consistently produce lots that have
similar biological effects.
(2) Normality is assumed for the distribution for the mean and this may
not be an appropriate assumption in many cases.
(3) The type I error rate may be much less than 0.05 and is dependent
on the number of lots compared; the larger the number of lots com-
pared, the smaller the type I error rate.
δ −|X − X |
Zmin = min
i j
1≤ i < j ≤ k 2 2
σ /n + σ j /n j
i i
If Zmin > Zα*, then equivalence is concluded, where the critical value is calcu-
lated from the distribution of the range statistic for the means.46 When the
standard errors are equal (e.g., equal lot sizes and the lots are assumed to
have the same within-lot variance), the min test is equivalent to the range
test of Giani and Finner.44
When the within-lot variances or the lot sizes are not equal, the min test can
be quite conservative (i.e., the min test has a type I error rate much smaller
than the desired significance level). Wiens and Iglewicz46 suggested using
an adjusted critical value, Zα**. The smallest within-lot standard error for
the mean across all lots is used as the common within-lot standard error in
determining the value for Zα** . Wiens and Iglewicz46 showed that the result-
ing test is still conservative, but since Zα** ≤ Zα*, the test is both less conserva-
tive and more powerful than the original min test.
Ng47 proposed hypotheses and an equivalence test on the basis of the
between-lot variability of the means for common lot sizes and common
within-lot variances. Here, the null hypothesis is
1/2
k
Ho :
∑
i=1
2
(µi − µ )
≥δ
for some margin on the variability, δ. The test statistic is the standard F sta-
tistic for testing for any between-lot variability. On the boundary of the null
hypothesis, the test statistic has a noncentral F distribution with k – 1 and
k(n – 1) degrees of freedom and noncentrality parameter nδ 2/σ 2, where n is
the common lot size and σ 2 is the common within-lot variance. The critical
value depends on the value of σ 2. When σ 2 must be estimated, Ng provides
an iterative method for finding the critical value. The test procedure assumes
that the data are normally distributed.
When the means and variances of the distributions exist and differences
in values make sense, the expected difference of the average squared dif-
ference between random observations from any two distributions relative
to the same expected difference when the random observations are drawn
from the same distribution is an appropriate measure of the amount of dif-
ference between the two distributions. For many distributions, this would
mean taking a random observation from each distribution and measuring
the variability in their values. Let X1, …, Xk be independent but not identi-
∑
k
cally distributed random variables with X = X i /k . Then if we randomly
i=1
select two of the k distributions and randomly draw an observation from
each selected distribution,
k k
∑ ∑ ( X − X ) /k
2
E ( X i − X j )2 = E i
i< j
2 i=1
represents the expected distance squared between the two observations. Let
the respective means be denoted by μ1, …, μk and the respective variances
denoted σ 12 , , σ k2 . Then the above expected value equals
k k
(2/k ) ∑ σ i2 + ∑ (µ − µ ) /k
i
2
(10.6)
i=1 i=1
For each i = 1, …, k, let W1i and W2i be independent and identically distributed
with variance σ i2 . Then, if we select one of the k distributions at random,
k k
E
∑
i=1
(W1i − W2 i )2 /k = (2/k ) σ i2
i=1
∑ (10.7)
represents the expected square distance between two random observations
taken from that randomly selected distribution. The difference in the two
expectations in Expressions 10.6 and 10.7 equals
∑ (µ − µ ) /k i
2
i=1
∞
1
γ =
k ∫ max( f (x))dx i
1≤ j ≤ k
−∞
have been different approaches as to the meaning of lot consistency. For any
given approach, the equivalence margins will vary from case to case, and
may be dependent on the indication and the efficacy of the product.
References
1. Wiens, B.L. and Heyse, J.F., Testing for interaction in studies of non-inferiority,
J. Biopharm. Stat., 13, 103–115, 2003.
2. Gail, M. and Simon, R., Testing for qualitative interactions between treatment
effects and patient subsets, Biometrics, 41, 361–376, 1985.
3. Laska, E.M. and Meisner, M.J., Testing whether an identified treatment is best,
Biometrics, 45, 1139–1151, 1989.
4. Piantadosi, S. and Gail, M.H., A comparison of the power of two tests for quali-
tative interactions, Stat. Med., 12, 1239–1248, 1993.
5. U.S. Code of Federal Regulations, Title 21, Sec. 314.500-560 and Sec. 601.40–46.
6. Fleming, T.R. and Powers, J.H., Issues in non-inferiority trials: The evidence in
community-acquired pneumonia, Clin. Infect. Dis., 47, S108–120, 2008.
7. Bruzzi, P. et al., Objective response to chemotherapy as a potential surrogate
endpoint of survival in metastatic breast cancer patients, J. Clin. Oncol., 23, 5117–
5125, 2005.
8. Fleming, T.R. and DeMets, D.L., Surrogate endpoints in clinical trials: Are we
being misled? Ann. Intern. Med., 125, 605–613, 1996.
9. Fleming, T.R., Surrogate endpoints and FDA’s accelerated approval process:
The challenges are greater than they seem, Health Affair, 24, 67–78, 2005.
10. Fleming, T.R., Objective response rate as a surrogate endpoint: A commentary, J.
Clin. Oncol., 23, 4845–4846, 2005.
11. Rothmann, M.D., Issues to consider when constructing a non-inferiority analy-
sis, ASA Biopharm. Sec. Pro., 1–6, 2005.
12. Fleming, T.R., Current issues in non-inferiority trials, Stat. Med., 27, 317–332,
2008.
13. Prentice, R.L., Surrogate endpoints in clinical trials: Discussion, definition and
operational criteria, Stat. Med., 8, 431–440, 1989.
14. Prentice, R.L., Surrogate and mediating endpoints: Current status and future
directions, J. Natl. Cancer Inst., 101, 216–217, 2009.
15. Baker, S.G., Surrogate endpoints: Wishful thinking or reality? J. Natl. Cancer
Inst., 9, 502–503, 2006.
16. Neupogen product labeling available at http://www.accessdata.fda.gov/Scripts/
cder/DrugsatFDA/index.cfm?fuseaction=Search.Label_ApprovalHistory.
17. Neulasta product labeling available at http://www.accessdata.fda.gov/Scripts/
cder/DrugsatFDA/index.cfm?fuseaction=Search.Label_ApprovalHistory.
18. Committee for Proprietary Medicinal Products. Reflection paper on method-
ological issues in confirmatory clinical trials with flexible design and analysis
plan, EMA, London, 2006.
19. Fleming, T.R., Standard versus adaptive monitoring procedures: A commentary,
Stat. Med., 25, 3305–3312, 2006.
20. Guidance for Industry: Adaptive design clinical trials for drugs and biologics
(draft guidance), February 2010.
21. Pocock, S.J., Group sequential methods in the design and analysis of clinical tri-
als, Biometrika, 64, 191–199, 1977.
22. O’Brien, P.C. and Fleming, T.R., A multiple testing procedure for clinical trials,
Biometrics, 35, 549–556, 1979.
23. Lan, K.K. and DeMets, D.L., Design and analysis of group sequential tests based
on the type I error spending function, Biometrika, 74, 149–154, 1983.
24. Jennison, C. and Turnbull, B.W., Repeated confidence intervals for group
sequential clinical trials, Control. Clin. Trials, 5, 33–45, 1984.
25. Jennison, C. and Turnbull, B.W., Sequential equivalence testing and repeated
confidence intervals, with application to normal and binary data, Biometrics, 49,
31–43, 1993.
26. Lawrence, J., Some remarks about the analysis of active control studies,
Biometrical J., 47, 616–622, 2005.
27. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH), E9: Statistical principles for
clinical trials, 1998, at http://www.ich.org/cache/compo/475-272-1.html#E4.
28. Wittes, J. and Brittain, E., The role of internal pilot studies in increasing the effi-
ciency of clinical trials, Stat. Med., 9, 65–72, 1990.
29. Friede, T. and Kieser, M., Blind sample size reassessment in non-inferiority and
equivalence trials, Stat. Med., 22, 995–1007, 2003.
30. Gao, P., Ware, J.H., and Mehta, C.R., Sample size re-estimation for adaptive
sequential design in clinical trials, J. Biopharm. Stat., 18, 1184–1196, 2008.
31. Cui, L., Hung, H.M.J., and Wang, S.-J, Modification of sample size in group
sequential clinical trials, Biometrics, 55, 321–324, 1999.
32. Wang, S.J. et al., Group sequential test strategies for superiority and non-inferiority
hypotheses in active controlled clinical trials, Stat. Med., 20, 1903–1912, 2001.
33. Shih, W.J., Quan, H., and Li, G., Two-stage adaptive strategy for superiority
and non-inferiority hypotheses in active controlled clinical trials, Stat. Med., 23,
2781–2798 2004.
34. Koyama, T., Sampson, A.R., and Gleser, L.J., A framework for two-stage adap-
tive procedures to simultaneously test non-inferiority and superiority, Stat.
Med., 24, 2439–2456, 2005.
35. Wellek, S., Testing Statistical Hypotheses of Equivalence, Chapman & Hall/CRC
Press, Boca Raton, FL, 2003.
36. Schuirmann, D., A comparison of the two one-sided tests procedure and the
power for assessing the equivalence of average bioavailability, J. Pharmacokinet.
Pharm., 15, 657–680, 1987.
37. Berger, R.L., Multiparameter hypothesis testing and acceptance sampling,
Technometrics, 24, 295–300, 1982.
38. Rom, D.M. and Hwang, E., Testing for individual and population equivalence
based on the proportion of similar responses, Stat. Med., 15, 1489–1505, 1996.
39. Heyse, J.F. and Stine, R., Use of the overlapping coefficient for measuring the
similarity of treatments, Am. Stat. Assoc. Proc. Biopharm. Sec., 29–32, 2000.
40. Guidance for Industry for the Evaluation of Combination Vaccines for Pre
ventable Diseases: Production, Testing, and Clinical Studies. U.S. Department
of Health and Human Services, Food and Drug Administration, Center for
Biologics Evaluation and Research, April 1997.
41. Lachenbruch, P.A., Rida, W., and Kou, J., Lot consistency as an equivalence
problem, J. Biopharm. Stat., 14, 275–290, 2004.
42. Wiens, B.L., Heyse, J.F., and Matthews, H., Similarity of three treatments, with
application to vaccine development, Am. Stat. Assoc. Proc. Biopharm. Sec., 203–
206, 1996.
43. Lieberman, J.M. et al., The safety and immunogenicity of a quadrivalent mea-
sles, mumps, rubella and varicella vaccine in healthy children: A study of man-
ufacturing consistency and persistence of antibody, Pediatr. Infect. Dis. J., 25,
615–622, 2006.
44. Giani, G. and Finner, H., Some general results on least favorable parameter con-
figurations with special reference to equivalence testing and the range statistic,
J. Stat. Plan. Infer., 28, 33–47, 1991.
45. Sasabuchi, S., A test of multivariate normal mean with composite hypotheses
determined by linear inequalities, Biometrika, 67, 429–439, 1980.
46. Wiens, B. and Iglewicz, B., On testing equivalence of three populations,
J. Biopharm. Stat., 9, 465–483, 1999.
47. Ng, T., Iterative chi-square test for equivalence of multiple treatment groups,
Am. Stat. Assoc. Proc. Biopharm. Sec., 2464–2469, 2002.
48. Cleveland, W. and Lachenbruch, P., A measure of divergence among several
populations, Commun. Stat., 33, 201–211, 1974.
11.1 Introduction
Many clinical trials use outcome variables that are binary in nature, that is,
there are two possible outcomes for each subject. Without loss of generality,
these two outcomes are called “success” and “failure.” Other terms used are
“with an event” and “without the event.” Usually there are qualitative differ-
ences between these two outcomes, in that one outcome is always preferred
over the other. Specifically excluded from this class of outcome variables are
outcomes of a time-to-event endpoint where we are interested in the length
of time until an event (as well as the occurrence of the event are of inter-
est), or outcomes where we are interested in the magnitude in gradations
between success and failure.
The simplest model of proportions is the binomial model, in which each
subject has the same probability of success, p. When a sample of n subjects is
taken, the expected number of successes is np and the variance is np(1 – p).
Often of most interest is the proportion of successes, p̂ = x/n, which then has
mean p and variance p(1 – p)/n. With large sample sizes, the normal approxi-
mation can be used to describe the distribution of p̂ with ease of calculations
and minimal loss of precision, as noted below. Since the variance estimate is
not independent of the mean estimate, extension of the normal approxima-
tion to more complex situations must be used with caution.
The experimental therapy is noninferior to the control therapy if the prob-
ability of a success outcome on the experimental arm is better than or not too
much worse than that of the control arm. When a “success” is a positive or
desirable outcome (as the word “success” suggests), this means that the prob-
ability of a success for the experimental arm is greater than or not too much
less than that for the control arm. When a “success” is a negative or undesir-
able outcome, this means that the probability of a success for the experimen-
tal arm is less than or not too much greater than that for the control arm. This
“not too much less than” or “not too much greater than” can be expressed
through a difference in the two probabilities of a success, in the ratio of the
two probabilities (i.e., a relative risk), or through an odds ratio.
251
That is, the null hypothesis is that the active control is superior to the experi-
mental treatment by at least a quantity of δ ≥ 0 that is prespecified. The alter-
native hypothesis is that the active control is superior by a smaller amount,
or the two treatments are identical, or the experimental treatment is superior.
When δ = 0, the hypotheses in Expression 11.1 reduce to classical one-sided
hypotheses for a superiority trial. The null hypothesis in Expression 11.1 is
rejected and the experimental therapy is concluded to be noninferior to the
control therapy when a decrease in the proportion of success of δ or greater
is statistically ruled out. If a “success” is an undesirable outcome, the roles
of pC and pE in the hypotheses in Expression 11.1 would be reversed (i.e., test
Ho: pC – pE ≤ –δ vs. Ha: pC – pE > –δ).
In the simplest application involving a desirable outcome, a confidence
interval can be calculated on the difference pE – pC, and non-inferiority is con-
cluded (the null hypothesis is rejected) if the lower bound is greater than –δ.
TABLE 11.1
Notation for Breakdown of Counts of Responses between Treatment Arms
Treatment Arm Response No Response Sample Size
Control x nC – x nC
Experimental y nE – y nE
Total s nC + nE – s nC + nE
Given a true difference (pE – pC = –δ) under the null hypothesis Ho, the
probability of observing an outcome (x, y) = (i, j) is given by
n n
P(X = i, Y = j|H o ) = C E (p + δ )i (1 − p − δ )nC − i p j (1 − p)nE − j (11.2)
i j
where p = pE is the nuisance parameter with the domain A = [0, 1 – δ]. For the
classical null hypothesis of no difference (δ = 0), the marginal total (S = X + Y)
is the sufficient statistic for the nuisance parameter (p). To eliminate the effect
of p, an exact test can be constructed conditional on this sufficient statistic,
which yields the well-known Fisher’s exact test.
In the case where pE – pC = –δ (δ > 0), there is no simple sufficient statistic for
p (the numbers of successes from each group are jointly minimal sufficient
statistics). Therefore, the conditional argument will not simplify the problem
of the nuisance parameter in testing non-inferiority hypotheses. In general,
an exact test of non-inferiority can be developed on the basis of the null prob-
ability distribution given in Equation 11.2 using the unconditional sampling
space consisting of all possible 2 × 2 tables given the sample sizes (nC, nE).
The exact test procedure defines the tail region (TR) of the observed table
(i, j) as the region of those tables that are at least as extreme as the observed
table according to a predefined ordering criterion. Then the exact p-value is
defined as
p∈A
(
p -value = max P (X , Y ) ∈TR(i, j) H o , p ) (11.3)
The exact p-value calculation eliminates the nuisance parameter using the
maximization principle,2,3 which caters to the worst-case scenario. Because
the maximization involves a large number of iterations in evaluating sums
of binomial probabilities, the exact unconditional tests are computationally
intensive, particularly with large sample sizes.
A natural ordering criterion proposed by Chan4 used the Z-statistic based
on the constrained maximum likelihood estimate (MLE) of parameters
under the null hypothesis:
pˆ E − pˆ C + δ
Z1 = Z( x , y ) = (11.4)
{p (1 − p ) n ( ) }
1/2
E E E + pC 1 − pC nC
where p̂C and p̂E are the observed response rates for the control and experi-
mental treatment groups, respectively. In addition, pC and pE are the MLEs
of pC and pE, respectively, under the constraint pE – pC = –δ given in the null
hypothesis. The closed-form solutions for pC and pE are given by Farrington
and Manning5 and are provided in Expression 11.6. Since large values of Z1
favor the alternative hypothesis, the tail region includes those tables whose
Z1 statistics are larger than or equal to the Z1 statistic associated with the
observed table (i, j), zobs. As a result, the exact p-value can be obtained as
( )
p -value = max P (X , Y ) : Z1 ≥ zobs H o , p .
p∈A
1. Compute the Z1 statistic for all tables and order them. Let zobs be the
calculated value of Z1 for the observed table. The tail of the observed
table includes those tables whose Z1 statistics are larger than or equal
to zobs.
2. For a given value of the nuisance parameter p in A = [0, 1 – δ], cal-
culate the tail probability by summing up the probabilities of those
tables in the tail using the probability function (Equation 11.2).
3. Repeat step 2 for every value of p in its domain. Then the exact
p-value is the maximum of the tail probability over the domain of
p. Since the domain of p is continuous, a numerical grid search (e.g.,
more than 1000 points) over the domain can be done to obtain the
maximum tail probability. This should provide adequate accuracy
for most practical uses.
For a nominal α level test, we reject the null hypothesis if the exact p-value
is less than or equal to α. To obtain the true level of the exact test, we first
convert the test procedure to find the critical value given the nominal α level
and the sample sizes nC and nE. This critical value does not depend on any
specific value of the nuisance parameter, and the true level is the maximum
(over the domain of the nuisance parameter) null probability of those tables
of which the test statistics are less than or equal to the critical value. This
exact test has been implemented in commercial software.
When δ = 0, pC and pE both simplify to the pooled estimate of the response
rate among the two groups, and the Z1 statistic in 11.4 reduces to the Z-pooled
statistic for the classical null hypothesis of no difference. As a result, the exact
unconditional test of non-inferiority based on Z provides a generalization of
the unconditional test of the classical null hypothesis studied by Suissa and
Shuster6 and Haber.7
Other types of statistics may also be considered as ordering criteria. A few
examples include: (1) the observed difference Dobs = pˆ E − pˆ C, (2) a Z-statistic
with the variance (denominator of Z) estimated directly from the observed
proportions, (3) a Z-statistic with the variance estimated from fixed marginal
totals,8 and (4) a likelihood ratio statistic.9 Findings from empirical investiga-
tions show that the Z-statistic in 11.4 generally performs better than Dobs and
other Z-type statistics. Röhmel and Mansmann10 recommended an ordering
∆o
{ pC ∈A }
∆ U = sup ∆o : max P(Z1 (X , Y ; ∆o ) ≤ Z1 ( x , y ; ∆o )|∆o , pC ) > α/2 .
Similarly, the lower bound of the two-sided 100(1 – α)% confidence interval
(ΔL) is obtained by considering the one-sided hypothesis Ho: Δ = Δo versus
H1: Δ > Δo such that
∆o { pC ∈A }
∆ L = inf ∆o : max P(Z1 (X , Y ; ∆o ) ≥ Z1 ( x , y ; ∆o )|∆o , pC ) > α/2 .
It was shown by Chan and Zhang9 that the exact confidence interval based
on the Z1 statistic is much better than the simple tail-based confidence inter-
val (see Santner and Snell14) as well as confidence intervals based on the
Z-unpooled and the likelihood ratio statistics. Also, since the exact confi-
dence interval based on Z1 is obtained by inverting two one-sided tests, it
controls the error rate of each side at the α/2 level, and hence provides con-
sistent inference with the p-value from the one-sided hypothesis. In other
words, if the null hypothesis in 11.1 is rejected at the one-sided α/2 level for
a specific δ, then the lower bound of the two-sided 100(1 – α)% confidence
interval for the difference pE – pC will be greater than –δ.
Exact confidence intervals have also been proposed by Agresti and Min15
and Chen16 by inverting the two-sided hypothesis Ho: Δ = Δo versus H1: Δ ≠
Δo based on the Z1 statistic. The resulting confidence interval generally has
a shorter width than the one obtained by inverting two one-sided tests, and
therefore is very useful if the hypothesis is two-sided in nature or if esti-
mation is of primary interest. Other methods (non-test-based) that are also
useful for estimation purposes have been proposed by Coe and Tamhane17
and Santner and Yamagami.18 By inverting a two-sided test, the confidence
interval controls the overall error rate at the α level but does not guarantee
control of the error rate of each side at the α/2 level. Consequently, it may
potentially produce results that are inconsistent with a one-sided hypothesis
test. Therefore if the criterion of showing non-inferiority is to require that the
lower bound of the two-sided confidence interval for the difference pE – pC
be greater than –δ, then controlling the one-sided type I error is essential,
and constructing the confidence interval by inverting two one-sided tests is
recommended.
Examples 11.1 and 11.2 provide p-values from applying some of these meth-
ods to the results of actual studies.
Example 11.1
Example 11.2
pˆ E − pˆ C + δ
Z= (11.5)
se( pˆ E − pˆ C )
where zα/2 is the 100(1 – α/2)th percentile of the standard normal distribution
and se( pˆ E − pˆ C ) is the standard error of the estimated difference in propor-
tions. The standard error is commonly estimated by the unrestricted MLE of
pˆ E (1 − pˆ E ) pˆ C (1 − pˆ C )
+ , which leads to a Wald’s confidence interval for the
nE nC
true difference in proportions. However, not all possible values of pE and pC can
be observed: in a study with nC and nE observations, only multiples of 1/nE and
1/nC can be observed. Thus, when there are large sample sizes, these (nE + 1)
(nC + 1) possible outcomes are fairly dense within the unit square, the param-
eter space of pE and pC. In cases with small sample sizes and/or probabilities
of success near 0 or 1, this simple confidence interval can have suboptimal
coverage probabilities and the associated test can reject the null hypothesis
less often (or more often) than desired. In addition the unrestricted MLE of
the variance is inconsistent with the null hypothesis, which restricts the true
difference in proportions.
Hauck and Anderson22 considered confidence intervals of the form
pˆ E − pˆ C ± { zα/2 × se( pˆ E − pˆ C ) + CC}, where CC denotes continuity correction
and se( pˆ E − pˆ C ) may or may not be based on the unrestricted MLE of the
variance. Hauck and Anderson concluded that some adjustment is neces-
sary, either in estimating the standard error or through use of a continuity
correction or both, even if sample sizes are large. With minor restrictions
on sample size, Hauck and Anderson recommended the unbiased estimate
of standard error (i.e., using n − 1 in the denominators instead of n as in the
MLE) and also using a continuity correction of 1/{2 × min(nE,nC)}. We note
that this is based on two-sided coverage probabilities, not on the testing of a
where
v = b3/(3a)3 – bc/(6a2) + d/(2a)
u = sign(v)[b2/(3a)2 – c/(3a)]½
w = [π + cos–1(v/u3)]/3,
and
a = 1 + k,
d = – p̂C δ(1 + δ)
pˆ E − pˆ C + δ
ZFM = Z1 = ,
pE (1 − pE ) pC (1 − pC )
+
nE nC
Example 11.3
pˆ ± zα/2 pˆ (1 − pˆ )/n .
Many authors have discussed the poor coverage properties (and mainte-
nance of a desired type I error probability) of Wald’s confidence interval in
both the single proportion and the difference in proportions settings.23–29 We
begin with some results involving a single proportion.
where p = ( x + 2)/n
and n
= n + 4. Applying this idea to a 100(1 – α)% confi-
dence interval for arbitrary α yields the interval of p ± zα/2 p(1 − p)/n
, where
p = ( x + zα2/2/2)/n
and n
= n + zα2/2 . When this interval is applied to no data (x =
0, n = 0), the resulting interval is [0, 1]. Agresti and Coull reported substantial
improvement in the coverage probability of this interval over Wald’s interval
for small sample sizes.
Brown, Cai, and Dasgupta29 compared the probability coverage and inter-
val lengths of several methods for constructing a confidence interval for a
single proportion. They recommended the Wilson interval or the equal-tailed
Jeffreys credible interval for small sample sizes (n ≤ 40), and the interval of
Agresti and Coull25 for large sample sizes (n > 40). All of these intervals
have instances where the coverage probability of the 95% interval is below
95%. For success rates fairly close to 0 and 1, the Jeffreys interval had a very
small coverage probability. To improve the probability coverage in such
cases, a modified version of the Jeffreys interval was proposed by Brown,
Cai, and Dasgupta.29 When x = 0, define the upper limit of the interval by
1 – (α/2)1/n and when x = n define the lower limit of the interval by (α/2)1/n.
For the details and more on the Jeffreys credible interval, see Appendices
A.2.2 and A.3.
The choices for standard errors are the unrestricted MLE of the standard
error and a modified version of such that replaces ni with ni – 1 for i = E, C. The
possible corrections are: (1) no correction (CC = 0), (2) Yates correction (CC =
1/(2nE) + 1/(2nC)), (3) a correction of Schouten et al.31 (CC = 1/(2 max(nE,nC))),
and (4) a correction of Hauck and Anderson (CC = 1/(2 min(nE,nC))). The cases
considered had minimum expected cell counts ranging from 2 to 15 with the
smallest group size ranging from 6 to 100.
As mentioned in Section 11.2.3, Hauck and Anderson22 recommended
the use of the Hauck–Anderson correction with the modified version of the
standard error. When the desired confidence level was 90% or 95% and the
minimum expected cell count was at least 3, that method gave coverage
probabilities close to the desired level. Wald’s interval with a Yates correction
also performed reasonably well but was more conservative. Wald’s interval
without any correction did not provide adequate coverage at any sample size
studied, and Hauck and Anderson recommended against its use. When the
desired confidence level was 99% and the minimum expected cell count was
at least 5, their recommended method and Wald’s interval with a Yates cor-
rection performed equally as well. No method studied performed consis-
tently well when the minimum expected cell count was 2. Tu32 preferred
Wald’s interval with a Hauck–Anderson continuity correction for equiva-
lence testing.
Li and Chuang-Stein33 evaluated and compared the type I error rate in non-
inferiority testing of a difference of two proportions using Wald’s interval
with and without a Hauck–Anderson continuity correction. Their evaluation
was based on equal allocation for “sample sizes relevant to the confirmatory
trials.” The sample sizes were between 100 and 300. For the cases studied
where all the expected cell counts (success and failures for both arms) were
all greater than 15 and 2.5% is the one-sided targeted type I error rate, the
estimated type I error rate for the standard Wald’s interval was between 2.3%
and 2.75%. Wald’s interval with Hauck–Anderson continuity correction pro-
duced type I error rates consistently below 2.5%. For the cases studied where
some of the expected cell counts were less than 15 and 2.5% is the one-sided
targeted type I error rate, the estimated type I error rate for the standard
Wald’s interval could go beyond 2.75%. The inflation appeared to increase as
the smallest expected cell count approached 5. In these cases, Wald’s inter-
val with Hauck–Anderson continuity correction performed fairly well and
produced type I error rates below 2.75%. Li and Chuang-Stein33 recommend
using Wald’s interval without a continuity correction when the expected fre-
quency of all cell counts is at least 15. Otherwise, they recommend imple-
menting the Hauck–Anderson correction.
Newcombe28 compared the coverage probabilities and expected lengths
of 11 methods for determining a confidence interval for the difference in
proportions. A tail area profile likelihood-based method, and the methods
of Mee34 and Miettinen and Nurminen,35 which invert test statistics that
use standard errors restricted to the specified difference in proportions, all
performed well but were either difficult to compute or required a computer
program. Newcombe recommended a method that combined Wilson score
intervals for the two proportions either with or without a continuity cor-
rection. The Newcombe–Wilson 100(1 – α)% confidence interval without a
continuity correction is given by (L, U) where
lE and uE are the roots of pE − y/nE = zα/2 pE (1 − pE )/nE and lC and uC are
the roots of pC − x/nC = zα/2 pC (1 − pC )/nC . For the Newcombe–Wilson in
terval with a continuity correction, lE and uE are the limits of the interval
{ }
p : p − y/nE − 0.5/nE ≤ zα/2 p(1 − p)/nE and lC and uC are the limits of the
{
interval p : p − x/nC − 0.5/nC ≤ zα/2 p(1 − p)/nC . }
Motivated by Agresti and Coull , Agresti and Caffo26 proposed to use the
25
corresponding Wald’s interval after adding one success and one failure to
each treatment group for the 95% confidence interval of the difference of pro-
portions. This addition of observations performed best on the basis of their
simulations using pairs of true probabilities selected randomly over the unit
square (uniform distribution) and group sizes selected randomly (uniform
distribution) over {10, 11, …, 30}. The resulting 95% confidence interval is
y = 0 and ni = 0 for i = E, C), the resulting interval is [–1, 1]. Note that for a
common sample size, the middle of the Agresti and Caffo interval is closer to
zero than that of Wald’s interval (i.e., pE − pC ≤ pˆ E − pˆ C ). However, this need
not be true for uneven sample sizes.
Santner et al.36 compared the small-sample probability coverage and
expected lengths of five methods for determining a 90% confidence inter
val for the difference of proportions. The methods include the asymptotic
method of Miettinen and Nurminen,35 which is based on the score statis-
tic, and the exact methods of Agresti and Min,15 Chan and Zhang,9 Coe and
Tamhane,17 and Santner and Yamagami.18 For seven pairs of sample sizes
(three cases of balanced allocation and four cases of unbalanced allocation
were examined), the average (exact) probability coverage was calculated
(based on binomial distributions) across 10,000 pairs of (pE,pC) selected evenly
across the unit square. The overall sample size ranged from 20 to 70. The
authors conclude that the exact method of Coe and Tamhane performed the
best, and the asymptotic method of Miettinen and Nurminen performed
the worst. The authors recommended the use of the Coe and Tamhane
method; when that method is not available, either the method of Agresti and
Min or the method of Chan and Zhang is recommended. The use of any of
these five methods was strongly recommended by the authors in the abstract
of the paper. However, in the conclusions of paper, use of the methods of
Santner and Yamagami and Miettinen and Nurminen was discouraged.
the distribution of type I error rates of Wald’s method. How the type
I error rates of the Agresti–Caffo method compare also depend on
the control success.
Dann and Koch37 seem to prefer Wald’s method and the Agresti–Caffo
method when the allocation ratio is large (3:2, 2:1, or 3:1), and the Farrington–
Manning method or the Newcombe–Wilson method for smaller allocation
ratios (1:2 or 1:1). It should be noted that their results do not directly apply to
control success rates (for positive outcomes) less than 0.5.
We close this section with Example 11.4, which applies the 95% confidence
intervals for the difference in proportions from various methods to the
results from a hypothetical clinical trial.
Example 11.4
Suppose there are 131 successes among 150 subjects in the experimental arm
and 135 successes among 150 subjects in the control arm. For various asymptotic
and Bayesian methods (see Section 11.5), the corresponding 95% two-sided con-
fidence intervals for p̂E – p̂C is determined. The results are provided in Table 11.2
where the methods are listed in decreasing order with respect to the lower confi-
dence limit. Four of the nine methods would lead to a non-inferiority conclusion if
the δ = 0.10 (the lower limit of the Newcombe–Wilson interval is –0.1002).
TABLE 11.2
95% Confidence Intervals for Difference in Proportions
Method 95% Confidence Interval
Wald (–0.098, 0.045)
Zero prior Bayesiana (–0.099, 0.045)
Jeffreys (–0.099, 0.045)
Agresti–Caffo (–0.099, 0.046)
Newcombe–Wilson (–0.100, 0.047)
Newcombe–Wilson with CC (–0.100, 0.047)
Farrington–Manningb (–0.101, 0.047)
Wald with Hauck and Anderson CC (–0.102, 0.048)
Wald with Yates CC (–0.105, 0.052)
a Based on the resulting posterior distributions after the prior α→0 β→0 for each arm.
b Standard errors based on the null restricted estimates of the proportions.
that pair of probabilities closest to (p1, p2) satisfying p1′ – p2′ = –δ (when a
restricted estimate of the standard error is used), or p1′ = p1 and p2′ = p2 (when
an unrestricted estimate of the standard error is used). Then the power
pˆ − pˆ C + δ pˆ − pˆ C − ∆a zα/2σ o − δ − ∆a ∆a + δ −
is approximately P E > zα/2 = P E > ≈ Φ
σo σa σa σ
− δ − ∆a
2σ o ∆a + δ − zα/2σ o
≈ Φ .
σa σa
For a desired power of 1 – β, the right-hand term above is set equal to Φ(zβ).
Simplifying the equation leads to
2
z p (1 − p )/k + p (1 − p ) + z p1′ (1 − p1′ )/k + p2′ (1 − p2′ )
β 1 1 2 2 α/2
nC = . (11.8)
∆a + δ
For the analyses proposed in previous papers,5,35 using p1′ = (p1 + p2 – δ)/2 and
p2′ = (p1 + p2 + δ)/2 or using p1′ = (kp1 + p2 – δ)/(1 + k) and p2′ = (kp1 + p2 + kδ)/(1 +
k) could also be appropriate, as they more closely match the analysis method.
Otherwise, ( p1′ , p2′ ) can be selected using some rule for determining the pair
of ( p1′ , p2′ ) in the null hypothesis that is the most difficult to reject when (p1, p2)
is the true pair of the probabilities of a success. The change in the estimated
sample size may not be dramatically affected unless δ is very large, or p1 and
p2 are close to 0 or 1. When δ = 0, corresponding to a superiority analysis, it is
common to use p1′ = p2′ = (p1 + p2)/2 or p1′ = p2′ = (kp1 + p2)/(1 + k). Otherwise,
the approaches of Farrington and Manning5 and Miettinen and Nurminen35
to obtain MLEs of the true success rates restricted to the null hypothesis can
be adapted to determine p1′ and p2′ . This is accomplished by treating p1 and
p2 as the observed success rates.
The use of the sample-size formula in Equation 11.8 is illustrated in
Example 11.5.
Example 11.5
p1 = p2 = 0.85, Δa = 0, p1′ = (p1 + p2 – δ)/2 = 0.80, p2′ = (p1 + p2 + δ)/2 = 0.90, z0.025 =
1.96, and z0.10 = 1.28. Then by 11.8, the required sample size per arm for a one-to-
one randomization (k = 1) is calculated as 16.262 = 264.5. Thus around 265 sub-
jects should be randomized to each treatment group. For p1′ = p1 = 0.85 and p2′ =
p2 = 0.85, the calculated sample size is 268 subjects per arm.
A new antibiotic might be developed to have a higher cure rate than currently
available treatments. If the investigational antibiotic might cure 95% of all cases
(p1 = 0.95 p2 = 0.85, Δa = 0.10, p1′ = (p1 + p2 – δ)/2 = 0.850, p2′ = (p1 + p2 +
δ)/2 = 0.95), then the required sample size is only 46 per treatment arm by 11.8.
This sample size may be low enough to cause concerns with whether the dif-
ference in sample proportions has a normal distribution. A sample size of 200
per group would provide 90% power to show that the experimental treatment
is superior with a greater power to show non-inferiority.
Alternatively, a new antibiotic may be expected to have a slightly lower cure
rate than currently available treatments, but have some other advantage such as
better tolerability. If the investigational antibiotic might cure 80% of all cases, the
required sample size to show non-inferiority is about 1200 per treatment group
(p1 = 0.80 p2 = 0.85, Δa = –0.05, p1′ = (p1 + p2 – δ)/2 = 0.775, p2′ = (p1 + p2 + δ)/2 =
0.875).
2
z p (1 − p )/k + p (1 − p ) + z p1′ (1 − p1′ )/k + p2′ (1 − p2′ )
β 1 1 2 2 α/2
(1 + k ). (11.9)
∆a + δ
The optimal k that minimizes Expression 11.9 can be found using calculus
by taking a derivative of Expression 11.9 with respect to k, setting the result
equal to zero and then solving for k, or by a “grid search” by evaluating
Expression 11.9 for many candidates for k.
Example 11.6 determines the optimal allocation ratio using the assump-
tions in Example 11.5
Example 11.6
Zα/ 2σo Zβ σa
ηo η* ηa
FIGURE 11.1
Relationship between null and selected alternative.
That is, the null hypothesis is that the success rate of the experimental arm is
smaller than a prespecified fraction, θo, of the success rate of the control arm,
whereas the alternative is that the active control is superior by a smaller amount,
the two treatments are identical, or the experimental treatment is superior. When
θo = 1, the hypotheses in Expression 11.10 reduce to classical one-sided hypoth-
eses for a superiority trial. If a “success” is a negative outcome, the roles of pC and
pE in the hypotheses in Expression 11.10 would be reversed, leading to hypothe-
ses that can expressed as Ho: θ = pE/pC ≥ θo versus Ha: θ = pE/pC < θo, where θo ≥ 1.
In the simplest application with no covariates, a confidence interval can be
calculated for pE/pC and non-inferiority is concluded (the null hypothesis in
Expression 11.10 is rejected) if the lower bound of the confidence interval is
greater than θo.
pˆ E − θ o pˆ C
Z2 = Z2 (X , Y ) = (11.11)
{ ( ) ( ) }
1/2
pE 1 − pE nE + θ o2 pC 1 − pC nC
where p̂C and p̂E are the observed response rates and pC and pE are the MLEs
of pE and pC, respectively, under the constraint pE = θopC given in the null
hypothesis in Expression 11.10. The closed form solutions for p̂C and p̂E are
given in Farrington and Manning’s study5 and in Expression 11.18 of this
text. Since large values of Z2 favor the alternative hypothesis, the tail region
includes tables whose Z2 values are greater than or equal to zobs, the Z2 value
for the observed table. Therefore, the exact p-value is calculated as
where p (=pC) is the nuisance parameter with the domain A = [0, θo], and the
probability is evaluated using the following null probability function
n n
P(x = i, y = j|H o ) = C E θ o− i p i+ j (1 − θ o−1 p)nC − i (1 − p)nT − j .
i j
For a nominal α-level test, the CR and true size can be calculated in a simi-
lar fashion as for the exact unconditional test using the difference measure
described in Section 11.2.2. In the special case where θo = 1, the hypotheses in
Expression 11.10 are those in Expression 11.1 with δ = 0. In this special case,
the Z1 and Z2 statistics are identical. Chan and Bohidar39 have studied the
utility of this exact unconditional test in designing clinical trials and found
that the empirical performance of this exact test compares very favorably
with its asymptotic counterpart (Z2 test) in terms of type I error rate, power,
and sample size under a wide range of true parameter values.
Example 11.7
Chan12 reanalyzed the data in Example 11.2 using the relative risk measure to show
non-inferiority based on the criterion requiring that the tumor response rate to the
chemotherapy treatment (pE) be greater than 90% of the response to the radiation
therapy (pC). This corresponds to a threshold of θo = 0.9 for the relative risk. Since
p̂E = 0.943 [83/88] and p̂C = 0.908 [69/76], the observed relative risk is θ̂ = 1.039.
The MLEs when pE/pC = 0.9 are p C = 0.946 and pE = 0.851, which gives Z2 = 2.835
from Equation 11.11. From Equation 11.12 the exact unconditional test based on Z2
yielded a p-value of 0.0028, compared with the asymptotic p-value of 0.0024. Both
tests strongly supported the conclusion of non-inferiority. At the one-sided 5% level,
the size of the exact test is 4.82%, whereas the type I error rate of the asymptotic Z2
test is 5.32% when pC = 0.9 and the size of the test is approximately 5.59%.
λE nE pE θ
φ= = =
λC + λE nC pC + nE pE θ + u
and u = nC/nE.
Since ϕ is increasing in θ, the non-inferiority hypotheses in Expression
11.10 are equivalent to
where ϕo = θo/(θo + u). Thus, inferences can be based on a simple exact test
involving a one-sample binomial distribution. Suppose yobs is the number
of disease cases observed in the experimental group, then the exact p-value
conditional on the total number of disease cases S = s is
yobs
For an α-level test, the critical value yα can be determined as the largest value
satisfying that Pr{Y ≤ yα|Y ∼ Binomial(s, ϕo)} is as close as possible to α from
below without exceeding it. The power conditional on S = s for testing the
hypotheses in Expression 11.13 against a specific alternative ϕ = ϕ1 < ϕo is
then calculated as Pr{Y ≤ yα|y ∼ Binomial(s, ϕ1)}, where ϕ1 = θ 1/(θ 1 + u).
Note that Equation 11.14 could also be evaluated via the F-distribution
using the following relationship (see, e.g., Johnson, Kotz, and Kemp,40 p.110):
y
ν 2φ
∑ k !(ss−! k)! φ (1 − φ)
k s− k
= Fν1 ,ν 2
ν 1 (1 − φ )
k =0
ν 1Fν−1,ν (α/2)
φL = 1 2
,
ν 2 + ν 1Fν−1,ν (α/2)
1 2
where
ν1 = 2yobs
ν2 = 2(s – yobs + 1),
and
ν 1Fν−1,ν (1 − α/2)
φU = 1 2
,
ν 2 + ν 1Fν−1,ν (1 − α/2)
1 2
where
ν1 = 2(yobs + 1),
ν2 = 2(s – yobs).
Then, a 100(1 – 2α)% exact confidence interval for the relative risk θ (θ L, θ U)
is given by
uφL uφU
θL = , θU = (11.15)
1 − φL 1 − φU
This exact conditional method can be applied to design a study with a goal to
obtain a fixed total number of events instead of running for a fixed duration.
Once the desired total number of events (S) is fixed, the power of the study
depends on incidence rates only through the relative risk (θ = pE/pC); thus,
one can avoid the situation potentially encountered in a fixed-duration trial
where the anticipated power is not achieved at the end of the trial because
the number of events is too few owing to unexpectedly low incidence rates.
Since the unconditional expected value of S is (nCpC + nEpE), the expected
number of subjects required for the study can be estimated on the basis of
the incidence rate in the control group (pC) and the relative risk (θ = θ 1) under
the alternative hypothesis:
s
nE ≈ (11.16)
( u + θ 1 ) pC
Example 11.8
Chan12 used the exact conditional method to design a non-inferiority trial compar-
ing a new hepatitis A vaccine with immune globulin (IG, standard treatment C) in
postexposure prophylaxis of hepatitis A disease. IG is believed to have approxi-
mately 90% efficacy for postexposure prophylaxis. However, IG is a derived
blood product, and thus there are concerns about its safety and purity in addition
to its short-lived protection. In contrast, the hepatitis A vaccine has been demon-
strated to be safe and highly efficacious (≈100%) in preexposure prophylaxis,41 and
capable of inducing long-term protective immunity against hepatitis A in healthy
subjects.42 Recognizing the potential long-term benefit of the vaccine, investiga-
tors of this study intended to show that the vaccine is noninferior to IG in terms of
postexposure efficacy. Since we are dealing with negative outcomes (have the dis-
ease), the hypotheses to be tested are Ho: θ = pE/pC ≥ θo versus Ha: θ = pE/pC < θo,
where θo ≥ 1. If non-inferiority is established (θ < θo), one can infer that the new
vaccine has reasonable efficacy (π E) for postexposure prophylaxis on the basis of
the following indirect argument:
pE p p
πE = 1− = 1 − E ⋅ C = 1 − θ (1 − π C ) > 1 − θ o (1 − π C ) (11.17)
pU pC pU
and pC = pE/θ o , where x and y are the observed number of successes in the
control and experimental arms, respectively, given in Table 11.1.
An alternative would be to base the inference on the log-relative risk,
log(pE/pC) = log pE – log pC. Katz et al.46 proposed the use of the standard
Taylor series method for determining a one-sided or two-sided confidence
interval for a relative risk.
By using the asymptotic standard error of log( pˆ E ) − log( pˆ C ) , a 100 (1 – α)%
confidence interval for log pE – log pC can be calculated as
(1 − pˆ E ) (1 − pˆ C )
log( pˆ E ) − log( pˆ C ) ± zα/2 + .
nE pˆ E nC pˆ C
pˆ E − θ o pˆ C
pˆ E (1 − pˆ E )/nE + θ o2 pˆ C (1 − pˆ C )/nC
( y − nE pE )2 ( x − nC pC )2
+
nE pE (1 − pE ) nE pC (1 − pC )
where pE and pC are the MLEs restricted to pE/pC = θo (see Equation 11.18
for the expressions for pE and pC ). The test statistic is compared with the
3
pˆ E 1 zα/2 (1 − pˆ E )/y + (1 − pˆ C )/x − zα/2 (1 − pˆ E )(1 − pˆ C )/(9 xy ) /3
2
pˆ .
C 1 − zα2/2 (1 − pˆ C )/(9 x)
of one-sided lower 95% confidence intervals for the same possible combina-
tions as in Katz et al.’s46 study (nE = nC = 100; pC = 0.1, 0.2, or 0.4; and θo =
0.25, 0.5, 0.667, 1, 1.5, 2, and 4). Bailey’s method had coverage probabilities
much closer to the desired level than the Taylor series method and Pearson’s
method. For non-inferiority testing targeting a one-sided type I error rate
of 5%, the corresponding (one-sided) type I error rates ranged from 4.6% to
5.2% for Bailey’s method, from 4.3% to 5.8% for the Taylor series method, and
from 3.8% to 4.6% for Bailey’s method with a continuity correction.
Dann and Koch1 evaluated and compared several methods of construct-
ing a confidence interval for the relative risk based on the calculated limits,
power, type I error rate, and agreement/disagreement with other methods.
The methods included were classified into three categories: Taylor series
methods, solution to quadratic equation methods (Fieller-based methods
applied to normalized test statistics), and maximum likelihood–based meth-
ods (asymptotic likelihood ratio test and Pearson’s χ2 test).
The Taylor series methods are the standard Taylor series method, the mod-
ified Taylor series method of Gart and Nam50 that adds 0.5 to the number of
successes in each group and to the common sample size, and another modi-
fied method (adapted Agresti–Caffo) that adds 4θo/(1 + θo) successes and 4/
(1 + θo) failures to the experimental group and 4/(1 + θo) successes and 4θo/
(1 + θo) failures to the control group. In addition, a Taylor series adjusted
alpha method that uses z0.0225 = 2.005 instead of z0.025 = 1.96 for the standard
Taylor series method is investigated.
The quadratic methods studied included the standard quadratic method
(referred to as “F-M 1” by Dann and Koch1), an adapted version that divides
by one less the sample size when determining the standard error (referred to
as “the quadratic method” in the paper of Dann and Koch1), Bailey’s method,
and two variations of the quadratic method provided by Farrington and
Manning.5 One of the variations in Farrington and Manning’s study5 uses
the MLEs in Equation 11.18 in determining the standard error. The other
variation uses the approach of Dunnet and Gent8 in obtaining estimates of
the success rates for determining the standard error.
The maximum likelihood methods include the Pearson’s method pro-
posed by Koopman48 and the generalized likelihood ratio test (referred to as
the “deviance method” by Dann and Koch1). The deviance statistic is
2 ln[L( pˆ E , pˆ C )/L( pE , pC )] , where pE and pC are the MLEs restricted to pE/pC = θo
provided in Section 11.3.3. The test statistic is compared with the appropriate
upper percentile of a χ2 distribution with 1 degree of freedom. A two-sided
100(1 – α)% confidence interval for the relative risk consists of those values that
can be specified as θo for which the respective one-sided or two-sided test of level
α fails to reject the null hypothesis. A one-sided confidence interval would extend
the appropriate side of a two-sided confidence interval to zero or infinity.
The assessments were based on one-sided 97.5% confidence intervals. The
trial sizes used were 100, 140, and 200 patients per group. Control success
rates (for undesirable outcomes) of 0.10, 0.15, 0.20, and 0.25 were examined
with null relative risks (experimental/control) of 0.667, 0.8, 1, 1.25, 1.5, and 2.
For each combination of sample size, control success rate, and relative risk,
100,000 simulations were performed. For the majority of methods, when the
experimental or control number of “successes” was three or fewer, the confi-
dence interval was replaced by the corresponding exact confidence interval
for the odds ratio. When there were no successes in the control group, the
upper limit was assigned the value of 100.
The simulated type I error rates were provided in the paper of Dann and
Koch1 for a null relative risk of 2. The quadratic method had the smallest simu-
lated type I error rate in every case. In all 12 of these cases, the Taylor series
adjusted alpha method and the deviance methods maintained the type I error
rate between 0.023 and the desired 0.025. Bailey’s method, maintained a type
I error rate between 0.0225 and 0.0275 in all cases. The adapted Agresti-Caffo
method had simulated type I error rates between 0.020 and 0.027. The other
methods did not perform consistently as well in maintaining the targeted
type I error rate as the Taylor series adjusted alpha, deviance, adapted Agresti-
Caffo, and Bailey methods. In cases having a null relative risk of 0.667, 0.8, 1,
1.25, and 1.5, the power was determined for each method for a true relative risk
of 2. In those 60 cases, the Taylor series adjusted alpha, deviance, and Bailey’s
methods all produced extraordinarily similar simulated power.
We close with Example 11.9, which applies the 95% confidence intervals
for the relative risk from various methods to the results from a hypothetical
clinical trial.
Example 11.9
Suppose there are 131 successes among 150 subjects in the experimental arm
and 135 successes among 150 subjects in the control arm. For various asymptotic
and Bayesian methods (see Section 11.5), the corresponding 95% two-sided con-
fidence intervals for pˆE /pˆ C is determined. The results are provided in Table 11.3
TABLE 11.3
95% Confidence Intervals for Relative Risk from Various Methods
Method 95% Confidence Interval
Taylor series (Katz) (0.895, 1.052)
Bailey (0.895, 1.052)
Standard quadratic (0.894, 1.052)
Zero prior Bayesiana (0.893, 1.052)
Taylor series adjusted alpha (0.893, 1.054)
Jeffreys (0.893, 1.053)
Deviance (0.892, 1.053)
Farrington–Manning (0.890, 1.055)
Koopman–Pearson (0.890, 1.055)
a Based on the resulting posterior distributions after the prior α→0 β→0 for each arm.
where the methods are listed in decreasing order with respect to the lower confi-
dence limit. The order of the lower limits can be fairly arbitrary and depends on
the general success rate, which of the arms had the greater rate, the sample size,
and the allocation ratio.
2
z p (1 − p )/k + θ 2 p (1 − p ) + z p1′ (1 − p1′ )/k + θ o2 p2′ (1 − p2′ )
β 1 1 o 2 2 α/2
nC = . (11.20)
p1 − θ o p2
Using p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo) or using p1′ = θo(kp1 +
p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo) could be appropriate in many cases.
Otherwise, ( p1′ , p2′ ) can be selected using some rule for determining the pair
of ( p1′ , p2′ ) in the null hypothesis that is the most difficult to reject when (p1,
p2) is the true pair of the probabilities of a success. The change in the esti-
mated sample size may not be dramatically affected unless θo is very small,
or p1 and p2 are close to 0 or 1. It should be understood that p1′ = θo(kp1 + p2)/
(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo) need not both be between 0 and 1 (e.g.,
when p1 = p2 = 0.9, θo = 0.5 and k = 1, p2′ = 1.2).
Example 11.10 compares and contrasts the sample-size formulas in
Equations 11.19 and 11.20.
Example 11.10
We will first compare and contrast the sample-size formulas in Equations 11.19
and 11.20 at both 80% and 90% power for three cases based on a one-to-one
randomization (k = 1). The values for θo, p1, and p2 are provided for each case
below.
The values chosen for p1′ and p2′ will be based on the formulas p1′ = θo(p1 + p2)/
(1 + θo) and p2′ = (p1 + p2)/(1 + θo). The results are summarized in Table 11.4.
In all cases examined, the sample size was smaller using formula 11.20 than for-
mula 11.19. In each case, the respective calculated sample sizes from the formulas
are closer for 90% power than for 80% power. The calculated sample sizes using
formula 11.19 grew at a faster rate or at least a faster relative rate as the power
increased from 80% to 90%. This occurs because when the inference is based on
the distribution for pˆE − θ o pˆ C , σa /σo is larger than when the inference is based on
the distribution of the estimator of the log-relative risk. For case 1, there was very
little difference in the sample-size calculation. Case 2 had a smaller value for θo
TABLE 11.4
Calculated Sample Sizes for 80% and 90% Power for Cases 1 through 3
Sample Size per Arm
Power (%) θo (p1, p2) (p1′ , p2′ ) log( pˆ E / pˆ C )a pˆ E − θ o pˆ Cb
80 0.7 (0.4, 0.4) (0.329, 0.471) 192 190
90 0.7 (0.4, 0.4) (0.329, 0.471) 256 255
80 0.3 (0.04, 0.04) (0.018, 0.062) 336 284
90 0.3 (0.04, 0.04) (0.018, 0.062) 435 403
80 0.1 (0.04, 0.04) (0.007, 0.073) 168 90
90 0.1 (0.04, 0.04) (0.007, 0.073) 204 141
a Calculations based on Equation 11.19.
b Calculations based on Equation 11.20.
and values for p1 and p2 close to zero. The calculated sample size begins to devi-
ate between the two formulas. Deviation is larger despite the sample sizes being
smaller for case 3, which had an even smaller value for θo than case 2. When p1′ =
p1 and p2′ = p2, the results are summarized in Table 11.5.
For case 1, comparing Tables 11.4 and 11.5, there was only moderate change in
the calculated sample sizes using p1′ = p1 = 0.4 and p2′ = p2 = 0.4 instead of p1′ =
θo(p1 + p2)/(1 + θo) = 0.329 and p2′ = (p1 + p2)/(1 + θo) = 0.471. For cases 2 and 3,
there were dramatic changes in the calculated sample sizes. Although in Table
11.4, for all cases examined, the sample size was larger using formula 11.20 than
formula 11.19, the reverse is seen in Table 11.5. As with Table 11.4, for cases 2 and
3, there were different calculated sample sizes between formulas 11.19 and 11.20.
For each case when the inference is based on the log-relative risk estimator, the
calculated sample size decreases when p1′ = p1 and p2′ = p2 is used instead of p1′ =
θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo). This is because when p1′ + p2′ is fixed,
(1− p1′)/p1′ + (1− p2′ )/p2′ = 1/p1′ + 1/p2′ – 2 becomes smaller when the probabilities
p1′ and p2′ become more similar.
Conversely, for an inference based on pˆE − θ o pˆ C , the calculated sample size
increases ( p1′(1− p1′) + θ o2p2′ (1− p2′ ) increases) when p1′ = p1 and p2′ = p2 is used
TABLE 11.5
Calculated Sample Sizes when p1′ = p1 and p2′ = p2
Sample Size per Arm
Power (%) θo (p1, p2) ( p1′ , p2′ ) log( pˆ E / pˆ C )a pˆ E − θ o pˆ C b
80 0.7 (0.4, 0.4) (0.4, 0.4) 186 195
90 0.7 (0.4, 0.4) (0.4, 0.4) 248 261
80 0.3 (0.04, 0.04) (0.04, 0.04) 260 420
90 0.3 (0.04, 0.04) (0.04, 0.04) 348 561
80 0.1 (0.04, 0.04) (0.04, 0.04) 72 235
90 0.1 (0.04, 0.04) (0.04, 0.04) 96 315
a Calculations based on Equation 11.19.
b Calculations based on Equation 11.20.
instead of p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo). When p1′ + p2′ = s for
fixed s and 0 ≤ θo ≤ 1, the maximum value for p1′(1− p1′) + θ o2p2′ (1− p2′ ) occurs when
p1′ = min {s, (1 + 2sθo – θo)/(2θo + 2).
Example 11.10 illustrates that the choice for p1′ and p2′ can have a small or
rather large effect on the calculated sample size depending on the value for θo
and the expected success probabilities. Whenever the calculated sample size
changes greatly as the choices for p1′ and p2′ change, simulations should be
used to find the appropriate sample size or validate a calculated sample size.
2
z (1 − p )/( kp ) + (1 − p )/p + z
β 1 1 2 2 α/2 1 (1 − p1′ )/( kp′) + (1 − p2′ )/p2′
(1 + k ). (11.21)
log θ a − log θ o
2
z p (1 − p )/k + θ 2 p (1 − p ) + z p1′ (1 − p1′ )/k + θ o2 p2′ (1 − p2′ )
β 1 1 o 2 2 α/2
(1 + k ). (11.22)
p1 − θ o p2
In either case, the optimal k that minimizes Equation 11.21 or 11.22 can be
found by using calculus or by a “grid search.” Example 11.11 compares and
contrasts the sample-size formulas in Equations 11.21 and 11.22.
Example 11.11
We will compare and contrast the results for the optimal overall study size based
on formulas 11.21 and 11.22 for cases 1 through 3 in Example 11.10 at both 80%
and 90% power when p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo). The
results are summarized in Tables 11.6 and 11.7.
We see that the study size reduction is more prominent when the inference is
based on pˆE − θ o pˆ C . Within a case, the optimal k for 90% power is smaller than
that for 80% when the inference is based on the distribution of log( pˆE / pˆ C ) , but
larger when the inference is based on the distribution of pˆE − θ o pˆ C.
Tables 11.8 and 11.9 provide analogous results on the calculation of the optimal
allocation ratio, k, and the corresponding study size when p1′ = θo(kp1 + p2)/(1 +
kθo) and p2′ = (kp1 + p2)/(1 + kθo).
TABLE 11.6
Sample Sizes for Log-Relative Risk Based on Optimal Allocation
log( pˆ E / pˆ C )
Reduction in
Case Power (%) Ratio nE nC n Study Sizea
1 80 1.23 210 170 380 4 (1%)
1 90 1.20 277 231 508 4 (1%)
2 80 1.55 390 251 641 31 (5%)
2 90 1.47 499 341 840 30 (3%)
3 80 2.34 203 87 290 46 (13%)
3 90 2.11 246 117 363 45 (11%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.19.
For this scenario, the calculated samples are more similar between formulas
11.19 and 11.20 than in the earlier scenarios. Although earlier when p1′ = θo(p1 +
p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo), the study reduction was more prominent
using an optimal allocation ratio when the inference is based on pˆE − θ o pˆ C , we see
that when p1′ = θo(kp1 + p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo), the study size
reduction is more prominent using an allocation ratio when the inference is based
on log( pˆE / pˆ C ). As before, within a case, the optimal k for 90% power is smaller
than that for 80% when the inference is based on the distribution of log( pˆE / pˆ C ) ,
but larger when the inference is based on the distribution of pˆE − θ o pˆ C . Compared
with when p1′ = θo(p1 + p2)/(1 + θo) and p2′ = (p1 + p2)/(1 + θo), the calculated study
size for the optimal allocation ratio is smaller when the inference is based on the
distribution of log( pˆE / pˆ C ) when p1′ = θo(kp1 + p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 +
kθo), but larger when the inference is based on the distribution of pˆE − θ o pˆ C . For
cases 1 through 3, the optimal allocation ratios when p1′ = θo(p1 + p2)/(1 + θo) and
p2′ = (p1 + p2)/(1 + θo) and the inference based on the distribution of log( pˆE / pˆ C )
were similar to the optimal allocation ratios when p1′ = θo(kp1 + p2)/(1 + kθo) and
p2′ = (kp1 + p2)/(1 + kθo) and the inference based on the distribution of pˆE − θ o pˆ C .
Likewise, the optimal allocation ratios when p1′ = θo(kp1 + p2)/(1 + kθo) and p2′ =
(kp1 + p2)/(1 + kθo) with the inference is based on the distribution of log( pˆE / pˆ C )
TABLE 11.7
Sample Sizes for pˆ E − θ o pˆ C Based on Optimal Allocation
pˆ E − θ o pˆ C
Reduction in
Case Power (%) Ratio nE nC n Study Sizea
1 80 1.37 214 156 370 10 (3%)
1 90 1.38 288 209 497 13 (3%)
2 80 2.10 337 161 498 70 (12%)
2 90 2.34 486 208 694 112 (14%)
3 80 4.53 105 23 128 52 (29%)
3 90 5.00 163 33 196 86 (30%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.20.
TABLE 11.8
Sample Sizes for Log-Relative Risk Based on Optimal Allocation
Reduction in
Case Power (%) ( p1′ , p2′ ) Ratio nE nC n Study Sizea
1 80 (0.342, 0.489) 1.53 222 146 368 16 (4%)
1 90 (0.340, 0.486) 1.44 293 203 496 16 (3%)
2 80 (0.024, 0.079) 2.40 394 165 559 113 (17%)
2 90 (0.023, 0.077) 2.13 515 242 757 113 (13%)
3 80 (0.016, 0.164) 5.24 154 30 184 152 (45%)
3 90 (0.015, 0.146) 4.19 206 50 256 152 (37%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.19.
were similar to the optimal allocation ratios when p1′ = θo(p1 + p2)/(1 + θo) and p2′ =
(p1 + p2)/(1 + θo) and the inference is based on the distribution of pˆE − θ o pˆ C.
These examples illustrate how the “optimal” allocation ratio depends on the
selection of p1′ and p2′ . Therefore, for the relative risk, the “optimal” allocation
ratio should be interpreted with caution. A moderate or even large change
in the allocation ratio often provides only a small change in the power (for a
fixed sample size) or sample size (for fixed power). Also, the allocation ratio
selected to maximize the power for the analysis of the primary efficacy end-
point may not be optimal or appropriate for the evaluation of secondary effi-
cacy endpoints and/or safety endpoints.
When the inference is based on log( pˆ E /pˆ C ) and p1′ = p1 = p2′ = p2, the optimal
allocation ratio will be k = 1 (and thus the sample sizes are those provided
in Table 11.5). When the inference is based on pˆ E − θ o pˆ C and p1′ = p1 = p2′ = p2,
the optimal allocation ratio will be k = 1/θo. The sample sizes are provided
in Table 11.10 for cases 1 through 3. Compared with Table 11.9 when p1′ =
θo(kp1 + p2)/(1 + kθo) and p2′ = (kp1 + p2)/(1 + kθo), the sample sizes for the
optimal allocation ratio are larger when p1′ = p1 and p2′ = p2, as are the cor-
responding calculated optimal allocation ratios.
TABLE 11.9
Sample Sizes for pˆ E − θ o pˆ C Based on Optimal Allocation
Reduction in
Case Power (%) ( p1′ , p2′ ) Ratio nE nC n Study Sizea
1 80 (0.338, 0.482) 1.32 212 160 372 8 (2%)
1 90 (0.338, 0.483) 1.34 286 214 500 10 (2%)
2 80 (0.020, 0.068) 1.43 325 226 551 17 (3%)
2 90 (0.021, 0.070) 1.59 471 296 767 39 (5%)
3 80 (0.011, 0.107) 2.27 106 47 153 27 (15%)
3 90 (0.012, 0.117) 2.71 169 63 232 50 (18%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.20.
TABLE 11.10
Sample Sizes for pˆ E − θ o pˆ C Based on Optimal Allocation
Reduction in
Case Power (%) ( p1′ , p2′ ) Ratio nE nC n Study Sizea,b
1 80 (0.4, 0.4) 1.43 223 156 379 1 (0%)
1 90 (0.4, 0.4) 1.43 298 209 507 3 (1%)
2 80 (0.04, 0.04) 3.33 500 150 650 –82 (–14%)
2 90 (0.04, 0.04) 3.33 669 201 870 –64 (–8%)
3 80 (0.04, 0.04) 10 256 26 282 –102 (–57%)
3 90 (0.04, 0.04) 10 343 34 377 –95 (–34%)
a Reduction is relative to the sample-size calculation in Table 11.4 using Equation 11.20.
b Reductions in sample sizes relative to Table 11.5, using Equation 11.20, are 11 (3%), 15 (3%),
190 (24%), 252 (22%), 188 (40%), 253 (40%), respectively.
ω = p/(1 – p).
p = ω/(1 + ω)
When the odds ω = 2, p = 2/3, indicating 2 successes for every failure. For a
comparative study with two binomial parameters, pE and pC, the odds ratio
between the new treatment (E) and the control (C) is
It can be seen that the odds ratio = relative risk of a success ÷ relative risk of
a failure. When the probabilities of a success are very small (the relative risk
of a failure ≈ 1), the odds ratio is approximately equal to the relative risk of
a success.
When a success is a desirable outcome, the hypotheses for testing that the
experimental therapy is noninferior to the control therapy based on a pre-
specified threshold of ψo (0 < ψo ≤ 1) are
P(Y = y|s, ψ ) =
( ) ( )ψ
nE
y
nC
s− y
y
(11.24)
∑ k
(nk E )(ns−Ck )ψ k
where the permissible values of y and k consist of all integers within the
range max(0, s – nC) to min(nE, s). This is called the extended hypergeomet-
ric distribution, and more details can be found in the papers of Zelterman51
and Johnson et al.40 Note that for a classical null hypothesis of unity odds
ratio (Ho: ψ = 1), the probability function in Equation 11.24 will reduce to the
hypergeometric distribution under the null hypothesis.
Suppose yobs is the observed number of positive responses in the new treat-
ment group, then the exact p-value for testing hypothesis in Expression 11.23
is given by
yobs B
∑i= A
P(Y = i|s, ψ U ) = α /2, and ∑ P(Y = i|s,ψ
i = yobs
L ) = α/2
where A = max(0, s – nC) and B = min(nE, s). Agresti and Min15 have also dis-
cussed the construction of confidence interval for the odds ratio by inverting
a two-sided test.
The above method of analyzing the odds ratio in a single 2 × 2 table has been
extended to analyze a common odds ratio in a series of 2 × 2 tables.54–57
y(nC − x) pˆ E (1 − pˆ C )
ψˆ = = .
x(nE − y ) pˆ C (1 − pˆ E )
If there are zero counts, the following amended estimator has been shown to
have good large sample behavior58:
( y + 0.5)(nC − x + 0.5) .
ψ =
( x + 0.5)(nE − y + 0.5)
1/2 1/2
1 1 1 1 1 1
σˆ = + + + = + .
Y nE − Y X nC − X nE pE (1 − pE ) nC pC (1 − pC )
ˆ ˆ ˆ ˆ
To protect against having a zero cell count, 1 or 0.5 can be added to each cell
count in the estimator of standard error. On the basis of the asymptotic nor-
mality of log ψ̂, a two-sided 100 (1 – α)% Wald’s confidence interval for log ψ
is given by
log ψˆ ± zα/2σˆ
where zα/2 is the upper α/2 percentile of the standard normal distribution.
Therefore, a confidence interval (ψ L,ψ U) for the odds ratio (ψ) can be obtained
by exponentiating the above limits.
For testing the non-inferiority hypothesis in Expression 11.23, the test sta-
tistic is
log ψˆ − log ψ o
Z= ,
σˆ
where Z follows a standard normal distribution and Φ(•) is the standard nor-
mal distribution function. The hypothesis will be rejected if Z > zα/2 or if the
p-value is <α/2. Equivalently, one can compare the lower limit (ψ L) of the
100(1 – α)% confidence interval for ψ with ψo. The non-inferiority hypothesis
will be rejected at the one-sided α level if ψ L > ψo.
If there are covariates to be adjusted in the analysis, one can consider
performing a logistic regression with covariates or a log-linear model if all
covariates are categorical. Then the odds ratio between the treatments can be
estimated from the regression parameter.
2
z ( kp (1 − p ))−1 + ( p (1 − p ))−1 + z ′ (1 − p1′ ))−1 + ( p2′ (1 − p2′ ))−1
1 1 2 2 α/2 ( kp1
β . (11.25)
log ψ a − log ψ o
2
z ( kp (1 − p ))−1 + ( p (1 − p ))−1 + z ′ (1 − p1′ ))−1 + ( p2′ (1 − p2′ ))−1
1 1 2 2 α/2 ( kp1
β (1 + k ).
log ψ a − log ψ o
As before, the optimal k that minimizes the above expression can found by
using calculus or by a “grid search.”
TABLE 11.11
Integrals Representing Posterior Probability of Non-Inferiority
Characteristic Determination of the Posterior Probability
Difference in proportions For any 0 ≤ k ≤ 1, P(pE –pC > –k) =
1 v− k
1−
∫∫ k 0
g E (u y ) g C (v x) du d v.
Odds ratio For any 0 < k < 1, P((pE/(1 – pE))/(pC/(1 – pC)) > k) =
1 1
∫∫0 kv/( 1+ ( k − 1) v )
g E (u y ) g C (v x) du d v.
Example 11.12
distributions for the probability of a response. For each case, independent beta
distributions are selected as the prior distribution for the probability of a response
for the control and the experimental arms. Table 11.12 summarizes the equal-
tailed 95% credible intervals for the difference, relative risk, and odds ratio of the
probabilities of a response under seven different pairs of the prior distributions.
Each credible interval was based on 1 million simulations from the corresponding
beta posterior distributions. The first case uses the limiting posterior distributions,
as the parameters (α and β) for each prior distribution tend toward zero. This
establishes a limiting beta posterior distribution where the inference is essentially
based entirely on the data (the mean of each posterior distribution is the respec-
tive observed proportion of responders). The second case has a Jeffreys prior
distribution for each probability of a response. The remaining cases are based
on having prior information on each probability of a response that is essentially
equivalent to having response data on 40 subjects. The third case is essentially
equivalent to beginning with 20 responders and 20 nonresponders for each arm.
When compared with case 1, this choice of a common prior distribution makes
the proportion of responders between arms more similar and closer to 0.5, while
not having a great impact on the variance of the posterior distributions. The fourth
case represents starting with 34 responders out of 40 subjects (85%, the same
as the observed proportion in the control arm) in each arm. When compared
with case 1, this choice of a common prior distribution makes the proportion of
responders between arms more similar and reduces the variance of each posterior
distribution. The fifth case represents starting with 34 responders out of 40 sub-
jects in the control arm and 32 responders out of 40 subjects in the experimen-
tal arm. When compared with case 1, this choice of prior distributions does not
change the proportion of responders in the respective arms while reducing the
variance of each posterior distribution.
The sixth case represents starting with 34 responders out of 40 subjects in the
control arm and 28 responders out of 40 subjects in the experimental arm. For a
non-inferiority margin of 15%, this case starts with observed proportions whose
TABLE 11.12
Equal-Tailed 95% Credible Intervals for a Difference, Relative Risk, and Odds Ratio
Case Prior Parameters pE – pC pE/pC Odds Ratio
1 C: α→0 β→0 (–0.155, 0.055) (0.825, 1.070) (0.329, 1.468)
E: α→0 β→0
2 C: α = 0.5 β = 0.5 (–0.155, 0.055) (0.824, 1.070) (0.336, 1.464)
E: α = 0.5 β = 0.5
3 C: α = 20 β = 20 (–0.139, 0.068) (0.825, 1.098) (0.487, 1.418)
E: α = 20 β = 20
4 C: α = 34 β = 6 (–0.123, 0.051) (0.861, 1.064) (0.406, 1.449)
E: α = 34 β = 6
5 C: α = 34 β = 6 (–0.139, 0.039) (0.842, 1.048) (0.372, 1.309)
E: α = 32 β = 8
6 C: α = 34 β = 6 (–0.170, 0.012) (0.807, 1.015) (0.317, 1.085)
E: α = 28 β = 12
7 C: α = 34 β = 6 (–0.231, –0.040) (0.737, 0.950) (0.238, 0.785)
E: α = 20 β = 20
If the results of a clinical trial are to stand alone, the α’s and β’s for the prior
distributions should be relatively small when compared to the sample size.
Otherwise, as would be done in a meta-analysis, the use of a beta prior dis-
tribution for each arm involves integrating prior successes and failures with
the successes and failures in the present clinical trial.
We note again that it is the size of the parameters for the prior distributions
that can be influential rather than the prior probability of non-inferiority,
inferiority, or superiority. Suppose for the experimental and control arms,
the prior distributions for the probability of a response are beta distributions
with α = 5 × 10 –10 and β = 9.5 × 10 –9 for the experimental arm and α = 9.5 × 10 –9
and β = 5 × 10 –10 for the control arm. Then the prior probability that pE > pC is
greater than 0.9 (90%). However, these prior distributions lose any real impact
once the response status is known for at least one patient in each arm.
In addition to using posterior probabilities for testing non-inferiority, Kim
and Xue60 discussed two other Bayesian approaches for non-inferiority test-
ing. The first alternative approach determines the 5% contour region defined
as those possible (pE,pC) whose joint posterior density is 5% of the joint den-
sity of the mode. When the 5% contour region lies entirely within the non-
inferiority region (the alternative hypothesis), non-inferiority is concluded.
The second alternative approach concludes non-inferiority whenever the 95%
credible set of highest posterior probability for (pE,pC) lies entirely within the
non-inferiority region (the alternative hypothesis).
TABLE 11.13
Distribution of Successes and Failures between Arms across Strata
Experimental Arm Control Arm
Strata Success Failure Success Failure
1 40 20 30 30
2 30 30 20 40
Total 70 50 50 70
method used in determining the effect of the control therapy and the non-
inferiority margin also be used in comparing the experimental and control
arms in the non-inferiority trial. If not, some “adjustment” may be needed to
the non-inferiority margin.
A Cochran–Mantel–Haenszel procedure is a very common stratified anal-
ysis when testing for the difference (or superiority or inferiority) between
two treatment arms on a binary endpoint. Essentially, the test statistic has for
its numerator the sum across strata within one of the arms of the difference
in the observed number of successes and the expected number of successes
(assuming no difference in success rates between arms). The denominator is
an estimate of the corresponding standard deviation under the assumption
that the success rate is equal between arms within each stratum. This is one
of the primary ways of performing a stratified analysis.
Another type of stratified or adjusted analysis adjusts with respect to some
preset relative frequency of some characteristic or combinations of charac-
teristics in a target population. A particular subpopulation or stratum would
consist of subjects that have the same level for that chosen characteristic or
the same combination of levels of many characteristics. For valid compari-
sons, the same relative weights should be used for each arm.
∑
k
given by pˆ Target = ai pˆ i , where p̂i is some estimator of pi, the success rate
i=1
for the ith subpopulation. If p̂i is an unbiased estimator of pi, then p̂Target will
be an unbiased estimator of the true success rate for the target population.
Each p̂i can be modeled as having a normal distribution with mean pi and
variance pi(1 – pi)/bi, where bi is the number of Bernoulli trials observed for
the ith subpopulation. Then p̂Target can be modeled as having a normal dis-
∑ ∑
k k
tribution with mean ai pi and variance a 2i pi (1 − pi )/bi .
i=1 i=1
For a clinical trial, let pˆ E ,i and pˆ C ,i denote the observed proportion of “suc-
cesses” in the ith stratum or subpopulation for the experimental and control
arms, respectively. Then the respective estimators for the target population
∑ ∑
k k
are given by pˆ E = ai pˆ E ,i and pˆ C = ai pˆ C ,i . The difference is given
i=1 i=1
∑
k
by pˆ E − pˆ C = ai ( pˆ E ,i − pˆ C ,i ). Thus, the difference in the overall estimated
i=1
rates is a weighted average (same weights) of the difference in the observed
rates within each stratum or subpopulation. The overall relative risk is given
∑ ∑
k k
ai pˆ E ,i ( ai pˆ C ,i )( pˆ E ,i /pˆ C ,i )
by pˆ E/pˆ C = i=1
= i=1
. Thus, the overall relative risk
∑ ∑
k k
ai pˆ C ,i ai pˆ C ,i
i=1 i=1
for the target population can be expressed as a weighted average, with ran-
dom weights, of the relative risks within the strata. The overall odds ratio
∑ ∑
k k
ai a j pˆ E ,i (1 − pˆ C , j )
i=1 j=1
is given by pˆ (1 − pˆ )/( pˆ (1 − pˆ )) = . Since this
∑ ∑
E C C E k k
ai a j (1 − pˆ E ,i )pˆ C , j
i=1 j=1
expression involves the product of terms calculated from different strata,
this odds ratio estimator cannot be expressed as a weighted average of the
within-stratum odds ratios. Discussion and inferences about odds ratio pro-
vided later will instead be based on a common or “average” odds ratio.
There are many choices for how to do a stratified or adjusted non-inferiority
analysis of a binary endpoint. This goes beyond whether a difference in pro-
portions, a relative risk, or an odds ratio is chosen as the basis for making an
inference. When comparing two proportions or probabilities in a randomized,
stratified clinical trial, one method for comparing the difference in propor-
tions uses the overall strata sizes as the common weights for each arm. This
allows for a comparison of the two proportions with respect to a target popu-
lation that has the same breakdown for the strata levels as observed in the
study. This is also consistent with the one proportion problem in estimating a
common proportion across strata (or studies). When it is assumed that the true
probability of a success for an arm is constant across strata (or studies), the
MLE of the common probability of a success for that arm uses the total num-
ber of subjects in that stratum for just that arm (or the study size for that arm)
as weights. This will lead to the overall proportion of successes as the estimate
of the common probability of a success. However, it may not be reasonable to
assume for a given arm that the true probability of a success is constant across
strata. It should be noted that in a clinical trial, the most prognostic factors are
selected as stratification factors. Thus, the expectation is that the success rate
will vary greatly across stratification levels of the same factor.
Another adjusted analysis uses the harmonic mean of the number of sub-
jects in the experimental and control arms within a stratum as the stratum
∑ ∑
k k
common overall risk difference (Δ ≡ Δ) is given by ∆ˆ =
i w w ∆ˆ
i i w, i
i=1 i=1
where wi = 1/[xE,i(nE,i – xE,i)/nE,i + xC,i(nC,i – xC,i)/nC,i)]. Relative to the proportion
of all observations that are in a stratum, this estimator downweights those
strata where the observed success rates are near 0.5, and overweights those
strata where the observed success rates are close to 0 or 1. For most clini-
cal trials, the risk difference will not be constant or approximately constant
across strata.
The Mantel–Haenszel estimator of the common risk difference across
strata is given by
TABLE 11.14
Notation for the Cell Counts
Experimental Control
Success xE,i xC,i
Failure nE,i – xE,i nC,i – xC,i
nE,i nC,i ni
∑ (x n
E ,i C ,i − xC ,i nE ,i )/N i k k
∆ˆ MH = i=1
k
= ∑ wi ∆ˆ i ∑w i
∑n n /N i
C ,i E ,i
i=1 i=1
i=1
1 1 ˆ
where wi = nC ,i nE ,i /N i = 1
n
+
n and ∆ i = xE ,i /nE ,i − xC ,i /nC ,i . Equiv
C ,i E ,i
alently, the weight for a given stratum can be considered as the harmonic
mean of the within-stratum sizes for the experimental and control arms.
The estimators
k k
∑ ∑
k k
log θ), which is given by log θˆw = wi log θˆi wi , where wi = 1/xE,i –
i=1 i=1
1/nE,i + 1/xC,i – 1/nC,i. The weight for a stratum is the inverse of the asymptotic
variance of the respective log-relative risk estimators.
∑x n
E ,i C ,i Ni k k
θˆMH = i=1
k
= ∑ wiθˆi ∑w i (11.27)
∑x
i=1
n
C ,i E ,i Ni i=1 i=1
∑x E ,i (nC ,i − xC ,i ) N i k k
ψˆ MH = i=1
k
= ∑ wiψ̂ i ∑w i (11.28)
∑x C ,i (nE ,i − xE ,i ) N i i=1 i=1
i=1
∑i=1
PR 2
i i /(2 R+ ) + ∑ i=1
i i + Qi Ri )/(2 R+ S+ ) +
( PS ∑ Q S /(2S )
i=1
i i
2
+ (11.29)
∑ ∑
k k
xC,i(nE,i – xE,i)/Ni, R+ = Ri , and S+ = Si . We will use this standard
i=1 i=1
error estimate in an example to construct confidence intervals for the com-
mon or average odds ratio.
Mantel and Haenszel64 indicated their disbelief that the relative risk for
an exposure factor would be constant across strata and suggested the use
instead of an average relative risk or, rather, an average odds ratio.
Logistic Regression. A logistic regression model can be used to estimate a
common log-odds ratio across all possibilities for a collection of covariates.
For a logistic regression model, the log-odds of a success is a linear function
of the covariate values of the given patient. A patient having baseline covar
iates values x1, … , xk (one or more of these covariates used to identify the
∑
k
treatment arm) has a log-odds of success of α + β i xi . When the sample
i=1
size is large, the MLEs will have approximate normal distributions. For fixed
values of the baseline covariates, the treatment effect represents the common
log-odds ratio between the experimental and control arms.
Example 11.13 illustrates the use of these methods.
Example 11.13
To illustrate these methods, data were simulated for a two-arm study having 200
subjects. One hundred subjects were randomized to each arm according to two
stratification factors having two levels each. The endpoint is a binary response.
Table 11.15 gives the subject breakdown according to treatment arm, stratification
factors, and response status.
From Equation 11.26, the Mantel–Haenszel estimates of pE and pC are 0.631
and 0.429, respectively. The respective estimates of the corresponding standard
deviations are 0.0479 and 0.0480. The approximate 95% confidence interval for
pE – pC is 0.069–0.335. From Equation 11.27, the Mantel–Haenszel estimate of the
relative risk pE /pC is 1.47. On the basis of Fieller’s method, the approximate 95%
confidence interval of 1.14–1.95 is found by solving for the values of x that satisfy
–1.96 < (0.631 – 0.429x)/((0.0479)2 + (0.0480x)2)0.5 < 1.96.
From Equation 11.28, the Mantel–Haenszel estimate of the common odds ratio
is 2.31 with approximate 95% confidence interval of 1.30–4.09 based on the
standard error estimate of the log-odds ratio of Robins, Breslow, and Greenland
in Equation 11.29.
For a logistic regression model using treatment arm and the two stratification
factors as factors in the model, the estimate of the common odds ratio is 2.31 with
corresponding approximate 95% confidence interval of 1.30–4.09, identical to the
Mantel–Haenszel estimate and the corresponding confidence interval. When an
interaction term for the stratification factors is added, the estimate of the common
odds ratio is 2.33 with a corresponding approximate 95% confidence interval of
1.31–4.15.
TABLE 11.15
Breakdown by Treatment Arm, Stratification Factors, and Response Status
Arm Factor 1 Factor 2 n Number of Responses
Experimental 0 0 30 21
0 1 25 15
1 0 23 15
1 1 22 12
Control 0 0 32 18
0 1 25 6
1 0 22 9
1 1 21 10
pˆ E − pˆ C + δ ( pC )
ZS =
pE (1 − pE )/nE + pC (1 − pC )(1 − δ ′( pC ))2/nC
where δ′ is the first derivate of δ. The values for pE and pC may correspond
to the sample proportions or to the MLEs of the proportions under the null
hypothesis. The hypotheses in Expression 11.30 can also be tested on the
basis of the posterior probability that pE – pC > –δ(pC). Alternatively, the spe-
cific form for δ(pC) may dictate an appropriate method of analysis.
Various Variable Margins. For the risk difference and the relative risk, the
corresponding variable margin is a linear function of pC. Phillips66 pro-
posed the use of a linear function in pC, δ(pC) = a + bpC. A motivation was
to “fit” a line to the random margin provided in the U.S. Food and Drug
Administration (FDA) guidelines for anti-infective products67 by having that
margin based on pC. The value for b < 0 in that fit is indicative of a margin
that increases as pC decreases. Thus, when the success rate (and possibly the
effect) of the active control appears smaller, the acceptable amount of inferi-
ority becomes larger. This seems counterintuitive. It would make more sense
to either maintain the same or smaller amount of the effect of the control
therapy as the perceived effect becomes smaller. When b > 0, the variable
margin of δ(pC) = a + bpC appropriately decreases as pC decreases. For a > 0
and 0 < b < 1, pC – (a + bpC) < 0 whenever pC < a/(1 – b). Thus, such a variable
margin should be avoided whenever pC may be less than a/(1 – b).
Röhmel68,69 proposed various functions for δ(•). In one study, Röhmel69 pro-
posed to use δ ( pC ) = 0.223 pC (1 − pC ) and δ ( pC ) = 0.333 pC (1 − pC ) for the
purpose of stabilizing the desired power and providing a variable margin
fairly consistent with those provided in the FDA guidelines for anti-infective
products.67 The power should be fairly stable among possibilities for pC that
are not close to 0 or 1. Röhmel appears to have been recommending such a
variable margin for real situations—for example, antibiotics or anti-infective
products—when the anticipated success rate is greater than 50%. For 0.50 <
pC < 1, the margin of δ ( pC ) = c pC (1 − pC ) increases as pC decreases. When it
is anticipated that 0.50 < pC < 1, such a function for the margin is probably
no longer appropriate for a non-inferiority registration trial. For 0 < pC < 0.50,
the margin of δ ( pC ) = c pC (1 − pC ) decreases as pC decreases. This is more
appropriate. However, pC − c pC (1 − pC ) < 0 when pC is close to zero. Thus,
such a variable margin should be avoided when pC may be close to zero.
By stabilizing the power for a given sample size, such a variable margin
may be appropriate for a randomized phase 2 study to assist in making a
go/no go decision to phase 3. For a one-sided significance level of α/2 and
power of 1 – β at pE – pC = p, where δ ( pC ) = c pC (1 − pC ) with c > 0, a crude
sample-size calculation of the number of patients per arm is given by 2(zβ +
zα/2)2/c2 (derived from Equation 11.7 with k = 1, δ = c pC (1 − pC ) , Δa = 0, and
p1′ = p1 = p2 = p2′ ).
and sample size per arm of 150 and 300 subjects, 1000 trials were simulated
and the proportion in which non-inferiority was demonstrated with each
approach was determined.
The second Bayesian approach (using a retrospective prior for the control
rate) and the exact likelihood ratio test maintained a type I error near 0.05 in
all cases. The Bayesian approach using independent uniform prior distribu-
tions slightly inflates the type I error rate in all cases. The observed event rate
procedure had an inflated type I error rate that was as high as 0.10 when the
control success rate was 15%. The inflation at all time points is mostly due to
the statistic ignoring the variability in δ ( pˆ C ). The particularly high inflated
type I error rate when the control rate was 15% appears to be due to the value
of δ(0.15) being smaller than what would be consistent with values of δ(pC)
for pC near 0.15. The value of δ (0.15) is 0.09, whereas from our interpolations
δ(pC) ≈ 0.04 + 0.4pC for pC near 0.15, which would lead to δ(0.15) ≈ 0.10. Thus,
although (pE,pC) = (0.24, 0.15) is on the boundary of the null hypothesis, the
general behavior of δ(pC) is such that (pE,pC) = (0.25, 0.15) is “expected” to be
on the boundary of the null hypothesis and (pE,pC) = (0.24, 0.15) is expected
to be just in the alternative hypothesis of non-inferiority. Hence, the added
inflation of the type I error rate when the control rate is 15%.
The second Bayesian approach tended to have the greatest power followed
by the observed event rate approach, and then followed by the first Bayesian
approach.
A type I error rate evaluation of using the testing procedure described in
the 1992 FDA Guidance67 is given in Example 11.14.
Example 11.14
Much of the work on variable margins for proportions has been motivated by
experience in testing the efficacy of anti-infective products. The observed margin
is given by
where pˆ max = max{ pˆE , pˆ C } . Non-inferiority would be concluded if the lower limit
of the 95% two-sided confidence interval for the experimental versus control
difference in the cure rates is greater than the negative of the observed margin.
Formally, this test procedure does not perfectly correspond to a test of two spe-
cific statistical hypotheses. Statistical hypotheses can be specified for a test that is
approximately equal to this non-inferiority test. The alternative hypothesis would
be defined as the union of the sets {(pE,pC):pC < 0.8,pE > pC – 0.2}, {(pE,pC):0.8 ≤
pC < 0.9,pE > pC –0.15}, and {(pE,pC):pC ≥ 0.9,pE > pC – 0.1}. The null hypothesis is
the complement. The variable margin is given by δ(pC) = 0.2, if pC < 0.8; δ(pC) =
0.15, if 0.8 ≤ pC < 0.9; and δ(pC) = 0.1, if pC ≥ 0.9.
For three possibilities in the null hypothesis and sample sizes of 150 and 300
per arm, the simulated probabilities of rejecting the null hypothesis (the type I
error rate) for testing these hypotheses by using the testing procedure described in
the 1992 FDA Guidance are provided in Table 11.16. The two-sided 95% Wald’s
confidence interval for the difference in proportions was used. Two possibilities
are located at or near where the variable margin changes. One million simulations
were used in each case. For the case where (pE,pC) = (0.65, 0.8) and the sample
size is 150 per arm, only 124 simulations (about 1 in every 8000 simulations) had
δ ( pˆ C ) ≠ δˆ = δ (max{ pˆE , pˆ C }). In all of these 124 simulations, non-inferiority was con-
cluded using the smaller margin of δˆ = δ (max{ pˆE , pˆ C }). In all other studied cases
for (pE,pC) and the sample size, observing δ ( pˆ C ) ≠ δˆ = δ (max{ pˆE , pˆ C }) was more rare
and never influenced the conclusion on non-inferiority. Unless the sample size
is small, non-inferiority will be demonstrated with respect to δˆ = δ (max{ pˆE , pˆ C })
whenever pˆE > pˆ C . Thus, for all practical purposes, the observed margin could
have been regarded as δ ( pˆ C ). When (pE,pC) = (0.65, 0.8), the type I error rate is
greatly inflated and tends to increase toward some value slightly larger than 0.5 as
the common sample size increases without bound. When (pE,pC) = (0.599, 0.799),
the type I error rate is slightly deflated and tends to increase toward 0.025 as the
common sample size increases without bound. When (pE,pC) = (0.70, 0.85), a
value on the boundary of the null hypothesis not very near a change point in the
variable margin, the type I error rate is inflated and tends to decrease toward 0.025
as the common sample size increases without bound.
TABLE 11.16
Type I Error Rates Consistent with Old FDA Guidelines
on Anti-Infective Products
Sample Size per Arm
two with one tested by a new assay (or diagnostic test) and the other by the
standard method. When the outcome measure is dichotomous, risk differ-
ence and risk ratios are often used to compare treatments. In this section
we describe the statistical methods for evaluating non-inferiority of the
difference and the rate ratio of two proportions in a matched-pair design.
Methods appropriate for large samples and small to moderate samples will
be discussed.
Consider the matched-pair design in which two treatments (e.g., experi-
mental and control) are performed on the same n subjects. A “response” to a
treatment will be denoted by a “1,” whereas a “2” will denote “no response”
to the treatment. For any subject the possible outcomes are denoted by (1, 1),
(1, 2), (2, 1), and (2, 2), where the first (second) entry is the outcome to the
experiment (control) treatment. Let q11, q12, q21, and q22 be the corresponding
probabilities of the pairs and let a, b, c, and d (a + b + c + d = n) be the observed
numbers for the pairs (1, 1), (1, 2), (2, 1), and (2, 2), respectively. The observed
vector (a, b, c, d) is assumed to come from the usual multinomial distribution
model:
n!
P(( a, b , c , d) n,(q11 , q12 , q11 , q12 )) = a
q11 b
q12 c
q21 d
q22 .
a!b! c ! d!
Then pE = q11 + q12 and pC = q11 + q21 are the probability of a response to the
experimental and control treatments, respectively. For a classical hypothesis
test of no difference between the new and standard treatments,
b−c
ZM = . (11.31)
b+c
b − c + nδ
ZD = (11.33)
{n ( 2qˆ )}
1/2
21 − δ (δ + 1)
where the estimator q̂21 is the constrained MLE of q21 under pE – pC = –δ.
Specifically, Tango72 showed that
(B2 − 4 AC) − B
qˆ 21 = (11.34)
2A
where
The null hypothesis will be rejected at the one-sided α/2 level if ZD > zα/2,
where zα/2 is the upper α-percentile of the standard normal distribution.
Tango provided special cases for this test. From Equation 11.34, when
δ = 0, corresponding to the test of no difference between the test and control,
q̂21 = (b + c)/2n. Thus, the test statistic ZD in Equation 11.33 simplifies to the
McNemar test statistic in Equation 11.31. When the off-diagonal cells are all
zero (b = c = 0), q̂21 = δ and the test statistics reduce to
1/2
nδ
ZD = .
1 − δ
In other words, if there are no discordant pairs observed from the study,
the non-inferiority hypothesis will be rejected at the one-sided α level if the
sample size (n) is large enough:
1−δ
n> zα2/2 .
δ
On the basis of the score test statistic ZD in Equation 11.33, a 100(1– α)% con-
fidence interval for the difference in proportions Δ = pE – pC = q12 – q21 can
be constructed by solving for δ in the equations: ZD = ±zα/2. Example 11.15
Example 11.15
It has been shown (see Tango72) that the score statistic ZD performs better
in small samples than in two other asymptotic tests proposed by Lu and
Bean74 and Morikawa and Yanagawa.75 In particular, the type I error rate of
the ZD test statistic is much closer to the nominal level than the other two
statistics, while maintaining similar power. For a matched-pair design with
small sample sizes, an exact test of non-inferiority proposed by Hsueh, Liu,
and Chen76 can be used to guarantee control of the type I error rates. Sample-
size and power calculation methods have been developed in Lu and Bean74
and Nam.77 Nam77 showed that a method based on the score-type statistic
performed better than the method of Lu and Bean.74
TABLE 11.17
Outcomes of Disinfection Systems for Soft Contact Lenses
Thermal Disinfection
Chemical Disinfection Effective Ineffective Total
Effective 43 0 43
Ineffective 1 0 1
Total 44 0 44
where 0 < θo < 1 is a prespecified acceptable threshold for the ratio of the
two proportions. Rejection of the null hypothesis will lead to a conclusion of
non-inferiority in the sense that the experimental treatment will have simi-
lar positive response rate compared with the standard treatment based on
the marginal response rates. Extending the work of Tango72, a score test was
derived by Tang, Tang, and Chan78 for testing the non-inferiority hypothesis
in Equation 11.35
a + b − (a + c)θ o
ZR = , (11.36)
n{(1 + θ o )qˆ 21 + (a + b + c)(θ o − 1)/n}
where q̂21 is the constrained MLE of q21 under the null hypothesis, given by
the larger root of the following quadratic equation
That is,
qˆ 21 = B2 − 4 AC − B /(2 A) (11.37)
where
freedom.
Tang, Tang, and Chan78 compared the performance of ZR in Equation
11.36 with several other potential test statistics, including one proposed by
Lachenbruch and Lynch.79 The empirical comparison showed that ZR was
the only test statistic that behaved satisfactorily in the sense that its empiri-
cal type I error rate was much closer to the desired, nominal level than those
for the other tests. The ZR statistic tends to be slightly conservative in cases
where pE and pC are large and the probability of discordance (q21) is low. In
addition, the empirical coverage probabilities of the confidence intervals
based on ZR were close to the nominal level, and the error rates of both tails
were generally similar. Sample sizes and power calculation formulas based
on the ZR statistic for both hypothesis testing and confidence interval estima-
tion were given by Tang et al.80
Example 11.16 illustrates the use of non-inferiority testing based on a rela-
tive risk in a matched-pair design.
Example 11.16
Tang, Tang, and Chan78 revisited the crossover clinical trial described in Example
11.15, where a chemical disinfection system was compared with a thermal dis-
infection system for soft contact lenses. Here suppose the interest is to assess
non-inferiority using the relative risk with a margin of 0.9 (requiring the response
of chemical method be at least 90% of the thermal method). The observed risk
ratio (chemical/thermal) is 0.977 with the 90% confidence interval based on ZR
in Equation 11.36 of (0.904–1.038). The p-value for testing the null hypothesis in
Equation 11.35 is .044. The results indicate that the chemical method is noninfe-
rior to the thermal method at a one-sided 0.05 level, but not at the one-sided level
of 0.025.
References
1. Dann, R.S. and Koch, G.G., Review and evaluation of methods for computing
confidence intervals for the ratio of two proportions and considerations for non-
inferiority clinical trials, J. Biopharm. Stat., 15, 85–107, 2005.
2. Barnard, G.A., Significance tests for 2 × 2 tables, Biometrika, 34, 123–138,
1947.
3. Basu, D., On the elimination of nuisance parameters, J. Am. Stat. Assoc., 72, 355,
1977.
4. Chan, I.S.F., Exact tests of equivalence and efficacy with a non-zero lower bound
for comparative studies, Stat. Med., 17, 1403–1413, 1998.
5. Farrington, C.P. and Manning, G., Test statistics and sample size formulae for
comparative binomial trials with null hypothesis of non-zero risk difference or
non-unity relative risk, Stat. Med., 9, 1447–1454, 1990.
6. Suissa, S. and Shuster, J.J., Exact unconditional sample sizes for the 2 × 2 bino-
mial trial, J. R. Stat. Soc. A, 148, 317–327, 1985.
7. Haber, M., An exact unconditional test for the 2 × 2 comparative trials, Psychol.
Bull., 99, 129–132, 1986.
8. Dunnet, C.W. and Gent, M., Significance testing to establish equivalence between
treatments with special reference to data in the form of 2 × 2 tables, Biometrics,
33, 593–602, 1977.
9. Chan, I.S.F. and Zhang, Z., Test-based exact confidence intervals for the differ-
ence of two binomial proportions, Biometrics, 55, 1201–1209, 1999.
10. Röhmel, J. and Mansmann, U., Unconditional non-asymptotic one-sided tests
for independent binomial proportions when the interest lies in showing non-
inferiority and/or superiority, Biom. J., 41, 149–170, 1999.
11. Andres, A.M. and Mato, A.S., Choosing the optimal unconditional test for com-
paring two independent proportions, Comput. Stat. Data Anal., 17, 555–574,
1994.
12. Chan, I.S.F., Providing non-inferiority or equivalence of two treatments with
dichotomous endpoints using exact methods, Stat. Method. Med. Res., 12, 37–58,
2003.
13. Clopper, C.J. and Pearson, E.S., The use of confidence or fiducial limits illus-
trated in the case of the binomial, Biometrika, 26, 404–413, 1934.
14. Santner, T.J. and Snell, M.K., Small-sample confidence intervals for p1 – p2 and
p1/p2 in 2 × 2 contingency tables, J. Am. Stat. Assoc., 75, 386–394, 1980.
15. Agresti, A. and Min, Y., On small-sample confidence intervals for parameters in
discrete distributions, Biometrics, 57, 963–971, 2001.
16. Chen, X., A quasi-exact method for the confidence intervals of the difference
of two independent binomial proportions in small sample cases, Stat. Med., 21,
943–956, 2002.
17. Coe, P.R. and Tamhane, A.C., Small sample confidence intervals for the differ-
ence, ratio, and odds ratio of two success probabilities, Commun. Stat. B Simul.,
22, 925–938, 1993.
18. Santner, T.J. and Yamagami, S., Invariant small sample confidence intervals for
the difference of two success probabilities, Commun. Stat. B Simul., 22, 33–59,
1993.
19. Fries, L.F. et al., Safety and immunogenicity of a recombinant protein influenza
A vaccine in adult human volunteers and protective efficacy against wild-type
H1N1 virus challenge, J. Infect. Dis., 167, 593–601, 1993.
20. Boschloo, R.D., Raised conditional level of significance for the 2×2-table when
testing the equality of two probabilities, Stat. Neerl., 24, 1–35, 1970.
21. Rodary, C., Com-Nougue, C., and Tournade, M.F., How to establish equivalence
between treatments: A one-sided clinical trial in paediatric oncology, Stat. Med.,
8, 593–598, 1989.
22. Hauck, W.W. and Anderson, S., A comparison of large sample confidence inter-
val methods for the differences of two binomial probabilities, Am. Stat., 40, 318–
322, 1986.
23. Ghosh, B.K., A comparison of some approximate confidence intervals for the
binomial parameter, J. Am. Stat. Assoc., 74, 894–900, 1979.
24. Vollset, S.E., Confidence intervals for a binomial proportion, Stat. Med., 12, 809–
824, 1993.
25. Agresti, A. and Coull, B.A., Approximate is better than ‘exact’ for interval esti-
mation of binomial proportions, Am. Stat., 52, 119–126, 1998.
26. Agresti, A. and Caffo, B., Simple and effective confidence intervals for propor-
tions and differences of proportions result from adding two successes and two
failures, Am. Stat., 54, 280–288, 2000.
27. Newcombe, R.G., Two-sided confidence intervals for the single proportion:
comparison of seven methods, Stat. Med., 17, 857–872, 1998.
28. Newcombe, R.G., Interval estimation for the difference between independent
proportions: Comparison of seven methods, Stat. Med., 17, 873–890, 1998.
29. Brown, L.D., Cai, T., and Dasgupta, A., Interval estimation for a binomial pro-
portion (with discussion), Stat. Sci., 16, 101–133, 2001.
30. Wilson, E.B., Probable inference, the law of succession, and statistical inference.
J. Am. Stat. Assoc., 22, 209–212, 1927.
31. Schouten, H.J.A. et al., Comparing two independent binomial proportions by a
modified chi-square test, Biom. J., 22, 241–248, 1980.
32. Tu, D., A comparative study of some statistical procedures in establishing thera-
peutic equivalence of nonsystemic drugs with binary endpoints, Drug Inf. J., 31,
1291–1300, 1997.
33. Li, Z. and Chuang-Stein, C., A note on comparing two binomial proportions in
confirmatory non-inferiority trials, Drug Inf. J., 40, 203–208, 2006.
34. Mee, R.W., Confidence bounds for the difference between two probabilities,
Biometrics, 40, 1175–1176, 1984.
35. Miettinen, O.S. and Nurminen, M., Comparative analysis of two rates, Stat.
Med., 4, 213–226, 1985.
36. Santner, T.J. et al., Small-sample comparisons of confidence intervals for the dif-
ference of two independent binomial proportions, Comput. Stat. Data Anal., 51,
5791–5799, 2007.
37. Dann, R.S. and Koch, G.G., Methods for one-sided testing of the difference
between proportions and sample size considerations related to non-inferiority
clinical trials, Pharm. Stat., 7, 130–141, 2008.
38. Hilton, J.F., Designs of superiority and non-inferiority trials for binary responses
are noninterchangeable, Biom. J., 48, 934–947, 2006.
39. Chan, I.S.F. and Bohidar, N.R., Exact power and sample size for vaccine efficacy
studies, Commun. Stat. Theory, 27, 1305–1322, 1998.
40. Johnson, N.L., Kotz, S., and Kemp, A.W., Univariate Discrete Distributions, Wiley,
New York, NY, 1992.
41. Werzberger, A. et al., A controlled trial of a formalin-inactivated hepatitis A vac-
cine in healthy children, New Engl. J. Med., 327, 453–457, 1992.
42. Wiens, B.L. et al., Duration of protection from clinical hepatitis A disease after
vaccination with VAQTA®, J. Med. Virol., 49, 235–241, 1996.
43. Temple, R., Problems in interpreting active control equivalence trials, Acct. Res.,
4, 267–275, 1996.
44. Jones, B. et al., Trials to assess equivalence: The importance of rigorous meth-
ods, Br. Med. J., 313: 36–39, 1996.
45. Ebbutt, A.F. and Frith, L., Practical issues in equivalence trials, Stat. Med., 17,
1691–1701, 1998.
46. Katz, D. et al., Obtaining confidence intervals for the risk ratio in cohort studies,
Biometrics, 34, 469–474, 1978.
47. Thomas, D.G. and Gart, J.J., A table of exact confidence limits for differences
and ratios of two proportions and their odd ratios, J. Am. Stat. Assoc., 72, 73–76,
1977.
48. Koopman, P.A.R., Confidence intervals for the ratio of two binomial propor-
tions, Biometrics, 40, 513–517, 1984.
49. Bailey, B.J.R., Confidence limits to the risk ratio, Biometrics, 43, 201–205, 1987.
50. Gart, J.J. and Nam, J., Approximate interval estimation of the ratio of binomial
parameters: A review and corrections for skewness, Biometrics, 44, 323–338,
1988.
51. Zelterman, D., Models for Discrete Data, Oxford University Press, Oxford, 1999.
52. Cornfield, J., A statistical problem arising from retrospective studies. Proceedings
of the Third Berkeley Symposium on Mathematical Statistics and Probability IV, J.
Neyman (ed.). 135–148, California Press, Berkeley, CA, 1956.
53. Gart, J.J., The comparison of proportions: A review of significance tests, confi-
dence intervals and adjustments for stratification, Rev. Inst. Int. Stat., 39, 148–
169, 1971.
54. Mehta, C.R., Patel, N.R., and Gray, R., Computing an exact confidence interval
for the common odds ratio in several 2 by 2 contingency tables, J. Am. Stat.
Assoc., 80, 969–973, 1985.
55. Vollset, S.E., Hirji, K.F., and Elashoff, R.M., Fast computation of exact confidence
limits for the common odds ratio in a series of 2×2 tables, J. Am. Stat. Assoc., 86,
404–409, 1991.
56. Mehta, C.R. and Walsh, S.J., Comparison of exact, mid-p, and Mantel–Haenszel
confidence intervals for the common odds ratio across several 2×2 contingency
tables, Am. Stat., 46, 146–150, 1992.
57. Emerson, J.D., Combining estimates of the odds ratio: The state of the art, Stat.
Methods Med. Res., 3, 157–178, 1994.
58. Gart, J.J. and Zweiful, J.R., On the bias of various estimators of the logit and its
variance with application to quantal bioassay, Biometrika, 54, 181–187, 1967.
59. Cytovene product labeling available at www.fda.gov/cder/foi/label/2000/
20460s10lbl.pdf.
60. Kim, M.Y. and Xue, X., Likelihood ratio and a Bayesian approach were supe-
rior to standard non-inferiority analysis when the non-inferiority margin varied
with the control event rate, J. Clin. Epidemiol., 57, 1253–1261, 2004.
61. Grizzle, J.E., Starmer, C.F., and Koch, G.G., Analysis of categorical data by linear
models, Biometrics, 25, 489–504, 1969.
62. Gart, J.J., On the combination of relative risks, Biometrics, 18, 601–610, 1962.
63. Robins, J., Breslow, N., and Greenland, S., Estimators of the Mantel–Haenszel
variance consistent in both sparse data and large-strata limiting models, Bio
metrics, 42, 311–323, 1986.
64. Mantel, N. and Haenszel, W., Statistical aspects of the analysis of data from
retrospective studies of disease, J. Natl. Cancer I., 22, 71 9–748, 1959.
65. Zhang, Z., Non-inferiority testing with a variable margin, Biom. J., 48, 948–965,
2006.
66. Phillips, K.F., A new test of non-inferiority for anti-infective trials, Stat. Med., 22,
201–212, 2003.
67. U.S. Food and Drug Administration, Division of Anti-infective Drug Products,
Clinical Development and Labeling of Anti-Infective Drug Products. Points-to-
consider. U.S. Food and Drug Administration, Washington, DC, 1992.
68. Röhmel, J., Therapeutic equivalence investigations: Statistical considerations,
Stat. Med., 17, 1703–1714, 1998.
69. Röhmel, J., Statistical considerations of FDA and CPMP rules for the investiga-
tion of new antibacterial products, Stat. Med., 20, 2561–2571, 2001.
70. Tsou, H.H. et al., Mixed non-inferiority margin and statistical tests in active con-
trolled trials, J. Biopharm. Stat., 17, 339–357, 2007.
71. McNemar, Q., Note on the sampling error of the difference between correlated
proportions or percentages, Psychometrika 12, 153–157, 1947.
72. Tango, T., Equivalence test and confidence interval for the difference in propor-
tions for the paired-sample design, Stat. Med., 17, 891–908, 1998.
73. Miyanaga, Y., Clinical evaluation of the hydrogen peroxide SCL disinfection
system (SCL-D), Jpn. J. Soft Contact Lenses, 36, 163–173, 1994.
74. Lu, Y. and Bean, J.A., On the sample size for one-sided equivalence of sensitivi-
ties based upon McNemar’s test, Stat. Med., 14, 1831–1839, 1995.
75. Morikawa, T. and Yanagawa, T., Taiounoaru 2chi data ni taisuru doutousei
kentei (Equivalence testing for paired dichotomous data), P. Ann. Conf. Biometric
Soc. Jpn., 123–126, 1995.
76. Hsueh, H.M., Liu, J.P., and Chen, J.J., Unconditional exact tests for equiva-
lence or non-inferiority for paired binary endpoints, Biometrics, 57, 478–483,
2001.
77. Nam, J., Establishing equivalence of two treatments and sample size require-
ments in matched-pairs design, Biometrics, 53, 1422–1430, 1997.
78. Tang, N.S., Tang, M.L., and Chan, I.S.F., On tests of equivalence via non-unity
relative risk for matched-pairs design, Stat. Med., 22, 1217–1233, 2003.
79. Lachenbruch, P.A. and Lynch, C.J., Assessing screening tests: Extensions of
McNemar’s Test, Stat. Med., 17, 2207–2217, 1998.
80. Tang, N.S. et al., Sample size determination for establishing equivalence/non-
inferiority via ratio of two proportions in matched-pair design, Biometrics, 58,
957–963, 2002.
81. Chan, I.S.F. et al., Statistical analysis of non-inferiority trials with a rate ratio in
small-sample matched-pair designs, Biometrics, 59, 1170–1177, 2003.
12.1 Introduction
This chapter discusses non-inferiority based on the underlying means or
medians when there are no missing data or censored observations. Means
and medians are often used to describe the typical value or the central loca-
tion of a distribution. Medians are preferred when the data are skewed. The
outcomes may be continuous or discrete. For continuous outcomes where
larger outcomes are more desirable and differences between outcomes have
meaning (i.e., the data have an interval or ratio scale), the difference in the
means of the experimental and control arms in a randomized trial repre-
sents the average benefit across trial subjects from being randomized to the
experimental arm instead of the control arm. A difference in the medians
does not have any analogous interpretation unless additional assumptions
are made on the underlying distributions (e.g., that the shapes of the under-
lying distributions are equal).
For discrete outcomes, such as scores, the value for the mean will prob-
ably not be a possible value and may not be interpretable. In such a case,
inferences based on means may be difficult to interpret without additional
assumptions (e.g., the distributions, when different, are ordered). In these
situations, testing should not be based on the mean. For binary data, the
mean is the proportion of 1s or successes, which is interpretable.
When the difference in means (medians) defines the benefit or loss of bene-
fit, non-inferiority testing should be based on the difference in means (medi-
ans). When the data are positive and relative changes are most important, it
may be more appropriate to base non-inferiority testing on the ratio of the
means (medians). The mean for the control group may be needed to under-
stand and interpret a ratio of means.
A normal model is frequently used for inferences on the mean when the
sample size is large. The sample mean is assumed to be a random value from
an approximate normal distribution, with mean equal to the true mean or
population mean and variance equal to σ 2/n, where σ 2 is the population vari-
ance and n is the sample size. Inferences on a median are often based on the
behavior of the order statistics (see Section 12.4).
319
That is, the null hypothesis is that the mean in the active control group is
superior to the mean in the experimental treatment group by at least a quan-
tity of δ, whereas the alternative is that the active control is superior by a
smaller amount, or the two treatments are identical, or the experimental
treatment is superior.
the investigational treatment over the active control by any method, includ-
ing a parametric test, will be equivalent to a test of non-inferiority of the
original values. A permutation test for superiority can easily be used to test
the null hypotheses after transformation.2
The permutation test being valid means that the residuals are exchange-
able. That is, if the distribution of Xi – μi (observed difference in a value
minus the treatment group mean value) is identical for the two treatment
groups, the permutation test is valid. This requirement is often assumed to
be correct (at least mostly correct) but rarely checked in a detailed manner.
Obvious examples of situations where the residuals are not exchangeable
include when one treatment produces a unimodal distribution and the other
produces a bimodal distribution, or when one treatment produces responses
with a larger variance than those produced by the other treatment. In such
cases, the permutation test will not be appropriate.3
A sufficient condition for the permutation test to be valid is that each sub-
ject would have a response, if assigned to receive the active control, that
exceeds that subject’s response, if assigned to receive the experimental
treatment, by exactly δ. This condition guarantees that the necessary condi-
tion from the previous paragraph is met, but this condition is not in itself
necessary.
the control and experimental groups, respectively. Again, with large sample
sizes, the relative difference in the two estimates will be negligible.
As an equivalent alternative to the confidence interval methodology, a test
x − xE − δ
statistic can be calculated. If C is less than the critical value (e.g.,
se(X C − X E )
less than –zα/2), non-inferiority is concluded. Alternatively, non-inferiority is
concluded when the appropriate-level confidence interval for μC –μE contains
only values less than δ. Using a test statistic has the advantage of being able to
calculate a p-value for the test of the null hypothesis. However, p-values are not
often calculated for such non-inferiority tests and, when they are calculated,
they are prone to misinterpretation as an indication of the existence of differ-
ences, not the rejection of the null hypothesis of a specific nonzero difference.
We will later compare the results of different analysis methods based on both
the calculated p-values (and an analogous posterior probability) for given vari-
ous margins and compare the calculated 95% confidence/credible intervals.
∑ ∑
nC nE
given by SC2 = (X i − X )2 /(nC − 1) and SE2 = (Yj − Y )2/(nE − 1). We will
i=1 j=1
consider three cases for testing the hypotheses in Expression 12.1: (1) large
sample normal–based inference, (2) using Satterwaite degrees of freedom,4
and (3) using a t statistic under the assumption of unknown but equal vari-
ances. Procedures (1) and (2) are just two cases to address the Behrens–Fisher
problem—making statistical inferences on the difference in the means of two
normal distributions having unknown variances that are not assumed to be
equal.
Large Sample Normal Inference. For large sample sizes, it follows from the
central limit theorem that the test statistic
X −Y −δ
Z= (12.2)
S /nC + SE2 /nE
2
C
X −Y −δ
T= (12.4)
2
S (1/nC + 1/nE )
Example 12.1
TABLE 12.1
Summary of One-Sided p-Values and 95% Confidence Intervals
Large Sample Normal Satterwaite Equal Variance
p-Value 0.026 0.030 0.039
95% CI (–1.23, 4.03) (–1.32, 4.12) (–1.51, 4.31)
For each method, the one-sided p-values are greater than 0.025 and each 95%
confidence interval contains the non-inferiority margin of 4. Therefore non-infe-
riority cannot be concluded. The upper limits of the 95% confidence intervals
represent the smallest margin that could have been prespecified for which non-
inferiority would have been concluded. The equal variance method has both the
largest confidence interval upper limit of 4.31 and the largest p-value of 0.039.
This is primarily due to the larger estimated standard error for X − Y used by the
equal variance method. For the equal variance method, the estimated standard
error for X − Y equals 1.45, whereas the estimated standard error for X − Y equals
1.34 for the large sample normal and Satterwaite methods. When the treatment
group having the larger sample size has the larger (smaller) observed sample vari-
ance, the estimated standard error for X − Y used by the equal variance method
will be larger (smaller) than the estimated standard error for X − Y used by the large
sample normal and Satterwaite methods. When the sample sizes are equal, the
same standard error for X − Y is used in all three methods. Note that the multipli-
ers (i.e., the absolute values of the critical values) used for the confidence intervals
were 1.960, 2.032, and 2.006.
Example 12.2
We still have nC = 25, nE = 30, x = 40.5, and y = 39.1. However, now sC2 = 49 and
sE2 = 4. The one-sided p-values and two-sided 95% confidence intervals are pro-
vided in Table 12.2. From Equation 12.3 the degrees of freedom for the Satterwaite
method equal 27.
For each method, the one-sided p-values are greater than 0.025 and each 95%
confidence interval contains the non-inferiority margin of 4. Therefore, non-infe-
riority cannot be concluded. The Satterwaite method has both the largest confi-
dence interval upper limit of 4.37 and the largest p-value of 0.042. The estimated
standard errors used for X − Y are approximately reversed from Example 12.1. For
TABLE 12.2
Summary of One-Sided p-Values and 95% Confidence Intervals
Large Sample Normal Satterwaite Equal Variance
p-Value 0.036 0.042 0.029
95% CI (–1.44, 4.24) (–1.57, 4.37) (–1.28, 4.08)
TABLE 12.3
Summary of One-Sided p-Values and 95% Confidence Intervals
Large Sample Normal Satterwaite Equal Variance
p-Value 0.015 0.016 0.017
95% CI (–1.76, 2.75) (–1.79, 2.78) (–1.83, 2.82)
the equal variance method the estimated standard error for X − Y equals 1.34,
whereas the estimated standard error for X − Y equals 1.45 for the large sample
normal and Satterwaite methods. The multipliers used for the confidence intervals
were 1.960, 2.052, and 2.006.
Example 12.3
unknown. Both cases are provided to illustrate the similarities and differ-
ences in applying the methods. As in Section 12.2.3.1, only the case where the
population variances are unknown will be carried forward and compared in
revisited examples with the methods in Section 12.2.3.1.
Variances Known. For a random sample of n from a normal distribution
with mean μ and known variance σ 2, a normal prior distribution for μ that
has mean υ and variance τ 2 leads to normal posterior distribution for μ that
has mean
n 1 n 1
σ 2 x + τ 2 υ σ 2 + τ 2
and variance
n 1
1 2 + 2
σ τ
When τ 2 is relatively large compared to σ 2/n, the specific choice of τ 2 will
have little impact. The Jeffreys prior has density h µ ∝ I µ = 1/σ for – ∞( ) ( )
< μ < ∞, which is not a proper density and is a noninformative prior for μ.
When h(μ) = 1/σ is used, the resulting posterior density is the density for a
normal distribution having mean equal to x and variance σ 2/n.
The parameters μ C and μE can be regarded as independent. Therefore, the
posterior distribution for μ C – μE is a normal distribution with mean equal to
the difference in the posterior means for μ C and μE and variance equal to the
sum of the posterior variances for μ C and μE.
When testing Ho: μ C – μE ≥ δ versus Ha: μ C – μE < δ, the null hypothesis
is rejected and non-inferiority is concluded when the posterior probability
of μ C – μE < δ exceeds some threshold (e.g., exceeds 1 – α/2) or alternatively
when the appropriate level credible interval for μ C – μE contains only values
less than δ.
Variances Unknown. In the frequentist setting, the analysis simplifies when
the additional assumption is made that the two underlying normal dis-
tributions have the same variance. In the Bayesian setting this additional
assumption complicates the analysis by leading to a joint posterior distribu-
tion where μ C and μE are not independent (μ C and μE are conditionally inde-
pendent given σ). We will discuss two Bayesian procedures, which will be
referred to as the Bayesian-γ and Bayesian-T procedures. These procedures
were introduced in Section 6.3 for three-arm non-inferiority trials. We repeat
the explanations of the procedures below.
Bayesian-γ Procedure. This procedure is similar to a procedure provided by
Ghosh et al.5 There are different choices on what function of the variance
(e.g., σ, σ 2, 1/σ, or 1/σ 2) to model. Applying the joint Jeffreys prior in each case
leads to joint posterior distributions that provide different posterior prob-
abilities. For this discussion, the variance will be modeled with σ 2, as this
will lead to a more convenient form for the joint posterior distribution. For
θ = σ 2, the density of the Jeffreys prior, h, satisfies h(θ) ∝ θ–3/2 for –∞ < μ < ∞,
and θ > 0. Then for X1, X2, …, Xn, a random sample from a normal distribu-
tion with mean μ and variance θ, where the prior density satisfies h(θ) ∝ θ–3/2,
the joint posterior density satisfies
( µ − x )2 − n/2−1 1 n
g( µ , θ |x1 , x2 ,… , xn ) ∝ θ −1/2 exp −
2θ /n
×θ exp −
2 ∑ (x − x) /θ
i=1
i
2
(12.5)
We see from Expression 12.5 that the joint density factors into the product of
an inverse gamma marginal distribution for θ and a normal conditional dis-
tribution for μ given θ. The inverse gamma distribution has shape and scale
∑
n
parameters equal to n/2 and ( xi − x )2 /2 , respectively, with mean equal to
i=1 2
∑
∑
n n
( xi − x )2 /(n − 2) and variance equal to 2 ( xi − x )2 /[(n − 2)2 (n − 4)] .
i=1 i=1
Note that θ has an inverse gamma distribution with parameters n/2 and
∑
n
( xi − x )2 /2 , if and only if 1/θ has a gamma distribution with para
i=1
∑ ∑
n n
meters n/2 and 2/ ( xi − x )2 with mean equal to n/ ( xi − x )2 . Given
i=1 i=1
θ, μ has a normal distribution with mean equal to x and variance equal to
θ/n. Therefore, to simulate probabilities involving μ, a random value for
1/θ can be taken from the gamma distribution with parameters n/2 and
∑
n
2/ ( xi − x )2 , and then a random value for μ can be taken from a normal
i=1
distribution having mean x and variance θ/n.
Bayesian-T Procedure. Another approach consistent with Gamalo et al.6 to
address the problem of unknown variances uses translated t distributions
for the posterior distributions. The mean of the control arm, μ C, has a poste-
rior distribution equal to the distribution of
x + TC sC nC
where TC has a t distribution with nC– 1 degrees of freedom.
The mean of the experimental arm, μE, has a posterior distribution equal
to the distribution of
y + TE sE nE
Example 12.4 revisits the examples in Section 12.2.3.1. We compare the pos-
terior probabilities of the null hypothesis for various margins and the cor-
responding 95% credible interval for the above Bayesian procedures with the
methods in Section 12.2.3.1.
Example 12.4
TABLE 12.4
Summary of p-Values, Posterior Probabilities, and 95% Confidence/Credible
Intervals
Non- Large
Inferiority Sample Equal
Margin, δ Bayesian-γ Bayesian-T Normal Satterwaite Variance
0 0.850 0.846 0.865 0.861 0.831
1 0.618 0.618 0.617 0.617 0.608
2 0.327 0.330 0.327 0.329 0.340
3 0.117 0.121 0.116 0.120 0.137
4 0.029 0.030 0.026 0.030 0.039
5 0.005 0.005 0.004 0.006 0.008
95% CI (–1.38, 4.08) (–1.34, 4.12) (–1.23, 4.03) (–1.32, 4.12) (–1.51, 4.31)
For Example 12.2, where the observed sample variances are reversed, we have
∑ ∑
25 30
that 2/ ( xi − x )2 = 0.001701 and 2/ (y j − y )2 = 0.01724. One hundred
i =1 j =1
thousand values for μC – μE were simulated for each Bayesian method.
For the Bayesian-γ procedure, the first step in simulating a value for μC (μE) is ran-
domly selecting a value for 1/σ C2 (1/σ E2) from a gamma distribution with parameters
12.5 and 0.001701 (15 and 0.01724). The second steps are the same as before. For
the Bayesian-T procedure, a value for μ C(μE) was selected at random from the dis-
tribution for 40.5 + 7TC/5 (39.1+ 2TE / 30 ), where TC (TE) has a t distribution with
24 (29) degrees of freedom.
Table 12.5 provides a summary of the 95% confidence/credible intervals for
these Bayesian methods and the three methods discussed in Section 12.2.3.1,
along with the calculated p-values or simulated probabilities of the null hypoth-
esis in Expression 12.1 for various choices of a non-inferiority margin, δ. In this
example the posterior probabilities and 95% credible interval for the Bayesian
methods are respectively similar to p-values and the 95% confidence intervals
from the large sample normal and Satterwaite methods.
∑ ∑
50 55
For Example 12.3, we have that 2/ ( xi − x )2 = 0.001748 and 2/ (y j − y )2 =
i =1 j =1
0.000784. These two values are the values of the scale parameters for simulat-
ing values for 1/σ C2 and 1/σ E2 from respective gamma distributions with values
for the shape parameters of 25 and 27.5. For both Bayesian methods, a value
for μC – μE is simulated in analogous fashion as above. Again, 100,000 values for
μ C – μE were simulated for each method. Table 12.6 provides a summary of the
95% confidence/credible intervals for the Bayesian methods and the three meth-
ods discussed in Section 12.2.3.1 along with the calculated p-values or simulated
probabilities of the null hypothesis in Expression 12.1 for various choices of a non-
inferiority margin, δ.
In all examples (Tables 12.4 through 12.6) the posterior probabilities and 95%
credible intervals for the Bayesian methods are respectively similar to p-values
and the 95% confidence intervals from the large sample normal and Satterwaite
methods. In each case the Bayesian-T and Satterwaite methods gave quite similar
results with the Bayesian-T method, producing slightly wider 95% confidence/
TABLE 12.5
Summary of p-Values, Posterior Probabilities, and 95% Confidence/Credible
Intervals
Non- Large
Inferiority Sample Equal
Margin, δ Bayesian-γ Bayesian-T Normal Satterwaite Variance
0 0.833 0.828 0.833 0.829 0.850
1 0.610 0.610 0.609 0.608 0.617
2 0.340 0.344 0.339 0.341 0.328
3 0.136 0.140 0.134 0.139 0.118
4 0.039 0.044 0.036 0.042 0.029
5 0.009 0.010 0.006 0.010 0.005
95% CI (–1.49, 4.32) (–1.59, 4.41) (–1.44, 4.24) (–1.57, 4.37) (–1.28, 4.08)
TABLE 12.6
Summary of p-Values, Posterior Probabilities, and 95% Confidence/Credible
Intervals
Non- Large
Inferiority Sample Equal
Margin, δ Bayesian-γ Bayesian-T Normal Satterwaite Variance
0 0.666 0.662 0.666 0.666 0.663
1 0.331 0.329 0.331 0.331 0.334
2 0.096 0.096 0.096 0.097 0.101
2.5 0.042 0.042 0.041 0.042 0.045
3 0.016 0.016 0.015 0.016 0.017
95% CI (–1.78, 2.77) (–1.82, 2.79) (–1.76, 2.75) (–1.79, 2.78) (–1.83, 2.82)
credible intervals. For the Bayesian-γ method, when the posterior probability of
the null hypothesis was very small, the posterior probability lied between the
p-values for the large sample normal method and the Satterwaite method and in
each example the upper limit of the 95% credible interval for μC – μE lied between
the upper limits of the 95% confidence intervals from the large sample normal and
Satterwaite methods.
Alternatively, when the control therapy is not terribly effective, the non-in-
feriority margin is small. Assuming no difference in the true means for the
experimental and control arms not only will lead to a large sample size but will
also reflect the belief that the experimental therapy is not terribly effective.
In deriving sample formulas, we borrow from ideas in Kieser and
Hauschke.7 Let μ1 and μ2 represent the assumed true means for the experi-
mental and control arms, respectively, where μ2 – μ1 < δ. Let σ 1 and σ 2 denote
the respective assumed underlying standard deviations, where l = σ 1/σ 2. Let
k = nE/ nC denote the allocation ratio. On the basis of the assumption of inde-
pendent normal random samples and using a Satterwaite-like approxima-
tion for the degrees of freedom, the test statistic Z in Expression 12.2 will be
modeled as having an approximate noncentral t distribution with noncen-
µ 2 − µ1 − δ
trality parameter given by and degrees of freedom given by
σ 2 (l 2/k + 1)/nC
(1 + l 2/k )2
ν= . We will use the approximate relation that the
1/(nC − 1) + l 4/[ k 2 ( knC − 1)]
100βth percentile of the noncentral t distribution is approximately equal to
the noncentrality parameter plus the 100βth percentile of the t distribution
having the same degrees of freedom. Then an iterative sample size formula
for nC (nC appears on both sides of the equation) is given by
for the frequentist procedures. For the Bayesian-T procedure, the required
sample size should be similar to that calculated for the Satterwaite-like pro-
cedure. Alternatively for the frequentist or Bayesian methods, the required
sample size can be based on an assumed distribution over all possibilities
for (μ1, μ2).
The use of the sample size formulas will be illustrated in Example 12.5.
Example 12.5
Consider an endpoint that is the improvement from baseline in some score. The
non-inferiority margin for the difference in mean improvement is 10. Both a 1:1 and
a 2:1 experimental to control allocation are potentially being considered. The trial
will be sized on the basis of the following assumptions: the true mean improve-
ment is 38 and 40 for the experimental and control arms, respectively, and the
corresponding underlying standard deviations are 6 and 12, respectively. For the
equal variance approach, the common underlying variance will be assumed as
90 (the average of the variances for standard deviations of 6 and 12). Table 12.7
provides the determined sample sizes for the control arm.
For this example, the calculated sample sizes were smaller for the large sample
normal and equal variance methods. For the large sample normal method, when
the inference is based on the difference in means with equal and known underly-
ing variances, the overall sample size for a 2:1 allocation will be 12.5% greater
than the overall sample size for a 1:1 allocation. In this example, for each method,
the percentage increase in the calculated value for the overall sample size going
from a 1:1 allocation to a 2:1 allocation is greater than 12.5%. This is due to the
unequal variances and the iterative nature of the sample size equations for the
Satterwaite and equal variance methods.
For the large sample normal and Satterwaite methods, the optimal allocation
ratio is l = σ 1/σ 2. In this example, for the large sample normal and Satterwaite
methods, k = 0.5 is the allocation ratio that minimizes the calculated overall sam-
ple size. For the large sample normal method, the corresponding allocation is
16 subjects to the control arm and 8 subjects to the experimental arm. For the
Satterwaite method, the corresponding allocation is 18 subjects to the control
arm and 9 subjects to the experimental arm. For the equal variance method, a
1:1 allocation ratio minimizes the calculated overall sample size with 15 subjects
allocated to each arm.
TABLE 12.7
Sample Sizes for the Control Arm
Allocation Ratio Percentage Change in
Method 1:1 2:1 Overall Sample Size (%)a
Large sample normal 14 12 35.1
Satterwaite 15 14 38.6
Equal variance 15 11 15.9
a Based on the calculated value before rounding up.
subject was assigned to the control arm in the real study but to the experi-
mental arm in the rerandomization, the subject’s actual observed value
is multiplied by δ for the rerandomization calculations; if a subject was
assigned to the experimental arm in the real study but to the control arm
in the rerandomization, the subject’s actual observed value is divided by δ
for the rerandomization calculations; and if a subject receives the same allo-
cation in the rerandomization as in the real trial and, the actual observed
value is used. In such an approach where outcomes must be positive, we are
assuming the shapes of the distributions for the logarithms of the outcomes
are identical between the control and experimental arms. That is, the shapes
of the underlying distributions for the outcomes differ by a scale factor. Here
the permutation test being valid means that the residuals of the logarithms
are exchangeable. A sufficient condition for the permutation test to be valid
is that each subject would have an outcome, if assigned to receive the experi-
mental therapy that is exactly δ times the outcome that the subject would
have had if assigned to receive the control therapy.
Let X1, X2, …, X nC and Y1, Y2, …, YnE denote independent random samples
for the control and experimental arms, respectively. Let μ C and μ C denote
the means of the underlying distributions for the control and experimen
2 2 2 2
tal arms and let σ C and σ E denote the respective variances. Let SC and SE
∑
nC
denote the respective sample variances given by SC2 = (X i − X )2/(nC − 1)
∑
nE i=1
and SE2 = (Yj − Y )2 /(nE − 1).
j=1
Y − δX
Z= (12.9)
S /nE + δ 2SC2 /nC
2
E
2
δ 2 sC2 sE2
n +n
C E
(12.10)
δ 4 sC4 sE4
+
nC2 (nC − 1) nE2 (nE − 1)
Y − δX
T= (12.11)
S (1/nE + δ 2/nC )
2
confidence interval for μE/μ C by a Fieller approach is given by {λ: −tα/2 ,nC + nE − 2 <
( y − λ x )/ s2 (1/nE + λ 2/nC ) < tα/2 ,nC + nE − 2}.
A Delta-Method Approach. A delta-method approach can be considered in
the testing of the hypotheses in Expression 12.7. This can be done using either
a test statistic based on the ratio of sample means or a confidence interval
for μE/μ C. Hasselblad and Kong8 considered a delta-method approach to the
retention fraction for relative risks, odds ratios, and hazard ratios. Rothmann
and Tsou9 evaluated the behavior of delta-method confidence interval pro-
cedures through the maintenance of a desired 0.025 one-sided type I error
rate and the quality of estimated standard error for the ratio of two normally
distributed estimators.
The theorem behind the delta method can be found in many sources, such
as the book by Bishop, Fienberg, and Holland.10 For independent sequences
of random variables {Un} and {Vn}, we have from the delta-method theo-
(
rem that if n U n − µ1 ) d
(
→ N (0, σ 12 ) and n Vn − µ 2 d )
→ N (0, σ 22 ) , then
U µ σ 2 µ 2σ 2
n n − 1 d → N 0, 12 + 1 4 2 provided μ2 ≠ 0. It follows, as noted
Vn µ 2 µ2 µ2
by Rothmann and Tsou,9 that
U µ σ 12 µ12σ 22
Wn = n n − 1 +
Vn µ 2 µ 22 µ 24
2
µ µ1 2
= n U n − Vn 1 σ X2 + σ 2 ( µ 2 /Vn ) = Zn × ( µ 2 /V
Vn )
µ2 µ2
2
Y Y SE2/nE SC2 /nC
W = −δ X Y 2 + X 2 (12.12)
X
Example 12.6 revisits Examples 12.1 through 12.3. Here the non-inferiority
inference will be based on the ratio of the means. The posterior probabilities
of the null hypothesis or p-value when δ = 0.9 and the 95% confidence/cred-
ible interval will be determined using each Bayesian procedure and each
procedure discussed in Section 12.3.3.1.
Example 12.6
We revisit Examples 12.1 through 12.3. Table 12.8 provides the p-value or the
posterior probabilities of the null hypothesis in Expression 12.7 (or Expression
12.8 if appropriate) when δ = 0.9 and the corresponding 95% confidence/credible
intervals for the six methods discussed in this section. The p-values or posterior
probabilities of the null hypothesis less than 0.025 are italicized. For the Bayesian
methods, the set of values for μ C and μE simulated earlier were used. All simulated
values for μC and μE were positive and far from zero.
TABLE 12.8
Summary of p-Values, Posterior Probabilities, and 95% Confidence Intervals
Example 12.1 Example 12.2 Example 12.3
Procedures p-Value 95% CI p-Value 95% CI p-Value 95% CI
Large sample 0.0230 (0.901, 1.030) 0.0217 (0.902, 1.038) 0.0125 (0.910, 1.061)
normal
Satterwaite- 0.0271 (0.899, 1.033) 0.0265 (0.899, 1.042) 0.0137 (0.909, 1.062)
like
Equal 0.0294 (0.898, 1.039) 0.0206 (0.903, 1.033) 0.0135 (0.909, 1.064)
variance
Delta-method 0.0236 (0.901, 1.030) 0.0292 (0.898, 1.033) 0.0148 (0.908, 1.059)
Bayesian-γ a 0.0255 (0.900, 1.032) 0.0249 (0.900, 1.039) 0.0137 (0.909, 1.062)
Bayesian-Ta 0.0270 (0.899, 1.033) 0.0283 (0.898, 1.042) 0.0144 (0.909, 1.063)
a Posterior probabilities of the null hypothesis are given under the p-value column.
µ1 − δµ 2
parameter given by and degrees of freedom given by
σ 2 (l 2/k + δ 2 )/nC
(δ 2 + l 2/k )2
ν= . We will use the approximate relation that the
δ /(nC − 1) + l 4/[ k 2 ( knC − 1)]
4
Example 12.7
Consider an endpoint that is the improvement from baseline in some score. The
non-inferiority threshold for the ratio of mean improvement is 0.75. Both a 1:1 and
a 2:1 experimental-to-control allocation are potentially being considered. As in
Example 12.5, the trial will be sized on the basis of true mean improvement of 38
and 40 for the experimental and control arms, respectively, with corresponding
underlying standard deviations of 6 and 12, respectively. For the equal variance
Table 12.9
Sample Sizes for the Control Arm
Allocation Ratio
Percentage Change in
Method 1:1 2:1 Overall Sample Size (%)a
Large sample normal 20 17 26.9
Delta method 28 25 33.7
Satterwaite 21 18 30.2
Equal variance 25 17 4.2
a Based on the calculated value before rounding up.
approach, the common underlying variance will be assumed as 90. Table 12.9
provides the determined sample sizes for the control arm.
For this example, the calculated sample sizes were smallest for the large sample
normal and equal variance methods. The delta method requires a larger sample
size than the large sample normal method since μ1/μ2 = 0.95 > 0.75.
When the inference is based on the difference in means with equal and known
underlying variances (i.e., when l/δ = 1), the overall sample size for a 2:1 alloca-
tion will be 12.5% greater than the overall sample size for a 1:1 allocation. In this
example for these methods, the percentage increase in the overall sample size
from a 1:1 allocation to a 2:1 allocation was quite different from 12.5% since l/δ (or
lμ2/μ1 when the delta-method is used) was different from 1. The iterative nature of
the Satterwaite and equal variance methods also influences the particular percent-
age increase. For each of the large sample normal, Satterwaite (l/δ = 2/3), and delta
methods (lμ2/μ1 ≈ 0.526), there was around a 30% increase in the overall sample
size from a 1:1 allocation to a 2:1 allocation. For the equal variance method l/δ =
4/3 and instead of a 12.5% increase in the calculated value for the overall sample
size going from a 1:1 allocation to a 2:1 allocation, there was a 4.2% increase.
The optimal allocation ratio is l/δ (or lμ2/μ1 for the delta-method is used). In this
example, for the large sample normal and the Satterwaite methods, k = 2/3 is the
allocation ratio that minimizes the calculated overall sample size. For the large sample
normal method, this corresponds to 22 subjects in the control arm and 15 subjects in
the experimental arm. For the Satterwaite-like method, 25 subjects are allocated to
the control arm and 16 subjects are allocated to the experimental arm. For the delta-
method, k ≈ 0.526 is the allocation ratio that minimizes the calculated overall sample
size with 33 subjects allocated to the control arm and 17 subjects allocated to the
experimental arm. For the equal variance method, k = 4/3 is the allocation ratio that
minimizes the calculated overall sample size with 21 subjects allocated the control
arm and 29 subjects allocated to the experimental arm.
P(X ≤ µ) = 0.5 = P(X ≥ µ ). We will consider only cases involving continuous
distributions having unique medians. We will denote the medians for the
underlying distributions for the experimental and control arms as µE and
µC , respectively, and their difference by ∆ = µE − µC .
Medians are often used to describe the central location of a distribution
that is skewed. They are less frequently used for comparison purposes. A
difference in two means between an experimental and control arm in a ran-
domized trial represents the average benefit across trial subjects from being
randomized to the experimental arm instead of the control arm. A differ-
ence in the two medians does not have any analogous interpretation unless
additional assumptions are made on the underlying distributions (e.g., that
the shapes of the underlying distributions are equal). Non-inferiority test-
ing involving a median is rare. There are, however, both distinct proper-
ties involving non-inferiority testing of medians than for other metrics and
there are some common methodology more easily discussed with medians.
For positive-valued variables, the median of the log-values is the log of the
median. Therefore, the ratio of the medians can be tested through the differ-
ence of the medians of log-values. Means do not have such a property (the
mean of the log-values is not the log of the mean).
For a sample, the sample median is the middle ordered value of a sample when
the sample size is odd. When the sample size is even, any value between the two
middle values is a sample median—however, it is common to use the average of
the two middle values as the sample median. We will use this common conven-
tion as defining the sample median when the sample size is even.
median of the combined sample. Under the null hypothesis, M has a hyper-
N/2 N/2 N
geometric distribution, where P( M = m) = for m = 0,
m nE − m nE
a
1, 2, …, nE, where = 0 whenever a < b. When Δ = 0, the distribution for M is
b
symmetric about the mean nE/2 with variance nCnE/[4(N – 1)]. The test rejects
Ho: Δ = 0 when M < d or M > nE − d + 1, where α/2 = P(M < d|Δ = 0) = P(M >
nE − d + 1|Δ = 0). For large sample sizes, the value for d can be approximated
by the greatest integer less than or equal to nE/2 − zα/2 nE nC/[ 4( N − 1)] . When
M − nE/2 d
Δ = 0 (FC = FE), → N (0, 1) as nC, nE → ∞, and nC/N → λ > 0.
λ nE/4
Let X(1) < X(2) < … < X( nC ) and Y(1) < Y(2) < … < Y( nE ) denote the respective
order statistics. Without loss of generality, assume that N is an even number
and that nC ≥ nE. The test statistic in Equation 12.14 can be reexpressed as
nE
M= ∑ I(Y
j=1
( j) − X( N/2− j+1) > 0) (12.15)
where 1 – α E and 1 – α C are the respective confidence coefficients for the indi-
vidual confidence interval for the medians of the experimental and control
arms.
When (LE, UE) = (Y( dE ) , Y( nE − dE +1) ) and (LC, UC) = (X( dC ) , X( nC − dC +1) ) for some dE
and dC, we have that for large nC and nE from Theorem 2.2 of Hettmansperger,11
the confidence coefficients are related by
In addition, the asymptotic width of the confidence interval does not depend
on the choice of α E and α C. From Theorem 2.3 of Hettmansperger,11 with
probability of 1
where fC(0) is the common density at the median. Thus, there are many pairs
of α E and α C that lead to a confidence interval for the difference in medians of
(LE – UC, UE – LC) that has confidence coefficient of approximately 1 – α hav-
ing the same asymptotic width/efficiency. Note from Equation 12.17, when it
is desired to have α E = α C, then set
(
zα E/2 = zα C/2 = zα/2 / λ + 1 − λ ) (12.18)
The overall assumption of the two underlying distributions having the same
shape (i.e., FC(y) = FE(y – Δ) for all y) is necessary for Equations 12.16 through
12.19 and other properties to hold. If the shapes of the two distributions
are quite different, then these methods may not produce confidence inter-
vals for the difference in medians of a desired level. Pratt13 noted that when
the true medians are equal with underlying distributions having different
shapes, then the two-sided level for the median test is asymptotically equal
to 2[1 – Φ(cz α/2)], where c = (1 − λ + λτ )/ 1 − λ + λτ 2 , and τ is the ratio of the
underlying densities (fC/f E) at the common median. It follows, for large sam-
ple sizes, that the desired significance level can be approximately maintained
(as noted by Freidlin and Gastwirth15) when the underlying assumption of
nE
nE
V − nE/2
nE/2 − zα/2 nE ( N + 1)/( 4[nC + 2]) . When Δ = 0 (FC = FE), d
→ N (0, 1)
nE/( 4λ )
as nC, nE → ∞ and nC/N → λ > 0.
Asymptotic results can also be applied when nC is even and large. This can
∑ ∑
nE nE
be shown from the relation I (Yj > X( nc/2+1) ) ≤ V ≤ I (Yj > X( nc/2 ) ) and
j=1 j=1
that each bounding sum has the same asymptotic distribution as V when nC
is odd. The probability distributions for these bounding sums are also easily
obtained and can be used to obtain approximate critical values. Alternatively,
∑ ∑
nE nE
for ease when nC is even, I (Yj > X( nc/2+1) ) or I (Yj > X( nc/2 ) ) can be
j=1 j=1
used as the test statistic instead of V. Note that neither of these sums have a
symmetric distribution when Δ = 0.
The test statistic in Equation 12.20 can be reexpressed as
nE
V= ∑ I(Y
j=1
( j) − X(( nC +1)/2 ) > 0) (12.21)
Therefore, from Equation 12.21, the value of V depends on the two samples
through the ordered values of
Y(1) − X(( nC +1)/2 ) < Y( 2 ) − X(( nC +1)/2 ) < < Y( nE ) − X(( nC +1)/2 )
The decision from Mathisen’s test (rejecting or failing to reject Ho: Δ = 0) can
depend on which sample’s median is used. If the roles were switched for the
X’s and Y’s, the decision may change.
We provide a table similar to Table 1 of Hettmansperger.11 When α = 0.05
for each method, Table 12.10 provides the confidence coefficients for the
individual intervals for the control and experimental medians for allocation
ratios between 1 and 3, and as the allocation ratio goes to infinity. Equations
12.16, 12.18, 12.19, and 12.22 are used to determine the confidence coefficients.
Those common entries differ somewhat from those of Hettmansperger.11 The
TABLE 12.10
Confidence Coefficients for the Intervals for Control and Experimental
Medians when α = 0.05
Equal
Mathisen’s Test Median Test Coefficients Equal Lengths
nC/nE 1 – αC 1 – αE 1 – αC 1 – αE 1 – αC = 1 – αE 1 – αC 1 – αE
1 0 0.994 0.834 0.834 0.834 0.834 0.834
1.5 0 0.989 0.785 0.871 0.836 0.879 0.794
2 0 0.984 0.742 0.890 0.840 0.910 0.770
2.5 0 0.980 0.705 0.902 0.845 0.933 0.754
3 0 0.976 0.673 0.910 0.849 0.950 0.742
→∞ 0 0.95 0 0.95 0.95 1 0.673
methods are arranged so that the confidence coefficient for the control arm
increases (experimental arm decreases) going from left to right. For the con-
fidence intervals based on Mathisen’s test, as the allocation ratio increased
from 1 to 3, the confidence coefficient of the confidence interval for the con-
trol median remained at zero and the confidence coefficient of the confi-
dence interval for the experimental median decreased from 0.994 to 0.976.
For the remaining three methods, the common confidence coefficient for
the individual confidence intervals for the medians was 0.834 when equal
sample sizes were used. For the confidence intervals based on the two sam-
ple median test, as the allocation ratio increased from 1 to 3, the confidence
coefficient of the confidence interval for the control median decreased from
0.834 to 0.673, whereas the confidence coefficient of the confidence interval
for the experimental median increased from 0.834 to 0.910. The common con-
fidence coefficient for the equal coefficients case was stable, varying from
0.834 to 0.849 as the allocation ratio ranged from 1 to 3. For the approach of
using equal asymptotic length confidence intervals, as the allocation ratio
increased from 1 to 3, the confidence coefficient of the confidence interval for
the control arm increased from 0.834 to 0.950, whereas the confidence coef-
ficient of the confidence interval for the experimental arm decreased from
0.834 to 0.742. For Mathisen’s test, the median test, and the equal coefficients
cases, the limiting confidence coefficient for the confidence interval for the
experimental median was 0.95 as the allocation ratio approached infinity. For
the equal-lengths approach, the limiting confidence coefficient of the con-
fidence interval for the experimental median was 0.673 (i.e., zα E/2 → z0.025/2)
as the allocation ratio approached infinity, whereas the limiting confidence
coefficient of the confidence interval for the experimental median was 1.
When nC/nE is less than 1, except for Mathisen’s test, the confidence coef-
ficients can be found by reversing the roles of the control and experimen-
tal arms. For Mathisen’s test when α = 0.05, the confidence coefficient for
the individual confidence interval for the experimental median is greater
than 0.999 when nC/nE < 0.549. For Mathisen’s test, all the uncertainty in the
comparison of medians is reflected in the confidence interval for the experi-
mental median. As the uncertainty in the estimation of the control median
becomes larger relative to the uncertainty in the estimation of the experi-
mental median, a greater confidence coefficient is needed for the confidence
interval for the experimental median. A two–confidence interval procedure
based on Mathisen’s test is analogous to using the point estimate of the his-
torical effect of the active control therapy as the true effect of the active con-
trol in the non-inferiority trial (and thereby ignoring the uncertainty in the
estimate) when the constancy assumption holds.
Mann–Whitney–Wilcoxon Test. The Mann–Whitney–Wilcoxon test com-
pares the equality of two distributions. When the assumption is made that
the two distributions have the same shape (i.e., FC(y) = FE(y – Δ) for all y and
some Δ), inferences can be made on the shift parameter, which equals the
difference in the medians. Let X1, X2, …, X nC and Y1, Y2, …, YnE denote inde-
pendent random samples from distributions having respective distribution
functions FC and FE.
Consider testing Ho: Δ = 0 against Ha: Δ ≠ 0 under the assumption that the
two distributions have the same shape. The Mann–Whitney–Wilcoxon test
can be based on the sum of the ranks of the observations in one of the arms
∑
nE
among the combined observations (i.e., R(Yi )) or equivalently based on
i=1
the test statistic
nE nC
W= ∑ ∑ I(Y − X > 0)
j=1 i=1
j i
where I is an indicator function. Note that when Δ = 0, W has a symmetric
distribution about the mean nCnE/2 with variance nCnE(nC + nE + 1)/12. The
test rejects Ho: Δ = 0 when W < d or W > nCnE − d + 1, where α/2 = P(W < d|Δ
= 0) = P(W > nCnE − d + 1|Δ = 0). The Mann–Whitney–Wilcoxon test is the
locally most powerful rank test when FC has a logistic distribution.
For a non-inferiority margin of δ (for some δ > 0), the hypotheses can be
expressed as Ho: Δ ≤ –δ and Ha: Δ > –δ. The corresponding test statistic is
nE nC
Wδ = ∑ ∑ I(Y − X > −δ )
j=1 i=1
j i
The null hypothesis is rejected and non-inferiority is concluded whenever
Wδ > nCnE − d + 1, where α/2 = P(Wδ > nCnE − d + 1|Δ = −δ). Alternatively,
non-inferiority can be tested by finding the corresponding confidence inter-
val for the difference in medians, Δ, and comparing the interval with −δ.
It can be shown that a 100(1 − α)% confidence interval for Δ based on the
Mann–Whitney–Wilcoxon test is given by
where Z(1) < … < Z( nCnE ) are the ordered differences of Yj – Xi for i = 1, …, nC
and j = 1, …, nE. Non-inferiority is concluded when this confidence interval
only contains values greater than −δ. For large sample sizes, the value for d
can be approximated by the greatest integer less than or equal to
determined when FC ≠ FE. When FC ≠ FE, W has mean nCnEp1 and variance
nC nE ( p1 − p 21 ) + nCnE (nE − 1)( p2 − p 21 ) + nCnE (nC − 1)( p3 − p 21 ), where p1= P(Y1 > X1),
p2 = P(minY1,Y2 > X1), and p3 = P(Y1 > max X1,X2) (see Theorem 3.5.1 of Hett
mansperger17). When FC ≠ FE, it follows that as nC, nE →∞ and nC/N → λ > 0, P(W* >
zα/2) → 0 if p1 < 0.5, P(W * > zα/2 ) → 1 − Φ( zα/2 / 12{(1 − λ )( p2 − 0.25) + λ( p3 − 0.25)})
if p1= 0.5, and P(W* > zα/2) → 1 if p1 > 0.5.
Since the test statistics are distribution free when Δ = 0, for small samples
d can either be found from mathematical calculations or by simulations. The
asymptotic approximate results can be used for large sample sizes.
These methods can be used to find confidence intervals for the ratio of medi-
ans when the two underlying distributions are positive-valued and “differ” by
a scale factor. In this case we have FC(0) = 0 and FC(y) = FE(θy) for all y > 0 and
some θ > 0, which represents the experimental to control ratio of the medians.
Then the underlying distributions for the logarithms of the observations have
the same shape (i.e., GC(y) = GE(y + logθ)) with the difference in medians of log
θ. These methods can be used to find a confidence interval for log θ that can be
converted to a confidence interval for the ratio of medians θ.
− µ) d
n (X → N (0,( 4[ f ( µ)]2 )−1 )
N (∆ˆ − ∆) d
→ N (0,( 4λ(1 − λ )[ f ( µ)]2 )−1 ).
When comparing two distributions that have the same shape (i.e., FC(y) =
FE(y – Δ) for all y and some Δ) where a common variance exists, 4[ f ( µ)]2/σ 2
is also the relative efficiency of the difference in sample medians to the dif-
ference in sample means when there are independent random samples.
When the variance does not exist, the median or difference in medians is
more efficient. For underlying double exponential distributions, the relative
efficiency of the difference in sample medians to the difference in sample
means is 2.
Hodges–Lehmann Estimator of the Difference in Medians. The correspond-
ing estimator of the difference in medians based on the Mann–Whitney–
Wilcoxon test is the Hodges–Lehmann estimator of ∆ˆ HL = med(Yj − X i ) . As
i, j
nC, nE → ∞, and nC/N → λ > 0, N (∆ˆ HL − ∆) d
→ N (0, τ −2 ) , where
∞
τ = 12 λ(1 − λ )
∫
−∞
fC2 ( x) d x .
When the two underlying distributions have the same shape, the efficiency
of the Hodges–Lehmann estimator of the difference in medians relative to
2
∞ 2
the difference in sample medians equals 3
−∞ ∫
fC ( x) d x / f ( µ)2 . For normal
random samples with equal underlying variances, this relative efficiency is
approximately 1.5 and the relative efficiency of the Hodges–Lehmann esti-
mator to the difference in means is approximately 0.955. For random samples
from double exponential distributions, the relative efficiency of the Hodges–
Lehmann estimator of the difference in medians to the difference in sample
where 0 ≤ δ < 0.5. The value for δ, the non-inferiority margin should depend
on both the effect of the control therapy and the differences in the utility or
preferences of the categories. In general, the fewer the number of categories
(e.g., only “failure” and “success”), the larger will be the difference in utility
between successive categories, and thus the tendency for a smaller margin.
We will describe the test procedure in Munzel and Hauschke18 for test-
ing the hypotheses in Expression 12.23. Let X1, X2, …, X nC and Y1, Y2, …,
YnE denote independent random samples from distributions having respec-
tive distribution functions FC and FE. An unbiased, consistent estimator
for p is given by pˆ = (RE − (nE + 1)/2)/nC , where RE is the arithmetic aver-
age of the ranks of the observations in the experimental arm among all
∑
nE
observations. That is, RE = R(Yj )/nE , where R(Yj) is the rank of Yj in
j=1
∑
nC
the ordering of the combined sample. Define also RC = R(X j )/nC ,
j=1
where R(Xj) is the rank of Xj in the ordering of the combined sample.
When ties occur, R(Yj) (R(Xj)) is the midrank. Let R(C)(Xj) denote the
rank of Xj among X 1 , , X nC and let R(E)(Yj) denote the rank of Yj among
∑
nC
Y1 , , YnE . Define J C2 = (R(X j ) − R( C) (X j ) − RC + (nC + 1)/2)2/(nC − 1) and
j=1
∑
nE
J E2 = (R(Yj ) − R( E ) (Yj ) − RE + (nE + 1)/2)2 /(nE − 1) . Then for i = E, C, define
j=1
pˆ − (0.5 − δ )
Q=
uˆ C2 uˆ E2 (12.24)
+
nC nE
sizes. They recommend for per group sample sizes between 15 and 50 using
a Satterthwaite-like approximation of the degrees of freedom. The quality of
the approximation of the degrees of freedom was dependent on the number
of categories and deemed sufficient when there were at least three categories.
The value of tα/2,ν (where v is the Satterthwaite degrees of freedom) would
replace zα/2 for the determination of confidence intervals and as a critical
value in hypotheses testing.
For sizing a trial, let p′ represent the assumed value for p. Let σ 1 and σ 2
denote the respective assumed underlying standard deviations of FC(X1) and
FE(Y1), where l = σ 1/σ 2. Let k = nE/nC denote the allocation ratio. Then the
sample size for the control arm is given by
When the sample sizes are small, the term zα/2 + zβ in Equation 12.25 can be
replaced with tα/2,ν + tβ,ν (where v is the Satterthwaite degrees of freedom).
This creates an equation where iterations will be needed to determine the
sample sizes, since nC appears on both sides of the equation.
For two noncontinuous distributions, Wellek and Hampel20 proposed a
nonparametric test of equivalence around the parameter P(Y > X | Y ≠ X).
This parameter ignores the ties and the probability of a tie. For both equiv-
alence and non-inferiority testing, a tie is consistent with the alternative
hypothesis. Therefore, use of the parameter P(Y > X | Y ≠ X) will greatly
penalize an experimental therapy in situation when a tie is quite likely and
lead to conservative testing as noted by Munzel and Hauschke.18
For equivalence testing, a constant odds ratio based on the Wilcoxon mid-
ranks statistic and derived from the corresponding exact permutation distri-
bution was considered by Mehta, Patel, and Tsiatis.21
Often scores are assigned to the ordered categories and the data are treated
as continuous. However, it may be difficult to interpret a specific difference
between arms in the score, and the scores themselves may be subjective.
References
1. Hollander, M. and Wolfe, D.A., Nonparametric Statistical Methods, John Wiley &
Sons, New York, 1973.
2. Wiens, B.L., Randomization as a basis for inference in non-inferiority trials,
Pharm. Stat., 5, 265–271, 2006.
3. Good, P., Permutation, Parametric, and Bootstrap Tests of Hypotheses, Springer, New
York, NY, 2005.
4. Satterthwaite, F., An approximate distribution of estimates of variance compo-
nents, Biometrics, 2, 110–114, 1946.
13.1 Introduction
Many meaningful clinical endpoints are time-to-event endpoints—for
example, overall survival, time to a response, time to a cardiac-related event,
and time to progressive disease. When the intention and the outcome is that
all subjects are followed until the event is observed (no censoring), time-to-
event endpoints can be analyzed as continuous endpoints. The inferences
can be based on the mean, median, or some other quantity relevant to contin-
uous endpoints. However, in most practical cases involving a time-to-event
endpoint, all subjects are not followed until an event (i.e., some subjects have
their times censored). This limits the types of analyses that can be performed.
Nonparametric inferences on means and/or medians may not be possible. To
base the inference on means or medians may require following subjects for
a long and perhaps impractical length of time.
For a time-to-event endpoint, the amount of available information for
inferential purposes is tied to the total number of events and increases either
by continuing the follow-up on subjects that have not had events or by begin-
ning to follow additional subjects for events. For standard binary or con-
tinuous endpoints, the amount of available information increases solely by
including the outcomes of additional subjects.
For clinical trials, most time-to-event endpoints are defined as the time
from randomization (or enrollment or start of therapy) to the event of inter-
est or the first of many events of interest. Typically, at randomization, the
subject does not have the event or any of the events of interest. Starting the
time-to-event endpoint at randomization is also important as randomization
is fair and it is during randomization that subjects and their prognoses are
fairly allocated to treatment arms. In addition, to maintain the integrity of
this fairness of randomization, intent-to-treat analyses should be conducted
where all subjects are followed, regardless of adherence, to an event or the
end of study (i.e., until the data cutoff date or until some prespecified maxi-
mum follow-up has been completed). This allows for a valid comparison of
the study arms.
357
24 months than when the mean placebo survival is 4 months. The benefit of
the experimental therapy on survival is truly defined by the improvement
in mean/expected survival. However, usually the inference is not based on
the difference in mean/expected time-to-event. Therefore, the results when
positive may need to be translated into some form that provides an impres-
sion of the clinical benefit of the experimental therapy. Whenever possible,
although such instances are few, an inference should be based on a differ-
ence of means.
Composite Endpoints. Composite time-to-event endpoints are popular.
Such an endpoint is the time to the first event in a set of events of interest.
Examples include the time to the first event of stroke, myocardial infarction,
and death for cardiovascular trials, and the time to the first event of disease
progression and death in metastatic or advanced cancer. The use of a com-
posite endpoint may be necessary when the disease can be characterized
by many factors. For example, as reported by Chi,1 a disease may be charac-
terized by its pathophysiology, severity, signs and symptoms, progression,
morbidity, and mortality. Since the event or hazard rate for a composite end-
point is greater than that of the individual components, the use of a com-
posite endpoint has the advantage of requiring fewer subjects and having
an earlier analysis than a trial designed on an individual component. Many
researchers have written on the issues and disadvantages of using composite
endpoints.1–4
The individual components of a composite endpoint should be relevant
and meaningful for subjects and constitute clinical benefit. When the com-
ponents are equally important, and a new drug demonstrates superior effi-
cacy, the particular distribution of events across the individual components
or the differences between arms in the distribution of the events do not con-
tribute any necessary additional information on the new drug’s overall ben-
efit. When the importance varies across components, it is more difficult to
interpret the endpoint and the corresponding results. For example, a drug
that improves a major component (e.g., death) while having an adverse effect
on a minor component may be beneficial, but a drug that improves a minor
component while having an adverse effect on a major component would not
be beneficial. In addition, the severity of a component may change over time
owing to improvements in the treatment or management of the component,
thus reducing the utility of including that component in the composite end-
point. A composite endpoint may not be sensible if the components have
widely different importance.
When the components are not equally important, additional analyses
involving subcomposite endpoints may need to be done to assess the effects.
The subcomposite time-to-event endpoint should exclude the events of less
relative importance. This process of additional analyses by excluding events
of less importance may need to be further done until an analysis on a sub-
composite time-to-event endpoint that includes only those most important
events of equal value is done. To perform valid analyses on these additional
P(t ≤ T ≤ t + ε |T ≥ t)
h(t) = limε →0 (13.1)
ε
For an individual subject, the hazard rate at time t, h(t), represents the
instantaneous risk of an event at time t given that the subject has not
had an event by time t. If the event is death, h(t) represents the instanta-
neous risk of death at time t for a subject who is alive as time t approaches.
Additionally, for a subject who is alive at time t (without an event by time
t), the probability the subject dies (has an event) during the next ε of time is
approximately ε h(t) for a small ε. Evaluating the limit in Equation 13.1 gives
P(t ≤ T ≤ t + ε |T ≥ t) f (t) d
h(t) = lim ε →0 = = − log S(t).
ε S(t) dt
t
t ≥ 0. 0 ∫
The cumulative hazard function is given by H (t) = h( x) d x = − log S(t) for
When the hazard functions for the experimental and control arms are pro-
portional, then the common ratio of the hazard functions, called the hazard
ratio, is often used to measure the difference in the two distributions. The
h (t) H (t) − log SE (t)
hazard ratio, θ, satisfies θ = E = E = and SE(t) = [SC(t)]θ for
all t ≥ 0. hC (t ) H C (t ) − log SC (t )
In this chapter we will discuss the types of censoring, reasons for cen-
soring, and the issue of censoring deaths in Section 13.2. Non-inferiority
analyses involving exponential distributions are discussed in Section 13.3.
Non-inferiority analysis based on a hazard ratio from a proportional haz-
ards model is discussed in Section 13.4. Non-inferiority analyses either at
landmarks or involving medians are discussed in Section 13.5. The extension
of the testing problem in Section 13.5 to an inference over a preset interval is
discussed in Section 13.6.
13.2 Censoring
Throughout this chapter, the only type of censoring that will be considered is
right censoring. Right censoring means that a subject’s true time is unknown
and to the right of (greater than) the censored time. When the term censoring
is used, it will refer to right censoring unless otherwise stated.
Informative censoring occurs when the prognosis of a given subject with
a censored time is not independent of the censoring. In other words, what
to expect for a given subject’s ultimate or true time-to-event, which is cen-
sored at time x, is not represented by the follow-up experience of those
subjects in the same group with times that exceed x. Whenever a subject is
censored because treatment is being withheld because of declining physical
1. The subject did not have an event observed at the time of the data
cutoff for the analysis.
2. The subject completed the prespecified required time on study with-
out having an event.
this type of censoring the random censoring times are not independent of the
actual times. Although the censoring times are not independent of the time-
to-event endpoints, the censoring is not informative for a given treatment arm
provided the actual times to the event are a random sample.
Inclusion of Death as an Event. Censoring a subject’s time-to-event endpoint
because of death or death not related to disease is problematic and creates a
hypothetical endpoint. Follow-up ceases at death; there is no remaining time
to the event and thus there is no loss to follow-up or missing data. Complete
information or follow-up has been done on the subject. When death is not an
event of interest for the time-to-event endpoint, there is no actual time-to-
event for a subject who dies without being observed for an event of interest.
In situations where such censoring occurs, the subjects’ time-to-event times
in a given arm are not a random sample from that distribution that is being
estimated. Such an endpoint is a hypothetical endpoint. That is, for a given
arm, what is being estimated is the distribution for the time-to-event, or the
time-to-event when death occurs without experiencing the event, and we pre-
tend that the remaining time to an event of dead subjects can be represented
by the remaining time to an event of living subjects still under observation
of an event. Having living subjects represent dead subjects is beyond infor-
mative censoring and common sense. Another setting where this censoring
occurs involves analyses that include only disease-related deaths. This not
only creates a hypothetical endpoint where living subjects represent dead
subjects but also the determination of whether a death was related to the
disease, or even to the treatment, may be inexact and subjective.
times are better than smaller times (i.e., the event is undesirable), the null
and alternative hypotheses to be considered are
That is, the null hypothesis is that the active control is superior to the experi-
mental treatment by at least the prespecified quantity of δ ≥ 0. The alterna-
tive hypothesis is that the active control is superior by a smaller amount, or
the two treatments are identical, or the experimental treatment is superior.
When δ = 0, the hypotheses in Expression 13.2 reduce to classical one-sided
hypotheses for superiority testing. Rejection of the null hypothesis (Ho) leads
to the conclusion that the experimental treatment is noninferior to the con-
trol treatment. When smaller times are more desired than larger times (i.e.,
the event is desirable), the roles of μE and μ C in the hypotheses in Expression
13.2 would be reversed (i.e., test Ho: μ C – μE ≤ –δ vs. Ha: μ C – μE> –δ).
Let rE and rC denote the number of events observed in the experimen-
tal and control arms, respectively. For type II censoring, an approximate
100(1 – α)% confidence interval for the difference in means is given by
µˆ E − µˆ C ± zα/2 µˆ E2/rE + µˆ C2 /rC when rE and rC are sufficiently large. For type I cen-
soring, an approximate 100(1 – α)% confidence interval based on Cox’s proposal
is given by 2 rE µˆ E/(2 rE + 1) − 2 rC µˆ C/(2 rC + 1) ± zα/2 µˆ E2/(rE + 0.5) + µˆ C2 /(rC + 0.5)
when rE and rC are sufficiently large.
The following example illustrates the use of these formulas.
Example 13.1
TABLE 13.1
Calculated 95% Confidence Intervals for the Difference in
Means by Method
Method Estimate 95% CI Width
Type I censoring –11.5 (–45.3, 22.4) 67.7
Type II censoring –12.0 (–46.0, 21.9) 67.9
more than the estimate of the experimental mean. This led to a larger estimate of
the difference in means (–11.5 vs. –12). Had the values for the experimental and
control arms been exchanged for each other, the type II censoring method would
have the larger estimate for the difference in means (12 vs. 11.5).
That is, the null hypothesis is that the mean for the active control is superior
to that of the experimental treatment by at least δμ C, where δ ≥ 0. The alterna-
tive hypothesis is that the active control is superior by a smaller amount, or
the two treatments are identical, or the experimental treatment is superior.
When δ = 0, the hypotheses in Expression 13.3 reduce to classical one-sided
hypotheses for a superiority trial. Rejection of the null hypothesis (Ho) leads
to the conclusion that the experimental treatment is noninferior to the con-
trol treatment. When smaller times are more desired than larger times, the
roles of μE and μ C in the hypotheses in Expression 13.3 would be reversed
(i.e., test Ho: μ C/μE ≤ 1 – δ vs. Ha: μ C/μE> 1 – δ). Note that μE/μ C is also the
scale factor relating the two exponential distributions and the control versus
experimental hazard ratio (the ratio of the instantaneous risk of an event).
µˆ /µ
For type II censoring within each arm, we have that V = E E has
µˆ C/µC
an F distribution with 2rE and 2rC degrees of freedom in the numera-
tor and denominator, respectively. For any 0 < γ < 1, the correspond-
ing 100(1 – γ)% percentile, Fγ ,2 rE ,2 rC is defined as that value satisfying
γ = P(V > Fγ ,2 rE ,2 rC ) . For other F distributions, similar notation will be used
for the percentiles. Note that F1−γ ,2 rC ,2 rE = 1/Fγ ,2 rE ,2 rC . For 0 < α < 1, we have that
µˆ /µ µ
P F1−α/2 ,2 rE ,2 rC < E E < Fα/2 ,2 rE ,2 rC = P ( µˆ E/µˆ C)F1−α/2 ,2 rC ,2 rE < E < ( µˆ E/µˆ C)Fα/2 ,2rrC ,2 rE.
µ ˆ / µ µ
C C C
Thus, a 100(1 – α)% confidence interval for the hazard ratio is given by
( )
( µˆ E/µˆ C )F1−α/2 ,2 rC ,2 rE ,( µˆ E/µˆ C )Fα/2 ,2 rC ,2 rE .
rE (2 rC + 1)µˆ E rE (2 rC + 1)µˆ E
r (2 r + 1)µˆ F1−α/2 ,2 rC +1,2 rE +1 , r (2 r + 1)µˆ Fα/2 ,2 rC +1,2 rE +1 .
C E C C E C
These results may apply approximately for other types of random censoring.
For outcomes for which larger values yield better outcomes, non-inferiority
is concluded if the confidence interval for the ratio of means (the recipro-
cal of the hazard ratio) contains only values greater than the non-inferiority
threshold.
Another way of determining a confidence interval for the hazard ratio
applies the asymptotic distributions for the natural log of the maximum like-
lihood estimator of μE and μ C. For type II censoring, an approximate 100(1 –
α)% confidence interval for the control versus experimental log hazard ratio is
given by log µˆ E − log µˆ C ± zα/2 1/rE + 1/rC . For type I censoring from applying
the proposal of Cox, we analogously have an approximate 100(1 – α)% confi-
dence interval for the control versus experimental log hazard ratio given by
r (2 r + 1)
(log µˆ E − log µˆ C ) + log C E ± zα/2 1/(rE + 0.5) + 1/(rC + 0.5) . These results
rE (2 rC + 1)
may apply approximately for other types of random censoring.
Example 13.2 illustrates these formulas.
Example 13.2
We revisit Example 13.1. Note that the true control versus experimental hazard
ratio is 10/9 (≈1.111). We will apply each confidence interval formula for a hazard
ratio or log-hazard ratio conditioning on the number of uncensored observations
in each arm. We again have rE = 137 uncensored observations in the experimen-
tal arm with µ̂E = 105.4 and rC = 63 uncensored observations in the control arm
with µ̂C = 117.4. The corresponding confidence intervals for the hazard ratio are
provided in Table 13.2. All confidence intervals contain the true control versus
exponential hazard ratio of roughly 1.111. For a non-inferiority threshold of 0.8,
non-inferiority would fail to be concluded regardless of the method. The confi-
dence intervals for the type II censoring methods have a slightly greater relative
width than the type I censoring methods. Which method is more conservative
depends on various factors, including whether the experimental or control arm
performed better in the clinical trial. In this example, the control arm performed
better (despite having a poorer underlying distribution) and the F distribution
methods gave more conservative intervals (i.e., had smaller lower limits) than
their normal distribution counterparts. Likewise, the type II censoring methods
were more conservative than the type I censoring methods. Had the values for
the experimental and control arms been exchanged for each other, the opposite
TABLE 13.2
Calculated 95% Confidence Intervals for the Hazard Ratio by Method
Method Estimate 95% CI Relative Width
Type I censoring 0.901 (0.663, 1.205) 1.816
F distribution
Type II censoring 0.898 (0.660, 1.201) 1.820
F distribution
Type I censoring 0.901 (0.670, 1.214) 1.813
Normal distribution
Type II censoring 0.898 (0.666, 1.210) 1.816
Normal distribution
relations for conservatism would have held as noted by the order of the upper
limits of the 95% confidence intervals.
function and represents the hazard function for a subject having β′x = 0,
when such is possible. Per Cox,6 estimation of β through a partial likelihood
function does not depend on the function h0.
Suppose there are 10 subjects at risk of an event at time t (i.e., as time
approaches t), for some t > 0, having hazard rates hi(t) for i = 1, . . . , 10 that are
continuous at t. Given that an event occurred at time t for exactly 1 of the 10
subjects, the probability that subject j had the event is given by
10
h j (t) ∑ h (t)
i=1
i (13.5)
The probability would remain the same if each subject’s hazard rate was mul-
tiplied or divided by some positive constant c (e.g., divided by a baseline haz-
ard rate value denoted by h0(t)). The partial likelihood function for β is based
on the product of conditional probabilities like that in Expression 13.5.
Let xi denote the vector of explanatory variables for the ith subject, i = 1,
. . . , n. Suppose that among the n subjects, k subjects are each followed to an
event where their times to an event are different, whereas n–k subjects have
their times to an event censored. Let t(1) < t (2) < . . . < t(k) denote the ordered
times when events occurred. For i = 1, . . . , k define R(t(i)) as the set of indices
of those subjects at risk of an event as time t(i) approaches (i.e., consists of the
indices of subjects whose time to event, censored or uncensored, is at least
t(i)) and let x(i) denote the vector of explanatory variables for the subject that
had an event at time t(i). Then applying the multiplication rule of probabilities
to the k probabilities of the form of Expression 13.5 at the times of the events
leads to the partial likelihood for β of
L(β) = ∏ exp(β′x
i=1
(i) ) ∑ exp(β′x )
j∈R ( t( i ) )
j (13.6)
mechanisms are independent of the actual times to the event. Formally, if the
censoring is not independent, using the partial likelihood as a basis for infer-
ence may not be justified. It would be particularly problematic if the amount
of informative censoring is substantial.
The censoring mechanism being “random” requires that when an indi-
vidual censored at an early time would have survived without an event to
some later time, t″, their hazard rate of an event at time t″ would be the same
as that of another subject, having the same set of values for the explanatory
variables, who survived to time t″ without having an event. In essence, cen-
soring and the process of achieving an event are determined by independent
mechanisms.
Let β1 denote the experimental versus control log-hazard ratio correspond-
ing to the model given in Expression 13.4. Then the experimental versus
control hazard ratio is θ = exp(β1). For a non-inferiority threshold θo ≥ 1, the
hypotheses are expressed as
Let β̂1 denote the maximum likelihood Cox estimator (often referred to as a
Wald’s estimator) of β1 and let se(β̂1 ) denote an estimate of its standard error.
We will elaborate on the form for the standard error later. An approximate
100(1 – α)% confidence interval for the experimental versus control hazard
ratio, θ = exp(β1), is given by
−1/2
1
nE nC
∫
0
nC + nEθ u(θ −1)/θ
d u
(13.9)
TABLE 13.3
Asymptotic Relative Efficiencies and Ratios of Asymptotic Standard Errors
Hazard Ratio or Asymptotic Relative Ratio of the Asymptotic
Its Reciprocal Efficiencya Standard Errorsb
1.00 1.0000 1.0000
0.95 0.9993 0.9997
0.90 0.9972 0.9986
0.85 0.9935 0.9968
0.80 0.9879 0.9939
0.75 0.9803 0.9901
0.70 0.9703 0.9851
0.65 0.9578 0.9787
0.60 0.9424 0.9708
0.55 0.9238 0.9611
0.50 0.9014 0.9494
0.45 0.8747 0.9352
0.40 0.8429 0.9181
0.35 0.8050 0.8972
0.30 0.7596 0.8716
a Cox estimator to the maximum likelihood estimator.
b Maximum likelihood estimator to the Cox estimator.
has been a useful estimate of the unrestricted standard error of the log-hazard
ratio when determining confidence intervals. An approximate 100(1 – α)%
confidence interval for the experimental versus control hazard ratio would
then be given by
It should be noted that the standard error provided from statistical packages
is determined under the null hypothesis of no difference (i.e., a hazard ratio
of 1). Frequently in practice, the standard error restricted to the hazard ratio
equal to 1 will be relatively close to the quantity provided in Expression
13.10 and other estimates of the standard error. Thus, using the standard
error from the statistical packages tends to lead to the same conclusion as
using some other estimate of the standard error (e.g., unrestricted version or
restricted to the non-inferiority null hypothesis). However, caution should
be taken in the choice of an estimate of the standard error of the log-hazard
ratio. We provide two examples (Examples 13.3 and 13.4) to illustrate the use
of Expressions 13.9 and 13.10.
Example 13.3
Consider a two-arm study of the experimental drug A and the active control drug
B where 400 subjects are evenly randomized between the two arms. Suppose all
400 subjects are followed to the event of interest (e.g., death). Consider testing Ho:
θ ≥ 1.25 versus Ha: θ < 1.25. The test statistic is
lnθ̂ − ln1.25
s
where θ̂ is the Wald’s estimator from a Cox model with treatment as the sole
explanatory variable. The value for s is selected as ( 2/ 400 )/0.9939 = 0.10061,
where the value of 0.9939 comes from Table 13.3. For an observed hazard ratio
of 0.95, the value for the test statistic above is –2.7277, which corresponds to a
p-value of 0.003. For a one-sided significance level of 0.025, non-inferiority is
concluded (0.003 < 0.025).
Example 13.4
Consider again testing Ho: θ ≥ 1.25 versus Ha: θ < 1.25 on the basis of a two-arm
study where 1000 subjects are evenly randomized to the experimental and con-
trol arms. At the time of analysis, there are 320 and 304 events in the experimen-
tal and control arms, respectively, with a corresponding estimate of the hazard
ratio of 1.10. Using Expression 13.10 gives an estimate of the standard error of
1/320+1/304= 0.0801, from which a corresponding 95% confidence interval of
0.940–1.287 is obtained. Since the upper limit of 1.287 is greater than 1.25, non-
inferiority cannot be concluded.
Standard Errors for the Effects of Binary Covariates. For a binary (0–1) variable,
the form for the asymptotic standard error for its corresponding regression
parameter (a log-hazard ratio) is similar to that of the treatment effect. One
difference is that the standard error for the treatment effect (the treatment
log-hazard ratio) can be fairly controlled by knowing ahead of time roughly
how many subjects will be in each arm and how many events will be needed
for the analysis. For a binary covariate, the number of subjects that will have
each value, 0 and 1, is random (not controlled). It is customary to condition on
the number of subjects that have each value of the binary covariate (and the
number of events observed at each level) when determining the correspond-
ing standard error for its log-hazard ratio estimator. It is that conditioning
that justifies using the same formulas for the standard error for the binary
covariate as for the treatment effect. For ri (i = 0, 1) equal to the observed num-
ber of events among subjects having value i, a useful estimate of the asymptotic
standard error for the log-hazard ratio of the binary covariate is 1/r0 + 1/r1 .
As with the treatment effect, software packages restrict the standard error to
the null hypothesis that the true value of the log-hazard ratio for the binary
covariate equals zero. If the true log-hazard ratio is not far from zero, these
two estimated standard errors should be relatively the same.
It is important to note that when a variable is prognostic, and is still prog-
nostic given the set of values of any other potential covariates, the propor-
tional hazards assumption cannot simultaneously hold for the model that
includes that variable as an explanatory variable and the model that omits it
as an explanatory variable.
P(X >Y). For an experimental versus control hazard ratio of θ, it can easily
be shown that the probability is 1/(1 + θ) that a random subject, X, given the
experimental therapy will have a longer time to the event than a random
subject, Y, given the control therapy. This probability that the random subject
in the experimental arm has an event after the random subject in the control
arm has an event remains constant even when both random subjects have
“survived” for t amount of time without having an event.
For a randomly paired designed, when the hazard rates are proportional,
a confidence interval can be found for the hazard ratio by first finding a
confidence interval for 1/(1 + θ), the probability that a random subject given
the experimental therapy will have a longer time to the event than a random
subject given the control therapy, and then converting the interval into a con-
fidence interval for θ. This can be done by simultaneously following subjects
within their random pairs and conditioning on the number of pairs, n, where
at least one subject had an event. On this condition, the number of pairs
where the experimental subject had a longer time has a binomial distribu-
tion with parameters n and 1/(1 + θ).
Additional Design Considerations. The treatment parameter, β1, being esti-
mated by a Cox model is dependent on the covariates being included and
excluded in the model. Suppose that the baseline prognosis of subjects
improves as accrual continues (e.g., owing to differences in a known prog-
nostic factor), that at any time of accrual for a subject accrued at that time
the hazard rates for the theoretical distribution of the experimental arm are
proportional to the hazard rates of the control arm with a log-hazard ratio of
β1 ≠ 0, and that all subjects are followed to the events or the same data cutoff
date. Then the “log-hazard ratio” being estimated by a Cox analysis with
treatment as the sole explanatory variable is between zero and β1. For β1 to be
that value being estimated as the treatment effect by a Cox analysis, covari-
ates that collectively completely capture the baseline prognosis of subjects
need to be included in the model.
Owing to the covariates included or excluded in a Cox model, it is impor-
tant to realize that the active control versus placebo treatment parameter
may also be different across historical trials (and in the non-inferiority trial).
Not adjusting for influential covariates in the active control therapy versus
placebo trials will tend to “underestimate” the active control effect relative to
when there is adjustment for those covariates. Also, when influential covari-
ates are adjusted for, the treatment parameter being estimated will depend
on the prognosis of the subjects. Consideration should be given in the histor-
ical and non-inferiority trials to capture and adjust by important covariates.
Group sequential non-inferiority trials can be done on a hazard ratio with
a fixed non-inferiority margin.10 When the hazards are not proportional,
having multiple analyses at different study times may be acceptable for a
superiority trial. However, nonproportional hazards can be problematic
for a non-inferiority analysis based on a hazard ratio that uses the same
threshold or margin for both analyses. Since the follow-up or censoring
distribution will depend on the time of the analysis, when the hazards are
not proportional a different parameter or value (that is called the “hazard
ratio” or “average hazard ratio”) is being estimated at each analysis. If there
have also been nonproportional hazards when comparing the active control
with placebo, then the effect of the active control therapy versus placebo as
measured by a hazard ratio depends on the follow-up or censoring distribu-
tion. In the presence of nonproportional hazards, a non-inferiority criterion
should consider the amount of subject follow-up to ensure that the rejec-
tion of the null hypothesis will truly mean that the experimental therapy
is noninferior to the active control therapy. If the non-inferiority criterion
2
zα/2 + zβ 2
β − β (1 + k ) /k (13.11)
1, a 1, o
where β1,a is the assumed experimental versus control log-hazard ratio (or
selected alternative), and β1,o is the non-inferiority threshold for the log-
hazard ratio. When powering for a superiority claim, β1,o = 0. Expression
13.11 was provided by Fleming10 for a one-to-one randomization (k = 1).
After determining the required number of events, the sample size will fur-
ther depend on
Example 13.5 illustrates the determination of the sample sizes for a time-to-
event endpoint.
Example 13.5
2 2
zα /2 + zβ 1.96 + 1.2816
β − β (1+ k ) /k = ln 0.95 − ln1.15 (1+ 2) / 2 ≈ 1296 events.
2 2
1,a 1,o
In determining the appropriate sample size that achieves 1296 events by some
target time, we will assume that the subjects will be accrued over 24 months
in a uniform fashion and it is desired to have the analysis 12 months after the
end of accrual (at a study time of 36 months). A random accrual time will be
modeled as uniformly distributed over the first 24 months. For ease in deter-
mining the sample size, we will assume that the underlying distributions
for the experimental and control arms are exponential distributions with
respective medians of 10 and 9.5 months. Then, the probability that a random
subject will have had an event by the study time of 36 months is
24
∫
1− exp( − (36 − x )/(10/ln2))d x /24 = 0.788 and
0
24
∫
1− exp( − (36 − x )/(9.5/ln2))d x /24 = 0.803
0
for the experimental and control arms, respectively. Therefore, after 36 months,
we would expect (2 × 0.788 + 0.803)/3 = 0.793 of the subjects to have had events.
This leads to a sample size of 1296/0.793 ≈ 1635 subjects. This sample-size calcu-
lation provides the number of subjects needed so the expected number of events
at the study time of 36 months is the number required for stopping and perform-
ing the analysis. There will be some variability to the timing in study months that
the analysis is performed (1296 events are reached). This variability may also be
considered when designing the trial.
events were observed when combining the observations from both arms.
Then, for j = 1, . . . , r, let Nj denote the number of subjects at risk of an event
∑
j
just before time uj and let v j = 1/N i . The resulting plot has for each arm
i=1
ˆ
(vj, ln(− ln S(u j ))) plotted for j = 1, . . . , r. Other approaches that can be used
include having jth gap length, vj – vj–l, equal to the reciprocal of the harmonic
mean of the number of subjects in each arm that are at risk of an event just
before the jth event (or corrected version should one arm have no subjects at
risk) or having vj = j (i.e., enumerating the event times on the x-axis). In all
these cases, the resulting plot is invariant under increasing transformations.
Time-Dependent Covariate Model. Likewise, tests of proportional hazards
involving time or log time as a time-dependent covariate are not invariant
when applying an increasing, continuous transformation to all the censored
and uncensored times. For example, the p-values for testing whether the
coefficient for the time-dependent covariate is zero will change after the
transformation h(x) = exp{x +exp{x}} is applied to all the observations, even
though the ordered arrangement of the observations remains the same.
Analogous to the rescaling that was described above for the graphical dis-
plays, the censored and uncensored times can be rescaled on the basis of
the sum of the reciprocal of the number of subjects at risk of an event. The
∑
j
rescaled survival time of the jth event would be 1/N i . The censored
i=1
times would also be rescaled so as not to affect the overall ordered arrange-
ment of the observations.
As with other diagnostic plots used to evaluate an assumption of propor-
tional hazards, the plot of the ratio of the hazard rates will also overempha-
size the places where the estimated hazard rates are near zero (i.e., the time
intervals that are not influential in a comparison of the arms). In a published
study, the authors concluded from one such plot that the hazard ratio was not
constant, whereas in a quite different type of plot concluded that an assump-
tion of proportional cumulative odds may be appropriate. The conclusion is
unusual as the aspects of the estimated survival distribution were such that
similar conclusions should have drawn on the proportionality of the cumula-
tive hazards and the cumulative odds. In the example, the estimated survival
probabilities within both arms were greater than 90% over all studied time
points. In particular, − ln(Sˆ i (t))/(1 − Sˆ i (t)) ≈ 1 for all studied t and i = E, C, and
Sˆ C (t)/Sˆ E (t) appears to only vary between 0.98 and 1 over the studied interval
of time. Different conclusions on proportionality were drawn because the
assessment of proportional hazards was based on the ratio of estimates of
the hazard rates (not the ratio of estimates of the cumulative hazards) where
outlying estimates of the ratio were observed when the survival curves were
nearly flat and the hazard rates were close to zero. The assessment of the
cumulative odds ratio was based on a plot of the cumulative odds ratio over
time, which did not vary significantly. A plot of the ratio of the cumulative
hazards would also not have significantly varied over time.
A confidence interval for SE(t*) – SC(t*) can be determined using the respec-
tive Kaplan–Meier estimates and Greenwood’s estimates of the correspond-
ing variance. When the lower limit of the confidence interval for SE(t*) – SC(t*)
is greater than –δ, non-inferiority is concluded.
Kaplan–Meier Estimation. In the absence of censoring, the determination of
the estimated survival function (i.e., the event-free probabilities) for a given
arm is straightforward. For t > 0, the survival function is given by Sˆ (t) = the
relative frequency of times in that arm ≤ t. In the presence of censoring, the
most common estimate of the survival function is the Kaplan–Meier esti-
mate.19 As earlier, let t(1)< t(2) < . . . < t(k) denote the distinct ordered times
when events occurred, and for i = 1, . . . , k define R(t(i)) as the set of indices
of those subjects at risk of an event as time t(i) approaches (i.e., consists of the
indices of subjects whose time-to-event, censored or uncensored, is at least
t(i)). Let ni denote the size of R(t(i))—that is, the number of subjects at risk of an
event as time t(i) approaches—and let di denote the number of subjects that
had events at time t(i). For ease, we will define t(0) = 0. Then for i = 1, . . . , k, 1 –
di/ ni represents the relative frequency of subjects followed completely from
time t(i–1) to time t(i) that did not have an event, and represents an estimate of
the conditional probability that a subject will not have an event during the
interval from t(i–1) to t(i) given they have not had an event by time t(i–1). For
intermediate intervals, t(i–1) to t, where t(i–1) < t < t(i), the observed relative fre-
quency of subjects followed completely from time t(i–1) to time t that did not
have an event is 1. Thus, 1 is the estimate of the conditional probability that
a subject will not have an event during the interval from t(i–1) to t given they
have not had an event by time t(i–1). The Kaplan–Meier estimate of the sur-
vival function applies the multiplication rule to these estimated conditional
probabilities.
For t(i) ≤ t < t(i+1) i = 0, 1, . . . , k, the Kaplan–Meier estimate of the survival
function is given by
Sˆ (t) = ∏ (1 − d /n )
j=1
j j
For t(i) ≤ t < t(i+1) i = 0, 1, . . . , k, Greenwood’s formula provides an estimate of
dj
∑
i
the variance for Sˆ (t) of Var ˆ (Sˆ (t)) ≈ (Sˆ (t))2 .
j = 1 n j (n j − d j )
SE(t*) – SC(t*) is given by sˆE (t*) − sˆC (t*) ± zα/2 Varˆ (Sˆ E (t*)) + Varˆ (Sˆ C (t*)) . When
the lower limit of the confidence interval is greater than –δ the null hypoth-
esis in Expression 13.12 is rejected and non-inferiority is concluded.
As noted by Com-Nougue, Rodary, and Patte,8 under the assumption of
proportion hazards, the non-inferiority margin for the landmark analysis
can be linked to a non-inferiority threshold based on a hazard ratio by θo =
ln(SC(t*) – δ)/ln(SC(t*)). This relation along with a guess of the event-free prob-
ability at the landmark for the control arm can guide in translating an histor-
ical problem where inference was based on a hazard ratio to a non-inferiority
problem involving a difference in event-free probabilities or vice versa. First,
the historical control effect would be estimated using one of the metrics, and
then an appropriate non-inferiority threshold or margin for that metric. Then
with a guess of SC(t*), the above relation leads to a non-inferiority margin or
threshold for the other metric.
For inference on one median, Efron21 and Reid22 used bootstrap methods to
derive confidence intervals for the median. Such bootstrapping methods can
be easily applied to determine confidence intervals for the difference or ratio
of two medians. Alternatively, confidence sets for one median can be derived
by inverting tests similar to a sign test, as done by Brookmeyer and Crowley23
and Emerson.24 The confidence interval for one median consists of all values
t* for which a two-sided test of Ho: S(t*) = 0.5 fails to reject the null hypothesis.
The Brookmeyer and Crowley procedure23 uses the Kaplan–Meier estimated
survival probability at t*, Sˆ (t*), along with the corresponding Greenwood
estimated variance. This estimated variance changes as t* changes. As noted
by Wang and Hettmansperger,17 the confidence set derived from these meth-
ods need not be an interval. For the Brookmeyer and Crowley procedure,
this inadequacy can be alleviated by choosing the Greenwood estimate vari-
ance of Sˆ ( x), where x is the observed median. In generalizing to two medi-
ans, this type of estimated variance can be used for both arms in a minimum
dispersion test statistic, as in Su and Wei’s study.18
General Procedures. For comparing two medians, we will first discuss two
procedures that are not based on the assumption that the two underlying
distributions are related by scale factor.
Su and Wei18 derived confidence intervals for the difference and ratio of
two medians based on a quadratic test statistic similar to a minimum dis-
persion test statistic used by Basawa and Koul25 for continuous data. We will
present the test procedure for a ratio of medians.
Let X and Y denote the sample medians of the control and experi men-
tal arms, respectively. The observed sample medians x and y sat isfy
x = min{t : Sˆ C (t) ≤ 0.5} and y = min{t : Sˆ E (t) ≤ 0.5} . For testing Ho: Λ = Λo ˆ
(S ( Λ t) − 0
against Ha: Λ ≠ Λo (for some 0 < Λo ≤ 1), the test statistic is G( Λ o ) = min t>0 E o 2
σˆ E
(SˆE ( Λ ot) − 0.5)2 (Sˆ C (t) − 0.5)2 2 2
Λ o ) = min t>0 + , where σ̂ E and σ̂ C are the Greenwood’s esti-
σˆ E2 σˆ C2
mates of the variances of S ( y) and Sˆ ( x) , respectively. From a simulation
E
ˆ
C
study of Su and Wei,18 the upper percentiles of the distribution of G(Λo) when
Λ = Λo can be approximated by the upper percentiles of a χ2 distribution with
2
1 degree of freedom. Let χ 1,α denote the 100αth upper percentile of a χ2 dis-
tribution with 1 degree of freedom. Then, an approximate 100(1 – α)% confi-
2
dence interval for µE/µC consists of those positive values u so that G(u) < χ 1,α.
If the lower bound of the approximate 100(1 – α)% confidence interval for
x ± zα/2σˆ C/mC (ε )
Y − Λ oX
Z* =
σˆ E2/mE2 (ε ) + Λ o2σˆ C2 /mC2 (ε )
When Z* > zα/2, the null hypothesis in Expression 13.13 is rejected and non-
inferiority is concluded. A Fieller approach can also be used to determine an
approximate 100(1 – α)% confidence interval for µE/µC .
The remaining procedures that will be discussed are based on the overall
assumption that the two underlying distributions differ by a scale factor.
Adapting Standard Time-to-Event Tests. Let X1, X2, . . . , X nC and Y1, Y2, . . . , YnE
denote independent random samples from distributions having respective
distribution functions FC and FE. The X’s and the Y’s represent the actual,
uncensored times to the event of the control and experimental arms, respec-
tively. The underlying assumption is that the two distributions are related
through a scale factor Λ (i.e., FE(y) = FC(y/Λ) for all y and some Λ). For the
control arm, the independent censoring variables are denoted as A1, A2, . . . ,
AnC , which are assumed to be a random sample having common distribution
function H. For each subject in the control arm, the variable X i* = min{X i , Ai }
and the event status I (X i = X i* ) are observed. For the experimental arm, the
independent censoring variables are denoted as B1, B2, . . . , BnE , which are
assumed to be a random sample having a common distribution function K.
For each subject in the experimental arm, the variable Yi* = min{Yi , Bi } and
the event status I (Yi = Yi* ) are observed.
Confidence intervals for Λ can be obtained through a test statistic for time-
to-event endpoints by altering the values in one of the arms and then testing
for the equality of the underlying distributions. For any positive number c,
replace X1, X2, . . . , X nC with cX1, cX2, . . . , cX nC and replace A1, A2, . . . , AnC
with cA1, cA2, . . . , cAnC . For the observations in the control arm, the analysis
multiplies each observed censored or uncensored time-to-event by c with-
out changing the event status for those observations. If the null hypothesis
of equal medians (i.e., equal underlying distributions for Yi and cXi) is not
rejected at a two-sided significance level α, then Λo is in the (approximate)
100(1 – α)% confidence interval for Λ. If the lower bound of this approxi-
mate 100(1 – α)% confidence interval for the scale factor (i.e., also for µE/µC )
is greater than Λo = 1 – δ, then the null hypothesis in Expression 13.13 is
rejected and non-inferiority is concluded. This procedure for determining a
confidence interval for the scale factor or ratio of medians can be applied to
the log-rank test (where the parameter of interest tends to be a hazard ratio,
not a scale factor) or any Wilcoxon-like test. This procedure for obtaining a
confidence interval for a scale factor is analogous to manipulating the Mann–
Whitney–Wilcoxon procedure for deriving a confidence interval for the shift
in the distributions (i.e., the difference in medians) provided in Section 12.4.
It is a matter of debate whether there may be some crudeness to this proce-
dure. For the event status to remain the same, the corresponding censoring
variables would also need to be multiplied by c. In a properly conducted clin-
ical trial, the censoring distributions should be the same across arms. When
assumptions that are made on the underlying distribution do not hold, com-
parisons involving quite different censoring distributions can be difficult to
interpret when there is a moderate or large amount of censoring.
Two Confidence Interval Procedures. For time-to-event data in the presence of
censoring, the use and properties of confidence intervals for the difference in
medians where the limits are differences of the confidence limits of the indi-
vidual confidence intervals for the medians were investigated by Wang and
Hettmansperger.17 Several cases are considered, including the case where the
two underlying time-to-event distributions are assumed to have the same
shape. The results are fairly analogous to those by Hettmansperger16 for
determining confidence intervals for the difference in medians for continu-
ous data, which are summarized in Section 12.4.
It is, however, unlikely that two time-to-event distributions differ by a
shift. If the two underlying distributions differ by a scale factor, which is
often assumed when comparing time-to-event distributions, then the dis-
tributions for the log times will have the same shape (i.e., differ by a shift).
The results of Wang and Hettmansperger17 can be applied to testing the ratio
of the underlying medians when assuming equal shapes for the distribu-
tion of the log times. For ease in both presentation and in comparing the
results to those of Hettmansperger16 in Section 12.4, the results of Wang and
Hettmansperger17 will be presented for a difference in medians for the log
times. The medians for the log times are denoted by µlog,E and µlog,C for the
As in Section 12.4, the 100(1 – α)% confidence interval for the difference in
medians of the log times has the form (L, U) = (LE – UC, UE – LC), where
(LE, UE) is a 100(1 – α E)% confidence interval for the median log time of the
experimental arm and (LC, UC) is a 100(1 – α C)% confidence interval for the
median log time of the control arm. The confidence coefficients for the indi-
vidual confidence intervals are selected so that when those two intervals are
disjoint, Ho: Δ = 0 is rejected at a significance level of α in favor of the two-
sided alternative Ha: Δ ≠ 0. The null hypothesis in Expression 13.14 is rejected
at a significance level of α/2, and non-inferiority is concluded if L > – δ.
The previous notation for the time-to-event and censoring variables will
apply here to the log times. Let X1, X2, . . . , X nC and Y1, Y2, . . . , YnE denote
independent random samples from distributions having respective distribu-
tion functions FC and FE. The X’s and the Y’s represent the actual, uncensored
log times to the event of the control and experimental arms, respectively.
For the control arm, the independent censoring variables for the log times
are denoted as A1, A2, . . . , AnC (i.e., exp(A1), . . . , exp(AnC ) are the censor-
ing variables for exp(X1), exp(X2), . . . , exp(X nC ) ), which are assumed to be
a random sample having common distribution function H. For each sub-
ject in the control arm, the variable X i* = min{X i , Ai } and the event status
I (X i = X i* ) are observed. The common distribution function for X i* is given
by FC* (t) = 1 − (1 − FC (t))(1 − H (t)) . For the experimental arm, the independent
censoring variables for the log times are denoted as B1, B2, . . . , BnE , which
are assumed to be a random sample having common distribution function
K. For each subject in the experimental arm, the variable Yi* = min{Yi , Bi } and
the event status I (Yi = Yi* ) are observed. The common distribution function
for Yi* is given by FE* (t) = 1 − (1 − FE (t))(1 − K (t)).
The left continuous inverse of a distribution F is defined by F–1, where
F (p) = inf{t: F(t) ≥ p} for 0 < p < 1. Let F̂C and F̂E denote the Kaplan–Meier
–1
and GE* (t) = P(Yi* ≤ t , Yi* = Yi ) . Per Wang and Hettmansperger,17 τi is the
λ ZE + 1 − λ ZC ≈ zα/2 λτ E + (1 − λ )τ C (13.15)
In addition, the asymptotic width of the confidence interval does not depend
on the choice of ZE and ZC that satisfies Equation 13.15. From Theorem 3 of
Wang and Hettmansperger,17
ZE = τ E/τ C ZC = zα/2 τ E λτ E + (1 − λ )τ C /( λτ E + (1 − λ )τ C )
The authors also provided formulas for the multipliers in the equal-depth
case where dE = dC.17
For the equal confidence coefficient procedure, the common confidence
coefficient ranged from 0.83 to 0.88 in the cases studied,17 where the alloca-
tion ratios ranged from 1 to 3 and various relative frequencies of censoring
were assumed. We refer the reader to Wang and Hettmansperger’s paper17
for analogously determined confidence coefficient intervals for the equal
length and equal depth procedures.
Additionally, Wang and Hettmansperger17 modified the two confidence
intervals procedures for the equal-shape case to obtain procedures for the
For the cumulative odds ratio, ψ(t) = [FE(t)/(1 – FE(t))]/[FC(t)/(1 – FC(t))], the
hypotheses are expressed as
Ho: ψ(t) ≥ ψo for some t* ∈ [τ0,τ1] and Ha : ψ(t) < ψo for all t* ∈ [τ0,τ1] (13.17)
Ho: supt∈[τ 0 ,τ 1 ][SC (t) − SE (t)] ≥ δ and Ha: supt∈[τ 0 ,τ 1 ][SC (t) − SE (t)] < δ (13.18)
The hypotheses involving the cumulative odds ratio, ratio of the cumula-
tive hazards, ratio of event-free probabilities, and relative risk of an event,
would involve the following supremums being compared to the appropriate
non-inferiority margin or threshold: supt∈[τ 0 ,τ 1 ] ψ (t), supt∈[τ 0 ,τ 1 ][ln SE (t)/ln SC (t)],
supt∈[τ 0 ,τ 1 ][SC (t)/SE (t)], and supt∈[τ 0 ,τ 1 ][(1 − SE (t))/(1 − SC (t))], respectively. Freitag,
Lange, and Munk27 used a hybrid bootstrap-based procedure based on that used
by Shao and Tu28 to construct a confidence interval for supt∈[τ 0 ,τ 1 ][SC (t) – SE (t)]
for testing the hypotheses in Expression 13.18. When the upper limit of the
confidence interval is less than the non-inferiority margin/threshold, the
null hypothesis is rejected and non-inferiority is concluded. This procedure
maintains the desired type I error rate and the supremum approach has more
power than the pointwise approach.
Comparing the event-free probabilities over an interval makes use of more
information in the data than a landmark analysis. As with landmark analy-
ses, it is not necessary to assume proportional hazards, the existence of a
scale factor, or proportional cumulative odds. When such an assumption
holds (or approximately holds), it is more efficient to base the inference on a
procedure that is designed for such an assumption than to restrict the infer-
ence to some prespecified interval.
The selected non-inferiority margin or threshold here may represent the
maximal allowed difference across [τ0,τ1], which may be a larger allowed
difference than for a landmark analysis at a specific time. When the same
margin or threshold is used for the non-inferiority analysis over [τ0,τ1] as
for a non-inferiority landmark analysis at landmark t* ∈ [τ0,τ1], rejecting the
null hypothesis in Expressions 13.16 or 13.18 for the interval analysis implies
that the null hypothesis in Expression 13.12 is rejected for the landmark
analysis.
Besides the need to choose a larger margin than for a landmark analysis,
it can be very tricky in using the historical results to determine the effect of
the control therapy. The specific information on the estimates of the event-
free probabilities over [τ0,τ1] and their corresponding standard errors for the
control therapy and the placebo may not be readily available from some or
all of the historical trials. If such information was not readily available but it
is still desired to base the non-inferiority inference over [τ0,τ1], it is likely that
the non-inferiority margin would be conservatively chosen.
It may also be difficult (as with landmark analyses) in determining how to
incorporate differences in estimated SC(t) across trials to determine the appro-
priate/best interval [τ0,τ1] to consider and to set the non-inferiority margin/
threshold. There are analogous concerns and issues when the non-inferiority
inference over [τ0,τ1] is based on the cumulative odds ratio, ratio of the cumu-
lative hazards, ratio of event-free probabilities, or relative risk of an event.
When the assumptions do not hold for proportional hazards, the existence
of a scale factor, or a constant cumulative odds ratio, the corresponding sample
estimator unbiasedly estimates some quantity that depends on the amount
of follow-up (i.e., the censoring distributions) for that trial. Thus, the under-
lying time-to-event distributions for the control therapy and the placebo can
remain constant across trials (across historical trials and the non-inferiority
trial), thereby having a constant true effect of the control therapy across tri-
als; however, because the assumption relating the underlying distributions is
not true (e.g., the hazards are not proportional) and the amount of follow-up
differs across trials, the value the selected estimator (e.g., the hazard ratio
estimator) is unbiasedly estimating varies across the trials. Landmark analy-
ses and analyses over an interval would not be affected by differences across
trials in the amount of follow-up, although they have their own issues.
Horizontal Differences in the Survival Functions. The difference in medians is
one specific horizontal difference in the experimental and control survival
functions (i.e., SE−1 (0.5) − SC−1 (0.5)). For continuous time-to-event distributions,
the difference in the means is the average of the horizontal differences in
the experimental and control survival functions over all percentiles (i.e., the
average of SE−1 ( p) − SC−1 ( p) for 0 < p <1), or simply the mean difference in per-
centiles. Thus, the difference in means (when the means exist) is given by
µE − µC =
∫ (S
0
−1
E ( y ) − SC−1 ( y ))dy (13.19)
Graphically, the difference in means is equal to the area between the two
survival functions, which is usually represented by
µE − µC =
∫ (S (x) − S (x))dx
0
E C (13.20)
When the means exist, the expression in Equation 13.19 can be extended to
1
continuous real-valued distributions to µ E − µ C =
∫ (F
0
−1
C ( y ) − FE−1 ( y ))dy . The
assumption of continuous distributions is necessary for Equation 13.19 to
hold. Therefore, Kaplan–Meier estimates of SE and SC, which are discrete
References
1. Chi, G.Y.H., Some issues with composite endpoints in clinical trials, Fund. Clin.
Pharm., 19, 609–619, 2005.
2. DeMets, D.L. and Califf, R.M., Lessons learned from recent cardiovascular clini-
cal trials: Part I, Circulation, 106, 746–751, 2002.
3. Montori, V.M. et al., Validity of composite endpoints in clinical trials, Br. Med. J.,
330, 594–596, 2005.
4. Kleist, P., Composite endpoints: Proceed with caution, Appl. Clin. Trial, May 1,
2006, at http://appliedclinicaltrialsonline.findpharma.com/appliedclinicaltrials/
article/articleDetail.jsp?id=324331.
5. Cox, D.R., Some simple approximate tests for Poisson variates, Biometrika, 40,
354–360, 1953.
6. Cox, D.R., Regression models and life tables, J. R. Stat. Soc., 34, 187–220, 1972.
7. Cox, D.R., Partial likelihood, Biometrika, 62, 269–276, 1975.
8. Com-Nougue, C., Rodary, C., and Patte, C., How to establish equivalence when
data are censored: A randomized trial of treatments for B non-Hodgkin lym-
phoma, Stat. Med., 12, 1353–1364, 1993.
9. Efron, B., Efficiency of Cox’s likelihood function for censored data, J. Am. Stat.
Assoc., 72, 557–565, 1977.
10. Fleming, T.R., Evaluation of active control trials in AIDS, J. Acq. Immun. Def.
Synd., 2, S82–S87, 1990.
11. Kalbfleisch, J.D. and Prentice, R.L., Estimation of the average hazard ratio,
Biometrika, 68, 105–112, 1981.
12. Fleming, T.R. and Harrington, D., Counting Processes and Survival Analysis, Wiley,
Chichester, 1991.
13. Crisp, A. and Curtis, P., Sample size estimation for non-inferiority trials of time-
to-event data, Pharm. Stat., 7, 236–244, 2008.
14. Cox, D.R., A note on the graphical analysis of survival data, Biometrika, 66, 188–
190, 1979.
15. Nelson, W., Theory and application of hazard plotting for censored failure data,
Technometrics, 14, 945–966, 1972.
16. Hettmansperger, T.P., Two-sample inference based on one-sample sign statis-
tics, J. R. Stat. Soc. C Appl., 33, 45–51, 1984.
17. Wang, J.-L. and Hettmansperger, T.P., Two-sample inference for median sur-
vival times based on one-sample procedures for censored survival data, J. Am.
Stat. Assoc., 85, 529–536, 1990.
18. Su, J.Q. and Wei, L.J., Nonparametric estimation for the difference or ratio of
median failure times, Biometrics, 49, 603–607, 1993.
19. Kaplan, E.L. and Meier, P., Nonparametric estimation from incomplete observa-
tions, J. Am. Stat. Assoc., 53, 457–481, 1958.
20. Thomas, D.R. and Grunkemeier, G.L., Confidence interval estimation of sur-
vival probabilities for censored data, J. Am. Stat. Assoc., 70, 865–871, 1975.
21. Efron, B., Censored data and the bootstrap, J. Am. Stat. Assoc., 76, 312–319,
1981.
22. Reid, N., Estimating the median survival time, Biometrika, 68, 601–608, 1981.
23. Brookmeyer, R. and Crowley, J., A confidence interval for the median survival
time, Biometrics, 38, 29–41, 1982.
24. Emerson, J.D., Nonparametric confidence intervals for the median in the pres-
ence of right censoring, Biometrics, 38, 17–27, 1982.
25. Basawa, I.V. and Koul, H.L., Large-sample statistics based on quadratic disper-
sion, Int. Stat. Rev., 56, 199–219, 1988.
26. Wei, L.J. and Gail, M.H., Nonparametric estimation for a scale-change with cen-
sored observations, J. Am. Stat. Assoc., 78, 382–388, 1983.
27. Freitag, G., Lange, S. and Munk, A., Non-parametric assessment of non-
inferiority with censored data, Stat. Med., 25, 1201–1217, 2006.
28. Shao, J. and Tu, D., The Jackknife and Bootstrap, Springer, New York, NY, 1995.
A.1.1 p-Values
A p-value is the probability of obtaining results as extreme or more extreme
(against the null hypothesis) than the observed results, where the probabil-
ity is determined under the assumption that the null hypothesis is true. For
most cases, when the null hypothesis is true, the p-value is a completely ran-
dom value between 0 and 1, its statistical distribution being a uniform distri-
bution over (0,1). As commonly applied, the null hypothesis is rejected if and
only if the p-value is less than or equal to the significance level. Hence, the
p-value can be regarded as the smallest significance level for which the null
hypothesis is rejected.
A p-value measures the strength of evidence against the null hypothesis
in the direction or directions of the alternative hypothesis. The smaller a
p-value, the stronger is the evidence against the null hypothesis, in favor of the
alternative hypothesis. A large p-value would correspond to little evidence
against the null hypothesis. Little or no evidence against the null hypothesis
does not mean that there is great evidence for the null hypothesis.
Examples A.1 through A.3 illustrate some properties of p-values. These
examples involve dichotomous data (coin tosses), continuous data (hemoglo-
bin levels), and time-to-event data (for an undesirable event).
Example A.1
We will consider a simple experiment of tossing a coin 10 times. Let p denote the
probability that any given toss results as a head. The coin is fair if p = 0.5. For the
null hypothesis of p = 0.5, there are three realistic possibilities for the alternative
hypothesis: p < 0.5, p > 0.5, and p ≠ 0.5. Suppose eight of these tosses result in a
head. Table A.1 summarizes the p-value in each of three cases. This example helps
illustrate the differences among the three cases in the directions of the strength of
395
TABLE A.1
Summary of p-Values for Three Cases Involving Dichotomous Data
Null Alternative Result of the As Strong or Stronger
Case Hypothesis Hypothesis Experiment Evidence in Favor of Ha p-Value a
1 Ho: p = 0.5 Ha: p < 0.5 8 heads in 8 or fewer heads in 0.989
10 tosses 10 tosses
2 Ho: p = 0.5 Ha: p > 0.5 8 heads in 8 or more heads in 0.055
10 tosses 10 tosses
3 Ho: p = 0.5 Ha: p ≠ 0.5 8 heads in 8 or more heads in 0.109
(p < 0.5 or 10 tosses 10 tosses, or 2 or fewer
p > 0.5) heads in 10 tosses
a p-Values as fractions are 1013/1024, 56/1024, and 112/1024, respectively.
evidence. In case 1, the smaller the number of heads, the stronger is the evidence
against the null hypothesis in favor of the alternative hypothesis. In case 2, the
larger the number of heads, the stronger is the evidence against the null hypothesis
in favor of the alternative hypothesis. In case 3, the further the number of heads
is from five (50% of the number of tosses), the stronger the evidence against the
null hypothesis in favor of the alternative hypothesis. Note also that the p-value
in case 3 is double the p-value in case 2 (double the smaller of the p-values in
cases 1 and 2). In case 3, two or fewer heads among 10 tosses provide as strong
or stronger evidence against p = 0.5 in favor of p < 0.5 as the strength of eight
heads among 10 tosses provides against p = 0.5 in favor of p > 0.5. If p = 0.5, for
10 tosses, the probability of getting two or fewer heads equals the probability of
getting eight or more heads.
In this coin-tossing example, the test is based on the number of heads in 10
tosses, which is referred to as the test statistic. The p-values in cases 1 and 2 are
referred to as one-sided p-values since the respective alternative hypotheses are
one-sided. Likewise, since the alternative hypothesis in case 3 is two-sided, the
respective p-value is referred as a two-sided p-value. Note that, here, the sum of
the one-sided p-values equals 1 plus the probability of getting the observed num-
ber of heads if the null hypothesis is true. Whenever the test statistic has a discrete
distribution, the sum of the one-sided p-values will equal 1 plus the probability of
getting the observed value of the test statistic if the null hypothesis is true.
Example A.2
TABLE A.2
Summary of p-Values for Three Cases Involving Continuous Data
Null Alternative Result of the As Strong or Stronger
Case Hypothesis Hypothesis Experiment Evidence in Favor of Ha p-Value
1 Ho: μ = 11 Ha: μ < 11 Sample Sample mean from 0.106
mean from 4 patients is 10.5 or
4 patients less
is 10.5
2 Ho: μ = 11 Ha: μ >11 Sample Sample mean from 0.894
mean from 4 patients is
4 patients 10.5 or more
is 10.5
3 Ho: μ = 11 Ha: μ ≠ 11 Sample Sample mean from 0.211
(μ < 11 or mean from 4 patients is either
μ >11) 4 patients 10.5 or less, or is
is 10.5 11.5 or more
Example A.3 will compare and contrast the calculation of a p-value for each
of four types of comparisons. For this example, an equivalence comparison
will be evaluated in Section A.3.
Example A.3
Let θ denote the true experimental arm versus control arm hazard ratio of some
undesirable event (e.g., death or disease progression). For an observed hazard
ratio of 0.91 based on 400 events in a clinical trial that had a one-to-one ran-
domization, Table A.3 summarizes the p-value for each comparison type. For the
non-inferiority comparison, a hazard ratio threshold of 1.1 is used.
For the inferiority, superiority, and difference comparisons, the orderings of the
strength of evidence against the null hypothesis (in favor of the alternative hypoth-
esis) are analogous to cases 1, 2, and 3 in each of Tables A.1 and A.2.
Note that the order of the strength of evidence is the same for a superiority compari-
son as with a non-inferiority comparison. For each of these comparisons, the smaller
the observed hazard ratio, the more favorable is the result for the experimental arm.
For these two comparisons, it is the same event (observing a hazard ratio of 0.91 or
less) whose probability is the p-value. The p-values are different because the prob-
abilities are calculated under different assumptions of the truth (θ = 1 and θ = 1.1). In
fact, because the “bar is lower” for a non-inferiority comparison than for a superiority
comparison between the same two treatment arms, the p-value for the non-inferiority
comparison will always be smaller than the p-value for a superiority comparison.
Note that had the observed hazard ratio equaled 1, the p-values for an inferior-
ity comparison and for a superiority comparison would be equal (both p-values
equaling 0.5). In this example, the p-values for an inferiority comparison and for
a non-inferiority comparison would be equal (both p-values approximately 0.317)
if the observed hazard ratio were the square root of 1.1 (the geometric mean of
1 and 1.1). When these p-values are equal, the strength of evidence in favor of
inferiority equals the strength of evidence in favor of non-inferiority.
Appendix
© 2012 by Taylor and Francis Group, LLC
Appendix 399
There is fairly suggestive but compelling evidence that the experimental arm is
noninferior to the control arm with respect to the time-to-event endpoint. For any
significance level less than 0.10, whether a one-sided or two-sided significance level,
there is not strong enough evidence that the experimental arm is inferior, superior,
or different from the control arm with respect to the time-to-event endpoint.
When the correct value for a parameter or effect is the hypothesized value
in the null hypothesis and the test statistic has a continuous distribution, the
p-value is a random value between 0 and 1, its statistical distribution being
a uniform distribution over (0,1). For most hypotheses-testing scenarios in
practice, when the correct value is among the alternative hypothesis, the dis-
tribution for the p-value is smaller. Several authors have examined the distri-
bution of the p-value when the alternative hypothesis is true. Dempster and
Schatzoff1 and Schatzoff2 investigated the stochastic nature of the p-value
and evaluated test procedures based on the expected (mean) p-value at a
given alternative. Hung et al.3 determined, for a fixed significance level and
a fixed difference between the true value and the hypothesized value in the
null hypothesis, that as the sample size increases, the mean, variance, and
percentiles for the distribution of the p-value decrease toward zero. They
also examined the distribution of the p-value in certain cases when the effect
size is a random variable. Sackrowitz and Samuel-Cahn4 extended the work
of Dempster and Schatzoff,1 and also related the expected p-value to the sig-
nificance level and power. Joiner5 introduced the median significance level
and the “significance level of the average” (the significance level that corre-
sponds with the mean value of the test statistic) as measures of test efficiency.
Bhattacharya and Habtzghi6 also used the median p-value to evaluate the
performance of a test. Below, we provide our own analogous derivation for
the distribution of the p-value.
In general, the distribution of the p-value depends on the sample size and
the true value of the parameter (or alternatively on the significance level and
the true power of the test). For test statistics that are normally distributed, the
distribution of the p-value depends on the number of standard error differ-
ence in the true value and the hypothesized value in the null hypothesis. For
a random sample of size n from a normal distribution with mean μa and stan-
dard deviation σ, we will see that the distribution of the p-value for testing
the null hypothesis μ = μo against the alternative hypothesis μ < μo depends
on the value of (μa – μo)/(σ/ n). We can replace μ, σ, and n by the equivalent
when comparing two means or when using a log-hazard ratio to compare
two time-to-event distributions. For test statistics that are modeled as hav-
ing a normal distribution, when the power is 1 – β with a one-sided signifi-
cance of α, the number of standard error difference in the true value and the
hypothesized value in the null hypothesis reduces to zα + zβ.
Suppose we are testing Ho: θ = θo versus Ha: θ < θo on the basis of an es
timate θ̂ , where the true value is θa and (θ̂ − θ a )/σ ′ is modeled as having
a standard normal distribution. The test statistic is (θ̂ − θ o )/σ ′ , and thus
the p-value is the observed value of Φ((θ̂ − θ o )/σ ′). Let G denote the distribu-
tion function for the p-value. Then for 0 < w < 1,
= Φ(Φ −1 (w)) + ( zα + zβ ))
where α is the significance level and 1 – β is the power when the true value of
θ is θa. For 0 < y < 1, the quantile function is given by
Since Φ–1(p) = z1–p for 0 < p < 1, the 100p-th percentile of the distribution of
the p-value is given by Φ(z1–p – (zα + zβ)). Note for any significance level α, the
100(1 – α)-th percentile for the p-value is β (i.e., 1 minus the power). Also, the
100(1 – β)-th percentile for the p-value is α. For 0 < w < 1, the density function
for the p-value is given by
We note that it can easily be shown that the distribution of the p-value
becomes larger with respect to a likelihood ratio ordering as zα + zβ becomes
smaller. In particular, for a fixed significance level, α, the distribution of the
p-value becomes smaller with respect to a likelihood ratio ordering when
the power increases (which can occur by either increasing the sample size or
choosing a more favorable alternative). Thus, for a fixed sample size, when
comparing two alternatives, the relative likelihood increases in favor of the
more favorable alternative as the smaller the observed p-value.
Note also that the test statistic has a normal distribution with mean
– (zα + zβ) and variance 1.
For cases where the test statistic is normally distributed, Table A.4 pro-
vides the median, 5th percentile, and 95th percentile for the distribution of
the p-value for various combinations for the significance level and power.
Whenever the power at the true effect size is 80% or greater, the median
p-value is very small and relatively much smaller than the significance level.
If a clinical trial is adequately powered at the actual effect size, the p-value
will typically be very small. An observed p-value that is microscopic (e.g.,
smaller than 10 –8 if the significance level is 0.005) would tend to be indica-
tive of an overpowered study—that is, the study would have had near 100%
power at the true effect size.
TABLE A.4
Median and Percentiles for p-Value Based on Significance Level and Power
Distribution for p-Value
Significance Level a Power Median 5th Percentile 95th Percentile
0.05 0.05 0.5 0.05 0.95
0.5 0.05 0.0005 0.5
0.8 0.0064 0.00002 0.2
0.9 0.0017 0.000002 0.1
0.025 0.025 0.5 0.05 0.95
0.5 0.025 0.0002 0.376
0.8 0.0025 0.000004 0.124
0.9 0.0006 0.0000005 0.055
0.01 0.01 0.5 0.05 0.95
0.5 0.01 0.00004 0.248
0.8 0.0008 0.0000007 0.064
0.9 0.0002 0.00000007 0.025
0.005 0.005 0.5 0.05 0.95
0.5 0.005 0.00001 0.176
0.8 0.0003 0.0000002 0.038
0.9 0.00006 0.00000002 0.013
a One-sided significance level.
Although p-values larger than the significance level are not out of the ordi-
nary when the power at the true effect size is 80% or greater, large p-values
are out of the ordinary. Large p-values are indicative of either the alternative
hypothesis being false or that the study is not adequately powered at the true
effect size (i.e., the assumed effect size is greater than the true effect size). In
the latter case, the effect size chosen to design the study (“power the study”)
was larger than the true effect size.
Note that reporting a p-value for non-inferiority testing is rare. This is pri-
marily due to some subjectivity in the determination of the non-inferiority
margin.
experiment), about 95% of the 95% confidence intervals actually capture the
correct value of the respective parameter. The value of (1 – α) is called the
confidence coefficient.
For each different choice of an alternative hypothesis as presented in Tables
A.1 and A.2, there is a different type of confidence interval. For some real-
valued parameter θ and significance level α, θo values where Ho: θ = θo is not
rejected in favor of Ha: θ < θo form a 100(1 – α)% confidence interval for θ of
the form (–∞, U). The value for U is referred to as the 100(1 – α)% confidence
upper bound for θ. Analogously, θo values where Ho: θ = θo is not rejected
in favor of Ha: θ > θo at a significance level α form a 100(1 – α)% confidence
interval for θ of the form (L, ∞). The value for L is referred to as the 100(1 –α)%
confidence lower bound for θ. These two types of confidence intervals are
sometimes referred to as “one-sided confidence intervals” since they are
based on tests of one-sided alternative hypotheses.
Values for θo where Ho: θ = θo is not rejected in favor of Ha: θ ≠ θo at a signifi-
cance level α form a 100(1 – α)% confidence interval for θ of the form (L, U).
Such confidence intervals are sometimes referred to as “two-sided confidence
intervals” since they are based on tests of two-sided alternative hypotheses.
In clinical trials, the lower limit and upper limits, L and U, of the two-sided
100(1 – α)% confidence interval for θ that tend to be used are the 100(1 – α/2)%
confidence lower bound for θ and the 100(1 – α/2)% confidence upper bound
for θ, respectively. This allows the selected two-sided confidence interval to
be “error symmetric.” That is, before any data occur, the probability that the
two-sided 100(1 – α)% confidence interval for θ will lie entirely above the
true value equals the probability that the two-sided 100(1 – α)% confidence
interval for θ will lie entirely below the true value.
The factors that influence the width of a confidence interval depend on the
type of parameter of interest. For a mean of some characteristic within a study
arm, the width of the confidence interval depends on the sample size, the
variability between patients on that characteristic, and the confidence coef-
ficient. For a treatment versus control log-hazard ratio of some time-to-event
endpoint, the width of the confidence interval depends on the breakdown
of the number of events (or the total number of events and the randomiza-
tion ratio) and the confidence coefficient. In general terms, the width of a
confidence interval will depend on some quantification of the amount of evi-
dence or information gathered and the confidence coefficient. Increasing the
amount of evidence gathered on the correct value of a parameter θ (increasing
the sample size or the number of events) reduces the width of the confidence
interval. Also, increasing the confidence coefficient increases the width of
the confidence interval (e.g., a 90% confidence interval is wider than the cor-
responding 80% confidence interval).
Table A.5 summarizes various confidence interval formulas for large
sample sizes (event sizes). Here, for some quantitative characteristic, xE and
xC denote the sample means in the experimental and control arms, respec-
tively; sE and sC denote the respective sample standard deviations in the
TABLE A.5
Formulas for Approximate 100(1 – α)% Confidence Intervals for Particular
Parameters of Interest
Parameter of Interest Approximate 100(1 – α)% Confidence Interval
Single mean (μE)
xE ± zα /2 sE / nE
experimental and control arms, respectively; and μE and μ C denote the actual
or underlying means for the experimental and control arms, respectively.
For a dichotomous characteristic, where the possibilities will be expressed
as “success” or “failure,” let p̂E and p̂C denote the sample proportions of “suc-
cesses” in the experimental and control arms, respectively, and pE and pC
denote the actual probability of a “success” for the experimental and control
arms, respectively.
For a time-to-event endpoint, let θ denote the true experimental arm ver-
sus control arm log-hazard ratio and let θ̂ denote its estimate based on rE and
rC events on the experimental and control arms, respectively. Furthermore,
let nE and nC denote the sample sizes for the experimental and control arms,
respectively, and let the 100(1 – γ)-th percentile of a standard normal distribu-
tion be denoted by zγ .
The confidence intervals in Table A.5 are all of the same form—the esti-
mate plus or minus the corresponding standard error for the estimator multi-
plied by a standard normal value, which represents the number of standard
errors that the estimate and the parameter will be within each other 100(1 –
α)% of the time. The standard error is the square root of the average squared
distance between the estimator of the parameter and the actual value of the
parameter.
As can be seen from Table A.5, the confidence interval for a difference in
means (or proportions) is not determined by manipulating the individual
confidence intervals for each mean (proportion). The use of separate confi-
dence intervals is conservative in determining whether we can rule out that
the two true means are equal. Each separate confidence interval reflects, for
that arm only, the possibilities for the true mean where it was not out of the
ordinary to observe the data that was observed or more extreme data. The
confidence interval for the difference reflects possibilities for the difference
in means for which it was not out of the ordinary to observe that collective
data from both arms or more extreme data.
Suppose the 95% confidence interval for the mean of one arm is (2–8) and the
95% confidence interval for the mean of the other arm is (5–11). The 95% confi-
dence interval for the difference in means is (–0.43, 6.43). For the first (second)
arm, it is not out of the ordinary to observe the data for that arm if the true
mean was 2.5 (10.5). However, it would be out of the ordinary to observe the
collective data of both arms if the difference in the true means was 8.
The standard error of the estimator of the log-hazard ratio depends on the
randomization ratio, the total number of events, and the true hazard ratio.
For a one-to-one randomization, when the true hazard ratio is not far from 1,
the standard error is approximately 2 divided by the square root of the total
number of events. For a fixed total number of events, this provides a specific
relationship between the upper limit and lower limits of a confidence inter-
val for the hazard ratio and the hazard ratio estimate. For example, for a one-
to-one randomization where there are 400 events, the upper limit of the 95%
confidence interval for the hazard ratio should be about 22% greater than the
estimate of the hazard ratio, which in turn should be about 22% greater than
the lower limit of that 95% confidence interval.
A frequent mistake in calculating confidence intervals is applying asymp-
totic methods when the sample size is not large enough for the assumptions
to approximately hold. The confidence intervals will then not have a level
approximately equal to the desired level. The delta method has been applied
to many functions where it would take rather large sample sizes for the
asymptotic results to approximately hold. Care should be taken when apply-
ing such methods.
the 95% confidence interval for the true hazard ratio is (0.45, 0.79). The first
trial provides stronger evidence than the second trial that the experimental
arm gives longer times to the event than the control arm. However, the sec-
ond trial rules out more possibilities for the hazard ratio away from equality
(0.79, 1) than the first trial (0.83, 1).
The use of hypotheses tests and p-values has been viewed by some as
dichotomizing the results as either “successful” or “unsuccessful,” and that
the role of a clinical trial should be to get precise estimates of the effect of
the experimental therapy relative to the control therapy. Precise estimates
would require a particular maximum standard error for the estimate and
maximum width of the confidence interval. This would require studies of
some minimal sample size.
In hypotheses testing, the conclusion of the alternative hypothesis of one
trial is reproduced by another trial, if that other trial also reached the con-
clusion of the same alternative hypothesis. In such a case, for each trial, the
p-value is smaller than the respective significance level. In practice, for two
superiority trials, a one-sided significance level of 0.025 would be used for
each trial. For confidence intervals, it may be unclear what is meant by a
“reproduced finding.” Is a reproduced finding getting similar confidence
intervals from each study or confidence intervals that have a great amount
of overlap with respect to their widths?
the primary outcome variable is continuous. See Wiens’ study11 for more
discussion.
Exact Tests for Binary and Categorical Data. For binary data, exact methods
have been proposed in the literature and are commonly used. Unlike reran-
domization methods for continuous data, exact methods are easy to interpret
for binary data. Again, as one moves from a null hypothesis of no difference
to a null hypothesis of a nonzero difference, complications arise.
The idea of exact tests was first introduced by Fisher12 while he was devel-
oping a conditional exact test for comparing two independent proportions
in a 2 × 2 table. Fisher’s exact test deals with the classical null hypothesis of
no difference between the two proportions conditioning on the observed
marginal totals. In this case, the marginal totals form a sufficient statistic
(i.e., a sufficient quantity from the data on which inferences can be based)
for the nuisance parameter (common proportion, p). Conditioning on the
marginal totals yields a hypergeometric distribution for the number of suc-
cesses for the experimental group. However, because of the discreteness of
the hypergeometric distribution, Fisher’s test tends to be overly conservative.
For the same problem, Barnard13 has proposed exact unconditional tests in
which all combinations of the unconditional sampling space are considered
in constructing the test. The probability is thus spread over more possi-
bilities, providing a test statistic that is less discrete in nature than the test
statistic for Fisher’s test. As a result, these exact unconditional tests gener-
ally offer better power than Fisher’s test, although they are computationally
more involved.
For testing the hypothesis of non-inferiority where the null space con-
tains a nonzero difference or a nonunity relative risk, a nuisance parameter
arises, making the calculation of the exact p-values more complicated. As
there is no simple sufficient statistic for the nuisance parameter, the condi-
tioning argument does not solve the problem. Exact unconditional methods
for non-inferiority testing have been proposed by Chan14 and Röhmel and
Mansmann,15 in which the nuisance parameter is eliminated using the maxi-
mization principle—that is, the exact p-value is taken as the maximum tail
probability over the entire null space. Because the maximization involves a
large number of iterations in evaluating sums of binomial probabilities, the
exact unconditional tests are computationally intensive, particularly with
large sample sizes. Some exact test procedures for non-inferiority and the
associated confidence intervals are currently available in statistical software.
Although most exact tests have been developed for the non-inferiority
testing, they can be easily adapted for equivalence testing by reformulat-
ing the equivalence hypothesis as two simultaneous non-inferiority hypoth-
eses of opposite direction. Then, equivalence of the two treatments can be
proved if both one-sided hypotheses are rejected.16,17 This approach is also
recommended in regulatory environments as indicated in the International
Conference on Harmonization E9 Guideline18 “Biostatistical Principles for
Clinical Trials,” which states that “Operationally, this (equivalence test) is
any sample size. With the advances in computer hardware and software, the
results from exact tests can be readily determined.
Asymptotic Tests for Continuous Data. When the primary endpoint is continu-
ous, asymptotic methods are commonly used. Consider the null and alternative
hypotheses of Ho: μC – μE ≥ δ versus Ha: μC – μE < δ. By assuming that the estima-
tors of μC and μE, the sample means X C and X E, have approximate normal dis-
tributions (based on the underlying normality of the data or on a central limit
X − XE − δ
theorem), the test statistic Z = C will have an approximate standard
se(X C − X E )
normal distribution, where se(X C − X E ) is the standard error of the difference in
sample means. For sample sizes of nC and nE, respectively, and common stan-
dard deviation σ, the standard error will be σ 1/nC + 1/nC . Except for the sub-
traction of the nonzero δ in the numerator, this test statistic Z is identical to the
test statistic for a test of superiority. The null hypothesis is rejected if Z < –zα in a
one-sided test at the level α, where zα is the 100 × (1 – α) percentile of the standard
normal distribution (e.g., 1.645 for a one-sided test with significance level 0.05).
A p-value can also be calculated. When σ is unknown and the samples are from
normal distributions, a t test can be used where σ2 is estimated by a pooled vari-
ance. With the large sample sizes common in non-inferiority clinical trials, it is
not necessary to assume either an equal variance or that the samples are from a
normal distribution. Rather, sC2 /nC + sE2/nE , where sC2 and sE2 are the respective
sample variances, can be used to estimate the standard error of the difference in
sample means and replaces se(X C − X E ) in Z.
In practice, non-inferiority test procedures are often expressed without test
statistics, although test statistics can be used. This is due to the subjectivity
involved in the choice of δ, when analyzing non-inferiority trials. Rather, a
two‑sided 100(1 – α)% confidence interval for μ C – μE is determined, and the
null hypothesis is rejected if the upper bound of the confidence interval is
less than δ. If the confidence interval is calculated as µˆ C − µˆ E ± zα/2 se( µˆ C − µˆ E ),
the confidence interval approach and the test statistic approach are identi-
cal. The confidence interval approach conveys those margins that can and
cannot be ruled out by the data. Thus, when there are different perspectives
on the non-inferiority margin (e.g., different perspectives among regulatory
bodies), individual decisions on non-inferiority are based on the same confi-
dence interval. The two-sided confidence interval is preferred to give infor-
mation about both the “best-case scenario” and “worst-case scenario” for the
experimental treatment. A two-sided confidence interval also can be used to
simultaneously test for superiority, non-inferiority, inferiority, and equiva-
lence (if equivalence margin are established).
More details on statistical approaches for continuous data are given in
Chapter 12.
Asymptotic Tests for Binary Data. When the primary endpoint is binary, infer-
ence is based on the proportion of subjects in each arm who have a favorable
TABLE A.6
Summary of Posterior Probabilities of p > 0.5 for Different Prior Distributions When
Observing 8 Heads in 10 Tosses
Prior Distribution Posterior Distribution Posterior Probability of
Case for p for p p > 0.5
1 Beta α = 1, β =1 Beta α = 9, β =3 0.967
2 Beta α = 3, β =3 Beta α = 11, β =5 0.941
3 Beta α = 5, β =5 Beta α = 13, β =7 0.916
Example A.4
As before, let θ denote the true experimental versus control hazard ratio of some
undesirable event (e.g., death or disease progression). For an observed hazard
ratio of 0.91 based on 400 events in a clinical trial that had a one-to-one random-
ization, Table A.7 summarizes the posterior probability that θ lies in the alternative
hypothesis for each comparison type.
Since superiority (θ > 1) implies non-inferiority (θ > 0.9), the posterior probability
for non-inferiority comparison will always be greater than the posterior probabil-
ity for a superiority comparison of the same two treatment arms. Also, since the
parameter has a continuous posterior distribution, the sum of the posterior prob-
abilities for superiority and for inferiority is 1, and the posterior probability for the
alternative hypothesis of a difference comparison is always 1 (even if the observed
hazard ratio is 1). In practice, since a difference comparison is a two-sided com-
parison, the posterior probabilities on each side of no difference would be calcu-
lated and compared when making an inference about whether a difference (and
the direction of that difference) has been demonstrated.
Note that from Tables A.3 and A.7, for inferiority, superiority, and non-inferiority
comparisons, the sum of the p-value and the posterior probability of the alterna-
tive hypothesis equal 1. This will occur for each of these comparisons whenever a
normal model with known variance is used for the estimator and a noninformative
TABLE A.7
Summary of Posterior Probabilities for Various Types of Comparisons for an
Observed Experimental versus Control Hazard Ratio of 1.10 Based on 400 Events
Posterior Posterior
Distribution for Probability of
Null Alternative the True Log- the Alternative
Case Hypothesis Hypothesis Hazard Ratioa Hypothesis
Inferiority Ho: θ ≤ 1 Ha: θ > 1 θ ~ N(ln (0.91), 0.173
Superiority Ho: θ ≥ 1 Ha: θ < 1 (0.1)2) 0.827
Difference Ho: θ = 1 H a: θ ≠ 1 1
Non-inferiority Ho: θ = 1.1 Ha: θ < 1.1 0.971
a The posterior distribution is approximated for a one-to-one randomization by using the
asymptotic distribution of the estimator of the log-hazard ratio and a noninformative prior on
the true log-hazard ratio.
prior distribution is selected for the parameter. More on comparing and contrast-
ing a p-value and the posterior probability of the alternative hypothesis will be
provided in Section A.3.
parameter θ. Let Ω denote the parameter space for θ, the set of all possible
values for θ. We denote the probability density function (or probability mass
function) of X by f (x|θ). Let h denote the probability density function for the
prior distribution of θ. Then for the observed values x1, . . . , xn, from a random
sample X1, . . . , Xn, the posterior density function for θ is given by
g(θ |x1 , x2 ,. . ., xn ) =
( ) ( ) .
f ( x1|θ ) f ( x2 |θ ) ⋅⋅⋅ f xn |θ h θ
(A.1)
∫ f (x |θ ) f (x |θ ) ⋅⋅⋅ f ( x |θ ) h (θ ) dθ
1 2 n
TABLE A.8
Summary of Conjugate Family of Prior Distributions
Distribution from
which X1, …, Xn Is
Randomly Drawn Prior Distribution Posterior Distributiona
Bernoulli (p) p ~ Beta (α, β) p ~ Beta (α + Σxi, β + n – Σxi)
α
(
mean = nx + α + β
) (
n+α + β
α + β
)
Normal (μ, σ2) μ ~ Normal (υ, τ2) μ ~ Normal with
σ2 known n 1 n 1
mean = 2 x + 2 υ 2 + 2
σ τ σ τ
n 1
and variance = 1 2 + 2
σ τ
Poisson (λ) λ ~ Gamma (α, β) λ ~ Gamma (α + Σxi, 1/(n+1/β))
where the mean of mean = (nx + (1/β )(αβ ))/(n + 1/β )
λ is αβ
a For the posterior distributions, the observed values are x1, . . . , xn with sample mean x .
distributions for which the inferences on the parameter are based almost
entirely on the data. In some settings, Bayesian inferences based on such
prior distributions are completely analogous or identical to inferences based
on frequentist methods. For example, consider an experiment where inde-
pendent samples of size 100 each are taken from normal distributions each
having a variance of 25 and respective means of μ1 and μ2. The statistical
hypotheses that will be tested are Ho: μ1 ≤ μ2 and Ha: μ1 > μ2. Let x1 and x2
denote the observed values of the respective sample means. The p-value for
x −x
the normalized test statistics is 1 − Φ 1 2
. For noninformative prior dis-
1/2
tributions for μ1 and μ2 (or equivalently, a noninformative prior distribution
x −x
on θ = μ1 – μ2), the posterior probability that μ1 > μ2 equals Φ 1 2
. Thus,
1/2
for this example, for any 0 < α < 1, rejecting Ho (and thus concluding Ha)
whenever the p-value is ≤α is equivalent to rejecting Ho whenever the poste-
rior probability of Ha is ≥ 1 – α.
Jeffreys Prior Distributions. In the univariate setting, a Jeffreys prior has a den-
sity function proportional to the square root of Fisher’s information. Fisher’s
∂2
information in a single observation is given by I (θ ) = −E 2 log f (X |θ ) .
∂θ
The density for the Jeffreys prior then satisfies h(θ ) ∝ I (θ ) . When sampling
That is, the estimate, a, is the value in the parameter space that minimizes
=
∫ L(θ ,a) g(θ |x ,. . ., x )dθ . For example, consider the squared-error loss
Ω
1 n
function, L(θ,a) = (θ – a)2. The value a = E(θ|x1, . . . ,xn), the posterior mean of θ,
minimizes E(θ – a)2 =
∫ (θ − a) g(θ |x ,. . ., x )dθ . For an absolute loss function
Ω
2
1 n
(L(θ,a) = |θ – a|), the expected loss is minimized by using the median of the
posterior distribution as the estimate.
Consider a Jeffreys prior for p, the probability of a success. An experiment
results in two successes and eight failures among 10 trials. Here, p has a pos-
terior Beta distribution α = 2.5 and β = 8.5. Table A.9 provides the estimates
for three loss functions.
The posterior median and mean for p are approximately 0.210 and 0.227,
respectively. For cubed absolute loss, the value of approximately 0.242 mini-
mizes the expected loss. When the Beta distribution has a mean less than 0.5,
as in this example, the sequence of ah that minimizes E(|θ – a|h) increases to
0.5. If the Beta distribution has a mean greater than 0.5, this same sequence
decreases to 0.5. When the Beta distribution has a mean equal to 0.5, ah = 0.5
minimizes E(|θ – a|h) for all h > 0.
A frequentist evaluation of a Bayesian method can also be done. For
Bayesian estimators, the sampling distribution can be determined as can
the mean square error (when it exists) of the estimator, and the asymptotic
properties of the estimator. For example, consider making an inference on a
response rate p, based on a sample of 20 subjects and a beta prior distribu-
tion for p where the value for each parameter is 2. Let x denote the number
among the 20 subjects that responded. We will model x as a random value
from a binomial distribution based on 20 trials with probability of success
p. We denote the mean of the posterior distribution by p̂. Then the sampling
distribution for p̂ is summarized by P( pˆ = ( x + 2)/24) = 20 p x (1 − p)20− x
x
for x = 0, . . . , 20. The mean squared error for p̂ is (1 + p + p2)/144.
Credible Intervals. To illustrate an example of an equal-tailed credible inter-
val, consider a randomized, controlled clinical trial where 10 of 15 patients
on the experimental arm and 4 of 15 patients on the control arm responded.
We will use a Jeffreys prior for the prior distribution for the response rate (pC
TABLE A.9
Loss Functions and Corresponding Estimates
Loss Function = L(θ,a) Estimate
Absolute loss = |θ – a| 0.210
Squared-error loss = (θ – a)2 0.227
Cubed absolute loss = |θ – a|3 0.245
L(θ,a) = |θ – a|h as h → ∞ 0.5
TABLE A.10
Summary of 95% Credible Intervals
95% Credible
Parameter Prior Distribution Posterior Distribution Interval
Control response pC ~ Beta (0.5, 0.5) pC ~ Beta (4.5, 11.5) 0.097, 0.517
rate pC
Experimental pE ~ Beta (0.5, 0.5) pE ~ Beta (10.5, 5.5) 0.416, 0.860
response rate pE
Difference in response pC and pE are assumed pC and pE are assumed 0.046, 0.663
rates pE – pC to be independent to be independent
and pE) of each arm. Table A.10 summarizes the equal-tailed 95% credible
intervals for the response rate of each arm and for the difference in response
rates. The joint posterior distribution (with joint density being the product of
the posterior densities for pC and pE) is used to determine the 95% credible
interval for the difference.
Note that the 95% exact confidence intervals for pC and pE are (0.078, 0.551)
and (0.384, 0.882), respectively. Also, the large sample normal approximate
95% confidence interval for pE – pC is (0.073, 0.727). The 95% credible inter-
vals are narrower than the respective exact 95% confidence intervals for pC
and pE. Here, for the difference in response rates, the 95% credible interval
is narrower than and slightly shifted to the left of the 95% confidence inter-
val. Section A.3 investigates the relationship between credible intervals for a
proportion based on a Jeffreys prior with the corresponding exact confidence
interval.
Hypotheses testing can be based on a credible interval, on the magnitude
of the posterior probability that the alternative hypothesis is true, or on the
expected loss/cost for rejecting or failing to reject the null hypothesis. In
any case, there will exist a rejection region, a set of possible samples for
which the null hypothesis is rejected. The rejection region can be assessed
to determine the type I error rate or size of the test, and the power function.
For example, suppose a posterior probability for p > 0.5 greater than 0.975 is
needed to reject the null hypothesis that p ≤ 0.5 and conclude the alternative
hypothesis that p > 0.5. For a Jeffreys prior distribution, this would require
at least 15 responses among the 20 subjects. The power function for this test
20 x
∑
20
20− x
is p (1 − p)
x = 15 x
for 0 < p < 1, and thus the size of the test is approx-
imately 0.021.
The Likelihood Principle. The likelihood principle states that when two dif-
ferent experiments lead to the same or proportional likelihood functions,
inferences about the parameter should be identical for the two experiments.
The likelihood principle is preserved with Bayesian inference but may not
be preserved with frequentist inference. This is illustrated in the following
example.
Example A.5
Suppose we learn that among seven subjects that were given an investigational
drug in a phase I study, one subject experienced the target toxicity. Suppose we
are interested in whether the true probability that a subject will experience the tar-
get toxicity, p, is less than 0.5. Thus, we are interested in testing Ho: p = 0.5 against
the hypothesis that Ho: p < 0.5. Consider the following two possible designs (with
corresponding calculations for the p-value):
• Design A (binomial design): The design of the study required the toxicity
experiences of exactly seven subjects given the proposed dose of the inves-
tigational drug. One patient out of the seven experienced the target toxicity.
Here, the p-value, the probability that zero or one of the seven patients
would experience the target toxicity when p = 0.5, equals 1/16 (=0.0625).
The one-sided lower 95% confidence interval for p is (0, 0.52).
• Design B (negative binomial design): For the study, subjects were to receive
the investigational therapy, one at a time, until a subject experiences the
target toxicity or until 10 patients have received the investigational therapy.
Here, the p-value, the probability that at least seven patients will be treated
with the investigational drug when p = 0.5, equals 1/64 (≈0.0156). The one-
sided lower 95% confidence interval for p is (0, 0.39).
The corresponding likelihood functions for p in the two studies are propor-
tional. However, if formal hypotheses testing were done with a significance
level of 0.05, the decision on whether the evidence is strong enough to con-
clude p < 0.5 would differ between the study designs. In the Bayesian set-
ting, we have from Equation A.1 that since the likelihood functions for p are
proportional, the posterior distribution for p will not depend on whether the
design was A or B. For a Jeffreys prior for p (which leads to a Beta posterior
distribution with parameter values 1.5 and 6.5), the posterior probability that
p > 0.5 is 0.025. The one-sided lower 95% credible interval for p is (0, 0.44). For
a Beta prior on p with parameters α and β, the posterior distribution for p
will have parameter values α + 1 and β + 6. Sending α and β to zero leads to a
limiting Beta posterior distribution with parameter values of 1 and 6. On the
basis of this Beta posterior distribution, the posterior probability that p > 0.5
is 1/64 with a one-sided lower 95% credible interval for p of (0, 0.39). This pro-
vides analogous results to the frequentist analysis using a negative binomial
design. It can be shown that for alternative hypotheses of the form p > pothat
the p-value for observing the xth success in the nth trial from a negative bino-
mial design equals the probability of p > po when p has a Beta distribution
with parameter values x and n–x. Further comparison of credible intervals
and confidence intervals for proportions is provided in Section A.3.
∑ j !(nn−! j)! p (1 − p )
j= x
j
L L
n− j
= α/2 .
Similarly, the exact 100(1 – α/2)% confidence upper bound for p is pU, where
pU satisfies
∑ j !(nn−! j)! p (1 − p )
j= 0
j
U U
j− x
= α/2 .
The exact 100(1 – α)% confidence interval for p is then given by (pL,pU).
It can be shown by applying multiple integrations by parts that pL and
∑
pL
n! n!
∫
n
pU sat isfy pLj (1 − pL )n− j = z x−1 (1 − z)n− x = α/2
j = x j !(n − j)! 0 ( x − 1)!(n − x)!
∑
1
n! n!
∫
x
and pUj (1 − pU )n− j = z x (1 − z)n− x−1 = α/2 . Thus, the
j= 0 j !(n − j)! pU x !(n − x − 1)!
100(1 – α)% confidence interval for p has the 100α/2-th percentile of a Beta dis-
tribution with parameter values x and n – x + 1 for its lower limit (pL) and the
100(1 – α/2)-th percentile of a Beta distribution with parameter values x + 1
and n – x for its upper limit (pU).
Let qL and qU denote the exact 100α/2% confidence lower bound for p and
the exact 100α/2% confidence upper bound for p, respectively. Thus, qL is the
100(1 – α/2)-th percentile of a Beta distribution with parameter values x and
n – x + 1 and qU is the 100α/2-th percentile of a Beta distribution with param-
eter values x + 1 and n – x. Since we are 100α/2% confident that the actual
value of p is in [qL, 1] and also 100α/2% confident that the actual value of p
is in [0, qU], it would seem reasonable that we are 100(1 – α)% confident that
the actual value of p is in (qL,qU). However, note that a Beta distribution with
parameter values x and n – x + 1 is stochastically smaller than a Beta distri-
bution with parameter values x + ½ and n – x + ½, which in turn is smaller
than a Beta distribution with parameter values x + 1 and n – x. Thus, pL <
rL < qU and qL < r U < pU. Hence, (qU, qL) is contained in (rL,rU), which in turn is
contained in (pL,pU).
Note that when a frequentist says that a confidence coefficient (regardless
whether a one-sided or two-sided confidence interval) is γ, that means that
the confidence interval will capture the correct value of a parameter at least
100γ% of the time. Thus, the interval (qU, qL) will capture the correct value of
a parameter at most (not at least) 100(1 – α)% of the time. In fact, there are few
possibilities (if any) for the actual value of p that would be captured exactly
100(1 – α)% of the time by an exact 100(1 – α)% confidence interval for p. Thus,
invariably, the “probability coverage” of a 100(1 – α)% exact confidence inter-
val for p, (pL, pU), is greater than 1 – α. Exact confidence intervals for p have
been regarded as conservative.
∑
pL
n! n!
∫
n
100(1 – α)% confidence interval satisfy sLj (1 − sL )n− j =
j = x j !(n − j)! 0 ( x − 1)!(n
∑ (n − 1)! (n −
pL 1
n! x−1
j
L (1 − sL )n− j =
∫ 0 ( x − 1)!(n − x) !
z x−1 (1 − z)n− x = α/2 and
j = 0 j !(n − 1 − j)!
sUj (1 − sU )n−1− j =
∫
pU ( x − 1)!(n
1
(n − 1)!
1 − sU )n−1− j =
∫pU ( x − 1)!(n − x − 1)!
z x−1 (1 − z)n− x−1 = α/2 . Thus, the 100(1 – α)% confidence
interval for p has the 100α/2-th percentile of a Beta distribution with parameter
values x and n – x + 1 for its lower limit (sL) and the 100(1-α/2)-th percentile of
a Beta distribution with parameter values x and n – x for its upper limit (sU).
Note that the lower limit of the 100(1 – α)% exact confidence interval for p is the
same for binomial sampling as negative binomial sampling (i.e., sL = pL). Since
a Beta distribution with parameter values x and n – x is stochastically smaller
than a Beta distribution with parameter values x and n – x + 1, we have sU <
pU. Whether there is an ordering between a Beta distribution with parameter
values x and n – x and a Beta distribution with parameter values x + ½ and n – x
+ ½ (or, i.e., the respective order of sU and rU) depends on the specific values for
x and n. Note also that a Beta distribution with parameters x and n – x is the
limiting posterior distribution from using a Beta prior distribution for p with
parameters α and β, and then sending α and β to zero. Thus, sU is the upper limit
of the respective credible interval based on this limiting posterior distribution.
Inference based on a Jeffreys prior distribution can be directly generalized to
comparing two proportions. If the samples involving each proportion are inde-
pendent, the true probabilities of a success will have independent Beta posterior
distributions. The posterior distribution for the difference in the true probabili-
ties (or some other function of the true probabilities) can be determined, which
can be used to find an estimate and a credible interval for the difference in the
probabilities of a success. The specific approach used to find an exact confi-
dence interval for p cannot be directly extended to making inferences about
a difference in two probabilities. However, there are exact confidence inter-
val approaches for the difference of two probabilities (see Chapter 11). These
approaches require setting an ordering on the possible observations that may
or may not be a priori (i.e., different orderings have been used in practice).
methods adjust for the possibility that the control therapy is not effective.
We borrow ideas from Simon23 in constructing some of the credible inter-
vals. We will assume that the sample/event size for the non-inferiority trial
is independent of results in estimating the effect of the control therapy.
Consider the following hypothetical example for overall survival. The pla-
cebo versus control therapy log-hazard ratio is estimated as 0.20, with corre-
sponding standard error of 0.10. A normal distribution is considered for the
sampling distribution of the placebo versus control log-hazard ratio estimator.
From the non-inferiority trial, the experimental therapy versus control therapy
log-hazard ratio is –0.10, with corresponding standard error of 0.08. Table A.11
gives 95% confidence intervals and 95% credible intervals for the retention
fraction, λ, based on various methods. Those methods that are being intro-
duced are described below. For the Bayesian methods, the prior distribution
for the placebo versus control therapy log-hazard ratio, β, is modeled as a nor-
mal distribution with mean 0.2 and standard deviation 0.1, and the posterior
distribution for the experimental versus control log-hazard ratio, η, is modeled
as a normal distribution with mean –0.10 and standard deviation 0.08.
The intervals using frequentist methods do not adjust for the uncertainty
that the control therapy is less effective than placebo. Here the p-value
for testing whether the control therapy is better than placebo is 0.019. The
(Bayesian) probability that the control therapy is less effective than placebo
is also 0.019. If we ignore whether the control therapy is more or less effective
than placebo and extend the definition of λ to include cases where β < 0 (λ =
1 – η/β), then P(λ > 0.239) = P(λ < 4.56) = 0.975. Here the 95% (“equal-tailed”)
credible interval analog to the 95% confidence interval is (0.239, 4.56).
The other two credible intervals in Table A.11 do not ignore the uncertainty
that the control therapy may be less effective than placebo. When the deter-
mination of a credible interval for λ requires or is restricted to β > 0, the 95%
(“equal-tailed”) credible interval for λ is (–0.527, 4.51). That is, P(λ > –0.527, β > 0)
= 0.975 and P(λ > 4.51, β > 0) = 0.025. For this case, since P(β > 0) ≈ 0.977, an equal-
tailed credible interval with coefficient greater than 0.954 cannot be determined.
If possibilities where η – β < 0 and β < 0 (cases where the experimental therapy
is better than placebo, which is better than the control therapy) are regarded as
TABLE A.11
95% Confidence Interval or Credible Interval for Retention Fraction, λ, Based on
Several Methods
95% Confidence Interval or
Method Credible Interval for λ
Fieller (based on a normalized test statistic) 0.640, 26.612
Delta method 0.575, 2.424
Bayesian ignore whether β < 0 or β > 0 P(0.239 < λ < 4.56) = 0.95
Bayesian exclude P(η – β < 0 and β < 0) P(–0.527 < λ < 4.51, β > 0) = 0.95
Bayesian include P(η – β < 0 and β < 0) P(0.614 < λ < 9.90, β > 0) = 0.95
having greater relative efficacy than any case where λ > 0 and β > 0, then the
95% (“equal-tailed”) credible interval for λ is (0.614–9.90). That is, P({λ > 0.614, β >
0} or {η – β < 0, β < 0}) = 0.975 and P({λ > 9.90, β > 0} or {η – β < 0, β < 0})) = 0.025.
This last method has the advantage of considering both the uncertainty that
the control therapy may be less effective than a placebo and also other cases of
greater relative efficacy. The interval (0.614, 9.90) may be the most appropriate
95% CI (Confidence interval or credible interval) for the retention fraction.
When P(β > 0) is extremely close to 1, the credible intervals from the above
three methods will be approximately the same. The 95% confidence interval
from the Fieller method will also be similar. For example, if the estimate
of the placebo versus control therapy log-hazard ratio is instead 0.4, then
the 95% Fieller confidence interval and each of the three 95% credible inter-
vals are approximately (0.851, 1.81). The approximate 95% confidence interval
using the delta method is (0.839, 1.66). The confidence interval for the delta
method is noticeably different from that using Fieller’s method. The estima-
tor of the retention fraction may not have an approximate normal distribu-
tion. Rothmann and Tsou24 examined the actual coverage of delta method
confidence intervals for the retention fraction when estimated by a ratio of
independent random variables, each having an approximate normal distri-
bution. When the ratio of the mean to the standard deviation is greater than 8
for the estimator of the effect of the control therapy, then (per Rothmann and
Tsou24) a hypothesis test based on a delta method confidence intervals for the
retention fraction will have approximately the desired type I error rate.
For testing for a retention fraction of more than 0.5, Table A.12 summarizes
the one-sided p-values and the analogous posterior probabilities using these
methods for the example given in Table A.11. For this case, the p-value or
posterior probability was similar for the normalized test statistic, the delta
method, and the Bayesian method, which includes possibilities where η – β <
0 and β < 0 as having the greatest relative efficacy.
Table A.12
One-Sided p-Values and Analogous Posterior Probabilities for Non-Inferiority for
Testing for a Retention Fraction of More than 0.5
Method p-Value/Posterior Probability
Normalized test statistic 0.017
Delta method 0.017
Bayesian ignore whether β < 0 or β > 0 P(λ < 0.5) = 0.032
Bayesian exclude P(η – β < 0 and β < 0) 1 – P(0.5 < λ, β > 0) = 0.036
Bayesian include P(η – β < 0 and β < 0) 1 – P(0.5 < λ, β > 0 or η – β < 0 and β < 0) = 0.019
experiments even though the observed data are identical. The Bayesian anal-
ysis remains the same. This is also pertinent to the design and analysis of
non-inferiority trials when the analysis includes the estimation of the effect
of the control therapy from previous trials. The frequentist interpretation of
the results formally depends on whether the design of the non-inferiority
trial was independent or dependent of the estimation of the effect of the con-
trol therapy. We will illustrate this by considering two designs for comparing
the means from two samples. For the purpose of a non-inferiority compari-
son, μ1 represents the effect of the control therapy versus placebo that will be
estimated by previous trials and μ2 represents the difference in the effects of
the control and experimental therapies that will be estimated from the non-
inferiority trial. Consider the following two experiments.
Case 1: A random sample of size 25 is drawn from a normal distribution
having an unknown mean μ1 and a variance equal to 100, and an indepen-
dent random sample of size 100 is drawn from a normal distribution having
an unknown mean μ2 and a variance equal to 100.
Case 2: A random sample of size 25 is drawn from a normal distribution
having an unknown mean μ1 and a variance equal to 100. The observed
sample mean, x1 , is noted. An independent random sample of size m(x1 ) is
drawn from a normal distribution having an unknown mean μ2 and a vari-
ance equal to 100, for some positive-integer valued function m.
Let x1 and x2 denote the respective sample means.
In case 1, the likelihood function reduces to (is proportional to)
( ) ( )
L µ1 , µ 2 ; x1 , x2 = f x1 , x2 ; µ1 , µ 2 =
1
4π
{ }
exp − ( x1 − µ1 )2/8 + ( x2 − µ 2 )2/2 . The
likelihood function factors into the product of separate functions of x1 and
x2 , and also factor into the product of separate functions of μ1 and μ2. The
two random sample means are independent, and if independent noninfor-
mative priors are selected for μ1 and μ2, μ1 and μ2 will be independent at all
stages of sampling. In fact, in such a case, certain frequentist and Bayesian
inferences will be the same.
In case 2, the likelihood function reduces to (is proportional to)
m(x )
( ) ( ) 40π exp {− (x − µ ) /8 + (x − µ ) /( 200/m ( x ))}
L µ1 , µ 2 ; x1 , x2 = f x1 , x2 ; µ1 , µ 2 =
1
1 1
2
2 2
2
1
− µ ) /( 200/m ( x )) }. This likelihood function will factor into the product of separate
2
2 1
functions of μ1 and μ2. Analogous types of Bayesian methods can be applied
in case 2 as in case 1, and if m(x1 ) = 100, the likelihood functions and the pos-
terior distributions will be identical, and hence the inferences will be identi-
cal. However, in case 2, the likelihood function cannot be expressed as the
product of separate functions of x1 and x2 . In fact, if m is not a constant func-
tion, then the difference in the random sample means will not have a normal
distribution. Suppose μ1 = μ2 = 0 and m(x) = 1, if x < 0 and 100/m(x) ≈ 0, if x > 0.
(
Then, it is easy to see that P(X 1 − X 2 > 0) > 0.5 , even though E X 1 − X 2 = 0. )
Scenarios like these arise when historical trials are used to estimate the
effect of the non-inferiority trial’s control therapy. Many of the first such non-
inferiority analyses that used historical trials to estimate the effect of the
control therapy had this estimation occur retrospectively after the results
of the non-inferiority trial were known. It is currently common practice to
prospectively estimate the effect of the control therapy before conducting
the non-inferiority trial. Thus, the non-inferiority criterion and the sizing
of the non-inferiority will depend on the results from estimating the effect
of the control therapy. For Bayesian analyses, it does not matter whether the
estimate of the control therapy’s effect and its corresponding variance influ-
ences the sizing of the non-inferiority trial. For frequentist analyses, the sam-
pling distribution, for whatever test statistic used, is altered and may not be
approximately determined. Rothmann25 provided a discussion on how the
type I error probability changes across the boundary of a non-inferiority null
hypothesis and potential ways of addressing this problem when trying to
maintain a desired type I error rate.
Example A.6
Ho: θ ≤ 0.8 or θ ≥ 1.25 vs. Ha: 0.8 < θ < 1.25 (A.3)
where 0.8 and 1.25 are the equivalence limits. As in examples A.3 and A.4, the
observed experimental to control hazard ratio is 0.91. We will define the p-value
consistent with Section A.1 as the (largest) probability of observing a hazard ratio
of 0.91 or more extreme (more in favor of the alternative hypothesis) if the null
hypothesis was true. It would seem reasonable, at least conceptually, that the
closer the observed hazard ratio is to 1 in a relative sense (the closer the observed
log-hazard ratio is to 0), the stronger the strength of evidence against the null
hypothesis in favor of “equivalence.” On the basis of that approach, the p-value
is the largest probability of getting an observed hazard ratio between 0.91 and
1/0.91 when the null hypothesis is true, which equals 0.098. In practice, a p-value
is rarely calculated when performing an equivalence test. In general, equivalence
is concluded if a confidence interval (usually a 90% confidence interval) contains
only possibilities within the equivalence margin. For example, for the alternative
hypothesis of equivalence in Equation A.3, equivalence may be concluded if a
90% confidence interval for θ lies within (0.8, 1.25). As the 90% confidence inter-
val for θ is (0.772, 1.072), which is not contained in (0.8, 1.25), equivalence cannot
be concluded. Here, the p-value is less than .10, but the 90% confidence interval
contains possibilities in the null hypothesis. There is thus a different relationship
between inferences on a p-value and inferences based on a confidence interval
for equivalence tests than for a superiority test or a test of a difference.
For determining the posterior probability of the alternative hypothesis in Equation
A.3, a noninformative prior distribution will be used for the true log-hazard ratio,
and the estimated log-hazard ratio will be modeled as having a normal distribu-
tion with standard deviation of 0.1. The posterior distribution for the log-hazard
ratio is a normal distribution with mean ln(0.91) and standard deviation 0.1. The
posterior probability of the alternative hypothesis in Expression A.3 is 0.900. Note
that for the equivalence comparison, the posterior probability of the alternative
hypothesis did not equal 1 minus the p-value. If a 90% posterior probability were
required for a conclusion of equivalence, the result would lie on the boundary of
statistical significance.
Thus, while it seems that there is 90% confidence that 0.8 < θ < 1.25, the 90%
confidence interval for θ does not lie within the interval (0.8, 1.25). We note that
these types of equivalence hypotheses tend to be tested using a 90% confidence
interval in various settings, including generic drug settings. Schuirmann17 showed
that such a test has a maximum type I error rate of 0.05. This is the result of treat-
ing an equivalence test as performing two simultaneous one-sided tests based on
one-sided 95% confidence intervals, both of which need statistical significance at
a 5% level. The alternative hypotheses for the one-sided tests are Ha: 0.8 < θ and
Ha: θ < 1.25. More commonly, the p-value for the equivalence test is alternatively
defined as the maximum of the two p-values from the two one-sided tests. For
each of the one-sided tests, the definition of the p-value in Section A.1 is used. In
this example, the respective one-sided p-values are 0.099 and 0.0008, resulting
in a p-value of 0.099 for the equivalence test. This p-value is compared with 0.05
(the desired type I error rate), not 0.10. Since, 0.099 > 0.05, equivalence is not
demonstrated. Note also that this p-value corresponds to the largest-level confi-
dence interval, having a confidence coefficient 1 – 2 × p-value, which lies within
(0.8, 1.25). Here, the 80.2% confidence interval is (0.8, 1.035).
For this second definition of the p-value for the equivalence test, the largest
distribution of the p-value under Ho: θ ≤ 0.8 or θ ≥ 1.25 is larger than a uniform
distribution over (0,1). Thus, this test can be conservative. For the first definition of
the p-value for an equivalence test in this example, the largest distribution of the
p-value under Ho: θ ≤ 0.8 or θ ≥ 1.25 is a uniform distribution over (0,1).
A.4.1 Stratification
Clinical trials are commonly randomized using permuted blocks.26 Further
more, randomization is commonly stratified by some predefined prognostic
factor. With stratification, separate permuted blocks are generated for each
level of stratification, and subjects are assigned the next available random-
ized treatment from the stratum to which they belong. In this way, subjects
from each stratum are assigned to the various treatment groups in num-
bers approximately equal to the desired randomization ratio (exactly equal
to the extent that entire blocks are used). This is desired to balance the levels
of meaningful prognostic factors between arms. By doing so, any demon-
strated difference between the arms can be attributed to the difference in the
treatments received instead of one arm being allocated with better subjects
than the other arm.
With stratification, treatment arms tend to be more similar in the distribu-
tion of the stratification factors. A clinical trial in which treatment arms are
not well balanced can be subject to criticism and difficult to interpret, even
though the randomization procedure was fair in its assignment of subjects
to arms, and the calculation of the p-value accounts for any potential imbal-
ance. Without stratification, the treatment arms will be balanced on average;
with stratification, the balance will be much closer for the realized allocation
as well as for the mean among many theoretical realizations.
A second advantage of using stratification is that it allows for the use of
analyses that have greater power. When a stratification factor is used for the
randomization process, the analysis may adjust for the stratification factor
by either using that factor as a covariate in the analysis or by integrating the
results of the comparisons within each level of the factor (known as a “strati-
fied analysis”). For an analysis of covariance, including the factor as a covari-
ate in the model tends to reduce the associated standard error in estimating
the difference in means (the treatment effect). The analysis of covariance
allows the covariate to explain its contribution to the total variability of the
observed outcomes. The remaining variability is now the background vari-
ability in estimating the treatment effect. This is true even if the treatment
arms are balanced for the stratification factor, provided the stratification fac-
tor is correlated with outcome. For a stratified analysis and for Cox propor-
tional hazards models and logistic regression models, the use of a prognostic
covariate in the analysis allows for a comparison of “likes” between treat-
ment arms. That is, the comparison is not obscured by arbitrary differences
in the covariate between arms. In a stratified analysis, patients with similar
be problematic when each site enrolls few study subjects, as it creates many
strata, some of which may be confounded with treatment when all subjects
at a given site are randomized to the same treatment. When it is appropriate,
sites can be grouped by geographic region or by some other meaningful cri-
terion (e.g., grouping based on climate or based on the specialization of the
principal investigator whenever the climate or specialization has impact).
A.4.2 Analyses
Members of the population for which a new drug is being developed have
characteristics that are quite heterogeneous. For a clinical trial to be appli-
cable to the entire population (have external validity), the study population
will also be heterogeneous. When subjects are provided a treatment, this
subject variability contributes to the variability in the observed outcomes.
Restricting a study to only patients who have similar values for a very influ-
ential prognostic factor will lead to less variable outcomes and will require
fewer patients for the desired power. However, the results from such a
study will only be externally valid for people similar to the study subjects.
Variability can be reduced by stratifying the randomization and analysis,
or by adjusting the analysis for prognostic factors not used in the random-
ization process. Stratified and adjusted analyses therefore allow studies to
enroll a diverse group of subjects without requiring a dramatically larger
number of subjects as a study using a homogeneous group of subjects.27,28
When a clinical trial incorporates a randomization process that uses strati-
fication, the analysis commonly incorporates the stratification factor as a
covariate. More generally, when a potentially prognostic baseline variable
is identified before the study begins, an analysis may use this variable as
a covariate in the model whether or not it was used as a stratification fac-
tor in the randomization process. Whether these prognostic factors should
be included as covariates in the primary analysis model is a matter of some
controversy for superiority analyses.26,27 For non-inferiority analyses, the
same controversies exist along with other concerns.
Covariates that can be included in the analysis model are identified in one
of three ways: prospectively identified as being prognostic, retrospectively
identified as being correlated with response, or recognized as being imbal-
anced in the clinical trial under consideration. A factor that is prospectively
identified is the easiest to assess; a factor that is identified based on the data
observed in the study is more difficult, and its inclusion in the model may
introduce bias.
In general, prospectively identified factors can be included as covariates in
the analysis model without much controversy. If the covariate influences the
conclusion, the argument can be made that the inclusion of the factor in the
analysis was made before the data were known and, therefore, the conclu-
sion is not biased. One aspect of covariate analysis that may differ in non-
inferiority and superiority analyses is the issue of collinearity. Collinearity
occurs when two prognostic factors are related to each other and at least
one is related to the outcome variable. A model that includes both covari-
ates commonly shows neither covariate has a statistically significant effect
on the outcome. Thus, with each collinear variable considered after adjust-
ment for the other, an effect is not identified for either covariate. When one
of these two variables is the randomized treatment group (i.e., by not having
the covariate balanced between groups by the randomization), this can have
the effect of masking a real effect of treatment. For superiority analyses, this
has the effect of decreasing the chance of finding a relationship; for non-
inferiority analyses, the impact on conclusions is much less well understood.
We thus recommend caution in interpreting a non-inferiority clinical trial
in which the analysis uses a covariate not included in the randomization
process and in which there is considerable imbalance between comparative
treatment arms.
The impact of choosing covariates for inclusion in the model on the basis
of data observed in the model is even more difficult to defend. Such post
hoc model choices are subject to biases in superiority analyses as well as in
non-inferiority analyses. Additionally, choosing covariates on the basis of
baseline imbalances can have the effect of causing collinearity, obscuring
differences in treatment groups, and resulting in a biased estimate of treat-
ment effect.
To illustrate, consider a simple additive analysis of covariance (ANCOVA)
model, in which
yij = α + κi + γj + ε
where α is the grand mean, κi is the effect of treatment group i, γj is the effect
of categorical covariate j with two levels, ε is the error, and yij is the observed
value for a subject with associated covariate and treatment assignment. If the
covariate is predictive of the outcome as an additive effect to the treatment,
as expected by the model, the confidence interval calculated via ANCOVA
will tend to be shorter than the confidence interval calculated without con-
sideration of the covariate. The confidence interval for the true difference
in means, μE – μ C, is calculated using the estimated treatment effect and the
mean square error, and the null hypothesis is rejected if the lower bound of
the confidence interval is greater than –δ.
Despite the advantages of using covariates in the analysis, we caution
that all covariates, like other aspects of the analysis, must be prespeci-
fied. Including or excluding covariates to obtain a confidence interval that
excludes –δ, based on post hoc analyses, is not appropriate as it inflates the
chance of falsely concluding that the experimental drug is noninferior to the
standard drug.
It is also often important to investigate whether there is any interaction
between treatment and a prespecified covariate on the outcome of inter-
est. Such an interaction effect means that the difference in the effects of the
experimental and control therapies varies across the levels of the covariate.
For superiority testing, when this difference in effect always favors the same
therapy, the interaction is regarded as quantitative. If instead this difference in
effects sometimes favors the experimental therapy and sometimes favors the
control therapy, the interaction is regarded as qualitative. Determination of
whether the interaction is quantitative or qualitative involves comparing the
difference of effects with zero (zero being the value specified as the difference
in effects in the null hypothesis).29 For non-inferiority testing, the determina-
tion of whether the interaction is quantitative or qualitative involves compar-
ing the difference of effects with the non-inferiority margin. In a stratified
analysis, an advantage of the control group in one stratum that is larger than
the non-inferiority margin can be offset by an advantage of the experimental
group in another stratum, a situation akin to that of a qualitative interaction
and causing the two treatments, on average, to look similar.30 In such a situa-
tion, a non-inferiority analysis on the overall population may be problematic.
Examination of the treatment effect in each stratum to check for consistency
of effect will be important (see Chapter 10 for further details).
References
1. Dempster, A.P. and Schatzoff, M, Expected significance level as a sensibility
index for test statistics, J. Am. Stat. Assoc., 60, 420–436, 1965.
2. Schatzoff, M., Sensitivity comparisons among tests of the general linear hypoth-
eses, J. Am. Stat. Assoc., 61, 415–435, 1966.
3. Hung, H.M.J. et al., The behavior of the p-value when the alternative hypothesis
is true, Biometrics, 53, 11–22, 1997.
4. Sackrowitz, H. and Samuel-Cahn, E., P-values as random variables—expected
p-values, Am. Stat., 53, 326–331, 1999.
5. Joiner, B.L., The median significance level and other small sample measures of
test efficacy, J. Am. Stat. Assoc., 64, 971–985, 1969.
6. Bhattacharya, B. and Habtzghi, D., Median of the p-value under the alternative
hypothesis, Am. Stat., 56, 202–206, 2002.
7. Fisher, R.A., The Design of Experiments, Oliver and Boyd, Edinburg, 1935.
8. Hollander, M. and Wolfe, D.A., Nonparametric Statistical Methods, John Wiley,
New York, NY, 1973.
9. Good P., Permutation, Parametric, and Bootstrap Tests of Hypotheses, Springer, New
York, NY, 2005.
10. Box, G.E.P., Hunter, J.S., and Hunter, W.G., Statistics for Experimenters: An
Introduction to Design, Data Analysis, and Model Building, John Wiley, New York,
NY, 1978.
11. Wiens, B.L., Randomization as a basis for inference in non-inferiority trials,
Pharm. Stat., 5, 265–271, 2006.
12. Fisher, R.A., Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh,
1925.
13. Barnard, G.A., Significance tests for 2 × 2 tables, Biometrika, 34, 123–138, 1947.
14. Chan, I.S.F., Exact tests of equivalence and efficacy with a non-zero lower bound
for comparative studies, Stat. Med., 17, 1403–1413, 1998.
15. Röhmel, J. and Mansmann, U., Unconditional non-asymptotic one-sided tests
for independent binomial proportions when the interest lies in showing non-
inferiority and/or superiority, Biom. J., 41, 149–170, 1990.
16. Dunnet, C.W. and Gent, M., Significance testing to establish equivalence be
tween treatments with special reference to data in the form of 2 × 2 tables,
Biometrics, 33, 593–602, 1977.
17. Schuirmann, D., A comparison of the two one-sided tests procedure and the
power for assessing the equivalence of average bioavailability, J. Pharmacokinet.
Pharm., 15, 657–680, 1987.
18. International Conference on Harmonization of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH), E9: statistical princi-
ples for clinical trials, 1998, at http://www.ich.org/cache/compo/475-272-1
.html#E4.
19. Neyman, J. and Pearson, E.L., On the problem of the most efficient tests of sta-
tistical hypotheses, Philos. T. R. Soc. Lond., 231, 289–337, 1933.
20. Blackwelder, W.C., Proving the null hypothesis in clinical trials, Control. Clin.
Trials, 3, 345–353, 1982.
21. Carlin, B.P. and Louis, T.A., Bayes and Empirical Bayes Methods for Data Analysis,
Chapman and Hall, London, 1996.
22. Goodman, S.N., Towards evidence-based medical statistics: The P-value fallacy,
Ann. Intern. Med., 995–1004, 1999.
23. Simon R., Bayesian design and analysis of active control clinical trials, Biometrics,
55, 484–487, 1999.
24. Rothmann, M.D. and Tsou, H., On non-inferiority analysis based on delta-
method confidence intervals, J. Biopharm. Stat., 13, 565–583, 2003.
25. Rothmann, M., Type I error probabilities based on design-stage strategies with
applications to non-inferiority trials, J. Biopharm. Stat., 15, 109–127, 2005.
26. Senn, S., Added values: Controversies concerning randomization and additivity
in clinical trials, Stat. Med., 23, 3729–3753, 2004.
27. Friedman, L.M., Furberg, C.D., and DeMets, D.L., Fundamentals of Clinical Trials,
3rd Edition, Springer, New York, NY, 1998.
28. Montgomery, D.C., Design and Analysis of Experiments, John Wiley & Sons, New
York, NY, 1991.
29. Gail, M. and Simon, R., Testing for qualitative interactions between treatment
effects and patient subsets, Biometrics, 41,, 361–372, 1985.
30. Wiens, B.L. and Heyse, J.F., Testing for interaction in studies of non-inferiority,
J. Biopharm. Stat., 13, 103–115, 2003.
A B
Abstracting estimates and the standard Barnard’s ordering criterion, 257
error, methods for, 63–64 Baseline hazard function, 368–369
Active control effect, 5 Bayesian analyses, 160–164
defining the, 58 Bayesian and frequentist prediction
extrapolating to the non-inferiority methods, 51–55
trial, 59–62 Bayesian fixed-effects meta-analysis,
modeling the, 58–59 87–88
potential biases and random highs, Bayesian methods, 293–297
62–74 posterior probabilities and credible
Active-controlled clinical trial, 4, 7. See intervals, 410–412
also Clinical trials prior and posterior distributions,
Adaptive clinical trial designs, 220, 230. 412–415
See also Clinical trials statistical inference, 415–418
changing the sample size or the Binomial model, 251
primary objective, 235–237 Biocreep, 110
group sequential designs, 231–235 Bioequivalence problems, 334
Adjusted estimators, 300–304 Bioequivalence studies, 5
Agresti-Caffo method, 282 Bioequivalence trial, 235
All-or-nothing testing, 171 Blinded interim analysis, 232
Analyses involving medians, 342–343 Blinded methods, 235–236
asymptotic methods, 351–353 Bonferroni procedure, 169–170, 174
hypotheses and issues, 343–344 Brookmeyer and Crowley procedure, 384
nonparametric methods, 344–351
Analysis methods and type I error rates,
C
comparing, 120–121
asymptotic results, 124–128 Cardiovascular risk in antidiabetic
comparison of methods, 121–124 therapy, 216–218
more on type I error rates, 128–131 Censoring, 361–363
incorporating regression to mean Central limit theorem, 323, 336, 408, 409
bias, 132–141 Clinical trials, 16–17. See also Adaptive
non-inferiority trial size depends clinical trial designs
on estimation of active control designing and conducting, 183–184
effect, 131–132 randomization of, 427–429
Analysis sets, 196 for the registration of antibiotic
different analysis populations, therapies, 36–40
197–198 Cochran–Mantel–Haenszel procedure,
further considerations, 203–204 298
influence of analysis population on Composite time-to-event endpoints,
conclusions, 198–203 359–360
Assay sensitivity, 15–17 Concurrent controls, comparisons to
Asymptotic methods, 259–262, 278, noninferior to active control, 154–160
291–292 superiority over placebo, 152–154
433
U
T
Unblinded interim analysis, 236
Tail region (TR), 255 Unified test statistic, 123–124
Taylor series method, 279–280 U.S. Food and Drug Administration
Testing for both superiority and non- (FDA) Draft Guidance on Non-
inferiority inferiority Trials, 57
testing non-inferiority after failing
superiority, 178–179
testing superiority after achieving V
non-inferiority, 176–178
Variability of results, 43
Thomas and Gart method, 279–280
Variable margins, 304–308, 305
Three-arm non-inferiority trial, 149–150
use of, 150
TOST approach. See Two one-sided tests W
approach
Traditional approval. See Regular Wald’s confidence interval, 260, 263
approval Weight-based Bayes factor, 49
Two-confidence interval approaches, Whitney–Wilcoxon procedure, 386
94–95 Woolf estimator, 302
hypotheses and tests, 92–98
Two confidence interval procedures,
Z
386–389
Two one-sided tests approach, 239–240 Z-type statistics, 256–257